Cloudflare Workers技術要點
傳送門:和Cloudflare相關的值得閱讀的文章
以下內容節選自://blog.cloudflare.com/cloud-computing-without-containers/
A basic Node Lambda running no real code consumes 35 MB of memory. When you can share the runtime between all of
the Isolates as we do, that drops to around 3 MB.
在所有隔離之間共享運行時時
Isolates are lightweight contexts that group variables with the code allowed to mutate them. Most importantly, a
single process can run hundreds or thousands of Isolates, seamlessly switching between them. They make it
possible to run untrusted code from many different customers within a single operating system process. They’re
designed to start very quickly (several had to start in your web browser just for you to load this web page), and
to not allow one Isolate to access the memory of another.
隔離是輕量級上下文,將變量與允許對其進行突變的代碼分組在一起。最重要的是,一個進程可以運行數百或數千個隔離,并在它們之間無縫切換
。它們使在單個操作系統進程中運行來自許多不同客戶的不受信任的代碼成為可能。它們旨在快速啟動(為了加載此網頁,必須在您的Web瀏覽器中
啟動幾個),并且不允許一個隔離訪問另一個內存。
以下內容節選自://blog.cloudflare.com/jamstack-podcast-with-kenton-varda/
But the one that has received by far the most scrutiny, and the most real-world battle testing over the years,
would be the V8 JavaScript engine from Google Chrome. We took that and embedded it in a new server environment
written in C++ from scratch.
但是,多年來受到最嚴格的審查和最真實的戰斗測試的是Google Chrome的V8 JavaScript引擎。我們采用了它,并將其從頭開始嵌入到用C
++編寫的新服務器環境中。
以下內容節選自://blog.cloudflare.com/mitigating-spectre-and-other-security-threats-the-cloudflare-workers-security-model/
首先,我們需要創建一個執行環境,使代碼無法訪問不應執行的任何操作。
為此,我們的主要工具是V8,這是Google開發的可在Chrome中使用的JavaScript引擎。V8在“隔離”內部執行代碼,從而防止該代碼訪問隔離外部
的內存-即使在同一進程中也是如此。重要的是,這意味著我們可以在一個流程中運行多個隔離。這對于像Workers這樣的邊緣計算平臺至關重要,
在該平臺上,我們必須在每臺計算機上托管成千上萬個來賓應用程序,并以最小的開銷每秒在數千個來賓之間快速切換。如果我們必須為每個客人
運行一個單獨的流程,那么我們可以支持的租戶數量將大大減少,并且我們必須將邊緣計算限制在少數可以支付很多錢的大型企業客戶中。借助隔
離技術,我們可以使所有人都能使用邊緣計算。
但是,有時我們確實決定按自己的私有流程安排工作人員。如果它使用某些我們認為需要額外隔離的功能,則可以執行此操作。例如,當開發人員
使用devtools調試器檢查其工作程序時,我們在單獨的進程中運行該工作程序。這是因為從歷史上看,在瀏覽器中,檢查器協議只能由瀏覽器的受
信任操作員使用,因此沒有像其他V8一樣受到安全審查。為了對付檢查員協議中錯誤的風險增加,我們將受檢查的工人轉移到帶有流程級沙箱的單
獨流程中。我們還使用進程隔離作為對Spectre的額外防御,我將在本文后面描述。
此外,即使對于與其他隔離程序在共享進程中運行的隔離程序,我們也會在每臺計算機上運行整個運行時的多個實例,我們將其稱為“密碼”。通
過為每個工作人員分配信任級別并將低信任度的工作人員與我們更信任的工作人員區分開來,將工作人員分配到警戒線中。作為這種操作的一個示
例:注冊我們的免費計劃的客戶將不會與企業客戶以相同的流程進行安排。在V8中發現零日安全漏洞的情況下,這提供了深度防御。但是,在本文
的后面,我將更多地討論V8錯誤以及我們如何解決它們。
在全過程級別,我們應用了另一層沙箱來進行深度防御。“第2層”沙箱使用Linux名稱空間和seccomp來禁止對文件系統和網絡的所有訪問。命名空
間和seccomp通常用于實現容器。但是,我們對這些技術的使用比容器引擎中通常使用的技術嚴格得多,因為我們在過程開始之后(但在未加載任何
隔離之前)配置名稱空間和seccomp。這意味著,例如,我們可以(并且確實)使用完全空的文件系統(裝入名稱空間),并使用seccomp來絕對阻
止所有與文件系統相關的系統調用。容器引擎通常不能禁止所有文件系統訪問,因為這樣做將使它無法使用exec()從磁盤啟動guest程序;在我們
的例子中,我們的訪客程序不是本機二進制文件,并且在阻止文件系統訪問之前,Workers運行時本身已經完成加載。
第2層沙箱也完全禁止網絡訪問。相反,該進程僅限于通過本地Unix域套接字進行通信,以便與同一系統上的其他進程進行通信。與外界的任何通信
都必須通過沙箱外部的其他本地過程來實現。
特別是其中一個這樣的過程(我們稱為“管理者”)負責從磁盤或其他內部服務中獲取工作程序代碼和配置。主管確保沙盒進程無法讀取任何配置
,但與應該運行的工作程序相關的配置除外。
例如,當沙盒進程收到從未出現過的工作人員請求時,該請求包括該工作人員代碼的加密密鑰(包括附加的機密)。然后,沙盒可以將該密鑰傳遞
給主管,以請求代碼。沙箱無法請求尚未為其接收適當密鑰的任何工作線程。它不能列舉已知的工人。它也不能請求不需要的配置。例如,它無法
向工作人員請求用于HTTPS流量的TLS密鑰。
除了讀取配置之外,沙箱與系統上其他進程進行對話的另一個原因是實現了對Workers公開的API。這使我們進入了API設計。
在沙箱環境中,API設計承擔了新的責任。我們的API準確定義了工作人員可以做什么和不能做什么。我們必須非常小心地設計每個API,以便它只能
表示我們希望允許的操作,而不能表示更多操作。例如,我們希望允許Worker發出和接收HTTP請求,而我們不希望它們能夠訪問本地文件系統或內
部網絡服務。
。。。
這樣的API將如何實現?如上所述,沙盒進程無法訪問真實的文件系統,我們希望保持這種狀態。取而代之的是,文件訪問將由主管進程來調解。沙
盒使用Cap'n Proto RPC(基于功能的RPC協議)與主管進行對話。(Cap'n Proto是目前由Cloudflare
Workers團隊維護的一個開源項目。)此協議使基于功能的API的實現非常容易,因此我們可以嚴格限制沙箱僅訪問屬于Workers的文件它正在運行。
現在如何進行網絡訪問?如今,僅允許工人通過HTTP與世界其他地方進行對話-傳入和傳出。沒有用于其他形式的網絡訪問的API,因此被禁止(盡
管我們計劃在將來支持其他協議)。
如前所述,沙盒進程無法直接連接到網絡。相反,所有出站HTTP請求都通過Unix域套接字發送到本地代理服務。該服務對請求實施限制。例如,它
驗證請求是發送到公共Internet服務,還是發送到工作人員區域自己的原始服務器,而不是發送給本地計算機或網絡上可能可見的內部服務。它還
會在每個請求中添加標頭,以標識源自其的工作程序,以便可以跟蹤和阻止濫用請求。一切就緒后,請求將發送到我們的HTTP緩存層,然后再發送
到Internet。
同樣,入站HTTP請求也不會直接進入Workers Runtime。它們首先由入站代理服務接收。該服務負責TLS終止(Workers
Runtime從不查看TLS密鑰),以及標識要為特定請求URL運行的正確Worker腳本。一切就緒后,請求將通過Unix域套接字傳遞到沙盒進程。
以下內容節選自://www.infoq.com/presentations/cloudflare-v8/
關于 Workers 的資源控制和安全:
用 Linux 的超時系統調用把每個 isolate 的運行時間控制在 50 毫秒。
For CPU time, we actually limit each isolate to 50 milliseconds of CPU execution per request. The way we do that
is the Linux timer create system call lets you set up to receive a signal when a certain amount of CPU time has
gone by. Then from that signal handler, we can call a V8 function, called terminate execution, which will
actually cancel execution wherever it is.
每個V8 的線程中同一時刻只運行一個 isolate,并發的請求通過多個線程處理。
An isolate in JavaScript is a single-threaded thing. JavaScript is inherently a single threaded event driven
language. So an isolate is only running on one thread at a time, other isolates can be on other threads. We don't
technically have to, but in our design, we never run more than one isolate on a thread at a time. We could have
multiple isolates assigned to one thread and handle the events as they come in. But what we don't want is for one
isolate to be able to block another with a long computation and create latency for someone else, so we put them
each on different threats.
通過監控手段控制 isolate 內代碼的內存使用,如果超出就kill掉 isolate。這里也提到新的請求可能會啟動新的
isolate,不知道是否復用已有的線程?
Instead, we end up having to do more of a monitoring approach. After each time we call into JavaScript when it
returns, we check how much use space it is now using. If it's gone a little bit over its limit, then we'll do a
soft eviction where it can continue handling in-flight requests. But for any new requests, we can just start up
another isolate. If it goes way over then we'll just kill it and cancel all the requests. This works in
conjunction with the CPU time limit because generally, you can't allocate a whole lot of data without spending
some CPU time on that, at least not JavaScript objects. Then type trays are something different, but it's a long
story.
Serverless程序代碼提交后,3秒內發布到邊緣服務器上,以加快請求處理時的啟動時間。
Another problem is we need to get our code, or the user's code, to all the machines that run that code. It sure
would be sad if we had achieved our 5 millisecond startup time only to spend 200 milliseconds waiting for some
storage server to return the code to us before we could even execute it. So what we're doing right now is
actually we distribute the code to all of the machines in our fleet up front. We already had technology for this
to distribute configuration changes to the edge, and we just said code is another kind of configuration, and
threw it in there and it works. It takes about three seconds between when you upload your code and when it's on
every machine in our fleet.
對V8的bug和安全風險控制,他們監控V8代碼倉庫的更新,自動同步更新代碼、自動構建出新的版本發布到生產環境。
We can see when the commit lands in the V8 repository, which happens before the Chrome update, and automate our
build system so that we can get that out into production within hours automatically. We don't even need some want
to click.
他們不允許程序中使用 eval 這樣的執行功能,并且會監控 0day 的攻擊代碼,一旦發現會檢查代碼并提交給Google。也提到了不支持 timer
功能和 并發特性。
There are some things, some risk management things we can do on the server, that we cannot do so easily on the
browser. One of them is we store every single piece of code that executes on our platform, because we do not
allow you to call eval to evaluate code at runtime. You have to upload your code to us and then we distribute it.
What that means is that if anyone tries to upload an attack, we now have a record of that attack. If it's a
zero-day that they have attacked, they have now burned their zero day, when we take a look at that code. We'll
submit it to Google, and then the person who uploaded won't get their $15,000.
他們也會在全部服務器上監控 segfaults 錯誤,并且報警,他們也會檢查程序的 crash 報告。
每個HTTP請求會對應一個V8的線程去運行isolate,這些HTTP請求來自于同一個機器上的Nginx,每個線程同一時刻只運行一個isolate,通過多個線
程實現并發請求。(這里有個疑問,觀眾問到了是否會有 spare 空閑線程存在,以及最少數量和最大數量,好像并沒有明確回答到)
As I said earlier, we start up a thread or we have different isolates running on different threads. We actually
start a thread for each incoming HTTP connection, which are connections incoming from an engine X Server on the
same machine. This is kind of a neat trick because engine X will only send one HTTP request on that connection at
a time. So this is how we know that we only have one isolate executing at a time. But we can potentially have as
many threads as are needed to handle the concurrent requests. The workers will usually be spending most of their
time waiting for some back end, so not actually executing that whole time.
每個邊緣節點上要保證足夠用的CPU,因為同一時刻一個CPU只能響應一個HTTP請求,如果一個邊緣節點上的CPU不夠用了,用戶會被調度到其他有CP
U資源的邊緣節點上去。
We need to make sure that we have plenty of CPU capacity in all of our locations. When we don't, when one
location gets a little overloaded, what we do is we actually shift. Usually the free users we just shift them to
other locations- the free users of our general service, there isn't a free tier of workers yet.
有個人對Cloudflare監控和查看客戶代碼的做法提出了質疑,他們回答解釋是出于診斷錯誤和響應事故的目的。也提到了他們已經搭建了
appstore,允許客戶發布自己的 Serverless 應用程序在 appstore 上面,這樣應用市場上的代碼也同樣會被更多的客戶看到。
We look at code only for debugging and incident response purposes. We don't dig through code to see what people
are doing for fun. That's not what we want to do. We have something called the Cloudflare app store, which
actually lets you publish a worker for other people to install on their own sites. It's being able to do it with
workers is in beta right now. So this will be something that will ramp up soon. But then you sell that to other
users, and we'd much rather have people selling their neat features that they built on Cloudflare to each other
in this marketplace, than have us just build it ourselves.
現在他們已經上線了 Serverless KV 存儲,這段話提到了一個值得關注的細節,就是邊緣上的 KV Store 是針對讀多寫少做的性能優化。
One of the first ones that's already in beta is called Workers KV. It's fairly simple right now, it's KV Store
but it's optimized for read-heavy workloads, not really for lots of writes from the edge. But there are things
we're working on that I'm very excited about but not ready to talk about yet, that will allow whole databases to
be built on the edge.
當前的 Serverless 計費模型,部署到邊緣節點需要每個月5美元,處理每百萬次請求0.5美元,前1000萬個請求免費。比 AWS Lambda 要便宜。
Varda: Great question. If you go to cloudflareworkers.com, you can actually play around with it just in your web
browser. You write some code and it immediately runs, and it shows you what the result would be. That's free.
Then when you want to actually deploy it on your site, the cost is $5 per month minimum, and then it's 50 cents
per million reque
We can do a lot of monitoring. For example, we can watch for segfaults anywhere on any of our servers. They are
rare, and when they happen, we raise an alert, we look at it. And we see in the crash report, it says what script
was running. So we're going to immediately look at that script, which we have available.