Using ultra-low overhead virtual machines to create programmable tenant logic for Varnish Enterprise
One of the things I’ve realized about doing something crazy or new is that you don’t really understand the possibilities until you’re neck deep in it. For example I was not able to know early on if this project would even work out. Or that you would be able to embed a SQLite database and generate dynamic content on the edge at very high levels of concurrency.
VCL
Varnish currently uses VCL as a configuration language for programming how Varnish handles requests, responses, caching and other things.
“With Workers we’ve been able to push routing and caching to the edge, allowing us to scale further while improving npm’s performance for millions of developers. Migrating away from VCL means we can spend more time in JavaScript, which we love.”
Laurie Voss
Co-founder and Chief Data Officer at npm on “Cloudflare Workers”
There is no doubt that VCL lacks general computation features, proper objects and a whole lot of other modern features. But it has native speed. It’s the fastest caching configuration language out there. Still, I can empathize. The problem though is that coming up with a solution that takes milliseconds to spin up is not the answer. Nor is completely departing from VCL. Without revealing too much I will talk about the current state of my project to implement tiny virtual machines in Varnish.
VMs
Virtual machines allow you to run untrusted code anywhere, including at the edge. It could be hosting any computer language, really, as long as that language can output the machine code your virtual machine can process. Tiny virtual machines don’t have to be traditional VMs in any sense. In this case they are written to have low overhead, fast startup and teardown and they don’t have to emulate any specific system (such as Linux userspace). In this case they use a limited API that was written for HTTP caching. They can’t read/write files. They can’t connect to anything, or modify system settings. What they can do is modify HTTP header fields and generate dynamic content.
HTTP caching
HTTP caches are high-concurrency beasts that delivers cached content in a variety of ways. They act as accelerators for traditional web servers, and are used everywhere these days. Programming in a HTTP cache is a weird experience due to everything being pushed to the limit. Memory caches are always trashed, for example, so all your synthetic benchmarks have no meaning. If you curl -v on the big domains you will undoubtedly see some via Varnish pop up.
So, what on earth have I been up to? Well, I wanted to explore the possibilities afforded by using tiny virtual machines that exist only for a single request, whether on the front- or back-end of a HTTP cache. A very challenging task because these things make the Internet responsive. I can’t just throw in an emulator and say “Fuck it, a few milliseconds extra won’t hurt anyone.” Thankfully, I started out with a 70 microsecond overhead. What that means is complicated to explain, but I’ll try. See, there is a master machine that is created for each tenant, and off that we create a fast copy on each request. Then delete it afterwards. This gives important guarantees, and makes it easier to reason about security and lifetimes. Now, the overhead of a machine is measured on the outside as the time it takes to serve a synthetic response from this machine, for example a 403 Forbidden response. If we measure times internally, we get all kinds of insanely low good-looking numbers, but I want to compare apples to apples.
After working on this for a month, I ended up with a 8–13 microsecond overhead, and a scaling of 2.75x compared to native code. Not bad at all, as you will be comfortably inside 100 microseconds for any normal program. Sure, you can’t resize large images without the aid of native code in the form of system calls. Why not? Well, because the CDN-operator applies limits to code executed inside virtual machines. That way, no tenant can sit there and just mine bitcoins. Although, I assume that if you pay for the privilege, you very well could!
Many insane optimizations had to be applied to solve problems in the high-concurrency environment that a synthetic benchmark will not shine a light on. I have a synthetic benchmarking suite and everything there is measured in nanoseconds. You will blow through 500 ns just by copying a few bytes from A to B during 300k req/s. I will not talk more about that though. Some interesting ideas came to mind while I was working on this.
Live updates
I noticed that I could perhaps add an update function where you supply a binary to a tenant and the tenant will start using it immediately, at run-time. The CDN can implement the update method however they wish, but one way would just be accepting POST requests. With some authentication of course. So, now there is live updating of tenant logic that allows CDNs to get out of the way and let tenants update their own logic directly. It takes effect immediately. If an update fails to initialize the old one is kept running.
I measured an update to take around 1 millisecond to initialize and such. It doesn’t really matter as it’s not affecting the operation of the cache. But it’s good to keep an eye on it.
/* Live update mechanism */
if (req.method == "POST") {
std.cache_req_body(15MB);
set req.backend_hint = tenant.live_update();
return (pass);
}
The simplest implementation of live updates in VCL.
Dynamic content generation
By implementing a fake backend that simply calls into the VM it was possible to let tenants generate content dynamically. This works exactly as one would think. The function simply returns the content-type and the content.
Since the VM can run almost any language, it can also, for example, run an in-memory database. Read-only, of course. With the live updating functionality it doesn’t have to get outdated. That presented a challenge: Big binaries take up a lot of memory.
static const std::string page = R"(
<html>
...
</html>
)";static void my_page()
{
beresp.append("Etag: 12beef34");
backend_response("text/plain", page);
}pub(on_client_request)
{
decision(my_page);
}
The decision() function will tell Varnish/CDN what to do next. In this case we will come back to my_page to generate a dynamic response. API is really just example code that happens to work.
Sharing memory
Typically when you think of sharing memory you think sharing between the host and the virtual machine. Or between threads, or shared page tables between CPUs. In this case, I had to share the executable and read-only segments between all machines per tenant to make sure that big binaries didn’t blow the workspace budget for each request. Requests can’t have a lot of budget because there are so many of them and it simply wouldn’t scale. Instead, this memory should just be owned by the original machine and never copied around. It’s read-only after all.
Alright, that should be enough, right? Nope. Turns out that each page being 80 bytes big in the fixed-size red-black tree with overflow allocator was a problem too. Just look at this calculation:
15 mb / 4096 bytes per page * 80 bytes per page = ~300kb. If each page was zero bytes it would still use 48 bytes, so there isn’t much to gain by trying to shrink pages (although they did start at 80 and I managed to cut some fat). 300kb is way too much to initialize on each request. So, the only thing left was to verify the binary had a sequential execute segment, followed by a sequential read-only segment and then share these pages (not the data, just the page information) sequentially with each forked machine. In the end it actually ended up being an optimization in most of my benchmarks, and now the binary size does not affect requests at all! Bizarre, almost.
Logging
One of the things I’m currently working on is how to debug these things. I wrote strace-like logging functionality that can run inside the VM without anyone really knowing or caring:
>>> ypizza.com: [trace] HTTP(REQ)::append(X-VM: Client request)
>>> ypizza.com: [trace] HeaderField(REQ, 9)
>>> ypizza.com: [trace] name() = "ypizza.com"
>>> ypizza.com: [trace] HTTP(REQ)::append(X-Tenant: ypizza.com)
>>> ypizza.com: [trace] HeaderField(REQ, 10)
>>> ypizza.com: [trace] decision(synth, 200)
>>> ypizza.com: [trace] HeaderField(REQ, 1)
>>> ypizza.com: [trace] HeaderField(REQ, 1)::regsub(0, "", false)
...
There are also plenty of logging solutions already existing for Varnish itself, and should really cover the rest. Still, I have plans to make some kind of API simulation that would let people run their code directly from Linux. No promises though.
The future
I don’t know what it will hold, but I imagine I have to select a language and stick to it. I’ve been looking at Go, but I’m not sure if its moldable enough. The older languages are, and Rust might also meet those requirements.
Right now I’m using C++ for both the host and VM guest. I’m quite liking modern C++, but the language selection for the guest is an open-ended question. It’s very easy to build with newlib, override all memory-handling C-functions to accelerate them as system calls, and writing inline assembly to optimize the general API. The abstractions on top just make for easy to use code. However, C++ has a problem with hard-to-understand error messages. So I’m not sure yet. Not really writing this for me, after all.
Also, I might just write a KVM hypervisor instead. This was a nice proof-of-concept that really far above my expectations, but I think VCL has set the precedent with native speed. Alright, KVM_EXIT_HLT :)
-gonzo