I think this would have been much better titled "Boosting NGINX Performance 9x with Asynchronous wrappers around blocking system calls".
Most people when hearing about "thread pools" in the context of a web-server think about using multiple threads for handling separate requests, which is NOT what this is about. It is using threads to enable some blocking syscalls (read and sendfile) to run asynchronously from the main event loop.
libuv isn't unique.. It's equivalent to libev + libeio.. in fact, that's what nodejs used before writing libuv. Whether or not it's faster than those is really case-by-case.. but what you'll definitely get with libuv is callbacks everywhere.
That Flare Mobile crap on this site is constantly applying zoom CSS attributes and vertically centering the useless sidebar share buttons making the reading experience very horrible. The page freezes momentarily anytime a scroll event is triggered. And, I'm not even using a mobile device!
A bit ironic since this is an article about reducing blocking for improving performance.
Hi there - I work on Flare. Could you let me know what phone/OS you're using, so I can take a quick look? jason [at] filament (dot) io. Thanks very much!
The design seems similar to the one suggested in the 1999 USENIX paper "Flash: An Efficient and Portable Web Server". It's a good read on the topic. nginx came about in a time where you're a lot more likely to have your site cached in RAM than in 1999 (along with offloading large files to a CDN/S3 and reverse-proxying to an app server for a lot of other stuff), but it's nice to see them working on making performance better for the bad cases.
"On the other hand, users of FreeBSD don’t need to worry at all. FreeBSD already has a sufficiently good asynchronous interface for reading files, which you should use instead of thread pools."
Originally, as Igor has said in many talks, Nginx was written for FreeBSD, and supports what FreeBSD supports, and the Linux port has historically managed as it could. This is a case of actually adding something for Linux specifically, which is unuusual.
So the answer is it probably would have been implmented years ago if that was the case.
A slight note on the terminology - reads and writes of ordinary disk files technically do not "block"; they "Disk Wait" instead. The difference is visible for example in that ordinary files are always considered by select()/poll() as readable and writeable.
I don't think that a load of 172 is a good idea. I know this is a benchmark that is measuring how fast you can go ideally but in production the question is how fast you can go with with keeping the latency within the SLA. As a general rule you want to run your boxes around 1 normalized load ( load / # of CPU cores). The rest of the article is pretty nice.
Having load average exceed core count isn't necessarily a sign of an imminent cascade failure. IIRC it's just a count of the number of processes that could run, but are waiting for a timeslice. A particular server and application might be perfectly fine with a load average that is 10 times the number of cores, as long as the average remains stable and the server is meeting response requirements.
Actually since you have no idea about what is causing the load (it can be wait on network IO) this is why I think that running your production system in that shape is not recommended. Out of curiosity in what situation is it ok to have significantly more things waiting to be running than your actual capacity? Seems like a bad capacity planning to me. Anyways, this is how it was done (keeping the normalized load around 1) in my previous gig where we had ~5000 nodes and it was working fine. I work on Hadoop clusters nowadays and any time we run into a load of 100+ there is a severe degradation in the service, timeouts etc happen. In reality high normalized load over time (not talking about 1 minute spikes) should be avoided, this is based on my experience.
As well as network IO. This is why just by the load you can't tell what is going on and this is exactly the reason why I don't like it too much in production. A box should do a have a smooth 15minutes normalized load over time. (If your workload changes you can do autoscaling, I think we used the normalized load as the metric for scaling up and down).
Processes blocked on a network socket should be Sleeping and not contributing the load average. However in the case of network filesystems processes may well be in Disk Wait while waiting for the server to respond.
I think it would be cool to see something similar from MS about IIS and .Net which have used thread pools for some time, though only relatively recently has asynchronous development taken hold at the application level (beyond lifecycle events)...
In practice, I've seen plenty of errant bugs because of race conditions in sites that start to come under heavy load. I wish more people would take the time to understand how their platforms work. That said, I've really come to appreciate the node.js approach.
Great explanation of an event driven web server. Helped me understand ome of the benefits of the mongrel2 architecture that separates the tasks needed to be done completely by using ØMQ as the mechanism to decouple the connection handling from the message handling of the request.
While asynchronous messaging is generally a good idea, using a messaging middleware seem to be an overkill. One should use smallest hammer for the job.
Thread is a wrong idea in the first place - by broking isolation of processes (share nothing principle) they brought in the whole new class of problems with locking and synchronization. Only threads that share nothing is a reasonable choice, but without sharing the whole concept makes no sense anymore. So there are kernel lightweight processes which seems to be good choice for offloading the blocking operations from the main loop.
BTW, Erlang does it right from the very beginning.)
It is really unfortunate that Linux does not do proper async disk IO. Then again, for lots of websites, the static assets stores on disk fit in the OS cache so the boost won't really be nearly as big.
I'm not sure what you really mean by this. Linux has supported non-blocking I/O using select and poll since at least 2.4. 2.6 even added support for epoll, which scales even further since the callbacks are O(1).
It's a fairly common practice to spawn 2n processes/threads (n processors) to allow half to block on I/O and system calls though.
Well, TFA talks about Linux not having great support for async IO for the filesystem. You can use O_DIRECT and get async IO that way, but that completely bypasses the OS cache, so it's not a great way to do it, at least not for nginx. Just read the article to see the details.
Note that kqueue(2) in BSD-land supports a unified interface for async IO for both sockets and files, so you can have a proper event loop without having to resort to reading files in a thread pool. If Linux had something similar, a nginx wouldn't need to integrate a threadpool for this (though it might for other things, such as CPU-intensive plugins).
The nginx threadpools aren't strictly for I/O. One of the other major issues TFA mentions is that plugins don't use epoll/kqueue, and they block (with all the associated performance costs).
The detail I apparently skipped is that uncached file reads aren't handled uniformly through epoll (which I'm surprised about). I don't see why files should be handled any differently than sockets. etc... in regard to non-blocking I/O using epoll.
Although, my issue is that everyone tends to look to methods starting with aio_ to do asynchronous I/O. Those are fairly bad interfaces (POSIX AIO) and inefficient (effectively threadpools). Using the nginx model with an epoll/kqueue event loop is a better architecture.
Hardly exciting stuff, async libraries have been doing this for things like DNS queries (where there's no portable non-blocking API) for decades. Good for Nginx addon devs I guess.
Linux kernel aio will often still block when dealing with the page cache even if you request nonblocking. The workaround for this is to use O_DIRECT, which is okay for databases that do their own cache management but not for something like nginx (which is depending on the OS cache).
Glibc's posix aio (aio_*(3)), on the other hand, does not use Linux's kernel aio AFAIK. It probably uses thread pools. It also uses signals to signal completion. It is not generally considered performant.
Yes, good points about caching, Informix does everything by itself, indeed, on raw devices or direct mapped files, which is the only way to maintain not evenrual, but strong data consistency Thanks for clarifying.
Open-source API Management KONG (https://github.com/mashape/kong), which is based on NGINX, uses the same workaround of Async wrapppers to make requests faster.
Most people when hearing about "thread pools" in the context of a web-server think about using multiple threads for handling separate requests, which is NOT what this is about. It is using threads to enable some blocking syscalls (read and sendfile) to run asynchronously from the main event loop.
There's already a library for that! http://software.schmorp.de/pkg/libeio.html