Boosting Nginx Performance with Thread Pools

smegel · on June 19, 2015

I think this would have been much better titled "Boosting NGINX Performance 9x with Asynchronous wrappers around blocking system calls".

Most people when hearing about "thread pools" in the context of a web-server think about using multiple threads for handling separate requests, which is NOT what this is about. It is using threads to enable some blocking syscalls (read and sendfile) to run asynchronously from the main event loop.

There's already a library for that! http://software.schmorp.de/pkg/libeio.html

preetamjinka · on June 19, 2015

Similarly, there's libuv [0], which is used for Node.js and others.

[0] https://github.com/libuv/libuv

rgbrenner · on June 19, 2015

libuv is for C projects that miss the joy of javascript callback hell.

anacrolix · on June 19, 2015

Or don't want to have 10000 threads. I have a Go server that regularly bumps 140k goroutines. Try that shit with native threads.

rgbrenner · on June 19, 2015

libuv isn't unique.. It's equivalent to libev + libeio.. in fact, that's what nodejs used before writing libuv. Whether or not it's faster than those is really case-by-case.. but what you'll definitely get with libuv is callbacks everywhere.

jimjag · on June 19, 2015

As well as an Apache httpd MPM: Event

pacquiao882 · on June 19, 2015

That Flare Mobile crap on this site is constantly applying zoom CSS attributes and vertically centering the useless sidebar share buttons making the reading experience very horrible. The page freezes momentarily anytime a scroll event is triggered. And, I'm not even using a mobile device!

A bit ironic since this is an article about reducing blocking for improving performance.

jasonatdt · on June 19, 2015

Hi there - I work on Flare. Could you let me know what phone/OS you're using, so I can take a quick look? jason [at] filament (dot) io. Thanks very much!

mdasen · on June 19, 2015

The design seems similar to the one suggested in the 1999 USENIX paper "Flash: An Efficient and Portable Web Server". It's a good read on the topic. nginx came about in a time where you're a lot more likely to have your site cached in RAM than in 1999 (along with offloading large files to a CDN/S3 and reverse-proxying to an app server for a lot of other stuff), but it's nice to see them working on making performance better for the bad cases.

https://www.usenix.org/legacy/event/usenix99/full_papers/pai...

antonios · on June 19, 2015

"On the other hand, users of FreeBSD don’t need to worry at all. FreeBSD already has a sufficiently good asynchronous interface for reading files, which you should use instead of thread pools."

Great to hear that.

simula67 · on June 19, 2015

What if Linux had a great asynchronous interface for reading files and FreeBSD had a terrible one ? Would NGINX team have bothered to implement this ?

justincormack · on June 19, 2015

Originally, as Igor has said in many talks, Nginx was written for FreeBSD, and supports what FreeBSD supports, and the Linux port has historically managed as it could. This is a case of actually adding something for Linux specifically, which is unuusual.

So the answer is it probably would have been implmented years ago if that was the case.

ArmTank · on June 19, 2015

Originally. But the focus has shifted since then. There are a number of Linux-addicted developers in the team now.

caf · on June 19, 2015

A slight note on the terminology - reads and writes of ordinary disk files technically do not "block"; they "Disk Wait" instead. The difference is visible for example in that ordinary files are always considered by select()/poll() as readable and writeable.

istvan__ · on June 19, 2015

I don't think that a load of 172 is a good idea. I know this is a benchmark that is measuring how fast you can go ideally but in production the question is how fast you can go with with keeping the latency within the SLA. As a general rule you want to run your boxes around 1 normalized load ( load / # of CPU cores). The rest of the article is pretty nice.

nitrogen · on June 19, 2015

Having load average exceed core count isn't necessarily a sign of an imminent cascade failure. IIRC it's just a count of the number of processes that could run, but are waiting for a timeslice. A particular server and application might be perfectly fine with a load average that is 10 times the number of cores, as long as the average remains stable and the server is meeting response requirements.

istvan__ · on June 19, 2015

Actually since you have no idea about what is causing the load (it can be wait on network IO) this is why I think that running your production system in that shape is not recommended. Out of curiosity in what situation is it ok to have significantly more things waiting to be running than your actual capacity? Seems like a bad capacity planning to me. Anyways, this is how it was done (keeping the normalized load around 1) in my previous gig where we had ~5000 nodes and it was working fine. I work on Hadoop clusters nowadays and any time we run into a load of 100+ there is a severe degradation in the service, timeouts etc happen. In reality high normalized load over time (not talking about 1 minute spikes) should be avoided, this is based on my experience.

vacri · on June 19, 2015

But a load of 100+ is considerably more than a normalised load of 1, unless we're talking about 100+ core machines.

caf · on June 19, 2015

Remember that processes in (D)isk Wait state count towards the load average even though they are not running or runnable.

smegel · on June 19, 2015

This++

istvan__ · on June 19, 2015

As well as network IO. This is why just by the load you can't tell what is going on and this is exactly the reason why I don't like it too much in production. A box should do a have a smooth 15minutes normalized load over time. (If your workload changes you can do autoscaling, I think we used the normalized load as the metric for scaling up and down).

noselasd · on June 19, 2015

At least according to the linux documentation, network IO is NOT part of the load average, only processes waiting for disk IO and runnable processes.

caf · on June 19, 2015

Processes blocked on a network socket should be Sleeping and not contributing the load average. However in the case of network filesystems processes may well be in Disk Wait while waiting for the server to respond.

tracker1 · on June 19, 2015

I think it would be cool to see something similar from MS about IIS and .Net which have used thread pools for some time, though only relatively recently has asynchronous development taken hold at the application level (beyond lifecycle events)...

In practice, I've seen plenty of errant bugs because of race conditions in sites that start to come under heavy load. I wish more people would take the time to understand how their platforms work. That said, I've really come to appreciate the node.js approach.

SethMurphy · on June 19, 2015

Great explanation of an event driven web server. Helped me understand ome of the benefits of the mongrel2 architecture that separates the tasks needed to be done completely by using ØMQ as the mechanism to decouple the connection handling from the message handling of the request.

dschiptsov · on June 19, 2015

While asynchronous messaging is generally a good idea, using a messaging middleware seem to be an overkill. One should use smallest hammer for the job.

Thread is a wrong idea in the first place - by broking isolation of processes (share nothing principle) they brought in the whole new class of problems with locking and synchronization. Only threads that share nothing is a reasonable choice, but without sharing the whole concept makes no sense anymore. So there are kernel lightweight processes which seems to be good choice for offloading the blocking operations from the main loop.

BTW, Erlang does it right from the very beginning.)

vidarh · on June 19, 2015

0MQ isn't a messaging middleware. It's point to point (unless you build your own middleware).

ehmuidifici · on June 19, 2015

'https://bugs.launchpad.net/ubuntu/+source/nginx/+bug/890179

IgorPartola · on June 19, 2015

It is really unfortunate that Linux does not do proper async disk IO. Then again, for lots of websites, the static assets stores on disk fit in the OS cache so the boost won't really be nearly as big.

uxcn · on June 19, 2015

I'm not sure what you really mean by this. Linux has supported non-blocking I/O using select and poll since at least 2.4. 2.6 even added support for epoll, which scales even further since the callbacks are O(1).

It's a fairly common practice to spawn 2n processes/threads (n processors) to allow half to block on I/O and system calls though.

IgorPartola · on June 20, 2015

Well, TFA talks about Linux not having great support for async IO for the filesystem. You can use O_DIRECT and get async IO that way, but that completely bypasses the OS cache, so it's not a great way to do it, at least not for nginx. Just read the article to see the details.

Note that kqueue(2) in BSD-land supports a unified interface for async IO for both sockets and files, so you can have a proper event loop without having to resort to reading files in a thread pool. If Linux had something similar, a nginx wouldn't need to integrate a threadpool for this (though it might for other things, such as CPU-intensive plugins).

uxcn · on June 20, 2015

The nginx threadpools aren't strictly for I/O. One of the other major issues TFA mentions is that plugins don't use epoll/kqueue, and they block (with all the associated performance costs).

The detail I apparently skipped is that uncached file reads aren't handled uniformly through epoll (which I'm surprised about). I don't see why files should be handled any differently than sockets. etc... in regard to non-blocking I/O using epoll.

Although, my issue is that everyone tends to look to methods starting with aio_ to do asynchronous I/O. Those are fairly bad interfaces (POSIX AIO) and inefficient (effectively threadpools). Using the nginx model with an epoll/kqueue event loop is a better architecture.

nly · on June 19, 2015

Hardly exciting stuff, async libraries have been doing this for things like DNS queries (where there's no portable non-blocking API) for decades. Good for Nginx addon devs I guess.

gregham · on June 19, 2015

Will you be exited if I tell you that nginx doing async DNS queries for decades without threads? It has it's own async resolver implementation.

girvo · on June 19, 2015

Just because it's been done in other tools doesn't make it "hardly exciting" when an often-used tool that was lacking a feature adds it.

coldtea · on June 19, 2015

He conflated exciting with "new development in computer science".

dschiptsov · on June 19, 2015

Linux has POSIX aio syscalls which seems to work. At least Informix and Oracle rely on them.

phs2501 · on June 19, 2015

Linux kernel aio will often still block when dealing with the page cache even if you request nonblocking. The workaround for this is to use O_DIRECT, which is okay for databases that do their own cache management but not for something like nginx (which is depending on the OS cache).

Glibc's posix aio (aio_*(3)), on the other hand, does not use Linux's kernel aio AFAIK. It probably uses thread pools. It also uses signals to signal completion. It is not generally considered performant.

dschiptsov · on June 19, 2015

Yes, good points about caching, Informix does everything by itself, indeed, on raw devices or direct mapped files, which is the only way to maintain not evenrual, but strong data consistency Thanks for clarifying.

daurnimator · on June 19, 2015

> Linux has POSIX aio syscalls which seems to work. At least Informix and Oracle rely on them.

That only works for read, write and fsync.

There are plenty of other blocking syscalls that people need to use.

e.g. An important missing one is getdents.

farqueue · on June 19, 2015

403 Forbidden

Well at least it isn't a 503

kolev · on June 19, 2015

They deleted it for some reason, but here it is from Google Cache: https://webcache.googleusercontent.com/search?q=cache:http:/...

kreutzwj · on June 19, 2015

Thanks, we run a few sites at work that might benefit from implementing this.

skorgu · on June 19, 2015

If only the nginx-provided rpm build was built --with-threads.

culo · on June 19, 2015

Open-source API Management KONG (https://github.com/mashape/kong), which is based on NGINX, uses the same workaround of Async wrapppers to make requests faster.