I feel a bit paralyzed by Fear Of Missing Io_Uring. There's so much awesome stre...

immibis · on Aug 8, 2024

I seem to be missing context for this reply. Why do you need io_uring?

jauntywundrkind · on Aug 8, 2024

You don't need io_uring. For many workloads being slow & inefficient is acceptable, isn't awful. But gee I'd rather start from a modern baseline that has high levels of mechanistic sympathy with the hardware, where things like network & io work can be done in an efficient async manner.

Why do I need io_uring? Because it sounds awful and unhackerly to suffer living in a much lesser worse world.

amluto · on Aug 8, 2024

Mechanical sympathy is understanding the system, not using the shiniest thing. If you want low latency processing of one event at a time, you are either going to burn an entire core spinning or you are going to do a syscall for each operation. The io_uring syscalls are not especially fast — they get their awesomeness by doing, potentially, a whole lot of work per operation. And, for some use cases, by having a superior async IO model.

But if you actually just want read(), then call read().

jauntywundrkind · on Aug 8, 2024

Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.

Highly concurrent system usage is what it takes. EPOLLEXCLUSIVE (2016) finally sort of gets epoll vaguely capable of what OSes were doing decades ago but is still difficult to use & a rats nest of complexity. Who here feels good reading https://stackoverflow.com/questions/41582560/how-does-epolls... ?

The submission/completion queue model of io_uring makes sense. It lets work be added or resolved without crossing that painful slow kernel barrier. It's been expanded to offer a lot more operations than what could be done in epoll.

The "shiniest thing" is a vast leap in capabilities, systems legibility, and overall (not single operation) throughout. You cannot remotely get the numbers io_uring was bringing three years ago any other way. And it's only gotten further and further ahead while everyone else has sat still.

amluto · on Aug 8, 2024

> Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.

Excuse me? I maintain a production system that cares about low latency for single events. Declaring that it doesn’t have “mechanistic sympathy” entirely misses the point. Of course I’m not squeezing the most throughput out of every cycle of my CPU. I have a set of design requirements, I understand what the kernel and CPU and IO system do under the hood, and I designed the system to make the most of the resources at hand to achieve the design requirements. Which, in this case, are minimal latency for single events or small groups of events, and io_uring would have no benefit.

(I can steam in events at a very nice rate as measured in events/sec, but I never tried to optimize that, and I should not try to optimize that because it would make the overall system perform worse.)

jauntywundrkind · on Aug 11, 2024

You aren't using your chips efficiently. That's basically it. Maybe your use case justifies it but you are not taking advantage of a massive part of what chips do. That's on you. And it does make you a pretty weird different use case than most software development.

Fine, you've talked you yourself deeply into a conviction that async doesn't and won't ever matter for you. But man, most people are properly doing the right thing by optimizing for throighput, not single events, and async has altered the game on amazingly colossally positive ways for computing efficiency.

immibis · on Aug 12, 2024

Squeezing every ounce of latency out of a system is just as valid mechanical sympathy as squeezing out every ounce of throughput.

immibis · on Aug 8, 2024

Or you are going to use an FPGA network card, like the HFT firms do.

zokier · on Aug 8, 2024

if you want mechanistic sympathy and low latency then you can't really do much better than dpdk; uring is still going through the very generic and abstracted kernel networking stack.

otabdeveloper4 · on Aug 8, 2024

Every time we've used something built on dpdk in production it was horribly bloated and slow.

I'm pretty sure this stuff is optimized for marketing benchmarks, not the real world.

hnav · on Aug 8, 2024

io_uring is a low level abstraction and is generally a wash against epoll. Really won't make a difference for these kinds of applications, especially not for client nodes.

10000truths · on Aug 8, 2024

io_uring allows for async reads and writes to disk without forcing a thread pool or direct I/O. That alone makes it much more scalable for workloads that touch both the network and disk.

hnav · on Aug 8, 2024

The point was that io_uring isn't going to make a big difference for the network code, as for disk I/O code (especially for the sorts of things GP is talking about) you have a bounded number of "threads" of execution anyway. For a node in a pub-sub system, maybe it has c10k users but it's probably appending to a handful of LSM-like datastructures that are written sequentially to disk. The biggest difference is random reads, but even then you can saturate what the disk will do with double digit numbers of threads.

immibis · on Aug 8, 2024

Doesn't the kernel use a thread pool to process the requests in the ring, because the kernel is still designed around blocking disk I/O?

JoshTriplett · on Aug 8, 2024

No, most operations in the ring directly work asynchronously. The thread mechanism only exists as a fallback for combinations of operations and system configurations (e.g. filesystems) that don't support asynchronous operation.

mightyham · on Aug 8, 2024

I don't know anything about the internals of io_uring and am genuinely curious how it works. Saying it "directly works asynchronously" doesn't mean anything though. When circular buffer requests are processed what thread is processing the request, how is that thread managed, and how does it manage blocking/unblocking when communicating with the storage device?

JoshTriplett · on Aug 8, 2024

Internally, many parts of the Linux kernel operate asynchronously: they queue up a request with some subsystem (e.g. a hardware device), and get an event delivered when the request is completed. In such cases, io_uring can enqueue such a request, and complete it when receiving the event, without needing to use a thread to block waiting for it.

See, for instance, https://lpc.events/event/11/contributions/901/attachments/78... slide 5 (though more has happened since then). io_uring will first see if it has everything needed to do the operation immediately, if not it'll queue a request in some cases (e.g. direct I/O, or buffered I/O in some cases). The thread pool is the last fallback, which always works if nothing else does.

https://lwn.net/Articles/821274/ talks about making async buffered reads work, for instance.

haberman · on Aug 8, 2024

Is it safe to say that a single thread using io_uring should be as fast or faster than N threads performing the same set of I/O tasks in a blocking manner?

In other words, can you count on the kernel to use its own threads internally whenever an I/O task might actually need to use a lot of CPU?

10000truths · on Aug 8, 2024

If you saturate the submission queue with CPU-bottlenecked tasks, it defeats the value-add of io_uring - at that point, you might as well replace your kernel-space thread pool with a user-space one.

haberman · on Aug 8, 2024

Sure, but that approach forces you to consider/research just how much CPU your I/O tasks may or may not require. What if I'm not sure? How CPU-intensive is open()? What about close()? What about read()?

It would simplify my design process if I could count on io_uring being optimal for ~all I/O tasks, rather than having to treat "CPU-heavy I/O" and "CPU-light I/O" as two separate things that require two separate designs.

10000truths · on Aug 8, 2024

This is something that will require profiling to get exact numbers. The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace. I wouldn't worry about any of these starving subsequent io_uring SQEs.

I reckon the most likely place you'd find unexpected CPU-heavy work is at the block layer. Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.

lossolo · on Aug 8, 2024

> Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.

LUKS has a negligible impact on I/O bandwidth, and the same is true for software RAID. I'm almost saturating NVMe drives using a combination of LUKS (aes-xts) and software RAID. Additionally, the encryption and decryption processes are almost free when using hardware AES-NI instructions, especially while waiting for I/O.

jauntywundrkind · on Aug 11, 2024

Agreed that you are deep into "you need to try & figure out" territory. The abstract theorycrafting has dug too deep, there's no good answers to such questions at this stage.

One of the best gems of insight available about how io_uring's work does get ran is Missing Manuals - io_uring worker pool, cloudflare writeup that at least sets the stage. https://blog.cloudflare.com/missing-manuals-io_uring-worker-...

Since you mention

> The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace.

Worth maybe pointing out the slick work excuse has done to make her el ebpf a capable way to do a lot of base fs stuff. That userland can send in ebpf kernel programs to run various of fs task is pretty cool flexibility, and this work has shown colossal gains by having these formerly FUSE filesystems-in-usrwrland getting to author their own & send up their own ebpf to run various of these responsibilities, but now in kernel. https://github.com/extfuse/extfuse

Very much agreeing again though. Although the CF article highlights extremes, theres really a toolkit described to build io_uring processing as you'd like, shaping how many kernel threads & many other parameters as you please. It feels like there's been asking for specifics of how things work, but it keeps feeling like the answer is that it depends on how you opt to use it.