I feel a bit paralyzed by Fear Of Missing Io_Uring. There's so much awesome streaming stuff about (RisingWave, Materialize, NATS, DataFusion, Velox, neat upstarts like Iggy, many more), but it all feels built on slower legacy system libraries.
You don't need io_uring. For many workloads being slow & inefficient is acceptable, isn't awful. But gee I'd rather start from a modern baseline that has high levels of mechanistic sympathy with the hardware, where things like network & io work can be done in an efficient async manner.
Why do I need io_uring? Because it sounds awful and unhackerly to suffer living in a much lesser worse world.
Mechanical sympathy is understanding the system, not using the shiniest thing. If you want low latency processing of one event at a time, you are either going to burn an entire core spinning or you are going to do a syscall for each operation. The io_uring syscalls are not especially fast — they get their awesomeness by doing, potentially, a whole lot of work per operation. And, for some use cases, by having a superior async IO model.
But if you actually just want read(), then call read().
Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.
Highly concurrent system usage is what it takes. EPOLLEXCLUSIVE (2016) finally sort of gets epoll vaguely capable of what OSes were doing decades ago but is still difficult to use & a rats nest of complexity. Who here feels good reading https://stackoverflow.com/questions/41582560/how-does-epolls... ?
The submission/completion queue model of io_uring makes sense. It lets work be added or resolved without crossing that painful slow kernel barrier. It's been expanded to offer a lot more operations than what could be done in epoll.
The "shiniest thing" is a vast leap in capabilities, systems legibility, and overall (not single operation) throughout. You cannot remotely get the numbers io_uring was bringing three years ago any other way. And it's only gotten further and further ahead while everyone else has sat still.
> Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.
Excuse me? I maintain a production system that cares about low latency for single events. Declaring that it doesn’t have “mechanistic sympathy” entirely misses the point. Of course I’m not squeezing the most throughput out of every cycle of my CPU. I have a set of design requirements, I understand what the kernel and CPU and IO system do under the hood, and I designed the system to make the most of the resources at hand to achieve the design requirements. Which, in this case, are minimal latency for single events or small groups of events, and io_uring would have no benefit.
(I can steam in events at a very nice rate as measured in events/sec, but I never tried to optimize that, and I should not try to optimize that because it would make the overall system perform worse.)
You aren't using your chips efficiently. That's basically it. Maybe your use case justifies it but you are not taking advantage of a massive part of what chips do. That's on you. And it does make you a pretty weird different use case than most software development.
Fine, you've talked you yourself deeply into a conviction that async doesn't and won't ever matter for you. But man, most people are properly doing the right thing by optimizing for throighput, not single events, and async has altered the game on amazingly colossally positive ways for computing efficiency.
if you want mechanistic sympathy and low latency then you can't really do much better than dpdk; uring is still going through the very generic and abstracted kernel networking stack.
io_uring is a low level abstraction and is generally a wash against epoll. Really won't make a difference for these kinds of applications, especially not for client nodes.
io_uring allows for async reads and writes to disk without forcing a thread pool or direct I/O. That alone makes it much more scalable for workloads that touch both the network and disk.
The point was that io_uring isn't going to make a big difference for the network code, as for disk I/O code (especially for the sorts of things GP is talking about) you have a bounded number of "threads" of execution anyway. For a node in a pub-sub system, maybe it has c10k users but it's probably appending to a handful of LSM-like datastructures that are written sequentially to disk. The biggest difference is random reads, but even then you can saturate what the disk will do with double digit numbers of threads.
No, most operations in the ring directly work asynchronously. The thread mechanism only exists as a fallback for combinations of operations and system configurations (e.g. filesystems) that don't support asynchronous operation.
I don't know anything about the internals of io_uring and am genuinely curious how it works. Saying it "directly works asynchronously" doesn't mean anything though. When circular buffer requests are processed what thread is processing the request, how is that thread managed, and how does it manage blocking/unblocking when communicating with the storage device?
Internally, many parts of the Linux kernel operate asynchronously: they queue up a request with some subsystem (e.g. a hardware device), and get an event delivered when the request is completed. In such cases, io_uring can enqueue such a request, and complete it when receiving the event, without needing to use a thread to block waiting for it.
See, for instance, https://lpc.events/event/11/contributions/901/attachments/78... slide 5 (though more has happened since then). io_uring will first see if it has everything needed to do the operation immediately, if not it'll queue a request in some cases (e.g. direct I/O, or buffered I/O in some cases). The thread pool is the last fallback, which always works if nothing else does.
Is it safe to say that a single thread using io_uring should be as fast or faster than N threads performing the same set of I/O tasks in a blocking manner?
In other words, can you count on the kernel to use its own threads internally whenever an I/O task might actually need to use a lot of CPU?
If you saturate the submission queue with CPU-bottlenecked tasks, it defeats the value-add of io_uring - at that point, you might as well replace your kernel-space thread pool with a user-space one.
Sure, but that approach forces you to consider/research just how much CPU your I/O tasks may or may not require. What if I'm not sure? How CPU-intensive is open()? What about close()? What about read()?
It would simplify my design process if I could count on io_uring being optimal for ~all I/O tasks, rather than having to treat "CPU-heavy I/O" and "CPU-light I/O" as two separate things that require two separate designs.
This is something that will require profiling to get exact numbers. The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace. I wouldn't worry about any of these starving subsequent io_uring SQEs.
I reckon the most likely place you'd find unexpected CPU-heavy work is at the block layer. Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.
> Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.
LUKS has a negligible impact on I/O bandwidth, and the same is true for software RAID. I'm almost saturating NVMe drives using a combination of LUKS (aes-xts) and software RAID. Additionally, the encryption and decryption processes are almost free when using hardware AES-NI instructions, especially while waiting for I/O.
Agreed that you are deep into "you need to try & figure out" territory. The abstract theorycrafting has dug too deep, there's no good answers to such questions at this stage.
> The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace.
Worth maybe pointing out the slick work excuse has done to make her el ebpf a capable way to do a lot of base fs stuff. That userland can send in ebpf kernel programs to run various of fs task is pretty cool flexibility, and this work has shown colossal gains by having these formerly FUSE filesystems-in-usrwrland getting to author their own & send up their own ebpf to run various of these responsibilities, but now in kernel.
https://github.com/extfuse/extfuse
Very much agreeing again though. Although the CF article highlights extremes, theres really a toolkit described to build io_uring processing as you'd like, shaping how many kernel threads & many other parameters as you please. It feels like there's been asking for specifics of how things work, but it keeps feeling like the answer is that it depends on how you opt to use it.
It's not heavily used yet, but Rust has a bunch of fairly high visibility efforts. Situation sort of feels similar with http3, where the problem is figuring out what to pick. https://github.com/tokio-rs/tokio-uring https://github.com/bytedance/monoio https://github.com/DataDog/glommio
Alas libuv (powering Node.js) shipped io_uring but disabled it latter. Seems to have significantly worn out the original author on the topic to boot. https://github.com/libuv/libuv/pull/4421#issuecomment-222586...