You don't *need* io_uring. For many workloads being slow & inefficient is accept...

amluto · on Aug 8, 2024

Mechanical sympathy is understanding the system, not using the shiniest thing. If you want low latency processing of one event at a time, you are either going to burn an entire core spinning or you are going to do a syscall for each operation. The io_uring syscalls are not especially fast — they get their awesomeness by doing, potentially, a whole lot of work per operation. And, for some use cases, by having a superior async IO model.

But if you actually just want read(), then call read().

jauntywundrkind · on Aug 8, 2024

Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.

Highly concurrent system usage is what it takes. EPOLLEXCLUSIVE (2016) finally sort of gets epoll vaguely capable of what OSes were doing decades ago but is still difficult to use & a rats nest of complexity. Who here feels good reading https://stackoverflow.com/questions/41582560/how-does-epolls... ?

The submission/completion queue model of io_uring makes sense. It lets work be added or resolved without crossing that painful slow kernel barrier. It's been expanded to offer a lot more operations than what could be done in epoll.

The "shiniest thing" is a vast leap in capabilities, systems legibility, and overall (not single operation) throughout. You cannot remotely get the numbers io_uring was bringing three years ago any other way. And it's only gotten further and further ahead while everyone else has sat still.

amluto · on Aug 8, 2024

> Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.

Excuse me? I maintain a production system that cares about low latency for single events. Declaring that it doesn’t have “mechanistic sympathy” entirely misses the point. Of course I’m not squeezing the most throughput out of every cycle of my CPU. I have a set of design requirements, I understand what the kernel and CPU and IO system do under the hood, and I designed the system to make the most of the resources at hand to achieve the design requirements. Which, in this case, are minimal latency for single events or small groups of events, and io_uring would have no benefit.

(I can steam in events at a very nice rate as measured in events/sec, but I never tried to optimize that, and I should not try to optimize that because it would make the overall system perform worse.)

jauntywundrkind · on Aug 11, 2024

You aren't using your chips efficiently. That's basically it. Maybe your use case justifies it but you are not taking advantage of a massive part of what chips do. That's on you. And it does make you a pretty weird different use case than most software development.

Fine, you've talked you yourself deeply into a conviction that async doesn't and won't ever matter for you. But man, most people are properly doing the right thing by optimizing for throighput, not single events, and async has altered the game on amazingly colossally positive ways for computing efficiency.

immibis · on Aug 12, 2024

Squeezing every ounce of latency out of a system is just as valid mechanical sympathy as squeezing out every ounce of throughput.

immibis · on Aug 8, 2024

Or you are going to use an FPGA network card, like the HFT firms do.

zokier · on Aug 8, 2024

if you want mechanistic sympathy and low latency then you can't really do much better than dpdk; uring is still going through the very generic and abstracted kernel networking stack.

otabdeveloper4 · on Aug 8, 2024

Every time we've used something built on dpdk in production it was horribly bloated and slow.

I'm pretty sure this stuff is optimized for marketing benchmarks, not the real world.