Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Gazette: Cloud-native millisecond-latency streaming (github.com/gazette)
139 points by danthelion on Aug 7, 2024 | hide | past | favorite | 52 comments


From reading the docs, this has an IMO surprising design decision: the “journal” is a stream of bytes, where each append (of a byte string) is atomic and occurs in a global order. The bytes are grouped into fragments, and no write spans a fragment boundary.

This seems sort of okay if writes are self-delimiting and never corrupt, and synchronization can always be recovered at a fragment boundary.

I suppose it’s neat that one can write JSONL and get actual JSONL in the blobs. But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose). And getting, say, Parquet output doesn’t seem like it will happen in any sensible way.


:wave: Hi, I'm the creator of Gazette.

> But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose).

Yes, writers are responsible for only ever writing complete delimited blocks of messages, in whatever framing the application wants to use.

Gazette promises to provide a consistent total order over a bunch of raced writes, and to roll back broken writes (partial content and then a connection reset, for example), and checksum, and a host of other things. There's also a low-level "registers" concept which can be used to cooperatively fence a capability to write to a journal, off from other writers.

But garbage in => garbage out, and if an application correctly writes bad data, then you'll have bad data in your journal. This is no different from any other file format under the sun.

> there’s no way to tell who wrote a record

To address this comment specifically: while brokers are byte-oriented, applications and consumers are typically message oriented, and the responsibility for carrying metadata like "who wrote this message?" shifts to the application's chosen data representation instead of being a core broker concern.

Gazette has a consumer framework that layers atop the broker, and it uses UUIDs which carry producer and sequencing metadata in order to provide exactly-once message semantics atop an at-least-once byte stream: https://gazette.readthedocs.io/en/latest/architecture-exactl...


> :wave: Hi, I'm the creator of Gazette.

Hi!

> if an application correctly writes bad data, then you'll have bad data in your journal. This is no different from any other file format under the sun.

In a journal that delimits itself, a bad write corrupts only that write (and anything depending on it) — it doesn’t make the next message unreadable. I’m not sure how I feel about this.

I maintain a journal-ish thing for internal use, and it’s old and crufty and has all manner of design decisions that, in retrospect, are wrong. But it does strictly separate writes from different sources, and each message has a well defined length.

Also, mine supports compressed files as its source of truth, which is critical for my use case. It looks like Gazette has a way to post process data before it turns into a final fragment — nifty. I wonder whether anyone has rigged it up to produce compressed Parquet files.


To my knowledge, nobody's implemented parquet fragment files. But it supports compression of JSONL out of the box. JSON compresses very well, and compression ratios approaching 10/1 are not uncommon.

But more to the point, journals are meant for things that are written _and read_ sequentially. Parquet wasn't really designed for sequential reads, so it's unclear to me whether there would be much benefit. IMHO it's better to use journals for sequential data (think change events) and other systems (e.g. RDBMS or parquet + pick-your-compute-flavor) for querying it. I don't think there's yet a storage format that works equally well for both.


I don't think it's correct to say that JSONL is any more vulnerable to invalid data than other message framings. There's literally no system out there that can fully protect you from bugs in your own application. But the client libraries do validate the framing for you automatically, so in practice the risk is low. I've been running decently large Gazette clusters for years now using the JSONL framing, and have never seen a consumer write invalid JSON to a journal.

The choice of message framing is left to the writers/consumers, so there's also nothing preventing you from using a message framing that you like better. Similarly, there's nothing preventing you from adding metadata that identifies the writer. Having this flexibility can be seen as either a benefit or a pain. If you see it as a pain and want something that's more high-level but less flexible, then you can check out Estuary Flow, which builds on Gazette journals to provide higher-level "Collections" that support many more features.


Gazette is at the core of Estuary Flow (https://estuary.dev), a real-time data platform. Unlike Kafka, Gazette’s architecture is simpler to reason about and operate. It plays well with k8s and is backed by S3 (or any object storage).


Interesting, are there any open source alternatives to tinybird?

https://www.tinybird.co/


Not an exact match, but https://github.com/soketi/soketi might work for your needs (API-compatible with https://pusher.com )


I've wrote one, but it's not public/production-ready yet. Built on top of clickhouse as well.

Basically pipe is just a collection of WITH statements with some template processing.


I feel a bit paralyzed by Fear Of Missing Io_Uring. There's so much awesome streaming stuff about (RisingWave, Materialize, NATS, DataFusion, Velox, neat upstarts like Iggy, many more), but it all feels built on slower legacy system libraries.

It's not heavily used yet, but Rust has a bunch of fairly high visibility efforts. Situation sort of feels similar with http3, where the problem is figuring out what to pick. https://github.com/tokio-rs/tokio-uring https://github.com/bytedance/monoio https://github.com/DataDog/glommio

Alas libuv (powering Node.js) shipped io_uring but disabled it latter. Seems to have significantly worn out the original author on the topic to boot. https://github.com/libuv/libuv/pull/4421#issuecomment-222586...


I seem to be missing context for this reply. Why do you need io_uring?


You don't need io_uring. For many workloads being slow & inefficient is acceptable, isn't awful. But gee I'd rather start from a modern baseline that has high levels of mechanistic sympathy with the hardware, where things like network & io work can be done in an efficient async manner.

Why do I need io_uring? Because it sounds awful and unhackerly to suffer living in a much lesser worse world.


Mechanical sympathy is understanding the system, not using the shiniest thing. If you want low latency processing of one event at a time, you are either going to burn an entire core spinning or you are going to do a syscall for each operation. The io_uring syscalls are not especially fast — they get their awesomeness by doing, potentially, a whole lot of work per operation. And, for some use cases, by having a superior async IO model.

But if you actually just want read(), then call read().


Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.

Highly concurrent system usage is what it takes. EPOLLEXCLUSIVE (2016) finally sort of gets epoll vaguely capable of what OSes were doing decades ago but is still difficult to use & a rats nest of complexity. Who here feels good reading https://stackoverflow.com/questions/41582560/how-does-epolls... ?

The submission/completion queue model of io_uring makes sense. It lets work be added or resolved without crossing that painful slow kernel barrier. It's been expanded to offer a lot more operations than what could be done in epoll.

The "shiniest thing" is a vast leap in capabilities, systems legibility, and overall (not single operation) throughout. You cannot remotely get the numbers io_uring was bringing three years ago any other way. And it's only gotten further and further ahead while everyone else has sat still.


> Low latency for a single event is never going to have mechanistic sympathy, will be a colossal waste of most of your system.

Excuse me? I maintain a production system that cares about low latency for single events. Declaring that it doesn’t have “mechanistic sympathy” entirely misses the point. Of course I’m not squeezing the most throughput out of every cycle of my CPU. I have a set of design requirements, I understand what the kernel and CPU and IO system do under the hood, and I designed the system to make the most of the resources at hand to achieve the design requirements. Which, in this case, are minimal latency for single events or small groups of events, and io_uring would have no benefit.

(I can steam in events at a very nice rate as measured in events/sec, but I never tried to optimize that, and I should not try to optimize that because it would make the overall system perform worse.)


You aren't using your chips efficiently. That's basically it. Maybe your use case justifies it but you are not taking advantage of a massive part of what chips do. That's on you. And it does make you a pretty weird different use case than most software development.

Fine, you've talked you yourself deeply into a conviction that async doesn't and won't ever matter for you. But man, most people are properly doing the right thing by optimizing for throighput, not single events, and async has altered the game on amazingly colossally positive ways for computing efficiency.


Squeezing every ounce of latency out of a system is just as valid mechanical sympathy as squeezing out every ounce of throughput.


Or you are going to use an FPGA network card, like the HFT firms do.


if you want mechanistic sympathy and low latency then you can't really do much better than dpdk; uring is still going through the very generic and abstracted kernel networking stack.


Every time we've used something built on dpdk in production it was horribly bloated and slow.

I'm pretty sure this stuff is optimized for marketing benchmarks, not the real world.


io_uring is a low level abstraction and is generally a wash against epoll. Really won't make a difference for these kinds of applications, especially not for client nodes.


io_uring allows for async reads and writes to disk without forcing a thread pool or direct I/O. That alone makes it much more scalable for workloads that touch both the network and disk.


The point was that io_uring isn't going to make a big difference for the network code, as for disk I/O code (especially for the sorts of things GP is talking about) you have a bounded number of "threads" of execution anyway. For a node in a pub-sub system, maybe it has c10k users but it's probably appending to a handful of LSM-like datastructures that are written sequentially to disk. The biggest difference is random reads, but even then you can saturate what the disk will do with double digit numbers of threads.


Doesn't the kernel use a thread pool to process the requests in the ring, because the kernel is still designed around blocking disk I/O?


No, most operations in the ring directly work asynchronously. The thread mechanism only exists as a fallback for combinations of operations and system configurations (e.g. filesystems) that don't support asynchronous operation.


I don't know anything about the internals of io_uring and am genuinely curious how it works. Saying it "directly works asynchronously" doesn't mean anything though. When circular buffer requests are processed what thread is processing the request, how is that thread managed, and how does it manage blocking/unblocking when communicating with the storage device?


Internally, many parts of the Linux kernel operate asynchronously: they queue up a request with some subsystem (e.g. a hardware device), and get an event delivered when the request is completed. In such cases, io_uring can enqueue such a request, and complete it when receiving the event, without needing to use a thread to block waiting for it.

See, for instance, https://lpc.events/event/11/contributions/901/attachments/78... slide 5 (though more has happened since then). io_uring will first see if it has everything needed to do the operation immediately, if not it'll queue a request in some cases (e.g. direct I/O, or buffered I/O in some cases). The thread pool is the last fallback, which always works if nothing else does.

https://lwn.net/Articles/821274/ talks about making async buffered reads work, for instance.


Is it safe to say that a single thread using io_uring should be as fast or faster than N threads performing the same set of I/O tasks in a blocking manner?

In other words, can you count on the kernel to use its own threads internally whenever an I/O task might actually need to use a lot of CPU?


If you saturate the submission queue with CPU-bottlenecked tasks, it defeats the value-add of io_uring - at that point, you might as well replace your kernel-space thread pool with a user-space one.


Sure, but that approach forces you to consider/research just how much CPU your I/O tasks may or may not require. What if I'm not sure? How CPU-intensive is open()? What about close()? What about read()?

It would simplify my design process if I could count on io_uring being optimal for ~all I/O tasks, rather than having to treat "CPU-heavy I/O" and "CPU-light I/O" as two separate things that require two separate designs.


This is something that will require profiling to get exact numbers. The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace. I wouldn't worry about any of these starving subsequent io_uring SQEs.

I reckon the most likely place you'd find unexpected CPU-heavy work is at the block layer. Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.


> Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.

LUKS has a negligible impact on I/O bandwidth, and the same is true for software RAID. I'm almost saturating NVMe drives using a combination of LUKS (aes-xts) and software RAID. Additionally, the encryption and decryption processes are almost free when using hardware AES-NI instructions, especially while waiting for I/O.


Agreed that you are deep into "you need to try & figure out" territory. The abstract theorycrafting has dug too deep, there's no good answers to such questions at this stage.

One of the best gems of insight available about how io_uring's work does get ran is Missing Manuals - io_uring worker pool, cloudflare writeup that at least sets the stage. https://blog.cloudflare.com/missing-manuals-io_uring-worker-...

Since you mention

> The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace.

Worth maybe pointing out the slick work excuse has done to make her el ebpf a capable way to do a lot of base fs stuff. That userland can send in ebpf kernel programs to run various of fs task is pretty cool flexibility, and this work has shown colossal gains by having these formerly FUSE filesystems-in-usrwrland getting to author their own & send up their own ebpf to run various of these responsibilities, but now in kernel. https://github.com/extfuse/extfuse

Very much agreeing again though. Although the CF article highlights extremes, theres really a toolkit described to build io_uring processing as you'd like, shaping how many kernel threads & many other parameters as you please. It feels like there's been asking for specifics of how things work, but it keeps feeling like the answer is that it depends on how you opt to use it.


More details viewable here: https://gazette.readthedocs.io/en/latest/


> the broker is pushing new content to us over a singled long-lived HTTP response

Any plans to support websocket?

https://gazette.readthedocs.io/en/latest/brokers-tutorial-in...


What's the use case for millisecond-latency streaming? HFT? Remotely driving heavy machinery? Anything else?


I think it's less about guaranteed 1ms real time transactions and more about, like, it's just fast enough that you most likely don't have to worry about it introducing perceptible lag?

I'm working on a streaming audio thing and keeping latency low is a priority. I actually think I'll try Gazette, I just saw it now and it was one of those moments where it's like wait I go to Hacker News to waste time but this is quite exactly what I've been wanting in so many ways.

I'll use it for Ogg/Opus media streams, transcription results, chat events, LLM inferences...

I really like the byte-indexed append-only blob paradigm backed by object storage. It feels kind of like Unix as a distributed streaming system.

Other streaming data gadgets like Kafka always feel a bit uncomfortable and annoying to me with their idiosyncratic record formats and topic hierarchies and whatnot... I always wanted something more low level and obvious...


> wait I go to Hacker News to waste time but this is quite exactly what I've been wanting in so many ways.

This has happened so many times for me that I don't consider the time "wasted". I try to make sure I separate the a) "this is interesting personally", and b) "this is interesting professionally" threads and have a bunch of open tabs for a) that I can "read later".

But the items in (b) I read "now" and consider that to be work, not pleasure.


Collaborative systems come to mind. If you edit a document and want to subscribe to changes from other nodes it is valuable to have very low latency.


A millisecond is an eternity in HFT.


Where can I get nanosecond latency streaming?


The screen in front of your face. That'll give you like 3ns latency.


It is frequently faster to send an IP packet to another continent than to change a pixel on the screen.


No it isn't. John Carmack found a tv ten years ago that had 200-300ms of latency due to all its post processing and wrote an essay about it.

That doesn't mean that it is "frequently" faster to send packets to other continents than change pixels on screens. It doesn't even apply to modern tvs set up for latency, let alone computer monitors.


Yes, it really is. The problem is it takes a lot more than one frame for most modern software to change a pixel on the screen. I'm sitting in Hawaii on wifi right now and the first random US mainland server I pinged responded in 120ms, which means sending only took 60ms. Now say you're running a 30 Hz game with 2 frames of input lag, and you've already lost before even considering the input lag of the monitor itself.

There are just so many ways to accidentally get many frames of input lag. OS window compositors generally add a whole frame of input lag globally to every windowed app. Anything running in a browser has a second compositor in between it and the display that can add more frames. GPU APIs typically buffer one or two frames by default. And all of that is on top of whatever the app itself does, and whatever the monitor does (and whatever the input device does if you want to count that too).


No it really isn't. Are you really doubling down on this by talking about software that has built in latency?

Any game that runs at half the frame rate of a cheap TV and has an architecture designed to not draw frames immediately has nothing to do with what you're saying. That would be like someone deciding to send packets every 100ms and claiming 100ms extra latency.

All of this is forgetting that packets can be fired off whenever but with vsync on, frames need to wait for a specific timing. If you take that away you can set pixels with less latency.


Once you throw in head-of-line blocking, other requests in flight, and your average website's pile of ads and JavaScript operating systems layered on top of each other to emulate a small library that reimplements much of what browsers natively support:

Yeah I think displays, even when triple buffered, might win on average. Sending a single packet is fighting a straw man when compared against a full rendering pipeline with common habits. Compare minimums or compare common cases, crossing between them is unfair regardless of which direction you go.


> OS window compositors generally add a whole frame of input lag globally to every windowed app.

Is there a way to verify this is the case? In X11 Linux specifically.

Also does variable refresh rate like freesync help with this?


Not exactly what you ask for, but some reviewers measure it like this: https://www.rtings.com/monitor/tests/inputs/input-lag


A wire


A rather short wire. Short enough for Grace Hopper to give away to students during lectures.


The wire




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: