Slitter: A slab allocator that trusts, but verifies

gopalv · on Aug 5, 2021

Allocator efficiency was the primary concern in the past (both speed and packing), which is why there are several things in malloc which is obviously bad for allocation safety.

The allocators in the past have tried to use the space it is not currently using for user code for its own book-keeping, which is very efficient when it comes to usage of space. So the free-lists are interleaved with the actual allocations to be as cheap as they can be (not in a different page).

Oddly enough, packing all the metadata in a single area might actually be faster though at lower scales might be worse for the data itself.

I've used talloc[1] in the past, which traded some of the efficiencies for safety (I have my own sins trying to copy, but simplify pool based allocators for PHP apc with some tricks[3] borrowed from xcache).

If the author is reading, the missing magic I've always missed with jemalloc are the .gdb macros for walking the allocator and do things like print the allocations by size/content [2]. You can write them in python now and that can let you print them out neatly in a graphviz dot structure with linkages, which is the sort of project which you hand out to an intern who just joined.

[1] - https://talloc.samba.org/talloc/doc/html/libtalloc__dts.html

[2] - https://research.nccgroup.com/2015/08/11/libtalloc-a-gdb-plu...

[3] - http://notmysock.org/blog/2009/Jan/08/

pkhuong · on Aug 5, 2021

gdb scripts are a really good idea! I'll add that to our issues. https://github.com/backtrace-labs/slitter/issues/12

parhamn · on Aug 5, 2021

> guarantee type stable allocations: once an address has been used to fulfill a request for a certain allocation class, it will only be used for that class. Slitter doesn’t overlay intrusive lists on top of freed allocations, so the data always reflects what the application last stored there. This means that double-frees and use-after-frees only affect the faulty allocation class. An application could even rely on read-after-free being benign to simplify non-blocking algorithms.

This is really fascinating. Does anyone know how much the typical cost of such types of guarded allocations are? is it in the single/double digit percentages?

wahern · on Aug 5, 2021

Intrusive vs. not-intrusive isn't nearly sufficient to make that comparison. For example, in phkmalloc the switch to a separate index generally improved performance relative to the allocator being replaced: http://phk.freebsd.dk/pubs/malloc.pdf However, from that paper it seems that the previous allocator made some particularly bad choices that resulted in unnecessarily poor cache locality.

OpenBSD's allocator uses a separate index, as well as many more mitigations (many more than Slitter), and unfortunately OpenBSD's allocator ends up being substantially, even notoriously slower than most other allocators. But, again, this has little to do with using a separate index, except that using a separate index opens the door to many other potential mitigations. And where ever there's a choice between performance or mitigation, OpenBSD usually takes the latter (though sometimes they walk that back as usage and efficacy become more clear).

lep · on Aug 5, 2021

I can only advise everybody to run your C program under OpenBSD. I maintain a tool which is probably run against million lines of code and which i tested under FreeBSD, Windows and Linux where it worked. But recently i ran it the first time under OpenBSD and it SEGFAULTed almost immediatly. Of course it was an error on my side. It was a beautiful experience.

ndesaulniers · on Aug 5, 2021

Looks like the rise of hardened memory allocators. Reminds me of LLVM's scudo allocator now being used and Android and Fuchsia: https://llvm.org/docs/ScudoHardenedAllocator.html.

wyldfire · on Aug 5, 2021

I was going to ask how this one compared to scudo.

Const-me · on Aug 5, 2021

I have recently implemented something remotely similar in a totally different way.

Computers are 64-bit by now. For most practical applications, address space is unlimited. This means you can allocate let’s say 1TB of continuous address space at startup, and implement allocator on top that allocates and commits physical pages in that range, implementing a stack allocator. Not only it never moves the data, it also guarantees the complete memory is at continuous range of addresses. For instance, you can view that thing as an infinite std::vector() which never requires reallocations.

I did it for Windows on top of VirtualAlloc() API. Linux has it too, mmap() with PROT_NONE protection flag to reserve address space, then mprotect() to allocate physical pages in that range.

titzer · on Aug 5, 2021

Well, unfortunately, address space isn't unlimited. In my experience developing 64-bit wasm engines, the typical approach of reserving 8GB of memory for each Wasm memory leads to address space exhaustion after only about 1000 memories (regardless of how many actual pages those memories use). Therefore for a long living VM that might load and unload hundreds or even thousands of programs (on different webpages for example), it's really important to clean up those reservations promptly.

In V8, cleaning up the reservations for Wasm memories is ultimately tied to garbage collection of the JS heap, since JS objects can root memory. That led to the somewhat clunky need to retry allocation in a loop, running a full GC if allocation fails the first few times (https://github.com/v8/v8/blob/master/src/objects/backing-sto...).

There are also fundamental kernel limitations on virtual memory size. They are system dependent but much lower than 2^64.

Const-me · on Aug 5, 2021

On modern windows the limit is 2^47 bytes: https://docs.microsoft.com/en-us/windows/win32/memory/memory... With 8GB of virtual space per engine, should be enough for 16k of these engines. Interesting that you got OOM only after 1k engines, that’s 16x below the limit. Was that on Windows? Actual hardware or some VM substitute?

In my use case, I was lucky as I only need an allocator per hardware thread. They never approach thousands. So far, the maximum is 256 hardware threads, like in dual-CPU EPYC 7713.

titzer · on Aug 5, 2021

It seems to depend a lot on current conditions and is ultimately probabilistic. I.e. it would not fail 100% at 1000, but some large enough fraction of the time that that strategy was not shippable for Wasm. IIRC it might only have been 10% of the time, but it quickly increases to 100% along a sliding scale up to the hard limit. Part of that is fragmentation as well, because Chrome uses an allocator underneath that can carve out fairly large reservations that fragment the address space.

psykotic · on Aug 6, 2021

Just wanted to cosign that this is absolutely a real-world concern. For that reason I always try to make huge reservations as early in the process lifecycle as possible [1] and then subdivide them from there. Then you can know ahead of time how much space you have to work with. This still has potential composability issues, but partitioning the address space like this before VA fragmentation becomes an issue really helps when you have a greedy VA reserver competing against typical workloads (and their malloc's mmap/VirtualAlloc requests), especially in long-running programs.

[1] Ideally before most shared libraries have been loaded. I should do a simulation to get some quantitative estimates, but my hunch is that ASLR hurts badly even if each loaded library only takes up one 4 kB page. E.g. if you have an empty address space, allocate a 4 kB page at a random address and split the space in two around it, then try to densely pack as many 8 GB ranges in there as possible, you lose one potential 8 GB range (even if the ranges only require 4 kB alignment for their base address). Whenever such splits don't land in the same 8 GB clump, you lose a potential 8 GB range per split.

michielderhaeg · on Aug 5, 2021

This is what the Glasgow Haskell Compiler runtime does. If you run a Haskell application, you'll see it has uses 1TB of virtual memory.

This way you never have to talk to the kernel if you need more memory.

weavie · on Aug 5, 2021

Also Clozure Common Lisp.

eptcyka · on Aug 5, 2021

I can only imagine this working for applications that don't churn through memory or applications that are short lived. But in those cases, it would work really well.

Const-me · on Aug 5, 2021

My application is long-lived (at least hours) and uses gigabytes of memory.

It works for me because I have distinct moments in time when the whole stack is no longer needed. I’m working on a CAM/CAE app where work is organized as a sequence of optimization passes. I only need to keep the data within 1 iteration. At the end of every optimization pass I simply reset the stack pointer to the very bottom, and on the next iteration the stack starts growing again from zero reusing these physical memory pages.

BTW I’m using multiple of these stacks, one per thread. Helps with performance because no locks are needed, also L1 and L2 caches are inside cores, only the L3 is shared across the whole CPU.

Joker_vD · on Aug 5, 2021

So basically, your use case is the extreme variant of the one described in Appel's "Garbage collection can be faster than stack allocation" paper.

Const-me · on Aug 5, 2021

Not sure it deserves to be called garbage collector, too simple.

The approach is pretty close to arena allocation https://en.wikipedia.org/wiki/Region-based_memory_management... The only twist, with modern 64-bit processors there’s no need for linked lists or multiple blocks, can instead add new physical pages into pre-allocated address space.

stefan_ · on Aug 5, 2021

Address space is unlimited but pages are still 4K.

Const-me · on Aug 5, 2021

Right, I thought about large pages for that, unfortunately on Windows they’re pretty much useless. They require an elevated process, and they are incompatible with swap file i.e. they never paged to disk even when there’s low memory: https://docs.microsoft.com/en-us/windows/win32/memory/large-...

gpderetta · on Aug 5, 2021

and TLB is also a scarce resource.

sgt · on Aug 5, 2021

Speaking of Backtrace, they recently got acquired by Sauce Labs: https://saucelabs.com/news/sauce-labs-extends-continuous-tes...

itamarst · on Aug 5, 2021

I'm working on a memory profiler for Python that is fast enough to run in production (see link below), so I've ended up with some similar problems re performance and importance of testing.

A few things the article talks about where one can maybe do even better:

1. likely()/unlikely() not being in stable Rust. This is true, but the hashbrown project has some hacked-up variants it claims work on stable: https://github.com/rust-lang/hashbrown/blob/bbda6e0077bafb75...

2. Rust not having fast thread locals. Same problem for me, so likewise did it in C with "initial-exec". But! If you use clang, you can get LTO across C and Rust, so you can get fast thread locals _and_ not have function call overhead. Basically need to use same version of Clang as Rust does (12 at the moment) and do a little song and dance in linker and compiler flags. See https://matklad.github.io/2020/10/03/fast-thread-locals-in-r...

3. For testing these sort of things, being able to assert "this test touched this code path" is extremely useful. In my case, for example, I have different code paths for sampled and unsampled allocations, but from perspective of code calling malloc() everything should be identical. So how to tell if correct code path was used? Coverage marks are a great solution for this: https://ferrous-systems.com/blog/coverage-marks/

(The Python profiler, if anyone is interested: I've already released an open source memory profiler that tracks all allocations, https://pythonspeed.com/fil/. Unlike most memory profilers it tells you source-of-allocations at peak memory, which is key for data processing applications. The commercial production-grade version I'm now working on uses sampling, and will be even more focused on data processing batch applications; the goal is to have essentially no performance overhead so it can always be on.)

pkhuong · on Aug 5, 2021

I tried a handful of such hacks for likely/unlikely annotations on stable, when working on versions of https://crates.io/crates/reciprocal. They all resulted in correct code, but none of them had any impact on basic block ordering (on rust 1.52, I believe).

nextaccountic · on Aug 5, 2021

Does this allocator ever returns memory to the OS after deallocation?

saagarjha · on Aug 5, 2021

I'm curious how you do allocation partitioning based on type for C. You can certainly segregate on size, but that still allows confusions across types that are the same size–plumbing type metadata should require compiler support.

pkhuong · on Aug 5, 2021

You use a different interface than malloc/free and ask the programmer to pass in a class tag. https://github.com/backtrace-labs/slitter/blob/7afb9781fd25b...

BoiledCabbage · on Aug 5, 2021

"Trust but verify" is such an empty phrase.

Two people getting ready to leave the house.

A: "Did you turn off the stove?" B: "Yes" A: "Are you sure?" B: "Yes"

If you trust them, then you leave the house. If you go verify it yourself, then you don't trust them.

anamexis · on Aug 5, 2021

I think it's an intentionally oxymoronic, witty turn of phrase.

In case you aren't aware of its history: https://en.wikipedia.org/wiki/Trust,_but_verify

BoiledCabbage · on Aug 6, 2021

Thanks - and yes I'm aware of it's history.

And as to support my point, even your link quotes Stalin being the inspiration of the concept specifically says:

"Healthy distrust makes good base for cooperation"

Ie the phrase is simply distrust - it has no meaning to trust at all.

strken · on Aug 5, 2021

I don't think it's entirely empty. There's a difference between a hostile entity who will try to harm you and whose attempts must be thwarted, and a mostly benevolent entity who will occasionally do the wrong thing if not given oversight.

Consider the different approaches to mitigating terrorism (proactive prevention against a hostile actor willing to make sacrifices to harm you) versus mitigating tax evasion (after-the-fact verification and punishment against a greedy actor willing to make rational selfish decisions).