Why is memory reclamation so important for lock-free algorithms?

rusanu · on Sept 17, 2017

Once I've read an excellent book, The Art of Multiprocessor Programming [0]. All examples in the book are in Java. After reading it, when thinking about some of the solutions and algorithms presented, I quickly concluded that deploying them in C/C++ would be an order of magnitude more complex, because of 'memory reclamation'. So yes, I think the OP makes a good point.

[0] https://www.amazon.com/Art-Multiprocessor-Programming-Revise...

morecoffee · on Sept 17, 2017

I picked this book up from the library and it is definitely a useful book. About 2/3rds of the book is actual runnable example code.

Also agree totally on Java making concurrency way easier with GC. Russ Cox describes one of these problems here https://research.swtch.com/lockfree

CyberDildonics · on Sept 18, 2017

Typical system memory allocation locks, so being able to ignore that aspect makes lock free algorithms simpler, often because it makes them no long lock free.

haberman · on Sept 17, 2017

I've found that a lot of the tricky edge cases in lock free programming can be reduced to the following case:

Imagine that a thread reads a pointer, then immediately (before it dereferences it) gets scheduled out and frozen for arbitrarily long.

To keep the thread from crashing or performing incorrect logic when it wakes up, to have to make sure this pointer remains valid for arbitrarily long.

You have to accommodate the fact that a thread can get frozen basically forever.

CyberDildonics · on Sept 18, 2017

I've done quite a bit of lock free programming and getting into this situation should mean that the design isn't going to work in the first place. To start, using pointers to objects is a red flag to me. What you are really working with is data. Indexing that data in an array or lock free map can work.

This is also ignoring the issue that global heap memory allocation locks anyway, so allocating a new object is already taking a lock every time. That is rarely what you would want for performance on a single core, but definitely not what you want for a lock free data structure.

haberman · on Sept 18, 2017

How do you think a lock-free map works?

Pretty much every fundamental lock-free data structure is built around pointers. It's true that a static array can work if you're willing to commit to a fixed size up-front (and pre-allocate that entire fixed size), but that is often too limiting for practical use.

> This is also ignoring the issue that global heap memory allocation locks anyway, so allocating a new object is already taking a lock every time.

Any memory allocator optimized for multi-threaded performance will have a thread-local or CPU-local cache that can allocate without taking a lock.

CyberDildonics · on Sept 21, 2017

Not all lock free maps work the same. The ones I've written didn't use pointers.

Even an allocator with a thread local cache will still have to lock to map in more memory. This might be unavoidable all together, but allocating memory from a heap and using the pointer not only is inefficient because of the allocation itself as well as the huge increase in the number of times a system call has to be made, but also due to the fact that access now means you have to jump around in memory to find anything.

Resizing contiguous memory is a separate problem, not a show stopper. Multiple buffers with an atomic switch is one way to confront it. Having all writes go to a new buffer while reads try the new buffer first but try the old buffer second if they can't find anything is one way to ensure that all reads stay lock free.

smegel · on Sept 18, 2017

Why would you ever free an object if something else might be pointing at it?

xenadu02 · on Sept 18, 2017

>Why would you ever free an object if something else might be pointing at it?

Hi, welcome to programming /s

This is the crux of the entire issue for high-performance multi-threaded programs: how to tell when memory can safely be freed without introducing expensive synchronization?

The simple solutions are obvious:

1. Don't share memory 2. Use a lock

The real tricks are in forms of reference counting or garbage collection that avoid taking locks as much as possible. For really high performance you also need to avoid atomic operations and fences as much as possible. A "pause the world" garbage collector is easy (relatively) to get right and you can be confident it will work. Doing a concurrent pauseless GC is another matter entirely. Inserting a memory fence on every write is not great for performance.

For both GC and reference counting schemes weak references are an additional wrinkle. Swift's original weak reference scheme solved this by inverting control and maintaining parallel reference counts: when the last strong reference is released it ran deinit and set a flag in the object header but otherwise left the memory allocated. Only once every weak reference was touched (or itself deallocated) was the memory actually freed. That's trading greater high-water memory usage for performance. (I don't know off-hand why the implementation was changed though).

smegel · on Sept 18, 2017

> 1. Don't share memory 2. Use a lock

3. Have a concept of ownership and make sure all your code follows some basic rules?

E.g. if a consumer reads a pointer from a look free queue, it is then responsible for freeing that memory. OP seems to be saying let some other thread free it on some racy assumption that the consumer will be done with it within N nanoseconds.

That's not how I would code it.

haberman · on Sept 18, 2017

The problem is inside the lock-free queue. How do you remove something from a lock-free queue? Usually by reading a pointer to the queue node, then doing a CAS to remove the node from the queue. But another thread can remove that same node and free it before you can do the CAS!

If you're just using a lock-free queue, then the hardest part of the problem has been solved for you already. But the person who wrote the lock-free queue had to think about and solve these really tricky problems.

smegel · on Sept 19, 2017

So how did they solve it?

haberman · on Sept 20, 2017

There are a number of techniques for solving this:

    1. use a GC'd language
    2. hazard pointers
    3. epoch-based GC
    4. collect nodes when a user destroys the structure
       (user is required to guarantee that destruction is
       single-threaded).

There are probably others.

cfallin · on Sept 18, 2017

The tricky bit isn't making sure you don't free pointed-to objects; the tricky bit is knowing which objects can be pointed-to. A consequence of GP's point is that because there can be an arbitrary delay between reading a pointer and dereferencing it, any pointer ever exposed to other threads must be considered in-use indefinitely (the reader could sleep indefinitely then wake up and deref) unless one can prove that the pointer is no longer held by any potential readers.

There are a few common ways to do that. Hazard pointers are one: every thread updates some per-thread pointer indicating what object it is reading, and other threads check this (for every live thread) before freeing an object. GC also solves the problem because any multithreaded GC must somehow examine live roots in every thread -- so it will see the still-live reference in the sleeping reader.

The main point, though, is that it's harder than the lock-synchronized case because one no longer has the guarantee that a reader will hold the lock as long as it's examining the pointed-to object.

InclinedPlane · on Sept 18, 2017

OK, sure. But now you have a system that has only a monotonically increasing memory footprint. In other words, it's leaking memory. This is undesirable because memory is finite.

This is the core issue behind a lot of the differences between competing programming languages. Languages like C and C++ allow you to manage memory on your own. If you can guarantee through your own program constraints that some piece of memory won't be referenced, then go ahead and free it if you need to. Other programming languages like Java, C#, Python, Javascript, etc. use a managed memory design. A runtime keeps track of memory usage and does "garbage collection" of memory that it can prove is not still in use. The overhead of the runtime and especially of garbage collection is a big sticking point in performance comparisons between such languages and C/C++. Other languages use still other techniques. Rust, for examples, uses a "borrow checker".

emerged · on Sept 17, 2017

I came to use an epoch based reclamation approach where a token is passed between all threads in the system. Since this chain involves only relaxed read/writes, it has practically zero overhead. The only downside is the relatively high latency between retiring and recycling the memory. That and all participating threads must occasionally process their epoch.

I hadn't seen this exact approach used elsewhere, all the other epoch approaches were more complex and required atomic operations at certain points.

sbahra · on Sept 17, 2017

This is quiescent-state-based reclamation, other implementations exist that are cheap. I have seen the technique you're talking about used in K42 (after I had also thought I invented something novel, when I used the same approach as you :-P).

cwzwarich · on Sept 17, 2017

Doesn't the strict definition of QBSR require a thread to keep no references to shared data when it is in quiescent state, whereas epoch-based reclamation allows threads to simply retain no references to older versions of shared data when it advances its epoch?

sbahra · on Sept 19, 2017

I'm referring to the FIFO implementation. On the definition of QSBR though, frankly, I think the term gets overloaded in some of the more recent literature in this area. Both EBR and QSBR require that no hazardous references get carried over across "protected sections" (EBR is explicit read-side protected sections and in QSBR, across a quiescent point).

emerged · on Sept 17, 2017

I think it does fit that description, but for some reason none of the implementations I found were completely without atomic ops. They had at least barriers which aren't free on ARM.

gpderetta · on Sept 19, 2017

wouldn't you need acquire/release barriers around your load and stores of your token (or alternatively acq/rel load/stores)? Otherwise your other operations won't be ordered with regard the epoch marked by the token.

jnwatson · on Sept 17, 2017

The last time I had to write a lock-free algorithm (in this case, a multi-writer, multi-reader queue), the memory reclamation strategy was straightforward:

Make 2 lock-free queues. Pre-allocate all the memory into message-sized chunks, and post them to the "upwards" queue. A writer would read/consume from the upwards queue, write into the received buffer, and then post the same buffer in the other queue. Everything is passed by reference, so it is zero-copy.

I guess I don't understand the reason for all the complexity. Perhaps for more complicated algorithms than queues?

barrkel · on Sept 17, 2017

If you preallocate memory, you're not reclaiming it, are you?

jnwatson · on Sept 18, 2017

Sure, I'm reclaiming it for future writers.

anarazel · on Sept 18, 2017

Depending on your implementation (hard to say given the detail provided) you're quite possibly running into ABA style problems with such an such an approach. If you use cmpxchg and rely on pointer identity it could be that you see the same pointer, but in its reused form...

kasey_junk · on Sept 17, 2017

What do you do when your queue is full?

jnwatson · on Sept 18, 2017

Block. Lock-free doesn't imply infinite RAM.

kasey_junk · on Sept 18, 2017

Wait freedom does though, the article is about both. In wait free queues memory reclamation is the hard part in practice (most wait free queues are actually much slower than their locked or lock free counterparts).

Here is a great paper describing how the memory reclamation strategy was central to speeding up their implementation:

http://chaoran.me/assets/pdf/wfq-ppopp16.pdf

convolvatron · on Sept 17, 2017

huh. I expected this to be a discussion about the ABA problem, probably would have been worth at least a mention

edit ref: https://en.wikipedia.org/wiki/ABA_problem

ianopolous · on Sept 17, 2017

They claim there are no lock-free or wait-free GCs. But Azul have had a pauseless one for the best part of a decade: http://www.artima.com/lejava/articles/azul_pauseless_gc.html

bitcharmer · on Sept 17, 2017

C4 (the GC implementation you are referring to) is not truly pauseless. It does enormous amount of work with the assist of the "user" threads and has a concept of local (instead of VM-wide) safepoints, but it still (rarely) pauses.

http://dl.acm.org/citation.cfm?id=1993491&dl=ACM&coll=DL

http://static.usenix.org/events/vee05/full_papers/p46-click....

ianopolous · on Sept 17, 2017

So they are wrong to repeatedly call it pauseless? Or are you making a distinction between their implementation and their algorithm? Or are there different definitions of pauseless?

cwzwarich · on Sept 17, 2017

To quote their earlier paper:

"HotSpot supports a notion of GC safepoints, code locations where we have precise knowledge about register and stack locations [1]. The hardware supports a fast cooperative preemption mechanism via interrupts that are taken only on user-selected instructions, allowing us to rapidly stop individual threads only at safepoints. Variants of some common instructions (e.g., backwards branches, function entries) are flagged as safepoints and will check for a pending per-CPU safepoint interrupt. If a safepoint interrupt is pending the CPU will take an exception and the OS will call into a user-mode safepoint-trap handler. The running thread, being at a known safepoint, will then save its state in some convenient format and call the OS to yield. When the OS wants to preempt a normal Java thread, it sets this bit and briefly waits for the thread to yield. If the thread doesn't report back in a timely fashion it gets preempted as normal.

The result of this behavior is that nearly all stopped threads are at GC safepoints already. Achieving a global safepoint, a Stop-The-World (STW) pause, is much faster than patch-and-roll-forward schemes [1] and is without the runtime cost normally associated with software polling schemes. While the algorithm we present has no STW pauses, our current implementation does. Hence it's still useful to have a fast stopping mechanism."

To quote their later paper:

"The C4 algorithm is entirely concurrent, i.e. no global safepoints are required. We also differentiate between the notion of a global safepoint, where are all the mutator threads are stopped, and a checkpoint, where individual threads pass through a barrier function. Checkpoints have a much lower impact on application responsiveness for obvious reasons. Pizlo et al [16] also uses a similar mechanism that they refer to as ragged safepoints.

The current C4 implementation, however, does include some global safepoints at phase change locations. The amount of work done in these safepoints is generally independent of the size of the heap, the rate of allocation, and various other key metrics, and on modern hardware these GC phase change safepoints have already been engineered down to sub-millisecond levels. At this point, application observable jitter and responsiveness artifacts are dominated by much larger contributors, such as CPU scheduling and thread contention. The engineering efforts involved in further reducing or eliminating GC safepoint impacts will likely produce little or no observable result."

ianopolous · on Sept 17, 2017

Thank you. My reading of that is that the algorithm is theoretically pauseless, but the implementation has a fixed sub millisecond pause that doesn't scale with heap size or allocation rate. So they could make the implementation truly pauseless, but it wouldn't make any practical impact because of other factors like CPU scheduling jitter dominating. So to all intents and purposes it is pauseless.

nickpsecurity · on Sept 18, 2017

Here's what I found with some Googling with a mix of strengths and weaknesses. Also, note that the one with fragmentation might not be a problem if using clustering with regular replication and fail-over. Depends on how quickly it fragments but small, regular restarts can knock out lots of issues like that.

Wait-Free

http://www.non-blocking.com/download/Sun04_WaitFreeRef_TR.pd...

Lock-Free 1

http://www.win.tue.nl/~jfg/articles/CSR-04-31.pdf

Lock-Free 2

http://www.non-blocking.com/download/gidpst05_lockfreegc_tr....

Far as restarts, you being a Java guy on distributed systems might find the Micro-Reboots paper interesting too as it ties into above tip:

https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...

https://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/j0...

none_to_remain · on Sept 17, 2017

I recall explaining to a guy that the lock free hashmap was very easy to use as long as you never free()'d anything you put in it, in which case I had some literature and code on Hazard Pointers for him to take a look at...

sbahra · on Sept 17, 2017

Great write-up! A few notes worth mentioning (IMO) below:

Garbage collection: This is only true in absence of garbage collection. If you're paying garbage collection cost to begin with, this is not an issue. Also, note things such as http://web.mit.edu/willtor/www/res/threadscan-spaa-15.pdf

This only applies to dynamic lock-free data structures: Or, data structures requiring memory allocation. If you're using bounded buffers and don't require memory allocation, this isn't an issue.

Taxonomy: Not all passive schemes are quiescent-state-based. In QSBR, quiescent points are a first-class entity while that is not the case in EBR (you have only 3 distinct states). Absent extensions you are unable to differentiate one quiescent point from another, which has real implications on being able to implement things like deferred / non-blocking reclamation efficiently. There are some advantages to this, a while ago Paul Khuong contributed a reference counting scheme to epoch reclamation allowing for overlapping protected sections (bounded epoch makes reference counting practical, something you can't do with unbounded quiescent points). It's pretty great and note, it doesn't require a strict notion of quiescence! You'll find this in Concurrency Kit.

It is also worth noting the HP, etc... (pointer-based schemes) can be used to implement QSBR.

For these reasons, I prefer to classify these techniques into two primary families (based on the semantics of the interface rather than the implementation): "passive schemes" and "active schemes". Passive schemes do not require cooperation from the algorithms requiring safe memory reclamation (EBR, QSBR, etc...) while "active schemes" do (HP, PTB, etc... in their textbook form require modification to the algorithm).

On blocking for passive schemes: It is worth noting that QSBR, EBR and other "passive schemes" do have "non-blocking" (informally) interfaces (rcu_call, text-book EBR utilizes limbo lists, etc...) such that writers do not have to block on the fast path, but of course, there is the cost of memory accumulating until a grace period is detected (so, not non-blocking in the formal sense but if you've the memory to spare and sufficient forward progress, becomes a non-issue). In my implementations, I typically use a dedicated thread for forward progress / epoch detection, ensuring the writer has forward progress.

You have much higher granularity and the lock-free progress guarantee in hazard pointers, because it is pointer-based (tied to the forward progress guarantees of the underlying algorithm).

Under-appreciated recent development: An interesting thing to note here is there are schemes such as https://www2.seas.gwu.edu/~seotest/publications/eurosys16ps.... which do not necessitate the same heavy-weight barriers in the presence of invariant TSC+TSO and with a sufficiently high granularity TSC, can provide the same forward progress guarantees as hazard pointers.

On hazard pointers being slow: One interesting thing to note is hazard pointers can also be used to implement "passive" schemes such as proxy collection, to give similar performance properties as EBR and QSBR.

On real world implementations: It is worth mentioning http://concurrencykit.org :-P

hyperpape · on Sept 17, 2017

I'm having trouble understanding your point about GC. The post points out that GC solves some of these problems, at the expense of creating waits. Do you think there's something wrong with the way it describes the issue?

wbl · on Sept 17, 2017

Not GP, but I think what he's getting at is that the GC cost is already paid and doesn't get paid again for reclaiming memory with lock-free algorithms.

rootw0rm · on Sept 18, 2017

the best way to do it: don't require memory reclamation in your design.

I coded a lock free IPC interface from standard C to C#. Luckily in my case the data I was moving originated on the stack. So all I do is copy from stack to shared IPC buffer and obviously keep using that same buffer until exit.

HOWEVER, if you're competent enough to handle raw threads and lock free design to begin with, you shouldn't have the slightest issue with queuing up your calls to free(). My workstation is old, but I still have 6 cores/12 threads @ 4.2 GHz. That's more than enough headroom to have a dedicated GC thread that has zero effect on performance, and a barely noticeable effect on memory.

Proper concurrency is not that difficult.

anarazel · on Sept 18, 2017

The difficulty isn't calling free, it's to know when it's safe to do so, without incurring significant overhead.

otabdeveloper1 · on Sept 18, 2017

> Btw, even if you have a GC, no known GC is lock-free (much less wait-free)

There are lots of lock-free memory allocators for C (and C++).

They're not at all hard to write, I've written them myself.