Parallelism in 2021 should not be tightly coupled across threads if performance ...

CJefferson · on March 13, 2021

I don't understand -- isn't what you are suggesting single threaded async code? That might be useful for servers, where you are mostly waiting for other things (like databases and networks), but in ithe places the point of parallel is to get all your CPUs doing useful work, and then (in my experience, happy to be shown counterexamples), coroutines aren't very useful. You just want to blast a bunch of threads (or rightly coupled processes)

jandrewrogers · on March 13, 2021

Yes, roughly single-threaded async, with each core running a disjoint subset of the workload on data private to that core. You can’t beat the operation throughput. The software architecture challenge is shedding load between cores, since this will hotspot under real workloads with a naive design. Fortunately, smoothly and dynamically shedding load across cores with minimal overhead and latency is a solved design problem.

It works pretty well for ordinary heavy crunch code too. I originally designed code like this on supercomputers. You do need a practical mechanism for efficiently decomposing loads at a fine granularity but you rarely see applications that don’t have this property that are also compute bound.

CJefferson · on March 13, 2021

Ah, I understand now. I misinterpreted "no thread overhead" as meaning "I'm not running things in multiple threads", like the current node.js/javascript obsession, where we just run code in one thread and use a bunch of async to "parallelise". Sorry!

I've (badly) written code like you describe -- usually by abusing fork to do my initial data structure setup, then using C pipes to pass jobs around. I suspect there are much better ways of doing it, but that parallelised well enough for the stuff I was doing. I'd be interested to know if there are good libraries (or best practices) for doing this kind of parallelism.

oivey · on March 13, 2021

I was also confused. I agree that touching raw threads is usually not the right thing to do, and chains of parallel coroutines are one of the good abstractions. It’s crazy how few languages have easy access to that very basic abstraction.

truckerbill · on March 13, 2021

Do you know of any resources to learn more about this approach?

Jweb_Guru · on March 13, 2021

There is not a meaningful semantic difference between what you're describing and what tools like rayon provide (and BTW, threads do just fine when pinned to a core and appropriately managed as they should be in large data processing workloads). Whether threads are used on the backend is largely a distraction, you still have to write things roughly the same way to create correct code (for example, you cannot share memory between tasks on different cores, or on different nodes, without synchronizing somehow).

andi999 · on March 13, 2021

Thanks. What is latency hiding?

jandrewrogers · on March 13, 2021

There are many operations on data that are relatively slow from a CPU’s perspective — filling a cache line, page faulting, cache coherency, acquiring a lock, waiting on I/O to complete, etc. All of these add latency by stalling execution. In conventional software, when these events occur you simply stall execution, possibly triggering a context switch (which is very expensive). In many types of modern systems, these events are extremely frequent.

Latency hiding is a technique where 1) most workloads are trivially decomposed into independent components that can be executed separately and 2) you can infer or directly determine when any particular operation will stall. There are many ways to execute these high latency operations in an asynchronous and non-blocking way such that you can immediate work on some other part of the workload. The “latency-hiding” part is that the CPU is rarely stalled, always switching to a part of the workload that is immediately runnable if possible so that the CPU is never stalled and always doing real, constructive work. Latency-hiding optimizes for throughput, maximizing utilization of the CPU, but potentially increasing the latency of specific sub-operations by virtue of reordering the execution schedule to “hide” the latency of operations that would stall the processor. For many workloads, the latency of the sub-operations doesn’t matter, only the throughput of the total operation. The real advantage of latency-hiding architectures is that you can approach the theoretical IPC of the silicon in real software.

There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing. Cray/Tera MTA architecture is probably the canonical example as well as the original Xeon Phi. As a class, latency-hiding CPU architectures are sometimes referred to as “barrel processors”. In the case of Cray MTA, the CPU can track 128 separate threads of execution in hardware and automatically switch to a thread that is immediately runnable at each clock cycle. Thread coordination is effectively “free”. In software, switching between logical threads of execution is much more by inference but often sees huge gains in throughput. The only caveat is that you can’t ignore tail latencies in the design — a theoretically optimal latency-hiding architecture may defer execution of an operation indefinitely.

Const-me · on March 13, 2021

> There are exotic CPU architectures explicitly designed for latency hiding, mostly used in supercomputing.

I don’t know much about supercomputers, but what you described is precisely how all modern GPUs deal with VRAM latency. Each core runs multiple threads, the count is bound by resources used: the more registers and group shared memory a shader uses, the less threads of the shader can be scheduled on the same core. The GPU then switches threads instead of waiting for that latency.

That’s how GPUs can saturate their RAM bandwidth, which exceeds 500 GB/second in modern high-end GPUs.

spockz · on March 13, 2021

Is there some tool like ps/top /time that I can use measure how much “constructive” work my cpu spend doing?

termie · on March 13, 2021

perf https://perf.wiki.kernel.org/index.php/Tutorial#Counting_wit...

Jweb_Guru · on March 13, 2021

Latency hiding is a a way to substantially increase throughput by queuing massive numbers of requests or tasks while waiting on expensive resources (e.g. main memory access can have latency in the hundreds of cycles for GPUs). By scheduling enough tasks or enough requests that can be dispatched in parallel (or very soon after one another), after an initial delay you may be able to process the queued requests very quickly (possibly once per clock cycle or two) providing similar overall performance to running the same set of tasks sequentially with very low latency. However, such long pipelines are very prone to stalling, especially if there are data dependencies that prevent loads from being issued early, so getting maximum performance out of code on architecture that heavily exploits latency hiding techniques can require a lot of very specific domain knowledge.

eptcyka · on March 13, 2021

You are mistaking parallelism for concurrency.

otabdeveloper4 · on March 13, 2021

> purpose-built scheduler + native coroutines

"Threads" are nothing but kernel-mode coroutines with purpose-built schedulers in the kernel.

Redoing the same machinery except in usermode is not the way to get performance.

The problem is that scripting languages don't let you access the kernel API's cleanly due to various braindead design decisions - global locks in the garbage collector, etc.

But the solution isn't to rewrite the kernel in every scripting language, the solution is to learn to make scripting languages that aren't braindead.