Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It came about because of Linux' intrinsically unpredictable and unreliable memory management. Even when you disable overcommit, Linux can still easily get into situations where the OOM killer is triggered. Similarly, OOM scores won't reliably and consistently result in the desired process being killed.

Vendors like Google and Facebook have for years created their own userspace OOM killers that attempt to stay one step ahead of the kernel's OOM killer in order to improve consistency and reliability. PSI is just the latest in a long series of features over the years that companies have contributed for the benefit of userspace OOM killers.

There's a similar issue with Linux' aggressive buffer cache. Even if you disable overcommit, and even if there's technically uncommitted (free) memory remaining, you can easily bring a Linux system to a halt with a combination of I/O and memory contention. See, e.g., https://lkml.org/lkml/2019/8/4/15 The hanging issue can even lead to the OOM killer kicking in when Linux' heuristics determine that memory eviction isn't progressing quickly enough. And, yes, even if you use memory cgroups, you can still hit these situations. At scale and under load you will see these issues regularly. It's a nightmare.

Outside Linux, the enterprise solution to OOM situations was to write robust applications that could handle malloc and mmap failure, naturally responding to memory pressure and allowing them to reliably and deterministically maintain state and QoS. Solaris doesn't do overcommit, for example. Neither does Windows, at least not by default--it's opt-in when you do, so robust applications don't need to fear being shot down by devil-may-care memory hogs.

To reiterate, even if you disable overcommit in Linux, the fact of the matter is that many aspects of the Linux kernel were designed and implemented with the assumption of overcommit and loose accounting.

There are various ways to deterministically handle the related buffer cache and I/O contention issue. The simplest is to keep buffer cache and anonymous memory separate so one doesn't intrude onto another. But of course that's less than ideal from an efficiency and performance perspective. Another is to have a [very] sophisticated I/O scheduler that can integrate memory and I/O resource accounting and prioritization together so you can get deterministic worst-cast behavior, at least for your most important processes. A partial solution might be to strictly provision memory for disk-mapped executables. But in any event Linux doesn't provide any of these. Any particular application could theoretically implement these itself, but unless all applications do the kernel can still get wedged.

PSI is just another band-aid. It can improve things dramatically, at least if you have the time and capability to write your own userspace OOM killer and tune it to your particular workloads. (There are no generic solutions--if there were the existing OOM killer would be good enough). But when it comes to resource accounting Linux is basically broken by design. It's the price to be paid for its performance and rapid evolution.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: