Copy-on-write friendly Python garbage collection

ComputerGuru · on Jan 1, 2018

Not directly pertaining to the article but in relation to the comments here, I feel it’s necesary to remind that the real problem with GC isn’t average processing time/max simultaneous requests but rather the large standard deviation resulting from GC pauses (and not to mention the poor cache friendliness, memory evictions, and more as a result - but those are taken into account in the the average run times so we can ignore them in this context).

As a related metric, sometimes it’s not your average response times and scale that matters but how much headroom you’ve got for a sudden influx of traffic. If memory/pagefile thrashing and hard faults bring your system screeching to a halt, you won’t be happy no matter your scale when you can’t benefit from the eyeballs and pageclicks coming your way.

rbehrends · on Jan 2, 2018

Eh, incremental garbage collection is technology that has been around since the 1980s/1990s. For purely sequential systems (which is what we're talking about here), it is pretty easy to make GC soft real-time [1]. We are talking about maximum pause times well below just the network latency between the client and the web server.

Cache friendliness is a double-edged sword. There are plenty of use cases where a GC can be more cache-friendly than manual memory management, in particular if you have a generational GC with a bump allocator.

[1] What makes life (considerably) harder is when you have multiple threads sharing a heap, but that's not at issue here.

CoolGuySteve · on Jan 2, 2018

If it’s pretty easy, where are all these soft real-time garbage collectors?

rbehrends · on Jan 2, 2018

Note that I was specifically talking about the sequential case, not multi-threaded languages.

OCaml has had one since the 1990s [1]; it's a fairly standard generational, compacting collector with incremental collection for the old generation. Lua has had an incremental GC since version 5.1 (in 2006); as it's frequently used as a scripting language for video games, it's safe to assume that pause times aren't much of an issue.

The problem is that virtually every language since then has pretty much decided to have all threads use a global shared heap. Once you do that, you run into all kinds of challenges, such as root discovery from thread stacks without stopping the world. That said, there are plenty of languages that have this option, anyway; it's simply more challenging, not impossible.

Languages that are single-threaded maintain thread-local heaps don't have the problem. Python and Ruby (unlike Lua) have issues for historical reasons (they started out with basic reference counting and mark-and-sweep collection, respectively, and then had to maintain backwards compatibility [2]).

Intermediate designs (having both thread-local heaps and shared heaps at the same time) are also possible, but that design space hasn't been explored much.

[1] http://prl.ccs.neu.edu/blog/2016/05/24/measuring-gc-latencie...

[2] I think that in principle one could make the cycle detector in Python incremental (it's basically a form of trial deletion); Ruby eventually got an incremental GC for its major generations in 2.2, but I believe there are still some inherent limitations due to the lack of write barriers in C code.

tychver · on Jan 2, 2018

Running T-Mobile's Erlang powered LTE network.

the8472 · on Jan 2, 2018

Golang, Azul's C4, Shenandoah, Metronome, ...?

jacobolus · on Jan 1, 2018

Moreover, in a distributed systems context where many machines are involved in each task some 'rare' issues can end up happening a large proportion of tasks, and solitary problems can cascade and cause positive feedback loops that bring your whole system down.

IgorPartola · on Jan 1, 2018

A more general question: why is it so typical for a Python we worker to have its memory grow infinitely with each request processed? I don’t expect the kind of performance I can get out of C with its ability to only allocate memory on the stack or to pre-allocate all the buffers I’d need. And I have found a number of memory leaks in Python libraries and especially poorly written Python C modules over the years. But still I would expect something like a simple ideomatic Django application to not have its memory footprint grow idefinitely. Is this just life now? Are there any good tools out there for figuring out what part of the application keeps requesting objects that don’t get destroyed? Are corporations people?

sametmax · on Jan 1, 2018

> why is it so typical for a Python we worker to have its memory grow infinitely with each request processed?

It's not. The cases where it happened to me during the last 10 years and dozens of web projects spanning on multiple frameworks were generally me doing something stupid:

- using a mutable object as default value or in a class attribute

- doing something in __del__

- keeping DEBUG=True for Django (which is very well known for causing memory leaks)

> Is this just life now?

Nope. I have big Django apps running (like 500k users/day streaming video sites) and it doesn't happen.

> Are there any good tools out there for figuring out what part of the application keeps requesting objects that don’t get destroyed?

Yes for course: http://tech.labs.oliverwyman.com/blog/2008/11/14/tracing-pyt...

IgorPartola · on Jan 1, 2018

I found pdb to be useful when I wrote my own code from scratch. I found it useless when using a large app on top of Django. Oh really? I have a lot of dicts and tuples? Who knew. I would rather see something like a count of object allocated but not released at the end of each request cycle and their reference graph. Because when I have found memory leaks it has had to do with some kind of cyclic dependencies or creating objects that some kind of long lived object like a database adapter wanted to hold onto.

sametmax · on Jan 1, 2018

You got muppy for that: https://pythonhosted.org/Pympler/muppy.html

Also objgraph: https://www.darkcoding.net/software/finding-memory-leaks-in-...

If you are lucky to use Python 3.6+, you got tracemalloc: https://docs.python.org/3/library/tracemalloc.html

IgorPartola · on Jan 1, 2018

Thanks. Those do look excellent.

ubernostrum · on Jan 1, 2018

It's hard to call DEBUG in Django a "memory leak" when the "leak" is documented behavior -- when DEBUG=True, Django keeps a running per-connection log in memory of database queries issued. For, y'know, debugging :)

This does cause memory use to increase with the number of queries issued, but a debug log is not really what I'd call a leak (and genuinely leaking memory in Python takes some effort).

ComputerGuru · on Jan 1, 2018

Who keeps a running, unbounded log of all connections in memory?

I’m not a pythonista, so please forgive my ignorance here, but: is this just the default configuration for Django? Does it ship out-of-the-box with adapters to log that to MySQL or similar instead? Why does it not log the last n connections for some sane (and configurable) default value of n instead?

traverseda · on Jan 1, 2018

With django, you don't need to set up a webserver or mysql. You can just run `./manage.py runserver` and it will run in it's own webserver and use sqlite.

That is what the "debug=True" variable does. Sets things up for running in a minimal development environment. Another feature? It restarts the server when any source files change.

Suffice to say, it's really not meant to be run in anything like production.

https://docs.djangoproject.com/en/dev/ref/settings/#debug

>Never deploy a site into production with DEBUG turned on.

>Did you catch that? NEVER deploy a site into production with DEBUG turned on.

>Still, note that there are always going to be sections of your debug output that are inappropriate for public consumption. File paths, configuration options and the like all give attackers extra information about your server.

>It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you’re debugging, but it’ll rapidly consume memory on a production server.

ComputerGuru · on Jan 1, 2018

I don’t know. If I were designing that system, I’d note that most queries would be extremely similar and would benefit insanely from being compressed in situ or would (given were talking DB here) be a perfect fit for normalization.

You mentioned SQLite - a second “querylog.db” with a table of queries and a table of instances (query time, parameters, whatever) would be an obvious option (at the expense of slightly exaggerating your db response times, but I’m sure debug mode is already slower as is).

traverseda · on Jan 1, 2018

When running in debug mode, generally the server is restarted every few minutes at worst. As the app is under active development and the server restarts whenever the app changes.

The reason it logs those queries is just for debugging. There are a lot of different ways they could do it, but they don't matter because you can just turn that feature off, or set it up to log in a different way, or whatever else.

Logging to a DB would be worse, because then it stores a bunch of useless queries that all look very similar. That logging is used for debugging, and wouldn't be something you'd want to use in production.

Why hasn't anyone already done this? Because it's trivia, not an actual problem that needs fixing.

ubernostrum · on Jan 1, 2018

As others have noted, the behavior is there for use in lighter-weight debugging/dev situations when you want to interact with the code and also have the ability to inspect what it's doing in detail (either by directly poking things from inside a Python interpreter, or through plugins like the debug toolbar). Trying to design a hyper-efficient compressed query storage mechanism would be massive overkill for such a thing. The simplest thing that could possibly work for this is an in-memory list of the queries as strings, so that's what it does.

jsmeaton · on Jan 1, 2018

Since no one has mentioned yet, it IS bounded to the last 9000 queries by default.

bjt · on Jan 1, 2018

> Who keeps a running, unbounded log of all connections in memory?

It does that only when DEBUG=True, which is not the recommended production configuration.

pwaai · on Jan 2, 2018

> Nope. I have big Django apps running (like 500k users/day streaming video sites) and it doesn't happen.

Mind sharing your stack? I'm currently torn between Laravel and Django. It won't be the first time I've used an MVC framework (CodeIgniter, some Rails) but I like working with Python. However, Laravel also seems to be heavily used. I don't have anything against PHP but wanting to compound my familiarity with Python is one big reason for leaning towards Django...

sametmax · on Jan 2, 2018

Varnish, nginx, gunicorn, Django, redis, celery, postgres most of the time.

Sometime a bit of crossbar.io.

Laravel is a nice framework. I prefer it to symfony, espacially since they love Vue.js which is now my fav front end lib.

But the definitive advantage of using Python is that you get a much more versatile language than PHP. Python is not just good for the Web. It's heavily used for scripting, automation, sysadmin, data analysis, GUI, pentesting, etc.

People don't just use Python with HTTP in mind. You find it at Apple, NASA, at Sony, Fedora, in the state of Geneva, in french schools, in 3D (Maya, Blender) or geography (most GIS)...

Basically, you get a tool that you can use for a lot more stuff, because the ecosystem has been developed for a lot more types of tasks.

mycelium · on Jan 2, 2018

My default stack is almost exactly yours, except

> gunicorn

Any particular reason you use it over uwsgi? I last did real research in 2013 and have not revisited since then.

sametmax · on Jan 2, 2018

Ease of use. Uwsgi is faster, but it's not the place in my stack where i need speed. The cost of transforming nginx packets in wsgi object is very small compared to all the rest.

alexnewman · on Jan 2, 2018

I use uwsgi when I wanna use the whole box. Now with k8s use gunicorn

misterbowfinger · on Jan 2, 2018

>- using a mutable object as default value or in a class attribute

Python noob here. Can you expand on that a little? Or perhaps link a post? I can't think of an example or how it'd impact memory

nitely · on Jan 2, 2018

Here[0]. Mutable objects as default values create a single instance of the object. For iterables (list, set, etc) the behaviour is particularly surprising coz it accumulates/memoize every item ever added (i.e a list that grows/leaks with every call to the function). If you use a good IDE (like pycharm), it will warn you about this stuff.

[0] http://docs.python-guide.org/en/latest/writing/gotchas/

hamandcheese · on Jan 2, 2018

It would only lead to a memory leak depending on your usage, but it’s pretty common gotcha that can be annoying with or without memory issues.

The following is some pretty good reading on the matter, and the example given is also one that would lead to a memory leak:

https://stackoverflow.com/questions/1132941/least-astonishme...

kasabali · on Jan 1, 2018

> A more general question: why is it so typical for a Python we worker to have its memory grow infinitely with each request processed?

I can't talk about general situation but in this specific case it's because they've turned off GC previously, so growing infinitely was expected. Details are in the first blog post they've linked from the OP.

IgorPartola · on Jan 2, 2018

Notice that even with the GC on, even with the modified GC, the memory footprint doesn’t stabilize. It just grows slower.

Scaevolus · on Jan 1, 2018

The default Python allocator experiences a lot of fragmentation over time, leading to sparse memory packing.

Dropbox implemented a custom allocator with different types and sizes getting their own arenas, leading to long term memory stability. Even controlling all their own code, processing millions of file paths, metadata, and buffers inevitably lead to memory leaks.

lathiat · on Jan 2, 2018

Aaron Patterson recently talked about doing similar work for Ruby, with good technical details - he's quite good at explaning these kinds of issues for people less familiar with them. Explains how and why you want to be copy-on-write friendly and what that means for Ruby's GC - and the impacts of his work.

I watched the RubyConf AU version: https://www.youtube.com/watch?v=nAEt36XNtAE&t=2482s

But looks like there was a version at Rubyconf as well that may or may not be better/different: https://www.youtube.com/watch?v=8Q7M513vewk

kasabali · on Jan 2, 2018

I've recently read their previous post, now this one and I really enjoyed reading both of them. I appreciate their practical approach and solving their issues with minimal changes.

On the other hand, I am disappointed that this change goes into upstream Python instead of properly solving the problem by making reference count implementation really CoW friendly so that all applications would benefit from it without the need for a careful use of a special function.

eeks · on Jan 2, 2018

Managing large memory objects, shared or not, is a very commmon problem with garbage-collected languages. The solution of “hiding” the region from the GC has been around for quite some time for languages such as Java, C#, and even OCaml. That being said, it’s a very enjoyable little write up.

vosper · on Jan 1, 2018

When exactly is this new GC behavior useful? When forking processes? Is it something we'd see used in the multiprocessing or concurrent modules?

bpicolo · on Jan 1, 2018

Memory is typically the scaling bottleneck for Python workers for web servers. One way to cut back drastically on this is to load as much as possible in startup and then fork for request serving processes. The problem is that python refcounting causes a lot of copy-on-writes for data that's really just used as read-only data.

This change allows you to run more workers with less memory.

Instagram has had a few good posts about how they've approached this problem, here's another: https://engineering.instagram.com/dismissing-python-garbage-...

sametmax · on Jan 1, 2018

> Memory is typically the scaling bottleneck for Python workers for web servers.

Wut ?

When I have performances problems, my RAM is full with varnish cache, redis stored objects and postgres buffer.

Python is like, low, low on the list.

Again, most people are NOT instagram or Google. They have a very atypical load.

bpicolo · on Jan 1, 2018

That's latency, not bandwidth. This is to help scale out concurrent request counts per box, rather than time to serve a single request (though obviously with request queuing, failure to serve requests fast enough makes concurrency a latency bottleneck).

And sure, most people aren't instagram. This is a python-at-big-scale problem, when the cost benefits of fitting more requests onto fewer servers actually matter.

One of the biggest reasons this actually makes a difference is that, in SoA, internal API scaling against your monolith can become expensive.

ComputerGuru · on Jan 1, 2018

> Again, most people are NOT Instagram or Google

And, again, can we stop with these pointless comparisons? Past 1 front end server the only cost that matters is how many requests your one node can handle and how much that one node costs. If you’re not a VPS and you have your frontend http cache correctly configured, then it doesn’t matter how much smaller than Google you are, comparisons are valid (although, of course, the fewer servers you have the more you can afford to spend on them; though you probably aren’t making as much money as IG/FB/Google either...).

sametmax · on Jan 1, 2018

The only metrics is economic. The question is how much does those performance constrains cost you ?

ot · on Jan 1, 2018

Yes, when you fork all the pages in the address space are referenced, and copied when modified. However if you have a GC it will go around touching objects in the child process, and there's a good chance it will end up touching a few bytes in each page, nullifying the benefits of COW.

rurban · on Jan 3, 2018

Exactly. That's why you normally don't call refcounting a GC, as it writes to objects on referencing, and thereby destroys COW. With Perl it's even more stupid, they write to the const strings also, not just the refcount. And still call it COW.

I call it write on assign, WOA. A normal GC leaves it alone and does not force fresh memory maps for these.

ot · on Jan 3, 2018

It's not just refcounting, GCs also normally have some sort of bookkeeping that is updated in the background, for example a mark&sweep GC will need to mark the objects. See the comment about the object header in the OP.

rurban · on Jan 3, 2018

This GC bookkeeping cannot be compared to the refcounting madness. A GC needs 1-2 bits during the GC, but does not disturb COW pages at all during run-time. It also has no word overhead, these 2 bits easily fit into every object. Also a GC compacts the live heap, which makes all upcoming accesses much faster. Well, not Mark & Sweep, but a compacting GC.

masklinn · on Jan 1, 2018

> When exactly is this new GC behavior useful? When forking processes?

When forking a process and performing lots of allocations in the parent process (e.g. lots of imports & objects creation before forking) and doing so on a CoW-forking OS (but I'm guessing that's most modern POSIX systems).

sigmonsays · on Jan 1, 2018

One web request should not be served by one process at a time. This is a terrible software architecture which is just the fault of python. I love python, but it is not suitable for backend work at scale. Its simply too costly in terms of resources required. While its interesting to see improvements in the GC, it also makes equal if not more sense to migrate away from python. Just my 0.02$

sametmax · on Jan 1, 2018

> One web request should not be served by one process at a time

It's doesn't have to be. You can use asyncio and deal with many requests at a time by one process.

> I love python, but it is not suitable for backend work at scale.

What's at scale ?

Most projects I work on IRL, including in banks and administrations, never reach the level of scale this even remotely a problem. Too many people think they are Google.

ComputerGuru · on Jan 1, 2018

Do you mean intranet bank apps? Or public facing sites?

sametmax · on Jan 1, 2018

It doesn't matter. In France, there are 70 millions people. Remove old and young ones, and the ones that are not clients in your bank. Then spread the usage on the week, then on 24h, divide by the number of requests that actually hit your Python backend (so no cache, no static files, etc). You got what ? 1000 requests/s top ? That's nothing.

Bank sites are not youtube.

Besides, operations on your bank account don't even hit the Python backend, but a dedicated system. Usually some COBOL dinosaure they froze, wrapped into a Java service and exposed through a RESTish API so that the rest of their system can use it without ever having to touch it again.

figgis · on Jan 2, 2018

And if you were really wanting to get as much performance as you wanted out of python there's plenty of options like epoll...

danso · on Jan 1, 2018

How many systems are people working at that have greater scale than Instagram?

ComputerGuru · on Jan 1, 2018

Net connections means nothing. How many servers does IG dedicate to this task? Compare IG’s server budget with another company’s and then divide by the number of servers. Not so unimaginably, out-of-reach big any longer.