This is a great post. I like how they walked through all the steps and especially the "perf" tool.<p>Ruby has a patch to do the same thing -- increase sharing by moving reference counts out of the object itself:<p>Index here:<p><a href="http://www.rubyenterpriseedition.com/faq.html#what_is_this" rel="nofollow">http://www.rubyenterpriseedition.com/faq.html#what_is_this</a><p>First post in a long series:<p><a href="http://izumi.plan99.net/blog/index.php/2007/07/25/making-rubys-garbage-collector-copy-on-write-friendly/" rel="nofollow">http://izumi.plan99.net/blog/index.php/2007/07/25/making-rub...</a><p>I think these patches or something similar may have made it into Ruby 2.0:<p><a href="http://patshaughnessy.net/2012/3/23/why-you-should-be-excited-about-garbage-collection-in-ruby-2-0" rel="nofollow">http://patshaughnessy.net/2012/3/23/why-you-should-be-excite...</a><p><a href="https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-and-copy-on-write-f5eddef21485#.10aa2bnnw" rel="nofollow">https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-and-...</a><p>The Dalvik VM (now replaced by ART) also did this to run on phones with 64 MiB of memory:<p><a href="https://www.youtube.com/watch?v=ptjedOZEXPM" rel="nofollow">https://www.youtube.com/watch?v=ptjedOZEXPM</a><p>I think PHP might do it too. It feels like Python should be doing this as well.
It's basically "cheating" at GC by exploiting a very narrow use case. I saw a trick like this at Smalltalk Solutions in 2000 with a 3D game debugging tool. The "GC" actually simply threw everything away for each frame tick.<p>Someone needs to come up with something like a functional language based on a trick like this. Or maybe a meta-language akin to RPython, so people can write domain specific little languages for doing things like serving web requests, combined with domain specific "cheating" GC that can get away with doing much less work than a full general purpose GC.<p>Couldn't a pure functional programming environment be structured to allow for such GC "cheating?"
I find Instragram's engineering blog to be really awesome (I especially like their content on PostgreSQL). As well, it seems like they managed to implement a solid solution to a problem they were facing.<p>That being said, I wonder if their team considered implementing a different language that was meant to work without GC overhead. I'm all for working with something you're familiar with, but this seems like they've hit the point where they know enough of the problem surface area that they should be able to start optimizing for more than just 10% efficiency by turning off a selling point of safer languages.
Nice. I worked on something like this at an internship. I wrote a Unicorn-like preload-fork multiprocess server in Ruby (for other reasons).<p>I realized that the workload (which involved a large amount of long-lived static data on the heap) would have seen enormous memory savings, if only we weren't running with Ruby 1.9's mark-and-sweep GC algorithm that marked every object during the mark phase.<p>I briefly experimented with turning off GC and periodically killing workers. Thankfully, in <i>that</i> situation, all we actually had to do was upgrade to Ruby 2.2, which does have a proper CoW-friendly incremental GC algorithm.<p>`fork` is awesome.
One of their issues was that Python runs a final GC call before process exit. Why <i>does</i> Python run that final GC call if the process is exiting anyway?
Is it just me, or does it look like the typical example of short term hack that will blow up in your face pretty quickly, and turn your life in a constant stream of low-level tinkering ?<p>I suppose people at instagram didn't just stop there, but are also planning for more long term solution to optimizing their stack ( aka migration to a more performant language).
> Instagram can run 10% more efficiently<p>Seems quite risky/costly for a mere 10% computational efficiency gain. If you're going to change the memory model of a programming language, might as well shoot for <i></i>10x<i></i> improvement instead of 10%.
Fun fact: Lisp originally had no GC. It just allocated and allocated memory till there was none left, and then it died, after which the user dumped their working heap to tape, restarted Lisp, and loaded the heap back from tape. Since only the "live" objects were actually written, the heap took up less memory than before and the user could keep going.
<i>Instagram’s web server runs on Django in a multi-process mode with a master process that forks itself to create dozens of worker processes that take incoming user requests.</i><p>So this is all a workaround for Python's inability to use threads effectively. Instead of one process with lots of threads, they have many processes with shared memory.
Noting that some other library might call `gc.enable()` is correct. But, then ignoring the fact that another library can simply call `gc.set_threshold(n > 0)` seems like an obvious bug in the waiting, and the same issue as something calling `gc.enable()`
This is called out of band GC. We've been doing it for years in Ruby with Unicorn <a href="https://blog.newrelic.com/2013/05/28/unicorn-rawk-kick-gc-out-of-the-band/" rel="nofollow">https://blog.newrelic.com/2013/05/28/unicorn-rawk-kick-gc-ou...</a><p>However when the ruby community moved to Puma which is based on both processes and threads it was needed less. Not that this is rocket science (it's still far behind the JVM and .NET), I assume a hybrid process/thread model is something that hadn't reached a critical mass in the Python/Django/Flask/Bottle community?
They mentioned msgpack was calling gc.enable(), but it looks like that issue was fixed quite a while ago in version 0.2.2:<p><a href="https://github.com/msgpack/msgpack-python/blob/2481c64cf162d765bfb84bf8e85f0e9861059cbc/ChangeLog.rst#bugs-fixed-10" rel="nofollow">https://github.com/msgpack/msgpack-python/blob/2481c64cf162d...</a>
This writing feels a little sloppy<p>> At Instagram, we do the simple thing first. [...] Lesson learned: prove your theory before going for it.<p>So do they no longer do the simple thing first?<p>More on topic: this seems like they optimized something in a way that might really constrain them down the road. Now if anyone creates an object that isn't covered by ref-counting they will get OOMs.
Carl Myer, a Django core dev, presented at Django under the hood on using Django at instagram. It was a really good talk that goes through how they scaled and what metrics they use for measuring performance. <a href="https://youtu.be/lx5WQjXLlq8" rel="nofollow">https://youtu.be/lx5WQjXLlq8</a>
I actually didn't know that CPython had a way of breaking reference cycles. I seem to remember reading that reference counting was the only form of garbage collection that CPython did. Maybe this was the case in the past?
> Each CoW triggers a page fault in the process.<p>Maybe I misunderstood how page faults work, but I thought this process was reversed. I.e. Each page fault triggers a CoW, not the other way around?
Using threading to handle user requests with Python seems very wrong to me. They might see solid improvement by ditching WSGI and employing a non-blocking solution (like Tornado, aiohttp or Sanic), running on PyPy as multiple instances behind a load balancer.
Instead of a bunch of hacks that are obviously going to blow up in someone's face one day why not just use a more suitable platform?<p>Forking threads for web pages is so old school...And Python is a terrible choice for something at their scale.<p>Just redo the hosting bit in Java or golang and call it a day. If their UI code is sufficiently isolated from the back end it's not a huge deal.<p>Instagram is a pretty small application feature-wise, a few devs could probably do it in a couple months
If you think about it, this approach is actually very similar to the FaaS/Lambda/Serverless model. Each request lives in its own little container which gets thrown away after every execution. This approach means you reduce the amount of shared state and lots of problems like garbage collection either get easier or go away.