While I won't claim this is unique to Go, I've had some similar good experiences cloning out various bits of Go for my own crazy purposes. The standard library and compiler are relatively clean code for what they are, and it's relatively easy for a pro developer to fork them temporarily like this, or pick up a standard library that almost does what you need and add what you need. I've forked encoding/json to add an attribute that collects "the rest" of the JSON attributes not automatically marshaled, both in and out. I've forked encoding/xml to add a variety of things I needed to write an XML sanitizer (in which you are concerned with things like "how long the attribute tag" is <i>during</i> parsing; it's too late to be presented with a 4 gigabyte attribute in the user code, it's already brought your server to its knees). I saved weeks by being able to start with a solid encoder/decoder backend and be able to follow it and bring it up to snuff, rather than start from scratch. A coworker forked crypto/tls because it was the easiest TLS implementation to break in deliberate ways to test for the various SSL vulnerabilities that have emerged over the years.<p>Of course I recommend this more as a last resort than the first thing you reach for, but it's a fantastic option to have in the arsenal, even if you don't reach for it often. I encourage people to at least consider it.
Summary: developers were calling a method over and over 30 levels deep inside the stack and just barely not going over the 2k golang stack size initial limit. i.e. they were getting great performace because everything happened to be 1.9k or 1.8k, or just not quite 2k or more. Then, a change, and performance went terrible. On the opposite side of 2k with 2.1k or 2.2k an entire extra 2K more had to be allocated for a total of 4k to fit everything. Engineers stop at nothing to find the RCA looking at assembly of the binary and yes, forking the go compiler.
Ah, memory management. Here’s my basic understanding:<p>- Go initially allocates 2KB stack per routine. When it exceeds it it copies all of it into 2x the space.<p>- This was happening once or twice per request. They didn’t explain exactly why all that stack memory was being used (maybe someone can chime in), but contributing factors were a 30 function deep call stack and a minor code change that tipped it into the next stack growth tier.<p>- Also this doesn’t get freed up until garbage collection runs.<p>- They worked around it by implementing a kind of go routine pool that keeps assigning work to the same (stack-expanded) routines, staying ahead of the garbage collector.<p>My takeaways:<p>1. Fantastic analysis and job well done.<p>2. Pooling does not seem to be how things “should” work in Go. It’s more of a hack around undesirable allocator / garbage collector behavior.<p>3. I’m really interested in reference counted languages like Swift on the server for these reasons. I know ARC means more predictable latency when it comes to garbage collector behavior (which is only indirectly the problem here). Now I’m really curious how Swift allocates the stack and whether it would avoid this “morestack” growth penalty that Go has.
Fascinating read. Although the idea of using thread pools evokes the pthread management, this post is rather convincing that such "hand-holding" is necessary in applications with intense SLAs. Alas, the magic the Go team has worked with routines doesn't yield a free lunch for <i>everybody</i>.<p>If we accept that pooling is necessary in some cases, I'm curious – is there a common source that these applications use?<p>In trying to answer my own question, I found that M3 has a mature-looking implementation of such an abstract solution. <a href="https://github.com/m3db/m3/tree/master/src/x/sync" rel="nofollow">https://github.com/m3db/m3/tree/master/src/x/sync</a>.<p>Elsewhere, I couldn't find anything similar in the usual suspects. CockroachDB has one-off, specific implementations in the places where they've decided pooling is worth it. Looks like Kubernetes uses the stdlib's `sync.Pool` interface in a similar way, but doesn't use a full-fledged "routine pool".<p>Do people at Uber think this is a robust enough solution to be used outside of m3? Seems like it might be useful in the stdlib as an implementation of `sync.Pool` :)
It seems to me they could have used the following hack to fix such issue:<p><pre><code> var dontOptimizeMeBro byte
// go:noinline
func makeStackBig() {
var buf [16386]byte
dontOptimizeMeBro = buf[0] + buf[len(buf)-1]
}
</code></pre>
Call this at the start of the goroutine.<p>What it does, I hope, is extend stack to 16kb, once (as opposed to going from 2kb to 4kb then to 8kb then to 16kb and paying for coyping the memory multiple times).<p>The stack stays big for the remaining lifetime of the goroutine.
This is a case in which generational GC can help. If you allocate goroutine stacks in the nursery, then you can use a bump allocator, which makes the throughput extremely fast. Throughput of allocation matters just as much as latency does!<p>(By the way, Rust used to cache thread stacks back when it had M:N threading, because we found that situations like this arose a lot.)
Except it was the second growth just exceeding the 4096 stack size that was causing the issue:<p>"it looked like the goroutine stack was growing from 4 kibibytes to 8 kibibytes"
I seem to recall there are a couple of github issues around reusing goroutines and their stacks or being able to specify the stack size of a new goroutine instead of making it a runtime constant. Either would be very helpful for those of us using Go at scale.
Great read, so a key takeway would be to make sure to prime the connection pool and also re-use it. Isn't re-using goroutines a bit of an anti-pattern though?