Graceful Shutdown in Go: Practical Patterns

247 pointsby mkl959 days ago

13 comments

zdc19 days ago

I've been bitten by the surprising amount of time it takes for Kubernetes to update loadbalancer target IPs in some configurations. For me, 90% of the graceful shutdown battle was just ensuring that traffic was actually being drained before pod termination.Adding a global preStop hook with a 15 second sleep did wonders for our HTTP 503 rates. This creates time between when the loadbalancer deregistration gets kicked off, and when SIGTERM is actually passed to the application, which in turn simplifies a lot of the application-side handling.

评论 #43897694 未加载

评论 #43895631 未加载

evil-olive9 days ago

another factor to consider is that if you have a typical Prometheus `/metrics` endpoint that gets scraped every N seconds, there's a period in between the "final" scrape and the actual process exit where any recorded metrics won't get propagated. this may give you a false impression about whether there are any errors occurring during the shutdown sequence.it's also possible, if you're not careful, to lose the last few seconds of logs from when your service is shutting down. for example, if you write to a log file that is watched by a sidecar process such as Promtail or Vector, and on startup the service truncates and starts writing to that same path, you've got a race condition that can cause you to lose logs from the shutdown.

评论 #43890712 未加载

评论 #43890934 未加载

评论 #43891221 未加载

评论 #43895604 未加载

karel-3d9 days ago

one tiny thing I see quite often: people think that if you do `log.Fatal`, it will still run things in `defer`. It won't!<pre><code> package main import ( "fmt" "log" ) func main() { defer fmt.Println("in defer") log.Fatal("fatal") } </code></pre> this just runs "fatal"... because log.Fatal calls os.Exit, and that closes everything immediately.<pre><code> package main import ( "fmt" "log" ) func main() { defer fmt.Println("in defer") panic("fatal") } </code></pre> This shows both `fatal` and `in defer`

wbl9 days ago

If a distribute system relies on clients gracefully exiting to work the system will eventually break badly.

评论 #43891244 未加载

评论 #43890227 未加载

评论 #43890088 未加载

评论 #43890152 未加载

评论 #43893073 未加载

评论 #43894837 未加载

评论 #43891101 未加载

fpoling9 days ago

I was hoping the article describe how to perform the application restart without dropping a single incoming connections when a new service instance receives the listening socket from the old instance.It is relatively straightforward to implement under systemd. And nginx has been supporting that for over 20 years. Sadly Kuberenets and Docker have no support for that assuming it is done in load balancer or the reverse proxy.

评论 #43895334 未加载

giancarlostoro9 days ago

I had a coworker that would always say, if your program cannot cleanly handle ctrl c and a few other commands to close it, then its written poorly.

评论 #43894229 未加载

评论 #43891564 未加载

gchamonlive9 days ago

This is one of the things I think Elixir is really smart in handling. I'm not very experienced in it, but it seems to me that having your processes designed around tiny VM processes that are meant to panic, quit and get respawned eliminates the need to have to intentionally create graceful shutdown routines, because this is already embedded in the application architecture.

评论 #43891163 未加载

eberkund9 days ago

I created a small library for handling graceful shutdowns in my projects: <a href="https://github.com/eberkund/graceful">https://github.com/eberkund/graceful</a>I find that I typically have a few services that I need to start-up and sometimes they have different mechanisms for start-up and shutdown. Sometimes you need to instantiate an object first, sometimes you have a context you want to cancel, other times you have a "Stop" method to call.I designed the library to help my consolidate this all in one place with a unified API.

评论 #43892837 未加载

评论 #43893266 未加载

deathanatos9 days ago

> After updating the readiness probe to indicate the pod is no longer ready, wait a few seconds to give the system time to stop sending new requests.> The exact wait time depends on your readiness probe configurationA terminating pod is not ready by definition. The service will also mark the endpoint as terminating (and as not ready). This occurs on the transition into Terminating; you don't have to fail a readiness check to cause it.(I don't know about the ordering of the SIGTERM & the various updates to the objects such as Pod.status or the endpoint slice; there might be a small window after SIGTERM where you could still get a connection, but it isn't the large "until we fail a readiness check" TFA implies.)(And as someone who manages clusters, honestly that infintesimal window probably doesn't matter. Just stop accepting new connections, gracefully close existing ones, and terminate reasonably fast. But I feel like half of the apps I work with fall into either a bucket of "handle SIGTERM & take forever to terminate" or "fail to handle SIGTERM (and take forever to terminate)".

cientifico9 days ago

We've adopted Google Wire for some projects at JustWatch, and it's been a game changer. It's surprisingly under the radar, but it helped us eliminate messy shutdown logic in Kubernetes. Wire forces clean dependency injection, so now everything shuts down in order instead... well who knows :-D<a href="https://go.dev/blog/wire" rel="nofollow">https://go.dev/blog/wire</a> <a href="https://github.com/google/wire">https://github.com/google/wire</a>

Savageman8 days ago

I wish it would talk about liveness too, I've see several times apps that use the same endpoint for liveness/readiness but it feels wrong.

liampulles9 days ago

I tend to use a waitgroup plus context pattern. Any internal service which needs to wind down for shutdown gets a context which it can listen to in a goroutine to start shutting down, and a waitgroup to indicate that it is finished shutting down.Then the main app goroutine can close the context when it wants to shutdown, and block on the waitgroup until everything is closed.

评论 #43892861 未加载

gitroom9 days ago

honestly i always end up wrestling with logs and shutdowns too, nothing ever feels simple - feels like every setup needs its own pile of band aids