Major incidents aside, I always think that cache-related bugs are some of the most likely to go undetected since if you don't test for them end-to-end, they're really not that easy to spot & diagnose.<p>An article sticking around too long on the home page. Semi-stale data creeping into your pipeline. Someone's security token being accepted post-revocation. All really hard to spot unless (1) you're explicitly looking, or (2) manure hits the fan.
Required reading for all of the "I could code up Twitter in a weekend" -types.<p>The long listen queue -> multiple queued up retries feedback loop is a classic: <a href="https://datatracker.ietf.org/doc/html/rfc896" rel="nofollow">https://datatracker.ietf.org/doc/html/rfc896</a> TCP/IP "congestion collapse" and the 1986 Internet meltdown [various sources]
What I find most interesting in this is the pseudo detective story of hunting down disappearing post-mortem and "lessons learned" documentation. Optimistically we'd hope that perhaps the older systems no longer reflect the existing systems in any meaningful way (possibly as the org structures and/or software stacks shift and change) and they're no longer relevant.<p>I'd imagine most lost knowledge is not an explicit decision however which means such historical scenarios / documentation / ... are just lost as part of business. Lost knowledge is the default for companies.<p>Twitter is likely better than most given their documentation is all digital and there exist explicit processes to catalogue such incidents. I'd also be curious to see how much of this knowledge has been implicitly exported to their open source codebases.
I remember reading Facebooks caches had a dedicated standby set of “gutter” servers that would take over a failure quickly (otherwise inactive and unused) that was an interesting mitigation for some failure scenarios.
These big incidents involving 'big cache' are fun to read about. Years ago I had to deal with a bunch of cache issues over a short time, but they were all minor incidents with minor uses of cache (simple memoization, storing stuff in maps on attributes of java singletons, browser local storage). Still, I made a checklist of questions to ask thenceforth on any proposal or implementation of a cache in a doc or code review. A bunch of them are just focused on actually paying attention to what your keys are made of and how invalidation works (or if you even can invalidate, or if it's even needed). I think for 'big cache' questions I should just refer to this blog post and ask "what's the risk of these issues?"
Yeah, see also, Marc Brooker has a good article on why the bimodal behavior of caches can cause a lot of headaches <a href="https://brooker.co.za/blog/2021/08/27/caches.html" rel="nofollow">https://brooker.co.za/blog/2021/08/27/caches.html</a>
"There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton<p><a href="https://martinfowler.com/bliki/TwoHardThings.html" rel="nofollow">https://martinfowler.com/bliki/TwoHardThings.html</a>
“ On Nov 8, a user changed their name from tigertwo to Woflstar_Bachi.”<p>Horrifically inappropriate inclusion of PII in this post. Didn’t someone at legal go through this?