I'm not sure what qualifies as "stateful", but<p>> <i>Fewer dependencies to fail and take down the service.</i><p>No logging? No metrics? No monitoring? (& <i>yes</i>, you'd think those shouldn't take down the stack if they went offline. And I'd agree. And yet, I've witnessed that failure mode multiple times. In one, a call to Sentry was synchronous & a hard-fail, so when Sentry went down, that service 500'd. In another, syslog couldn't push logs out to the logging service, as that was <i>very</i> down, having been inadvertently deleted by someone who ran "terraform apply", didn't read the plan, & then said "make it so"; syslog then responded to the logging service being down by logging that error to a local file. Repeatedly. As fast as it possibly could. Fill the disk. Disk is full. Service fails.)<p>I've also seen our alerting provider have an outage <i>during an outage we're having</i> & thus not sending pages for our outage, causing me to ponder and wonder how I'd just rolled a 1 on the SRE d20 and what god did I anger? Also who watches the watchmen?<p>> <i>A common pitfall is to introduce something like ElasticSearch, only to realize a few months later that no one knows how to run it.</i><p>Yeah I've seen that exact pit fallen into.<p>No DNS? Global Cloudflare outage == fun.<p>No certificates?<p>I've seen certs fail so many different way. Of course not getting renewed, that's your table stakes "welcome to certs!" failure mode. Certs get renewed but an <i>allegedly Semver compatible</i> upgrade changed the defaults, and required extensions don't get included leading to the client rejecting the cert. I've seen a service which watches certs to make sure they don't expire (see the outage earlier in this paragraph!) have an outage (which, b/c it's monitoring, wasn't customer visible) because a tool issued a malformed cert (…by… default…) that the monitor failed to parse (as it was malformed). Oh, and then the LE cross-signing expiration took out an Azure service that wasn't ready for it, a service from a third-party of ours that wasn't ready for it, <i>and</i> our CI system b/c several tools were out of date including <i>an up to date system on Debian that was theoretically "supported"…</i> but still shipped an ancient crypto library riddled with bugs in its path validation.<p>> <i>Okay fine, S3 too, but that’s a different animal.</i><p><i>Is it?</i> I've seen that have outages too, & bring down a service with it. (There really wasn't a choice there; S3 was the service's backing store, & without it, the service was truly screwed.)<p>But of course, all this is to say I violently agree with the article's core point: think carefully about each dependency, as they have a very real production cost.<p>(I've recently been considering changing my title to SRE because I have done very little in the way of SWE recently…)