Latest scoop:<p>At 06:00 UTC on March 8th, 2023 the Datadog platform started experiencing widespread issues across multiple products. The web application was unavailable or intermittently loading, and data ingestion & monitor evaluation were delayed.<p>We have identified and remedied the issue that caused this outage. We will prepare and share a detailed root cause analysis as soon as possible after our incident response is complete, but we can share a preliminary analysis now. A critical software update applied to a broad set of hosts in our infrastructure caused a subset of these hosts to lose network connectivity.<p>The primary impact of this was that several of our regional Kubernetes clusters became unhealthy, affecting the control plane that keeps our workloads running smoothly. At this point, we believe we have repaired all the affected Kubernetes clusters, and our recovery efforts are now focused on the application layer above this.<p>The web application is now generally available, although data and monitor evaluation remains delayed in some cases (refer to the Status Page in your region for the latest information). We have made substantial progress on restoring the various core services that were impacted by the incident, and have now moved on to getting our data processing pipelines for metrics, logs, traces, and other data into a healthy state.<p>It is difficult to give a precise ETA on our full recovery and we are focusing our efforts on restoring real-time data and alerts within a matter of hours (not minutes, but also not days). The recovery of historical data (between the start of the outage and 15 minutes in the past) has been deprioritized.<p>We understand the impact an outage can have, and are sorry for the disruption.
Ironically, DD promotes using their tool to set and measure SLAs but has a low bar on their SLA:<p>> Excluding scheduled maintenance windows, Datadog will use commercially reasonable efforts to maintain 99.8% availability of the hosted portion of the Service for each calendar month during the term of this Agreement. The Service will be deemed “available” so long as Authorized Users are able to login to the Service interface and access monitoring data. Excluding planned maintenance periods, in the event the Service availability drops below 99.8% for two consecutive months, Customer may terminate the Service in the calendar month following such two-month period upon written notice to Datadog. To assess uptime, Customer may, if under a Paying Plan, request the Service availability for a prior month by filing a support ticket through the Site.
We're up to 10 hours of downtime on their services.<p>Anyone know what could have caused this? Companies generally [citation needed] don't go down for half a day across all their services.
All of Datadog's auto-muting logic during incidents is super well thought out and impresses me every time.<p>I'd have a few incidents open at 2am for missing business metrics and hosts falling out of the sky due to this if they didn't have that logic there, but instead we've sent out no false alerts for this.
> Mar 08, 2023 - 13:14 EST
> Update - We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.<p>> Mar 08, 2023 - 12:29 EST
> Update - We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.<p>> Mar 08, 2023 - 11:46 EST
> Update - We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.<p>-----<p>I won't post all of it, but you get the picture. Datadog's status updates go on like this for 12 hours. This is a status update anti-pattern. These updates add no value, except maybe some small reassurance that Datadog hasn't forgotten they are down, and they continue to work on it.<p>But it's actually worse than no update at all, because now every customer needs to parse through a whole mess of these updates to try and figure out what's happening, and if you subscribe to the updates, you start getting spam with no new information every 45 minutes.<p>I know companies are in a bind here. They don't want to provide estimates they might miss, or adhoc engineering info without vetting. On the other hand, customers complain about a lack of communication if there are no updates. But spamming your status page like this is not a real update, it's a pretend update. Just something to point to when customers complain about a lack of communication, but ultimately still a lack of communication.
There some 3rd party uptime tracking here that might be useful: <a href="https://app.metrist.io/demo/datadog" rel="nofollow">https://app.metrist.io/demo/datadog</a>