Deno’s July 13th incident update

110 点作者 mostafah将近 3 年前

10 条评论

ctvo将近 3 年前

> On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. We investigated the status of our services, but were unable to confirm any of the reports. All of our status monitoring and tests reported that everything was operating normally.> Over the course of the outage, we continued to monitor our service status, and worked with some of the affected users to narrow down the source of the problem.> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region, which we then took offline, directing traffic to other nearby regions instead.The time difference between when the first reports came in and when it was confirmed is a little concerning.As an aside:> ... approximately 18:00 UTC ...> ... just over 24 hours ...> ... For a period of around 24 hours, some users in the us-west3 region> ... less than 30 minutes ...> ... On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. ..."Approximately", "just", "around", "some", "small number of". It goes on and on. I disagree with the stylistic approach of being less specific in posts like these. A "small number of users" is relative. As readers, we have no idea what your typical load may be. Small may be a large number to us. "Just" over 24 hours is 26 hours? 24.5 hours? I implore you to be specific when you have the actual data.These terms read as weasel words, and impact your effort at being fully transparent.

评论 #32120139 未加载

评论 #32119687 未加载

评论 #32118428 未加载

alluro2将近 3 年前

I don't mean anything bad to Deno's team (I'm very partial to what they're building), but I'm rather surprised whenever a widely-publicized service has an outage that lasts hours or more than 24h. I'm genuinely curious to understand whether it's typically due to complexity of infrastructure and how hard it is to find route causes, how long it takes to redirect traffic / patch temporarily when the cause is found, or is it due to attitude where it's considered normal for these things to happen, and to take time to solve step by step.Our services are of what I consider medium complexity (~70 services, ~10 different "layers" of logic, db, caching, load balancing etc, AWS, mostly self-managed centralized logging and monitoring) but still quite low-volume (< 100 requests / second), and any more serious issue (let alone outage) is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes.We're very modestly funded compared to Deno (in this example) and the team is small...Not sure whether that changes with traffic volume, complexity, team size, or is more primarily attitude-based and should continue to be cultivated.

评论 #32118598 未加载

评论 #32117856 未加载

评论 #32121009 未加载

tetha将近 3 年前

Ew, we've had similar issues in the past. These are really messy and confusing to recognize.In our case, 1 out of 5 LB instances lost its connection to the service discovery and later on ended up not knowing about a failover of one of the 5 backends for a service. As a result, something like 1 in 20 to 1 in 25 requests got answered with a connection refused. That took a minute to find.

评论 #32118197 未加载

sentrms将近 3 年前

Is anyone running mission critical software on this rather new platform? In my experience every added cloud service becomes another potential weak link in your chain. A distributed DB for your data, a CDN for static assets, a couple of lambda functions for background processing. With every move away from the monolith your surface for potential downtime or "elevated error rates" increases.

remram将近 3 年前

So three failures:- The load balancer lost its connection to etcd and did not reconnect- The load balancer had no healthy backend and did not un-advertise itself- The load balancer did not report either of those issues to monitoringHonestly this is a little concerning. Are they using their own load-balancing software? If yes, why?

评论 #32121247 未加载

评论 #32123811 未加载

评论 #32120003 未加载

AtNightWeCode将近 3 年前

"... (a TCP load balancer). It does not record any diagnostics about dropped connections, nor does it have a return channel to return diagnostic information to the user (unlike HTTP loadbalancers, which can return a response header)."And there is no API monitoring apparently.

评论 #32121310 未加载

turtlebits将近 3 年前

Seem like a huge gap in observability - Low/zero healthy targets for a load balancer should be a P0/critical alert, especially when traffic is getting black holed.LBs should also be alerting on health checks failures/no data for targets as well.

donavanm将近 3 年前

Hey Luca, some thoughts from working on similar systems.Visibility is the cause & lesson learned on duration. It's worth simply paying for 3P distributed RUM. Make sure you can get down to /24s & ASNs as well as breaking it out by (your) target destination/address. I reallly like TurboBytes in the past. Cedexis was ok, but I remember the API/raw data access to be bit of a pain.It sounds like your TCP LB wasnt exporting metrics this time. For other cases you can get decent data out of the tcp metrics cache on linux. And proc has some good counters even before you get a socket; PAWSPASSIVEREJECTED may have bitten me before :( Make sure your reads of /proc/net/netstat are aligned to the right size if you go that route.> ... because the load balancer that failed was very early in the network stack (a TCP load balancer). It does not record any diagnostics about dropped connections ...You may be able to sort some improved visibility with something like netflow/sflow. This aligns well with discrete components and independent failure domains as well.> Services announce themselves to the etcd cluster when their availability state changes ... If there are no healthy backends it will un-advertise itself from the network to prevent requests ending up at this "dead end".In my experience you really can't rely on nodes to manage themselves when it comes to service availability or health. There are too many grey failure cases where a dataplane node will partially fail enough to keep mangling traffic or passing shallow health checks. eg a disk going read only or stalled IO can keep the LB and active data in memory up, signalling like BGP sessions stay up, but prevent consuming new system/customer state updates. A seperate system/component is necessary for teh control loop to be insulated from those failures.You end up in a situation where the distributed LB has "data plane" workers that handle connections & packets while the out of band "control plane" determines health & controls BGP/routing/ARP/whatever to put the data plane nodes in or out of service. Your application/lb/etc data plane can still self report & retrieve data from etcd. But put the control somewhere with less correlated failures. While you're at it build data versioning in to your configuration, eg active customers/domains/etc, that your dataplane uses & reports. That way your control plane can check both the availability/performance and the current working state of LB dataplane configuration.> [The LB did not have] any healthy backends to direct traffic to. ... This caused the traffic to be dropped entirely.Throwing a RST or similar here is not wrong per se, and is a nice clear failure mode. One other approach is to have something like a default route that you can punt traffic to (and alert) as a last resort. It depends on your network/LB configuration but this could be a common MAC address, an internal ECMP'd route, or similar. I think you'll see many services that build L3/4 LBs, like CDNs, take this approach. IIRC google maglev and fastly document their take on this to deal with problems like IP fragments and MTU discovery where some packets dont flow with the rest of teh 5 tuple.> The region will remain disabled until our monitoring has improved and the issue has been fixed more permanently.I understand if this choice is around business & customer confidence. However I didnt see anything that indicated your failure modes were specific to us-west3. It seemed to be that visibility & detection were the real failure. And in that case I'd posit the better path is getting global visibility in to your failure mode, deploying that first/early to us-west3 and use that as your gate.edit: Im a couple years past doing distributed networking/lb systems as my full time job, so apologies if this is dated/fuzzy advice.

FBISurveillance将近 3 年前

Hugops to the team. A quick question: is it intentional that there's nothing on <a href="https://denostatus.com/" rel="nofollow">https://denostatus.com/</a>?

评论 #32116984 未加载

评论 #32121387 未加载

评论 #32116825 未加载

评论 #32116774 未加载

Shadonototra将近 3 年前

> several services provided by the Deno company experienced a service disruptions in our us-west3 region for a period of just over 24 hours.'"JUST" over 24 hours', no big deal of course /s<a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que" rel="nofollow">https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que</a>...no mention of that issueeither nobody uses Deno so 0 complainsor people use Deno and for some reasons 24h+ downtime didn't impact anybody, wich is surprising, to say the least

评论 #32117178 未加载