Hardening the registers: A cascading failure of edge induced fault tolerance

137 pointsby matrixalmost 3 years ago

12 comments

yardstickalmost 3 years ago

> Although Autobahn contained all the item data, the 404 responses were interpreted by the SDM Proxy as an indicator that the item was missing in Autobahn and the SDM Proxy retried the request to the central ILS API in the data centers.This is why I never design web APIs to use the HTTP status code to indicate the application response. Always embed the application response within the HTTP payload. It should be independent of the transport mechanism. I’m ok with it not being a proper REST/RESTful service.{ “status” : 1000, “message” : “Item not found” }And intentionally don’t use the same status numbers as http (Ie don’t use 404 as not found, because someone will mix them up!)

评论 #31850418 未加载

评论 #31850488 未加载

评论 #31850621 未加载

评论 #31850455 未加载

评论 #31851163 未加载

评论 #31851296 未加载

评论 #31850606 未加载

评论 #31851079 未加载

评论 #31852756 未加载

评论 #31850844 未加载

haroldlalmost 3 years ago

This was really interesting both in exploring the architecture of a retail system and looking at how systems fail. Better to read about it and learn than to live it.I'd call it a 4 hour outage because the initial "recovery" was a result of cashiers manually typing in prices for items. Then when load decreased and they discovered that scanning items worked again the problem came right back.Maybe returning 404 for both a cache miss and a "there's no endpoint at this path" error is an issue too. For other status codes there's a distinction between temporary and permanent failure; e.g. 301 versus 302. It would've been good to use HTTP 400 Bad Request for the misconfigured URL and 404 for a cache miss.In the 10% of stores with the early roll out of the config change the cache hit rate went to 0 right away, and that started 12 days before the outage. Alerts on cache hit rates and per-store alerts would've caught that.Then there were 4 days where traffic to the main inventory micro-service in the data center jumped 3x which took it to what appears to be 80% of capacity. Load testing to know your capacity limits and alerts when you near that limit would've called out the danger.Then during the outage when services slowed down due to too many requests they were taken out of rotation for failing health checks. Applying back pressure/load shedding could have kept those servers in active use so that the system could keep up.

评论 #31850170 未加载

评论 #31850534 未加载

jkapturalmost 3 years ago

It's very interesting that by building a system that's more resilient and reliable:> high profile processes (such as POS) implement their own fallback processes to handle the possibility of issues with the SDM system in store. In the case of item data, the POS software on each register is capable of bypassing the SDM Proxy and retrying its request directly to the ILS API in the data centers.... the system as a whole became much more complex and difficult to observe. The system was running in a degraded, abnormal, less-tested, fallback mode for days without anyone caring.This is also a point about the normalization of deviance. When there is a background rate of the POS using the fallback path, who is to say how important an increase in that rate might be?

评论 #31851641 未加载

评论 #31851548 未加载

joedissmeyeralmost 3 years ago

Glad to see that a big US retailer like Target is using the same types of "de-facto" observability tools that I've been using for a while at all of my various employers over the last 5+ years - which are Grafana, Prometheus and Elastic Stack (specifically the Kibana UI for the logging analysis screenshot).

评论 #31850344 未加载

EricEalmost 3 years ago

"It’s not enough to implement redundant systems and failovers, we must monitor and alert when those systems are being exercised."My air conditioner in my house has a secondary drain pan under it. The outlet for that drain pan is right above a main window outside. If the primary condensate drain gets plugged/fails and the water overflows into the backup pan there would be a stream of water in front of a window that shouldn't otherwise be there. They want you to be able to readily notice it as you are now at risk for significant water damage if that secondary drain manages to plug up too.Always something worth considering when designing any system - how to make it fail in a way that is noticeable!

at_a_removealmost 3 years ago

I don't work at that level, or even want to, but I did detect a dark pattern that I often complain about, but have never managed to get people to pay attention to: do not collect data unless you have attached to it a decision with two or more distinct outcomes based on that data.

评论 #31850838 未加载

评论 #31850572 未加载

londons_explorealmost 3 years ago

I'd like to see staff training for major outages in retail like this.For example, if the shop loses power, do they have the ability to sell goods still?One approach is to let staff members estimate the value of goods - for example at the register, the staff member looks at the cart contents, estimates that it's about $120 worth of goods, charges the customer $120, and hand writes a receipt saying "$120 of goods sold, Date, store name, signature". The staff member then uses a phone to photograph the cart and the receipt.At the end of the shift, the shaff member drops all the photos into a big store wide Dropbox account, that the accounts department can use to pay taxes.You'd probably want to practice this process ahead of time with every staff member.I imagine it might actually be a good process to use on very busy days too - it is probably quicker than scanning every item at the register.

评论 #31853224 未加载

prithvi24almost 3 years ago

Target has 250k SKUs total - why is their inventory system so complicated? Why the hybrid on-prem store + data center cloud model - isn’t it easier if there is one source of truth? Seems like it would reduce the need for even dealing with all this eventually consistence cache sycning and whatnotI ofc don’t know what I dont know, but super curious if anyone has insight into why such a complex system is requiredAlso, if this microservice is used for brick and mortgage mortar, can’t imagine more than a couple hundred per second? ( 2000 stores, 5 registers a store - and humans manually scanning items ) - why did that overload the micro service (guessing it wasn’t an endless exponential backoff)

评论 #31851778 未加载

评论 #31852748 未加载

ncmncmalmost 3 years ago

Wasn't Target where hackers were running loose in their POS ("point of sale", not the other meaning) system for months or years? Was that before or after this incident?

aftbitalmost 3 years ago

Why don't the ILS services have their own cache in front of them? Supporting a per-store cache already requires good discipline on timeouts and invalidation, so adding an additional caching layer in the datacenter between the inbound requests and ILS itself seems like it would provide for a cheap extra layer of scalability in case the per-store caches become unavailable.

InCityDreamsalmost 3 years ago

I presume "guest" means "customer"?

评论 #31849971 未加载

评论 #31851469 未加载

评论 #31850053 未加载

评论 #31849984 未加载

sydthrowawayalmost 3 years ago

Surprised they’re still not on IBM Mainframes