Knightmare: A DevOps Cautionary Tale (2014)

460 点作者 sathishmanohar超过 1 年前

51 条评论

I'm not sure how automated deployments would have solved this problem. In fact, if anything, it would have magnified the impact and fallout of the problem.Substitute "a developer forgot to upload the code to one of the servers" for "the deployment agent errored while downloading the new binary/code onto the server and a bug in the agent prevented the error from being surfaced." Now you have the same failure mode, and the impact happens even faster.The blame here lies squarely with the developers--the code was written in a non-backwards-compatible way.

评论 #37459878 未加载

评论 #37460967 未加载

评论 #37459808 未加载

评论 #37460269 未加载

评论 #37460559 未加载

评论 #37461832 未加载

评论 #37464802 未加载

评论 #37460148 未加载

评论 #37460900 未加载

评论 #37463383 未加载

评论 #37460147 未加载

评论 #37460278 未加载

评论 #37460225 未加载

评论 #37461247 未加载

评论 #37460718 未加载

lolinder超过 1 年前

> why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the pointThis seems to be exactly the point! For 8 years they left unused code in place, seemingly only bothering to remove it because they wanted to repurpose a flag. If they'd done the right thing 8 years prior and removed code they weren't using, this story plays out very differently. No ancient routines get resurrected, no rogue server.Maybe Knight Capital wasn't using version control and held onto this code "just in case", but I've seen this same resistance to deleting code in programmers working in repos that are completely under VCS, and it's flabbergasting. If you need it again, you can always bring it back from version control. If you need it again but forget it's there, you'd do the same with the dead code path. Leaving it in the source tree is pure liability.EDIT: Kevlin Henney gave an excellent talk at GOTO about software reliability and he touches on this, using Knight Capital as the example—he actually cites this very blog post [0]. The whole talk is excellent, but I've linked the three minutes where he talks about Knight Capital.> The problem is there is no code that is truly dead. It turns out all you need to do is make a small assumption, a change of an assumption and then suddenly it's no longer dead, it's zombie code. It has come back to life and the zombie apocalypse costs money.[0] <a href="https://youtu.be/IiGXq3yY70o?si=hZ9HB2dlfj0vHvNK&t=463" rel="nofollow noreferrer">https://youtu.be/IiGXq3yY70o?si=hZ9HB2dlfj0vHvNK&t=463</a>

评论 #37462711 未加载

评论 #37462663 未加载

评论 #37462692 未加载

评论 #37461974 未加载

评论 #37466396 未加载

评论 #37467288 未加载

评论 #37475754 未加载

评论 #37463557 未加载

评论 #37464290 未加载

评论 #37463085 未加载

hedora超过 1 年前

No continuous deployment system I have worked with would have blocked this particular bug.They were in a situation where they were incrementally rolling out, but the code had a logic bug where the failure of one install within an incremental rollout step bankrupted the company.I’d guard against this with runtime checks that the software version (e.g. git sha) matches, and also add fault injection into tests that invoke the software rollout infrastructure.

评论 #37461881 未加载

TheAlchemist超过 1 年前

Wild west times ! It's worth noting, that things changed a lot in trading systems since then.When I started working in this domain (2009), it was pretty crazy how unreliable those systems were, on all sides - banks, brokers, exchanges. Frequently you needed to make sure over the phone, what quantities got executed etc.I remember when the Italian exchange was rolling out their systems, at some point we did "tests" on a mix of production and UAT - if my memory is correct, we were just changing IPs to which to connect for order passing to test for the upcoming release, after the market closed. We couldn't just test in their UAT environment, since it was so bugged and half down most of the time.And let's not even talk about Excel spreadsheets with some VBA code that would make chatGPT swear, that were pricing instruments with volumes traded with a lot of zeros.It's very different nowadays, in part thanks to stories like this one. Most things are automated, and there is much less cowboy's attitude.There are mandatory kill switches, a lot of layers of risk / trading activity monitorings (on your side, on exchange side), and really a lot of hard learned lessons incorporated into the systems. That's also part of the reason why people sometime tend to be naive about how hard it is to build a good trading system - the strategies are sometimes now really smart - it's mostly about how to avoid getting killed by something that's outside of usual conditions.

valdiorn超过 1 年前

Literally everyone in quant finance knows about knight capital. It even has its own phrase; "pulling a knight capital" (meaning; cutting corners on mission critical systems, even ones that can bankrupt the company in an instant, and experiencing the consequences)

评论 #37460223 未加载

pavas超过 1 年前

My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost. Long enough means at least several hours and in this time frame we can get things back to a good state, often without much external impact.We too have manual processes in place, but for any manual process we document the rollback steps (before starting) and monitor the deployment. We also separate deployment of code with deployment of features (which is done gradually behind feature flags). We insist that any new features (or modification of code) requires a new feature flag; while this is painful and slow, it has helped us avoid risky situations and panic and alleviated our ops and on-call burden considerably.For something to go horribly wrong, it would have to fail many "filters" of defects: 1. code review--accidentally introducing a behavioral change without a feature flag (this can happen, e.g. updating dependencies), 2. manual and devo testing (which is hit or miss), 3. something in our deployment fails (luckily this is mostly automated, though as with all distributed systems there are edge cases), 4. Rollback fails or is done incorrectly 5. Missing monitoring to alert us that issue still hasn't been fixed. 5. Fail to escalate the issue in time to higher-levels. 6. Enough time passes that we miss out on ability to meet our SLA, etc.For any riskier manual changes we can also require two people to make the change (one points out what's being changed over a video call, the other verifies).If you're dealing with a system where your SLA is in minutes, and changes are irreversible, you need to know how to practically monitor and rollback within minutes, and if you're doing something new and manually, you need to quadruple check everything and have someone else watching you make the change, or its only a matter of time before enough things go wrong in a row and you can't fix it. It doesn't matter how good or smart you are, mistakes will always happen when people have to manually make or initiate a change, and that chance of making mistakes needs to be built into your change management process.

评论 #37460051 未加载

dang超过 1 年前

Related:Knightmare: A DevOps Cautionary Tale (2014) - <a href="https://news.ycombinator.com/item?id=22250847">https://news.ycombinator.com/item?id=22250847</a> - Feb 2020 (33 comments)Knightmare: A DevOps Cautionary Tale (2014) - <a href="https://news.ycombinator.com/item?id=8994701">https://news.ycombinator.com/item?id=8994701</a> - Feb 2015 (85 comments)Knightmare: A DevOps Cautionary Tale - <a href="https://news.ycombinator.com/item?id=7652036">https://news.ycombinator.com/item?id=7652036</a> - April 2014 (60 comments)Also:The $440M software error at Knight Capital (2019) - <a href="https://news.ycombinator.com/item?id=31239033">https://news.ycombinator.com/item?id=31239033</a> - May 2022 (172 comments)Bugs in trading software cost Knight Capital $440M - <a href="https://news.ycombinator.com/item?id=4329495">https://news.ycombinator.com/item?id=4329495</a> - Aug 2012 (1 comment)Knight Capital Says Trading Glitch Cost It $440 Million - <a href="https://news.ycombinator.com/item?id=4329101">https://news.ycombinator.com/item?id=4329101</a> - Aug 2012 (90 comments)Others?

评论 #37461685 未加载

foota超过 1 年前

The real issue here (sorry for true Scotsman-ing) is that they were using an untested combination of configuration and binary release. Configuration and binaries can be rolled out in lockstep, preventing this class of issues.Of course there were other mistakes here etc., but the issue wouldn't have been possible if this weren't the case.

dkarl超过 1 年前

> why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the pointIt's not the worst mistake in the story, but it's not "not the point." A proactive approach to pruning dead functionality would have resulted in a less complex, better-understood piece of software with less potential to go haywire. Driving relentlessly forward without doing this kind of maintenance work is a risk, calculated or otherwise.

daft_pink超过 1 年前

I'm so glad I don't write code that automatically routes millions of dollars with no human intervention.It's like writing code that flies a jumbo jet.Who wants that kind of responsibility.

评论 #37460190 未加载

评论 #37460361 未加载

评论 #37460862 未加载

评论 #37461790 未加载

评论 #37459990 未加载

评论 #37463402 未加载

评论 #37460270 未加载

skizm超过 1 年前

I feel like the first thing I would build into any automated trading system is a kill switch? then every single diff or pull request I add would have some sort of automated testing to ensure the kill switch still works. Also I'd manually flip it on/off once a day to make sure it works for real. That seems like the single most important thing to build and make sure works. Or is the system too complex for something like this and I don't understand the domain well enough?

评论 #37462976 未加载

评论 #37462629 未加载

quickthrower2超过 1 年前

Hold on. Are we blaming the plane crash on the pilot here? It seems there is so much other stuff wrong with this company first that such a deployment would tank it.No kill switch. Literally needs to be a power switch and a trader who runs to the room and flips it. Ridiculously small amount of cash for the trading volume, and no way to borrow more to stay in business (but that borrowing requiring manual intervention no accessible to the trading system). Obviously the decision to leave that code in there, and for there to be config setting to bring it back.Then the devops stuff - rollback plans, approvals, pairing on deployments, etc.

hyperhopper超过 1 年前

Yes, the deployment practices were bad, but they still would have had an issue even with proper practices.The real issue was re-using an old flag. That should have never been thought of or approved.

评论 #37460004 未加载

评论 #37460484 未加载

评论 #37459948 未加载

评论 #37459845 未加载

评论 #37464646 未加载

jammycakes超过 1 年前

This incident highlights a problem that is often overlooked in the debate about feature branches versus feature toggles.I've worked with both feature branches and feature toggles, and while long lived feature branches can be painful to work with what with all the conflicts, they do have the advantage that problems tend to be uncovered and resolved in development before they hit production.When feature toggles go wrong, on the other hand, they go wrong in production -- sometimes, as was the case here, with catastrophic results. I've always been nervous about the fact that feature toggles and trunk based development means merging code into main that you know for a fact to be buggy, immature, insufficiently tested and in some cases knowingly broken. If the feature toggles themselves are buggy and don't cleanly separate out your production code from your development code, you're asking for trouble.This particular case had an additional problem: they were repurposing an existing feature toggle for something else. That's just asking for trouble.

评论 #37465648 未加载

xyst超过 1 年前

Having worked in some Fortune 500 financial firms and low rent “fintech” upstarts, I am not surprised this happened. Decades of bandaid fixes, years of rotating out different consultants/contractors, and software rot. Plus years of emphasizing mid level management over software quality.As other have mentioned, I don’t think “automation of deployment” would have prevented this company’s inevitable downfall. If it wasn’t this one incident in 2014, then it would have been another incident later on.

评论 #37460534 未加载

评论 #37460287 未加载

评论 #37460735 未加载

评论 #37461094 未加载

gumballindie超过 1 年前

> Had Knight implemented an automated deployment system – complete with configuration, deployment and test automation – the error that cause the Knightmare would have been avoided.Would it have been avoided though? Configuration, deployment and test automation mean nothing if they don't do what they are supposed to do. Regardless of how many tests you have, if you don't test for the right stuff it's all useless.

评论 #37461095 未加载

alexpotato超过 1 年前

Couple fun facts/stories:1. I signed my offer letter to work at Knight 5 days before this happened (and I still went to work there)You can read more about that here: <a href="https://twitter.com/alexpotato/status/1501174282969305093" rel="nofollow noreferrer">https://twitter.com/alexpotato/status/1501174282969305093</a>2. As I mentioned above, I went to work at Knight as a DevOps on a team that deal directly with the team mentioned in the blog post.There are lots of stories around this but I will share this one:Late 2012 is when Apple rolled out the "emergency weather notification" function. I was in the office and the notification went off on multiple people's phones. Knight was also experimenting with call notifications.So when the alert goes off, someone yells "God damn it! Not again!!" (thinking there was another big outage)3. People outside of finance have no idea of the different types of outage that can happen due to all sorts of factors.I have a LOT of stories here: <a href="https://twitter.com/alexpotato/status/1215876962809339904" rel="nofollow noreferrer">https://twitter.com/alexpotato/status/1215876962809339904</a>4. In finance in general, the amount of legacy code that behaves in weird ways or was written by someone 10 years ago who is no longer with the firm is ASTOUNDING.Coupled with the billions of combinations of regulations, internal controls, multiple countries and jurisdictions etc makes accounting for every single edge case impossible. To use an infosec term the "attack surface" of possible user actions that could lead to bugs is enormous.Typical case:- User says they want to see reports for a couple days worth of trading for all securities- User also says they want to see FULL history for one security- User never says they might want to see FULL history for ALL securities at the same time- This being HN, someone will say "you should have thought of that"- Sure, but then they pull only some of the history for a Ukranian bond that has a 182 (not 180 like most) day bond. This is the only example of this type of bond. Ever. Did you think of that? What should the system have done?- An oh, btw, this system was pushed out quickly due to regulatory pressure etc

评论 #37467960 未加载

brundolf超过 1 年前

Much as I enjoy articles that reinforce my existing beliefs, high-frequency trading is a pretty extreme example when it comes to how how badly things can go in a short time

dilyevsky超过 1 年前

Their issue was neglecting an automated SCRAM system that would halt all the trading or any alerting with manual intervention. The article touches on that. There was no excuse why the system wasn’t halted by 9:32 which would’ve avoided most of the kerfuffle

stevage超过 1 年前

>They had 48-hours to raise the capital necessary to cover their losses (which they managed to do with a $400 million investment from around a half-dozen investors).I'm very curious about this bit. How exactly do you raise $400m of "investment" to cover such a massive footgun, in 48 hours, when you haven't even had time to understand what happened or whether it would happen again?Why are people stumping up hundreds of millions of cash here?

supportengineer超过 1 年前

They were missing any kind of risk mitigation steps, in their deployment practice.

评论 #37460478 未加载

评论 #37459674 未加载

codegeek超过 1 年前

I refuse to believe that failed deployment can bring a company down. That is just a symptom. The root cause has to be a whole big collection of decisions and processes/systems built over years.

civilized超过 1 年前

I see a lot of criticism of the deployment, but why did the developers "repurpose an old flag" that activates 8 years dead code that you haven't deleted and that has completely unknown current functionality? That seems like the strangest decision made in this debacle.

评论 #37463315 未加载

nickdothutton超过 1 年前

“The code that that was updated repurposed an old flag…” Was as far as I needed to read. Never do this.

zsoltkacsandi超过 1 年前

This has nothing to do with “DevOps”, and I am getting tired of this word. This mistake could have been prevented on multiple levels, and in my experience, deployments that involves major architectural changes rarely repeatable or can be fully automated.

motoboi超过 1 年前

Changes we make to software and hardware infrastructure are essentially hypotheses. They're backed by evidence suggesting that these modifications will achieve our intended objectives.What's crucial is to assess how accurately your hypothesis reflects the reality once it's been implemented. Above all, it's important to establish an instance that would definitively disprove your hypothesis - an event that wouldn't occur if your hypothesis holds true.Harnessing this viewpoint can help you sidestep a multitude of issues.

taspeotis超过 1 年前

Needs (2014) in the title.

gumby超过 1 年前

> (why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point).Actually it's a big part of the point: they have a system that works with dead code in it. If you remove that dead code perhaps it unwittingly breaks something else.That kinds of chesterson's fence is a good practice.

评论 #37460135 未加载

评论 #37460220 未加载

0xFEE1DEAD超过 1 年前

I don't exactly understand what this has to do with continuous delivery, but maybe I just don't know enough about this topic.Wouldn't it have been best to set up a 'shadow infrastructure' and route every trade into it for several weeks/months to verify the correctness of the system?

danielvaughn超过 1 年前

I worked in fintech for a few years. I'll never again work on software that's responsible for trading, you could offer $1M/year and I wouldn't take it. By far the most stress I've ever experienced at a job.

评论 #37462791 未加载

siliconc0w超过 1 年前

While nice automated deployment is the wrong lesson here, it's really not anticipating backwards incompatibility and poor altering and incident training.Flags should never be reused and should be retired after they're no longer useful.

评论 #37461677 未加载

belter超过 1 年前

Knightmare: A DevOps Cautionary Tale (2014) - <a href="https://news.ycombinator.com/item?id=8994701">https://news.ycombinator.com/item?id=8994701</a>Knightmare: A DevOps Cautionary Tale (2014) - <a href="https://news.ycombinator.com/item?id=22250847">https://news.ycombinator.com/item?id=22250847</a>Knightmare: A DevOps Cautionary Tale - <a href="https://news.ycombinator.com/item?id=7652036">https://news.ycombinator.com/item?id=7652036</a>

KnuthIsGod超过 1 年前

Not so simple. The company was then used as a building block to to create another entity, which was then acquired for over a billion dollars."The company agreed to be acquired by Getco LLC in December 2012 after an August 2012 trading error lost $460 million. The merger was completed in July 2013, forming KCG Holdings....On April 20, 2017, KCG announced that it had agreed to be acquired by Virtu Financial for $20 per share in cash in a deal valued at approximately $1.4 billion."

_boffin_超过 1 年前

If you want to see how it looked like from the tick scale, take a look here: <a href="http://www.nanex.net/aqck2/3522.html" rel="nofollow noreferrer">http://www.nanex.net/aqck2/3522.html</a>Ps. Anyone know of any other sites / places that does comparable level of research that's open to the public?

praptak超过 1 年前

Focusing on deployments is too narrow. Deployment can be automatic but still have a botched config.In this context it's more useful to think in terms of production principles. The principle that was poorly followed was defence in depth. There was no line of defence after the deployment.

roughly超过 1 年前

This is the Ur “devops fuckup” tale - I’ve told this to junior engineers who’ve bodged a deploy to make them feel better. I’ve been in this field for 20 years, and I can’t imagine I’ll ever have a day as bad as the engineers who got bit by this fuckup.

hinkley超过 1 年前

> $400M in assets to bankruptWas this Knight Capital?> Knight Capital GroupYep. Practically the canonical case study in deployment errors.

cratermoon超过 1 年前

Not removing old code is akin to never throwing away food, even after it reaches its expiration date. Sure, you'll have it around next time you need it, but putting year-old yeast into your baguettes is, well, a recipe for disaster.

评论 #37462878 未加载

thorum超过 1 年前

Honestly seems like the market itself should have safeguards against this kind of thing.

评论 #37459961 未加载

评论 #37460074 未加载

firesteelrain超过 1 年前

Automation is not a silver bullet. Automation is still designed by humans. Peer reviews, acceptance test procedures, promotion procedures, etc all would have helped. And yes some of those things are manual. Sandbox environments, etc

markus_zhang超过 1 年前

Sometimes I think whether these events are more sinister than they appear to be. But then I heard that another MM is using Access applications to make markets for options and I think it's just incompetent.

codeulike超过 1 年前

When I got to the memorable words "Power Peg" I remembered I'd heard all about this before.

mariusmg超过 1 年前

First problem was repurposing flags used in the past for different functionality. Just dont do that.

piyh超过 1 年前

This can be summarized as "terminally inadequate technical controls"

reedf1超过 1 年前

Gotta imagine the sinking feeling that guy felt.

cdchn超过 1 年前

Good example of a blameless port mortem.

m3kw9超过 1 年前

Someone missed a blind spot

jokoon超过 1 年前

Oh noAnyways

tomp超过 1 年前

Ah, Knight Capital. The warning story for every quant trader / engineer.This is what people don't realize when they say HFT (high frequency trading) is risk-free, leeching off people, etc.You make a million every day with very little volatility (the traditional way of quantifying "risk" in finance) but one little mistake, and you're gone. The technical term is "picking up pennies in front of a steamroller (train)". Selling options is also like that.

评论 #37459821 未加载

评论 #37460119 未加载

评论 #37461055 未加载

评论 #37460016 未加载

40yearoldman超过 1 年前

lol. No. Deployments were not the issue. At any given time an automated deployment system could have had a mistake introduced that resulted in bad code being sent to the system. It does not matter if it was old or new code. Any code could have had this bug.What the issue was, and it’s one that I see often. Firstly no vision into the system. Not even a dash board showing the softwares running version. How often i see people ship software without a banner posting its version and or an endpoint that simply reports the version.Secondly no god damn kill switch. You are working with money!! Shutting down has to be an option.

评论 #37459899 未加载

评论 #37459924 未加载

评论 #37460023 未加载

rvz超过 1 年前

But ChatGPT would have fixed the issue faster in 45 mins than a human would. /sA high risk situation like this would make the idea of using LLMs for this as not an option; before someone puts out a 'use-case' for a LLM to fix this issue.I'm sorry to preempt the thought of this in advance, but it would not.

评论 #37460180 未加载

评论 #37461058 未加载