Knightmare: A DevOps Cautionary Tale (2014)

207 pointsby strzalekover 10 years ago

18 comments

vijucatover 10 years ago

I once shut down an algorithmic trading server by hastily typing (in bash):- Ctrl-r for reverse-search through history- typing 'ps' to find the process status utility (of course)- pressing Enter,....and realizing that Ctrl-r actually found 'stopserver.sh' in history instead. (There's a ps inside stoPServer.sh)I got a call from the head Sales Trader within 5 seconds asking why his GUI showed that all Orders were paused. Luckily, our recovery code was robust and I could restart the server and resume trading in half a minute or so.That's $250 million to $400 million of orders on pause for half a minute. Not to mention my heartbeat.Renamed stopserver.sh to stop_server.sh after that incident :|P.S. typing speed is not merely overrated, but dangerous in some contexts. Haste makes waste.

评论 #8996151 未加载

评论 #8996194 未加载

评论 #8996080 未加载

评论 #8996283 未加载

ratsbaneover 10 years ago

I can't read something like this without feeling really bad for everyone involved and taking a quick mental inventory of things I've screwed up in the past or potentially might in the future. Pressing the enter key on anything that affects a big-dollar production system is (and should be) slightly terrifying.

评论 #8994958 未加载

NhanHover 10 years ago

Everytime I'm reading the story, there is one question that I've never understood: why can't the just shutdown the servers itself? There ought to be some mechanism to do that. I mean, $400 millions is a lot of money to not just bash the server with a hammer. It seems like they realized the issue early on and was debugging for at least part of the 45 minutes. I know they might not have physical access to the server, but wouldn't there be any way to do a hard reboot?

评论 #8994944 未加载

评论 #8994979 未加载

评论 #8994952 未加载

评论 #8995027 未加载

评论 #8995906 未加载

otakucodeover 10 years ago

While articles like this are very interesting for explaining the technical side of things, I am always left wondering about the organizational/managerial side of things. Had anyone at Knight Capital Group argued for the need of an automated and verifiable deployment process? If so, why were their concerns ignored? Was it seen as a worthless expenditure of resources? Given how common automated deployment is, I think it would be unlikely that none of the engineers involved ever recommended moving to a more automated system.I encountered something like this about a year ago at work. We were deploying an extremely large new system to replace a legacy one. The portion of the system which I work on required a great deal of DBA involvement for deployment. We, of course, practiced the deployment. We ran it more than 20 times against multiple different non-production environments. Not once in any of those attempts was the DBA portion of the deployment completed without error. There were around 130 steps involved and some of them would always get skipped. We also had the issue that the production environment contained some significant differences from the non-production environments (over the past decade we had, for example, delivered software fixes/enhancements which required database columns to be dropped... this was done on the non-production systems, but was not done on the production environment because dropping the columns would take a great deal of time). Myself and others tried to raise concerns about this, but in the end we were left to simply expect to do cleanup after problems were encountered. Luckily we were able to do the cleanup and the errors (of which there were a few) were able to be fixed in a timely manner. We also benefitted from other portions of the system having more severe issues, giving us some cover while we fixed up the new system. The result, however, could have been very bad. And since it wasn't, management is growing increasingly enamored with the idea of by-the-seat-of-your-pants development, hotfixes, etc. When it eventually bites us as I expect it will, I fear that no one will realize it was these practices that put us in danger.

评论 #8995419 未加载

ooOOooover 10 years ago

The post is quite poor and suffer a lot from hindsight bias. Following article is so much better: <a href="http://www.kitchensoap.com/2013/10/29/counterfactuals-knight-capital/" rel="nofollow">http://www.kitchensoap.com/2013/10/29/counterfactuals-knight...</a>

评论 #8995703 未加载

serve_yayover 10 years ago

If you fill the basement with oily rags for ten years, when the building goes up in flames, is it the fault of the guy who lit a cigarette?

评论 #8996121 未加载

评论 #8994951 未加载

rgjover 10 years ago

Repurposing a flag should be spread over two deployments. First remove the code using the old flag, then verify, then introduce code reusing the flag.Even if the deployment was done correctly, during the deployment there would be old and new code in the system.

gunnark01over 10 years ago

I used to work in the HFT, and I don´t understand is why there was no risk controls. They way we did it was to have explicit shutdown/pause rules (pause meaning that the strategy will only try to get flat).The rules where things like: - Too many trades in one direction (AKA. big pos) - P/L down by X over Y - P/L up by X over Y - Orders way off the current priceWhen ever there was a shutdown/pause a human/trader would need to assess the situation and decide to continue or not.

评论 #8997064 未加载

Mandatumover 10 years ago

I remember reading a summary of this when it occurred in 2012. It's obvious to everyone here what SHOULD have been done, and I find this pretty surprising in the finance sector..Also your submission should probably have (2014) in the title.

评论 #8994949 未加载

评论 #8994858 未加载

solarmistover 10 years ago

Why would they repurpose an old flag at all? That seems crazy to me unless it was something hardware bound.

评论 #8994851 未加载

评论 #8995019 未加载

beatover 10 years ago

It's nice to see a more detailed technical explanation of this. I've used the story of Knight Capital is part of my pitching for my own startup, which addresses (among other things) consistency between server configurations.This isn't just a deployment problem. It's a monitoring problem. What mechanism did they have to tell if the servers were out of sync? Manual review is the recommended approach. Seriously? You're going to trust human eyeballs for the thousands of different configuration parameters?Have computers do what computers do well - like compare complex system configurations to find things that are out of sync. Have humans do what humans do well - deciding what to do when things don't look right.

narratorover 10 years ago

Somebody was on the other side of all those trades and they made a lot of money that day. That's finance. Nobody loses money, no physical damage gets done and somebody on the other side of the poker table gets all the money somebody else lost.

评论 #8995186 未加载

评论 #8996346 未加载

__abcover 10 years ago

This must be an old wives tale. I live in Chicago and a trading form on the floor beneath us went bankrupt, in roughly the same time, with a similar "repurposed bit" story.Maybe it's the same one .....

评论 #8995232 未加载

bevacquaover 10 years ago

Ah yes, this story is legendary. I discuss it in my JavaScript Application Design book[1]. Chaos-monkey server-ball-wrecking sounds like a reasonable way to mitigate this kind of issues (and sane development/deployment processes, obviously)[1]: <a href="http://bevacqua.io/bf" rel="nofollow">http://bevacqua.io/bf</a>

aosmithover 10 years ago

Wasn't Knight in trouble for some other things as well?

recursiveover 10 years ago

"Power Peg"? More like powder keg.

danbrucover 10 years ago

What really looks broken to me in this story is the financial system. It has become an completely artificial and lunatic system that has almost nothing to do with the real - goods and services producing - economy.

评论 #8995688 未加载

hcarvalhoalvesover 10 years ago

As usual in catastrophic failures, a series of bad decisions had to occur:- They had dead code in the system- They repurposed a flag for a previous functionality- They (apparently) didn't had code reviews- They didn't had a staging environment- They didn't had a tested deployment process- They didn't had a contingency plan to revert the deployIt could be minimized or avoided altogether by fixing just one of the points. Incredible.

评论 #8994981 未加载

评论 #8994965 未加载