I once shut down an algorithmic trading server by hastily typing (in bash):<p>- Ctrl-r for reverse-search through history<p>- typing 'ps' to find the process status utility (of course)<p>- pressing Enter,....and realizing that Ctrl-r actually found 'stopserver.sh' in history instead. (There's a ps inside stoPServer.sh)<p>I got a call from the head Sales Trader within 5 seconds asking why his GUI showed that all Orders were paused. Luckily, our recovery code was robust and I could restart the server and resume trading in half a minute or so.<p>That's $250 million to $400 million of orders on pause for half a minute. Not to mention my heartbeat.<p>Renamed stopserver.sh to stop_server.sh after that incident :|<p>P.S. typing speed is not merely overrated, but dangerous in some contexts. Haste makes waste.
I can't read something like this without feeling really bad for everyone involved and taking a quick mental inventory of things I've screwed up in the past or potentially might in the future. Pressing the enter key on anything that affects a big-dollar production system is (and should be) slightly terrifying.
Everytime I'm reading the story, there is one question that I've never understood: why can't the just shutdown the servers itself? There ought to be some mechanism to do that. I mean, $400 millions is a lot of money to not just bash the server with a hammer. It seems like they realized the issue early on and was debugging for at least part of the 45 minutes. I know they might not have physical access to the server, but wouldn't there be any way to do a hard reboot?
While articles like this are very interesting for explaining the technical side of things, I am always left wondering about the organizational/managerial side of things. Had anyone at Knight Capital Group argued for the need of an automated and verifiable deployment process? If so, why were their concerns ignored? Was it seen as a worthless expenditure of resources? Given how common automated deployment is, I think it would be unlikely that none of the engineers involved ever recommended moving to a more automated system.<p>I encountered something like this about a year ago at work. We were deploying an extremely large new system to replace a legacy one. The portion of the system which I work on required a great deal of DBA involvement for deployment. We, of course, practiced the deployment. We ran it more than 20 times against multiple different non-production environments. Not once in any of those attempts was the DBA portion of the deployment completed without error. There were around 130 steps involved and some of them would always get skipped. We also had the issue that the production environment contained some significant differences from the non-production environments (over the past decade we had, for example, delivered software fixes/enhancements which required database columns to be dropped... this was done on the non-production systems, but was not done on the production environment because dropping the columns would take a great deal of time). Myself and others tried to raise concerns about this, but in the end we were left to simply expect to do cleanup after problems were encountered. Luckily we were able to do the cleanup and the errors (of which there were a few) were able to be fixed in a timely manner. We also benefitted from other portions of the system having more severe issues, giving us some cover while we fixed up the new system. The result, however, could have been very bad. And since it wasn't, management is growing increasingly enamored with the idea of by-the-seat-of-your-pants development, hotfixes, etc. When it eventually bites us as I expect it will, I fear that no one will realize it was these practices that put us in danger.
The post is quite poor and suffer a lot from hindsight bias.
Following article is so much better:
<a href="http://www.kitchensoap.com/2013/10/29/counterfactuals-knight-capital/" rel="nofollow">http://www.kitchensoap.com/2013/10/29/counterfactuals-knight...</a>
Repurposing a flag should be spread over two deployments. First remove the code using the old flag, then verify, then introduce code reusing the flag.<p>Even if the deployment was done correctly, <i>during</i> the deployment there would be old and new code in the system.
I used to work in the HFT, and I don´t understand is why there was no risk controls. They way we did it was to have explicit shutdown/pause rules (pause meaning that the strategy will only try to get flat).<p>The rules where things like:
- Too many trades in one direction (AKA. big pos)
- P/L down by X over Y
- P/L up by X over Y
- Orders way off the current price<p>When ever there was a shutdown/pause a human/trader would need to assess the situation and decide to continue or not.
I remember reading a summary of this when it occurred in 2012. It's obvious to everyone here what SHOULD have been done, and I find this pretty surprising in the finance sector..<p>Also your submission should probably have (2014) in the title.
It's nice to see a more detailed technical explanation of this. I've used the story of Knight Capital is part of my pitching for my own startup, which addresses (among other things) consistency between server configurations.<p>This isn't just a deployment problem. It's a <i>monitoring</i> problem. What mechanism did they have to tell if the servers were out of sync? Manual review is the recommended approach. Seriously? You're going to trust human eyeballs for the thousands of different configuration parameters?<p>Have computers do what computers do well - like compare complex system configurations to find things that are out of sync. Have humans do what humans do well - deciding what to do when things don't look right.
Somebody was on the other side of all those trades and they made a lot of money that day. That's finance. Nobody loses money, no physical damage gets done and somebody on the other side of the poker table gets all the money somebody else lost.
This must be an old wives tale. I live in Chicago and a trading form on the floor beneath us went bankrupt, in roughly the same time, with a similar "repurposed bit" story.<p>Maybe it's the same one .....
Ah yes, this story is legendary. I discuss it in my JavaScript Application Design book[1]. Chaos-monkey server-ball-wrecking sounds like a reasonable way to mitigate this kind of issues (and sane development/deployment processes, obviously)<p>[1]: <a href="http://bevacqua.io/bf" rel="nofollow">http://bevacqua.io/bf</a>
What really looks broken to me in this story is the financial system. It has become an completely artificial and lunatic system that has almost nothing to do with the real - goods and services producing - economy.
As usual in catastrophic failures, a series of bad decisions had to occur:<p>- They had dead code in the system<p>- They repurposed a flag for a previous functionality<p>- They (apparently) didn't had code reviews<p>- They didn't had a staging environment<p>- They didn't had a tested deployment process<p>- They didn't had a contingency plan to revert the deploy<p>It could be minimized or avoided altogether by fixing just one of the points. Incredible.