How to lose $172k per second for 45 minutes (2013)

327 pointsby sunasraabout 6 years ago

21 comments

wcoenenabout 6 years ago

I'm amused by the tone. It's like the author doesn't realize that 99% of software development and deployment is done like this, or much much worse. Welcome to the real world.We work in an incredibly immature industry. And trying to enforce better practices rarely works out as intended. To give one example: we rolled out mandatory code reviews on all changes. Now we have thousands of rubber-stamped "looks good to me" code reviews without any remarks.Managers care about speed of implementation, not quality. At retrospectives, I hear unironic boasts about how many bugs were solved last sprint, instead of reflection on how those bugs were introduced in the first place.

评论 #19543503 未加载

评论 #19543734 未加载

评论 #19543746 未加载

评论 #19543152 未加载

评论 #19545585 未加载

评论 #19543363 未加载

评论 #19544815 未加载

评论 #19556492 未加载

评论 #19546634 未加载

评论 #19544349 未加载

评论 #19556415 未加载

评论 #19547083 未加载

评论 #19547867 未加载

评论 #19544394 未加载

chollida1about 6 years ago

discussed previously at:<a href="https://news.ycombinator.com/item?id=6589508" rel="nofollow">https://news.ycombinator.com/item?id=6589508</a>I remember the week after this. Everyone I knew who worked at a fund was going over their code and also updating their Compliance documents covering testing and deployment of automated code.As a side note one of hte biggest ways funds tend to get in trouble from their regulators is to not follow the steps outlined in their compliance manual. Its been my experience that regulators care more that you follow the steps in your manual than those steps necessary being the best way to do something.I came away from this thinking the worst part of this was that their system did send them errors, its just that when you deal with billions of events emailing errors just tend to get ignored as at that scale you generate so many false positives with logging.I still don't know the best way to monitor and alert users for large distributed systems.The other take away was that this wasn't just a software issue but a deployment issue as well. It wasn't just one root cause but a number of issues that built up to cause the issue.1) New exchange feature going live so this is the first day you are actually running live with this feature2) old code left in the system long after it was done being used3) re-purposed command flag that used to call the old code, but now is used in the new code4) only a partial deployment leaving both old and new code working together.5) inability to quickly diagnose where the problem was6) you are also managing client orders and have the equivalent of an SLA with them so you don't want to go nuclear and shut down everything

评论 #19543347 未加载

评论 #19544898 未加载

评论 #19545641 未加载

评论 #19545956 未加载

评论 #19548514 未加载

评论 #19545503 未加载

ajucabout 6 years ago

Deployment is where the really scary bugs can happen the easiest.I've been working on a warehouse management software (that was running on mobile barcode scanners each warehouse worker had, as he moved stuff around the warehouse and confirmed each step with the system by reading barcodes on shelves and products).We had a test mode, running on a test database, and production mode, running on the production database, and you could switch between them in a menu during the startup.During testing/training users were running on the test database, then we intended to switch the devices to production mode permanently, so that the startup menu wouldn't show.A few devices weren't switched for some reason (I suspect they were lost when we did the switch and found later), and on these devices the startup menu remained active.Users were randomly taking devices each day in the morning, and most of them knew to choose "production" when the menu was showing. Some didn't, and were choosing the first option instead.We started getting some small inaccuracies on the production database. People were directed by the system to take 100 units of X from the shelf Y, but there was only 90 units there. We looked at the logs on the (production) database, and on the application server, but everything looked fine.We were suspecting someone might just be stealing, but later we found examples where there was more stuff in reality on some shelves than in the system.At that time we introduced a big change to pathfinding, and we thought the system was directing users to put products in the wrong places. Mostly we were trying to confirm that this was the cause of the bugs.Finally we found the reason by deploying a change to the thin client software running on the mobile devices to gather log files from all the mobile devices and send to server.

评论 #19546926 未加载

time0utabout 6 years ago

My company's legacy system (which still does most revenue producing work) has deployment problems like this. The deployment is fully automated, but if it fails on a server it fails silently.I rarely work on this system, but had to make an emergency change last summer. We deployed the change at around 10 pm. A number of our tests failed in a really strange way. It took several hours to determine that one of the 48 servers had the old version still. It's disk was full, so the new version rollout failed. The deployment pipeline happily reported all was well.We got lucky in that our tests happened to land on the affected server. The results of this making past the validation process would be catastrophic. Not as catastrophic as this case I hope, but it'd be bad.We made a couple human process changes, like telling the sysadmins to not ignore full disk warnings anymore (sigh). We also fixed the rollout script toactually report failures, but I don't actually trust it still.

评论 #19544117 未加载

HenryBemisabout 6 years ago

> Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were receivedSo they received these 90mins before they were executed, and as it so happens in many organizations, automated emails fly back and forth without anyone paying attention.Also.. running a new trading code, and NOT have someone looking at it LIVE on the kick-off, that is simply irresponsible and reckless.

hinkleyabout 6 years ago

I bring up this story every time someone talks about trying to do something dumb with feature toggles.(Except I had remembered them losing $250M, not $465M, yeow)The sad thing about this is if the engineering team had insisted on removing the old feature toggle first, deploying that code and letting it settle, and only then started work on the new toggle, they may well have noticed the problem prior to turning on the flag, and it certainly would have been the case that rolling back would not have caused the catastrophic failure they saw.Basically they were running with scissors. When I say 'no' in this sort of situation I almost always get pushback, but I also can find at least a couple people who are as insistent as I am. It's okay for your boss to be disappointed sometimes. That's always going to happen (they're always going to test boundaries to see if the team is really producing as much as they can). It's better to have disappointed bosses than ones that don't trust you.

alexeizabout 6 years ago

I had a chance to get familiar with deployment procedures at Knight two years after the incident. And let me tell you, they were still atrocious. It's no surprise this thing happened. In fact, what's more surprising is that it didn't happen again and again (or perhaps it did, but not on a such large scale).Anyway, this is what the deployment looked like two years after:* All configuration files for all production deployments were located in a single directory on an NFS mount. Literally, countless of .ini files for hundreds of production systems in a single directory without any subdirectories (or any other structure) allowed. The .ini files themselves were huge as it typically happens in a complex system.* The deployment config directory was called 'today'. Yesterday's deployment snapshot was called 'yesterday'. This is as much of a revision control as they had.* In order to change your system configuration, you'd be given write access to the 'today' directory. So naturally, you could end up wiping out all other configuration files with a single erroneous command. Stressful enough? This is not all.* Reviewing config changes were hardly possible. You had to write a description of what you changed, but I've never seen anybody attach an actual diff of changes. Say you changed 10 files, in the absence of a VCS, manually diff'ing 10 files wasn't anybody wanted to do.* The deployment of binaries was also manual. Binaries were on the NFS mount as well. So theoretically, you could replace your single binary and all production servers would pick it up the next day. In practice though, you'd have multiple versions of your binary, and production servers would use different versions for one reason or another. In order to update all production servers, you'd need to check which version each of the server uses and update that version of the binary.* There wasn't anything to ensure that changes to configs and binaries are done at the same time in an atomic manner. Nothing to check if the binary uses the correct config. No config or binary version checks, no hash checks, nothing.Now, count how many ways you can screw up. This is clearly an engineering failure. You cannot put more people or more process over this broken system to make it more reliable. On the upside, I learned more about reliable deployment and configuration by analyzing shortcomings of this system than I ever wanted to know.

padseekerabout 6 years ago

I realize that the consensus is that lots of companies do this kinda thing. I don't know if it's 99% - but the percentage is pretty high.However what's neglected to mention is the risk associated with a catastrophic software error. If you are say instagram and you lose your uploaded image of what you ate for lunch, that is undesirable and inconvenient. The consequences of that risk should it come to fruition is relatively low.On the other hand if you employee software developers that are literally the lifeblood of your business for automatic trading, you'd think that a company like that would understand the consequences of treating this "cost-center" as a critical asset rather than just a commodity.Unfortunately you would be wrong. Nearly every developer I have ever met that has worked for a trading firm has told me that the general attitude towards nearly all it's employees that are not generating revenue as a disposable commodity. It's not just developers but also research, governance, secretarial, customer service, etc. This is a bit of a broad brush but generally the principles and traders of those aforementioned firms are arrogant and greedy and cut corners whenever possible.In this case you'd think these people would be rational enough to know that cutting corners on your IT staff could be catastrophic. This is where you would be wrong. Most of the small/mid sized financial firms that I have had friends who worked there have told me they generally treat their staff like garbage and routinely push people out who want decent raises/bonuses, etc. These people are generally greedy and also egocentric and egomaniacal, and they believe all their employees are leaching from their yearly bonus directly.This story is not a surprise to me in the least. What's shocking is no one in the finance industry has learned anything. Instead of looking at this story as a warning, most of the finance people hear this story and laugh at how stupid everyone else is and that this would never happen to them personally because they're so much smarter than everyone else.

评论 #19547443 未加载

malux85about 6 years ago

What baffles me is how they got his far into operations with such dreadful practices, 100-200k could have got them a really solid CI pipeline with rollbacks, monitoring, testing etc,But spend 200,000 on managing 460,000,000? No way!

评论 #19543031 未加载

评论 #19543634 未加载

评论 #19543084 未加载

评论 #19547917 未加载

nealsabout 6 years ago

Makes me feel less bad about rm -rf 'ing a product database and losing 1 hour of client data, the other week. Maybe I should show them this...

评论 #19542920 未加载

phodgeabout 6 years ago

Loosely related - this is what terrifies me about deploying to cloud services like Google which have no hard limit on monthly spend - if background jobs get stuck in an infinite loop using 100% CPU while I'm away camping, my fledgling business could be bankrupt by the time I get phone signal back.

评论 #19548202 未加载

评论 #19548393 未加载

pjc50about 6 years ago

This is one of the classics of the genre. If you're interested in software reliability/failure, you should read some of COMP.RISKS .. and then stop before you get too depressed to continue.

snotrocketsabout 6 years ago

> This is probably the most painful bug report I've ever readI suggest further reading, starting with Therac-25.

评论 #19551117 未加载

kylekabout 6 years ago

Totally unrelated, but the title made me think back to one of my previous roles in the broadcast industry. If you're using a satellite as part of your platform, every second that you aren't transmitting to your birds (satellites), you're losing a massive amount of money. There are always a lot of buffers and redundant circuits in those situations, but things can always go wrong.Funny tangent- the breakroom at that job was somewhat near the base stations. Some days around lunchtime we'd have transmission interruptions. The root cause ended up being an old noisy microwave.

vxNsrabout 6 years ago

Needs (2013) tag. As usual, human negligence is to blame.

评论 #19542988 未加载

brootstrapabout 6 years ago

Just popping in to say i believe the Equifax hack was also do to a 'bad manual deployment' similar to this. They had a number of servers but they didnt patch one of the servers in their system. Hackers were able to find this one server with outdated and vulnerable software and took advantage of it.I think deploys get better with time, but that initial blast of software development at a startup is insane. You literally need to do whatever it takes to get your shit running. Some of these details dont matter because initially you have no users. But if your company survives for a couple years and builds a user base, you still have the same shitty code and practices from early times.

thanatos_demabout 6 years ago

I have no sympathy for high frequency traders losing everything.So many more interesting and meaningful uses of computing than trying to build a system to-out cheat other systems in manipulating the global market for the express interest of amplifying wealth.

评论 #19544268 未加载

评论 #19551091 未加载

anonuabout 6 years ago

I watched the market go haywire on this day. Attentive people made a cool buck or two as dislocations arose.What's crazy is that there were already rules in place to prevent stuff like this from happening - namely the Market Access Rule <a href="https://www.sec.gov/news/press/2010/2010-210.htm" rel="nofollow">https://www.sec.gov/news/press/2010/2010-210.htm</a> which was in place in 2010.When the dust settled, Knight sold the entire portfolio over via a blind portfolio bid to GS. Was a couple $bn gross portfolio. I think they made a pretty penny on this as well.

Havocabout 6 years ago

>Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment.Ah the good old "fk it we'll do it live" approach to managing billions.

spyspyabout 6 years ago

Could use a [2013] tag, but this story is fascinating and horrifying and I re-read it every time it pops up. It's a textbook case of why a solid CI/CD pipeline is vital.

评论 #19543027 未加载

nickthemagicmanabout 6 years ago

Who did they hire to develop this software?

评论 #19543016 未加载

评论 #19548404 未加载

评论 #19542953 未加载