I've seen the guts of a few major financial organizations, and there are some common themes regarding their infrastructures.<p>The one that really sticks out to me as an engineer is the fact that the whole system in most cases seems to be tied together by a fragile arrangement of 100+ different vendors' systems & middleware that were each tailored to fit specific audit items that cropped up over the years.<p>Individually, all of these components have highly-available assurances up and down their contracts, but combine all these durable components together haphazardly and you get emergent properties that no one person or vendor can account for comprehensively.<p>When the article says a full reset entails killing the power and restarting, this is my actual experience. These complex leviathans have to be brought up in a special snowflake sequence or your infra state machine gets fucked up and you have to start all over. When dependency chains are 10+ systems long and part of a complex web of other dependency chains it starts to get hopeless pretty quickly.
"A data device critical to the Tokyo Stock Exchange’s trading system had malfunctioned, and the automatic backup had failed to kick in. It was less than an hour before the system, called Arrowhead, was due to start processing orders in the $6 trillion equity market. Exchange officials could see no solution."<p>You know, just from a human perspective, talk about a bad day.<p>It's like seeing the cruise liner heading for the port too fast, knowing it is going to crash and cause immense damage, and realizing there is absolutely nothing that can now be done to prevent the damage.<p>Except, in this case, it's just way worse.
Leaders frankly accepting responsibility, known as the Asoh Defense[0] is named after Capt. Kohei Asoh after accidentally landing a DC-8, JAL Flight 2 in San Francisco bay.<p>[0] <a href="https://en.wikipedia.org/wiki/Japan_Airlines_Flight_2#The_%22Asoh_defense%22" rel="nofollow">https://en.wikipedia.org/wiki/Japan_Airlines_Flight_2#The_%2...</a>
Depending on your POV, financial exchanges are a great/awful example of the "behind the surface" complexity of modern life. You'd think with only a few order types and not that many tickers you could stand up an exchange using rust in a few nights and weekends, no? fsync liberally, pay colin his tarsnap dues, and off you go! /s
I wonder if a solution such as "Cancel all orders and publicly announce this as well as trading commencement time such as +1 hours ahead, then reboot the server just before that and resume the day as normal" was considered, and if so why it wouldn't have worked, given it seems to satisfy the constraints mentioned in the article
It is not actually a very bad thing for the whole market to go down for everybody at the same time. What is bad is for part of the market to go down, or for it to go down for some people but not others.<p>This failure highlights how frequent, scheduled testing of your failover system is needed in order to be able to say, honestly, that you even have a failover system, and not just another box burning power and doing nothing. When you have a choice, it is often better to have both systems running all the time, at less than half capacity and sharing the load; or each doing all the work, and throwing away half.<p>If you choose the former, traffic can sometimes peak over half capacity, without loss. If the latter, you can check that they are producing the same answers, too.
The interesting aftermath from this situation is the Top of Tokyo Stock Exchange(東証) as a company is recognizing the technical situation well. That's not the case for most of the companies in Japan as most of the companies simply outsource their system, but 東証 was not.<p>While Japanese so-called journalists are completely blind at technology, even worse, they don't even have a basic literacy or listening skill, questioning things they already told, or fart out a question like "But computers won't break, isn't it?" while the cause is likely be soft memory error.
Sounds very much like a fault <i>plus a bad HA implementation</i> took down the market. Couldn't care less about the fault. I'd really like to hear about the bug(s) in the HA software.
‘Upgraded Version of Tokyo Stock Exchange's "arrowhead" Trading System’<p><a href="https://archive.is/kkuxu" rel="nofollow">https://archive.is/kkuxu</a>
It's amazing to me a <i>single</i> hardware failure can lead to this kind of chaos. Wasn't there any redundancy? Didn't anyone forsee this possibility?
Shouldn't something like Lasp[1] help to build better distributed and fault-tolerant systems of this scale? Or using the purer programming languages and frameworks, like Jane Street with OCaml, Standard Chartered with Haskell.<p>[1] <a href="https://lasp-lang.readme.io/docs" rel="nofollow">https://lasp-lang.readme.io/docs</a>