TechEcho

12 comments

bob1029over 4 years ago

I've seen the guts of a few major financial organizations, and there are some common themes regarding their infrastructures.The one that really sticks out to me as an engineer is the fact that the whole system in most cases seems to be tied together by a fragile arrangement of 100+ different vendors' systems & middleware that were each tailored to fit specific audit items that cropped up over the years.Individually, all of these components have highly-available assurances up and down their contracts, but combine all these durable components together haphazardly and you get emergent properties that no one person or vendor can account for comprehensively.When the article says a full reset entails killing the power and restarting, this is my actual experience. These complex leviathans have to be brought up in a special snowflake sequence or your infra state machine gets fucked up and you have to start all over. When dependency chains are 10+ systems long and part of a complex web of other dependency chains it starts to get hopeless pretty quickly.

评论 #24667625 未加载

评论 #24666697 未加载

评论 #24666476 未加载

评论 #24666964 未加载

评论 #24667064 未加载

stanriversover 4 years ago

"A data device critical to the Tokyo Stock Exchange’s trading system had malfunctioned, and the automatic backup had failed to kick in. It was less than an hour before the system, called Arrowhead, was due to start processing orders in the $6 trillion equity market. Exchange officials could see no solution."You know, just from a human perspective, talk about a bad day.It's like seeing the cruise liner heading for the port too fast, knowing it is going to crash and cause immense damage, and realizing there is absolutely nothing that can now be done to prevent the damage.Except, in this case, it's just way worse.

评论 #24666128 未加载

neurotech1over 4 years ago

Leaders frankly accepting responsibility, known as the Asoh Defense[0] is named after Capt. Kohei Asoh after accidentally landing a DC-8, JAL Flight 2 in San Francisco bay.[0] <a href="https://en.wikipedia.org/wiki/Japan_Airlines_Flight_2#The_%22Asoh_defense%22" rel="nofollow">https://en.wikipedia.org/wiki/Japan_Airlines_Flight_2#The_%2...</a>

fomine3over 4 years ago

I really impressed how the CIO replies explains the problem on press conference. He knows the system and explains the issue perfectly.

评论 #24684459 未加载

fovcover 4 years ago

Depending on your POV, financial exchanges are a great/awful example of the "behind the surface" complexity of modern life. You'd think with only a few order types and not that many tickers you could stand up an exchange using rust in a few nights and weekends, no? fsync liberally, pay colin his tarsnap dues, and off you go! /s

评论 #24669745 未加载

评论 #24667399 未加载

ve55over 4 years ago

I wonder if a solution such as "Cancel all orders and publicly announce this as well as trading commencement time such as +1 hours ahead, then reboot the server just before that and resume the day as normal" was considered, and if so why it wouldn't have worked, given it seems to satisfy the constraints mentioned in the article

评论 #24668730 未加载

ncmncmover 4 years ago

It is not actually a very bad thing for the whole market to go down for everybody at the same time. What is bad is for part of the market to go down, or for it to go down for some people but not others.This failure highlights how frequent, scheduled testing of your failover system is needed in order to be able to say, honestly, that you even have a failover system, and not just another box burning power and doing nothing. When you have a choice, it is often better to have both systems running all the time, at less than half capacity and sharing the load; or each doing all the work, and throwing away half.If you choose the former, traffic can sometimes peak over half capacity, without loss. If the latter, you can check that they are producing the same answers, too.

评论 #24667474 未加载

ezoeover 4 years ago

The interesting aftermath from this situation is the Top of Tokyo Stock Exchange(東証) as a company is recognizing the technical situation well. That's not the case for most of the companies in Japan as most of the companies simply outsource their system, but 東証 was not.While Japanese so-called journalists are completely blind at technology, even worse, they don't even have a basic literacy or listening skill, questioning things they already told, or fart out a question like "But computers won't break, isn't it?" while the cause is likely be soft memory error.

notacowardover 4 years ago

Sounds very much like a fault plus a bad HA implementation took down the market. Couldn't care less about the fault. I'd really like to hear about the bug(s) in the HA software.

Stierlitzover 4 years ago

‘Upgraded Version of Tokyo Stock Exchange's "arrowhead" Trading System’<a href="https://archive.is/kkuxu" rel="nofollow">https://archive.is/kkuxu</a>

gautamcgoelover 4 years ago

It's amazing to me a single hardware failure can lead to this kind of chaos. Wasn't there any redundancy? Didn't anyone forsee this possibility?

评论 #24666091 未加载

评论 #24666294 未加载

评论 #24667424 未加载

评论 #24665777 未加载

xvilkaover 4 years ago

Shouldn't something like Lasp[1] help to build better distributed and fault-tolerant systems of this scale? Or using the purer programming languages and frameworks, like Jane Street with OCaml, Standard Chartered with Haskell.[1] <a href="https://lasp-lang.readme.io/docs" rel="nofollow">https://lasp-lang.readme.io/docs</a>

12 comments

bob1029over 4 years ago

评论 #24667625 未加载

评论 #24666697 未加载

评论 #24666476 未加载

评论 #24666964 未加载

评论 #24667064 未加载

stanriversover 4 years ago

评论 #24666128 未加载

neurotech1over 4 years ago

fomine3over 4 years ago

I really impressed how the CIO replies explains the problem on press conference. He knows the system and explains the issue perfectly.

评论 #24684459 未加载

fovcover 4 years ago

评论 #24669745 未加载

评论 #24667399 未加载

ve55over 4 years ago

评论 #24668730 未加载

ncmncmover 4 years ago

评论 #24667474 未加载

ezoeover 4 years ago

notacowardover 4 years ago

Sounds very much like a fault plus a bad HA implementation took down the market. Couldn't care less about the fault. I'd really like to hear about the bug(s) in the HA software.

Stierlitzover 4 years ago

‘Upgraded Version of Tokyo Stock Exchange's "arrowhead" Trading System’<a href="https://archive.is/kkuxu" rel="nofollow">https://archive.is/kkuxu</a>

gautamcgoelover 4 years ago

It's amazing to me a single hardware failure can lead to this kind of chaos. Wasn't there any redundancy? Didn't anyone forsee this possibility?

Tokyo Stock Exchange Blackout: One Piece of Hardware Took Down a Market

12 comments

Tokyo Stock Exchange Blackout: One Piece of Hardware Took Down a Market

12 comments