TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Monitoring 9600 banks at scale

96 pointsby jeandenisalmost 7 years ago

7 comments

steakknifealmost 7 years ago
Interesting writeup. This is also a major issue for us at TradeIt (we do something similar but for stock brokers and portfolio&#x2F;trading) as the brokers we integrate are not always...<i>ahem</i>...&quot;robust&quot;. We&#x27;ve found that our upstream users really appreciate that often we can tell them about brokers&#x27; service outages before the brokers even announce it (when the brokers even bother). Sometimes the brokers don&#x27;t even realize their system is malfunctioning until we poke them to ask what&#x27;s going on.<p>Our throughput numbers are much lower and and our integrations are much fewer than Plaid, so we have been able to get away with keeping a close eye on Graphite&#x2F;Grafana for spikes in request failures&#x2F;timeouts. Seems like eventually we will need to implement some kind of statistical monitoring and alerting.
评论 #17420347 未加载
divxflounderalmost 7 years ago
Great article! I&#x27;m definitely taking an action item to look into Prometheus. I own DevOps&#x2F;Monitoring and Alerting my org and it&#x27;s really cool to see how other companies skin this cat.<p>I saw Cloudwatch in the pipeline, which is an Amazon product. I know I&#x27;m going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets.<p>Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well.<p>Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We&#x27;re still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive.<p>Love to see this stuff :)
评论 #17419225 未加载
syastrovalmost 7 years ago
Nice write up. I love reading these kinds of postmortems.<p>Unlike a lot of those I read, it sounds like you actually set out with a good set of requirements and really understood the problem.<p>I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.
评论 #17418596 未加载
lordxenualmost 7 years ago
How do you get the data from banks? Are you scraping the webpage after the user logs in? Not many banks I know of have public apis.
评论 #17420360 未加载
Rainymoodalmost 7 years ago
How do you guys handle user log-in credentials? I mean, you&#x27;re basically logging into their bank, right?
wbh1almost 7 years ago
Really enjoyed this write-up. I&#x27;m currently in the process of scaling out a Prometheus-based replacement for an old Nagios setup that was scaled to its limit and posts like this just make me that much more excited for Prometheus as a technology.
beamatronicalmost 7 years ago
With that many integrations, some small set must be broken at any given time. How do you handle this without scaling a support staff accordingly?