TechEcho

7 comments

steakknifealmost 7 years ago

Interesting writeup. This is also a major issue for us at TradeIt (we do something similar but for stock brokers and portfolio/trading) as the brokers we integrate are not always...ahem..."robust". We've found that our upstream users really appreciate that often we can tell them about brokers' service outages before the brokers even announce it (when the brokers even bother). Sometimes the brokers don't even realize their system is malfunctioning until we poke them to ask what's going on.Our throughput numbers are much lower and and our integrations are much fewer than Plaid, so we have been able to get away with keeping a close eye on Graphite/Grafana for spikes in request failures/timeouts. Seems like eventually we will need to implement some kind of statistical monitoring and alerting.

评论 #17420347 未加载

divxflounderalmost 7 years ago

Great article! I'm definitely taking an action item to look into Prometheus. I own DevOps/Monitoring and Alerting my org and it's really cool to see how other companies skin this cat.I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets.Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well.Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We're still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive.Love to see this stuff :)

评论 #17419225 未加载

syastrovalmost 7 years ago

Nice write up. I love reading these kinds of postmortems.Unlike a lot of those I read, it sounds like you actually set out with a good set of requirements and really understood the problem.I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.

评论 #17418596 未加载

lordxenualmost 7 years ago

How do you get the data from banks? Are you scraping the webpage after the user logs in? Not many banks I know of have public apis.

评论 #17420360 未加载

Rainymoodalmost 7 years ago

How do you guys handle user log-in credentials? I mean, you're basically logging into their bank, right?

wbh1almost 7 years ago

Really enjoyed this write-up. I'm currently in the process of scaling out a Prometheus-based replacement for an old Nagios setup that was scaled to its limit and posts like this just make me that much more excited for Prometheus as a technology.

beamatronicalmost 7 years ago

With that many integrations, some small set must be broken at any given time. How do you handle this without scaling a support staff accordingly?

7 comments

steakknifealmost 7 years ago

评论 #17420347 未加载

divxflounderalmost 7 years ago

评论 #17419225 未加载

syastrovalmost 7 years ago

评论 #17418596 未加载

lordxenualmost 7 years ago

How do you get the data from banks? Are you scraping the webpage after the user logs in? Not many banks I know of have public apis.

评论 #17420360 未加载

Rainymoodalmost 7 years ago

How do you guys handle user log-in credentials? I mean, you're basically logging into their bank, right?

wbh1almost 7 years ago

beamatronicalmost 7 years ago

With that many integrations, some small set must be broken at any given time. How do you handle this without scaling a support staff accordingly?

Monitoring 9600 banks at scale

7 comments

Monitoring 9600 banks at scale

7 comments