Interesting writeup. This is also a major issue for us at TradeIt (we do something similar but for stock brokers and portfolio/trading) as the brokers we integrate are not always...<i>ahem</i>..."robust". We've found that our upstream users really appreciate that often we can tell them about brokers' service outages before the brokers even announce it (when the brokers even bother). Sometimes the brokers don't even realize their system is malfunctioning until we poke them to ask what's going on.<p>Our throughput numbers are much lower and and our integrations are much fewer than Plaid, so we have been able to get away with keeping a close eye on Graphite/Grafana for spikes in request failures/timeouts. Seems like eventually we will need to implement some kind of statistical monitoring and alerting.
Great article! I'm definitely taking an action item to look into Prometheus. I own DevOps/Monitoring and Alerting my org and it's really cool to see how other companies skin this cat.<p>I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets.<p>Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well.<p>Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We're still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive.<p>Love to see this stuff :)
Nice write up. I love reading these kinds of postmortems.<p>Unlike a lot of those I read, it sounds like you actually set out with a good set of requirements and really understood the problem.<p>I had a good experience using Prometheus as well for a smaller project (server monitoring). It’s interesting to know that it can handle so many metrics and scale so well to more complex problem areas.
Really enjoyed this write-up. I'm currently in the process of scaling out a Prometheus-based replacement for an old Nagios setup that was scaled to its limit and posts like this just make me that much more excited for Prometheus as a technology.