I'm looking for the right tool to use to track "operational" health, which in my context mostly the size of various queues (eg, how many pending withdrawals are there, and then this is split by type of withdrawal).<p>Maybe some other business KPIs, but the emphasis is on building "graphs that the operational team should be looking at on ~hourly basis to prioritize fixes and spot systemic regressions."<p>Ideally it would integrate with Pagerduty, in that it'd be easy to express "page me if if X is above Y."<p>My preference is that I have my own cron job that captures the metrics from our queues/databases and pushes them to the tool via a ~REST API but I could be convinced of a different method.<p>So far the only thing I can think of is Datadog, specifically https://www.datadoghq.com/solutions/real-time-business-intelligence/<p>Anything else I should be looking at? Or any other tips?
There's an indie-hacker I follow building this: <a href="https://chartbrew.com/" rel="nofollow">https://chartbrew.com/</a><p>From what I understand, it's more like an easier-to-use grafana, where you can build charts/graphs from different data sources.
SaaS wise there’s Datadog, Splunk, New Relic (expensive) and others<p>Self hosted wise Prometheus/Grafana, ELK stack, old school Nagios maybe<p>AWS also has managed Grafana/Prometheus for a fairly reasonable price, I would recommend this if you don’t have a lot of time to mess around updating your grafana stack every few months