Datadog is very nice. Here's something I wrote when asked what value we were getting from it- <a href="https://gist.github.com/sciurus/3a1cd4c203891c8d33b2" rel="nofollow">https://gist.github.com/sciurus/3a1cd4c203891c8d33b2</a><p># Why datadog? #<p>I would break it down into four pieces. Datadog is<p>1. providing functionality<p>1. we need<p>1. in an easy-to-use manner<p>1. that would be difficult to build and maintain ourselves<p># 1) Functionality #<p>## The agent ##<p>It gathers system metrics, integrates with key software we use, and provides a standard interface to which our applications can send custom metrics.<p>## Integrations ##<p>Datadog has prebuilt integrations to pull data from almost every important service we use.<p>## Events ##<p>Through the integrations datadog generates a consolidated event stream that we can filter and earch as needed.<p>## Dashboards ##<p>Datadog lets us build dashboards that combine metrics from many different sources. We can combine and transform metrics to make them more useful. It also provides an powerful interface for interactive exploration of metrics.<p>## Alerting ##<p>Datadog has nice stream processing capabilities for generating alerts, and it can surface them in services we use like pagerduty and slack.<p># 2) Need #<p>## The Agent ##<p>We don't get nearly enough insight from cloudwatch alone, we need an on-instance tool to gather system and app metrics.<p>## Integrations ##<p>There are lots of services with operational signficiance, but many of them don't provide a good way to access their data.<p>## Events ##<p>We would spend <i>dramatically</i> longer investigating problems if we had to look at eash source of events in isolation. Many of our event sources don't even provide a way for us to view past events or to query them.<p>## Dashboards ##<p>Per-service and per-instance dashboards are important for investigating problems quickly. The consolidation of data from multiple sources is again a key feature.<p>## Alerting ##<p>We need to do anaylze trends in our metrics and alert on them.<p># 3) Ease of use #<p>## The agent ##<p>The agent is deployable via a chef cookbook datadog wrote for us. It requires minimal configuration. It knows which system and application metrics are worth gathering.<p>## Integrations ##<p>Integrating with all the data sources is literally a few clicks.<p>## Events ##<p>The interface makes searching and filtering events straightforward.<p>## Dashboards ##<p>There are prebuilt dashbaords for lots of things we care about. Snazzy features like autocomplete and templating make building our own dashboards easy.<p>## Alerting ##<p>The guided steps and previewed outputs make creating alerts simple.<p># 4) Hard to replicate #<p>Here I described a system of collectd, custom code to pull metrics from cloudwatch, custom code to pull or receive events from various sources (airbrake, cloudtrail, chef, pagerduty, jenkins, etc) influxdb, and grafana.