At my company I'm considering switching us from Nagios to another monitoring system and starting to do some research. What's the best monitoring solution out there today? I'm pretty impressed by Prometheus, but just like to get some more opinions.
Prometheus.io which is a modern fresh monitoring system that I would checkout if replacing a legacy system.<p>Also take a look at Riemann which is system monitoring written in Clojure. Riemann should be good for monitoring latency of the system.<p>If it helps here is Slidedeck from Spotify how they do their monitoring
<a href="https://www.netways.de/fileadmin/images/Events_Trainings/Events/OSMC/2015/Slides_2015/Monitoring_at_Spotify_When_things_go_ping_in_the_night-Martin_Parm.pdf" rel="nofollow">https://www.netways.de/fileadmin/images/Events_Trainings/Eve...</a>
If you want to hand off all the hard stuff about monitoring and get some easy to use, core functionality (graphing, alerts, dashboards) then my company <a href="https://www.serverdensity.com" rel="nofollow">https://www.serverdensity.com</a> has been going 7+ years now.<p>For highly sophisticated environments then <a href="https://www.datadog.com" rel="nofollow">https://www.datadog.com</a> is a very advanced product.<p>Both are based off my original agent: <a href="https://github.com/serverdensity/sd-agent" rel="nofollow">https://github.com/serverdensity/sd-agent</a> (DataDog forked it in 2010 and we forked them back last year).<p>We're also behind <a href="http://www.humanops.com" rel="nofollow">http://www.humanops.com</a> trying to build monitoring that also helps you run on-call and ops teams generally in a way that considers fatigue, stress and the realities of the humans running IT systems! E.g. <a href="https://blog.serverdensity.com/introducing-alert-costs/" rel="nofollow">https://blog.serverdensity.com/introducing-alert-costs/</a> and <a href="https://opzzz.sh" rel="nofollow">https://opzzz.sh</a>
I've used Nagios and Icinga2, and I've become a huge proponent of check_mk. The documentation isn't great, but the product rocks: with very little time, you can start monitoring a slew of services (disk, hardware, logs, ntp) with almost no tweaking required. You can easily create custom checks, but you also have all the Nagios plug-ins it's compatible with. No daemon listens on the hosts being monitored. You get graphing for free (no setup time) for almost all your checks.<p>It uses Nagios under the hood, it's basically an automation system that generates those Nagios systems. The GUI is amazing, because it uses a plug-in so you don't have to edit files on disk to group your hosts or tweak the alerts. Those configs are snapshotted automatically at every change, and you can replicate that configuration automatically to remote servers. Download it from the upstream site instead of relying on distro package repositories.<p>Caveat, the documentation sucks, the GUI can be nonintuitive and it's hard to Google problems. It takes time to fully tune. Out of the box you'll probably still be impressed though.
Best at what? What is driving you to want to switch?<p>I like Icinga a lot. I won't bother reviewing it; is is very well known. Professionally, my last two gigs have used Zabbix.<p>Zabbix, architecturally, is a nightmare. Uses an RDBMS for storing time-series data, so it wastes a ton of space on historic data while managing to be far slower than it needs to be when querying larger ranges. Uses an agent. Has a proxy-agent that, while handy, encourages all sorts of sketchy, error-prone monitoring topologies. With 3.0, the UI has crawled out of the awful range, and is now merely annoying. Takes the all-singing, all-dancing monolithic approach for the main app, including features for drawing maps on big-screens.<p>For all that, it works well. Give it the hardware it wants, be sane in setting it up[1], ignore the goofy features (maps, inventory, screens - I guess someone must of requested those), and it is very solid and very powerful.<p>[1] The template system, pseudo-language for triggers, naming convention for variables and method of creating custom monitors take some getting used to. Expect to take the time to actually read the docs, and most likely to throw out your templates the first time you model your systems.
It depends on your needs and budget.<p>Can you afford time but not money? Try Sensu or Nagios.<p>Do you have money and not time? Try datadog.<p>Like someone else mentioned here, if you're looking to alert off of logs from ELK, try Elastalert.
Prometheus is absolutely the way you should be going. All of the other systems I'm seeing mentioned here — Nagios, Icinga, check_mk, Zabbix, Sensu — are host-centric and are very awkward when you try to bend them to fit modern (containerized, etc.) workloads.
If you can have a monitoring system in the cloud Datadog is a great choice.<p>Good documentation, UI, many, many plugins and fair pricing (IMO).<p><a href="https://www.datadoghq.com/" rel="nofollow">https://www.datadoghq.com/</a><p>(Im not affiliated with in any way other than using their product on a pet project with many moving parts).
It depends on your architecture and scale. There is no "best", just "best we've found for this" and "best given other constraints".<p>This is yet another point where DevOps is not "devs doing ops" but "operations building and deploying with all the tools of modern software development". You need a subject matter expert.<p>What are you monitoring? Do you care about availability or performance or both? Scale? Do you have services or servers? Do you manage the underlying hardware? Do you need to track which hardware boxes have which VMs or containers?<p>There are a million questions to answer. One big set of them: what do you dislike about Nagios? Make sure that you don't get those problems with the next one, but also make sure you get something that does what you need as well as what you want.
Just my opinion, but I won't use Prometheus, because of the active polling model. It won't scale without a number of workarounds.<p>My preferred method is Icinga2 (a Nagios clone with better configuration and clustering built-in) with reports coming in via passive NSCA. Toss in Graphite (or I'm warming up to Grafana on Influx) with some ability to write alerts against those reported metrics, and you're as close to ideal as I can come up with.<p>Of course, that requires a fair bit of up-front knowledge to stand up and operate, but they're so rock solid (and scale like mad) I have a hard time not recommending them.
At Stack Overflow we use a homebuilt Go solution called bosun: <a href="http://bosun.org/" rel="nofollow">http://bosun.org/</a> -- it runs on pretty much anything and lets us incorporate data from windows machines / linux machines in one place.
The one you use. I have sold and implemented these types of tools for the past ten years. Biggest problem is companies not actually fully implementing and using the tools they already own, and letting teams splinter off into their own tool sets.
I think it depends on your needs and software, how much time you want to invest, what you want to monitor, do you want to maintain it or you want saas?<p>You want metrics from counters you build in your app? (see statsd?)<p>You want to aggregate and do analysis on logs? (see ELK stack?)<p>You want to monitor cloud infrastructure (see stackdriver?)<p>You want to run end to end tests on your application to ensure it's behaving? (see runscope?)<p>As your application grows, you probably want a blend of tools to see inside your app.
Why not to start with AWS Cloud Watch: <a href="https://aws.amazon.com/cloudwatch/details/" rel="nofollow">https://aws.amazon.com/cloudwatch/details/</a> - simple, scalable, but of the box solution. It's much simpler than build similar functionality yourself.
Hi all,
I'm surely biased as I work at Instana (<a href="https://www.instana.com" rel="nofollow">https://www.instana.com</a>), but here's my opinion about monitoring.<p>Applications are dramatically and rapidly changing, with continuous delivery, microservice approach, containers and orchestration tools, things are all over and you might have a component spun up and down within few minutes.
Humans cannot keep up with data and it doesn't make any sense to stare at a big screen full of data, just looking the all day at charts trying to visually correlate data.
The correlation of data is becoming harder and harder as systems are more and more resilient. There's, therefore, no unique root cause anymore (<a href="https://www.instana.com/blog/no-root-cause-microservice-applications/" rel="nofollow">https://www.instana.com/blog/no-root-cause-microservice-appl...</a>).<p>At Instana we're re-defining what monitoring means. We're moving the bar from visualizing data to providing plain English explanation of what's going together with suggestion for remediation.
Instana 3 main values are:
- Automatic Discovery: dynamically models the architecture of infrastructure, middleware and services
- Automatic QoS Analysis: continuously derives KPIs of all components and services and alerts on incidents
- Integrated Investigation: visualizes in real-time physical and logical architecture, compares over time, suggests fixes and optimizations.<p>Happy to get feedback and provide more info.
Enrico
Hynek Schlawack gave a talk at PyCon this year about using Prometheus and Grafana to unify monitoring metrics. Honestly the talk goes beyond my own understanding, but you may find it helpful. He's quite knowledgeable.<p>> To get real time insight into your running applications you need to instrument them and collect metrics: count events, measure times, expose numbers. Sadly this important aspect of development was a patchwork of half-integrated solutions for years. Prometheus changed that and this talk will walk you through instrumenting your apps and servers, building dashboards, and monitoring using metrics.<p>Abstract - <a href="https://us.pycon.org/2016/schedule/presentation/1601/" rel="nofollow">https://us.pycon.org/2016/schedule/presentation/1601/</a><p>Slides - <a href="https://speakerdeck.com/hynek/get-instrumented-how-prometheus-can-unify-your-metrics" rel="nofollow">https://speakerdeck.com/hynek/get-instrumented-how-prometheu...</a><p>Video - <a href="https://www.youtube.com/watch?v=b-qLOY5ChnQ" rel="nofollow">https://www.youtube.com/watch?v=b-qLOY5ChnQ</a>
In the past we used icinga at Zalando and it scaled for us to 40k checks, after that we got huge latency problems. We use now zmon <a href="https://github.com/zalando/zmon/" rel="nofollow">https://github.com/zalando/zmon/</a> which is really great, because it scales the checks, the graph database is kairosdb on top of Cassandra, which also scales and even creating alerts can be automated and also added by development teams themselves and you can easily build team dashboards and reuse checks/alerts and filter to your entities.
Influxdb was a nice try, but clustering was very unstable in the beginning (tried with 0.7 and 0.8). If you don't want to be the monitoring configurator for your organization (application monitoring should also be created and maintained), I highly recommend to use zmon ( maybe Prometheus can also help). There is also a check to query Prometheus in zmon.
Most people here are recommending Prometheus. What is the best monitoring system to monitor good old infrastructure software like DNS servers, IMAP/SMTP server etc? Is Prometheus a reasonable choice for those as well?
We have had very good success with sensu. We like it better than nagios, but I haven't used many others so can't really say that sensu is better than everything.
So we just recently switched over to Wavefront from an aging Zabbix monitoring. We had tested and reviewed a few time series based monitoring systems and felt Wavefront was what we needed for Enterprise level monitoring.<p>Some of the key items we liked were:<p>* Able to consume millions of metrics per second. This is pretty huge. While we're not even close to that much (11k/s at the moment), we expect that number to triple or quadruple in the next year.<p>* Fast. Wavefront renders graphs quickly. The ability to manipulate the data in real time has been impressive.<p>* Feature requests. Wavefront has been receptive to ideas from their customer base. They even have a voting system in their community page if other customers like a certain request.<p>* Support has been great. Questions on issues or general technical guidance has been handled quickly, within the hour.<p>* Docker ready. Already using Wavefront with our emerging docker infrastructure.<p>* Engineers are self sufficient. Before, Tech Ops had to do all the monitoring for new services. With technologies such as docker, our engineers are capable of setting up monitoring within the application to directly send to Wavefront. This offloads quite a bit of work from Tech Ops.<p>No, I'm not affiliated with Wavefront. We just use their monitoring service.
We use Icinga 2 at work which serves our needs well enough.<p>The configuration was a bit of an initial hurdle when coming from icinga 1 / nagios - the config syntax is essentially an EDSL for programming your monitoring requirements - but the flexibility is worth it. Adding new hosts and services is pretty cheap (programmer-time-wise), and I can use whatever programming constructs and conditions I want to decide what services to apply to which hosts in which measure.<p>That said, it's still in a bit of a young state and some parts are very rough around the edges - for example, icinga 2's dependency model is a bit naive. You can configure email notifications to ignore notifications for services that depend on a different failed host/service, but this only applies if icinga already knows about the dependency having failed. So when a parent service dies, an extra e-mail notification could be generated for each of its children before icinga realizes the parent has also died and stops sending notifications for them.<p>tl;dr I had fun setting it up and it works well for us, but expect some quirks
We've been using riemann and it's wonderful. There's a little bit of learning curve as the configurations are just clojure code, but since it's all code you can build whatever you want on top of it if you know some Clojure. The DSL is well thought of and we ended up writing a REST API on top of riemann to make our monitoring stack self-serviced for all the internal users.
Hey new metric system called monsoon.. Its a framework and its pretty impressive. It can do collectd, any json and is both s push /pull based system... Check it out <a href="https://github.com/groupon/monsoon" rel="nofollow">https://github.com/groupon/monsoon</a> soon to support wavefront and pagerduty..
We use a combination of metric monitors, with Wavefront being the leading monitoring solution - integration is smooth, the querying language is simple and powerful, the graphs render fast and their support is very helpful - even after the contract is signed :)
Cray Advance Cluster Engine EMS ( <a href="http://www.cray.com/products/computing/cs-series?tab=cs_series" rel="nofollow">http://www.cray.com/products/computing/cs-series?tab=cs_seri...</a> ). Formerly Appro Cluster Engine.<p>Complete control and monitoring of cluster with either a CLI or GUI. Scalable monitoring with negligible impact on running workloads, including global synchronization of metric collection times, to minimize jitter. Ganglia front-end, but without the overhead of gmond/gmetric running on nodes. Validated as scaling well on a 8,000 node cluster.<p>Full disclosure: I designed and implemented the monitoring system.
Hey there, Librato here (<a href="https://www.librato.com/" rel="nofollow">https://www.librato.com/</a>)<p>Welp, nobody can blame you for wanting to get away from Nagios. It’s certainly a tool from a different, simpler era and hasn’t aged well in our opinion.<p>As a push-based metrics solution, Librato is probably a lot different than what you're used to. But don’t worry: we're super easy to get up and running with, and obviously you no longer need to worry about maintaining or scaling infrastructure. Also, unlike with some other solutions, you can use us with your existing toolchain (it’s easy to plug us into your existing Nagios infrastructure to try us - the trial is free & full-featured).<p>We’re a hosted metrics platform, meaning you can send metrics of any type and amount you want. We’re functionally similar to Graphite+Grafana, except we do all the work of scaling and management for you so you can focus on the metrics themselves. We provide alerting and other useful bits out of the box (things that are not trivial to setup yourself, e.g., bolting together collectd+Graphite+Grafana+statsd+flapjack+kitchen sink and hoping it scales and doesn’t fall over). We’ve got an agent that comes with a bunch of turn-key integrations too, to make it super easy for you to monitor what you care about.<p>As to pricing, we're the only hosted monitoring system that will just charge you for what you actually USE. You pay pennies per metric metered by the hour, instead of a per-node model, which gets crazy expensive and inefficient for modern ephemeral infrastructure. For example, if all you're doing is integrating us with AWS CloudWatch to monitor some EC2 instances and an RDS instance, we can do that for effectively a $1-$2 an instance. We also have an agent you can install on your servers if you want more detailed metrics, which adds $5-10 per instance depending on how many metrics you enable. Our customer success team (email support@librato.com, or the Help chat window if you already have a Librato account) will be more than happy to walk you through any permutation of our pricing and the details of the model to help you better understand it.<p>As mentioned, you can try us out for free--no credit card required: <a href="https://www.librato.com/" rel="nofollow">https://www.librato.com/</a>
We are using Wavefront at Doordash and have been very happy with it. Setting up is super easy, UI is easy to use, they never have major outage. Definitely something you can try out.
I used to use nagios and migrated to sensu for system checks. I was using graphite/seyren for time series and alerting, but doing a YoY or week over week was very slow especially if it's a lot of metrics. You should look at <a href="http://wavefront.com" rel="nofollow">http://wavefront.com</a><p>You can do some nice math functions for your alerts.
AppSignal is also a cool product, although mostly focused on Rails applications. And, a big plus, they are working on an Elixir integration :-) <a href="https://appsignal.com/elixir" rel="nofollow">https://appsignal.com/elixir</a>
OpenNMS. It is a java memory pig but is used by some of the twenty largest ASNs (by CAIDA ASRANK) in north america. Truly open source and free. Very extensible. Large development community behind it and many constant updates.
If you want monitoring plus automation and remote management check out <a href="http://www.kaseya.com/" rel="nofollow">http://www.kaseya.com/</a>
If you are okay with something which you don't have to run yourself-<p>The winner IMO is dataloop.io [0].<p>Dataloop is a SaaS monitoring solution that is super easy to get up and running and has tons of fantastic features and capabilities. The team behind it is stellar and their pricing is reasonable.<p>10/10, will continue to use again and again :)<p>[0] <a href="https://dataloop.io/" rel="nofollow">https://dataloop.io/</a>
If you care about correctness of data, solid data retention and good analytics (prediction, forecasting, etc.) then you should take a look at Circonus.<p><a href="http://www.circonus.com/" rel="nofollow">http://www.circonus.com/</a><p>500 metrics accounts are free for life.<p>Built by SREs for SREs.
disclaimer: I evaluated most of these tools and wrote a blog post here. <a href="https://thehftguy.wordpress.com/2016/04/18/monitoring-in-the-cloud-datadog-vs-server-density-vs-stackdriver-vs-bmc-boundary-vs-newrelic/" rel="nofollow">https://thehftguy.wordpress.com/2016/04/18/monitoring-in-the...</a><p>It's a bit old and i'll update it later, but here is the short resume with all the latest tool:<p>###
Free (as in open-source) shitty options:
icinga, nagios, riemann<p>They suck so much they're not even worthy of having their names written.<p>###<p>The other open-source option is prometheus.<p>I didn't try it personally but I've have candidates interviewing at my company who talked at length about their experience on it and they were satisfied.<p>I red the whole documentation and it's better than the old shitty tools but it's still not great. Be aware that it has many limitations by design, they skipped all the hard stuff (single node only, no HA, pull-mode only for metrics).<p>----<p>The new SaaS tools (ordered by maturity), all 10-20$ per host, they're mostly copy-cat:<p>Datadog, BMC truesigh pulse (Boundary), signalfx, wavefront, server density.<p>Datadog is the best option. It's older (about 5 years) and more mature. It has the most features and integrations. It's really the next generation of monitoring.<p>BMC truesight pulse is the historic competitor. It was a startup called "Boundary" that was bought by BMC, and BMC rebranded the product. That's about the same thing. Not sure what the acquisition may or may not have changed.<p>SignalFX is a direct copy-cat of datadog (and BMC). But it came later so it's lacking in features and integrations.<p>Wavefront is an even later copy-cat of datadog and signalfx. Except it has no public price nor public trial. You have to contact them and go through sales for anything. (Honestly: just ignore wavefront. There are 3 directs competitors who are better and more accessible).<p>ServerDensity: Don't bother trying. The website is buggy, it fails to load pages very often. The product is not even finished and lack 80% of the competitor features. The company will probably die soon. (sorry for their employees who are commenting here and reading that :\ )<p>[Google] StackDriver: It was another company that was acquired by Google 2 years ago. Currently, it's dead and it's being integrated to Google offerings. That might be great when it comes back (probably this year, there seem to be some closed beta given by Google at the moment).<p>###
Current status-quo:<p>Datadog beats everything by a long margin. More mature, more features, more integrations. It's has the advantage and it's evolving faster. That's the horse you have to put your money on (I did).<p>You can try the competitors (either BMC or signalfx) if you wanna play around or just tickle datadog sales team to get a better price (I did) :D<p>###
Far future:<p>There might be a market rupture within 1-2 years when google finally release StackDriver. It had some quite advanced stuff and great review when it was acquired. It's the only one that might be able to catch up with datadog and provide the very advanced stuff that doesn't currently exist (e.g. outlier detection done right).<p>If and When Google finally offers GCE (cheaper & faster than AWS) + kubernetes (docker and infrastructure on steroid) + StackDriver (complete monitoring AND logging solutions), they will be the best IaaS provider on the planet by a wide margin. The evolutions brought by these tools will allow me to do the work of 3 infra/sre guy all by myself.
Plenty of folks building large scale custom monitoring solutions with InfluxDB (plus Grafana, Collectd, Telegraf etc) - <a href="https://influxdata.com/testimonials/" rel="nofollow">https://influxdata.com/testimonials/</a>
I know I might come off as trolling, but try to get out of the business of managing servers. I've done it, it sucks, and I don't ever want to do it again.<p>Get everything containerized and use a container runtime like ECS, unless you're operating in analytics, adtech, or something else with extreme storage/compute/network requirements.