Netdata – Linux performance monitoring, done right

477 pointsby cujanovicabout 9 years ago

25 comments

Looks like another faster horse. A pretty GUI on /proc is not the most burning issue to solve in Linux performance monitoring. I wish anyone making these tools would spend 30 minutes watching my Monitorama talk about instance monitoring requirements at Netflix: <a href="http://www.brendangregg.com/blog/2015-06-23/netflix-instance-analysis-requirements.html" rel="nofollow">http://www.brendangregg.com/blog/2015-06-23/netflix-instance...</a> . I still hate gauges.Where is the PMC support? At Facebook a few days ago, they said their number one issue was memory bandwidth. Try analyzing that without PMCs. You can't. And that's their number one issue. And it shouldn't be a surprise that you need PMC access to have a decent Linux monitoring/analysis tool. If that's a surprise to you, you're creating a tool without actual performance expertise.Should front BPF tracing as well... Maybe it will in the future and I can check again.

评论 #11392437 未加载

评论 #11391103 未加载

评论 #11391120 未加载

评论 #11392003 未加载

评论 #11393804 未加载

sputrabout 9 years ago

Don't use the red-green combination in charts as it makes it really hard to read for those of us with a degree of red-green color blindness (which is the most common type in the ~5% of the male and ~1% female population that has it).Other than that it looks AWESOME.

评论 #11389608 未加载

评论 #11393536 未加载

dsr_about 9 years ago

Well, it's pretty. It's probably great if you have one to five machines you care about, or you really want a pretty dashboard.Notable features that I would need all relate to multi-server usage:- central config across hosts- alerting when values go over or under thresholds- a mode for automatically selecting and viewing the machines which are working hardest, or not working- a mode for viewing of a few stats across all machines- a mode for slide-show viewing of a few stats across all machines

评论 #11389174 未加载

评论 #11389678 未加载

评论 #11390384 未加载

评论 #11396215 未加载

评论 #11390865 未加载

评论 #11388843 未加载

sagichmalabout 9 years ago

<a href="https://github.com/firehol/netdata/wiki/Installation#nodejs" rel="nofollow">https://github.com/firehol/netdata/wiki/Installation#nodejs</a>> I believe the future of data collectors is node.js:(

评论 #11388883 未加载

评论 #11390058 未加载

clarkevansabout 9 years ago

Is there such a thing as 95% threshold CPU monitoring?Consider an application spikes (close to 100% on a core) for 2-3s on some web requests -- let's assume this is normal (nothing can be done about it). Now, let's consider the average user of the system is idle for 2 minutes per web request. So, users won't see performance degradation unless $(active-users) > $(cores) during a 2-3 minute window.For most monitoring systems, CPU is reported as an average over a minute, and, even if it's pinned only 2-3s per 60s, that's only 5% usage. Presume a 2 CPU system with 5 users, who all happen to be in a conference call... and hitting the system at exactly the same time (but are otherwise mostly idle). The CPU graph might show 10-15% usage (no flag). Yet, those 5 users will report significant application performance issues (one of the users will have to wait 6-9s).What I'd like to monitor, as a system administration, is the 95% utilization of the CPUs -- that is, over the minute, throw away the bottom 94% (mostly idle cycles) and report to me the CPU utilization of the next highest percentile. This should show me those pesky CPU spikes. Anything do that?

评论 #11390911 未加载

评论 #11389699 未加载

评论 #11390115 未加载

oxplotabout 9 years ago

Gave it a try. Definitely not useful for running the daemon and view the UI on the same machine. Chrome at least eats 50% of one of the cores to show the realtime data.On my RPi B, the daemon eats 4% average on all four cores, with almost all the time spent in the kernel. I assume polling the various entries under /proc/ is costly.

Wilyaabout 9 years ago

The dashboard is gorgeous, one of the prettiest I've ever seen.But I wish it were a Riemann/Graphite/whatever dashboard instead of reimplementing its own data collection system.There is a need for great dashboards, but I don't feel any need for yet another format of data collection plugins.

lobster_johnsonabout 9 years ago

Interesting! Really gorgeously rendered dashboards.But also weird. The fact that both the collectors, the storage and the UI runs on each box makes this more like a small-scale replacement for top and assorted command-line tools such as iostat than for a scalable, distributed monitoring system. Lack of central collection means you cannot get a cluster-wide view of a given metric, nor can you easily build alerting into this.I'm also disappointed that it reimplements a lot of collectors that already exists in mature projects like Collectd and Diamond (and, more recently, Influx's Telegraf). I understand that fewer external dependencies can be useful, but still, does every monitoring tool really need to write its own collector for reading CPU usage? You'd think there would be some standardization by now.For comparison, we use Prometheus + Grafana + a lot of custom metrics collectors. Grafana is less than stellar, though. I'd love to have this UI on top of Prometheus.

sleepyheadabout 9 years ago

Can we please stop with the "done right"?

glittersharkabout 9 years ago

Having a custom plugin architecture for this is a total dealbreaker. We already have statsd, why not just use that?

评论 #11392844 未加载

thesorrowabout 9 years ago

Monitoring without alerting is kinda useless. How can I aggregate multiple servers ?

评论 #11389442 未加载

评论 #11390877 未加载

评论 #11394215 未加载

gedrapabout 9 years ago

At the moment (like, literally now, just took a break and saw this), I am configuring graphite + collectd + grafana (and probably cabot on top for alerts), using ansible to set up collectd and sync the configuration across the nodes.After some time of using graphite + statsd and friends, I came to really appreciate the benefits of using widely adopted open source components and the flexibility it gives over all-in-one solutions such as this. On the other hand, solutions like this are much easier to configure, especially the first time when you are not familiar with the tools yet.

wyldfireabout 9 years ago

It's great that they've got all that explanatory prose for the metrics. That would help when reviewing data with other team members who aren't familiar with the context of each of these.I have less of a realtime system review need than a post-mortem need. Today, I'll use kSar to do that, but this tool looks much more capable.It's too bad that it doesn't provide an init script or other startup feature. The installer, while it doesn't seem to follow typical distribution patterns, is otherwise fairly complete.

评论 #11392733 未加载

guiyeabout 9 years ago

very nice look and feel, but it's doing http polling each second, maybe using websockets or SSE could perfom better, great work!

kabdibabout 9 years ago

Did some spot checking. Found a race condition in the dictionary code in less than five minutes of poking around. Ugh.Edit: code to add an entry to the dictionary releases its lock, whereupon you can wind up with duplicate NV pairs.

评论 #11389336 未加载

评论 #11392674 未加载

评论 #11389318 未加载

ameliusabout 9 years ago

It would be nice if it could show the processes that were running at the time of a peak in the graph.Also, it would be nice if this could be run over multiple machines and show combined results.Further, it appears that this tool shows information that other tools currently do not show. Perhaps nice if this tool allowed scripting and/or a CLI.

评论 #11392756 未加载

vbtechguyabout 9 years ago

netdata is perfect for single server monitoring it's perfectly suited to integration into my Centmin Mod LEMP stack installer <a href="https://community.centminmod.com/threads/addons-netdata-sh-new-system-monitor-addon.7022/" rel="nofollow">https://community.centminmod.com/threads/addons-netdata-sh-n...</a>.For folks wanting multiple servers, the wiki does mention that i believe at <a href="https://github.com/firehol/netdata/wiki#how-it-works" rel="nofollow">https://github.com/firehol/netdata/wiki#how-it-works</a>

ausjkeabout 9 years ago

Impressive. The dashboard can be a bit condensed though, put all details on one page is a little overwhelming, maybe have some tabs(cpu,memory,disk,network,etc)?

rodionosabout 9 years ago

nmon gives you console beauty without external dependencies. You can watch it in console mode and cron schedule it in batch mode for long-term data collection.

notinventedhearabout 9 years ago

This looks really useful, although it doesn't seem to have a dashboard for showing the aggregated results from multiple running daemons.

评论 #11392689 未加载

brndnabout 9 years ago

Would implementing something like this on a server have any noticeable performance impact?

评论 #11388711 未加载

romanovcodeabout 9 years ago

Pretty cool, does it also auto-update itself? I also think it's a bit cluttered.

评论 #11388676 未加载

igamaabout 9 years ago

Looks pretty cool, going to test it soon.

jjuhlabout 9 years ago

I'd recommend people to also check out SysOrb : <a href="http://sysorb.com/" rel="nofollow">http://sysorb.com/</a>

crudbugabout 9 years ago

+1 great work.. would love to see React port.

评论 #11388500 未加载