科技回声

We just learned the hard way that the NewRelic disk usage alert will never get triggered if your user level process fills up the disk until "no space left on device". Unless you change the defaults of either the alert threshold or your ext4 filesystem that is.Issuing df on a "completely filled" disk will give you this:$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 9611492 9123252 0 100% /Notice the difference between "Used" and "Available"? That's the space reserved for root processes, so that careless users can't crash their system just by filling up the disk. See stackexchange for further discussion (http://unix.stackexchange.com/questions/7950/reserved-space-for-root-on-a-filesystem-why)While df does Used/(Used+Available) to calculate the disk usage percentage, NewRelic does Used/1K-blocks.This means when our disk got filled by a rogue process which wrote a huge logfile, the NewRelic disk usage measurement got stuck at 94,9%. As the NewRelic default threshold for disk usage alerts is 95% the alert never got triggered, the alert email never got sent and we had a service outage because the streaming server process crashed when it couldn't write to the disk.End of story (to put it in NewRelic support staff words): "[…] you should really set to threshold under 95%, or tune your filesystem so that you have <5% reserved […]". For them the current state is intended behaviour. So heads up if you rely on NewRelic disk usage alerts!

Heads up if you rely on NewRelic disk usage alerts

暂无评论

Heads up if you rely on NewRelic disk usage alerts

暂无评论