I use runit in production for <a href="http://typing.io" rel="nofollow">http://typing.io</a>. I appreciate runit's strong unix philosophy (shell scripts instead of dsls). However, I'm starting to experiment with systemd because of features like properly tracking and killing services [1]. This feature would be useful with a task like upgrading an nginx binary without dropping connections [2]. This isn't possible with runit (and most process monitors) because nginx double forks, breaking its supervision tree.<p>[1] <a href="http://0pointer.de/blog/projects/systemd-for-admins-4.html" rel="nofollow">http://0pointer.de/blog/projects/systemd-for-admins-4.html</a><p>[2] <a href="http://wiki.nginx.org/CommandLine#Upgrading_To_a_New_Binary_On_The_Fly" rel="nofollow">http://wiki.nginx.org/CommandLine#Upgrading_To_a_New_Binary_...</a>
You forgot one:<p>sysv init!<p>All of my systems' processes are managed by it and have been for at least two decades.<p>Occasionally I do these periodic tasks as well, which are handled by a thing called "cron".<p>Yes this is sarcasm. There is a lot of wheel-reinventing done these days which is entirely unnecessary if you consider the long-forgotten "Unix philosophy".
Monit is by far your best bet. Easy to install, packaged on most distros and it performs reactive monitoring as opposed to most traditional monitoring systems like Nagios. Plus you can open up a web interface if you want to allow for easy browsing of monitored processes.<p>EDIT:
I liked it so much (and it was so easy) that I wrote a blog post expounding how much I liked it and how to use it. <a href="http://moduscreate.com/monit-easy-monitoring/" rel="nofollow">http://moduscreate.com/monit-easy-monitoring/</a>
Folk on Solaris / Illumos would probably like SMF to be added to the list.<p>I'd go on to mention various z/OS subsystems, but that's a bit esoteric even for HN :D<p>(Process management ties into my larger rant that nobody has properly combined it with configuration management. But nobody has time for this nonsense.)
Nothing. The answer is you don't monitor production processes directly, ever, it's a waste of your time and effort. Certainly this sort of foolishness should not be used to page an employee off-hours if that's where you're headed.<p>The only thing you need to monitor is: whether a server answers the network request it was designed to. Outside of that you might optionally want to know whether the disk is full, ram is maxed (thus putting linux into swap) or if the cpu runs too high to cope with losing some servers at peak, but really that's all optional if you're in ec2 and can just spin up more servers on a moment's notice.<p>You can gather all this data for yourself with Newrelic or if you want you can send data to graphite or if you're old-fashioned you can use Icinga in place of Nagios because it keeps history in a database. If the developers want to know about the process for the application they implemented you can put Newrelic on the server for them, and put the system Newrelic thing on there too, just don't pay attention to it or pretend it's important until something breaks.<p>The important catch here, the thing that is critical to this whole line of thinking: you have to have thought things through before you built them, focused on having one service per os and real redundancy throughout the environment, and then critically your kick should be fast enough that if a server has some kind of problem in production you don't fix it you just re-kick it. That means your kick throws the os on there, then triggers salt or ansible or chef to configure every single detail and then triggers a deploy of internally-developed applications. That means you have to test the kick to death before you can rely on it to rebuild you something live. If the problem is recurring you can use immediate tools, jdump or whatever, to get some data, give it to the application's developers, and let them try to recreate it in staging while you go ahead and re-kick the prod server and go back to writing documentation for lesser ops to not read, drinking at your desk, reading hackernews, acting as a cia listening post for cat pictures or whatever else passes the time.
Some points of note:<p>1. daemontools and runit are practically identical. I do prefer runit somewhat, as svlogd has a few more features than multilog (syslog forwarding, more timestamp options), and sv has a few more options than svc (it can issue more signals to the supervised process).<p>2. Among the criteria I look for in a process manager are: (1) the ability to issue any signal to a process (not just the limited set of TERM and HUP), and (2) the ability to run some kind of test as a predicate for restarting or reloading a service. The latter is especially useful to help avoid automating yourself into an outage. As far as I'm aware, none of the above process supervisors can do that, so I tend to eschew them in favor of initscripts and prefer server implementations that are reliable enough not to need supervision.
I've been using supervisord for most everything (doesn't hurt that I'm primarily a python guy), but I'm slowly testing out Mozilla's circus (<a href="https://github.com/mozilla-services/circus" rel="nofollow">https://github.com/mozilla-services/circus</a>) and it's been going great so far.
I don't, currently, because I don't have to monitor services, but if I did, I think I'd likely use daemontools, based on the fact that djb really, really understands how to write Unix software.
systemd handles service monitoring natively, as well as socket management and many aspects of container management. It's a superset of most of the tools listed.
SNMP for centralized systems monitoring, cronjobs for "daemon management". I dislike the tools that take up large system footprints, sometimes larger than the things they're supposed to manage themselves, to do simple things. No, they're not always extensible. When you can accomplish the same thing with a 4-5 line shellscript that takes less than 100k of ram to run, that you'd do with a giant process that takes 40-50MB of ram, I know which one I prefer (generally).
Systemd for most Servers, svc / rundir ( a version of daemontools ) from busybox on embedded services.<p>The code is designed to be shot, and will always recover after a restart. rundir/svc will neatly reap a process and re-start it again. And can be used separately.
Do any of these integrate with cgroups? I've found myself wanting to specify some rules about resource usage on occasion, and cgroups seems conceptually nice, but I'm not sure how to work it in nicely to my other tools, short of writing custom shell scripts to manipulate /proc/cgroups.
Anyone with any experience using Angel (<a href="https://github.com/MichaelXavier/Angel" rel="nofollow">https://github.com/MichaelXavier/Angel</a>)?
Have been using monit to keep a couple of node.js services online / monitor their PIDs and HTTP interfaces. It's been a positive experience so far.
Please add God[1] to the list, we use this in production.<p>Also I'll put in a shameless plug for my side project, a service management tool for multi project development; Hack On[2]<p>[1] <a href="http://godrb.com/" rel="nofollow">http://godrb.com/</a>
[2] <a href="https://github.com/snikch/hack" rel="nofollow">https://github.com/snikch/hack</a>
<a href="https://github.com/caldwell/daemon-manager" rel="nofollow">https://github.com/caldwell/daemon-manager</a><p>I've been dogfooding it in a production environment for a couple years and it's been pretty solid.
I wish that s6 (skarnet.org's small and secure supervision software suite) [1] were more-widely packaged and available on distros. It's very much in the same vein as daemontools, but with some improvements. While certainly biased, the author wrote a pretty good breakdown and comparison of why s6 was developed [2].<p>[1] <a href="http://www.skarnet.org/software/s6/" rel="nofollow">http://www.skarnet.org/software/s6/</a><p>[2] <a href="http://www.skarnet.org/software/s6/why.html" rel="nofollow">http://www.skarnet.org/software/s6/why.html</a>
I use supervisord now, before I would use mon/mongroup [1] which is just a tiny C program to monitor stuff.<p>I have also used god at some point, but I kept having trouble. I can't remember exactly what was wrong but it never quite worked correctly for me. Probably PEBCAK.<p>[1] <a href="https://github.com/jgallen23/mongroup" rel="nofollow">https://github.com/jgallen23/mongroup</a>
upstart because I wouldn't use a job control system in prod that isn't included with the base distribution (do any base unix distros use monit or supervisord?) It's just too much useless work to rewrite job control logic for daemons when the OS already gives them to you, and I've been quite surprised with the feature completeness of upstart.
Histogram so far:<p><a href="http://quickhist.onloop.net/monit=75,supervisord=107,daemonize=2,runit=30,perp=2,DJB%27s%20daemontools=33,systemd=54,god=18,upstart=76/Unix%20process%20management%20tools%20-%20Hacker%20News%20Poll%20Jul%20%2713" rel="nofollow">http://quickhist.onloop.net/monit=75,supervisord=107,daemoni...</a>
I use upstart, but am not happy with it for a number of reasons. Two important ones: "restart" does not reread the configuration file and the DSL is poorly done (the "respawn" stanza and others).<p>I haven't looked recently at alternatives, but I'm open to it.
For daemons: none of the above (init) I then monitor it with Zabbix. I assume services don't crash, and hey they don't (not that I know of in any case).<p>Unless you did really mean process and not daemon, then it's supervisord.
personally i've had great success with supervisord, no success with god, good experiences with monit, but i'm curious, whatever happened to good ol' linux watchdog?
Reactive monitoring via Riemann: <a href="http://riemann.io/" rel="nofollow">http://riemann.io/</a><p>We use this to monitor services at the application level.
Pacemaker if you need to keep it alive no matter what.<p>Systemd was pretty stable until user mode flat out broke in 205. I use it to manage my entire desktop session.
Older HN post for reference:<p><a href="https://news.ycombinator.com/item?id=1368855" rel="nofollow">https://news.ycombinator.com/item?id=1368855</a>
forever (<a href="https://github.com/nodejitsu/forever/" rel="nofollow">https://github.com/nodejitsu/forever/</a>) has worked great for me, but doesn't make any sense if you're not running node.js applications.
Kinda sad that "nothing" isn't on the list. I just use software that isn't broken, so it doesn't need to be constantly restarted.