> <i>Why does systemd give up by default?</i><p>> <i>I’m not sure. If I had to speculate, I would guess the developers wanted to prevent laptops running out of battery too quickly because one CPU core is permanently busy just restarting some service that’s crashing in a tight loop.</i><p><i>sigh</i> … bounded randomized exponential backoff retry.<p>(exponential: double the maximum time you might wait each iteration. Randomized: the time you want is a random amount, between [0, current maximum] (yes, zero.). Bounded: you stop doubling at a certain point, like 5 minutes, so that we'll never wait longer than 5 minutes; otherwise, at some point you're waiting for ∞s, which I guess is like giving up.)<p>(The concern about logs filling up is a worse one. It won't directly solve this, but a high enough max wait usually slows the rate of log generation enough that it becomes small enough to not matter. Also do your log rotations on size.)
> I would guess the developers wanted to prevent laptops running out of battery too quickly<p>And I would guess sysadmins also don't like their logging facilities filling the disks just because a service is stuck in a start loop. There are many reasons to think a service failing to start multiple times in a row won't start. Misconfiguration is probably the most frequent reason for that.
This must be a different philosophy. When I see something like this happening, I investigate to find out <i>why</i> the service is failing to start, which usually uncovers some dependency that can be encoded in the service unit, or some bug in the service.
I can understand avoiding infinite restarts when there is something clearly wrong with configuration, but I can't figure out why they made the "systemctl restart" command also limited by this. For services which don't support dynamic reloading, restarting them is a substitute for that. This makes "systemctl restart" extremely brittle when used from scripts.<p>Nobody accidentally runs "systemctl restart" too fast, when such a command is issued it is clearly intentional and should be always respected by systemd.
Recently discovered while making a monitoring script that systemd exposes a few properties that can be used to alert on a service that is continuously failing to start if it's set to restart indefinitely.<p><pre><code> # Get the number of restarts for a service to see if it exceeds an arbitrary threshold.
systemctl show -p NRestarts "${SYSTEMD_UNIT}" | cut -d= -f2
# Get when the service started, to work out how long it's been running, as the restart counter isn't reset once the service does start successfully.
systemctl show -p ActiveEnterTimestamp "${SYSTEMD_UNIT}" | cut -d= -f2
# Clear the restart counter if the service has been running for long enough based on the timestamp above
systemctl reset-failed "${SYSTEMD_UNIT}"</code></pre>
It would be nice if `RestartSec` weren't constant.<p>Then you could have the default be 100ms for one-time blips, but (after a burst of failures) fall back gradually to 10s to avoid spinning during longer outages.<p>That said, beware of failure <i>chains</i> causing the interval to add up. AFAIK there's no way to have the kernel notify you of when a different process starts listening on a port.
I've always preferred daemontools and runit's ideology here. If a service dies, wait one second, then try starting it. Do this forever.<p>The last thing I need is emergent behavior out of my service manager.
I believe this allows you to have cascading restart strategies, similar to what can be done in Erlang/OTP: Only after the StartLimit= has been reached, systemd considers the service as failed. Then services that have Required= set on the failed service will be restarted/marked failed as well.<p>I think you can even have systemd reboot or move the system into a recovery mode (target) if an essential unit does not come up. That way, you can get pretty robust systems that are highly tolerant to failures.<p>(Now after reading `man systemd.unit`, i am not fully sure how exactly restarts are cascaded to requiring units.)
I’ve been bitten by the restart limit many times. Our application server (backend) was crash looping, newest build fixed the crash, but systemd refused to restart the service due to the limit. A subtle but very annoying default behavior.
> And then you need to remember to restart the dependent services later, which is easy to forget.<p>You missed the other direction of the relationship.<p>I posted elsewhere in the thread on this, don't rely on entropy. Define your dependencies (well)<p>After=/Requires= are obvious. People forget PartOf=.