For monitoring and alerts I look to how industrial SCADA does it.<p>Unfortunately I have no code to share, because... I'm a dev, rather than a sysadmin, and I do backups and such at home with the GUI, and I don't work 9m anything microservicy, so I've only done monitoring of features within one monolithic application.<p>My preferred way to monitor a backup task would just be to use a backup tool that had it's own monitoring built in, or integrations with a popular monitor solution. I've done DIY backup scripts, it always seems so simple that you might as well just write a few lines... But it's also so common of a use case that there's lots of really nice options.<p>I've done the systemd --failed thing on every new terminal, and probably should go back to doing so, but it doesn't do much if you're not logging in regularly. Although it does help when you're logging in to see what went wrong.<p>But the general idea when I have actually implemented monitoring, is that you have state machine alerts. They go from normal, to tripped, to active.<p>If you acknowledge it, it becomes acknowledged, if it bad condition goes away, it becomes cleared, and returns to normal when acknowledged(Or instantly, if auto-ack is selected).<p>Every alert has a trip condition, which can be any function on one or more "Tag points"(Think observable variables with lots of extra features).<p>A tripped alert only becomes active if it remains tripped for N seconds, to filter out irrelevant things caused by normal dropped packets and such, while still logging them.<p>While an alert is active, it shows in the list on the server's admin page, and can periodically make a noise or do some reminder. Eventually I'd like to find some kind of MQTT dashboard solution that shows everything in one place, and sends messages to an app, but I haven't needed anything like that yet.<p>Under the hood the model is fairly complex but you don't have to think about it much to use it.