Dealing with faults is easy: read the stack trace and stop doing what shouldn't be happening. But what about errors where something should be happening but isn't?<p>For example, I might have a notification system where users are reporting missed messages. They give concrete examples and I can verify that, given the state of the system, the notification should have been sent, but evidently it hasn't. This involves multiple systems, changing state and many potential failure points. This is a legacy system written by others, so I have an incomplete mental model of the system, and obviously can't reproduce the error.<p>I'm an experienced software developer, but whenever I encounter an error like the above, my steps include:
* Backtrack from where something is supposed to happen looking for potential logical bugs.
* Add more logging
* repeat<p>It's not very efficient! I'm tempted to throw the system out and replace it, but we all know what kind of a trap that is.<p>So how do engineers here attack a problem like this? Are there activities you do, e.g. building flow diagrams, that help guide you through the process?
I would start where something should be happening but hasn't, and then manually check the conditions required to make that happen one by one. Sometimes the conditions themselves rely on something else having happened that hasn't, so I start from there. Eventually you'll be able to unravel the entire chain of events.