hmmm, reminds me of <a href="http://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html" rel="nofollow">http://www.twilio.com/blog/2013/07/billing-incident-post-mor...</a><p>If you have a software stack and it is going to do something that is not idempotent (like billing customers or sending emails), you need a state machine more complex than "not done" and "done". You need a "doing" state. After service is restored and everything is running smoothly, you go through all the tasks stuck in "doing" and decide whether to retry or abort, based on other logs or an evaluation of the consequences of not acting vs double acting. What you do not do is have your software just keep hammering away until everything magically turns "done".
I just got an email from them with a link to this; what's interesting is the garbled name in the To: field of the email. The mail was sent to ">,
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"MyFirst
MyLast\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"
<" <myaddress@example.com><p>That's not the sort of thing I'd expect from a company whose entire business is email...
Beeminder got bitten pretty bad by this yesterday, as one of the "small number of customers" affected by the duplicates. Several users let us know they were seeing quadruplicates. Oy.<p>We're still huge Mailgun fans though, and have been since they were just starting out. We've certainly had worse crashes of ineptitude than this ourselves.<p>Kudos for the thorough post-mortem!
Slightly off topic, but posts like this show, in stark detail, why it's a good idea to turn off comments. Good on you for being straightforward though.