A good writeup, but quite shocking that this managed to happen in the first place. I'd have expected that an email service provider would have very good monitoring on deliverability and failure reasons on both sending and receiving, and that something like a cloud migration would be done very incrementally to ensure no loss of service.<p>For this particular issue I would have expected some or all internal email at HEY! to be moved before any customers so that the new system could be tested.<p>Email is notoriously finicky when it comes to networks, IPs, the cryptography involved, and all sorts of details that are in flux during a cloud migration, and it's also notorious for being difficult to recover from if you accidentally get your email listed in denylists.
I'm glad that they posted a "miss" - but this reads over and over like a sales pitch:<p>- I created a card in <X> Basecamp
- Someone posted a message in Campfire
- We have our own encryption
- Another message posted in a different Campfire
- Oh, this one uses custom categories!
- Todo's in Basecamp project<p>I get it, 37signals dogfoods their system. What we don't normally see from other posts is that person/company X posted in slack and made a ticket in jira and then created a todo on their trello board.<p>Maybe I'm being too cynical...
I'm a little surprised this was published. It is hard to sound charitable when writing something like this but it was such a trivial, obvious fault (moving an email system and then SPF starts failing) that normally things like this are embarrassingly swept under the rug. Generally that is probably the best path.<p>While I appreciate the transparency and it's a great write-up, at the same time somehow I leave the post with a worse opinion of 37signals.
> Senior SRE Paul Shuvashish first noticed that these emails weren’t failing DKIM but SPF. [...] This pointed out a flaw in our application-level analysis system: we were assimilating DMARC errors – which can be either because of SPF or DKIM – to DKIM errors. So while the app was doing the right thing nevertheless – marking the email as spam – the insight it was collecting internally was misleading.<p>I don't agree with 'the app was doing the right thing' here: for DMARC alignment (a DMARC pass) you need SPF <i>or</i> DKIM alignment. One of the two is enough.<p>So an email from a domain with DMARC enabled that passes DKIM, but fails SPF should pass. The application should not have rejected the email based on SPF, when it was actually DKIM aligned.
As someone that works in a team with minimal collaboration software overhead—is there a ton of bloat in their process (Basecamp this, Campfire that, etc.) or is that just the reality of modern software development?