My current team is currently very unstable. We have incidents every 2-3 days, and we're constantly firefighting. I don't want to bore you with the specifics of our current situation. Instead I'm curious to hear some stories how you managed to fix your team and what problems you were able to solve.<p>For example, maybe someone actually tried to put up a "Days since last incident" sign and that somehow worked? Or you did something that goes against typical advice that you find when you google this topic?<p>Curious to hear your stories for some inspiration
Generally this is due to a combination of problematic and false beliefs on the team about speed vs. quality trade offs. Customers always ask for more features, but their values in use (rather than espoused values) suggest uptime matters a lot. Turn off feature work and fix the reliability issue, incidents every 2-3 days is ridiculous and absolutely a choice to avoid on a months to years timescale.
Every system is perfectly designed to get the results it does. -Deming<p>Already doing regular retros? If yes, then are you capturing what's causing pain and is the team collaborating on how to fix?