[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
[2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.<p>There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.<p>To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.