Received the following email from them this morning.<p>“
It has been brought to our attention that between December 28th and January 3rd, some customers encountered issues with backups and snapshots not being correctly executed. This disruption impacted all background processes at SnapShooter, affecting backups and emails.<p>Customers may have experienced delayed backups from December 28th to January 1st. From January 1st to 4th, there were more delays and even complete backup failures. Slow tasks in our worker queue caused these delays, and to fix the system, we had to clear the backlog.<p>We have found the root cause of the slow tasks, a third-party service was experiencing degraded API performance. Tasks that would take a few milliseconds were taking minutes to complete, blocking other tasks from getting executed.<p>Implemented Solutions:<p>- Introduced alerts to notify engineers when the queue processing takes too long, along with improved memory management.
- Divided queues into multiple systems, ensuring that if one part of the system encounters delays, backup jobs receive higher priority.
- Carefully plan timeouts when using 3rd party services.
- Provision more capacity for our queue system, ensuring that we can sustain a greater amount of load for the coming future.<p>Kind regards<p>The SnapShooter Team
“