This line: <i>I would have sent out an email to the mailing lists earlier; but since at each point I thought I was "one change away" from fixing the problems, I kept on delaying said email until it was clear that the problems were finally fixed" </i> is such a common situation for most people, but I tend to see it with engineers especially. I find I struggle with it an incredible amount. In some ways, I guess it seems healthy or reassuring that incredibly smart people like Colin Percival suffer from similar challenges around fully understanding the scope of the problem and the solution.<p>All that being said, I really respect the detailed response from a technical perspective as well as owning up to (and the decisions that went into) a spell of downgraded performance.<p>Later edit because I don't want to spam the comments: I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions. This is well beyond my depth and while Colin mentions that it didn't necessarily increase performance, I'm wondering if the hope is it would longterm? Or were there other benefits to such an integration?
In case any other customer is wondering "Wait, I didn't hear anything from my monitoring about that and I'm retroactively worried. How worried should I be?" like I was: I just pulled our logs and reconstructed them, and it shows over the last ~30 days that the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case. This didn't trip our monitoring at the time because they all completed successfully.<p>n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and <i>will</i> save you grief some day far in the future.
For those that want to run a similar service using their own systems, I found that Attic [1] is a great open source backup tool that works in a very similar way, including deduplication and compression.<p>I backup some VPS servers to my NAS at home using attic over an SSH tunnel. Incremental backups are quite small and it's easy to automate with a simple cron job.<p>[1] <a href="https://attic-backup.org/" rel="nofollow">https://attic-backup.org/</a>
As an AWS user this type of thing gives me cause for concern:<p><i>At 2015-04-01 00:00 UTC, the Amazon EC2 "provisioned I/O" volume on which most
of this metadata was stored suddenly changed from an average latency of 1.2
ms per request to an average latency of 2.2 ms per request. I have no idea
why this happened -- indeed, I was so surprised by it that I didn't believe
Amazon's monitoring systems at first -- but this immediately resulted in the
service being I/O limited.</i><p>A sudden doubling of latency can have dire consequences on any system. Knowing that such unexpected changes are possible makes it built trust in your environment, even if it is running fine today.
Sorry if this is offtopic, but can anybody explain the value proposition of tarsnap to me? It seems like a nice service and all, but the pricing is an order of magnitude more expensive than S3. If you are storing a few GB, this might not matter ("over half of Tarsnap users spend under $1 per month on storing their backups"), but if you have that little data, why not just dump it on a free Dropbox/Gdrive/etc account?<p>For more data, why not just use one of the many compressed, deduplicated, encrypted, incremental backup systems (attic comes to mind, I'm sure there are others) then just sync to S3 at a tenth the cost?
Good description, but I'm missing lesson learned #0: Do not wait too long before informing your users, even if only to tell them "we know about it and are working on it"