This matches my experience with BigTable, down to the short-duration failure spikes.<p>I feel that something should be said on the plus side of the ledger here. I'm the solo founder of a company that indexes huge amounts of fine-grained information. Bigtable is the key technology that let me start my company on my own: it soaks up all the data we can throw at it, with almost zero maintenance. Even within the stable of GCP technologies it stands out as being particularly reliable.<p>My biggest "problem" with BigTable is the lack of public information on schema design - which in this context is mostly the art of designing key structures to solve specific problems. I've come up with sensible strategies, but much of it was far from obvious. I can't help but feel that there should be a body of prior art I could draw on.
Random inside joke I overheard about ten years ago:<p>"It's called BigTable, not FastTable or AvailableTable!"<p>...It's probably a <i>bad</i> idea to evaluate 2019's BigTable based on the joke, but my puerile mind still find it amusing. :)
We are a user of BigTable, 30k writes/sec and 300k reads/sec, and compared to the other managed services (Pub/Sub, Memorystore, etc), it has been the most stable by far, but we have to scale up our node count at times when we don't think we should have to (based on the perf described in the docs) as well as the latency/errors described in the article. They also added storage caps based on node count last year that increased our costs dramatically.<p>The Key Visualizer has been a huge help but there's still not enough metrics and tooling to understand when things do go wrong or what is happening behind the scenes. Luckily we have a cache sitting in front of Bigtable for reads that allows us to absorb most of the described intermittent issues because cost has prevented us from doing any sort of replication.
Reading the article the following quote got my attention "you should always keep things simple even if your tools allow for more complex patterns".<p>I follow the "it is perfect when you don't need to remove anything else" rule in most systems/processes/functions/tasks in life (not only IT systems). I am happy to see in this cluttered space called IT there are many more like-minded people who see that too much is TOO much.
As another (former) user of cloud bigtable (migrating from cassandra) we saw almost identical results. Great performance when it works, but regular periods of unavailability (this was around 2-3 years ago at this point). Interesting to hear that they still have the same problems. Had a similar experience spending time with the cloud bigtable team but they never really got to the bottom of it.
Worth noting his original reason for moving away from DynamoDB is outdated. DynamoDB added an “adaptive capacity” feature to handle hot partitions.[1]<p>[1] <a href="https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/" rel="nofollow">https://aws.amazon.com/blogs/database/how-amazon-dynamodb-ad...</a>
"Unfortunately, we do multiple operations on Bigtable in one request to our api and we rely on strong consistency between those operations."<p>I feel like "strong consistency" is misused here. Strongly consistent is relevant only in a distributed environment. Its usually solved by using paxos/raft between the replicas. Bigtable only has had best-effort replication, so I am not sure its being mentioned here. I think they are looking for the term serial, that their queries have to be executed in a specific order for a particular user request.
I really, really, really hate unexplained problems like the one described here. Not in storage but any facet of computing. It's true that the systems we build and work on are complex, but they are also ultimately deterministic, and there is a reason why something goes wrong like TFA describes. Ideally we would seek to understand our systems before continuing to add features to them, but of course the real world often doesn't work that way.<p>This would be a super frustrating situation for me, particularly when you're not given the tools you need to diagnose in the first place, <i>and</i> you loop in support but they still can't help you identify what's wrong.<p>Years ago, I worked on a .NET system that sometimes would respond super slowly and we didn't have a concrete explanation for why. As in TFA, we developed a kind of religion about it. "Oh, it must be JITting", that sort of stuff.