NewSQL databases fail to guarantee consistency and I blame Spanner

510 pointsby evanweaverover 6 years ago

34 comments

nemothekidover 6 years ago

>Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be availableIn my experience, yes network partitions are incredibly rare. However 99% of my distributed ststem partitions have little do with the network. When running databases on a cloud environment network partitions can occur for a variety of reasons that don’t actually include the network link between databases:1. The host database is written in a GC’d language and experiences a pathological GC pause.2. The Virtual machine is migrated and experiences a pathological pause3. Google migrates your machine with local SSDs, fucks up that process and you lose all your data on that machine (you do have backups right?)4. AWS retires your instance and you need to reboot your VM.You may never see these issue if you are running a 3 or 5 cluster database. I began seeing issues like this semi regularly once the cluster grew to 30-40 machines (Cassandra). Now I will agree that none of the issues took down majority, but if your R=3, it really only takes an unlucky partition to fuck up an entire shard

评论 #18040511 未加载

评论 #18040576 未加载

评论 #18063605 未加载

评论 #18042567 未加载

CydeWeysover 6 years ago

It's definitely true that putting the burden of consistency on developers (instead of on the DB) results in a lot more tricky work for developers. On my project, which started six years ago, we use Cloud Datastore, because Cloud Spanner hadn't come out yet. It results in complicated, painful code that would be completely unnecessary with stronger transactional guarantees. Some examples: <a href="https://github.com/google/nomulus/blob/master/java/google/registry/backup/CommitLogCheckpointStrategy.java" rel="nofollow">https://github.com/google/nomulus/blob/master/java/google/re...</a> <a href="https://github.com/google/nomulus/blob/master/java/google/registry/model/ofy/CommitLoggedWork.java" rel="nofollow">https://github.com/google/nomulus/blob/master/java/google/re...</a> <a href="https://github.com/google/nomulus/blob/master/java/google/registry/model/index/ForeignKeyIndex.java" rel="nofollow">https://github.com/google/nomulus/blob/master/java/google/re...</a>It's no surprise that we're currently running screaming to something with stronger transactional guarantees.

评论 #18041553 未加载

评论 #18041389 未加载

评论 #18041499 未加载

tytsoover 6 years ago

The post talks about Spanner using "a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across servers." But what it fails to mention is that is that this time is distributed using a software solution --- NTP, in fact.Google makes its "leap-smeared NTP network" available via Google's Public NTP service. And it's not expensive for someone to buy their own GPS clock and run their own leap-smared NTP service.Yes, it means that someone who installs a NewSQL database will have to set up their own time infrastructure. But that's not hard! There are lots of things about hardware setup which are at a similar level of trickiness, such as UPS with notification so that servers can shutdown gracefully when the batteries are exhausted, locating your servers in on-prem data centers so that if a flood takes out one data center, you have continuity of service elsewhere, etc., etc.Or of course, you can pay a cloud infrastructure provider (which Google happens to provide, but Amazon and Azure also provides similar services) to take care all of these details for you. Heck, if you use Google Compute Platform, you can use the original Spanner service (accept no substitutes :-)

评论 #18039978 未加载

评论 #18042220 未加载

评论 #18040300 未加载

评论 #18040208 未加载

gregwebsover 6 years ago

In the post and another comment here it is stated that Calvin can handle any real-world workload. However, according to my reading of the Calvin paper, one must understand the update keys before starting the transaction. I also experienced limitations when trying to use FaunaDB: it doesn't support ad-hoc queries and it only allows for indexed queries.I really like the Calvin protocol, and is does seem perfectly suited to many application workloads, but it is odd to me though, to see Calvin presented as purely superior to all alternatives. It seems like more research and work needs to be done to create systems that can address its shortcomings around transactions that query for keys (3.2.1 Dependent transactions in the paper), including ad-hoc queries, and even interactive queries (with a transaction).Disclaimer: I work on TiDB/TiKV. TiKV (distributed K/V store) uses the percolator model for transactions to ensure linearizability, but TiDB SQL (built on top) allows write skew.

评论 #18040579 未加载

julesover 6 years ago

The CAP theorem has been truly disastrous for databases. The CAP theorem simply says that if you have a database on 2 servers and the connection between those serves goes down, then queries against one server don't see new updates from the other server, so you either have to give up consistency (serve stale data) or give up availability (one of the servers refuses to process further requests). That's all that CAP is, but somehow half the industry has been convinced that slapping a fancy name on this concept is a justification for giving up consistency (even when the network and all servers are fully functional) to retain availability. The A for availability in CAP means that ALL database servers are fully available, which is unnecessary in practice, because clients can switch to the other servers. Giving up consistency introduces big engineering challenges. You're getting something that most people don't need in return for a large cost.

评论 #18040908 未加载

评论 #18041459 未加载

评论 #18043140 未加载

评论 #18041179 未加载

评论 #18041960 未加载

abadidover 6 years ago

I'm the author of that post. I'm happy to respond to comments on the post on this thread for the next several hours. You can also leave comments on the post itself, and I will respond there at any time.

评论 #18039843 未加载

评论 #18039855 未加载

评论 #18039796 未加载

评论 #18039907 未加载

评论 #18039827 未加载

评论 #18039746 未加载

评论 #18040217 未加载

评论 #18039763 未加载

评论 #18039881 未加载

评论 #18041376 未加载

评论 #18041214 未加载

评论 #18040379 未加载

评论 #18040875 未加载

romedover 6 years ago

I’ve seen comments on HN over the years in which someone Dunning-Kruegers their way into saying that TrueTime is easily replicated. I always wonder if they have sixteen senior SREs in their pocket, because that’s the level of production engineering Google applies to the problem. Time SRE has at various points had take measures up to and including calling the USAF and telling them their satellites are fucked up. If you don’t have the staff for this, the easiest way to get access to TrueTime is probably to just use Cloud Spanner.

评论 #18040056 未加载

评论 #18041607 未加载

评论 #18040704 未加载

评论 #18040127 未加载

评论 #18039983 未加载

评论 #18040288 未加载

评论 #18040819 未加载

评论 #18039996 未加载

voidmainover 6 years ago

This article lacks any reference to FoundationDB, which offers external consistency and serializable distributed transactions without trusting clocks in any way. We designed it starting in 2009 and so its lineage is independent of either Calvin or Spanner.FDB doesn't have an actively developed SQL layer at the moment, so I guess you could say it isn't a "NewSQL" database, but none of the properties under discussion have much to do with the query language.

评论 #18042561 未加载

bnasticover 6 years ago

> With 10ms batches, Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads.I haven’t worked with NASDAQ stream directly, but knowing how fast equities tick I find this “10,000 orders/sec” estimate quite low.Not to mention that 10ms delay in confirming an order would be really terrible.

评论 #18042537 未加载

habermanover 6 years ago

> We will trace the failure to guarantee consistency to a controversial design decision made by Spanner that has been tragically and imperfectly emulated in other systems.This post doesn't establish any "controversy" about Spanner's design decision. It only says that it requires special hardware, which other systems attempt to emulate despite not having this specialized hardware.To call this decision "controversial" I think one would need to show that it has some significant problem in the environment it was designed for.

评论 #18039826 未加载

评论 #18039819 未加载

dgreenspover 6 years ago

This is the clearest presentation of why CAP is misleading that I’ve ever seen! Wonderful.If the app programmer knows they don’t have global consistency, just consistency per partition, I wonder if in practice there are ways to achieve the necessary application-level guarantees such as in the photo-sharing case (not that it sounds that appealing to need to do so).

评论 #18040008 未加载

beamatronicover 6 years ago

Please be sure to distinguish between Couchbase and CouchDB. Couchbase is a CP system.

klodolphover 6 years ago

Anyone interested in this subject may also find it worthwhile to read the paper describing Spanner: <a href="http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf" rel="nofollow">http://static.googleusercontent.com/media/research.google.co...</a>

cbsmithover 6 years ago

Seems to have a powerful straw man there about eventual consistency.The point of AP systems is not 100% availability, but rather higher availability.By the same reasoning, one should never do CP, because it is also not possible to have 100% consistency. Disk/memory/network corruption, even with ECC can overwhelm the ability to maintain consistency.

AtlasBarfedover 6 years ago

The context of this article still seems to constrain consistency to guarantees for single atomic units of data (atomic at the node level).But the second you have multiple-node transactions (your data after all in a truly distributed system will exist on mulitple nodes), then the single-node reliability become dependent on multiple data exchanges across networks for confirmations. Then your network partition exposure starts to snowball.Same thing for joins. Great if your joins are perfectly served by a node-local sharding strategy, but when you retrieve from multiple nodes, again your network partitioning risk starts to compound.In AWS you will see noisy neighbor networks and network unreliability. Your VMs get yanked. Your EBS might experience an IO pause. GC pauses. Or simply one of your nodes might get swamped by a query spike.

yoavaover 6 years ago

I like the realistic view of noSql vs CAP and the different tradeoffs. We do need to talk more about the disadvantages of noSql systems and all the new database alternatives.Which brings me to my point - what the author fails to talk about is why spanner has taken the design decision they have made. He does claim scalability, but that is a very general word.I believe the spanner decision is to avoid making a global voting, as in global across the entire world, from North America, to Asia, Europe and even as far as Australia. Such a global voting will take a lot of time - impose seconds of later on any write.I think the author, when he talks about having a single global vote with a small window, thinks in terms of a single database, maybe two regions on the Continental USA, as apposed to spanner trying to be Geo distributed.

idubrovover 6 years ago

Any comments on how FoundationDB might be the same / different?

评论 #18040864 未加载

评论 #18041521 未加载

rkarthik007over 6 years ago

CTO of YugaByte here. We firmly stand by our claims, and I wanted to explain more.From the post by Daniel: << CockroachDB, to its credit, has acknowledged that by only incorporating Spanner’s software innovations, the system cannot guarantee CAP consistency (which, as mentioned above, is linearizability).YugaByte, however, continues to claim a guarantee of consistency. I would advise people not to trust this claim. YugaByte, by virtue of its Spanner roots, will run into consistency violations when the local clock on a server suddenly jumps beyond the skew uncertainty window. >>The statement about YugaByte DB is incorrect.1. With respect to CAP, both Cockroach DB (<a href="https://www.cockroachlabs.com/blog/limits-of-the-cap-theorem/" rel="nofollow">https://www.cockroachlabs.com/blog/limits-of-the-cap-theorem...</a>) and YugaByte DB (<a href="https://docs.yugabyte.com/latest/faq/architecture/#how-can-yugabyte-db-be-both-cp-and-ha-at-the-same-time" rel="nofollow">https://docs.yugabyte.com/latest/faq/architecture/#how-can-y...</a>) are CP databases with HA and there is really no difference in the claims.2. With respect to Isolation level in ACID, YugaByte DB does not make the linearizability (called external consistency by Google Spanner) claim. YugaByte DB offers Snapshot Isolation (detects write-write conflicts) today and Serializable isolation (detect read-write and write-write conflicts) is in the roadmap (<a href="https://docs.yugabyte.com/latest/architecture/transactions/isolation-levels/" rel="nofollow">https://docs.yugabyte.com/latest/architecture/transactions/i...</a>).3. We have publicly claimed that we do rely on NTP and max clock skew bounds to guarantee consistency. For example, slide 43 of our NorCal DB Day talk (<a href="https://www.slideshare.net/YugaByte/yugabyte-db-architecture-storage-engine-and-transactions" rel="nofollow">https://www.slideshare.net/YugaByte/yugabyte-db-architecture...</a>) we mention we are “relying on bounded clock sync (NTP, AWS Time Sync, etc).”

评论 #18042761 未加载

dborehamover 6 years ago

In case you're wondering : this is a very worthwhile read.

评论 #18040142 未加载

评论 #18039788 未加载

ssethover 6 years ago

The author talks about high throughput achieved via batching. But he has not mentioned latency implications of batching, or perhaps I missed that in the article.Wouldn't batching lead to an increase in transaction latency, even if we achieve higher throughput?

dekhnover 6 years ago

This reads like an advertisement for the researcher's system combined with some FUD.

评论 #18040215 未加载

omneityover 6 years ago

It's becoming harder to evaluate the guarantees of the most recent so called "cloud-scale" databases, Spanner or CockroachDB for instance.Is there anything out there like a standard benchmark to compare these offerings more accurately?

acjeover 6 years ago

I don’t know for sure, but to me the AP vs CP interpretation seems only to be a true limitation for distributed systems of exactly two nodes.I also like the blogs point that availability is not ever 100%, but I think the added cost of availability levels when going from an eventual consistency system to a linearizable one is underestimated because performance is going to be a significant availability factor, not only failuremodes as discussed.

rbransonover 6 years ago

Maybe I’m missing something, but the way this reads seems to attribute Spanner’s concurrency control entirely to TrueTime. Spanner still uses partitioned Paxos to establish consensus and 2PC for multi-partition write transactions.

erikbover 6 years ago

Consistency is a requirement that holds us back deeply. Implementing constency with eventually voted know-all masters are also holding us back. Looking at this makes an engineers heart cringe. In fact the current architecture is only slightly better than what we had in 2005, more than 10 years ago. And if it runs not on google or amazon clouds, and therefore on UNRELIABLE networks, the voting itself can fail and you have a centralized system without center. And while we claim that we need a central point of knowledge that is at sync throughout the whole system, the whole world AROUND the IT works completely without being in sync and completely without knowing it all (sometimes even having no knowledge or assumed but wrong knowledge), at a scale that IT systems only to some degree come close to nowadays.

rifungover 6 years ago

Probably a dumb question but if you need all servers to talk to each other to achieve consensus, how do you deal with network latency? Do people just keep data in separate zones so they aren't globally replicated?

jpalomakiover 6 years ago

And majority of developers don’t need NoSQL or NewSQL - they are well off with PostgreSQL.I think the old and tried should be the default choice here and other paths taken only after careful consideration and good justifications.

评论 #18042823 未加载

mr_picklesover 6 years ago

No mention of Clustrix? The Spanner paper was in 2012. Clustrix started working on a "NewSQL" database in 2006 and had a product out in 2010.

评论 #18040659 未加载

apaprockiover 6 years ago

> a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across serversMaybe I'm wrong, but pretty much anyone with a DC is going to use Microsemi (previously Symmetricom) grandmaster clocks with all the bells-and-whistles including the internal Rubidium atomic oscillator along with PTP to keep everything in sync with multiple layers of redundancy. A specialized hardware solution that uses "both GPS and atomic clocks" to guarantee ~15ns RMS to UTC are just a shopping cart away.

shaklee3over 6 years ago

I'm hoping aphyr will comment on this...

redwoodover 6 years ago

Odd to see the most popular distributed operational database system in the world left out here.

andras_gerlitsover 6 years ago

I have a whitepaper about scalable consistency guarantees using optimistic transactions. It should solve these very problems.<a href="https://medium.com/@andrasgerlits/optimistic-acid-transactions-4f193844bb5f" rel="nofollow">https://medium.com/@andrasgerlits/optimistic-acid-transactio...</a>

sabujpover 6 years ago

since spanner is distributed on the backend, any function that uses multiple sql commands that might be called or ingested by multiple sharded servers which are then sent to spanner, should always have these calls sent in a single transaction

peterwwillisover 6 years ago

Distributed databases are like systems of government. Some are worse than others, but all of them suck. I'm not aware of any study that shows that X type of database reduces bugs, increases availability, and makes customers happier. Pick one that fits your application and deal with the suckiness.

评论 #18042836 未加载

评论 #18041511 未加载