Cassandra is not row level consistent

274 pointsby leastangleover 8 years ago

20 comments

jbellisover 8 years ago

Cassandra developer here.Lots of comments here about how Cassandra is AP so of course you get inconsistent (non-serializable) results.This is true, to a point. I'm firmly convinced that AP is a better way to build distributed systems for fault tolerance, performance, and simplicity. But it's incredibly useful to be able to "opt in" to CP for pieces of the application as needed. That's what Cassandra's lightweight transactions (LWT) are for, and that's what the authors of this piece used.However! Fundamentally, mixing serializable (LWT) and non-serializable (plain UPDATE) ops will produce unpredictable results and that's what bit them here.Basically the same as if you marked half the accesses to a concurrently-updated Java variable with "synchronized" and left it off of the other half as an "optimization."Don't take shortcuts and you won't get burned.

评论 #13036563 未加载

评论 #13036712 未加载

评论 #13034882 未加载

评论 #13038751 未加载

评论 #13035670 未加载

评论 #13039047 未加载

评论 #13035616 未加载

评论 #13036299 未加载

helperover 8 years ago

As a long time Cassandra user its easy to forget that some of Cassandra's semantics will be surprising to new users. That being said, if you are considering adopting an AP database it really is important for you to know the details about how write conflicts get resolved. This is perhaps the biggest difference between Cassandra other databases like Riak and ought to be part of your decision making process instead of a surprise you run into later.That being said, using Cassandra for distributed locks is a terrible idea. I can't think of any way in which Cassandra would be better than using {Zookeeper,etcd,consul}. Trying to force a database to do something it really isn't designed for will almost always lead to disappointment (and often resentment) of said database.

评论 #13033835 未加载

评论 #13033631 未加载

评论 #13033712 未加载

评论 #13034841 未加载

评论 #13033749 未加载

Animatsover 8 years ago

This:<pre><code> INSERT INTO locks (id, lock, revision) VALUES ('Tom', true, 1) IF NOT EXISTS USING TTL 20; </code></pre> looks like a race condition. The same problem comes up in SQL databases - you can't lock a row that doesn't exist yet. If you write, in SQL:<pre><code> BEGIN TRANSACTION SELECT FROM locks WHERE id = "Tom" AND lock = true AND revision = 1; -- if no records returned INSERT INTO LOCKS locks (id, lock, revision) VALUES ('Tom', true, 1) COMMIT </code></pre> you have a race condition. If two threads make that identical request near-simultaneously, both get a no-find from the select, and both do the INSERT. SELECT doesn't lock rows that don't exist.The usual solution in SQL is to use UNIQUE indices which will cause an INSERT to fail if the record about to be inserted already exists.I ran into this reassembling SMS message fragments, where I wanted to detect that all the parts had come in. The right answer was to do an INSERT for each new fragment, then COMMIT, then do a SELECT to see if all the fragments of a message were in. Doing the SELECT first produced a race condition.

评论 #13034297 未加载

评论 #13034726 未加载

评论 #13034303 未加载

aarturover 8 years ago

And there's another surprise waiting to be discovered. The execution of a LWT is not guaranteed to return applied/not-applied response [1]. It can raise a WriteTimeout exception that means "I don't know if applied". It looks like in that case it can be worked around by inserting a UUID and in case of a WriteTimeout reading the UUID using SERIAL consistency and checking if it's the inserted UUID. But generally this limitation of LWTs makes implementing some algorithms impossible, e.g. you can't implement a 100% reliable counter.[1] <a href="https://issues.apache.org/jira/browse/CASSANDRA-9328" rel="nofollow">https://issues.apache.org/jira/browse/CASSANDRA-9328</a>

seanparsonsover 8 years ago

I railed against CQL right from the start and it's precisely because of this kind of thing. Imitating SQL has the side effect of setting certain expectations and drags a certain mental model along with it.

im_down_w_otpover 8 years ago

If you're not writing purely immutable data or can't 100% guarantee a serialized reader/writer, then you're just looking for trouble with Cassandra.

beefsackover 8 years ago

The post was quite interesting, but I find image macros and GIFs really distracting in technical writings.

zzzcpanover 8 years ago

Shouldn't Cassandra be using Lamport timestamps or even vector clocks there? Relying on timer and its resolution sounds strange for a database, especially a distributed one.

评论 #13033771 未加载

评论 #13033620 未加载

leastangleover 8 years ago

Author here.Great discussion around the CAP theorem but it misses the point. AP vs CP / Cassandra being AP is not relevant to this particular problem:1) This is not a distributed systems corner case. You will run into this if you are running Cassandra on a single node. A node should be able to guarantee consistency internally during normal operation. If it is not able to do that, there is something wrong with the system.2) This is a case where queries are being send from the same process/thread and go to exactly the same nodes. Attach a simple, monotonically increasing query counter to each call and you can easily serialize it on the other side.

sigyover 8 years ago

I see this as a basic misunderstanding of how LWT works. If you want to ensure serializable operations, then you need to use LWT with preconditions that ensure serializable operations.Even better, stop trying to emulate the old and tired distributed lock methods that have been proven over and over again to be insufficient.

refresh-credsover 8 years ago

If it's not obtained synchronously what does the lock achieve?

supergirlover 8 years ago

Couldn't explain the problem in a less cringy way? Sounds like a bug to me. So file a bug report?

评论 #13033548 未加载

agentgtover 8 years ago

I guess I'm old or just not hip (most likely both) but I had to google WAT (I know WTF but WAT ... never seen it).Even now I'm still not sure but I presume WAT = what!

评论 #13034529 未加载

评论 #13036033 未加载

评论 #13034480 未加载

second_picardover 8 years ago

Hazelcast has a distributed lock (see <a href="http://docs.hazelcast.org/docs/3.7/manual/html-single/index.html#lock" rel="nofollow">http://docs.hazelcast.org/docs/3.7/manual/html-single/index....</a>) and I've used for more than a year to synchronize jobs across a cluster.

steeveover 8 years ago

Using Cassandra as a lock is a terrible, terrible idea.

knownover 8 years ago

With the Oracle/PostgreSQL, readers never wait for writers and writers never wait for readers <a href="http://philip.greenspun.com/sql/your-own-rdbms.html" rel="nofollow">http://philip.greenspun.com/sql/your-own-rdbms.html</a> using underlying locking mechanism <a href="http://www.beej.us/guide/bgipc/output/html/singlepage/bgipc.html#flocking" rel="nofollow">http://www.beej.us/guide/bgipc/output/html/singlepage/bgipc....</a>

knownover 8 years ago

Its' better to implement <a href="https://en.wikipedia.org/wiki/Priority_inversion" rel="nofollow">https://en.wikipedia.org/wiki/Priority_inversion</a> in all <a href="https://en.wikipedia.org/wiki/NoSQL#Types_and_examples_of_NoSQL_databases" rel="nofollow">https://en.wikipedia.org/wiki/NoSQL#Types_and_examples_of_No...</a>

cbsmithover 8 years ago

It feels like it is 2013 all over again: <a href="https://aphyr.com/posts/294-jepsen-cassandra" rel="nofollow">https://aphyr.com/posts/294-jepsen-cassandra</a>

mydpyover 8 years ago

As others have noted, this blog uses an approach to data modeling that is considered an anti pattern for an AP data store like Cassandra.

neeleshsover 8 years ago

Anyone has experience around this on HBase, a CP database?

评论 #13034684 未加载

20 comments

jbellisover 8 years ago

评论 #13036563 未加载

评论 #13036712 未加载

评论 #13034882 未加载

评论 #13038751 未加载

评论 #13035670 未加载

评论 #13039047 未加载

评论 #13035616 未加载

评论 #13036299 未加载

helperover 8 years ago

评论 #13033835 未加载

评论 #13033631 未加载

评论 #13033712 未加载

评论 #13034841 未加载

评论 #13033749 未加载

Animatsover 8 years ago

评论 #13034297 未加载

评论 #13034726 未加载

评论 #13034303 未加载

aarturover 8 years ago

seanparsonsover 8 years ago

im_down_w_otpover 8 years ago

If you're not writing purely immutable data or can't 100% guarantee a serialized reader/writer, then you're just looking for trouble with Cassandra.

beefsackover 8 years ago

The post was quite interesting, but I find image macros and GIFs really distracting in technical writings.

zzzcpanover 8 years ago

Shouldn't Cassandra be using Lamport timestamps or even vector clocks there? Relying on timer and its resolution sounds strange for a database, especially a distributed one.

评论 #13033771 未加载

评论 #13033620 未加载

leastangleover 8 years ago

sigyover 8 years ago

refresh-credsover 8 years ago

If it's not obtained synchronously what does the lock achieve?

supergirlover 8 years ago

Couldn't explain the problem in a less cringy way? Sounds like a bug to me. So file a bug report?

评论 #13033548 未加载

agentgtover 8 years ago

I guess I'm old or just not hip (most likely both) but I had to google WAT (I know WTF but WAT ... never seen it).Even now I'm still not sure but I presume WAT = what!

评论 #13034529 未加载

评论 #13036033 未加载

评论 #13034480 未加载

second_picardover 8 years ago

steeveover 8 years ago

Using Cassandra as a lock is a terrible, terrible idea.

knownover 8 years ago

cbsmithover 8 years ago

It feels like it is 2013 all over again: <a href="https://aphyr.com/posts/294-jepsen-cassandra" rel="nofollow">https://aphyr.com/posts/294-jepsen-cassandra</a>

mydpyover 8 years ago

As others have noted, this blog uses an approach to data modeling that is considered an anti pattern for an AP data store like Cassandra.

neeleshsover 8 years ago

Anyone has experience around this on HBase, a CP database?

评论 #13034684 未加载