Let's consign CAP to the cabinet of curiosities

135 点作者 nalgeon10 个月前

38 条评论

So you’re setting up a multi-region RDS. If region A goes down, do you continue to accept writes to region B?A bank: No! If region A goes down, do not process updates in B until A is back up! We’d rather be down than wrong!A web forum: Yes! We can reconcile later when A comes back up. Until then keep serving traffic!CAP theorem doesn’t let you treat the cloud as a magic infinite availability box. You still have to design your system to pick the appropriate behavior when something breaks. No one without deep insight into your business needs can decide for you, either. You’re on the hook for choosing.

评论 #41070527 未加载

评论 #41070029 未加载

评论 #41070843 未加载

评论 #41069997 未加载

mordae10 个月前

You wish.> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partitionIncidentally that's where CAP makes it's appearance and bites your ass.No amount of VRRP, UCARP wishful thinking can guarantee a conclusion on what partition is "correct" in presence of a network partition between load balancer nodes.Also, who determines where to point the DNS? A single point of failure VPS? Or perhaps a group of distributed machines voting? Yeah.You still need to perform the analysis. It's just that some cloud providers offer the distributed voting clusters as a service and take care of the DNS and load balancer switchover for you.And that's still not enough, because you might not want to allow stragglers write to orphan databases before the whole network fencing kicks in.

评论 #41070564 未加载

评论 #41072247 未加载

评论 #41070879 未加载

评论 #41070460 未加载

评论 #41070131 未加载

bunderbunder10 个月前

I once lost an entire Christmas vacation to fixing up the damage caused when an Elasticsearch cluster running in AWS responded poorly to a network partition event and started producing results that ruined our users' day (and business records) in a "costing millions of dollars" kind of way.It was a very old version of ES, and the specific behavior that led to the problem has been fixed for a long time now. But still, the fact that something like this can happen in a cloud deployment demonstrates that this article's advice rests on an egregiously simplistic perspective on the possible failure modes of distributed systems.In particular, the major premise that intermittent connectivity is only a problem on internetworks is just plain wrong. Hubs and switches flake out. Loose wires get jiggled. Subnetworks get congested.And if you're on the cloud, nobody even tries to pretend that they'll tell you when server and equipment maintenance is going to happen.

评论 #41070607 未加载

throwaway7127110 个月前

When I design systems I just think about tiny traitor generals and their sneaky traitor messengers racing in the war, their clocks are broken, and some of them are deaf, blind or both.CAP or no CAP, chaos will reign.I think FLP (<a href="https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf" rel="nofollow">https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf</a>) is better way to think about systems.I think CAP is not as relevant in the cloud because the complexity is so high that nobody even knows what is going on, so the just C part, regardless of the other letters, is ridiculously difficult even on a single computer. A book can be written just to explain write(2)'s surprise attacks.So you think you have guarantees whatever the designers said they have AP or CP, and yet.. the impossible will happen twice a day (and 3 times at night when its your on-call).

评论 #41070062 未加载

评论 #41071572 未加载

killjoywashere10 个月前

The military lives in this world and will likely encourage people to continue thinking about it. Think about wearables on a submarine, as an example. Does the captain want to know his crew is fatigued, about to get sick, getting less exercise than they did on their last deployment? Yes. Can you talk to a cloud? No. Does the Admiral in Hawaii want to know those same answers about that boat, and every boat in the Group, eventually? Yes. For this situation, datacenter-aware databases are great. There are other solutions for other problems.

评论 #41070041 未加载

rdtsc10 个月前

> The CAP Theorem is IrrelevantJust sprinkle the magic "cloud" powder on your system and ignore all the theory.<a href="https://ferd.ca/beating-the-cap-theorem-checklist.html" rel="nofollow">https://ferd.ca/beating-the-cap-theorem-checklist.html</a>Let's see, let's pick some checkboxes.(x) you pushed the actual problem to another layer of the system(x) you're actually building an AP system

评论 #41075688 未加载

评论 #41071917 未加载

xnorswap10 个月前

There's a better rebuttal(*) of CAP in Kleppmann's DDIA, under the title, "The unhelpful CAP theorem".I won't plagiarize his text, instead the chapter references his blogpost, "Please stop calling databases CP or AP": <a href="https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html" rel="nofollow">https://martin.kleppmann.com/2015/05/11/please-stop-calling-...</a>(*): rebuttal I think is the wrong word, but I couldn't think of better.

评论 #41072995 未加载

vmaurin10 个月前

Plot twist: in the article drawings, replica one and two are split by network, and it could fail.The author seems to not understand what the meaning of the P in CAP

评论 #41070632 未加载

评论 #41070336 未加载

pyrale10 个月前

Someone else tok ownership of the problem for you and sells you their solution : "The theoretical issue is irrelevant to me".Sure. Also, there's a long list of other things that are probably irrelevant to you. That is, until your provider fails and you need to understand the situation in order to provide a workaround.And slapping "load-balancers" everywhere on your schema is not really a solution, because load-balancers themselves are a distributed system with a state and are subject to CAP, as presented in the schema.> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition."Somehow, something somewhere will fix my shit hopefully". Also, as a sidenote, a few friends would angrily shake their "it's always DNS" cup reading this.edit: reading the rest of the blog and author's bio, I'm unsure whether the author is genuinely mistaken, or whether they're advertising their employer's product.

justinsaccount10 个月前

> None of the clients need to be aware that a network partition exists (except a small number who may see their connection to the bad side drop, and be replaced by a connection to the good side).What a convenient world where the client is not affected by the network partition.

tristor10 个月前

As someone who's worked extensively on distributed systems, including at a cloud provider, after reading this I think the author doesn't actually understand the CAP theorem or the two generals problem. Their conclusions are essentially utterly incorrect.

kristjansson10 个月前

Many things can be solved by the SEP Field[0][0]: <a href="https://en.wikipedia.org/wiki/Somebody_else's_problem#Douglas_Adams'_SEP_field" rel="nofollow">https://en.wikipedia.org/wiki/Somebody_else's_problem#Dougla...</a>

ivan_gammel10 个月前

The CAP theorem is quantum mechanics of software with C*A = O(1) in theory, similarly to uncertainty principle, but in many use cases this value is so small that "classical" expectations of both C and A are fine.

mrkeen10 个月前

> In practice, the redundant nature of connectivity and ability to use routing mechanisms to send clients to the healthy side of partitionsIow: You can have CAP as long as you can communicate across "partitions".

评论 #41070512 未加载

PaulHoule10 个月前

So glad to see that the CAP "theorem" is being recognized as a harmful selfish meme like Fielding's REST paper with a deadly seductive power against the overly pedantic.

senorrib10 个月前

I think every couple of months there's yet another article saying the CAP theorem is irrelevant. The problem with these is that they ignore the fact that CAP theorem isn't a guide, a framework or anything else.It's simply the formalization of a fact, and whether or not that fact is *important* (although still a fact) depends on the actual use case. Hell, it applies even to services within the same memory space, although obviously the probability of losing any of the three is orders of magnitude less than on a network.Can we please move on?

评论 #41070590 未加载

fractalic10 个月前

Hmm this article seems misleading. I suppose it's trying to make the point that application designers usually don't need to think too hard about it, because it's already being addressed by a quorum consensus protocol implemented by someone else. This is a bit of a tautology though; the author seems to be saying 'assume you have a solution to CAP theorem -- now isn't it silly to worry about CAP theorem?'.One of the fundamental assumptions of CAP theorem is that you can't tell whether or not you have a partition. If you have an oracle that can instantaneously tell you the state of every subsystem, then yeah, CAP is pointless.But if one of your DBs is connected, reporting itself as alive, and throwing all its writes into /dev/null, you won't be able to route traffic to a quorum of healthy instances because it's not possible to be certain that they're all healthy.This is what CAP theorem is about: managing data in a distributed system where the status of any given system is fundamentally unknowable because of the Two Generals' Problem (<a href="https://en.wikipedia.org/wiki/Two_Generals'_Problem)" rel="nofollow">https://en.wikipedia.org/wiki/Two_Generals'_Problem)</a>In many cases in Cloud though, we can skip that technical stuff and design systems as if we really _did_ have an oracle that could instantaneously and perfectly tell us the state of the system, and things will typically work fine.

rubiquity10 个月前

The point trying to be made is that with nimble infrastructure the A in CAP can be designed around to such a small amount you may as well be a CP system unless you have a really good reason to go after that 0.005% of availability. Not being CP means sacrificing the wonderful benefits that being consistent (linearizability, sequential consistency, strict serializibility) make possible. It's hard to disagree with that sentiment, and is likely why the Local First ideology is centered on data ownership rather than that extra 0.0005 ounces of availability. Once availability is no longer the center of attention the design space can be focused on durability or latency: how many copies to read/write before acking.Unfortunately the point is lost because of the usage of the word "cloud", a somewhat contrived example of solving problems by reconfiguring load balancers (in the real world certain outages might not let you reconfigure!), and missing empathy that you can't tell people not to care about how the semantics that thinking about, or not thinking about, availability imposes on the correctness of their applications.As for the usage of the word cloud: I don't know when a set of machines becomes a cloud. Is it the APIs for management? Or when you have two or more implementations of consensus running on the set of machines?

lupire10 个月前

He's saying that you don't need Partition Tolerance because network is never actually Partitioned. This is exactly why the Internet and the US Interstate Highway system were invented in the first place.Or he's saying you don't need Consistency because your system isn't actually distributed; it's just a centralized system with hot backups.It's unclear what he's trying to say.No idea why he wrote the blog post. It doesn't increase my confidence in the engineering equality of his employer AWS

cryptonector10 个月前

> If the partition extended to the whole big internet that clients are on, this wouldn’t work. But they typically don’t.This is the key, that network partitions either keep some clients from accessing any servers, or they keep some servers from talking to each other. The former case is uninteresting because nothing can be done server-side about it. The latter is interesting and we can fix it with load balancers.This conflicts with the picture painted earlier in TFA where the unhappy client is somehow stuck with the unhappy server, but let's consider that just didactic.We can also not use load balancers but have the clients talk to all the servers they can reach, when we trust the clients to behave correctly. Some architectures do this, like Lustre, which is why I mention it.I see several comments here that seem to take TFA as saying that distributed consensus algorithms/protocols are not needed, but TFA does not say that. TFA says you can have consistency, availability, and partition tolerance because network partitions between servers typically don't extend to clients, and you can have enough servers to maintain quorum for all clients (if a quorum is not available it's as if the whole cloud is down, then it's not available to any clients). That is a very reasonable assertion, IMO.

评论 #41073753 未加载

linuxhansl10 个月前

So basically this is saying that the CAP theorem is irrelevant because a partition is not really have a partition (since the load balancer still can reach everybody). Hmm...I agree that in modern data centers the CAP theorem is essentially irrelevant for intra-DC services, due the uptime and redundancy of networking H/W (making a partition less likely than other systemic failures).Across DCs I'll claim it is still absolutely relevant.

hot_gril10 个月前

The only concrete solution the article proposes that I can think of: Spanner uses quorum to maintain availability and consistency. Your "master" is TrueTime, which is considered reliable enough. You have replicated app backends. If this isn't too generous, let's also say the cloud handles load balancing well enough. CAP isn't violated, but you might say the user no longer worries about it.Most databases don't work like Spanner, and Spanner has its downsides, two of them being cost and performance. So most of the time, you're using a traditional DB with maybe a RW replica, which will sacrifice significant consistency or availability depending on whether you choose sync or async mode. And you're back to worrying about CAP.

评论 #41074210 未加载

skywhopper10 个月前

Weird article. Different users have different priorities and that’s what the CAP theorem expresses. The article also pretends that there’s a magic “load balancer” in the cloud that always works and also knows which segment of a partitioned network is the “correct” one (one of the points of CAP is that there’s not necessarily a “correct” side), and that no users will ever be on the “wrong” side of the partition. And not only that but all replicas see the exact same network partition. None of this is reality.But the gist, I guess, is that for most applications it’s not actually that important, and that’s probably true. But when it is important, “the cloud” is not going to save you.

jorblumesea10 个月前

CAP was never designed as an end all template you blindly apply to large scale systems. Think of it more as a mental starting pointing, that systems have these trade offs you need to consider. Each system you integrate has complex and nuanced requirements that don't neatly fall into clean buckets.As always Kleppmann has a great and deep answer for this.<a href="https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html" rel="nofollow">https://martin.kleppmann.com/2015/05/11/please-stop-calling-...</a>

bjornsing10 个月前

I suspect the CAP theorem factored into the design of these cloud architectures, in such a way that it now seems irrelevant. But it probably was relevant in preventing a lot of other more complex designs.

评论 #41070493 未加载

KaiserPro10 个月前

I kinda see what the author is getting at, but I don't buy the argument.However, in the example with the network partition, it relies on proper monitoring to work out if the DB its attached to is currently in partition.managing reads is a piece of piss, mostly. Its when you need propagate write to the rest of the DB system, thats where stuff gets hairy.Now, most places can run from a single DB, especially as disks are fucking fast now. so CAP is never really that much of a problem. However when you go multi-region, thats when it gets interesting.

thayne10 个月前

This only addresses one kind of partition.What if your servers can't talk to each other, but clients can?What if clients can't connect to any of your servers?What if there are multiple partitons, and none of them have a quorum?Also, changing the routing isn't instantaneous, so you will have some period of unavailability between when the partition happens, and when the client is redirected to the partition with the quorum.

ibash10 个月前

> if a quorum of replicas is available to the client, they can still get both strong consistency, and uncompromised availability.Then it’s not a partition.

api10 个月前

This is just saying because the cloud system hides the implications of the theorem from you, it's not relevant.I suppose it's kinda true in the sense that how to operate a power plant is not relevant when I turn on my lights.

评论 #41069752 未加载

评论 #41072951 未加载

remram10 个月前

This article assumes P(artitions) don't happen, and then concludes you can have both C and A. Congrats, that's the CAP theorem.

hinkley10 个月前

> The formalized CAP theorem would call this system unavailable, based on their definition of availability:Umm, no? That’s a picture of a partition. The partition is not able to make progress because the system is not partition tolerant. If it did it wouldn’t be consistent. It’s still available.

throw0101c10 个月前

See also:> In database theory, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).* <a href="https://en.wikipedia.org/wiki/PACELC_theorem" rel="nofollow">https://en.wikipedia.org/wiki/PACELC_theorem</a>

kwillets10 个月前

This seems like what I've noticed on MPP systems (a little before cloud): data replicas give a lot more availability than the number of partition events would suggest.I likely need to read the paper linked, but it's common to have an MPP database lose a node but maintain data availability. CAP applies at various levels, but the notion of availability differs:1. all nodes available 2. all data availableRedundancy can make #2 a lot more common than #1.

sir-dingleberry10 个月前

The CAP theorem is irrelevant if your acceptable response time is greater than the time it takes your partitions to sync.At that point you get all 3: consistency,availability, partitioning.In my opinion it should be the CAPR theorem.

评论 #41070198 未加载

评论 #41069880 未加载

评论 #41069838 未加载

评论 #41071655 未加载

评论 #41069747 未加载

评论 #41069921 未加载

motbus310 个月前

If you don't care about costs...

评论 #41070902 未加载

mcbrit10 个月前

All models are wrong, some are useful. CAP is probably at least as useful as Newtonian mechanics WHEN you are explaining why you just did a bunch of… extra stuff.I would like to violate CAP, please. I would like to be nearish to c, please.Here is my passport. I have done the work.

jumploops10 个月前

> The point of this post isn’t merely to be the ten billionth blog post on the CAP theorem. It’s to issue a challenge. A request. Please, if you’re an experienced distributed systems person who’s teaching some new folks about trade-offs in your space, don’t start with CAP.Yeah… no. Just because the cloud offers primitives that allow you to skip many of the challenges that the CAP theorem outlines, doesn’t mean it’s not a critical step to learning about and building novel distributed systems.I think the author is confusing systems practitioners with distributed systems researchers.I agree in some part, the former rarely needs to think about CAP for the majority of B2B cloud SaaS. For the latter, it seems entirely incorrect to skip CAP theorem fundamentals in one’s education.tl;dr — just because Kubernetes (et al.) make building distributed systems easier, it doesn’t mean you should avoid the CAP theorem in teaching or disregard it altogether.

hot_gril10 个月前

Every time someone tries to deprecate the nice and simple CAP theorem, it grows stronger. It's an unstoppable freight train at this point, like the concept of relational DBs after the NoSQL fad.

38 条评论

kstrauser10 个月前

评论 #41070527 未加载

评论 #41070029 未加载

评论 #41070843 未加载

评论 #41069997 未加载

mordae10 个月前

评论 #41070564 未加载

评论 #41072247 未加载

评论 #41070879 未加载

评论 #41070460 未加载

评论 #41070131 未加载

bunderbunder10 个月前

评论 #41070607 未加载

throwaway7127110 个月前

评论 #41070062 未加载

评论 #41071572 未加载

killjoywashere10 个月前

评论 #41070041 未加载

rdtsc10 个月前

评论 #41075688 未加载

评论 #41071917 未加载

xnorswap10 个月前

评论 #41072995 未加载

vmaurin10 个月前

Plot twist: in the article drawings, replica one and two are split by network, and it could fail.The author seems to not understand what the meaning of the P in CAP

评论 #41070632 未加载

评论 #41070336 未加载

pyrale10 个月前

justinsaccount10 个月前

tristor10 个月前

kristjansson10 个月前

ivan_gammel10 个月前

mrkeen10 个月前

评论 #41070512 未加载

PaulHoule10 个月前

So glad to see that the CAP "theorem" is being recognized as a harmful selfish meme like Fielding's REST paper with a deadly seductive power against the overly pedantic.

senorrib10 个月前

评论 #41070590 未加载

fractalic10 个月前

rubiquity10 个月前

lupire10 个月前

cryptonector10 个月前

评论 #41073753 未加载

linuxhansl10 个月前

hot_gril10 个月前

评论 #41074210 未加载

skywhopper10 个月前

jorblumesea10 个月前

bjornsing10 个月前

评论 #41070493 未加载

KaiserPro10 个月前

thayne10 个月前

ibash10 个月前

> if a quorum of replicas is available to the client, they can still get both strong consistency, and uncompromised availability.Then it’s not a partition.

api10 个月前

评论 #41069752 未加载

评论 #41072951 未加载

remram10 个月前

This article assumes P(artitions) don't happen, and then concludes you can have both C and A. Congrats, that's the CAP theorem.

hinkley10 个月前

throw0101c10 个月前

kwillets10 个月前

sir-dingleberry10 个月前

评论 #41070198 未加载

评论 #41069880 未加载

评论 #41069838 未加载

评论 #41071655 未加载

评论 #41069747 未加载

评论 #41069921 未加载

motbus310 个月前

If you don't care about costs...

评论 #41070902 未加载

mcbrit10 个月前

jumploops10 个月前

hot_gril10 个月前

Every time someone tries to deprecate the nice and simple CAP theorem, it grows stronger. It's an unstoppable freight train at this point, like the concept of relational DBs after the NoSQL fad.