Jepsen: PostgreSQL 12.3

769 pointsby aphyralmost 5 years ago

22 comments

branduralmost 5 years ago

Personally, this kind of thing actually gives me _more_ confidence in Postgres rather than less. The core team's responsiveness to this bug report was incredibly impressive.Around June 4th, the article's author comes in with a bug report that basically says "I hammered Postgres with a whole bunch of artificial load and made something happen" [1].By the 8th, a preliminary patch is ready for review [2]. That includes all the time to get the author's testing bootstrap up and running, reproduce, diagnose the bug (which, lest us forget, is the part of all of this that is actually hard), and assemble a fix. It's worth noting that it's no one's job per se on the Postgres project of fix this kind of thing — the hope is that someone will take interest, step up, and find a solution — and as unlikely as that sounds to work in most environments, amazingly, it usually does for Postgres.Of note to the hacker types here, Peter Geoghegan was able to track the bug down through the use of rr [4] [5], which allowed an entire problematic run to be captured, and then stepped through forwards _and_ backwards (the latter being the key for not having to run the simulation over and over again) until the problematic code was identified and a fix could be developed.---[1] <a href="https://www.postgresql.org/message-id/CAH2-Wzm9kNAK0cbzGAvDtdJi-rj_ngsBbRX0i_DKdjYxqJnzNA%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-Wzm9kNAK0cbzGAvDt...</a>[2] <a href="https://www.postgresql.org/message-id/CAH2-Wzk%2BFHVJvSS9VPPJ_K9w4xwqeVyfnkzYWtWrBzXJSJcMVQ%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-Wzk%2BFHVJvSS9VPP...</a>[3] <a href="https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p%3D6Hg%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...</a>[4] <a href="https://en.wikipedia.org/wiki/Rr_(debugging)" rel="nofollow">https://en.wikipedia.org/wiki/Rr_(debugging)</a>[5] <a href="https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p%3D6Hg%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...</a>

评论 #23499700 未加载

评论 #23499963 未加载

评论 #23500552 未加载

评论 #23500248 未加载

feikealmost 5 years ago

This postgresql mailing list thread allows you to read along with the PostgreSQL developers and Jepsen, seems like a very useful discussion: <a href="https://www.postgresql.org/message-id/flat/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io" rel="nofollow">https://www.postgresql.org/message-id/flat/db7b729d-0226-d16...</a>

评论 #23499710 未加载

arghwhatalmost 5 years ago

It is very rare to see a Jepsen report that concludes with a note that a project is being too humble about their consistency promises.Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.

评论 #23499646 未加载

评论 #23505894 未加载

sandGorgonalmost 5 years ago

>PostgreSQL has an extensive suite of hand-picked examples, called isolationtester, to verify concurrency safety. Moreover, independent testing, like Martin Kleppmann’s Hermitage has also confirmed that PostgreSQL’s serializable level prevents (at least some!) G2 anomalies. Why, then, did we immediately find G2-item with Jepsen? How has this bug persisted for so long?This is super interesting. Jepsen seems to be like Hypothesis for race conditions: you specify the race condition to be triggered and it generates tests to simulate it.Yesterday, Gitlab acquired a fuzz testing company[1]. I wonder if Jepsen was envisioned as a full CI integrated testing system[1] <a href="https://m.calcalistech.com/Article.aspx?guid=3832552" rel="nofollow">https://m.calcalistech.com/Article.aspx?guid=3832552</a>

评论 #23499473 未加载

camgunzalmost 5 years ago

Reading through the source of Elle:> "I cannot begin to convey the confluence of despair and laughter which I encountered over the course of three hours attempting to debug this issue. We assert that all keys have the same type, and that at most one integer type exists. If you put a mix of, say, Ints and Longs into this checker, you WILL question your fundamental beliefs about computers" [1].I feel like Jepsen/Elle is a great argument for Clojure, reading the source is actually kind of fun. Not what you'd expect for a project like this.[1]: <a href="https://github.com/jepsen-io/elle/blob/master/src/elle/txn.clj#L43" rel="nofollow">https://github.com/jepsen-io/elle/blob/master/src/elle/txn.c...</a>

评论 #23500923 未加载

micimizealmost 5 years ago

This is my understanding of what a G2-Item Anti-dependecy Cycle is from the linked paper example:<pre><code> -- Given (roughly) the following transactions: -- Transaction 1 (SELECT, T1) with all_employees as ( select sum(salary) as salaries from employees ), department as ( select department, sum(salary) as salaries from employees group by department ) select sum(all_employees.salaries) - sum(department.salaries); -- Transaction 2 (INSERT, T2) insert into employees (name, department, salary) values ('Tim', 'Sales', 70000); -- G2-Item is where the INSERT completes between all_employees and department, -- making the SELECT result inconsistent </code></pre> This is called an "anti-dependency" issue because T2 clobbers the data T1 depends on before it completes.They say Elle found 6 such cases in 2 min, which I'm guessing is a "very big number" of transactions, but can't figure out exactly how big that number is based on the included logs/results.Also, "Elle has found unexpected anomalies in every database we've checked"

评论 #23500251 未加载

mekokaalmost 5 years ago

Props to Jensen for exposing this longtime bug. Props to the PG team for identifying the culprit and their response. This report just strengthens my faith in the project.

redwoodalmost 5 years ago

It would be great to see Jepsen testing on distributed Postgres as this is a single node issue they've found here. In prod don't folks run HA?

评论 #23499499 未加载

评论 #23499601 未加载

KingOfCodersalmost 5 years ago

We laughed when this happend to MongoDB.The difference though is the reaction from the vendor.

评论 #23499658 未加载

评论 #23500369 未加载

pkilgorealmost 5 years ago

> Neither process crashes, multiple tables, nor secondary-key access is required to reproduce our findings in this report. The technical justification for including them in this workload is “for funsies”.Always read the footnotes!

reitanqildalmost 5 years ago

By the way: where does the Jepsen name come from?I have wondered more than once and my browsing and searching skills are failing me on this one.Edit: The closest link I can find is "Call me maybe" but I am not able to find a causation or even a direct link or mention for now.

评论 #23500070 未加载

评论 #23504638 未加载

评论 #23500048 未加载

评论 #23500018 未加载

评论 #23500031 未加载

评论 #23501482 未加载

评论 #23500111 未加载

threeseedalmost 5 years ago

I am still wondering when we will see PostgreSQL being tested in a HA form.It's just extraordinary to me that it's 2020 and it still does not have a built-in, supported set of features for supporting this use case. Instead we have to rely on proprietary vendor solutions or dig through the many obsolete or unsupported options.

评论 #23502394 未加载

popotamongaalmost 5 years ago

What does this really mean? I just migrated from mongo to Pg.

评论 #23499377 未加载

评论 #23499464 未加载

评论 #23499376 未加载

评论 #23499482 未加载

评论 #23502643 未加载

评论 #23499386 未加载

rolls-reusalmost 5 years ago

So this does not affect SSI guarantees if the transactions involved all operate on the same row? Is my understanding correct? For instance can I update a counter with serializable isolation and not run into this bug?

评论 #23502170 未加载

zeroimplalmost 5 years ago

For the repeatable read issue, I don't intuitively understand why the violation mentioned would be a problem. In particular, even though the transaction sequence listed wouldn't make sense for a serializable level, it seems consistent with what I'd expect from repeatable read (though I have not read the ANSI SQL standard's definition of repeatable read).Any insights into why we should want repeatable read to block that? It feels like blocking that is specifically the purpose of serializable isolation.

评论 #23505332 未加载

emilystalmost 5 years ago

Ah, now I know why you hopped on IRC finally last week. :)

senderistaalmost 5 years ago

I don’t know why the author is surprised that Postgres offers stronger guarantees than serializability in practice. Serializability per se allows anomalies that would be disastrous in practice: <a href="http://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html" rel="nofollow">http://dbmsmusings.blogspot.com/2019/06/correctness-anomalie...</a>.

评论 #23518764 未加载

emmelaichalmost 5 years ago

@aphyr could you please clarify this sentence?> This behavior is allowable due to long-discussed ambiguities in the ANSI SQL standard, but could be surprising for users familiar with the literature.Should that be "not familiar"? And which literature - the standard or the discussions?

评论 #23506421 未加载

wlllalmost 5 years ago

Thanks for doing these, they're incredibly interesting, useful, amusing (Oh no! the schadenfreude!) and also, incredibly inspiring to me to be a better engineer, so thank you again :)

julienfr112almost 5 years ago

Can (should ?) Jepsen tests be integrated in PostGres CI/CT ? Can we raise money for that ?

ordxalmost 5 years ago

Any plans to test any other NoSQL databases? I'm interested in MarkLogic

评论 #23505419 未加载

评论 #23505054 未加载

k0k-xoxalmost 5 years ago

I think its funny