Personally, this kind of thing actually gives me _more_ confidence in Postgres rather than less. The core team's responsiveness to this bug report was incredibly impressive.<p>Around June 4th, the article's author comes in with a bug report that basically says "I hammered Postgres with a whole bunch of artificial load and made something happen" [1].<p>By the 8th, a preliminary patch is ready for review [2]. That includes all the time to get the author's testing bootstrap up and running, reproduce, diagnose the bug (which, lest us forget, is the part of all of this that is actually hard), and assemble a fix. It's worth noting that it's no one's job per se on the Postgres project of fix this kind of thing — the hope is that someone will take interest, step up, and find a solution — and as unlikely as that sounds to work in most environments, amazingly, it usually does for Postgres.<p>Of note to the hacker types here, Peter Geoghegan was able to track the bug down through the use of rr [4] [5], which allowed an entire problematic run to be captured, and then stepped through forwards _and_ backwards (the latter being the key for not having to run the simulation over and over again) until the problematic code was identified and a fix could be developed.<p>---<p>[1] <a href="https://www.postgresql.org/message-id/CAH2-Wzm9kNAK0cbzGAvDtdJi-rj_ngsBbRX0i_DKdjYxqJnzNA%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-Wzm9kNAK0cbzGAvDt...</a><p>[2] <a href="https://www.postgresql.org/message-id/CAH2-Wzk%2BFHVJvSS9VPPJ_K9w4xwqeVyfnkzYWtWrBzXJSJcMVQ%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-Wzk%2BFHVJvSS9VPP...</a><p>[3] <a href="https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p%3D6Hg%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...</a><p>[4] <a href="https://en.wikipedia.org/wiki/Rr_(debugging)" rel="nofollow">https://en.wikipedia.org/wiki/Rr_(debugging)</a><p>[5] <a href="https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p%3D6Hg%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...</a>
This postgresql mailing list thread allows you to read along with the PostgreSQL developers and Jepsen, seems like a very useful discussion:
<a href="https://www.postgresql.org/message-id/flat/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io" rel="nofollow">https://www.postgresql.org/message-id/flat/db7b729d-0226-d16...</a>
It is very rare to see a Jepsen report that concludes with a note that a project is being too humble about their consistency promises.<p>Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.
><i>PostgreSQL has an extensive suite of hand-picked examples, called isolationtester, to verify concurrency safety. Moreover, independent testing, like Martin Kleppmann’s Hermitage has also confirmed that PostgreSQL’s serializable level prevents (at least some!) G2 anomalies. Why, then, did we immediately find G2-item with Jepsen? How has this bug persisted for so long?</i><p>This is super interesting. Jepsen seems to be like Hypothesis for race conditions: you specify the race condition to be triggered and it generates tests to simulate it.<p>Yesterday, Gitlab acquired a fuzz testing company[1]. I wonder if Jepsen was envisioned as a full CI integrated testing system<p>[1] <a href="https://m.calcalistech.com/Article.aspx?guid=3832552" rel="nofollow">https://m.calcalistech.com/Article.aspx?guid=3832552</a>
Reading through the source of Elle:<p>> "I cannot begin to convey the confluence of despair and laughter which I encountered over the course of three hours attempting to debug this issue. We assert that all keys have the same type, and that at most one integer type exists. If you put a mix of, say, Ints and Longs into this checker, you WILL question your fundamental beliefs about computers" [1].<p>I feel like Jepsen/Elle is a great argument for Clojure, reading the source is actually kind of fun. Not what you'd expect for a project like this.<p>[1]: <a href="https://github.com/jepsen-io/elle/blob/master/src/elle/txn.clj#L43" rel="nofollow">https://github.com/jepsen-io/elle/blob/master/src/elle/txn.c...</a>
This is my understanding of what a G2-Item Anti-dependecy Cycle is from the linked paper example:<p><pre><code> -- Given (roughly) the following transactions:
-- Transaction 1 (SELECT, T1)
with all_employees as (
select sum(salary) as salaries
from employees
),
department as (
select department, sum(salary) as salaries
from employees group by department
)
select sum(all_employees.salaries) - sum(department.salaries);
-- Transaction 2 (INSERT, T2)
insert into employees (name, department, salary)
values ('Tim', 'Sales', 70000);
-- G2-Item is where the INSERT completes between all_employees and department,
-- making the SELECT result inconsistent
</code></pre>
This is called an "anti-dependency" issue because T2 clobbers the data T1 depends on before it completes.<p>They say Elle found 6 such cases in 2 min, which I'm guessing is a "very big number" of transactions, but can't figure out exactly how big that number is based on the included logs/results.<p>Also, "Elle has found unexpected anomalies in every database we've checked"
Props to Jensen for exposing this longtime bug. Props to the PG team for identifying the culprit and their response. This report just strengthens my faith in the project.
It would be great to see Jepsen testing on distributed Postgres as this is a single node issue they've found here. In prod don't folks run HA?
> Neither process crashes, multiple tables, nor secondary-key access is required to reproduce our findings in this report. The technical justification for including them in this workload is “for funsies”.<p>Always read the footnotes!
By the way: where does the Jepsen name come from?<p>I have wondered more than once and my browsing and searching skills are failing me on this one.<p>Edit: The closest link I can find is "Call me maybe" but I am not able to find a causation or even a direct link or mention for now.
I am still wondering when we will see PostgreSQL being tested in a HA form.<p>It's just extraordinary to me that it's 2020 and it still does not have a built-in, supported set of features for supporting this use case. Instead we have to rely on proprietary vendor solutions or dig through the many obsolete or unsupported options.
So this does not affect SSI guarantees if the transactions involved all operate on the same row? Is my understanding correct?
For instance can I update a counter with serializable isolation and not run into this bug?
For the repeatable read issue, I don't intuitively understand why the violation mentioned would be a problem. In particular, even though the transaction sequence listed wouldn't make sense for a serializable level, it seems consistent with what I'd expect from repeatable read (though I have not read the ANSI SQL standard's definition of repeatable read).<p>Any insights into why we should want repeatable read to block that? It feels like blocking that is specifically the purpose of serializable isolation.
I don’t know why the author is surprised that Postgres offers stronger guarantees than serializability in practice. Serializability per se allows anomalies that would be disastrous in practice: <a href="http://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html" rel="nofollow">http://dbmsmusings.blogspot.com/2019/06/correctness-anomalie...</a>.
@aphyr could you please clarify this sentence?<p>> <i>This behavior is allowable due to long-discussed ambiguities in the ANSI SQL standard, but could be surprising for users familiar with the literature.</i><p>Should that be "not familiar"? And which literature - the standard or the discussions?
Thanks for doing these, they're incredibly interesting, useful, amusing (Oh no! the schadenfreude!) and also, incredibly inspiring to me to be a better engineer, so thank you again :)