MVCC – the part of PostgreSQL we hate the most (2023)

382 点作者 virtualwhys7 个月前

26 条评论

Wow, as someone who feels like I'm decently familiar with the ins and outs of Postgres, I thought this was a great article and I learned a ton.It seems like one of the biggest fundamental flaws is that Postgres chose the O2N approach for tracking row versions instead of N2O. While switching to N2O wouldn't solve all problems (e.g. the article also talks about how Postgres stores full row copies and not just diffs), from an "80/20 rule" perspective, it seems like it would get rid of most of the downsides with the current implementation. For example, I'd assume that the vast majority of the time that transactions want the latest row version, so using the N2O ordering means you could probably do away with storing each row version in an index, as you'd only need to traverse the linked list of you needed an older version, which should be much less common.

评论 #41904048 未加载

fweimer7 个月前

The big advantage is that you do not need any extra space if your workload mostly consists of INSERTs (followed by table drops). And it's generally unnecessary to split up insertion transactions because there is no size limit as such (neither on the generated data or the total count of rows changed). There is a limit on statements in a transaction, but you can sidestep that by using COPY FROM if you do not have to switch tables too frequently. From a DBA point of view, there is no need to manage a rollback/undo space separately from table storage.Every application is a bit different, but it's not that the PostgreSQL design is a loser in all regards. It's not like bubble sort.

评论 #41904079 未加载

评论 #41905601 未加载

评论 #41902493 未加载

评论 #41897732 未加载

OrvalWintermute7 个月前

This article is incorrect IMO - the following section in particular.“ In the 2000s, the conventional wisdom selected MySQL because rising tech stars like Google and Facebook were using it. Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“. In the last five years, PostgreSQL has become the Internet’s darling DBMS. And for good reasons! It’s dependable, feature-rich, extensible, and well-suited for most operational workloads.”Smart engineers were choosing postgres not because of the logical fallacy of popularum, but for the following reasons:Data safety - not MyIsam, ACID, Similarity to Oracle, MVCC, SQL standards adherence, Postgres team, Helpful awesome community, Data types, High performance, BSD flexibilityAbove are the reasons I selected Postgres while at ATT early 2000s and our Oracle DBA found it a very easy transition. While Mysql went through rough transitions, PG has gone from strength to strength and ever improving path.I think Bruce Momjian is a big part of this success; they truly have an excellent community. <3

评论 #41903184 未加载

nightfly7 个月前

> MySQL and Oracle store a compact delta between the new and current versions (think of it like a git diff).Doesn't git famously _not_ store diffs and instead follows the same storage pattern postgres uses here and stores the full new and old objects?

评论 #41897486 未加载

评论 #41897457 未加载

评论 #41897885 未加载

评论 #41899189 未加载

评论 #41899164 未加载

评论 #41897759 未加载

mxey7 个月前

> The need for PostgreSQL to modify all of a table’s indexes for each update has several performance implications. Obviously, this makes update queries slower because the system has to do more work.You know, I was wondering something regarding this write amplification. It's true that MySQL doesn't need to update its indexes like that. However, MySQL replication relies on the binlog, where every change has to be written in addition to the database itself (InnoDB redo log and so on). So, it seems to me, MySQL, if used in a cluster, has a different kind of write amplification. One that PostgreSQL does not have, because it reuses its WAL for the replication.In addition, on the receiving side, MySQL first writes the incoming binlog to the relay log. The relay log is then consumed by the applier threads, creating more InnoDB writes and (by default) more binlog.

halayli7 个月前

This topic cannot be discussed alone without talking about disks. SSDs write 4k page at a time. Meaning if you're going to update 1 bit, the disk will read 4k, you update the bit, and it writes back a 4k page in a new slot. So the penalty for copying varies depending on the disk type.

评论 #41904085 未加载

评论 #41900275 未加载

dfox7 个月前

> Oracle and MySQL do not have this problem in their MVCC implementation because their secondary indexes do not store the physical addresses of new versions. Instead, they store a logical identifier (e.g., tuple id, primary key) that the DBMS then uses to look up the current version’s physical address. Now this may make secondary index reads slower since the DBMS has to resolve a logical identifier, but these DBMS have other advantages in their MVCC implementation to reduce overhead.Interesting behavior of MySQL that I have observed (~500GB database, with a schema that is more of an document oriented than relational) is that when you update single row doing SELECT id WHERE something; UPDATE what WHERE id=id is orders of magnitudes faster than UPDATE what WHERE something. I somehow suspect that this is the reason for this behavior. But well, the normal workload will not do that and this only slows down ad-hoc DML when you fix some inconsistency.

评论 #41898716 未加载

评论 #41902118 未加载

thih97 个月前

> Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“.Off topic, it was marketing all along: <a href="https://news.ycombinator.com/item?id=15124306">https://news.ycombinator.com/item?id=15124306</a>

评论 #41899286 未加载

vinnymac7 个月前

The part I hate the most is that in 2024 I still need a connection pooler (such as pgbouncer) in front of it to make it usable.

评论 #41901827 未加载

fforflo7 个月前

OrioleDB was supposed to tackle this problem with a new storage engine . <a href="https://github.com/orioledb/orioledb">https://github.com/orioledb/orioledb</a>

评论 #41898410 未加载

globular-toast7 个月前

The article says the benefit of O2N is there's no need to immediately update indexes, but then goes on to say postgres updates the indexes anyway! So is there actually any advantage to O2N at all?

评论 #41897608 未加载

评论 #41897609 未加载

didgetmaster7 个月前

How big of an issue is this really for db users who work daily with large tables that are frequently updated but still have to be fast for queries?The article mentioned that there are nearly 900 different databases on the market. I am trying to build yet another one using a unique architecture I developed. It is very fast, but although it is designed for transactions; I haven't implemented them yet.I think if I spend the time and effort to do it right, this could be a real game changer (I think the architecture lends itself very well to a superior implementation); but I don't want to waste too much time on it if people don't really care one way or the other.

评论 #41911447 未加载

hlandau7 个月前

With every new PostgreSQL release we see yet more features and sugar added to the frontend, yet seemingly no meaningful improvement to the backend/storage layer which suffers these fundamental problems.I wish the PostgreSQL community would stop chasing more frontend features and spend a concerted few years completely renovating their storage layer. The effort in each release seems massively and disproportionately skewed towards frontend improvements without the will to address these fundamental issues.It's absurd that in 2024, "the world's most advanced open source database" doesn't have a method of doing upgrades between major versions that doesn't involve taking the database down.Yes, logical replication exists, but it still doesn't do DDL, so it has big caveats attached.

评论 #41897077 未加载

评论 #41896999 未加载

评论 #41897049 未加载

评论 #41907744 未加载

评论 #41897182 未加载

评论 #41905260 未加载

bluedonuts7 个月前

Loved this post! Could anyone recommend a book (or other medium) with similar content about RMBDS internals?

评论 #41904497 未加载

评论 #41910345 未加载

thr0w7 个月前

Don't agree with their characterization of `pg_repack`. `VACUUM FULL` is definitely crushing, but that's why repack exists as a faster/lighter alternative. Anyone have a different experience?

评论 #41899466 未加载

评论 #41898486 未加载

guenthert7 个月前

"And second, traversing the entire version chain just to find the latest version (which is what most queries want) is wasteful."Is that what most queries want? I would have thought, that the latest version is part of a transaction which isn't committed yet and hence isn't even meant to be found by most queries (only those within the session which opened the modifying transaction). Where did I go wrong?

Izkata7 个月前

They mention autvacuum_vacuum_scale_factor, and its default value, but give no hint if they tried to change that. Obviously I have no access to their database, but one of piece of advice for ages has been, in situations similar to theirs where a lot of dead tuples accumulate and autovacuum is having trouble finishing, to lower this value so autovacuum runs much more often, so each run has less to do and less gets blocked.

WuxiFingerHold7 个月前

Putting aside the license / closed source issues with CockroachDB (CRDB) and just focus at it technically: CRDB uses MVVM too, but its storage is a key-value store. I know it uses some kind of garbage collection to remove the old versions.I wonder if CRDB (or other newer designed DBs) has circumvented those issues? Or don't we just hear from those issues as CRDB and the other newer DBs are not that widely used and mainly in the commercial space?

评论 #41901777 未加载

评论 #41901017 未加载

mbreese7 个月前

So, if this is such a problem, my question is — are the poor MVCC choices of Postgres enough to make the authors (or people here) recommend another RDBMS?

评论 #41900276 未加载

uvas_pasas_per7 个月前

My problem with PG is it just doesn't seem to help much with my situation. I want to write apps that work offline and sync data across devices using the 'cloud'. I think that means Sqlite on the client, but ?? for the server. I have yet to find a good book explaining techniques for this kind of replication/syncing.

评论 #41903423 未加载

评论 #41904249 未加载

评论 #41903746 未加载

评论 #41903396 未加载

derefr7 个月前

Question: is storing full new row-tuple versions something fundamental to Postgres as a whole, or is it just a property of the default storage engine / “table access method”?

评论 #41898523 未加载

评论 #41897917 未加载

msie7 个月前

Nicely written article. Easy to read and understand!

avg_dev7 个月前

pretty informative. now i understand why people are often having issues with vacuum-related stuff. i like the diagrams too.

srcreigh7 个月前

It’s really annoying to see people write that Postgres has a “primary index” and “secondary indexes”. No. That’s not what those words mean. Every index in Postgres is a secondary index.

评论 #41903808 未加载

terminalbraid7 个月前

<a href="https://news.ycombinator.com/item?id=41892830">https://news.ycombinator.com/item?id=41892830</a>

评论 #41897128 未加载

评论 #41898993 未加载

评论 #41898153 未加载

评论 #41897142 未加载

jongjong7 个月前

For most cases, MVCC sounds like over-engineering. From the problem description:> The goal of MVCC in a DBMS is to allow multiple queries to read and write to the database simultaneously without interfering with each other when possible.How is that a problem for most use cases?If there is a read query which is taking a long time, with many rows, and some of these later rows happen to be updated mid-read but the earlier rows are not... It's not really a problem for the vast majority of application. Why is it better for all rows to be delivered out of date versus just the first half fetched being out of date? It's not ideal in either case but it's unavoidable that some requests can sometimes return out of date data. It seems like a tiny advantage.I suspect the real need to implement MVCC arose out of the desire for databases like Postgres to implement atomic transactions as a magical black box.IMO, two-phase commit is a simpler solution to this problem. It's not possible to fully hide concurrency complexity from the user; it ends up with tradeoffs.

评论 #41898609 未加载