For those unaware, Zheap is a new storage engine for Postgres that handles updates and deletes in a different way. Currently, when you update a row in Postgres, Postgres creates a new copy of the row and marks the old one as deleted. This is done for several reasons, specifically it makes handling concurrency easier and it makes rolling back transactions easier.<p>The issue with this approach is over time this leads to lots of "dead rows" - deleted rows that are taking up space in your DB. Postgres has a background job called the "vacuum" which effectively garbage collects all deleted rows. Depending on your settings, the vacuum can be a pretty expensive job and even with the vacuum, you can still wind up using a lot more space than you actually need.<p>Zheap addresses these problems by using a different approach. When performing an update, instead of marking the old row as deleted and inserting a new one, Postgres will replace the old row with the new one and write a copy of the old row to a separate file. This means the main table file doesn't need to be larger than necessary to keep track of the dead rows.<p>Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast. For example, what happens if I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.
Confusingly, there is another new PostgreSQL storage effort called "Zedstore": <a href="https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com" rel="nofollow">https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5...</a><p>Disclosure: I work for VMware, who are sponsoring zedstore development via Greenplum.
There is a super informative blog post from last week at <a href="https://www.cybertec-postgresql.com/en/zheap-reinvented-postgresql-storage/" rel="nofollow">https://www.cybertec-postgresql.com/en/zheap-reinvented-post...</a> - discussed at <a href="https://news.ycombinator.com/item?id=24710765" rel="nofollow">https://news.ycombinator.com/item?id=24710765</a>
> What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE / INSERT operation which is of course not as efficient as an in-place UPDATE.<p>I would be wary of this. Innodb, for example, also has an optimistic (in-place) UPDATE mode, and a pessimistic UPDATE mode.<p>Repeatedly updating the same row under the pessimistic mode would end up stalling the database, even at a rather low QPS.<p><a href="https://bugs.mysql.com/bug.php?id=53825" rel="nofollow">https://bugs.mysql.com/bug.php?id=53825</a> was originally reported by Facebook 10 years ago, and is still not fixed in 8.0.21 with VATS / CATS scheduling.<p>And then there is also the performance gap between the ideal case where undo log is still in memory, versus the pathological case where undo log needs to be fetched from disk.<p>The last thing postgres needs is something that looks good on paper / in benchmark, but has a bunch of issues in production.
It is nice that PG got another storage engine. I hope that will ease the implementation of the column based storage in official PG, as the row based storage is a major problem with very large datasets.
Great to see Antonin and others working on this! I did some work on some pieces of this (along with many others), and I gave a talk about an earlier prototype of the undo log machinery and how it fits at PGCon 2019 (which now seems like a whole lifetime ago!). Slides and recording here in case the background is interesting:<p><a href="https://speakerdeck.com/macdice/transactions-in-postgresql-and-other-animals" rel="nofollow">https://speakerdeck.com/macdice/transactions-in-postgresql-a...</a>
This seems neat!<p>Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE/INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.<p>My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.
This should solve one of the problems Uber had with PostgreSQL
reported by Christophe Pettus back in 2017[1].<p>1- <a href="https://thebuild.com/presentations/uber-perconalive-2017.pdf" rel="nofollow">https://thebuild.com/presentations/uber-perconalive-2017.pdf</a><p>Previous discussion from 2018 <a href="https://news.ycombinator.com/item?id=16526623" rel="nofollow">https://news.ycombinator.com/item?id=16526623</a>
I wonder how this approach will compare with an LSM tree style index that has gained popularity recently (mostly via RocksDB). LSM trees are also write optimized and allow for some optimizations by batching up updates and merging them in bulk.