TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Zheap – Reinvented PostgreSQL Storage

263 pointsby PhilipTraunerover 4 years ago

11 comments

malisperover 4 years ago
For those unaware, Zheap is a new storage engine for Postgres that handles updates and deletes in a different way. Currently, when you update a row in Postgres, Postgres creates a new copy of the row and marks the old one as deleted. This is done for several reasons, specifically it makes handling concurrency easier and it makes rolling back transactions easier.<p>The issue with this approach is over time this leads to lots of &quot;dead rows&quot; - deleted rows that are taking up space in your DB. Postgres has a background job called the &quot;vacuum&quot; which effectively garbage collects all deleted rows. Depending on your settings, the vacuum can be a pretty expensive job and even with the vacuum, you can still wind up using a lot more space than you actually need.<p>Zheap addresses these problems by using a different approach. When performing an update, instead of marking the old row as deleted and inserting a new one, Postgres will replace the old row with the new one and write a copy of the old row to a separate file. This means the main table file doesn&#x27;t need to be larger than necessary to keep track of the dead rows.<p>Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast. For example, what happens if I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.
评论 #24759982 未加载
评论 #24759753 未加载
评论 #24759814 未加载
评论 #24763780 未加载
评论 #24760395 未加载
评论 #24760723 未加载
评论 #24760362 未加载
评论 #24759759 未加载
评论 #24759783 未加载
评论 #24759760 未加载
评论 #24763615 未加载
jacques_chesterover 4 years ago
Confusingly, there is another new PostgreSQL storage effort called &quot;Zedstore&quot;: <a href="https:&#x2F;&#x2F;www.postgresql.org&#x2F;message-id&#x2F;CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com" rel="nofollow">https:&#x2F;&#x2F;www.postgresql.org&#x2F;message-id&#x2F;CALfoeiuF-m5jg51mJUPm5...</a><p>Disclosure: I work for VMware, who are sponsoring zedstore development via Greenplum.
评论 #24761943 未加载
fanf2over 4 years ago
There is a super informative blog post from last week at <a href="https:&#x2F;&#x2F;www.cybertec-postgresql.com&#x2F;en&#x2F;zheap-reinvented-postgresql-storage&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.cybertec-postgresql.com&#x2F;en&#x2F;zheap-reinvented-post...</a> - discussed at <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=24710765" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=24710765</a>
throwdbaawayover 4 years ago
&gt; What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE &#x2F; INSERT operation which is of course not as efficient as an in-place UPDATE.<p>I would be wary of this. Innodb, for example, also has an optimistic (in-place) UPDATE mode, and a pessimistic UPDATE mode.<p>Repeatedly updating the same row under the pessimistic mode would end up stalling the database, even at a rather low QPS.<p><a href="https:&#x2F;&#x2F;bugs.mysql.com&#x2F;bug.php?id=53825" rel="nofollow">https:&#x2F;&#x2F;bugs.mysql.com&#x2F;bug.php?id=53825</a> was originally reported by Facebook 10 years ago, and is still not fixed in 8.0.21 with VATS &#x2F; CATS scheduling.<p>And then there is also the performance gap between the ideal case where undo log is still in memory, versus the pathological case where undo log needs to be fetched from disk.<p>The last thing postgres needs is something that looks good on paper &#x2F; in benchmark, but has a bunch of issues in production.
sega_saiover 4 years ago
It is nice that PG got another storage engine. I hope that will ease the implementation of the column based storage in official PG, as the row based storage is a major problem with very large datasets.
ksecover 4 years ago
When can we expect Zheap to land? 14?<p>With the new pluggable Storage API, Are there any other new storage engine other than Zheap?
macdiceover 4 years ago
Great to see Antonin and others working on this! I did some work on some pieces of this (along with many others), and I gave a talk about an earlier prototype of the undo log machinery and how it fits at PGCon 2019 (which now seems like a whole lifetime ago!). Slides and recording here in case the background is interesting:<p><a href="https:&#x2F;&#x2F;speakerdeck.com&#x2F;macdice&#x2F;transactions-in-postgresql-and-other-animals" rel="nofollow">https:&#x2F;&#x2F;speakerdeck.com&#x2F;macdice&#x2F;transactions-in-postgresql-a...</a>
teejover 4 years ago
This seems neat!<p>Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE&#x2F;INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.<p>My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.
评论 #24760166 未加载
评论 #24759928 未加载
yayzheapover 4 years ago
This should solve one of the problems Uber had with PostgreSQL reported by Christophe Pettus back in 2017[1].<p>1- <a href="https:&#x2F;&#x2F;thebuild.com&#x2F;presentations&#x2F;uber-perconalive-2017.pdf" rel="nofollow">https:&#x2F;&#x2F;thebuild.com&#x2F;presentations&#x2F;uber-perconalive-2017.pdf</a><p>Previous discussion from 2018 <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=16526623" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=16526623</a>
评论 #24761029 未加载
AdamProutover 4 years ago
I wonder how this approach will compare with an LSM tree style index that has gained popularity recently (mostly via RocksDB). LSM trees are also write optimized and allow for some optimizations by batching up updates and merging them in bulk.
ronloboover 4 years ago
Look no further, after playing around with TimescaleDB, there are no updates anymore and we can put this problem into a vacuum :)