An unexpected find that freed 20GB of unused index space in PostgreSQL

375 点作者 haki超过 4 年前

19 条评论

mjw1007超过 4 年前

Summary: if you have an index on a column which is mostly NULL, consider using a partial index covering only the records where it's non-NULL.

评论 #25989980 未加载

评论 #25995445 未加载

mulander超过 4 年前

Partial indexes are amazing but you have to keep in mind some pecularities.If your query doesn't contain a proper match with the WHERE clause of the index - the index will not be used. It is easy to forget about it or to get it wrong in subtle ways. Here is an example from work.There was an event tracing structure which contained the event severity_id. Id values 0-6 inclusive are user facing events. Severity 7 and up is debug events. In practice all debug events were 7 and there were no other values above 7. This table had a partial index with WHERE severity_id < 7. I tracked down a performance regression, when an ORM (due to programmer error) generated WHERE severity_id != 7. The database is obviously not able to tell that there will never be any values above 7 so the index was not used slowing down event handling. Turning the query to match < 7 fixes the problem. The database might also not be able to infer that the index can be indeed used, for example when prepared statements are involved WHERE severity_id < ?. The database will not be able to tell that all bindings of ? will satisfy < 7 so will not use the index (unless you are running PG 12, then that might depend on the setting of plan_cache_mode[1] but I have not tested that yet).Another thing is that HOT updates in PostgreSQL can't be performed if the updated field is indexed but that also includes being part of a WHERE clause in a partial index. So you could have a site like HN and think that it would be nice to index stories WHERE vote > 100 to quickly find more popular stories. That index however would nullify the possiblity of a hot update when the vote tally would be updated. Again, not a problem but you need to know the possible drawbacks.That said, they are great when used for the right purpose. Kudos to the author for a nice article![1] - <a href="https://postgresqlco.nf/doc/en/param/plan_cache_mode/" rel="nofollow">https://postgresqlco.nf/doc/en/param/plan_cache_mode/</a>

评论 #25991825 未加载

评论 #26008530 未加载

boomer918超过 4 年前

Partial indexes can flip query plans if the covered part becomes so small that it won't be represented when sampled by the stats collector. The planner could then decide that the index scan isn't worth it and could try an alternative less efficient index if one exists.

评论 #25990240 未加载

deathanatos超过 4 年前

> Coming from Oracle, I was always taught that NULLs are not indexedThat left me wondering how, if all indexes are by default partial in Oracle… how does one make an unpartial? nonpartial? index.<a href="https://use-the-index-luke.com/sql/where-clause/null/index" rel="nofollow">https://use-the-index-luke.com/sql/where-clause/null/index</a>Apparently, you add a computed column to the index that just computes a constant value. And single non-null column then causes the nulls in other columns to get indexed, it's only if the whole tuple is composed of nulls that it gets left out.That also seems like a bug waiting to happen; someone inverts a query to find unset (NULL) entries, and now you're doing a table scan.…but it seems also like a form of brain rot, induced by a particular implementation, e.g., similar to how I've had MySQL users ask how to make a key on a table. Where a "key" is an index, it's just that MySQL by default uses the word "key" to mean index, instead of … key¹. (The query language even supports "INDEX" in place of "KEY", but things like "SHOW TABLE" default to the "wrong" (linguistically, not programmatically) word.) And then you might have to de-tangle why these two are different concepts, how they're different. It's very Arrival, in the sense of language (mis-)shaping perception.¹a key is a set of columns that are sufficient to identify a row. The primary such set of columns is … the primary key. An index can index a key (if more than one exists within a table), but it doesn't have to.

pottertheotter超过 4 年前

This has nothing to do with the content, but the design of this page really stuck out to me. It's very easy to read and doesn't have fluff. But it still feels modern (in the good way). It's perfectly balanced.

评论 #25993167 未加载

de6u99er超过 4 年前

When I did my Oracle DBA training 15 years ago, I learnt about database reorgs.It means basically exporting your database (or tables) and importing it again. What happens is that deleted data which doesn't necessarily free up space (Oracle reuses the freed up space sometimes) doesn't get exported.<a href="https://www.iri.com/blog/vldb-operations/database-reorgs-why-they-matter/" rel="nofollow">https://www.iri.com/blog/vldb-operations/database-reorgs-why...</a><a href="https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:1336002638218" rel="nofollow">https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUEST...</a>

评论 #25989672 未加载

评论 #25990457 未加载

brianberns超过 4 年前

> REINDEX INDEX CONCURRENTLY index_name;> If for some reason you had to stop the rebuild in the middle, the new index will not be dropped. Instead, it will be left in an invalid state and consume space.Well, that sure sounds like a bug in PostreSQL to me.

评论 #25992734 未加载

评论 #25990548 未加载

ivoras超过 4 年前

Is the partial index technique to avoid indexed NULL data as effective for PostgreSQL 13+?It looks like in v13+ PostgreSQL could create a single leaf for NULL data and just store row pointers within it, which should reduce data sizes at least a bit.

评论 #25989467 未加载

评论 #25989440 未加载

matsemann超过 4 年前

> Clear bloat in tablesOhh, we've had issues with this. We have this table that's mostly ephemeral data, so rows are constantly inserted and then deleted after a certain amount of time. Due to a bug the deletion didn't work for a while and the db grew very large. Fixed the deletion, but no amount of vacuuming actually allows us to fully reclaim that space so we don't have to pay for it.At the same time the extra cost is probably negligible compared to spending more energy fixing it..

评论 #25992806 未加载

评论 #25997596 未加载

nieve超过 4 年前

The article includes a couple of useful queries unrelated to the "find" and led me to these useful bloat-detection resources <a href="https://wiki.postgresql.org/wiki/Show_database_bloat" rel="nofollow">https://wiki.postgresql.org/wiki/Show_database_bloat</a> <a href="https://github.com/ioguix/pgsql-bloat-estimation" rel="nofollow">https://github.com/ioguix/pgsql-bloat-estimation</a>

lucian1900超过 4 年前

Partial indexes can be useful in any case where one value has much higher cardinality than others.Indexing boolean columns is often only useful if one of the values is uncommon and the index is partial to only include those uncommon rows.

评论 #25989307 未加载

MrStonedOne超过 4 年前

>There are several ways to rebuild a table and reduce bloat:>Re-create the table: Using this method as described above often requires a lot of development, especially if the table is actively being used as it's being rebuilt.>Vacuum the table: PostgreSQL provides a way to reclaim space occupied by dead tuples in a table using the VACUUM FULL command. Vacuum full requires a lock on the table, and is not an ideal solution for tables that need to be available while being vacuumed:This is confusing to me, i thought postgre was suppose to be better then mysql, yet mysql has a non-locking command to recreate a table. it has like 3 that would fit here, AND deal with the indexes in one command.

malinens超过 4 年前

Too bad MySQL does not have partial indexes.We have one huge table I want to add some indexes for specific cases (for max 1% of records) but server will not have enough memory for it if I add those indexes for all records :/

评论 #25989992 未加载

评论 #25989631 未加载

评论 #25989436 未加载

gangstead超过 4 年前

The included query for finding which indexes in your database could benefit from a partial index is amazing. Thanks for putting the extra effort into this post.

alexfromapex超过 4 年前

It seems like this is an optimization that Postgres should handle internally, doesn't it?

tantalor超过 4 年前

Graphing "free storage" is meaningless and confusing; it should be "used storage".Available storage depends on usage and capacity.Edit: I meant for this article; of course I believe it is useful to track this in practice.

评论 #25990276 未加载

评论 #25990441 未加载

评论 #25990221 未加载

pierrebai超过 4 年前

The chart seems to show an uptick of 2GB, not 20GB. Am I missing something?

评论 #25995632 未加载

bombcar超过 4 年前

Are there things you can do to check a MySQL/mariadb instance? I see a command called mysqlcheck I may have to investigate.

gyrgtyn超过 4 年前

wow 20GB