Show HN: Reduce SQLite database size by up to 80% with transparent compression

420 pointsby phireskyalmost 3 years ago

17 comments

wmanleyalmost 3 years ago

This is really neat. I like how the compression can be applied to specific columns rather than whole rows - or whole pages. I could imagine storing JSON from an HTTP request here and using Generated Columns[1] to extract the bits my application is interested in.I also like how simple and powerful the custom partitioning is.[1]: <a href="https://www.sqlite.org/gencol.html" rel="nofollow">https://www.sqlite.org/gencol.html</a>

评论 #32305198 未加载

评论 #32305951 未加载

评论 #32306004 未加载

cryptonectoralmost 3 years ago

The image near the top of the readme implies that compressing the whole file means not having random access, which is indeed true, but it also implies that there's nothing between whole file compression and row/column level compression, and that is not true. You could compress pages, kinda like ZFS compresses blocks, and still maintain random access, and get pretty close to whole-file compression.

评论 #32320690 未加载

yababa_yalmost 3 years ago

This makes me wonder how sqlite augmented with blosc or some similar faster-than-memory compressor might behave. Size and speed wins?

评论 #32310336 未加载

mansoor_almost 3 years ago

What does "transparent" mean in this context?

评论 #32304533 未加载

quietbritishjimalmost 3 years ago

One option not listed (under "Other options for compressing data in a database") but probably closest to what this solution does under the hood, is to do row-interdependent compression at application level.That is, while saving rows to the database, pick out every (say) 10th row, and compress it by itself but keep the compression state in memory. Then for the next 9 rows, compress that row based on that dictionary, and a reference to the row it's based on. (For example, in Python's zlib module, you'd use the zdict parameter [1] of compress and decompress.) A bit like keyframes in video compression.You can potentially get better results this way than a generic solution like the article, because you might have some application knowledge about what rows are likely to be similar to others. But it obviously requires application level code (e.g. fetching a logical row requires fetching two physical rows so you have enough information to decode), and it would be a nightmare if you need to remove or update the rows.[1] <a href="https://docs.python.org/3/library/zlib.html#zlib.compressobj" rel="nofollow">https://docs.python.org/3/library/zlib.html#zlib.compressobj</a>

评论 #32306496 未加载

评论 #32306261 未加载

leobgalmost 3 years ago

This looks great. How would I apply this retroactively? I.e. tell SQLite to compress column x in an existing database, say, of type TEXT?

andrewmcwattersalmost 3 years ago

Oh, interesting. I scrape GitHub's APIs to generate user and repositories statistics data, then compress it with ZStandard to allow the databases to fit within GitHub's file size restrictions.[1] So, this is very relevant to me.It's pleasantly surprising how much data you can fit within a ZStandard-compressed SQLite database. The two technologies fit very well together.[1]: <a href="https://github.com/andrewmcwattersandco/github-statistics" rel="nofollow">https://github.com/andrewmcwattersandco/github-statistics</a>

avereveardalmost 3 years ago

I don't want to diss on the effort, I have an honest question: wouldn't this something that would be more appropriate to be done at the filesystem level?

评论 #32304581 未加载

评论 #32304148 未加载

评论 #32304102 未加载

评论 #32304076 未加载

评论 #32310597 未加载

评论 #32304137 未加载

评论 #32304067 未加载

评论 #32304124 未加载

评论 #32309241 未加载

logicalleealmost 3 years ago

This mentions something I've been wondering about:>Depending on the data, this can reduce the size of the database by 80% while keeping performance mostly the same (or even improving it, since the data to be read from disk is smaller)."Even improving it" is a bold claim!Modern CPU's are often hungry for data, aren't they? While being incredibly fast and mostly idle?Is reading a smaller compressed file and decompressing it in the CPU's hyperfast cache before use faster than reading it uncompressed?Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!Unless the CPU's are at 100% utilization this could always be used. Since the CPU will then use the data, it might be fast enough to uncompress it at home faster than the bus can keep dropping it off at file read speeds.This chart I found that goes up to 2000 suggests cache access had an increasing performance gap with even RAM bandwidth, to say nothing of SSD's[1]It is from here[2](However, compressing a full disk can only help if there are extra CPU cycles.For activity spikes when the CPU doesn't have extra cycles or the cache for it, both compressed and uncompressed versions could be available, at most doubling disk usage, so that the uncompressed file could be read instead.This can ensure the CPU isn't burdened with a decompression task, since this only helps if the fastest caches are used for it and they might have better things to do than uncompress files.However, whenever the CPU has idle cores and lots of cache, which is most of the time, it could request the compressed version. If it is busy it could ask for the original version which saves the decompression work.)Can this help solve the problem of waiting for the disk?[1] <a href="http://www.extremetech.com/wp-content/uploads/2014/08/CPU-DRAM.png" rel="nofollow">http://www.extremetech.com/wp-content/uploads/2014/08/CPU-DR...</a>[2] <a href="https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips" rel="nofollow">https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu...</a>

评论 #32311785 未加载

josephernestalmost 3 years ago

Let's say I have a web visits analytics log database, with a column "user-agent" and a column "url" that have most of the time the same values (99.9% of rows have one of 100 typical values).Would this work well with compression here?ie. from 50 bytes on average for these columns... to a few bytes only?

评论 #32311859 未加载

评论 #32311838 未加载

malkiaalmost 3 years ago

SQLite allows you to override the low-level paging routines, wonder if it won't work by just having a different pager (though maybe not that efficient compression, unless the pages are big (64kb)).

AtNightWeCodealmost 3 years ago

LZ77+ ftw. I am a heavy user of Zstd. I think Brotli might be a better choice here. Or at least for my use-cases.

password4321almost 3 years ago

Awesome!Hopefully this will encourage SQLite to support compression natively in some fashion.

hda111almost 3 years ago

I love the idea. Can I use it with Django projects somehow?

aabbcc1241almost 3 years ago

I'm using sqlite on zfs and btrfs, with transparent compression supported by the file system, the query speed is very good and compression rate also very good. (database below 3GB uncompressed, 22% after compressed)

评论 #32306486 未加载

评论 #32309310 未加载

评论 #32309382 未加载

d1lalmost 3 years ago

I am happy for the author, but want to warn the credulous that this is a very strange way to approach compression with SQLite. The much more common way is to use SQLite vfs:<a href="https://www.sqlite.org/vfs.html" rel="nofollow">https://www.sqlite.org/vfs.html</a>Briefly search for SQLite compressed vfs and you will find many extensions. Some are quite mature.I get the impression that the author doesn't know about the vfs, as it's not even mentioned in the readme. The clickbait title similarly seems amateurish. Caveat emptor.

评论 #32304272 未加载

评论 #32304583 未加载

评论 #32304535 未加载

评论 #32304185 未加载

评论 #32305049 未加载

Joel_Mckayalmost 3 years ago

sudo apt-get install archivemountarchivemount mydbimage.tgz /mnt/mydbimage -o big_writes// -o readonly disable write support// -o nosave do not save changes upon unmount.//open/cache /mnt/mydbimage/mydbimage.db//flushfusermount -u /mnt/mydbimage//No LGPL/Rust/cargo dependency issues...//as SQLite's best use-case is embedding ;)

评论 #32305889 未加载

评论 #32305750 未加载