This is really neat. I like how the compression can be applied to specific columns rather than whole rows - or whole pages. I could imagine storing JSON from an HTTP request here and using Generated Columns[1] to extract the bits my application is interested in.<p>I also like how simple and powerful the custom partitioning is.<p>[1]: <a href="https://www.sqlite.org/gencol.html" rel="nofollow">https://www.sqlite.org/gencol.html</a>
The image near the top of the readme implies that compressing the whole file means not having random access, which is indeed true, but it also implies that there's nothing between whole file compression and row/column level compression, and <i>that</i> is not true. You could compress pages, kinda like ZFS compresses blocks, and still maintain random access, and get pretty close to whole-file compression.
One option not listed (under "Other options for compressing data in a database") but probably closest to what this solution does under the hood, is to do row-interdependent compression at application level.<p>That is, while saving rows to the database, pick out every (say) 10th row, and compress it by itself but keep the compression state in memory. Then for the next 9 rows, compress that row based on that dictionary, and a reference to the row it's based on. (For example, in Python's zlib module, you'd use the zdict parameter [1] of compress and decompress.) A bit like keyframes in video compression.<p>You can potentially get better results this way than a generic solution like the article, because you might have some application knowledge about what rows are likely to be similar to others. But it obviously requires application level code (e.g. fetching a logical row requires fetching two physical rows so you have enough information to decode), and it would be a nightmare if you need to remove or update the rows.<p>[1] <a href="https://docs.python.org/3/library/zlib.html#zlib.compressobj" rel="nofollow">https://docs.python.org/3/library/zlib.html#zlib.compressobj</a>
Oh, interesting. I scrape GitHub's APIs to generate user and repositories statistics data, then compress it with ZStandard to allow the databases to fit within GitHub's file size restrictions.[1] So, this is very relevant to me.<p>It's pleasantly surprising how much data you can fit within a ZStandard-compressed SQLite database. The two technologies fit very well together.<p>[1]: <a href="https://github.com/andrewmcwattersandco/github-statistics" rel="nofollow">https://github.com/andrewmcwattersandco/github-statistics</a>
I don't want to diss on the effort, I have an honest question: wouldn't this something that would be more appropriate to be done at the filesystem level?
This mentions something I've been wondering about:<p>>Depending on the data, this can reduce the size of the database by 80% while keeping performance mostly the same (or even improving it, since the data to be read from disk is smaller).<p>"Even improving it" is a bold claim!<p>Modern CPU's are often hungry for data, aren't they? While being incredibly fast and mostly idle?<p>Is reading a smaller compressed file and decompressing it in the CPU's hyperfast cache before use faster than reading it uncompressed?<p>Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!<p>Unless the CPU's are at 100% utilization this could always be used. Since the CPU will then use the data, it might be fast enough to uncompress it at home faster than the bus can keep dropping it off at file read speeds.<p>This chart I found that goes up to 2000 suggests cache access had an increasing performance gap with even RAM bandwidth, to say nothing of SSD's[1]<p>It is from here[2]<p>(However, compressing a full disk can only help if there are extra CPU cycles.<p>For activity spikes when the CPU doesn't have extra cycles or the cache for it, both compressed and uncompressed versions could be available, at most doubling disk usage, so that the uncompressed file could be read instead.<p>This can ensure the CPU isn't burdened with a decompression task, since this only helps if the fastest caches are used for it and they might have better things to do than uncompress files.<p>However, whenever the CPU has idle cores and lots of cache, which is most of the time, it could request the compressed version. If it is busy it could ask for the original version which saves the decompression work.)<p>Can this help solve the problem of waiting for the disk?<p>[1] <a href="http://www.extremetech.com/wp-content/uploads/2014/08/CPU-DRAM.png" rel="nofollow">http://www.extremetech.com/wp-content/uploads/2014/08/CPU-DR...</a><p>[2] <a href="https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips" rel="nofollow">https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu...</a>
Let's say I have a web visits analytics log database, with a column "user-agent" and a column "url" that have most of the time the same values (99.9% of rows have one of 100 typical values).<p>Would this work well with compression here?<p>ie. from 50 bytes on average for these columns... to a few bytes only?
SQLite allows you to override the low-level paging routines, wonder if it won't work by just having a different pager (though maybe not that efficient compression, unless the pages are big (64kb)).
I'm using sqlite on zfs and btrfs, with transparent compression supported by the file system, the query speed is very good and compression rate also very good. (database below 3GB uncompressed, 22% after compressed)
I am happy for the author, but want to warn the credulous that this is a very strange way to approach compression with SQLite. The much more common way is to use SQLite vfs:<p><a href="https://www.sqlite.org/vfs.html" rel="nofollow">https://www.sqlite.org/vfs.html</a><p>Briefly search for SQLite compressed vfs and you will find many extensions. Some are quite mature.<p>I get the impression that the author doesn't know about the vfs, as it's not even mentioned in the readme. The clickbait title similarly seems amateurish. Caveat emptor.