This simple idea has been helping me a lot in the past projects.<p>The performance gain from columnar storage is the compression ratio. And by ordering similar attributes together, it is going to greatly reducing the entropy between rows, which in turn leads to high compression and better performance.<p>The trick is to smartly select the which column you are going to have all the rows sorted upon.<p>I my previous company, we are using Redshift to encode 1 billions rows, and the simple change to let the table sorted by user_id reduce the whole table size by 50%, that is half a TB of disk storage, the improvement is nothing more but jaw-dropping. I think Google here just takes this trick into a more systematic method, which is really neat.<p>To point out, in columnar storage system, take ordering into account. Try some ordering that you feel could maximize the redundancy between rows, usually it is going to be primary id that is most representative of the underlying data. You don't need to have a fancy system like this one to leverage this power idea, it could apply to all columnar systems.
> doesn’t end here. BigQuery has background processes that constantly look at all the stored data and check if it can be optimized even further. Perhaps initially data was loaded in small chunks, and without seeing all the data, some decisions were not globally optimal. Or perhaps some parameters of the system have changed, and there are new opportunities for storage restructuring. Or perhaps, Capacitor models got more trained and tuned, and it possible to enhance existing data. Whatever the case might be, when the system detects an opportunity to improve storage, it kickstarts data conversion tasks. These tasks do not compete with queries for resources, they run completely in parallel, and don’t degrade query performance. Once the new, optimized storage is complete, it atomically replaces old storage data — without interfering with running queries. Old data will be garbage-collected later.<p>I wonder if they could share more details on how this is handled.
Slightly offtopic, but it's great to see that Mosha of MDX and SQL Server Analysis Services fame (data warehouse from MS) is now part of Google's BigQuery. His throrough blog posts full of technical details are a blessing even after so many years.
I'm surprised they didn't give a justification as to why they couldn't just adopt Parquet[0].<p>[0] <a href="https://parquet.apache.org/" rel="nofollow">https://parquet.apache.org/</a>