TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Capacitor, BigQuery’s next-generation columnar storage format

133 点作者 fhoffa大约 9 年前

4 条评论

eva1984大约 9 年前
This simple idea has been helping me a lot in the past projects.<p>The performance gain from columnar storage is the compression ratio. And by ordering similar attributes together, it is going to greatly reducing the entropy between rows, which in turn leads to high compression and better performance.<p>The trick is to smartly select the which column you are going to have all the rows sorted upon.<p>I my previous company, we are using Redshift to encode 1 billions rows, and the simple change to let the table sorted by user_id reduce the whole table size by 50%, that is half a TB of disk storage, the improvement is nothing more but jaw-dropping. I think Google here just takes this trick into a more systematic method, which is really neat.<p>To point out, in columnar storage system, take ordering into account. Try some ordering that you feel could maximize the redundancy between rows, usually it is going to be primary id that is most representative of the underlying data. You don&#x27;t need to have a fancy system like this one to leverage this power idea, it could apply to all columnar systems.
评论 #11577265 未加载
kgp7大约 9 年前
&gt; doesn’t end here. BigQuery has background processes that constantly look at all the stored data and check if it can be optimized even further. Perhaps initially data was loaded in small chunks, and without seeing all the data, some decisions were not globally optimal. Or perhaps some parameters of the system have changed, and there are new opportunities for storage restructuring. Or perhaps, Capacitor models got more trained and tuned, and it possible to enhance existing data. Whatever the case might be, when the system detects an opportunity to improve storage, it kickstarts data conversion tasks. These tasks do not compete with queries for resources, they run completely in parallel, and don’t degrade query performance. Once the new, optimized storage is complete, it atomically replaces old storage data — without interfering with running queries. Old data will be garbage-collected later.<p>I wonder if they could share more details on how this is handled.
评论 #11577935 未加载
polskibus大约 9 年前
Slightly offtopic, but it&#x27;s great to see that Mosha of MDX and SQL Server Analysis Services fame (data warehouse from MS) is now part of Google&#x27;s BigQuery. His throrough blog posts full of technical details are a blessing even after so many years.
michaelmior大约 9 年前
I&#x27;m surprised they didn&#x27;t give a justification as to why they couldn&#x27;t just adopt Parquet[0].<p>[0] <a href="https:&#x2F;&#x2F;parquet.apache.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;parquet.apache.org&#x2F;</a>
评论 #11576524 未加载
评论 #11575672 未加载
评论 #11576080 未加载
评论 #11579372 未加载