TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

PostgreSQL Columnar Store for Analytic Workloads

190 点作者 tadruj大约 11 年前

15 条评论

chrisfarms大约 11 年前
I&#x27;ve never really digged into column-oriented storage, so had a quick skim... Would the below excerpts&#x2F;example be a fair note of pros&#x2F;cons of the general idea?<p>&gt; Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.<p>Example: SELECT sum(a) FROM things;<p>&gt; Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.<p>Example: UPDATE things SET a = a+1;<p>&gt; Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.<p>Example: SELECT * FROM things;<p>&gt; Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.<p>Example: INSERT INTO things (a,b,c,d,e,f,g) VALUES (1,2,3,4,5,6,7);
评论 #7524948 未加载
评论 #7524868 未加载
评论 #7530707 未加载
rusanu大约 11 年前
Good work, and I&#x27;m impressed by the PostgreSQL foreign data wrapped API power to allow for such ease of implementation.<p>The .proto defined for cstore_fwd differs from the ORC as defined for Hive. At a quick glance I can&#x27;t find references to dictionary encoding nor statistics, and the datatypes used are apparently the native PostgreSQL data types. From what I can tell this implementation leverages reduced IO (fetch only columns of interest from disk), segment elimination (use min&#x2F;max info to skip over entire row groups) and pgzl compression for the data stream. I couldn&#x27;t find references to run-length encoding (RLE) or dictionary encoding. I&#x27;m sure the shortcoming will be improved in future iterations, specially better encoding schemes which would result in better compression.<p>But I&#x27;m a bit disappointed that the ORC format used is not the same as the one originally used in Hive, Pig and the rest of Java&#x2F;Hadoop ecosystem. Had it shared the actual file format it would had enabled a number of very interesting scenarios. Think about Hive&#x2F;PIG serving as ETL pipelines to produce ORC file that are attached directly to PostgreSQL (add today&#x27;s partition as FOREIGN DATA WRAPPER ... OPTIONS (filename &#x27;...&#x2F;4_4_2014.cstore&#x27;) and then do interactive analytic driven by PostgreSQL. It would reduce the cost of ingress significantly (fast attach, no need for expensive INSERT and re-compression). I realize data type translation would had been thorny, to say the least (datetime and decimal as primitives, probably many of the structured types too).<p>- github.com&#x2F;citusdata&#x2F;cstore_fdw&#x2F;blob&#x2F;master&#x2F;cstore.proto - github.com&#x2F;apache&#x2F;hive&#x2F;blob&#x2F;trunk&#x2F;ql&#x2F;src&#x2F;protobuf&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hive&#x2F;ql&#x2F;io&#x2F;orc&#x2F;orc_proto.proto
tkyjonathan大约 11 年前
I think the emphasis should be on the fact that Postgres now has a free and open source columnar store. For many years (probably decades) there have been companies that have developed &quot;big data&quot;&#x2F;analytics system with Postgres as the base, but have not contributed back to the Postgres ecosystem.<p>While this new columnar store is not as speedy as ones that have been around since the early 2000s, it does give a platform for CitusData and other companies to build on and share solutions.<p>Having a DB that can hold both the transactional data and data for fast analytical purposes is very advantageous as you have less moving parts and much less ETL work.<p>What I am looking forward to now is a few start ups similar to CitusData that solve different &quot;Big Data&quot; problems and for them to work together to disrupt the multi-billion dollar datawarehouse&#x2F;analytics vendors.
lucian1900大约 11 年前
This is potentially very interesting, it&#x27;s the only open source SQL columnar store I&#x27;m aware of. Bonus points for being part of Postgres.<p>Basically ParAccel (like used by Amazon RedShift) at reasonable cost.
评论 #7524800 未加载
评论 #7530186 未加载
评论 #7524360 未加载
评论 #7526673 未加载
评论 #7529377 未加载
评论 #7524542 未加载
monstrado大约 11 年前
Any particular reason why ORC was chosen as the columnar store format over Parquet (<a href="https://github.com/Parquet/parquet-format" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Parquet&#x2F;parquet-format</a>)? Reason I ask is because Parquet seems to have its own development cycle, roadmap, and is pretty continuously updated with enhancements.
mason55大约 11 年前
I&#x27;d be interested to see more benchmarks. The improvements in this post are not anywhere close to what we&#x27;ve seen going from PG to Infobright for our reporting queries - we get speedups from 10x - 1000x, the one speed benchmark they have here is only 2x.
评论 #7524591 未加载
评论 #7529700 未加载
评论 #7524886 未加载
alecco大约 11 年前
How is this different than the many other columnar SQL databases and extensions?<p>Columnar querying is typical for OLAP, PostgreSQL engine is aimed at OLTP. This doesn&#x27;t look like a good idea. Like adding side floats to a car and paddles to use it like a boat.<p>This goes against using the right tool for the right job.
评论 #7525236 未加载
评论 #7524910 未加载
noelherrick大约 11 年前
This is really exciting! Columnar storage is something that the big boys like Microsoft and Oracle charge an arm and a leg for. You can currently get column-based SQL open-source databases, but this new FDW allows you to mix workloads on the same machine.
mimighost大约 11 年前
I am curious about how this thing compares to something like Amazon Redshift.<p>Briefly skimming, it looks pretty similar, except for the data compression part, which uses RCFile. It also supports more data types. If this being adapted by redshift or something else, I will be thrilled.
capkutay大约 11 年前
So with Cassandra you have a pretty nice, scalable, columnar DB with a SQL interface[0]. Not to mention, it&#x27;s free and apache licensed so you can distribute it as part of your own software. I guess I&#x27;ve only looked at cassandra from the scope of a developer. Would a DBA prefer using a columnar version of PostgreSQL rather than using cassandra for free?<p>0:<a href="http://cassandra.apache.org/doc/cql/CQL.html" rel="nofollow">http:&#x2F;&#x2F;cassandra.apache.org&#x2F;doc&#x2F;cql&#x2F;CQL.html</a><p>Edit: I didn&#x27;t realize Citus Data was making the columnar postgres offering open source...that&#x27;s great!
评论 #7525421 未加载
评论 #7530010 未加载
评论 #7526756 未加载
评论 #7525415 未加载
评论 #7535926 未加载
hfmuehleisen大约 11 年前
I have written a short blog post on how CitusDB&#x27;s column store compares with purpose-built systems such as MonetDB. Repeatability ftw. Disclosure: I am part of the MonetDB development team.<p>CitusDB vs. MonetDB TPC-H Shootout <a href="http://www.monetdb.org/content/citusdb-postgresql-column-store-vs-monetdb-tpc-h-shootout" rel="nofollow">http:&#x2F;&#x2F;www.monetdb.org&#x2F;content&#x2F;citusdb-postgresql-column-sto...</a>
rpedela大约 11 年前
I am curious, why are foreign tables necessary?
评论 #7524551 未加载
评论 #7524506 未加载
hans_castorp大约 11 年前
This sounds very interesting. Are there any pre-built Windows binaries available for this extension?
joevandyk大约 11 年前
You&#x27;d have to mangage backups of the storage file separately than backups of postgresql, right?
yazun大约 11 年前
Any chance it will support INSERT some day?