Transforming Postgres into a Fast OLAP Database

245 点作者 philippemnoel超过 1 年前

23 条评论

HermitX超过 1 年前

Thanks for sharing, I really like Postgres. However, I generally use Postgres for OLTP work. I would like to point out two things:1. Based on Clickbench results, pg_analytics is still far from top-tier performance. If you're looking for a high-performance OLAP database, you should consider the top-ranked products.2. The queries tested by Clickbench are really simple, far from real data analysis scenarios. You must be aware of the TPC-DS and TPC-H benchmarks, because ClickHouse simply cannot run these test suites, so their Clickbench does not include these datasets.Lastly, I want to say, if your enterprise is of a certain size, separating OLTP and OLAP into two databases is the right choice, because you will have two different teams responsible for two different tasks. By the way, I use StarRocks for OLAP work.

评论 #39292710 未加载

评论 #39292368 未加载

评论 #39293924 未加载

评论 #39292865 未加载

DrDroop超过 1 年前

Interesting stuff keeps happening in PostgresLand. This uses Apache DataFusion, first think i hear about it, but from reading the docs it seems like it already offers build in SQL interface atop of Apache parquet and Apache Arrow. What is the difference between this two versions of SQL. Can I still use my favourite pl/sql function together with a deltalake table and does this play nice with other kind of tables?

评论 #39293153 未加载

pgaddict超过 1 年前

Looks very nice. I have three random questions:1) How does this deal with backups? Presumably the deltalake tables can't be backed up by Postgres itself, so I guess there's some special way to d backups?2) Similarly for replication (physical or logical). Presumably that's not supported, right? I guess logical replication is more useful for OLTP databases from which the data flow to datalakes, so that's fine. But what's the HA story without physical replication?3) Presumably all the benefits are from compression at the storage level? Or are there some tweaks to the executor to do columnar stuff? I looked at the hooks in pg_analytics, but I see only stuff to handle DML. But I don't speeak rust, so maybe I missed something.

评论 #39295520 未加载

yunyu超过 1 年前

Congrats on launching – we just spent quite a bit of time replicating/transforming our primary database into clickhouse for OLAP use cases, and it would have been way easier if there were a postgres-native solution. Hoping the managed hosting providers catch on

评论 #39296691 未加载

zX41ZdbW超过 1 年前

There are many Postgres forks for OLAP workloads, such as:- Greenplum;- Citus with cstore_fdw;- IMCS (Konstantin Knizhnik);- Hydra;- AlloyDB;For example, Greenplum is one of the earliest, fairly mature, but abandoned.

评论 #39292521 未加载

ggregoire超过 1 年前

Slightly related but it's so interesting to me all those new small tech companies that basically provides value on top of Postgres! They usually blog about some obscure stuff about Postgres and one of their articles gets on the frontpage of HN. Good marketing and very cool stuff!

caust1c超过 1 年前

How do you find working with Arrow / Parquet?Personally I found it very hard to reason about and thought that Clickhouse's strategy for managing columnar data much more reasonable.

评论 #39294085 未加载

评论 #39294167 未加载

pletnes超过 1 年前

Is this deltalake the same as Databricks deltalake? Is it compatible?

评论 #39292772 未加载

Xcelerate超过 1 年前

This is pretty cool. It seems like there are a lot of hard database problems that still need to be solved, but finding the right database system (or extension) seems about as difficult as finding a research article and implementing the solution yourself lately. This seems like a step in the right direction by integrating with Postgres, which is widely used and supported.Case in point regarding OLAP in particular, I am currently trying to solve a problem where I have a high number of categorical dimensions, and I want to perform a “count distinct” over any combination of dimensions, grouped by any other combination of dimensions, filtered by specific values in each dimension. E.g., count(distinct a.col1, b.col2), count(distinct a.col1), count(distinct b.col3) from table a join table b using (id) group by a.col4, b.col7.Sounds obscure when I word it that way, but this is actually a pretty “generic” problem that appears whenever you want to filter and count the number of distinct property combinations that occur within a fact dataset of transactions or events that has been joined with other dimensional datasets. A naive implementation is exorbitantly expensive (and impractical) if you have to join many large tables before grouping and performing count distinct.However, this specific problem manifests in various equivalent forms mathematically: model counting of boolean expressions, low rank factorization of sparse high dimensional boolean tensors (each row in your transaction dataset corresponds to a value of “true” in a sparse tensor with dimensions indexed by the values of your columns), minimal hypergraph covering set, etc.Is there a database already out there that’s optimized for this fairly common business problem? Maybe...? I searched for a while but couldn’t easily separate the startup database hype from the actual capabilities of a particular offering. Plus, even if the ideal “hypergraph counting database” exists, it’s not like my company is just going to replace its standard cloud SQL platform that serves as the backbone of our entire product with a niche and fragile experimental database with questionable long-term support. It’s much easier to just translate one of the latest tensor factoring research papers into a Python script, plop that into the data processing pipeline, and output the simple factored form of the transactions dataset into a new table that can be easily queried in the standard way.

评论 #39295373 未加载

pberton超过 1 年前

Nice benchmark with amazing results! Congrats!

tomkaos超过 1 年前

That look like what Microsoft has done with Microsoft Fabric. SQL Server going full parquet file for table storage.

评论 #39294428 未加载

ZeroCool2u超过 1 年前

If I combined ParadeDB with a common postgres extension like PostGIS, would I get the benefits of both?

评论 #39293056 未加载

datadrivenangel超过 1 年前

Interesting and smart choice to use Apache DataFusion as their query engine.

评论 #39292494 未加载

评论 #39292438 未加载

ComputerGuru超过 1 年前

If you're looking for pg_analytics (the product in TFA) in the charts and can't find it, it's because the chart calls it ParadeDB (after the company, instead of the product).

评论 #39292460 未加载

cha42超过 1 年前

I would say that the biggest missing features is not being able to mix normal and delta tables.This would allow easier reporting within a very active db without too much bother.

评论 #39350595 未加载

dbmikus超过 1 年前

Congrats on the pg_analytics launch! Great to see ParadeDB consolidating some of these tools. Useful at the early stage to not manage multiple databases.

brand超过 1 年前

How does this handle sort ordering / collations?

评论 #39292920 未加载

DanielLapidus超过 1 年前

Nice, congrats! How does this compare to Hydra in philosophy and use cases?

评论 #39295571 未加载

rywalker超过 1 年前

Excited to see ParadeDB make progress in this area :)

nwhnwh超过 1 年前

Can it be installed on managed solutions?

评论 #39350608 未加载

slt2021超过 1 年前

clickhouse is phenomenal

winrid超过 1 年前

(in Rust) :)

评论 #39292150 未加载

tomcam超过 1 年前

OLAP was my favorite character in Frozen