Postgres and Parquet in the data lake

90 pointsby pramseyabout 3 years ago

11 comments

alexottabout 3 years ago

I would suggest to look onto Delta Lake (<a href="https://delta.io/" rel="nofollow">https://delta.io/</a>) - it's built on top of the Parquet, but has advantages over the plain parquet:- transactions - you don't get a garbage in your table if your write failed- supports update/delete/merge operations (in some implementations)- metadata allows faster discovery of data, for example, if you have a lot of partitions on the cloud storage.- metadata also allow to support features like data skipping, when you can filter out files that doesn't contain necessary data- time travel - you can get back to previous versions of the tables

pellaabout 3 years ago

> "Generating Parquet"It is also useful for moving data from Postgres to BigQuery! ( batch load )<a href="https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet" rel="nofollow">https://cloud.google.com/bigquery/docs/loading-data-cloud-st...</a>Thanks for the "ogr2ogr" trick! :-)I hope the next blog post will be about GeoParquet and storing complex geometries in parquet format :-)<a href="https://github.com/opengeospatial/geoparquet" rel="nofollow">https://github.com/opengeospatial/geoparquet</a>

评论 #31276405 未加载

评论 #31268988 未加载

iknownothowabout 3 years ago

After reading a few comments here, here's the experiment I'm going to run this weekend to couple Postgres (AWS rds) and DuckDBThere's already an FDW for DuckDB [1]! But it's not available from within AWS RDS because they only allow a few extensions. So, I would launch an EC2 instance and launch a postgres instance on it. Then I would install the DuckDB FDW on this postgres instance. Then, on the AWS RDS postgres instance, I would use the postgres_fdw extension to connect to the postgres instance running on EC2 and indirectly gain access to it's DuckDB FDW. I'm not sure if this chained FDW strategy will work out. But here's why I think it's a great idea:1. Any data that I am willing to put into DuckDB is guaranteed to have backups and/or an ETL already setup to sync data. Therefore I don't need this EC2 setup to be "reliable". If it goes down, I'll just rerun the ETL.2. I need to pay a small DevOps cost to have some performance gain but the analytics complexity remains unchanged because at the end of the day it's just Postgres. I get to continue to use DBT against the original AWS RDS.The only thing I need to see is if there are any performance gains to be had using this setup. I won't take that for granted until it's tested.[1] <a href="https://github.com/alitrack/duckdb_fdw" rel="nofollow">https://github.com/alitrack/duckdb_fdw</a>

评论 #31370235 未加载

jhoechtlabout 3 years ago

Wouldn't it be nice if Postgresql could gain a native new storage engine which serializes columnar data as parquet files? Does this even make sense?DuckDB already has fine Parquet support and I find it really helpful.

评论 #31274306 未加载

评论 #31271208 未加载

评论 #31270751 未加载

评论 #31271384 未加载

bjackmanabout 3 years ago

For [Ex-]Googlers: I thought this sounded a bit like Capacitor and indeed it is mentioned in this page:<a href="https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format" rel="nofollow">https://cloud.google.com/blog/products/bigquery/inside-capac...</a>(I don't think there's a direct analogue to Postgres though? Maybe people sometimes take data out of Spanner and put it in a Capacitor file instead sometimes?)

评论 #31273141 未加载

eatonphilabout 3 years ago

This is a great post demonstrating the wide variety of things you can do with postgres foreign data wrappers.But I've worked on analytics systems built on this and it should just be clear that when you use this kind of system you're accepting ease of operation/ease of integration for actual interactive-level performance.At some point if you want interactive-level performance on this historic data (which is very much possible) you'll have to start putting this data in clickhouse or elasticsearch and some system on top of postgres with a better sharing scheme.If you wanted better performance without giving up the postgres frontend it would (to me) make more sense for folks to invest in building FDWs for elastic or clickhouse rather than over parquet or csv. But that's just a theory.

sunngabout 3 years ago

While using data lake for this case is a good idea. I'm not sure if parquet is the particular good storage format for data like event log, audio log. The data is often row oriented and does not take full advantage of a column based format.

mildbyteabout 3 years ago

Another similar option is cstore_fdw [0] -- it's now part of Citus but can still be used standalone as a foreign data wrapper. We use it at my startup to do OLAP on Postgres. It has some advantages on parquet_fdw:* Supports writes (actually generating Parquet files was also difficult in my testing: I used odbc2parquet [1] but thanks for the ogr2ogr tip!) so you can write directly to the foreign table by running INSERT INTO ... SELECT FROM ...* Supports all PG datatypes (including types from extensions, like PostGIS)* Performance was basically comparable in my limited testing (faster for a single-row SELECT with cstore_fdw in our case since we do partition pruning, same for a full-table scan-and-aggregation).Re: performance overhead, with FDWs we have to re-munge the data into PostgreSQL's internal row-oriented TupleSlot format again. Postgres also doesn't run aggregations that can take advantage of the columnar format (e.g. CPU vectorization). Citus had some experimental code to get that working [2], but that was before FDWs supported aggregation pushdown. Nowadays it might be possible to basically have an FDW that hooks into the GROUP BY execution and runs a faster version of the aggregation that's optimized for columnar storage. We have a blog post series [3] about how we added agg pushdown support to Multicorn -- similar idea.There's also DuckDB which obliterates both of these options when it comes to performance. In my (again limited, not very scientific) benchmarking of on a customer's 3M row table [4] (278MB in cstore_fdw, 140MB in Parquet), I see a 10-20x (1/2s -> 0.1/0.2s) speedup on some basic aggregation queries when querying a Parquet file with DuckDB as opposed to using cstore_fdw/parquet_fdw.I think the dream is being able to use DuckDB from within a FDW as an OLAP query engine for PostgreSQL. duckdb_fdw [5] exists, but it basically took sqlite_fdw and connected it to DuckDB's SQLite interface, which means that a lot of operations get lost in translation and aren't pushed down to DuckDB, so it's not much better than plain parquet_fdw. I had a complex query in the PG dialect generated with dbt that used joins, CTEs and window functions. I don't remember the exact timings, but it was even slower on duckdb_fdw than with cstore_fdw, whereas I could take the same query and run it on DuckDB verbatim, only replacing the foreign table name with the Parquet filename.This comment is already getting too long, but FDWs can indeed participate in partitions! There's this blog post that I keep meaning to implement where the setup is, a "coordinator" PG instance has a partitioned table, where each partition is a postgres_fdw foreign table that proxies to a "data" PG instance. The "coordinator" node doesn't store any data and only gathers execution results from the "data" nodes. In the article, the "data" nodes store plain old PG tables, but I don't think there's anything preventing them from being parquet_fdw/cstore_fdw tables instead.[0] <a href="https://github.com/citusdata/cstore_fdw" rel="nofollow">https://github.com/citusdata/cstore_fdw</a>[1] <a href="https://github.com/pacman82/odbc2parquet" rel="nofollow">https://github.com/pacman82/odbc2parquet</a>[2] <a href="https://github.com/citusdata/postgres_vectorization_test" rel="nofollow">https://github.com/citusdata/postgres_vectorization_test</a>[3] <a href="https://www.splitgraph.com/blog/postgresql-fdw-aggregation-pushdown-multicorn-part-1" rel="nofollow">https://www.splitgraph.com/blog/postgresql-fdw-aggregation-p...</a>[4] <a href="https://www.splitgraph.com/trase/supply-chains" rel="nofollow">https://www.splitgraph.com/trase/supply-chains</a>[5] <a href="https://github.com/alitrack/duckdb_fdw" rel="nofollow">https://github.com/alitrack/duckdb_fdw</a>[6] <a href="https://swarm64.com/post/scaling-elastic-postgres-cluster/" rel="nofollow">https://swarm64.com/post/scaling-elastic-postgres-cluster/</a>

评论 #31274370 未加载

jaxrtechabout 3 years ago

If you were ready to throw away a bit of archival quality, I wonder if writing Apache Arrow straight to disk and then just mmap your columns back in for a query.

评论 #31269701 未加载

评论 #31270994 未加载

miohtamaabout 3 years ago

Please correct if I am wrong but Parquet is columnar store whereas PSQL is row based?

评论 #31261950 未加载

评论 #31255040 未加载

sharadovabout 3 years ago

Thank you, this is really cool!