Buckets of Parquet Files Are Awful

20 点作者 memset大约 1 年前

15 条评论

kermatt大约 1 年前

Everything in this article implies a lack of understanding of the more common use cases for Parquet - data sets that exceed typical single node RAM sizes. The article would make more sense titles "Large Buckets of Parquet Files Are Awful on a Laptop".Don't read Parquet into memory in one pass. Use an engine that can scan it, ex. Spark.PostgreSQL and Clickhouse have their own storage engines, and bulk importing _large_ volumes of data no matter the format (Parquet, CSV, JSON) will present challenges.DuckDB is meant to be a single node engine. If your data size exceeds RAM, its going to be a problem.If your data is not partitioned appropriately for its size, that is a problem with the people who created it.

评论 #40310595 未加载

dekhn大约 1 年前

buckets of shard files are awesome. many massive pipelines doing important work use that model. it's simple and easy to program around. Inspecting data and sampling it is straightforward. Even record files (with no index) can be easily seeked and sampled. They are extremely efficient by reducing the number of inodes. Tools like LexicographicRangeSharding can be used to make the bucket object sizes more balanced.Use sampling to make the data size more reasonable for doing local work. But, every tool I've used to deal with large data has handled large data the way I expect: either loading whole files into RAM, mapping whole files into RAM, or using fixed buffers. Typically when my work exceeds RAM, I just move to a machine with bigger RAM but still work with large shard files.

wenc大约 1 年前

I would say on the contrary, please stop dumping CSV data just because that’s all you know all to deal with.CSV is full of gotchas which is why CSV readers have so many switches.They are huge (and hugely inefficient at scale), have no type safety, don’t support predicate push down and are only meant to be read linearly.Most of the OPs complaints are about loading and memory, not about actually using the data. OP is more concerned about infra and memory usage than actually the end goal: using the the data well.I’m not saying Parquet is the be all and end all of data formats but it’s a darned sight better than CSVs for almost all analytics use cases.People who use CSV because it’s readable and understandable either have non analytics use cases or just don’t want to learn anything new.

评论 #40311487 未加载

fifilura大约 1 年前

Ok, I guess I am repeating myself but so does this post since I assume it spawned from this discussion <a href="https://news.ycombinator.com/item?id=40284225">https://news.ycombinator.com/item?id=40284225</a>Why does no-one mention Trino or the AWS branded Trino - Athena?This will solve basically all of the problems listed.In the case of Athena and provided you already have your data in s3, without having to set up a new service and very little work.Using Athena is not like setting up a new database. You just point Athena to the files and it will use resources provided by AWS to query. No import required, no service required.I think Athena is an amazing tool. I can borrow like 120 CPUs (my tasks often parallelize to that) for free during 5 minutes or even an hour to join / group / modify all my data.

评论 #40311692 未加载

ein0p大约 1 年前

I just built an NVMe cache when a customer of mine had the problem of reading parquet from S3. Basically first hit goes through the ram, but then it also writes it to NVMe. The next hit just checks etag and if it hasn’t changed, mmaps the file directly from the cached version for faster access. Eviction is typical LRU, but with leases. When disk cache is full (and sufficient storage can’t be released because files are in use) it only uses DRAM. When DRAM is full, which is very unlikely, the cache blocks and waits until stuff gets released.I’d think twice before abandoning a more or less standard, open format for some brand new shiny thing.

评论 #40311216 未加载

tppiotrowski大约 1 年前

I've recently played with the Overture Maps datasets which are in .parquet and I agree that there is a learning curve.One thing that's nice about .parquet is typical Macs come with 128/256GB SSDs and you can query 300GB of parquet in S3 without needing all those files on your local hard drive. Some of these queries are also surprisingly fast but I haven't looked into how it works.Edit: Having used MySQL and Postgres, I actually think DuckDB is better than either.

twobitshifter大约 1 年前

I don’t think buckets of csv would be better, in fact I know it would be worse because you lose data types and in memory representation.

评论 #40311262 未加载

paulsutter大约 1 年前

Postgres CAN read from parquet using parquet_fdw, and then you can import directly using "create table x as select" from the parquet foreign tables.>Postgres can't read from S3, load from Parquet, or query parquet files.

评论 #40311541 未加载

Galanwe大约 1 年前

A big part of my job is to ingest data from hundreds of providers to process.A provider that will push data to my S3 bucket, as parquet files, is the best I can think of.I don't want no provider managed database to host, stupidly huge zipped CSV dumps, or whatever weird technology the provider can think of.Seriously reading a file from S3 and stream processing the parquet is not the peak of engineering. What is there to complain about...Some of the insanity I had to deal with:- 350TB of zipped CSVs which columns and delimiter vary across time, shipped on HDD by mail- Self hosting a MSSQL server for the provider to update- Good old FTP polling at more or less predictable times- Fancy you-name-it-shiny-new-tech remote database accounts- The crappy client http website with custom SQL inspired query language to download zipped extracts- The web 2.0 rest API, billed per request- Snowflake data warehouse data lake big dataPlease, don't be that guy, just push parquet flat files in my bucket.

评论 #40311897 未加载

rckygry大约 1 年前

The arguments presented by the author reminds me of the saying "Incompetent dancer complained that the stage was crooked"

IshKebab大约 1 年前

This article would be greatly improved by at least a single sentence saying what your proposed alternative is. Otherwise it sounds like you're saying "having to eat is really annoying. let's stop eating".

martinky24大约 1 年前

It's quite clear that this blog post is in direct response to this comment: <a href="https://news.ycombinator.com/item?id=40299364">https://news.ycombinator.com/item?id=40299364</a>Nothing wrong with that, but good for folks to know the context.

评论 #40311607 未加载

ithkuil大约 1 年前

Things become popular because they are good/decent/better than the alternatives.Then, popular things become the default choice even when they don't necessarily make sense because they are not solving the problems they originally meant to solveThen people complain about it

barryrandall大约 1 年前

This is the one thing that Azure does better than AWS. With Azure Synapse, you can upload parquet files to a storage account (similar to an AWS bucket), create a view in a serverless SQL instance, and interact with it via any SQL Server client.

ordx大约 1 年前

They are not awful if you use right tools. Welcome to the world of Spark, Apache Drill, Trino/Presto, etc.