Parquet: An efficient, binary file format for table data

283 点作者 calpaterson大约 2 年前

22 条评论

julik大约 2 年前

There is a point which one needs to be aware of with Parquet: if you are not stepping into it with tools well suited for it (that is: Go, C++, Java or Python codebase/runtime, mainstream platform...) you are going to have a bad time. For all its advantages, Parquet is _very_ complicated (IMO overcomplicated, on multiple fronts), and uses binary dependencies which you absolutely need to carry around (Thrift?..). Up to a certain data size the advantage of CSV is not that it is a better format (it is decidedly, absolutely worse/more ambiguous than Parquet) but the fact that you can write a valid CSV in a couple of lines of anything, and read in a couple of lines of anything. If you ask me, Parquet - for all its advantages - is a product of the Cambrian explosion of the Apache-big-data-projects. Very little in its design warrants (or requires) using things like Thrift, so I won't exclude that something less overcomplicated (and more welcoming, and requiring less dependencies) will appear on the scene at some point.

评论 #35439248 未加载

评论 #35445947 未加载

评论 #35450060 未加载

rf15大约 2 年前

This is all very interesting (and makes me check out Parquet) but it's painful to see how the text goes to great lengths to describe how it avoids common problems and what they are, but doesn't lose a single word on... how they actually solved these problems. What is the actual boolean type? Which encoding are you actually working in? There's nothing concrete but an implication that it's also compressing data?

评论 #35435357 未加载

评论 #35435288 未加载

评论 #35436324 未加载

clord大约 2 年前

“Parquet has both a date type and the datetime type (both sensibly recorded as integers in UTC).”What does it mean for a date to be utc? my date, but in the utc timezone? Usually when I write a date to a file, I want a naive date; since that’s the domain of the data. 2020-12-12 sales: $500. But splitting that by some other timezone seems to be introducing a mistake.Often I want to think in local naive time too, especially for things like appointments that might change depending on dst or whatever. Converting to utc involves some scary things. Timestamps are also useful but I don’t want to transcode my data sometimes as the natural format is the most correct.

评论 #35434353 未加载

评论 #35434019 未加载

评论 #35433531 未加载

评论 #35435857 未加载

评论 #35435461 未加载

评论 #35464065 未加载

评论 #35433528 未加载

评论 #35436338 未加载

评论 #35434320 未加载

评论 #35433421 未加载

poulpy123大约 2 年前

One of the main interest of csv is that it's human readable and usable with common text tools (such as grep, cat or any texte editor) so I'm not sure the comparison should be done between parquet and csv, but between parquet and other formats like hdf5 and netcdf. What is the advantage of this new format compared to the older ones ?

评论 #35437470 未加载

评论 #35437890 未加载

评论 #35438495 未加载

评论 #35437955 未加载

jatorre大约 2 年前

And if you are interested on encoding Geospatial Data there is a format called Geoparquet that is on the way of standarization: <a href="https://geoparquet.org/" rel="nofollow">https://geoparquet.org/</a> Essentially adding metadata on the extensible schema metadata streucture that parquet supports. Think of WKB in a column. With more exciting stuff coming that way.

评论 #35437569 未加载

lysecret大约 2 年前

Oh very interesting didn’t expect so much scepticism on parquet. We use it to store around 100 tb of data in a partitioned way on GCS and query it using BigQuery. Absolutely fantastic solution and incredibly cheap too.

antipaul大约 2 年前

Can you `grep` it?Text-based files like CSV can be `grep`-ed en masse, which I do often.eg, to find some value across a ton of files.Is that possible with parquet?

评论 #35435358 未加载

评论 #35433753 未加载

评论 #35433188 未加载

评论 #35434829 未加载

评论 #35434579 未加载

评论 #35436091 未加载

评论 #35433209 未加载

评论 #35433621 未加载

评论 #35434816 未加载

评论 #35434693 未加载

评论 #35433383 未加载

评论 #35438081 未加载

crop_rotation大约 2 年前

S3 Select supports parquet, which means you can push query to individual S3 blobs. I think there is a possibility we might see someone build a postgres foreign data wrapper which pushes the querying via S3 select to S3 blobs and use postgres only for the final aggregations. Someone on HN had done this with sqlite files in S3 queried by lambdas using sqlite, but pushing it wholly to S3 would eliminate a lot of overhead.

评论 #35439487 未加载

评论 #35433291 未加载

nwatson大约 2 年前

Parquet fits well into a system like Apache Arrow for good query performance, when considering partitioning (via directories), push-down predicates, and read-only-the-columns-you-need: <a href="https://arrow.apache.org/docs/python/dataset.html" rel="nofollow">https://arrow.apache.org/docs/python/dataset.html</a> "Tabular Datasets"

pietroppeter大约 2 年前

I was surprised lately to realize there is a not a standard way to import a parquet file in MSSQL. Happy to be proven wrong

domoritz大约 2 年前

If you need a quick tool to convert your CSV files, you can use csv2parquet from <a href="https://github.com/domoritz/arrow-tools">https://github.com/domoritz/arrow-tools</a>.

评论 #35433047 未加载

评论 #35434128 未加载

Helmut10001大约 2 年前

I've used Parquet here to filter/map 400 Million Twitter geotagged tweets [1]. The main advantage was that it automatically partitions data so it can be streamed and chunked during processing.[1]: <a href="https://ad.vgiscience.org/twitter-global-preview/00_Twitter_datashader.html" rel="nofollow">https://ad.vgiscience.org/twitter-global-preview/00_Twitter_...</a>

crabbone大约 2 年前

> On a real world 10 million row financial data table I just tested with pandas I found that Parquet is about 7.5 times quicker to read than csv, ~10 times quicker to write and a about a fifth of the size on disk. So way to think of Parquet is as "turbo csv" - like csv, just faster (and smaller).Come on... who does tests like this? Why is anyone supposed to believe these numbers? What is even the point of making such claims w/o any evidence?---PS. Also, buried somewhere in the middle of the article is the admission that the format isn't possible to stream.And, if we expand on the file structure, then it becomes apparent that it was inspired by typical relational database storage, except simplified, which made the format awful for insertion / deletion operations. Of course, if you compare to CSV, then there's no loss here, but there are other formats which can handle limited, or even unlimited insertion / deletion with the resource use similar to writes.Even truncation in this format is complicated (compared to CSV).Seems like the format was trying to be the middle ground between proper relational database storage and something like CSV, but not sure if such middle ground is actually necessary...

评论 #35438765 未加载

评论 #35438329 未加载

评论 #35438777 未加载

stewbrew大约 2 年前

Parquet is cool. What isn't so cool is that, e g., partitioning isn't supported on all platforms out of the box. Hence, reading a set of files created in one language may require extra work in another language. Partitioning should be supported in all languages.

ayhanfuat大约 2 年前

I wanted to dive into Parquet internals a couple of times to see how it works but there seems to be no tool to see page level data. The only option seems to be to use Arrow’s C++ library. Is anybody aware of a higher level option?

评论 #35437173 未加载

grandinteg3大约 2 年前

Why not use Delta? <a href="https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf" rel="nofollow">https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf</a>

kragen大约 2 年前

i wonder why the words 'apache', 'twitter', 'cloudera', 'cutting', 'rcfile', 'orc', 'cutting', 'trevni', 'thrift', 'hadoop', 'julien', 'jco', 'tianshuo', and 'deng' don't appear in this pagei mean i understand not mentioning all of them but it's pretty disappointing to not mention any of them

评论 #35435190 未加载

评论 #35434017 未加载

评论 #35435257 未加载

jiggunjer大约 2 年前

I'm interested in a good tool for converting csv to pq. Seems many encoding features are not supported by most.

wodenokoto大约 2 年前

I've still yet to figure out where feather fits into this.

评论 #35437629 未加载

CoBE10大约 2 年前

Why is the picture of Sutjeska war memorial in the article?

评论 #35437637 未加载

评论 #35438808 未加载

jjgreen大约 2 年前

<pre><code> apt-cache search parquet <nada></code></pre>

Madeindjs大约 2 年前

Why should not use a simple database instead ?