This article seems written by someone who never had to work with diverse data pipelines.<p>I work with large volumes of data from many different sources. I’m lucky to get them to send csv. Of course there are better formats, but all these sources aren’t able to agree on some successful format.<p>Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.<p>I’ve been reading these “everyone is stupid, why don’t they just do the simple, right thing and I don’t understand the real reason for success” articles for so long it just makes me think the author doesn’t have a mentor or an editor with deep experience.<p>It’s like arguing how much mp3 sucks and how we should all just use flac.<p>The author means well, I’m sure. Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation.
"You give up human readable files, but what you gain in return is..."
Stop right there. You lose more than you gain.<p>Plus, taking the data out of [proprietary software app my client's data is in] in csv is usually easy. Taking the data out in Apache Parquet is...usually impossible, but if it is possible at all you'll need to write the code for it.<p>Loading the data into [proprietary software app my client wants data put into] using a csv is usually already a feature it has. If it doesn't, I can manipulate csv to put it into their import format with any language's basic tools.<p>And if it doesn't work, I can look at the csv myself, because it's human readable, to see what the problem is.<p>90% of real world coding is taking data from a source you don't control, and somehow getting it to a destination you don't control, possibly doing things with it along the way. Your choices are usually csv, xlsx, json, or [shudder] xml. Looking at the pros and cons of those is a reasonable discussion to have.
As a French, there is another problem with CSV.<p>In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator, so the semicolon is used instead.<p>It means that depending whether or not the tool that exports the CSV is localized or not, you get commas or you get semicolons. If you are lucky, the tool that imports it speaks the same language. If you are unlucky, it doesn't, but you can still convert it. If you are really unlucky, then you get commas for both decimal numbers and separators, making the file completely unusable.<p>There is a CSV standard, RFC 4180, but no one seems to care.
Of course if you only consider the disadvantages, something looks bad.<p>The advantages of CSV are pretty massive though - if you support CSV you support import and export into a massive variety of business tools, and there is probably some form of OOTB support.
The reason CSV is popular is because it is (1) super simple, and (2) the simplicity leads to ubiquity. It is extremely easy to add CSV export and import capability to a data tool, and that has come to mean that there are no data tools that don't support CSV format.<p>Parquet is the opposite of simple. Even when good libraries are available (which it usually isn't), it is painful to read a Parquet file. Try reading a Parquet file using Java and Apache Parquet lib, for example.<p>Avro is similar. Last I checked there are two Avro libs for C# and each has its own issues.<p>Until there is a simple format that has ubiquitous libs in every language, CSV will continue to be the best format despite the issues caused by under-specification. Google Protobuf is a lot closer than Parquet or Avro. But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.
Author here. I see now that the title is too controversial, I should have toned that down. As I mention in the conclusion, if you're giving parquet files to your user and all they want to know is how to turn it into Excel/CSV, you should just give them Excel/CSV. It is, after all, what end users often want. I'm going to edit the intro to make the same point there.<p>If you're exporting files for machine consumption, please consider using something more robust than CSV.
Or export to CSV correctly and test with Excel and/or LibreOffice. Honestly CSV is a very simple, well defined format, that is decades old and is “obvious”. I’ve had far more trouble with various export to excel functions over the years, that have much more complex third-party dependencies to function. Parsing CSV correctly is not hard, you just can’t use split and be done with it. This has been my coding kata in every programming language I’ve touched since I was a teenager learning to code.
CSV is very durable. If I want it read in 20 years, csv is the way to go until it’s just too big to matter.<p>Of course there are better formats. But for many use cases friends encourage friends to export to CSV.
Friends don't let friends export to CSV -- in the data science field.<p>But outside the data science field, my experience working on software programming these years is that it won't matter how beautiful your backoffice dashboards and web apps are, many non-technical business users will demand at some point CSV import and/or export capabilities, because it is easier for them to just dump all the data on a system into Excel/Sheets to make reports, or to bulk edit the data via export-excel-import rather than dealing with the navigation model and maybe tens of browser tabs in your app.
CSV wins because its universal and very simple.
With an editor like Notepad++ and the CSV plugin, reformating, like change date format, is very easy and even with colored columns.
It's strange to me that people complain about some variety in CSV files while acting as if parquet was one specific file format that's set in stone.
They can't even decide which features are core, and the file format has many massive backwards-incompatible changes already. If you give me a parquet file I cannot guarantee that I can read it, and if I produce one I cannot guarantee that you can.<p>I treat formats such as parquet as I generally do: I try to allow various different inputs, and produce standard outputs. Parquet is something I allow purely as an optimization. CSV is the common default all of my tools have (UTF-8 without BOM, international locale, comma separator, quoting at the start of the value optional, standards-compliant date format or unix timestamps). Users generally don't have any issue with adapting their files to that format if there's any difference.
> <i>One of the infurating things about the format is that things often break in ways that tools can't pick up and tell you about</i><p>This line is emblematic of the paradigm shift LLMs have brought. It’s now easier to build a better tool than change everyone’s behaviour.<p>> <i>You give up human readable files, but</i><p>What are we even doing here.
Every single use I've ever seen of CSV would be improved by the very simple change to TSV.<p>Even Excel can handle it.<p>It is far safer to munge data containing tabs (convert to spaces, etc), than commas (remove? convert to dots? escape?).<p>The better answer is to use ASCII separators as Lyndon Johnson intended, but that turns out to be asking a lot of data producers. Generating TSV is usually easier than generating CSV.
My takeaway is that csv has some undefined behaviours, and it takes up space.<p>I like that everyone knows about .csv files, and it's also completely human readable.<p>So for <100mb I would still use csv.
1. CSV is for ensuring compatibility with the widest range of consumers, not for ensuring best read or storage performance for consumers. (It <i>is</i> already more efficient than JSON because it can be streamed, and takes up less space than a JSON array of objects)<p>2. The only data type in CSV is a string. There is no null, there are no numbers. Anything else must be agreed upon between producer and consumer (or more commonly, a consumer looks at the CSV and decides how the producer formatted it). JSON also doesn’t include dates, you’re not going to see people start sending API responses as Apache Parquet. CSV is fiiine.
CSV has some limits and difficulties, but has massive benefits in terms of readability, portability, etc.<p>I feel like USV (Unicode Separated Values) neatly improves CSV while maintaining most of its benefits.<p><a href="https://github.com/sixarm/usv">https://github.com/sixarm/usv</a>
The poor performance argument is not true even for Python ecosystem that the author discusses. Try saving geospatial data in GeoPackage, GeoJson, FlatGeobuf. They are saved slower than in plain CSV (the only inconvenience is that you must convert geometries into WKT strings). GeoPackage was "the Format of the Future" 8 years ago, but it's utterly slow when saving, because it's an SQLite database and indexes all the data.<p>Files in .csv.gz are more compact than anything else, unless you have some very-very specific field of work and a very compressible data. As far as I remember, Parquet files are larger than CSV with the same data.<p>Working with the same kind of data in Rust, I see everything saved and loaded in CSV is lightning fast. The only thing you may miss is indexing.<p>Whereas saving to binary is noteably slower. A data in generic binary format becomes LARGER than in CSV. (Maybe if you define your own format and write a driver for it, you'll be faster, but that means no interoperability at all.)
I think the sad reality there is that it's become "the" format that users expect, and more importantly, it's what's integrated into the majority of peripheral services and tools.<p>Like JSON.
Remembering all the cases I needed to export to CSV – 99% are the relatively small datasets, so marginal gains in a few millisecond to import aren't worth sacrificing convenience. And sometimes you just get data from gazzilion of diverse sources, and CSV is the only option available.<p>I suspect that not everybody here work exclusively with huge datasets and well-defined data pipelines.<p>On a practical side, if I want to follow suggestions, how do I export to Avro from Numbers/Excel/Google Sheets?
I never liked articles about how you should replace CSV with some other format while pulling some absolutely idiotic reasons out of their rear...<p>1. CSV is underspecified
Okay, so specify it for your use case and you're done?
E.g use rfc3339 instead of the straw-man 1-1-1970 and define how no value looks like, which is mostly an empty string.<p>2. CSV files have terrible compression and performance
Okay, who in their right mind uses a plain-text-file to export 50gb of data? Some file systems don't even support that much. When you are at the stage of REGULARLY shipping around files this big, you should think about a database and not another filetype to send via mail.
Performance may be a point, but again, using it for gigantic files is wrong in the first place.<p>3. There's a better way (insert presentation of a filtype I have never heard of)
There is lots of better ways to do this, but:
CSV is implemented extremely fast, it is universally known unlike Apache Parquet (or Pickle or ORC or Avro or Feather...) and it is humanly readable.<p>So in the end: Use it for small data exports where you can specify everything you want or like everywhere, where you can import data, because most software takes CSV as input anyway.<p>For lots of data use something else.<p>Friends don't let friends write one-sided articles.
The reason why USV did not use the proper ASCII codes for field separator and record separator is a bit too pragmatic for me…<p><a href="https://github.com/SixArm/usv/tree/main/doc/faq#why-use-control-picture-characters-rather-than-the-control-characters-themselves">https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...</a>
An article promoting parquet over CSV. Fair enough, but parquet has been around for a while and still no support in Debian. Is there some deep and dark reason why?
The problem, as always, is that you deal with multiple data sources - which you can not control the format of. I work as a data analyst, and in my day-to-day work I collect data from around 10 different sources. It's a mix of csv, json, text, and what not.<p>Nor can you control the format others want. The reason I have to export to csv, is unfortunately because the people I ship out to use excel for everything - and even though excel does support many different data formats, they either enjoy using .csv (should be mentioned that the import feature in excel works pretty damn well), or have some system written in VBA that parses .csv files.
As a data architect in a big company, I cannot tell how harmful such a stupid data format CSV can be.
All the possible semantics of the data has to be offloaded to either the brain of people [don’t do that! Just don’t!] or out-of-sync specs [better hidden in the CMS of the company that the Ark of Alliance, and outdated anyway] or obscure code or SQL queries [an opportunity for hilarious reverse engineering sessions, where you hate a retired developper forever for all the tricks he added inside code to circumvent poorly defined data. Then got away to Florida beach after hiring you.]
I like the ping pong of one day an article being posted where everyone asks, "when/why did everything become so complicated", and then the next day something like this is posted.
I wouldn’t say the article proposes a better way, but he proposes rather a more complex way.<p>Nothing beats CSV in terms of simplicity, minimal friction, and ease of exploring across diverse teams.
"There's a better way" - "just" write your application in Java or Python, import Thrift, zstandard and boost, do some compiling - and presto, you can now export a very complicated file format you didn't really need which you hope your users (who all undoubtedly have Java and Python and Thrift and whatnot) will be able to read.<p>CSV does not deserve the hate.
CSV is a superb, incredibly useful data format.. but not perfect or complete.<p>Instead of breaking CSV by adding to it .. I recommend augmenting it :<p>It would be useful to have a good standardized / canonical json format for things like encoding, delimiter, schema and metadata, to accompany a zipped csv file, perhaps packaged in the same archive.<p>Gradually datasets would become more self-documenting and machine-usable without wrangling.
“the use case where people often reach for CSV, parquet is easily my favorite”<p>My use case is that other people can’t or won’t read anything but plain text.
Schemas are overrated. Often the source-system can't be trusted so you need to check everything anyway or you'll have random strings in your data. Immature languages/libraries often do dumb stuff like throwing away the timezone before adjusting it to UTC. They might not support certain parquet types (e.g. an interval).<p>Like I've recently found it much easier to deal with schema evolution in pyspark with a lot of historical CSVs than historical parquets. This is essentially a pyspark problem, but if everything works worse with your data format then maybe it's the format that's the problem. CSV parsing is always and everywhere easy, easier than the problems parquets often throw up.<p>The only time I'd recommend parquet is if you're setting up a pipeline with file transfer and you control both ends... but that's the easiest possible situation to be in; if your solution only works when it's a very easy problem then it's not a good solution.
CSV is totally fine if you use it for the right kind of data and the right application. That means:<p>- data that has predictable value types (mostly numbers and short labels would be fine), e.g. health data about a school class wouldn't involve random binary fields or unbounded user input<p>- data that has a predictable, managable length — e.g. the health data of the school class wouldn't be dramatically longer than the number of students in that class<p>- data with a long sampling period. If you read that dataset once a week performance and latency become utterly irrelevant<p>- if the shape of your data is already tabular and not e.g. a graph with many references to other rows<p>- if the gain in human readability and compatibility for the layperson outweighs potential downsides about the format<p>- if you use a sane default for encoding (utf8, what else), quoting, escaping, delimiter etc.<p>Every file format is a choice, often CSV isn't the wrong one (but: very often it is).
I have all SQL exported to CSV and committed to git once a day (no, I don't think this is the same as WAL/replication).<p>Dumping to CSV is built into MySQL and Postgres (though MySQL has better support), is faster on export and much faster on import, doesn't fill up the file with all sorts of unneeded text, can be diffed (and triangulated by git) line by line, is human readable (eg. grepping the CSV file) and overall makes for a better solution than mysqldumping INSERTs.<p>In Docker, I can import millions of rows in ~3 minutes using CSV; far better than anything else I tried when I need to mock the whole DB.<p>I realize that the OP is more talking about using CSV as a interchange format or compressed storage, but still would love to hear from others if my love of CSV is misplaced :)
I tend to prefer line delimited JSON myself, even if it's got redundant information. It will gzip pretty well in the data if you want to use less storage space.<p>Either that or use the ASCII codes for field and row delimiters on a UTF-8 file without a BOM.<p>Even then you're still stuck with data encoding issues with numbers and booleans. And that direct even cover all the holes I've seen in CSV in real world use by banks and govt agencies over the years.<p>When I've had to deal with varying imports I push for a scripted (js/TS or Python) preprocessor that takes the vender/client format and normalized to line delimited JSON, then that output gets imported.
It's far easier than trying to create a flexible importer application.<p>Edit: I've also advocated for using SQLite3 files for import, export and archival work.
SQL and XML have schemas, and they're to a large extent human readable, even to people who aren't developers. If storage is cheap, compression isn't very important.<p>I've never come across this Parquet-format, is it grep:able? Gzip:ed CSV is. Can a regular bean counter person import Parquet into their spreadsheet software? A cursory web search indicates they can't without having a chat with IT, and SQL might be easier while XML seems pretty straightforward.<p>Yes, CSV is kind of brittle, because the peculiarities with a specific source is like an informal schema but someone versed in whatever programming language makes this Parquet convenient won't have much trouble figuring out a CSV.
Use the right data format for the right data. CSV can be imported into basically any spreadsheet, which can make it appealing, but it doesn't mean it's always a good option.<p>If you want csv, considering a normalization step. For instance, make sure numbers have no commas and a "." decimal place. Probably quote all strings. Ensure you have a header row.<p>Probably don't reach for a CSV if:<p>- You have long text blobs with special characters (ie quotes, new lines, etc.)<p>- You can't normalize the data for some reason (ie some columns have formulas instead of specific data)<p>- You know that every user will always convert it to another format or import it
I've written CSV exports in C from scratch, no external dependencies required.<p>It's "Comma Separated Variables", it doesn't really need anymore specification than that.<p>These files have always imported into M$ and libre office suites without issue.
Here is a crazy idea: So csv itself is abiguous, but as a convention we could encode the options in the file name. E.g data.uchq.csv means utf8, comma-separated, with header, quoted.
What's the best way to expose random CSV/.xlsx files for future joins etc? We're house hunting and it would be nice have a local db to keep track of price changes, asking prices, photos, etc. And look up (local) municipal OpenData for an address and grab the lot size, zoning, etc. I'm using Airtable and sometimes Excel, but it would be nice to have a home (hobby) setup for storing queryable data.
One particularly memorable on-call shift had a phenomenal amount of pain caused by the use of CSV somewhere along the line, and a developer who decided to put an entry "I wonder, what happens if I put in a comma", or something similar. That single comma caused hours of pain. Quite why they thought production was the place to test that, when they knew the data would end up in CSV, is anybody's guess.<p>I think Hanlon's razor applies in that situation.
xsv makes dealing with csv miles easier: <a href="https://github.com/BurntSushi/xsv">https://github.com/BurntSushi/xsv</a>
Not sure I understand right what this article is about.
From my point of view, CSV is an easy way to export data from a system to allow an end user to import it in excel and work on this data.
Apart if it's as easy with parquet to import in excel as with a CSV, I'm not sure this is not fixing a problem that doesn't exist. And making things more complicated.<p>Outside of the context of end user, I don't see any advantages in this compared to xml or json export.
> You give up human readable files, but what you gain in return is incredibly valuable<p>Not as valuable as human-readable files.<p>And what kind of monstrous CSV files has this dude been working with? Data types? Compression? I just need to export 10,000 names/emails/whatevers so I can re-import them elsewhere.<p>Like, I guess once you start hitting GBs, an argument can be made, but this article sounds more like "CSV considered harmful", which is just silly to me.
Ok, funny guy. Tell that to all the wholesale providers, which use software from 2005 or at least it feels that way.<p>No query params in their single endpoint and only csv exports possible.<p>Then add to that, that shopify, apparently the leader or whatever in shopping software, can't do better than require exactly the format they say, don't you dare coming with configurable fields or mapping.<p>The industry is stuck in the 00s, if not 90s.
I'd rather work with someone that prefers a format, but doesn't write articles like this. It's fine to "prefer" parquet, but CSV is totally fine - whatever works mate.<p>When you hit the inevitable "friends don't let friends" or "considered harmful" type of people, it's time to move quickly past them and let the actual situation dictate the best solution.
"I'm a big fan of Apache Parquet as a good default. You give up human readable files, but..."<p>Lost me right there. It has to be human readable.
I still don’t understand how you deal with cardinalities in a CSV.
You always recreate an object model on top of it to deal with them properly ?<p>Cf a tweet I wrote in one of my past lives: <a href="https://x.com/datao/status/1572226408113389569?s=20" rel="nofollow">https://x.com/datao/status/1572226408113389569?s=20</a>
It is weird to say both that "CSV files have terrible compression" and then that the proposed format, Apache Parquet, has "Really good compression properties, competitive with .csv.gz". I think what's meant here is that csv compresses really well but you loose the ability to "seek" inside the file.
I finally set aside my laziness and started a thread on r/vim for the .usv project:<p><a href="https://www.reddit.com/r/vim/comments/1bo41wk/entering_and_displaying_ascii_separators_in_vim/" rel="nofollow">https://www.reddit.com/r/vim/comments/1bo41wk/entering_and_d...</a>?
I’ve always liked CSV. It’s a streaming friendly format so:<p>- the sender can produce it incrementally<p>- the receiver can begin processing it as soon as the first byte arrives (or, more roughly, unescaped newline)<p>- gzip compression works without breaking the streaming nature<p>Yeah, it’s a flawed interchange format. But in a closed system over HTTP it’s brilliant.
Ubiquity has a quality all of its own.<p>Yes CSV is a pain in many regards, but many of the difficulties with it arise from the fact that anybody can produce it with very little tool support - which is also the reason it is so widely used.<p>Recommending a decidedly niche format as an alternative is not going anywhere.
>Numerical columns may also be ambigious, there's no way to know if you can read a numerical column into an integral data type, or if you need to reach for a float without first reading all the records.<p>Most of the time you know the source pretty well and can simply ask about the value range.
There is CSVY, which lets you set a delimiter, schema, column types, etc. and has libraries in many languages and is natively supported in R.<p>Also is backwards-compatible with most CSV parsers.<p><a href="https://github.com/leeper/csvy">https://github.com/leeper/csvy</a>
I gave up at "You give up human readable files". While I recognize in some cases these recommendations may make sense/CSV may not be ideal, the idea of a CSV _export_ is generally that it could need to be reviewed by a human.
If you ever need to parse CSV <i>really fast</i> and happen to know C#, there is an incredible vectorized parser for that: <a href="https://github.com/nietras/Sep/">https://github.com/nietras/Sep/</a>
This is the level of discourse among Norwegian graduates. Half of them are taught to worship low level, the other half has framework diabetes.<p>Don't come here to work if you don't want to drown in nitpicking and meaningless debates like this.
Openrefine has saved my bacon more times that I care to admit. It ingest everything and have powerful exporting tools. Friends give friends CSV files, and also tell them about tools that help them deal with wide array of crap formats.
Using parquet in python requires installing pyarrow and numpy, whereas CSV comes with stdlib.<p>Also, the csv has a very pythonic interface vis-a-vis parquet, in most cases if I can fit the file in memory I would go with CSV.
The author seems to be missing the point of CSVs entirely. I looked him up expecting a fresh college grad, but am surprised to see he's probably in his early 30s. Seems to be in a dev bubble that doesn't actually work with users.<p>Try telling 45 year old salesman he needs to export his data in parquet. "Why would I need to translate it to French??"<p>I feel like I'm pretty up to date on stuff, and I've never heard of parquet or seen in as an option, in any software, ever.
I wish I could get Excel to stop converting Product UPCs to scientific notation when opening CSVs.<p>Also some UPCs start with 0<p>Worst is when Excel saves the scientific notation back to the CSV, overwriting the correct number.
CSVs won the war. No vendor lock in and very portable.<p>I wish TSVs were more popular though. Tabs appear less frequently than commas in data.<p>My biggest recommendation is to avoid Excel! It will mangle your data if you let it.
I use JSON for import/export of user data (in my super app collAnon), it's more predictable and the toolings around it to transform into any other format(even csv) is underappreciated, imo.
Parquet is a columnar format. Which might be what you want, but it also might not, like if you want to process one row at a time in a stream. Maybe avro would be a better format in that case?
Trying to do business without using CSV is like trying to weld without using a torch. Might be possible but you aren't likely to have success at it.
excel 2021:
the "a spreadsheet is all it needs"-file is not usable because excel is not able to translate the "LC references that are inside brackets" into other languages.
there needs to be some pandoc (panbin?) for binary formats to convert between parquet, hdf5, fits, netcdf, grib, ROOT, sqlite, etc. (Ok these are not all equivalent in capability...).
Tell me you've never worked a real job without telling me. This is a technologists solution in search of a problem. Do you also argue that "email is dead"?
XML. Not CSV, not Parquet (whatever that is), not protobufs. Export the data as XML, with a schema. Not json or yaml either. You can render the XML into whatever format you want downstream.<p>The alternative path involves parsing csv in order to turn it into a different csv, turning json into yaml and so forth. Parsing "human readable" formats is terrible relative to parsing XML. Go with the unambiguous source format and turn it into whatever is needed in various locations as required.