CSVs Are Kinda Bad. DSVs Are Kinda Good

107 点作者 hieronymusN9 个月前

41 条评论

teddyh9 个月前

The article talks about reading and parsing CSV data of unknown variants, but then skips to the solution being using a different format altogether. But you can only switch to a different format if you are producing data, not if you are reading it!And if you are in control of producing data, just produce strict RFC 4180-compliant CSV data and everybody will be able to read it just fine. There is no need to make your reader’s lives difficult by using yet another non-standard data format.See also: <<a href="https://news.ycombinator.com/item?id=39679753">https://news.ycombinator.com/item?id=39679753</a>>

评论 #41245065 未加载

评论 #41244859 未加载

评论 #41244766 未加载

评论 #41249295 未加载

评论 #41244814 未加载

usrbinbash9 个月前

> CSVs are kinda bad.Not really.What's bad is when people keep insisting on coming up with new and amazing CSV dialects.<a href="https://www.ietf.org/rfc/rfc4180.txt" rel="nofollow">https://www.ietf.org/rfc/rfc4180.txt</a> is very clear about what CSV files are supposed to look like, and the fact that people keep ignoring this for whatever reason, is not the formats problem.And no, "using another format" is not a solution to this. Because: I can just invent a new DSV dialect. Or a JSON dialect. Or a dialect where the field separator is "0xFF00FF00" and the row separator is the string `DECAFCOFFEE` encoded in EBCDIC, all other characters have to be UTF-32, except for a, b and f, which also need to be EBCDIC encoded.> For starters, it’s rather unreadable when opened in a text editor. But I bet you don’t really do that with your CSVs all that often anyway!Wrong. I do that with csv files all the time. In fact I even have an amazing vim plugin just for them [0]. That's pretty much the point of having a plaintext tabular data storage format: That I can view and edit it using standard text wrangling utilities.---There is a much simpler solution to this problem: Don't accept broken CSV. If people keep ignoring standards, thats their problem.[0]: <a href="https://github.com/mechatroner/rainbow_csv">https://github.com/mechatroner/rainbow_csv</a>

评论 #41250820 未加载

评论 #41245302 未加载

评论 #41246006 未加载

评论 #41252517 未加载

评论 #41253824 未加载

评论 #41245340 未加载

mjevans9 个月前

This keeps coming up as new people discover what CSVs are. An ancient TEXT data exchange format. The lowest vaguely common denominator. A style of format with flavors software long out of support contract are happy to export data in.The intent of the format is to be human readable and editable. Sure, Tab characters can be used instead of commas. (TSV files) Yes that's that "" to escape a quote rule. Oh and quoted values are optional, unquoted strings are fine as long as they contain no newline or record separator characters.Sure, you could make another CSV inspired format which uses the old mainframe control characters; except as keeps getting pointed out, even programmers often don't know how to enter raw flow control characters on their systems. Who even bashes those out these days? I know I have to look it up every time.Rejoice that the format is so simple, it's all just text which software might convert to numbers or other values as it might desire.

评论 #41247807 未加载

评论 #41244741 未加载

评论 #41247125 未加载

评论 #41244816 未加载

wmal9 个月前

The author seems to ignore the fact that CSV got so popular because it is human readable. If anyone wanted a binary format there’s plenty of them - most better than this DSV.Also, I’m on a mobile right now, so can’t verify that, but it seems the format is flawed. The reader decodes UTF8 strings after splitting the binary buffer by the delimiter, but I believe the delimiter may be a part of a UTF8 character.Edit: just checked and there’s actually no chance that the delimiter the author chose would be part of UTF8 encoding of any other character than the delimiter itself

评论 #41244739 未加载

评论 #41253869 未加载

评论 #41251243 未加载

评论 #41244862 未加载

bvrmn9 个月前

[Grumpy mode start]Some nitpicks, maybe someone finds it useful. Could we talk about a code design a little bit?<pre><code> class DSV: @property def delimiter(cls) -> bytes: return b'\x1F' @property def record_separator(cls) -> bytes: return b'\x1E' @property def encoding(cls) -> str: return 'utf-8' </code></pre> It's Python, do not make a premature properties for static values.<pre><code> class DSV: delimiter = b'\x1F' record_separator = b'\x1E' encoding = 'utf-8' </code></pre> Also it's a false inheritance relationship. Writer is not related to configuration. You can't make any other useful subclasses for DSV (ok maybe DSVReader, but that's it). At least it should be in the opposite way: an abstract Writer operating on instance configuration attributes and DSVWriter defining these attributes.Also `self._buffer += chunk` is O(N^2). It starts to bite even for buffers small as 100 bytes. It's ok for an example, but it's an issue for real code. Example at least buffers incomplete record not a whole read chunk (good!). But does only one split at a time (bad).[Grumpy mode end]Nevertheless article is very valuable and interesting to read. CSV gotchas are well described.

Pikamander29 个月前

From what I've seen, the biggest problem isn't with the CSV standard (even though it has a lot of faults), but rather that a lot of software that utilizes CSVs is poorly tested.I can't tell you how many times I've downloaded a CSV that didn't escape quotes or newlines correctly, or how many times Excel has failed to correctly parse a perfectly valid CSV due to some decades-old quirk.I know that there are better formats that make these types of situations pop up less, but is a little bit of quality control too much to ask for? If you've tested your software to make sure that it can handle CSVs with line breaks, tabs, and both types of quotes, then you've seemingly done more testing than 90% of the software out there.On that note, the LibreOffice Calc team deserves major credit for how many different CSV formats it can handle. It's saved my bacon so many times when Excel wasn't up to the task.

brunokim9 个月前

I've read a comment here some years ago of someone discovering ASCII field delimiters and excited to use them. They then discovered that those characters are only used in three places: the ASCII spec, their own code, and the data from the first client where he tried to use this solution.Any file format needs a well-specified escape strategy, because every file format is binary and may contain binary data. CSV is kinda bad not only because, in practice, there's no consensus escaping, but also because we don't communicate what the chosen escaping is!I think a standard meta header like follows would do wonders to improve interchangeability, without having to communicate the serialization format out-of-band.``` #csv delim=";" encoding=utf8 quote=double locale="pt-BR" header=true ```(RFC-4180 does specify that charset and header may be specified in the MIME type)

评论 #41248809 未加载

NoboruWataya9 个月前

The only real benefit of CSV (other than that it is widely supported) is that it is easy for humans to read and write. The approach in this article solves the quoting problem, but also removes that benefit. If you have the power to move from CSV, surely JSON would be better if you need to keep the human readable/writable feature. And if you don't need it, there are other more featureful binary formats out there like parquet.

评论 #41244775 未加载

评论 #41244784 未加载

joeld429 个月前

I like the idea but this is non-standard enough to be just as hard as making a custom format.In my experience, the best way to handle this is:1) Use TSV (tab-separated) instead of CSV (most things that export CSV also export TSV). Strip LF characters while reading and assume newlines are CR.2) If you have a stubborn data source that insists on CSV, convert it to TSV in a pre-process. This could be a separate step or part of your reader as you're reading in the file. That means there's a single place to handle the escaping nonsense, and you can tailor that to each data source.

tbrownaw9 个月前

> If we used 31 as a field delimiter and 30 instead of newlines, we solve every single edge case from above. Why? Because these are non-printing characters that should never appear in a text-stream data set.I have in fact seen CSV files used as an interchange format for things that include non-plaintext fields. And I've seen nested CSV files.

评论 #41250860 未加载

RockRobotRock9 个月前

>For starters, it’s rather unreadable when opened in a text editor. But I bet you don’t really do that with your CSVs all that often anyway!I really wish that were true.

FerretFred9 个月前

I had to LOL a bit about this. I built a career that lasted over 30 years writing software that deciphered clients' attempts to produce sales data files in CSV format.Many times they just couldn't seem to find the comma. Other times there were commas in the item description (unescaped). My favourite though was when files were edited collaboratively using a Mac, Windows and Linux machines - multiple line-end types FTW! Like I said, a long and somewhat inglorious career..

zelphirkalt9 个月前

CSVs are a subset of DSVs. So I guess the idea is, that using that specific subset is bad. But then again it sort of does not matter too much, which character is used for separation, at least if it is not a character that is frequently used as part of a cell value, because that would cause a lot of escaping being needed.

cafard9 个月前

About a month ago, somebody posted a link to an interview with Brian Kernighan. About 6 minutes in, he talks about the difficulty of writing a decent CSV parser: <a href="https://www.youtube.com/watch?v=_QQ7k5sn2-o" rel="nofollow">https://www.youtube.com/watch?v=_QQ7k5sn2-o</a>

Diti9 个月前

The same idea, applied to JSON: <a href="https://shkspr.mobi/blog/2017/03/kyli-because-it-is-superior-to-json/" rel="nofollow">https://shkspr.mobi/blog/2017/03/kyli-because-it-is-superior...</a>

TomMasz9 个月前

When we get to CSVs I tell my Python students that while the CSV module does do a lot of nice things for you, CSVs are still a minefield and you really have to look at the file in a text editor first if you're not the one who created it.

knallfrosch9 个月前

If you strip the exchangeability from an exchange format, it is useless.DSVs didn't work with either Google Sheets, nor vim and neither Python – I assume this is the exhaustive list of software the author would have needed support from. The question, then: If no software understands the format, what's the point?> I first learned about these ASCII delimiters while working with .fec [Federal Election Commission] files.And then the author instantly chose a different delimiter. Two parties and already two standards. That should have been the final red flag for this proposal.--- Aside: CSVs have so many problems wit their data format that you have to always verify them anyway.Germans write 5.000,24€ where an American would write $5,000.24. Date strings. URL-encoded strings. Numbers as strings. Floating numbers.Solving the delimiter problem accomplishes nothing.

snthpy9 个月前

I was wondering what DSV is and saw it's a term the author created. I have seen this format usually called ASV (ASCII Separated Values).There's also a more modern USV (Unicode Separated Values) which has visible separators.

mattewong9 个月前

I cannot imagine any way it is worth anyone's time to follow this article's suggestion vs just using something like zsv (<a href="https://github.com/liquidaty/zsv">https://github.com/liquidaty/zsv</a>, which I'm an author of) or xsv (<a href="https://github.com/BurntSushi/xsv/edit/master/README.md">https://github.com/BurntSushi/xsv/edit/master/README.md</a>) and then spending that time saved on "real" work

poikroequ9 个月前

If you're not concerned with the size of the file, you might consider just using NDJSON.<a href="https://github.com/ndjson/ndjson-spec">https://github.com/ndjson/ndjson-spec</a>

dflock9 个月前

If only...Every time I have to do any major work with CSVs, I re-lament this exact thing.I think the only way this could ever become more widespread is to fix all the open source tooling so that it's eventually just supported everywhere - then keep evangelizing for... ~30 yrs.Probably you should also register a new mime type and extension and make it a new thing - don't overload .CSV any further - but make the same tooling support it.

评论 #41244767 未加载

tpoacher9 个月前

People here complaining this guy is suggesting a "new" standard: it's ASCII. It is already a standard, and probably a lot more sensible than others that followed.I too have wondered why the hell aren't we using those special characters already ever since I discovered their existence

jgord9 个月前

CSV is kinda great .. but it does help to have nice tools to wrangle it, such as the famous xsv by burnt-sushi.

eigenvalue9 个月前

I have found that pandas is much better than the standard library csv library for just importing random CSV files and automatically figuring out what you would want to do most of the time, detecting column headers, dealing with quotes strings, etc.

euroderf9 个月前

Are there any commonly-used fonts that display FS GS RS US as cool graphical characters ? If I'm going to use them to structure text, I want them to be visible, and I want them to be clearly distinguishable from the text.

canimus9 个月前

Had a similar challenge when writing alphareader is in GitHub. HN comments helped me to think in multi-byte separators, and one thing is sure no matter which char you choose it will appear in the wrong place at some point.

jmclnx9 个月前

['Alice', 'She said, "Hello" and waved.', 'Fred''s Car is broken']You still have the issue described by "" with '' if I read the examples correctly.

mr902109 个月前

Nice! Regarding support from third-party software, perhaps it would be worth writing a specification for DSVs. I think that it could ease the adoption from well-known softwares.

beardyw9 个月前

I wanted a quick and dirty to parse a CSV in js the other day and just added square brackets around it and used JSON.parse.Am I alone in this?

评论 #41246458 未加载

cogman109 个月前

CSVs are bad. If you can change the format then don't use a DSV, use parquet and a library for your language to consume parquet.It's less code for you and you can do neat things like zstd compression on the columns.Bonus, it also doesn't require that you load and unload everything in memory.<a href="https://arrow.apache.org/docs/python/parquet.html" rel="nofollow">https://arrow.apache.org/docs/python/parquet.html</a>

Decabytes9 个月前

When I'm in control I just produce a TSV instead of a CSV. A comma is much more likely in text than a tab

kvbe9 个月前

Ok, but which tool can you use to edit csvs with the same power as excel… without messing with the csv?

BeFlatXIII9 个月前

What this DSV format needs for evangelization is for someone to create a front end editor for it.

dsevil9 个月前

Original author writes: >>> "Quick aside: I first learned about these ASCII delimiters while working with .fec files. For whatever reason, the Federal Election Commission in the United States also decided that they needed to ditch the comma, but they landed on using ASCII character 28 which is supposed to be used as a file separator not a field saparator. I have no idea why they picked that one when 31 was right there. Anyway, the FEC also has a tool called fs2comma.exe that turns it back into a CSV format, and a couple of years I filed a FOIA request and got the source code."I can only speculate on this but in Perl, for fake multimensional arrays à la `$foo{$x,$y,$z}`[^1], Perl uses ASCII character 28 (U+001C INFORMATION SEPARATOR FOUR) as its default subscript separator. Perl borrowed this feature from AWK, which uses the same character by default for the same purpose.Based on Perl, I initally used that same character for that same purpose in a project or two. I cannot speculate on why Aho, Weinberger, and/or Kernighan chose that character. (On or before 1977.)[^1]: Not to be confused with nested array (or hash) references in Perl, a truer form of multimensional arrays: `$foo->[$x]->{$y}->[$z]`

foobarkey9 个月前

DSVs are pretty bad, CSVs are kinda OK (just don’t ever open one in Excel)

评论 #41244788 未加载

bionhoward9 个月前

Love it. I’m gonna use this. Thank you for sharing!

mberning9 个月前

> Because these are non-printing characters that should never appear in a text-stream data set.Good luck with that.

welcome_dragon9 个月前

Ah yes good old CSV. It's perfectly fine to use for data transfer and there are libraries for (probably) every language that handle it perfectly to spec.The problem isn't "CSV". The problems come from: - "excel doesn't like this CSV therefore it's not valid" - "what do you mean the CSV I sent you is wrong? I used excel" - "standard? What's a standard? I put info then a comma. That should be good enough for anyone"CSV, when done right (i.e. following a standard) is a great format. It's human readable, less verbose than, say, JSON, and everybody understands it.Just have to make sure business people (and less technical technical people) understand that CSV != Excel and vice-versa.

eth0up9 个月前

Question: I started with a deliberately convoluted PDF which after much effort I filtered, sorted, reorganized and transferred the 18000 useful lines to a csv. These lines are simple, with dates, indicator and corresponding numbers.The purpose is to statically analyze the numbers for anomalies or any signs of deviation from expected randomness. I do this all in python3 with various libraries. It seems to be working, but...What is a more efficient format than csv for this kind of operation?Edit: I have also preserved all leading zeros by conversion to strings -- csv readers don't care much for leading zeros and simply disappear them, but quotes fix that.

评论 #41247320 未加载

gsck9 个月前

The phones we use at {JOB} can be programmatically controlled by using their proprietary command language, which is just CSV and each command is ended with a new line (Because how else would you do it, packet boundary pfft?).It's something I've never understood why, why not use something more standard like SIP, or even a more structured message format. Having to parse CSV across N different packet boundaries is a royal PITA

TudorAndrei9 个月前

This reminds me of <a href="https://xkcd.com/" rel="nofollow">https://xkcd.com/</a>As other said, most of the times, if you are producing them, just produce them right, or choose other formats.If you don't then pray for the best.

评论 #41245763 未加载