TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

So You Want to Write Your Own CSV code

158 pointsby Monkeygetalmost 11 years ago

39 comments

slgalmost 11 years ago
CSV are a headache. Like the article says, RFC4180 doesn&#x27;t necessarily represent the real world. However sometimes you just have to reject things that aren&#x27;t spec.<p>Not too long ago I was struggling with one of these CSV issues and received some good advice from Hans Passant [1] on a Stack Overflow question pertaining to my problem (emphasis mine):<p>&quot;It is pretty important that you don&#x27;t try to fix it. That will make you responsible for bad data for a long time. Reject the file for being improperly formatted. If they hassle you about it then point out that it is not RFC-4180 compatible. <i>There&#x27;s another programmer somewhere that can easily fix this.</i>&quot;<p>It makes perfect sense in hindsight. If you accept a malformed CSV file, people will expect you to accept <i>any</i> malformed data that has a CSV extension. You are taking on a lot of extra responsibility to cover for the lack of work by another programmer. Odds are they can make a change to fix the problem that takes a fraction of the time it would take you work around it. You just have to raise the issue.<p>I realize that rejecting bad files isn&#x27;t really possible in every circumstance. But I have a feeling it is an option more times than you might initially think.<p>[1] - <a href="http://stackoverflow.com/users/17034/hans-passant" rel="nofollow">http:&#x2F;&#x2F;stackoverflow.com&#x2F;users&#x2F;17034&#x2F;hans-passant</a>
评论 #7797150 未加载
评论 #7797861 未加载
评论 #7797577 未加载
评论 #7797214 未加载
评论 #7798583 未加载
mrweaselalmost 11 years ago
The most retard structure I&#x27;ve seen in a CSV file relates to the &quot;What if the character separating fields is not a comma?&quot;.<p>We get &quot;CSV&quot; files from Klarna, an invoicing company, with the payments they&#x27;ve processed for us. Because we&#x27;re Danish and they are Swedish, it&#x27;s not really weird that they would use comma as the decimal separator. So to compensate for having used the comma, they for some reason picks &quot;, &quot; ( that&#x27;s comma + space ) as the field separator. Most good csv parsers can handle the field separator to be any character you like, as long is it&#x27;s just ONE character. By picking a two character separator they&#x27;ve just dictated that I write my own or resort to just splitting a line on &quot;, &quot;.
评论 #7797252 未加载
评论 #7796848 未加载
qwerty_asdfalmost 11 years ago
Garbage in? Garbage out. You give me a shitty file, you get shitty results. Tough shit.<p>None of these questions are particularly daunting. CSV means &quot;comma separated values&quot;, so if you want to play games and use other delimiters, please fuck off. If it&#x27;s not a comma, then guess what: it&#x27;s not delimited. New line characters are well-known, and well-understood, across all platforms and easy to detect. If you manage to fuck that up in your file, then take a look in the mirror, because the problem is you. Enforcing the practice of enclosing the target data in quotation marks among users is a good idea. It&#x27;s something that should be supported and encouraged, and ignored at one&#x27;s own risk.<p>Additionally, employing an escape character (such as backslash) to allow for the use of a quotation mark within enclosing quotation marks is a nice feature to add in. After that, the concept of a CSV file has provided enough tools, to tolerate [an arbitrarily large percentage] of all use cases. If you need something more robust, XML is thataway.
foxhillalmost 11 years ago
as the article mentions, CSV is not well defined. libraries are.. well, different. you&#x27;d spend as much time becoming familiar with one as you would writing a basic parser.<p>commas don&#x27;t delimit field entries? CSV -&gt; comma separated values.<p>new lines inside a field? i&#x27;ve never written a parser that would be foiled by this. could be an issue if you use a built-in tokeniser (e.g strtok, etc.). be aware.<p>variable number of fields? you’re probably writing this for something with an expected input form. throw errors if you see something you do not accept. make sure you catch them.<p>ascii&#x2F;unicode? yea. it’s a fucking mess. everywhere.<p>just do it. handle failure gracefully. learn from your mistakes. don&#x27;t be naive. consider a library if the (risk of failure):(time) ratio is skewed the wrong way. the only time i would absolutely insist that a 3rd party library be used is when crypto is involved. even then, be aware that they are not perfect.<p>absolutely ignore people who&#x27;s argument is along the lines of &quot;you are not smart enough to implement this standard. let someone else do it.”. fuck <i>everything</i> about that statement, and it’s false sense of superiority.<p>nothing comes for free. wether you use a library, or do your own thing, you’re going to run into problems.
评论 #7797447 未加载
评论 #7797163 未加载
gavinpcalmost 11 years ago
My most popular stackoverflow answer [1] includes a CSV writer and reader. Yeah, I&#x27;d clean it up a little if I were doing it now (return enumerator instead of array, etc). But people keep using it.<p>It uses regex lookaheads to deal with quoting, so it&#x27;s not 100% portable. But it&#x27;s only about one page.<p>As for the other things mentioned by the OP (BOM, encoding), those should be handled by the stream, and are not the provenance of CSV <i>per se</i>.<p>[1] <a href="http://stackoverflow.com/a/769713/4525" rel="nofollow">http:&#x2F;&#x2F;stackoverflow.com&#x2F;a&#x2F;769713&#x2F;4525</a>
评论 #7807099 未加载
iagooaralmost 11 years ago
&quot;CSV is not a well defined file-format. The RFC4180 does not represent reality. It seems as every program handles CSV in subtly different ways. Please do not inflict another one onto this world. Use a solid library.&quot;<p>I can&#x27;t but disagree when I read stuff like this. Why shouldn&#x27;t I release a library if I think it&#x27;s good enough for the community? Even the powerful and versatile Ruby library for CSV parsing started as a gem from a person who didn&#x27;t give a s... about advise like &quot;do not inflict another one into this world&quot;.
评论 #7797513 未加载
seanwoodsalmost 11 years ago
This article makes it much more complicated than it needs to be. It tries to be all things to all people. In practice you&#x27;re going to have to sacrifice some functionality for the sake of usability and your own sanity.<p>When I add a CSV import feature to a project I&#x27;m working on, I tell people &quot;this works with MS Excel flavor of CSV.&quot; This covers most, if not all, real world cases because in my world the people who want to import data are non-programmer types who all use Excel.<p>I&#x27;ll often include the basic rules in the screen that accepts the import. If I ever had to accept data from something that was _not_ Excel I&#x27;d probably include a combo box on the web form that lets you pick the dialect. So far I haven&#x27;t had to do that.<p>The only thing I might not be totally covering is how Excel handles newlines, but in practice I&#x27;ve never had to deal with that.
评论 #7797023 未加载
评论 #7796423 未加载
评论 #7797529 未加载
评论 #7796415 未加载
Dorian-Mariealmost 11 years ago
&gt; Ruby CSV library is 2321 lines.<p>If you look at lib&#x2F;csv.rb [1] it&#x27;s:<p>* 2325 Lines<p>* 2161 Non-blank lines<p>* 950 Lines of Code<p>[1]: <a href="https://github.com/ruby/ruby/blob/trunk/lib/csv.rb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ruby&#x2F;ruby&#x2F;blob&#x2F;trunk&#x2F;lib&#x2F;csv.rb</a>
encodereralmost 11 years ago
Early on in my career, just a year out of school, I, for some absurd reason, had the idea to build my own date library.<p>Primarily, I didn&#x27;t fully understand the date objects and functions available in the languages&#x2F;libraries i was using so simple things like formatting a string date seemed difficult to me.<p>This was an awful idea. Dreadful.<p>I came up with all sort of delightful helper methods to cover common use cases like adding one month to the current date. I made this decision to represent dates internally with a timestamp, so adding a month is easy, right?! No. ...What&#x27;s 1 month from January 31st? February 28th? Well then what&#x27;s 1 month from February 28th? The list of edge cases goes on.<p>Most things in life are more complicated than they, at first, seem.
评论 #7798657 未加载
EpicEngalmost 11 years ago
&gt; <i>What if the character separating fields is not a comma?</i><p>&gt; <i>Not kidding.</i><p>We&#x27;d ll be better off really, but that ship has sailed. Using CSV for data which is only ever read by a machine is a dumb decision. Use the RS (record separator) character and many of these ambiguities disappear.<p>Of course, like I said, that ship has sailed. If you want your data to be read nicely by other programs you&#x27;re probably stuck with CSV, TSV, or something similar.
评论 #7798536 未加载
jimehalmost 11 years ago
Personally I know the pain of creating a CSV parser. In late 2006 I was working on a PHP project that required a CSV parser, and what was available at the time did not come close to cutting it. So I created my own parser&#x2F;generator, which among many other things included automatic delimiter character detection. It was a rather painful project to create, but I learned a lot, and found the experience really fun.<p>Overall I agree with the article, there&#x27;s no point in reinventing the wheel if there are libraries out there. And CSV specifically is a horribly complex format to deal with. But sometimes rolling your own is the best and&#x2F;or only choice you have, and you might come out the other end enjoying the experience, and having learned a lot.<p>As for what happened to my old CSV parser? It ended up being quite popular, but stuck in the dark ages as I&#x27;d mostly moved on from PHP years ago. But thanks to a contributor, we&#x27;ve recently put renewed effort into bringing the project in to modern times: <a href="https://github.com/parsecsv/parsecsv-for-php" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;parsecsv&#x2F;parsecsv-for-php</a>
p0ncealmost 11 years ago
Can we stop being liberal in what we accept from others? It only leads to an unfixable mess.
huhertoalmost 11 years ago
CSV works for simple cases. It is trivial to parse, you shouldn&#x27;t even need a library.<p>It there are many &quot;what ifs&quot; like in the posted article. You probably need another format like JSON (preferably) or XML.
评论 #7797285 未加载
评论 #7798442 未加载
评论 #7797525 未加载
评论 #7797583 未加载
mooredsalmost 11 years ago
This goes for most complex problems. The first step of any dev problem should be to make sure you understand the problem, the second to map out the main pieces and the third to make sure you are leveraging every (well maintained) library possible. There are, of course, issues with dependencies and tying yourself to code you didn&#x27;t write, but what would you rather depend on--code that has had tens or hundreds of eyes on it, or code that you, and maybe one or two team members has reviewed?
Rabidgremlinalmost 11 years ago
One of my first open source projects was a JDBC driver that read CSV files. It started simply enough but once you started adding in support for all the quirks things became really complicated really quickly. Just check out all the &quot;options&quot; for the driver that have been added by the community over the last 14ish years <a href="http://csvjdbc.sourceforge.net/doc.html" rel="nofollow">http:&#x2F;&#x2F;csvjdbc.sourceforge.net&#x2F;doc.html</a>
Sami_Lehtinenalmost 11 years ago
Parsing CSV is easier than handling XML or JSON. I do integrations as my job and most common format used is CSV because it&#x27;s handy simple and reliable compared to other formats. That is exactly the reason why ini and props file are also preferred over database for data which isn&#x27;t too volatile or big. Any one can open the datafile and see what&#x27;s stored and what&#x27;s wrong.
评论 #7822305 未加载
michaelmioralmost 11 years ago
The best tool I&#x27;ve found for working with CSV files is csvkit[1]. I&#x27;ve run into some of the issues mentioned in the article and it&#x27;s handled them all gracefully. It&#x27;s basically a bunch of scripts mirroring sort, grep, cut, etc. but specifically for dealing with CSV files.<p>[1] <a href="http://csvkit.readthedocs.org/" rel="nofollow">http:&#x2F;&#x2F;csvkit.readthedocs.org&#x2F;</a>
评论 #7798302 未加载
kemayoalmost 11 years ago
We actually use CSV-reading as an incidental part of a hiring exercise. We provide a really simple homemade CSV parser as part of a PHP project, with a &quot;could you find and fix bugs in this?&quot; instruction. The way to get full marks is to rip out the parser and replace it with the appropriate standard library function.
评论 #7797778 未加载
Roboprogalmost 11 years ago
CSVs were simpler back in the 80s, when there were a few products (e.g. - Lotus 123, xBASE) that all wrote RFC 4180 compliant text (and I&#x27;m pretty sure there was no RFC 4180 yet)<p>No alternate delimiters, no backslashes.<p>Now I have to put up with offshore staff trying to use apostrophes (&#x27;) instead of quotes (&quot;) :-(<p>Barring alternate delimiters, and disallowing newlines* in fields, I can write the parser for 4180 in about 30 lines of perl, reading a char at a time and flipping between about 4 states. (avoids getting root access and days of paperwork to install from CPAN)<p>* disallowing newlines in the data is admittedly a big restriction, but it works for many use-case&#x2F;applications, and allows the caller to pull in a line before calling the parse function.<p>For Java, the &quot;Ostermiller&quot; library is pretty good for CSV handling, and has a few options for dealing with freaky variants.
collywalmost 11 years ago
I think this example is relevant to many seemingly trivial problems. Where the task seems simple, but once you think about the details a bit more it becomes complex.<p>I was trying to get Perl tar libraries working, when my colleague asked why I don&#x27;t just use backticks to do it in the shell. Basically because I don&#x27;t know that much about tar. I can use it to untar file, or create a new archive. Someone else who has written a library probably has taken the time to read through the whole manual and make it work nicely. They know the errors and warnings, and have abstracted that to a sensible level hopefully. They have thought about these things, so hopefully I won&#x27;t have to.
michaelfeathersalmost 11 years ago
Easy to write, hard to read. Perfect illustration of an emergent case of Postel&#x27;s Law.
评论 #7797240 未加载
joshvmalmost 11 years ago
I trust Numpy a lot for CSV handling. It deals with lots of edge cases including missing data, weird delimiters (pipes &#x27;|&#x27; are popular in astro for some reason) and massive files. If in doubt, whack it into Excel which has been doing this stuff for decades now. I prefer using Numpy to Python&#x27;s CSV library which I find a bit clunky.<p>Very little data is actually true CSV.<p>The code isn&#x27;t particularly long (~900 lines), it&#x27;s Python (hence readable) and it&#x27;s well commented:<p><a href="https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/npyio.py#L1172" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;numpy&#x2F;numpy&#x2F;blob&#x2F;v1.8.1&#x2F;numpy&#x2F;lib&#x2F;npyio.p...</a>
评论 #7796878 未加载
winter_bluealmost 11 years ago
A good and <i>performant</i> alternative to CSVs are Google&#x27;s protocol buffers: <a href="https://code.google.com/p/protobuf/" rel="nofollow">https:&#x2F;&#x2F;code.google.com&#x2F;p&#x2F;protobuf&#x2F;</a>
iziettoalmost 11 years ago
But it is still by far the most readable text data format out there. Which is the reason for its wide adoption. I&#x27;ll be downvoted, but I really believe in this.
jstschalmost 11 years ago
So, which library? CSV is a mess.
评论 #7796727 未加载
评论 #7796322 未加载
评论 #7796907 未加载
评论 #7797614 未加载
评论 #7797615 未加载
NaNaNalmost 11 years ago
Why CSV is not just for readability? I think RFC is sometimes too pedantic, that it let CSV can handle both plain text and binary files. COMMA is not just a COMMA, but a COMMA not in different environments. Why should we use the phrase CSV or Comma Separated Values just for RFC?<p>CSV or Comma Separated Values are not only for RFC, but also for EVERYONE who wants to use this word or phrase. Pedantry sucks!
mrcozzalmost 11 years ago
We switched from a CSV based delivery to Apache Avro files. These are binary files which have the record schema embedded in the file header. We&#x27;re pretty happy with this solution for the time being and it seems to be an awesome alternative to CSV. I wonder if anyone else is doing something similar? Good article but I&#x27;d appreciate if the author gave some alternatives.
mschuster91almost 11 years ago
I usually take advantage of the fixed formats of the individual exporting tool. Everyone does it a bit different - so what? I have a php parser for it and adapt it for every of my clients. It&#x27;s cheaper to have a small parser, adapted for the customer&#x27;s needs, than having one 10k SLOC library to handle a boatload of files...
mantis369almost 11 years ago
CSV is really slow to work with, because you have to check for well-formedness, like you do with XML. And in the end, I always end up making specific concessions for the files that my customers use (which must be patched again and again) or having to take a hard stance on what can and can&#x27;t be in the &quot;CSV&quot; files.
aubergenealmost 11 years ago
Mike Bostock&#x27;s DSV library handles pretty much all of the cases listed for encoding and decoding. Written in JavaScript, in 116 lines.<p><a href="https://github.com/mbostock/dsv/blob/master/dsv.js" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mbostock&#x2F;dsv&#x2F;blob&#x2F;master&#x2F;dsv.js</a>
justifieralmost 11 years ago
i recently needed to deal with a ~4G xml file.. i tried a parser but after waiting thirty minutes for it to load i decided to parse out the bits i needed manually with a bash script<p>knowing my needs i could easily account for all possible muck ups and avoid the instances where ambiguity could play a part<p>i was then able to use the bits i pulled out of the ~4G file, now 16M, in the parser with all of its assurances<p>sure, edge cases justify using a tried and true library for generics, but there are also edge cases that justify mocking up your own naive implementation.. if only, like in my case, to make the dada usable in such a library
minimaxiralmost 11 years ago
Most likely prompted by discussion on <a href="https://news.ycombinator.com/item?id=7794684" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7794684</a>
kabdibalmost 11 years ago
CSV: Where the only way to win is not to play . . .
neoyagamialmost 11 years ago
This article represents all my feeligs when my boss says &quot; just write a csv parser for this, its just csv . So aint that hard&quot;
itamarhaberalmost 11 years ago
A non-standard standard is always a sure way to shoot yourself in the foot. Endianess also causes some confusion...
codingdavealmost 11 years ago
Its a flippin&#x27; CSV.<p>Of course you can come up with scenarios where it doesn&#x27;t work, but anyone who considers themselves to be a competent programmer should be able to deal with these issues, use another data format, or just talk to whomever is giving you the data to correct their data issues.<p>Seriously, The overhwleming CSV_bashing in these comment really makes me worry that coders just can&#x27;t handle the basics anymore.
评论 #7797913 未加载
epeusalmost 11 years ago
Never ever use csv to export. Use tab separated, as it takes work to type a tab in excel.
yp_allalmost 11 years ago
Post a sample .csv file you believe is too difficult.<p>I will solve your problem with only UNIX utilities. And I&#x27;m sure others will solve it other ways.<p>Usually I only need sed and tr. Sometimes lex or AWK.<p>Arguing about something without ever pointing to an example accomplishes nothing; it&#x27;s just whining.<p>Post an example.<p>Thank you.
评论 #7797870 未加载
mantrax5almost 11 years ago
Why are people using CSV when better (and less fuzzily defined) solutions exist, such as JSON?
评论 #7796374 未加载
评论 #7796471 未加载
评论 #7797079 未加载
评论 #7796534 未加载
评论 #7796342 未加载
评论 #7796341 未加载
评论 #7797536 未加载