Consider Using CSV

69 pointsby jfhrover 2 years ago

31 comments

Pinusover 2 years ago

CSV looks deceptively simple. It is far too easy to just write(','.join(whatever)), which sort of works, until it doesn’t, and then someone, sometimes I, has to sort out the resulting mess. PLEASE use a proper CSV library (Python comes with a CSV module in the standard library), or at least implement the entire format according to the RFC from the outset, even if you think you won’t need it!

评论 #33936093 未加载

评论 #33938362 未加载

评论 #33939103 未加载

ndsipa_pomuover 2 years ago

As much as I like and use CSV for database work, it has a problem with being poorly specified. The most common problems are when processing CSVs produced elsewhere which might not enclose text fields with quotes and thus have issues with data that includes commas and multi-line data.

评论 #33935357 未加载

评论 #33934942 未加载

评论 #33935153 未加载

评论 #33937418 未加载

评论 #33935077 未加载

评论 #33936216 未加载

majkinetorover 2 years ago

With gzip on web server the difference is not important at all.CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.

评论 #33934980 未加载

评论 #33936139 未加载

thangalinover 2 years ago

CSV is also great for importing external data into documents. My text editor, KeenWrite[0], includes an R engine and a CSV-to-Markdown function[1]. This means you can write the following in a plain text R Markdown document:<pre><code> `r#csv2md('filanme.csv')` </code></pre> The editor will convert Markdown to XHTML in the preview panel (in real time), then ConTeXt can typeset the XHTML into a PDF file in various styles.[2][3] This avoids spending time fighting with table formatting/consistency in certain word processors while storing the data in a machine-friendly format. (Thereby upholding the DRY principle because the data can have a single source of truth, as opposed to copying data into documents, which could go stale/diverge.)Using JSON would be possible, but it's not as easy to convert into a Markdown table.[0]: <a href="https://github.com/DaveJarvis/keenwrite" rel="nofollow">https://github.com/DaveJarvis/keenwrite</a>[1]: <a href="https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L35" rel="nofollow">https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L3...</a>[2]: <a href="https://i.ibb.co/6FLXKsD/keenwrite-csv.png" rel="nofollow">https://i.ibb.co/6FLXKsD/keenwrite-csv.png</a>[3]: <a href="https://i.ibb.co/47h6zNx/keenwrite-table.png" rel="nofollow">https://i.ibb.co/47h6zNx/keenwrite-table.png</a>

gugagoreover 2 years ago

The only reason, in my eyes, to use CSV is to have easy interoperability with spreadsheet software.If you want streaming: <a href="https://jsonlines.org/" rel="nofollow">https://jsonlines.org/</a>

评论 #33936083 未加载

评论 #33934978 未加载

elcritchover 2 years ago

Sometimes CSV is nicer. Still you can cut down on your JSON by formatting it as a similar header style:<pre><code> [ ["productId", "quantity", "customerId"], ["5710031efdfe", 1, "8fe96b88"], ["479cd9744e5c", 2, "526ba6f5"] ] </code></pre> This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)

account-5over 2 years ago

I think one of the issues it data types. JSON has them CSV doesn't, so this means your program needs to be aware of which columns are which data type and do the conversion where needed.It's similar to JSON Vs INI files for config files.On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.

评论 #33936066 未加载

评论 #33935090 未加载

bufferoverflowover 2 years ago

The author didn't compare gzipped/brottlied sizes.The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?

ARandomerDudeover 2 years ago

> It's only 77 bytes, with 29 for the header and 24 for each line. At 100,000 entries, this list would be 2.4 MB (that's ~63% less than the JSON).If size is really the issue but you still want schema enforcement protobuf is the way to go.

评论 #33935963 未加载

评论 #33938162 未加载

xwowsersxover 2 years ago

I mean point well taken, but, as they acknowledged in the post themselves, CSV isn't suitable when you have a nested structure. And you almost always have/need a nested structure, no?

评论 #33935059 未加载

评论 #33935048 未加载

YmiYugyover 2 years ago

I always thought CSV was just fine, until I had to ingest and export a bunch of CSV in my last project. The big problem is that CSV is not well defined and it's so deceptively simple that many don't bother to adhere to the spec that does exist. Just a few idiosyncrasies I found: Inconsistent character encoding. If you open or save a csv with Excel it will assume a Windows-1252 encoding. Since browsers deal exclusively with UTF-8, this get's really messy. The CSV I got didn't actually use a comma as a delimiter but a semicolon. Everyone seems to have conflicting options about whether strings should have quotes and if so, which ones. The CSV I had to deal with also came with a decimal comma, which screwed up even more stuff. My advice stay away from CSV as an exchange format. Use something that is well defined.

pcthrowawayover 2 years ago

I'm definitely in the "Just use JSON for most things" camp, but I'm wondering, why would you ever choose CSV for interfacing microservices over protobuf?Isn't protobuf basically CSV but with good libraries at the interface point and standards around how to deserialize the streams?

panzerboilerover 2 years ago

I usually prefer a binary encoding. More efficient on the wire, easier to parse and generate, and with no ambiguity. We have 2 control codes given to us by the teletype era that have the perfect meaning for this kind of data:<pre><code> 0x1E Record Separator 0x1F Unit Separator</code></pre>

评论 #33935896 未加载

sitkackover 2 years ago

No one uses that format for streamed json, see ndson and jsonl<a href="http://ndjson.org/" rel="nofollow">http://ndjson.org/</a>The size complaint is overblown, as repeated fields are compressed away.As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.

majkinetorover 2 years ago

Since this is about CSV, this is obligatory tool for larger ones:* <a href="https://github.com/antonycourtney/tad" rel="nofollow">https://github.com/antonycourtney/tad</a>

评论 #33936367 未加载

SillyUsernameover 2 years ago

Holy cow.If somebody asked me to support this format after you'd left the company I'd quit on the spot. This frankenformat is 100% premature optimization, non standardised, requires custom parsers (which are potentially inefficient and may negate the network performance from having to parse both json and csv) and is potentially very difficult to maintain and debug (no syntax highlighters or rest like posting tools)Just either use GRPC or JSON with regular network level gzip encoding.

beached_whaleover 2 years ago

A constrained format based on JSONL with each record being a tuple of number/string/bool/null could better defined than CSV and looks almost like it. The benefit being, almost any json library could work with it, or could be made to one line at a time and it can be parallelized as newlines only exist as the delimiter.["hello",5,false,1,2,2.334,null]["world",12,true,1,2,2.334,null]

albertopvover 2 years ago

What else do you use if you have to import millions of rows from a client or supplier without direct integration but sftp?

评论 #33934999 未加载

spentuover 2 years ago

I cannot count how many times CSV "format" has caused problems for me..In my country the decimal separator is comma, instead of punctuation. This causes problems when importing and exporting with this "format".Just few weeks ago I had fun times working with API returning CSV in unknown encoding. Hopefully they will never make changes (you cannot always trust headers). Ah and i do love when CSV is missing headers and someone adds data into middle.Of course some of these issues can be avoided by doing the things "right". Sadly you cannot trust this in real life. People write ugly structures in JSON, but at least you can validate results..

WirelessGigabitover 2 years ago

No. Just no. The amount of times I've had issues with CSVs exported from a non-US locale is insane. They use semi-colon as separator, as for some weird reason they use the comma as the decimal point.Then there's the issue of encoding, as that is also not the same across locales. Then you get a CSV with the BOM characters up front or some French accents represented as ? because of incorrect encoding parsing / saving.At least JSON doesn't have any of these things. Standardized strings, and standardized number format.

sheeeep86over 2 years ago

You could have the advantages of both worlds by having one json object per line. You could stream process, and you could structure more complex objects and have consistent escaping.

评论 #33934934 未加载

评论 #33935985 未加载

评论 #33934924 未加载

whateveracctover 2 years ago

I quite like CSVs. I've used them to great effect at maybe every job I've ever had. xsv, sqlite, and Excel/LibreOffice provide useful tooling on top of them.I see a lot of complaining about "no standard" in this thread, but the way I've used them, it's been fine. I just use Haskell's cassava. If human produce them with Excel/LibreOffice, I never have issues on the ingestion end.

slotransover 2 years ago

Please don't. CSV is one of the worst file formats ever conceived. Use (compressed) line-delimited JSON if you need a file of records.

cpetersoover 2 years ago

Another alternative is a streaming JSON format like JSONL (newline-delimited JSON). You can parse one record/line at a time, but still have the structure and named fields of JSON.<a href="https://en.m.wikipedia.org/wiki/JSON_streaming" rel="nofollow">https://en.m.wikipedia.org/wiki/JSON_streaming</a>

brundolfover 2 years ago

I worked at a company where we did this for some endpoints and it worked great. Our client app had to request enormous time-series datasets and using CSV cut a significant percentage off of the payload size. I recommend it if you have similar constraints

nathantsover 2 years ago

i had a lot of fun exploring the performance ceiling of csv and csv like formats. turns out binary encoding of size prefixed byte arrays is fast[1].csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.1. <a href="https://github.com/nathants/bsv/tree/55c90797283f5e37f91bbb6cdf60f0f187a33302/experiments" rel="nofollow">https://github.com/nathants/bsv/tree/55c90797283f5e37f91bbb6...</a>

fellowniusmonkover 2 years ago

Delimited formats performance can be exceptional, they can also be phenomenally terse and avoid the string tarpits of CSV and TSV if you just use these unicode characters.U+241D, U+241E, U+241F

评论 #33935210 未加载

评论 #33935091 未加载

pkstnover 2 years ago

Use gzip for compressing. If you want to stream, use following syntax:[\n<pre><code> { ... },\n { ... },\n { ... },\n ...\n </code></pre> ]\nWith this simple trick you can stream easily..

评论 #33935648 未加载

revskillover 2 years ago

Sure! For example, for batch processing, CSV is always the default for me and the teams.

margarina72over 2 years ago

you may also simply add a format specification and return either csv or json depending on the need or the context. Most language would have what it needs to return either without much trouble.

dsmmckenover 2 years ago

You could also consider Kafka for streaming, and Parquet for batch.

31 comments

Pinusover 2 years ago

评论 #33936093 未加载

评论 #33938362 未加载

评论 #33939103 未加载

ndsipa_pomuover 2 years ago

评论 #33935357 未加载

评论 #33934942 未加载

评论 #33935153 未加载

评论 #33937418 未加载

评论 #33935077 未加载

评论 #33936216 未加载

majkinetorover 2 years ago

评论 #33934980 未加载

评论 #33936139 未加载

thangalinover 2 years ago

gugagoreover 2 years ago

The only reason, in my eyes, to use CSV is to have easy interoperability with spreadsheet software.If you want streaming: <a href="https://jsonlines.org/" rel="nofollow">https://jsonlines.org/</a>

评论 #33936083 未加载

评论 #33934978 未加载

elcritchover 2 years ago

account-5over 2 years ago

评论 #33936066 未加载

评论 #33935090 未加载

bufferoverflowover 2 years ago

The author didn't compare gzipped/brottlied sizes.The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?

ARandomerDudeover 2 years ago

评论 #33935963 未加载

评论 #33938162 未加载

xwowsersxover 2 years ago

I mean point well taken, but, as they acknowledged in the post themselves, CSV isn't suitable when you have a nested structure. And you almost always have/need a nested structure, no?

评论 #33935059 未加载

评论 #33935048 未加载

YmiYugyover 2 years ago

pcthrowawayover 2 years ago

panzerboilerover 2 years ago

评论 #33935896 未加载

sitkackover 2 years ago

majkinetorover 2 years ago

Since this is about CSV, this is obligatory tool for larger ones:* <a href="https://github.com/antonycourtney/tad" rel="nofollow">https://github.com/antonycourtney/tad</a>

评论 #33936367 未加载

SillyUsernameover 2 years ago

beached_whaleover 2 years ago

albertopvover 2 years ago

What else do you use if you have to import millions of rows from a client or supplier without direct integration but sftp?

评论 #33934999 未加载

spentuover 2 years ago

WirelessGigabitover 2 years ago

sheeeep86over 2 years ago

You could have the advantages of both worlds by having one json object per line. You could stream process, and you could structure more complex objects and have consistent escaping.

评论 #33934934 未加载

评论 #33935985 未加载

评论 #33934924 未加载

whateveracctover 2 years ago

slotransover 2 years ago

Please don't. CSV is one of the worst file formats ever conceived. Use (compressed) line-delimited JSON if you need a file of records.

cpetersoover 2 years ago

brundolfover 2 years ago

nathantsover 2 years ago

fellowniusmonkover 2 years ago

Delimited formats performance can be exceptional, they can also be phenomenally terse and avoid the string tarpits of CSV and TSV if you just use these unicode characters.U+241D, U+241E, U+241F

评论 #33935210 未加载

评论 #33935091 未加载

pkstnover 2 years ago

Use gzip for compressing. If you want to stream, use following syntax:[\n<pre><code> { ... },\n { ... },\n { ... },\n ...\n </code></pre> ]\nWith this simple trick you can stream easily..

评论 #33935648 未加载

revskillover 2 years ago

Sure! For example, for batch processing, CSV is always the default for me and the teams.

margarina72over 2 years ago

you may also simply add a format specification and return either csv or json depending on the need or the context. Most language would have what it needs to return either without much trouble.

dsmmckenover 2 years ago

You could also consider Kafka for streaming, and Parquet for batch.