Friends don't let friends export to CSV

245 pointsby lervagabout 1 year ago

90 comments

prependabout 1 year ago

This article seems written by someone who never had to work with diverse data pipelines.I work with large volumes of data from many different sources. I’m lucky to get them to send csv. Of course there are better formats, but all these sources aren’t able to agree on some successful format.Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.I’ve been reading these “everyone is stupid, why don’t they just do the simple, right thing and I don’t understand the real reason for success” articles for so long it just makes me think the author doesn’t have a mentor or an editor with deep experience.It’s like arguing how much mp3 sucks and how we should all just use flac.The author means well, I’m sure. Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation.

评论 #39823087 未加载

评论 #39823361 未加载

评论 #39823772 未加载

评论 #39824588 未加载

评论 #39824637 未加载

评论 #39828527 未加载

评论 #39827273 未加载

评论 #39827760 未加载

评论 #39830274 未加载

评论 #39825519 未加载

评论 #39833172 未加载

评论 #39823205 未加载

评论 #39825135 未加载

评论 #39824261 未加载

评论 #39825057 未加载

评论 #39825633 未加载

评论 #39822939 未加载

rossdavidhabout 1 year ago

"You give up human readable files, but what you gain in return is..." Stop right there. You lose more than you gain.Plus, taking the data out of [proprietary software app my client's data is in] in csv is usually easy. Taking the data out in Apache Parquet is...usually impossible, but if it is possible at all you'll need to write the code for it.Loading the data into [proprietary software app my client wants data put into] using a csv is usually already a feature it has. If it doesn't, I can manipulate csv to put it into their import format with any language's basic tools.And if it doesn't work, I can look at the csv myself, because it's human readable, to see what the problem is.90% of real world coding is taking data from a source you don't control, and somehow getting it to a destination you don't control, possibly doing things with it along the way. Your choices are usually csv, xlsx, json, or [shudder] xml. Looking at the pros and cons of those is a reasonable discussion to have.

评论 #39824338 未加载

评论 #39830209 未加载

评论 #39824573 未加载

GuB-42about 1 year ago

As a French, there is another problem with CSV.In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator, so the semicolon is used instead.It means that depending whether or not the tool that exports the CSV is localized or not, you get commas or you get semicolons. If you are lucky, the tool that imports it speaks the same language. If you are unlucky, it doesn't, but you can still convert it. If you are really unlucky, then you get commas for both decimal numbers and separators, making the file completely unusable.There is a CSV standard, RFC 4180, but no one seems to care.

评论 #39823017 未加载

评论 #39822964 未加载

评论 #39824777 未加载

评论 #39826056 未加载

评论 #39825451 未加载

评论 #39826399 未加载

评论 #39823154 未加载

评论 #39823219 未加载

Closiabout 1 year ago

Of course if you only consider the disadvantages, something looks bad.The advantages of CSV are pretty massive though - if you support CSV you support import and export into a massive variety of business tools, and there is probably some form of OOTB support.

评论 #39814740 未加载

评论 #39822805 未加载

评论 #39822638 未加载

breadwinnerabout 1 year ago

The reason CSV is popular is because it is (1) super simple, and (2) the simplicity leads to ubiquity. It is extremely easy to add CSV export and import capability to a data tool, and that has come to mean that there are no data tools that don't support CSV format.Parquet is the opposite of simple. Even when good libraries are available (which it usually isn't), it is painful to read a Parquet file. Try reading a Parquet file using Java and Apache Parquet lib, for example.Avro is similar. Last I checked there are two Avro libs for C# and each has its own issues.Until there is a simple format that has ubiquitous libs in every language, CSV will continue to be the best format despite the issues caused by under-specification. Google Protobuf is a lot closer than Parquet or Avro. But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.

评论 #39824226 未加载

评论 #39824803 未加载

评论 #39823239 未加载

评论 #39823764 未加载

评论 #39823563 未加载

评论 #39823106 未加载

rkavelandabout 1 year ago

Author here. I see now that the title is too controversial, I should have toned that down. As I mention in the conclusion, if you're giving parquet files to your user and all they want to know is how to turn it into Excel/CSV, you should just give them Excel/CSV. It is, after all, what end users often want. I'm going to edit the intro to make the same point there.If you're exporting files for machine consumption, please consider using something more robust than CSV.

评论 #39827457 未加载

评论 #39840028 未加载

评论 #39827276 未加载

评论 #39830396 未加载

riettaabout 1 year ago

Or export to CSV correctly and test with Excel and/or LibreOffice. Honestly CSV is a very simple, well defined format, that is decades old and is “obvious”. I’ve had far more trouble with various export to excel functions over the years, that have much more complex third-party dependencies to function. Parsing CSV correctly is not hard, you just can’t use split and be done with it. This has been my coding kata in every programming language I’ve touched since I was a teenager learning to code.

评论 #39814627 未加载

评论 #39814581 未加载

评论 #39816942 未加载

评论 #39814560 未加载

评论 #39814565 未加载

评论 #39814694 未加载

评论 #39814872 未加载

评论 #39815107 未加载

评论 #39822800 未加载

评论 #39816281 未加载

评论 #39824880 未加载

评论 #39822740 未加载

评论 #39814673 未加载

prependabout 1 year ago

CSV is very durable. If I want it read in 20 years, csv is the way to go until it’s just too big to matter.Of course there are better formats. But for many use cases friends encourage friends to export to CSV.

评论 #39822685 未加载

danirodabout 1 year ago

Friends don't let friends export to CSV -- in the data science field.But outside the data science field, my experience working on software programming these years is that it won't matter how beautiful your backoffice dashboards and web apps are, many non-technical business users will demand at some point CSV import and/or export capabilities, because it is easier for them to just dump all the data on a system into Excel/Sheets to make reports, or to bulk edit the data via export-excel-import rather than dealing with the navigation model and maybe tens of browser tabs in your app.

评论 #39822917 未加载

评论 #39824546 未加载

_trampeltierabout 1 year ago

CSV wins because its universal and very simple. With an editor like Notepad++ and the CSV plugin, reformating, like change date format, is very easy and even with colored columns.

eskaabout 1 year ago

It's strange to me that people complain about some variety in CSV files while acting as if parquet was one specific file format that's set in stone. They can't even decide which features are core, and the file format has many massive backwards-incompatible changes already. If you give me a parquet file I cannot guarantee that I can read it, and if I produce one I cannot guarantee that you can.I treat formats such as parquet as I generally do: I try to allow various different inputs, and produce standard outputs. Parquet is something I allow purely as an optimization. CSV is the common default all of my tools have (UTF-8 without BOM, international locale, comma separator, quoting at the start of the value optional, standards-compliant date format or unix timestamps). Users generally don't have any issue with adapting their files to that format if there's any difference.

评论 #39827923 未加载

JumpCrisscrossabout 1 year ago

> One of the infurating things about the format is that things often break in ways that tools can't pick up and tell you aboutThis line is emblematic of the paradigm shift LLMs have brought. It’s now easier to build a better tool than change everyone’s behaviour.> You give up human readable files, butWhat are we even doing here.

评论 #39820137 未加载

评论 #39815626 未加载

评论 #39822970 未加载

queseraabout 1 year ago

Every single use I've ever seen of CSV would be improved by the very simple change to TSV.Even Excel can handle it.It is far safer to munge data containing tabs (convert to spaces, etc), than commas (remove? convert to dots? escape?).The better answer is to use ASCII separators as Lyndon Johnson intended, but that turns out to be asking a lot of data producers. Generating TSV is usually easier than generating CSV.

评论 #39825209 未加载

评论 #39824107 未加载

评论 #39824133 未加载

评论 #39824052 未加载

aronhegedusabout 1 year ago

My takeaway is that csv has some undefined behaviours, and it takes up space.I like that everyone knows about .csv files, and it's also completely human readable.So for <100mb I would still use csv.

评论 #39822522 未加载

SPBSabout 1 year ago

1. CSV is for ensuring compatibility with the widest range of consumers, not for ensuring best read or storage performance for consumers. (It is already more efficient than JSON because it can be streamed, and takes up less space than a JSON array of objects)2. The only data type in CSV is a string. There is no null, there are no numbers. Anything else must be agreed upon between producer and consumer (or more commonly, a consumer looks at the CSV and decides how the producer formatted it). JSON also doesn’t include dates, you’re not going to see people start sending API responses as Apache Parquet. CSV is fiiine.

onethumbabout 1 year ago

CSV has some limits and difficulties, but has massive benefits in terms of readability, portability, etc.I feel like USV (Unicode Separated Values) neatly improves CSV while maintaining most of its benefits.<a href="https://github.com/sixarm/usv">https://github.com/sixarm/usv</a>

culebron21about 1 year ago

The poor performance argument is not true even for Python ecosystem that the author discusses. Try saving geospatial data in GeoPackage, GeoJson, FlatGeobuf. They are saved slower than in plain CSV (the only inconvenience is that you must convert geometries into WKT strings). GeoPackage was "the Format of the Future" 8 years ago, but it's utterly slow when saving, because it's an SQLite database and indexes all the data.Files in .csv.gz are more compact than anything else, unless you have some very-very specific field of work and a very compressible data. As far as I remember, Parquet files are larger than CSV with the same data.Working with the same kind of data in Rust, I see everything saved and loaded in CSV is lightning fast. The only thing you may miss is indexing.Whereas saving to binary is noteably slower. A data in generic binary format becomes LARGER than in CSV. (Maybe if you define your own format and write a driver for it, you'll be faster, but that means no interoperability at all.)

评论 #39827643 未加载

评论 #39827551 未加载

nsjamesabout 1 year ago

I think the sad reality there is that it's become "the" format that users expect, and more importantly, it's what's integrated into the majority of peripheral services and tools.Like JSON.

评论 #39814546 未加载

评论 #39814614 未加载

addminztratorabout 1 year ago

Good thing I don't have friends then because I'm exporting to csv, it's simple and it works

kembrekabout 1 year ago

If one could export to Parquet from Microsoft Excel, I think this would be a goer. Until such time, it seems likely many will stick with CSVs.

评论 #39822893 未加载

评论 #39822594 未加载

divanabout 1 year ago

Remembering all the cases I needed to export to CSV – 99% are the relatively small datasets, so marginal gains in a few millisecond to import aren't worth sacrificing convenience. And sometimes you just get data from gazzilion of diverse sources, and CSV is the only option available.I suspect that not everybody here work exclusively with huge datasets and well-defined data pipelines.On a practical side, if I want to follow suggestions, how do I export to Avro from Numbers/Excel/Google Sheets?

Black616Angelabout 1 year ago

I never liked articles about how you should replace CSV with some other format while pulling some absolutely idiotic reasons out of their rear...1. CSV is underspecified Okay, so specify it for your use case and you're done? E.g use rfc3339 instead of the straw-man 1-1-1970 and define how no value looks like, which is mostly an empty string.2. CSV files have terrible compression and performance Okay, who in their right mind uses a plain-text-file to export 50gb of data? Some file systems don't even support that much. When you are at the stage of REGULARLY shipping around files this big, you should think about a database and not another filetype to send via mail. Performance may be a point, but again, using it for gigantic files is wrong in the first place.3. There's a better way (insert presentation of a filtype I have never heard of) There is lots of better ways to do this, but: CSV is implemented extremely fast, it is universally known unlike Apache Parquet (or Pickle or ORC or Avro or Feather...) and it is humanly readable.So in the end: Use it for small data exports where you can specify everything you want or like everywhere, where you can import data, because most software takes CSV as input anyway.For lots of data use something else.Friends don't let friends write one-sided articles.

评论 #39815189 未加载

评论 #39816402 未加载

评论 #39814693 未加载

loliveabout 1 year ago

The reason why USV did not use the proper ASCII codes for field separator and record separator is a bit too pragmatic for me…<a href="https://github.com/SixArm/usv/tree/main/doc/faq#why-use-control-picture-characters-rather-than-the-control-characters-themselves">https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...</a>

PLenzabout 1 year ago

Everything reads it, everything writes it. CSV is the one true data format to which everything else will eventually be converted to by users.

WatchDogabout 1 year ago

The only real problems I ever have with CSV, is when excel is involved.

benobabout 1 year ago

Can Parquet be read/parsed in almost every programming language with very little effort?

评论 #39814661 未加载

评论 #39814655 未加载

评论 #39814706 未加载

jjgreenabout 1 year ago

An article promoting parquet over CSV. Fair enough, but parquet has been around for a while and still no support in Debian. Is there some deep and dark reason why?

评论 #39822655 未加载

TrackerFFabout 1 year ago

The problem, as always, is that you deal with multiple data sources - which you can not control the format of. I work as a data analyst, and in my day-to-day work I collect data from around 10 different sources. It's a mix of csv, json, text, and what not.Nor can you control the format others want. The reason I have to export to csv, is unfortunately because the people I ship out to use excel for everything - and even though excel does support many different data formats, they either enjoy using .csv (should be mentioned that the import feature in excel works pretty damn well), or have some system written in VBA that parses .csv files.

loliveabout 1 year ago

As a data architect in a big company, I cannot tell how harmful such a stupid data format CSV can be. All the possible semantics of the data has to be offloaded to either the brain of people [don’t do that! Just don’t!] or out-of-sync specs [better hidden in the CMS of the company that the Ark of Alliance, and outdated anyway] or obscure code or SQL queries [an opportunity for hilarious reverse engineering sessions, where you hate a retired developper forever for all the tricks he added inside code to circumvent poorly defined data. Then got away to Florida beach after hiring you.]

评论 #39824683 未加载

aabbcc1241about 1 year ago

An alternative is to export to sqlite file

randomsolutionsabout 1 year ago

I like the ping pong of one day an article being posted where everyone asks, "when/why did everything become so complicated", and then the next day something like this is posted.

j7akeabout 1 year ago

I wouldn’t say the article proposes a better way, but he proposes rather a more complex way.Nothing beats CSV in terms of simplicity, minimal friction, and ease of exploring across diverse teams.

julikabout 1 year ago

"There's a better way" - "just" write your application in Java or Python, import Thrift, zstandard and boost, do some compiling - and presto, you can now export a very complicated file format you didn't really need which you hope your users (who all undoubtedly have Java and Python and Thrift and whatnot) will be able to read.CSV does not deserve the hate.

jgordabout 1 year ago

CSV is a superb, incredibly useful data format.. but not perfect or complete.Instead of breaking CSV by adding to it .. I recommend augmenting it :It would be useful to have a good standardized / canonical json format for things like encoding, delimiter, schema and metadata, to accompany a zipped csv file, perhaps packaged in the same archive.Gradually datasets would become more self-documenting and machine-usable without wrangling.

评论 #39823920 未加载

2devnullabout 1 year ago

“the use case where people often reach for CSV, parquet is easily my favorite”My use case is that other people can’t or won’t read anything but plain text.

baazaaabout 1 year ago

Schemas are overrated. Often the source-system can't be trusted so you need to check everything anyway or you'll have random strings in your data. Immature languages/libraries often do dumb stuff like throwing away the timezone before adjusting it to UTC. They might not support certain parquet types (e.g. an interval).Like I've recently found it much easier to deal with schema evolution in pyspark with a lot of historical CSVs than historical parquets. This is essentially a pyspark problem, but if everything works worse with your data format then maybe it's the format that's the problem. CSV parsing is always and everywhere easy, easier than the problems parquets often throw up.The only time I'd recommend parquet is if you're setting up a pipeline with file transfer and you control both ends... but that's the easiest possible situation to be in; if your solution only works when it's a very easy problem then it's not a good solution.

rossvorabout 1 year ago

Friends don't let friends export to CSV [for my specific use case]

atoavabout 1 year ago

CSV is totally fine if you use it for the right kind of data and the right application. That means:- data that has predictable value types (mostly numbers and short labels would be fine), e.g. health data about a school class wouldn't involve random binary fields or unbounded user input- data that has a predictable, managable length — e.g. the health data of the school class wouldn't be dramatically longer than the number of students in that class- data with a long sampling period. If you read that dataset once a week performance and latency become utterly irrelevant- if the shape of your data is already tabular and not e.g. a graph with many references to other rows- if the gain in human readability and compatibility for the layperson outweighs potential downsides about the format- if you use a sane default for encoding (utf8, what else), quoting, escaping, delimiter etc.Every file format is a choice, often CSV isn't the wrong one (but: very often it is).

_shantaramabout 1 year ago

Surprised no one has mentioned sqlite even once in these comments.

评论 #39825158 未加载

评论 #39826280 未加载

sam_goodyabout 1 year ago

I have all SQL exported to CSV and committed to git once a day (no, I don't think this is the same as WAL/replication).Dumping to CSV is built into MySQL and Postgres (though MySQL has better support), is faster on export and much faster on import, doesn't fill up the file with all sorts of unneeded text, can be diffed (and triangulated by git) line by line, is human readable (eg. grepping the CSV file) and overall makes for a better solution than mysqldumping INSERTs.In Docker, I can import millions of rows in ~3 minutes using CSV; far better than anything else I tried when I need to mock the whole DB.I realize that the OP is more talking about using CSV as a interchange format or compressed storage, but still would love to hear from others if my love of CSV is misplaced :)

tracker1about 1 year ago

I tend to prefer line delimited JSON myself, even if it's got redundant information. It will gzip pretty well in the data if you want to use less storage space.Either that or use the ASCII codes for field and row delimiters on a UTF-8 file without a BOM.Even then you're still stuck with data encoding issues with numbers and booleans. And that direct even cover all the holes I've seen in CSV in real world use by banks and govt agencies over the years.When I've had to deal with varying imports I push for a scripted (js/TS or Python) preprocessor that takes the vender/client format and normalized to line delimited JSON, then that output gets imported. It's far easier than trying to create a flexible importer application.Edit: I've also advocated for using SQLite3 files for import, export and archival work.

cess11about 1 year ago

SQL and XML have schemas, and they're to a large extent human readable, even to people who aren't developers. If storage is cheap, compression isn't very important.I've never come across this Parquet-format, is it grep:able? Gzip:ed CSV is. Can a regular bean counter person import Parquet into their spreadsheet software? A cursory web search indicates they can't without having a chat with IT, and SQL might be easier while XML seems pretty straightforward.Yes, CSV is kind of brittle, because the peculiarities with a specific source is like an informal schema but someone versed in whatever programming language makes this Parquet convenient won't have much trouble figuring out a CSV.

mrozbarryabout 1 year ago

Use the right data format for the right data. CSV can be imported into basically any spreadsheet, which can make it appealing, but it doesn't mean it's always a good option.If you want csv, considering a normalization step. For instance, make sure numbers have no commas and a "." decimal place. Probably quote all strings. Ensure you have a header row.Probably don't reach for a CSV if:- You have long text blobs with special characters (ie quotes, new lines, etc.)- You can't normalize the data for some reason (ie some columns have formulas instead of specific data)- You know that every user will always convert it to another format or import it

johneaabout 1 year ago

I've written CSV exports in C from scratch, no external dependencies required.It's "Comma Separated Variables", it doesn't really need anymore specification than that.These files have always imported into M$ and libre office suites without issue.

评论 #39823258 未加载

评论 #39822545 未加载

评论 #39822635 未加载

bazoom42about 1 year ago

Here is a crazy idea: So csv itself is abiguous, but as a convention we could encode the options in the file name. E.g data.uchq.csv means utf8, comma-separated, with header, quoted.

评论 #39828967 未加载

mtrabout 1 year ago

What's the best way to expose random CSV/.xlsx files for future joins etc? We're house hunting and it would be nice have a local db to keep track of price changes, asking prices, photos, etc. And look up (local) municipal OpenData for an address and grab the lot size, zoning, etc. I'm using Airtable and sometimes Excel, but it would be nice to have a home (hobby) setup for storing queryable data.

Twirrimabout 1 year ago

One particularly memorable on-call shift had a phenomenal amount of pain caused by the use of CSV somewhere along the line, and a developer who decided to put an entry "I wonder, what happens if I put in a comma", or something similar. That single comma caused hours of pain. Quite why they thought production was the place to test that, when they knew the data would end up in CSV, is anybody's guess.I think Hanlon's razor applies in that situation.

coxleyabout 1 year ago

xsv makes dealing with csv miles easier: <a href="https://github.com/BurntSushi/xsv">https://github.com/BurntSushi/xsv</a>

ysrisabout 1 year ago

Not sure I understand right what this article is about. From my point of view, CSV is an easy way to export data from a system to allow an end user to import it in excel and work on this data. Apart if it's as easy with parquet to import in excel as with a CSV, I'm not sure this is not fixing a problem that doesn't exist. And making things more complicated.Outside of the context of end user, I don't see any advantages in this compared to xml or json export.

QuiDortDineabout 1 year ago

> You give up human readable files, but what you gain in return is incredibly valuableNot as valuable as human-readable files.And what kind of monstrous CSV files has this dude been working with? Data types? Compression? I just need to export 10,000 names/emails/whatevers so I can re-import them elsewhere.Like, I guess once you start hitting GBs, an argument can be made, but this article sounds more like "CSV considered harmful", which is just silly to me.

lakomenabout 1 year ago

Ok, funny guy. Tell that to all the wholesale providers, which use software from 2005 or at least it feels that way.No query params in their single endpoint and only csv exports possible.Then add to that, that shopify, apparently the leader or whatever in shopping software, can't do better than require exactly the format they say, don't you dare coming with configurable fields or mapping.The industry is stuck in the 00s, if not 90s.

anothernewdudeabout 1 year ago

I like how their alternative is an instant non-starter.

评论 #39823663 未加载

throwitaway222about 1 year ago

I'd rather work with someone that prefers a format, but doesn't write articles like this. It's fine to "prefer" parquet, but CSV is totally fine - whatever works mate.When you hit the inevitable "friends don't let friends" or "considered harmful" type of people, it's time to move quickly past them and let the actual situation dictate the best solution.

jll29about 1 year ago

Whether or not you use Parquet is one thing, but CSV will stay because any achival/data exchange format should be human readable.

gmootabout 1 year ago

"I'm a big fan of Apache Parquet as a good default. You give up human readable files, but..."Lost me right there. It has to be human readable.

loliveabout 1 year ago

I still don’t understand how you deal with cardinalities in a CSV. You always recreate an object model on top of it to deal with them properly ?Cf a tweet I wrote in one of my past lives: <a href="https://x.com/datao/status/1572226408113389569?s=20" rel="nofollow">https://x.com/datao/status/1572226408113389569?s=20</a>

Zababaabout 1 year ago

It is weird to say both that "CSV files have terrible compression" and then that the proposed format, Apache Parquet, has "Really good compression properties, competitive with .csv.gz". I think what's meant here is that csv compresses really well but you loose the ability to "seek" inside the file.

ykonstantabout 1 year ago

I finally set aside my laziness and started a thread on r/vim for the .usv project:<a href="https://www.reddit.com/r/vim/comments/1bo41wk/entering_and_displaying_ascii_separators_in_vim/" rel="nofollow">https://www.reddit.com/r/vim/comments/1bo41wk/entering_and_d...</a>?

mlhpdxabout 1 year ago

I’ve always liked CSV. It’s a streaming friendly format so:- the sender can produce it incrementally- the receiver can begin processing it as soon as the first byte arrives (or, more roughly, unescaped newline)- gzip compression works without breaking the streaming natureYeah, it’s a flawed interchange format. But in a closed system over HTTP it’s brilliant.

ratherbefuddledabout 1 year ago

Ubiquity has a quality all of its own.Yes CSV is a pain in many regards, but many of the difficulties with it arise from the fact that anybody can produce it with very little tool support - which is also the reason it is so widely used.Recommending a decidedly niche format as an alternative is not going anywhere.

croesabout 1 year ago

>Numerical columns may also be ambigious, there's no way to know if you can read a numerical column into an integral data type, or if you need to reach for a float without first reading all the records.Most of the time you know the source pretty well and can simply ask about the value range.

评论 #39824182 未加载

Evidloabout 1 year ago

There is CSVY, which lets you set a delimiter, schema, column types, etc. and has libraries in many languages and is natively supported in R.Also is backwards-compatible with most CSV parsers.<a href="https://github.com/leeper/csvy">https://github.com/leeper/csvy</a>

darrmitabout 1 year ago

I gave up at "You give up human readable files". While I recognize in some cases these recommendations may make sense/CSV may not be ideal, the idea of a CSV _export_ is generally that it could need to be reviewed by a human.

neonsunsetabout 1 year ago

If you ever need to parse CSV really fast and happen to know C#, there is an incredible vectorized parser for that: <a href="https://github.com/nietras/Sep/">https://github.com/nietras/Sep/</a>

osloensisabout 1 year ago

This is the level of discourse among Norwegian graduates. Half of them are taught to worship low level, the other half has framework diabetes.Don't come here to work if you don't want to drown in nitpicking and meaningless debates like this.

codeonlineabout 1 year ago

The utillity of a file being human readable cant be overstated.File formats like CSV will outlast religion.

braiampabout 1 year ago

Openrefine has saved my bacon more times that I care to admit. It ingest everything and have powerful exporting tools. Friends give friends CSV files, and also tell them about tools that help them deal with wide array of crap formats.

mukundeshabout 1 year ago

Using parquet in python requires installing pyarrow and numpy, whereas CSV comes with stdlib.Also, the csv has a very pythonic interface vis-a-vis parquet, in most cases if I can fit the file in memory I would go with CSV.

bongodongobobabout 1 year ago

The author seems to be missing the point of CSVs entirely. I looked him up expecting a fresh college grad, but am surprised to see he's probably in his early 30s. Seems to be in a dev bubble that doesn't actually work with users.Try telling 45 year old salesman he needs to export his data in parquet. "Why would I need to translate it to French??"I feel like I'm pretty up to date on stuff, and I've never heard of parquet or seen in as an option, in any software, ever.

unsupp0rtedabout 1 year ago

I wish I could get Excel to stop converting Product UPCs to scientific notation when opening CSVs.Also some UPCs start with 0Worst is when Excel saves the scientific notation back to the CSV, overwriting the correct number.

评论 #39829107 未加载

throwaway38375about 1 year ago

CSVs won the war. No vendor lock in and very portable.I wish TSVs were more popular though. Tabs appear less frequently than commas in data.My biggest recommendation is to avoid Excel! It will mangle your data if you let it.

thepraabout 1 year ago

I use JSON for import/export of user data (in my super app collAnon), it's more predictable and the toolings around it to transform into any other format(even csv) is underappreciated, imo.

thayneabout 1 year ago

Parquet is a columnar format. Which might be what you want, but it also might not, like if you want to process one row at a time in a stream. Maybe avro would be a better format in that case?

MisterBastahrdabout 1 year ago

Trying to do business without using CSV is like trying to weld without using a torch. Might be possible but you aren't likely to have success at it.

shadowgovtabout 1 year ago

CSV is still a nice, compact intermediate between ease of reading and ease of processing, which is an advantage most alternatives lack.

392about 1 year ago

I must have missed when Excel added Parquet support.

评论 #39823621 未加载

yipeeeerrrrrabout 1 year ago

Ingesting data via CSV with Azure Polybase is one of the fastest things I have encountered. +1 for CSV.

rasculabout 1 year ago

CSV can be fine with some well defined datasets. It can get weird in other cases, though.

kvakerokabout 1 year ago

Friends also don't allow friends only CSV import of data, but here we are.

fifiluraabout 1 year ago

"Friends don't send parquet files to analysts who wants them in their spreadsheet program"

visitor4712about 1 year ago

excel 2021: the "a spreadsheet is all it needs"-file is not usable because excel is not able to translate the "LC references that are inside brackets" into other languages.

cozzydabout 1 year ago

there needs to be some pandoc (panbin?) for binary formats to convert between parquet, hdf5, fits, netcdf, grib, ROOT, sqlite, etc. (Ok these are not all equivalent in capability...).

chealdabout 1 year ago

"Okay, but how do I open it in Excel?"

liquidifyabout 1 year ago

what about just exporting to sqlite files?

评论 #39824502 未加载

评论 #39825108 未加载

nxpnsvabout 1 year ago

Csv sure is a step up from excel though…

noddinghamabout 1 year ago

Tell me you've never worked a real job without telling me. This is a technologists solution in search of a problem. Do you also argue that "email is dead"?

snissnabout 1 year ago

new line seperated json "JSONL"

atroxoneabout 1 year ago

"csv is terrible!"Screams the rustocean from his ivory tower

F_J_Habout 1 year ago

meh

JonChesterfieldabout 1 year ago

XML. Not CSV, not Parquet (whatever that is), not protobufs. Export the data as XML, with a schema. Not json or yaml either. You can render the XML into whatever format you want downstream.The alternative path involves parsing csv in order to turn it into a different csv, turning json into yaml and so forth. Parsing "human readable" formats is terrible relative to parsing XML. Go with the unambiguous source format and turn it into whatever is needed in various locations as required.

90 comments

prependabout 1 year ago

评论 #39823087 未加载

评论 #39823361 未加载

评论 #39823772 未加载

评论 #39824588 未加载

评论 #39824637 未加载

评论 #39828527 未加载

评论 #39827273 未加载

评论 #39827760 未加载

评论 #39830274 未加载

评论 #39825519 未加载

评论 #39833172 未加载

评论 #39823205 未加载

评论 #39825135 未加载

评论 #39824261 未加载

评论 #39825057 未加载

评论 #39825633 未加载

评论 #39822939 未加载

rossdavidhabout 1 year ago

评论 #39824338 未加载

评论 #39830209 未加载

评论 #39824573 未加载

GuB-42about 1 year ago

评论 #39823017 未加载

评论 #39822964 未加载

评论 #39824777 未加载

评论 #39826056 未加载

评论 #39825451 未加载

评论 #39826399 未加载

评论 #39823154 未加载

评论 #39823219 未加载

Closiabout 1 year ago

评论 #39814740 未加载

评论 #39822805 未加载

评论 #39822638 未加载

breadwinnerabout 1 year ago

评论 #39824226 未加载

评论 #39824803 未加载

评论 #39823239 未加载

评论 #39823764 未加载

评论 #39823563 未加载

评论 #39823106 未加载

rkavelandabout 1 year ago

评论 #39827457 未加载

评论 #39840028 未加载

评论 #39827276 未加载

评论 #39830396 未加载

riettaabout 1 year ago

评论 #39814627 未加载

评论 #39814581 未加载

评论 #39816942 未加载

评论 #39814560 未加载

评论 #39814565 未加载

评论 #39814694 未加载

评论 #39814872 未加载

评论 #39815107 未加载

评论 #39822800 未加载

评论 #39816281 未加载

评论 #39824880 未加载

评论 #39822740 未加载

评论 #39814673 未加载

prependabout 1 year ago

评论 #39822685 未加载

danirodabout 1 year ago

评论 #39822917 未加载

评论 #39824546 未加载

_trampeltierabout 1 year ago

CSV wins because its universal and very simple. With an editor like Notepad++ and the CSV plugin, reformating, like change date format, is very easy and even with colored columns.

eskaabout 1 year ago

评论 #39827923 未加载

JumpCrisscrossabout 1 year ago

评论 #39820137 未加载

评论 #39815626 未加载

评论 #39822970 未加载

queseraabout 1 year ago

评论 #39825209 未加载

评论 #39824107 未加载

评论 #39824133 未加载

评论 #39824052 未加载

aronhegedusabout 1 year ago

评论 #39822522 未加载

SPBSabout 1 year ago

onethumbabout 1 year ago

culebron21about 1 year ago

评论 #39827643 未加载

评论 #39827551 未加载

nsjamesabout 1 year ago

I think the sad reality there is that it's become "the" format that users expect, and more importantly, it's what's integrated into the majority of peripheral services and tools.Like JSON.

评论 #39814546 未加载

评论 #39814614 未加载

addminztratorabout 1 year ago

Good thing I don't have friends then because I'm exporting to csv, it's simple and it works

kembrekabout 1 year ago

If one could export to Parquet from Microsoft Excel, I think this would be a goer. Until such time, it seems likely many will stick with CSVs.

评论 #39822893 未加载

评论 #39822594 未加载

divanabout 1 year ago

Black616Angelabout 1 year ago

评论 #39815189 未加载

评论 #39816402 未加载

评论 #39814693 未加载

loliveabout 1 year ago

PLenzabout 1 year ago

Everything reads it, everything writes it. CSV is the one true data format to which everything else will eventually be converted to by users.

WatchDogabout 1 year ago

The only real problems I ever have with CSV, is when excel is involved.

benobabout 1 year ago

Can Parquet be read/parsed in almost every programming language with very little effort?

评论 #39814661 未加载

评论 #39814655 未加载

评论 #39814706 未加载

jjgreenabout 1 year ago

An article promoting parquet over CSV. Fair enough, but parquet has been around for a while and still no support in Debian. Is there some deep and dark reason why?

评论 #39822655 未加载

TrackerFFabout 1 year ago

loliveabout 1 year ago

评论 #39824683 未加载

aabbcc1241about 1 year ago

An alternative is to export to sqlite file

randomsolutionsabout 1 year ago

I like the ping pong of one day an article being posted where everyone asks, "when/why did everything become so complicated", and then the next day something like this is posted.

j7akeabout 1 year ago

julikabout 1 year ago

jgordabout 1 year ago

评论 #39823920 未加载

2devnullabout 1 year ago

“the use case where people often reach for CSV, parquet is easily my favorite”My use case is that other people can’t or won’t read anything but plain text.

baazaaabout 1 year ago

rossvorabout 1 year ago

Friends don't let friends export to CSV [for my specific use case]

atoavabout 1 year ago

_shantaramabout 1 year ago

Surprised no one has mentioned sqlite even once in these comments.

评论 #39825158 未加载

评论 #39826280 未加载

sam_goodyabout 1 year ago

tracker1about 1 year ago

cess11about 1 year ago

mrozbarryabout 1 year ago

johneaabout 1 year ago

评论 #39823258 未加载

评论 #39822545 未加载

评论 #39822635 未加载

bazoom42about 1 year ago

Here is a crazy idea: So csv itself is abiguous, but as a convention we could encode the options in the file name. E.g data.uchq.csv means utf8, comma-separated, with header, quoted.

评论 #39828967 未加载

mtrabout 1 year ago

Twirrimabout 1 year ago

coxleyabout 1 year ago

xsv makes dealing with csv miles easier: <a href="https://github.com/BurntSushi/xsv">https://github.com/BurntSushi/xsv</a>

ysrisabout 1 year ago

QuiDortDineabout 1 year ago

lakomenabout 1 year ago

anothernewdudeabout 1 year ago

I like how their alternative is an instant non-starter.

评论 #39823663 未加载

throwitaway222about 1 year ago

jll29about 1 year ago

Whether or not you use Parquet is one thing, but CSV will stay because any achival/data exchange format should be human readable.

gmootabout 1 year ago

"I'm a big fan of Apache Parquet as a good default. You give up human readable files, but..."Lost me right there. It has to be human readable.

loliveabout 1 year ago

Zababaabout 1 year ago

ykonstantabout 1 year ago

mlhpdxabout 1 year ago

ratherbefuddledabout 1 year ago

croesabout 1 year ago

评论 #39824182 未加载

Evidloabout 1 year ago

darrmitabout 1 year ago

neonsunsetabout 1 year ago

osloensisabout 1 year ago

codeonlineabout 1 year ago

The utillity of a file being human readable cant be overstated.File formats like CSV will outlast religion.

braiampabout 1 year ago

mukundeshabout 1 year ago

bongodongobobabout 1 year ago

unsupp0rtedabout 1 year ago

评论 #39829107 未加载

throwaway38375about 1 year ago

thepraabout 1 year ago

I use JSON for import/export of user data (in my super app collAnon), it's more predictable and the toolings around it to transform into any other format(even csv) is underappreciated, imo.

thayneabout 1 year ago

Parquet is a columnar format. Which might be what you want, but it also might not, like if you want to process one row at a time in a stream. Maybe avro would be a better format in that case?

MisterBastahrdabout 1 year ago

Trying to do business without using CSV is like trying to weld without using a torch. Might be possible but you aren't likely to have success at it.

shadowgovtabout 1 year ago

CSV is still a nice, compact intermediate between ease of reading and ease of processing, which is an advantage most alternatives lack.

392about 1 year ago

I must have missed when Excel added Parquet support.

评论 #39823621 未加载

yipeeeerrrrrabout 1 year ago

Ingesting data via CSV with Azure Polybase is one of the fastest things I have encountered. +1 for CSV.

rasculabout 1 year ago

CSV can be fine with some well defined datasets. It can get weird in other cases, though.

kvakerokabout 1 year ago

Friends also don't allow friends only CSV import of data, but here we are.

fifiluraabout 1 year ago

"Friends don't send parquet files to analysts who wants them in their spreadsheet program"

visitor4712about 1 year ago

excel 2021: the "a spreadsheet is all it needs"-file is not usable because excel is not able to translate the "LC references that are inside brackets" into other languages.

cozzydabout 1 year ago

there needs to be some pandoc (panbin?) for binary formats to convert between parquet, hdf5, fits, netcdf, grib, ROOT, sqlite, etc. (Ok these are not all equivalent in capability...).

chealdabout 1 year ago

"Okay, but how do I open it in Excel?"

liquidifyabout 1 year ago

what about just exporting to sqlite files?

评论 #39824502 未加载

评论 #39825108 未加载

nxpnsvabout 1 year ago

Csv sure is a step up from excel though…

noddinghamabout 1 year ago

Tell me you've never worked a real job without telling me. This is a technologists solution in search of a problem. Do you also argue that "email is dead"?

snissnabout 1 year ago

new line seperated json "JSONL"

atroxoneabout 1 year ago

"csv is terrible!"Screams the rustocean from his ivory tower

F_J_Habout 1 year ago

meh

JonChesterfieldabout 1 year ago