Using command line to process CSV files (2022)

161 点作者 mr_o47将近 2 年前

36 条评论

benhoyt将近 2 年前

Unfortunately "awk -F," (field separator of comma) doesn't work with most real CSV files, because of quoted fields, commas in fields, and (less frequently) multiline fields. My GoAWK implementation has a CSV mode activated with "goawk -i csv" (input mode CSV) and some other CSV features that properly handle quoted and multiline fields: <a href="https://benhoyt.com/writings/goawk-csv/" rel="nofollow noreferrer">https://benhoyt.com/writings/goawk-csv/</a>The frawk tool (written in Rust) also supports this.Interestingly, Brian Kernighan is currently updating the book "The AWK Programming Language" for a second edition (I'm one of the technical reviewers), and Gawk and awk are adding a "--csv" option for this purpose. So real CSV mode is coming to an AWK near you soon!

评论 #36504183 未加载

评论 #36504436 未加载

评论 #36506845 未加载

评论 #36504981 未加载

评论 #36503623 未加载

评论 #36506658 未加载

评论 #36507550 未加载

wenc将近 2 年前

For any kind of tabular data, I use DuckDB (parallel CSV reads, works on small to humongous CSVs). It supports full Postgres-like SQL, so you can do arbitrarily complex manipulations on columns and rows. The CSV essentially becomes a dataframe object that can be manipulated in performant ways (DuckDB vectorizes and paralllelizes, so it's way faster than awk on large CSV files). You can even manipulate multiple CSVs in a single statement (JOINs, UNIONs etc).<pre><code> % duckdb -c "from 'test.csv'" ┌───────┬───────┬───────┐ │ A │ B │ C │ │ int64 │ int64 │ int64 │ ├───────┼───────┼───────┤ │ 1 │ 2 │ 3 │ │ 4 │ 5 │ 6 │ │ 7 │ 8 │ 9 │ └───────┴───────┴───────┘ % duckdb -c "select sum(C) from 'test.csv'" ┌────────┐ │ sum(C) │ │ int128 │ ├────────┤ │ 18 │ └────────┘ % duckdb -c "select sum(A + 2*B + C^2) from 'test.csv'" ┌────────────────────────────────┐ │ sum(((A + (2 * B)) + (C ^ 2))) │ │ double │ ├────────────────────────────────┤ │ 168.0 │ └────────────────────────────────┘</code></pre>

评论 #36502737 未加载

评论 #36502394 未加载

评论 #36502690 未加载

ldmosquera将近 2 年前

Have a look at visidata, a TUI table viewer/editor with vim keybindings. It's incredibly powerful and fully automatable - everything you do gets recorded as a series of commands which you can save and replay.Not the right tool for everything, but it shines for quickly glancing at the shape of tabular data and making sense of it. Sorting, filtering, joins across files, column histograms and even column splitting/rejoining are all keystrokes away.It groks anything even remotely table shaped like CSV, JSONL, JSON, even Excel files, it can even directly connect to databases and parse tables out of HTML.<a href="https://www.visidata.org/" rel="nofollow noreferrer">https://www.visidata.org/</a>

评论 #36502699 未加载

评论 #36503088 未加载

sklarsa将近 2 年前

Personally, I use xsv and it’s been tremendously helpful, especially when working with larger files. <a href="https://github.com/BurntSushi/xsv">https://github.com/BurntSushi/xsv</a>

评论 #36503601 未加载

评论 #36504128 未加载

kippinitreal将近 2 年前

Cool stuff! But it’s criminal to not call attention to JQ’s elder sibling CSVKit. It’s invaluable for playing with csvs. Much easier to parse out columns, allows you to generate new csvs and even merge them. More importantly, it allows SQL on csvs (via SQLite iirc) which empowers all sorts of csv shenanigans. The bash scripting this enables us incredible (good and bad).<a href="https://csvkit.readthedocs.io/en/latest/" rel="nofollow noreferrer">https://csvkit.readthedocs.io/en/latest/</a>

评论 #36503310 未加载

pradeepchhetri将近 2 年前

As an SRE, I used to install one tool for every data format until I came across clickhouse-local. Now I use it for everything.- It works with every data format which I come across in my daily job. It support data formats like Protobuf, Avro, Cap'n Proto which regular tools don't support. Funny thing, it can even read mysql dumps.- I can read the data stored in local or remote location like http, s3, gcs, azure and whatever location I can think of.- I can SQL queries on the raw data and improve my SQL skills on daily basis.

评论 #36503095 未加载

评论 #36503630 未加载

globular-toast将近 2 年前

None of this stuff works.Whenever CSV comes up I always feel a bit sad. ASCII includes a set of four out-of-band delimiters[0] that can be used instead of silly formats like CSV that use in-band delimiters which necessitate complicated quoting rules.You can't just treat CSV as text. It's not and woe betide you if you ever use this stuff in a script instead of using a proper CSV parsing tool. If we used the ASCII delimiters instead it would be possibly to treat it as text and stuff like this would work.[0] <a href="https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text</a>

评论 #36505562 未加载

评论 #36504893 未加载

评论 #36511936 未加载

woodruffw将近 2 年前

It's worth noting that a lot of these tricks will break on non-trivial CSV inputs, e.g. ones that contain escaped commas.I like using the shell (and Awk in particular!) as much as anyone, but for CSV I tend to reach for Python's standard csv module[1].[1]: <a href="https://docs.python.org/3/library/csv.html" rel="nofollow noreferrer">https://docs.python.org/3/library/csv.html</a>

评论 #36503624 未加载

vajdagabor将近 2 年前

Nushell is also quite powerful for this. For example:<pre><code> > open people.csv | where status == 'customer' | unique-by email | select surname forename email | sort-by email | save customers.json </code></pre> <a href="https://www.nushell.sh/" rel="nofollow noreferrer">https://www.nushell.sh/</a>

评论 #36507783 未加载

zX41ZdbW将近 2 年前

My favorite tool is clickhouse-local: <a href="https://clickhouse.com/blog/extracting-converting-querying-local-files-with-sql-clickhouse-local" rel="nofollow noreferrer">https://clickhouse.com/blog/extracting-converting-querying-l...</a>It is the most powerful (works with any formats, with remote datasets, and supports all the ClickHouse SQL) and the most performant.

评论 #36502640 未加载

tanin将近 2 年前

In my day job, a customer would ask me to reconcile between 2 giant CSVs very often. Think GBs.When I tell them to do it in Excel by themselves, they would say Excel couldn't open a CSV larger than 1M rows...At first, I was using sqlite through shell. I hated it so much that I built a desktop app on top of it. It wasn't slick enough with all the typing.It is quite a joy to use, and I'd love for people to try it out: <a href="https://superintendent.app" rel="nofollow noreferrer">https://superintendent.app</a> (disclaimer: I'm the creator).

评论 #36504272 未加载

评论 #36504878 未加载

评论 #36503263 未加载

dash2将近 2 年前

There's a nice tool called dplyr-cli which lets you use R's dplyr data manipulation language on the command line.<pre><code> cat mtcars.csv | group_by cyl | summarise "mpg = mean(mpg)" | kable #> | cyl| mpg| #> |---:|--------:| #> | 4| 26.66364| #> | 6| 19.74286| #> | 8| 15.10000| </code></pre> <a href="https://github.com/coolbutuseless/dplyr-cli">https://github.com/coolbutuseless/dplyr-cli</a>

hfkwer将近 2 年前

These days I just use powershell. It has built-in csv import and then I'm dealing with familiar pwsh/.net objects. I don't miss the days of learning bespoke tools to handle slightly different cases.

snthpy将近 2 年前

Shameless plug: I created prql-query (<a href="https://github.com/PRQL/prql-query">https://github.com/PRQL/prql-query</a>) in order to scratch my own itch and use PRQL (prql-lang.org) with DataFusion and DuckDB for things like this.pq is overdue some maintenance but I will update it with the imminent PRQL 0.9 release.I've used pq in anger at my $dayjob and found it incredibly productive to have the full power of SQL combined with the terse and logical syntax of PRQL.

chrisshroba将近 2 年前

I have this shell function defined:csv_to_json () { python -c 'import csv, json, sys; print(json.dumps([dict(r) for r in csv.DictReader(sys.stdin)]))' | jq . }It converts a csv to a json list of objects, mapping column names to values. I find it way easier to then operate on json by filtering with jq or gron, or just pasting it into other tools for post-processing. The jq at the end isn't necessary but makes for nice formatting!

评论 #36669655 未加载

评论 #36509134 未加载

dima55将近 2 年前

Lots of tools do this sort of thing. An incomplete list is in the vnlog docs: <a href="https://github.com/dkogan/vnlog/#description">https://github.com/dkogan/vnlog/#description</a>

评论 #36502587 未加载

LispSporks22将近 2 年前

Ruby's CSV module can be handy:<pre><code> ruby -rcsv -ne 'CSV($<).each { |r| puts r[0] }' </code></pre> I like the seen example the dude has:<pre><code> '!seen[$1]++'</code></pre>

评论 #36509308 未加载

评论 #36508605 未加载

thibran将近 2 年前

Nushell can open and write CSV. Those AWK commands look horrible compared to the nu syntax.<a href="https://www.nushell.sh/commands/docs/from_csv.html" rel="nofollow noreferrer">https://www.nushell.sh/commands/docs/from_csv.html</a>

评论 #36507810 未加载

jiehong将近 2 年前

If you use powershell, you can directly run Import-Csv and don’t really have to think about it.

oofnik将近 2 年前

Many command-line CSV parsing tools mentioned here, adding my choice to the list:<a href="http://harelba.github.io/q/" rel="nofollow noreferrer">http://harelba.github.io/q/</a>

asicsp将近 2 年前

For field extraction (like the first example: `awk -F, '{print $1}'`), you can also use `cut -d, -f1`GNU datamash (<a href="https://www.gnu.org/software/datamash/" rel="nofollow noreferrer">https://www.gnu.org/software/datamash/</a>) provides features like groupby, statistical operations, etc.See also this free ebook: Data Science at the Command Line (<a href="https://jeroenjanssens.com/dsatcl/" rel="nofollow noreferrer">https://jeroenjanssens.com/dsatcl/</a>)

two_handfuls将近 2 年前

I think the lesson here is don’t use awk for CSV. Instead, use one of the many tools discussed in comments that knows how to handle CSV.Some are very close to awk in spirit, like my own attempt: `pawk` (1). It will parse your csv just fine. Or tsv. Or JSON or YAML or TOML. Or Parquet, even.1: <a href="https://github.com/jean-philippe-martin/pawk">https://github.com/jean-philippe-martin/pawk</a>

PhilippGille将近 2 年前

Many of the tools shared in this thread simplify working with CSV files, but only some allow running proper SQL queries.SQLite, DuckDB and Clickhouse-local have been mentioned, but another very simple one, a single dependency-free binary, is <a href="https://github.com/multiprocessio/dsq">https://github.com/multiprocessio/dsq</a>Not affiliated, just a happy user

评论 #36516621 未加载

nbk_2000将近 2 年前

Just thought I'd plug Octosql[1] which I've enjoyed using for this. It parses CSV and JSON, which are the file types I parse the most.[1] <a href="https://github.com/cube2222/octosql/">https://github.com/cube2222/octosql/</a>

zh3将近 2 年前

'tr' and 'cut' are also very useful; 'tr' can be used to get rid of extra spaces and to conver commas to spaces and vice versa (and to handle text data with pretty much any character used as a separator).<pre><code> 'cat $SPACE_SEPARATED_FILE | tr -s ' ' | tr ' ' ',' > out.csv </code></pre> Allied with 'cut', it becomes easy to pull particular fields out of a text file:-<pre><code> 'cat $FILE_WITH_COMMAS_AND_SPACES | tr ',' ' ' | tr -s ' ' | cut -d ' ' -f1,2,17 > out.txt</code></pre>

greazy将近 2 年前

I recommend everyone checkout the very cool csvtk by the amazing bioinformstician Shen Wei<a href="https://github.com/shenwei356/csvtk">https://github.com/shenwei356/csvtk</a>

hermitcrab将近 2 年前

Presumably this command:awk -F, '{print $1}' file.csvDoesn't work if the data contains commas (with escapes)? If so, that might be worth spelling out.

bobnamob将近 2 年前

I've defaulted to doing all my data munging in a clojure repl.The datasets that I work with are well small enough to fit in memory on my machine and having the full java ecosystem + clojure ergonomics is worth more to me than the performance full db tooling might offer

petre将近 2 年前

I use miller, as it's available on my distro.

pyeri将近 2 年前

Python's pandas library has ability to seamlessly work with csv by importing/exporting them as data frames.

SPBS将近 2 年前

I don't understand why this is so high when it's just using awk to naively split on commas. Any programming language could do that! I was expecting an actual command line tip that can handle CSV files in general, looks like it's still basically impossible without resorting to a full blown CSV parser (none of which come installed by default).

replwoacause将近 2 年前

I use PowerShell for this and it works very well.

activiation将近 2 年前

Interesting but not very intuitive.

kohlerm将近 2 年前

xsv works pretty well for me

cutler将近 2 年前

Title should have been "Using awk to process CSV files".

9735194将近 2 年前

Can’t wait for the tutorial on ‘ls’. It’s not like there is a manual.