I always use sqlite3 for working with large CSV files:<p><pre><code> $ sqlite3
$ .mode csv
$ .import my.csv foo
$ SELECT * FROM foo WHERE name = 'bar';
</code></pre>
It reads the header in automatically for the field names and then stores all the values as strings.
If you have Windows, you have a tool that can do this already installed: PowerShell. (If you don't have Windows, PowerShell is free and open source [MIT license]. One of the first things I install on a new Linux or macOS box lately.)<p>The first example in PowerShell:<p>> Import-Csv example.csv | Sort-Object Color, Shape<p>To filter, use `Where-Object`:<p>> Import-Csv example.csv | Where-Object { $_.Color -eq 'purple' } | Sort-Object Color, Shape<p>Or using aliases:<p>> ipcsv example.csv | ? { $_.Color -eq 'purple' } | sort Color, Shape<p>JSON uses the same syntax -- just replace `Import-Csv` with `Get-Content` and `ConvertFrom-Json`:<p>> Get-Content example.json | ConvertFrom-Json | Select-Object Color -Unique<p>> purple<p>> red<p>> yellow<p>There's also `Group-Object` for aggregation.
Some other useful tools for csv files, in particular:<p>csvkit, a set of command line tools for manipulating csvs: <a href="https://csvkit.readthedocs.io/en/latest/" rel="nofollow">https://csvkit.readthedocs.io/en/latest/</a><p>visdata, a quite terminal based csv explorer: <a href="https://www.visidata.org/" rel="nofollow">https://www.visidata.org/</a>
I make a lot of use of Miller at work and it has been a godsend. People tend to underestimate tools like this, but I'm able to answer a lot of questions about our data that other people would take longer for or don't have the capacity to check. It's great as a first step and if anything more complex is needed we usually spin up a python script. So it's very much in the philosophy of small composable tools but for records.
Not exactly the same, but we wrote a library to easily load any delimited type of file and finds header (even if not first row). It also works to load JSON, Parquet, AVRO and loads it into a dataframe. Not CLI exactly, but pretty easy:<p><a href="https://github.com/capitalone/dataprofiler" rel="nofollow">https://github.com/capitalone/dataprofiler</a><p>Can Install: <i>pip install dataprofiler[ml] --user</i><p>How it works:<p>csv_data = Data('your_file.csv') # Load: delimited, JSON, Parquet, Avro<p>csv_data.data.head(10) # Get head<p>csv_data.data.sort_values(by='name', inplace=True) # Sort<p>Anyway, pretty interesting Miller CLI. I'm not sure how the header detection works, especially if the header isn't the first row (which is often the case)
Is this (supposedly) better than Burntsushi toolset (e.g. XSV[1]) or just different ?<p>[1] <a href="https://github.com/BurntSushi/xsv" rel="nofollow">https://github.com/BurntSushi/xsv</a>
The comparison between their ongoing C version [1] and the upcoming go version [2] is nice to see. The README talks a bit more the performance comparison and the C version's strength, but still.<p>[1] <a href="https://github.com/johnkerl/miller/tree/main/c" rel="nofollow">https://github.com/johnkerl/miller/tree/main/c</a><p>[2] <a href="https://github.com/johnkerl/miller/tree/main/go" rel="nofollow">https://github.com/johnkerl/miller/tree/main/go</a>
See also my own foray into this space: <a href="https://tkatchev.bitbucket.io/" rel="nofollow">https://tkatchev.bitbucket.io/</a><p>You might find it useful.
See LNAV (<a href="https://lnav.org" rel="nofollow">https://lnav.org</a>) for a "mini-ETL" CLI powertool with embedded SQLite; it's fantastic.
How does this compare to JQ?<p>Aside for that it appears to support more formats (CSV [which has partial JQ support], and TSV), of course.<p>I find it very helpful when people compare their new toy or service with what is already pretty well known.
I don't think Sed and Awk should be on this list. It's not really able to transform text into something else, is it?<p>Even the tutorial leans on sed for cleansing ahead of Miller. <a href="https://www.ict4g.net/adolfo/notes/data-analysis/miller-quick-tutorial.html" rel="nofollow">https://www.ict4g.net/adolfo/notes/data-analysis/miller-quic...</a>
Just yesterday I was opining in the IRC about what it would be like to have a operating system that was build from the ground up that supports smarter data formats than just text.<p>The concept of everything as a file, has awesome benefits I believe, but that doesn't meant files need to be unstructured text.<p>There's a core problem, that with text languages (like AWK or structured regex (sed, etc) you end up both having to parse AND manipulate data. which is no fun and prone to errors.<p>Abstracting away all of that into codecs that all of coreutils could speak would be very cool to me.<p>The second issue is structured data vs unstructured text. CSVs, or other table based formats make sense sometimes, and sometimes you want to be able to query things more easily. JSON provides that. What's the synthesis of both I don't know... but maybe this is closer to heaven.<p>I'd like to _minimize_ the amount of parsing or text manipulation done in any program and be able to focus on extracting and inserting the data I need, like a rosetta stone that just handles everything for me while I work. I want to be able to do:<p>awk "{print $schoolName}" or awk "{print $4}" and it just works - json, text, CSV.
Similar, but using the ACTUAL awk, sed, join, sort tools you already have and know about: <a href="https://github.com/dkogan/vnlog/" rel="nofollow">https://github.com/dkogan/vnlog/</a>
I'm a bit surprised by the need of verbose "--icsv --ocsv"... Shouldn't it be trivial to see that the input is csv? (and in that case, the output could be csv by default).
I added a new feature to my sqlite-utils CLI tool in June which covers similar ground: it now lets you load CSV, TSV and JSON data into an in-memory SQLite database and execute SQL queries against it: <a href="https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/" rel="nofollow">https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/</a><p>Example usage:<p><pre><code> sqlite-utils memory example.csv "select * from t order by color, shape"
# Defaults to outputting JSON, you can add
# --csv or --tsv for those formats or
# --table to output as a rendered table
</code></pre>
More docs here: <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#querying-data-directly-using-an-in-memory-database" rel="nofollow">https://sqlite-utils.datasette.io/en/stable/cli.html#queryin...</a>
There's also rq (record query)[1] that also supports CSV and JSON but not TSV though. It's written in Rust.<p>[1] <a href="https://github.com/dflemstr/rq" rel="nofollow">https://github.com/dflemstr/rq</a>
I just prefer using R data-frames - the convenience of a full data-science language shell for getting statistical data out of CSV. One can also make gorgeous graphs using ggplot2.
Nice. Any reason why installation section is not prominent and on the top of the readme?<p>It's even in an non-intuitive place in the documentation.
I don't know about MillerCLI's portability, but RecordStream (<a href="https://github.com/benbernard/RecordStream" rel="nofollow">https://github.com/benbernard/RecordStream</a>) is my go to swiss army knife.
i recently came across jq, which i use for json parsing: <a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>
For windows users NimbleText is a great little utility for working with tabular data:-<p><a href="https://nimbletext.com/" rel="nofollow">https://nimbletext.com/</a>