Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON

273 pointsby jaboover 3 years ago

25 comments

I always use sqlite3 for working with large CSV files:<pre><code> $ sqlite3 $ .mode csv $ .import my.csv foo $ SELECT * FROM foo WHERE name = 'bar'; </code></pre> It reads the header in automatically for the field names and then stores all the values as strings.

评论 #28302499 未加载

评论 #28302364 未加载

评论 #28300871 未加载

评论 #28303106 未加载

评论 #28301369 未加载

评论 #28300068 未加载

评论 #28299927 未加载

brushfootover 3 years ago

If you have Windows, you have a tool that can do this already installed: PowerShell. (If you don't have Windows, PowerShell is free and open source [MIT license]. One of the first things I install on a new Linux or macOS box lately.)The first example in PowerShell:> Import-Csv example.csv | Sort-Object Color, ShapeTo filter, use `Where-Object`:> Import-Csv example.csv | Where-Object { $_.Color -eq 'purple' } | Sort-Object Color, ShapeOr using aliases:> ipcsv example.csv | ? { $_.Color -eq 'purple' } | sort Color, ShapeJSON uses the same syntax -- just replace `Import-Csv` with `Get-Content` and `ConvertFrom-Json`:> Get-Content example.json | ConvertFrom-Json | Select-Object Color -Unique> purple> red> yellowThere's also `Group-Object` for aggregation.

msluyterover 3 years ago

Some other useful tools for csv files, in particular:csvkit, a set of command line tools for manipulating csvs: <a href="https://csvkit.readthedocs.io/en/latest/" rel="nofollow">https://csvkit.readthedocs.io/en/latest/</a>visdata, a quite terminal based csv explorer: <a href="https://www.visidata.org/" rel="nofollow">https://www.visidata.org/</a>

评论 #28307285 未加载

awildover 3 years ago

I make a lot of use of Miller at work and it has been a godsend. People tend to underestimate tools like this, but I'm able to answer a lot of questions about our data that other people would take longer for or don't have the capacity to check. It's great as a first step and if anything more complex is needed we usually spin up a python script. So it's very much in the philosophy of small composable tools but for records.

citilifeover 3 years ago

Not exactly the same, but we wrote a library to easily load any delimited type of file and finds header (even if not first row). It also works to load JSON, Parquet, AVRO and loads it into a dataframe. Not CLI exactly, but pretty easy:<a href="https://github.com/capitalone/dataprofiler" rel="nofollow">https://github.com/capitalone/dataprofiler</a>Can Install: pip install dataprofiler[ml] --userHow it works:csv_data = Data('your_file.csv') # Load: delimited, JSON, Parquet, Avrocsv_data.data.head(10) # Get headcsv_data.data.sort_values(by='name', inplace=True) # SortAnyway, pretty interesting Miller CLI. I'm not sure how the header detection works, especially if the header isn't the first row (which is often the case)

评论 #28304497 未加载

traceroute66over 3 years ago

Is this (supposedly) better than Burntsushi toolset (e.g. XSV[1]) or just different ?[1] <a href="https://github.com/BurntSushi/xsv" rel="nofollow">https://github.com/BurntSushi/xsv</a>

评论 #28300289 未加载

flakinessover 3 years ago

The comparison between their ongoing C version [1] and the upcoming go version [2] is nice to see. The README talks a bit more the performance comparison and the C version's strength, but still.[1] <a href="https://github.com/johnkerl/miller/tree/main/c" rel="nofollow">https://github.com/johnkerl/miller/tree/main/c</a>[2] <a href="https://github.com/johnkerl/miller/tree/main/go" rel="nofollow">https://github.com/johnkerl/miller/tree/main/go</a>

otabdeveloper4over 3 years ago

See also my own foray into this space: <a href="https://tkatchev.bitbucket.io/" rel="nofollow">https://tkatchev.bitbucket.io/</a>You might find it useful.

评论 #28299819 未加载

chrisweeklyover 3 years ago

See LNAV (<a href="https://lnav.org" rel="nofollow">https://lnav.org</a>) for a "mini-ETL" CLI powertool with embedded SQLite; it's fantastic.

sam_goodyover 3 years ago

How does this compare to JQ?Aside for that it appears to support more formats (CSV [which has partial JQ support], and TSV), of course.I find it very helpful when people compare their new toy or service with what is already pretty well known.

评论 #28302968 未加载

mmahemoffover 3 years ago

I don't think Sed and Awk should be on this list. It's not really able to transform text into something else, is it?Even the tutorial leans on sed for cleansing ahead of Miller. <a href="https://www.ict4g.net/adolfo/notes/data-analysis/miller-quick-tutorial.html" rel="nofollow">https://www.ict4g.net/adolfo/notes/data-analysis/miller-quic...</a>

评论 #28302619 未加载

calvinmorrisonover 3 years ago

Just yesterday I was opining in the IRC about what it would be like to have a operating system that was build from the ground up that supports smarter data formats than just text.The concept of everything as a file, has awesome benefits I believe, but that doesn't meant files need to be unstructured text.There's a core problem, that with text languages (like AWK or structured regex (sed, etc) you end up both having to parse AND manipulate data. which is no fun and prone to errors.Abstracting away all of that into codecs that all of coreutils could speak would be very cool to me.The second issue is structured data vs unstructured text. CSVs, or other table based formats make sense sometimes, and sometimes you want to be able to query things more easily. JSON provides that. What's the synthesis of both I don't know... but maybe this is closer to heaven.I'd like to _minimize_ the amount of parsing or text manipulation done in any program and be able to focus on extracting and inserting the data I need, like a rosetta stone that just handles everything for me while I work. I want to be able to do:awk "{print $schoolName}" or awk "{print $4}" and it just works - json, text, CSV.

评论 #28303906 未加载

评论 #28306495 未加载

评论 #28301692 未加载

评论 #28301956 未加载

dima55over 3 years ago

Similar, but using the ACTUAL awk, sed, join, sort tools you already have and know about: <a href="https://github.com/dkogan/vnlog/" rel="nofollow">https://github.com/dkogan/vnlog/</a>

enriqutoover 3 years ago

I'm a bit surprised by the need of verbose "--icsv --ocsv"... Shouldn't it be trivial to see that the input is csv? (and in that case, the output could be csv by default).

评论 #28299139 未加载

评论 #28299082 未加载

throw0101aover 3 years ago

> Like Awk, sed, cut, join, and sort […]So basically Perl? :)

simonwover 3 years ago

I added a new feature to my sqlite-utils CLI tool in June which covers similar ground: it now lets you load CSV, TSV and JSON data into an in-memory SQLite database and execute SQL queries against it: <a href="https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/" rel="nofollow">https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/</a>Example usage:<pre><code> sqlite-utils memory example.csv "select * from t order by color, shape" # Defaults to outputting JSON, you can add # --csv or --tsv for those formats or # --table to output as a rendered table </code></pre> More docs here: <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#querying-data-directly-using-an-in-memory-database" rel="nofollow">https://sqlite-utils.datasette.io/en/stable/cli.html#queryin...</a>

xvilkaover 3 years ago

There's also rq (record query)[1] that also supports CSV and JSON but not TSV though. It's written in Rust.[1] <a href="https://github.com/dflemstr/rq" rel="nofollow">https://github.com/dflemstr/rq</a>

lenkiteover 3 years ago

I just prefer using R data-frames - the convenience of a full data-science language shell for getting statistical data out of CSV. One can also make gorgeous graphs using ggplot2.

smashahover 3 years ago

Nice. Any reason why installation section is not prominent and on the top of the readme?It's even in an non-intuitive place in the documentation.

csmarshallover 3 years ago

I don't know about MillerCLI's portability, but RecordStream (<a href="https://github.com/benbernard/RecordStream" rel="nofollow">https://github.com/benbernard/RecordStream</a>) is my go to swiss army knife.

muad_kyrlachover 3 years ago

Got a csv or json you need to eff-with from the command line? It's Miller time, baby!

nyolfenover 3 years ago

i recently came across jq, which i use for json parsing: <a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>

lazyeyeover 3 years ago

For windows users NimbleText is a great little utility for working with tabular data:-<a href="https://nimbletext.com/" rel="nofollow">https://nimbletext.com/</a>

nyolfenover 3 years ago

i recently came across `jq`, which i use for this purpose

nwmcsweenover 3 years ago

A bit OT but there really need to be a new paradigm shift for a new OS interface that has less coupling and more flexibility compared to piping.

25 comments

yung_steezyover 3 years ago

评论 #28302499 未加载

评论 #28302364 未加载

评论 #28300871 未加载

评论 #28303106 未加载

评论 #28301369 未加载

评论 #28300068 未加载

评论 #28299927 未加载

brushfootover 3 years ago

msluyterover 3 years ago

评论 #28307285 未加载

awildover 3 years ago

citilifeover 3 years ago

评论 #28304497 未加载

traceroute66over 3 years ago

Is this (supposedly) better than Burntsushi toolset (e.g. XSV[1]) or just different ?[1] <a href="https://github.com/BurntSushi/xsv" rel="nofollow">https://github.com/BurntSushi/xsv</a>

评论 #28300289 未加载

flakinessover 3 years ago

otabdeveloper4over 3 years ago

See also my own foray into this space: <a href="https://tkatchev.bitbucket.io/" rel="nofollow">https://tkatchev.bitbucket.io/</a>You might find it useful.

评论 #28299819 未加载

chrisweeklyover 3 years ago

See LNAV (<a href="https://lnav.org" rel="nofollow">https://lnav.org</a>) for a "mini-ETL" CLI powertool with embedded SQLite; it's fantastic.

sam_goodyover 3 years ago

评论 #28302968 未加载

mmahemoffover 3 years ago

评论 #28302619 未加载

calvinmorrisonover 3 years ago

评论 #28303906 未加载

评论 #28306495 未加载

评论 #28301692 未加载

评论 #28301956 未加载

dima55over 3 years ago

Similar, but using the ACTUAL awk, sed, join, sort tools you already have and know about: <a href="https://github.com/dkogan/vnlog/" rel="nofollow">https://github.com/dkogan/vnlog/</a>

enriqutoover 3 years ago

I'm a bit surprised by the need of verbose "--icsv --ocsv"... Shouldn't it be trivial to see that the input is csv? (and in that case, the output could be csv by default).

评论 #28299139 未加载

评论 #28299082 未加载

throw0101aover 3 years ago

> Like Awk, sed, cut, join, and sort […]So basically Perl? :)

simonwover 3 years ago

xvilkaover 3 years ago

lenkiteover 3 years ago

I just prefer using R data-frames - the convenience of a full data-science language shell for getting statistical data out of CSV. One can also make gorgeous graphs using ggplot2.

smashahover 3 years ago

Nice. Any reason why installation section is not prominent and on the top of the readme?It's even in an non-intuitive place in the documentation.

csmarshallover 3 years ago

muad_kyrlachover 3 years ago

Got a csv or json you need to eff-with from the command line? It's Miller time, baby!

nyolfenover 3 years ago

i recently came across jq, which i use for json parsing: <a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>

lazyeyeover 3 years ago

For windows users NimbleText is a great little utility for working with tabular data:-<a href="https://nimbletext.com/" rel="nofollow">https://nimbletext.com/</a>

nyolfenover 3 years ago

i recently came across `jq`, which i use for this purpose

nwmcsweenover 3 years ago

A bit OT but there really need to be a new paradigm shift for a new OS interface that has less coupling and more flexibility compared to piping.