DuckDB as the New jq

364 pointsby pgr0ssabout 1 year ago

23 comments

xg15about 1 year ago

The most effective combination I've found so far is jq + basic shell tools.I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places and has some awkward design choices in others, which means that a lot of practical everyday tasks - such as aggregations or even just set membership - are a lot more complicated than they ought to be.Luckily, what jq can do really well is bringing data of interest into a line-based text representation, which is ideal for all kinds of standard unix shell tools - so you can just use those to take over the parts of your pipeline that would be hard to do in "pure" jq.So I think my solution to the OP's task - get all distinct OSS licenses from the project list and count usages for each one - would be:curl ... | jq '.[].license.key' | sort | uniq -cThat's it.

评论 #39785906 未加载

评论 #39786378 未加载

评论 #39807119 未加载

评论 #39799502 未加载

评论 #39788747 未加载

评论 #39786364 未加载

评论 #39790539 未加载

评论 #39786140 未加载

ndrabout 1 year ago

If you like lisp, and especially clojure, check out babashka[0]. This my first attempt but I bet you can do something nicer even if you keep forcing yourself to stay into a single pipe command.<pre><code> cat repos.json | bb -e ' (->> (-> *in* slurp (json/parse-string true)) (group-by #(-> % :license :key)) (map #(-> {:license (key %) :count (-> % val count)})) json/generate-string println)' </code></pre> [0] <a href="https://babashka.org/" rel="nofollow">https://babashka.org/</a>

hu3about 1 year ago

Related, clickhouse local cli command is a speed demon to parse and query JSON and other formats such as CSV:- "The world’s fastest tool for querying JSON files" <a href="https://clickhouse.com/blog/worlds-fastest-json-querying-tool-clickhouse-local" rel="nofollow">https://clickhouse.com/blog/worlds-fastest-json-querying-too...</a>- "Show HN: ClickHouse-local – a small tool for serverless data analytics" <a href="https://news.ycombinator.com/item?id=34265206">https://news.ycombinator.com/item?id=34265206</a>

评论 #39785472 未加载

sshineabout 1 year ago

Very cool!I am also a big fan of jq.And I think using DuckDB and SQL probably makes a lot of sense in a lot of cases.But I think the examples are very geared towards being better solved in SQL.The ideal jq examples are combinations of filter (select), map (map) and concat (.[]).For example, finding the right download link:<pre><code> $ curl -s https://api.github.com/repos/go-gitea/gitea/releases/latest \ | jq -r '.assets[] | .browser_download_url | select(endswith("linux-amd64"))' https://github.com/go-gitea/gitea/releases/download/v1.15.7/gitea-1.15.7-linux-amd64 </code></pre> Or extracting the KUBE_CONFIG of a DigitalOcean Kubernetes cluster from Terraform state:<pre><code> $ jq -r '.resources[] | select(.type == "digitalocean_kubernetes_cluster") | .instances[].attributes.kube_config[].raw_config' \ terraform.tfstate apiVersion: v1 kind: Config clusters: - cluster: certificate-authority-data: ... server: https://...k8s.ondigitalocean.com ...</code></pre>

评论 #39784665 未加载

jeffbeeabout 1 year ago

I tried this and it just seems to add bondage and discipline that I don't need on top of what is, in practice, an extremely chaotic format.Example: trying to pick one field out of 20000 large JSON files that represent local property records.% duckdb -json -c "select apn.apnNumber from read_json('*')" Invalid Input Error: JSON transform error in file "052136400500", in record/value 1: Could not convert string 'fb1b1e68-89ee-11ea-bc55-0242ad1302303' to INT128Well, I didn't want that converted. I just want to ignore it. This has been my experience overall. DuckDB is great if there is a logical schema, not as good as jq when the corpus is just data soup.

mritchie712about 1 year ago

You can also query (public) Google Sheets [0]<pre><code> SELECT * FROM read_csv_auto('https://docs.google.com/spreadsheets/export? format=csv&id=1GuEPkwjdICgJ31Ji3iUoarirZNDbPxQj_kf7fd4h4Ro', normalize_names=True); </code></pre> 0 - <a href="https://x.com/thisritchie/status/1767922982046015840?s=20" rel="nofollow">https://x.com/thisritchie/status/1767922982046015840?s=20</a>

NortySpockabout 1 year ago

In a similar vein, I have found Benthos to be an incredible swiss-army-knife for transforming data and shoving it either into (or out of) a message bus, webhook, or a database.<a href="https://www.benthos.dev/" rel="nofollow">https://www.benthos.dev/</a>

评论 #39786034 未加载

评论 #39785107 未加载

haradionabout 1 year ago

I've found Nushell (<a href="https://www.nushell.sh/" rel="nofollow">https://www.nushell.sh/</a>) to be really handy for ad-hoc data manipulation (and a decent enough general-purpose shell).

评论 #39787608 未加载

nf3about 1 year ago

I run a pretty substantial platform where I implemented structured logging to SQLite databases. Each log event is stored as a JSON object in a row. A separate database is kept for each day. Daily log files are about 35GB, so that's quite a lot of data to go through is you want to look for something specific. Being able to index on specific fields, as well as express searches as SQL queries is a real game changer IMO.

JeremyNTabout 1 year ago

I have a lot of trouble understanding the benefits of this versus just working with json with a programming language. It seems like you're adding another layer of abstraction versus just dealing with a normal hashmap-like data structure in your language of choice.If you want to work with it interactively, you could use a notebook or REPL.

评论 #39788859 未加载

评论 #39786260 未加载

评论 #39786362 未加载

hprotagonistabout 1 year ago

i've been using simonw's sqlite-utils (<a href="https://sqlite-utils.datasette.io/en/stable/" rel="nofollow">https://sqlite-utils.datasette.io/en/stable/</a>) for this sort of thing; given structured json or jsonl, you can throw data at an in-memory sqlite database and query away: <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#querying-data-directly-using-an-in-memory-database" rel="nofollow">https://sqlite-utils.datasette.io/en/stable/cli.html#queryin...</a>

评论 #39796418 未加载

pletnesabout 1 year ago

Worth noting that both jq and duckdb can be used from python and from the command line. Both are very useful data tools!

ec109685about 1 year ago

While jq’s syntax can be hard to remember, ChatGTP does an excellent job generating jq from an example json file and a description of how you want it parsed.

Sammiabout 1 year ago

I work primarily in projects that use js and I mostly don't see the point in working with json in other tools than js.I have tried jq a little bit, but learning jq is learning a new thing, which is healthy, but it also requires time and energy, which is not always available.When I want to munge some json I use js... because that is what js in innately good at and it's what I already know. A little js script that does stdin/file read and then JSON.parse, and then map and filter some stuff, and at the end JSON.stringify to stdout/file does the job 100% of the time in my experience.And I can use a debugger or put in console logs when I want to debug. I don't know how to debug jq or sql, so when I'm stuck I end up going for js which I can debug.Are there js developers who reach for jq when you are already familiar with js? Is it because you are already strong in bash and terminal usage? I think I get why you would want to use sql if you are already experienced in sql. Sql is common and made for data munging. Jq however is a new dsl when I don't see the limitation of existing js or sql.

评论 #39790120 未加载

snthpyabout 1 year ago

Hi,I very much share your sentiment and I saw a few comments mentioning PRQL so I thought it might be worth bringing up the following:In order to make working with data at the terminal as easy and fun as possible, some time ago I created pq (prql-query) which leverages DuckDB, DataFusion and PRQL.Unfortunately I am currently not in a position to maintain it so the repo is archived but if someone wanted to help out and collaborate we could change that.It doesn't have much in the way of json functions out-of-the-box but in PRQL it's easy to wrap the DuckDB functions for that and with the new PRQL module system it will soon also become possible to share those. If you look through my HN comment history I did provide a JSON example before.Anyway, you can take a look at the repo here: <a href="https://github.com/PRQL/prql-query">https://github.com/PRQL/prql-query</a>If interested, you can get in touch with me via Github or the PRQL Discord. I'm @snth on both.

HellsMaddyabout 1 year ago

Jq tip: Instead of `sort_by(.count) | reverse`, you can do `sort_by(-.count)`

评论 #39785429 未加载

mutantabout 1 year ago

<a href="https://github.com/mikefarah/yq">https://github.com/mikefarah/yq</a>Yq handles almost every format, and IMO easier to use.

schindlabuaabout 1 year ago

Shoutout to jqp, an interactive jq explorer.<a href="https://github.com/noahgorstein/jqp">https://github.com/noahgorstein/jqp</a>

评论 #39797099 未加载

rpigababout 1 year ago

I love jq and yq, but sometimes I don't want to invest time in learning new syntax and just fallback to some python one liner, that can if necessary become a small python script.Something like this, I have a version of this in a shell alias:<pre><code> python3 -c "import json,sys;d=json.load(sys.stdin);print(doStuff(d['path']['etc']))" </code></pre> Pretty print is done with json.dumps.

phmxabout 1 year ago

There is also a way to import a table from the STDIN (see also <a href="https://duckdb.org/docs/data/json/overview" rel="nofollow">https://duckdb.org/docs/data/json/overview</a>)cat my.json | duckdb -c "CREATE TABLE mytbl AS SELECT * FROM read_json_auto('/dev/stdin'); SELECT ... FROM mytbl"

dudusabout 1 year ago

DuckDB parses JSON using yyjson internally .<a href="https://github.com/ibireme/yyjson">https://github.com/ibireme/yyjson</a>

jonfwabout 1 year ago

My current team produces a CLI binary that is available on every build system and everybody's dev machinesWhenever we're writing automation, if the code is nontrivial, or if it starts to include dependencies, we move the code into the CLI tool.The reason we like this is that we don't want to have to version control tools like duckdb across every dev machine and every build system that might run this script. We build and version control a single binary and it makes life simple.

hermitcrababout 1 year ago

if you want a very visual way to transform JSON/XML/CSV/Excel etc in a pipeline it might also be worth looking at Easy Data Transform.