DuckDB as the New jq

364 点作者 pgr0ss大约 1 年前

23 条评论

xg15大约 1 年前

The most effective combination I've found so far is jq + basic shell tools.I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places and has some awkward design choices in others, which means that a lot of practical everyday tasks - such as aggregations or even just set membership - are a lot more complicated than they ought to be.Luckily, what jq can do really well is bringing data of interest into a line-based text representation, which is ideal for all kinds of standard unix shell tools - so you can just use those to take over the parts of your pipeline that would be hard to do in "pure" jq.So I think my solution to the OP's task - get all distinct OSS licenses from the project list and count usages for each one - would be:curl ... | jq '.[].license.key' | sort | uniq -cThat's it.

评论 #39785906 未加载

评论 #39786378 未加载

评论 #39807119 未加载

评论 #39799502 未加载

评论 #39788747 未加载

评论 #39786364 未加载

评论 #39790539 未加载

评论 #39786140 未加载

ndr大约 1 年前

If you like lisp, and especially clojure, check out babashka[0]. This my first attempt but I bet you can do something nicer even if you keep forcing yourself to stay into a single pipe command.<pre><code> cat repos.json | bb -e ' (->> (-> *in* slurp (json/parse-string true)) (group-by #(-> % :license :key)) (map #(-> {:license (key %) :count (-> % val count)})) json/generate-string println)' </code></pre> [0] <a href="https://babashka.org/" rel="nofollow">https://babashka.org/</a>

hu3大约 1 年前

Related, clickhouse local cli command is a speed demon to parse and query JSON and other formats such as CSV:- "The world’s fastest tool for querying JSON files" <a href="https://clickhouse.com/blog/worlds-fastest-json-querying-tool-clickhouse-local" rel="nofollow">https://clickhouse.com/blog/worlds-fastest-json-querying-too...</a>- "Show HN: ClickHouse-local – a small tool for serverless data analytics" <a href="https://news.ycombinator.com/item?id=34265206">https://news.ycombinator.com/item?id=34265206</a>

评论 #39785472 未加载

sshine大约 1 年前

Very cool!I am also a big fan of jq.And I think using DuckDB and SQL probably makes a lot of sense in a lot of cases.But I think the examples are very geared towards being better solved in SQL.The ideal jq examples are combinations of filter (select), map (map) and concat (.[]).For example, finding the right download link:<pre><code> $ curl -s https://api.github.com/repos/go-gitea/gitea/releases/latest \ | jq -r '.assets[] | .browser_download_url | select(endswith("linux-amd64"))' https://github.com/go-gitea/gitea/releases/download/v1.15.7/gitea-1.15.7-linux-amd64 </code></pre> Or extracting the KUBE_CONFIG of a DigitalOcean Kubernetes cluster from Terraform state:<pre><code> $ jq -r '.resources[] | select(.type == "digitalocean_kubernetes_cluster") | .instances[].attributes.kube_config[].raw_config' \ terraform.tfstate apiVersion: v1 kind: Config clusters: - cluster: certificate-authority-data: ... server: https://...k8s.ondigitalocean.com ...</code></pre>

评论 #39784665 未加载

jeffbee大约 1 年前

I tried this and it just seems to add bondage and discipline that I don't need on top of what is, in practice, an extremely chaotic format.Example: trying to pick one field out of 20000 large JSON files that represent local property records.% duckdb -json -c "select apn.apnNumber from read_json('*')" Invalid Input Error: JSON transform error in file "052136400500", in record/value 1: Could not convert string 'fb1b1e68-89ee-11ea-bc55-0242ad1302303' to INT128Well, I didn't want that converted. I just want to ignore it. This has been my experience overall. DuckDB is great if there is a logical schema, not as good as jq when the corpus is just data soup.

mritchie712大约 1 年前

You can also query (public) Google Sheets [0]<pre><code> SELECT * FROM read_csv_auto('https://docs.google.com/spreadsheets/export? format=csv&id=1GuEPkwjdICgJ31Ji3iUoarirZNDbPxQj_kf7fd4h4Ro', normalize_names=True); </code></pre> 0 - <a href="https://x.com/thisritchie/status/1767922982046015840?s=20" rel="nofollow">https://x.com/thisritchie/status/1767922982046015840?s=20</a>

NortySpock大约 1 年前

In a similar vein, I have found Benthos to be an incredible swiss-army-knife for transforming data and shoving it either into (or out of) a message bus, webhook, or a database.<a href="https://www.benthos.dev/" rel="nofollow">https://www.benthos.dev/</a>

评论 #39786034 未加载

评论 #39785107 未加载

haradion大约 1 年前

I've found Nushell (<a href="https://www.nushell.sh/" rel="nofollow">https://www.nushell.sh/</a>) to be really handy for ad-hoc data manipulation (and a decent enough general-purpose shell).

评论 #39787608 未加载

nf3大约 1 年前

I run a pretty substantial platform where I implemented structured logging to SQLite databases. Each log event is stored as a JSON object in a row. A separate database is kept for each day. Daily log files are about 35GB, so that's quite a lot of data to go through is you want to look for something specific. Being able to index on specific fields, as well as express searches as SQL queries is a real game changer IMO.

JeremyNT大约 1 年前

I have a lot of trouble understanding the benefits of this versus just working with json with a programming language. It seems like you're adding another layer of abstraction versus just dealing with a normal hashmap-like data structure in your language of choice.If you want to work with it interactively, you could use a notebook or REPL.

评论 #39788859 未加载

评论 #39786260 未加载

评论 #39786362 未加载

hprotagonist大约 1 年前

i've been using simonw's sqlite-utils (<a href="https://sqlite-utils.datasette.io/en/stable/" rel="nofollow">https://sqlite-utils.datasette.io/en/stable/</a>) for this sort of thing; given structured json or jsonl, you can throw data at an in-memory sqlite database and query away: <a href="https://sqlite-utils.datasette.io/en/stable/cli.html#querying-data-directly-using-an-in-memory-database" rel="nofollow">https://sqlite-utils.datasette.io/en/stable/cli.html#queryin...</a>

评论 #39796418 未加载

pletnes大约 1 年前

Worth noting that both jq and duckdb can be used from python and from the command line. Both are very useful data tools!

ec109685大约 1 年前

While jq’s syntax can be hard to remember, ChatGTP does an excellent job generating jq from an example json file and a description of how you want it parsed.

Sammi大约 1 年前

I work primarily in projects that use js and I mostly don't see the point in working with json in other tools than js.I have tried jq a little bit, but learning jq is learning a new thing, which is healthy, but it also requires time and energy, which is not always available.When I want to munge some json I use js... because that is what js in innately good at and it's what I already know. A little js script that does stdin/file read and then JSON.parse, and then map and filter some stuff, and at the end JSON.stringify to stdout/file does the job 100% of the time in my experience.And I can use a debugger or put in console logs when I want to debug. I don't know how to debug jq or sql, so when I'm stuck I end up going for js which I can debug.Are there js developers who reach for jq when you are already familiar with js? Is it because you are already strong in bash and terminal usage? I think I get why you would want to use sql if you are already experienced in sql. Sql is common and made for data munging. Jq however is a new dsl when I don't see the limitation of existing js or sql.

评论 #39790120 未加载

snthpy大约 1 年前

Hi,I very much share your sentiment and I saw a few comments mentioning PRQL so I thought it might be worth bringing up the following:In order to make working with data at the terminal as easy and fun as possible, some time ago I created pq (prql-query) which leverages DuckDB, DataFusion and PRQL.Unfortunately I am currently not in a position to maintain it so the repo is archived but if someone wanted to help out and collaborate we could change that.It doesn't have much in the way of json functions out-of-the-box but in PRQL it's easy to wrap the DuckDB functions for that and with the new PRQL module system it will soon also become possible to share those. If you look through my HN comment history I did provide a JSON example before.Anyway, you can take a look at the repo here: <a href="https://github.com/PRQL/prql-query">https://github.com/PRQL/prql-query</a>If interested, you can get in touch with me via Github or the PRQL Discord. I'm @snth on both.

HellsMaddy大约 1 年前

Jq tip: Instead of `sort_by(.count) | reverse`, you can do `sort_by(-.count)`

评论 #39785429 未加载

mutant大约 1 年前

<a href="https://github.com/mikefarah/yq">https://github.com/mikefarah/yq</a>Yq handles almost every format, and IMO easier to use.

schindlabua大约 1 年前

Shoutout to jqp, an interactive jq explorer.<a href="https://github.com/noahgorstein/jqp">https://github.com/noahgorstein/jqp</a>

评论 #39797099 未加载

rpigab大约 1 年前

I love jq and yq, but sometimes I don't want to invest time in learning new syntax and just fallback to some python one liner, that can if necessary become a small python script.Something like this, I have a version of this in a shell alias:<pre><code> python3 -c "import json,sys;d=json.load(sys.stdin);print(doStuff(d['path']['etc']))" </code></pre> Pretty print is done with json.dumps.

phmx大约 1 年前

There is also a way to import a table from the STDIN (see also <a href="https://duckdb.org/docs/data/json/overview" rel="nofollow">https://duckdb.org/docs/data/json/overview</a>)cat my.json | duckdb -c "CREATE TABLE mytbl AS SELECT * FROM read_json_auto('/dev/stdin'); SELECT ... FROM mytbl"

dudus大约 1 年前

DuckDB parses JSON using yyjson internally .<a href="https://github.com/ibireme/yyjson">https://github.com/ibireme/yyjson</a>

jonfw大约 1 年前

My current team produces a CLI binary that is available on every build system and everybody's dev machinesWhenever we're writing automation, if the code is nontrivial, or if it starts to include dependencies, we move the code into the CLI tool.The reason we like this is that we don't want to have to version control tools like duckdb across every dev machine and every build system that might run this script. We build and version control a single binary and it makes life simple.

hermitcrab大约 1 年前

if you want a very visual way to transform JSON/XML/CSV/Excel etc in a pipeline it might also be worth looking at Easy Data Transform.