Thank you all for your feedback. The benchmark was updated and the fastest tool is NOT written in Python. Here are the highlights:<p>* Added ClickHouse (written in C++) to the benchmark: I was unaware that the clickhouse-local tool would handle these tasks. ClickHouse is now the fastest (together with OctoSQL);<p>* OctoSQL (written in Go) was updated as a response to the benchmark: updates included switching to fastjson, short-circuiting LIMIT, and eagerly printing when outputting JSON and CSV. Now, OctoSQL is one of the fastest and memory is stable;<p>* SPyQL (written in Python) is now third: SPyQL leverages orjson (Rust) to parse JSONs, while the query engine is written in Python. When processing 1GB of input data, SPyQL takes 4x-5x more time than the best, while still achieving up to 2x higher performance than jq (written in C);<p>* I removed Pandas from the benchmark and focused on command-line tools. I am planning a separate benchmark on Python libs where Pandas, Polars and Modin (and eventually others) will be included.<p>This benchmark is a living document.
If you are interested in receiving updates, please subscribe to the following issue: <a href="https://github.com/dcmoura/spyql/issues/72" rel="nofollow">https://github.com/dcmoura/spyql/issues/72</a><p>Thank you!
Please take this claim and these results with a pinch of salt. spyql was not created with the goal of being the fastest tool for querying data, and it might be the case that the same tools with different datasets or in different use-cases outperform spyql. There might also be other tools that I was not aware when I wrote the benchmark (I just learned about a new one that we will be adding to the benchmark).<p>For me the lesson was that in certain problems (e.g. I/O intensive) the architecture/design might have a higher impact than the choice of the programming language.<p>spyql can both leverage the python standard lib for parsing json (written in C) as well as orjson (written in Rust). In this benchmark we used the later, which shows considerable performance improvements. Still, query processing (expression evaluation, filtering, aggregations, etc) are implemented in Python. I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high perfomance on core functionalities.<p>Here is a simple experiment with a 1GB file that shows that JSON decoding takes less than 40% of the processing time:<p><pre><code> !spyql "SELECT avg_agg(json->overall) FROM orjson" < books.json
avg_agg_overall
4.31181166791025
time: 11.7 s (started: 2022-04-13 23:37:07 +00:00)
import orjson as json
acc = 0
cnt = 0
with open('books.json') as f:
for line in f:
acc += json.loads(line)['overall']
cnt += 1
print(acc/cnt)
4.31181166791025
time: 4.55 s (started: 2022-04-13 23:37:19 +00:00)</code></pre>
Hey, OctoSQL author here.<p>Great to see such benchmarks and another dataset to test with!<p>Just wanted to add a small explanation for the interested to why at least OctoSQL[0] falls behind in the first and last test.<p>OctoSQL supports retractions. That means an input for OctoSQL can not only be producing records, but also saying "I'm taking this record back", and OctoSQL will calculate the correct output based on that with minimal processing overhead required. It's a dataflow system in that sense.<p>To achieve that, it has to buffer data at the output to be able to display the current output state (which might have rows removed from it). It’s the same case for LIMIT which is right now very unoptimal in OctoSQL. Because the top 100 rows can actually change if you have an ORDER BY clause specified and new rows/retractions arrive - it buffers the whole output, and then displays the top 100 rows.<p>Running with the stream_native output format should drastically reduce the memory overhead and possibly processing time (I’ll post an update when the data is downloaded and I can compare), at least in the first case.<p>Obviously, OctoSQL could be much smarter about detecting whether a source can even produce retractions, and optimize it’s behavior based on that. No point in buffering if retractions are not possible (which is the case for i.e. a JSON file). This would make LIMIT and the output table much faster and lighter.<p>It’s definitely a big priority for me for future development to fix this (and make LIMIT and the table output format usable with big outputs), so stay tuned! :)<p>And congrats to SPyQL for the first place here, its design is awesome.<p>EDIT: Running the first test with `--output stream_native` (which just sends the productions and retractions as they come to the output, instead of buffering and sorting them) does indeed run 2-3x faster and use 50 MB RAM - instead of GB's.<p>[0]:<a href="https://github.com/cube2222/octosql" rel="nofollow">https://github.com/cube2222/octosql</a>
Not sure the HN title is that great. The actual title is “ The fastest tools for querying large JSON datasets” and the bit about it being written in python is editorialized. The analysis includes products written in go, python, and Java but doesn’t appear to care about the implementation language at all as you must click through to the GitHub to see the language.<p>I’m also not confident that the list was exhaustive.
Allow me a moment to hijack this thread. I'm a PhD student looking to make analytics of JSON data a first-class citizen in <a href="https://duckdb.org/" rel="nofollow">https://duckdb.org/</a>. I'm still very much in a literature study stage, so I'd love to hear <i>your</i> success (or failure) stories about analyzing data stored in JSON. E.g. I'd love to hear about:<p>* Which dialect for querying worked the best for you?<p>* Which tools do you enjoy the most? This could be for any reason: convenience of install/availability, expressivity, raw speed, other niche features, etc.<p>* What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.<p>* What do you feel is the biggest gap/missing feature in currently available tools?
Daniel Lemire’s simdjson probably belongs in this discussion and I would be surprised if it is not the fastest tool by some margin.<p><a href="https://github.com/simdjson/simdjson" rel="nofollow">https://github.com/simdjson/simdjson</a>
Thanks for including dsq! Its focus is always going to be flexibility and UX over performance since it's just a thin wrapper over the internals of a GUI data integration tool [0]. For example many other tools operate on only one or two file types and have optimized for querying on those types.<p>Some like octosql and spyql IIRC have implemented their own sql parsing layer and engine. So if full sql support is necessary you may not be able to use them.<p>In contrast dsq supports excel, avro, parquet, json, csv/tsv, nginx logs, apache logs, json newline, concatenated json, etc. And this list keeps growing. Dsq stores data in sqlite under the hood so you have access to the full sqlite dialect. Textql and q also do it this way.<p>That said, despite performance not being the major I goal I think it will get better over time. There's already an OSS contributor who happens to be thinking about perf.<p>If you're interested in getting started with OSS contributions btw I keep a page of good first projects [0]. If you know Go you're in a good place to contribute. I've already had a number of people take me up on this and merged a number of PRs by first-time OSS contributors.<p>[0] <a href="https://github.com/multiprocessio/datastation" rel="nofollow">https://github.com/multiprocessio/datastation</a><p>[1] <a href="https://github.com/multiprocessio/datastation/blob/main/GOOD_FIRST_PROJECTS.md" rel="nofollow">https://github.com/multiprocessio/datastation/blob/main/GOOD...</a>
Tools like<p><a href="https://www.alteryx.com/" rel="nofollow">https://www.alteryx.com/</a><p>are really limited in my view because they are based on a relational model that requires joins to do things that are easily expressed without joins in the object-relational (basically JSON) model.<p>"Boxes and lines" data pipelines that involve joins are a bear to maintain because of the structural instability: what looks like a little change to management can turn a very simple pipeline that goes in one direction to a pipeline that splits into four streams that have to get reassembled at the end.<p>If you were passing JSON documents between the nodes you can keep the stuff that would have been split out and then joined inside the document and the pipelines are a lot easier to maintain.<p>The people who develop those tools, however, are in love with the relational model because it is SIMDable, not branchy, easy to implement at hyper-speed so there is going to be an opportunity for people to make one that gets much better customer satisfaction.<p>To do that though somebody has to ask the question of "what algebra does this thing work over?"
That SpyQL is pretty cute... I like how it imports Python modules directly.<p>It's a lot like list comprehensions on steroids but it does lack the high end features of SQL like joins and it has a simple query processing strategy that doesn't support query optimization like SQL.<p>It is convenient to incorporate the Python ecosystem (functions) but in the end it is not that strong because the query language doesn't understand Python. It reminds me of the Java Streams in the API which can represent all the blocks of the processing pipeline before you <i>collect</i> it, yet it doesn't buy very much because the map, filter, group, etc. functions are black boxes that it can't think globally about.<p>The Expression Tree version of LINQ in C# can do better (like compile what looks like a C# function to SQL) but it's disappointing that very few people really did anything with expression trees. (Give programmers the power of LISP and they say... Meh.)
I asked about this on the Github issue regarding these benchmarks as well.<p>I'm curious as to why libraries like ultrajson[0] and orjson[1] weren't explored. They aren't command line tools, but neither is pandas (which <i>is</i> benchmarked) right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, `python -c "..."`)?<p>[0] <a href="https://github.com/ultrajson/ultrajson" rel="nofollow">https://github.com/ultrajson/ultrajson</a><p>[1] <a href="https://github.com/ijl/orjson" rel="nofollow">https://github.com/ijl/orjson</a>
I personally tend to put all my data for analysis into clickhouse (which is written in C++ FWIW). My side project actually is all about ingesting JSON data into clickhouse and making it super easy to query, visualize, monitor and alert on JSON data.<p>You certainly pay a computation cost everytime you want to scan the data, but I think the flexibility more than makes up for it. You can take a read about how and why I built GraphJSON here if you're interested <a href="https://www.graphjson.com/guides/about" rel="nofollow">https://www.graphjson.com/guides/about</a>
Why they didn't compare with Polars [0]? According to the h2oai benchmark [1], Polars is fast.<p>[0] -- <a href="https://www.pola.rs/" rel="nofollow">https://www.pola.rs/</a>
[1] -- <a href="https://h2oai.github.io/db-benchmark/" rel="nofollow">https://h2oai.github.io/db-benchmark/</a>
So for large files the bottleneck is encoding/decoding, which is done by orjson and not Python. If you wrap orjson in C++, the results would be even faster.
Focusing on ease of use instead of performance (I'm usually looking at small files) but I found jq's syntax to be really difficult, so I wrote my own teeny tiny tool in Python:<p><a href="https://replit.com/@gabrielsroka/Bash#pj.py" rel="nofollow">https://replit.com/@gabrielsroka/Bash#pj.py</a><p>Obviously it only does a fraction of what jq does, but it does what I need and I can expand it in a fraction of the time it would take to learn jq.<p>Inspired by <a href="https://docs.python.org/3/library/json.html#module-json.tool" rel="nofollow">https://docs.python.org/3/library/json.html#module-json.tool</a>
TIL that time can also output max memory use, and IO use very useful!<p>Like they do here:<p>/usr/bin/time -o performance.log -a -f"%e\t%M\t%x\t%C"<p>%e (Not in tcsh.) Elapsed real time (in seconds).<p>%M Maximum resident set size of the process during its lifetime, in Kbytes.<p>%x (Not in tcsh.) Exit status of the command.<p>%C (Not in tcsh.) Name and command-line arguments of the command being timed.
Potentially relevant here...<p><a href="https://lemire.me/blog/2018/05/03/how-fast-can-you-parse-json/" rel="nofollow">https://lemire.me/blog/2018/05/03/how-fast-can-you-parse-jso...</a><p><a href="http://www.vldb.org/pvldb/vol10/p1118-li.pdf" rel="nofollow">http://www.vldb.org/pvldb/vol10/p1118-li.pdf</a><p>"We experimentally evaluate Mison using representative real-world JSON datasets and the TPC-H benchmark, and show that Mison produces significant performance benefits over the best existing JSON parsers; in some cases, the performance improvement is over one order of magnitude."
Just this week I has to query a JSON file that is 13-14 MB in size. I already had Node installed, so that it what I used. I was surprised that it was able to do the processing so fast that it was done as soon as I hit the enter key in the terminal window. And the processing wasn't trivial, it had to touch every single element of the entire JSON file and put the entire thing into a new data structure. I was quite impressed at how fast it was. It took my IDE about 5 minutes just to open the JSON file so I could take a look at it to see how it was structured.
You could also try my Xidel. With the dev build 0.9.9, not the last release 0.9.8, because recently I have implemented my own hash map and JSON parser to make it faster than before.<p>Although it is not fast nor supports large files (it loads everything in memory)