Analyzing multi-gigabyte JSON files locally

215 pointsby bubblehack3rabout 2 years ago

35 comments

jmmvabout 2 years ago

Some random comments:* A few GBs of data isn't really that much. Even /considering/ the use of cloud services just for this sounds crazy to me... but I'm sure there are people out there that believe it's the only way to do this (not the author, fortunately).* "You might find out that the data doesn’t fit into RAM (which it well might, JSON is a human-readable format after all)" -- if I'm reading this right, the author is saying that the parsed data takes _more_ space than the JSON version? JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.* "When you’re ~trial-and-error~iteratively building jq commands as I do, you’ll quickly grow tired of having to wait about a minute for your command to succeed" -- well, change your workflow then. When tackling new queries, it's usually a good idea to reduce the data set. Operate on a few records until you have the right query so that you can iterate as fast as possible. Only once you are confident with the query, run it on the full data.* Importing the data into a SQLite database may be better overall for exploration. Again, JSON is slow to operate on because it's text. Pay the cost of parsing only once.* Or write a custom little program that streams data from the JSON file without buffering it all in memory. JSON parsing libraries are plentiful so this should not take a lot of code in your favorite language.

评论 #35210738 未加载

评论 #35213450 未加载

评论 #35210555 未加载

评论 #35214333 未加载

评论 #35215259 未加载

评论 #35211433 未加载

评论 #35215188 未加载

评论 #35212360 未加载

评论 #35215145 未加载

评论 #35210576 未加载

评论 #35210549 未加载

评论 #35210671 未加载

评论 #35215666 未加载

ddulaneyabout 2 years ago

I really like using line-delimited JSON [0] for stuff like this. If you're looking at a multi-GB JSON file, it's often made of a large number of individual objects (e.g. semi-structured JSON log data or transaction records).If you can get to a point where each line is a reasonably-sized JSON file, a lot of things gets way easier. jq will be streaming by default. You can use traditional Unixy tools (grep, sed, etc.) in the normal way because it's just lines of text. And you can jump to any point in the file, skip forward to the next line boundary, and know that you're not in the middle of a record.The company I work for added line-delimited JSON output to lots of our internal tools, and working with anything else feels painful now. It scales up really well -- I've been able to do things like process full days of OPRA reporting data in a bash script.[0]: <a href="https://jsonlines.org/" rel="nofollow">https://jsonlines.org/</a>

评论 #35215850 未加载

评论 #35229366 未加载

评论 #35214073 未加载

jahewsonabout 2 years ago

I had to parse a database backup from Firebase, which was, remarkably, a 300GB JSON file. The database is a tree rooted at a single object, which means that any tool that attempts to stream individual objects always wanted to buffer this single 300GB root object. It wasn’t enough to strip off the root either, as the really big records were arrays a couple of levels down, with a few different formats depending on the schema. For added fun our data included some JSON serialised inside strings too.This was a few years ago and I threw every tool and language I could at it, but they were either far too slow or buffered records larger than memory, even the fancy C++ SIMD parsers did this. I eventually got something working in Go and it was impressively fast and ran on my MacBook, but we never ended up using it as another engineer just wrote a script that read the entire database from the Firebase API record-by-record throttled over several days, lol.

评论 #35215523 未加载

评论 #35214614 未加载

评论 #35212892 未加载

评论 #35216303 未加载

评论 #35214426 未加载

isoprophlexabout 2 years ago

Nice writeup, but is jq & GNU parallel or a notebook full of python spaghetti the best (least complex) tool for the job?DuckDB might be nice here, too. See <a href="https://duckdb.org/2023/03/03/json.html" rel="nofollow">https://duckdb.org/2023/03/03/json.html</a>

评论 #35211441 未加载

评论 #35214153 未加载

评论 #35218520 未加载

评论 #35211486 未加载

hamilyon2about 2 years ago

Clickhouse is the best way to analyze 10GB sized json by far.Latest bunch of features add near-native json support. Coupled with ability to add extracted columns make the whole process easy. It is fast, you can use familiar SQL syntax, not constrainted to RAM limits.It is a bit hard if you want to iteratively process file line-by line or use advanced SQL. And you have one-time cost of writing schema. Apart from that, I can't think of any downsides.Edit: clarify a bit

评论 #35210887 未加载

jeffbeeabout 2 years ago

One thing that will greatly help with `jq` is rebuilding it so it suits your machine. The package of jq that comes with Debian or Ubuntu Linux is garbage that targets k8-generic (on the x86_64 variant), is built with debug assertions, and uses the GNU system allocator which is the worst allocator on the market. Rebuilding it targeting your platform, without assertions, and with tcmalloc makes it twice as fast in many cases.On this 988MB dataset I happen to have at hand, compare Ubuntu jq with my local build, with hot caches on an Intel Core i5-1240P.<pre><code> time parallel -n 100 /usr/bin/jq -rf ../program.jq ::: * -> 1.843s time parallel -n 100 ~/bin/jq -rf ../program.jq ::: * -> 1.121s </code></pre> I know it stinks of Gentoo, but if you have any performance requirements at all, you can help yourself by rebuilding the relevant packages. Never use the upstream mysql, postgres, redis, jq, ripgrep, etc etc.

评论 #35211460 未加载

rvanlaarabout 2 years ago

Recently had 28GB json of IOT data with no guarantees on the data structure inside.Used simdjson [1] together with python bindings [2]. Achieved massive speedups for analyzing the data. Before it was in the order of minutes, then it became fast enough to not leave my desk. Reading from disk became the bottleneck, not cpu power and memory.[1] <a href="https://github.com/simdjson/simdjson">https://github.com/simdjson/simdjson</a> [2] <a href="https://pysimdjson.tkte.ch/" rel="nofollow">https://pysimdjson.tkte.ch/</a>

评论 #35221038 未加载

Groxxabout 2 years ago

tbh my usual strategy is to drop into a real programming language and use whatever JSON stream parsing exists there, and dump the contents into a half-parsed file that can be split with `split`. Then you can use "normal" tools on one of those pieces for fast iteration, and simply `cat * | ...` for the final slow run on all the data.Go is quite good for this, as it's extremely permissive about errors and structure, has very good performance, and comes with a streaming parser in the standard library. It's pretty easy to be finished after only a couple minutes, and you'll be bottlenecked on I/O unless you did something truly horrific.And when jq isn't enough because you need to do joins or something, shove it into SQLite. Add an index or three. It'll massively outperform almost anything else unless you need rich text content searches (and even then, a fulltext index might be just as good), and it's plenty happy with a terabyte of data.

zeitlupeabout 2 years ago

Spark is my favorite tool to deal with jsons. It can read as many jsons – in any format located in any even nested folder structure – as you want, offers parallelization, and is great to flatten structs. I've never run into memory issues (or never ran out of workarounds) so far.

评论 #35212232 未加载

评论 #35214823 未加载

19habout 2 years ago

To analyze and process the pushshift Reddit comment & submission archives we used Rust with simd-json and currently get to around 1 - 2GB/s (that’s including the decompression of the zstd stream). Still takes a load of time when the decompressed files are 300GB+.Weirdly enough we ended up networking a bunch of Apple silicon MacBooks together as the Ryzen 32C servers didn’t even closely match its performance :/

评论 #35214219 未加载

评论 #35214083 未加载

ginkoabout 2 years ago

This is something I did recently. We have this binary format we use for content traces. You can dump it to JSON, but that turns a ~10GB into a ~100GB file. I needed to check some aspects of this with Python, so I used ijson[1] to parse the JSON without having to keep it in memory.The nice thing is that our dumping tool can also output JSON to STDOUT so you don't even need to dump the JSON representation to the hard disk. Just open the tool in a subprocess and pipe the output to the ijson parser. Pretty handy.[1] <a href="https://pypi.org/project/ijson/" rel="nofollow">https://pypi.org/project/ijson/</a>

version_fiveabout 2 years ago

For a hacky solution, I've often just used grep, tr, awk, etc. If it's a well structured file and all the records are the same or similar enough, it's often possible to grep your way into getting the thing you want on each line, and then use awk or sed to parse out the data. Obviously lots of ways this can break down, buy 9GB is nothing if you can make it work with these tools. I have found jq much slower.

评论 #35210393 未加载

chrisweeklyabout 2 years ago

LNAV (<a href="https://lnav.org" rel="nofollow">https://lnav.org</a>) is ideally suited for this kind of thing, with an embedded sqlite engine and what amounts to a local laptop-scale mini-ETL toolkit w/ a nice CLI. I've been recommending it for the last 7 years since I discovered this awesome little underappreciated util.

Nihilartikelabout 2 years ago

If you're doing interactive analysis, converting the json to parquet is a great first step.. After that duckdb or spark are a good way to go. I only fall back to spark if some aggregations are too big to fit in RAM. Spark spills to disk and subdivides the physical plans better in my experience..

评论 #35210829 未加载

评论 #35210838 未加载

mastaxabout 2 years ago

Dask looks really cool, I hope I remember it exists next time I need it.I've been pretty baffled, and disappointed, by how bad Python is at parallel processing. Yeah, yeah, I know: The GIL. But so much time and effort has been spent engineering around every other flaw in Python and yet this part is still so bad. I've tried every "easy to use" parallelism library that gets recommended and none of them has satisfied. Always: "couldn't pickle this function" or spawning loads of processes that use up all my RAM for no visible reason but don't use any CPU or make any indication of progress. I'm sure I'm missing something, I'm not a Python guy. But every other language I've used has an easy to use stateless parallel map that hasn't given me any trouble.

评论 #35214143 未加载

评论 #35210844 未加载

评论 #35217008 未加载

thakoppnoabout 2 years ago

Would sampling the JSON down to 20MB and running jq experimentally until one has found an adequate solution be a decent alternative approach?It depends on the dataset one supposes.

评论 #35210300 未加载

tylerhannanabout 2 years ago

There was an interesting article on this recently...<a href="https://news.ycombinator.com/item?id=31004563" rel="nofollow">https://news.ycombinator.com/item?id=31004563</a>It prompted quite some conversation and discussion and, in the end, an updated benchmark across a variety of tools <a href="https://colab.research.google.com/github/dcmoura/spyql/blob/master/notebooks/json_benchmark.ipynb" rel="nofollow">https://colab.research.google.com/github/dcmoura/spyql/blob/...</a> conveniently right in the 10GB dataset size.

berkle4455about 2 years ago

Just use clickhouse-local or duckdb. Handling data measured in terabytes is easy.

cpuguy83about 2 years ago

Jq does support slurp mode so you should be able to do this using that... granted I've never attempted this and the syntax is very different.--- edit ---I used the wrong term, the correct term is streaming mode.

评论 #35211310 未加载

funstuff007about 2 years ago

Anyone who's generating multi-GB JSON files on purpose has some explaining to do.

评论 #35212837 未加载

hprotagonistabout 2 years ago

i would seriously consider sqlite-utils here.<a href="https://sqlite-utils.datasette.io/en/stable/cli.html" rel="nofollow">https://sqlite-utils.datasette.io/en/stable/cli.html</a>

评论 #35210210 未加载

cube2222about 2 years ago

OctoSQL[0] or DuckDB[1] will most likely be much simpler, while going through 10 GB of JSON in a couple seconds at most.Disclaimer: author of OctoSQL[0]: <a href="https://github.com/cube2222/octosql">https://github.com/cube2222/octosql</a>[1]: <a href="https://duckdb.org/" rel="nofollow">https://duckdb.org/</a>

reegnzabout 2 years ago

Allow me to advertise my zsh jq plugin +jq-repl: <a href="https://github.com/reegnz/jq-zsh-plugin">https://github.com/reegnz/jq-zsh-plugin</a>I find that for big datasets choosing the right format is crucial. Using json-lines format + some shell filtering (eg. head, tail to limit the range, egrep or ripgrep for the more trivial filtering) to reduce the dataset to a couple of megabytes, then use that jq-repl of mine to iterate fast on the final jq expression.I found that the REPL form factor works really well when you don't exactly know what you're digging for.

UnCommonLispabout 2 years ago

Use ClickHouse, either clickhouse-server or clickhouse-local. No fuss, no muss.

mattewongabout 2 years ago

If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).<a href="https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sqlite.md">https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...</a><a href="https://github.com/BurntSushi/xsv">https://github.com/BurntSushi/xsv</a>

code-fasterabout 2 years ago

> Also note that this approach generalizes to other text-based formats. If you have 10 gigabyte of CSV, you can use Miller for processing. For binary formats, you could use fq if you can find a workable record separator.You can also generalize it without learning a new minilanguage by using <a href="https://github.com/tyleradams/json-toolkit">https://github.com/tyleradams/json-toolkit</a> which converts csv/binary/whatever to/from json

Animatsabout 2 years ago

Rust's serde-json will iterate over a file of JSON without difficulty, and will write one from an iterative process without building it all in memory. I routinely create and read multi-gigabyte JSON files. They're debug dumps of the the scene my metaverse viewer is looking at.Streaming from large files was routine for XML, but for some reason, JSON users don't seem to work with streams much.

liammclennanabout 2 years ago

Flare’s (<a href="https://blog.datalust.co/a-tour-of-seqs-storage-engine/" rel="nofollow">https://blog.datalust.co/a-tour-of-seqs-storage-engine/</a>) command line tool can query CLEF formatted (new-line delimited) JSON files and is perhaps an order of magnitude faster.Good for searching and aggregating. Probably not great for transformation.

DeathArrowabout 2 years ago

You can deserialize the JSONs and filter the resulting arrays or lists. For C# the IDE can automatically generate the classes from JSON and I think there are tools for other languages to generate data structures from JSON.

maCDzPabout 2 years ago

I like SQLite and JSON columns. I wonder how fast it would be if you save the whole JSON file in one record and then query SQLite. I bet it’s fast.You could probably use that one record to then build tables in SQLite that you can query.

kosherhurricaneabout 2 years ago

What I would have done is first create a map of the file, just the keys and shapes, without the data. That way I can traverse the file. And then mmap the file to traverse and read the data.A couple of dozen lines of code would do it.

2habout 2 years ago

Note the Go standard library has a streaming parser:<a href="https://go.dev/play/p/O2WWn0qQrP6" rel="nofollow">https://go.dev/play/p/O2WWn0qQrP6</a>

kashifabout 2 years ago

Might be useful for some - <a href="https://github.com/kashifrazzaqui/json-streamer">https://github.com/kashifrazzaqui/json-streamer</a>

zeropabout 2 years ago

Other day i discovered duckdb on HN which allows firing SQL on JSON. But i am not sure if that can take this much volume of data.

nn3about 2 years ago

the real trick is to do the debugging/exploration on a small subset of the data. Then usually you don't need all these extra measures because the real processing is only done a small number of times.