Simdjson: Parsing gigabytes of JSON per second

503 pointsby lorenzfxalmost 5 years ago

24 comments

dangalmost 5 years ago

A few months ago: <a href="https://news.ycombinator.com/item?id=22745351" rel="nofollow">https://news.ycombinator.com/item?id=22745351</a>2019: <a href="https://news.ycombinator.com/item?id=19214387" rel="nofollow">https://news.ycombinator.com/item?id=19214387</a>

kaspernialmost 5 years ago

I would consider Daniel Lemire (the main author) quite an authority within practical use of vectorization (SIMD). He is a computer science professor at Université du Québec. And is also behind the popular Roaring Bitmaps project [1]. You can check out his publication list here [2].[1] <a href="https://roaringbitmap.org/" rel="nofollow">https://roaringbitmap.org/</a>[2] <a href="https://lemire.me/en/#publications" rel="nofollow">https://lemire.me/en/#publications</a>

评论 #24069977 未加载

评论 #24070199 未加载

评论 #24070429 未加载

评论 #24072592 未加载

评论 #24070071 未加载

dan-robertsonalmost 5 years ago

Gigabytes per second can be a worrying statistic. It suggests that benchmarks would be parsing massive json files rather than the small ones that real-world applications deal with.However this library maintains roughly constant throughput for both small (eg 300 byte) and large documents, if it’s benchmarks are accurate.

评论 #24070073 未加载

评论 #24069786 未加载

评论 #24069842 未加载

评论 #24070703 未加载

评论 #24071067 未加载

评论 #24069764 未加载

评论 #24072614 未加载

avianalmost 5 years ago

The GitHub page links to a video that explains some of the internals [1]. Can someone comment on the result that they show at 14:26?My understanding is that they run a code that does 2000 branches based on a pseudo-random sequence. Over around 10 runs of that code, the CPU supposedly learns to correctly predict those 2000 branches and the performance steadily increases.Do the modern branch predictors really have the capability to remember an exact sequence of past 2000 decisions on the same branch instruction? Also, why would the performance increase incrementally like that? I would imagine that it would remember the loop history on the first run and achieve maximum performance on the second run.I doubt that there's really a neural net in the silicon doing this as the author speculates.[1] <a href="https://youtu.be/wlvKAT7SZIQ?t=864" rel="nofollow">https://youtu.be/wlvKAT7SZIQ?t=864</a>

评论 #24069985 未加载

评论 #24069962 未加载

评论 #24070031 未加载

rattrayalmost 5 years ago

For other folks interested in using this in Node.js, the performance of `simdjson.parse()` is currently slower than `JSON.parse()` due to the way C++ objects are converted to JS objects. It seems the same problem affects a Python implementation as well.Performance-sensitive json-parsing Node users must do this instead:<pre><code> require("simdjson").lazyParse(jsonString).valueForKeyPath("foo.bar[1]") </code></pre> <a href="https://github.com/luizperes/simdjson_nodejs/issues/5" rel="nofollow">https://github.com/luizperes/simdjson_nodejs/issues/5</a>

评论 #24070107 未加载

评论 #24074065 未加载

dmitryminkovskyalmost 5 years ago

SQLite can seemingly parse and process gigabytes of JSON per second. I was pretty shocked by its performance when I tried it out the other month.I ran all kinds of queries on JSON structures and it was so fast.

burtonatoralmost 5 years ago

An idea I had a few years ago which someone might be able to run with is to develop new charsets based on the underlying data, not just some arbitrary numerical range.The idea being that characters that are more common in the underlying language would be represented as lower integers and then use varint encoding so that the data itself is smaller.I did some experiments here and was able to compress our data by 25-45% in many situations.There are multiple issues here though. If you're compressing the data anyway you might not have as big of a win in terms of storage but you still might if you still need to decode the data into its original text.

评论 #24071023 未加载

评论 #24072791 未加载

评论 #24071868 未加载

fastballalmost 5 years ago

And if you're looking for a fast JSON lib for CPython, orjson[1] (written in rust) is the best I've found.[1] <a href="https://github.com/ijl/orjson#performance" rel="nofollow">https://github.com/ijl/orjson#performance</a>

评论 #24072028 未加载

hellofunkalmost 5 years ago

I never thought I’d write this, but we have officially entered a golden age for C++ JSON utils. They are everywhere, and springing up right and left. It is a great time to be alive.

评论 #24069886 未加载

grandinjalmost 5 years ago

Just noting that this library requires that you are able to hold your expanded document in memory.I needed to parse a very very large JSON document and pull out a subset of data, which didn't work, because it exceeded available RAM.

评论 #24070945 未加载

评论 #24070713 未加载

评论 #24074710 未加载

评论 #24073117 未加载

评论 #24073022 未加载

dheeraalmost 5 years ago

So what is the fastest JSON library available? orjson claims they are the fastest but they don't benchmark simdjson. simdjson claims they are the fastest but did they forget to benchmark anything?

cerberusssalmost 5 years ago

The author has given a talk last month, which can be viewed on YouTube:<a href="https://www.youtube.com/watch?v=p6X8BGSrR9w" rel="nofollow">https://www.youtube.com/watch?v=p6X8BGSrR9w</a>

chungusalmost 5 years ago

I use Emacs with lsp-mode (Language Server Protocol) a lot (for haskell, rust, elm and even java) and there was a dramatic speedup from Emacs 27 onwards when it started using jansson JSON parsing.I don't think it's the bottleneck at the moment, but it's good to know there are faster parsers out there. Had a small search but couldn't find any plans to incorporate simdjson, besides a thread from last year on Emacs China forums.

Const-mealmost 5 years ago

Very impressive. Still there’re couple of issues there.This comment is incorrect: <a href="https://github.com/simdjson/simdjson/blob/v0.4.7/src/haswell/simd.h#L111" rel="nofollow">https://github.com/simdjson/simdjson/blob/v0.4.7/src/haswell...</a>The behavior of that instruction is well specified for all inputs. If the high bit is set, the corresponding output byte will be 0. If the high bit is zero, only the lower 4 bits will be used for the index. Ability to selectively zero out some bytes while shuffling is useful sometimes.I’m not sure about this part: <a href="https://github.com/simdjson/simdjson/blob/v0.4.7/src/simdprune_tables.h#L9-L11" rel="nofollow">https://github.com/simdjson/simdjson/blob/v0.4.7/src/simdpru...</a> popcnt instruction is very fast, the latency is 3 cycles on Skylake, and only 1 cycle on Zen2. It produces same result without RAM loads and therefore without taking precious L1D space. The code uses popcnt sometimes, but apparently the lookup table is still used in other places.

评论 #24074581 未加载

Koshkinalmost 5 years ago

There's something wrong with having gigabytes-sized text files.

评论 #24073132 未加载

yalokalmost 5 years ago

Didn’t find any mention of plans for NEON (ARM’s SIMD) support - anyone heard of such plans?

评论 #24073498 未加载

评论 #24074163 未加载

mattbk1almost 5 years ago

There's an R (#rstats) wrapper as well: <a href="https://github.com/eddelbuettel/rcppsimdjson" rel="nofollow">https://github.com/eddelbuettel/rcppsimdjson</a>

pier25almost 5 years ago

This is fantastic.Anyone knows what library does V8 use or how does it compare?

评论 #24072710 未加载

asadlionpkalmost 5 years ago

It seems this is for parsing multiple JSONs, each a few MBs at most. What does one do if they have a single 100GB JSON file? :)ie.<pre><code> { // 100GB of data }</code></pre>

chiialmost 5 years ago

i wonder if it's better to FFI into this library when using node.js, rather than using JSON.parse()

评论 #24069942 未加载

评论 #24069712 未加载

评论 #24069768 未加载

MariuszGalusalmost 5 years ago

Glad Lemire is getting his shine-time on hn

deathnotoalmost 5 years ago

missing comparison with jansson (<a href="https://jansson.readthedocs.io/en/2.10/" rel="nofollow">https://jansson.readthedocs.io/en/2.10/</a>)

评论 #24070135 未加载

deathnotoalmost 5 years ago

missing comparison with libjansson (<a href="https://jansson.readthedocs.io/en/2.10/" rel="nofollow">https://jansson.readthedocs.io/en/2.10/</a>)

rgovostesalmost 5 years ago

Your computer scientists were so preoccupied with whether or not they could, they didn't stop to think if they should.

24 comments

dangalmost 5 years ago

kaspernialmost 5 years ago

评论 #24069977 未加载

评论 #24070199 未加载

评论 #24070429 未加载

评论 #24072592 未加载

评论 #24070071 未加载

dan-robertsonalmost 5 years ago

评论 #24070073 未加载

评论 #24069786 未加载

评论 #24069842 未加载

评论 #24070703 未加载

评论 #24071067 未加载

评论 #24069764 未加载

评论 #24072614 未加载

avianalmost 5 years ago

评论 #24069985 未加载

评论 #24069962 未加载

评论 #24070031 未加载

rattrayalmost 5 years ago

评论 #24070107 未加载

评论 #24074065 未加载

dmitryminkovskyalmost 5 years ago

burtonatoralmost 5 years ago

评论 #24071023 未加载

评论 #24072791 未加载

评论 #24071868 未加载

fastballalmost 5 years ago

评论 #24072028 未加载

hellofunkalmost 5 years ago

I never thought I’d write this, but we have officially entered a golden age for C++ JSON utils. They are everywhere, and springing up right and left. It is a great time to be alive.

评论 #24069886 未加载

grandinjalmost 5 years ago

评论 #24070945 未加载

评论 #24070713 未加载

评论 #24074710 未加载

评论 #24073117 未加载

评论 #24073022 未加载

dheeraalmost 5 years ago

So what is the fastest JSON library available? orjson claims they are the fastest but they don't benchmark simdjson. simdjson claims they are the fastest but did they forget to benchmark anything?

cerberusssalmost 5 years ago

The author has given a talk last month, which can be viewed on YouTube:<a href="https://www.youtube.com/watch?v=p6X8BGSrR9w" rel="nofollow">https://www.youtube.com/watch?v=p6X8BGSrR9w</a>

chungusalmost 5 years ago

Const-mealmost 5 years ago

评论 #24074581 未加载

Koshkinalmost 5 years ago

There's something wrong with having gigabytes-sized text files.

评论 #24073132 未加载

yalokalmost 5 years ago

Didn’t find any mention of plans for NEON (ARM’s SIMD) support - anyone heard of such plans?

评论 #24073498 未加载

评论 #24074163 未加载

mattbk1almost 5 years ago

There's an R (#rstats) wrapper as well: <a href="https://github.com/eddelbuettel/rcppsimdjson" rel="nofollow">https://github.com/eddelbuettel/rcppsimdjson</a>

pier25almost 5 years ago

This is fantastic.Anyone knows what library does V8 use or how does it compare?

评论 #24072710 未加载

asadlionpkalmost 5 years ago

It seems this is for parsing multiple JSONs, each a few MBs at most. What does one do if they have a single 100GB JSON file? :)ie.<pre><code> { // 100GB of data }</code></pre>

chiialmost 5 years ago

i wonder if it's better to FFI into this library when using node.js, rather than using JSON.parse()

评论 #24069942 未加载

评论 #24069712 未加载

评论 #24069768 未加载

MariuszGalusalmost 5 years ago

Glad Lemire is getting his shine-time on hn

deathnotoalmost 5 years ago

missing comparison with jansson (<a href="https://jansson.readthedocs.io/en/2.10/" rel="nofollow">https://jansson.readthedocs.io/en/2.10/</a>)

评论 #24070135 未加载

deathnotoalmost 5 years ago

missing comparison with libjansson (<a href="https://jansson.readthedocs.io/en/2.10/" rel="nofollow">https://jansson.readthedocs.io/en/2.10/</a>)

rgovostesalmost 5 years ago

Your computer scientists were so preoccupied with whether or not they could, they didn't stop to think if they should.