Simdjson – Parsing Gigabytes of JSON per Second

598 pointsby cmsimikeabout 6 years ago

23 comments

raphlinusabout 6 years ago

This is very cool. Meanwhile, in the xi-editor project, we're struggling with the fact that Swift JSON parsing is very slow. My benchmarking clocked in at 0.00089GB/s for Swift 4, and things don't seem to have improved much with Swift 5. I'm encouraging people on that issue to do a blog post.[1]: <a href="https://github.com/xi-editor/xi-mac/issues/102" rel="nofollow">https://github.com/xi-editor/xi-mac/issues/102</a>

评论 #19214729 未加载

评论 #19214925 未加载

评论 #19214844 未加载

评论 #19215395 未加载

评论 #19214717 未加载

评论 #19216854 未加载

评论 #19215702 未加载

glangdaleabout 6 years ago

One of the two authors here. Happy to answer questions.The intent was to open things but not publicize them at this stage but Hacker News seems to find stuff. Wouldn't surprise me if plenty of folks follow Daniel Lemire on Github as his stuff is always interesting.

评论 #19217839 未加载

评论 #19214902 未加载

评论 #19214767 未加载

评论 #19215170 未加载

评论 #19214804 未加载

评论 #19215598 未加载

评论 #19218361 未加载

评论 #19215017 未加载

评论 #19221810 未加载

评论 #19217047 未加载

评论 #19215352 未加载

xfsabout 6 years ago

If you're working with json objects with sizes on the higher end quite often you're not going to need the entirety of them, just a small part of them. If that is the workload what then to do is simply parse as little data as possible: skip the validation, locate the relevant bits, and then start parsing, validation and all the stuff. In this optimizing the json scanner/lexer gives much greater improvement than optimizing the parser.Though this job is trickier than it may look. The logic to extract the "relevant" bits is often dynamic or tied to user input but for the scanner/lexer to be ultrafast it has to be tightly compiled. You can try jitting but libllvm is probably too heavyweight for parsing json.

评论 #19214755 未加载

评论 #19214681 未加载

评论 #19214715 未加载

jillesvangurpabout 6 years ago

Number handling looks like it would be a problem. There are Test suites for json parsers and lots of parsers that fail a lot of these tests. Check e.g. <a href="https://github.com/nst/JSONTestSuite" rel="nofollow">https://github.com/nst/JSONTestSuite</a> which checks compliance against RFC 8259.Publishing results against this could be useful both for assessing how good this parser is and establishing and documenting any known issues. If correctness is not a goal, this can still be fine but finding out your parser of choice doesn't handle common json emitted by other systems can be annoying.Regarding the numbers, I've run into a few cases where Jackson being able to parse BigIntegers and BigDecimals was very useful to me. Silently rounding to doubles or floats can be lossy and failing on some documents just because the value exceeds max long/in t can be an issue as well.

baybal2about 6 years ago

> We store strings as NULL terminated C strings. Thus we implicitly assume that you do not include a NULL character within your string, which is allowed technically speaking if you escape it (\u0000).I lost count to broken JSON parsers which all fall to that.

评论 #19215414 未加载

adrianNabout 6 years ago

I feel like if you need to parse Gigabytes per second of JSON, you should probably think about using a more efficient serialization format than JSON. Binary formats are not much harder to generate and can save a lot of bandwidth and CPU time.

评论 #19216018 未加载

评论 #19214935 未加载

评论 #19215162 未加载

评论 #19215242 未加载

评论 #19214911 未加载

kccqzyabout 6 years ago

I guess the question is, what do you parse it to? I'm guessing definitely not turning objects into std::unordered_map and arrays into std::vector or some such. So how easy it is to use the "parsed" data structure? How easy is it to add an element to some deeply nested array for example?

评论 #19214747 未加载

评论 #19214657 未加载

评论 #19214647 未加载

评论 #19221826 未加载

westurnerabout 6 years ago

> Requirements: […] A processor with AVX2 (i.e., Intel processors starting with the Haswell microarchitecture released 2013, and processors from AMD starting with the Rizen)

评论 #19214714 未加载

ben-schaafabout 6 years ago

I wonder how this compares to fast.json: "Fastest JSON parser in the world is a D project?" (<a href="https://news.ycombinator.com/item?id=10430951" rel="nofollow">https://news.ycombinator.com/item?id=10430951</a>), both in an implementation/approach sense and in terms of performance.

yeldarbabout 6 years ago

Will this work on JSON files that are larger than the available system memory?Firebase backups are huge JSON files and we haven’t found a good way to deal with them.There are some “streaming JSON parsers” that we have wrestled with but they are buggy.

评论 #19221076 未加载

评论 #19217107 未加载

xnormalabout 6 years ago

Any chance of something similar for CSV? (full RFC-4180 including quotes, escaping etc).Terabytes of "big data" get passed around as CSV.

评论 #19215290 未加载

评论 #19215732 未加载

评论 #19215682 未加载

fooycabout 6 years ago

What happens of the parsed data ? Do the benchmarks account for the time to access that data after parsing ?

ftp-bitabout 6 years ago

Perhaps I'm misunderstanding or don't have a good enough grasp of this, but, in what circumstance would you need to parse gigabytes? I've only seen it be used in config files, so...

评论 #19216400 未加载

评论 #19216455 未加载

评论 #19220256 未加载

malikerabout 6 years ago

If this kind of work is interesting to you, you might like Daniel Lemire's blog (<a href="https://lemire.me/blog/" rel="nofollow">https://lemire.me/blog/</a>).He's a professor, but his work is highly applied and immediately usable. He manages to find and demonstrate a lot of code where we assume the big-O performance, but the reality of modern processors and caching (etc.) mean very difference performance in practice.

sbr464about 6 years ago

Thanks for posting. I've been working with lidar/robotic data more recently and it's nice to work with JSON directly, when the performance is good enough.

avmichabout 6 years ago

> All JSON is JavaScript, but not all JavaScript is JSONReally? I thought they diverged specifications long enough ago (though using those extras could be discouraged in some cases).

评论 #19214665 未加载

评论 #19214639 未加载

fulafelabout 6 years ago

What's the current state of the art in doing this on GPU?

评论 #19220268 未加载

tenkenabout 6 years ago

I'm curious how fast the sqlite json extension is for validation and manipulation of json data when compared to this library.

kitdabout 6 years ago

OT, but I notice it can be run by #include-ing the simdjson.cpp file. How common is this in CPP projects?

评论 #19219526 未加载

vkakuabout 6 years ago

Honestly, this is a cool hack. But it's not the best way to shuttle that much data around.It's a hammer on rocket fuel.

hrdwdmrblabout 6 years ago

Would it be possible to make a native module out of this for node?

评论 #19215660 未加载

iamleppertabout 6 years ago

Is this faster than the browser’s native parsing speed I assume?

achalkleyabout 6 years ago

With this work on an Arduino?

评论 #19214784 未加载