This is very cool. Meanwhile, in the xi-editor project, we're struggling with the fact that Swift JSON parsing is very slow. My benchmarking clocked in at 0.00089GB/s for Swift 4, and things don't seem to have improved much with Swift 5. I'm encouraging people on that issue to do a blog post.<p>[1]: <a href="https://github.com/xi-editor/xi-mac/issues/102" rel="nofollow">https://github.com/xi-editor/xi-mac/issues/102</a>
One of the two authors here. Happy to answer questions.<p>The intent was to open things but not publicize them at this stage but Hacker News seems to find stuff. Wouldn't surprise me if plenty of folks follow Daniel Lemire on Github as his stuff is always interesting.
If you're working with json objects with sizes on the higher end quite often you're not going to need the entirety of them, just a small part of them. If that is the workload what then to do is simply parse as little data as possible: skip the validation, locate the relevant bits, and then start parsing, validation and all the stuff. In this optimizing the json scanner/lexer gives much greater improvement than optimizing the parser.<p>Though this job is trickier than it may look. The logic to extract the "relevant" bits is often dynamic or tied to user input but for the scanner/lexer to be ultrafast it has to be tightly compiled. You can try jitting but libllvm is probably too heavyweight for parsing json.
Number handling looks like it would be a problem. There are Test suites for json parsers and lots of parsers that fail a lot of these tests. Check e.g. <a href="https://github.com/nst/JSONTestSuite" rel="nofollow">https://github.com/nst/JSONTestSuite</a> which checks compliance against RFC 8259.<p>Publishing results against this could be useful both for assessing how good this parser is and establishing and documenting any known issues. If correctness is not a goal, this can still be fine but finding out your parser of choice doesn't handle common json emitted by other systems can be annoying.<p>Regarding the numbers, I've run into a few cases where Jackson being able to parse BigIntegers and BigDecimals was very useful to me. Silently rounding to doubles or floats can be lossy and failing on some documents just because the value exceeds max long/in t can be an issue as well.
> We store strings as NULL terminated C strings. Thus we implicitly assume that you do not include a NULL character within your string, which is allowed technically speaking if you escape it (\u0000).<p>I lost count to broken JSON parsers which all fall to that.
I feel like if you need to parse Gigabytes per second of JSON, you should probably think about using a more efficient serialization format than JSON. Binary formats are not much harder to generate and can save a lot of bandwidth and CPU time.
I guess the question is, what do you parse it to? I'm guessing definitely not turning objects into std::unordered_map and arrays into std::vector or some such. So how easy it is to use the "parsed" data structure? How easy is it to add an element to some deeply nested array for example?
> <i>Requirements: […] A processor with AVX2 (i.e., Intel processors starting with the Haswell microarchitecture released 2013, and processors from AMD starting with the Rizen)</i>
I wonder how this compares to fast.json: "Fastest JSON parser in the world is a D project?" (<a href="https://news.ycombinator.com/item?id=10430951" rel="nofollow">https://news.ycombinator.com/item?id=10430951</a>), both in an implementation/approach sense and in terms of performance.
Will this work on JSON files that are larger than the available system memory?<p>Firebase backups are huge JSON files and we haven’t found a good way to deal with them.<p>There are some “streaming JSON parsers” that we have wrestled with but they are buggy.
Perhaps I'm misunderstanding or don't have a good enough grasp of this, but, in what circumstance would you need to parse gigabytes? I've only seen it be used in config files, so...
If this kind of work is interesting to you, you might like Daniel Lemire's blog (<a href="https://lemire.me/blog/" rel="nofollow">https://lemire.me/blog/</a>).<p>He's a professor, but his work is highly applied and immediately usable. He manages to find and demonstrate a lot of code where we assume the big-O performance, but the reality of modern processors and caching (etc.) mean very difference performance in practice.
Thanks for posting. I've been working with lidar/robotic data more recently and it's nice to work with JSON directly, when the performance is good enough.
> All JSON is JavaScript, but not all JavaScript is JSON<p>Really? I thought they diverged specifications long enough ago (though using those extras could be discouraged in some cases).