TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Building a high performance JSON parser

532 点作者 davecheney超过 1 年前

32 条评论

jchw超过 1 年前
Looks pretty good! Even though I&#x27;ve written far too many JSON parsers already in my career, it&#x27;s really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.<p>That said, I will say one thing: you don&#x27;t <i>really</i> need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization <i>entirely</i>. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it&#x27;s mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn&#x27;t until the =&gt; does or does not appear that you know for sure!)
评论 #38153656 未加载
eatonphil超过 1 年前
The walkthrough is very nice, how to do this if you&#x27;re going to do it.<p>If you&#x27;re going for pure performance in a production environment you might take a look at Daniel Lemire&#x27;s work: <a href="https:&#x2F;&#x2F;github.com&#x2F;simdjson&#x2F;simdjson">https:&#x2F;&#x2F;github.com&#x2F;simdjson&#x2F;simdjson</a>. Or the MinIO port of it to Go: <a href="https:&#x2F;&#x2F;github.com&#x2F;minio&#x2F;simdjson-go">https:&#x2F;&#x2F;github.com&#x2F;minio&#x2F;simdjson-go</a>.
评论 #38151207 未加载
评论 #38151581 未加载
评论 #38151793 未加载
评论 #38152194 未加载
wood_spirit超过 1 年前
My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:<p>Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.<p>Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.<p>For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.<p>My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.<p>If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.<p>Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.
jensneuse超过 1 年前
I&#x27;ve taken a very similar approach and built a GraphQL tokenizer and parser (amongst many other things) that&#x27;s also zero memory allocations and quite fast. In case you&#x27;d like to check out the code: <a href="https:&#x2F;&#x2F;github.com&#x2F;wundergraph&#x2F;graphql-go-tools">https:&#x2F;&#x2F;github.com&#x2F;wundergraph&#x2F;graphql-go-tools</a>
评论 #38160137 未加载
评论 #38152832 未加载
evmar超过 1 年前
In n2[1] I needed a fast tokenizer and had the same &quot;garbage factory&quot; problem, which is basically that there&#x27;s a set of constant tokens (like json.Delim in this post) and then strings which cause allocations.<p>I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.<p>In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn&#x27;t know about the parser&#x27;s representation.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;evmar&#x2F;n2">https:&#x2F;&#x2F;github.com&#x2F;evmar&#x2F;n2</a>
ncruces超过 1 年前
It&#x27;s possible to improve over the standard library with better API design, but it&#x27;s not really possible to do a fully streaming parser that doesn&#x27;t half fill structures before finding an error and bailing out in the middle, which is another explicit design constraint for the standard library.
crabbone超过 1 年前
Maybe I overlooked something, but the author keeps repeating that they wrote a &quot;streaming&quot; parser, but never explained what that actually means. In particular, they never explained how did they deal with repeating keys in &quot;hash tables&quot;. What does their parser do? Calls the &quot;sink&quot; code twice with the repeated key? Waits until the entire &quot;hash table&quot; is red and then calls the &quot;sink&quot; code?<p>In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that&#x27;s not JSON at all.
rexfuzzle超过 1 年前
Great to see a shout out to Phil Pearl! Also worth looking at <a href="https:&#x2F;&#x2F;github.com&#x2F;bytedance&#x2F;sonic">https:&#x2F;&#x2F;github.com&#x2F;bytedance&#x2F;sonic</a>
kevingadd超过 1 年前
I&#x27;m surprised there&#x27;s no way to say &#x27;I really mean it, inline this function&#x27; for the stuff that didn&#x27;t inline because it was too big.<p>The baseline whitespace count&#x2F;search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.
评论 #38152223 未加载
mgaunard超过 1 年前
&quot;It’s unrealistic to expect to have the entire input in memory&quot; -- wrong for most applications
评论 #38155950 未加载
评论 #38152676 未加载
评论 #38153456 未加载
评论 #38153968 未加载
评论 #38183681 未加载
评论 #38153192 未加载
peterohler超过 1 年前
You might want to take a look at <a href="https:&#x2F;&#x2F;github.com&#x2F;ohler55&#x2F;ojg">https:&#x2F;&#x2F;github.com&#x2F;ohler55&#x2F;ojg</a>. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.
forrestthewoods超过 1 年前
&gt; Any (useful) JSON decoder code cannot go faster that this.<p>That line feels like a troll. Cunningham’s Law in action.<p>You can definitely go faster than 2 Gb&#x2F;sec. In a word, SIMD.
评论 #38155825 未加载
1vuio0pswjnm7超过 1 年前
&quot;But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto.&quot;<p>From 1989:<p><a href="https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;spitbol&#x2F;x32&#x2F;master&#x2F;docs&#x2F;spitbol-manual-v3.7.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;spitbol&#x2F;x32&#x2F;master&#x2F;docs&#x2F;sp...</a><p>&quot;Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data.&quot;
arun-mani-j超过 1 年前
I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won&#x27;t use a library for JSON, they will write one themselves.<p>I don&#x27;t know how &quot;true&quot; that comment is but I thought I should try to write a parser myself to get a feel :D<p>So I wrote one, in Python - <a href="https:&#x2F;&#x2F;arunmani.in&#x2F;articles&#x2F;silly-json-parser&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;arunmani.in&#x2F;articles&#x2F;silly-json-parser&#x2F;</a><p>It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)
评论 #38152266 未加载
评论 #38151808 未加载
评论 #38152663 未加载
评论 #38151504 未加载
wslh超过 1 年前
I remember this JSON benchmark page from RapidJSON [1].<p>[1] <a href="https:&#x2F;&#x2F;rapidjson.org&#x2F;md_doc_performance.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;rapidjson.org&#x2F;md_doc_performance.html</a>
romshark超过 1 年前
I&#x27;ve recently held a talk (<a href="https:&#x2F;&#x2F;youtu.be&#x2F;a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk" rel="nofollow noreferrer">https:&#x2F;&#x2F;youtu.be&#x2F;a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk</a>) about github.com&#x2F;romshark&#x2F;jscan that I&#x27;ve been working on. It&#x27;s a performance-oriented JSON iterator &#x2F; tokenizer you might want to take a look at if interested in high performance zero allocation JSON parsing in Go.
isuckatcoding超过 1 年前
This is fantastically useful.<p>Funny enough I stumbled upon your article just yesterday through google search.
rurban超过 1 年前
This is a very poor and overly simplified text to write basic JSON parsers, not touching any topic of writing actually fast JSON parsers. Such as not-copying tokenizers (e.g. jsmn), word-wise tokenizers (simdjson) and fast numeric conversions (fast_double_parser at al).
EdwardDiego超过 1 年前
A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.<p>His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.<p>Fast JSON is that important these days.
nwpierce超过 1 年前
Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: <a href="https:&#x2F;&#x2F;github.com&#x2F;nwpierce&#x2F;jsb">https:&#x2F;&#x2F;github.com&#x2F;nwpierce&#x2F;jsb</a>
suzzer99超过 1 年前
Can someone explain to me why JSON can&#x27;t have comments or trailing commas? I really hope the performance gains are worth it, because I&#x27;ve lost 100s of man-hours to those things, and had to resort to stuff like this in package.json:<p><pre><code> &quot;IMPORTANT: do not run the scripts below this line, they are for CICD only&quot;: true,</code></pre>
评论 #38153686 未加载
评论 #38153784 未加载
评论 #38156995 未加载
评论 #38153729 未加载
mleonhard超过 1 年前
I wrote a Rust library that works similarly to the author&#x27;s byteReader: <a href="https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;fixed-buffer" rel="nofollow noreferrer">https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;fixed-buffer</a>
mannyv超过 1 年前
These are always interesting to read because you get to see runtime quirks. I&#x27;m surprised there was so much function call overhead, for example. And it&#x27;s interesting you can bypass range checkong.<p>The most important thing, though, is the process: measure then optimize.
flaie超过 1 年前
This was a very good read, and I did learn some nice tricks, thank you very much.
mikhailfranco超过 1 年前
I notice &#x27;sample.json&#x27; contains quite a few escaped nulls \u0000 inside quoted strings.<p>Is &quot;\u0000&quot; legal JSON?<p>P.S. ... and many other control characters &lt; \u0020
cratermoon超过 1 年前
My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be&#x2F;bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&amp;t=1312<p>Note the distinct lack of:<p><pre><code> if err != nil {</code></pre>
hintymad超过 1 年前
How is this compared to Daniel Lemire&#x27;s simdjson? <a href="https:&#x2F;&#x2F;github.com&#x2F;simdjson&#x2F;simdjson">https:&#x2F;&#x2F;github.com&#x2F;simdjson&#x2F;simdjson</a>
thomasvn超过 1 年前
In what cases would an application need to regularly parse gigabytes of JSON? Wouldn&#x27;t it be advantageous for the app to get that data into a DB?
denysvitali超过 1 年前
Also interesting: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;a7VBbbcmxyQ" rel="nofollow noreferrer">https:&#x2F;&#x2F;youtu.be&#x2F;a7VBbbcmxyQ</a>
lamontcg超过 1 年前
Wish I wasn&#x27;t 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.
hknmtt超过 1 年前
what does this bring over goccy&#x27;s json encoder?
visarga超过 1 年前
nowadays I am more interested in a &quot;forgiving&quot; JSON&#x2F;YAML parser, that would recover from LLM errors, is there such a thing?
评论 #38152466 未加载
评论 #38151646 未加载
评论 #38151829 未加载
评论 #38152500 未加载
评论 #38152352 未加载