TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Parsing malformed JSON

66 pointsby p8donaldover 8 years ago

22 comments

bhaakover 8 years ago
Great, after the tag soup of modern browsers are we now also going to see json soup?<p>Sometimes it&#x27;s obvious what&#x27;s wrong with malformed data you receive. A classic would be encoding errors.<p>But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.<p>Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.
评论 #12978734 未加载
peterkellyover 8 years ago
Please don&#x27;t do things like this. It only encourages people to be lazy about producing conforming documents, and different parsers that try to compensate for syntax errors are going to do so in different ways. We learnt this the hard way with HTML.
评论 #12980674 未加载
评论 #12982510 未加载
captainmuonover 8 years ago
Many people here wondering how you can end up with JSON this bad, and who is &quot;sending&quot; it to them. Well, the poster is not neccessarily running a REST service. At work, I&#x27;ve dealt with plenty of little JSON (and XML) files, created by &quot;little tools&quot; and passed around via files and pipes. Since I work in science, most of our coders are the users of their code, so you can imagine both code quality and UX are poor. And the main reason something like this happens is that people don&#x27;t use proper serialization, because they never heard of it, or they don&#x27;t have the right tools. They just construct JSON by string interpolation. If they are lucky, they remember to replace `&#x27;` with `\&quot;`. In fact, that looks a lot like what happened here (plus one or two levels of escaping).<p>Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.
RMarcusover 8 years ago
I wrote a library to handle (many cases) of invalid JSON, motivated by a similar experience. <a href="https:&#x2F;&#x2F;github.com&#x2F;RyanMarcus&#x2F;dirty-json" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;RyanMarcus&#x2F;dirty-json</a><p>I&#x27;m on my phone now, but later today I&#x27;ll test to see if it would have worked for the author. It&#x27;s good for cleaning up JSON, but I would be weary of putting it (or anything like it) anywhere near production.
k2xlover 8 years ago
I&#x27;m hoping nobody actually does this in production. As an academic exercise it is interesting.<p>Maybe I&#x27;m old fashioned - I&#x27;m all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.<p>At minimum, check to make sure it is proper JSON... I know that a lot of stream processors will put it into a queue and 200 right away and then process in the background, but I don&#x27;t think that ensuring it is at least JSON and doesn&#x27;t have a content size of more than X could be too intensive.<p>In this case, if the data was already accepted and you&#x27;ve got no choice but to deal with it, you&#x27;ve gotta do what you got to do. I&#x27;ve been there, and it ain&#x27;t fun cleaning up a 900 GB JSON file.
评论 #12978498 未加载
评论 #12978464 未加载
devyover 8 years ago
This reminds me &quot;Parsing JSON is a minefield&quot; post a few weeks ago, TL;DR, JSON is not standardized or (having multiple standards) making parsing &#x2F; validating JSON data very tricky in edge cases.<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12796556" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12796556</a>
评论 #12979671 未加载
评论 #12980467 未加载
Analemma_over 8 years ago
How well do you know the sender? Because this looks like an attack, or at least a probe: something to try and crash the parser and see what response they get back, to see if you are vulnerable to some kind of heap corruption attack.
aikahover 8 years ago
&gt; I have no idea how something like this was generated.<p>It would be interesting to ask the sender how .<p>&gt; If the file is small enough or the data regular enough, you could fix it by hand with some search &amp; replace.<p>off course.<p>&gt; But the file I had was gigabytes in size and most of it looked fine.<p>I suspect a faulty JSON library, it&#x27;s important to figure out how it was generated so an eventual issue can be opened and the bug can be fixed.
junkeover 8 years ago
&gt; I had this &quot;JSON&quot; file sent to me<p>Why? By whom? Did you complain loudly?
PaulHouleover 8 years ago
Malformed data is a scalability problem. Unusual failure modes from coding problems to random bit flips become inevitable as the data volume approaches infinity.
评论 #12977949 未加载
评论 #12978202 未加载
评论 #12978679 未加载
mSparksover 8 years ago
my first reaction would be to parse until you hit a problem. then use a string distance function and a genetic algorithm to find the problematic characters.<p>in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.<p>then, of course log out the changes.<p>I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.<p>also, i can guess how it was created.<p>the code is probably in c, and a rare edge case is overwriting memory before it hits the file.
wccrawfordover 8 years ago
It&#x27;s a neat trick, but not something I&#x27;d deploy into production. If I have to try to guess at what the customer is sending me, I&#x27;m not going to apply it to their account.<p>In an emergency, I might hand-edit it and make it right, but I&#x27;d absolutely insist that further files be in the correct format.
评论 #12978105 未加载
mwkaufmaover 8 years ago
Or, how I made my service a DDoS target.<p>It&#x27;s not just the extra compute, it&#x27;s the lack of a formal specification. If different services applied this kind of ad hoc &quot;postel&#x27;s principle&quot; they may parse the malformed markup differently, and end up introducing downstream inconsistencies.
评论 #12982663 未加载
latchover 8 years ago
Not a python developer so I was surprised when the built-in json library has a flag allow_nan which is True by default.<p>Also, not invalid, but surprising &#x2F; annoying (a while to debug). An empty lua table is the same as an empty lua array: {}. This causes ambiguity.<p><pre><code> &#x2F;&#x2F; will print {} print(cjson.encode(cjson.decode(&#x27;[]&#x27;)))</code></pre>
评论 #12978390 未加载
anentropicover 8 years ago
just send the file back where it came from
nommm-nommmover 8 years ago
Why would a JSON file be GBs in size? I think that&#x27;s the more interesting question.
评论 #12979319 未加载
评论 #12982124 未加载
agounarisover 8 years ago
Wasn&#x27;t easier to just remove the wrong characters manually? :P<p>Validate the json and if its wrong just throw it away. Makes no sense trying to fix&#x2F;guess the correct form of an input.
nkriscover 8 years ago
Should you really assume malformed JSON is even correct?
ekiaraover 8 years ago
Wouldn&#x27;t a better option be an error log? you reply to the client that &quot;I can accept 398,500 of your 400,000 submitted records, attached are the records that do not conform to the expected template. Choose either to (1) submit only the validated records and discard the malformed ones or (2) reformat the malformed records and resubmit the entire batch&quot;
ape4over 8 years ago
JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.
评论 #12980361 未加载
fbreducover 8 years ago
i don&#x27;t get malformed json, i tell the sender to re-send data as json
bborudover 8 years ago
Don&#x27;t.