Parsing malformed JSON

66 pointsby p8donaldover 8 years ago

22 comments

bhaakover 8 years ago

Great, after the tag soup of modern browsers are we now also going to see json soup?Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.

评论 #12978734 未加载

peterkellyover 8 years ago

Please don't do things like this. It only encourages people to be lazy about producing conforming documents, and different parsers that try to compensate for syntax errors are going to do so in different ways. We learnt this the hard way with HTML.

评论 #12980674 未加载

评论 #12982510 未加载

captainmuonover 8 years ago

Many people here wondering how you can end up with JSON this bad, and who is "sending" it to them. Well, the poster is not neccessarily running a REST service. At work, I've dealt with plenty of little JSON (and XML) files, created by "little tools" and passed around via files and pipes. Since I work in science, most of our coders are the users of their code, so you can imagine both code quality and UX are poor. And the main reason something like this happens is that people don't use proper serialization, because they never heard of it, or they don't have the right tools. They just construct JSON by string interpolation. If they are lucky, they remember to replace `'` with `\"`. In fact, that looks a lot like what happened here (plus one or two levels of escaping).Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.

RMarcusover 8 years ago

I wrote a library to handle (many cases) of invalid JSON, motivated by a similar experience. <a href="https://github.com/RyanMarcus/dirty-json" rel="nofollow">https://github.com/RyanMarcus/dirty-json</a>I'm on my phone now, but later today I'll test to see if it would have worked for the author. It's good for cleaning up JSON, but I would be weary of putting it (or anything like it) anywhere near production.

k2xlover 8 years ago

I'm hoping nobody actually does this in production. As an academic exercise it is interesting.Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.At minimum, check to make sure it is proper JSON... I know that a lot of stream processors will put it into a queue and 200 right away and then process in the background, but I don't think that ensuring it is at least JSON and doesn't have a content size of more than X could be too intensive.In this case, if the data was already accepted and you've got no choice but to deal with it, you've gotta do what you got to do. I've been there, and it ain't fun cleaning up a 900 GB JSON file.

评论 #12978498 未加载

评论 #12978464 未加载

devyover 8 years ago

This reminds me "Parsing JSON is a minefield" post a few weeks ago, TL;DR, JSON is not standardized or (having multiple standards) making parsing / validating JSON data very tricky in edge cases.<a href="https://news.ycombinator.com/item?id=12796556" rel="nofollow">https://news.ycombinator.com/item?id=12796556</a>

评论 #12979671 未加载

评论 #12980467 未加载

Analemma_over 8 years ago

How well do you know the sender? Because this looks like an attack, or at least a probe: something to try and crash the parser and see what response they get back, to see if you are vulnerable to some kind of heap corruption attack.

aikahover 8 years ago

> I have no idea how something like this was generated.It would be interesting to ask the sender how .> If the file is small enough or the data regular enough, you could fix it by hand with some search & replace.off course.> But the file I had was gigabytes in size and most of it looked fine.I suspect a faulty JSON library, it's important to figure out how it was generated so an eventual issue can be opened and the bug can be fixed.

junkeover 8 years ago

> I had this "JSON" file sent to meWhy? By whom? Did you complain loudly?

PaulHouleover 8 years ago

Malformed data is a scalability problem. Unusual failure modes from coding problems to random bit flips become inevitable as the data volume approaches infinity.

评论 #12977949 未加载

评论 #12978202 未加载

评论 #12978679 未加载

mSparksover 8 years ago

my first reaction would be to parse until you hit a problem. then use a string distance function and a genetic algorithm to find the problematic characters.in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.then, of course log out the changes.I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.also, i can guess how it was created.the code is probably in c, and a rare edge case is overwriting memory before it hits the file.

wccrawfordover 8 years ago

It's a neat trick, but not something I'd deploy into production. If I have to try to guess at what the customer is sending me, I'm not going to apply it to their account.In an emergency, I might hand-edit it and make it right, but I'd absolutely insist that further files be in the correct format.

评论 #12978105 未加载

mwkaufmaover 8 years ago

Or, how I made my service a DDoS target.It's not just the extra compute, it's the lack of a formal specification. If different services applied this kind of ad hoc "postel's principle" they may parse the malformed markup differently, and end up introducing downstream inconsistencies.

评论 #12982663 未加载

latchover 8 years ago

Not a python developer so I was surprised when the built-in json library has a flag allow_nan which is True by default.Also, not invalid, but surprising / annoying (a while to debug). An empty lua table is the same as an empty lua array: {}. This causes ambiguity.<pre><code> // will print {} print(cjson.encode(cjson.decode('[]')))</code></pre>

评论 #12978390 未加载

anentropicover 8 years ago

just send the file back where it came from

nommm-nommmover 8 years ago

Why would a JSON file be GBs in size? I think that's the more interesting question.

评论 #12979319 未加载

评论 #12982124 未加载

agounarisover 8 years ago

Wasn't easier to just remove the wrong characters manually? :PValidate the json and if its wrong just throw it away. Makes no sense trying to fix/guess the correct form of an input.

nkriscover 8 years ago

Should you really assume malformed JSON is even correct?

ekiaraover 8 years ago

Wouldn't a better option be an error log? you reply to the client that "I can accept 398,500 of your 400,000 submitted records, attached are the records that do not conform to the expected template. Choose either to (1) submit only the validated records and discard the malformed ones or (2) reformat the malformed records and resubmit the entire batch"

ape4over 8 years ago

JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.

评论 #12980361 未加载

fbreducover 8 years ago

i don't get malformed json, i tell the sender to re-send data as json

bborudover 8 years ago

Don't.