Great, after the tag soup of modern browsers are we now also going to see json soup?<p>Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.<p>But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.<p>Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.
Please don't do things like this. It only encourages people to be lazy about producing conforming documents, and different parsers that try to compensate for syntax errors are going to do so in different ways. We learnt this the hard way with HTML.
Many people here wondering how you can end up with JSON this bad, and who is "sending" it to them. Well, the poster is not neccessarily running a REST service. At work, I've dealt with plenty of little JSON (and XML) files, created by "little tools" and passed around via files and pipes. Since I work in science, most of our coders are the users of their code, so you can imagine both code quality and UX are poor. And the main reason something like this happens is that people don't use proper serialization, because they never heard of it, or they don't have the right tools. They just construct JSON by string interpolation. If they are lucky, they remember to replace `'` with `\"`. In fact, that looks a lot like what happened here (plus one or two levels of escaping).<p>Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.
I wrote a library to handle (many cases) of invalid JSON, motivated by a similar experience. <a href="https://github.com/RyanMarcus/dirty-json" rel="nofollow">https://github.com/RyanMarcus/dirty-json</a><p>I'm on my phone now, but later today I'll test to see if it would have worked for the author. It's good for cleaning up JSON, but I would be weary of putting it (or anything like it) anywhere near production.
I'm hoping nobody actually does this in production. As an academic exercise it is interesting.<p>Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.<p>At minimum, check to make sure it is proper JSON... I know that a lot of stream processors will put it into a queue and 200 right away and then process in the background, but I don't think that ensuring it is at least JSON and doesn't have a content size of more than X could be too intensive.<p>In this case, if the data was already accepted and you've got no choice but to deal with it, you've gotta do what you got to do. I've been there, and it ain't fun cleaning up a 900 GB JSON file.
This reminds me "Parsing JSON is a minefield" post a few weeks ago, TL;DR, JSON is not standardized or (having multiple standards) making parsing / validating JSON data very tricky in edge cases.<p><a href="https://news.ycombinator.com/item?id=12796556" rel="nofollow">https://news.ycombinator.com/item?id=12796556</a>
How well do you know the sender? Because this looks like an attack, or at least a probe: something to try and crash the parser and see what response they get back, to see if you are vulnerable to some kind of heap corruption attack.
> I have no idea how something like this was generated.<p>It would be interesting to ask the sender how .<p>> If the file is small enough or the data regular enough, you could fix it by hand with some search & replace.<p>off course.<p>> But the file I had was gigabytes in size and most of it looked fine.<p>I suspect a faulty JSON library, it's important to figure out how it was generated so an eventual issue can be opened and the bug can be fixed.
Malformed data is a scalability problem. Unusual failure modes from coding problems to random bit flips become inevitable as the data volume approaches infinity.
my first reaction would be to parse until you hit a problem. then use a string distance function and a genetic algorithm to find the problematic characters.<p>in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.<p>then, of course log out the changes.<p>I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.<p>also, i can guess how it was created.<p>the code is probably in c, and a rare edge case is overwriting memory before it hits the file.
It's a neat trick, but not something I'd deploy into production. If I have to try to guess at what the customer is sending me, I'm not going to apply it to their account.<p>In an emergency, I might hand-edit it and make it right, but I'd absolutely insist that further files be in the correct format.
Or, how I made my service a DDoS target.<p>It's not just the extra compute, it's the lack of a formal specification. If different services applied this kind of ad hoc "postel's principle" they may parse the malformed markup differently, and end up introducing downstream inconsistencies.
Not a python developer so I was surprised when the built-in json library has a flag allow_nan which is True by default.<p>Also, not invalid, but surprising / annoying (a while to debug). An empty lua table is the same as an empty lua array: {}. This causes ambiguity.<p><pre><code> // will print {}
print(cjson.encode(cjson.decode('[]')))</code></pre>
Wasn't easier to just remove the wrong characters manually? :P<p>Validate the json and if its wrong just throw it away. Makes no sense trying to fix/guess the correct form of an input.
Wouldn't a better option be an error log? you reply to the client that "I can accept 398,500 of your 400,000 submitted records, attached are the records that do not conform to the expected template. Choose either to (1) submit only the validated records and discard the malformed ones or (2) reformat the malformed records and resubmit the entire batch"