The fortunate thing is, almost all of the broken sequences are unambiguous enough to be signs that the text should be encoded and then re-decoded as UTF-8. (This is not the case with any arbitrary encoding mixup -- if you mix up Big5 with EUC-JP, you might as well throw out your text and start over -- but it works for UTF-8 and the most common other encodings because UTF-8 is well-designed.)<p>So if you want a Python library that can do this automatically with an extremely low rate of false positives: <a href="https://github.com/LuminosoInsight/python-ftfy" rel="nofollow">https://github.com/LuminosoInsight/python-ftfy</a>
I previously wrote about this common double encoding issue at <a href="http://www.pixelbeat.org/docs/unicode_utils/" rel="nofollow">http://www.pixelbeat.org/docs/unicode_utils/</a> which references tools and techniques to fix up such garbled data
Missing: how defaults are wrong between UTF8 and EBCDIC. E.g. where a character in UTF8 outside the MES2 subset ('latin1') will be mapped to the x3F 'unknown' character of EBCDIC, which will be mapped back to the x1A character ('CTRL-z') of UTF8...
Lol, biggest bug is developer ignoring that latin1 & unicode encoded in UTF8 can coexists in the same stream of data :<p>- HTTP 1.1 headers are ISO-8859-1 (CERN legacy) while content can be UTF8
- SIP being based on HTTP RFC have the same flaw.<p>The CTO of my last VoIP company is still wondering why some callerIDs are breaking his nice python program assuming everything is UTF8 and still does not understand this...<p>Yes, encoding can change, I also saw it while using regionalisation with C# .net in logs.
Currently down. Here’s a snapshot from January:<p><a href="http://archive.is/t2tB3" rel="nofollow">http://archive.is/t2tB3</a>