TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

UTF-8 Encoding Debugging Chart

116 pointsby tardabout 9 years ago

6 comments

rspeerabout 9 years ago
The fortunate thing is, almost all of the broken sequences are unambiguous enough to be signs that the text should be encoded and then re-decoded as UTF-8. (This is not the case with any arbitrary encoding mixup -- if you mix up Big5 with EUC-JP, you might as well throw out your text and start over -- but it works for UTF-8 and the most common other encodings because UTF-8 is well-designed.)<p>So if you want a Python library that can do this automatically with an extremely low rate of false positives: <a href="https:&#x2F;&#x2F;github.com&#x2F;LuminosoInsight&#x2F;python-ftfy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;LuminosoInsight&#x2F;python-ftfy</a>
pixelbeatabout 9 years ago
I previously wrote about this common double encoding issue at <a href="http:&#x2F;&#x2F;www.pixelbeat.org&#x2F;docs&#x2F;unicode_utils&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.pixelbeat.org&#x2F;docs&#x2F;unicode_utils&#x2F;</a> which references tools and techniques to fix up such garbled data
plankabout 9 years ago
Missing: how defaults are wrong between UTF8 and EBCDIC. E.g. where a character in UTF8 outside the MES2 subset (&#x27;latin1&#x27;) will be mapped to the x3F &#x27;unknown&#x27; character of EBCDIC, which will be mapped back to the x1A character (&#x27;CTRL-z&#x27;) of UTF8...
julie1about 9 years ago
Lol, biggest bug is developer ignoring that latin1 &amp; unicode encoded in UTF8 can coexists in the same stream of data :<p>- HTTP 1.1 headers are ISO-8859-1 (CERN legacy) while content can be UTF8 - SIP being based on HTTP RFC have the same flaw.<p>The CTO of my last VoIP company is still wondering why some callerIDs are breaking his nice python program assuming everything is UTF8 and still does not understand this...<p>Yes, encoding can change, I also saw it while using regionalisation with C# .net in logs.
评论 #11363304 未加载
armabout 9 years ago
Currently down. Here’s a snapshot from January:<p><a href="http:&#x2F;&#x2F;archive.is&#x2F;t2tB3" rel="nofollow">http:&#x2F;&#x2F;archive.is&#x2F;t2tB3</a>
alien3dabout 9 years ago
Are utfmb4 effected also ? I been converting my table from utf8 unicode to utfmb4 unicode for supporting emoticon unicode.