TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

What programmers need to know about encodings and charsets (2011)

70 点作者 neiesc超过 4 年前

8 条评论

at_a_remove超过 4 年前
I was looking for the catch. Here it is: &quot;It&#x27;s really simple: <i>Know</i> what encoding a certain piece of text, that is, a certain byte sequence, is in, then interpret it with that encoding.&quot;<p>That&#x27;s like &quot;knowing&quot; the truth. How?<p>I have received some very interesting files that made Python yack unicode errors, again and again. Why? Not only did I not &quot;know&quot; what encoding it was in -- <i>the encodings changed</i> at different points in the stream of bytes. I call this &quot;slamming bytes together&quot; because somewhere along the line, someone&#x27;s program did exactly that.<p>Everything is simple -- until it isn&#x27;t.
评论 #24163666 未加载
评论 #24166613 未加载
jbandela1超过 4 年前
Note: This post is basically a TLDR of <a href="https:&#x2F;&#x2F;www.theregister.com&#x2F;2013&#x2F;10&#x2F;04&#x2F;verity_stob_unicode&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.theregister.com&#x2F;2013&#x2F;10&#x2F;04&#x2F;verity_stob_unicode&#x2F;</a> by Verity Stob.<p>One of the reasons there is a lot of confusion about encodings vs Unicode is that Unicode was initially an encoding. It was thought that 65K characters was enough to represent all the characters in actual use across the languages and thus you just needed to change the from an 8 bit char to a 16 bit char and all would be well (apart from the issue of endianness). Thus Unicode initially specified what each symbol would look like encoded in 16bits. (see <a href="http:&#x2F;&#x2F;unicode.org&#x2F;history&#x2F;unicode88.pdf" rel="nofollow">http:&#x2F;&#x2F;unicode.org&#x2F;history&#x2F;unicode88.pdf</a>, particularly section 2). Windows NT, Java, ICU, all embraced this.<p>Then it turned out that you needed a lot more characters than 65K and instead of each character being 16 bits, you would need 32 bit characters (or else have weird 3 byte data types). Whereas people could justify going from 8 bits to 16 bits as a cost of not having to worry about charsets, most developers balked at 32 bits for every character. In addition, you now had a bunch of the early adopters (Java and Windows NT) that had already embraced 16 bit characters. So then encodings were hacked on such as UTF-16 (surrogate pairs of 16 bit characters for some unicode code points).<p>I think, if the problem had been understood better at the start that you have a lot more characters than will fit in 16 bits, then something UTF-8 would likely have been chosen as the canonical encoding and we could have avoided a lot of these issues. Alas, such is the benefit of 20&#x2F;20 hindsight.
评论 #24165678 未加载
sgopalra超过 4 年前
Interesting article from Joel spoolsky on unicode and character sets. <a href="https:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;2003&#x2F;10&#x2F;08&#x2F;the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;2003&#x2F;10&#x2F;08&#x2F;the-absolute-minim...</a>
dang超过 4 年前
If curious see also<p>2015 <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=9788253" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=9788253</a><p>2012: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=4771987" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=4771987</a>
UpdatedFolders超过 4 年前
I personally had a good time re-reading this over and over again when I was migrating python 2 to python 3, it&#x27;s a great resource: <a href="http:&#x2F;&#x2F;farmdev.com&#x2F;talks&#x2F;unicode&#x2F;" rel="nofollow">http:&#x2F;&#x2F;farmdev.com&#x2F;talks&#x2F;unicode&#x2F;</a>
neiesc超过 4 年前
I think not explorer BOM UTF-8 <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Byte_order_mark" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Byte_order_mark</a>
ExtremisAndy超过 4 年前
I love C++ so much, and it has brought me such joy as a hobbyist programmer, but good grief, this one aspect of it (dealing with encodings &amp; charsets) is so depressing I just want to cry sometimes.
nunez超过 4 年前
F for respects for everyone who got wrecked by BOM (byte-order mark) and CRLF vs LF.