科技回声

8 条评论

I was looking for the catch. Here it is: "It's really simple: Know what encoding a certain piece of text, that is, a certain byte sequence, is in, then interpret it with that encoding."That's like "knowing" the truth. How?I have received some very interesting files that made Python yack unicode errors, again and again. Why? Not only did I not "know" what encoding it was in -- the encodings changed at different points in the stream of bytes. I call this "slamming bytes together" because somewhere along the line, someone's program did exactly that.Everything is simple -- until it isn't.

评论 #24163666 未加载

评论 #24166613 未加载

jbandela1超过 4 年前

Note: This post is basically a TLDR of <a href="https://www.theregister.com/2013/10/04/verity_stob_unicode/" rel="nofollow">https://www.theregister.com/2013/10/04/verity_stob_unicode/</a> by Verity Stob.One of the reasons there is a lot of confusion about encodings vs Unicode is that Unicode was initially an encoding. It was thought that 65K characters was enough to represent all the characters in actual use across the languages and thus you just needed to change the from an 8 bit char to a 16 bit char and all would be well (apart from the issue of endianness). Thus Unicode initially specified what each symbol would look like encoded in 16bits. (see <a href="http://unicode.org/history/unicode88.pdf" rel="nofollow">http://unicode.org/history/unicode88.pdf</a>, particularly section 2). Windows NT, Java, ICU, all embraced this.Then it turned out that you needed a lot more characters than 65K and instead of each character being 16 bits, you would need 32 bit characters (or else have weird 3 byte data types). Whereas people could justify going from 8 bits to 16 bits as a cost of not having to worry about charsets, most developers balked at 32 bits for every character. In addition, you now had a bunch of the early adopters (Java and Windows NT) that had already embraced 16 bit characters. So then encodings were hacked on such as UTF-16 (surrogate pairs of 16 bit characters for some unicode code points).I think, if the problem had been understood better at the start that you have a lot more characters than will fit in 16 bits, then something UTF-8 would likely have been chosen as the canonical encoding and we could have avoided a lot of these issues. Alas, such is the benefit of 20/20 hindsight.

评论 #24165678 未加载

sgopalra超过 4 年前

Interesting article from Joel spoolsky on unicode and character sets. <a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...</a>

dang超过 4 年前

If curious see also2015 <a href="https://news.ycombinator.com/item?id=9788253" rel="nofollow">https://news.ycombinator.com/item?id=9788253</a>2012: <a href="https://news.ycombinator.com/item?id=4771987" rel="nofollow">https://news.ycombinator.com/item?id=4771987</a>

UpdatedFolders超过 4 年前

I personally had a good time re-reading this over and over again when I was migrating python 2 to python 3, it's a great resource: <a href="http://farmdev.com/talks/unicode/" rel="nofollow">http://farmdev.com/talks/unicode/</a>

neiesc超过 4 年前

I think not explorer BOM UTF-8 <a href="https://en.wikipedia.org/wiki/Byte_order_mark" rel="nofollow">https://en.wikipedia.org/wiki/Byte_order_mark</a>

ExtremisAndy超过 4 年前

I love C++ so much, and it has brought me such joy as a hobbyist programmer, but good grief, this one aspect of it (dealing with encodings & charsets) is so depressing I just want to cry sometimes.

nunez超过 4 年前

F for respects for everyone who got wrecked by BOM (byte-order mark) and CRLF vs LF.

8 条评论

at_a_remove超过 4 年前

评论 #24163666 未加载

评论 #24166613 未加载

jbandela1超过 4 年前

评论 #24165678 未加载

sgopalra超过 4 年前

dang超过 4 年前

UpdatedFolders超过 4 年前

neiesc超过 4 年前

I think not explorer BOM UTF-8 <a href="https://en.wikipedia.org/wiki/Byte_order_mark" rel="nofollow">https://en.wikipedia.org/wiki/Byte_order_mark</a>

ExtremisAndy超过 4 年前

I love C++ so much, and it has brought me such joy as a hobbyist programmer, but good grief, this one aspect of it (dealing with encodings & charsets) is so depressing I just want to cry sometimes.

nunez超过 4 年前

F for respects for everyone who got wrecked by BOM (byte-order mark) and CRLF vs LF.

What programmers need to know about encodings and charsets (2011)

8 条评论

What programmers need to know about encodings and charsets (2011)

8 条评论