What to know about encodings and character sets

83 pointsby dshankaralmost 10 years ago

12 comments

Animatsalmost 10 years ago

Something they should have mentioned: put a<pre><code> <meta charset="utf-8" /> </code></pre> in all your HTML documents that are in UTF-8. Note that this has to be in the first 1024 bytes of the document. Otherwise, the browser has to invoke the "encoding guesser"[1], which will sometimes guess wrong. (W3C: "The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream.") The result will be occasional users seeing random pages in the wrong encoding, depending on browser, browser version, platform, and page content.I recently saw the front page of the New York Times misconverted because they didn't specify an encoding, and the only UTF-8 sequence near the beginning of the document was in<pre><code> <link rel="apple-touch-icon-precomposed" sizes="144×144" ... </code></pre> The "×" there is not the letter x, it's the Unicode multiplication symbol. This confused an encoding guesser. Don't go there.[1] <a href="http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding" rel="nofollow">http://www.w3.org/TR/html5/syntax.html#determining-the-chara...</a>

评论 #9789347 未加载

PeterisPalmost 10 years ago

We work with a lot of multilingual text, and for "what to know about encodings and character sets" we have a very simple answer to that - a guideline called "use UTF8 or die".It's not suitable for absolutely everyone (e.g. if you have a lot of Asian text then you may want a different encoding), but for our use case every single deviation causes lot of headache, risks and unneccessary work in fixing garbage data.In simplistic terms what we mean by this guideline:* in your app, 100% all human text should be stored UTF8 only, no exceptions. If you need to deal with data in other encodings - other databases, old documents, whatever - make a wrapper/interface that takes some resource ID and returns the data with everything properly converted to UTF8; and has no way (or at least no convenient way) to access the original bytes in another encoding.* in all persistence layers, store text only as UTF8. If at all possible, don't even provide options to export files in other encodings. If legacy/foreign data stores need another encoding, then in your code never have an API that requires a programmer to pass data in that encoding - the functions "store_document_to_the_mainframe_in_EBCDIC" and "send_cardholder_data_to_that_weird_CC_gateway" should take UTF8 strings only and handle the encodings themselves.* in all [semi-]public API, treat text as UTF8-only and document that. If your API documentation mentions a text field, state the encoding so that there is no guessing or assuming by anyone.* in all system configuration, set UTF8 as the default whenever possible. A database server? Make sure that any new databases/tables/text fields will have UTF8 set as the default, so unless someone takes explicit action then user-local-language encodings won't accidentally appear.* Whoever introduces a single byte of different encoding data is responsible for fixing the resulting mess. This is the key part. Did you write a data input function that passed on data in the user computer default encoding; tested it only on US-ASCII nonenglish symbols; and got a bunch of garbage data stored? You're responsible for finding the relevant entries and fixing them, not only your code. Used a third party library that crashes or loses data when passed non-english unicode symbols? Either fix the library (if it's OS) or rewrite code to use a different one.

peapickeralmost 10 years ago

From the article: "Overall, Unicode is yet another encoding scheme."It is more than that - for instance, it includes algorithms as well... for instance, dealing with RTL languages with ordering and shaping rules (i.e. Arabic), how to know what to do when RTL languages are mixed with LTR (is that '.' at the end of '123' a decimal point, or a period? (determines if it goes to the right or the left or the sequence)) and how to know when data is equivalent despite being normalized or not, etc...

scottfralmost 10 years ago

The linked article by Joel Spolksy is also great:<a href="http://www.joelonsoftware.com/articles/Unicode.html" rel="nofollow">http://www.joelonsoftware.com/articles/Unicode.html</a>

评论 #9789215 未加载

ninjakeyboardalmost 10 years ago

I hate that I have to know this stuff. Working on implementing a spec today where handling character encoding is a requirement.

vorgalmost 10 years ago

> It basically defines a ginormous table of 1,114,112 code points that can be used for all sorts of letters and symbols. That's plenty to encode all existing, pre-historian and future characters mankind knows about. There's even an unofficial section for Klingon. Indeed, Unicode is big enough to allow for unofficial, private-use areas.The private use areas only encode about 137,000 codepoints (U+e000 to U+f8ff & U+f0000 to U+10ffff) and are running out quickly. Most of U+e000 to U+f8ff is used by many different private agreements, and some pseudo-public ones like the Conscript registry which encodes Klingon, linked to in the article. Conscript also uses a large chunk of plane F to encode the constructed script Kinya, i.e. the 3696 codepoints in U+f0000 to U+f0e6f, see <a href="http://www.kreativekorp.com/ucsur/charts/PDF/UF0000.pdf" rel="nofollow">http://www.kreativekorp.com/ucsur/charts/PDF/UF0000.pdf</a> . It takes up so much room because it's a block script like Korean Hangul and is encoded by formula just like Hangul. Each Korean Hangul block is made up of 2 or 3 jamo: one of 19 leading consonants, one of 21 vowels, and optionally one of 27 trailing consonants, giving a total of 19 * 21 * 28 = 11,172 possible syllable blocks, generated by formula into range U+ac00 to U+d7a3. Kinya also uses such a formula to generate its script, and I'm sure many other constructed block scripts will make their way into the quasi-official Conscript Registry. I'm even working on one of my own.In fact, rather than filling up U+f0000 to U+10ffff, such conscripts only need to fill up the first quarter of it (i.e. U+f0000 to U+f7fff) for Unicode to run out of private use space, because the remainder (U+f8000 to U+10ffff) is needed for a second-tier surrogate system (see <a href="https://github.com/gavingroovygrover/utf88" rel="nofollow">https://github.com/gavingroovygrover/utf88</a> ) to extend the codepoint space back up to 1 billionish codepoints as it was originally specified by Pike and Thompson until it was clipped back down to 1 million in 2003.So Unicode is not "plenty to encode" or "big enough to allow for" all known, future, or private-use characters.

评论 #9789938 未加载

keedotalmost 10 years ago

Interesting that I actually don't need to know this stuff. I think you'll find that MOST developers actually don't need to know this stuff. People seem to forget that the vast bulk of developers are for corporate and in house development, single language, being English.I know this stuff because I like to understand how this works, but for all the dev's under me, there are probably a thousand concepts that I want them to understand before they start tackling encoding beyond knowing when to call the correct function.

评论 #9789301 未加载

评论 #9789461 未加载

评论 #9789369 未加载

评论 #9789914 未加载

imaokalmost 10 years ago

One thing I'm still confused about. What exactly is happening when you copy paste some text from one app to another? What encoding will the copied text be in?

JoachimSalmost 10 years ago

Good, gentle introduction that goes through everything step by step. Turns to php at the end.

carsonreinkealmost 10 years ago

...never assume one byte per glyph

SilasXalmost 10 years ago

Sorry, but now I reflexively flag-on-sight any instance if this clickbaity, obviously overstated "every programmer needs to know about semiconductor opcodes/mainframe architecture/etc".

SFjulie1almost 10 years ago

PHP devs are so slow they just adopted utf8 and see its glory.I myself "UTF8 or die!"d a long time ago and discovered it was not a good idea.I will forget the problem of the parsing of the nth character, the string length vs the memory used, the canonization of strings for comparison. And go directly to 2 problems:* There exists cases in which latin1 & utf8 are mangled in a same string. (ex http/SIP headers are latin1 and content can be utf8 and you may want to store a full HTTP/SIP transaction verbatim for debugging purpose), and it can store in iso-latin3 (code table for esperanto to be sarcastic), but will explode in utf8 unless you rencode it (B64)* tools are partly UTF8 compliant: mysql (which is as good as PHP in terms of quality) is clueless about UTF8 (hint index and collation), and PHP too <a href="https://bugs.php.net/bug.php?id=18556" rel="nofollow">https://bugs.php.net/bug.php?id=18556</a> <--- THIS BUG TOOK 10 YEARS TO BE CLOSEDThe whole point is developer don't understand the organic nature of culture, and especially of its writing and the diversity of culture.They think that because some rules applies in their language it also applies in others: BUT* PHP devs: lowercase of I is not always i (it can be i without a dot). It took 10 years to the dev to find where their bug was! * shortening a memory representation does not always shorten its graphical representation (apples bug with sms in arabian) * sort orders are not immutable (collation not only can vary from language to language but also according to the administrative context (ex: proper name in french)) * inflexions are hell and text size for error varies a lot (hence the unstability of windows 95 in french because error message where copied in a reserved page and the fixed size was less than the whole size of the domain's corpus... hence any contiguous block in memory (lower xor upper bound) would have its memory potentially corrupted)).My point is UTF8 is not hell. Real world is complex. And it becomes hell when some dumb devs thinking that by manipulating strings that represents any language they know about any language.. and apply universal rules that are not.Some problems can be solved by ignoring them. But with culture it is not the case.And actually, unicode SUX because it is US centrics* computers should be able to store all our pasts books and make them available for the future, even in verbatim mode. But unicode HAS not archeological character sets like THIS <a href="https://fr.wikipedia.org/wiki/S_long" rel="nofollow">https://fr.wikipedia.org/wiki/S_long</a> I don't care about the USA lack of history. I see no use in the computer if it requires to sacrifice our histories and cultures, * <a href="https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name" rel="nofollow">https://modelviewculture.com/pieces/i-can-text-you-a-pile-of...</a> some people cannot even use it in their own languageUnicode suffers a lot of problem plus a conceptual one; it is has immutable characters AND directives (change the direction of the writing, apply ligatures)... that not only will create security concerns (one of the funniest being the possibility by adding a string to reedit silently text in on an effector (screen or printer))... We are introducing type setting rules in unicode.For those who have used tex since a long time, the non separation of the almost programmatical typography and the graphens is like not separing the model and the controller.Which actually also calls for the view (the effection) and thus the fonts. Having the encoding of the Slong does not tell you what it looks like unless you have a canonical representation of the codepoint as a graphem.And since we are printing/creating document for juridical purpose we may like to control the view that ensures that the mapping of the string representation will not alter graphical representation in a way that can compromise its meaning. If someone signs in a box you don't want the signature to alter the representation anywhere or worse without notice.The devil lies in the detail. Unicode is a Babel tower that may well crash for the same reason as in the bible: hubris.