TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

What developers should know about Unicode and character sets in 2013

47 点作者 oyvindeh超过 11 年前

8 条评论

jrochkind1超过 11 年前
&gt; Never assume that the data you’re dealing with is UTF-8 — ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8).<p>Um, what? This is just wrong. ascii-equivalent characters only take one byte in UTF-8. Other characters may take two, or three, bytes.<p>If the author actually viewed text in ascii that, when in UTF-8, had three-bytes per character.... I don&#x27;t know what they were looking at, but it wasn&#x27;t UTF-8.
评论 #6686218 未加载
PeterisP超过 11 年前
The concluding statement is a bit wierd: &quot;ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8)&quot;<p>That isn&#x27;t accurate, ASCII text would appear identical even if &#x27;you view the hex&#x27;, because it is identical in UTF-8, that&#x27;s the whole point of UTF-8. You&#x27;d have to look at non-ASCII characters to see how they&#x27;re encoded.
评论 #6682231 未加载
评论 #6682785 未加载
VLM超过 11 年前
Some background not covered in an otherwise pretty good article:<p>&quot;In general, don’t save a Byte Order Mark (BOM) — it’s not needed for UTF-8, and historically could cause problems.&quot;<p>This attitude comes from agony in processing from UTF-16 files. I interface with a group that finds it hilarious to send me textual data in UTF-16 format and the first hard won lesson you learn with UTF-16 is superficially the default order should be correct 50% of the time if guessed randomly but somehow its always wrong. So say you read one line of a UTF-16 text file and process it accordingly after passing it thru a UTF-16 decoder. OK no problemo, it had a BOM as the first glyph&#x2F;byte&#x2F;character&#x2F;whatever and was converted and interpreted correctly. Then you read another line, just like you&#x27;d read a line process a line with ASCII or UTF-8. However they only give me a BOM at the start of a file not a start of line, so invariably I translate that to garbage because the bytes are swapped.<p>Now there are program methods to analyze the BOM and memorize it. Or read the whole blasted multi-gig file into memory all at once and then de-UTF-16 it all at once and then line by line the file. But fundamentally its a simple one liner sysadmin type job to just shove the file thru a UTF-16 to UTF-8 translator program before it hits my processing system. I already had to unencrypt it, and unzip it, and verify its hash so I know they sent the whole file to me (and correctly), so adding a conversion stage is no big deal.<p>And this kind of UTF-16 experience is what leads people to do things like say &quot;oh, its unicode? That means I should squirt out BOMs as often as possible&quot; even though that technically only applies to unicode UTF-16 and is not helpful for UTF-8.
danso超过 11 年前
I hate to be &quot;that SEO guy&quot;, but the OP needs to do some SEO. The submitted title here is nowhere to be seen, which is too bad because it&#x27;s a great title and one that I would try to Google after forgetting to bookmark this page.<p>Luckily I do use Pinboard, which auto-grabs the title, if it existed. But this is a helpful reference to many devs who don&#x27;t read HN, and it&#x27;s all but obscured.
golergka超过 11 年前
Oh, one more fun fact: some emoji characters occupy more than one _Unicode_ character, and can be encoded in different ways depending on the device that uses them. (Before they were introduced into Unicode, they used character codes designated for custom platform-specific stuff).<p>Debugging a text input field where user can enter emoji &amp; RTL text is FUN.
评论 #6682920 未加载
ygra超过 11 年前
Site appears to be down; Google cache: <a href="http://webcache.googleusercontent.com/search?q=cache:A8oNdl-pbKIJ:the-pastry-box-project.net/oli-studholme/2013-october-8/+&amp;cd=1&amp;hl=de&amp;ct=clnk&amp;gl=de" rel="nofollow">http:&#x2F;&#x2F;webcache.googleusercontent.com&#x2F;search?q=cache:A8oNdl-...</a>
ohwp超过 11 年前
Note that some browser do use the &lt;meta charset=&quot;UTF-8&quot;&gt; even if the content-type header already sent the charset.<p>Another thing to add: always open a database connection in the charset of choice. And if you are a PHP user (like I am): there are still functions that don&#x27;t support multibyte so be careful.
评论 #6684059 未加载
hcarvalhoalves超过 11 年前
&gt; While there are a ton of encodings you could use, for the web use UTF-8. You want to use UTF-8 for your entire stack. So how do we get that?<p>You should use your language&#x27;s internal unicode representation, and decode from&#x2F;encode to UTF-8 on I&#x2F;O.