TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

UTF-8 Everywhere (2012)

122 pointsby thefoxalmost 9 years ago

15 comments

Animatsalmost 9 years ago
The Python problem is amusing. Python 3 has three representations of strings internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when necessary. This is mostly to support string indexing. It probably would have been better to use UTF-8, and create an index array for the string when necessary.<p>You rarely need to index a string with an integer in Python. FOR loops don&#x27;t need to. Regular expressions don&#x27;t need to. Operations that return a position into the string could return an opaque type which acts as a string index. That type should support adding and subtracting integers (at least +1 and -1) by progressing through the string. That would take care of most of the use cases. Attempts to index a string with an int would generate index arrays internally. (Or, for short strings, just start at the beginning every time and count.)<p>Windows and Java have big problems. They really are 16-bit char based. It&#x27;s not Java&#x27;s fault; they standardized when Unicode was 16 bits.
评论 #11934750 未加载
评论 #11934698 未加载
评论 #11934879 未加载
评论 #11934709 未加载
评论 #11935078 未加载
评论 #11935203 未加载
评论 #11934728 未加载
评论 #11936062 未加载
wcoenenalmost 9 years ago
It&#x27;s interesting how history seems to have repeated itself with UTF-16. With ASCII and its extensions, we had 128 &quot;normal&quot; characters and everything else was exotic text that caused problems.<p>Now with UTF-16, the &quot;normal&quot; characters are the ones in the basic multilingual plane that fit in a single UTF-16 code point.
评论 #11935936 未加载
chillacyalmost 9 years ago
This article was from 4 years ago. Since then, utf 8 adoption has increased from 68% to 87% of the top 10 million websites on Alexa:<p><a href="https:&#x2F;&#x2F;w3techs.com&#x2F;technologies&#x2F;history_overview&#x2F;character_encoding&#x2F;ms&#x2F;y" rel="nofollow">https:&#x2F;&#x2F;w3techs.com&#x2F;technologies&#x2F;history_overview&#x2F;character_...</a>
评论 #11935746 未加载
IvanK_netalmost 9 years ago
When you create a table in MySQL, a text attribute (VARCHAR etc.) is not encoded in UTF8 by default.<p>I think UTF8 should be the default and only format for storing text attributes in all databases and all other text encodings should be removed from database systems.
评论 #11935534 未加载
yuhongalmost 9 years ago
I have the feeling that back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.
mangixalmost 9 years ago
this seems specific to Windows. UTF8 is already standard in Linux and the web for example. It&#x27;s just Microsoft.
评论 #11934658 未加载
评论 #11934557 未加载
评论 #11934566 未加载
评论 #11934666 未加载
评论 #11934800 未加载
voaiealmost 9 years ago
May be off-topic, I wonder if anyone is planning a redesign of Unicode for the far future? or is there a better way to handle characters, so we don&#x27;t require a giant library like ICU?
评论 #11934625 未加载
评论 #11934574 未加载
评论 #11934711 未加载
评论 #11934593 未加载
Const-mealmost 9 years ago
&gt; In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.<p>Wrong: up to 4 bytes UTF16, and up to 6 bytes UTF8.<p>&gt; Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.<p>Cyrillic, Hebrew and several other languages still have spaces and punctuation, that take a single byte in UTF8. Now it’s 2016, RAM and storage are cheap and declining, but CPU branch misprediction cost is same 20 cycles and not going to decline.<p>&gt; plain Windows edit control (until Vista)<p>Windows XP is 14 years old, and now in 2016 it’s market share is less then 3%. Who cares what was before Vista?<p>&gt; In C++, there is no way to return Unicode from std::exception::what() other than using UTF-8.<p>The exception that are part of STL don’t return Unicode at all, they are in English.<p>If you throw your custom exceptions, return non-English messages in exception::what() in utf-8, catch std::exception and call what() — you’ll get English error messages for STL-thrown exceptions, and non-English error messages for your custom exceptions.<p>I’m not sure mixing GUI languages in a single app is always a right thing.<p>&gt; First, the application must be compiled as Unicode-aware<p>The oldest visual studio I have installed is 2008 (because I sometimes develop for WinCE). I’ve just created a new C++ console application project, and by default it already Unicode-aware.<p>So, for anyone using Microsoft IDE, this requirement is not a problem.
评论 #11935726 未加载
hackuseralmost 9 years ago
Is there any application where UTF-8 isn&#x27;t the best choice for long-term (i.e., 20-200 year) forward compatibility?
评论 #11935374 未加载
Murkalmost 9 years ago
After considering this problem in long detail in the past, I too favoured utf8 at the time.<p>I remember a project (circa 1999) I worked on which was a feature phone HTML 3.4 browser and email client (one of the first). The browser&#x2F;ip stack handled only ascii&#x2F;code page characters to begin with. To my surprise it was decided to encode text on the platform using utf-16. Thus the entire code base was converted to use 16 bit code points (UCS-2). On a resource constrained platform (~300k ram IFIRC), better, I think, would have been update the renderer and email client to understand utf8.<p>Nice as it might be to have the idea that utf16, or utf32 were a &quot;character&quot; it is as has been pointed out not the case, and when you look into language you can see how it never can be that simple.
misnomealmost 9 years ago
I quite like Swift&#x27;s approach -Characters, where a character can be &quot;An extended grapheme cluster ... a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.&quot;. This seems, in practice to mean things like multibyte entries, modified entries, end up as a single entry.<p>As the trade-off, directly indexing into strings is... Either not possible or discouraged, and often relies on an opaque(?) indexing class.<p>The main weirdness I have encountered so far is that the Regex functions operate only on the old, objective-c method of indexing, so a little swizzling is required to handle things properly.
cm3almost 9 years ago
Offtopic, but does anyone know of a way to ensure I don&#x27;t introduce non-ASCII filenames, to ensure broad portability across systems? I&#x27;ve had to resort to disabling UTF-8 on Linux to achieve that.
评论 #11935347 未加载
wrpalmost 9 years ago
This militancy to <i>force</i> everyone to use UTF-8 is bad engineering. I&#x27;m thinking of GNOME 3, where you aren&#x27;t even allowed the option of choosing ASCII as a default setting, only UTF-8 or ISO-8859-x. A default setting is just as important for what it filters out as for what it passes through. I use a lot of older tools on *nix that are ASCII-only, in tool chains that slurp and munge text. If the chain includes any of these UTF-8-only apps, I&#x27;m constantly dealing with the problem of invalid ASCII passing through.
codeulikealmost 9 years ago
With or without a BOM?
评论 #11934569 未加载
评论 #11934817 未加载
评论 #11934586 未加载
douchealmost 9 years ago
Some days, I imagine a parallel universe, where the ancient Chinese had called ideograms a bad idea, and went on to develop a proper alphabet. Unicode would be pretty much unnecessary.