TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How ASCII lost and unicode won

82 pointsby stevejalimalmost 12 years ago

13 comments

joshuaellingeralmost 12 years ago
ASCII is by far the most successful character encoding that computers have used. It was invented in 1963, back in the era of punch cards and core memory. Modern RAM did not exist until 1975 -- a decade later.<p>Unicode is the replacement, not the competitor, like 64-bit IP addresses are the replacement for 32-bit IP addresses. It was developed in the early 1990s when RAM got cheap enough that you could afford two-bytes per character.<p>Personally, I deal with data all the time and rarely encounter unicode. Of course, I&#x27;m in the US dealing with big files out of financial and marketing databases. In fact, I&#x27;ve seen more EBCDIC than UNICODE.
评论 #6012219 未加载
评论 #6012367 未加载
评论 #6012221 未加载
评论 #6012470 未加载
评论 #6013252 未加载
timthornalmost 12 years ago
I really hate to nitpick, but the article implies that ASCII was the first character encoding. In fact, there was a rich history of different encodings before that, with different word sizes and&#x2F;or incompatible 8 bit encodings. It&#x27;s quite interesting to look back and see what trade-offs were made and why.
评论 #6012148 未加载
评论 #6012334 未加载
评论 #6012200 未加载
salmonellaeateralmost 12 years ago
The fact that UTF-8 and UTF-16 are often exposed to programmers when dealing with text is a major failure of separation-of-concerns. If you had a stream of data that was gzipped, would it ever make sense to look at the bytes in the data stream before decompressing it? Variable-length text encodings are the same. Application code should only see Unicode code points.<p>In general it was a mistake to put variable-length encodings into the Unicode standard. A much better design would have been to use UTF-32 for the application-level interface to characters, and use a separate compression standard that is optimized for fixed alphabets when transporting or storing text. This has the advantage that the compression scheme can be dynamically updated to match the letter frequencies in the real-world text, and it logically separates the ideas of encoding and compression so that the compression container is easier to swap out. And, of course, an entire class of bugs would be eliminated from application code.<p><i>Edited first paragraph to clarify: </i>Variable-length <i>text encodings are the same.</i>
评论 #6014071 未加载
评论 #6013060 未加载
评论 #6012819 未加载
ygraalmost 12 years ago
I&#x27;m impressed. Easily readable and understandable, short and as far as I can tell no factual inaccuracies and wrong information (unlike many other Unicode introductions and tutorials).
评论 #6012317 未加载
评论 #6012264 未加载
评论 #6012239 未加载
评论 #6012235 未加载
Digit-Alalmost 12 years ago
&gt;ASCII really should have been named ASCIIWOA: the American Standard Code for Information Exchange With Other Americans.<p>So he thinks Americans are the only people to use the <i>English</i> language does he?
评论 #6012986 未加载
评论 #6013820 未加载
评论 #6012632 未加载
peterkellyalmost 12 years ago
Another good article on this topic is the one by Joel Spolsky:<p><a href="http://www.joelonsoftware.com/articles/Unicode.html" rel="nofollow">http:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;articles&#x2F;Unicode.html</a>
gnosisalmost 12 years ago
<i>&quot;Designed as a single, global replacement for localised character sets, the Unicode standard is beautiful in its simplicity. In essence: collect all the characters in all the scripts known to humanity and number them in one single, canonical list. If new characters are invented or discovered, no problem, just add them to the list. The list isn’t an 8-bit list, or a 16-bit list, it’s just a list, with no limit on its length.&quot;</i><p>Is this really true? My impression was that UTF-32 is a fixed-length encoding which uses 32 bits to encode all of Unicode. It seems that this means that Unicode can never have more code points than could fit in 32 bits. Right?
评论 #6012199 未加载
评论 #6012197 未加载
评论 #6012232 未加载
评论 #6012192 未加载
okwaalmost 12 years ago
&gt; These mappings of numbers to characters are just a convention that someone decided on when ASCII was developed in the 1960s. There’s nothing fundamental that dictates that a capital A has to be character number 65, that’s just the number they chose back in the day.<p>I don&#x27;t think it&#x27;s mere coincidence that the capital letters start at 65 and the lower case at 97 and the decimal digits at 48.
stuartcwalmost 12 years ago
It&#x27;s not a matter of winning or loosing. The pre-unicode mix of character sets was a mess when it came internationalization. Try truncating a Japanese Shift-JIS string in C. That will learn you..
评论 #6012404 未加载
dansoalmost 12 years ago
OT and out of curiosity...how do non-native English speakers experience typing&#x2F;keyboard education? I can barely remember how to make any of the basic accents over the `e` when trying to sound French...are typing classes in non-English schooling systems much more sophisticated than in English (i.e. ASCII-centric) schools? I wonder if non-native English typists come away with a better handling of the power of keyboard shortcuts (whether to create accents or not)
评论 #6012573 未加载
评论 #6021643 未加载
评论 #6013353 未加载
评论 #6013033 未加载
评论 #6012469 未加载
评论 #6013359 未加载
评论 #6012453 未加载
lmmalmost 12 years ago
Given the controversy over Han unification, I suspect that incompatible charactersets will be with us for a while yet, more&#x27;s the pity.
lelfalmost 12 years ago
Well, when 99% think unicode = encoding = ucs2 = utf-16, don&#x27;t believe there&#x27;s something outside BMP, and wtf is the only word coming to their mind when they hear about graphemes… Unicode won?
rayineralmost 12 years ago
Unicode, meh. Nobody will ever need more than 128 characters.