TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

What Every Software Developer Must Know About Unicode and Character Sets

3 pointsby tmleeover 8 years ago

1 comment

nabla9over 8 years ago
&gt;Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point.<p>This is wrong.<p>Code-point does not match each platonic letter (abstract character in Unicode) nor grapheme. It does so in many western languages and alphabets but not in general. Code-point is just unit of information used in __encoding__.<p>Mapping from code-points to abstract characters is not total, injective, or surjective. Some abstract characters need more than one code point to express them. Also a grapheme can be sequence of one or more code points and so can abstract character and so can abstract character. You can&#x27;t split code points to split text into abstract characters or graphemes.<p>What every software developer must know beyond code points and code units:<p>User-perceived character : what user thinks is a character.<p>Grapheme cluster : A sequence of coded characters that ‘should be kept together&#x27;. They try to represent user perceived character in language independent way. Selecting single character or cursor movement happen at this level.