TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Why not extended grapheme cluster as character type?

2 pointsby red2awnover 4 years ago
Among the most popular programming languages with native support for unicode (Go, Java, Rust, .NET, etc), why do most of them define a character as an unicode codepoint&#x2F;scalar value instead of an extended grapheme cluster? Swift is the only exception I know of.<p>My assumption is that most programs would want to deal with user perceived characters instead of individual codepoints. Is this decision for performance reasons or something else?

2 comments

brudgersover 4 years ago
The reason is computer science. In computer science, a character is simply part of an arbitrary alphabet. Zero or more characters form a string. A string can be input into one of the several types of automata - e.g. fininte state, push-down, Turing machine, etc. Automata accept or reject strings or loop endlessly (in the case of Turing machines).<p>None of this has anything to do with human readable text, except that accidentally it is possible to represent human readable text with a single byte in the case of English and since C was written by English speakers who where familiar with ASCII encoding as part of their job, &quot;char&quot; became synonymous with &quot;byte.&quot; And so strings of bytes were used to encode English text and &quot;string&quot; was a useful but very leaky abstraction for English text. How leaky? C strings are NULL terminated because with a Turing machine there are no guarantees that the input terminates and this is useful for switching telephone networks since the switches have to work continuously even though individual calls end, i.e. the end of a call is not the end of the stream of input for a network switch.<p>Conflating &quot;string&quot; with &quot;text&quot; is a source of endless clusterfucks. The most notable being Python 2 strings versus Python 3 strings and all the code that had to be rewritten because of a dumb design decision that could not be questioned under the governing model of dictatorship. As opposed to Perl where the difference between strings and text was understood and handling text with regex&#x27;s meant that parsing an evilly constructed regex might take a really long time was an engineering tradeoff for the finite time of the Unix regex which lacked backtracking and look ahead...because the Unix &quot;regex&quot; was a regular expression in the sense of automata -- as such the regular expression is equivalent to a finite state machine.<p>All of which probably expresses an opinion I might hold about Swift if I looked at it closely, but I haven&#x27;t because I know enough about it to know that it does things that are hard to reason about in the way I tend to apply engineering reasoning.<p>But that&#x27;s me, and if it gets the job done for you, then that&#x27;s great. I just don&#x27;t want to try to figure out something that was designed that way.
Someoneover 4 years ago
History, dominance of English in computing, performance, plus the fact that even extended grapheme clusters don’t fully solve the problem. In Swift<p><pre><code> &quot;ffi&quot;.count </code></pre> returns 1.