TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How the Unicode Committee Broke the Apostrophe

35 pointsby Ovidalmost 10 years ago

7 comments

thaumasiotesalmost 10 years ago
Makes a real effort to completely gloss over a very common English use of apostrophes. From the article:<p>&gt; Consider any English word with an apostrophe, e.g. <i>“don’t”</i>. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a <i>modifier letter</i>, not a <i>punctuation mark</i>, regardless of what colloquial English calls it.<p>&gt; According to the Unicode character database, U+2019 is a punctuation mark (General Category = Pf), while U+02BC is a modifier letter (General Category = Lm). Since English apostrophes are part of the words they’re in, they are modifier letters, and hence should be represented by U+02BC, not U+2019.<p>&gt; (It would be different if we were talking about French. In French, I think it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as two words. But that’s a conversation for another time. Right now I’m talking about English.)<p>OK, I&#x27;ve considered an English word: &quot;man&#x27;s&quot;. In the sentence &quot;That man&#x27;s pants are on fire!&quot;, this is usually considered a single word, the genitive case of &quot;man&quot; (personally, I&#x27;m not a huge fan of that approach, since the &quot;genitive&quot; &#x27;s attaches to phrases, not to words, but it is the mainstream position in linguistics).<p>In the sentence &quot;That man&#x27;s about to jump&quot;, on the other hand, the &quot;word&quot; &quot;man&#x27;s&quot; is two words joined by an apostrophe, exactly as in French &quot;l&#x27;homme&quot;. These clitics aren&#x27;t exactly rare in English. The author shows some linguistic training in the comments to his piece, but never once mentions clitics, and fails to address them when another commenter brings them up.<p>Use U+0027 for English apostrophes. ;p
评论 #9715475 未加载
评论 #9715242 未加载
评论 #9715545 未加载
评论 #9715423 未加载
TazeTSchnitzelalmost 10 years ago
In a <i>ideal</i> world, yes, this is how it would work. But in practice it is not. The vast, vast majority of documents using non-straight quotes use ’ (U+2019[1], the Windows-1252 \x92 right curly quote that Microsoft Word &lt;3s) for apostrophes. There&#x27;s not much that can be done about that.<p>Unicode has to strike a balance between what&#x27;s most &quot;correct&quot; and how the real world actually uses it.<p>[1] I was looking at that codepoint and thought it must be wrong. It&#x27;s too big a number for a Latin-1 codepoint. Aren&#x27;t the first 256 characters of Unicode just Latin-1? Well, exactly. They&#x27;re Latin-1, <i>rather than Windows-1252</i>, which is where the now-infamous curly “smart quotes” come from. The two encodings are easily confused, because they&#x27;re mostly the same. The difference is Microsoft replaced the extra control codes in the high byte (who needs those, really? ASCII had too many already) with more useful new printable characters.
评论 #9715633 未加载
评论 #9716093 未加载
shiggerinoalmost 10 years ago
Sadly, Unicode is a clusterfuck. But can anything be done about it? Or should we just be happy we for once have managed to get a decent adoption of something interoperable?
评论 #9715533 未加载
harshrealityalmost 10 years ago
Iʼm sold, I think... (U+02bc isn&#x27;t really intended for such use, but until there&#x27;s a proper alternative, other than U+2019 or U+0027, I&#x27;m using it)[1].<p><pre><code> (in ~&#x2F;.XCompose) include &quot;%L&quot; &lt;Multi_key&gt; &lt;apostrophe&gt; &lt;minus&gt; : &quot;ʼ&quot; U02BC # MODIFIER LETTER APOSTROPHE </code></pre> [1] A potential problem with U+0027 is that low-ascii &#x27; and &quot; have uses for demarcating things (like attribute values in html, most popularly), so if you&#x27;re editing anything that uses &#x27; for markup, you can&#x27;t search and replace based on &#x27; anymore.
jmountalmost 10 years ago
&quot;Using U+2019 is inconsistent with the rest of the standard&quot; I agree with the article, just my negativity is such I would say the correct statement is more like: &quot;Using U+2019 is inconsistent with good use, making it consistent with the rest of the mess that is the standard.&quot;
sctbalmost 10 years ago
<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=9655387" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=9655387</a>
lambdaalmost 10 years ago
I think there are a few things wrong with this argument. I&#x27;m unconvinced by the argument that this should actually be considered a modifier letter.<p><pre><code> Consider any English word with an apostrophe, e.g. “don’t”. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a modifier letter, not a punctuation mark, regardless of what colloquial English calls it. </code></pre> The definition of a modifier letter is (<a href="http:&#x2F;&#x2F;www.unicode.org&#x2F;versions&#x2F;Unicode7.0.0&#x2F;ch07.pdf#G15832" rel="nofollow">http:&#x2F;&#x2F;www.unicode.org&#x2F;versions&#x2F;Unicode7.0.0&#x2F;ch07.pdf#G15832</a>):<p><pre><code> Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc=Mn or gc=Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, in dicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound. </code></pre> Punctuation, on the other hand, is (<a href="http:&#x2F;&#x2F;www.unicode.org&#x2F;faq&#x2F;punctuation_symbols.html" rel="nofollow">http:&#x2F;&#x2F;www.unicode.org&#x2F;faq&#x2F;punctuation_symbols.html</a>):<p><pre><code> Punctuation marks are standardized marks or signs used to clarify the meaning and separate structural units of text. </code></pre> Based on these definitions, the apostrophe seen in contractions and possessives is definitely punctuation, not a modifier letter. Modifier letters indicate some effect on sound or pronunciation, either modifying an adjacent letter or having a sound on their own. U+02BC (MODIFIER LETTER APOSTROPHE) is such an example, being used to indicate a glottal stop.<p>Apostrophes used in contractions and possessives, however, have no effect on pronunciation; instead, just as in the definition of punctuation, they are used to &quot;clarify the meaning and separate structural units of text.&quot;<p><pre><code> But we shouldn’t be perpetuating this problem. When a programmer is writing a regex that can match text in Chinese, Arabic, or any other human language supported by Unicode, they shouldn’t have to add an exception for English. </code></pre> Thinking that it&#x27;s possible to do text processing in a language or writing system neutral way is a fallacy. Unicode simply provides an encoding that allows all of these writing systems in a single document, plus a number of algorithms that are designed to be fairly reasonable across the entire encoding, but which cannot be correct for all languages and writing systems without specific tailoring.<p>Many writing systems do not use spaces between words. Any form of word segmentation for these writing systems will necessarily be language specific, generally involving dictionaries. Using a regex like \w+ on Chinese or Thai text is fairly meaningless, as it will generally match an entire sentence at a time, rather than actually matching a single word.<p><pre><code> For godsake, apostrophes are not closing quotation marks! </code></pre> No, they are not. However, they also aren&#x27;t modifier letters. If you wanted to provide a distinction for the purposes mentioned here, you would probably need to add a new, distinct punctuation character &quot;curly apostrophe&quot; or something of the sort (since the ASCII range apostrophe can&#x27;t be reused due to its overloaded meaning). However, even if you did that, you would still need to deal with all of the legacy documents which use ASCII apostrophe and closing quotation marks; you wouldn&#x27;t actually be able to simplify the implementation by making the assumptions that a closing quotation mark was always actually closing a quotation.<p>Now having three different characters that looked identical (the modifier letter apostrophe, the closing quotation mark, and the punctuation apostrophe) would additionally add to confusion.<p>Even if you didn&#x27;t introduce a new character, and instead used the modifier letter apostrophe as a punctuation apostrophe, you would still have all of the problems with legacy documents; it would take years for this change to make it&#x27;s way through all of the various word processing programs and text editors, even after it had there would be existing documents using the old conventions, etc.<p>In short, text processing is hard, because text conventions were designed for human readers who know the language, not computers trying to process text in a language-independent way, and they were designed either through handwriting or manual typesetting, not keyboard entry. You are never going to achieve a perfect text processing model that can handle all of the world&#x27;s languages simply by using particular global Unicode properties of characters and applying a simple algorithm or categorization on them. A lot of text processing will need to be contextual, language (and locale) specific, and involve dictionaries.<p>I don&#x27;t think that switching from the punctuation closing quote character to the modifier letter apostrophe for the punctuation apostrophe is likely to help much; and the confusion caused by nearly 20 years of documents that follow the existing conventions and so having to support both conventions is likely to make the situation much worse, not better.