I think there are a few things wrong with this argument. I'm unconvinced by the argument that this should actually be considered a modifier letter.<p><pre><code> Consider any English word with an apostrophe, e.g.
“don’t”. The word “don’t” is a single word. It is not the
word “don” juxtaposed against the word “t”. The
apostrophe is part of the word, which, in Unicode-speak,
means it’s a modifier letter, not a punctuation mark,
regardless of what colloquial English calls it.
</code></pre>
The definition of a modifier letter is (<a href="http://www.unicode.org/versions/Unicode7.0.0/ch07.pdf#G15832" rel="nofollow">http://www.unicode.org/versions/Unicode7.0.0/ch07.pdf#G15832</a>):<p><pre><code> Modifier letters, in the sense used in the Unicode
Standard, are letters or symbols that are typically
written adjacent to other letters and which modify their
usage in some way. They are not formally combining marks
(gc=Mn or gc=Mc) and do not graphically combine with the
base letter that they modify. They are base characters in
their own right. The sense in which they modify other
letters is more a matter of their semantics in usage; they
often tend to function as if they were diacritics, in
dicating a change in pronunciation of a letter, or
otherwise distinguishing a letter’s use. Typically this
diacritic modification applies to the character preceding
the modifier letter, but modifier letters may sometimes
modify a following character. Occasionally a modifier
letter may simply stand alone representing its own sound.
</code></pre>
Punctuation, on the other hand, is (<a href="http://www.unicode.org/faq/punctuation_symbols.html" rel="nofollow">http://www.unicode.org/faq/punctuation_symbols.html</a>):<p><pre><code> Punctuation marks are standardized marks or signs used to
clarify the meaning and separate structural units of text.
</code></pre>
Based on these definitions, the apostrophe seen in contractions and possessives is definitely punctuation, not a modifier letter. Modifier letters indicate some effect on sound or pronunciation, either modifying an adjacent letter or having a sound on their own. U+02BC (MODIFIER LETTER APOSTROPHE) is such an example, being used to indicate a glottal stop.<p>Apostrophes used in contractions and possessives, however, have no effect on pronunciation; instead, just as in the definition of punctuation, they are used to "clarify the meaning and separate structural units of text."<p><pre><code> But we shouldn’t be perpetuating this problem. When a
programmer is writing a regex that can match text in
Chinese, Arabic, or any other human language supported by
Unicode, they shouldn’t have to add an exception for
English.
</code></pre>
Thinking that it's possible to do text processing in a language or writing system neutral way is a fallacy. Unicode simply provides an encoding that allows all of these writing systems in a single document, plus a number of algorithms that are designed to be fairly reasonable across the entire encoding, but which cannot be correct for all languages and writing systems without specific tailoring.<p>Many writing systems do not use spaces between words. Any form of word segmentation for these writing systems will necessarily be language specific, generally involving dictionaries. Using a regex like \w+ on Chinese or Thai text is fairly meaningless, as it will generally match an entire sentence at a time, rather than actually matching a single word.<p><pre><code> For godsake, apostrophes are not closing quotation marks!
</code></pre>
No, they are not. However, they also aren't modifier letters. If you wanted to provide a distinction for the purposes mentioned here, you would probably need to add a new, distinct punctuation character "curly apostrophe" or something of the sort (since the ASCII range apostrophe can't be reused due to its overloaded meaning). However, even if you did that, you would still need to deal with all of the legacy documents which use ASCII apostrophe and closing quotation marks; you wouldn't actually be able to simplify the implementation by making the assumptions that a closing quotation mark was always actually closing a quotation.<p>Now having three different characters that looked identical (the modifier letter apostrophe, the closing quotation mark, and the punctuation apostrophe) would additionally add to confusion.<p>Even if you didn't introduce a new character, and instead used the modifier letter apostrophe as a punctuation apostrophe, you would still have all of the problems with legacy documents; it would take years for this change to make it's way through all of the various word processing programs and text editors, even after it had there would be existing documents using the old conventions, etc.<p>In short, text processing is hard, because text conventions were designed for human readers who know the language, not computers trying to process text in a language-independent way, and they were designed either through handwriting or manual typesetting, not keyboard entry. You are never going to achieve a perfect text processing model that can handle all of the world's languages simply by using particular global Unicode properties of characters and applying a simple algorithm or categorization on them. A lot of text processing will need to be contextual, language (and locale) specific, and involve dictionaries.<p>I don't think that switching from the punctuation closing quote character to the modifier letter apostrophe for the punctuation apostrophe is likely to help much; and the confusion caused by nearly 20 years of documents that follow the existing conventions and so having to support both conventions is likely to make the situation much worse, not better.