Note that normalization involves rearranging combining characters of different combining classes:<p><pre><code> > Array.from("\u{10FFff}\u0300\u0327".normalize('NFC')).map(x=>x.codePointAt().toString(16))
[ '10ffff', '327', '300' ]
</code></pre>
If a precombined character exists, the relevant accent will be pulled into the base regardless of where it is in the sequence. Note also that normalization can change the visual length (see below) under some circumstances.<p>The article is somewhat wrong when it says Unicode may "change character normalization rules"; new combining characters may be added (which affects the class sort above) but new precombined ones cannot.<p>---<p>There's one important notion of "length" that this doesn't cover: how wide is this on the screen?<p>For variable-width fonts of course this is very difficult. For monospace fonts, there are several steps for the least-bad answer:<p>* Zeroth, if you have reason to believe a later stage has a limit on the number of combining characters or will normalize, do the normalization yourself if that won't ruin your other concerns. (TODO - since there are some precomposed characters with multiple accents, can this actually make things worse?)<p>* First, deal with whitespace. Do you collapse space? What forms of line separator do you accept? How far apart are tab stops?<p>* Second, deal with any nonprintable/control/format characters (including spaces you don't recognize), e.g. escaping them or replacing them by their printable form but adding the "inverted" attribute.<p>* Third, deal with any leading (meaning, immediately after a nonprintable or a line-separator) combining characters, treat them by synthesizing a NBSP (which is not a space), which has length 1. Likewise, synthesize missing Hangul fillers <i>anywhere</i> in the line.<p>* Now, iterate through the codepoints, checking their EastAsianWidth (note that you can usually have a table combining this lookup with the earlier stages): -1 for a control character, 0 for a combining character (unless dealing with a system that's too dumb to strip them), 1 or 2 for normal characters.<p>* Any codepoints that are Ambiguous or in one of the Private Use Areas should be counted <i>both</i> ways (you want to produce two separate counts). Any combining characters that are enclosing should be treated as ambiguous (unless the base was already wide). Likewise for the Korean Hangul LVT sequences, you should produce a range of lengths (since in practice, whether they will combine depends on whether the font includes that exact sequence).<p>* If you encounter any ZWJ sequences, regardless of whether or not they correspond to a known emoji, count them both ways (min length being the max of any single component, max length as counted all separately).<p>* Flag characters are evil, since they violate Unicode's random-access rule. Count them both as if they would render separately and if they would render as a flag.<p>* TODO what about Ideographic Description Characters?<p>* Finally, hard-code any exceptions you encounter in the wild, e.g. there are some Arabic codepoints that are really supposed to be more than 2 columns.<p>For the purpose of layout, you should mostly work based on the largest possible count. But if the smallest possible count is different, you need to use some sort of absolute positioning so you don't mess up the user's terminal.