> How many spaces is a tab? GCC seems to say “8”, for some reason<p>On old mechanical typewriters without adjustable tab stops (which were on the carriage, not the part you typed BTW, which skeuomorphically survived in Word’s tab stop interface), the tab key (those that had tab keys) slid the carriage left 8 spaces. You could often push it back a bit if you wanted. This carried over to TTYs which were grossly electromechanical devices and from there to glass teletypes and terminals.<p>So it’s the proper default.<p>Typewriters were pretty direct. Often they omitted the 1 and 0 digits (just use l and O). Of course there was not a VT or LF — you just stretched out your right hand and turned the platen. For FF you gripped the paper and pulled. On some models, to backspace you just pushed the carriage. And to delete (called rubout on a some old terminals) you painted or XXXed out the offending text — or used a pen. Even on legal documents (which is why they still both write out use digits to specify numbers)
> Dare I say and ask: How many spaces is a tab?<p>I just checked (because I had a hazy memory of this doing something weird): tab characters printed raw to a PTY and looked at in a terminal emulator/tmux/etc., <i>aren't</i> any fixed number of spaces wide.<p>Rather, a tab, at least when rendered onto a PTY, advances the cursor to <i>the next tabstop</i> — i.e. to the next virtual column that is a multiple of 8.<p>Which has the dreadful implication that, to handle column prediction when there could ever be tabs <i>in the input</i>, you actually need to fully model the rendering behavior of a PTY.<p>And don't get me started on predicting the width of text printed raw containing ANSI escape codes!<p>The fact that libncurses works at all, while dealing with <i>all</i> of this, is a never-ending source of amazement to me.
> grapheme clusters [...] portable<p>If only. Boundary locations are dependent on unicode version. If your terminal uses one and your compiler another—boom.<p><pre><code> * * *
</code></pre>
In my c compiler I distinguish 'acol' and 'vcol'. The former stands for 'actual column', and the latter for 'visual column'. The actual column is a byte offset, which can be used to identify the offending location in source, while the visual column represents a physical offset in glyph-widths.<p>The issue given in TLA of tab widths being different can be resolved by making the compiler expand the tabs itself. This does nothing for e.g. emoji, if there is disagreement about the version of unicode in use.<p>Vim seems to do something similar. Given 'set ruler', if I type a tab followed by an em dash, I am told that I am in column '5-10'. 5 is 3 bytes from the em dash (in utf8) and 1 byte from the tab, plus 1. 10 is 8 glyph-widths from the tab and 1 glyph-width from the em dash, plus 1.<p><pre><code> * * *
</code></pre>
However, my approach to error-handling should perhaps not be taken as representative. E.G.<p>> But nobody looks at an error message and manually navigates to the location using the column information!<p>I do. And I also dislike rustc's error messages, which apparently receive universal acclaim.
Two things:<p>1) I recently implemented truncating long lines for CLI tool and went with a hybrid approach using both graphemes and virtual columns -- I'd only truncate a line between graphemes, but when counting how much space was used up I would use virtual columns. In the case of something like , this means things tend to error on the "safe" side of truncating a line too early if the virtual column approach counts it as 4 columns wide rather than 2.<p>2) I wanted to test something with the scientist emoji and managed to crash Ruby's repl, irb, simply by pasting it into the repl and then backspacing over it. (It was clearly confused about the position of the cursor, and the stacktrace pointed to an error in a line_editor.rb file.) I was on Ruby 2.7.1, but it looks like it's been fixed in 3.0.0!
> Most Chinese, Japanese or Korean characters are rendered twice as wide as most other characters, even in a monospace font.<p>Not just CJK characters, but also a lot of non-Latin characters and symbols (a canonical example being ↑). In the East Asian Width standard [1] they are classified as "ambiguous", which can be half-width or full-width depending on the user choice. (By the way thank you much for pointing this out, this is super non-obvious to non-CJK developers and consequently affects CJK developers!)<p>[1] <a href="https://unicode.org/reports/tr11/" rel="nofollow">https://unicode.org/reports/tr11/</a>
If you're measuring text for terminal display, you might like my "widecharwidth" library. It tries to be what wcwidth should have been.<p><a href="http://github.com/ridiculousfish/widecharwidth" rel="nofollow">http://github.com/ridiculousfish/widecharwidth</a>
> Emojis easily combine many many code points into one symbol. It begins with flag emojis such as , which is actually a special “E” followed by “U”, continues with emojis such as (scientist), which is (person) glued together with (microscope) using a special joiner code point, and ends at the absolute pinnacle of code point combinations - the family emoji . How do you make a family? Easy, you take a person (with optional skin tone and gender modifier) and glue it with another person, as well as their children. That way you can easily end up with a single “character” consisting of ten or more code points!<p>Why do we create unnecessary complexities and then refuse to dismantle them when we start having problems with them? We're so unable to admit mistakes?