TechEcho

12 comments

Another problem is line breaks. Have a <textarea>? Line breaks are counted as \n on the client (affecting maxlength attribute and JavaScript calculations using textarea.value.length), but submitted as \r\n. This has bitten me on “2000 character maximum” feedback forms at least twice: client says it’s fine, server says it’s too long, and promptly throws everything away.

jasonthorsness21 days ago

Huh, apparently HTML input attributes like maxsize don't try anything fancy and just count UTF-16 code units same as JavaScript strings (I guess it makes sense...) With the prevalence of emojis this seems like it might not do the right thing.<a href="https://html.spec.whatwg.org/multipage/input.html#attr-input-maxlength" rel="nofollow">https://html.spec.whatwg.org/multipage/input.html#attr-input...</a>

wpollock21 days ago

Best advice I've heard is to never use the character type in your programming language. Instead, store characters in strings. An array of strings can be used as a string of characters. In this approach, characters become opaque blobs of bytes. This makes it easy to get the two numbers you care about: length in characters and size in bytes.There is some overhead for this, so maybe a technique more suited to backends. Normalization, sanitation and validation steps are best performed in the frontend.Also worth knowing is the ICU library, which is often the easiest way to work with Unicode consistently regardless of programming language.Finally, punycode is a standard way to represent arbitrary Unicode strings as ASCII. It's reversible too (and built into every web browser). You can do size limits on the punycode representation.BTW, you shouldn't store passwords in strings in the first place. Many programming languages have an alternative to hold secrets in memory safely.

评论 #43853295 未加载

评论 #43854867 未加载

评论 #43854569 未加载

评论 #43853781 未加载

bsder21 days ago

TIL: In worst case, "20 UTF-8 bytes" == "1 Hindi character"Going to have to remember that.

评论 #43851770 未加载

评论 #43854906 未加载

neuroelectron21 days ago

This is why my website is going to be ASCII only.

评论 #43851464 未加载

评论 #43857691 未加载

HeyImAlex21 days ago

Thank you for writing this! It’s something I’ve always wanted a comprehensive guide on, now I have something to point to.

aidenn021 days ago

This doesn't seem to cover truncation, but rather acceptance/rejection. If you are given something with "too many" codepoints, but need to use it anyways it seems like it would make sense to truncate it on a grapheme cluster boundary.

评论 #43852551 未加载

评论 #43856045 未加载

Retr0id21 days ago

> The byte size allowed would need to be about 100x the length limit. That’s… kind of a lot?Would it need to be, though? ~10x ought to be enough for any realistic string that wasn't especially crafted to be annoying.

评论 #43852658 未加载

评论 #43852305 未加载

o11c21 days ago

Note that normalization involves rearranging combining characters of different combining classes:<pre><code> > Array.from("\u{10FFff}\u0300\u0327".normalize('NFC')).map(x=>x.codePointAt().toString(16)) [ '10ffff', '327', '300' ] </code></pre> If a precombined character exists, the relevant accent will be pulled into the base regardless of where it is in the sequence. Note also that normalization can change the visual length (see below) under some circumstances.The article is somewhat wrong when it says Unicode may "change character normalization rules"; new combining characters may be added (which affects the class sort above) but new precombined ones cannot.---There's one important notion of "length" that this doesn't cover: how wide is this on the screen?For variable-width fonts of course this is very difficult. For monospace fonts, there are several steps for the least-bad answer:* Zeroth, if you have reason to believe a later stage has a limit on the number of combining characters or will normalize, do the normalization yourself if that won't ruin your other concerns. (TODO - since there are some precomposed characters with multiple accents, can this actually make things worse?)* First, deal with whitespace. Do you collapse space? What forms of line separator do you accept? How far apart are tab stops?* Second, deal with any nonprintable/control/format characters (including spaces you don't recognize), e.g. escaping them or replacing them by their printable form but adding the "inverted" attribute.* Third, deal with any leading (meaning, immediately after a nonprintable or a line-separator) combining characters, treat them by synthesizing a NBSP (which is not a space), which has length 1. Likewise, synthesize missing Hangul fillers anywhere in the line.* Now, iterate through the codepoints, checking their EastAsianWidth (note that you can usually have a table combining this lookup with the earlier stages): -1 for a control character, 0 for a combining character (unless dealing with a system that's too dumb to strip them), 1 or 2 for normal characters.* Any codepoints that are Ambiguous or in one of the Private Use Areas should be counted both ways (you want to produce two separate counts). Any combining characters that are enclosing should be treated as ambiguous (unless the base was already wide). Likewise for the Korean Hangul LVT sequences, you should produce a range of lengths (since in practice, whether they will combine depends on whether the font includes that exact sequence).* If you encounter any ZWJ sequences, regardless of whether or not they correspond to a known emoji, count them both ways (min length being the max of any single component, max length as counted all separately).* Flag characters are evil, since they violate Unicode's random-access rule. Count them both as if they would render separately and if they would render as a flag.* TODO what about Ideographic Description Characters?* Finally, hard-code any exceptions you encounter in the wild, e.g. there are some Arabic codepoints that are really supposed to be more than 2 columns.For the purpose of layout, you should mostly work based on the largest possible count. But if the smallest possible count is different, you need to use some sort of absolute positioning so you don't mess up the user's terminal.

评论 #43852731 未加载

评论 #43854901 未加载

jerf21 days ago

I had this problem recently, in logging email subjects into something that has a defined byte limit size. I went for iterating on graphemes and fitting as many complete graphemes into the bytes as I could, and then stopping. The idea is, don't show broken graphemes and fit as much as I can.This approach probably solves most programmer problems with length. However if this has to be surfaced to an end-user who is not intimately familiar with the nature of Unicode encodings, which is, you know, basically everybody, it may be difficult to explain to them what the limits actually mean in any sensible way. About all you can do is maybe give vague hints about it being nearly too long and avoid being precise enough for there to be a problem. There doesn't seem to me to be a perfect solution here, the intrinsic problem of there being no easy to explain the lengths of these things to end-users and no reason to ever expect them to understand it seems fundamental to me.

wavemode21 days ago

In the age of unicode (and modern computing in general), all of this is more headache than it's worth. What is actually important is that you limit the size of an HTTP request to your server (perhaps making some exceptions for file upload endpoints). As long as the user's form entries fit within that, let them do what they want.

评论 #43852925 未加载

评论 #43852696 未加载

adam-p21 days ago

@dang Can the title be changed? It should be "The best – but not good – way to limit string length". Thanks.

评论 #43851096 未加载