Charset="WTF-8"

282 点作者 edent6 个月前

32 条评论

kgeist6 个月前

My rule of thumb is to treat strings as opaque blobs most of the time. The only validation I'd always enforce is some sane length limit, to prevent users from shoving entire novels inside. If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away. Imho often times, input validation is an attempt to solve a problem from the wrong side. Say, when XSS or SQL injections are found on a site, I've seen people's first reaction to be validation of user input by looking for "special symbols", or add a whitelist of allowed characters, instead of simply escaping strings right before rendering HTML (and modern frameworks do it automatically), or using parameterized queries if it's SQL. If a user wants to call themselves "alert('hello')", why not? Why the arbitrary limits? I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.

评论 #42233298 未加载

评论 #42235435 未加载

评论 #42232320 未加载

评论 #42232572 未加载

评论 #42235031 未加载

评论 #42232420 未加载

评论 #42234928 未加载

评论 #42235084 未加载

评论 #42239109 未加载

jtvjan6 个月前

A coworker once implemented a name validation regex that would reject his own name. It still mystifies me how much convincing it took to get him to make it less strict.

评论 #42228260 未加载

评论 #42228055 未加载

poizan426 个月前

I have an 'æ' in my middle name (formally secondary first name because history reasons). Usually I just don't use it, but it's always funny when a payment form instructs me to write my full name exactly as written on my credit card, and then goes on to tell me my name is invalid.

评论 #42228157 未加载

评论 #42228306 未加载

评论 #42230175 未加载

评论 #42228513 未加载

评论 #42230808 未加载

评论 #42232547 未加载

评论 #42230397 未加载

powersnail6 个月前

As someone who really think name field should just be one field with any printable unicode characters, I do wonder what the hell would I need to do if I take customer names in this form, and then my system has to interact with some other service that requires first/last name split, and/or [a-zA-Z] validation, like a bank or postal service.Automatic transliteration seems to be very dangerous (wrong name on bank accounts, for instance), and not always feasible (some unicode characters have more than one way of being transliterated).Should we apologize to the user, and just ask the user twice, once correctly, and once for the bad computer systems? This seems to be the only approach that both respects their spelling, and at the same time not creating potential conflict with other systems.

评论 #42231662 未加载

评论 #42231025 未加载

评论 #42231544 未加载

评论 #42232862 未加载

评论 #42230974 未加载

评论 #42231619 未加载

gavinsyancey6 个月前

WTF-8 is actually a real encoding, used for encoding invalid UTF-16 unpaired surrogates for UTF-8 systems: <a href="https://simonsapin.github.io/wtf-8/" rel="nofollow">https://simonsapin.github.io/wtf-8/</a>

评论 #42231177 未加载

评论 #42236001 未加载

wruza6 个月前

I'll say it again: this is the consequence of Unicode trying to be a mix of html and docx, instead of a charset. It went too far for an average Joe DevGuy to understand how to deal with it, so he just selects a subset he can handle and bans everything else. HN does that too - special symbols simply get removed.Unicode screwed itself up completely. We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while. Shortly after it focused on becoming a complex file format with colorful icons and invisible symbols, which is not manageable without cutting out all that bs by force.

评论 #42231092 未加载

评论 #42231082 未加载

评论 #42231316 未加载

评论 #42231141 未加载

评论 #42231230 未加载

评论 #42231114 未加载

评论 #42231109 未加载

评论 #42231200 未加载

评论 #42231356 未加载

评论 #42231136 未加载

评论 #42230937 未加载

评论 #42231795 未加载

imrejonk6 个月前

A system not supporting non-latin characters in personal names is pitiful, but a system telling the user that they have an invalid name is outright insulting.

评论 #42230848 未加载

评论 #42228399 未加载

评论 #42255347 未加载

KPGv26 个月前

It seems ridiculous to apply form validation to a name, given the complexity of charsets involved. I don't even validate email addresses. I remember [this](<a href="https://www.netmeister.org/blog/email.html" rel="nofollow">https://www.netmeister.org/blog/email.html</a>) wonderful explainer of why your email validation regex is wrong.

评论 #42232304 未加载

RadiozRadioz6 个月前

I've got a good feel now for which forms will accept my name and which won't, though mostly I default to an ASCII version for safety. Similarly, I've found a way to mangle my address to fit a US house/state/city/zip format.I don't feel unwelcome, I emphathize with the developers. I'd certainly hate to figure out address entry for all countries. At least the US format is consistent across websites and I can have a high degree of confidence that it'll work in the software, and my local postal service know what to do because they see it all the time.

评论 #42231349 未加载

评论 #42230734 未加载

hedora6 个月前

I’d expect iCloud to accept that name, even though Rachel True’s name breaks the heck out of it:<a href="https://www.reddit.com/r/ProgrammerHumor/comments/lz27ou/she_has_true_as_her_last_name_and_that_breaks/" rel="nofollow">https://www.reddit.com/r/ProgrammerHumor/comments/lz27ou/she...</a>

Hackbraten6 个月前

Situations like these regularly make me feel ashamed about being a software developer.

Diggsey6 个月前

I thought this was <a href="https://simonsapin.github.io/wtf-8/" rel="nofollow">https://simonsapin.github.io/wtf-8/</a>

评论 #42228268 未加载

rmrfchik6 个月前

Well, the labels of input fields are written in English yet user enters his name in native language.What's the reason of having name at all? You can call the person by this name. But if I write you my name in my language, what you (not knowing how to read it) can do? Only "hey, still-don't-know-you, here is your info".In my foreign passport I have name __transliterated__ to Latin alphabet. Shouldn't this be the case for other places?

评论 #42239347 未加载

评论 #42234926 未加载

ginko6 个月前

Under GDPR you have the legal right for your name to be stored and processed with the correct spelling in the EU.<a href="https://gdprhub.eu/index.php?title=Court_of_Appeal_of_Brussels_-_2019/AR/1006" rel="nofollow">https://gdprhub.eu/index.php?title=Court_of_Appeal_of_Brusse...</a>

评论 #42230370 未加载

josephcsible6 个月前

What would be wrong with "enter your name as it appears in the machine-readable zone of your passport" (or "would appear" for people who have never gotten one)? Isn't that the one standard format for names that actually is universal?

评论 #42230971 未加载

评论 #42232503 未加载

评论 #42231228 未加载

评论 #42231888 未加载

wvh6 个月前

There's little more you can do to validate a name internationally than to provide one textbox and check if it's a valid encoding of Unicode. Maybe you can exclude some control and graphical ranges at best.Of course there are valid concerns that international names should pass through e.g. local postal services, which would require at least some kind of Latinized representation of name and address. I suppose the Latin alphabet is the most convenient minimal common denominator across writing systems, even though I admit being Euro-centric.

account426 个月前

1) The "real" WTF-8 charset [0] is a cool and useful encoding.2) This may be an unpopular opinion but I think restricting name input to latin is an OK thing to do, especially if the entered names are being viewed/used by humans who cannot be expected to be know all scripts.3) Similarly, internationalized domain names were a mistake. If your business card tells me to go to stępień.com then chances are I won't bother to try end remember how to enter those accents on my keyboard layout. Most users won't even be able to enter them. Worse are letters that are visually indistinguishible - and no, registries preventing confusable names is not enough when I still won't know which letter to enter. This makes IDN domains less useful while retaining all the security issues they bring.Most languages were already forced to deal with ASCII and have developed standardized ways to "romanize" names and other words to that character set. This solution achieves peak interopability - not only between computer systems but also between the fleshy components operating them.[0] <a href="https://simonsapin.github.io/wtf-8/" rel="nofollow">https://simonsapin.github.io/wtf-8/</a>

surfingdino6 个月前

I lost count of the projects where this was an issue. US and Western European-born devs are oblivious to this problem and it ends up catching them over and over again.

评论 #42230819 未加载

cabirum6 个月前

How do I allow "stępień" while detecting Zalgo-isms?

评论 #42228250 未加载

评论 #42228647 未加载

评论 #42228284 未加载

评论 #42230633 未加载

评论 #42228254 未加载

评论 #42230745 未加载

card_zero6 个月前

Pfft, "Dein Name ist ungültig" (your name is invalid). Let's get straight to the point, it's the user's fault for having a bad name, user needs to fix this.

bawolff6 个月前

Its really not that hard though. PCRE regex support unicode letter classes. There is really no excuse for this type of issue.

评论 #42231530 未加载

rawbert6 个月前

OMG, the second screenshot might be actually the application i am working on right now ...

Pesthuf6 个月前

I totally get that companies are probably more successful using simple validation rules, that work for the vast majority of names rather than just accepting everything just so that some person with no name or someone whose name cannot possibly be expressed or at least transliterated to Unicode can use their services.But that person's name has no business failing validation. They fucked up.

jccalhoun6 个月前

My first name is hyphenated. I still find forms that reject it. My favorite was one that say "invalid first name."

stop_nazi6 个月前

grzegorz brzęczyszczykiewicz

评论 #42231702 未加载

评论 #42230483 未加载

dcow6 个月前

Apropos: <a href="https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/" rel="nofollow">https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...</a>

dathinab6 个月前

fun fact there is a semi standard encoding called wtf-8 which is utf-8 extended in a way so that it can represent non well formed utf-16 (bad surrogate code points)it's used in situations like when a utf-8 based system has to interact with Windows file paths

rurban6 个月前

Just use the unicode identifier rules, my libu8ident. <a href="https://github.com/rurban/libu8ident">https://github.com/rurban/libu8ident</a>Windows folks need to convert to UTF—8 first

评论 #42232084 未加载

egorfine6 个月前

Mandatory mention: <a href="https://github.com/minimaxir/big-list-of-naughty-strings">https://github.com/minimaxir/big-list-of-naughty-strings</a>

ljouhet6 个月前

评论 #42230869 未加载

mdavid6266 个月前

It's like with phone numbers. Some people assume they contain only digits.

xyst6 个月前

Software has been gaslighting generations of people around the world.Side note: not a bad way to skirt surveillance though.A name like “stępień” will without a doubt have many ambiguous spellings across different intelligence gathering systems (RUMINT, OSINT, …). Americans will probably spell it as “Stefen” or “Steven” or “Stephen”, especially once communicated over phone.

评论 #42232094 未加载