TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Charset="WTF-8"

282 点作者 edent6 个月前

32 条评论

kgeist6 个月前
My rule of thumb is to treat strings as opaque blobs most of the time. The only validation I'd always enforce is some sane length limit, to prevent users from shoving entire novels inside. If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away. Imho often times, input validation is an attempt to solve a problem from the wrong side. Say, when XSS or SQL injections are found on a site, I've seen people's first reaction to be validation of user input by looking for "special symbols", or add a whitelist of allowed characters, instead of simply escaping strings right before rendering HTML (and modern frameworks do it automatically), or using parameterized queries if it's SQL. If a user wants to call themselves "alert('hello')", why not? Why the arbitrary limits? I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.
评论 #42233298 未加载
评论 #42235435 未加载
评论 #42232320 未加载
评论 #42232572 未加载
评论 #42235031 未加载
评论 #42232420 未加载
评论 #42234928 未加载
评论 #42235084 未加载
评论 #42239109 未加载
jtvjan6 个月前
A coworker once implemented a name validation regex that would reject his own name. It still mystifies me how much convincing it took to get him to make it less strict.
评论 #42228260 未加载
评论 #42228055 未加载
poizan426 个月前
I have an 'æ' in my middle name (formally secondary first name because history reasons). Usually I just don't use it, but it's always funny when a payment form instructs me to write my full name exactly as written on my credit card, and then goes on to tell me my name is invalid.
评论 #42228157 未加载
评论 #42228306 未加载
评论 #42230175 未加载
评论 #42228513 未加载
评论 #42230808 未加载
评论 #42232547 未加载
评论 #42230397 未加载
powersnail6 个月前
As someone who really think name field should just be one field with any printable unicode characters, I do wonder what the hell would I need to do if I take customer names in this form, and then my system has to interact with some other service that requires first&#x2F;last name split, and&#x2F;or [a-zA-Z] validation, like a bank or postal service.<p>Automatic transliteration seems to be very dangerous (wrong name on bank accounts, for instance), and not always feasible (some unicode characters have more than one way of being transliterated).<p>Should we apologize to the user, and just ask the user twice, once correctly, and once for the bad computer systems? This seems to be the only approach that both respects their spelling, and at the same time not creating potential conflict with other systems.
评论 #42231662 未加载
评论 #42231025 未加载
评论 #42231544 未加载
评论 #42232862 未加载
评论 #42230974 未加载
评论 #42231619 未加载
gavinsyancey6 个月前
WTF-8 is actually a real encoding, used for encoding invalid UTF-16 unpaired surrogates for UTF-8 systems: <a href="https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;</a>
评论 #42231177 未加载
评论 #42236001 未加载
wruza6 个月前
I&#x27;ll say it again: this is the consequence of Unicode trying to be a mix of html and docx, instead of a charset. It went too far for an average Joe DevGuy to understand how to deal with it, so he just selects a subset he can handle and bans everything else. HN does that too - special symbols simply get removed.<p>Unicode screwed itself up completely. We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while. Shortly after it focused on becoming a complex file format with colorful icons and invisible symbols, which is not manageable without cutting out all that bs by force.
评论 #42231092 未加载
评论 #42231082 未加载
评论 #42231316 未加载
评论 #42231141 未加载
评论 #42231230 未加载
评论 #42231114 未加载
评论 #42231109 未加载
评论 #42231200 未加载
评论 #42231356 未加载
评论 #42231136 未加载
评论 #42230937 未加载
评论 #42231795 未加载
imrejonk6 个月前
A system not supporting non-latin characters in personal names is pitiful, but a system telling the user that they have an invalid name is outright insulting.
评论 #42230848 未加载
评论 #42228399 未加载
评论 #42255347 未加载
KPGv26 个月前
It seems ridiculous to apply form validation to a name, given the complexity of charsets involved. I don&#x27;t even validate email addresses. I remember [this](<a href="https:&#x2F;&#x2F;www.netmeister.org&#x2F;blog&#x2F;email.html" rel="nofollow">https:&#x2F;&#x2F;www.netmeister.org&#x2F;blog&#x2F;email.html</a>) wonderful explainer of why your email validation regex is wrong.
评论 #42232304 未加载
RadiozRadioz6 个月前
I&#x27;ve got a good feel now for which forms will accept my name and which won&#x27;t, though mostly I default to an ASCII version for safety. Similarly, I&#x27;ve found a way to mangle my address to fit a US house&#x2F;state&#x2F;city&#x2F;zip format.<p>I don&#x27;t feel unwelcome, I emphathize with the developers. I&#x27;d certainly hate to figure out address entry for all countries. At least the US format is consistent across websites and I can have a high degree of confidence that it&#x27;ll work in the software, and my local postal service know what to do because they see it all the time.
评论 #42231349 未加载
评论 #42230734 未加载
hedora6 个月前
I’d expect iCloud to accept that name, even though Rachel True’s name breaks the heck out of it:<p><a href="https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;ProgrammerHumor&#x2F;comments&#x2F;lz27ou&#x2F;she_has_true_as_her_last_name_and_that_breaks&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;ProgrammerHumor&#x2F;comments&#x2F;lz27ou&#x2F;she...</a>
Hackbraten6 个月前
Situations like these regularly make me feel ashamed about being a software developer.
Diggsey6 个月前
I thought this was <a href="https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;</a>
评论 #42228268 未加载
rmrfchik6 个月前
Well, the labels of input fields are written in English yet user enters his name in native language.<p>What&#x27;s the reason of having name at all? You can call the person by this name. But if I write you my name in my language, what you (not knowing how to read it) can do? Only &quot;hey, still-don&#x27;t-know-you, here is your info&quot;.<p>In my foreign passport I have name __transliterated__ to Latin alphabet. Shouldn&#x27;t this be the case for other places?
评论 #42239347 未加载
评论 #42234926 未加载
ginko6 个月前
Under GDPR you have the legal right for your name to be stored and processed with the correct spelling in the EU.<p><a href="https:&#x2F;&#x2F;gdprhub.eu&#x2F;index.php?title=Court_of_Appeal_of_Brussels_-_2019&#x2F;AR&#x2F;1006" rel="nofollow">https:&#x2F;&#x2F;gdprhub.eu&#x2F;index.php?title=Court_of_Appeal_of_Brusse...</a>
评论 #42230370 未加载
josephcsible6 个月前
What would be wrong with &quot;enter your name as it appears in the machine-readable zone of your passport&quot; (or &quot;would appear&quot; for people who have never gotten one)? Isn&#x27;t that the one standard format for names that actually is universal?
评论 #42230971 未加载
评论 #42232503 未加载
评论 #42231228 未加载
评论 #42231888 未加载
wvh6 个月前
There&#x27;s little more you can do to validate a name internationally than to provide one textbox and check if it&#x27;s a valid encoding of Unicode. Maybe you can exclude some control and graphical ranges at best.<p>Of course there are valid concerns that international names should pass through e.g. local postal services, which would require at least some kind of Latinized representation of name and address. I suppose the Latin alphabet is the most convenient minimal common denominator across writing systems, even though I admit being Euro-centric.
account426 个月前
1) The &quot;real&quot; WTF-8 charset [0] is a cool and useful encoding.<p>2) This may be an unpopular opinion but I think restricting name input to latin is an OK thing to do, especially if the entered names are being viewed&#x2F;used by humans who cannot be expected to be know all scripts.<p>3) Similarly, internationalized domain names were a mistake. If your business card tells me to go to stępień.com then chances are I won&#x27;t bother to try end remember how to enter those accents on my keyboard layout. Most users won&#x27;t even be able to enter them. Worse are letters that are visually indistinguishible - and no, registries preventing confusable names is not enough when I still won&#x27;t know which letter to enter. This makes IDN domains less useful while retaining all the security issues they bring.<p>Most languages were already forced to deal with ASCII and have developed standardized ways to &quot;romanize&quot; names and other words to that character set. This solution achieves peak interopability - not only between computer systems but also between the fleshy components operating them.<p>[0] <a href="https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;</a>
surfingdino6 个月前
I lost count of the projects where this was an issue. US and Western European-born devs are oblivious to this problem and it ends up catching them over and over again.
评论 #42230819 未加载
cabirum6 个月前
How do I allow &quot;stępień&quot; while detecting Zalgo-isms?
评论 #42228250 未加载
评论 #42228647 未加载
评论 #42228284 未加载
评论 #42230633 未加载
评论 #42228254 未加载
评论 #42230745 未加载
card_zero6 个月前
Pfft, &quot;Dein Name ist ungültig&quot; (your name is invalid). Let&#x27;s get straight to the point, it&#x27;s the user&#x27;s fault for having a bad name, user needs to fix this.
bawolff6 个月前
Its really not that hard though. PCRE regex support unicode letter classes. There is really no excuse for this type of issue.
评论 #42231530 未加载
rawbert6 个月前
OMG, the second screenshot might be actually the application i am working on right now ...
Pesthuf6 个月前
I totally get that companies are probably more successful using simple validation rules, that work for the vast majority of names rather than just accepting everything just so that some person with no name or someone whose name cannot possibly be expressed or at least transliterated to Unicode can use their services.<p>But that person&#x27;s name has no business failing validation. They fucked up.
jccalhoun6 个月前
My first name is hyphenated. I still find forms that reject it. My favorite was one that say &quot;invalid first name.&quot;
stop_nazi6 个月前
grzegorz brzęczyszczykiewicz
评论 #42231702 未加载
评论 #42230483 未加载
dcow6 个月前
Apropos: <a href="https:&#x2F;&#x2F;www.kalzumeus.com&#x2F;2010&#x2F;06&#x2F;17&#x2F;falsehoods-programmers-believe-about-names&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.kalzumeus.com&#x2F;2010&#x2F;06&#x2F;17&#x2F;falsehoods-programmers-...</a>
dathinab6 个月前
fun fact there is a semi standard encoding called wtf-8 which is utf-8 extended in a way so that it can represent non well formed utf-16 (bad surrogate code points)<p>it&#x27;s used in situations like when a utf-8 based system has to interact with Windows file paths
rurban6 个月前
Just use the unicode identifier rules, my libu8ident. <a href="https:&#x2F;&#x2F;github.com&#x2F;rurban&#x2F;libu8ident">https:&#x2F;&#x2F;github.com&#x2F;rurban&#x2F;libu8ident</a><p>Windows folks need to convert to UTF—8 first
评论 #42232084 未加载
egorfine6 个月前
Mandatory mention: <a href="https:&#x2F;&#x2F;github.com&#x2F;minimaxir&#x2F;big-list-of-naughty-strings">https:&#x2F;&#x2F;github.com&#x2F;minimaxir&#x2F;big-list-of-naughty-strings</a>
ljouhet6 个月前
Yes, all these forms should handle existing names...<p>but the author&#x27;s own website doesn&#x27;t (url: xn--stpie-k0a81a.com, bottom of the page: &quot;© 2024 ę ń. All rights reserved.&quot;)
评论 #42230869 未加载
mdavid6266 个月前
It&#x27;s like with phone numbers. Some people assume they contain only digits.
xyst6 个月前
Software has been gaslighting generations of people around the world.<p>Side note: not a bad way to skirt surveillance though.<p>A name like “stępień” will without a doubt have many ambiguous spellings across different intelligence gathering systems (RUMINT, OSINT, …). Americans will probably spell it as “Stefen” or “Steven” or “Stephen”, especially once communicated over phone.
评论 #42232094 未加载