The Turkish İ Problem and Why You Should Care (2012)

112 点作者 Rygian12 天前

18 条评论

German has an 'ß' problem of a similar nature. There is a corresponding capital "ẞ" in Unicode, and Germany has officially adopted 'ẞ' as an alternative since, but in Unicode's SpecialCasing.txt the upper of 'ß' is still 'SS'. The lower of 'S' of course being 's', there's no going back after folding to upper cases. Lower of 'ẞ' is however still 'ß'.So by alternating case you end up with ß→SS→ss or ẞ→ß→SS. Certainly has potential to screw with naive attempts at case-insensitive comparison via case folding. Then again, Unicode adopting 'ẞ' as the upper of 'ß' in some future version would probably only increase that potential further.I'm interested to hear from people dealing with a lot of German text how much of a problem this is in practice.

评论 #43903577 未加载

评论 #43908716 未加载

JimDabell11 天前

Transliterating this character incorrectly resulted in a violent attack causing two deaths:<a href="https://languagelog.ldc.upenn.edu/nll/?p=73" rel="nofollow">https://languagelog.ldc.upenn.edu/nll/?p=73</a>

评论 #43903214 未加载

评论 #43903320 未加载

评论 #43903300 未加载

评论 #43903223 未加载

评论 #43906368 未加载

评论 #43903704 未加载

评论 #43909014 未加载

评论 #43909337 未加载

评论 #43903101 未加载

ozgung11 天前

> So while we have two i’s (upper and lower), they have four.No, we don’t have four i’s in Turkish. I(ı) and İ(i) are two separate letters. Turkish alphabet has 29 letters, and each letter has their own key on a Turkish keyboard. We also have Öö, Üü, Çç, Şş and Ğğ. These are all individual letters. They are not dotted versions, accents or a writing convention. So the language is as simple as it gets. The complications come from mapping letters to English alphabet.

评论 #43906017 未加载

评论 #43905885 未加载

评论 #43908646 未加载

评论 #43905887 未加载

评论 #43906451 未加载

donatj11 天前

I feel like Turkish should have been given a different entirely separate lowercase "i" character so the pairs could be consistent, like the Greek lookalikes. Considering how historically capital letters came before lowercase it seems like İ should have been considered an entirely separate letter from I.Greek was given entirely separate characters even though many are indistinguishable from the Latin alphabet. In Greek for instance we have "Ν" lowercases to "ν" instead of "n". The Greek "Ν" however is not a latin "N" and is an entirely separate character. This makes a lot more sense.

评论 #43905574 未加载

评论 #43904970 未加载

jongjong11 天前

This is one of the reasons why software development is so difficult, most people cannot even begin to imagine how complex the user environment can be. Even within very niche problem domains you may have to deal with a broad range of different environments with different locales, different spoken languages, operating systems, programming languages, compilers/transpilers, engine versions, server frameworks, cache engines, load balancers, TLS certificate provisioning, container engines, container image versions, container orchestrators, browsers, browser extensions, frontend frameworks, test environments, transfer protocols, databases (with different client and servers versions), database indexes, schema constraints, rate limiting... I could probably keep going for hours. Now imagine being aware of all these factors (and much more) and being aware of all possible permutations of these; that's what you need in order to be a senior software developer these days. It's a miracle that any human being can produce any working software at all.As a developer, if some code works perfectly on your own computer, the journey has barely just begun.

Karliss11 天前

This makes me wonder is a there a programming language which has separate data types for locale aware and locale independent strings. I know that rust has OsString but that's a slightly different usecase.Problem with the current widely used approach of having global application wide locale setting is that most applications contain mix of User facing strings and technical code interacting with file formats or remote APIs. Doesn't matter if you set it to current language (or just let operating system set it) or force it to language independent locale, sooner or later something is going to break.If you are lucky a programming language might provide some locale independent string functions, but using them is often clunky and and unlikely to be done consistently across whole code base and all the third party libraries. It's easier to do things correctly if you are forced to declare the intention from the start and any mixing of different context requires an explicit conversion.

评论 #43906915 未加载

评论 #43905616 未加载

评论 #43904065 未加载

alkonaut11 天前

I think the key to doing text sanely in programming is separating "text" from "international text" or "user text". "Text", can be e.g. the characters that make up my xml node names. Or all the names of my db columns etc. You still have to worry about encodings and everything with this data, but you don't have to worry that there is a 10 byte emoji or a turkish upper case i. A key property of it is: you can, for example, run toUpper or toLower, with a default culture. It has symmetric transforms. It can often be assumed to be the ASCII subset, regardless off encoding.Then on the other end you have text what the user enters. It can be anything (so may need validation and washing). You may not be able to run "to lower" on it (although I'd be tempted to do it on an email address for example).The key is just knowing what you have. It's unfortunate that "string" is usually used for everything from paths to user input to db column names etc.

评论 #43916155 未加载

评论 #43905674 未加载

评论 #43907783 未加载

bob102911 天前

System.Globalization is quite the feat of engineering. Setting CultureInfo is like getting onto an actual airplane. I don't know of any other ecosystem with docs like:<a href="https://learn.microsoft.com/en-us/windows/apps/design/globalizing/japanese-era-change" rel="nofollow">https://learn.microsoft.com/en-us/windows/apps/design/global...</a>

评论 #43903159 未加载

评论 #43903256 未加载

sebstefan11 天前

Boy it would sure be easier if the Turkish i was a different unicode character in lowercase too

评论 #43903263 未加载

评论 #43903052 未加载

评论 #43903197 未加载

评论 #43903177 未加载

ndepoel11 天前

Ahh yes, been there, done that.Several years ago we had issues with certification of our game on PS4 because the capitalization on Sony's Turkish translation for "wireless controller" was wrong. The problem being that Turkish dotless I. What was the cause? Some years prior we had had issues with internal system strings (read: stringified enums) breaking on certain international PC's because they were being upper/lowercased using locale-specific capitalization rules. As a quick fix, the choice was made then to change the culture info to invariant globally across the entire game. This of course meant that all strings were now being upper/lowercased according to English rules, including user-facing UI strings. Hence Turkish strings mixing up dotted and dotless I's in several places. The solution? We just pre-uppercased that one "wireless controller" term in our localization sheet, because that was the only bit of text Sony cared about. An ugly fix and we really should have gone through the code to properly separate system strings from UI texts, but it got the job done.

the_mitsuhiko11 天前

Over the years this has shown up a few times because PHP internally was using a locale dependent function to normalize the class names, but it was also doing it inconsistently in a few places. The bug was active for years and has resurfaced more than once: <a href="https://bugs.php.net/bug.php?id=18556" rel="nofollow">https://bugs.php.net/bug.php?id=18556</a>

评论 #43905659 未加载

whizzter11 天前

This highlights the single biggest problem I have with the MS/C#/.NET runtime/ecosystem (The article seems to be from a .NET developer), so many functions connected to string handling are locale dependent and you have to explicitly select the non-locale variants and that then becomes an issue when dealing with the common data interchange and file-formats since those are usually with US semantics.Many European developers run into this frequently since the default parse for float/double/decimal will assume comma as the decimal separator due to our locale settings.

评论 #43905466 未加载

prmph11 天前

I wish someone would write a book that distills all the knowledge contained in those "Falsehoods Programmers Believe About X" or "Things Programmers Should Know" topics, providing a resource for how to write real-world practically robust software that works reasonably well anywhere anytime.The list of gotchas with any non-trivial software is long and frequently obscure.

hudo11 天前

Reminds me to friends old but brilliant project, use Unicode to draw art on stack trace logs! Enough with boring stack traces in logs, lets make some art there and make life a bit easier for the poor soul thats on support and has to debug latest prod issue. <a href="https://medium.com/@ironcev/stack-trace-art-4b700a8817ea" rel="nofollow">https://medium.com/@ironcev/stack-trace-art-4b700a8817ea</a>

poulsbohemian11 天前

When I was in Turkey on a project, the i was absolutely a problem in the software I was trying to deploy. Glad to see this as it's one of those classic "Things Programmers Should Know" topics right up there with all the other classics like address formats and name formats not being the same across the globe.

rolandog11 天前

Huh. Trying to find the letter "i" in this page in Firefox for Android results in a 0-based index of results (starts at 0/-1); you get 999/-1 as the last result if you start from the end.

anticensor11 天前

We need a combining character DELETE DOT ABOVE to make i into ı.

评论 #43907020 未加载

shultays11 天前

<pre><code> const string input = "interesting"; bool comparison = input.ToUpper() == "INTERESTING"; Console.WriteLine("These things are equal: " + comparison); Console.ReadLine(); </code></pre> Is this a realistic scenario? Changing case of a string and comparing it to something else? Running some kind of operations & logic on a string that is meant for user?If you are doing such things then it looks more like a code smell.

评论 #43906024 未加载

评论 #43906816 未加载

评论 #43906500 未加载