TechEcho

13 comments

dwdz3 months ago

The script works just fine on real Linux, it creates 2048 files and ls command lists them all with different names.<pre><code> ls -l win32/ total 0 -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\277\237''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\267\213''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\240\220''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\274\273''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\251\205''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\255\223''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\272\257''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\264\207''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\261\246''.exe' -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\254\266''.exe' ...</code></pre>

评论 #43185130 未加载

评论 #43185323 未加载

评论 #43182684 未加载

account423 months ago

Falsehoods programmers believe about filenames #1: Filenames are text and can be represented in common text encodings.> Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000Wrong. Windows uses WTF-16 [0] despite what the documentation says.[0] <a href="https://simonsapin.github.io/wtf-8/#ill-formed-utf-16" rel="nofollow">https://simonsapin.github.io/wtf-8/#ill-formed-utf-16</a>

评论 #43183751 未加载

评论 #43183087 未加载

评论 #43183625 未加载

评论 #43184739 未加载

feldrim3 months ago

Hi all. OP here. I added a Postscriptum about the surrogte pairs and their status in Linux. I used WSL to access those files under Windows, and generated the same on Linux. You can see that behavior differs on the same file names:1. On Windows, accessed by WSL2. On Linux (WSL), using UTF-8 locale3. On Linux (WSL), using POSIX localeThe difference is weird for me as a user. I'd like to know about the decisions made behind these. If anyone has information, please let me know.

评论 #43182569 未加载

评论 #43183769 未加载

评论 #43183104 未加载

kzrdude3 months ago

I remember that in Mac OS X times, sometime between OS X v10.1 and 10.4, a system upgrade caused a bunch of unicode named files to become inaccessible/untouchable (but still present with a directory listing). At the time I didn't have the skills to figure out what had happened. I'm still curious to know if it was an intended breaking change.

评论 #43184992 未加载

mofeien3 months ago

Hi, thanks for the interesting submission!I was a bit confused by the detour via utf-8 to arrive at the code points and had to look up UTF-8 encoding first to understand how they relate. Then I tried out the following<pre><code> candidate = chr(0xD800) candidate2 = bytes([0xED, 0xA0, 0x80]).decode('utf-8', errors='surrogatepass') print(candidate == candidate2) # True </code></pre> and it seems that you could just iterate over code points directly with the `chr()` function.

评论 #43182801 未加载

qingcharles3 months ago

Aha! Found the name of my next album. Try downloading me on Napster now!

somewhereoutth3 months ago

Ah got caught by surrogate pairs recently:- javascript sees them as 2 chars when e.g. slicing strings, so it is possible to end up with invalid strings if you chop between a pair.

n_plus_1_acc3 months ago

I think it's hilarious that the event viewer XML gets borked.

评论 #43181947 未加载

ooterness3 months ago

Why does the Windows filesystem allow filenames with invalid strings?It seems obvious that attempts to create files with such filenames ought to be blocked.

评论 #43185624 未加载

ge963 months ago

I remember a long time ago I accidentally put some symbol in a folder name like ? in Windows had problems

rob743 months ago

Or otherwise said: Surrogate Pairs are used in UTF-16 (which uses two bytes per character, so it can encode up to 65536 characters) to encode Unicode characters that have code points that can't be encoded using just two bytes.

评论 #43181974 未加载

Devasta3 months ago

Stuff like this is why UTF and any attempt at trying to encode all characters is a mistake.The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.

评论 #43183071 未加载

评论 #43183849 未加载

theiebrjfb3 months ago

Yet another reason to use Linux everywhere. It is 2025 and Windows (and probably Mac) users have to deal with weird Unicode filesystem issues. Good luck puting Chinese characters or emoticons into filenames.Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.And we get filesystem level snapshots etc...

评论 #43182054 未加载

评论 #43182076 未加载

13 comments

dwdz3 months ago

评论 #43185130 未加载

评论 #43185323 未加载

评论 #43182684 未加载

account423 months ago

评论 #43183751 未加载

评论 #43183087 未加载

评论 #43183625 未加载

评论 #43184739 未加载

feldrim3 months ago

评论 #43182569 未加载

评论 #43183769 未加载

评论 #43183104 未加载

kzrdude3 months ago

评论 #43184992 未加载

mofeien3 months ago

评论 #43182801 未加载

qingcharles3 months ago

Aha! Found the name of my next album. Try downloading me on Napster now!

somewhereoutth3 months ago

Ah got caught by surrogate pairs recently:- javascript sees them as 2 chars when e.g. slicing strings, so it is possible to end up with invalid strings if you chop between a pair.

n_plus_1_acc3 months ago

I think it's hilarious that the event viewer XML gets borked.

评论 #43181947 未加载

ooterness3 months ago

Why does the Windows filesystem allow filenames with invalid strings?It seems obvious that attempts to create files with such filenames ought to be blocked.

评论 #43185624 未加载

ge963 months ago

I remember a long time ago I accidentally put some symbol in a folder name like ? in Windows had problems

Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read

13 comments

Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read

13 comments