The script works just fine on real Linux, it creates 2048 files and <i>ls</i> command lists them all with different names.<p><pre><code> ls -l win32/
total 0
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\277\237''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\267\213''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\240\220''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\274\273''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\251\205''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\255\223''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\272\257''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\264\207''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\261\246''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\254\266''.exe'
...</code></pre>
Falsehoods programmers believe about filenames #1: Filenames are text and can be represented in common text encodings.<p>> Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000<p>Wrong. Windows uses WTF-16 [0] despite what the documentation says.<p>[0] <a href="https://simonsapin.github.io/wtf-8/#ill-formed-utf-16" rel="nofollow">https://simonsapin.github.io/wtf-8/#ill-formed-utf-16</a>
Hi all. OP here. I added a Postscriptum about the surrogte pairs and their status in Linux. I used WSL to access those files under Windows, and generated the same on Linux. You can see that behavior differs on the same file names:<p>1. On Windows, accessed by WSL<p>2. On Linux (WSL), using UTF-8 locale<p>3. On Linux (WSL), using POSIX locale<p>The difference is weird for me as a user. I'd like to know about the decisions made behind these. If anyone has information, please let me know.
I remember that in Mac OS X times, sometime between OS X v10.1 and 10.4, a system upgrade caused a bunch of unicode named files to become inaccessible/untouchable (but still present with a directory listing). At the time I didn't have the skills to figure out what had happened. I'm still curious to know if it was an intended breaking change.
Hi, thanks for the interesting submission!<p>I was a bit confused by the detour via utf-8 to arrive at the code points and had to look up UTF-8 encoding first to understand how they relate. Then I tried out the following<p><pre><code> candidate = chr(0xD800)
candidate2 = bytes([0xED, 0xA0, 0x80]).decode('utf-8', errors='surrogatepass')
print(candidate == candidate2) # True
</code></pre>
and it seems that you could just iterate over code points directly with the `chr()` function.
Ah got caught by surrogate pairs recently:- javascript sees them as 2 chars when e.g. slicing strings, so it is possible to end up with invalid strings if you chop between a pair.
Why does the Windows filesystem allow filenames with invalid strings?<p>It seems obvious that attempts to create files with such filenames ought to be blocked.
Or otherwise said: Surrogate Pairs are used in UTF-16 (which uses two bytes per character, so it can encode up to 65536 characters) to encode Unicode characters that have code points that can't be encoded using just two bytes.
Stuff like this is why UTF and any attempt at trying to encode all characters is a mistake.<p>The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.
Yet another reason to use Linux everywhere. It is 2025 and Windows (and probably Mac) users have to deal with weird Unicode filesystem issues. Good luck puting Chinese characters or emoticons into filenames.<p>Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.<p>And we get filesystem level snapshots etc...