TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read

85 点作者 feldrim3 个月前

13 条评论

dwdz3 个月前
The script works just fine on real Linux, it creates 2048 files and <i>ls</i> command lists them all with different names.<p><pre><code> ls -l win32&#x2F; total 0 -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\277\237&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\267\213&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\240\220&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\274\273&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\251\205&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\255\223&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\272\257&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\264\207&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\261\246&#x27;&#x27;.exe&#x27; -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 &#x27;&#x27;$&#x27;\355\254\266&#x27;&#x27;.exe&#x27; ...</code></pre>
评论 #43185130 未加载
评论 #43185323 未加载
评论 #43182684 未加载
account423 个月前
Falsehoods programmers believe about filenames #1: Filenames are text and can be represented in common text encodings.<p>&gt; Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000<p>Wrong. Windows uses WTF-16 [0] despite what the documentation says.<p>[0] <a href="https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;#ill-formed-utf-16" rel="nofollow">https:&#x2F;&#x2F;simonsapin.github.io&#x2F;wtf-8&#x2F;#ill-formed-utf-16</a>
评论 #43183751 未加载
评论 #43183087 未加载
评论 #43183625 未加载
评论 #43184739 未加载
feldrim3 个月前
Hi all. OP here. I added a Postscriptum about the surrogte pairs and their status in Linux. I used WSL to access those files under Windows, and generated the same on Linux. You can see that behavior differs on the same file names:<p>1. On Windows, accessed by WSL<p>2. On Linux (WSL), using UTF-8 locale<p>3. On Linux (WSL), using POSIX locale<p>The difference is weird for me as a user. I&#x27;d like to know about the decisions made behind these. If anyone has information, please let me know.
评论 #43182569 未加载
评论 #43183769 未加载
评论 #43183104 未加载
kzrdude3 个月前
I remember that in Mac OS X times, sometime between OS X v10.1 and 10.4, a system upgrade caused a bunch of unicode named files to become inaccessible&#x2F;untouchable (but still present with a directory listing). At the time I didn&#x27;t have the skills to figure out what had happened. I&#x27;m still curious to know if it was an intended breaking change.
评论 #43184992 未加载
mofeien3 个月前
Hi, thanks for the interesting submission!<p>I was a bit confused by the detour via utf-8 to arrive at the code points and had to look up UTF-8 encoding first to understand how they relate. Then I tried out the following<p><pre><code> candidate = chr(0xD800) candidate2 = bytes([0xED, 0xA0, 0x80]).decode(&#x27;utf-8&#x27;, errors=&#x27;surrogatepass&#x27;) print(candidate == candidate2) # True </code></pre> and it seems that you could just iterate over code points directly with the `chr()` function.
评论 #43182801 未加载
qingcharles3 个月前
Aha! Found the name of my next album. Try downloading me on Napster now!
somewhereoutth3 个月前
Ah got caught by surrogate pairs recently:- javascript sees them as 2 chars when e.g. slicing strings, so it is possible to end up with invalid strings if you chop between a pair.
n_plus_1_acc3 个月前
I think it&#x27;s hilarious that the event viewer XML gets borked.
评论 #43181947 未加载
ooterness3 个月前
Why does the Windows filesystem allow filenames with invalid strings?<p>It seems obvious that attempts to create files with such filenames ought to be blocked.
评论 #43185624 未加载
ge963 个月前
I remember a long time ago I accidentally put some symbol in a folder name like ? in Windows had problems
rob743 个月前
Or otherwise said: Surrogate Pairs are used in UTF-16 (which uses two bytes per character, so it can encode up to 65536 characters) to encode Unicode characters that have code points that can&#x27;t be encoded using just two bytes.
评论 #43181974 未加载
Devasta3 个月前
Stuff like this is why UTF and any attempt at trying to encode all characters is a mistake.<p>The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.
评论 #43183071 未加载
评论 #43183849 未加载
theiebrjfb3 个月前
Yet another reason to use Linux everywhere. It is 2025 and Windows (and probably Mac) users have to deal with weird Unicode filesystem issues. Good luck puting Chinese characters or emoticons into filenames.<p>Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.<p>And we get filesystem level snapshots etc...
评论 #43182054 未加载
评论 #43182076 未加载