In 2009, David A. Wheeler wrote a comprehensive article covering problems with Unix/Linux/POSIX filenames¹. Given that the OS naïvely treats filenames as a simple stream of bytes, he advocated that developers use UTF-8 for encoding filenames. He mentioned the issue of multiple normalisation systems being used to encode characters that have more than one Unicode representation but glossed over it because such problems are “overshadowed by the terrible awful even worse problems caused by filenames all being in random unguessable charsets”.<p>I’m guessing that, by now, most developers on Unix-like systems would be using UTF-8 for filenames – though a decade after these articles were published, there still doesn’t seem to be any good/universal solution to the problem of characters with multiple Unicode representations.<p>¹ <a href="https://dwheeler.com/essays/fixing-unix-linux-filenames.html" rel="nofollow">https://dwheeler.com/essays/fixing-unix-linux-filenames.html</a>
You should normalize names on write, on read is very hard to fix. You can have a perfectly valid, denormalized strings representing codepoints with different normalizations.<p>So if you have four possible normalizations: NFD, NFC, NFKD, NFKC and your string has N ambiguous codepoints, the number of possible strings you need to try is N^4.
Side note - after all these years I still don't feel comfortable with using special characters (like ą, ż, ź) and spaces in filenames in Windows. DOS times sit deeply in my soul and It just doesn't feel right.