There is one property about UTF-8 that distinguishes it from opaque byte strings of unknown encoding: its codepoints are self-delimiting, so you can naively locate instances of a substring (and delete them, replace them, split on them, etc.) without worrying that you've grabbed something else with the same bytes as the substring.<p>Constrast with UTF-16, where a substring might match the bytes at an odd index in the original string, corresponding to totally different characters.<p>Identifying a substring is valid in every human language I know of, as long as the substring itself is semantically meaningful (e.g., it doesn't end in part of a grapheme cluster; though if you want to avoid breaking up words, you may also want a \b-like mechanism). So it does seem to refute the author's notion that you can do nothing with knowledge only of the encoding.
My understanding is that Rust has designed the rest of the String API under the assumption of validity. You can't create an invalid String because the methods that operate on strings strive to be tightly optimized UTF-8 manipulation algorithms that assume the string has already been cleaned. Pack all of the actual robustness in the guarantee that you are working with UTF-8, and you can avoid unnecessary CPU cycles, which is one of the goals of systems languages. If you want to skip it, go for raw strings or CStr -- all raw byte buffers have the basic ASCII functions available, which are designed to be robust against whatever you throw at it, and it shouldn't be too hard to introduce genericity for an API to accept both strings and raw data.<p>That being said, I'm not sure how this is actually implemented, I assume there is still some degree of robustness when running methods on strings generated using `unsafe fn from_utf8_unchecked` just by nature of UTF-8's self-synchronization, which may be what the article is pointing out. It's possible that some cleverly optimized UTF-8 algorithms don't need valid data to avoid memory issues / UB that trips the execution of the entire program, and can instead catch the error or perform a lossy transformation on the spot without incurring too much overhead.
Good paper on UTF-8 validation performance: <a href="https://arxiv.org/pdf/2010.03090" rel="nofollow">https://arxiv.org/pdf/2010.03090</a><p><pre><code> The relatively simple algorithm (lookup) can be several times faster than conventional algorithms at a common task using nothing more than the instructions available on commodity processors. It requires fewer than an instruction per input byte in the worst case.</code></pre>
So... the main reason to use Unicode in general, and UTF-8 specifically, is that it's the common denominator of a lot of weird stuff you'd see in the wild.<p>For example, most Unix platforms allow filenames to be arbitrary sets of bytes, while Windows lets filenames be UCS-2 (i.e. invalid surrogates are supported). Also, both Unix and Windows have some notion of a "local encoding" (LC_ALL etc on Unix, codepages on Windows).<p>The common denominator, the Schelling point [1], of all of these weird systems is Unicode. Without prior coordination, you can generally assume that other participants in your system would try to use Unicode, and probably with the UTF-8 encoding.<p>Checking at the boundaries of your program that your inputs are valid Unicode/UTF-8 leads to (a) good error messages when they aren't, and (b) not having to deal with jank internally.<p>[1] <a href="https://en.wikipedia.org/wiki/Focal_point_(game_theory)" rel="nofollow">https://en.wikipedia.org/wiki/Focal_point_(game_theory)</a>
"98% of web sites are encoded in UTF8"<p>And quite a few web sites claim to be encoded in UTF8 and serve latin-1. It is best to check, or at least to specify error handling on your decoder.
I agree. Validating UTF-8 will waste processing time, as well as not work well with non-Unicode text; and (like it says in the article) often you should not actually care what character encoding (if any) it uses anyways. Furthermore, it is often useful to measure the length or split by bytes rather than Unicode code points anyways.<p>Unicode string types are just a bad idea, I think. Byte strings are better; you can still add functions to deal with Unicode or other character codes if necessary (and/or add explicit tagging for character encoding, if that is helpful).<p>Many programming languages though make it difficult to work with byte strings, non-Unicode strings, etc. This often causes problems, in my experience, unless you are careful.<p>Unicode string types are a problem especially when used incorrectly, since if used in a library they can even be exposed to applications that call it even if they do not want it and even if the library doesn't or shouldn't really care. GOTO is not a problem; it is good, because it does not affect library APIs; even if a library uses it, your program does not have to use it, and vice-versa. Unicode string types do not have that kind of benefit, so they are a much more significant problem, and should be avoided when designing a programming language.<p>(None of the above means that there is never any reason to deal with UTF-8, although usually there isn't a good one. For example, if a file in ASCII format can contain commands which are used to produce some output in a UTF-16 format, then it makes sense to treat the arguments to those commands as WTF-8 so that they can be converted to UTF-16, since WTF-8 is the "corresponding ASCII-compatible character encoding" than UTF-16. Similarly, if the output file is JIS, then using EUC-JP would be sensible.)
Personally, I prefer Ruby's behavior of explicit encoding on strings and being very cranky when invalid codepoints show up in a UTF-8 string.<p>If you want to ignore invalid UTF-8, use String#scrub to replace heretical values with \uFFFD and life is good. :)
As a small nit, the reason Rust has pairs of types (String vs &str) is not due to mutability, but to allow references to substrings.<p>A String is an allocated object, with a location and length. A str is a sequence of characters "somewhere", which is why you can only have references to str, never an actual str object. Cstr and Osstr are similar. You could use &[u8] instead of any of them, but the stronger types enforce some guarantees on what you'll find in the sequence.
> In Rust, strings are always valid UTF8, and attempting to create a string with invalid UTF8 will panic at runtime:<p>> [piece of code explicitly calling .unwrap()]<p>You misspelled "returns an error".<p>It might be worth considering Python, where the most central change from 2 to 3 was that strings would now be validated UTF-8. I don't understand why it gets discarded with "it was designed in the 1990's" when that change happened so recently.
> As always, immutability comes with a performance penalty: Mutating values is generally faster than creating new ones.<p>I get what they're saying, but I'm not sure I agree with it. Mutating one specific value is faster than making a copy and then altering that. Knowing that a value can't be mutated and using that to optimize the rest of the system can be faster yet. I think it's more likely the case that allowing mutability comes with a performance penalty.
You do have to validate UTF-8 strings:<p>- You can't just skip stuff if you run any kind of normalization<p>- How would you index into or split an invalid UTF-8 string?<p>- How would you apply a regex?<p>- What is its length?<p>- How do you deal with other systems that <i>do</i> validate UTF-8 strings?<p>Meta point: scanning a sequence of byte for invalid UTF-8 sequences is validating. The decision to skip them is just slightly different code than "raise error". It's probably also a lot slower as you have to always do this for every operation, whereas once you've validated a string you can operate on it with impunity.<p>Love this for the hot take/big swing, but it's a whiff.
In my opinion, one argument for internally representing `String`s as UTF8 is it prevents accidentally saving a file as Latin1 or other encodings. I would like to read a file my coworker sent me in my favorite language without having to figure out what the encoding of the file is.<p>For example, my most recent Julia project has the following line:<p><pre><code> windows1252_to_utf8(s) = decode(Vector{UInt8}(String(coalesce(s, ""))), "Windows-1252")
</code></pre>
Figuring out that I had to use Windows-1252 (and not Latin1) took a lot more time than I would have liked it to.<p>I get that there's some ergonomic challenges around this in languages like Julia that are optimized for data analysis workflows, but imho all data analysis languages/scripts should be forced to explicitly list encodings/decodings whenever reading/writing a file or default to UTF-8.