You probably don't need to validate UTF-8 strings

78 pointsby jakobnissenabout 1 year ago

15 comments

There is one property about UTF-8 that distinguishes it from opaque byte strings of unknown encoding: its codepoints are self-delimiting, so you can naively locate instances of a substring (and delete them, replace them, split on them, etc.) without worrying that you've grabbed something else with the same bytes as the substring.Constrast with UTF-16, where a substring might match the bytes at an odd index in the original string, corresponding to totally different characters.Identifying a substring is valid in every human language I know of, as long as the substring itself is semantically meaningful (e.g., it doesn't end in part of a grapheme cluster; though if you want to avoid breaking up words, you may also want a \b-like mechanism). So it does seem to refute the author's notion that you can do nothing with knowledge only of the encoding.

评论 #40384792 未加载

评论 #40386784 未加载

评论 #40385315 未加载

评论 #40395125 未加载

评论 #40384543 未加载

creatonezabout 1 year ago

My understanding is that Rust has designed the rest of the String API under the assumption of validity. You can't create an invalid String because the methods that operate on strings strive to be tightly optimized UTF-8 manipulation algorithms that assume the string has already been cleaned. Pack all of the actual robustness in the guarantee that you are working with UTF-8, and you can avoid unnecessary CPU cycles, which is one of the goals of systems languages. If you want to skip it, go for raw strings or CStr -- all raw byte buffers have the basic ASCII functions available, which are designed to be robust against whatever you throw at it, and it shouldn't be too hard to introduce genericity for an API to accept both strings and raw data.That being said, I'm not sure how this is actually implemented, I assume there is still some degree of robustness when running methods on strings generated using `unsafe fn from_utf8_unchecked` just by nature of UTF-8's self-synchronization, which may be what the article is pointing out. It's possible that some cleverly optimized UTF-8 algorithms don't need valid data to avoid memory issues / UB that trips the execution of the entire program, and can instead catch the error or perform a lossy transformation on the spot without incurring too much overhead.

评论 #40384628 未加载

3pmabout 1 year ago

Good paper on UTF-8 validation performance: <a href="https://arxiv.org/pdf/2010.03090" rel="nofollow">https://arxiv.org/pdf/2010.03090</a><pre><code> The relatively simple algorithm (lookup) can be several times faster than conventional algorithms at a common task using nothing more than the instructions available on commodity processors. It requires fewer than an instruction per input byte in the worst case.</code></pre>

sunshowersabout 1 year ago

So... the main reason to use Unicode in general, and UTF-8 specifically, is that it's the common denominator of a lot of weird stuff you'd see in the wild.For example, most Unix platforms allow filenames to be arbitrary sets of bytes, while Windows lets filenames be UCS-2 (i.e. invalid surrogates are supported). Also, both Unix and Windows have some notion of a "local encoding" (LC_ALL etc on Unix, codepages on Windows).The common denominator, the Schelling point [1], of all of these weird systems is Unicode. Without prior coordination, you can generally assume that other participants in your system would try to use Unicode, and probably with the UTF-8 encoding.Checking at the boundaries of your program that your inputs are valid Unicode/UTF-8 leads to (a) good error messages when they aren't, and (b) not having to deal with jank internally.[1] <a href="https://en.wikipedia.org/wiki/Focal_point_(game_theory)" rel="nofollow">https://en.wikipedia.org/wiki/Focal_point_(game_theory)</a>

评论 #40385258 未加载

jdouganabout 1 year ago

I wish Swift grapheme oriented strings had been in the comparison. I think he blows off the grapheme level too quickly.

评论 #40390776 未加载

评论 #40385671 未加载

cafardabout 1 year ago

"98% of web sites are encoded in UTF8"And quite a few web sites claim to be encoded in UTF8 and serve latin-1. It is best to check, or at least to specify error handling on your decoder.

评论 #40385583 未加载

zzo38computer12 months ago

I agree. Validating UTF-8 will waste processing time, as well as not work well with non-Unicode text; and (like it says in the article) often you should not actually care what character encoding (if any) it uses anyways. Furthermore, it is often useful to measure the length or split by bytes rather than Unicode code points anyways.Unicode string types are just a bad idea, I think. Byte strings are better; you can still add functions to deal with Unicode or other character codes if necessary (and/or add explicit tagging for character encoding, if that is helpful).Many programming languages though make it difficult to work with byte strings, non-Unicode strings, etc. This often causes problems, in my experience, unless you are careful.Unicode string types are a problem especially when used incorrectly, since if used in a library they can even be exposed to applications that call it even if they do not want it and even if the library doesn't or shouldn't really care. GOTO is not a problem; it is good, because it does not affect library APIs; even if a library uses it, your program does not have to use it, and vice-versa. Unicode string types do not have that kind of benefit, so they are a much more significant problem, and should be avoided when designing a programming language.(None of the above means that there is never any reason to deal with UTF-8, although usually there isn't a good one. For example, if a file in ASCII format can contain commands which are used to produce some output in a UTF-16 format, then it makes sense to treat the arguments to those commands as WTF-8 so that they can be converted to UTF-16, since WTF-8 is the "corresponding ASCII-compatible character encoding" than UTF-16. Similarly, if the output file is JIS, then using EUC-JP would be sensible.)

kayodelycaonabout 1 year ago

Personally, I prefer Ruby's behavior of explicit encoding on strings and being very cranky when invalid codepoints show up in a UTF-8 string.If you want to ignore invalid UTF-8, use String#scrub to replace heretical values with \uFFFD and life is good. :)

评论 #40382960 未加载

andrewaylett12 months ago

As a small nit, the reason Rust has pairs of types (String vs &str) is not due to mutability, but to allow references to substrings.A String is an allocated object, with a location and length. A str is a sequence of characters "somewhere", which is why you can only have references to str, never an actual str object. Cstr and Osstr are similar. You could use &[u8] instead of any of them, but the stronger types enforce some guarantees on what you'll find in the sequence.

remramabout 1 year ago

> In Rust, strings are always valid UTF8, and attempting to create a string with invalid UTF8 will panic at runtime:> [piece of code explicitly calling .unwrap()]You misspelled "returns an error".It might be worth considering Python, where the most central change from 2 to 3 was that strings would now be validated UTF-8. I don't understand why it gets discarded with "it was designed in the 1990's" when that change happened so recently.

评论 #40382931 未加载

评论 #40383200 未加载

kstrauserabout 1 year ago

> As always, immutability comes with a performance penalty: Mutating values is generally faster than creating new ones.I get what they're saying, but I'm not sure I agree with it. Mutating one specific value is faster than making a copy and then altering that. Knowing that a value can't be mutated and using that to optimize the rest of the system can be faster yet. I think it's more likely the case that allowing mutability comes with a performance penalty.

评论 #40383013 未加载

评论 #40383351 未加载

singpolyma3about 1 year ago

Honestly if you don't know that it's valid unicode then it's not a string at all, but a bytstring.

评论 #40384862 未加载

bawolffabout 1 year ago

In terms of actual issues - i think normalizing to NFC is much more important than validating.

camgunzabout 1 year ago

You do have to validate UTF-8 strings:- You can't just skip stuff if you run any kind of normalization- How would you index into or split an invalid UTF-8 string?- How would you apply a regex?- What is its length?- How do you deal with other systems that do validate UTF-8 strings?Meta point: scanning a sequence of byte for invalid UTF-8 sequences is validating. The decision to skip them is just slightly different code than "raise error". It's probably also a lot slower as you have to always do this for every operation, whereas once you've validated a string you can operate on it with impunity.Love this for the hot take/big swing, but it's a whiff.

评论 #40383648 未加载

评论 #40395294 未加载

kdheepakabout 1 year ago

In my opinion, one argument for internally representing `String`s as UTF8 is it prevents accidentally saving a file as Latin1 or other encodings. I would like to read a file my coworker sent me in my favorite language without having to figure out what the encoding of the file is.For example, my most recent Julia project has the following line:<pre><code> windows1252_to_utf8(s) = decode(Vector{UInt8}(String(coalesce(s, ""))), "Windows-1252") </code></pre> Figuring out that I had to use Windows-1252 (and not Latin1) took a lot more time than I would have liked it to.I get that there's some ergonomic challenges around this in languages like Julia that are optimized for data analysis workflows, but imho all data analysis languages/scripts should be forced to explicitly list encodings/decodings whenever reading/writing a file or default to UTF-8.

评论 #40385175 未加载