The Python Unicode Mess

284 pointsby psibiover 6 years ago

43 comments

acdhaover 6 years ago

As far as I can tell this is a long-form “I used to be able to ignore encoding issues and now it’s a ‘mess’ because the language is forcing me to be correct”. Each of the examples cited is something which was a source of latent bugs which he thought was working because they were ignored.Only his third bit of advice isn’t wrong and treating it as something unusual shows the problem: the only safe way to handle text has always been to decode bytes as soon as you get them, work with Unicode, and then encode it when you send them out. Anything else is extremely hard to get right, even if many English-native programmers were used to being able to delay learning why for long periods of time.

评论 #18154993 未加载

评论 #18155417 未加载

评论 #18154997 未加载

评论 #18156352 未加载

评论 #18155274 未加载

评论 #18158564 未加载

ptxover 6 years ago

Text encoding in general is a mess, and Python 2 Unicode support was a mess, but Python 3 makes it much less of a mess.I think the author has a mess on his hands because he's trying to do it the Python 2 way – processing text without a known encoding, which is not really possible, if you want the results to come out right.To resolve the mess in Python 3, choose what you actually want to do:1. Handle raw bytes without interpreting them as text – just use bytes in this case, without decoding.2. Handle text with a known encoding – find out the encoding out-of-band from some piece of metadata, decode as early as possible, handle the text as strings.3. Handle Unix filenames or other byte sequences that are usually strings but could contain arbitrary byte values that are invalid in the chosen encoding – use the "surrogateescape" error handler; see PEP 383: <a href="https://www.python.org/dev/peps/pep-0383/" rel="nofollow">https://www.python.org/dev/peps/pep-0383/</a>4. Handle text with unknown encoding – not possible; try to turn this case into one of the other cases.Also, watch Ned Batchelder's excellent talk, Pragmatic Unicode, or, How do I stop the pain?, from 2012: <a href="https://pyvideo.org/pycon-us-2012/pragmatic-unicode-or-how-do-i-stop-the-pain.html" rel="nofollow">https://pyvideo.org/pycon-us-2012/pragmatic-unicode-or-how-d...</a>

评论 #18155689 未加载

garethreesover 6 years ago

There is a particular use case which leads to frustration with Python 3, if you don't know the latin1 trick.The use case is when you have to deal with files that are encoded in some unknown ASCII-compatible encoding. That is, you know that bytes with values 0–127 are compatible with ASCII, but you know nothing whatsoever about bytes with values 128–255.The use case arises when you have files produced by legacy software where you don't know what the encoding is, but you want to process embedded ASCII-compatible parts of the file as if they were text, but pass the other parts (which you don't understand) through unchanged (for example, the files are documents in some markup language, and you want to make automatic edits to the markup but leave the rest of the text unchanged). Processing as text requires you to decode it, but you can't decode as 'ascii' because there are high-bit-set characters too.The trick is to decode as latin1 on input, process the ASCII-compatible text, and encode as latin1 on output. The latin1 character set has a code point for every byte value, and bytes with the high bit set will pass through unchanged. So even if the file was actually utf-8 (say), it still works to decode and encode it as latin1, and multi-byte characters will survive this process.The latin1 trick deserves to be better known, perhaps even a mention in the porting guide.

评论 #18156068 未加载

评论 #18155385 未加载

评论 #18155259 未加载

评论 #18156052 未加载

评论 #18156263 未加载

评论 #18156904 未加载

评论 #18155523 未加载

评论 #18161773 未加载

评论 #18156363 未加载

perlgeekover 6 years ago

The real problem here is that* UNIX file systems allow any byte sequence that doesn't contain / or \0 as file and directory names* User interfaces have to render that as strings, so they must decode* There is no meta data about what the file encoding isMany programs use the encoding from the current locale, which is mostly a good assumption, but the way that locales scope (basically per process) has nothing to do with how file names are scoped.So, many programs make some assumptions. Some models are:1) Assume file names are encoded in the current locale2) Assume file names are encoded in UTF-83) Don't assume anythingThe "correct" model would be 3), but it's not very useful. People want to be able to sort and display file names, which generally isn't very useful with binary data.Which is why most programs, including python, use 1) or 2), and sometimes offer some kind of kludge for when the assumption doesn't hold -- and sometimes not.IMHO a file system should store an encoding for the file names contained in it, and validate on writes that the names are correct. But of course that would be a huge POSIX incompatibility, and thus won't happen.People just live with the current models, because they tend to be good enough. Mostly.

评论 #18157354 未加载

评论 #18163394 未加载

zorkw4rgover 6 years ago

I'm not so sure other languages do that any better (nodejs doesn't even support non-unicode filenames at all for instance). Modern python does a pretty good job at supporting unicode, very far away from being a "Mess" that's just very much not true at all. People always like to hate on python but then other languages supposedly designed by actually capable people do mess up other stuff all the time. Look at how the great Haskell represents strings for instance and what a clusterfuck[1] that is.[1] <a href="https://mmhaskell.com/blog/2017/5/15/untangling-haskells-strings" rel="nofollow">https://mmhaskell.com/blog/2017/5/15/untangling-haskells-str...</a>

评论 #18154929 未加载

评论 #18154996 未加载

评论 #18155046 未加载

评论 #18156075 未加载

评论 #18155596 未加载

评论 #18155902 未加载

评论 #18154913 未加载

aeturnumover 6 years ago

My main criticism of Python 3's changes to strings is that it has become much more specific about strings.In Python 2, if you have a series of bytes -or- a "string", the language has no opinion about the encoding. It just passes around the bytes. If that set of bytes enters and exits Python without being changed, its format is of no concern. Interactions do not force you to define an encoding. This is not correct, but it is often functional.Python 3, on the other hand, if you ever treat bytes as a string, forces you to have an opinion about the encoding. Same goes for if you convert back to bytes. For uncommon or unexpected encodings, the chance of this going wrong in a casual, accidental way is much higher. Of course, the approach is more correct, but it doesn't feel more correct to the programmer.

评论 #18155187 未加载

评论 #18155692 未加载

评论 #18155091 未加载

评论 #18155968 未加载

nicolaslemover 6 years ago

For anyone interested in learning why Python 3 works this way I highly recommend the blog of Victor Stinner[0].As for the article, this is nothing new. The problem is similar to the issues raised by Armin Ronacher[1]. These problems are well known and Python developers address them one at a time. Issues around these egde cases have improved since the initial release of Python 3.0.[0] <a href="http://vstinner.github.io" rel="nofollow">http://vstinner.github.io</a>[1] <a href="http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/" rel="nofollow">http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/</a>

burntsushiover 6 years ago

This article is kind of hard to evaluate, because the OP doesn't provide an example program with an example input that fails. So it's hard to judge whether the solution presented here is actually ideal. Instead, we're forced to just take the OP's word for it, which is kind of uncomfortable.I do somewhat agree with the general sentiment, although I find it difficult to distinguish between the problems specifically related to its handling of Unicode and the fact that the language is unityped, which makes a lot of really subtle things very implicit.

评论 #18154892 未加载

flohofwoeover 6 years ago

IMHO the whole python3 string mess could have been prevented if they had chosen UTF-8 as the only string encoding instead of adding a strict string type with a lot of under-the-hood magic. That way strings and byte streams could remain the same underlying data, just as in python2. The main problem I have with byte-streams vs strings in python3 is that it adds a strict type checking at runtime which isn't checked at 'authoring time'. Some APIs even make it impossible to do upfront type checking even if type hints would be provided (e.g. reading file content either returns a byte stream, or a string, based on the content of a string parameter in the file open function).Recommended reading: <a href="http://utf8everywhere.org/" rel="nofollow">http://utf8everywhere.org/</a>

评论 #18156085 未加载

评论 #18159306 未加载

minitechover 6 years ago

> And the environment? [it’s not even clear.] <a href="https://stackoverflow.com/questions/44479826/how-do-you-set-a-string-of-bytes-from-an-environment-variable-in-python" rel="nofollow">https://stackoverflow.com/questions/44479826/how-do-you-set-...</a>That question is about interpreting backslash escape sequences for bytes in an environment variable. All this person wants is `os.environb` (and look, its existence highlighted a Windows incompatibility, saving them from subtle bugs like every other Python 3 improvement). <a href="https://docs.python.org/3/library/os.html#os.environb" rel="nofollow">https://docs.python.org/3/library/os.html#os.environb</a>

评论 #18156158 未加载

TimJYoungover 6 years ago

Getting Unicode right, especially with various file systems and cross-platform implementations is hard, for sure. But, I think this quote:"And, whatever you do, don’t accidentally write if filetype == "file" — that will silently always evaluate to False, because "file" tests different than b"file". Not that I, uhm, wrote that and didn’t notice it at first…"shows a behavior that, to me, is inexcusable. The encoding of a string should never cause a comparison to fail when the two strings are equivalent except for the encoding. For example, in Delphi/FreePascal, if you compare an AnsiString or UTF-8-encoded string with a Unicode string that is equivalent, you get the correct answer: they are equal.

评论 #18157615 未加载

评论 #18156497 未加载

评论 #18156455 未加载

apk-dover 6 years ago

Let's be honest, the real mess is with UNIX filenames. I dare you to come up with a legitimate use case for allowing newlines and other control characters in a file name.

评论 #18162941 未加载

评论 #18158367 未加载

Aardappelover 6 years ago

Going of on a tangent a bit here, but I think there are 2 important related issues:* API design should fit the language. In a "high on correctness" language like Haskell or Rust, I'd expect APIs to force the programmer to deal with errors, and make them hard to ignore. In a dynamically typed language like Python where many APIs are very relaxed / robust in terms of dealing with multiple data types (being able to see numbers/strings/objects generically is part of the point of the language), being super strict about string encoding sounds extra painful compared to a statically typed language. I'd expect an API in this language to err on the side of "automatically doing a useful/predictable thing" when it encounters data is only slightly incorrect, as opposed to raising errors, which makes for very brittle code. Most Python code is the opposite of brittle, in the sense that you can take more liberties with data types before it breaks than in statically typed languages. Note that I am not advocating incorrect APIs, or APIs that silently ignore errors, just that the design should fit the language philosophy as best as possible.* Where in a program/service should bytes be converted to text? Clearly they always come in as bytes (network, files..), and when the user sees them rendered (as fonts), those bytes have been interpreted using a particular encoding. The question where in the program should this happen? You can do this as early as possible, or as late as possible. Doing it as early as possible increase the code surface where you have to deal with conversions, and thus possible errors and code complexity, so that doesn't seem so great to me personally, but I understand there are downsides to most of your program dealing with a "bag of bytes" approach too.

评论 #18157655 未加载

评论 #18157426 未加载

lincolnqover 6 years ago

Indeed py3 decided to make unicode strings the default. This fixes all sorts of thorny issues across many use cases. But it does indeed break filenames. I haven't dealt with this issue myself, but the way python was supposed (?) to have "solved" this is with surrogate escapes. There's a neat piece on the tradeoffs of the approach here: <a href="https://thoughtstreams.io/ncoghlan_dev/missing-pieces-in-python-3-unicode/" rel="nofollow">https://thoughtstreams.io/ncoghlan_dev/missing-pieces-in-pyt...</a>Maybe handling the surrogates better would allow you to use 'str' everywhere instead of bytes?

gnudover 6 years ago

> For a Python program to properly support all valid Unix filenames, it must use “bytes” instead of strings, which has all sorts of annoying implications.While in python 2, you had to use unicode strings for all sorts of actual text, which caused its own problems.> What’s the chances that all Python programs do this correctly? Yeah. Not high, I bet.Exactly.

zzzeekover 6 years ago

Don't think of python Unicode as a "string". Think of it as "text". I don't really understand the issues the author is having with things like sys.stdout and such because he did not provide complete examples. He should cite actual examples and bug reports that he has posted for these things, ive had no such issues. There's a lot of things we need to do to accommodate for non-ascii text but they are all "right" as far as I've observed.

dan-robertsonover 6 years ago

Part of the issue is to do with bytes and strings being considered totally different by python but confusingly similar to people.The error from "file" != b"file" is particularly bad. It makes sense if you realise that a == b means a,b have the same type and their values are equal. But there is no way a even a reasonably careful programmer could spot this without super careful testing (and who’s to say they would remember to test b"file" and not "file"). Other ways this could be solved are:1. String == bytes is true iff converting the string to bytes gives equality (but then can == become non transitive)2. String == bytes raises (and so does string == string if encodings are different)3. Type-specific equality operators like in lisp. But these are ugly and verbose which would discourage their use and so one would not think to use bytesEqual instead of ==4. A stricter/looser notion of equality that behaves as one of the above called eg === but this is also not great

评论 #18157796 未加载

评论 #18158940 未加载

0x006Aover 6 years ago

i love unicode handling in python3, it's so much better to work with. python2 was a mess, migrating old code requires looking at old code, the result is only better code, never a mess.

kabachaover 6 years ago

> Python's unicode is a "mess" because of this single edge case I've encounteredFTFY

评论 #18159222 未加载

upofadownover 6 years ago

The article is about a specific instance (filenames). In general, handling Unicode as a bunch of indexable code points as per Py3 turned out to be not that great. I guess the idea came from the era where people still thought that strings could be in some sense fixed length. These days we better understand that strings are inherently variable length. So there is no longer any reason to not just leave everything encoded as UTF-8 and convert to other forms as and if required. Strings are just a bunch of bytes again.

评论 #18154912 未加载

snicker7over 6 years ago

There are lots of comments indicating that the programmer is doing things wrong. But what is the right way to deal with encoding issues? Wait for code to break in production?Whatever "best practices" there are for dealing with unexpected text encoding in Python, they do not seem to be widely well-known. I bet a large % of Python programmers (myself included) made the exact same errors the author had, with little insight as to how avoid them in the future.

tyingqover 6 years ago

His examples are all stuff that isn't Unicode. The filename thing would probably work using a latin1 encoding, since that leaves 8 bit bytes undisturbed.

andrewstuartover 6 years ago

That's not Python's fault - those are programmer errors.Having said that, Python really has something to answer for with "encode" versus "decode" - WTF? Which is which? Which direction am I converting? I still have to look that up every single time I need to convert.Why the heck are there not "thistothat" and "thattothis" functions in Python that are explicit about what they do?

评论 #18156977 未加载

franga2000over 6 years ago

If you're storing files with non-Unicode-compatible names, you should really stop. Even if on Unix, you can technically use any kind of binary mess as a name, doesn't mean you should. And this applies to all kinds of data. All current operating systems support (and default to) Unicode, so handling anything else is a job for a compatibility layer, not your application.If you write new code to be compatible with that one Windows ME machine set to that one weird IBM encoding sitting in the back of the server room, you're just expanding your technical debt. Instead, write good, modern code, then write a bridge to translate to and from whatever garbage that one COBOL program spits out. That way, when you finally replace it, you can just throw away that compatibility layer and be left with a nice, modern program.In EE terms, think of it like an opto-isolator. You could use a voltage divider and a zenner diode, but thats just asking for trouble.

jlaroccoover 6 years ago

I can't believe there are still people whining about this in 2018.Those problems with gpodder, pexecpt, etc. aren't due to Python 3, they're due to the software being broken. Without knowing the encoding, UNIX paths can't be converted to strings. It's unfortunate, but that's the way it is, and it's not Python's fault.

codedokodeover 6 years ago

The author has files with invalid names and complains that Python refuses to accept them. Maybe he should fix the names first?

评论 #18158376 未加载

softblushover 6 years ago

<a href="https://web.archive.org/web/20181006121702/http://changelog.complete.org/archives/9938-the-python-unicode-mess" rel="nofollow">https://web.archive.org/web/20181006121702/http://changelog....</a>

luckystarrover 6 years ago

Author doesn't seem to care that there is a difference between Unicode the standard and utf-8 the encoding. While the changes on the fringes to the system are debatable, they are also in a way sensible. Internal to your application everything should be encoding independent (unicode objects in Py2, strings in Py3) while when talking to stuff outside your program (be it network, local file content or filesystem names) it has to be encoded somehow. The distinction between encoding independent storage and raw byte-streams forces you to do just that!Stop worrying and go with the flow. Just do it as it is supposed to be done and you'll be happy.

mikezter1over 6 years ago

The encoding is a property of the string, just like the content, just as with any other object. If you want to compare strings with different encodings, you'll have to convert at least one of them.I was never forced into encoding hell again, after reading this excellent post: <a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...</a>

singularity2001over 6 years ago

I invest some karma to point out how I'd love for str to just use UTF-8 by default, and print as UTF-8 by default:print(b'DONT b"EVERYTHING!"')print(str(b'SAME!'))print(str(b'I DONT WANT TO add ,"UTF-8" everywhere!','UTF-8'))line="ום עולם"output.write(line) # TypeError: a bytes-like object is required, not 'str'fp.write(output.getvalue()) # TypeError: write() argument must be str, not bytesPlease at least allow us to set a global option via sys.setdefaultencoding('UTF8') as before to automatically encode/decode as UTF-8 by default!

madroxover 6 years ago

Dealing with string encoding has always been the bane of my existence in Python...going back over 10 years when I first started using it. I've never had such wild issues with decoding/encoding in other languages...that may be my privilege, though, since I was dealing with internal systems before Python, and then I got into web scraping.Regardless, string encoding/decoding in Python is hard, and it doesn't feel like it needs to be.

loegover 6 years ago

I agree Python3 is an awful mistake and that straight-up Unicode is not well suited for storing arbitrary byte strings from old disk images. However, Python 3.1+ encode disk names as WTF-8 (aka utf-8b): <a href="https://www.python.org/dev/peps/pep-0383/" rel="nofollow">https://www.python.org/dev/peps/pep-0383/</a> .

gspetrover 6 years ago

This post barely scratches the tip of the iceberg.For a more comprehensive discussion of unicode issues and how to solve them in Python, "Let’s talk about usernames" does this issue more justice than I could write in a comment: <a href="https://news.ycombinator.com/item?id=16356397" rel="nofollow">https://news.ycombinator.com/item?id=16356397</a>

jessaustinover 6 years ago

TFA is short and to the point. A few examples, a few links to other examples. Py3's insistence on shoving Unicode into every API it possibly could maybe fit, is often inconvenient for coders and for users. This thread has 100 comments, mostly disagreeing in the same fingers-in-ears-I-can't-hear-you fashion. Whom are we struggling to convince, here?

vfclistsover 6 years ago

If python programmers think they are the only ones with UTF problems, try Lazarus and Freepascal development mailing lists. The debates have going since forever, and I am sure issues will be popping up every now and then.Try Elixir. According to their docs they've had it right from the word go - I think.

andrewstuartover 6 years ago

Is the author saying that the Python programming language handles this badly, and all other (relevant) programming languages do not?Or is that that Python's attention to detail means that issues that would be glossed over or hidden using ther languages are brought to the fore and require addressing?

SoulManover 6 years ago

I just came from pycon India 2018. This is exactly what the keynote was about.(it was by author of Flask)

wParserover 6 years ago

Filenames are a good example to show people why forcing an encoding onto all strings simply doesn't work. The usual reaction from people is to ignore that and they'll shout: "fix your filenames!"Here is another example: Substrings of unicodestrings. Just split a unicodestring into chunks of 1024 bytes. Forcing an encoding here and allowing automatic conversions will be a mess. People will shout: "Your're splitting your Strings wrong!"The first language I knew that fell for encoding aware strings was Delphi - people there called it "Frankenstrings" and meanwhile that language is pretty dead.As a professional who has to handle a lot of different scenarios (barcodes, Edifact, Filenames, String-buffers, ...) - in the end you'll have to write all code using byte-strings. Then you'll have to write a lot of GUI-Libraries to be able to work with byte-strings... and in the end you'll be at the point where the old Python was... (In fact you'll never reach that point because just going elsewhere will be a lot easier)

评论 #18156195 未加载

Walkmanover 6 years ago

It's just a not very well explained rant of some shitty libraries and a lot of legacy code. If you want to read about REAL complaints, read Armin Ronacher thought about it instead: <a href="http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/" rel="nofollow">http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/</a>

godman_8over 6 years ago

This is mostly why PHP6 wasn't a thing.

prevedmedvedover 6 years ago

Looks like we need a Python 4. (/s)

评论 #18156550 未加载

repolfxover 6 years ago

That sounds more like a mess handling things that are not Unicode.

评论 #18154974 未加载

IshKebabover 6 years ago

This is just the cost of using a dynamic language with implicit error handling (exceptions).

评论 #18154962 未加载

评论 #18155462 未加载

43 comments

acdhaover 6 years ago

评论 #18154993 未加载

评论 #18155417 未加载

评论 #18154997 未加载

评论 #18156352 未加载

评论 #18155274 未加载

评论 #18158564 未加载

ptxover 6 years ago

评论 #18155689 未加载

garethreesover 6 years ago

评论 #18156068 未加载

评论 #18155385 未加载

评论 #18155259 未加载

评论 #18156052 未加载

评论 #18156263 未加载

评论 #18156904 未加载

评论 #18155523 未加载

评论 #18161773 未加载

评论 #18156363 未加载

perlgeekover 6 years ago

评论 #18157354 未加载

评论 #18163394 未加载

zorkw4rgover 6 years ago

评论 #18154929 未加载

评论 #18154996 未加载

评论 #18155046 未加载

评论 #18156075 未加载

评论 #18155596 未加载

评论 #18155902 未加载

评论 #18154913 未加载

aeturnumover 6 years ago

评论 #18155187 未加载

评论 #18155692 未加载

评论 #18155091 未加载

评论 #18155968 未加载

nicolaslemover 6 years ago

burntsushiover 6 years ago

评论 #18154892 未加载

flohofwoeover 6 years ago

评论 #18156085 未加载

评论 #18159306 未加载

minitechover 6 years ago

评论 #18156158 未加载

TimJYoungover 6 years ago

评论 #18157615 未加载

评论 #18156497 未加载

评论 #18156455 未加载

apk-dover 6 years ago

Let's be honest, the real mess is with UNIX filenames. I dare you to come up with a legitimate use case for allowing newlines and other control characters in a file name.

评论 #18162941 未加载

评论 #18158367 未加载

Aardappelover 6 years ago

评论 #18157655 未加载

评论 #18157426 未加载

lincolnqover 6 years ago

gnudover 6 years ago

zzzeekover 6 years ago

dan-robertsonover 6 years ago

评论 #18157796 未加载

评论 #18158940 未加载

0x006Aover 6 years ago

i love unicode handling in python3, it's so much better to work with. python2 was a mess, migrating old code requires looking at old code, the result is only better code, never a mess.

kabachaover 6 years ago

> Python's unicode is a "mess" because of this single edge case I've encounteredFTFY

评论 #18159222 未加载

upofadownover 6 years ago

评论 #18154912 未加载

snicker7over 6 years ago

tyingqover 6 years ago

His examples are all stuff that isn't Unicode. The filename thing would probably work using a latin1 encoding, since that leaves 8 bit bytes undisturbed.

andrewstuartover 6 years ago

评论 #18156977 未加载

franga2000over 6 years ago

jlaroccoover 6 years ago

codedokodeover 6 years ago

The author has files with invalid names and complains that Python refuses to accept them. Maybe he should fix the names first?

评论 #18158376 未加载

softblushover 6 years ago

luckystarrover 6 years ago

mikezter1over 6 years ago

singularity2001over 6 years ago

madroxover 6 years ago

loegover 6 years ago

gspetrover 6 years ago

jessaustinover 6 years ago

vfclistsover 6 years ago

andrewstuartover 6 years ago

SoulManover 6 years ago

I just came from pycon India 2018. This is exactly what the keynote was about.(it was by author of Flask)

wParserover 6 years ago

评论 #18156195 未加载

Walkmanover 6 years ago

godman_8over 6 years ago

This is mostly why PHP6 wasn't a thing.

prevedmedvedover 6 years ago

Looks like we need a Python 4. (/s)

评论 #18156550 未加载

repolfxover 6 years ago

That sounds more like a mess handling things that are not Unicode.

评论 #18154974 未加载

IshKebabover 6 years ago

This is just the cost of using a dynamic language with implicit error handling (exceptions).

评论 #18154962 未加载

评论 #18155462 未加载