Unicode in Python 3

207 点作者 buttscicles大约 11 年前

20 条评论

wbond大约 11 年前

Having written a bunch of Python 2 and porting it to 3 where I deal with unknown encodings (FTP servers), I can't help but disagree with Armin on most of his Python 3 posts.The crux of his argument with this article is "unix is bytes, you are making me deal with pain to treat it like Unicode." Python 2 just allowed to take crap in and spit crap out. Python 3 requires you to do something more complicated when crap comes in. In my situation, I am regularly putting data into a database (PostgreSQL with UTF-8 encoding) or working with Sublime Text (on all three platforms). You try to pass crap along to those and they explode. You HAVE to deal with crappy input.In my experience, Python 2 explodes at run time when you get weird crappily-encoded data. And only your end users see it, and it is a huge pain to reproduce and handle. Python 3 forces you to write code that can handle the decoding at the get go. By porting my Python 2 to 3, I uncovered a bunch of places where I was just passing the buck on encoding issues. Python 3 forced me to address the issues.I'm sure there are bugs and annoyances along the way with Python 3. Oh well. Dealing with text input in any language is a pain. Having worked with Python, C, Ruby and PHP and dealing with properly handling "input" for things like FTP, IMAP, SMTP, HTTP, etc, yeah, it sucks. Transliterating, converting between encodings, wide chars, Windows APIs. Fun stuff. It isn't really Python 3 that is the problem, it is undefined input.Unfortunately, it seems Armin happens to play in areas where people play fast and loose (or are completely oblivious to encodings). There is probably more pain generally there than dealing with transporting data from native UI widgets to databases. Sorry dude.Anyway, I never write Python 2 anymore because I hate having this randomly explode for end-users and having to try and trace down the path of text through thousands of lines of code. Python 3 makes it easy for me because I can't just pass bytes along as if they were Unicode, I have to deal with crappy input and ask the user what to do.Python 2 is a dead end with all sorts of issues. The SSL support in Python 2 is a joke compared to 3. You can't re-use SSL contexts without installing the cryptography package, which requires, cffi, pycparsers and bunch of other crap. Python 2 SSL verification didn't exist unless you roll your own, or use Requests. Except Requests didn't even support HTTPS proxies until less than a year ago.Good riddance Python 2.

评论 #7733181 未加载

twic大约 11 年前

There was a related discussion on the Mercurial mailing list a while back. Not about Python 2 vs 3, but about filename encoding.Mercurial follows a policy of treating filenames as byte strings. Matt Mackall is very clear about this. Because unix treats filenames as byte strings, this makes Mercurial interoperate with other programs on a unix machine pretty well: you can manage files of any encoding, you can embed filenames in file contents (eg in build scripts) and be confident they will always be byte-for-byte identical with the names managed by Mercurial, etc.However, it also means Mercurial falls flat on its face when it's asked to share files between machines using different encodings. Names which work fine on one machine will, to human eyes, be garbled nonsense on the other.This is a problem which does actually happen; there is a slow trickle of bug reports about it. And because of the commitment to unix-style filenames, it will probably never be fixed. List members did try and come up with some ideas to fix it which preserved the unix semantics normal cases, but they weren't popular.And before anyone gets lippy, i assume Git has the same problem.Ultimately, i would say this comes down to a conflict between two fundamentally different kinds of users of strings: machines and people. Machines are best served by strings of bytes. People are best served by strings of characters. Usually. And sadly, unix's lack of a known filesystem encoding is too well-established for there to be much chance of building a bridge.

评论 #7735418 未加载

评论 #7736181 未加载

评论 #7734720 未加载

overgard大约 11 年前

I had to deal with this a lot at a job I used to have (not python specifically, but just with unicode issues), and there's really just not a right answer to how to do any of this. Any solution you pick is going to suck for someone.One thing he's leaving out of the Python 2 being better aspect: Ok, for cat you can treat everything as one long byte array. But what if, say, I need to count how many characters are in that string? Or what if I need to write a "reverse cat", which reverses the string? Python 2's model is entirely broken there.Armin suggests that printing broken characters is better than the application exploding and I agree.. sometimes. On the other hand, try explaining to a customer why the junk text they copy pasted from microsoft word into an html form has question marks in it when it shows on your site.The problem with the whole "treat everything as bytes" thing is that you'll never have a system that quite works. You'll just have a system that mostly works, and mostly for languages closer to english. Going the rigorous route is the hard way, but it will end up with systems that actually work right.

rdtsc大约 11 年前

> There is a perfectly other language available called Python 2, it has the larger user base and that user base is barely at all migrating over. At the moment it's just very frustrating.I come from a different perspective, I looked at the benefits of Python 3 and looked at my existing code base and how it would be better if was written in Python 3 and apart from bragging rights, and having a few built-in modules (that now I get externally) it wouldn't actually be better.To put it plainly, Python 3, for me, doesn't offer anything at the moment. There is no carrot at the end. I have not seen any problems with Unicode yet. Not saying they might not be lurking there, I just haven't seen them. And, most important, Python 2 doesn't have any stick beating me on the head with, to justify migrating away from it. It is just a really nice language, fast, easy to work with, plenty of libraries.From from _my_ perspective Python 3 came at the wrong time and offered the wrong thing. I think it should have happened a lot earlier, I think to justify incompatibilities it should have offered a lot more, for example:* Increased speed (a JIT of some sort)* Some new built-in concurrency primitives or technologies (something greenlet or message passing based).* Maybe a built-in web framework (flask) or something like requests or ipython.It is even hard to come with a list, just because Python 2 with its library ecosystem is already pretty good.

评论 #7733062 未加载

评论 #7733118 未加载

ak217大约 11 年前

Is sys.getfilesystemencoding() not a good way to get at filename encoding?I think on the face of it I do like the Go approach of "everything is a byte string in utf-8" a lot, but I haven't really worked with it so there's probably some horrible pain there somewhere, too. In the meantime Python 3 is a hell of a lot better than Python 2 to me because it doesn't force unicode coercion with the insane ascii default down my throat (by the time most new Python 2 coders realize what's going on, their app already requires serious i18n rework). Also, I don't really know why making sure stuff works when locale is set to C is important - I would simply treat such a situation as broken.In writing python 2/3 cross-compatible code, I've done the following things when on Python 2 to stay sane:- Decode sys.argv asap, using sys.stdin.encoding- Wrap sys.stdin/out/err in text codecs from the io module (<a href="https://github.com/kislyuk/eight/blob/master/eight/__init__.py#L78-L98" rel="nofollow">https://github.com/kislyuk/eight/blob/master/eight/__init__....</a>). This approximates Python 3 stdio streams, but has slightly different buffering semantics compared to Python 2 and messes around with raw_input, but it works well. Also, my wrappers allow passing bytes on Python 2, since a lot of things will try to do so.

评论 #7735484 未加载

评论 #7733736 未加载

评论 #7734108 未加载

inklesspen大约 11 年前

If you want to work with bytes on stdin and stdout, Python 3 documents how to do that, at the same place it documents the stdin and stdout streams.<a href="https://docs.python.org/3/library/sys.html#sys.stdin" rel="nofollow">https://docs.python.org/3/library/sys.html#sys.stdin</a>All you have to do is use sys.stdin.buffer and sys.stdout.buffer; the caveat is that if sys.stdin has been replaced with a StringIO instance, this won't work. But in Armin's simple cat example, we can trivially make sure that won't happen.I'd be a lot more willing to listen to this argument if it didn't overlook basic stuff like this.

评论 #7734776 未加载

评论 #7735341 未加载

评论 #7735445 未加载

mangecoeur大约 11 年前

I get that Armin runs into pain points with Py3, but on the other hand I get annoyed with the heavily English centric criticims - its easy to think py2 was better when you're only ever dealing with ASCII text anyway.Fact is, most of the world doesn't speak english and needs accents, symbols, or completely different alphabets or characters to represent their language. If POSIX has a problem with that then yes, it is wrong.Even simple things like french or german accents can make the Py2 csv module explode, while Py3 works like a dream. And anyone who thinks they can just replace accented characters with ASCII equivalents needs to take some language lessons - the result is as borked and nonsensical as if, in some parallel univese, I had to replace every "e" with an "a" in order to load simple english text.

评论 #7733758 未加载

评论 #7734278 未加载

评论 #7734021 未加载

评论 #7739380 未加载

lmm大约 11 年前

If you're happy with Go's "everything is a unicode string" approach then you should be happy to just treat everything as unicode. Don't handle the decode errors - if someone sends some data to your stdin that's not in the correct encoding, too bad.Yes, python3 makes it hard to write programs that operate on strings as bytes. This is a good thing, because the second you start to do anything more complicated than read in a stream of bytes and dump it straight back out to the shell (the trivial example used here), your code will break. Unix really is wrong here, and the example requirement would seem absurd to anyone not already indoctrinated into the unix approach: you want a program that will join binary files onto each other, but also join strings onto each other, and if one of those strings is in one encoding and one is in another then you want to print a corrupt string, and if one of them is in an encoding that's different from your terminal's then you want to display garbage? Huh? Is that really the program you want to write?

评论 #7733122 未加载

评论 #7735365 未加载

评论 #7733507 未加载

cool-RR大约 11 年前

Worth it if only for `copyfileobj`. As a seasoned Python expert, I was not familiar with that function. From the docs:shutil.copyfileobj(fsrc, fdst[, length]) Copy the contents of the file-like object fsrc to the file-like object fdst. The integer length, if given, is the buffer size. In particular, a negative length value means to copy the data without looping over the source data in chunks; by default the data is read in chunks to avoid uncontrolled memory consumption. Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied.

andreasvc大约 11 年前

I think the main problem here is an impedance mismatch caused by forcing things to be Unicode. While the Python developers are technically correct (the best kind they say..) in claiming that LANG=C means ASCII, that's not how everything else in UNIX works until now, most applications don't crash because of encoding errors. And filenames are byte strings, so forcing Unicode on them is a bad idea.It would be great if everyone fixed their locale settings and all their filename encodings but in the meantime this will cause even more friction for Python 3 adoption.

andrewstuart大约 11 年前

It's a great concern that some of Python's most respected developers such as mitsuhiko and Zed Shaw are not on board with the current future direction of Python. It would be a better world for all if somehow Python 4 could be something that everyone is happy with - I want the mitsuhikos and Zed Shaws of the world to be writing code that I can run as a Python 3 user, written in a language that these top level developers feel enthused about.Is there no way forward that everyone agrees on? Has anyone ever proposed a solution?

shadowmint大约 11 年前

> That I work with "boundary code" so obviously that's harder on Python 3 now (duh)mhm. I tell people now and then that python 3 (and the python 3 developers) are hostile to people embedding it and using it for low level tasks specifically because of this unicode stuff, and they tend to tell me I should just suck it up.I suppose I'm morbidly glad not the only one feeling the pain, but really, it honestly feels like python 3 line is just not making any effort towards making this stuff easier and simpler. :/

评论 #7732951 未加载

评论 #7733208 未加载

andrewstuart大约 11 年前

I hear and understand and agree with the issues raised, the question is what is the right way to fix this stuff? How can we get there?How can we get the Python 2 stalwarts and the Python 3 folks to all sit in the same figurative room and create a future that everyone is happy with?It would be nice to see the ongoing grumbling about Python 3 replaced with a tangible peace process.Are the warring parties talking about solutions?

评论 #7736982 未加载

e12e大约 11 年前

I don't know... I get an error from the first script with python3:<pre><code> $ ls test test3.py test.py tøst 日本語 $ python2.7 test.py * hello hellø こにちは tøst 日本語 import sys # (…) hello hellø こにちは tøst 日本語 hello hellø こにちは tøst 日本語 $ python3 test.py * Traceback (most recent call last): File "test.py", line 13, in <module> shutil.copyfileobj(f, sys.stdout) File "/usr/lib/python3.2/shutil.py", line 68, in copyfileobj fdst.write(buf) TypeError: must be str, not bytes #But I can make it work with: $ diff test.py test3.py 8c8 < f = open(filename, 'rb') --- > f = open(filename, 'r') $ python3 test3.py * # same as above </code></pre> Now, these two scripts are no longer the same, the python3 script outputs text, the python2 script outputs bytes:<pre><code> $ python3 test3.py /bin/ls Traceback (most recent call last): File "test3.py", line 13, in <module> shutil.copyfileobj(f, sys.stdout) File "/usr/lib/python3.2/shutil.py", line 65, in copyfileobj buf = fsrc.read(length) File "/usr/lib/python3.2/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 24: invalid start byte </code></pre> The other script works like cat -- and dumps all that binary crap to the terminal.So, yeah, I guess things are different -- not entirely sure that the python3 way is broken, though? It's probably correct to say that it doesn't work well with the "old" unix way in which text was ascii and binary was just bytes -- but consider:<pre><code> $ cat /bin/ls |wc 403 2565 114032 e12e@stripe:~/tmp/python/unicodetest $ du -b /bin/ls 114032 /bin/ls </code></pre> Does that "wordcount" and "linecount" from wc make any sense? For that matter, consider:<pre><code> $ cat test hello hellø こにちは tøst 日本語 e12e@stripe:~/tmp/python/unicodetest $ wc test 1 5 42 test </code></pre> (Here the word count does make sense, but just because it's an artificial example, it wouldn't make sense for actual Japanese).The character count is pretty certainly wrong unless you cared about what "du -b" thinks of the number of bytes...

评论 #7734191 未加载

评论 #7734262 未加载

评论 #7734389 未加载

keyme大约 11 年前

Strings should be byte strings. Not ASCII, not Unicode. Bytes.Strings don't represent Text lest I decide they do. For this a UnicodeString object should exist, and it should _not_ be the default.In my latest project I've made myself use Python 3.4 over 2.7, for its new great features. So many steps forward, except this one thing.What a stupid decision are these default Unicode strings...

评论 #7735928 未加载

pekk大约 11 年前

From the one person who has complained most about this topic, making him an expert on complaining about Python 3 but not necessarily as much of an expert on how to cope.

skizm大约 11 年前

Bit off topic, but can anyone recommend a good tutorial/book/whatever for python 2 programmers looking to move to (or at least become familiar with) python 3?

评论 #7733185 未加载

评论 #7733175 未加载

评论 #7734534 未加载

im3w1l大约 11 年前

>For instance it will attempt decoding from utf-8 with replacing decoding errors with question marks.Please don't do this. Replacing with question mark is a lossy transformation. If you use a lossless transformation, a knowledgeable user of your program will be able to reverse the garbling, in their head, or using a tool. Consider Ã¥Ã¤Ã¶, the result of interpreting utf8 åäö as latin1. You could find both the reason and solution by googling on it.

Retr0spectrum大约 11 年前

Did anyone else find the title font hard to read?

评论 #7734492 未加载

jrochkind1大约 11 年前

I have to admit I can't follow this completely -- dealing with file system file names that are not in ascii is a very confusing thing, and one I haven't done before -- plus I am not very familiar with python.But I have done a lot of dealing with char encoding issues though -- in ruby.In ruby 1.9+, I find ruby's char encoding handling to be quite good. Which does not mean it's not incredibly challenging and confusing to deal with char encoding issues. But it means I haven't been able to come up with any better approach than ruby 1.9+'s, anything I wish ruby 1.9+ did differently.The mental model is simple (relatively, for the domain anyway) -- any strings are tagged with an encoding. If your string contains illegal bytes for the encoding it's tagged with, it's gonna raise if you try to concatenate it or do much anything else with it. Concatenating strings of two different encodings is probably going to raise too (some exceptions if they are both ascii supersets and happen to contain only ascii-valid 7-bit chars). You can easily check if a string contains any illegal bytes; change the tagged encoding to any encoding you like (including the 'binary' null encoding); remove bad bytes; or trans-code from one encoding to another.It means that you have all the tools you need to deal with char encoding issues, but you still need to think through some complicated and confusing issues to deal with em. It is an inherently confusing domain (which is why it's nice that more and more of the time you can just assume UTF8 everywhere -- but yes, I've written plenty of code that can't assume that too, or that has to deal with bad bytes in presumed UTF8)(The biggest frustrations can be when using gems (libraries) that themselves aren't dealing with char encoding correctly, and then you find yourself debugging someone elses code and trying to convince them that their code is incorrect when they're putting up a fight cause it's so damn confusing. There are still plenty of encoding related bugs. But I'm not sure that's ruby's fault).You certainly can deal with everything as a byte stream (the 'binary' null encoding) if you want to in ruby, as far as the language is concerned, although I don't think you actually usually want to. (and some open source gems might not play well with that approach either)It would be interesting to see someone who understands both ruby and python take the OP and analogize the problem case to ruby 1.9+ and see if it's any different.(One important thing ruby was missing prior to 2.1 is the new String#scrub method. It was possible to write it yourself though, which I figured out eventually. Another thing I still wish ruby had built-in to stdlib was more of the Unicode algorithms (sort collation, case change, etc.), although there are gems for most of em these days, thanks open source.)

20 条评论

wbond大约 11 年前

评论 #7733181 未加载

twic大约 11 年前

评论 #7735418 未加载

评论 #7736181 未加载

评论 #7734720 未加载

overgard大约 11 年前

rdtsc大约 11 年前

评论 #7733062 未加载

评论 #7733118 未加载

ak217大约 11 年前

评论 #7735484 未加载

评论 #7733736 未加载

评论 #7734108 未加载

inklesspen大约 11 年前

评论 #7734776 未加载

评论 #7735341 未加载

评论 #7735445 未加载

mangecoeur大约 11 年前

评论 #7733758 未加载

评论 #7734278 未加载

评论 #7734021 未加载

评论 #7739380 未加载

lmm大约 11 年前

评论 #7733122 未加载

评论 #7735365 未加载

评论 #7733507 未加载

cool-RR大约 11 年前

andreasvc大约 11 年前

andrewstuart大约 11 年前

shadowmint大约 11 年前

评论 #7732951 未加载

评论 #7733208 未加载

andrewstuart大约 11 年前

评论 #7736982 未加载

e12e大约 11 年前

评论 #7734191 未加载

评论 #7734262 未加载

评论 #7734389 未加载

keyme大约 11 年前

评论 #7735928 未加载

pekk大约 11 年前

From the one person who has complained most about this topic, making him an expert on complaining about Python 3 but not necessarily as much of an expert on how to cope.

skizm大约 11 年前

Bit off topic, but can anyone recommend a good tutorial/book/whatever for python 2 programmers looking to move to (or at least become familiar with) python 3?