I don't know... I get an error from the first script with python3:<p><pre><code> $ ls
test test3.py test.py tøst 日本語
$ python2.7 test.py *
hello hellø こにちは tøst 日本語
import sys
# (…)
hello hellø こにちは tøst 日本語
hello hellø こにちは tøst 日本語
$ python3 test.py *
Traceback (most recent call last):
File "test.py", line 13, in <module>
shutil.copyfileobj(f, sys.stdout)
File "/usr/lib/python3.2/shutil.py", line 68, in copyfileobj
fdst.write(buf)
TypeError: must be str, not bytes
#But I can make it work with:
$ diff test.py test3.py
8c8
< f = open(filename, 'rb')
---
> f = open(filename, 'r')
$ python3 test3.py *
# same as above
</code></pre>
Now, these two scripts are no longer the same, the python3 script
outputs text, the python2 script outputs bytes:<p><pre><code> $ python3 test3.py /bin/ls
Traceback (most recent call last):
File "test3.py", line 13, in <module>
shutil.copyfileobj(f, sys.stdout)
File "/usr/lib/python3.2/shutil.py", line 65, in copyfileobj
buf = fsrc.read(length)
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 24: invalid start byte
</code></pre>
The other script works like cat -- and dumps all that binary crap to the
terminal.<p>So, yeah, I guess things are different -- not <i>entirely</i> sure that the
python3 way is broken, though? It's probably correct to say that it
doesn't work well with the "old" unix way in which text was ascii and
binary was just bytes -- but consider:<p><pre><code> $ cat /bin/ls |wc
403 2565 114032
e12e@stripe:~/tmp/python/unicodetest $ du -b /bin/ls
114032 /bin/ls
</code></pre>
Does that "wordcount" and "linecount" from wc make any sense? For that
matter, consider:<p><pre><code> $ cat test
hello hellø こにちは tøst 日本語
e12e@stripe:~/tmp/python/unicodetest $ wc test
1 5 42 test
</code></pre>
(Here the word count does make sense, but just because it's an
artificial example, it wouldn't make sense for actual Japanese).<p>The character count is pretty certainly wrong unless you cared about
what "du -b" thinks of the number of bytes...