Bzr command line arguments may use non-ASCII encoding, by typing or as output from other tools. This page discuss issues with argument encoding on darwin (Mac OS X).
Bzr uses locale.getpreferredencoding() to get the encoding of command line arguments, but this gives useless result on Mac OS X 10.3. This code fail to decode non-ASCII arguments:
$ bzr init $ touch \327\242\327\221\327\250\327\231\327\252 $ echo * עברית $ bzr add * bzr: ERROR: exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal not in range(128) at /opt/local/lib/python2.4/site-packages/bzrlib/commands.py line 484 in run_bzr
When you type non-ASCII characters, the terminal enter your text as utf-8 (using octal numbers). When you let the shell complete names, it uses utf-8 and show the names correctly.
The Terminal has a Character Set Encoding menu, which is set to utf-8 by default, and can be changed to few other encodings (not all the encodings supported on Mac OS X). However, changing the encoding in this menu does not change the result of locale.getpreferredencoding(). I did not find a way to detect the encoding used by the Terminal.
If you change the encoding to Japanese, you will get junk by typing Hebrew or using the shell complete:
$ echo * ラ「ラ泰ィラ燮ェ
This is expected, you can decode anything as any 8 bit encoding and get junk.
I did not find any documentation about argument encoding and output of command line tools, expect this note:
- "All BSD system functions expect their string parameters to be in UTF-8 encoding and nothing else. Code that calls BSD system routines should ensure that the contents of all const *char parameters are in canonical UTF-8 encoding."
Another evidence for file system encoding is sys.getfilesystemencoding which according to http://docs.python.org/lib/module-sys.html returns always utf-8 on Mac OS X.
Testing common commands from Python (using os.system) and from shell scripts, show that completed file names use utf-8 and commands that return file names also return utf-8 output. For example:
>>> import os, commands
>>> os.system('touch \327\242\327\221\327\250\327\231\327\252')
0
>>> '\327\242\327\221\327\250\327\231\327\252' in commands.getoutput('ls -1').splitlines()
TrueFrom my experience programing in C, Objective C and Python on Mac OS X, argument are always using utf-8.
Solution
Always use utf-8 encoding on darwin, ignoring the locale.
See revisions 1575..1577 from http://nirs.dyndns.org/bzr/encoding-nirs
The changes fix the decoding error of the arguments. There are still encoding errors when printing and bad display of unicode names (using ???) which have to be investigated.
Todo
tests for user_encoding value
tests for decoding shell completed names with user_encoding on posix
- tests for decoding shell completed names with user_encoding on win32 ?
- Test on 10.4 - someone on 10.4 is invited to do this
- Find a way to detect the Terminal encoding
- Check how 3 party terminal apps behave
- iTerm has many encodings to choose, but typing in Hebrew is impossible, so probably its not relevant.
