This is a general discussion of what John Meinel has been working on to improve bzr's support for user encodings, and for unicode filenames.
See also BzrIOEncodingSpec
- Bzr output
- Normalization of unicode filenames
- Transport URL interface
- Avoiding print
- Global encoding
- Optional things
- Unknown input encoding
The output of bzr is effected by the user's encoding. The idea is that if you commit a file with a non-ASCII character, it needs to be properly encoded so that the correct characters are displayed to the user. Internally, bzr stores everything as unicode, so it should be able to track most files/committers/etc.
The biggest issue is that some encodings don't support all characters. So we need to decide what to do (we can error, or we can display a substitute character).
In general, the 'correct' thing depends on the command. For example for the command bzr log, its output is not critical. So it is better to display substitute characters (generally '?'). The text output for bzr commit is also not critical.
We should be able to commit changes to a filename that can't be displayed on the terminal. However the output of bzr diff should be correct, so we should error if we can't display them correctly. (There is a discussion that bzr diff should always display the paths as utf-8).
To make this reasonably easy, I added a field to Command definitions encoding_type. I also added a new member object Command.outf, this is the output file, which is basically sys.stdout wrapped with a codecs encoder.
encoding_type can take the values strict, exact, replace:
- strict - error if the characters cannot be displayed (default)
- exact - do not perform any encoding translation. (This will be used by diff)
replace - Used by log, etc. to indicate that it should try to translate, but if the character is not available, it is okay to substitute.
Most of the commands have been converted to use self.outf instead of sys.stdout though a few more need to be done. (There are blackbox tests in bzrlib/tests/blackbox/non_ascii.py)
rename to encoding_type to encoding_policy?
Maybe instead of a out file, supply a write or display function that does the encoding? Instead of self.outf.write('stuff') you would use self.display('stuff')
Commands requiring strict
This is a (partial) list of commands that should be run in strict mode. The reason for strict, is that they are intended for users to do something based on the output, not just for user information.
- ls - This is generally a scripting interface.
- conflicts, added, removed, etc - These give a list of paths. Ultimately they will probably be superseded by 'bzr ls'.
Commands wanting replace
- commit - The screen output of this command is unimportant. The internals are very important, and must be correct. But you should not fail to commit because it cannot
print added foo.
- add, rm, mv - Same as commit.
Diff gets its own section, because it is fairly special in its needs. The paths that are displayed need to handle unicode translation, but the contents of the patch should not be modified in any way.
The question is should bzr diff always encode the paths in utf-8. The issue is that sometimes bzr diff will be used to create a patch to send to someone else.
This works best if you use utf-8, so that it is in a standard form. However, if you are just reviewing what has changed in your directory, you would want it to be in your local encoding, so you can read the filenames.
It might be less important once changesets are introduced, since then you would have bzr submit which could send an email of the changes in a well defined format (and bzr cset if you just want the file). Then bzr diff would not be overloaded with this dual-purposes.
Normalization of unicode filenames
Unicode has all sort of rules for normalizing unicode characters. Basically, some characters can be represented as 1 code point, or as 2 code points. (Like a with circle).
Mac OS X issues
Different file systems in Mac OS X have different levels of Unicode support:
- Mac OS Extended (HFS+) - the default and recommended file system - uses canonically decomposed Unicode 3.2 in UTF-16 format.
- UFS file system allows any character from Unicode 2.1 or later, but uses the UTF-8 format
Mac OS Standard (HFS) does not support Unicode and instead uses legacy Mac encodings, such as MacRoman.
I think we can ignore this legacy file system for now. -- NirSoffer 2006-02-15 10:38:41
See Apple Developer Connection: File Encodings and Fonts for more info.
When using HFS+ file system, unicode names are saved in decomposed format, prefering two code points which combine to a character, rather than a combined character (NFD). Internally we want to use (NFC) encoding, since that is the XML spec (XXX url). So when we read filenames from the working tree, we need to translate.
Here is an example that illustrate this problem using HFS+ file system:
>>> import os >>> file(u'\xe5', 'w').write('foo') >>> u'\xe5' in os.listdir(u'.') False >>> os.listdir(u'.') [u'a\u030a'] >>> os.path.exists(u'\xe5') True >>> file(u'\xe5').read() 'foo' >>> file(u'a\u030a', 'w').write('bar') >>> file(u'\xe5').read() 'bar'
We saved a file named u'\xe5', but got a file named u'a\u030a'!
UFS, Linux and Win32 Issues
Similar example when working with UFS on Mac OS X:
>>> import os >>> file(u'\xe5', 'w').write('foo') >>> u'\xe5' in os.listdir(u'.') True >>> os.listdir(u'.') [u'\xe5'] >>> os.path.exists(u'a\u030a') False >>> file(u'a\u030a', 'w').write('bar') >>> os.listdir(u'.') [u'\xe5', u'a\u030a']
Does it work the same on other file systems and platforms?
Expected behavior - but both files with different names look just the same on the file system:
If Bzr uses only one form to record the file name, as XML spec requires, it can't keep both files in the same directory, because both of them have the same normalized form.
Multi platform compatibility issues
If you create a file named '\xe5.txt' on Linux or Windows, and check the project on a Mac, it will create what it thinks is '\xe5.txt', but actually create 'a\u030a.txt'. When it tries to list the directory, '\xe5.txt' has disappeared, and an unknown 'a\u030a.txt' file has appeared.
We discussed the issue, and decided that it made the most sense to always normalize (XXX using NFC as XML requires?) filenames internally. And complain if the user tries to add a non-normalized filename. (On Mac you can't create one).
We want to alert the user if they are trying to add a file which isn't properly normalized. Though it should not be an error to have it in the directory (and it should be possible to ignore it, possibly requiring a wildcard).
Transport URL interface
Transport needs proper URL interface. Basically we want the input into Transport to only be URLs. And we define those urls to be url-quoted, utf-8 strings (URLs are defined as ASCII strings). LocalTransport especially needs to quote the return values from some functions like list_dir().
There are places in the code that use print directly. These should be cleaned up to take an output file. There should be no printing inside bzrlib, since it could be used as a GUI library. The callback style of Commit should be used when possible.
After everything has been updated to use self.outf, it can be easy to introduce a global --utf-8 option. This could be used by scripts and front-ends which want to be able to not depend on the user's encoding.
Just a brain dump of ideas.
Should we have a separate return code for encoding failure? I'm thinking that scripting interfaces might want to be aware that bzr ls --foo is failing because of a file cannot be represented, rather than just a generic error.
Path support - Windows users would probably like to see \\ separating their paths rather than /, since native commands like dir do not support the forward slash version. It might be possible to use a fancier object than a plain codec wrapper, which would have a function like write_path which would use native paths for display.
Mac OS X have similar problem with posix style paths. Users are used to see Mac style paths like foo:bar:baz, while Cocoa (bought from NeXT) works with posix paths. To solve this problem Cocoa has the method displayNameAtPath: in the NSFileManager class. See Getting display names
Possible solution is to use os.path.normpath(internal_path) to display file paths, or a shortcut.
>>> import posixpath >>> posixpath.normpath('foo/bar') 'foo/bar' >>> import ntpath >>> ntpath.normpath('foo/bar') 'foo\\bar'
Move to a page about Win32 support?
Unknown input encoding
When reading input files, you must know the input encoding to decode the file correctly.
versioned files - may contain info about the encoding, like the Python #-*- coding: foo -*- line, but usually not.
- configuration files - a specific encoding (utf-8?) may be required, but users can easily introduce encoding errors because of editors defaults or ignorance.
command line arguments - see DarwinCommandLineArgumentDecoding
It is easy to detect utf-8 encodings, because of the special format, but if its not utf-8, any data can be decoded as any 8 bit encoding without errors, generating junk.
For some applications you can guess the encoding by the context - for example, the Hebrew wikipedia accept urls encoded both in utf-8 and in windows-1255 (as generated by some versions of IE on systems with Hebrew local).
I'm not sure how this problem effect bzr, maybe someone with a clue can elaborate on this?
http://bzr.arbash-meinel.com/branches/bzr/encoding - JAM encoding branch