Launchpad Entry: https://launchpad.net/products/bzr/+spec/path-tokens
Created: 2007-03-16 by RobertCollins
Our internal model of a tree is insufficient to model copying or combining of files, directories and symlinks. This spec maps out the infrastructure required to allow that modelling, as a prelude to being able to implement that.
Corner cases: file_ids make a lot of code extremely simple. Are two files the same, taking renames into account, is very simple in a file id world: just compare the ids, if its the same, then yes, if not, then no. Applying a delta to a version path is easy: lookup the fileid the delta was made against in the target tree, and apply. Keeping this simple way of talking about versioned things is extremely important in my opinion.
Parallel imports: There are many cases where parallel imports occur. These imports make it difficult to really work in a decentralised manner. Conversions from CVS, SVN, GNU Arch, imports from tarballs, application of regular patches (which create files), all exhibit the parallel import problem, which is that its desirable for two different imports to be able to be merged and talked about as though they are the same project, when it is hard for bzr to actually know that they are. We currently go through a lot of hoops to achieve [nearly] identical output here. If we had the ability to take two separate trees which happen to have paths that users consider the same, and commit a record somewhere that identifies which paths in these trees should be treated as the same, it would be possible to merge, and replay, correctly between those trees. This has been talked about in the past under the term 'file id aliases'. This would allow a dramatic simplication of the user experience when converting from systems, like tarballs and CVS, where a repeatable conversion is essentially impossible.
Copies: This is an oft requested feature. I think it comes up at least monthly on IRC, and its a real issue when representing what other VCS systems like SVN actually do to perform 'renames. This isn't to say that we want to represent SVN renames as copy and delete (I think that is fugly), but we currently cannot accurately convert svn repositories that do copies and *do not* delete. Copies also make sense for some user operations, like splitting a files contents, or take a file like 'COPYING' that does not change often and putting it into other locations or trees. Telling people to use symlinks, or to remember to manage separate files, is IMO a reflection on our limits, not a reflection on what we *should* allow.
Two versioned paths become one: This is mostly covered in my text about parallel imports. While not quite the same thing they are closely related. Specifically, there are use cases such as 'combining two source files' which are independent from the parallel import case, and also worth supporting, if we can clearly document sane behaviour.
No reference to historical data: Accessing lots of historical data is expensive - it means performance degrades as history accumulates. Additionally, in order to support history horizons, which is a proposal that we allow people to set a strict limit on what historical data is available to bzr, we need to be able to identify 'these are the same' across trees without necessarily having acccess to a common ancestor.
storage size: A naive implementation approach to supporting both file copying and file combining without history searches may well result in rapidly increasing storage requirements, so while we are not yet discussing implementation, this is a constraint on the implementation.
So here are a few ideas that I have about the shape of a new tool, which I'll call path tokens, to avoid confusion with file ids [a better name is welcome]. I dont intend on talking about implementation yet - partly because I dont have one in mind, but mainly because I want us to agree on the *goals* first: theres no point talking about an implementation until we agree on what we want to achieve. I propose a multi step plan to tackling this problem:
- - identify the problems/use cases to solve. - design acceptable semantics for the new functionality that we've decided we want to solve. - design an implementation that can supplant/extend file_ids to deliver the agree semantics. - go forth and implement.
path tokens should:
- For currently supported cases, have no more corner cases than file-ids.
- allow us to support parallel imports better than file-ids.
- allow us to support copies as first-class operations.
- allow us to support 'two versioned paths become one versioned path'.
- allow us to compare two trees with no reference historical data.
path tokens should not:
- increase storage size proportional to history or tree size. Note that this isn't the same as saying 'they should have fixed size'.