Nested Trees Design
NOTE: This is a design document for future work, not a description of anything bzr currently implements. Don't attempt to do anything this talks about. For more up to date information see: http://bazaar.launchpad.net/~bzr-core/bzr/devnotes/annotate/head:/nested-trees.txt
This document describes the current design of by-reference Nested Trees. Much of this is already implemented. It does not describe by-value trees, because these are indistinguishable from normal trees.
- 1 Core Concepts
- 2 Basic approach
- 3 Implementation strategies
- 4 Scope
- 5 Data storage
- 6 Commands
- 7 Use cases
- 8 API Changes
- 9 Comparison with other systems
- 10 Future work
Subtree A tree which is inside another tree, which bzr has been asked to treat as part of the outer tree.
Containing tree A tree which has another tree inside it
Tree reference A directory in a containing tree which contains a subtree.
Composite tree A tree which behaves as a single tree, but is actually made up of several trees.
Commands in containing trees should behave as though the subtrees were plain directories. Commands in subtrees should not affect the containing trees (except possibly with user intervention).
One of the objectives of nested trees is to provide ways of reproducing historical combinations of different codebases. The dependency chain points downwards, such that trees are affected by the revision of their subtrees, but subtrees are oblivious to their containing trees. Just as bazaar doesn't entice people to commit inconsistent trees, it should not entice people to commit inconsistent combinations of containing tree and subtree. Therefore, commit should recurse downwards by default.
Status and diff should reflect what will happen when commit is used, so they should also recurse downward by default. Add almost does this already. With status, diff, commit and add recursing downwards, it would be confusing to users if other operations did not. Therefore, all operations should recurse downwards by default.
One of the reasons for using nested trees is to gain performance by only committing in a subtree. Therefore, operations should not recurse upwards by default. However, some users do want to have upwards recursion, so it should be provided as an option.
The idea that a set of nested trees behaves like a single, larger tree seems relatively easy to grasp. Both for users and for developers, it provides a clear expectation for the behaviour of nested trees. There are no obvious drawbacks in terms of code clarity or performance. Therefore, it seems like a good model to start with.
There are cases where users place multiple copies of a source tree inside another tree, and treating a set of nested trees like a single large tree is in conflict with that. It is essentially the same as supporting file copies, and file copies themselves are a big and complicated feature to implement. It seems likely that Bazaar will one day support file copies, and when it does, this conflict will disappear. Implementing limited support for copies via nesting would have similar complexity to implementing general support for file copies, but would have fewer fruits. It would lead to complication when true support for file copies is implemented. It does not seem appropriate for initial nested-tree support. On the other hand, it could be implemented later; none of the data model changes are in conflict with that.
The idea that tree-reference file-ids are the same as the file-ids of the corresponding root directories has a nice symmetry. It is one way of ensuring that "bzr split" is deterministic, and "bzr join" is deterministic. When performing operations involving a tree and a split version of that tree, using the same file-id makes it easy to ensure that operations such as moves and renames are applied appropriately to the tree-reference. Providing mechanisms whereby a tree-reference can be treated as it would if it had its old file-id encroaches on the territory of path tokens or file-id aliases. Having "split" cause file-id changes means that in comparing these revisions, it would be seen as a deleting a directory and creating a new tree-reference with the same name. Handling this correctly in operations such as merge and revert would be more complicated than if it were treated as a kind change, especially when unversioned files are present in the subtree.
When branching from a branch with tree references, bzr should create local branches in .bzr/subbranches. Aside from their location, there will be nothing special about them. In relation to their containing branch or tree, they shall be called "subbranches". If there is a working tree, its subtrees should be a lightweight checkouts of the corresponding subbranches. The reference-location information in the containing branch should point to the subbranch.
If the branches were not local, the local subtrees might not be committable, and commits to the remote branch would make the local subtree out of date. They should not be in a separate location from the containing branch, because the branch might be in a shared repository, and might share history with the tree-reference's branch. However, those local branches should not be at the same location as their tree, because the tree might be deleted or moved. Indeed, they should not be anywhere within a working tree. Even though a branch's revisions could be preserved while deleting the actual branch, its configuration data would not.
subtree branches should not be above or beside their containing branch, because it could cause terrible confusion if subtrees from two different trees were updating the same branches with every push, pull, commit and uncommit.
Sub-branches make it conceivable to eliminate the reference-location concept, by simply requiring that all subtrees in a composite tree be checkouts of subbranches. The sub-branch name would have to be based on the file-id. It could be a hash of the file-id, or it could be an encoded file-id. Encoding file-ids has been a problem in the past-- we would want to avoid percent encoding due to buggy web servers.
In the initial implementation, recursion into subtrees should be implemented at the highest level possible. This will make experimentation easier, reduce the chance for regressions in core code, and reduce the number of public API changes. This will entail some redundant post-operation determination of whether subtrees are present. Speed of operations involving subtrees is not a major concern, but operations that do not use subtrees must not be observably slower.
Aaron Bentley believes that recursion at the top level is also easier to understand and debug, but Robert Collins and Vincent Ladeuil disagree.
As time goes on and the design is proven, operations can be rewritten as more intrusive changes, in order to provide better performance or better handling of edge cases. It may make sense to implement them on NestedTrees, so that they can take advantage of its subtree caching.
However, some low-level operations would be dubious to implement as recursive. For example, a recursive Tree.id2path might open several subtrees in order to determine a single path.
The obvious way to prevent slowness when there are no subtrees is to provide a fast way of determining whether there are subtrees. This could be provided by additional on-disk indices, or potentially by providing a flag indicating whether there are subtrees. Such a flag could be updated by Tree.iter_entries, to avoid re-running Tree.iter_entries. A parallel flag for iter_changes is also conceivable, but would be more complicated, since iter_changes is relative to another Tree. Further, during iter_changes it is possible to have subtrees that were not encountered because they were unchanged between the trees (as iter_changes does not always walk the entire tree).
Locking a NestedTrees should lock all subtrees, to reduce the potential for new failure modes. This suggests the need for an on-disk index of subtrees.
For some operations such as diff, status and export, knowing the locations of tree references is not essential. They can be satisfied by a Tree implementation that papers over tree-references, making the set of nested trees look like a single big tree with directories instead of tree-references. This implementation strategy has a limited scope, and probably a limited shelf life, but provides a fast way of getting coverage of several important commands.
Ultimately, all commands should support nested trees. Initial work is focused on:
The root-ids of trees must be unique, so that the same file-id can be used in both the containing tree and the subtree, to simplify access within trees. Tree references are an inventory type that is distinct from a directory, and has a revision-id associated with it. All modern working trees support tree references. Indices may be provided to ensure fast access to the list of subtrees.
Because of the need to use branches for recursion, it would be desirable to associate RevisionTrees with branches. The alternative is for many operations to accept Tree + Branch as input, and return Tree, Branch pairs.
Branches store a mapping of file-id to branch location and path. The location is used both when branching and when recursing into the subtrees of revision trees. It is associated with the branch because it is not a versioned property-- different sites may use different values, as long as they contain the relevant revisions. Relative paths are interpreted relative to the branch's location. BzrBranchFormat8 is the first format to support references.
Some repository formats have 'subtree' variants, e.g. pack-0.92-subtree, development-subtree. These are hidden, experimental formats that support storing tree-references in their inventory formats. The most recent format which supports subtrees is RepositoryFormatKnitPack3.
The following new commands are introduced:
split Convert a single working tree into two trees, one inside the other.
join --reference Cause an inner tree to be treated as a subtree. (Adds a tree-reference to the containing tree)
reference-location list, view and update the locations associated with tree references in a branch. (Invocation similar to bzr alias).
John works on a project called FooBar, but has decided that it would be better structured as two projects, where Bar is a library that may be of general use outside of Foo. As it happens, bar already has its own subdirectory.
# Convert into two trees: foobar and foobar/bar. # In each tree, files will be removed and deleted. In foobar/bar, "bar" will have been moved to become the tree root. # These changes will be committed later. $ bzr split foobar/bar # Add a tree-reference from foobar to foobar/bar $ bzr join --reference foobar/bar # This recurses into foobar/bar and commits the deletion of the containing tree. # In foobar, it commits a kind change for 'bar' from directory to tree-reference, # and the deletion of the contents of bar. $ bzr commit foobar
This commits new revisions to foobar and bar, and foobar's tree-reference bar refers to the revision-id of bar.
Next, he adds two new files: foobar/baz and foobar/bar/qux:
$ vi foobar/baz $ vi foobar/bar/qux # This adds qux to foobar/bar and adds baz to foobar. $ bzr add foobar
Since foobar/bar/qux is in a commitable state and foobar/baz is not, he invokes
$ bzr commit foobar/bar
This commits foobar/bar/baz/qux to the subtree and commits foobar/bar to the containing tree.
(Had he wanted to commit to just the subtree, or just the containing tree, he could have specified an option.)
Robert wants to hack on a project, Baz, that is structured as a nested tree, which uses the external library "quxlib", from the site quxlib.org.
$ bzr branch http://baz.org/dev baz
This creates a branch and working tree for baz, as normal. It then recurses into the quxlib directory, and does a branch for that. bzr queries http://baz.org/dev for the location associated with "quxlib-id-snarglezap", and determines the location to be http://quxlib.org/dev. It creates a new branch and working tree using quxlib.org/dev, but it uses the revision-id from the tree-reference in the containing tree, not the head revision of http://quxlib.org/dev. This allows Robert to get a known-good nested tree.
Later, Robert decides to update the version of quxlib being used. He runs:
$ bzr pull -d baz/quxlib
This updates the version of quxlib in the working tree, which mean that baz is now out-of-date with its last-committed tree. Unfortunately, the new rev on quxlib is not completely compatible with the old one, and Robert must tweak a few files before Baz runs properly. Once he has done so, he runs:
$ bzr commit baz
Now he has committed a known-good nested tree, and the baz working tree is once again up-to-date.
Note that if quxlib.org disappears, the Baz maintainers can substitute their own local mirror.
When a pull involves updates to tree references, pull will always pull into the reference branch. For all new revisions in the upper branch, it will determine the revision values of tree references, and fetch them into the reference branch's repository.
When new tree references are encountered, pull should create a corresponding branch in .bzr/branches and add it to the list of reference branches.
A subtree should typically be a lightweight checkout of a branch in the containing tree's .bzr/branches directory, so that the branch does not move around, and is not created or destroyed, when subtree is moved, created, or destroyed.
Pulls will update the subtrees whose tree-references change, including creating lightweight checkouts of the reference branches.
New class to provide services for sets of nested trees, including coordinating locking, caching Tree instances, and mapping file-ids and paths to the appropriate Tree and Branch. Includes:
NestedTrees.lock_read NestedTrees.lock_write NestedTrees.unlock NestedTrees.paths_info(self, paths) - return a list of tuples of Tree, relpaths, where relpaths is a list of relative paths, and Tree is the subtree that contains them. NestedTrees.get_tree_and_treepath - return the tree and relative path within that tree of a file-id NestedTrees.apply_inventory_delta - splits an inventory delta into the affected subtrees and applies it to them all NestedTrees.iter_changes - provide iter_changes over a set of nested subtrees. Emits tree-references in preference to tree roots.
New class to permit smooth recursion of TreeTransform operations like merge and revert even when the subtree boundaries differ. API essentially the same as TreeTransform.
New class to provide easy transition of superficial commands like diff, status, export. API is a subset of Tree API. Additional methods:
CompositeTree.maybe_composite - return a CompositeTree if the input tree has subtrees, otherwise return the input Tree
- Return a Tree and Branch for the tree-reference with the specified file_id
WorkingTree.pull recurses downwards by default. Subtrees are pulled from the reference location of the specified branch.
MutableTree.commit recurses downwards by default.
WorkingTree.revert recurses downwards by default.
Merger recurses downwards by default. This is not equivalent to running merge in each of the subtrees-- it uses the tree-reference revision-id to determine which revision to merge into each subtree, recursively. Merger should not care whether subtree boundaries have changed. Like WorkingTree.pull, it fetches from the reference location of the branch associated with the tree.
- Return branch for tree reference
- Provide location and saved file path of tree reference branch
- Set location and saved file path of tree reference branch
pull recurses into reference branches, and pulls from the source's reference branches. fetch recursively fetches revisions from the source's reference branches into the target's reference branches.
fetch provides a list of tree-reference revision ids/file-id pairs for the revisions that were fetched
At the storage layer, this is the most similar system to Nested Trees. Both store revisions, and separately store locations. Submodules uses a versioned file to store locations, while Nested Trees uses unversioned branch data. At the UI layer, they are quite different, because submodules does not attempt to integrate subtree support into the core commands.
The wiki page does not give confidence that this is a well-maintained project. It seems similar to config-manager-- its 'snapshot' files are like config-manager's config files, describing what branches to get and where to put them, optionally specifying a revision. No metadata about nesting is stored in the tree. Optionally, a 'snapshot.txt' file may be stored in the containing tree, but it can also be stored somewhere else.
There is no attempt to integrate subtree support into the core commands.
This design was for an integrated feature, but there are apparently 4 implementations as extensions. While it does integrate subtree support into core commands, this may be off by default: "The alternative that I lean towards is to not recurse unless explicitly instructed to. Most probably, only a few commands should arguably even be aware of modules."
- hg module add ~= bzr join --reference
- hg module remove ~= bzr remove --keep
- hg module record's functionality is part of nested commits in nested trees.
Like submodules, this stores location information in versioned data: a .hgmodules directory. Like submodules and nested trees, particular revisions are recorded.
Like bzr, uses per-file metadata. Like submodules and nested repositories, locations are versioned data. Like Forests, revisions are optional. Like nested repositories, there is limited integration into core commands; checkout and update support externals, and commit may support them in the future. However, there is no UI specific to creating and updating svn:externals references.
Unlike all other alternatives, supports partial checkouts. This is because svn natively supports partial checkouts. Also, supports checkouts of tags, because tags are merely a convention in svn.
Supports pulling in "head" subtrees too, not just specific ("known-good") revisions. However, in bzr, you just update the subtree and commit any time, so this approaches the same ability to track development sources, while still guaranteeing that when bzr checks out a tree it gets a reproducible result. So the way bzr handles this is probably more appropriate for bzr than "head" subtree support would be.
Some support for single-file svn:externals (see http://subversion.tigris.org/svn_1.6_releasenotes.html#externals), whereas bzr subtrees must be directories.
Has a --ignore-externals option for checking out without pulling in the svn:externals items (svn checkout is more or less equivalent to bzr branch). You can also use that option on update (more or less like bzr merge).
It may be nice to have an additional command like checkout-or-branch that creates a checkout or branch depending on configuration. This allows new contributors to get bootstrapped quickly, with some trees read-write and others read-only. However, this breaks command nesting, since various commands behave differently for checkouts. Of course, reconfigure can also be used to convert checkouts to branches and vice-versa.
It would be nice to support multiple copies of subtree in a master tree. Currently, this would lead to having multiple copies of a given file-id in the (composite) tree, which is forbidden. Allowing this would involve one of:
- permitting multiple copies of a file-id within a tree
- not treating nested trees as composite trees
- using fake file-ids within the composite tree
- replacing the file-id concept with something else (e.g. path tokens)