Launchpad Entry: https://launchpad.net/products/bzr/+spec/shallow-checkouts
Created: 2006-02-25 by AaronBentley
This page is under discussion.
Make it possible to get a fully fledged branch from a project with a lot of history very quickly, by only downloading enough data for a small number of revisions, and having a pointer to the remaining data. This extra data can then be retrieved on demand. This feature is primarily a performance orientated feature.
Being required to download a project's entire history can be a frustrating and apparently-arbitrary requirement for new contributors or users.
Historical data goes stale; its utility decreases over time. Many participants in a project will only be interested in the most recent history. Downloading the entire history imposes a burden on very active or older projects and casual participants.
Lengthy revision histories cannot be completely compensated for by fast network access or efficient storage, because the costs are approximately O(n*m) where n is the size of the project, and m is the length of its history.
Bzr has some experience with ghost revisions already, and supports them, though they have 'a bad smell'. Implementing support for partial histories will make them a permanent, and constant feature.
The suggested approach is to implement a history horizon. Only the history horizon, it descendants, and their direct parents are required to be present in a branch's repository. Other revisions may be present, e.g. if they're needed for a merge base. It seems like it would be a good idea if the history horizon marker also indicated a revno offset, so that a branch with a history horizon could have the same revno as an otherwise-identical branch without one.
Since this is repository data, it appears that the history horizon should be a repository file. Which suggests that it should permit many horizons to be specified.
The original horizon description referred to a parent list; this does not exist in present-day bzr. Such a list would be ~470K for the bzr project itself, or ~90K compressed with bzip2. It would make merge base selection possible, but the actual merge would still be impossible. External data would be needed for that, but we could presumably use external data for picking a base, too. So it becomes an optimization of a corner case. Probably not a good idea.
The major operations on historical data are:
Merging currently requires that one of the two branch repositories involved in the merge contain the merge base. In a star topology, this is almost always true; the mainline branch be able to merge anything branched from it. So the only potentially problematic case is mesh-merging.
If we are able to convert to a completely annotate-based merge, then there will be no merge base, so neither branch will need to have any past history.
Diff is supported by requiring all of the parents of the history horizon revision and its descendants to be present.
The Knit format includes support for annotated full-text versions. So in order to support annotation, it is recommended to retain the most recent full-text annotation which is required to annotate the files in the history horizon revision. (Or perhaps, all revisions that derive from ghosts.)
- History horizons will provide a significant efficiency gain in some important cases
- Users will want not want to lose the ability to do diff on the history horizon revision (and its descendants).
- Users will want to be able to annotate files in said revisions
- Users who use horizons will be more interested in getting their branch merged than in merging other branches
- It is useful for revnos to be consistent, whether or not horizons are used.
J. Random Hacker uses and likes project FOO. But he there is one bug that really, well, bugs him. He decides to fix it. He runs bzr branch http://example.com/FOO. This produces a branch with a basis-revision of 'ghmak', with only enough data to reconstruct 'ghmak', and a history search path pointing 'http://example.foo.com/FOO'. It takes roughly as long as a lightweight checkout would take.
He runs bzr revno, which replies 465. He knows that 465 is a recent version of the project.
He goes on a plane. Makes changes, commits, makes more changes, commits some more.
He gets off the plane, (maybe pushes his branch somewhere,) submits a merge request. His fix is incorporated into the mainline.
He runs 'bzr log', which searches the history for the revision texts to show log to the user. Every text downloaded is cached locally.
Shallow checkouts would be the same as shallow branches, except for being 'bound'. The data retrieval requirements for shallow checkouts are very similar to those for lightweight checkouts. Their storage requirements are less than 2x the requirements of lighweight checkouts. Yet they have most of the local performance benefits of heavyweight branches. (Obviously, they don't have fast access to non-local data.)
So they appear to be a good compromise between lightweight and heavyweight checkouts, and could become the default.
snoreutils is a fork of GNU coreutils that is dedicated to making the diagnostic messages as boring as possible. GNU coreutils is a project with a very long history. The snoreutils maintainers have no interest in ancient history, and they do not have the space on their web site to host the entire revision history. Besides which, there's a full history at sourcecontrol.net, if anyone's interested.
They do bzr branch core_repo/coreutils snore_repo/snoreutils --horizon -100. This creates a repository branch with the last hundred revisions in it, which they upload to their server and start hacking on.
Old huge projects
Say a project FOO is running for 30 years, accumulated 100 000 revisions and the repository has grown to 1GB1. Besides the fact that it's no longer practical to copy this amount of data when branching, even local operations on the mainline are getting slower, because only the the index of (main) inventory knit is ~10MB. So the project creates a new branch with history horizon say -5000 and moves the big repository into an archive. New revisions may be pushed there from time to time.
Low contribution barrier
A user downloads a standard distribution of project FOO as a tarball called foo-xy.tgz. This contains the files necessary for the project to run, and a very small bazaar manifest file. This manifest specifies the location and revision of the branch the tarball was made from.
If the user intends to fix a bug and commit it back, he would first use a bzr command (possibly 'init --manifest') to create a branch that's shallow (without history) and lightweight (not all files present) from the files in the tarball, without any additional downloads.
He could then contribute back using the normal bzr development procedure.
bzr branch and checkout will default to grabbing a single revision of data rather than greedy history. clone will remain unchanged. bzr branch and checkout will gain a 'greedy' option which will grab all history, and a history-limit (This option needs brainstorming) that will allow precise control for those folk that like doing so. A limit of 0 / null: is equivalent to passing greedy.
Discussed option names:
--shallow - needs to much internal knowledge to 'get it'.
--horizon - suggests inaccessability and also a rolling window, neither of which are true.
* Repository format with history source url list
1 def set_fallback_repositories(repositories): 2 """Set the fallback repositories for this repository. 3 4 fallback repositories are used to access history not accessible to this repository on-demand. 5 write locks that are taken out on this repository will not cause write locks to be taken on the fallback repositories. 6 a read lock on this repository may be silently upgraded to a write lock when data is being retrieved from a fallback location. 7 8 repositories that cannot fallback should raise a subclass of UnsupportedOperation. 9 """
- minimal Tests needed:
- can set fallback on knit
- access a revision tree on a fallback-configured repo loads the data into it.
- accessing a revision tree does not pull data further back than the last full text
- Branch format with history source url list on disk
- change checkout to create an empty branch
The fetcher will need to be modified to respect horizons. Because they support annotation, we may require that at least the output branch (and preferably the input) use knits to store file contents.
When a local branch is made with 1 revision of history in the repository and a limits file in the branch containin the revno and revision id of the last-revision in the projects trunk, we have a commit object, its linked inventory, both with ghost parents. We have many different revisions in the repository fileid knits, each with ghosts for their per-file history.
corner cases in fetch logic. what assumptions are made about file text/delta presence. what assumptions are made about file text/delta fetching in terms of copying content etc.
how much to copy?
Should non-limited branches copy only the needed data, or be greedy. I.e. when merging 10 revisions, we only really need the data for the tip, but the other data would be ghosted and might lost if we dont fetch it. Likewise for limited branches, should we be semi-greedy when merging, or ???
More generally, when do we deliberately create ghosted data locally during fetch operations?
- We are willing to ghost data locally if and only if we can reconstruct that data from one or more of our branches history sources.
- When fetching, gather the transitive closure of revisions from the requested revision [that we want to be able to construct], and then subtract from that set, all the revisions that are available from our history sources to get the set that must be fully populated locally after the fetch. Any additional data needed to construct the target revision will be grabbed from the source repository preferentially, then falling back to our history sources.
- This involves downloading the revision index to get the graph of related revisions, then checking locally to eliminate revisions we can completely construct: and we can eliminate their parents transitively because the only way we get a constructable revision is by making it on top of a constructable revision, or by pull it and everything back to our history sources edge from somewhere else, or by pulling it from a history source. This is flawed - the history sources may change under us - does that matter? A completely safe algorithm is:
- get the full contents of all history sources - read revision index, inventory contents for all inventories, scan out to all referenced file texts. This is 10's of MBs of data for samba - 15K revisions. Accessing that much data before doing a pull from a shallow branch with only 2 revisions in it would be nuts. Equally pulling everything when merging from a full branch would be annoying. This would not be a problem, if the history source's can answer 'has complete revision' cheaply: we can spider the revision requested from the source repository in the history source and stop at the first 'complete revision' if we assume that history sources dont contain ghosts, or we can even do a full check relatively cheaply, though having a fully incremental algorithm would be good.
-- Given a partial-revision-free repository, if we strictly append full revisions - all data to construct the revision and all its transitive parents that are available [i.e. not historically ghosted], then the repository will remain partial-revision-free.
Partial revisions are only permitted to a repository when the revisions, and their transitive parents, are fully constructable from a supplied history source.
This means that a branch with a different history source may see a repository with a partial revision that is not constructable either locally or via a history source, but for any *branch*, all revisions are fully constructable.
full branch A partial branch B, history source of A, Z (added after C branched) partial branch C, history source of A,B
sequence: init A branch A to B branch B to C add D as a source for B, and fetch revision Z's revision.xml to B.
C cannot at this point construct Z, while B can. But, C has no reference to Z either. C pulls from B, which means C requires Z's transitive history be constructable... this is achieved by:
- at fetch time, the repository for B is a stack containing B, A, D.
- the repository for C is a stack containing C, B, A.
- 'constructable' for C is satisfied when Z can be made from C,B,A, which it cant even though B contains the revision.xml for Z, so the data is fetched from the stack for B - B,A,D - which will actually resolve at step D... we may want to share the object for B and A to make optimal use of caching logic.
simple case: revision knit - each revision is standalone, just copy binary image as-is. ditto signatures. inventory knits - may have up to many hundred inventories to get to a full text. To construct *any* local version we must download *all* the texts from a fulltext-cache to the one we want. file text knits - same situation as inventories.
- every time we read a knit blob, insert it locally.
- optimal network use: if we had to transfer it, we cache it.
- only insert those locally that match some policy.
- optimal local storage size - we only store it if we decide we want it.
Network costs more, so we should start with the former approach.
ghosts: should we have a 'greedy' option to pull/merge to trigger grabbing all history, for filling ghosts out. Or perhaps a 'fetch' command that just grabs related history without altering the branch.
has_revision becomes vastly more complex:
- has xml
- has xml and referenced material
- any of history sources have xml and referenced material
performance of 'has revision' checks will be important.
assertion: 'has revision' checks are currently done before operations that require any content from a revision. They are also done as part of fetch optimisation.
A new branch format will be required because once revision-history is truncated, older bzrs will not calculate revno correctly. It may make sense to remove revision-history entirely in this branch format.
It probably makes sense to introduce a list of 'related-repositories'. This would be a list stored in the branch. When a revision was required that was a ghost revision, it could be fetched from one of the related-repositories. The decision to retrieve a revision from the related-repositories should probably be made at a high level, so this would not introduce a get_revision / get_local_revision split in the repository API.
branches grow a list of history sources, branch operations append the source to the list.
- cloning a limited branch should preserve the limit ?
- revision-history pivoting seems interesting - making sure revnos stay correct
- revision history should be truncated too.
- what should log do ?
- repository-based-limits imply a lock for all repo formats to manage the limit-list, per branch is better as we have to lock branches anyway, and branches are the right place to set policy for when to pull beyond the limit, and when not.
- should fetching data to do a merge put it in the local repo? should it get *just* the data, or the data between the common base and the current history point. i.e. if a tree has history A,B,C,D and we have only D, doing a merge which requires A as the base - should that pull in A,B,C or just A? (noting that we have any content from B,C, still valid in D anyway.
- What about reannotation of data - do we require reading the full data to do that ?
- generating dotted-revnos without the full graph may be 'tricky'. When showing a log with a merge whos origin is below the horizon, the number of parallel branches from that origin revision is unknown and thus the merge's revnos cannot be calculate. Anything above a dominator revision can be correctly calculated if the dominator is above the history horizon. It might make sense to cache the graph down to the most recent dominator - but this requires having graph data in the repository would the associated tree data, unless we also populate those trees.
- inserting lots of partial-graphs into a knit may lead to a form of fragmentation. will knits work when partial graphs join up, and what are the performance ramifications?
Questions and Answers
Some time ago I heared GCC CVS repository is something like 800MB, so it's not so much out of question (1)