Pieces of bzr, Part 2
Presumably before you get here, you've read PiecesInBrief, which gave an initial set of working definitions of our pieces and an overview of how they fit together. This document will attempt to expand those definitions so we have a fuller understanding. And then we'll try applying terms to other VCS's for comparison.
I'm not yet entirely happy with this doc. I think it's longer than it should be for what it's trying to convey. This probably calls for some editing...
Now that we've handwaved approximate position, we can talk in more details about how the pieces work.
A Revision conceptually contains all the files you had when you created it. You can think of this as if it contained a tarball of your working files (though you can have files in your Working Tree that you don't tell bzr to track, and so it doesn't put into the Revision). There are a number of ways revisions can be created (for instance, converters from other systems), but in practice just about every revision will be created by running bzr commit.
It also contains some metadata describing it. First, it has what's called a Revision ID, which is a globally unique string that can be used as a key to refer to this revision. No other revision will ever use that same identifier. This is how (as described above) one Branch can know if it's related to another Branch; by comparing the ID's of the Revisions in it.
A Revision will have a timestamp, telling when you created it. It'll have a committer identifier, telling who created it. It'll have a log message, where you describe what's special about this Revision that you wanted to create it. It has a checksum, so that disk corruption can be detected. And various other fields.
Importantly, it also stores a reference to the Revision it's derived from. Very few Revisions are actually created from thin air. Most of the time, you have an existing Revision that you make some changes to, yielding a new one. This is called a parent Revision. And that parent revision also has a pointer to a parent of its own, and that to another parent... all the way back to the beginning of a project.
Most of the time, a Revision will have one parent. This happens because you have some existing Revision, you edit (in your Working Tree) the files to make some change, then you commit your changes, creating a new Revision, with that pre-existing Revision as its parent. However, a Revision can also have no parents, in the case of a first Revision when you're starting from scratch. And a Revision can also have more than one parent (it's possible, but very rare, to have more than 2), in the case where it's the result of a merge.
I'm not going to go into describing what merges are here. If you don't already know, just store "merge == multiple parents == merge" in your brain for now. It'll be explored in other documents.
This has some important implications. For one thing, it means that just by having one Revision, you can walk back to its parent, and the parent's parent, and the parent's parent's parent, and so on all the way back to the start of the project. So you don't need to refer to an ever-growing list to define the history, just the latest Revision. For another, it means every Revision "depends on" all the revisions that came before it. That means that when one Branch is comparing itself against another Branch, as soon as they find one Revision in common, they can stop following the links because they know that everything prior to that is the same as well.
For those with a technical background, it might sound like I'm describing a Directed Acyclic Graph (DAG). The reason it sounds like that is because I am. The ancestry of a Revision is a graph, which each Revision being a vertex, and the parent links in the Revision defining directed edges backward in time, all the way to initial revisions which are leaf nodes.
- If that probably didn't make any sense, don't worry; you don't need it.
One required attribute for all this to work is that Revisions are immutable. When you create a new Revision, it can have anything in it you might imagine. But once it's created, it's carved in stone; you can never change anything in it again. At the moment you create it and it gets assigned a Revision ID, now and forevermore anywhere in the universe that Revision ID will refer to, and only to, that exact Revision. You can't change the content of the files you committed, or change the list of parents. Doing that would mean that the same Revision ID (which other things may now have picked up and started referring to) would point at something different than they expect, and that's a no-no.
This also means you can add new Revisions "on top of" existing Revisions all day long, but you can't change how the connections back into the past on existing Revisions are hooked up. You can't take some Revision that's in your history, and pull it out and throw it away, unless you also throw away everything "later" than it.
With that more thorough definition of Revisions, we can return to Repositories. Actually, there's not much to add to our previous description; a Repository is still basically a giant bucket that you keep all your Revisions in. But we can elaborate a bit.
It's not precisely just an unorganized clump, since the Revisions themselves have their Parent references, so just by picking a Revision out of the Repository, you can figure out its whole history. But the Repository doesn't know or care anything about that; all it knows is that it either has or doesn't have a Revision with a particular Revision ID.
We made an earlier reference to Repositories being shared by multiple Branches. This allows us to avoid storing multiple copies of the history. Imagine that we have a history of say 100 Revisions, and we have two branches based on that, each of which adds 1 Revision. If each Branch has its own separate Repository, we would have two Repositories, each holding 101 Revisions, of which 100 are exactly the same across both (a total of 202 Revisions eating up space on our drive and IO resources to read/write).
If both Branches instead use the same Repository, on the other hand, we only have to store those 100 common Revisions once, and then we have one new Revision from each of the two Branches, so we're only storing 102 Revisions with no duplication. This saves space on the drive, and also saves having to copy all that data around when we make a new Branch. Now, in terms of actually using these branches, committing in them and merging between them, everything acts the same whether they share one Repository, or each have their own. A Repository boundary doesn't affect bzr's user-level behavior, it just saves time and space.
When we have a configuration like this, where a Repository exists independently of a particular Branch, we call it a Shared Repository. These are created via the bzr init-repo command. Judicious use of Shared Repositories is essential to using bzr efficiently. However, it's not critical to get it exactly right the first time; through the use of the bzr reconfigure command, it's possible to switch a Branch that has its own internal private Repository around to using a Shared Repository (and vice versa). So you can rearrange your workspace down the road if necessary.
- Details of setting up repositories aren't covered here. This doc is about the concepts. How to set it up and examples of using it will be covered elsewhere.
Now You See It, Now You Don't
One thing of note here is that, aside from initially setting up a Repository, you almost never run a command that directly interacts with it. Whether you create a Shared Repository with init-repo, or you create a Branch with its own private internal Repository with init, from that point on you only run commands on the Repository in special cases. Practically all your actual interaction is with either a Branch or a Working Tree.
Now that we have a more thorough definition of Revisions, we can more precisely say how Branches work.
First, we said before that a branch describes a particular sequence of Revisions. Now, with the more precise knowledge of how Revisions work behind us, we can say it's not even that involved.
Because each Revision tells us what it parent (or parents) are, from a single Revision we can know its entire history. So the Branch doesn't need to store a giant list; it just needs to point at the 1 Revision which is currently the head, or latest revision on the Branch. And then from that Revision, the entire past history can be dug up.
A Branch standing all alone can't do much. It can point to a Revision, but it doesn't store Revisions. It has to have a Repository, which is where all the Revisions are. Whether it's a private Repository colocated with the Branch, or a Shared Repository elsewhere that other Branches can also use, that's where new Revisions get stored, and old Revisions get found.
When you make a new Revision (e.g. via commit), it takes that previous Revision that used to be the head of the branch, and makes that its Parent. Then the new Revision is stored in whichever Repository this particular Branch is using. And finally, the Branch's pointer to the head is changed to point at this new Revision. And now everything's set to move history forward the next step!
Again, not much to add to our previous description. Just some elaboration.
As a Branch is incomplete with a Repository, so a Working Tree is incomplete without a Branch. A Working Tree only has the files; it can't look at any history, and it can't move forward to new Revisions. It can only sort of float around. It's only by association with a Branch that it can have some frame of reference.
As the Branch has a head Revision, that points at where it is, the Working Tree also has a base Revision, that tells us what its "original" state is. From that base, you make changes; you edit files, you move them around, you add new files... whatever is involved in moving your project forward. The Working Tree keeps track of what changes you've made, and stores them up so that when you run commit it's ready to make a new Revision.
And that Revision goes on the Branch your Working Tree is associated with. And then the Branch stores it in its Repository as we covered above. The Working Tree doesn't itself know anything about the Repository; it just talks to the Branch.
It's also possible for the Revision the Working Tree is based on to be different from the Revision that's the head of the Branch. This can happen in a variety of ways; for instance someone elsewhere, from another Working Tree or via push, could add a new Revision to the branch and leave you behind. Or you could use update -r to switch the Working Tree to an older revision. In this case, you can't commit, because then the Branch would have 2 head Revisions; the pre-existing head, and this new Revision you're just creating, and neither would be an ancestor of the other, so they're diverged. And a Branch can't diverge with itself, so an attempt to commit a new Revision will fail, unless the Revision that the Working Tree is based on is the current head of the Branch.
Comparison With Other VCS's
Note: I'm no expert on differences between these systems, or indeed in most of the systems described. I'm reasonably confident in my assessments, but they shouldn't be taken as gospel. And because of the differences in how terms are used between the systems (which is most of what I'm trying to address here), people can seem to be disagreeing when they're actually saying the same thing. So watch out.
In Subversion (like in CVS before it) the primary division is between the repository and the checkout. A svn checkout fills the same niche as a bzr Working Tree, so that's an easy comparison to make. The svn repository contains the role of a bzr Repository, and also contains one or more Branches internally. The svn repository can be treated as a whole, or any subdivision of it, whereas in bzr the granularity you work with is the Branch; neither more nor less. svn's branches are conventional paths within a single big versioned glob. So its model of things, in gross and in fine, is rather different from your average DVCS.
When you run git init, you're given a git repository. This use of the word repository is different from what bzr means by Repository. A git repository contains the equivalent of a bzr Repository (the object store). It contains one or more git branches, each of which is comparable in concept to a bzr Branch; they're just stored as internal objects in the git repository rather than being visible top-level elements like they are in bzr. And finally, the git repository, at its root, has a working tree (except for a bare repository, which doesn't have any working tree files), which is pretty much the same as a bzr Working Tree.
The overall model of history, the meaning of Revisions and how they connect to each other, and such abstractions are actually pretty much the same. The implementation of them is somewhat different, and some fine details are very different, but the abstract model is very similar, so if you can get your mind past terminological confusions, it's not too hard to move from one to the other.
The Term 'Repository'
The Repository term is the commonest source of problems here. Recall back in PiecesInBrief we talked about the colloquial term branch, which is used in a loose sense to generally mean a Branch and/or its Working Tree (usually for a colocated pair). People used to git will tend to use the term repository for this, since in git repository means a working tree with one or more associated branches, while for bzr Repository means specifically the Revision store.
Usually, it's not too hard to figure out what somebody actually means. Still, it's best to avoid that sort of blending of terms. For one thing, it can lead to confusion, when one person says repository in a git sense, and the other person thinks they mean Repository in a bzr sense. And it can confuse you the other way too; if a bzr person starts using the term repository in a discussion, they almost certainly mean a bzr Repository, and often a Shared Repository, so if you interpret it in a git repository sense, you'll end up talking past each other.
So for quick correspondence
Branch (in the strict sense)
branch (in the loose sense)
repository (normal or bare)
Working Tree / Checkout
working tree / checkout
In monotone, the repository is a first class, opaque, external object, much like in svn. mtn's branches, like git's, are internal objects in the repository. A mtn checkout is essentially the same as a bzr Working Tree or git's working tree; it's working files on a specific branch, where you can edit and commit. Revision ancestry and connection works much like in bzr or git, with one exception.
Unlike in bzr or git, a Branch (the conceptual object) isn't defined by ancestry, but rather by certificates applied to each revision; any revision with a certificate claiming membership in a given branch, is on that branch. This means that a mtn branch can have multiple head revisions at one time. The various implications of this are way outside the scope of this doc. Aside from that, though, the model is fairly similar.
I don't know hg deeply enough to write a thorough comparison here. A hg clone is roughly the same as a bzr branch (in the loose sense); it contains the equivalent of a Branch, with an internal Repository and an associated Working Tree. It has no equivalent to bzr's Shared Repository, though it gets some of the same effects using hard links.
hg also has an internal named-head sort of operation which I believe is something like a cross between git's named branches in a single repo, and mtn's multiple heads in one branch. I could be completely wrong about that, though, so don't put too much credence in it.
That completes this document. Having read these two parts, you should have an understanding of the Revision as the basic entity bzr manipulates, and of the triumvirate of Repository, Branch, and Working Tree as the three constructions you use in bzr to create and assemble Revisions.
I believe this gives you a conceptual background to better understand bzr. Without this, a lot of situations you get into and commands you're told to run seem like bizarre, ad-hoc special cases. But by understanding what you're trying to do with Revisions, and where your Working Tree, Branch, and Repository (or their plurals) are in your particular situation, the actions and consequences fall out more as applications of general rules.