Bazaar

Bazaar

 




Wiki Tools

  • Find Page
  • Recent Changes
  • Page History
  • Attachments

Warning: old document

This document was written before Bazaar started using weaves for backend storage. Bazaar no longer uses weaves by default, because they were found to have poor performance and could not provide append-only guarantees. (When a weave is updated, new data may be added anywhere in the file.) However, we still can provide weave style merging "on the fly", with 'bzr merge --weave'.

What a weave is

A weave is a way of storing all the revisions of a file in a single file. It is a form of delta compression, so it is quite efficient storage. It is planned as a new backend storage method for bzr. Weaves are a fairly old idea; they were probably first used in SCCS, and are also used in BitKeeper. (SCCS weaves store more data than this, but the additional data isn't central to the 'weave' concept)

Weaves work by storing all the lines that have ever appeared in a file, in order. They also store metadata indicating which lines are added and deleted by each revision. Since the lines are already in order, extracting a revision is merely a matter of switching on the appropriate lines.

Each line is uniquely identified by the revision that added it, and its position in that revision. Two lines with the same contents are treated as different lines, because they are in different positions or were added in different revisions.

Having the line-by-line metadata makes it easy to do 'annotate' or 'merge' operations.

Example

Here's a weave in an example format (not bzr's format)

LINES
1. #include <stdio.h>
2. int main(int argc, const *argv[])
3. {
4. /* It's bad form to printf a string directly */
5. /* printf is overkill for this */
6.     printf("Hello, World!\n");
7.     printf("%s", "Hello, World!\n");
8.     puts("Hello, World!");
9.     return 0;
10. }
REVISIONS
1: add 1 2 3 6 9 10
2, ancestor 1: add 4 7 delete 6
3, ancestor 1: add 5 8 delete 6

If we extracted Revision 1, it would look like this (except without the line ids):

1. #include <stdio.h>
2. int main(int argc, const *argv[])
3. {
6.    printf("Hello, World!\n");
9.    return 0;
10.}

Revision 2

1. #include <stdio.h>
2. int main(int argc, const *argv[])
3. {
4. /* It's bad form to printf a string directly */
7.     printf("%s", "Hello, World!\n");
9.     return 0;
10.}

Revision 3

1. #include <stdio.h>
2. int main(int argc, const *argv[])
3. {
5. /* printf is overkill for this */
8.     puts("Hello, World!");
9.     return 0;
10.}

Here is what the real bzr weave would look like, using bzr's current (2005-08-18) weavelib:

# bzr weave file v5
i
1 617c35cf3f0da48f5adaeaa8a18edaaaeea4df84
n base1

i 0
1 66409c5ff598479fe8fd7274f4a8ef3c91237c7f
n rev2

i 0
1 7c4c736573e5181faec34e01e3052385f008caa6
n rev3

w
{ 0
. #include <stdio.h>
. int main(int argc, const *argv[])
. {
[ 1
[ 2
.     printf("Hello, World!\n");
] 1
{ 1
. /* It's bad form to printf a string directly */
.     printf("%s", "Hello, World!\n");
}
] 2
{ 2
. /* printf is overkill for this */
.     puts("Hello, World!");
}
.     return 0;
. }
}
W

The curly braces delimit insertions and the square braces mark deletions. Note that insertions are always properly nested (LIFO), but deletions can overlap with each other and with insertions.

Weave Merges: Why and How

Three-way merges are the most common kind in SCMs, but they have disadvantages. They don't work well with cherrypicking, or when two people accidentally merge from each other at the same time. (Monotone's Nathaniel Smith describes three-way merging problems here and here)

Weave merges don't need three versions of the file, so they don't have these limitations. Roughly speaking, they work by

  • First, turning on all the lines that are added in either revision
  • Second, turning off all the lines that are deleted in either revisions

These operations are sequential, so a line that is deleted in either revision will never appear in the result.

This works because each line is distinct, no matter what its contents are.

For a more detailed explanation, see Precise Codeville Merge

Here is the output from the plan-merge operator for the example above:

% weave plan-merge aaron.weave 1 2
     unchanged | #include <stdio.h>
     unchanged | int main(int argc, const *argv[])
     unchanged | {
   killed-both |     printf("Hello, World!\n");
         new-a | /* It's bad form to printf a string directly */
         new-a |     printf("%s", "Hello, World!\n");
         new-b | /* printf is overkill for this */
         new-b |     puts("Hello, World!");
     unchanged |     return 0;
     unchanged | }

We also need to flag conflicting regions for the attention of the user. This case does conflict, between the mutually unchanged lines.

It is the desire to support weave merging that motivates us to use weaves instead of other delta-compressed formats.

Disadvangages of Weaves

Current implementations of weaves permit any portion of the file to be rewritten at any time. This means that using them with standard network protocols requires re-downloading the entire weave every time it is updated. However, weaves tend to be small (compressed weaves can be smaller than the original), and broadband connections should be able to deliver them quickly.

The other disadvantage is that the size of a weave file is proportional to the amount of changes that were made to the file. For projects with a long history, one might expect this to be prohibitively slow, but this does not seem to be the case. For example, the CHANGELOG from the gcc project, which has 1112 revisions, can be extracted in 0.2 seconds.

Both of these disadvantages can be avoided by using more complicated, append-only weave formats. It is not clear whether they will be worthwile, since neither disadvantage is expected to be severe.

In summary form:

Advantages

  • Fast annotate and weave-merge.
  • Supports regular three-way merge equally well.
  • Since we're rewriting the storage file every time, we can compress it in one gzip history and so get very good compression.
  • Very compact.

Disadvantages

  • Current implementation rewrites the storage file every time:
    • Clients can't do an http range request to fetch just the new region.
    • Upload over sftp must rewrite the entire file.
    • However, rsync of uncompressed files or files with rsyncable gzip will probably do well.
    • This is not incremental backup-friendly.

      (Really? Incremental backups that work at the whole-file level will see that some weaves within the tree have changed, and store a full copy. Backups that use rsync or xdelta to detect changes in a file should do reasonably well. -- MatthieuMoy: Well, I'm not an expert of backup systems, but it seems many sysadmins have a preference for whole-file level incremental backup. At least, that's what I have in my lab. Any snapshot tool based on hardlink will have similar innefficiency problem.

      • I would say, really. Every time you add a revision to a gzipped weave, it may change completely, even though the changes added may be small. This is far worse than say, an Arch archive --AaronBentley)

  • Since the same file is being rewritten, a corruption will destroy all earlier history.
    • Wow. Am I the only person scared by this? I really hope you give users the choice of multiple backends ...

  • The entire history must be re-read to extract any version. However for special cases, such as inventory files, where we do not care about annnotations or textual merges we could use other formats, or we could split the weave into several files.

Future weave options

It would be possible to also track cherry-picks, and exclusions (anti-merges) in a weave. These make future mesh merging more complex.

Code changes

This has already been completed, and is part of bzr 0.1.

DONE

Separate out code that handles packing revisions and inventories into XML, so we can read both old and new trees

DONE

In a new-style inventory, store the last changed revision

DONE

Remove parent revision sha checks to make upgrades easier

Check all code gets texts and inventories by going through the Branch object, rather than directly to the store

DONE

Convert old branches to new format