Bazaar

Bazaar

 




Wiki Tools

  • Find Page
  • Recent Changes
  • Page History
  • Attachments

RobertCollins

For more details: https://launchpad.net/~lifeless.

Performance Guidelines

bzr should work well on very large and deep trees. Current trees *in use today* have 55000 paths and 50000 revisions, so this should be considered a minimum set for testing scaling and performance.

Measuring matters

If you dont have real-world measured numbers for the code you are profiling, its entirely possible to pessimise it rather than optimise it; so the first thing should always be to create a benchmark. For instance, this is what I use to benchmark _iter_changes as I work on it:

PYTHONPATH=~/source/baz/dirstate/ python -m timeit -s "import bzrlib.workingtree" 
-s "tree=bzrlib.workingtree.WorkingTree.open('/home/robertc/source/bigtree')" 
-s "tree.lock_read(); basis=tree.basis_tree()" "list(tree._iter_changes(basis))"

If your initial benchmarks dont fit what you are observing at a high level, be skeptical and scrutinize them, its entirely possible you are either measuring the wrong thing (remove the 'list' from the benchmark above for an example), or your high level observations may be confounded with something.

Scaling

  1. Avoid scales-per-tree-or-history-size designs.
  2. Avoid bad O algorithms

Minimise overheads in inner loops

  1. tuples are fast
  2. lists are 1/5 as fast
  3. objects are 1/6 the speed of lists
  4. Function calls are relatively slow.
  5. special case options can hurt: you have to execute bytecode to decide what to do, when the caller knows what they wanted. If a different way of choosing the right code path can be found without duplication etc, it will reduce overhead. However, a check of a boolean flag is faster than calling into a parameterisation object.

Reducing function call overhead

One way is rather than calling a single 'do_x' function for each item in a loop, call a 'do_all_x' function for all items. This reduces the overhead significantly. A do_x API can be offered by thunking through to do_all_x.

Alternatively, iterators can be used to reduce the function setup overhead.

$ python -m timeit -s 'def foo(bar, baz):' -s '  if 1:' -s '    return 1' 'foo(12, 34)'
1000000 loops, best of 3: 0.627 usec per loop
$ python -m timeit -s 'def foo(bar, baz): ' -s '  while 1:' -s '     yield 1' -s 'i=foo(12, 34)' -s 'n = i.next' 'n()'
1000000 loops, best of 3: 0.472 usec per loop

As a result, a 30% improvement can be made if its possible to structure things as a iterator rather than function calls. While you can play tricks to create a iterator that data is pushed into, this is actually slower than calling functions : If the routine can be made into a pipeline of iterators, do that, if not consider inlining function calls, or just calling functions.

What is the impact here? I recently analysed performance on the same machine as the timeit results above, for dirstate and xml based trees. For xml trees, the total time to analyse a single path, excluding start costs:

  • xml : 0.22ms (220usecs)
  • dirstate:0.12ms (120usecs)

Combining that with the above data, we can see that to retain the above performance, we can do a maximum of ~240 iteration steps, and ~200 function calls for a single path.

Never LBYL

Look before you leap is very bad when looking involves getting data from high latency resources. Of course profiling and testing may show that the common case is such that looking first makes sense - but it should not be the default way that code is written.

Profiling HOWTO

Writing a little wrapper can be useful for getting at the guts:

from bzrlib.workingtree import WorkingTree
t = WorkingTree.open('.')
t.lock_read()
list(t._iter_changes(t.basis_tree()))
from bzrlib.lsprof import profile

_, stats = profile(list, t._iter_changes(t.basis_tree()))
stats.sort()
stats.pprint()

t.unlock()

Running this like PYTHONPATH=/path/to/dirstatedir python profile.py > output.lsprof gives a file you can analyse easily. In vim, doing :sort! /.*\%32v/ will sort that on the total-time in the routine, which is very useful.

Python hints

These should/will go in the dev guide at some point.

dicts are not sets

While set methods like difference_update and intersection work with dicts as the supplied argument, their performance deteriorates dramatically. I saved 50% of the runtime of bzr pull in my experimental format by changing one argument from a dict to set. At worst my back of envelope figures say thats 700-300=400:2000-700=1300 or 1:3 performance difference. I think its much greater than that, but am not interested in establishing just how big a deal it is - 3 times is plenty enough reason to change for me.

Blazing local disk access (Dirstate Version 2)

Path string kinds

Should paths be unicode? I think they should be utf8 internally, and only cast to unicode upon use.

Rationale: We should only be accessing a given path once or twice, and only showing it to the user once, within bzrlib. We may however split it, combine it, and other wise manipulate it a lot internally. unicode operations are much slower than bytestring operations and we would therefore have an overall win to use utf8 bytestrings internally and only make the paths unicode at the UI and disk boundaries.

Juidicious caching of unicode values - i.e. the tree base path should be stored as unicode AND utf8 would allow reducing the amount of common strings converted to unicode even when the filesystem is not utf8.

Lastly, utf8 based platforms avoid all unicode conversions in this scheme: the ui may be utf8, and the disk utf8, and thus we never need a unicode string at all? (What about normalisation: we may need to make strings unicode at some point?)

We may need a mapping from internal paths to actual disk paths to deal with normalisation issues.

In the discussion below, references to paths can be 'internal' or 'external'. External paths are for showing to users or using to access the filesystem, internal ones are purely for internal logic.. e.g. dirstate index entries.

Users

Operation

Internal path

External path

disk kind

logical kind

size

executable

sha1

statcache

ignorecache

versioned

add

X

X

X

X

X

unknowns

X

X

X

X

X

diff

X

X

X

X

X

X

X

X

X

status

X

X

X

X

X

X

X

X

X

X

commit

X

X

X

X

X

X

X

X

X

logical kinds may replace 'directory' kinds with 'tree-reference' when there is a subtree at that path. This also kindof conflicted versioned flags.

  • I think commit needs 'ignorecache' if you supply the --strict flag. -- JohnArbashMeinel

other use cases to do?:

  • - list just files, doesn't need to get details for non-file-kinds

Stat cache usage: operations that want to determine 'difference between trees' check the following:

  1. stat the file - gives disk kind, size, statcache, executable
  2. compare logical kind - may stat path/.bzr
  3. compare executable (note that we could in theory bail out here on a exe bit change, without caring about size or sha1, but we dont for nice reporting of what has changed.
  4. check size, if different stop - its a content change
  5. check sha1 - which requires reading the file IFF the statcache is different.

The sha1 value is only used by diff and status when the kind and size value is the same

Physical constraints

1. Reading disk may require unicode paths to be passed to the OS on some platforms. 1. Reading all data in a single directory at once is faster than swapping between dirs. 1. stating in inode order is apparently good on linux.

Layers

Low level primitives, easily replaceable by C

  1. Dirstate parser.
  2. 'file.gather_meta_data': stat a single path, and return the data structured and ready to be used by dirstate.
  3. 'directory.gather_meta_data': read a directory, stat everything in it, and return the data structured and ready to be used by dirstate. This should return logical kind - all the apis want logical kind. This can do inode order optimisations and so forth if desired. The result does not have to be sorted, but the order of data within the tuples returned should be amenable to a trivial sort being done without lambdas or special keys.
  4. Sha1 of a single file. (python is _SLOW_).

dirstate apis which build on that to offer statcache facilities. dirname, basename in utf8. we have size, exe, fingerprint, logical kind for a single tree-details, and at a path, we have a list of file_ids, one of which *may* be versioned, and if versioned, recieves this data from the low level api.

  1. for add:
    • Return logical data about a single path: All the data we want to be able to update a dirstate path details in the current tree.
    • Return logical data for the immediate children of a path: As for a single path
    • Return data about a single path: all the paths present on disk given a single path. That is the path, and if it is a directory (but not a subtree), the details for its direct children. The minimal interesting data here is path, kind, versioned status, ignorecache. It could for example return a tuple (path, childrenofpath). Alternatively, we could have a rarely-used api for gathering data for a single path (calling it on a dir returns the details of the dir, not its children), and callers can handle ENOTDIR by calling this single-path api. Another way to avoid double-processing each directory is to have a parameter to say 'dont give me the paths details, only its children'.

Currently we have walk_entries, which ignores disk. I think we need a iterator that gives up to date but potentially sha1less entries. That is - logical kind, size, exe are 'fresh', and sha1 is None if it was invalidated by state cache missing. so this updates the versioned entry for this path, and adds in data for unversioned paths. It will not recurse into unversioned dirs, or into the disk data of paths in the current tree that are 'tree-references'. That is it will give basis tree data for paths that have become subtrees, but consider their disk data to be missing.

tree apis which perform deltas within the dirstate data.

At the moment _iter_changes - tree api - does: dirblock._iter_entries and osutils.walkdirs and advances each iterator as needed, calling update_entry on every single path that is versioned. It then compares the resulting entries and yields those that match the callers requested policy. Callers to this invoke is_ignored as needed.

Random notes

walkdirs is useful - grab data for a dir at once. But it stats ignored and unknowns unconditionally. Perhaps it would be better to call a function per directory, with a list of desired paths. This function would stat everything in the list and return a list of path_infos. so want_unversioned would be 'list_dir_with_inode' -> stat_directory -> zip them together. and not want_unversioned would be stat_directory(dirpath, [name for name in dir]). At the moment though, I think its probably ok to stat everything. When we come to commit we should look at this.

_iter_changes internals:. upper level translates dirstate to the zipped tuples. This needs: path info if its unversioned source details and target details, with any that match disk refreshed.

Possible layering: iter_selected_trees -> returns all details spidering out from the named tree indexes, block by block. update_index_from_disk -> stats disk and updates, based on the iter_selected_trees iteration, yields blocks at a time _iter_changes -> sets up the iterator chain it needs and then converts the selected indexs to _iter_changes form.

Current timings scratchpath

dirstate2

python -m timeit -s 'from bzrlib import osutils' -s 'from bzrlib.workingtree import WorkingTree' -s "tree=WorkingTree.open($TREE)" 'tree.lock_read()' 'tree.current_dirstate()._read_dirblocks_if_needed()' 'tree.unlock()'
10 loops, best of 3: 57 msec

python -m timeit -s 'from bzrlib import osutils' -s 'from bzrlib.workingtree import WorkingTree' -s "tree=WorkingTree.open($TREE)" 'tree.lock_read()' 'list(tree._iter_changes(tree.basis_tree()))' 'tree.unlock()'
10 loops, best of 3: 431 msec per loop

python -m timeit -s 'from bzrlib import osutils' -s 'from bzrlib.workingtree import WorkingTree' -s 'tree=WorkingTree.open("/home/robertc/source/canonical/lp.dirstate")' 'tree.lock_read()' 'for details in tree.current_dirstate().iter_dirs_in_trees([''], [0]):' '  list(details[1][1])' "list(osutils.walkdirs($TREE))" 'tree.unlock()'
10 loops, best of 3: 254 msec per loop

python -m timeit -s 'from bzrlib import osutils' "list(osutils.walkdirs($TREE))"
10 loops, best of 3: 191 msec per loop

python -m timeit -s 'from bzrlib import osutils' "list(osutils._walkdirs_utf8($TREE))"
10 loops, best of 3: 150 msec per loop

python -m timeit -s 'from bzrlib.path_info import path_info' -s 'from bzrlib.osutils import _walkdirs_utf8 as walkdirs' -s "dirs = [dir[1] for dir in walkdirs($TREE)]" -s 'paths = [] ' -s 'for dir in dirs: paths.extend([path[3] for path in dir])' 'for path in paths: path_info(path)'
10 loops, best of 3: 117 msec per loop

python -m timeit -s 'from os import listdir' -s 'from bzrlib.path_info import path_info' -s 'from bzrlib.osutils import _walkdirs_utf8 as walkdirs' -s "dirs = [dir[1] for dir in walkdirs($TREE)]" -s 'paths = [] ' -s 'for dir in dirs: paths.extend([path[3] for path in dir if path[2][0]=="directory"])' 'for path in paths: listdir(path)'
100 loops, best of 3: 11.6 msec per loop

pathinfo (C module)

python -m timeit -s 'from bzrlib import osutils' -s 'from bzrlib.workingtree import WorkingTree' -s "tree=WorkingTree.open($TREE)" 'tree.lock_read()' 'tree.current_dirstate()._read_dirblocks_if_needed()' 'tree.unlock()'
10 loops, best of 3: 55.2 msec per loop

python -m timeit -s 'from bzrlib import osutils' -s 'from bzrlib.workingtree import WorkingTree' -s "tree=WorkingTree.open($TREE)" 'tree.lock_read()' 'list(tree._iter_changes(tree.basis_tree()))' 'tree.unlock()'
10 loops, best of 3: 464 msec per loop

python -m timeit -s 'from bzrlib import osutils' -s 'from bzrlib.workingtree import WorkingTree' -s 'tree=WorkingTree.open("/home/robertc/source/canonical/lp.dirstate")' 'tree.lock_read()' 'for details in tree.current_dirstate().iter_dirs_in_trees([''], [0]):' '  list(details[1][1])' "list(osutils.walkdirs($TREE))" 'tree.unlock()'
10 loops, best of 3: 187 msec per loop

python -m timeit -s 'from bzrlib import osutils' "list(osutils.walkdirs($TREE))"
10 loops, best of 3: 131 msec per loop

python -m timeit -s 'from bzrlib import osutils' "list(osutils._walkdirs_utf8($TREE))"
10 loops, best of 3: 89.5 msec per loop

python -m timeit -s 'from bzrlib.path_info import path_info' -s 'from bzrlib.osutils import _walkdirs_utf8 as walkdirs' -s "dirs = [dir[1] for dir in walkdirs($TREE)]" -s 'paths = [] ' -s 'for dir in dirs: paths.extend([path[3] for path in dir])' 'for path in paths: path_info(path)'
10 loops, best of 3: 55.7 msec per loop

python -m timeit -s 'from os import listdir' -s 'from bzrlib.path_info import path_info' -s 'from bzrlib.path_info_c import read_dir' -s 'from bzrlib.osutils import _walkdirs_utf8 as walkdirs' -s "dirs = [dir[1] for dir in walkdirs($TREE)]" -s 'paths = [] ' -s 'for dir in dirs: paths.extend([path[3] for path in dir if path[2][0]=="directory"])' 'for path in paths: read_dir(path)'
100 loops, best of 3: 11.1 msec per loop

unix comparisons

time find $TREE -size 3c

real    0m0.039s
user    0m0.008s
sys     0m0.020s
================
        0m0.028s = 28msec

time find . -false

real    0m0.015s
user    0m0.004s
sys     0m0.008s
================
        0m0.012s = 12msec

Walking directory overview:

  • find: 28msec
  • list all directories: 11msec
  • stat all files and encode to statcache etc 55.7msec w/C 117msec pure python
  • unaccounted overhead: 89.5msec-11-55.7=22.8msec and 150msec-11-117=23msec

TODO

  1. reduce or eliminate inventories.
  2. operations should take file_id or path.
  3. dirstate adding-deleted and iter into subtrees

Indices take 2

Some progress (size, b+tree, memory capped reading and writing) implemented as index2 aka btree repository format

Flaws to fix:

  • size: The indices are plain text encoded, which while not huge is much larger than a (e.g.) bzipped index of the same.
  • Performance: The in-python logic required to manage bisection with irregular key sizes leads to high cpu use.
  • memory utilisation: Due in part to the high cost of reading parts of the index, the index memoises all read areas in memory. The lack of an effective way to describe portions of the file makes using a LRU or similar cache hard - large indices (millions of records) take up gigabytes of memory. Generally we should be able to only access each key in an index once, or even layer a simple dict LRU over underlying indices to accomodate repeated accesses with high locality of reference (such as during annotate).
  • Network traffic: The adhoc expansion-of-reads in transport was a good hack to get more effective network use, but it seems the coupling is too weak, leading to overlapping reads and less efficient use that could be achieved. I think that in hindsight the original concept of having the index us a 'page size' hinted at by the transport will allow more precise reads with less waste.
  • locality of reference - having a vector and bisecting within it works well as a simple approach, but an actual tree can be sorted on disk to have the root at the start of the file, so less disk IO is needed to perform the first few tree splits, rather than causing many parts of the index disk file to be paged in. This will help cold cache scenarios, and large index scenarios.

Key goals of indices #1 to preserve/improve upon:

  • streaming writes: indices get written once.
  • scalable IO sizes: we can read more at once to get more from the index
  • composable across multiple segments.
  • graph details are embedded in the index

Some possible improvements (maybe not in this iteration):

  • Suggested by Martin - add multiple columns to the index rather than the current one, then we could conceptually store what are multiple indices today in a single index; which may cut down on total index size/IO load on operations.

Design percolating in my head:

  • same basic approach: headers followed by data.
  • data will be grouped in 4K pages
  • each page is a NULL padded gzip compressed node of a BTree/B+Tree. (For signatures we may want a BTree, for inventories/texts we definitely want a B+Tree. (starting with a B+Tree and evaluate performance before actually implementing plain BTree would be sensible)).
  • graph references are internal to the node (should counterbalance the increased size via the gzip compression)
  • LRU cache tracks nodes in memory
  • building the index builds in-memory in a dict initially, but could spill to disk or (handwave) something in future.

Inventory ideas

Flaws to consider:

  1. cannot do a commit in less than O(tree size) [and actually we are much worse]
  2. references-introduced-by can be considered a hack
  3. neither fish nor foul - claim to be xml, is actually binary-looking-like xml
  4. cannot diff in less than O(tree size)
  5. derived data (last change) and facts ((parent, name) aka path, file id, content pointer) are stored in the same place - so changes to derived data algorithms affect factual data storage

Related bugs: