Created: 2009-04-20 by IanClatworthy
We need to seek one-format-to-rule-them-all-for-ever-and-ever. That implies taking a fresh look at the drivers for format bumps and putting in place alternative solutions where we can. If and when we do release a new format, we should cleanly and efficently handle the upgrade for users as best we can.
New formats are fine pre 1.0, annoying post 1.0, and highly anti-social post 2.0. For large communities, the cost of getting all developers to both upgrade their software and upgrade their branches is huge. In the distributed VCS community, this is even more of an issue than it is in the central VCS user community.
Multiple formats mean that users need to think about selecting a format. That complicates the user experience and goes against our Just Works philosophy.
Potential adopters perceive frequent format releases as a sign of product immaturity. Competitive tools have successfully used this argument against Bazaar on numerous occasions.
We should always retain the right to release new formats as technology and our understanding improves. Being able to adapt as the world changes is important: the trade-offs for version control of source code on an IDE hard drive won't necessarily apply for version control of binary files (e.g. music) on different storage technologies (e.g. SSD and EC2). But new formats should be a rare thing, once per major release at most. Given corporate planning cycles, a better goal is once every 2 years (compare Ubuntu LTS) or longer.
Format change drivers
There are at least 3 reasons for releasing a new format:
- Better storage organisation.
- Metadata enhancements enabling new features.
- Protection of users against data-vs-code capability mismatches.
As an example of data-vs-code protection, end-of-line support required a format bump. If someone has multiple versions of Bazaar installed and accidentally ran commit using the old version on a working tree with converted eol's, we didn't want every text file to get committed! By bumping the format marker, we prevent that sort of thing and report an error instead explaining that the running (older) version can't interpret the (newer) format.
Units of work
This is an umbrella spec covering multiple areas.
Here are the component specs in rough order of priority, together with the driver they are addressing:
extensible metadata - metadata for new features
branch baggage - metadata for new features
branch dependencies - data-vs-code protection
TODO: add more specs, particularly one for making bulk upgrade of repositories and branches a breeze.
It is possible to evolve formats with limited inconvenience to users. Applications like iTunes roll out new features every few months, with new metadata supporting those, and the user is lucky to notice the implicit format upgrade happening under the covers. Things are more tricky in Bazaar's case because users may want new software but an old format, to ensure interoperability with peers who haven't upgraded. OTOH, it is possible, if tricky, to adopt policies that permit users on new formats to interoperate with users on old formats transparently, e.g. guaranteeing forward compatibility of semantics and round-tripping data you don't understand. This becomes far easier is the only difference between formats is because of better storage organisation.
Sadly we seem to get next-to-no credit for protecting users from themselves. I guess no-one even notices when they are implicitly rescued, while everyone notices needing to do a format upgrade. We could always take the approach other tools seem to - if people screw up, it's their own fault - but that seems morally wrong. It's not about being smart or dumb: even the best people still make silly mistakes now and then. Branch dependencies are a really cool idea because they solve the protection issue without needing a format bump.
Here's some feedback from lifeless on IRC:
(17:50:42) lifeless: igc1: btw, you need to enlarge your 'open data format' stuff a _lot_ - last I read it it looked like handwaving : 'we' have spent a lot of time considering extension points for the core store and not found any that were satisfactory in the general case (17:51:00) lifeless: igc1: [and a general mechanism is by definition the general case] (17:52:06) igc1: lifeless: Can you point me to mailing list discussions that I ought to read/re-read? (17:53:46) lifeless: igc: versioned properties discussions right from the start basically (17:54:32) lifeless: igc: various challenges exist - consistency, updating, schema, validation, check, reconcile, propogation, merge, (17:54:58) lifeless: indexing (17:55:01) lifeless: and performance (17:56:07) lifeless: igc: more broadly, one needs to consider how things will degrade for clients not supporting $extension, and if the answer is 'they do the wrong thing', then a repo with that extended data in it is broken for clients (17:56:16) lifeless: which is arguably identical to our current behaviour (17:57:19) lifeless: In general terms, I think the question is about 'is it possible' not 'how to make it happen', for a safe extensible format (17:58:50) igc: lifeless: it's software - anything is possible :-) :-) (17:59:03) igc: lifeless: seriously, I like your list (17:59:42) igc: lifeless: I don't expect the core to solve every problem for every type of data (18:00:19) igc: lifeless: I do expect it to provide plumbing where it can and delegate things it can't to the plugin/code registering the extended data (18:00:51) lifeless: I'd be delighted to make sure fetch calls hooks to let people add data and recieve extended data (18:01:02) igc: lifeless: there will be limitations but I don't believe that means we should throw the baby out with the bathwater and say "it can't be done" (18:01:04) lifeless: I'm not at all sure that the _storage_ of said data should be in .bzr/repository (18:03:54) lifeless: igc: I think there is a different between an extensible store, and hooks allowing plugins to do what they want to alongside core operations (18:04:18) lifeless: igc: I argue that there isn't a difference at the user level, but there is a vast difference for the programming and performance implications we face (18:05:12) lifeless: we consider very carefully where *our* data goes and what it implies; is it a column store or row store, how long does it take to do X, or Y. (18:06:04) lifeless: someone storing a mime type per file could easily add 50% to the raw storage size of inventories (18:06:41) lifeless: and writing a general purpose, high performance, tuning-free, self-maintaining database, while extremely interesting, isn't really what we're here to do (18:06:45) lifeless: IMO (18:07:10) lifeless: I hope I'm making sense (18:07:29) igc: lifeless: you are and I agree with most of what you're saying (18:08:27) igc: lifeless: but ewe need a better answer than "new format required" when code/a plugin wants a small amount of data managed for it (18:09:16) lifeless: igc: I think plugins should maintain their own stores (18:09:25) igc: lifeless: it one thing to say "one format only during the life of 2.0" (18:09:26) lifeless: in db terms they should get their own shard, and do what they want with it (18:10:28) igc: lifeless: it implies though that only people running the development format get to see any data-dependent new features until 3.0 is released (18:11:25) lifeless: it comes back to definitions (18:11:34) lifeless: I don't think replacing format with 'schema version' makes things better (18:11:56) igc: lifeless: which, IMO, means 3.0 is back to "big bang" releasing with lower quality (as a rule) (18:12:12) lifeless: if something is -truely- safe to add to the data existing code processes it can be added to an existing format with todays mechanisms (18:13:12) lifeless: I think we're trying to cross the chasm at the moment (18:13:26) lifeless: and fast incremental releases are giving our non early adopters problems (18:13:52) igc: lifeless: but we seem to do that extremely rarely because format bumps are our only instrument for code-data alignment protection (18:14:32) igc: lifeless: I think branch dependency rules will help here (18:14:42) lifeless: we'll be having a sprint (18:16:27) lifeless: I am not against achieving safe finer granularity; I'm against making the system more fragile, (18:16:51) lifeless: and I think that adding more stores is a better approach than a single 'extensible' store, for a number of reasons (18:16:56) lifeless: those above (18:17:14) lifeless: plus separation of core and non core data (18:17:24) lifeless: see for instance the long desired 'file graph is a cache' one (18:17:34) igc: I like the multiple stores approach - it's what I was getting at with the baggage idea (18:17:50) lifeless: bzr-search is an example of this (18:18:02) lifeless: living breathing working without needing changes to bzr itself (18:18:47) igc: lifeless: right, so we need the rules on how to do that better documented and explained (18:19:03) lifeless: sure
See the component specs for details.