0% found this document useful (0 votes)
76 views49 pages

File Synchronization: Theory and Practice Benjamin C. Pierce

This document discusses file synchronization theory and practice. It begins with examples of areas where synchronization is needed, such as distributed file systems, mobile devices, and version control systems. It then describes the goals of the Unison synchronization tool project, which aims to develop a clean conceptual foundation and robust cross-platform synchronization tool. The document provides a demo and overview of the conceptual foundations, followed by a discussion of how to specify synchronization using a simple example. It proposes organizing principles and structure for the specification, defines preliminary concepts, and provides a core specification for an acceptable synchronization run. The document observes that the specification does not require propagating all changes and discusses how to handle iterated synchronization runs.

Uploaded by

gmlgmlgmlgml
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views49 pages

File Synchronization: Theory and Practice Benjamin C. Pierce

This document discusses file synchronization theory and practice. It begins with examples of areas where synchronization is needed, such as distributed file systems, mobile devices, and version control systems. It then describes the goals of the Unison synchronization tool project, which aims to develop a clean conceptual foundation and robust cross-platform synchronization tool. The document provides a demo and overview of the conceptual foundations, followed by a discussion of how to specify synchronization using a simple example. It proposes organizing principles and structure for the specification, defines preliminary concepts, and provides a core specification for an acceptable synchronization run. The document observes that the specification does not require propagating all changes and discusses how to handle iterated synchronization runs.

Uploaded by

gmlgmlgmlgml
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
You are on page 1/ 49

File Synchronization

Theory and Practice

Benjamin C. Pierce
University of Pennsylvania

with
Trevor Jim (AT&T), Jérôme Vouillon (Penn),
Sundar Balasubramaniam (Jareva Tech.), Mattheiu Goulay,
Sylvain Gommier (Ecole Polytechnique)

“First, do no harm...” —Hippocrates

snc / 1
Synchronization all around us...

increasing distribution =) replicated data


increasing mobility =) disconnected updates
+ + +
synchronization

Examples...
 distributed filesystems and databases (with optimistic replication
strategies)
 synchronization utilities for mobile laptops
 “hot sync” software for PDAs
 version control systems (e.g., CVS)
 groupware applications
 ...

snc / 2
The Unison Project

Goals:
Theory: A clean conceptual foundation for file synchronization
(and, ultimately, other forms of synchronization).

Practice: A robust, portable, cross-platform synchronization tool.

snc / 3
Demo

snc / 4
Foundations

snc / 5
Synchronization (a simple example)

A synchronizer should propagate changes...

DIR DIR
a b a b

f’ g f g’

DIR DIR

a b a b

f’ g’ f’ g’

snc / 6
... as long as they do not conflict:

DIR DIR
a b a b

f’ g’ f g’’

DIR DIR

a b a b

f’ g’ f’ g’’

snc / 7
A more interesting example

If a file gets renamed on one side and modified on the other, what should
the synchronizer do?

DIR DIR
a b c a b

f g g f g’

snc / 8
Three reasonable possibilities:
1. Copy old version with new name ( ); report a conflict for old name
(b)

DIR DIR
a c a c
b

f g f g’ g

2. Modify the file in the first replica and move it in the second

DIR DIR
a c a c

f g’ f g’

3. Do nothing (report a conflict)

snc / 9
Another unclear case

Suppose a file is created on one side and its parent directory is deleted
on the other side...

DIR DIR
d d

DIR DIR

a a b

f f g’

What should happen?

snc / 10
1. Nothing; a conflict should be reported

2. The siblings (d:a) should be deleted from the second replica, leaving
just the file (d:b) and its parent directory (d)

3. The siblings and parent directory should all be deleted from the
second replica; the file should be moved to a special “orphanage” and
the user alerted

snc / 11
What we want...

A simple, precise framework for specifying and


discussing file synchronizers, phrased in terms
accessible to both implementers and users.

snc / 12
Organizing principles

 Start by trying to specify just one synchronizer (ours) cleanly


 Specify a user-level synchronizer
 synchronization operation occurs under explicit user-control
 only the current state of the filesystems is available to the
synchronizer (plus any information it chooses to remember from
last time)
 Assume a static model of the world
 Factor out heuristics and user interactions for merging overlapping
updates

snc / 13
Structure of the specification

Replication
O O

User User
Updates Updates

A B

Synchronizer

A’ B’

snc / 14
Preliminaries
 A path is a (possibly empty) sequence of names.
The empty path is written . The symbol : is used as both the path
separator (d:e:f) and for concatenating paths (p:q).
 A file is a value drawn from some uninterpreted set F (e.g., strings of
bytes).
 A filesystem is a total function mapping paths to their contents,
where the contents at a path may be a file, a directory, or nothing.
Formally:
fA2P ! (F [ f DIR; ?g) j
8p; x: A(p:x) == ? ) A(p) = DIR g

snc / 15
Write A=p (“A after p”) for the sub-filesystem of A rooted at path p

DIR
a b

DIR DIR
c
A
DIR

d f A / b.c
e
x y z

snc / 16
Conflicts

Key question: What is a conflict?

snc / 17
Conflicts

Key question: What is a conflict?


Our answer: A conflict occurs when the two replicas do not agree (at
some path), and both have been changed.

snc / 17-a
Conflicts

Key question: What is a conflict?


Our answer: A conflict occurs when the two replicas do not agree (at
some path), and both have been changed.

Formally, we say there is a conflict at path p if

A(p) == B(p) “A and B are different at p”


and A=p == O=p “A has been changed at (or below) p”
and B=p == O=p “B has been changed at (or below) p”

snc / 17-b
Core Specification

Each run of a file synchronizer takes filesystems A, B, and O as inputs


and yields new filesystems A 0 and B 0 as outputs. A run is said to be
acceptable if, for all paths p:

(1) if A(p) == O(p), then A 0 (p) = A(p)


= O(p), then B 0 (p) = B(p)
if B(p) =
(“don’t overwrite user changes”)
(2) if A 0 (p) =
= A(p), then A 0 =p = B=p
if B 0 (p) =
= B(p), then B 0 =p = A=p
(“only change replicas by (completely) propagating user changes”)
(3) if there is a conflict at p, then A 0 =p = A=p and B 0 =p = B=p
(“don’t change (at or below) conflicting paths”)

A synchronizer implementation is correct if all its runs are acceptable.

snc / 18
Observations

Interestingly, this specification does not force the synchronizer to do


anything at all!
Of course, we prefer that the synchronizer should propagate as many
changes as possible, but requiring that it propagate all changes is too
strong:
1. The specification should apply even in the case of failure during
synchronization.
2. For efficiency, we want to allow the implementation to be
conservative in detecting updates — i.e., to give some “false
positives,” which may lead to false conflicts.
Propagation of updates is thus a nonfunctional requirement: a
synchronizer implementation should try to propagate as many changes as
it can, subject to the above rules.

snc / 19
Going Deeper

snc / 20
Iterated Synchronization

The synchronizer may fail to make the replicas equal at some path (e.g.,
because of conflicting changes, or over-conservative change detection).
In this case, what should we use for the “original filesystem” O on the
next round of synchronization?

snc / 21
Iterated Synchronization

The synchronizer may fail to make the replicas equal at some path (e.g.,
because of conflicting changes, or over-conservative change detection).
In this case, what should we use for the “original filesystem” O on the
next round of synchronization?

Answer: maintain a (fictitious) filesystem recording the “last synchronized


state” of each path.
8<
A 0 (p) if A 0 (p) = B 0 (p)
O 0 (p) =
: O(p) otherwise

snc / 21-a
Strictly speaking...

To deal with the possibility of machine failures during synchronization, we


need to treat O 0 as an output of the synchronizer. The specification is
extended so that the following values for O 0 are considered to be
correct:
1. O 0 = O

8<
unchanged (failure during update detection)

0 A(p) if A(p) = B(p)


2. O (p) =
:O(p) otherwise
recording just the paths already synchronized in the inputs (failure

8<
during change propagation)

A 0 (p) if A 0 (p) = B 0 (p)


O 0 (p) =
3.
: O(p) otherwise
additionally recording the paths that have just become synchronized
(successful termination)

snc / 22
Synchronizing Multiple replicas

We’ve treated just the two-replica case in this specification (and in our
implementation).
Pairwise synchronization can be used to keep 3-5 replicas in sync. Just
synchronize successive pairs in a star or ring topology.
For synchronizing more replicas, both specification and implementation
can be extended straightforwardly... iff we require that all replicas
participate in every synchronization.
For synchronizing many replicas, we need to deal with the fact that only
a subset may participate in any given sync. Problems become significantly
trickier. (Need something like version vectors.)

snc / 23
Dealing with Links

Synchronization of Unix-style symbolic links can easily be handled in our


framework. A symbolic link is just a special kind of file whose “contents”
is a string. Both the “ordinary file / symlink” bit and the link-target string
are considered part of the contents of the file as far as the synchronizer
is concerned.
(Hard links are more problematic.)

snc / 24
Permission bits...

Handled just like symlinks: we consider them as part of the “contents” of


the file.

snc / 25
Heterogeneity

Unison is the only synchronizer (AFAWK) that tries to do a good job of


synchronizing across different filesystem architectures (Win32 / Posix).
This involves dealing with...
 different permission bits
 different modtime representations
 file name capitalization
 UID/GIDs (between different Unix systems)
 etc.

To achieve this, we need to change our goal to “synchronizing the


common information” (and doing something reasonable with the rest).

snc / 26
Implementation

snc / 27
Unison

The Unison synchronizer aims for robustness, portability, and


heterogeneity...
 Design strongly influenced by the specification described earlier, and
vice versa
 Runs on Windows [98/NT/2K] and most flavors of Unix
 Supports cross-platform synchronization between Windows and Unix
 Deals with symlinks, file permissions, modtimes, uids, etc., etc.
 Tuned for high- (ethernet) and medium-bandwidth (PPP) connections
 Uses the rsync protocol for “diffs only” transmission of small updates
to large files
 Tunnels over ssh for security (can also use raw sockets)
 Easy install (single executable, no administrative privileges required)
 Source code available under GPL (15K lines of O’Caml)
 Growing user community (500-1000 users, max replicas 5 Gb)

snc / 28
Client Server

update
detector update
u detector
s
e U rpc over ssh
r I reconciler

transport
agent

Client FS Server FS

replica replica
archive archive

snc / 29
System Architecture
Robustness

Our promise to users:


After any run of Unison (whether successful or not), each path in
each replica will be either unchanged, or (if permitted by the
specification) updated to exactly match the other replica.

Issues:
 Safety for arbitrary crash failures

 Atomicity of changes to filesystems

 Resilience to concurrent activity by the user

 etc.

 modulo bugs (natch), plus a few unavoidable races

snc / 30
Going further
(“what do you want to
synchronize today...?”)

snc / 31
Data synchronization

Many commercial synchronization tools are able to synchronize individual


records within databases. For each database, certain fields are designated
as key fields. Two records are regarded as “the same record” if they
have identical key fields.
Our framework incorporates this case without change. We just have to
extend the notion of “path” to include the key fields.
E.g., suppose the path d:f refers to a database
F IRST N AME L AST N AME AGE A DDRESS
Adam Smith 275 Scotland
John Keynes 115 England
... ... ... ...

and that the key fields of this database are F IRST N AME and L AST N AME.
Then the path d:f:hAdam; Smithi refers to the record with contents
h275; Scotlandi.
snc / 32
XML Synchronization

Key issue:
There are many ways to index information in XML structures (hence, it is
not clear how to “match up” the parts of the structures should be
synchronized).

snc / 33
E.g., in
<table>
<header> < olumn>Names</ olumn> < olumn>Phones</ olumn> </header>
<element> <name>Joe</name> <phone>123-4567</phone> </element>
<element> <name>Amy</name> <phone>888-2001</phone> </element>
<element> <name>Fred</name> <phone>777-1234</phone> </element>
<element> <name>Eve</name> <phone>932-3528</phone> </element>
</table>

is the “path” to 777-1234...


 the second child of the fourth child of the root?
 the phone child of the third element child of the root?
 the phone child of the element whose name is Fred?

All are plausible!

snc / 34
How can we figure out what indexing method is intended?

snc / 35
How can we figure out what indexing method is intended?
1. guess

snc / 35-a
How can we figure out what indexing method is intended?
1. guess

2. ask the user

snc / 35-b
How can we figure out what indexing method is intended?
1. guess

2. ask the user

3. look at the schema!

snc / 35-c
Another issue: Ordering

Although the absolute position of a piece of information is generally not


its primary index, it is often desirable to maintain ordering of children.
In effect, there can be multiple relevant indexing schemes for a given
part of an XML document.

snc / 36
Finishing up...

snc / 37
Related Projects

Lots of implementations:
 Many distributed file systems [Coda, Bayou, Ficus, etc., etc.]

 Many commercial products for Windows and MacOS [MS Briefcase,


Puma IntelliSync, etc.]

 Rumor [UCLA]

 Reconcile [Mitsubishi Research]

A few specifications:
 Norman Ramsey [Harvard]
algebraic specifications of unison-like synchronizers

 Marc Shapiro & co [MSR-UK]


trace-based specifications of more general middleware layers

snc / 38
Want to play?...

http://www.cis.upenn.edu/bcpierce/unison

snc / 39
Extra slides...

snc / 40
Examples

We tested some popular synchronizers to see whether they satisfy our


specification...
 Microsoft Briefcase...
Yes, modulo some bugs and differences in intended behavior
 PowerMerge (Mac)...
No
 Rumor (a Unix-only synchronizer from UCLA)...
Pretty much (extra generality of Rumor makes comparison hard)
 Distributed filesystems (CODA, Ficus, Bayou, etc.)...
Pretty much (again, modulo extra generality)
 “Data synchronizers” (Intellisync, etc.)...
Yes

snc / 41
Strategies for update detection
 Exact update detector
dirty(p) iff hcurrent contents at pi == O(p)
 Modtime update detector (for Unix)
dirty(p) iff for some ancestor q of p;
modtime(q)> hlast sync time for qi
[Note that dirty(p) iff modtime(p) > hlast sync time for pi is not
right!]

 Modtime-inode update detector (for Unix)


dirty(p) iff modtime(p) > hlast sync time for pi
or = inodeO (p)
inode(p) =

snc / 42
Incorporating heuristic / interactive merging
Replication
O O

User User
Updates Updates
A B

Interactive/Heuristic Merging

A’’ B’’

Synchronizer

A’ B’

snc / 43

You might also like