Unionfs 2.x CONCEPTS:
=====================

This file describes the concepts needed by a namespace unification file
system.


Branch Priority:
================

Each branch is assigned a unique priority - starting from 0 (highest
priority).  No two branches can have the same priority.


Branch Mode:
============

Each branch is assigned a mode - read-write or read-only. This allows
directories on media mounted read-write to be used in a read-only manner.


Whiteouts:
==========

A whiteout removes a file name from the namespace. Whiteouts are needed when
one attempts to remove a file on a read-only branch.

Suppose we have a two-branch union, where branch 0 is read-write and branch
1 is read-only. And a file 'foo' on branch 1:

./b0/
./b1/
./b1/foo

The unified view would simply be:

./union/
./union/foo

Since 'foo' is stored on a read-only branch, it cannot be removed. A
whiteout is used to remove the name 'foo' from the unified namespace. Again,
since branch 1 is read-only, the whiteout cannot be created there. So, we
try on a higher priority (lower numerically) branch and create the whiteout
there.

./b0/
./b0/.wh.foo
./b1/
./b1/foo

Later, when Unionfs traverses branches (due to lookup or readdir), it
eliminate 'foo' from the namespace (as well as the whiteout itself.)


Opaque Directories:
===================

Assume we have a unionfs mount comprising of two branches.  Branch 0 is
empty; branch 1 has the directory /a and file /a/f.  Let's say we mount a
union of branch 0 as read-write and branch 1 as read-only.  Now, let's say
we try to perform the following operation in the union:

	rm -fr a

Because branch 1 is not writable, we cannot physically remove the file /a/f
or the directory /a.  So instead, we will create a whiteout in branch 0
named /.wh.a, masking out the name "a" from branch 1.  Next, let's say we
try to create a directory named "a" as follows:

	mkdir a

Because we have a whiteout for "a" already, Unionfs behaves as if "a"
doesn't exist, and thus will delete the whiteout and replace it with an
actual directory named "a".

The problem now is that if you try to "ls" in the union, Unionfs will
perform is normal directory name unification, for *all* directories named
"a" in all branches.  This will cause the file /a/f from branch 1 to
re-appear in the union's namespace, which violates Unix semantics.

To avoid this problem, we have a different form of whiteouts for
directories, called "opaque directories" (same as BSD Union Mount does).
Whenever we replace a whiteout with a directory, that directory is marked as
opaque.  In Unionfs 2.x, it means that we create a file named
/a/.wh.__dir_opaque in branch 0, after having created directory /a there.
When unionfs notices that a directory is opaque, it stops all namespace
operations (including merging readdir contents) at that opaque directory.
This prevents re-exposing names from masked out directories.


Duplicate Elimination:
======================

It is possible for files on different branches to have the same name.
Unionfs then has to select which instance of the file to show to the user.
Given the fact that each branch has a priority associated with it, the
simplest solution is to take the instance from the highest priority
(numerically lowest value) and "hide" the others.


Unlinking:
=========

Unlink operation on non-directory instances is optimized to remove the
maximum possible objects in case multiple underlying branches have the same
file name.  The unlink operation will first try to delete file instances
from highest priority branch and then move further to delete from remaining
branches in order of their decreasing priority.  Consider a case (F..D..F),
where F is a file and D is a directory of the same name; here, some
intermediate branch could have an empty directory instance with the same
name, so this operation also tries to delete this directory instance and
proceed further to delete from next possible lower priority branch.  The
unionfs unlink operation will smoothly delete the files with same name from
all possible underlying branches.  In case if some error occurs, it creates
whiteout in highest priority branch that will hide file instance in rest of
the branches.  An error could occur either if an unlink operations in any of
the underlying branch failed or if a branch has no write permission.

This unlinking policy is known as "delete all" and it has the benefit of
overall reducing the number of inodes used by duplicate files, and further
reducing the total number of inodes consumed by whiteouts.  The cost is of
extra processing, but testing shows this extra processing is well worth the
savings.


Copyup:
=======

When a change is made to the contents of a file's data or meta-data, they
have to be stored somewhere.  The best way is to create a copy of the
original file on a branch that is writable, and then redirect the write
though to this copy.  The copy must be made on a higher priority branch so
that lookup and readdir return this newer "version" of the file rather than
the original (see duplicate elimination).

An entire unionfs mount can be read-only or read-write.  If it's read-only,
then none of the branches will be written to, even if some of the branches
are physically writeable.  If the unionfs mount is read-write, then the
leftmost (highest priority) branch must be writeable (for copyup to take
place); the remaining branches can be any mix of read-write and read-only.

In a writeable mount, unionfs will create new files/dir in the leftmost
branch.  If one tries to modify a file in a read-only branch/media, unionfs
will copyup the file to the leftmost branch and modify it there.  If you try
to modify a file from a writeable branch which is not the leftmost branch,
then unionfs will modify it in that branch; this is useful if you, say,
unify differnet packages (e.g., apache, sendmail, ftpd, etc.) and you want
changes to specific package files to remain logically in the directory where
they came from.

Cache Coherency:
================

Unionfs users often want to be able to modify files and directories directly
on the lower branches, and have those changes be visible at the Unionfs
level.  This means that data (e.g., pages) and meta-data (dentries, inodes,
open files, etc.) have to be synchronized between the upper and lower
layers.  In other words, the newest changes from a layer below have to be
propagated to the Unionfs layer above.  If the two layers are not in sync, a
cache incoherency ensues, which could lead to application failures and even
oopses.  The Linux kernel, however, has a rather limited set of mechanisms
to ensure this inter-layer cache coherency---so Unionfs has to do most of
the hard work on its own.

Maintaining Invariants:

The way Unionfs ensures cache coherency is as follows.  At each entry point
to a Unionfs file system method, we call a utility function to validate the
primary objects of this method.  Generally, we call unionfs_file_revalidate
on open files, and __unionfs_d_revalidate_chain on dentries (which also
validates inodes).  These utility functions check to see whether the upper
Unionfs object is in sync with any of the lower objects that it represents.
The checks we perform include whether the Unionfs superblock has a newer
generation number, or if any of the lower objects mtime's or ctime's are
newer.  (Note: generation numbers change when branch-management commands are
issued, so in a way, maintaining cache coherency is also very important for
branch-management.)  If indeed we determine that any Unionfs object is no
longer in sync with its lower counterparts, then we rebuild that object
similarly to how we do so for branch-management.

While rebuilding Unionfs's objects, we also purge any page mappings and
truncate inode pages (see fs/unionfs/dentry.c:purge_inode_data).  This is to
ensure that Unionfs will re-get the newer data from the lower branches.  We
perform this purging only if the Unionfs operation in question is a reading
operation; if Unionfs is performing a data writing operation (e.g., ->write,
->commit_write, etc.) then we do NOT flush the lower mappings/pages: this is
because (1) a self-deadlock could occur and (2) the upper Unionfs pages are
considered more authoritative anyway, as they are newer and will overwrite
any lower pages.

Unionfs maintains the following important invariant regarding mtime's,
ctime's, and atime's: the upper inode object's times are the max() of all of
the lower ones.  For non-directory objects, there's only one object below,
so the mapping is simple; for directory objects, there could me multiple
lower objects and we have to sync up with the newest one of all the lower
ones.  This invariant is important to maintain, especially for directories
(besides, we need this to be POSIX compliant).  A union could comprise
multiple writable branches, each of which could change.  If we don't reflect
the newest possible mtime/ctime, some applications could fail.  For example,
NFSv2/v3 exports check for newer directory mtimes on the server to determine
if the client-side attribute cache should be purged.

To maintain these important invariants, of course, Unionfs carefully
synchronizes upper and lower times in various places.  For example, if we
copy-up a file to a top-level branch, the parent directory where the file
was copied up to will now have a new mtime: so after a successful copy-up,
we sync up with the new top-level branch's parent directory mtime.

Implementation:

This cache-coherency implementation is efficient because it defers any
synchronizing between the upper and lower layers until absolutely needed.
Consider the example a common situation where users perform a lot of lower
changes, such as untarring a whole package.  While these take place,
typically the user doesn't access the files via Unionfs; only after the
lower changes are done, does the user try to access the lower files.  With
our cache-coherency implementation, the entirety of the changes to the lower
branches will not result in a single CPU cycle spent at the Unionfs level
until the user invokes a system call that goes through Unionfs.

We have considered two alternate cache-coherency designs.  (1) Using the
dentry/inode notify functionality to register interest in finding out about
any lower changes.  This is a somewhat limited and also a heavy-handed
approach which could result in many notifications to the Unionfs layer upon
each small change at the lower layer (imagine a file being modified multiple
times in rapid succession).  (2) Rewriting the VFS to support explicit
callbacks from lower objects to upper objects.  We began exploring such an
implementation, but found it to be very complicated--it would have resulted
in massive VFS/MM changes which are unlikely to be accepted by the LKML
community.  We therefore believe that our current cache-coherency design and
implementation represent the best approach at this time.

Limitations:

Our implementation works in that as long as a user process will have caused
Unionfs to be called, directly or indirectly, even to just do
->d_revalidate; then we will have purged the current Unionfs data and the
process will see the new data.  For example, a process that continually
re-reads the same file's data will see the NEW data as soon as the lower
file had changed, upon the next read(2) syscall (even if the file is still
open!)  However, this doesn't work when the process re-reads the open file's
data via mmap(2) (unless the user unmaps/closes the file and remaps/reopens
it).  Once we respond to ->readpage(s), then the kernel maps the page into
the process's address space and there doesn't appear to be a way to force
the kernel to invalidate those pages/mappings, and force the process to
re-issue ->readpage.  If there's a way to invalidate active mappings and
force a ->readpage, let us know please (invalidate_inode_pages2 doesn't do
the trick).

Our current Unionfs code has to perform many file-revalidation calls.  It
would be really nice if the VFS would export an optional file system hook
->file_revalidate (similarly to dentry->d_revalidate) that will be called
before each VFS op that has a "struct file" in it.

Certain file systems have micro-second granularity (or better) for inode
times, and asynchronous actions could cause those times to change with some
small delay.  In such cases, Unionfs may see a changed inode time that only
differs by a tiny fraction of a second: such a change may be a false
positive indication that the lower object has changed, whereas if unionfs
waits a little longer, that false indication will not be seen.  (These false
positives are harmless, because they would at most cause unionfs to
re-validate an object that may need no revalidation, and print a debugging
message that clutters the console/logs.)  Therefore, to minimize the chances
of these situations, we delay the detection of changed times by a small
factor of a few seconds, called UNIONFS_MIN_CC_TIME (which defaults to 3
seconds, as does NFS).  This means that we will detect the change, only a
couple of seconds later, if indeed the time change persists in the lower
file object.  This delayed detection has an added performance benefit: we
reduce the number of times that unionfs has to revalidate objects, in case
there's a lot of concurrent activity on both the upper and lower objects,
for the same file(s).  Lastly, this delayed time attribute detection is
similar to how NFS clients operate (e.g., acregmin).

Finally, there is no way currently in Linux to prevent lower directories
from being moved around (i.e., topology changes); there's no way to prevent
modifications to directory sub-trees of whole file systems which are mounted
read-write.  It is therefore possible for in-flight operations in unionfs to
take place, while a lower directory is being moved around.  Therefore, if
you try to, say, create a new file in a directory through unionfs, while the
directory is being moved around directly, then the new file may get created
in the new location where that directory was moved to.  This is a somewhat
similar behaviour in NFS: an NFS client could be creating a new file while
th NFS server is moving th directory around; the file will get successfully
created in the new location.  (The one exception in unionfs is that if the
branch is marked read-only by unionfs, then a copyup will take place.)

For more information, see <http://unionfs.filesystems.org/>.