backupatronche is a backup utility software aimed at backuping disks with large collection of media (such as a library of DVDs). backupatronche is written in python.
- The typical volume you want to backup with backupatronche is a several terabyte in size
- Typical files are more than 1 GB, and they typically don't change (think of them as a ripped DVD).
- Each backup disk is big enough to hold one backuped disk (backups aren't spread among several media)
- Data are too big to afford having more than one copy
- Data is too big to make comparisons, so they should be kept at a minimum
- Files are typically on a NAS, and backupatronche runs on another machine, so the bandwith it typically "low" (20 MB/s maximum, often less), as opposed to a directly attached disk.
- backupatronche knows about moving a file, and the structure of related files (only DVDs for now, thus the VIDEO_TS file and everything that goes under).
Theory of operations
- backupatronche keeps a table of the sha1 signature of every file it has to backup (hence the name !).
- backupatronche goes into some effort to avoid unnecessarily recomputing sha1, so it has a notion of "invariant" that is supposed to be, well, invariant, as long as a file is not modified. More on this later.
The "live" volume is called the active volume.
The backup volume is called the backup volume.
The file with every file of a volume, including the meta data (SHA1 and #invariant invariant) is called the catalog. There is one catalog per volume (active and backup, thus 2 catalogs). The catalog must be on a separate volume to avoid oscillation (every change to the catalog changing the active volume thus leading to a new synchronization).
The machine which runs the scripts is called the backup machine. Typically, the backup volume is attached to the backup machine, and the active volume is mounted as a remote file system. The backup volume may be distant, with a performance penalty.
In order to quickly check if a file as been modified without having to recompute the SHA1, a value from the operating system is used, such as the value is modified if the file is modified, and the value is not modified if the file is not modified. The value is called the invariant. If the invariant is modified, the SHA1 is recomputed.
Note that this method does not work if the file is modified without the operating system awareness (in case of physical support corruption, typically). Other provisions are made in this case.
In the case of Linux, the invariant is the triple (inode number, file size, file modification time). The inode however, is masked with the last 32 bits because, if the master file is mounted through CIFS on the backup machines, the other bits may change from one mount to the next.
Difference with rsync
The main difference for the purpose of backuping as explained in #Main ideas is that. if a file is moved in the active storage, this is detected through the invariant, and the file is moved in the backup volume at practically no cost, while it is copied in the case of rsync.
There are 3 scripts: sync-catalog, purge-catalog and diff.
Bring the catalog in sync with the live storage. Syncing means that, after successful run, every file on the live storage is in the catalog with the correct SHA1 (along with possibly old entries). If old entries (by their SHA1) are still present in the catalog, they're not purged. This is the job of purge-catalog.
sync-catalog is an offline procedure, meaning it doesn't need the backup.
Purge catalog remove old entries from a catalog, be it the catalog of a live or backup storage. The file olding the old catalog isn't replaced, rather the new catalog is written with the prefix .new.