The search for the perfect backup

Why backing up

Basically, you want backup to recover your data after some kind of event destroyed them.

There are at least three of these kinds we can think of.

For each of them, we give an example and also an alternative mechanism (as opposed to back up) allowing to somewhat limit the effects of the event.

Event What backup allows you Alternative method
Catastrophic hardware failure (faulty hardware) Execute your recovery plan: get new servers online, then load them with the last data from backup redundant storage (RAID)
Faulty software, manipulation error that destroy your data Get data back to a working state "Undo" into your database
Malicious data alteration (in this case, there is an inside or outside attack). Get data back to a working state, identify modification point, forensics "Undo" into your database
Forensic Analysis of what happened, after an attack or for a search in responsibility Logs. Note that log and backup usually complete each other for that case.

The salt of the game: security

Backup is essential for data integrity and recovery, however it has security implications too. They come in two flavors.

Breach of data protection

Some of your data are confidential. For example medical records, credit cards numbers, etc... but they may be as simple as your private emails or a list of passwords.

Backup means you create many copies of your data. Each copy is another opportunity for an attacker to get her hands on it. Too often, the live data are very heavily secured, but backup storage not so (after all, they are only backup). This way an attacker is able to get the data just by stealing the backup hard drive.

Using the backup system as a Trojan horse

Most of the time, the backup system needs extended rights on the server, for example, it needs access to all files in order to back them up. If the connection flow is to go from the warehouse to the server, an attacker might compromise the warehouse to get access to the server, possibly with extended rights. In the other direction (the server connects to the warehouse), the reverse is also possible. Jumping from a compromised machine to another one might be even possible in the direction contrary to the connection flow.

What to back up

Backing up always ends in backing up files, in the sense that, whatever object you want to backup: files, databases, applications or systems, you'll have to produce a set of files that can be put in the backup storage and used to restore the object to a previous state.

For a file, there are no more questions to ask, but for more complex objects, backup cannot avoid to be application-aware to some extent.

Here is alist of objects ordered by increasing level of difficulty, in term of backing up, and especially restoring the data.

Files

A file is the simplest thing you may want to backup. You backup the file by copying it to the backup storage, you restore the file by copying it from the backup storage to the production storage.

There are at least two subtleties however.

First, on a system like Unix (or Linux), you can create large empty files that take no physical place on your disk (for example the popular BDB system takes advantage of it). If you copy them by reading the file content and then writing it to the backup storage, you will end in writing large files, wasting space and possibly filling up your backup storage beyond its capacity without any good reason.

Second point, an application may be modifying the file while you're copying it, resulting in a file that your application can not use any more. This is the case in particular (but not only) if the file is a storage space for a database.

Databases

In the case of a database, the data stored in it are usually (not always) stored as files in your file system, however copying theses raw files doesn't mean you'll be able to restore the database. This is because operations inside your database may alter files while your copying them, but not in a consistent manner. For example, your database may be spread between two files, data and index. Before you start copying the file, database is in a state A. You copy the first file, and starts to copy the second one, and then the database goes to a state B. You have then an index reflecting state A, a data file reflecting state B and you are unable to restore either state A or B.

Even if the data is in only one file, or if you have a sophisticated file system that let you take a snapshot of the entire filesystem, the state of the database can be spread between the computer memory and the disk, etc... and copying the file that embodies the state of the database is not enough to restore the database.

Some databases use their own file system, bypassing completely the operating system running the copy process, and again you can not copy the database by copying the file.

If you can stop the database, copying files may be enough to have something you can restore, but this is a luxury that can usually not be afforded.

Happily, databases always come with a way to produce "backup-able" data, that is a file which is guaranteed to be in a consistent set, and that you can feed to a database server to restore the database. Here the added complexity of backing up a database, as opposed to backing up a file, is that backing up a database is a two step process: first you must produce an image of the database in form of files, and then backup these files. You also have to hope that you have enough space to do so, because the image format usually takes more space than the internal format of the database.

Applications

An application is yet another step in complexity for backup, because an application can use several databases and file storages that may be difficult to save in a consistent manner.

For example, the Opera mail software saves mail as files, but also uses a database to store meta-data. This is the same for Otrs, a popular open-source CRM software. To backup such an application, you should be able to backup a "snapshot" of both the database and the directory hierarchy of files. However, this is not possible in most cases. Unlike databases, most applications don't offer the ability to dump everything in one (or several) files in a consistent manner.

System (host)

By a host / system I mean restoring the proper functioning of a computer on another computer (piece of hardware), by restoring the previously saved files. That is, you save the files of a computer, the computer burns, and you hope to create an identically running system by restoring the files on another (more or less identical) computers. In general, this is not possible, and you should question if this is desirable.

The reason why this may not work is that different computers, even bought at the same time with seemingly identical hardware, may have subtle hardware differences, such as different usb bridge controllers or things like that. Some operating systems probe this, store the result in a cache file, and rely on the cache at the next reboot. Thus, by copying the files, and restoring them on a subtly different hardware, you're configuring your new hardware using inappropriate configuration file. Furthermore, there are files, like the kernel OS, that you usually cannot copy or don't have control over. Thus, you can end for example running a 64 bits kernel with 32 bits library.

My opinion is that a server should always be installed for scratch. Thus, the server has the opportunity to probe the hardware, and do everything the way the author of the install procedure had in mind.

This raise however the question of the configuration of the machine [...].

A set of hosts ?

How to back up

The storage ("warehouse")

Where should it be ?

In the most remote possible place than your live data. If the backup data are, say, in a safe in the same building than your servers, and the whole building is destroyed by fire, so are your backup data: you lose. At minimum, the backup data should be in another city than your live data. Another continent is better: if both continents are destroyed together, most likely you'll have other concerns than getting your data back.

How to build it

Tapes

Online disks

Removable disks

In the cloud ?

Why not, this may be an attractive option that is available today. However, most of the time, you don't know where your cloud provider storage is physically located. It may be in the same datacenter than your servers (I've seen the case), which is very much as storing your backup data in the same building than your live data. If you choose to use a cloud provider, be sure that your know where the data is physically located. It they can't or don't want to tell you, but they commit to protect your data through a contract, be sure than they're up to it: the price of your data is generally more than the price you pay for the backup. In case of failure of the contract, the penalty won't cover your operational losses.

Connecting the warehouse to the servers

Checking backup

In short, there are two possibilities:

Check files in backup against a signature you computed during the backup

This is the fastest way, since you only copy data once from storage to CPU memory for recomputing the key (if you want to compare against the actual file instead of a signature, you have to read both the backup and live files).

Restore backup to see if it works

In the case of complex backup (host / application in our typology), you can't avoid that, since restoring the files isn't enough to restore the application (you have to ensure that various files are consistent so the application can start).

Replicating backup data

(with encryption)

Removing doublons

Preparing data for backup

Storing the data into an intermediate place

(like you dump your database, also privilege processes can copy data into a less privileged aerea, so that the backup processes don't need full privileges)

Only copying delta

Ideally it should grow incrementally.

Whatever the backup mechanism, it's better to copy only the modified data from the server to the warehouse. If you have hundred of gigabytes of data (not such a big deal nowadays), you can't transfer it every day or so. You're much more likely to be able to copy the difference (delta) since your last backup. Roughly, the amount of delta depends on your activity, which places a kind of upper limit on the delta size. In contrast, the total storage on your server may have been accumulated over a long period of time, and thus be arbitrary large.

Note that copying only the delta doesn't tell you anything about the organization of the data in the backup storage. On some backup systems, you perform somewhat frequent delta backup, and some not-so-frequent "full" backup (copying all your data), with a difficulty to restore since you have to play with both the last full and last delta set of data (or even worse, with several delta). This is a legacy of the tape days, where you have to store the data the way they were transferred. Your backup should be on disks (see "Desirable properties of the backup storage" above), allowing to have a permanent mirror of your data.

(Notes for myself)

These are notes taken for myself during the writing of this (e-)book.


  • Keep several version of your data (just in case one should be destroyed).

Backing up files

What user on the server (the remote machine)

Case 1: a known, specific unix user

You're backing up some very specific part of your file system. For example, these are the files of your CRM application, and your CRM application runs under a special unix user, for example "'crm"'.

Then you can perform your backup using the "crm" user. Even better, you can create a crm-readonly user, put it in the same unix group as the crm user, and have files writable by the crm user, and read-only with the group.

Case 2: not case 1

You don't know a specific user. May be there are several of them, or may be you just don't know, for example you want to backup an entire file system, and you don't know exactly what unix users and permissions are used.

If you think it's too dangerous for the backup system to have root access on your server (remote machine)

You can arrange for using a generic user for backup purpose, let's call it backup. In order to do that, every file that must be backed up must be readable by the backup user, either because the file is world-readable, or because the file is in a specific group that can read the file, or the file has backup as its owner (quite unlikely). Also backup must be able to traverse every directory leading to the files that must be backed up.

In this context, errors are likely so you must be sure that:

  • you can see the problem in the backup log
  • Someone actually reads it on a regular basis, and takes appropriate actions.
If you think it better be automatic

Then go for the root user.

(Former first page)

A bit of vocabulary

In essence, there are:

  • machines with data to backup: I call them servers
  • one (may be several) machines to which the backup storage is attached: I call it the warehouse.

Why ?

If poorly handled, backup can be a security nightmare.

Of course, the first goal of backup is to be able to get your files back.

To be sure of being able to get your files back, you must have the warehouse in a (very) remote location. If the backup data are, say, in a safe in the same building than your servers, if the whole building is destroyed by fire, so are your backup data: you lose. So you should put your servers in a city and your warehouse in another city, the best being on another continent.

Fortunately, this is easily done these days, if the amount of data you've to store is "reasonable", for example, with a delta of 100 MB / day.

Desirable properties of the backup storage

The backup storage should be:

  • A disk. This is important. In my view, tapes are obsolete as a backup medium. It is difficult to organize data on tapes, and furthermore, they are not as reliable as disks. Disks on the other hand are cheap, even if it means having as much backup capacity as your main (server) capacity, or even more. Again, disks are cheap, errors due to problem with your backup medium are expensive. Moreover, they can be kept on-line, allowing for fast data recovery, without the need for an expensive infrastructure such as a tape changer robot.
  • encrypted
  • into an area remote from the backuped server
  • it should hold multiple versions, that is you have the last backup of your data, but also the previous one and several before.

The link between the warehouse the server must be encrypted and of course authenticated.

Minimal set of tools

  • rsync
  • sftp (must be secured ! not allowing ssh)
  • encfs / cryfs