System backup

Home : Linux resources : "Howto" : Backup


Table of contents

  1. System backup
    1. Table of contents
    2. Backup technologies
    3. Notes on archival backups
      1. Backup levels
      2. Backup frequency
      3. Backup timing
      4. Automated backups with cron
    4. Tools for archival backup dumps
      1. Archival backups with dump
      2. Archival backups with DAR
      3. The backup.pl Perl script
      4. Copying dump files with vacuum.pl
    5. Notes on mirroring backup
      1. Mirroring case history 1: Web server content
      2. Mirroring case history 2: Full disk copy
    6. Acknowledgements

Backup technologies

The baseline motivation behind all backup systems is disaster recovery: You want to ensure that your files will survive all hardware failures that Murphy's Law might conceivably throw at you. All backup technologies meet this goal by making a copy, but there are really two kinds of copies, with distinct recovery characteristics: Archival, and mirroring.

Archival backup gives you the ability to travel through time: If you suddenly realize that an important file is missing, and you're not sure when it was deleted, then the ability to sift through a year of backup dumps looking for the missing file can be a life-saver. In order to do this, however, you must keep a lot of data around, and that almost always means putting the backup dumps on some sort of offline storage, such as CD-R (or CD-RW), Zip, etc.

Mirroring backup gives you immediate access to the most recent copy of your data; if you deleted that important file just this morning, then it's a snap to go get it from the backup drive, without any searching. On the other hand, if you deleted it before the last mirroring operation, you are completely out of luck. At a minimum, mirroring only requires a spare disk of comparable size, and is easy to automate completely, as it requires no manipulation of offline media.

The three "entry-level" backup options for Linux (and Unix systems generally) tend to provide either archival or mirroring, but not both. They are:

  1. The standard Unix tar utility. This is GNU tar in free implementations (and even some other Unix flavors), and is the easiest tool for archiving particular directories. It has the distinct advantage of being supremely portable; "tar" format can be read by all other Unix systems, and even by DOS/Windows and Macintosh.
  2. The traditional Unix dump and restore programs. Most Linux systems come with the dump/restore implementation for ext2/ext3, but these are the traditional names for the archival backup programs in Unix, so "man dump" will almost always come up with something on any Unix system. dump can only operate on a whole partition at a time, but it features incremental backup capability, so you only need to back up what's changed.
  3. The rsync program. Unlike tar or dump, rsync is designed to mirror the content of directory trees over the network, is quite clever about only transferring data that have changed, and can also be set up to do local disk-to-disk copying.
Fortunately, it is possible to have both archival and mirroring backup, for those that need it. For small to medium installations where high availability is important, you can install a hybrid system where archival dumps are created on a primary server, copied to a backup server for safe-keeping, and also restored onto the backup server's disks for quick access in the event that the primary server fails.

And for many small installations, archival backups are sufficient. This is all I need at home, in fact.

It is also possible to do mirroring without archival, though I myself would not recommend it. But the low-maintenance of an rsync solution may make it the most appealing for some -- just be clear that you're giving up your "data history" when you pass on archival.

Notes on archival backups

In order to reduce the amount of storage required for archival backups, it is desirable to skip files that haven't changed since the last backup. Obviously, the first backup must contain everything, but a series of subsequent backups need only contain the files changed since the last backup; in the event of a disaster, restoring all backups in the order in which they were made will return the file system to the same state as if it had been restored from a single full backup made on the last day. This scheme still has two drawbacks: The first is that the process of restoring the file system gets to be quite tedious after a few weeks, since there are quite a few of them at that point. Worse yet, data will lost if any of those backups is somehow lost or corrupted.

Backup levels

In order to address these drawbacks, it is useful to define a backup level between 0 and 9 that controls how comprehensive to make the backup. Each level k backup contains a snapshot of all files changed since the level k-1 dump (or the dump made at the next lower numeric level if there is no level k-1 dump). Level 0 is therefore the most comprehensive, and level 9 is the most "incremental." At this point, some additional terminology is in order:

In order to reduce the number of incrementals required, one can use the "modified Tower of Hanoi algorithm" described in the dump manpage, which prescribes the following sequence of incremental dump levels (after having made a full or consolidated dump):

    3 2 5 4 7 6 9 8 9 9 ...
These are for daily backups, which is the absolute minimum period for a workgroup server in an office environment. At the end of the week, a consolidated dump is performed, and the daily cycle starts over again. At this point, last week's incrementals could be thrown away, as they are no longer needed for disaster recovery, but it's a good idea to keep them around for at least a month in order to cover the "I didn't mean to delete that" syndrome.

In any case, this multilevel backup system turns out to be quite effective in reducing the size of backups; even after a month, a consolidated dump can be only about 20% of the size of the full dump, and the daily incrementals only 3 to 5%.

Backup frequency

Deciding how often to make backups requires making a tradeoff between how many days of work you are willing to lose versus how much effort you have to spend on performing each backup. That is why a high degree of automation is a great advantange; it costs essentially nothing to take backups every day. My automated system costs me only 5 to 10 minutes per week, mostly to write consolidated backups to CD, and changing the daily backup schedule wouldn't affect that at all.

For less automated systems, the cost may be 5 to 10 minutes for each backup dump. A system failure that requires restoring from backups could happen at any time during the backup cycle, which means that the expected amount of work lost for each failure is half of the usage between backup intervals. In other words, if the system is backed up after every 40 hours of use, then the expected loss due to backup failure is 20 hours. It seems reasonable to set the expected loss over the course of a year equal to the planned time investment, and then solve for the backup frequency in order to find a value that minimizes expected total effort. (Finding the true optimum probably isn't much harder, but it's not clear that it's worth the effort.) If we do that, we get:

I*f = W*F/(2*f)

f2 = W*F/(2*I)

f = sqrt(W*F/(2*I))
where

Of course there are other costs to consider, such as inconvenience to customers (and staff embarassment) when you have to admit that you lost their emails, but these mostly define a "maximum acceptable loss" ceiling, underneath which it is still desirable to seek an optimum.

If there is only one user who uses the system for 40 hours per week, and who does their own backups, then we have what might be called the "standard home office scenario." For this scenario, and assuming that (a) backups take 10 minutes on average, and (b) the system is likely to fail once per year on average (which might or might not be pessimistic), then we arrive at the following optimal backup frequency for the home office:

fopt = sqrt((120000 min/yr*1 failure/yr)/(2*10min))

      = sqrt(6000) = 77.5 per yr

This works out to be three times every two weeks, for a total time investment (or expected time lost due to data recovery) of 77.5*10 = 775 minutes, or about 13 hours. We might want to round this frequency to twice per week, then the time investment is 1000 minutes (almost 17 hours!), and the expected time lost is only 10 hours (a quarter of a week).

Most changes to this minimal scenario have the effect of driving the ideal backup frequency up. If there were ten people using the system via file sharing, then the amount of potential lost work is ten times higher, and so it becomes worth investing that 10 minutes every working day (the actual optimal frequency is nearly 245 backups per year). If the time of the person making backups is only worth half as much as that of the average file server user (in which case we should optimize the dollar cost), then the "daily is optimal" point would be reached with only 4 or 5 additional server users. The end result is that it rarely makes sense for small offices with shared file servers to do backups any less often than daily. If the resulting 41 hours per annum of staff time spent on backups becomes excessive, then it's time to increase the level of backup automation.

Backup timing

Backup timing is also important, though often overlooked. If the backup system makes its copy of a given file while an application is partway through updating it, the copy that winds up on the backup medium may be inconsistent, and would appear to be corrupted to the application if it were ever restored. For this reason, it is best to make backups at times when the file system isn't changing. The middle of the night is therefore ideal.

A particularly nasty case of backup-induced corruption can be caused by backing up the files used by a relational database management system (RDBMS) to implement tables. A transaction that updates multiple tables may be in different stages of being written to disk for each table, so the backup might be inconsistent even if it could be done instantaneously. There are really only two choices for archival backup of a database: Stop the RDBMS server completely (e.g. "/etc/init.d/mysql stop") during the backup, or use a database client backup program (e.g. mysqldump for the MySQL system).

For similar reasons, backing up more than once a day is probably not worth the bother. The only predictable period during the day when the file system is highly unlikely to change is during the night when all users are asleep. And, for just those reasons, doing more than one backup during this period would be pointless.

Automated backups with cron

A regular weekly schedule is easy to automate via cron jobs. The crontab entries for the full schedule for my /home partition look like this:

# Backups.
# at 15 minutes after midnight every night, do a /home backup.
15 0 * * Mon	/usr/local/bin/backup.pl -cd-dir /scratch/backups/cd/to-write /dev/hda9 3
15 0 * * Tue	/usr/local/bin/backup.pl -cd-dir /scratch/backups/cd/to-write /dev/hda9 2
15 0 * * Wed	/usr/local/bin/backup.pl -cd-dir /scratch/backups/cd/to-write /dev/hda9 5
15 0 * * Thu	/usr/local/bin/backup.pl -cd-dir /scratch/backups/cd/to-write /dev/hda9 4
15 0 * * Fri	/usr/local/bin/backup.pl -cd-dir /scratch/backups/cd/to-write /dev/hda9 7
15 0 * * Sat	/usr/local/bin/backup.pl -cd-dir /scratch/backups/cd/to-write /dev/hda9 6
15 0 * * Sun	/usr/local/bin/backup.pl -cd-dir /scratch/backups/cd/to-write /dev/hda9 1
Backups of other partitions are usually done manually on Sundays. Many partitions change only when upgraded, /boot and /usr in particular, so they don't always need to be backed up. I usually skip a partition if the dump would be less than a megabyte.

Once I had the backup creation process fully automated, I tried doing them daily, but that got to be excessive; I don't generate as much stuff as an office full of people, and I still had to copy the backups to offline storage manually. Consequently, I only did the level 1, 2, 4, and 6 dumps in the crontab schedule above. Then I got a new machine (in March 2004), which made it possible to copy all dumps automatically to the disk on the new machine.


Tools for archival backup dumps

Archival backups with dump

Historically, I used the standard, tried-and-true ext2 file system, and its successor ext3, both of which are required for the dump and restore programs, in large part because there was no good incremental backup solution for ReiserFS. So, I used ReiserFS only for those partitions I don't plan to back up. More recently I have used DAR for backups on all partitions; it has some advantages and disadvantages over dump; dump/restore: Not Recommended!" section of the Chapter 8. Planning for Disaster page of The Official Red Hat Linux System Administration Primer for Red Hat Linux 8.0, and rebutted on the "Is dump really deprecated?" page of the Dump/restore utilities project. My personal opionion is that it is buggy of Linux to leave file data cached in RAM for hours, especially on an idle system; there is no benefit to doing so, but there is risk of data lossage if the UPS dies (or the chip melts, or whatever), and this is independent of any backup system. In any case, as discussed in the "Backup timing" section, there is little you can do about data that is changing during the backup, so the Linux caching issue is a difference in magnitude, not in size. If you really care, you can always remount the file system read-only before creating a backup. (I've been meaning to experiment with this myself, but it hasn't been a high enough priority.) The backup.pl Perl script described below can be used to call dump to create backup files, and then verify them using restore. Verification is not as essential as it was in the bad old days, when media were much less reliable, and hardware was much less clever about working around media failures. In fact, I've never yet found a corrupted backup file during verification. However, I've heard enough horror stories of people not finding out that their backups were unreadable until they needed them, so I don't consider the backup complete until it's been verified in its final location. This is why my scripts always verify.

Initially, I wrote a shell script version, then recoded it into Perl when I needed to add sufficient functionality that I found the shell syntax too confining. The backup.pl Perl script has a number of extra features, plus changed defaults that make it more convenient to use for CD-R and DVD dumps (which is why the shell script is no longer published.)

To see how many bytes are likely to be dumped, use the "-S" option to dump, e.g.

    dump -S2 /dev/hda9
for a level 2 dump of the /dev/hda9 partition (mounted on my system as /home).

Archival backups with dar

"dar" stands for
Disk ARchive, and has a number of advantages with respect to dump:

Unfortunately, there are also a few disadvantages:

The backup.pl Perl script

When backup.pl is run by root, it creates and verifies a set of backup files using either
dump and restore or the DAR program. Usage is
    backup.pl [--test] [--verbose] [--usage|-?] [--help]
              [--date=<string>] [--name-prefix=<string>]
              [--file-name=<name>]
	      [--dump-program=<dump-prog>] [--[no]dar]
	      [--restore-program=<restore-prog>]
	      [--gzip | -z] [--bzip2 | -y]
              [--cd-dir=<mv-dir>] [--dump-dir=<dest-dir>]
	      [--dest-dir=<destination-dir>]
              [--[no]cd] [--volsize=<max-vol-size>]
	      [--partition=<block-special-device> | <partition> ]
              [--level=<digit> | <level>]
See the backup.pl man page for argument descriptions, known bugs, and other details.

Download backup.pl into /usr/local/bin/, make it executable, and you're ready to roll. [wrong; also requires Backup::DumpSet in order to do incremental DAR dumps. -- rgr, 4-Jul-08.]

Copying dump files with vacuum.pl

vacuum.pl copies dump files from place to place, being careful to copy only current backups, and guarding against network corruption. Usage for this is
    vacuum.pl [--test] [--verbose] [--usage|-?] [--help]
              [--from=<source-dir>] [--to=<dest-dir>]
              [--mode=(mv|cp)] [--prefix=<tag>]
              [--since=<date-string>] [--min-free-left=<size>]
See
the vacuum.pl man page for argument descriptions, known bugs, and other details.

For another useful tool for manipulating backups, see the Burning CDs with cd-dump.pl section.


Notes on mirroring backup

The weakness of mirroring backup is that it only gives you a single archival time point from which to recover. Of course, this assumes that you only make a single mirrored copy; multiple copies could get quite expensive, so it's not surprising that I've never heard of anyone who has actually done multiple mirrored copies, except possibly for Web content.

The key parameter for mirroring backup is therefore the backup frequency, which involves a tradeoff in the two different kinds of recovery capability discussed above. If you back up more frequently, then you will lose less in the event of a catastrophic failure (i.e. a disk crash), but you will also have less time in which to recover from file corruption or accidental deletion. The extremum of frequent backup is provided by RAID 0, in which backup is transparent and so frequent as to be effectively instantaneous, and recovery from single-disk failure is likewise transparent, but there is no archival history whatsoever. Having RAID is not the same as having a backup!

Another example of continuous mirroring backup is database replication. A full discussion of replication is beyond the scope of this page, but note that the same caution applies: Just because your database is replicated doesn't mean that it's backed up!

Mirroring case history 1: Web server content

At work, I use rsync to mirror the intranet Web server content. The primary server runs the following cron job every morning at 01:55:

    rsync -e ssh -a /srv/www/htdocs /srv/www/cgi-bin rome:/srv/www
The -a switch requests archival copying; according to the manual page, the -a option "... is a quick way of saying you want recursion and want to preserve almost everything." This command copies the contents of the /srv/www/htdocs and /srv/www/cgi-bin directory trees on the primary server into the same locations on rome, the standby server. That way, I only need to update the primary server; the standby server is always ready to take over, with content that is at most a day out of date.

Note the "-e ssh" on the command line; that tells rsync to use ssh to establish the remote connection. In order for this cron job (not to mention others) to work unattended, it was necessary to configure ssh to use "public key authentication," creating a public/private key pair and installing the private key on the main server and the public key on the standby server. As a result, no password is needed, and the strong encryption techniques used by ssh are robust enough to permit mirroring to be done securely over the public Internet.

Mirroring case history 2: Full disk copy

rsync can also be used for disk-to-disk copying within a single system. Here is how Anthony DiSante describes his backup system, in which he uses rsync in lieu of archival backup:

I use rsync for my weekly backups -- I've got two 120GB disks in my computer, and I have a 250GB disk in an external firewire enclosure. Once the external drive is mounted at /mnt/backup, all it takes is this simple command:
    rsync -a --delete --exclude /mnt/backup / /mnt/backup
The -a switch is for archival copying, --exclude tells it not to copy the external drive onto itself, and --delete means to delete any files on the destination that no longer exist on the source. The result is that, when complete, the disk at /mnt/backup is an exact copy of my root filesystem (which includes both 120GB disks). rsync is of course known for its highly efficient remote-update algorithm whereby only the changes in files are transmitted; in practice, I find that my weekly backup takes about an hour to run on my 172GB of used space.

Note that a system-to-system backup of this magnitude might not take much longer; probably not much of that 172GB changes from week to week, so rsync would figure that out and would only transfer the differences. Based on my experience, a full dump of 172GB (uncompressed) would require 11 hours to transmit over a local 100BaseT connection, so dealing with archival dumps of this size would be a pain.

Also, since the backup drive is removed after update, this setup can be extended to use two or more identically-configured external drives, which are updated in rotation. This requires no more effort than for a single drive, but begins to provide some archival history, for those who can afford the additional hardware.

Acknowledgements

Thanks to Anthony DiSante <orders at nodivisions dot com> for pointing out that I had neglected to mention rsync; the resulting reorganization of the material has made this page much more comprehensive.


Bob Rogers <rogers@rgrjr.dyndns.org>