System backup

Home : Linux resources : "Howto" : Backup


Abstract

After describing the difference between archival and mirroring backups and some common backup tools used for each by Unix system administrators, I describe the general considerations that go into setting up an archival backup system, and describe the tools I use for archival backups. I conclude with a briefer section of notes on mirroring backup -- briefer both because the technology is simpler, and because I use it less.

Table of contents

  1. System backup
    1. Abstract
    2. Table of contents
    3. Backup technologies
    4. Notes on archival backups
      1. Backup levels
      2. Backup frequency
      3. Backup timing
      4. Automated backups with cron
    5. Tools for archival backup dumps
      1. Archival backups with tar
      2. Archival backups with dump
      3. Archival backups with dar
      4. Backup tools in the "scripts" project
      5. The backup.pl Perl script
      6. Listing backups with show-backups.pl
      7. Copying dump files with vacuum.pl
      8. Tidying up with clean-backups.pl
    6. Notes on mirroring backup
      1. Mirroring case history 1: Web server content
      2. Mirroring case history 2: Full disk copy
    7. Acknowledgements


Backup technologies

The baseline motivation behind all backup systems is disaster recovery: You want to ensure that your files will survive all hardware failures that Murphy's Law might conceivably throw at you. All backup technologies meet this goal by making a copy, but there are really two kinds of copies, with distinct recovery characteristics: Archival, and mirroring.

Archival backup gives you the ability to travel through time: If you suddenly realize that an important file is missing, and you're not sure when it was deleted, then the ability to sift through a year of backup dumps looking for the missing file can be a life-saver. In order to do this, however, you must keep a lot of data around, and that almost always means putting the backup dumps on some sort of offline storage.

Mirroring backup gives you immediate access to the most recent copy of your data; if you deleted that important file just this morning, then it's a snap to go get it from the backup drive, without any searching. On the other hand, if you deleted it before the last mirroring operation, you are completely out of luck. At a minimum, mirroring only requires a spare disk of comparable size, and is easy to automate completely, as it requires no manipulation of offline media.

The "entry-level" backup options for Linux (and Unix systems generally) tend to provide either archival or mirroring, but not both. They are:

  1. The standard Unix tar utility. This is GNU tar in free implementations (and even some other Unix flavors), and is the easiest tool for archiving particular directories. It has the distinct advantage of being supremely portable; "tar" format can be read by all other Unix systems, and even by DOS/Windows and Macintosh. It is usually included in a default installation, but if not, "zypper install tar" will get it on openSUSE.
  2. The traditional Unix dump and restore programs. Most Linux systems come with the dump/restore implementation for ext2/ext3, but these are the traditional names for the archival backup programs in Unix, so "man dump" will almost always come up with something on any Unix system (try "zypper install dump" on openSUSE).
  3. The dar program is something between tar and dump; it stands for "disk archiver," as in disk-to-disk (do "zypper install dar" on openSUSE).
  4. The rsync program. Unlike the other three, rsync is designed to mirror the content of directory trees over the network, is quite clever about only transferring data that have changed, and can also be set up to do local disk-to-disk copying. (You guessed it, "zypper install rsync" on openSUSE).

Fortunately, it is possible to have both archival and mirroring backup, for those that need it. For small to medium installations where high availability is important, you can install a hybrid system where archival dumps are created on a primary server, copied to a backup server for safe-keeping, and also restored onto the backup server's disks for quick access in the event that the primary server fails.

And for many small installations, archival backups are sufficient. This is all I need at home, in fact.

It is also possible to do mirroring without archival, though I myself would not recommend it. But the low-maintenance of an rsync solution may make it the most appealing for some -- just be clear that you're giving up your "data history" when you pass on archival.


Notes on archival backups

In order to reduce the amount of storage required for archival backups, it is desirable to skip files that haven't changed since the last backup. Obviously, the first backup must contain everything, but a series of subsequent backups need only contain the files changed since the last backup; in the event of a disaster, restoring all backups in the order in which they were made will return the file system to the same state as if it had been restored from a single full backup made on the last day. This scheme still has two drawbacks: The first is that the process of restoring the file system gets to be quite tedious after a few weeks, since there are quite a few of them at that point. Worse yet, data will be lost if any of those backups somehow gets lost or becomes corrupted.

Backup levels

In order to address these drawbacks, it is useful to define a backup level between 0 and 9 that controls how comprehensive to make the backup. Each level k backup contains a snapshot of all files changed since the level k-1 dump (or the dump made at the next lower numeric level if there is no level k-1 dump). Level 0 is therefore the most comprehensive, and level 9 is the most "incremental." At this point, some additional terminology is in order:

In order to reduce the number of incrementals required, one can use the "modified Tower of Hanoi algorithm" described in the dump manpage, which prescribes the following sequence of incremental dump levels (after having made a full or consolidated dump):

    3 2 5 4 7 6 9 8 9 9 ...
These are for daily backups, which is the absolute minimum period for a workgroup server in an office environment. At the end of the week, a consolidated dump is performed, and the daily cycle starts over again. At this point, last week's incrementals could be thrown away, as they are no longer needed for disaster recovery, but it's a good idea to keep them around for at least a month in order to cover the "I didn't mean to delete that" syndrome.

In any case, this multilevel backup system turns out to be quite effective in reducing the size of backups; even after a month, a consolidated dump can be only about 20% of the size of the full dump, and the daily incrementals only 3 to 5%.

Backup frequency

Deciding how often to make backups requires making a tradeoff between how many days of work you are willing to lose versus how much effort you have to spend on performing each backup. That is why a high degree of automation is a great advantange; it costs essentially nothing to take backups every day. My automated system costs me only 5 to 10 minutes per week, mostly to write consolidated backups to CD, and changing the daily backup schedule wouldn't affect that at all.

For less automated systems, the cost may be 5 to 10 minutes for each backup dump. A system failure that requires restoring from backups could happen at any time during the backup cycle, which means that the expected amount of work lost for each failure is half of the usage between backup intervals. In other words, if the system is backed up after every 40 hours of use, then the expected loss due to backup failure is 20 hours. It seems reasonable to set the expected loss over the course of a year equal to the planned time investment, and then solve for the backup frequency in order to find a value that minimizes expected total effort. (Finding the true optimum probably isn't much harder, but it's not clear that it's worth the effort.) If we do that, we get:

I*f = W*F/(2*f)

f2 = W*F/(2*I)

f = sqrt(W*F/(2*I))
where

Of course there are other costs to consider, such as inconvenience to customers (and staff embarassment) when you have to admit that you lost their emails, but these mostly define a "maximum acceptable loss" ceiling, underneath which it is still desirable to seek an optimum.

If there is only one user who uses the system for 40 hours per week, and who does their own backups, then we have what might be called the "standard home office scenario." For this scenario, and assuming that (a) backups take 10 minutes on average, and (b) the system is likely to fail once per year on average (which might or might not be pessimistic), then we arrive at the following optimal backup frequency for the home office:

fopt = sqrt((120000 min/yr*1 failure/yr)/(2*10min))

      = sqrt(6000) = 77.5 per yr

This works out to be three times every two weeks, for a total time investment (or expected time lost due to data recovery) of 77.5*10 = 775 minutes, or about 13 hours. We might want to round this frequency to twice per week, then the time investment is 1000 minutes (almost 17 hours!), and the expected time lost is only 10 hours (a quarter of a week).

Most changes to this minimal scenario have the effect of driving the ideal backup frequency up. If there were ten people using the system via file sharing, then the amount of potential lost work is ten times higher, and so it becomes worth investing that 10 minutes every working day (the actual optimal frequency is nearly 245 backups per year). If the time of the person making backups is only worth half as much as that of the average file server user (in which case we should optimize the dollar cost), then the "daily is optimal" point would be reached with only 4 or 5 additional server users. The end result is that it rarely makes sense for small offices with shared file servers to do backups any less often than daily. If the resulting 41 hours per annum of staff time spent on backups becomes excessive, then it's time to increase the level of backup automation.

Backup timing

Backup timing is also important, though often overlooked. If the backup system makes its copy of a given file while an application is partway through updating it, the copy that winds up on the backup medium may be inconsistent, and would appear to be corrupted to the application if it were ever restored. For this reason, it is best to make backups at times when the file system isn't changing. The middle of the night is therefore ideal.

Another solution to the "changing data during dump" problem is to remount the filesystem read-only before performing the backup. This has never been practical for me; if the partition is exported via NFS (true for all partitions I need to back up), I would need to unmount it on all clients, possibly disrupting shell sessions or other long-running processes. The closest I've come is to edit /etc/fstab to mark the partition as read-only temporarily and then reboot, but that doesn't work for automated nightly backups, so I've only done it when making extra just-in-case backups before server upgrades, when I am planning to reboot the system anyway.

A particularly nasty case of backup-induced corruption can be caused by backing up the files used by a relational database management system (RDBMS) to implement tables. A transaction that updates multiple tables may be in different stages of being written to disk for each table, so the backup might be inconsistent even if it could be done instantaneously. There are really only two choices for archival backup of a database: Stop the RDBMS server completely (e.g. "systemctl stop mariadb") during the backup, or use a database client backup program (e.g. mysqldump for the MariaDB system). Doing the latter is more robust, since it makes it more likely that old database content can be restored into a much later version of the RDBMS system.

For similar reasons, backing up more than once a day is probably not worth the bother. The only predictable period during the day when the file system is highly unlikely to change is during the night when all users are asleep. And, for just those reasons, doing more than one backup during this period would be pointless.

Automated backups with cron

A regular weekly schedule is easy to automate via cron jobs. The crontab entries for the full schedule for my /home partition look like this:

# At 03:00 every night, do a /home backup.
00 03 * * Mon	/usr/local/bin/home-backups /dev/mapper/boot-home 3
00 03 * * Tue	/usr/local/bin/home-backups /dev/mapper/boot-home 2
00 03 * * Wed	/usr/local/bin/home-backups /dev/mapper/boot-home 5
00 03 * * Thu	/usr/local/bin/home-backups /dev/mapper/boot-home 4
00 03 * * Fri	/usr/local/bin/home-backups /dev/mapper/boot-home 7
00 03 * * Sat	/usr/local/bin/home-backups /dev/mapper/boot-home 6
00 03 * * Sun	/usr/local/bin/home-backups /dev/mapper/boot-home 1
# [full backup recipe.  -- rgr, 10-Apr-04.]
# 00 01 * * Mon	/usr/local/bin/home-backups /dev/mapper/boot-home 0

When I first automated this process, I tried doing them daily, but that got to be too much work, because I didn't change that many files, and I still had to copy the backups to offline storage manually. Making the backup was no help from the point of view of disaster recovery if I didn't copy it to another disk fairly promptly. Consequently, I only did the level 1, 2, 4, and 6 dumps in the crontab schedule above. Then I got a new desktop machine and set the old one up as a server, which made it possible to copy all dumps automatically from the server to the disk on the new machine.


Tools for archival backup dumps

Archival backups with tar

The traditional Unix tar program is not well suited to making backups because it is only capable of creating full dumps. However, the GNU tar program can do incremental backup dumps, and it's probably already installed on your GNU/Linux system, so it's worth mentioning along with the other possiblities.

To keep track of what's been dumped so far, tar uses what it calls a "snapshot file," which is passed to the --listed-incremental option. Every time tar is run with this option, it consults this file for the state of the filesystem at the last backup, and updates it with files that it writes to the new tarfile(s). If a file is named that does not exist, then it is created, and the resulting tarfile contains a full dump. To keep the original snapshot file from becoming modified (which makes subsequent consolidated dumps impossible), the original snapshot must be copied for each subsequent backup, so that each tarfile backup gets its own snapshot, at least until it is superceded by a subsequent backup.

The resulting backup protocol looks like this:

[I am considering extending the backup.pl Perl script described below to support GNU tar backups. However, because GNU tar requires keeping track of the additional snapshot file, it would require extensive changes to the backup code infrastructure, and seems harder to automate, so it's not clear that it's worth it. -- rgr, 26-Jan-21.]

Archival backups with dump

Historically, I used the standard, tried-and-true dump and restore programs. More recently I have used dar for backups; it has some advantages and disadvantages over dump; see below for details.

An interesting characteristic of dump is that it accesses the raw device file (i.e. /dev/hda5) of ext2, ext3, and ext4 file systems, instead of going through the file system interface, which means that it can only work on whole partitions, and it can miss file data that is cached in in RAM and not yet written to disk. The pros and cons of this approach are discussed on the "Is dump really deprecated?" page of the Dump/restore utilities project. In a nutshell, it makes the "changing data during dump" problem worse, though there are ways around this, but has the unique advantage that partitions can be dumped without affecting any of the times recorded by the file system. And (as also mentioned in the "Backup timing" section) data that is changing during the backup is only one of the tradeoffs you need to consider when setting up a backup system.

Note that the backup.pl Perl script described below used to support dump to create backup files, but I dropped that support in 2017. -- rgr, 25-Jan-21.]

To see how many bytes are likely to be written to a dump file, use the "-S" option to dump, e.g.

    dump -S2 /dev/hda9

for a level 2 dump of the /dev/hda9 partition.

Archival backups with dar

"dar" stands for
Disk ARchive, and has a number of advantages with respect to dump:

Unfortunately, there are also a few disadvantages:

All in all, I find the drawbacks minor, and have come to prefer dar; it has been my standard backup tool since 2008.

Backup tools in the "scripts" project

To assist in setting up a backup system, I have written a series of Perl scripts that help to automate the tedious parts. These are part of the "scripts" project at Github and are available under an open-source license.

With these tools, it is possible to automate the system backups of a small to medium office site (up to 20 daily users) to a high degree. A cron job runs backup.pl to create backup files on the primary server in a scratch (non-backed-up) partition which are then copied by vacuum.pl in a second cron job to another scratch partition on secondary system for safe keeping. Once on the secondary server, one can use cd-dump.pl to write the full and consolidated dumps to offline media without loading the primary server. Other cron jobs on each system can then run clean-backups.pl periodically to remove the oldest daily backups in order to preserve sufficient room for new backups. Depending on the amount of scratch disk space available and the volume of daily file system churn, the sysadmin only needs to intervene a few times per month to write offline media and to remove excess full and consolidated dumps.

These tools use a backup file naming convention in order to make it easier to keep track of what may amount to thousands of backup files collected from multiple systems over many years. All such files match "<prefix>-<date>-l<level>.<slice>.dar" for dar backups or "<prefix>-<date>-l<level><idx>?.dump" for dump backups, where

To install "scripts" backup tools,

    # git clone https://github.com/rgrjr/scripts
    # cd scripts
    # make install-backup
    ...
    #

By default, this will put the above scripts into /usr/local/bin (along with backup-dbs.pl and svn-dump.pl, which are not discussed here, as they are more specialized), as well as intalling the classes they use where Perl can find them.

The backup.pl Perl script

When backup.pl is run by root, it creates and verifies a set of backup files using the dar program. Usage is

    backup.pl [ --test ] [ --verbose ] [ --usage|-? ] [ --help ]
              [ --date=<string> ] [ --name-prefix=<string> ]
              [ --file-name=<name> ]
              [ --dump-program=<dump-prog> ] [ --[no]dar ]
              [ --gzip | -z ] [ --bzip2 | -y ] [  --compression[=[algo:]level] ]
              [ --dest-dir=<destination-dir> ] [ --dump-dir=<dest-dir> ]
              [ --volsize=<max-vol-size> ]
              [ --target=<dir> | <dir> ] [ --level=<digit> | <level> ]

See the documentation in the script for argument descriptions, known bugs, and other details.

Listing backups with show-backups.pl

The show-backups.pl script lists all backup files it can find under the search root(s) that follow the naming convention described above. The default search roots are any directories that match "/scratch*/backups" but that can be overridden by specifying directories on the command line. Other options exist to constrain the search by level, date, and prefix, and to modify the output format.

Usage is as follows:

    show-backups.pl [ --help ] [ --man ] [ --usage ] [ --prefix=<pattern> ... ]
		    [ --[no]slices ] [ --[no]date | --sort=(date|prefix|dvd) ]
		    [ --before=<date> ] [ --since=<date> ] [ --size-by-date ]
		    [ --level=<level> | --level=<min>:<max> ]
		    [ <search-root> ... ]

where:

    Parameter Name     Deflt  Explanation
     --before                 If specified, only dumps on or before this date.
     --help                   Print detailed help.
     --level            all   If specified, only do dumps in this range.
     --man                    Print man page.
     --prefix                 Partition prefix on files; may be repeated.
     --since                  If specified, only do dumps since this date.
     --size-by-date      no   Print a table of total size by dump date.
     --slices                 If specified, print only slice file names.
     --sort           prefix  Sort by prefix, date, or dvd order.
     --usage                  Print this synopsis.

Here is an example of the output:

    # show-backups.pl --since 2021-1-1
     *    56890485 home-20210127-l5.1.dar [orion:/scratch/backups/]
     *    64615879 home-20210126-l2.1.dar [orion:/scratch/backups/]
	  57317283 home-20210125-l3.1.dar [orion:/scratch/backups/]
     *   106608760 home-20210124-l1.1.dar [orion:/scratch/backups/]
	  77181762 home-20210123-l6.1.dar [orion:/scratch/backups/]
	  69923104 home-20210122-l7.1.dar [orion:/scratch/backups/]
	  79237797 home-20210121-l4.1.dar [orion:/scratch/backups/]
	  46002295 home-20210120-l5.1.dar [orion:/scratch/backups/]
	  96332012 home-20210119-l2.1.dar [orion:/scratch/backups/]
	  78037401 home-20210118-l3.1.dar [orion:/scratch/backups/]
	  91735019 home-20210117-l1.1.dar [orion:/scratch/backups/]
	  97052916 home-20210116-l6.1.dar [orion:/scratch/backups/]
	  80359764 home-20210115-l7.1.dar [orion:/scratch/backups/]
	  98385074 home-20210114-l4.1.dar [orion:/scratch/backups/]
	  62601445 home-20210113-l5.1.dar [orion:/scratch/backups/]
	  73749317 home-20210112-l2.1.dar [orion:/scratch/backups/]
	  67564592 home-20210111-l3.1.dar [orion:/scratch/backups/]
	  96643821 home-20210110-l1.1.dar [orion:/scratch/backups/]
	 113825418 home-20210109-l6.1.dar [orion:/scratch/backups/]
	  92711331 home-20210108-l7.1.dar [orion:/scratch/backups/]
	 101974794 home-20210107-l4.1.dar [orion:/scratch/backups/]
	  72500168 home-20210106-l5.1.dar [orion:/scratch/backups/]
	  84543562 home-20210105-l2.1.dar [orion:/scratch/backups/]
     *    11687025 home-20210104-l0-cat.1.dar [orion:/scratch/backups/]
     *  1563907072 home-20210104-l0.1.dar [orion:/scratch/backups/]
     *    60423634 home-20210104-l0.2.dar [orion:/scratch/backups/]
	  73820512 home-20210102-l6.1.dar [orion:/scratch/backups/]
	  72067136 home-20210101-l7.1.dar [orion:/scratch/backups/]
    #

Note that the current backup files (the ones that would need to be restored in order to recreate the most recent state of the filesystem) are marked with a "*".

Copying dump files with vacuum.pl

vacuum.pl copies backup dump files from place to place, being careful to copy only current backups, and checks for good copies to guard against network corruption. Usage for this is

    vacuum.pl [--test] [--verbose] [--usage|-?] [--help]
	      [--from=<source-dir>] [--to=<dest-dir>]
	      [--mode=(mv|cp)] [--prefix=<tag> ... ]
	      [--since=<date-string>] [--min-free-left=<size>]

See the documentation in the script for argument descriptions, known bugs, and other details.

Tidying up with clean-backups.pl

When following the modified Tower of Hanoi backup level scheme described above, daily backups (those with a backup level of 2 or greater) contain only one or two days worth of data. Odd dailies (3, 5, 7, and 9) have only one day, and even dailies (2, 4, 6, and 8) have two days -- that day plus that of the previous odd daily. Consequently, it is less important to keep dailies around for extended periods of time, so they can be deleted automatically when they are no longer useful. This is what clean-backups.pl does: Maintains a specified minimum amount of free space on the filesystem partition that is used for backups by removing first odd dailies that are older than a specified threshold, starting with the oldest, and if that does not restore enough space, it then removes even dailies. No output is generated except when clean-backups.pl fails to make its quota, which makes it work well as a cron job; the sysadmin gets an email when it's time to think about removing consolidated dumps.

Full and consolidated backups cover much longer time periods, and are usually kept around for much longer. For this reason, clean-backups.pl never deletes full or consolidated backups.

Usage is as follows:

            clean-backups.pl [ --conf=<config-file> ]
                             [ --[no]test ] [ --verbose ... ]

            clean-backups.pl [ --usage | --help ]

The default configuration file is /etc/backup.conf, which tells which partitions to clean, and how thoroughly to clean them. Here is an example:

    # Backup configuration.

    [scorpio:/scratch]
    min-free-space = 10
    min-odd-retention = 60
    min-even-retention = 120
    clean = home

    [orion:/scratch]
    min-free-space = 0.5
    min-odd-retention = 60
    min-even-retention = 120
    clean = home

Notice that the same configuration file is shared between two systems. Minimum retention is specified as a number of days, and the even and odd retention values are specified separately. The min-free-space is specified in GiB, and the clean tells which prefixes should be cleaned. in the event that multiple backup sets are stored on that partition.


Notes on mirroring backup

The weakness of mirroring backup is that it only gives you a single archival time point from which to recover. Of course, this assumes that you only make a single mirrored copy; multiple copies could get quite expensive, so it's not surprising that I've never heard of anyone who has actually done multiple mirrored copies, except possibly for Web content.

The key parameter for mirroring backup is therefore the backup frequency, which involves a tradeoff in the two different kinds of recovery capability discussed above. If you back up more frequently, then you will lose less in the event of a catastrophic failure (i.e. a disk crash), but you will also have less time in which to recover from file corruption or accidental deletion. The extremum of frequent backup is provided by RAID 0, in which backup is transparent and so frequent as to be effectively instantaneous, and recovery from single-disk failure is likewise transparent, but there is no archival history whatsoever. Having RAID is not the same as having a backup!

Another example of continuous mirroring backup is database replication. In master-slave setup, a master database server pushes changes to one or more slave servers; each server keeps a copy of the data on its local disk(s), so that if the master server fails, any of the slaves can be reconfigured to take over as the replacement master. However, the same caution applies: Just because your database is replicated doesn't mean that it's backed up!

Mirroring case history 1: Web server content

This Web server uses rsync to mirror the CMU Common Lisp download content from common-lisp.net. The server runs the following cron job once a day as root:

    cd /scratch/mirror
    rsync -avz common-lisp.net::project/cmucl/ cmucl > /root/cmucl-mirror-log.text

The -a switch requests archival copying; according to the manual page, the -a option "... is a quick way of saying you want recursion and want to preserve almost everything." The -v switch makes it verbose (which is low-cost, as the content has few but very large files, and doesn't change often), and -z means to use compression in transit. This command contacts the common-lisp.net rsync server and updates the contents of the cmucl tree under /scratch/mirror/cmucl/ on my server -- without bothering to copy anything that hasn't been changed. You can browse the content at http://www.rgrjr.com/cmucl/downloads/.

Mirroring case history 2: Full disk copy

rsync can also be used for disk-to-disk copying within a single system. Here is how Anthony DiSante describes his backup system, in which he uses rsync in lieu of archival backup:

I use rsync for my weekly backups -- I've got two 120GB disks in my computer, and I have a 250GB disk in an external firewire enclosure. Once the external drive is mounted at /mnt/backup, all it takes is this simple command:
    rsync -a --delete --exclude /mnt/backup / /mnt/backup
The -a switch is for archival copying, --exclude tells it not to copy the external drive onto itself, and --delete means to delete any files on the destination that no longer exist on the source. The result is that, when complete, the disk at /mnt/backup is an exact copy of my root filesystem (which includes both 120GB disks). rsync is of course known for its highly efficient remote-update algorithm whereby only the changes in files are transmitted; in practice, I find that my weekly backup takes about an hour to run on my 172GB of used space.

Note that a system-to-system backup of this magnitude might not take much longer; probably not much of that 172GB changes from week to week, so rsync would figure that out and would only transfer the differences. Based on my experience, a full dump of 172GB (uncompressed) would require 11 hours to transmit over a local 100BaseT connection, so dealing with archival dumps of this size would be a pain.

Also, since the backup drive is removed after update, this setup can be extended to use two or more identically-configured external drives, which are updated in rotation. This requires no more effort than for a single drive, but begins to provide some archival history, for those who can afford the additional hardware.

Acknowledgements

Thanks to Anthony DiSante <orders at nodivisions dot com> for pointing out that I had neglected to mention rsync; the resulting reorganization of the material has made this page much more comprehensive.


Bob Rogers <rogers@rgrjr.com>