Tutorial

Most work on Bacula happens on the director, which is where backups are coordinated. Actual data is stored on the storage daemon, but the director is where we can issue commands and everything.

All commands below are ran from the bconsole shell, which can be ran on the director with:

root@bacula-director-01:~# bconsole 
Connecting to Director bacula-director-01.torproject.org:9101
1000 OK: 103 torproject-dir Version: 9.4.2 (04 February 2019)
Enter a period to cancel a command.
*

Then you end up with a shell with * as a prompt where you can issue commands.

Checking last jobs

To see the last jobs ran, you can check the status of the director:

*status director
torproject-dir Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian 9.7
Daemon started 22-Jul-19 10:30, conf reloaded 23-Jul-2019 12:43:41
 Jobs: run=868, running=1 mode=0,0
 Heap: heap=7,536,640 smbytes=701,360 max_bytes=21,391,382 bufs=4,518 max_bufs=8,576
 Res: njobs=74 nclients=72 nstores=73 npools=291 ncats=1 nfsets=2 nscheds=2

Scheduled Jobs:
Level          Type     Pri  Scheduled          Job Name           Volume
===================================================================================
Full           Backup    15  03-Aug-19 02:10    BackupCatalog      *unknown*
====

Running Jobs:
Console connected using TLS at 02-Aug-19 15:41
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
107689  Back Full          0         0  chiwui.torproject.org is waiting for its start time (02-Aug 19:32)
====

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
107680  Incr      51,879    2.408 G  OK       02-Aug-19 13:16 rouyi.torproject.org
107682  Incr         355    361.2 M  OK       02-Aug-19 13:33 henryi.torproject.org
107681  Diff      12,864    715.9 M  OK       02-Aug-19 13:34 pauli.torproject.org
107683  Incr         274    30.78 M  OK       02-Aug-19 13:50 forrestii.torproject.org
107684  Incr       3,423    2.398 G  OK       02-Aug-19 13:55 meronense.torproject.org
107685  Incr         288    32.24 M  OK       02-Aug-19 14:12 nevii.torproject.org
107686  Incr         341    69.64 M  OK       02-Aug-19 14:51 getulum.torproject.org
107687  Incr         289    26.24 M  OK       02-Aug-19 15:11 dictyotum.torproject.org
107688  Incr         376    57.62 M  OK       02-Aug-19 15:22 kvm5.torproject.org
107690  Incr         238    20.88 M  OK       02-Aug-19 15:32 opacum.torproject.org

====

Here we see that no backups are running, and the last ones succeeded correctly.

You can also check the status of individual clients with status client.

Checking messages

The messages command shows the latest messages on the bconsole. It's useful to run this command when you start your session as it will flush the (usually quite long) buffer of messages. That way the next time you call the command, you will only see the result of your latest jobs.

Running a backup

Backups are ran regularly by a cron job, but if you need to run a backup immediately, this can be done in the bconsole.

The short version is to just run the run command and pick the server to backup.

Longer version:

  1. enter the console on the bacula director:

    ssh -tt bacula-director-01.torproject.org bconsole
    
  2. enter the run dialog:

    *run
    A job name must be specified.
    The defined Job resources are:
         1: RestoreFiles
         2: alberti.torproject.org
         3: archive-01.torproject.org
         [...]
         59: peninsulare.torproject.org
    
  3. pick a server, for example peninsulare.torproject.org above is 59, so enter 59 and confirm by entering yes:

    Select Job resource (1-77): 59
    Run Backup job
    JobName:  peninsulare.torproject.org
    Level:    Incremental
    Client:   peninsulare.torproject.org-fd
    FileSet:  Standard Set
    Pool:     poolfull-torproject-peninsulare.torproject.org (From Job resource)
    Storage:  File-peninsulare.torproject.org (From Pool resource)
    When:     2019-10-11 20:57:09
    Priority: 10
    OK to run? (yes/mod/no): yes
    Job queued. JobId=113225
    
  4. bacula confirms the job is queued. you can see the status of the job with status director, which should show set of lines like this in the middle:

    JobId  Type Level     Files     Bytes  Name              Status
    ======================================================================
    113226  Back Incr          0         0  peninsulare.torproject.org is running
    
  5. this will take more or less time depending on the size of the server. you can call status director repeatedly to follow progress (for example, with watch "echo status director | bconsole" in another shell or run the mess command to see new messages as they progress. when the backup completes, you should see something like this in the status director output:

    Terminated Jobs:
     JobId  Level      Files    Bytes   Status   Finished        Name 
    ====================================================================
    113225  Incr          33    11.67 M  OK       11-Oct-19 20:59 peninsulare.torproject.org
    

That's it, new files were sucked in and you're good to do whatever nasty things you were about to do.

How to...

This section is more in-depths and will explain more concepts as we go. Relax, take a deep breath, it should go fine.

Configure backups on new machines

Backups for new machines should be automatically configured by Puppet using the bacula::client class, included everywhere (through hiera/common.yaml).

There are special configurations required for MySQL and PostgreSQL databases, see the design section for more information on those.

Restore files

Short version:

$ ssh -tt bacula-director-01.torproject.org bconsole
*restore

... and follow instructions. Reminder: by default, backups are restored on the originating server. llist jobid=N and messages to follow progress.

The bconsole program has a pretty good interactive restore mode which you can just call with restore. It needs to know which "jobs" you want to restore from. As a given backup job is typically an incremental job, you normally mean multiple jobs to restore to a given point in time.

The first thing to know is that restores are done from the server to the client, ie. they are restored directly on the machine that is backed up. Note that by default files will be owned by the bacula user because the file daemon runs as bacula in our configuration. If that's a problem for large backups, the override (in /etc/systemd/system/bacula-fd.service.d/override.conf) can be temporarily disabled by simply removing the file and restarting the service:

rm /etc/systemd/system/bacula-fd.service.d/override.conf
systemctl restart bacula-fd

And then restarting the restore procedure. Note that this file well be re-created by Puppet the next time it runs, so maybe you also want to run puppet --disable 'to respect the bacula-fd override'. In this configuration, however, the file daemon can overwrite any file, so be careful in this case.

A simple way of restoring a given client to a given point in time is to use the option. So:

  1. enter bconsole in a shell on the director

  2. call the restore command:

    *restore
    Automatically selected Catalog: MyCatalog
    Using Catalog "MyCatalog"
    
    First you select one or more JobIds that contain files
    to be restored. You will be presented several methods
    of specifying the JobIds. Then you will be allowed to
    select which files from those JobIds are to be restored.
    
  3. you now have a list of possible ways of restoring, choose: 5: Select the most recent backup for a client:

    To select the JobIds, you have the following choices:
         1: List last 20 Jobs run
         2: List Jobs where a given File is saved
         3: Enter list of comma separated JobIds to select
         4: Enter SQL list command
         5: Select the most recent backup for a client
         6: Select backup for a client before a specified time
         7: Enter a list of files to restore
         8: Enter a list of files to restore before a specified time
         9: Find the JobIds of the most recent backup for a client
        10: Find the JobIds for a backup for a client before a specified time
        11: Enter a list of directories to restore for found JobIds
        12: Select full restore to a specified Job date
        13: Cancel
    Select item:  (1-13): 5
    
  4. you will see a list of machines, pick the machine you want to restore from by entering its number:

    Defined Clients:
         1: alberti.torproject.org-fd
    [...]
       117: yatei.torproject.org-fd
    Select the Client (1-117): 87
    
  5. you now get dropped in a file browser where you use the mark and unmark commands to mark and unmark files for restore. the commands support wildcards like *. use mark * to mark all files in the current directory, see also the full list of commands:

    Automatically selected FileSet: Standard Set
    +---------+-------+----------+-----------------+---------------------+----------------------------------------------------------+
    | jobid   | level | jobfiles | jobbytes        | starttime           | volumename                                               |
    +---------+-------+----------+-----------------+---------------------+----------------------------------------------------------+
    | 106,348 | F     |  363,125 | 157,545,039,843 | 2019-07-16 09:42:43 | torproject-full-perdulce.torproject.org.2019-07-16_09:42 |
    | 107,033 | D     |    9,136 |     691,803,964 | 2019-07-25 06:30:15 | torproject-diff-perdulce.torproject.org.2019-07-25_06:30 |
    | 107,107 | I     |    4,244 |     214,271,791 | 2019-07-26 06:11:30 | torproject-inc-perdulce.torproject.org.2019-07-26_06:11  |
    | 107,181 | I     |    4,285 |     197,548,921 | 2019-07-27 05:30:51 | torproject-inc-perdulce.torproject.org.2019-07-27_05:30  |
    | 107,257 | I     |    4,273 |     197,739,452 | 2019-07-28 04:52:15 | torproject-inc-perdulce.torproject.org.2019-07-28_04:52  |
    | 107,334 | I     |    4,302 |     218,259,369 | 2019-07-29 04:58:23 | torproject-inc-perdulce.torproject.org.2019-07-29_04:58  |
    | 107,423 | I     |    4,400 |     287,819,534 | 2019-07-30 05:42:09 | torproject-inc-perdulce.torproject.org.2019-07-30_05:42  |
    | 107,504 | I     |    4,278 |     413,289,422 | 2019-07-31 06:11:49 | torproject-inc-perdulce.torproject.org.2019-07-31_06:11  |
    | 107,587 | I     |    4,401 |     700,613,429 | 2019-08-01 07:51:52 | torproject-inc-perdulce.torproject.org.2019-08-01_07:51  |
    | 107,653 | I     |      471 |      63,370,161 | 2019-08-02 06:01:35 | torproject-inc-perdulce.torproject.org.2019-08-02_06:01  |
    +---------+-------+----------+-----------------+---------------------+----------------------------------------------------------+
    You have selected the following JobIds: 106348,107033,107107,107181,107257,107334,107423,107504,107587,107653
    
    Building directory tree for JobId(s) 106348,107033,107107,107181,107257,107334,107423,107504,107587,107653 ...  mark etc
    ++++++++++++++++++++++++++++++++++++++++++++++
    335,060 files inserted into the tree.
    
    You are now entering file selection mode where you add (mark) and
    remove (unmark) files to be restored. No files are initially added, unless
    you used the "all" keyword on the command line.
    Enter "done" to leave this mode.
    
    cwd is: /
    $ mark etc
    1,921 files marked.
    

    Do not use the estimate command as it can take a long time to run and will freeze the shell.

  6. when done selecting files, call the done command

    $ done
    
  7. this will drop you in a confirmation dialog showing what will happen. note the Where parameter which shows where the files will be restored, on the RestoreClient. Make sure that location has enough space for the restore to complete.

    Bootstrap records written to /var/lib/bacula/torproject-dir.restore.6.bsr
    
    The Job will require the following (*=>InChanger):
       Volume(s)                 Storage(s)                SD Device(s)
    ===========================================================================
    
        torproject-full-perdulce.torproject.org.2019-07-16_09:42 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-diff-perdulce.torproject.org.2019-07-25_06:30 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-26_06:11 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-27_05:30 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-29_04:58 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-07-31_06:11 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-08-01_07:51 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
        torproject-inc-perdulce.torproject.org.2019-08-02_06:01 File-perdulce.torproject.org FileStorage-perdulce.torproject.org
    
    Volumes marked with "*" are in the Autochanger.
    
    
    1,921 files selected to be restored.
    
    Using Catalog "MyCatalog"
    Run Restore job
    JobName:         RestoreFiles
    Bootstrap:       /var/lib/bacula/torproject-dir.restore.6.bsr
    Where:           /var/tmp/bacula-restores
    Replace:         Always
    FileSet:         Standard Set
    Backup Client:   perdulce.torproject.org-fd
    Restore Client:  perdulce.torproject.org-fd
    Storage:         File-perdulce.torproject.org
    When:            2019-08-02 16:43:08
    Catalog:         MyCatalog
    Priority:        10
    Plugin Options:  *None*
    
  8. this doesn't restore the backup immediately, but schedules a job that does so, like such:

    OK to run? (yes/mod/no): yes
    Job queued. JobId=107693
    

You can see the status of the jobs on the director with the status director, but also see specifically the status of that job with llist jobid=107693:

*llist JobId=107697
           jobid: 107,697
             job: RestoreFiles.2019-08-02_16.43.40_17
            name: RestoreFiles
     purgedfiles: 0
            type: R
           level: F
        clientid: 9
      clientname: dictyotum.torproject.org-fd
       jobstatus: R
       schedtime: 2019-08-02 16:43:08
       starttime: 2019-08-02 16:43:42
         endtime: 
     realendtime: 
        jobtdate: 1,564,764,222
    volsessionid: 0
  volsessiontime: 0
        jobfiles: 0
        jobbytes: 0
       readbytes: 0
       joberrors: 0
 jobmissingfiles: 0
          poolid: 0
        poolname: 
      priorjobid: 0
       filesetid: 0
         fileset: 
         hasbase: 0
        hascache: 0
         comment:

The JobStatus column is an internal database field that will show T ("terminated normally") when completed or R or C when still running or not started, and anything else if, well, anything else is happening. The full list of possible statuses is hidden deep in the developer documentation, obviously.

The messages command also provides for a good way of showing the latest status, although it will flood your terminal if it wasn't ran for a long time. You can hit "enter" to see if there are new messages.

*messages
[...]
02-Aug 16:43 torproject-sd JobId 107697: Ready to read from volume "torproject-inc-perdulce.torproject.org.2019-08-02_06:01" on File device "FileStorage-perdulce.torproject.org" (/srv/backups/bacula/perdulce.torproject.org).
02-Aug 16:43 torproject-sd JobId 107697: Forward spacing Volume "torproject-inc-perdulce.torproject.org.2019-08-02_06:01" to addr=328
02-Aug 16:43 torproject-sd JobId 107697: Elapsed time=00:00:03, Transfer rate=914.8 K Bytes/second
02-Aug 16:43 torproject-dir JobId 107697: Bacula torproject-dir 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian 9.7
  JobId:                  107697
  Job:                    RestoreFiles.2019-08-02_16.43.40_17
  Restore Client:         bacula-director-01.torproject.org-fd
  Where:                  /var/tmp/bacula-restores
  Replace:                Always
  Start time:             02-Aug-2019 16:43:42
  End time:               02-Aug-2019 16:43:50
  Elapsed time:           8 secs
  Files Expected:         1,921
  Files Restored:         1,921
  Bytes Restored:         2,528,685 (2.528 MB)
  Rate:                   316.1 KB/s
  FD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Restore OK

Once the job is done, the files will be present in the chosen location (Where) on the given server (RestoreClient).

See the upstream manual more information about the restore command.

Restore the directory server

If the storage daemon disappears catastrophically, there's nothing we can do: the data is lost. But if the director disappears, we can still restore from backups. Those instructions should cover the case where we need to rebuild the director from backups. The director is, essentially, a PostgreSQL database. Therefore, the restore procedure is to restore that database, along with some configuration.

This procedure can also be used to rotate a replace a still running director.

  1. if the old director is still running, star a fresh backup of the old database cluster from the storage server:

    sudo -tt bungei sudo -u torbackup postgres-make-base-backups dictyotum.torproject.org:5433 &
    
  2. disable puppet on the old director:

    ssh dictyotum.torproject.org puppet agent --disable 'disabling scheduler -- anarcat 2019-10-10' 
    
  3. disable scheduler, by commenting out the cron job, and wait for jobs to complete, then shutdown the old director:

    sed -i '/dsa-bacula-scheduler/s/^/#/' /etc/cron.d/puppet-crontab
    watch -c "echo 'status director' | bconsole "
    service bacula-director stop
    

    TODO: this could be improved: <weasel> it's idle when there are no non-idle 'postgres: bacula bacula' processes and it doesn't have any open tcp connections?

  4. create a new-machine run Puppet with the roles::backup::director class applied to the node, say in hiera/nodes/bacula-director-01.yaml:

    classes:
    - roles::backup::director
    bacula::client::director_server: 'bacula-director-01.torproject.org'
    

    This should restore a basic Bacula configuration with the director acting, weirdly, as its own director.

    When you add the machine to Nagios, make sure to add it to the postgres96-hosts group so that the PostgreSQL cluster is correctly monitored.

  5. Run Puppet by hand on the new director and the storage server a few times, so their manifest converge:

    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    ssh bungei.torproject.org puppet agent -t
    ssh bacula-director-01.torproject.org puppet agent -t
    

    The Puppet manifests will fail because PostgreSQL is not installed. And even if it would be, it will fail because it doesn't have the right passwords. For now, PostgreSQL is configured by hand.

    TODO: Do consider deploying it with Puppet, as discussed in postgresql.

  6. Install the right version of PostgreSQL.

    It might be the case that backups of the director are from an earlier version of PostgreSQL than the version available in the new machine. In that case, an older sources.list needs to be added:

    cat > /etc/apt/sources.list.d/stretch.list <<EOF
    deb https://deb.debian.org/debian/  stretch  main
    deb http://security.debian.org/ stretch/updates  main
    EOF
    apt update
    

    Actually install the server:

    apt install -y postgresql-9.6
    
  7. Once the base backup from step one is completed (or if there is no old director left), restore the cluster on the new host, see the "Indirect restore procedure" in postgresql

  8. You will also need to restore the file /etc/dsa/bacula-reader-database from backups (see "Getting files without a director", below), as that file is not (currently) managed through puppet (TODO). Alternatively, that file can be recreated by hand, using a syntax like this:

    user=bacula-dictyotum-reader password=X dbname=bacula host=localhost
    

    The matching user will need to have its password modified to match X, obviously:

    sudo -u postgres psql -c '\password bacula-dictyotum-reader'
    
  9. reset the password of the bacula director, as it changed in puppet:

    grep dbpassword /etc/bacula/bacula-dir.conf | cut -f2 -d\"
    sudo -u postgres psql -c '\password bacula'
    

    same for the tor-backup user:

    ssh bungei.torproject.org grep director /home/torbackup/.pgpass
    ssh bacula-director-01 -tt sudo -u postgres psql -c '\password bacula'
    
  10. copy over the pg_hba.conf and postgresql.conf (now conf.d/tor.conf) from the previous director cluster configuration (e.g. /var/lib/postgresql/9.6/main) to the new one (TODO: put in puppet). Make sure that:

    • the cluster name (e.g. main or bacula) is correct in the archive_command1
    • the ssl_cert_file and ssl_key_file point to valid SSL certs
  11. Once you have the postgres database cluster restored, start the director:

     systemctl start bacula-director
    
  12. Then everything should be fairies and magic and happiness all over again. Check that everything works with:

     bconsole
    

    Run a few of the "Basic commands" above, to make sure we have everything. For example, list jobs should show the latest jobs ran on the director. It's normal that status director does not show those, however.

  13. Enable puppet on the director again.

    puppet agent -t
    

    This involves (optionally) keeping a lock on the scheduler so it doesn't immediately start at once. If you're confident (not tested!), this step might be skipped:

     flock -w 0 -e /usr/local/sbin/dsa-bacula-scheduler sleep infinity
    
  14. to switch a single node, configure its director in tor-puppet/hiera/nodes/$FQDN.yaml where $FQDN is the fully qualified domain name of the machine (e.g. tor-puppet/hiera/nodes/perdulce.torproject.org.yaml):

     bacula::client::director_server: 'bacula-director-01.torproject.org'
    

    Then run puppet on that node, the storage, and the director server:

     ssh perdulce.torproject.org puppet agent -t
     ssh bungei.torproject.org puppet agent -t
     ssh bacula-director-01.torproject.org puppet agent -t
    

    Then test a backup job for that host, in bconsole, call run and pick that server which should now show up.

  15. switch all nodes to the new director, in tor-puppet/hiera/common.yaml:

     bacula::client::director_server: 'bacula-director-01.torproject.org'
    
  16. run puppet everywhere (or wait for it to run):

    cumin -b 5 -p 0 -o txt '*' 'puppet agent -t'
    

    Then make sure the storage and director servers are also up to date:

     ssh bungei.torproject.org puppet agent -t
     ssh bacula-director-01.torproject.org puppet agent -t
    
  17. if you held a lock on the scheduler, it can be removed:

    killall sleep

  18. switch the nagios checks over the new director: grep for the old director name in the nagios configuration and fix up some of the checks

    git -C tor-nagios grep dictyotum
    
  19. you will also need to restore the password file for the nagios check in /etc/nagios/bacula-database

  20. switch the director in /etc/dsa/bacula-reader-database or /etc/postgresql-common/pg_service.conf to point to the new host

The new scheduler and director should now have completely taken over the new one, and backups should resume. The old server can now be decommissioned, if it's still around, when you feel comfortable the new setup is working.

TODO: 15:19:55 <weasel> and once that's up and running, it'd probably be smart to upgrade it to 11. pg_upgradecluster -m upgrade --link

TODO: some psql users still refer to host-specific usernames like bacula-dictyotum-reader, maybe they should refer to role-specif names instead?

Troubleshooting

If you get this error:

psycopg2.OperationalError: definition of service "bacula" not found

It's probably the scheduler failing to connect to the database server, because the /etc/dsa/bacula-reader-database refers to a non-existent "service", as defined in /etc/postgresql-common/pg_service.conf. Either add something like:

[bacula]
dbname=bacula
port=5433

to that file, or specify the dbname and port manually in the config file.

If the scheduler is sending you an email every three minutes with this error:

FileNotFoundError: [Errno 2] No such file or directory: '/etc/dsa/bacula-reader-database'

It's because you forgot to create that file, in step 8. Similar errors may occur if you forgot to change that password.

If the director takes a long time to start and ultimately fails with:

oct 10 18:19:41 bacula-director-01 bacula-dir[31276]: bacula-dir JobId 0: Fatal error: Could not open Catalog "MyCatalog", database "bacula".
oct 10 18:19:41 bacula-director-01 bacula-dir[31276]: bacula-dir JobId 0: Fatal error: postgresql.c:332 Unable to connect to PostgreSQL server. Database=bacula User=bac
oct 10 18:19:41 bacula-director-01 bacula-dir[31276]: Possible causes: SQL server not running; password incorrect; max_connections exceeded.

It's because you forgot to reset the director password, in step 9.

Get files without a director

If you want to get to files stored on the bacula storgage host without involving the director, they can be accessed directly as well. Remember that to bacula everything is a tape, and /srv/backups/bacula is full of directories of tapes. You can see the contents of a tape using bls, that is, bls <file>, with a fully qualified filename, i.e. involving all the paths. bls $(readlink -f <filename>) is a handy way to get that.

root@bungei:/srv/backups/bacula/dictyotum.torproject.org# bls `readlink -f torproject-inc-dictyotum.torproject.org.2019-09-25_11:53` | head
bls: butil.c:292-0 Using device: "/srv/backups/bacula/dictyotum.torproject.org" for reading.
25-Sep 13:48 bls JobId 0: Ready to read from volume "torproject-inc-dictyotum.torproject.org.2019-09-25_11:53" on File device "FileStorage-dictyotum.torproject.org" (/srv/backups/bacula/dictyotum.torproject.org).
bls JobId 0: drwxr-xr-x   4 root     root                   1024 2019-09-07 17:01:03  /boot/
bls JobId 0: drwxr-xr-x  24 root     root                    800 2019-09-25 11:33:53  /run/
bls JobId 0: -rw-r--r--   1 root     root                  12288 2019-09-25 11:51:17  /etc/postfix/debian.db
bls JobId 0: -rw-r--r--   1 root     root                   4732 2019-09-25 11:51:17  /etc/postfix/debian
bls JobId 0: -r--r--r--   1 root     root                  28161 2019-09-25 00:55:50  /etc/ssl/torproject-auto/crls/ca.crl
...

You can then extract files from there bextract:

bextract /srv/backups/bacula/dictyotum.torproject.org/torproject-inc-dictyotum.torproject.org.2019-09-25_11:53 /var/tmp/restore

This will extract the entire tape to /var/tmp/restore. If you want only a few files, put their names into a file such as include and call bextract with -i:

bextract -i ~/include /srv/backups/bacula/dictyotum.torproject.org/torproject-inc-dictyotum.torproject.org.2019-09-25_11:53 /var/tmp/restore

Restore PostgreSQL databases

See postgresql for restore instructions on PostgreSQL databases.

Restore MySQL databases

MySQL restoration should be fairly straightforward. Install MySQL:

apt install mysql-server

Load each database dump:

for dump in 20190812-220301-mysql.xz 20190812-220301-torcrm_prod.xz; do
    mysql < /var/backups/local/mysql/$dump
done

Restore LDAP databases

See ldap for LDAP-specific procedures.

Debug jobs

If a job is behaving strangely, you can inspect its job log to see what's going on. For example, today Nagios warned about the backups being too old on colchicifolium:

10:02:58 <nsa> tor-nagios: [colchicifolium] backup - bacula - last full backup is WARNING: WARN: Last backup of colchicifolium.torproject.org/F is 45.16 days old.

Looking at the bacula director status, it says this:

Console connected using TLS at 10-Jan-20 18:19
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
120225  Back Full    833,079    123.5 G colchicifolium.torproject.org is running
120230  Back Full  4,864,515    218.5 G colchicifolium.torproject.org is waiting on max Client jobs
120468  Back Diff     30,694    3.353 G gitlab-01.torproject.org is running
====

Which is strange because those JobId numbers are very low compared to (say) the gitlab backup job. To inspect the job log, you use the list command:

*list joblog jobid=120225
+----------------------------------------------------------------------------------------------------+
| logtext                                                                                              |
+----------------------------------------------------------------------------------------------------+
| bacula-director-01.torproject.org-dir JobId 120225: Start Backup JobId 120225, Job=colchicifolium.torproject.org.2020-01-07_17.00.36_03 |
| bacula-director-01.torproject.org-dir JobId 120225: Created new Volume="torproject-colchicifolium.torproject.org-full.2020-01-07_17:00", Pool="poolfull-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
[...]
| bacula-director-01.torproject.org-dir JobId 120225: Fatal error: Network error with FD during Backup: ERR=No data available |
| bungei.torproject.org-sd JobId 120225: Fatal error: append.c:170 Error reading data header from FD. n=-2 msglen=0 ERR=No data available |
| bungei.torproject.org-sd JobId 120225: Elapsed time=00:03:47, Transfer rate=7.902 M Bytes/second     |
| bungei.torproject.org-sd JobId 120225: Sending spooled attrs to the Director. Despooling 14,523,001 bytes ... |
| bungei.torproject.org-sd JobId 120225: Fatal error: fd_cmds.c:225 Command error with FD msg="", SD hanging up. ERR=Error getting Volume info: 1998 Volume "torproject-colchicifolium.torproject.org-full.2020-01-07_17:00" catalog status is Used, but should be Append, Purged or Recycle. |
| bacula-director-01.torproject.org-dir JobId 120225: Fatal error: No Job status returned from FD.     |
[...]
| bacula-director-01.torproject.org-dir JobId 120225: Rescheduled Job colchicifolium.torproject.org.2020-01-07_17.00.36_03 at 07-Jan-2020 17:09 to re-run in 14400 seconds (07-Jan-2020 21:09). |
| bacula-director-01.torproject.org-dir JobId 120225: Error: openssl.c:68 TLS shutdown failure.: ERR=error:14094123:SSL routines:ssl3_read_bytes:application data after close notify |
| bacula-director-01.torproject.org-dir JobId 120225: Job colchicifolium.torproject.org.2020-01-07_17.00.36_03 waiting 14400 seconds for scheduled start time. |
| bacula-director-01.torproject.org-dir JobId 120225: Restart Incomplete Backup JobId 120225, Job=colchicifolium.torproject.org.2020-01-07_17.00.36_03 |
| bacula-director-01.torproject.org-dir JobId 120225: Found 78113 files from prior incomplete Job.     |
| bacula-director-01.torproject.org-dir JobId 120225: Created new Volume="torproject-colchicifolium.torproject.org-full.2020-01-10_12:11", Pool="poolfull-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
| bacula-director-01.torproject.org-dir JobId 120225: Using Device "FileStorage-colchicifolium.torproject.org" to write. |
| bacula-director-01.torproject.org-dir JobId 120225: Sending Accurate information to the FD.          |
| bungei.torproject.org-sd JobId 120225: Labeled new Volume "torproject-colchicifolium.torproject.org-full.2020-01-10_12:11" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org). |
| bungei.torproject.org-sd JobId 120225: Wrote label to prelabeled Volume "torproject-colchicifolium.torproject.org-full.2020-01-10_12:11" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org) |
| bacula-director-01.torproject.org-dir JobId 120225: Max Volume jobs=1 exceeded. Marking Volume "torproject-colchicifolium.torproject.org-full.2020-01-10_12:11" as Used. |
| colchicifolium.torproject.org-fd JobId 120225:      /run is a different filesystem. Will not descend from / into it. |
| colchicifolium.torproject.org-fd JobId 120225:      /home is a different filesystem. Will not descend from / into it. |
+----------------------------------------------------------------------------------------------------+
+---------+-------------------------------+---------------------+------+-------+----------+---------------+-----------+
| jobid   | name                          | starttime           | type | level | jobfiles | jobbytes      | jobstatus |
+---------+-------------------------------+---------------------+------+-------+----------+---------------+-----------+
| 120,225 | colchicifolium.torproject.org | 2020-01-10 12:11:51 | B    | F     |   77,851 | 1,759,625,288 | R         |
+---------+-------------------------------+---------------------+------+-------+----------+---------------+-----------+

So that job failed three days ago, but now it's actually running. In this case, it might be safe to just ignore the Nagios warning and hope that the rescheduled backup will eventually go through. The duplicate job is also fine: worst case there is it will just run after the first one does, resulting in a bit more I/O than we'd like.

Design

This section documents how backups are setup at Tor. It should be useful if you wish to recreate or understand the architecture.

Backups are configured automatically by Puppet on all nodes, and use Bacula with TLS encryption over the wire.

Backups are pulled from machines to the backup server, which means a compromise on a machine shouldn't allow an attacker to delete backups from the backup server.

Bacula splits the different responsabilities of the backup system among multiple components, namely:

  • storage daemon (bacula::storage in Puppet, currently bungei)
  • director (bacula::director in Puppet, currently bacula-director-01, PostgreSQL configured by hand)
  • file daemon (bacula::client, on all nodes)

In our configuration, the Admin workstation, Database serverand Backup server are all on the same machine, the bacula::director.

Volumes are stored in the storage daemon, in /srv/backups/bacula/. Each client stores its volumes in a separate directory, which makes it easier to purge offline clients and evaluate disk usage.

We do not have a bootstrap file as advised by the upstream documentation because we do not use tapes or tape libraries, which make it harder to find volumes. Instead, our catalog is backed up in /srv/backups/bacula/Catalog and each backup contains a single file, the compressed database dump, which is sufficient to re-bootstrap the director.

See the introductio to Bacula for more information on those distinctions.

PostgreSQL backup system

Database backups are handled specially. We use PostgreSQL (postgres) everywhere apart from a few rare exceptions (currently only CiviCRM) and therefore use postgres-specific configurations to do backups of all our servers.

See postgresql for that server's specific backup/restore instructions.

MySQL backup system

MySQL also requires special handling, and it's done in the mariadb::server Puppet class. It deploys a script (backup-mysql) which runs every hour and calls mysqldump to store plaintext copies of all databases in /var/backups/local/mysql.

It also stores the SHA256 checksum of the backup file as a hardlink to the file, for example:

1184448 -rw-r----- 2 root    154820 aug 12 21:03 SHA256-665fac68c0537eda149b22445fb8bca1985ee96eb5f145019987bdf398be33e7
1184448 -rw-r----- 2 root    154820 aug 12 21:03 20190812-210301-mysql.xz

Those both point to the same file, inode 1184448.

Those backups then get included in the normal Bacula backups.

References