Notes On rsync

rsync is some powerful voodoo. It stands for R emote Sync hronization. The idea is that if you have some files in one place and you want that to be replicated in another place, rsync will do this in the most efficient way possible. I usually use rsync over an SSH connection (handled automatically by rsync) but there are other ways too. The SSH connection ensures security during transport using known and trusted technology.

If rsync isn’t sufficient for your needs, this catalog of data moving technology is very nice. (Note the asciidoc presentation, obviously a pro!)

Using rsync

rsync -P -v -a --rsh="ssh -p 222" /path/dirtocopy [/path/moretocopy] xed.ucsd.edu:~/dest

Note that /path/dirtocopy creates a dirtocopy on the remote system while /path/dirtocopy/ just copies the contents.

Sometimes you don’t want to saturate your network. To limit transfers to 15kB/sec:

--bwlimit 15

What if you want to make a back up to a machine where you have full sudo access but no direct connection as root (i.e. Ubuntu or a Mac).

--rsync-path='sudo rsync'

As in:

sudo  rsync --rsync-path="sudo rsync" -aP /home/ xed@xed.ucsd.edu:raven-backup/

See sudo notes for more details if there’s an error.

Automatic Secure Transfers With rsync and SSH Keys

One of the nice things about rsync is that when using SSH keys you can design a setup that will very securely do unattended file transfers. This is useful for nightly backups or offloading logs or video captures.

The first step in setting this up is getting your SSH keys all sorted out. There is some flexibility/complexity regarding the particular strategy, but I’m going to describe a situation where a backup server pulls data off of a main repository somewhere. There will be two machines main and back. The idea will be to leave main alone except for when it’s time to get data and do as much of the set up and initiation on back. That said, main still has to be prepared to receive and comply with the request that back will be making.

Establish Key Pair

The first step is to set up an SSH key pair so that back can log into main. This may be a security problem in general, but later once things are working, we’ll restrict what this key pair can do so that it will only apply to back performing this particular backup task.

On back do the following entering a name (I used main-puller_rsa) and just hitting enter at the password prompt:

[back][~/.ssh]# cd ~/.ssh
[back][~/.ssh]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): main-puller_rsa
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in main-puller_rsa.
Your public key has been saved in main-puller_rsa.pub.
The key fingerprint is:
97:eb:40:97:2b:7a:a9:4d:7a:99:91:88:61:c6:b9:2f root@back
The key's randomart image is:
+--[ RSA 2048]----+
|   .++.o.        |
|   oo.o.         |
|  . + ..         |
|         . o o.. |
|  + o.E .        |
|         o =.  . |
|  +..            |
|         *.      |
|        o..      |
+-----------------+

Warning

Now you have an unencrypted key pair on back. Don’t let that private key wander off into an insecure situation!

Installing Public Key

Now you have to put the public key on the target machine. There are ways to do this using cut and paste, but this should be unambiguous:

[back][~]# cat ~/.ssh/fs-puller_rsa.pub | ssh main "cat >> ~/.ssh/authorized_keys"

This appended your newly created key (the public part) to the authorized keys file.

Note	If the `authorized_keys` file didn’t exist before, it needs to have restrictive permissions to work (so that bad dudes don’t mess with it). Doing something like `chmod 600 ~/.ssh/authorized_keys` (on host `main`) should do the trick.

Test to make sure that the key pair works. You need to specify the key you want to use explicitly like so:

[back][~]# ssh -i ~/.ssh/fs-puller_rsa main
Last login: Mon Oct 10 11:23:36 PDT 2011 from back on ssh
[main][~]#

Cool. Now we have a way to make easy SSH connections. Next is making that connection a lot less easy for everything but the back up mission.

Restricting SSH Key To Limited Functionality

The hardest thing about this process is knowing exactly what command you want the main host to run. In this example, it’s some kind of rsync but that’s not good enough. We need to know exactly what the command is. The way to find this out is to create a small script that simply dumps out the exact command as submitted by the SSH client (on back in this case). The way we intercept our client’s command is by using the command= key directive. By putting this phrase followed by the command you want to run when that key makes a connection, you can control what the key pair does. It turns out that it doesn’t matter what the client asked for. If a client makes a connection using a key pair with a command= directive, that command will be run. To answer our question, we create this temporary script and make it the command that is run:

test_ssh_cmd

#!/bin/sh
# Put this in front of the key of interest in .ssh/authorized_keys:

# command="/path/test_ssh_cmd" ssh-rsa AAAAB3.....x4sBbn62w6sISw== xed@xedshost

echo "$SSH_ORIGINAL_COMMAND" >> sshcmdlog
exec $SSH_ORIGINAL_COMMAND

Notice that all this script does is take the variable $SSH_ORIGINAL_COMMAND and append it to a file called sshcmdlog. It Then runs the submitted command.

Here’s a test command:

:-> [back][~/.ssh]# ssh -i ~/.ssh/fs-puller_rsa main cal 1 2012
    January 2012
Su Mo Tu We Th Fr Sa
 1  2  3  4  5  6  7
 8  9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31

Seems that it ran that normally on main from back. But over on main here is the setup and the result:

[main][~]# cat ~/test_ssh_cmd
echo "$SSH_ORIGINAL_COMMAND" >> ~/sshcmdlog
exec $SSH_ORIGINAL_COMMAND
[main][~]# tail -n1 ~/.ssh/authorized_keys
command="~/test_ssh_cmd" ssh-rsa AAAA....etc....AhySEWf9 root@back
[main][~]# cat ~/sshcmdlog
cal 1 2012

Notice that we were able to capture the exact command sent by the client as the server saw it. In this case, the command was simple and I could have just guessed that directly, but you’ll see that with rsync commands it can get tricky and it’s best not to spend all day guessing what it’s receiving.

Now I run a test of the real thing on back:

[back][~]# rsync --rsh='ssh -i /root/.ssh/fs-puller_rsa' -aP --del main:/files/users/xed /raid/users/

Check what turned up in sshcmdlog:

[main][~]# cat sshcmdlog
cal 1 2012
rsync --server --sender -vlogDtpre.iLsf . /files/users/xed

Now you can see why this is so hard to get right by guessing. I didn’t submit this command in this form, but the server interpreted as such. This is how you have to set the key on the SSH server machine so it will only honor these jobs.

This is the final form of the command directive in the SSH key in the authorized_keys file on main:

    :-> [main][~]# tail -n1 ~/.ssh/authorized_keys
command="rsync --server --sender -vlogDtpre.iLsf . /files/users/xed" ssh-rsa AAAAB3W...ETC...SEWf9 root@back

Note	In the example, I used the rsync flag `-P` which is a variant of verbose output which is nice for humans. In real life, you probably want to have an unattended backup job be silent. Your command string specified in the `authorized_keys` file must correspond with that exactly.

Note	Andrey believes that in recent versions of `rsync`, `--progress` doesn’t imply `--verbose` (which is the same as specifying `--info=flist2,name,progress`). Thus `-aPv` is not redundant which may have been true in the past.

Restricting SSH Key To Limited Hosts

Now this unencrypted key pair really can’t do anything but make a backup. But just to limit mischief, we can further restrict the operation of that key to specific hosts. This means that if somehow credentials are stolen, a backup won’t be made to some unknown IP number. Here’s how to implement that:

:-> [main][~]# tail -n1 ~/.ssh/authorized_keys
from="back",command="rsync --server --sender -vlogDtpre.iLsf . /files/users/xed" ssh-rsa AAAAB...ETC...hySEWf9 root@back

Note that our host here is back but it can be an IP number or even wild card values (so I’ve heard).

Automating

Now that you have set things up so your back up server can directly make a legitimate rsync backup job using SSH, you can now automate that in a cron job. Just run crontab -e and put the rsync command in a cron job (in my example situation, this is done on back, the requesting client). This will do my example every night at one in the morning:

0 1 * * * rsync --rsh='ssh -i /root/.ssh/fs-puller_rsa' -aP --del main:/files/users/xed /raid/users/

(see cron help)

Wait, Actually I Can’t Use Keys

I’ve had the very odd situation where there is a key installed and you don’t want to use it but it wants to insist. Here’s the answer.

sudo rsync \
--rsh='ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no'\
-rP xed@example.edu:/local/xed/data/dw .

Shortcomings of rsync

Nonsensical Unnecessary Transfers

Don’t get me wrong, rsync is pretty awesome. IMO It’s one of the best pieces of software ever written. However, sometimes it’s not the right thing to do. Imagine you had a directory like this.

/top/important_projectA

You make a back up of /top. Later you start some other important projects and realize this would make more sense.

/top/important/projectA

Even if projectA is a huge filesystem, it might only take milliseconds to reorganize it on the primary system. However the backup could be very painful indeed. It could temporarily require double the space during the transfer (--delete-after) or temporarily leave you without a backup (--delete-during, aka --del).

Trusting The OS Too Much

Let’s say you have 50 video files on your SSD hard drive and you want to put them on a USB drive. If each of the files is 1GB and the USB drive is 64GB, everything should be fine. But what happens is rsync plans the transfer and then gets started by asking the OS to do system calls related to the transfer. Rsync says something like "Hey Linux, I’ve got this file that needs to go over here; I’m going to read it and then write it over there, ok?" And the OS says, "Sure, go for it." Rsync does that and the OS immediately comes back with, "OK, done, what else do you have?" Rsync says, "Hang on, that was pretty quick! Are you sure you got all that?" And Linux says, "Oh ya totally." But here’s the thing — Linux is lying. Not in a bad way, just in a kind of overachiever optimistic way. What happens is that if you have a lot of RAM, Linux will answer the system calls to read the file and just stuff it in RAM in preparation for writing it to its final target. But what happens is that this breaks down quickly if the target writing is very slow, for example, as is the case on many USB drives.

Unfortunately rsync does not have an oflags=sync setting like the dd command which ensures writes are actually written. You’ll have to orchestrate moves like this yourself.

for F in /path/to/source/*; do
    rsync -aP --inplace "$F" /path/to/destination/
    sync  # Make sure all writes are written.
done

Alternatives to rsync

The SSH method:

Using ssh to transfer a filesystem:

tar -cjf - mp3/ | ssh -C -o "CompressionLevel=9" xed.ucsd.edu tar -C /target-xjf -

(see tar help)

lftp

Here’s a way to use lftp to get something from lftp with it prompting the user for a password.

FTP:

$ lftp  -e "get public_html/x.ico;exit" xed@example.xed.ch

HTTP:

lftp  -e "get location/index.html;exit" -u xed http://example.xed.ch

lftp can also do a mirror. How about some Gene Ontologies? Here’s a little mirroring script showing a way it can be done.

HOST=ftp://ftp.ebi.ac.uk/pub/databases/
X='' # Exclusions
X="${X} -x goa_uniprot"
X="${X} -x old"
lftp -e "mirror ${X} GO ; exit" ${HOST}

Unison

Want to have a topology nightmare? Unison might be just what you need. Unison allows changes on either of two file systems which can then be reconciled. Imagine that you have a laptop and a desktop, L and D respectively. You synchronize your file systems somehow so they are the same. Then you delete L:temp and add L:new. Over on D you create D:different_new and delete D:bad. When you run Unison, in theory, it will delete D:temp and L:bad while adding D:new and L:different_new. That’s already a wee bit confusing for me, but now imagine that you get a second laptop, L2 and you sync to that from L. Ok. Now what if you try syncing L2 to D? You’ve got a graph loop. You need to keep a star topology. Fun! Enjoy!

rdiff-backup

You want snapshots? You’re jealous of those fruity computer people going on about "Time Machine"? Well that functionality has been around for a long time in the free software world. Check out rdiff-backup which can take bandwidth efficient snapshots of the differences in a file system allowing you to reconstruct its state at any other point a snapshot was made. If you run this in a cron job every night you will be able to go back to a file system state from any day. This is very handy for certain kinds of backups. The downside is that if you deal with large temporary files (video editing, let’s say), rdiff-backup won’t really delete those but just make a note of the fact that they were deleted. This can be circumvented a bit, but generally rdiff-backup takes up more room than a straight mirror.