rsync is some powerful voodoo. It stands for R emote Sync hronization. The idea is that if you have some files in one place and you want that to be replicated in another place, rsync will do this in the most efficient way possible. I usually use rsync over an SSH connection (handled automatically by rsync) but there are other ways too. The SSH connection ensures security during transport using known and trusted technology.
If rsync
isn’t sufficient for your needs,
this catalog of data
moving technology is very nice. (Note the asciidoc presentation,
obviously a pro!)
Using rsync
rsync -P -v -a -e ssh /path/dirtocopy [/path/moretocopy] xed.ucsd.edu:~/dest
Note that /path/dirtocopy creates a dirtocopy on the remote system while /path/dirtocopy/ just copies the contents.
Sometimes you don’t want to saturate your network. To limit transfers to 15kB/sec:
--bwlimit 15
What if you want to make a back up to a machine where you have full sudo access but no direct connection as root (i.e. Ubuntu or a Mac).
--rsync-path='sudo rsync'
As in:
sudo rsync --rsync-path="sudo rsync" -aP /home/ xed@xed.ucsd.edu:raven-backup/
See sudo notes for more details if there’s an error.
Automatic Secure Transfers With rsync and SSH Keys
One of the nice things about rsync is that when using SSH keys you can design a setup that will very securely do unattended file transfers. This is useful for nightly backups or offloading logs or video captures.
The first step in setting this up is getting your SSH keys all sorted
out. There is some flexibility/complexity regarding the particular
strategy, but I’m going to describe a situation where a backup server
pulls data off of a main repository somewhere. There will be two
machines main
and back
. The idea will be to leave main
alone
except for when it’s time to get data and do as much of the set up and
initiation on back
. That said, main
still has to be prepared to
receive and comply with the request that back
will be making.
Establish Key Pair
The first step is to set up an SSH key pair so that back
can log
into main
. This may be a security problem in general, but later once
things are working, we’ll restrict what this key pair can do so that
it will only apply to back
performing this particular backup task.
On back
do the following entering a name (I used main-puller_rsa
)
and just hitting enter at the password prompt:
[back][~/.ssh]# cd ~/.ssh
[back][~/.ssh]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): main-puller_rsa
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in main-puller_rsa.
Your public key has been saved in main-puller_rsa.pub.
The key fingerprint is:
97:eb:40:97:2b:7a:a9:4d:7a:99:91:88:61:c6:b9:2f root@back
The key's randomart image is:
+--[ RSA 2048]----+
| .++.o. |
| oo.o. |
| . + .. |
| . o o.. |
| + o.E . |
| o =. . |
| +.. |
| *. |
| o.. |
+-----------------+
Warning
|
Now you have an unencrypted key pair on back. Don’t let that private key wander off into an insecure situation! |
Installing Public Key
Now you have to put the public key on the target machine. There are ways to do this using cut and paste, but this should be unambiguous:
[back][~]# cat ~/.ssh/fs-puller_rsa.pub | ssh main "cat >> ~/.ssh/authorized_keys"
This appended your newly created key (the public part) to the authorized keys file.
Note
|
If the authorized_keys file didn’t exist before, it needs to
have restrictive permissions to work (so that bad dudes don’t mess
with it). Doing something like chmod 600 ~/.ssh/authorized_keys
(on host main ) should do the trick. |
Test to make sure that the key pair works. You need to specify the key you want to use explicitly like so:
[back][~]# ssh -i ~/.ssh/fs-puller_rsa main
Last login: Mon Oct 10 11:23:36 PDT 2011 from back on ssh
[main][~]#
Cool. Now we have a way to make easy SSH connections. Next is making that connection a lot less easy for everything but the back up mission.
Restricting SSH Key To Limited Functionality
The hardest thing about this process is knowing exactly what command
you want the main
host to run. In this example, it’s some kind of
rsync
but that’s not good enough. We need to know exactly what the
command is. The way to find this out is to create a small script that
simply dumps out the exact command as submitted by the SSH client
(on back
in this case). The way we intercept our client’s command is by
using the command=
key directive. By putting this phrase followed
by the command you want to run when that key makes a connection, you
can control what the key pair does. It turns out that it doesn’t
matter what the client asked for. If a client makes a connection using
a key pair with a command=
directive, that command will be run. To
answer our question, we create this temporary script and make it the
command that is run:
#!/bin/sh
# Put this in front of the key of interest in .ssh/authorized_keys:
# command="/path/test_ssh_cmd" ssh-rsa AAAAB3.....x4sBbn62w6sISw== xed@xedshost
echo "$SSH_ORIGINAL_COMMAND" >> sshcmdlog
exec $SSH_ORIGINAL_COMMAND
Notice that all this script does is take the variable
$SSH_ORIGINAL_COMMAND
and append it to a file called sshcmdlog
. It
Then runs the submitted command.
Here’s a test command:
:-> [back][~/.ssh]# ssh -i ~/.ssh/fs-puller_rsa main cal 1 2012
January 2012
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
Seems that it ran that normally on main
from back
. But over on
main
here is the setup and the result:
[main][~]# cat ~/test_ssh_cmd
echo "$SSH_ORIGINAL_COMMAND" >> ~/sshcmdlog
exec $SSH_ORIGINAL_COMMAND
[main][~]# tail -n1 ~/.ssh/authorized_keys
command="~/test_ssh_cmd" ssh-rsa AAAA....etc....AhySEWf9 root@back
[main][~]# cat ~/sshcmdlog
cal 1 2012
Notice that we were able to capture the exact command sent by the client as the server saw it. In this case, the command was simple and I could have just guessed that directly, but you’ll see that with rsync commands it can get tricky and it’s best not to spend all day guessing what it’s receiving.
Now I run a test of the real thing on back
:
[back][~]# rsync --rsh='ssh -i /root/.ssh/fs-puller_rsa' -aP --del main:/files/users/xed /raid/users/
Check what turned up in sshcmdlog
:
[main][~]# cat sshcmdlog
cal 1 2012
rsync --server --sender -vlogDtpre.iLsf . /files/users/xed
Now you can see why this is so hard to get right by guessing. I didn’t submit this command in this form, but the server interpreted as such. This is how you have to set the key on the SSH server machine so it will only honor these jobs.
This is the final form of the command directive in the SSH key in the
authorized_keys
file on main
:
:-> [main][~]# tail -n1 ~/.ssh/authorized_keys
command="rsync --server --sender -vlogDtpre.iLsf . /files/users/xed" ssh-rsa AAAAB3W...ETC...SEWf9 root@back
Note
|
In the example, I used the rsync flag -P which is a variant of
verbose output which is nice for humans. In real life, you probably
want to have an unattended backup job be silent. Your command string
specified in the authorized_keys file must correspond with that
exactly. |
Note
|
Andrey believes that in
recent versions of rsync , --progress
doesn’t imply --verbose (which is the same as specifying
--info=flist2,name,progress ). Thus -aPv is not redundant which may
have been true in the past. |
Restricting SSH Key To Limited Hosts
Now this unencrypted key pair really can’t do anything but make a backup. But just to limit mischief, we can further restrict the operation of that key to specific hosts. This means that if somehow credentials are stolen, a backup won’t be made to some unknown IP number. Here’s how to implement that:
:-> [main][~]# tail -n1 ~/.ssh/authorized_keys
from="back",command="rsync --server --sender -vlogDtpre.iLsf . /files/users/xed" ssh-rsa AAAAB...ETC...hySEWf9 root@back
Note that our host here is back
but it can be an IP number or even
wild card values (so I’ve heard).
Automating
Now that you have set things up so your back up server can
directly make a legitimate rsync backup job using SSH, you can now
automate that in a cron job. Just run crontab -e
and put the rsync
command in a cron job (in my example situation, this is done on
back
, the requesting client). This will do my example every night at
one in the morning:
0 1 * * * rsync --rsh='ssh -i /root/.ssh/fs-puller_rsa' -aP --del main:/files/users/xed /raid/users/
(see cron help)
Wait, Actually I Can’t Use Keys
I’ve had the very odd situation where there is a key installed and you don’t want to use it but it wants to insist. Here’s the answer.
sudo rsync \
--rsh='ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no'\
-rP xed@example.edu:/local/xed/data/dw .
Shortcomings of rsync
Nonsensical Unnecessary Transfers
Don’t get me wrong, rsync is pretty awesome. IMO It’s one of the best pieces of software ever written. However, sometimes it’s not the right thing to do. Imagine you had a directory like this.
/top/important_projectA
You make a back up of /top
. Later you start some other important
projects and realize this would make more sense.
/top/important/projectA
Even if projectA
is a huge filesystem, it might only take
milliseconds to reorganize it on the primary system. However the
backup could be very painful indeed. It could temporarily require
double the space during the transfer (--delete-after
) or temporarily
leave you without a backup (--delete-during
, aka --del
).
Trusting The OS Too Much
Let’s say you have 50 video files on your SSD hard drive and you want to put them on a USB drive. If each of the files is 1GB and the USB drive is 64GB, everything should be fine. But what happens is rsync plans the transfer and then gets started by asking the OS to do system calls related to the transfer. Rsync says something like "Hey Linux, I’ve got this file that needs to go over here; I’m going to read it and then write it over there, ok?" And the OS says, "Sure, go for it." Rsync does that and the OS immediately comes back with, "OK, done, what else do you have?" Rsync says, "Hang on, that was pretty quick! Are you sure you got all that?" And Linux says, "Oh ya totally." But here’s the thing — Linux is lying. Not in a bad way, just in a kind of overachiever optimistic way. What happens is that if you have a lot of RAM, Linux will answer the system calls to read the file and just stuff it in RAM in preparation for writing it to its final target. But what happens is that this breaks down quickly if the target writing is very slow, for example, as is the case on many USB drives.
Unfortunately rsync does not have an oflags=sync
setting like the
dd
command which ensures writes are actually written. You’ll have to
orchestrate moves like this yourself.
for F in /path/to/source/*; do
rsync -aP --inplace "$F" /path/to/destination/
sync # Make sure all writes are written.
done
Alternatives to rsync
The SSH method:
Using ssh
to transfer a filesystem:
tar -cjf - mp3/ | ssh -C -o "CompressionLevel=9" xed.ucsd.edu tar -C /target-xjf -
(see tar help)
lftp
Here’s a way to use lftp
to get something from lftp
with it prompting
the user for a password.
FTP:
$ lftp -e "get public_html/x.ico;exit" xed@example.xed.ch
HTTP:
lftp -e "get location/index.html;exit" -u xed http://example.xed.ch
lftp
can also do a mirror. How about some Gene Ontologies? Here’s a
little mirroring script showing a way it can be done.
HOST=ftp://ftp.ebi.ac.uk/pub/databases/
X='' # Exclusions
X="${X} -x goa_uniprot"
X="${X} -x old"
lftp -e "mirror ${X} GO ; exit" ${HOST}
Unison
Want to have a topology nightmare?
Unison might be just
what you need. Unison allows changes on either of two file systems
which can then be reconciled. Imagine that you have a laptop and a
desktop, L
and D
respectively. You synchronize your file systems
somehow so they are the same. Then you delete L:temp
and add
L:new
. Over on D
you create D:different_new
and delete D:bad
.
When you run Unison, in theory, it will delete D:temp
and L:bad
while adding D:new
and L:different_new
. That’s already a wee bit
confusing for me, but now imagine that you get a second laptop, L2
and you sync to that from L
. Ok. Now what if you try syncing L2
to
D
? You’ve got a graph loop. You need to keep a star topology. Fun!
Enjoy!
rdiff-backup
You want snapshots? You’re jealous of those fruity computer people
going on about "Time Machine"? Well that functionality has been around
for a long time in the free software world. Check out
rdiff-backup which can take
bandwidth efficient snapshots of the differences in a file system
allowing you to reconstruct its state at any other point a snapshot
was made. If you run this in a cron job every night you will be able
to go back to a file system state from any day. This is very handy for
certain kinds of backups. The downside is that if you deal with large
temporary files (video editing, let’s say), rdiff-backup
won’t
really delete those but just make a note of the fact that they were
deleted. This can be circumvented a bit, but generally rdiff-backup
takes up more room than a straight mirror.