The problem#
Transferring many small files is much slower than you would expect given their total size.
The solution#
tar c directory | pv -abrt | ssh target 'cd destination; tar x'
or
cd destination; ssh source tar c directory | pv -abrt | tar x
The details#
There's a few different things going on here. The primary issue is that
common file transfer programs like scp
and rsync
poorly handle the case of lots of small files, which is solved by
transferring one large file. We can create an archive with tar
to
put all of the small files into one file and then transfer that file
like
tar cf archive.tar "$directory"
scp archive.tar "$target:$destination"
rm archive.tar
That works, but creating the intermediate file archive.tar
might take
a lot of time and disk space, during which the actual transfer hasn't
even started yet.
Piping through ssh
#
Taking a step back, we are going to use a useful feature of ssh
that makes it very flexible for scripting, but is easy to be unaware
of if you only every use ssh
for its common usage case of getting an
interactive terminal.
ssh
can take as an argument a command to run on the destination
computer. A simple demonstration of this is ssh hostname
w
which will output the status of the remote computer.
We note that the output appeared on our terminal because ssh
handles sending the output of w
back along the network connection.
In fact, it will also send its input over the network, so
echo "Hello World!" | ssh $hostname 'cat > file'
creates a file on the remote host containing the string Hello World!
which was generated on the local host.
In our case, we want to generate an archive on one host, send it over
ssh
, and extract it on the target host. Importantly, we are going
to take advantage of the fact that tar
was designed to generate
its output sequentially, without making any random access writes to
the archive. While this was intented for outputting tape backups, it
helps us here as well. To make tar
output to standard output, we
change the tar
command to tar cf - directory
(or, equivalently, tar c directory
—like most
programs it defaults to using standard input/output, which
can be explicitly denoted by -
). On the other side of the pipe,
we use tar x
to extract the archive. Inserting the
ssh
in the middle gives us tar c directory | ssh
target tar x
. If you don't want to extract the archive into
your home directory on the target, you can add a cd
to the desired
directory before the tar x
command, just remember to
enclose the entire command to be run on the remote host in quotes or the
semicolon will be misinterpreted as ending the ssh
command.
Compression#
Depending on the nature of the files being transferred and the network link, it may be worthwhile to compress the files before transferring them. Do note that on a fast enough network this may in fact slow down the transfer.
The simplest way is to just use tar
's compression support by adding a flag to tar like
--lzop
. If you want more control, you can separate out the compression
like tar c directory | gzip | ssh destination 'gunzip | tar x'
,
which is equivalent to giving tar
the --gzip
option.
ssh
also has built-in suppport for compression
which can be enabled with the -C
flag. In practice, this often makes
ssh
slower.
A progress bar#
When doing a file transfer, you usually want to know how fast it's going
and how long it's going to take. This is especially useful if you are
experimenting with compression settings to see which one is the fastest.
Do note that in that case you need to be careful to measure the transfer
speed of the uncompressed content, which is why I mentioned splitting
out the compression commands from tar
above.
There's a very useful little utility called
pv
which will sit on a pipe and display the status of transfers through
that pipe. Here we will be using it to measure the network transfer
speed. By inserting pv
in the pipeline before ssh
, we can get an
output of how fast ssh
is transferring data. pv
has a lot of output
options, so you'll probably want to check
the man page.
The ones I used above are -r
and -a
for the current and average transfer
rates, -t
for the elapsed time, and -b
for bytes transferred so far.
Note that we can't easily get a progress percentage out of pv
because it has no way of knowing how big the transfer is. We can tell
it using the -s
option… but the exact transfer size is unknown.
du -bs directory
can be used to estimate the
size, but it ignores the overhead. Also, the -b
option for counting
bytes is a GNU extension. If not using GNU du
, then use -k
to
count kilobytes and multiply the result by 1024.
Comments
Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.
3 comments
Jimmy Hartzell
Posted Tue 03 February 2015 06:06
There is a GNUism in this post: BSD tar does not default to stdin/stdout when the
-f
flag is not specified. Instead, it uses/dev/sa0
, which is probably bad unless you have a tape drive, and definitely bad for this situation. BSD is doing the wrong thing, but they are into tradition. Also, the old-styletar
flags are kind of badass, but the new-style POSIX flags are preferred and there's no reason to perpetuate the bad habit of usingtar xf -
whentar -xf -
is actually the preferred, standardized syntax.Manoj Paul
Posted Fri 21 February 2020 02:36
What is the reason behind the slow speed when we transfer smaller files through scp?
Daniel Perelman
Posted Sun 23 February 2020 16:10
I'm having trouble finding a discussion of SCP specifically, but there's two reasons that apply generally to any transfer protocol:
There's some amount of per-file overhead. (e.g. locating the file on disk on the sender side and allocating space to write it on the receiving side.)
Because of the way congestion control works, network transfers always start slow and speed up as the speed of the link is determined dynamically. With small files, there isn't time to get up to full speed before the transfer is completed. This is where the details of the protocol matter, because I think SCP keeps open a single TCP connection (for the SSH connection) for the entire transfer, but I believe I read somewhere that SSH internally has it's own congestion control system similar to TCP's, so the effect is similar. (Congestion control is a big complicated topic. See https://en.wikipedia.org/wiki/TCP_congestion_control for way more detail than you likely wanted.)