A Weird Imagination

Transferring many small files

Posted in

The problem#

Transferring many small files is much slower than you would expect given their total size.

The solution#

tar c directory | pv -abrt | ssh target 'cd destination; tar x'

or

cd destination; ssh source tar c directory | pv -abrt | tar x

The details#

There's a few different things going on here. The primary issue is that common file transfer programs like scp and rsync poorly handle the case of lots of small files, which is solved by transferring one large file. We can create an archive with tar to put all of the small files into one file and then transfer that file like

tar cf archive.tar "$directory"
scp archive.tar "$target:$destination"
rm archive.tar

That works, but creating the intermediate file archive.tar might take a lot of time and disk space, during which the actual transfer hasn't even started yet.

Piping through ssh#

Taking a step back, we are going to use a useful feature of ssh that makes it very flexible for scripting, but is easy to be unaware of if you only every use ssh for its common usage case of getting an interactive terminal.

ssh can take as an argument a command to run on the destination computer. A simple demonstration of this is ssh hostname w which will output the status of the remote computer. We note that the output appeared on our terminal because ssh handles sending the output of w back along the network connection. In fact, it will also send its input over the network, so

echo "Hello World!" | ssh $hostname 'cat > file'

creates a file on the remote host containing the string Hello World! which was generated on the local host.

In our case, we want to generate an archive on one host, send it over ssh, and extract it on the target host. Importantly, we are going to take advantage of the fact that tar was designed to generate its output sequentially, without making any random access writes to the archive. While this was intented for outputting tape backups, it helps us here as well. To make tar output to standard output, we change the tar command to tar cf - directory (or, equivalently, tar c directory—like most programs it defaults to using standard input/output, which can be explicitly denoted by -). On the other side of the pipe, we use tar x to extract the archive. Inserting the ssh in the middle gives us tar c directory | ssh target tar x. If you don't want to extract the archive into your home directory on the target, you can add a cd to the desired directory before the tar x command, just remember to enclose the entire command to be run on the remote host in quotes or the semicolon will be misinterpreted as ending the ssh command.

Compression#

Depending on the nature of the files being transferred and the network link, it may be worthwhile to compress the files before transferring them. Do note that on a fast enough network this may in fact slow down the transfer.

The simplest way is to just use tar's compression support by adding a flag to tar like --lzop. If you want more control, you can separate out the compression like tar c directory | gzip | ssh destination 'gunzip | tar x', which is equivalent to giving tar the --gzip option.

ssh also has built-in suppport for compression which can be enabled with the -C flag. In practice, this often makes ssh slower.

A progress bar#

When doing a file transfer, you usually want to know how fast it's going and how long it's going to take. This is especially useful if you are experimenting with compression settings to see which one is the fastest. Do note that in that case you need to be careful to measure the transfer speed of the uncompressed content, which is why I mentioned splitting out the compression commands from tar above.

There's a very useful little utility called pv which will sit on a pipe and display the status of transfers through that pipe. Here we will be using it to measure the network transfer speed. By inserting pv in the pipeline before ssh, we can get an output of how fast ssh is transferring data. pv has a lot of output options, so you'll probably want to check the man page. The ones I used above are -r and -a for the current and average transfer rates, -t for the elapsed time, and -b for bytes transferred so far.

Note that we can't easily get a progress percentage out of pv because it has no way of knowing how big the transfer is. We can tell it using the -s option… but the exact transfer size is unknown. du -bs directory can be used to estimate the size, but it ignores the overhead. Also, the -b option for counting bytes is a GNU extension. If not using GNU du, then use -k to count kilobytes and multiply the result by 1024.

Comments

Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.

3 comments

Avatar

Jimmy Hartzell

Posted Tue 03 February 2015 06:06

There is a GNUism in this post: BSD tar does not default to stdin/stdout when the -f flag is not specified. Instead, it uses /dev/sa0, which is probably bad unless you have a tape drive, and definitely bad for this situation. BSD is doing the wrong thing, but they are into tradition. Also, the old-style tar flags are kind of badass, but the new-style POSIX flags are preferred and there's no reason to perpetuate the bad habit of using tar xf - when tar -xf - is actually the preferred, standardized syntax.

Avatar

Manoj Paul

Posted Fri 21 February 2020 02:36

What is the reason behind the slow speed when we transfer smaller files through scp?

Avatar

Daniel Perelman

Posted Sun 23 February 2020 16:10

I'm having trouble finding a discussion of SCP specifically, but there's two reasons that apply generally to any transfer protocol:

  1. There's some amount of per-file overhead. (e.g. locating the file on disk on the sender side and allocating space to write it on the receiving side.)

  2. Because of the way congestion control works, network transfers always start slow and speed up as the speed of the link is determined dynamically. With small files, there isn't time to get up to full speed before the transfer is completed. This is where the details of the protocol matter, because I think SCP keeps open a single TCP connection (for the SSH connection) for the entire transfer, but I believe I read somewhere that SSH internally has it's own congestion control system similar to TCP's, so the effect is similar. (Congestion control is a big complicated topic. See https://en.wikipedia.org/wiki/TCP_congestion_control for way more detail than you likely wanted.)