The problem#
I will often make copies of important files onto multiple devices, and
then later make backups of all of those devices onto the same drive. At
which point, I now have multiple redundant copies of those files within
my backup. Tools like rdfind
, fdupes
,
and jdupes
exist to deal with the general problem of
searching a collection of files for duplicates efficiently, but none of
them support only checking if files are identical if their filenames
and/or paths match, so they end up doing a lot of extra work in this
case.
The solution#
Download the script I wrote, hardlink-dups-by-name.sh
and run it as follows:
hardlink-dups-by-name.sh a_backup/ another_backup/
Then all files like a_backup/some/path
that are identical to the
corresponding file another_backup/some/path
will get hard-linked
together so there will only be one copy of the data taking up space.
The details#
Algorithm overview#
The work done by the script is very simple, just a bit verbose to
put into a blog post. It uses find
to enumerate the files in
the first directory, does a find/replace to get the corresponding
path under second directory, checks if the files are identical with
cmp
and calls ln
to hardlink them
if so.
The main complication is what to do if the files already have hardlinks. If they're hardlinked to each other, there's nothing to do. If one has only a single hardlink, there's nothing to lose by hardlinking the other one over it. But if both already have hardlinks, there's no straightforward way to find those files, so it just prints a message by default and has a option for the user to specify which one to prefer.
cmp
or diff
#
cmp
's -s
and diff
's -q
both compare
two files and output nothing, allowing their return value to be used
to determine the result. This StackOverflow answer
benchmarks them and shows there's no measurable difference between the
two. I used cmp
to be clear this is about comparing files as
binary, not text.
Shell variable string manipulation#
The script I wrote is not portable to non-bash
shells because it
uses bash
's Shell Parameter Expansion for a few
things (only a small subset of that functionality is
available in all POSIX-compliant shells).
${line/#$from/$to}
is used to change the path under $from
to the
same path under $to
. The /#
means to replace the the string $from
at the start of $line
.
${from:0-1}
gets the last character of $from
to normalize the path
to make sure it ends in a /
.
${1:0:2}
gets the first two characters of $1
to check if it's an
option.
Comments
Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.
There are no comments yet.