The problem#

I will often make copies of important files onto multiple devices, and then later make backups of all of those devices onto the same drive. At which point, I now have multiple redundant copies of those files within my backup. Tools like rdfind, fdupes, and jdupes exist to deal with the general problem of searching a collection of files for duplicates efficiently, but none of them support only checking if files are identical if their filenames and/or paths match, so they end up doing a lot of extra work in this case.

The solution#

Download the script I wrote, hardlink-dups-by-name.sh and run it as follows:

hardlink-dups-by-name.sh a_backup/ another_backup/

Then all files like a_backup/some/path that are identical to the corresponding file another_backup/some/path will get hard-linked together so there will only be one copy of the data taking up space.

The details#

Algorithm overview#

The work done by the script is very simple, just a bit verbose to put into a blog post. It uses find to enumerate the files in the first directory, does a find/replace to get the corresponding path under second directory, checks if the files are identical with cmp and calls ln to hardlink them if so.

The main complication is what to do if the files already have hardlinks. If they're hardlinked to each other, there's nothing to do. If one has only a single hardlink, there's nothing to lose by hardlinking the other one over it. But if both already have hardlinks, there's no straightforward way to find those files, so it just prints a message by default and has a option for the user to specify which one to prefer.

`cmp` or `diff`#

cmp's -s and diff's -q both compare two files and output nothing, allowing their return value to be used to determine the result. This StackOverflow answer benchmarks them and shows there's no measurable difference between the two. I used cmp to be clear this is about comparing files as binary, not text.

Shell variable string manipulation#

The script I wrote is not portable to non-bash shells because it uses bash's Shell Parameter Expansion for a few things (only a small subset of that functionality is available in all POSIX-compliant shells).

${line/#$from/$to} is used to change the path under $from to the same path under $to. The /# means to replace the the string $from at the start of $line.

${from:0-1} gets the last character of $from to normalize the path to make sure it ends in a /.

${1:0:2} gets the first two characters of $1 to check if it's an option.

A Weird Imagination

Hardlink identical directory trees

The problem#

The solution#

The details#

Algorithm overview#

`cmp` or `diff`#

Shell variable string manipulation#

Comments

The problem#

The solution#

The details#

Algorithm overview#

cmp or diff#

Shell variable string manipulation#

`cmp` or `diff`#