The problem#

When doing an incremental backup, any moved file on the source filesystem usually results in recopying the file to the destination filesystem. For a large file this can both be slow and possibly waste space if the destination keeps around deleted files (e.g. ZFS holding on to old snapshots). If both sides are ZFS, then you can get zfs send/recv to handle all of the details efficiently. But if only the source filesystem is ZFS or the ZFS datasets are not at the same granularity on both sides, that doesn't apply.

zfs diff gives the information about file moves from a snapshot, but its output format is a little awkward for scripting.

The solution#

Download the script I wrote, zfs-diff-move.sh and run it like

zfs-diff-move.sh /path/ /tank/dataset/ tank/dataset@base @new

The following is an abbreviated version of it:

#!/bin/bash
zfs diff -H "$3" "$4" | grep '^R' | while read -r line
do
  get_path() {
    path="$(echo -e "$(echo "$line" | cut -d$'\t' "-f$3")")"
    echo "${path/#$2/$1}"
  }

  from="$(get_path "$1" "$2" 2)"
  to="$(get_path "$1" "$2" 3)"
  mkdir -vp -- "$(dirname "$to")"
  mv -vn -- "$from" "$to" || echo "Unable to move $from"
done

The details#

The problem#

I will often make copies of important files onto multiple devices, and then later make backups of all of those devices onto the same drive. At which point, I now have multiple redundant copies of those files within my backup. Tools like rdfind, fdupes, and jdupes exist to deal with the general problem of searching a collection of files for duplicates efficiently, but none of them support only checking if files are identical if their filenames and/or paths match, so they end up doing a lot of extra work in this case.

The solution#

Download the script I wrote, hardlink-dups-by-name.sh and run it as follows:

hardlink-dups-by-name.sh a_backup/ another_backup/

Then all files like a_backup/some/path that are identical to the corresponding file another_backup/some/path will get hard-linked together so there will only be one copy of the data taking up space.

The details#

The problem#

When writing a shell script that starts background jobs, sometimes running those jobs past the lifetime of the script doesn't make sense. (Of course, sometimes background jobs really should keeping going after the script completes, but that's not the case this post is concerned with.) In the case that either the background jobs are used to do some background computation relevant to the script or the script can conceptually be thought of as a collection of processes, it makes sense for killing the script to also kill any background jobs it started.

The solution#

At the start of the script, add

cleanup() {
    # kill all processes whose parent is this process
    pkill -P $$
}

for sig in INT QUIT HUP TERM; do
  trap "
    cleanup
    trap - $sig EXIT
    kill -s $sig "'"$$"' "$sig"
done
trap cleanup EXIT

If you really want to kill only jobs and not all child processes, use the kill_child_jobs() function from all.sh or look at the other versions in the kill-child-jobs repository.

The details#

The problem#

After using Vim as my primary editor for a while, I find myself trying to use vi-style keyboard shortcuts in other contexts. Usually resulting in :w in middle of whatever I was writing as saving is a natural thing to do when pausing.

There's a few ways to fix this:

Get used to the fact that not everything supports vi-style keyboard shortcuts.
Change the keyboard shortcuts of the programs I do use.
Use Vim for everything.

The first option is suboptimal because vi-style keyboard shortcuts are very useful. Luckily, in many cases there's ways to get them.

The problem#

While trying to develop a modification to the Pelican source, I was unexpectedly having my installed version of Pelican get run instead of the local version:

$ which pelican
/usr/local/bin/pelican
$ command -v pelican
/usr/bin/pelican

For some reason, which was pointing to the executable I was expecting bash to run, but the Bash builtin command was telling me that bash was running the installed version instead.

The solution#

Use the hash builtin to clear bash's cache of the location of pelican:

$ hash -d pelican
$ command -v pelican
/usr/local/bin/pelican

The problem#

If you take the word wizard, reverse the order of the letters and reverse the alphabet:

From: abcdefghijklmnopqrstuvwxyz
To:   ZYXWVUTSRQPONMLKJIHGFEDCBA

then you get the word wizard back, an observation made at least as early as 1972.

Now let's write a shell script to verify this so we can find other words with similar interesting properties. The obvious shell script to verify this

echo wizard | tr a-z z-a | rev

unfortunately fails with the error

tr: range-endpoints of 'z-a' are in reverse collating sequence order

The error is by design: it's not clear what a sequence in reverse order should mean, so POSIX actually requires that it not work.

The problem#

Yesterday, I proposed using a hash function to choose colors:

ps1_color="32;38;5;$((0x$(hostname | md5sum | cut -f1 -d' ' | tr -d '\n' | tail -c2)))"

I did not discuss two issues with this approach:

Colors may be too similar to other colors, such that they are not useful.
Some colors may be undesirable altogether, particularly very dark ones may be too similar to a black terminal background color.

Random color selection#

In my post about hostname-based prompt colors, I suggested a fallback color scheme that was obviously wrong in order to remind you to set a color for that host:

alice@unknown:~$

This carried with it an implicit assumption: you care what color each host is assigned. You may instead be happy to assign a random color to each host. We could use shuf to generate a random color:

ps1_color="32;38;5;$(shuf -i 0-255 -n 1)"

The problem with this solution is the goal of the recoloring the prompt was not simply to make it more colorful, but for that color to have meaning. We want the color to always be the same for each login to a given host.

One way to accomplish this would be to use that code to randomly generate colors, but save the results in a table like the one used before for manually-chosen colors. But it turns out we can do better.

Hash-based color selection#

Hash functions have a useful property called determinism, which means that hashing the same value will always get the same result. The consequence is that we can use a hash function like it's a lookup table of random numbers shared among all of our computers:

ps1_color="32;38;5;$(($(hostname | sum | cut -f1 -d' ' | sed s/^0*//) % 256))"

The $((...)) syntax is bash's replacement for expr which is less portable but easier to use. Here we use it to make sure the hash value we compute is a number between 0 and 255. [sum][sum] computes a hash of its input, in this case the result of hostname. Its output is not just a number so cut selects out the number and sed gets rid of any leading zeros so it isn't misinterpreted as octal.

The idea of using sum was suggested by a friend after reading my previous post on the topic.

But this turns out to not work great for hosts with similar names like rob.example.com and orb.example.com:

alice@rob:~$ 
alice@orb:~$

Similar colors on hosts with very different names would not be so bad, but because of how sum works, it will tend to give similar results on similar strings (although less often than I expected; it took some effort to find such an example).

Better hash functions#

While this is not a security-critical application, here cryptographic hash functions solve the problem. Cryptographic hash functions guarantee (in theory) that knowing that two inputs are similar tells you nothing about their hash values. In other words, the output of cryptographic hash functions are indistinguishable from random and, in fact, they can be used to build pseudorandom generators like Linux's /dev/urandom.

The cryptographic hash function utilities output hex instead of decimal, so they aren't quite a drop-in replacement for sum:

ps1_color="32;38;5;$((0x$(hostname | md5sum | cut -f1 -d' ' | tr -d '\n' | tail -c2)))"

Here we use cut and tr to select just the hex string of the hash. tail's -c option specifies the number of bytes to read from the end, where 2 bytes corresponds to 2 hex digits, which can have a value of 0 to 255, so the modulo operation is not needed. Instead the 0x prefix inside $((...)) interprets the string as a hex number and outputs it as a decimal number.

This code uses the md5sum utility to compute an MD5 hash of the hostname. This is recommended because md5sum is likely to be available on all hosts. Do be aware that MD5 is insecure and it is only okay to use here because coloring the prompt is not a security-critical application.

sha1sum and sha256sum are also likely available on modern systems and work as drop-in replacements for md5sum in the above command should you wish to use a different hash. Additionally, you could also get different values out of the hash by adding a salt:

salt="Some string."
ps1_color="32;38;5;$((0x$( (echo "$salt"; hostname) | sha256sum | cut -f1 -d' ' | tr -d '\n' | tail -c2)))"

The problem#

SSH public key authenication can make SSH much more convenient because you do not need to type a password for every SSH connect. Instead, in common usage, the first time you connect, GNOME Keyring, KWallet, or your desktop environment's equivalent will pop up and offer to keep your decrypted private key securely in memory. Those programs will remember your key until the next time you reboot your computer (or possibly until you log out completely and log back in).

But those are tied to your desktop environment. If you are not at a GUI, either using a computer in text-mode using a console or connecting over SSH, then you do not have access to those programs.

The problem#

When using multiple terminals on different hosts, it can sometimes be confusing to remember which host you are on. The hostname appears in the command prompt, but it's easy to skim past that if you are not paying attention.

One solution that works pretty well for me is recoloring the prompt based on what host I am on. This is in fact why I researched how to get 256 colors terminals working in the first place: in order to have enough colors to be able to make a good choice for each host I use frequently.

A Weird Imagination

Recreate moves from zfs diff

The problem#

The solution#

The details#

Hardlink identical directory trees

The problem#

The solution#

The details#

Kill child jobs on script exit

The problem#

The solution#

The details#

Making :w work everywhere

The problem#

Which command will be run?

The problem#

The solution#

Reverse sequence for tr

The problem#

Better hash-based colors

The problem#

Hash-based hostname colors

Random color selection#

Hash-based color selection#

Better hash functions#

Type your SSH passphrase less often

The problem#

256 color hostnames

The problem#