The problem#

I wanted to find a specific Unicode character I had used somewhere in some previous blog post. But by the nature of not knowing exactly what character it was, I wasn't sure how to search for it.

The solution#

Instead, I wrote a script based on this post to simply list all of the characters appearing in any file in a given directory:

cat * | sed 's/./&\n/g' | sort -u

Although the output is small, to further reduce the noise, this version strips out the English letters, numbers, and common symbols:

cat * \
    | tr -d 'a-zA-Z0-9!@#$%^&*()_+=`~,./?;:"[]{}<>|\\'"'-" \
    | sed 's/./&\n/g' | sort -u

The details#

The problem#

When running an incremental backup with rsync with the --progress flag, it often spends lot of time outputting nothing as it scans through many unchanged files. If you think of it before starting the transfer, --info=progress2 or the name2/skip2 --info flags would give more detail, but once the transfer has been going for a while, you probably don't want to cancel and restart it so you can add those flags.

The solution#

The documentation and this StackExchange answer say you can send a SIGVTALRM signal to rsync version 3.2.0+ and it will output its current progress, but that wasn't working for me.

As a workaround, you can use strace to get a running log of which files rsync is looking at, which includes files it skips without actually opening:

strace --attach="$(pidof rsync)" --trace=openat

(If that's not showing anything, try removing the --trace=openat filter and seeing if there's other syscalls with paths to filter on.)

Alternatively, this StackExchange answer suggests a way to see the currently open files including their sizes (including directories but not unchanged files being inspected):

watch lsof -p"$(pidof rsync | tr ' ' ',')"

(The same should work for a recursive cp/mv/rm.)

Similarly, for getting the status of a transfer of a single large file, this answer attempts to read the files cp is reading/writing to give a running percentage of how much it has copied; a similar approach might work for rsync.

The details#

The problem#

I've been playing Codenames online a lot lately (using my fork of codenames.plus), and a friend suggested it might be fun to have themed word lists. Specifically, they suggested Star Trek as a theme as it's a fandom that's fairly widely known. They left it up to me to figure out what should be in a Star Trek themed word list.

The solution#

If you just want to play Codenames with the list, go to my Codenames web app and select one or both of the Star Trek card packs. If you just want the word lists, you can download the Star Trek: The Next Generation words and the Star Trek: Deep Space 9 words.

To generate a word list yourself (I used this source for the Star Trek scripts), you will need a common words list like en_50k.txt which I mentioned in my previous post on anagram games, and then pipe the corpus through the following script (which you will likely have to modify for the idiosyncrasies of your data):

#!/bin/bash
set -euo pipefail

NUM_COMMON=2000 # Filter out the most common 2000 words
COMMON_WORDS="$(mktemp)"
<en_50k.txt head "-$NUM_COMMON" | cut -d' ' -f1 |\
    sort | tr '[:lower:]' '[:upper:]' >"$COMMON_WORDS"

# Select only dialogue lines (in Star Trek scripts)
grep -aP '^\t\t\t[^\t]' |\
    # Split words
    tr ' .,:()\[\]!?;"/\t[:cntrl:]' '[\n*]' |\
    sed 's/--/\n/' |\
    # Strip whitespace
    sed 's/^\s\+//' | sed 's/\s\+$//' |\
    grep -av '^\s*$' |\
    # Strip quotes
    sed "s/^'//" | sed "s/'$//" |\
    # Filter out numbers
    grep -av '^[[:digit:]]*$' |\
    tr '[:lower:]' '[:upper:]' |\
    # Fix for contractions not being in wordlist
    sed "s/'\(S\|RE\|VE\|LL\|M\|D\)$//" |\
    grep -av "'T$" |\
    # Remove some more non-words
    grep -avF '-' |\
    grep -avF '&' |\
    # Count
    sort | uniq -c |\
    # Only keep words with >25 occurrences
    awk '{ if ($1 > 25) { print } }' |\
    # Remove common words
    join -v2 -22 -o 2.1,2.2 "$COMMON_WORDS" - |\
    # Sort most common words first
    sort -rn

rm "$COMMON_WORDS"

The output of the script will require some manual effort to decide which words really belong in the final list, but it's a good start.

The details#

Calculating π the hard way#

In honor of Pi Day, I was going to try to write a script that computed π in shell, but given the lack of floating point support, I decided it would be too messy. If you want to see hard to follow code to generate π, I highly recommend the IOCCC entry westley.c from 1998, the majority of which is an ASCII art circle which calculates its own area and radius in order to estimate π. The hint file suggests looking at the output of

$ cc -E westley.c

The 2012 entry, endoh2 is also a pretty amazing π calculator.

Getting π#

Instead, I will just generate π the shell way: using another program.

$ python -c 'import math; print(math.pi)'
3.14159265359

Random color selection#

In my post about hostname-based prompt colors, I suggested a fallback color scheme that was obviously wrong in order to remind you to set a color for that host:

alice@unknown:~$

This carried with it an implicit assumption: you care what color each host is assigned. You may instead be happy to assign a random color to each host. We could use shuf to generate a random color:

ps1_color="32;38;5;$(shuf -i 0-255 -n 1)"

The problem with this solution is the goal of the recoloring the prompt was not simply to make it more colorful, but for that color to have meaning. We want the color to always be the same for each login to a given host.

One way to accomplish this would be to use that code to randomly generate colors, but save the results in a table like the one used before for manually-chosen colors. But it turns out we can do better.

Hash-based color selection#

Hash functions have a useful property called determinism, which means that hashing the same value will always get the same result. The consequence is that we can use a hash function like it's a lookup table of random numbers shared among all of our computers:

ps1_color="32;38;5;$(($(hostname | sum | cut -f1 -d' ' | sed s/^0*//) % 256))"

The $((...)) syntax is bash's replacement for expr which is less portable but easier to use. Here we use it to make sure the hash value we compute is a number between 0 and 255. [sum][sum] computes a hash of its input, in this case the result of hostname. Its output is not just a number so cut selects out the number and sed gets rid of any leading zeros so it isn't misinterpreted as octal.

The idea of using sum was suggested by a friend after reading my previous post on the topic.

But this turns out to not work great for hosts with similar names like rob.example.com and orb.example.com:

alice@rob:~$ 
alice@orb:~$

Similar colors on hosts with very different names would not be so bad, but because of how sum works, it will tend to give similar results on similar strings (although less often than I expected; it took some effort to find such an example).

Better hash functions#

While this is not a security-critical application, here cryptographic hash functions solve the problem. Cryptographic hash functions guarantee (in theory) that knowing that two inputs are similar tells you nothing about their hash values. In other words, the output of cryptographic hash functions are indistinguishable from random and, in fact, they can be used to build pseudorandom generators like Linux's /dev/urandom.

The cryptographic hash function utilities output hex instead of decimal, so they aren't quite a drop-in replacement for sum:

ps1_color="32;38;5;$((0x$(hostname | md5sum | cut -f1 -d' ' | tr -d '\n' | tail -c2)))"

Here we use cut and tr to select just the hex string of the hash. tail's -c option specifies the number of bytes to read from the end, where 2 bytes corresponds to 2 hex digits, which can have a value of 0 to 255, so the modulo operation is not needed. Instead the 0x prefix inside $((...)) interprets the string as a hex number and outputs it as a decimal number.

This code uses the md5sum utility to compute an MD5 hash of the hostname. This is recommended because md5sum is likely to be available on all hosts. Do be aware that MD5 is insecure and it is only okay to use here because coloring the prompt is not a security-critical application.

sha1sum and sha256sum are also likely available on modern systems and work as drop-in replacements for md5sum in the above command should you wish to use a different hash. Additionally, you could also get different values out of the hash by adding a salt:

salt="Some string."
ps1_color="32;38;5;$((0x$( (echo "$salt"; hostname) | sha256sum | cut -f1 -d' ' | tr -d '\n' | tail -c2)))"

A Weird Imagination

Shell script over characters

The problem#

The solution#

The details#

Status of long-running copy

The problem#

The solution#

The details#

Generating specialized word lists

The problem#

The solution#

The details#

Pi in shell

Calculating π the hard way#

Getting π#

Reverse sequence for tr

The problem#

Better hash-based colors

The problem#

Hash-based hostname colors

Random color selection#

Hash-based color selection#

Better hash functions#

sh Rube Goldbergs

The problem#

The solution#

The details#

Child process not in ps?

A buggy program#