The problem#

I've been playing Codenames online a lot lately (using my fork of codenames.plus), and a friend suggested it might be fun to have themed word lists. Specifically, they suggested Star Trek as a theme as it's a fandom that's fairly widely known. They left it up to me to figure out what should be in a Star Trek themed word list.

The solution#

If you just want to play Codenames with the list, go to my Codenames web app and select one or both of the Star Trek card packs. If you just want the word lists, you can download the Star Trek: The Next Generation words and the Star Trek: Deep Space 9 words.

To generate a word list yourself (I used this source for the Star Trek scripts), you will need a common words list like en_50k.txt which I mentioned in my previous post on anagram games, and then pipe the corpus through the following script (which you will likely have to modify for the idiosyncrasies of your data):

#!/bin/bash
set -euo pipefail

NUM_COMMON=2000 # Filter out the most common 2000 words
COMMON_WORDS="$(mktemp)"
<en_50k.txt head "-$NUM_COMMON" | cut -d' ' -f1 |\
    sort | tr '[:lower:]' '[:upper:]' >"$COMMON_WORDS"

# Select only dialogue lines (in Star Trek scripts)
grep -aP '^\t\t\t[^\t]' |\
    # Split words
    tr ' .,:()\[\]!?;"/\t[:cntrl:]' '[\n*]' |\
    sed 's/--/\n/' |\
    # Strip whitespace
    sed 's/^\s\+//' | sed 's/\s\+$//' |\
    grep -av '^\s*$' |\
    # Strip quotes
    sed "s/^'//" | sed "s/'$//" |\
    # Filter out numbers
    grep -av '^[[:digit:]]*$' |\
    tr '[:lower:]' '[:upper:]' |\
    # Fix for contractions not being in wordlist
    sed "s/'\(S\|RE\|VE\|LL\|M\|D\)$//" |\
    grep -av "'T$" |\
    # Remove some more non-words
    grep -avF '-' |\
    grep -avF '&' |\
    # Count
    sort | uniq -c |\
    # Only keep words with >25 occurrences
    awk '{ if ($1 > 25) { print } }' |\
    # Remove common words
    join -v2 -22 -o 2.1,2.2 "$COMMON_WORDS" - |\
    # Sort most common words first
    sort -rn

rm "$COMMON_WORDS"

The output of the script will require some manual effort to decide which words really belong in the final list, but it's a good start.

The details#

The problem#

Several years ago, I was in a ROM hacking IRC room where another regular Alchemic was reverse engineering the text system of the SNES game Shadowrun. He figured it out and wrote a python script to decompress the text but had some questions about why it was designed the way it was. So we're going to walk through figuring out how the code works, with some help from his notes, and try to understand the design.

If you don't want spoilers and would rather try to reverse engineer it yourself, just read up to the end of the Trace format section and see how much you can figure out on your own.

So, you saw Sans Bullshit Sans (which I previously blogged about) and thought that Sans Bullshit Sans font is cool, but its word list doesn't really match up with my field's bullshit?

As mentioned in that previous post, Sans Bullshit Sans is open-source and comes with a detailed blog post explaining how it was made. Which means we can follow those instructions to modify the word list.

Random color selection#

In my post about hostname-based prompt colors, I suggested a fallback color scheme that was obviously wrong in order to remind you to set a color for that host:

alice@unknown:~$

This carried with it an implicit assumption: you care what color each host is assigned. You may instead be happy to assign a random color to each host. We could use shuf to generate a random color:

ps1_color="32;38;5;$(shuf -i 0-255 -n 1)"

The problem with this solution is the goal of the recoloring the prompt was not simply to make it more colorful, but for that color to have meaning. We want the color to always be the same for each login to a given host.

One way to accomplish this would be to use that code to randomly generate colors, but save the results in a table like the one used before for manually-chosen colors. But it turns out we can do better.

Hash-based color selection#

Hash functions have a useful property called determinism, which means that hashing the same value will always get the same result. The consequence is that we can use a hash function like it's a lookup table of random numbers shared among all of our computers:

ps1_color="32;38;5;$(($(hostname | sum | cut -f1 -d' ' | sed s/^0*//) % 256))"

The $((...)) syntax is bash's replacement for expr which is less portable but easier to use. Here we use it to make sure the hash value we compute is a number between 0 and 255. [sum][sum] computes a hash of its input, in this case the result of hostname. Its output is not just a number so cut selects out the number and sed gets rid of any leading zeros so it isn't misinterpreted as octal.

The idea of using sum was suggested by a friend after reading my previous post on the topic.

But this turns out to not work great for hosts with similar names like rob.example.com and orb.example.com:

alice@rob:~$ 
alice@orb:~$

Similar colors on hosts with very different names would not be so bad, but because of how sum works, it will tend to give similar results on similar strings (although less often than I expected; it took some effort to find such an example).

Better hash functions#

While this is not a security-critical application, here cryptographic hash functions solve the problem. Cryptographic hash functions guarantee (in theory) that knowing that two inputs are similar tells you nothing about their hash values. In other words, the output of cryptographic hash functions are indistinguishable from random and, in fact, they can be used to build pseudorandom generators like Linux's /dev/urandom.

The cryptographic hash function utilities output hex instead of decimal, so they aren't quite a drop-in replacement for sum:

ps1_color="32;38;5;$((0x$(hostname | md5sum | cut -f1 -d' ' | tr -d '\n' | tail -c2)))"

Here we use cut and tr to select just the hex string of the hash. tail's -c option specifies the number of bytes to read from the end, where 2 bytes corresponds to 2 hex digits, which can have a value of 0 to 255, so the modulo operation is not needed. Instead the 0x prefix inside $((...)) interprets the string as a hex number and outputs it as a decimal number.

This code uses the md5sum utility to compute an MD5 hash of the hostname. This is recommended because md5sum is likely to be available on all hosts. Do be aware that MD5 is insecure and it is only okay to use here because coloring the prompt is not a security-critical application.

sha1sum and sha256sum are also likely available on modern systems and work as drop-in replacements for md5sum in the above command should you wish to use a different hash. Additionally, you could also get different values out of the hash by adding a salt:

salt="Some string."
ps1_color="32;38;5;$((0x$( (echo "$salt"; hostname) | sha256sum | cut -f1 -d' ' | tr -d '\n' | tail -c2)))"

The problem#

I changed the URI scheme of this blog recently from /posts/YYYY/MM/slug/ to /YYYY/MM/DD/slug/. The latter looks better and makes the actual day of the post more visible.

But I already had posts using the old scheme and cool URIs don't change. Luckily, someone wrote a Pelican plugin called pelican-alias which allows articles to be tagged with additional URIs to redirect to their canonical location. All I had to do was add an Alias: /posts/2015/02/... line to the top of each of the posts I had already written and the plugin would take care of the rest.

Automating the aliasing#

The non-trivial part of automating this is that the URIs include the article's slug, which may have been generated by Pelican from the title, so Pelican has to be involved in generating the correct redirects.

There are two ways I could have automated this process:

Modify the plugin to add a redirect from the old scheme to the new scheme for every article. Unless somehow controlled, this would result in creating redirects for new articles which do not need them.
Write a one-off script to get the slugs out of Pelican and write the Alias: lines into the blog posts.

I took the latter approach because it was simpler and involved no new code to maintain.

Filenames are troublesome#

While shell programing lets you write very concise programs, it turns out that the primary use case of working with files is unfortunately much harder than it seems. That detailed article by David A. Wheeler does a good job of explaining all of the various problems that a naive shell script can run into due to various characters which are allowed in filenames which the shell treats specially in some way.

One surprising one is that filenames beginning with a dash (-) can be interpreted as options due to the way globbing works in the shell. Suppose we set up a directory as follows:

$ cat > -n
Some secret text.
$ cat > test
This is a test.
It has multiple lines.

Quick, what will cat * do here?

$ cat *
     1  This is a test.
     2  It has multiple lines.

Probably not what you wanted. The reason that happens is that the * is expanded by the shell before being fed to cat, so the command executed is cat -n test and -n gets interpreted not as a filename but as an option telling cat to number the lines of the output.

The workaround is to use ./* instead of *, so the - will not actually be the first character and therefore will not get misinterpreted as an option. But there are many other things that can go wrong with unexpected filenames and remembering to handle all of them everywhere is error-prone.

Warnings for unsafe shell code#

The solution is shellcheck. shellcheck will warn you about mistakes like the cat * problem and many other issues you may not be aware of.

As I have many shellscripts around that I wrote before learning about shellcheck, I wanted to run it on all of the shell scripts (but not binaries or other language scripts) in my ~/bin directory, so naturally I wrote a script to do so:

#!/bin/sh

find -exec file {} \; \
    | grep -F 'shell script' \
    | sed s/:[^:]*$// \
    | xargs shellcheck

This uses the file command to identify shell scripts and then selects out their file names to run shellcheck on all of them using xargs.

Warnings in Vim#

shellcheck is written to support integration into IDEs. I use Vim to edit shell scripts, so I installed the syntastic (using Vundle which makes installing Vim plugins off GitHub very easy). Note to follow the instructions on the Syntastic page, specifically the recommended settings: without any settings it won't do anything at all. Once set up, it automatically runs shellcheck on every save, identifies lines with warnings and shows a list of warnings that can be double-clicked to jump to the location of the warning.

If you use the other text editor, then the shellcheck website recommends the flycheck plugin.

The problem#

The command-line is an expressive interface which allows powerful commands to be written concisely. Sometimes you want a longer, less direct way of implementing a task. For example, merely writing wc -l is far too straightforward for counting lines in a file. Surely we can devise a more convoluted way to accomplish that task.

The solution#

cat "$file" |
    expr $(od -t x1 |
    sed 's/ /\n/g' |
    grep '^0a$' |
    sed -z 's/\n//g' |
    wc -c) / 2

The details#

The problem#

I used to have an occasionally unreliable internet connection. I wanted logs of exactly how unreliable it was and an easy way to have notice when it was back up.

The solution#

Use cron to check online status once a minute and write the result to a file. An easy way to check is to confirm that google.com will reply to a ping (this does give a false negative in the unlikely event that Google is down).

To run a script every minute, put a file in /etc/cron.d containing the line

* * * * * root /root/bin/online-check

where /root/bin/online-check is the following script:

#!/bin/sh

# Check if computer is online by attempting to ping google.com.
PING_RESULT="`ping -c 2 google.com 2>/dev/null`"
if [ $? -eq 0 ] && ! echo "$PING_RESULT" | grep -F '64 bytes from 192.168.' >/dev/null 2>/dev/null
then
    ONLINE="online"
else
    ONLINE="offline"
fi
echo "`date '+%Y-%m-%d %T%z'` $ONLINE" >> /var/log/online.log

The details and pretty printing#

When you start getting disk full messages on Linux, there's a few different reasons why that might happen:

The expected. Too many large files. You can track down large directories using WinDirStat or
```
du -hx --max-depth=1 | sort -h
```
where the -x option tells du to not cross filesystem boundaries and the -h option to both uses human-readable sizes like 11M or 1G.
Deleted files aren't actually deleted if they are still open. You can use lsof to find open files. Give it the filesystem as an argument like lsof /home.
By default 5% of each filesystem is reserved for writes by root. Depending on what the filesystem is being used for, this may be too much or simply unnecessary. See this Server Fault answer for how to deal with this.
The files could be shadowed by a mount. If a filesystem is mounted over a non-empty directory, the files in that directory aren't visible.
Last, the disk might not actually be out of space at all. It might actually be out of inodes. Some filesystems, notably the ext2/3/4 filesystems used by default on most Linux distributions have a fixed number of inodes allocated at filesystem creation time. The default is high enough that it is unlikely to be an issue unless there are a very large number of empty files. df -i will show the number of inodes free on each filesystem to verify if a filesystem is indeed out of inodes.

But how do you find those empty files? As described above, du will help find large files, but now we want to find large numbers of files. The following command acts like du -hx --max-depth=$depth | sort -h for inodes instead of file sizes:
```
find -xdev | sed "s@$\([^/]*/$\{$depth\}[^/]*\).*@\1@" | uniq -c | sort -n
```
find -xdev lists all of the files under the current directory on the same filesystem. The sed command finds the first $depth directories (ending in /) and discards the rest of the filename (the .* at the end), so each directory appears once for every file or directory anywhere under it. Then the end of the command counts the repeated lines and sorts by those counts, highlighting the directories with the most files.

A Weird Imagination

Generating specialized word lists

The problem#

The solution#

The details#

Shadowrun's text compression

The problem#

Custom Bullshit Sans

Hash-based hostname colors

Random color selection#

Hash-based color selection#

Better hash functions#

Changing Pelican URL scheme

The problem#

Automating the aliasing#

Checking for unsafe shell constructs

Filenames are troublesome#

Warnings for unsafe shell code#

Warnings in Vim#

sh Rube Goldbergs

The problem#

The solution#

The details#

Logging online status

The problem#

The solution#

The details and pretty printing#

Out of inodes, what now?