A Weird Imagination

Generating specialized word lists

Posted in

The problem#

I've been playing Codenames online a lot lately (using my fork of codenames.plus), and a friend suggested it might be fun to have themed word lists. Specifically, they suggested Star Trek as a theme as it's a fandom that's fairly widely known. They left it up to me to figure out what should be in a Star Trek themed word list.

The solution#

If you just want to play Codenames with the list, go to my Codenames web app and select one or both of the Star Trek card packs. If you just want the word lists, you can download the Star Trek: The Next Generation words and the Star Trek: Deep Space 9 words.

To generate a word list yourself (I used this source for the Star Trek scripts), you will need a common words list like en_50k.txt which I mentioned in my previous post on anagram games, and then pipe the corpus through the following script (which you will likely have to modify for the idiosyncrasies of your data):

#!/bin/bash
set -euo pipefail

NUM_COMMON=2000 # Filter out the most common 2000 words
COMMON_WORDS="$(mktemp)"
<en_50k.txt head "-$NUM_COMMON" | cut -d' ' -f1 |\
    sort | tr '[:lower:]' '[:upper:]' >"$COMMON_WORDS"

# Select only dialogue lines (in Star Trek scripts)
grep -aP '^\t\t\t[^\t]' |\
    # Split words
    tr ' .,:()\[\]!?;"/\t[:cntrl:]' '[\n*]' |\
    sed 's/--/\n/' |\
    # Strip whitespace
    sed 's/^\s\+//' | sed 's/\s\+$//' |\
    grep -av '^\s*$' |\
    # Strip quotes
    sed "s/^'//" | sed "s/'$//" |\
    # Filter out numbers
    grep -av '^[[:digit:]]*$' |\
    tr '[:lower:]' '[:upper:]' |\
    # Fix for contractions not being in wordlist
    sed "s/'\(S\|RE\|VE\|LL\|M\|D\)$//" |\
    grep -av "'T$" |\
    # Remove some more non-words
    grep -avF '-' |\
    grep -avF '&' |\
    # Count
    sort | uniq -c |\
    # Only keep words with >25 occurrences
    awk '{ if ($1 > 25) { print } }' |\
    # Remove common words
    join -v2 -22 -o 2.1,2.2 "$COMMON_WORDS" - |\
    # Sort most common words first
    sort -rn

rm "$COMMON_WORDS"

The output of the script will require some manual effort to decide which words really belong in the final list, but it's a good start.

The details#

Source of words#

To make the word list, we first need a text corpus that we are basing it on. Luckily, it was easy to find scripts of Star Trek episodes, but scripts of a less popular show or getting the plain text of book might be more difficult. Also, the two Star Trek shows I used both lasted for 7 seasons, so there were a lot of episodes and therefore a lot of text. Less text might have weaker signals on which words really stand out.

The scripts are screenplays which include a lot of descriptions in addition to the actual spoken dialogue. Initially I did not filter out the descriptions and was surprised by some of the common words, so I decided it would be more appropriate to use just the dialogue. Luckily, the dialogue in those files is distinguished by being the lines starting with exactly four tab characters, so it was easy to select it automatically.

Splitting into words#

The naive approach would be to simply split every line on spaces and take those as words, but that ignores punctuation. On the other hand hypen ("-") and apostrophe ("'") are used both in spelling words and as punctuation, so we can't split words on every punctuation character. So I have a list of punctuation characters that get converted to newlines as well as [:cntrl:] which matches all control characters since some of the files contained them.

Additionally, there's a separate sed command to treat "--" as punctuation, which tr couldn't be used for since it's not a single character. Similarly, sed "s/^'//" | sed "s/'$//" removes single quotes from the start and/or end of line. Note that since we've split words into separate lines, any text longer than one word that was wrapped in single quotes will now have those single quotes on different lines.

Filtering non-words#

grep -av '^\s*$' removes any lines containing only whitespace. Note that all of the grep commands have the -a option to force treating the input as text (as opposed to binary) because it was getting confused by some of the files.

There's also a few filters for removing numbers and symbols.

Taking only most common words#

sort | uniq -c

gives a list of lines (words) with the count of instances in the input.

awk '{ if ($1 > 25) { print } }'

filters that list by reading the count and only outputting lines where it is over 25.

Filtering to specialized words#

The following filters out any words that appear in $COMMON_WORDS using join:

join -v2 -22 -o 2.1,2.2 "$COMMON_WORDS" -

At this point the input coming on on standard input (the - argument at the end) looks like

...
5029 A
89 ENTERPRISE
3891 THE
...

and we have a common words list that looks like

...
A
THE
...

The -v2 argument to join means that the output should be only lines of the second file that do not match the first file. -22 means that the match should be done on the second field for the second file (as the first field is the count and the second field is the word). We don't have to specify -1 as the field is the entire line. -o 2.1,2.2 means that the output should be fields 1 and 2 of the second file (that is, the entire line).

At the top of the script, we generated $COMMON_WORDS by reducing the word list to just 2000 words, selecting just the words, and converting it to upper case. If we hadn't converted it to upper case, we could have used the -i option to join to do a case-insensitive comparison. Also, you may want to experiment with adjusting that 2000 to a value that makes a more appropriate word list.

Manual editing#

That got the list down to about 800 words, which I manually read through. Some entries like major character names or common terms or technobabble I kept. Others like adverbs that didn't make the top-2000 words list, I discarded. In between there were a lot of minor character names and less common terms that I searched the scripts for to see how they were used to decide whether to keep them in the list.

One common problem that probably could have been fixed automatically was that when the topic of an episode was a character that appears only in that episode, their name gets said a lot. Given a collection of files

grep -ail "\<$word\>" * | wc -l

will report how many files contain the word (grep's -l option means to list only the filenames that have matches, not the actual matches).

Another common problem was the common word actually being part of a common phrase: "tractor" appears in Star Trek scripts a lot not because they spend a lot of time discussing farming but because they have tractor beams. Once again, I handled this by manually searching the scripts for those words, but another possibility would be to generate lists of the most common n-grams. If before the first call to sort, we dump the cleaned text into a file $words, then we can use paste (you can avoid the bashism of the <(...) syntax by using more temporary files):

paste "$words" <(tail --lines=+2 "$words") \
        <(tail --lines=+3 "$words") |\
    sort | uniq -c | sort -n

That will generate a list of trigrams, you probably at least want bigrams as well. Note the common words haven't been filtered out at this point, so the list is pretty noisy, but it's intended for being grep'd through for a specific word anyway.

Finalizing#

To convert to a final usable word list (remove the count and sort the words), run

cut -d' ' -f2 words | sort

Comments

Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.

There are no comments yet.