The problem#
I've been playing Codenames online a lot lately (using my fork of codenames.plus), and a friend suggested it might be fun to have themed word lists. Specifically, they suggested Star Trek as a theme as it's a fandom that's fairly widely known. They left it up to me to figure out what should be in a Star Trek themed word list.
The solution#
If you just want to play Codenames with the list, go to my Codenames web app and select one or both of the Star Trek card packs. If you just want the word lists, you can download the Star Trek: The Next Generation words and the Star Trek: Deep Space 9 words.
To generate a word list yourself (I used
this source for the Star Trek scripts), you will need
a common words list like en_50k.txt
which I mentioned in
my previous post on anagram games, and then pipe the
corpus through the following script (which you will likely have to
modify for the idiosyncrasies of your data):
#!/bin/bash
set -euo pipefail
NUM_COMMON=2000 # Filter out the most common 2000 words
COMMON_WORDS="$(mktemp)"
<en_50k.txt head "-$NUM_COMMON" | cut -d' ' -f1 |\
sort | tr '[:lower:]' '[:upper:]' >"$COMMON_WORDS"
# Select only dialogue lines (in Star Trek scripts)
grep -aP '^\t\t\t[^\t]' |\
# Split words
tr ' .,:()\[\]!?;"/\t[:cntrl:]' '[\n*]' |\
sed 's/--/\n/' |\
# Strip whitespace
sed 's/^\s\+//' | sed 's/\s\+$//' |\
grep -av '^\s*$' |\
# Strip quotes
sed "s/^'//" | sed "s/'$//" |\
# Filter out numbers
grep -av '^[[:digit:]]*$' |\
tr '[:lower:]' '[:upper:]' |\
# Fix for contractions not being in wordlist
sed "s/'\(S\|RE\|VE\|LL\|M\|D\)$//" |\
grep -av "'T$" |\
# Remove some more non-words
grep -avF '-' |\
grep -avF '&' |\
# Count
sort | uniq -c |\
# Only keep words with >25 occurrences
awk '{ if ($1 > 25) { print } }' |\
# Remove common words
join -v2 -22 -o 2.1,2.2 "$COMMON_WORDS" - |\
# Sort most common words first
sort -rn
rm "$COMMON_WORDS"
The output of the script will require some manual effort to decide which words really belong in the final list, but it's a good start.