The problem
I've been playing Codenames online a lot
lately (using my fork of
codenames.plus), and a friend suggested it might
be fun to have themed word lists. Specifically, they suggested Star Trek
as a theme as it's a fandom that's fairly widely known. They left it up
to me to figure out what should be in a Star Trek themed word list.
The solution
If you just want to play Codenames with the list, go to my Codenames
web app and select one or both of the Star Trek card
packs. If you just want the word lists, you can download the
Star Trek: The Next Generation words and the
Star Trek: Deep Space 9 words.
To generate a word list yourself (I used
this source for the Star Trek scripts), you will need
a common words list like en_50k.txt
which I mentioned in
my previous post on anagram games, and then pipe the
corpus through the following script (which you will likely have to
modify for the idiosyncrasies of your data):
#!/bin/bash
set -euo pipefail
NUM_COMMON=2000 # Filter out the most common 2000 words
COMMON_WORDS="$(mktemp)"
<en_50k.txt head "-$NUM_COMMON" | cut -d' ' -f1 |\
sort | tr '[:lower:]' '[:upper:]' >"$COMMON_WORDS"
# Select only dialogue lines (in Star Trek scripts)
grep -aP '^\t\t\t[^\t]' |\
# Split words
tr ' .,:()\[\]!?;"/\t[:cntrl:]' '[\n*]' |\
sed 's/--/\n/' |\
# Strip whitespace
sed 's/^\s\+//' | sed 's/\s\+$//' |\
grep -av '^\s*$' |\
# Strip quotes
sed "s/^'//" | sed "s/'$//" |\
# Filter out numbers
grep -av '^[[:digit:]]*$' |\
tr '[:lower:]' '[:upper:]' |\
# Fix for contractions not being in wordlist
sed "s/'\(S\|RE\|VE\|LL\|M\|D\)$//" |\
grep -av "'T$" |\
# Remove some more non-words
grep -avF '-' |\
grep -avF '&' |\
# Count
sort | uniq -c |\
# Only keep words with >25 occurrences
awk '{ if ($1 > 25) { print } }' |\
# Remove common words
join -v2 -22 -o 2.1,2.2 "$COMMON_WORDS" - |\
# Sort most common words first
sort -rn
rm "$COMMON_WORDS"
The output of the script will require some manual effort to decide
which words really belong in the final list, but it's a good start.
The details