The problem#

I've recently been writing more series of blog posts or otherwise linking between posts using {filename} links. And also I've been adjusting the scheduling of my future planned blog posts, which involves changing the filename as my naming scheme includes the publication date in the filename. Which means there's opportunities for not adjusting the links to match and ending up with broken links between posts.

Pelican does generate warnings like

WARNING  Unable to find './invalid.md', skipping url        log.py:89
         replacement.

but currently building my entire blog takes about a minute, so I generally only do it when publishing. So I wanted a more lightweight way to just check the intra-blog {filename} links.

The solution#

I wrote the script check_filename_links.sh:

#!/bin/bash

content="${1:-.}"

find "$content" -iname '*.md' -type f -print0 | 
  while IFS= read -r -d '' filename
  do
    grep '^\[.*]: {filename}' "$filename" |
      sed 's/^[^ ]* {filename}\([^\#]*\)\#\?.*$/\1/' |
      while read -r link
      do
        if [ "${link:0:1}" != "/" ]
        then
          linkedfile="$(dirname "$filename")/$link"
        else
          linkedfile="$content$link"
        fi
        if [ ! -f "$linkedfile" ]
        then
          echo "filename=$filename, link=$link,"\
               "file does not exist: $linkedfile"
        fi
      done
  done

Run it from your content/ directory or provide the path to the content/ directory as an argument and it will print out the broken links:

filename=./foo/bar.md, link=./invalid.md, file does not exist: ./foo/./invalid.md

The details#

Initial quick and dirty version#

Okay, being honest, here's the messy one-liner version of the script I actually wrote first:

# DANGER: May execute arbitrary code.
grep -F ': {filename}.' */*.md |
  sed 's/^\([^:]*\):\[[^ ]* {filename}\([^\#]*\)\#\?.*$/'\
'if [ ! -f \"\$(dirname \"\1\")\/\2\" ]; '\
'then echo \"\0\"; fi/' |
  sh

That assumes relative links, so here's the even hackier version to search for the absolute links:

# DANGER: May execute arbitrary code.
grep -F ': {filename}/' */*.md |
  sed 's/^\([^:]*\):\[[^ ]* {filename}\([^\#]*\)\#\?.*$/'\
'if [ ! -f \"\$(dirname \"\1\")\/..\2\" ]; '\
'then echo \"\0\"; fi/' |
  sh

Breaking it down#

Generally a good way to understand a pipeline is to cut it off after each | and see what it outputs at that step.

The grep selects lines that define {filename} links. Since it's over multiple files, it will include the filename, it will include the filename at the beginning of the line (which also can be enabled explicitly with -H/--with-filename):

foo/bar.md:[badlink]: {filename}./invalid.md

The then sed matches the entire line (since it starts/ends with ^/$) and matches out the strings before the first : and after {filename}. Additionally, it leaves off any fragment after a #, if there is one. We can check the match part of the sed command by changing the replacement to /1=\1; 2=\2/:

1=foo/bar.md; 2=./invalid.md

Using those matches, the replacement is a sh program that checks if the file exists and prints out the original line if it doesn't. Since the path is relative to the file the link appears in, it uses dirname to get the directory of the file with the link and concatenates that to the link to get the path to the link target. Then it uses [ to check if that file exists. So for each link, it generates code like

if [ ! -f "$(dirname "foo/bar.md")/./invalid.md" ]
then
  echo "foo/bar.md:[badlink]: {filename}./invalid.md"
fi

(new lines added for readability).

That's a valid shell script, so it just pipes that to sh to be executed. Needless to say, do not do this with untrusted inputs. As an example if a file had the "link"

[evillink]: {filename}./$(zenity --error).md

then the script will run zenity and show an error dialog. A harmless example to demonstrate unintended code execution is happening, but pretty much any code could go there, which is why generating code based on user data and then executing it is a bad idea and if you think it's the right approach, then you have to be very careful about sanitizing your inputs. I wasn't worried because I was writting to script to run entirely over files I had written, but it's a bad habit to be in. Luckily in this case, it's completely unnecessary and the script can be rewritten to avoid that problem entirely.

Doing things safely#

I unpacked my one-liner into a slightly longer script that worked a little differently. In order to not need to parse the output of grep, I changed it to instead use find, and searched my own blog's find tag as I remembered for a previous post, I had found code for safely using find:

find "$content" -iname '*.md' -type f -print0 | 
  while IFS= read -r -d '' filename
  do
    # ... use "$filename"
  done

Then I modified the grep and sed regular expressions to just pull out the links, making them much simpler. Since I was writing a single script to handle both relative and absolute links, I had to write an if to check which kind of link it was, but then I could use [ to check for the existence of the file while only ever treating the filenames and links as strings, not potentially as code.

A Weird Imagination

Finding broken {filename} links