The problem#
I've recently been writing more series of blog posts or otherwise
linking between posts using {filename}
links. And also I've been adjusting the scheduling of my future planned
blog posts, which involves changing the filename as my naming scheme
includes the publication date in the filename. Which means there's
opportunities for not adjusting the links to match and ending up with
broken links between posts.
Pelican does generate warnings like
WARNING Unable to find './invalid.md', skipping url log.py:89
replacement.
but currently building my entire blog takes about a minute, so I
generally only do it when publishing. So I wanted a more lightweight way
to just check the intra-blog {filename}
links.
The solution#
I wrote the script check_filename_links.sh
:
#!/bin/bash
content="${1:-.}"
find "$content" -iname '*.md' -type f -print0 |
while IFS= read -r -d '' filename
do
grep '^\[.*]: {filename}' "$filename" |
sed 's/^[^ ]* {filename}\([^\#]*\)\#\?.*$/\1/' |
while read -r link
do
if [ "${link:0:1}" != "/" ]
then
linkedfile="$(dirname "$filename")/$link"
else
linkedfile="$content$link"
fi
if [ ! -f "$linkedfile" ]
then
echo "filename=$filename, link=$link,"\
"file does not exist: $linkedfile"
fi
done
done
Run it from your content/
directory or provide the path to the
content/
directory as an argument and it will print out the broken
links:
filename=./foo/bar.md, link=./invalid.md, file does not exist: ./foo/./invalid.md
The details#
Initial quick and dirty version#
Okay, being honest, here's the messy one-liner version of the script I actually wrote first:
# DANGER: May execute arbitrary code.
grep -F ': {filename}.' */*.md |
sed 's/^\([^:]*\):\[[^ ]* {filename}\([^\#]*\)\#\?.*$/'\
'if [ ! -f \"\$(dirname \"\1\")\/\2\" ]; '\
'then echo \"\0\"; fi/' |
sh
That assumes relative links, so here's the even hackier version to search for the absolute links:
# DANGER: May execute arbitrary code.
grep -F ': {filename}/' */*.md |
sed 's/^\([^:]*\):\[[^ ]* {filename}\([^\#]*\)\#\?.*$/'\
'if [ ! -f \"\$(dirname \"\1\")\/..\2\" ]; '\
'then echo \"\0\"; fi/' |
sh
Breaking it down#
Generally a good way to understand a pipeline is to cut it
off after each |
and see what it outputs at that step.
The grep
selects lines that define {filename}
links.
Since it's over multiple files, it will include the filename, it will
include the filename at the beginning of the line (which also can be
enabled explicitly with -H
/--with-filename
):
foo/bar.md:[badlink]: {filename}./invalid.md
The then sed
matches the entire line (since it starts/ends with
^
/$
) and matches out the strings before the first :
and after
{filename}
. Additionally, it leaves off any fragment
after a #
, if there is one. We can check the match part of the sed
command by changing the replacement to /1=\1; 2=\2/
:
1=foo/bar.md; 2=./invalid.md
Using those matches, the replacement is a sh
program that checks
if the file exists and prints out the original line if it doesn't.
Since the path is relative to the file the link appears in, it uses
dirname
to get the directory of the file with the link
and concatenates that to the link to get the path to the link target.
Then it uses [
to check if that file exists. So for each
link, it generates code like
if [ ! -f "$(dirname "foo/bar.md")/./invalid.md" ]
then
echo "foo/bar.md:[badlink]: {filename}./invalid.md"
fi
(new lines added for readability).
That's a valid shell script, so it just pipes that to sh
to be
executed. Needless to say, do not do this with untrusted inputs.
As an example if a file had the "link"
[evillink]: {filename}./$(zenity --error).md
then the script will run zenity
and show an error
dialog. A harmless example to demonstrate unintended code execution
is happening, but pretty much any code could go there, which is why
generating code based on user data and then executing it is a bad idea
and if you think it's the right approach, then you have to be very
careful about sanitizing your inputs. I wasn't worried because I was
writting to script to run entirely over files I had written, but it's a
bad habit to be in. Luckily in this case, it's completely unnecessary
and the script can be rewritten to avoid that problem entirely.
Doing things safely#
I unpacked my one-liner into a slightly longer script that worked
a little differently. In order to not need to parse the output of
grep
, I changed it to instead use find
, and
searched my own blog's find tag as I remembered for
a previous post, I had found code for safely using
find
:
find "$content" -iname '*.md' -type f -print0 |
while IFS= read -r -d '' filename
do
# ... use "$filename"
done
Then I modified the grep
and sed
regular expressions to just
pull out the links, making them much simpler. Since I was writing a
single script to handle both relative and absolute links, I had to write
an if
to check which kind of link it was, but then I could use [
to check for the existence of the file while only ever treating the
filenames and links as strings, not potentially as code.
Comments
Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.
There are no comments yet.