The problem#

When writing blog posts, I like to use Markdown's reference-style links which let you avoid writing URLs inline and instead provide a short name and define it elsewhere in the document. I always put them at the end, which results in the bottom of the Markdown file looking like a bibliography for the post. But then there's the extra task of making sure the references at the bottom of the post are consistent with their usage in the blog post; this isn't a huge problem as usually I add a link and immediately use it, looking at the preview to make sure it got used properly. But sometimes I'll start a post by entering a list of links I expect to use, and sometimes I'll miss something checking the preview.

The solution#

lint_refs.py takes any number of Markdown files as arguments and prints the references that are invalid or not used:

#!/usr/bin/env python

import sys
from pathlib import Path
from markdown import markdown, extensions, postprocessors


class ReferenceProxy(dict):
    def __init__(self, *arg, **kw):
        super(ReferenceProxy, self).__init__(*arg, **kw)
        self.read = set()

    def __contains__(self, key):
        self.read.add(key)
        return super().__contains__(key)


class ReferenceLintExtension(extensions.Extension,
                             postprocessors.Postprocessor):
    def __init__(self, filename):
        self.filename = filename

    def extendMarkdown(self, md):
        self.refs = md.references = \
                ReferenceProxy(**md.references)
        md.postprocessors.register(self, 'ref_lint', 1)

    def run(self, text):
        undefined = self.refs.read - set(self.refs.keys())
        unused = set(self.refs.keys()) - self.refs.read
        if unused or undefined:
            print(f"\n\n# {self.filename}")
            if undefined:
                print("\n## UNDEFINED REFERENCES")
                print('\n'.join(sorted(undefined)))
            if unused:
                print("\n## UNUSED REFERENCES")
                print('\n'.join(sorted(unused)))
        return text


for filename in sys.argv[1:]:
    markdown(Path(filename).read_text(), extensions=[
        ReferenceLintExtension(filename),
        'markdown.extensions.extra'])

Given example.md:

A [broken link][broken]. A [working link][working].

[working]: https://example.com/
[not-used]: https://example.org/

$ ./lint-refs.py example.md 


# example.md

## UNDEFINED REFERENCES
broken
broken link

## UNUSED REFERENCES
not-used

The details#

What are they called?#

The first challenge was actually figuring out the term "reference-style link" since I didn't actually know what they were called. I had to do a few web searches about "Markdown" and "links at bottom of file" before I happened upon that term. I also could have looked at the official documentation or the Python Markdown library source code which refers to the concept of "references".

Python Markdown extensions#

Python Markdown has an extensions API with lots of different places you can hook into its processing, including multiple internal representations of the document at different stages. I found the ReferenceProcessor at the "Block Processors" stage and ReferenceInlineProcessor at the "Inline Processors" stage, so I figured I wanted to run my code between them.

Unfortunately, the model does not include a concept of a parsed link with an unresolved reference: ReferenceInlineProcessor uses a regular expression to find possible references, and for matches with valid references, it turns them into links, but if there's no referent, it leaves it as text. I considered copying ReferenceInlineProcessor and writing a modified version that just recorded the missing referents, but that sounded messy.

Proxy object#

Having looked at both of those classes, I saw that they communicated using the md.references object and I added a debug print() to confirm it was just a normal Python dict. Looking at how it was used, I decided I could use the proxy pattern to replace it with my own type that would collect the information I needed. Instead of using any of the .register() methods, I had my extension set up by replacing md.references with a ReferenceProxy that it saved.

Since ReferenceInlineProcessor uses the in operator to check if a reference is valid, the only method I had to override in ReferenceProxy was __contains__¹. It saves every reference that any link uses (whether it's valid or not) to the self.read set, and falls through to the actual dict implementation of __contains__ using super().

Then that set is read in the postprocess phase (which is definitely after all links have been processed). At that point the set of references that have been read should be equal to the set of references that exist in the dictionary. Any only in the former set are undefined leading to literal text like "[broken link][broken]" in the output.² Any only in the latter set are unused, which may indicate links I forgot to make use of in the post.

Matching Pelican's behavior#

My first attempt surprisingly told me that a reference with a dash in it was unused even though I could see in the preview that the link worked. I printed the output of markdown.markdown() in my script and saw that in fact that link was not getting interpreted as a link. So, despite using the same Markdown engine as Pelican, I was somehow getting different results.

Obviously, I had to be somehow calling it differently than Pelican was, so I looked my Pelican settings and in pelicanconf.py I found

MARKDOWN = {
    'extension_configs': {
        'markdown.extensions.codehilite': {
            'css_class': 'highlight'
        },
        'markdown.extensions.extra': {},
        'markdown.extensions.meta': {},
        'markdown.extensions.toc': {
            'permalink': '#',
        },
    },
    'output_format': 'html5',
}

I don't know exactly what all of those extensions do, so I guessed which one might be relevant and added markdown.extensions.extra to the extensions list in my script, and that fixed the issue.

Making a nice script#

I wanted to be able to easily run this on many blog posts at once, and I expected most to have no output, so I made it output nothing if the references are fine, and if not it outputs the filename so I know what file it found the issues in.

Since markdown.markdown() takes a string, Path's read_text() provides a concise way to read a file into a string. There is also a markdown.markdownFromFile() which can take a filename, but it outputs to standard output by default, so, for this usage, it needs to be told explicitly to output to /dev/null:

markdownFromFile(input=filename, output='/dev/null',
                 extensions=[
                    ReferenceLintExtension(filename),
                    'markdown.extensions.extra'])

This StackOverflow answer lists all of the methods you might consider overriding when subclassing dict. ↩
In order to make that not actually trigger the script, I wrote that in Markdown as \[broken link]\[broken]. ↩

A Weird Imagination

Linting Markdown reference-style links