The problem#
When writing blog posts, I like to use Markdown's reference-style links which let you avoid writing URLs inline and instead provide a short name and define it elsewhere in the document. I always put them at the end, which results in the bottom of the Markdown file looking like a bibliography for the post. But then there's the extra task of making sure the references at the bottom of the post are consistent with their usage in the blog post; this isn't a huge problem as usually I add a link and immediately use it, looking at the preview to make sure it got used properly. But sometimes I'll start a post by entering a list of links I expect to use, and sometimes I'll miss something checking the preview.
The solution#
lint_refs.py
takes any number of Markdown files as arguments and
prints the references that are invalid or not used:
#!/usr/bin/env python
import sys
from pathlib import Path
from markdown import markdown, extensions, postprocessors
class ReferenceProxy(dict):
def __init__(self, *arg, **kw):
super(ReferenceProxy, self).__init__(*arg, **kw)
self.read = set()
def __contains__(self, key):
self.read.add(key)
return super().__contains__(key)
class ReferenceLintExtension(extensions.Extension,
postprocessors.Postprocessor):
def __init__(self, filename):
self.filename = filename
def extendMarkdown(self, md):
self.refs = md.references = \
ReferenceProxy(**md.references)
md.postprocessors.register(self, 'ref_lint', 1)
def run(self, text):
undefined = self.refs.read - set(self.refs.keys())
unused = set(self.refs.keys()) - self.refs.read
if unused or undefined:
print(f"\n\n# {self.filename}")
if undefined:
print("\n## UNDEFINED REFERENCES")
print('\n'.join(sorted(undefined)))
if unused:
print("\n## UNUSED REFERENCES")
print('\n'.join(sorted(unused)))
return text
for filename in sys.argv[1:]:
markdown(Path(filename).read_text(), extensions=[
ReferenceLintExtension(filename),
'markdown.extensions.extra'])
Given example.md
:
A [broken link][broken]. A [working link][working].
[working]: https://example.com/
[not-used]: https://example.org/
$ ./lint-refs.py example.md
# example.md
## UNDEFINED REFERENCES
broken
broken link
## UNUSED REFERENCES
not-used
The details#
What are they called?#
The first challenge was actually figuring out the term "reference-style link" since I didn't actually know what they were called. I had to do a few web searches about "Markdown" and "links at bottom of file" before I happened upon that term. I also could have looked at the official documentation or the Python Markdown library source code which refers to the concept of "references".
Python Markdown extensions#
Python Markdown has an extensions API with lots of
different places you can hook into its processing, including multiple
internal representations of the document at different stages. I found
the ReferenceProcessor
at the "Block Processors"
stage and ReferenceInlineProcessor
at the
"Inline Processors" stage, so I figured I wanted to run my code between
them.
Unfortunately, the model does not include a concept of a parsed link
with an unresolved reference: ReferenceInlineProcessor
uses a regular
expression to find possible references, and for matches with valid
references, it turns them into links, but if there's no referent, it
leaves it as text. I considered copying ReferenceInlineProcessor
and
writing a modified version that just recorded the missing referents, but
that sounded messy.
Proxy object#
Having looked at both of those classes, I saw that they communicated
using the md.references
object and I added a debug print()
to
confirm it was just a normal Python dict
. Looking at how it was used,
I decided I could use the proxy pattern to replace it with my
own type that would collect the information I needed. Instead of using
any of the .register()
methods, I had my extension set up by replacing
md.references
with a ReferenceProxy
that it saved.
Since ReferenceInlineProcessor
uses the in
operator to
check if a reference is valid, the only method I had to override in
ReferenceProxy
was __contains__
1. It saves
every reference that any link uses (whether it's valid or not) to the
self.read
set
, and falls through to the actual dict
implementation of __contains__
using super()
.
Then that set
is read in the postprocess phase (which is definitely
after all links have been processed). At that point the set of
references that have been read should be equal to the set of references
that exist in the dictionary. Any only in the former set are undefined
leading to literal text like "[broken link][broken]" in the
output.2 Any only in the latter set are unused, which may
indicate links I forgot to make use of in the post.
Matching Pelican's behavior#
My first attempt surprisingly told me that a reference with a dash in it
was unused even though I could see in the preview that the link worked.
I printed the output of markdown.markdown()
in my script and saw that
in fact that link was not getting interpreted as a link. So, despite
using the same Markdown engine as Pelican, I was somehow getting
different results.
Obviously, I had to be somehow calling it differently than Pelican was,
so I looked my Pelican settings and in pelicanconf.py
I found
MARKDOWN = {
'extension_configs': {
'markdown.extensions.codehilite': {
'css_class': 'highlight'
},
'markdown.extensions.extra': {},
'markdown.extensions.meta': {},
'markdown.extensions.toc': {
'permalink': '#',
},
},
'output_format': 'html5',
}
I don't know exactly what all of those extensions do, so I guessed which
one might be relevant and added markdown.extensions.extra
to
the extensions list in my script, and that fixed the issue.
Making a nice script#
I wanted to be able to easily run this on many blog posts at once, and I expected most to have no output, so I made it output nothing if the references are fine, and if not it outputs the filename so I know what file it found the issues in.
Since markdown.markdown()
takes a string, Path
's
read_text()
provides a concise way to
read a file into a string. There is also a
markdown.markdownFromFile()
which can take a
filename, but it outputs to standard output by default, so, for this
usage, it needs to be told explicitly to output to /dev/null
:
markdownFromFile(input=filename, output='/dev/null',
extensions=[
ReferenceLintExtension(filename),
'markdown.extensions.extra'])
-
This StackOverflow answer lists all of the methods you might consider overriding when subclassing
dict
. ↩ -
In order to make that not actually trigger the script, I wrote that in Markdown as
\[broken link]\[broken]
. ↩
Comments
Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.
There are no comments yet.