A Weird Imagination

Linting Markdown reference-style links

Posted Sun 30 March 2025 in Blogging

markdown pelican proxy object python reference-style links troubleshooting

The problem#

When writing blog posts, I like to use Markdown's reference-style links which let you avoid writing URLs inline and instead provide a short name and define it elsewhere in the document. I always put them at the end, which results in the bottom of the Markdown file looking like a bibliography for the post. But then there's the extra task of making sure the references at the bottom of the post are consistent with their usage in the blog post; this isn't a huge problem as usually I add a link and immediately use it, looking at the preview to make sure it got used properly. But sometimes I'll start a post by entering a list of links I expect to use, and sometimes I'll miss something checking the preview.

The solution#

lint_refs.py takes any number of Markdown files as arguments and prints the references that are invalid or not used:

#!/usr/bin/env python

import sys
from pathlib import Path
from markdown import markdown, extensions, postprocessors


class ReferenceProxy(dict):
    def __init__(self, *arg, **kw):
        super(ReferenceProxy, self).__init__(*arg, **kw)
        self.read = set()

    def __contains__(self, key):
        self.read.add(key)
        return super().__contains__(key)


class ReferenceLintExtension(extensions.Extension,
                             postprocessors.Postprocessor):
    def __init__(self, filename):
        self.filename = filename

    def extendMarkdown(self, md):
        self.refs = md.references = \
                ReferenceProxy(**md.references)
        md.postprocessors.register(self, 'ref_lint', 1)

    def run(self, text):
        undefined = self.refs.read - set(self.refs.keys())
        unused = set(self.refs.keys()) - self.refs.read
        if unused or undefined:
            print(f"\n\n# {self.filename}")
            if undefined:
                print("\n## UNDEFINED REFERENCES")
                print('\n'.join(sorted(undefined)))
            if unused:
                print("\n## UNUSED REFERENCES")
                print('\n'.join(sorted(unused)))
        return text


for filename in sys.argv[1:]:
    markdown(Path(filename).read_text(), extensions=[
        ReferenceLintExtension(filename),
        'markdown.extensions.extra'])

Given example.md:

A [broken link][broken]. A [working link][working].

[working]: https://example.com/
[not-used]: https://example.org/

$ ./lint-refs.py example.md 


# example.md

## UNDEFINED REFERENCES
broken
broken link

## UNUSED REFERENCES
not-used

The details#

Finding broken {filename} links

Posted Sun 21 July 2024 in Blogging

bash find grep if one-liner pelican sed sh test

The problem#

I've recently been writing more series of blog posts or otherwise linking between posts using {filename} links. And also I've been adjusting the scheduling of my future planned blog posts, which involves changing the filename as my naming scheme includes the publication date in the filename. Which means there's opportunities for not adjusting the links to match and ending up with broken links between posts.

Pelican does generate warnings like

WARNING  Unable to find './invalid.md', skipping url        log.py:89
         replacement.

but currently building my entire blog takes about a minute, so I generally only do it when publishing. So I wanted a more lightweight way to just check the intra-blog {filename} links.

The solution#

I wrote the script check_filename_links.sh:

#!/bin/bash

content="${1:-.}"

find "$content" -iname '*.md' -type f -print0 | 
  while IFS= read -r -d '' filename
  do
    grep '^\[.*]: {filename}' "$filename" |
      sed 's/^[^ ]* {filename}\([^\#]*\)\#\?.*$/\1/' |
      while read -r link
      do
        if [ "${link:0:1}" != "/" ]
        then
          linkedfile="$(dirname "$filename")/$link"
        else
          linkedfile="$content$link"
        fi
        if [ ! -f "$linkedfile" ]
        then
          echo "filename=$filename, link=$link,"\
               "file does not exist: $linkedfile"
        fi
      done
  done

Run it from your content/ directory or provide the path to the content/ directory as an argument and it will print out the broken links:

filename=./foo/bar.md, link=./invalid.md, file does not exist: ./foo/./invalid.md

The details#

Relative links in feeds

Posted Sun 25 February 2024 in Blogging

atom beautiful soup blogging fragment links monkey patch pelican pelican plugin python rss troubleshooting

The problem#

In an RSS/Atom feed, relative links are a bad idea because it's unclear what they're relative to. There are ways to specify a base for them to be relative to, but since feed readers do not consistently respect those mechanisms, it's safer to just always use absolute URLs in feeds. And Pelican recommends setting RELATIVE_URLS = False to always generate absolute URLs. But that setting does not apply to the anchor links generated by the Markdown toc extension to link to headers.

The solution#

I wrote a Pelican plugin, absolute_anchors which rewrites all link destinations starting with # in every article to add the absolute URL of the article at the beginning of the link.

The details#

Pelican publish without downtime

Posted Sun 21 June 2020 in Blogging

blogging chmod linux pelican sh

The problem#

My existing script for publishing my blog has Pelican run on the web server and generate the static site directly into the directory served by nginx. This has the effect that while the blog is being published, it is inaccessible or some of the pages or styles are missing. The publish takes well under a minute, so this isn't a big issue, but there's no reason for any downtime at all.

The solution#

Instead of serving the output/ directory, instead generate it and then copy it over by changing the make publish line in schedule_publish.sh to the following:

make publish || exit 1
if [ -L output_dir ]
then
    cp -r output output_dir/
    rm -rf output_dir/html.old
    mv output_dir/html output_dir/html.old
    mv output_dir/output output_dir/html
fi

where output_dir/ is a symbolic link to the parent of the directory actually being served and html/ is the directory actually being served (which output/ previously was a symbolic link to).

The details#

Timezones and scheduling tasks with at

Posted Sun 30 December 2018 in Blogging

at blogging date linux mail pelican sh timezones troubleshooting

The problem#

My system for automatically posting future-dated blog posts mysteriously stopped working recently. The posts would appear if I manually published the blog, but not with the automatic scheduling mechanism.

The solution#

In schedule_publish.sh, I changed the line

echo "$0" | at -q g $time

if [ "$(date -d "$time PST" +'%s')" -ge "$now" ]
then
    echo "$0" | at -q g -t "$(date +'%Y%m%d%H%M' -d "$time PST")"
fi

(where "PST" is the timezone of this blog; adjust as appropriate for your blog). $now is initialized with

now="$(date +'%s')"

before the call to make publish to avoid a race condition.

The details#

Speeding up Pelican's regenerate

Posted Sat 07 March 2015 in Blogging

compile on save hash make pelican python

Pelican's default Makefile includes an option make regenerate which uses Pelican's -r/--autoreload option to regenerate the site whenever a file is modified. Combined with the Firefox extension Auto Reload, this makes it easy to keep an eye on how a blog post will be rendered as you author it and to quickly preview theme changes.

The problem#

With just thirty articles, Pelican already takes several seconds to regenerate the site. For publishing a site, this is plenty fast, but for tweaking formatting in a blog post or theme, this is too slow.

The quick solution#

Pelican has an option, --write-selected, which makes it only write out the files listed. Writing just one file takes about half a second on my computer, even though it still has to do some processing for all of the files in order to determine what to write. To use --write-selected, you have to determine the output filename of the article you are editing:

$ pelican -r content -o output -s pelicanconf.py \
    --relative-urls \
    --write-selected output/draft/in-progress-article.html

The right solution#

Optimally, we wouldn't have to tell Pelican which file to output; instead, it would figure out which files could be affected by a change and regenerate only those files.

Changing Pelican URL scheme

Posted Tue 24 February 2015 in Blogging

eval grep one-off pelican sed sh

The problem#

I changed the URI scheme of this blog recently from /posts/YYYY/MM/slug/ to /YYYY/MM/DD/slug/. The latter looks better and makes the actual day of the post more visible.

But I already had posts using the old scheme and cool URIs don't change. Luckily, someone wrote a Pelican plugin called pelican-alias which allows articles to be tagged with additional URIs to redirect to their canonical location. All I had to do was add an Alias: /posts/2015/02/... line to the top of each of the posts I had already written and the plugin would take care of the rest.

Automating the aliasing#

The non-trivial part of automating this is that the URIs include the article's slug, which may have been generated by Pelican from the title, so Pelican has to be involved in generating the correct redirects.

There are two ways I could have automated this process:

Modify the plugin to add a redirect from the old scheme to the new scheme for every article. Unless somehow controlled, this would result in creating redirects for new articles which do not need them.
Write a one-off script to get the slugs out of Pelican and write the Alias: lines into the blog posts.

I took the latter approach because it was simpler and involved no new code to maintain.

My first Pelican plugin

Posted Sat 07 February 2015 in Blogging

beautiful soup blogging debugging footnotes monkey patch pelican pelican plugin python

The problem#

My previous blog post has a footnote in the first sentence. Due to the way footnotes are handled, the footnote reference is a link to #fn:prg, which works fine if the footnote is actually on the page, but on the blog main page (or any other listing of multiple articles) the footnote is not present because it's after the Read more… link. The result is that on those pages, all footnote references are broken links. These broken links should either be repaired such that they point to the article page or removed.

First attempt#

Unable to find an existing solution, I decided to write my own plugin, summary_footnotes. I started by finding another plugin, clean_summary that modifies summary and based my code off of it. That plugin uses Beautiful Soup to parse the summary and rewrite it. A quick look at the docs and I was able to figure out how to select the footnote links and rewrite them, which got me this version of the plugin.

Future-dating static blog content

Posted Mon 02 February 2015 in Blogging

at blogging git git hooks pelican sh

The problem#

Static site generators are great. But so are blog posts that automatically appear on schedule. How do we reconcile the two? There are solutions involving checking for updates on a schedule like every hour or every day, but that seems unsatisfying: if the posts have already been written, the blog should only need to be regenerated exactly when there is new content to publish.

The solution#

(These instructions are specifically for Pelican as that is what this blog uses, a similar method should work for other static blogging engines.)

Use Pelican's WITH_FUTURE_DATES setting to make future dated posts not appear as part of the blog, but only as drafts. Add the following to the article template in order to include the future publication dates in an easy to parse format:

{% if article.status == "draft" %}
    <!-- Post at datetime {{ article.date|strftime("%H:%M %Y-%m-%d") }} -->
{% endif %}

Then the following script schedule_publish.sh uses those comments to schedule rerunning itself using at:

#!/bin/sh

# Pelican publish
make publish

# Clear old queue entries if they call this script.
for q in `atq -q g | cut -f1`
do
    if [ `at -c $q | tail -2 | head -1` = "$0" ]
    then
        atrm $q
    fi
done

# Check newly published drafts for when they should be published.
# Not using for because output lines have spaces.
grep -F -- '<!-- Post at datetime ' output/drafts/* | cut -d' ' -f5-6 | while read time
do
    # Schedule running this script for that time.
    echo "$0" | at -q g $time
done

Last, follow the instructions in this blog post and run that script as the deployment task.

The details#

Hello World!

Posted Sun 01 February 2015 in Blogging

about blogging

Welcome to my blog. I am Daniel Perelman; I am presently a computer science graduate student at the University of Washington.

Much of my time is spent writing programs and using various highly-configurable tools like Bash, Vim, and LaTeX. I, like many other users of these tools, find myself often performing web searches for help on how to use these tools. Thanks to StackExchange and myriad technical blogs, the answers I am looking for are often close at hand. But not always. Sometimes I end up piecing together a solution from many sources. This blog is a place for me to save others time by sharing that knowledge.