A Weird Imagination

Speeding up Pelican's regenerate

Posted in

Pelican's default Makefile includes an option make regenerate which uses Pelican's -r/--autoreload option to regenerate the site whenever a file is modified. Combined with the Firefox extension Auto Reload, this makes it easy to keep an eye on how a blog post will be rendered as you author it and to quickly preview theme changes.

The problem

With just thirty articles, Pelican already takes several seconds to regenerate the site. For publishing a site, this is plenty fast, but for tweaking formatting in a blog post or theme, this is too slow.

The quick solution

Pelican has an option, --write-selected, which makes it only write out the files listed. Writing just one file takes about half a second on my computer, even though it still has to do some processing for all of the files in order to determine what to write. To use --write-selected, you have to determine the output filename of the article you are editing:

$ pelican -r content -o output -s pelicanconf.py \
    --relative-urls \
    --write-selected output/draft/in-progress-article.html

The right solution

Optimally, we wouldn't have to tell Pelican which file to output; instead, it would figure out which files could be affected by a change and regenerate only those files.

Once you consider things like {filename} links and both adding and removing tags, it becomes clear that figuring out dependencies is actually non-trivial, and it would be easy to miss some detail.

A workaround would be to implicitly handle dependencies by taking all of the information that would be used to write the file and recording it. At the step where --write-selected applies, much of the processing has already been done, including processing {filename} links and organizing articles into tags. The only step remaining is actually using the template to build the actual HTML file. If that information changes, then the output file will change, so it should be regenerated. That's actually an oversimplification because there's probably properties that the template doesn't read, but figuring out exactly which properties the template does read could be difficult.

As a further simplification, instead of remembering the full data for every output file, we notice that all we care about is whether that data changed, not what it is, so it's enough to just store a hash value. In order to avoid collisions, the hash should be a cryptographic hash function of some repeatable value like the string representation of the dictionary with the keys in sorted order.

I will look into implementing this at some point, but for the time being, I found a simpler, albeit incomplete, solution.

The hackish solution

I realised that I don't actually need a complete solution: most of the time when I care about regenerate being fast, I am writing a new blog post marked as a draft. Furthermore, I rarely have many drafts at the same time, so instead of figuring out exactly which files to generate, it suffices to regenerate all drafts. So I added an option --write-only-drafts to do so.

The --write-only-drafts flag is like the --write-selected flag except instead of taking an argument, it selects all drafts.

Comments

Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.

There are no comments yet.