Summary

In my last article, I talked about how I started to focus more on resolving the higher priority issues from the project’s issues list and the problems I faced doing that. In this article, I talk about my own requirements for a front-matter processor and how I added it to the project.

Introduction

With a sharper focus, I took a look at the issues list for the project and made a firm commitment to myself: to work through and complete as many of the Priority 1 issues as possible as quickly as possible. But in terms of importance, there was one of those issues that stood apart from the rest:

- optional YAML/simple front loader

Back at the start of the project, I sat down and figured out what the real requirements for the project were, as detailed here. However, while it is not explicitly stated in that list of requirements, the overarching requirement is stated in the last paragraph of that section:

From my point of view, these requirements help me visualize a project that will help me maintain my website by ensuring that any articles that I write conform to a simple set of rules.

The big push for me to do this was always to help me maintain my own website. Since that website is written as a collection of Markdown documents, I did not think that I needed to explicitly state that this project must work on my own website’s Markdown documents. It was just expected. And to do that, I needed to be able to handle the metadata in the front-matter of my articles.

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit for 04 Mar 2021.

Markdown Front-Matter

With respect to Markdown, if all someone has ever done is to use it to format a readme.md file or to add formatting to a control on a web form, the term front-matter might be a bit confusing. But for someone like me, who has a website authored in Markdown, it is a pivotal piece of the authoring process. As such, it needs to be done right.

Why Is It Needed?

When I write an article for my website, the body of the article is approximately 95 percent of the work. Coming up with a good format, making sure the article flows, and checking for grammar; these are the harder parts of writing articles on the weekly schedule that I follow. Those parts of the article form the content of the article, and once I fine-tune that content, it is very important to me that the content is presented properly. That is where metadata comes in.

To properly ensure that the articles are rendered properly and placed in the right areas of my website, I need to have a method that allows me to communicate metadata about the article to my website generator. Different website generators structure their content in different ways, but every website generator that I have looked at requires some form of additional metadata provided within each article to be published. After all, only so much information can be obtained from a directory structure or the article itself. For most website generators, that extra information is contained within something referred to as front-matter.

What Is Markdown Front-Matter?

Having experimented with a handful of static site generators, I chose the Pelican website generator for my own personal site. While I admit that I could have gone with something different, I do not regret any part of my decision to use Pelican as my site generator. It fits all my needs perfectly, and my simple tooling around it allows me to focus on the publishing of the articles, not the site generator.

I believe that one of the reasons that I work well with Pelican is that Pelican treats article metadata as a first-class object, not as an afterthought. This is evident as the first example of creating an article that exists on the Pelican Quick Start page is:

Title: My First Review
Date: 2010-12-03 10:20
Category: Review

Following is a review of my favorite mechanical keyboard.

That example is then followed up by a more complete example in the next section on Writing Content with the section on “File Metadata”:

Title: My super title
Date: 2010-12-03 10:20
Modified: 2010-12-05 19:30
Category: Python
Tags: pelican, publishing
Slug: my-super-post
Authors: Alexis Metaireau, Conan Doyle
Summary: Short version for index and feeds

This is the content of my super blog post.

These are both great examples of articles containing front-matter metadata. In its simplest form, front-matter is extra metadata that is attached to the beginning of an article, to be consumed by the website generator and not presented to the reader as content. While the effects of one or more of the data fields may influence how the content of the rest of the article is presented, that information is not explicitly part of the article itself.

A good example of that is the Title field in the above examples. If I were to publish this article on my website, the content of the article would be a single paragraph:

<p>This is the content of my super blog post.</p>

However, at the top of the page, the title itself would be presented as:

<h1>
    My super title
</h1>

Similarly, different parts of Pelican may use other fields to affect how the article is rendered or where the article is rendered. When was the article written? Is this article part of a series of articles? What tags are associated with the article? These are all specified in the front-matter of each article that I write.

That is why front-matter support is so important to me. This website is how I communicate with others, and I want to get it right.

What Are The Standards?

The first thing I need to cover is why any definition or standard around front-matter is not present in the GFM specification. The GFM specification itself is specifically focused on the common parts of Markdown and how it is rendered into HTML. Front-matter influences how the content is presented but is not directly part of the content. Therefore, it feels right that the GFM specification has no mention of front-matter.

Knowing this, I started looking for any kind of specification for front-matter weeks before I started to work on this feature. While information was hard to find, the information fell into three groups.

Group 1: Pelican Static Site Generator

The first group that defined front-matter was in the documentation for the Pelican Meta-Data plugin. This documentation was not easy to define, but provided the following paraphrased rules:

  • keywords are case-insensitive letter, numbers, underscore, dashes and must end with a colon
  • everything after the colon on the line, including a no data, are acceptable
  • if a line is idented by 4 or more spaces, that line is added to the data for the previous keyword
  • the first blank line ends all metadata for the document
  • alternatively, fences may be used
    • the first line of the document must be ---
    • the metadata ends at the first blank line or the first line containing --- or ..., whichever comes first

I had been following these rules without knowing them for months, but it was nice to finally know that they were in a concrete way. At least it was a start.

Group 2: CommonMark Markdown Parser

The second group that I found was in the documentation for the Commonmark Markdown Parser. While it it is not explicitly stated as such, after a bit of digging in the Java source code yielded the following rules by reading through the regular expressions in that code:

  • normal lines are started with 0 to 3 spaces, followed by 1 or more characters in a-z0-9._-
  • in certain cases where a “special” sequence was started, a line that starts with 4 or more spaces may add to the current value (such as literals)
  • fences are required
    • the start fence must start with 3 or more - characters
    • the end fence must start with 3 or more - or . characters

I really love documentation that presents you with a single example and almost nothing else. Without being able to read the Java code for details, this one would not have made the list.

Group 3: Markdown IT Front Matter Extension

The final group that I found was in the documentation for the Markdown IT Front Matter Extension. By far, this was the most complete documentation. It is so complete and well documented that I did not have to search any source code or paraphrase any other documentation. These points are copied verbatim from that website:

  • Indicated by three or more dashes: —
  • Opening and closing fences must be the same number of dash characters
  • Opening fence must begin on the first line of the markdown string/file
  • Opening fence must not be indented
  • The example above uses YAML but YAML is not required (bring your own front matter parser)

As I mentioned before the list: complete and well-documented.

Finding The Best of All Worlds

Faced with those three groups of definitions, I needed to think about how I was going to pull these together in a way that made sense. For this initial release, my big reason for adding this feature was to enable front-matter for linting my own website files. As such, the rules that apply to the Pelican Meta-Data plugin needed to have a bit more weight than the other rules. Even so, as I want to be able to add more support for other types of front-matter in the future, it made sense to only use that as a weighting, not as a final decision.

The first set of decisions that I needed to make were regarding the “fences” at the start and end of the metadata. Each of the sample implementations specify that the metadata must start at the beginning of the article, so that was one decision neatly out of the way. With respect to the fences themselves, while the Pelican plugin provides for them as an option, the other two implementations do not, so I decided to make them mandatory for now. But that left me to decide on the exact format of those fences. The Pelican plugin implementation specifies exactly 3 - characters for the start fence and exactly 3 - or . characters for the end fence. The CommonMark implementation specifies 3 or more - characters for the start fence and 3 or more - or . characters for the end fence. Finally, the Markdown-It implementation specifies 3 or more - characters for the start fence and 3 or more - characters for the end fence. This implementation also added the caveat that the start and end fences must have the same number of characters. Each one of these was a little different, and I wanted to make a good first choice with respect to the fences. That took a bit of thought.

In the end, I decided that for now, having fences with 3 or more - characters was the best way to go. Since each of these implementations specifies, either explicitly or implicitly, that the fences must be at the start of a line, that decision was a simple one. Finally, as I like symmetry in my articles, I felt that mandating that the start fence and the end fence contain the same number and type of characters was the final decision I had to make with respect to the headers. As for the content of the front-matter, I felt that since my initial push was to support the Pelican Meta-Data plugin, using its rules for content were good enough for me.

As I was coming up with these rules, I was cognizant of one big thing: I did not have to be correct, I just needed a good set of defaults. I was sure that at some point I was going to support all three formats, just not right away. Until I got to that point, I just needed to have something “good enough”.

Before We Go On

I pride myself on my honesty, so here goes: I messed up. This article is mainly about the setup around the addition of a new feature to the project, which is mostly complete in this commit. However, for some reason, I did not add the module front_matter_markdown_token.py and the module test_markdown_front_matter.py until this commit which is where the bulk of the work for parsing is located. So, while most of the work is contained within the first commit, I ask for you lenience, dear reader. Please pretend that that one file was added in the first commit. Thank you.

Adding The Feature

Starting to add this feature, I was keenly aware that I needed to start thinking about how to implement new features in an adaptable manner. This was only the first extension, and not the last one to come. To that end, I wanted to start the process of making sure that extending the parser and the linter were both possible. This was a good place to start!

Based on the actions that the front-matter processing needed to perform, I made the decision that it had to be encapsulated in a Leaf Block token. It was not a container and it did not make sense to specify it as an Inline token. After a bit of thinking on it, I concluded that it was a special token that contains other information, although not tokens. As such, while it was a bit of a reach, I figured that it made sense to express it as a Leaf Block token, just a specially scoped token.

Testing First

As is my process, while I might have added a bit of simple code here and there to test some little things out, this feature started with the first of the scenario tests added to the test_markdown_front_matter.py module. While this may appear to some as a backwards way to do things, I consistently return to the process because of one thing: it works well for me. Before starting the coding part of the feature, I have a test that provides a good set of requirements for what is needed to complete that one part.

For me, this is the best way to go. Write the test to specify your goal for the feature, and then work towards it.

Keeping a Single Point of Entrance

As I knew that the bulk of the code was going to be provided in an extension module, it was important to me that there be only one point of entry for the code to process front-matter.

    def __process_header_if_present(self, token_to_use, line_number, requeue):
        (
            token_to_use,
            line_number,
            requeue,
        ) = FrontMatterExtension.process_header_if_present(
            token_to_use,
            line_number,
            requeue,
            self.source_provider,
            self.tokenized_document,
        )
        return token_to_use, line_number, requeue

To make sure that it was called exactly once, I added the call to that function at the start of the __parse_blocks_pass function, right at the try statement:

    try:
        token_to_use, line_number, requeue = self.__process_header_if_present(
            token_to_use, line_number, requeue
        )
        did_start_close = token_to_use is None

Filling Out The Processing

For the most part, the processing was simple. If anyone is interested in the actual code that was added, look in this commit for process_header_if_present and follow along from there. I am going to focus on more description about what is done rather than walk through each function line-by-line.

As I already had the is_thematic_break function that was 95% of the way to where I needed it to be, I decided to retrofit that function instead of implementing a new one. The only three differences between a thematic break and the start fence were the characters used, the indentation allowed, and whitespaces being allowed between thematic break characters. The characters difference was easily taken care of by adding:

    if start_char == "-":

after the call to the is_thematic_break function. The indentation was also easy to mitigate by simply not extracting any whitespace from the source string and passing in an empty string ("") for the extracted whitespace.1 The whitespace issue took a bit more finessing. To deal with that, I modified the is_thematic_break function by adding a switch called whitespace_allowed_between_characters. This allowed the calling function to specify whether whitespaces were allowed between the characters.

Once there was a valid start to the front-matter section, the __handle_document_front_matter function was called to handle the rest of the processing. This enabled me to keep the process_header_if_present function focused on the external processing of the front-matter section itself. When the __handle_document_front_matter function returned, there were only two possibilities. If a Front Matter token was created, it was added to the document. Otherwise, if for any reason it failed, every line that was used in determining that the Front Matter token could not be created was requeued for reprocessing. That encapsulation kept things nice and clean.

Implementing The Interesting Stuff

Once there was a Front Matter element start fence was detected, it was up to the __handle_document_front_matter function to collect and validate any front-matter that was collected. This meant collecting lines until such time as a blank line or a closing fence was encountered. With my initial rule of having the start fence and the end fence containing the same number of characters, the code simply had to check to see if it encountered the same fence twice: once at the start and once at the end.

When the end fence was encountered, the processing was passed off to the __is_front_matter_valid function to determine if the lines formed a valid front-matter section and to create a dictionary of those values if it did. The logic for this initial version of the Front Matter token was very simple. If the number of spaces were 4 or more, add it to the last line. If not, check to see if it started with a valid property name. Any errors along the way aborted the validation, with a good reason returned to the caller indicating why it was aborted.

Finally, upon return to the __handle_document_front_matter function with a valid dictionary of values, the FrontMatterMarkdownToken instance was created. Following in the steps of other tokens, it was easy to add all the required fields and properties needed by the token.

Rounding Out The Feature

When I was adding the bulk of this feature, I commented out the parts of the scenario tests that validated the HTML support, the Markdown support, and the consistency checks. With all the scenario tests passing, I uncommented those parts and cleaned them up.

Adding HTML Support

Even though the Front Matter token does not generate any HTML output, I still needed to add the proper code to deal with adding nothing to the HTML output. To do this, I added the following code to the transform_to_gfm.py module:

from pymarkdown.extensions.front_matter_markdown_token import (
    FrontMatterExtension,
    FrontMatterMarkdownToken,
)
    ...
    self.register_handlers(
        FrontMatterMarkdownToken, FrontMatterExtension.handle_front_matter_token
    )

The handle_front_matter_token function was simple, as it had to do nothing:

def handle_front_matter_token(output_html, next_token, transform_state):

    _ = (next_token, transform_state)
    return output_html

Adding Markdown Rehydration Support

While this was not exactly like adding the HTML support, there were a lot of similarities. Instead of adding code to the transform_to_gfm.py module, code was added to the transform_to_markdown.py module:

    self.register_handlers(
        FrontMatterMarkdownToken, FrontMatterExtension.rehydrate_front_matter
    )

As PyMarkdown tokens capture all the information pertaining to how that token was created, it was easy to write the rehydrate function:

def rehydrate_front_matter(current_token, previous_token):
    _ = previous_token

    front_matter_parts = [current_token.boundary_line]
    front_matter_parts.extend(current_token.collected_lines)
    front_matter_parts.extend([current_token.boundary_line, ""])
    return ParserHelper.newline_character.join(front_matter_parts)

Adding Consistency Check Support

While I am sure that I will have to add more functions later, when adding this extension, there were only two functions that I needed to add: __validate_block_token_height and __calc_initial_whitespace. Both the __validate_block_token_height function and the __validate_first_token function are a long series of if statements that probably need refactoring. But the immediate need was to enable the Front Matter token to be handled properly by these functions.

In both cases, I used the is_extension property of the token to indicate that the token itself would calculate these values. For the __validate_block_token_height function, I added this code at the start of the if statement chain:

    if last_token.is_extension:
        token_height = last_token.calculate_block_token_height(last_token)

and for the __calc_initial_whitespace function, I added this code:

    if calc_token.is_extension:
        indent_level, had_tab = calc_token.calculate_initial_whitespace()

The bodies were easy to fill out because the token is so simple:

    def calculate_block_token_height(cls, last_token):
        return 2 + len(last_token.collected_lines)

    @classmethod
    def calculate_initial_whitespace(cls):
        return 0, False

What Was My Experience So Far?

Let me get the good things out of the way first. It was easy to add the new Front Matter token, even in a raw form. While I realize that this token is a very easy token to add, I was able to add it cleanly and with only minimal issues. Hopefully, this means that adding future extensions will be somewhat easy, even if they are not as easy as this feature. In addition, the changes to the HTML generator, the Markdown generator, and the consistency checks were very easy to add. Once again, that bodes well for future extensions. But even so, I was cognizant that most of the ease of adding this feature came from the fact that this new token is very specialized and only occurs at the start of the document. So, I hope things will be good going forward, but I also realize that I got lucky in this case. For now, that is a good mindset for me to have.

While the bad things is not a really big bad thing, it is still something that I mostly missed implementing in a real fashion: application configuration. I have little bits and pieces of it wired in as dictionaries containing hierarchical structures, I do not have a good start to end story around application configuration that I can live with. And after a few searches for “Python application configuration”, it seems that there are no easy answers to this need. That means I will need to write something. A bit of a miss, but I can recover from it.

After taking a bit of a break when writing this article, I reread the above two paragraphs and realized something that put both paragraphs into a new perspective. While I was previously looking at the Front Matter token as present and configuration as “should have been done in the past”, I believe I had the time frames wrong. With my new viewpoint, I realized that I am adding the Front Matter token because I want to use the project now, and both parts help me accomplish that goal. In essence, these are both concepts that are needed for a release. If I am now worrying about them, it means the release is near!

Sometimes, it just takes a bit of a break and a fresh point of view.

What is Next?

As the Front Matter token is non-GFM compliant, I need to make sure that I have a decent way to enable it when needed. Next up, application configuration.


  1. During various points in development, a common issue was the improper passing of the correct whitespace to functions like this. It was neat using this “issue” as a positive for once. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~16 min read

Published

Markdown Linter Road To Initial Release

Category

Software Quality

Tags

Stay in Touch