Markdown Linter - Rules - Headings - Part 1

Introduction¶

After setting up a good, easy process for writing new rules, it was now time to make some headway on the task of writing those new rules. While the title “Headings - Part 1” is not glamorous, it is a very apt description for the way that I planned out the development of this group of rules dealing with Markdown headings. In David Anson’s MarkDownLint, there are a total of 15 linter rules that specifically deal with headings. The first 3 of those rules were used as examples in the previous article, leaving 12 rules to implement. For reasons I will follow up on, I decided to leave the implementation of rules MD041 and MD043 for later, reducing the number of rules to a nice manageable 10.

Without any unnecessary embellishment, the article’s title simply denotes that I am going to talk about the implementation of the first half of those rules. Sure, I could come up with a name like “The 5 Most Important Rules for Headings!” or “5 Rules for Headings That You Cannot Live Without!”, but the truth is that they were simply the next group of rules. Nothing fancy, nothing misleading, just the plain truth.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 24 April 2020 and 26 April 2020.

What About Those Two Rules That Were Left Out?¶

Just to get this out of the way: because it was expedient.

I started to look at rule MD041 when I was implementing rule MD002, as documented in the last article. At that time, I looked at rule MD041 to determine if I could just include the work for MD041 with the work to write rule MD002. While the only difference between the two rules is the inclusion of “YAML front matter” into the rule, I felt that the difference was a large one at that stage of the project. I knew that I needed to added metadata support to the parser as a requirement for my use of the linter. But the overriding question I had to answer was: Was this the right time to add this new feature to the parser?

After a decent amount of back and forth on the subject, I decided against adding it at that moment. From my point of view, at that time it was better to focus on additional linter rules for the project rather than focus on additional features for the parser. While it would be fun to add more functionality to the parser to allow for MD041 to be implemented, I made a judgement call that it could wait a bit.

For rule MD043, my decision was based on a combination of the complexity of the feature and the usefulness of the feature to me. Without a deep dive into the rule, it looks like there will be a fair number of edge cases that I will have to consider for rule MD043, easily increasing its complexity. From the usefulness point of view, while I can see how it may be useful, this is not a feature that I see myself using a lot. While I took a different path to the judgement call, the result was the same: Rule MD043 could wait for later.

Why Is This Group of Rules Important?¶

While I didn’t think about it at the time, I needed this group of rules to prove two important things to myself: that I had chosen the right base framework to write linter rules with, and that I had chosen a good development process with which to write and test those rules. It was only after the development was completed and I was writing this article that I was able to recognize their importance in proving or disproving those two things to myself.

Rule MD018 - No space after hash on atx style heading¶

This section describes the initial implementation of PyMarkdown’s Rule MD018.

Why Does This Rule Make Sense?¶

This rule is all about catching a typo. This rule surmises that when someone types:

#Heading 1

they probably forgot to add a space and really meant to type this:

# Heading 1

While it is possible that the author intended to start a new line with the text #Heading, I believe that it is far more likely that a typo occurred.

Adding the Rule¶

The two important parts of adding this rule were recognizing that it can take place in a normal paragraph and that it can occur at the start of any line in that paragraph.

While I did consider the possibility of text like #Heading being captured as part of another block element, my experimentation with the CommonMark parser did not reveal any case that causes another block element to stop capturing on a line like # Heading. As such, by default the text ends up being captured within a paragraph block using a text token. As is shown in example 47, an Atx heading can interrupt other blocks, but more importantly, it can interrupt a paragraph block. With these two constraints understood, I was ready to start coding the rule.

Leveraging the first constraint allowed me to add code to the next_token function to consider any text token within a paragraph block eligible for further examination. From there, I applied the second constraint by enhancing the rule to look for an Atx heading-like pattern at the start of each line within those eligible text tokens. This was accomplished by breaking down each text token into separate lines and checking them against a Python regular expression. By carefully reading the GFM specification’s section on Atx headings, I was able to construct the regular expression ^\s{0,3}#{1,6}\S, which breaks down into the following components:

^ - start matching at the start of the line
\s{0,3} - look for between 0 and 3 whitespace characters
#{1,6} - look for between 1 and 6 # characters
\S - look for one non-whitespace character

I needed the regular expression to be a form of the regular expression to recognize the proper GRM specification form but modified to look for the specific case that I wanted to look for. In this instance, I was looking for heading-like text that is missing at least one space after the starting # characters. To satisfy this constraint I used the \S sequence, matching any single non-whitespace character, at the end of the regular expression specifically looking for no whitespace characters at the end of the expression. Using my favorite online Python regular expression tester, I was able to verify that I had the correct regular expression right away.

The rigorous testing for this rule involved more testing of cases where the rule should not fail than cases where it should fail. But when I boiled all those tests down into what was relevant, there were a couple of patterns that I just needed to test. In testing, this is referred to as equivalence class testing.

While sites like this one go on and on about what it is, it breaks down into one statement. Unless there is something special (usually around boundary conditions), entire groups of tests can be considered equivalent and be represented by a single test, if their relevant behavior is consistent across that entire group. Consider a simple calculator application that allows for the addition of 2 integers together. The testing of that application does not need to test every single integer added to every single integer. Just thinking about the work required for that is exhausting! Instead, the testing can be broken down into representative groups such as “any integer added to 0 equals that integer”, drastically reducing the number of tests required to cover a specific component.

Applying this concept to the test cases for this rule, I was able to reduce over 20 tests down to a more reasonable 8 tests. Some of these groups were making sure the rule does not fail if eligible text is in any of the text-containing leaf blocks except for the paragraph block, in which it should fail. The rest of the groups were simple cases based on variations of the regular expression above, that regular expression a reflection of the specification.

Confident that I had covered all the conditions for this rule, it was time to move on to the next one!

Rule MD019 - Multiple spaces after hash on atx style heading¶

This section describes the initial implementation of PyMarkdown’s Rule MD019.

Why Does This Rule Make Sense?¶

Like a fair number of the other rules that are defined, this rule is about consistency. While the opening paragraph for the GFM specification for Atx headings does allow for more than one space after the initial set of # characters, it also specifies that those leading spaces are stripped before the rest of the line is parsed as inline content. Essentially:

# Heading 1

and

#                                            Heading 1

are syntactically equivalent. As they are equivalent, it makes sense to use the simpler form of the heading to keep things consistent.

Adding the Rule¶

The code to evaluate this rule was very simple. When the content of the Atx heading is parsed into a text token, any spaces that are stripped from that token are kept as part of the text token in a separate part specifically reserved for stripped whitespace. To check for multiple spaces, I simply added code that checked to see if a text token existed within an Atx heading, and if so checked to see if that token contained more than one space in that stripped whitespace area of the token.

As the code was simple, the tests themselves were simple. The examples from the MD019 rule description page were used verbatim as the only cases needed as that is how simple this rule is. At least it was that simple until I looked more closely at the next two rules.

Rules MD020 and MD021¶

This section describes the initial implementation of PyMarkdown’s Rule MD020 and Rule MD021.

These Rules Look Very Familiar¶

While preparing for the work detailed in the last article, I quickly read through the list of rules before starting that work. I believe that because I was focusing on those first 3 rules, I missed an interesting piece of information on rules MD020 and MD021. The piece of information is that rules MD020 and MD021 are variations of the rules for MD018 and MD019 that apply specifically to Atx headings with the Atx_Closed style.

That information was both good news and bad news. The good news part was that since I had the code and tests already written for the previous two rules, I could repurpose a lot of that code for these new rules. The bad news part was that I was going to have to rewrite parts of those two previous rules to exclude cases where the latter two rules will fail. However, I believe that the good news here definitely outweighs the bad news, so on to the coding!

Adjusting the Old Rules¶

The first part of adding these new rules was modifying the old rules to not fail if an Atx_Closed style of Atx heading is found. Rule Md019 was the easiest one to adjust, requiring only a small change. As part of the normal Atx heading parsing, the number of trailing # characters in the heading is collected in the Atx heading’s remove_trailing_count variable. The small change I made was to only fail rule MD019 if that variable was zero, an easy change in anyone’s viewpoint. The change for rule MD018 was only slightly more difficult, adding an additional check for the regular expression #\s*$¹ to the existing condition. In both cases, simple test cases were added to expressly test the rule to make sure they did not fail if presented with an Atx_Closed style heading.

Simple changes, and simple tests to verify those changes.

Adding the New Rules¶

Adjusting rule MD019 to fit the parameters of rule MD021 was interesting, but not too difficult. The original check for spaces was modified to set the variable is_left_in_error instead of failing immediately. When the end token for the Atx heading token is seen, it checks to see if that is_left_in_error was set or if the closing # count is greater than 1, failing if either part is true. Checking the end token was required as the parser’s pattern is that any leading spaces are applied to the token that follows those spaces, meaning that those spaces before the closing # characters is stored with the Atx heading’s end token.

Making similar changes to rule MD020 proved to be a bit more difficult. The constraints from rule MD018 were still present, but rule MD020 added the constraint of looking for bad spacing before the closing set of # characters. Starting with the previous regular expression for finding normal Atx headings in paragraphs (^\s{0,3}#{1,6}\S) and merging it together with the the regular expression for excluding Atx_Closed line failures from rule MD019 (#\s*$), I ended up with the expression ^\s{0,3}#{1,6}.*#+\s*$. To merge the two expression together, I made two small changes. The \S expression used to indicate the matching of any character that is not whitespace was replaced with the .* expression to match zero or more instances of any character. The # character used to indicate a single # character was then replaced with the #+ expression to match 1 or more # characters, allowing for any number of closing # characters in the Atx heading.

Before adding any extra test cases, I used my favorite Python regular expression tester to do some extensive testing on the regular expression. I am happy to report that except for a simple typo, I got the regular expression right on the first try!

With the regular expression verification concluding successfully, I moved to add the test cases for both rules. For both rules, I started by adding copies of the test cases from the original rules but modified each one to represent missing spaces at the start of the Atx_Closed style headings instead of normal Atx style headings. From there, I added test cases where spaces were missing from both the start and the end of the Atx heading as well as where spaces were only missing from the end of the Atx_Closed style headings. It was there that I ran into an interesting case.

Is This Really A Failure?¶

While the original rules do not specifically identify this as a distinct case, consider the following Markdown:

# Heading#

While this obviously qualifies as an Atx heading, I can credibly argue that it should be included in MD020’s definition of a typo in an Atx_Closed styled heading. While I could have left it out, I felt strongly that this was most likely a typo and the rule should fail on this input as well.

Adding this into the rule for MD020 was easy. If the last token is a text token and the rule is currently looking at an end Atx heading token, the rule checks to see if that previous text token ends with a # character. If the # character is found, it fails the rule.

After adding the code for this newly discovered case, I spent some time exploring different possible combinations, making sure that I had the right equivalence classes for all 4 new rules. When I was satisfied that I had not missed anything, I move on to the next rule.

Rule MD022 - Headings should be surrounded by blank lines¶

This section describes the initial implementation of PyMarkdown’s Rule MD022.

Why Does This Rule Make Sense?¶

While I could stay with the consistency answer I used for MD019, a better answer for this rule is compatibility and readability. Speaking to compatibility, there are some Markdown parsers that will not recognize Atx headings unless they are preceded by a blank line or followed by a blank line. To keep things simple, having at least one blank line on either side of the Atx heading keeps the options for parser choices open.

From a readability point of view, hopefully it is obvious that:

Lorem ipsum dolor sit amet, consectetur adipiscing.

## Next Section

Morbi dictum tortor a diam volutpat, ut.

is a lot easier to read than:

Lorem ipsum dolor sit amet, consectetur adipiscing.
## Next Section
Morbi dictum tortor a diam volutpat, ut.

Between the coloring of the Atx heading in my editor and the spacing around the Atx heading, it is easy to find the different sections. Without that spacing, I know it takes me a lot more effort to find headings, especially in a large document.

In addition to these reasons, I believe that different people have varying requirements on what is acceptable or proper amounts of spacing before and after a heading. I personally know of a few people with visual impairments that find it easier to acknowledge the change in sections if accompanied by extra spacing. To support varying requirements like these, rule MD022 has the lines_above and lines_below configuration values that allows the user to alter the required number of blank lines before and after each heading element. From my point of view, the benefit provided by allowing the user to configure this rule easily outweighs the work that was required to add it.

Adding the Rule¶

In documenting the previous 4 rules, I noticed that I almost always talk about testing as a final step, which is far from my practice. As such, I thought it would be useful to use this rule’s development as an example of my normal development process.

Testing Done Right¶

As I documented in my last post, my usual process for developing a new rule is:

Creating a New Rule
Creating the Initial Tests
Implementing the Rule
Thorough Rule Validation

I usually talk about running the tests after I talk about creating a rule, because it is in the “Thorough Rule Validation” phase that most of the interesting things happen. In this case, it was the “Creating the Initial Tests” phase that was interesting.

Creating a new rule is a task that I can now perform while sleeping, as I have done it enough times that adding a new rule to the project only requires a little bit of my attention to accomplish correctly. I was then wakened out of slumber when I proceeded to the next step and started going through the initial combinations of test data for this rule. Honestly, keeping all that data in my head caused me to get confused somewhat quickly. There were simply too many combinations to keep in my head with any degree of confidence. As such, I started creating the initial test cases for rule MD022, adding one test case per file just like normal. Working through each set of combinations, I was surprised that at the end of that exercise, I had 22 test files ready to go for testing. Reviewing the variations of text block types, SetExt/Atx headings, valid/invalid spacing, and alternate spacing configurations, it did seem justified that 22 test files were required to properly test the rule. And that number was kept low as I cheated a little bit, leaving any combinations with container blocks (block quotes and list blocks) out of the equation until I could do a bit more research and thinking on them.

Writing the Rule¶

Having so many test cases, I was concerned that the logic was going to be incredibly complex, but I was able to boil it down to two simple metrics: the number of blank lines before a heading and the number of blank lines after a heading. The count of lines before any heading is equal to the number of blank lines after a leaf block completes and the heading block starts. The count of lines after any heading is equal to the number of blank lines after the end of the heading until the next leaf block starts. This is where excluding the container blocks paid off, as adding those blocks in to this rule with their containing conditions and ending conditions would have complicated those calculations by a fair amount.

Given the calculations as specified in the last paragraph, the code to determine these values was simple, but required some thinking. The blank_line_count variable is incremented if it is not None and is greater than zero. At the start of the document, the blank_line_count variable is set to -1 to make sure it does not fail for a heading at the very start of the document. Whenever the end of a leaf block is encountered, the blank_line_count variable is set to 0 to allow the counting of blank lines to commence. The end token for anything other than leaf blocks is not important to this rule, so in that case, the blank_line_count is set to None to stop counting until the next qualifying end token starts it again.

Once I had the calculation of blank_line_count worked out properly, the rest was a matter of perspective. The before count is simply the value of blank_line_count at the time that the Atx heading is encountered. Likewise, the after count is the value of blank_line_count at the time that the next leaf block is encountered. But what if there is no next leaf block? What if the heading block is the last block in the document? That is where the completed_file function comes in handy. Added to the plugin manager for cases like this, this function is called after all the line and token processing for the Markdown has completed. When the completed_file function is called for this rule, it performs the usual check for closing conditions, failing the rule if the conditions are right.

With the 22 test cases identified at the start of the development for this rule, the testing was simple, with almost no errors in the implementation of the rule itself, just the normal semantic and typo stuff. I do feel that I must acknowledge that I believe part of the reason this went off so cleanly was because the container blocks are not yet implemented. I do think those 2 blocks alone will easily throw a monkey wrench into the works. But for now, it is a good place to stop with development of this rule.

What Was My Experience So Far?¶

After the stress of adding the first three rules, as documented in the previous article, adding these rules to PyMarkdown was comparatively easy. While I still had to contend with the line numbers and column numbers not being reported², everything else in my “rule development process” was working fine. There are times that I want to skip the tests and get to the development of the rule and have fun, as I am only human. But if there is anything that my experience as a developer has taught me, it is that having a clear picture of what you want to do before your write the code is essential to shaping the code. If you start with a poor picture for what your code will do, you get poor results. I find it almost always pays off to take the time to draw that good picture, getting together solid test cases and manually calculating their anticipated results once executed against the project.

While the ease of rule development is nice, it is the solid nature and usefulness of the core rule engine that is making me happy. In implementing this batch of 5 rules, I did not have to make any changes to the core rule engine. At the same time, the development of the rules was easy, my focus centering on the particulars of the rule itself, and not the infrastructure required to support the rule. While I am not sure if it is happiness or pride, it is a good feeling that the work to get the project to this point is paying off. And while I do know refactoring is needed in some areas, I also have a growing collection of proof that my approach to this project is solid and should be continued.

Based on those observations, I believe that any earlier questions about whether I had chosen the right framework and the right development process were answered positively. As an added benefit, I also sincerely believe that my early choices to do things “this way” are paying off for the project. While not 100% frictionless, the minimal effort that I require to write new rules is challenging but not difficult, increasing the “fun quotient” for the project.

I honestly couldn’t wait to keep the progress going with the next set of rules.

What is Next?¶

Without much fanfare, since this first group of 5 heading rules was accomplished with ease, next on the list if the second group of 5 rules. I do hope that I will have a similar result with those rules!

Look for any # character followed by any number of whitespace characters, anchored to the end of the line. ↩
Yes, I am still kicking myself over that, just not as much. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Rules - Headings - Part 1

Introduction¶

What Is the Audience for This Article?¶

What About Those Two Rules That Were Left Out?¶

Why Is This Group of Rules Important?¶

Rule MD018 - No space after hash on atx style heading¶

Why Does This Rule Make Sense?¶

Adding the Rule¶

Rule MD019 - Multiple spaces after hash on atx style heading¶

Why Does This Rule Make Sense?¶

Adding the Rule¶

Rules MD020 and MD021¶

These Rules Look Very Familiar¶

Adjusting the Old Rules¶

Adding the New Rules¶

Is This Really A Failure?¶

Rule MD022 - Headings should be surrounded by blank lines¶

Why Does This Rule Make Sense?¶

Adding the Rule¶

Testing Done Right¶

Writing the Rule¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Markdown Linter Rules

Category

Tags

Stay in Touch