Markdown Linter - Rules - Headings - Part 2

Introduction¶

For any readers that have been waiting with bated breath since last week when I posted my article titled “Markdown Linter - Rules - Headings - Part 1”, I will now break the suspense. This week’s article is titled… drum roll please…

“Markdown Linter - Rules - Headings - Part 2”.

Yeah, I know, the title is terribly unoriginal, but like the previous article, there is nothing special about this article except that it details the second group of heading rules that I am creating.

With this group of rules completed, except for the two rules mentioned in the previous article, every rule in the initial heading rules group will be completed. Even though I know these rules will not be the last group of rules that I write, the writing of these rules serves an important purpose. The process of writing these rules will help to paint a reliable picture of the stability of the project at this point. This picture will then allow me to determine whether writing more rules or fixing some issues is the best course of action for the project.

To be clear, I am not pro-rules nor am I pro-issues, I am pro-project. If the picture that emerges gives me a level of confidence with which to write more rules, so be it. If that level of confidence is not achieved, then I will address any issues that I feel are getting in the way of me being able to add more rules with confidence. Basically, the tests and the issues provide me data, and I will interpret that data to determine the proper next step. Simple.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 28 April 2020 and 02 May 2020.

Dealing with An Omission¶

As I started to write this article, I went over the 15 different rules that I know contain headings: 3 in the “First Three” article, 5 in the last article, 2 rules to skip, and 4 rules in this article. Huh? I double checked my numbers, and as far as my math is concerned, 3 + 5 + 2 + 4 = 14. Where did the missing rule go?

At the end of implementing the last rule group, I determined that the development of rule MD025 needed to be put on hold, for the same reasons as rule MD041: YAML front matter. Rule MD025 exists to make sure that there is one and only one top-level heading present in each document. The description for rule MD041 allows that top-level heading to be specified in the metadata for the file, and the plan for rule MD025 is to follow that same pattern. As rule MD041 was put on hold until that metadata support is added, it only makes sense that rule MD025 is also put on hold pending the addition of that same metadata support.

And Now, More Rules¶

With this group of 4 rules, the total number of implemented heading rules will be 12. I believe that is a good sampling of rules with which to decide whether to fix more issues or create new rules. To that extent, let’s go!

Rule MD023 - Headings must start at the beginning of the line¶

This section describes the initial implementation of PyMarkdown’s Rule MD023.

Why Does This Rule Make Sense?¶

For this rule, the first answer that popped into my mind was the usual answer: consistency. According to the GFM specification, there is no syntactic difference between:

# Heading

and

   # Heading

As they are equivalent, just like with previous rules, it makes sense to use the simpler form of the heading to keep things consistent.

However, on a more thorough examination, a better purpose for this rule is compatibility. While I recognize that the initial parser features and tests are based on the GFM specification, knowing where other parsers diverge from that specification allows me to start thinking about future “tunings” for the project. I do not know how possible it is, but I am wondering if I can tune PyMarkdown’s output based on a profile set up for a given Markdown parser being used. While this is still in the “far-future thinking” stage, one of the pieces of data I need to understand for that endeavor is how far from my chosen base specification are the specifications for those other parsers.

To get an idea of how compatible a normal Atx heading is across parsers, I submitted the text # Heading 1 to Babelmark¹. Except for the presence of an id attribute in the h1 tag (and that attribute’s value varying only slightly), the results were the same across all the parsers, indicating wide compatibility. When I added 2 spaces at the start of the text and submitted that text to Babelmark, only 12 out of the 31 available parsers interpreted the line as an Atx heading. With only 39% of the parsers at Babelmark recognizing the changed text as an Atx heading, it was far from being widely compatible.

Based on those results, for the sake of compatibility, it made sense to create a rule to recommend the removal of any whitespace at the start of an Atx heading line. While a standard GFM Markdown parser will treat it properly, with less than half of the sampled parsers handling it properly, it just made sense to avoid that extra whitespace. If PyMarkdown tuning is a feature that I add in the future, I can always revisit this rule and make it aware of the added tuning. As that feature is currently only a maybe, a current, solid rule for headings means no spaces at the start of heading lines.

Adding the Rule¶

Adding the code to implement this rule was trivial. The rule simply looks at the extracted_whitespace field of the Atx heading token where the captured whitespace from before the start of the Atx heading is stored. If the value of extracted_whitespace is not empty, whitespace is present, and the rule fails.

Using the process from the previous section to test Markdown samples with multiple parsers, I then tested variations of SetExt headings like:

My heading
----------

where there was extra spacing at the start of the first line, the second line, or both lines. In each of those cases, the results showed that there was at least one parser that did not recognize it as a proper SetExt heading. If I want a linter that can be useful for a wide variety of parsers, the failure of one parser to recognize the above text as a SetExt heading is still somewhat of a failure.

To deal with whitespace before SetExt headings, only a small amount of extra parsing was required above what was added for the Atx headings. When the SetExt token is observed, that token is stored in the setext_start_token variable, with the storing of the presence of any leading whitespace in the any_leading_whitespace_detected variable. As Text tokens are observed, if the any_leading_whitespace_detected variable has not already been set, a simple decomposition of the text is performed to look for leading whitespace on each line, the any_leading_whitespace_detected variable being set to True if any whitespace is found. Finally, when the SetExt’s end token is observed, one final check is made against the end token for whitespace. When that is complete, if any whitespace was detected in any of these three phases, the rule fails with the value stored in the setext_start_token variable being used as the failure token.

The code for implementing this rule, the tests, and test data were all trivial. Once I understood the constraints that I needed for the rule, it was easy to translate those constraints into test data and source code. Perhaps I am a bit jaded in my viewpoint, but after implementing 8 other rules for headings, adding this rule was just easy.

The real interesting part about implementing this rule was using Babelmark and looking at the output from the various parsers. Seeing the different HTML variations that arose from the different interpretations of a single Markdown element made it clearer to me why the GFM specification was written. If there is not a single, clear picture of how parsers translate Markdown into HTML, the authors start to get confused, tailoring their use of Markdown to a specific parser and its output. Seeing that Babelmark output just really drove that point home for me.

Rule MD024 - Multiple headings with the same content¶

This section describes the initial implementation of PyMarkdown’s Rule MD024.

Why Does This Rule Make Sense?¶

If the parser is following the GFM specification exactly or is a bare bones parser, this rule does not make any sense. When presented with Markdown such as:

## Next Heading

some text

## Next Heading

some text

the parser will generate HTML that is close to the following:

<h2>Next Heading</h2>
<p>some text</p>
<h2>Next Heading</h2>
<p>some text</p>

However, either through its own configuration or through a plugin, most Markdown parsers allow for a mode in which some transformation of the heading text is added to the <h2> tag, usually in an id attribute. Some parsers go further than that, providing an anchor point allowing the reader to navigate directly to the heading by using a full URL with the contents of the href attribute after the normal part of the URL.

Once such example is this HTML, which was rendered by one of the parsers for the first heading in the previous sample:

<h2 id="h2-next-heading">
<a id="user-content-next-heading" class="anchor" href="#next-heading">
Next Heading
</a>
</h2>

This kind of generated HTML output is popular for two main reasons. First, various combinations of the heading tag’s id attribute, the anchor tag’s class, and the anchor tag’s id attribute allow for stylesheets to be applied to the generated page in a clean and unambiguous manner. Secondly, assuming that the normal URL to the page is https://www.website.com/that-page.html, since the href attribute for the anchor tag is specified as #next-heading, the reader can go directly to that section on the page by using the URL https://www.website.com/that-page.html#next-heading. This concept is so useful that multiple “Table of Contents” plugins that I looked at required another plugin to already be enabled that provides this information that is the repurposed by the Table of Contents plugin.

Specifically related to this rule, the part that I want to focus on is the generation of the id attributes and the href attribute. If a naïve generator implementation is used, the heading text Next Heading will always be reduced to some form of the text next-heading, without any regard for duplicates. When a more advanced implementation is used, the plugin remembers that it has already generated the text next-heading for the current document. To avoid any duplication, when the parser goes to generate another instance of a heading id based on the text Next Heading, it generates it as normal but also appends a suffix to the generated text, usually something like -1 to keep the generated ids unique.

Because Markdown parsers are not guaranteed to perform the proper, advanced generator implementation, this rule plays it safe by failing when it detects heading text that is duplicated. However, to provide further configuration for more compliant parsers, this rule can also be set to only fail the rule in cases where sibling heading elements have the same name. While it is not stated, my guess is that some parsers include extra information in their generated ids that only causes duplication issues when sibling headings have the same text.

Adding the Rule¶

Using the default configuration, the evaluation for this rule is very simple. When an Atx heading token or a SetExt heading token is observed, the rule starts collecting the text until it encounters the appropriate end token. At that point, one of two things happen. If this is the first time that text has been seen, it is saved in a dictionary that is cleared at the start of each document. If it is not the first time that the text has been seen, the rule fails. Simple and clear cut.

If this rule’s configuration is changed to only search for duplicates in siblings, a small amount of a change is required in the algorithm, but not much. The first change is that each of the 6 levels of headings must have its own dictionary. Instead of looking up the collected text in a single dictionary, the text is looked for in the dictionary assigned to its heading level. Basically, if a heading is a level 2 heading, then it needs to verify against the level 2 heading dictionary, and so on. As the heading level that determines which dictionary to use is an integer, it made sense to create an array that contains 6 dictionaries, one for each heading level and initialized to an empty dictionary for each new document.

With those changes in place, a bit of working through multiple scenarios provided the rest of the answers. For the increasing heading level case, consider the general case of:

## Level 2
### Level 3

While it is appropriate to only increase the heading level by 1, hence the creation of rule MD001, the general case for this rule applies to any increase in heading levels. When going to the new heading level, the level’s dictionary must be cleared to ensure that any previous headings do not pollute the rule’s sense of duplication. If multiple levels are involved, it makes logical sense to clear all the dictionaries up to and including the new heading level.

The other case to consider is where the heading levels decrease, such as with:

### Level 3
## Level 2

Once again, the general case for this applies to any decrease in heading levels. As the new Level 2 heading is separated from any previous level 2 headings by the ### Level 3 heading, those headings are no longer considered siblings. As such, when moving to the lower levels, the dictionaries must be cleared down to and including the new heading level.

Other than figuring out the logic for this rule, and taking a while to do it, the rest of the coding for this rule went smoothly. While the new algorithm based on that logic took longer to figure out than I hoped it would, it was useful to be able to have the test data and test cases on hand from VSCode’s output. Taking the information from the Problems window of that editor saved a lot of time when coded into my tests for this rule. I did have a couple of false starts but knowing that the algorithm passed the tests using the VSCode derived test data really increased my confidence that I got the algorithm right.

Rule MD026 - Trailing punctuation in heading¶

This section describes the initial implementation of PyMarkdown’s Rule MD026.

Why Does This Rule Make Sense?¶

This rule seems to be largely derived from Ciro Santilli’s Markdown Style Guide. In this document, Ciro claims that headings are not complete sentences, and as such, should not end with punctuation. Specifically, his examples show very good cases for not ending a heading with the : character or the . character. My best guess is that David Anson, when creating his rule based on that document, wanted to allow his linter’s users to be able to extend this concept to include any headings that end with a set of common punctuation characters. As such, the default for David’s rule is to fail the rule if any heading ends with any of the .,;:!?。，；：！？ characters, and that set of characters can be easily changed with configuration.

While I do not completely agree with this rule, I do agree with the premise behind the rule: headings should never be considered complete sentences. To properly describe that statement and how it applies to various headings, I needed to brush up on my English grammar rules. Even though it has been a few years since I was in high school, I was quickly able to find a good solid definition:

a complete sentence starts with a capital letter, ends with a punctuation character, and expresses a complete thought.

Using that definition as a rubric, I feel it is useful to demonstrate how that rule applies to the headings in my own documents. For this comparison, I used the layout of the current section of this article as a good example of how to conceptually apply this rule.

For each rule in this section, the level 2 heading is always the id of the rule followed by the description. In the case of this rule, that heading is Rule MD026 - Trailing punctuation in heading. That heading is not a valid sentence because it is not a complete thought and does not end with punctuation. The level 3 heading following this section is Adding the Rule. There is a bit more nuance involved with that heading, as I can easily argue that it can be interpreted as a complete sentence, just a very weak one. The saving grace for that heading is that it does not end with punctuation, so it is not advertising itself as a complete sentence.

That leaves the heading for this section which is Why Does This Rule Make Sense?. When I first looked at my grammar references for this, it seemed to fail at every point. It starts with a capital letter, ends with punctuation, and looks like a complete sentence. Yes, I said “looks”… and it took me a bit to get there as well. Remember above when I said that a complete sentence expresses a complete thought? That is where context comes in. Without the context imposed by this document, the obvious question to ask is “What rule?”. Because I tied the context of the heading to document’s structure, it is not a complete thought unless it remains in the document. As it is not a complete thought on its own, it narrowly fails the complete sentence test.

Aside: So, Why Keep It Like That?¶

I know that using partial questions as headings is not a popular choice, and I know it narrowly fails the complete sentence test. So why keep it?

For me, it all boils down to using my own voice in my writing, and the authenticity of my writing in that voice. For each of these rules, if I were reading someone else’s document, I would ask two questions:

Why is it good to do this?
What did it take to do this?

Wording the second question as a statement is relatively easy. Instead of “What did it take to do this?” I used the heading “Adding the Rule”. It is not glamourous, but it concisely and accurately conveys the image that I want to convey for that section.

For the first question, I struggled for a long while trying to rephrase that question as anything other than a question, and it just did not seem correct. It either was missing something, or it just felt like something that I would not say unless I was prompted to. So instead of settling for something that I was not confident about, I opted for using a question that did capture the essence that I wanted in a heading for that section.

And yes, I purposefully worded the heading of this section to emphasize that point. That, and it is the kind of question I would ask myself if I was reading someone else’s article.

Adding the Rule¶

At this point in the development of heading rules, I believe the term “ditto” is appropriate. Like a fair number of the rules before this one, any text that is seen between either the Atx heading token or the SetExt heading token is collected for later examination. As that process of collecting the heading text has been well documented in the other rules, I will avoid documenting it from here on out, assuming that avid readers already know it fairly well, instead focusing on the unique bits and differences in the algorithms.

Once the heading text is collected and the heading’s end token is encountered, the rule’s comparison logic is then activated. Quite simply, when the end token is encountered, the last text character of the collected text is checked against the configured set of punctuation characters. If there is a match, the rule fails. If configuration is provided to change the set of punctuation characters to check against, the check is simply performed against that list of characters instead of the default list.

At first, I was a bit let down that I saw this as being a simple rule, as the both the code and testing for this rule were very trivial. While I was initially disappointed, after a while I was able to see it as a good thing. One benefit that I started to see is that if these rules are performing a consistent action, it reduces the chance that I am going to get the logic wrong. Perhaps more obvious to me is the benefit that if the logic is indeed very similar, it may be encapsulated into a base class or helper class in a future refactoring, thereby reducing the maintenance costs.

When I realized what those benefits were, I became okay with this rule being a mundane rule to write. Mundane equals refactorable equals lower maintenance.

Rule MD036 - Emphasis used instead of a heading¶

This section describes the initial implementation of PyMarkdown’s Rule MD036.

Why Does This Rule Make Sense?¶

This rule is another rule that was largely derived from Ciro Santilli’s Markdown Style Guide. If I had to guess, it seems that Ciro had seen cases where people were using emphasis as headings in Markdown documents, something that is confusing. Doing a bit of easy research, most Markdown tutorial sites, such as Markdown Tutorial, address headings in their first 3 lessons. For people who may ignore tutorials and go straight to cheat sheets, both Markdown Guide’s Cheat Sheet and Adam Pritchard’s Cheatsheet deal with headings on their first page where they are immediately visible. From this research alone, it is hard to figure out why someone would use emphasis over headings. However, I started wondering if perhaps there were historical reasons for this rule?

If you start looking at the tutorials and cheat sheets from a different angle, perhaps the historical angle to justify this rule does make sense. Perhaps these distinct sources put headings near the start of their documents because people were not using headings properly at the time that those documents were written. Thinking as an author, by putting headings near the start of the tutorial or cheat sheet, I would expect that placement to strongly hint that they are important and should be looked at before going with the rest of the document.

To explore that concept further, I assumed that I did not know about headings, forcing me to theorize on how to make a heading-like element would work. Trying to forget about this rule, I started working on a list of the elements that I could use. With headings out of the picture, and every one of the other blocks having a very distinct purpose, that left only inline elements. Line breaks are not immediately visible, and links are really meant for navigation, so they were removed from the possibilities list early on. That left backslashes, character references, code spans, and emphasis. Out of those 4, only the emphasis made sense to use to draw attention to some text.

Based on those restrictions, the best heading-like element that I came up with was this example:

*My Not So Heading*

More text

What a surprise! That is exactly what this rule is looking for. In particular:

the paragraph is a single line, surrounded by emphasis
the contents of the paragraph are only simple text, no inline Markdown except for the surrounding emphasis
for the same reasons as rule MD026, the text does not end with a punctuation character

Basically, if the only action I do is to remove the emphasis around the text, instead prefixing the line with the text #, it should become a valid heading.

As I now had a working theory on why this rule may have been created and what a good justification was for it, it was time to start writing the rule.

Adding the Rule¶

While the logic for the rules has been somewhat simple before the point, even a simple glance at this rule led me to believe that the logic for this rule was going to need some heft to it. As it often is with many complex parsing scenarios, it was time to use a Finite State Machine. While there are many good articles on Finite State Machines, such as this decently complete one at Wikipedia, these complex machines boil down to this simple statement: the machine has a finitely small number of states and for each state there are 0 or more transitions from that state to other states.

From my experience, there is a certain level in parsing or pattern recognition where a simple comparison is not enough, and a formal Finite State Machine is required. The inverse of this is also true, where the formality and setup required for a proper Finite State Machine may get in the way of a small, efficient algorithm. Consider the algorithm and code for rule MD026 in the previous section. A simple if statement was used to look for 3 types of tokens: a start heading token, any contained text tokens, and an end heading token. While not called out as such, this algorithm was a small state machine with clear states and clear transitions between states. Once the contents of the heading were collected, then rule MD026 did an analysis on it, and determined whether to fail the rule based on that analysis.

Breaking that logic out into a full-fledged Finite State Machine had little benefit, as the simple logic was able to complete the task without less than 5 states and without any complicated transition logic. On the other hand, the logic for this rule as defined in the last section reveals that this rule’s algorithm requires 5 distinct states:

look for a paragraph start token
look for an emphasis start token
look for a text token
look for an emphasis end token
look for a paragraph end token

If at any point the next token that we encounter is not the type of token required to go to the next state, the machine resets back to state 1 and the machine starts to look for a new paragraph. While the transition logic remained simple, I felt that the “large” number of different states made using a Finite State Machine the best alternative.

Once I had the states and transitions figured out, the writing of the rule was trivial. The states were represented by a simple set of if statements, each one clearly looking for the correct condition to move the state machine to the next state. If that condition is not found, the current state is reset to state 1, causing it to look for the initial condition again. If the Finite State Machine gets past the final state, the rule fails and the current state is again reset back to state 1, looking for another candidate paragraph to evaluate.

What Was My Experience So Far?¶

From the viewpoint of adding new rules, this was a good experience. My feelings, as expressed in the last article, were easily extended to cover this set of rules as well. While I still want to jump ahead sometimes, my confidence in the process of writing rules and the framework in which I write those rules is getting deeper with the authoring of each rule. A good part of that confidence boost is that when I create new rules, the bulk of the time used to author those rules is focused on the rules themselves, and not on how to wring the required data from the framework. It is only the rare exception where I need to figure out how to get some data from the framework.

But it is those exceptions that caused me to pause and think when trying to answer the question that I posed at the start of the article: write more rules or fix more issues? This answer is a difficult one for me to arrive at, as there is good data to support both sides. Writing more rules would round out the framework some more, while focusing on the framework will fix known issues with the framework, allowing future rules to be written with less friction.

This choice was also complicated by the arrival of new ideas for the project as I am implementing the rules. A good example of this is the Markdown parser “tunings” that I briefly talked about back in the documentation for rule MD023. While it is nice to think about these concepts and how they could make the project better, doing anything more than thinking about them at this stage would be distracting. Even worse, it could derail the project by having me follow that concept down the rabbit hole. If I want to be able to explore concepts such as “tunings”, which of the two options would allow me to get there faster while maintaining my require level of quality?

In the end, I determined that I wanted some more time to think about this. There are a few issues in my backlog, and fixing those issues will give me some more time to determine the best course of action. Sometimes, the best decision is not to decide, but to collect more information. I firmly believe that this is one of those cases. And if I am wrong? It just means the foundation for the project will be that much cleaner and stronger than it was before.

What is Next?¶

As I mentioned in the previous section, I feel that the best course of action is to do some refactoring. While it is not a glorious task, it will give me some time to determine if whether to choose line/column support in the parser or adding more rules.

Babelmark is a useful online tool that converts a given piece of Markdown text into HTML using a wide variety of Markdown parsers. I find tools like this very useful in exploring possibilities for how to solve issues that I have. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Rules - Headings - Part 2

Introduction¶

What Is the Audience for This Article?¶

Dealing with An Omission¶

And Now, More Rules¶

Rule MD023 - Headings must start at the beginning of the line¶

Why Does This Rule Make Sense?¶

Adding the Rule¶

Rule MD024 - Multiple headings with the same content¶

Why Does This Rule Make Sense?¶

Adding the Rule¶

Rule MD026 - Trailing punctuation in heading¶

Why Does This Rule Make Sense?¶

Aside: So, Why Keep It Like That?¶

Adding the Rule¶

Rule MD036 - Emphasis used instead of a heading¶

Why Does This Rule Make Sense?¶

Adding the Rule¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Markdown Linter Rules

Category

Tags

Stay in Touch