Markdown Linter - Making Progress On New Rules

Summary¶

In my last article, I talked about starting to tackle the long list of rules that are not yet implemented. I this article, I talk about the process I am making and my efforts to streamline the process.

Introduction¶

When I started working on this task, I had 31 rules to implement. By the start of this week, I was down to 28 rules. As my work week was filled with deep-thinking and experimentation, I had a feeling that most of my project work would end up being done on the weekend. And I was correct.

With two days to make some progress on the project, it was hard to find good, solid blocks of time in between my other plans for the weekend. But I did find some of that time, and I tried to use that time to my benefit. My goal was simple: if possible, get more than three rules designed, implemented, and tested before I started writing this article on Sunday. To do that, I was going to have to change how I was approaching the rules, and I hoped my plans would work out.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 27 Jul 2021 and 01 Aug 2021.

Working Fast and Efficiently¶

With 28 rules left to implement, I wanted to try and get into the habit of making decisions on how to be more efficient. As this is week 2 of working on these rules, if I can get more efficient with how I design and implement these rules, hopefully I can complete the entire set within seven weeks or less.

But the only way that I was going to do that if with some new personal rules. The first rule is that, while it is usually okay for me to experiment, now is not the time for that. If I cannot come up with a clean design within 30 minutes, I need to move on to the next rule. This should help me go through the existing list of rules and get the easier rules taken care of. On the next place through the list, I can extend that time if needed. But for now, I need to get the easy stuff done first.

The second personal rule is that if a rule takes more than two hours to get into a semi-finished state, I need to save the work and move on. If I am having issues with a design after two hours, I got something wrong and I need to rethink that design. But if I am doing that after having designed the rule, it means I did not think the design through and need to go back and redesign it. And that takes time.

While neither of these rules are permanent, I am hoping that by starting this block of work with these rules, I can quickly move through the easier rules. Will they help? Only one way to find out!

Rule Md010 - Hard Tab Characters¶

For years I have avoided using tab characters in anything except word processor documents for one good reason: they are a holy war waiting to happen. For those readers not familiar with the term holy war, let me explain. In development terms, a holy war is a discussion that is never ending with no real, concrete answer. Some developers will swear by always 4 characters, others by always 2 characters and others by always a 4 character tabstop.

As I mentioned in the last paragraph, there is no real answer to the question. To avoid starting that discussion, most of the coding and documentation guidelines that I know of specify explicit rules on how an entered tab character should be handled, including the GFM specification. But while the specification lists the exact behavior a compliant Markdown parser should use, that specification does not change how the author’s editor may decide how to interpret that tab character. In the end, it is a lot easier to have the editor translate any tab character into the author’s desired number of space characters, side-stepping the entire issue.

Design¶

Starting with that information, I was quick to design a simple function that would simply scan for any tab character on the line and trigger a rule if it found any. The problem? The original rule has a configuration setting that allows tab characters to be included in Code Block elements without triggering the rule.

Using up most of my allowed design time, I was at an impasse. While the PyMarkdown parser has decent support for tab characters, it accomplishes part of that support by replacing tab characters with a four space character tabstop.¹ There is already a section in the Issues List titled Bugs - Tabs to really dig into the proper support for tab characters, but right now there is a stopgap measure present. With the stopgap replacement of tab characters, there are no traces of the tab characters in the tokens, leaving no way to detect them. No matter what else I could try, those tab characters will remain unreachable until I address the issues in that section.

So, with that research done, I course corrected back to the original, simple design of scanning line-by-line. I will revisit this later, but for now, simple was the only available choice.

Implementation and Testing¶

After the course correction in the design phase, everything went fine with the implementation and testing of this rule. It was rather simple to come up with the test cases, seeing as the special code_block configuration was not in play. As such, the detection loop was simple:

def next_line(self, context, line):
    if "\t" in line:
        next_index = line.find("\t", 0)
        while next_index != -1:
            extra_data = f"Column: {next_index + 1}"
            self.report_next_line_error(
                context, next_index + 1, extra_error_information=extra_data
            )
            next_index = line.find("\t", next_index + 1)

Rule Md012 - Detect Extra Blank Lines¶

Blank Line elements are interesting elements in Markdown. Grouped with the Link Reference Definition element, it is in a group of elements that do not show up in a rendered HTML document. For every other element, there is some physical representation in the generated document. But not those two.

In the case of Blank Line elements, they primarily serve to delineate one element from another element. Need a new paragraph? Use a blank line before the next paragraph. Want to create a paragraph after specifying an SetExt Heading element? Use a blank line. Want to stop a List element before the next element starts? Use a blank line. In every case, they only provide a method to terminate the element that precedes them.

But other than that small task, there is nothing that they do. And multiple blank lines do not make any sense either. A second blank line would terminate the element that the first blank line terminated? It just does not make sense. Therefore, this rule is in place to limit the documents to only one Blank Line element at a time.

Design¶

I originally was thinking that the design had to be complicated, but as I worked through the cases, I realized that the design could be a lot simpler than I thought. At first, I thought I would have to include code to prevent this rule from firing within Code Block elements and HTML elements. But after checking things out, I remembered that the text within those elements use the \n character instead of the Blank Line element. With those three elements excluded, it was enough to count the number of consecutive Blank Line tokens, checking them against the configured limit.

Implementation and Testing¶

Nothing interesting to report here. With a solid design in hand, it was easy to create the test functions and their test data. From there, as the detection design was simple, the implementation was also simple:

 def next_token(self, context, token):
    if token.is_blank_line:
        self.__last_blank_line = token
        self.__blank_line_count += 1
    else:
        self.__check_for_excess_blank_lines(context)
        self.__blank_line_count = 0

Rule Md014 - Showing Bash Style Output¶

When writing articles that include Bash scripts, most authors start with script blocks that look something like this:

```shell
jacob@system:~$ ls /my/dir
jacob@system:~$ cat /my/dir/file
```

But with the user’s account name, the system name, and the directory name all exposed, most authors quickly trim that down to:

```shell
$ ls /my/dir
$ cat /my/dir/file
```

From experience, this format is only useful if the author interweaves shell input and shell output together in the example, such as:

```shell
$ ls /my/dir
file
file2
$ cat /my/dir/file
```

That text format is efficient as it makes it clear which lines are shell input commands and which lines are shell output text.

But if that interwoven format is not desired, it is simpler to trim the initial script down further to only reflect the shell input:

```shell
ls /my/dir
cat /my/dir/file
```

With those leading $ characters removed, these lines can then be copied by the reader into the clipboard and executed in their own Bash window. Even though this example is specific to the /my/dir directory, it can still be copied-and-pasted, with the results being somewhat predictable.

Design¶

The obvious starting point for this rule was detecting whether the rule was looking at a token within a Code Block element. Once that context is established, it then becomes a simple matter of looking at each line in the following Text token. For each line, the rule looks for at least one line that does not begin with the $ character.

This simple design was made possible by a good design decision that I made regarding Code Block elements. To prevent accidental parsing of their content as anything but a code block, I ensured that the encompassed Text token has an exact recording of what is in that Code Block element. While that decision was made for another reason, it benefited me in this design.

Implementation and Testing¶

With a solid design in place, the implementation for this rule was easy. I just implemented each step of the design, one step at a time. The only addition to the design occurred when I questioned whether leading whitespace before the dollar sign character ($) would affect the triggering of the rule. Double checking with the original rule, its support of leading whitespace aligned with my thoughts, and the design was adapted. The rule now checked for a leading $ character after any leading whitespace was removed.

def next_token(self, context, token):
    if token.is_code_block:
        self.__in_code_block = True
    elif token.is_code_block_end:
        self.__in_code_block = False
    elif self.__in_code_block and token.is_text:
        are_all_preceded_with_dollar_sign = True
        split_token_text = token.token_text.split("\n")
        for next_line in split_token_text:
            if not next_line.strip().startswith("$"):
                are_all_preceded_with_dollar_sign = False
                break
        if are_all_preceded_with_dollar_sign:
            self.report_next_token_error(context, token)

Rule Md033 - No HTML in Markdown¶

I remember the first time reading about this rule and wondered how it would be properly used. Now, it seems obvious: because there was a need for it! More specifically, this rule breaks down to two main reasons for its existence: generation and security.

From the generation point of view, most of the Markdown parsers that I have encountered translate the Markdown elements into HTML output. But I have heard of parsers that translate the elements into intermediate forms, such as the token format that the PyMarkdown project uses. In such a case, embedding pure HTML into that document would most likely result in the output not looking like the author intended. For any authors rendering the Markdown using one of those parsers, that is an issue.

The other big reason is security. If an author can place their own HTML into a generated document, it is possible that they can make that document do things it was not supposed to do. For example, say you want to allow someone to enter comments at a public kiosk. For whatever reason, there are requirements to let the kiosk users enter their comments in Markdown or Plaintext. If they can enter HTML as part of the Markdown content, they can cause all manner of problems. By removing the ability to execute any HTML other than Markdown generated HTML, that security hole is closed.

Design¶

With the experience of designing the other rules behind me, creating the design for this rule was not difficult. There are two main ways to get HTML into a document: an HTML Block element and an inline Raw HTML element. The Raw HTML element is easy in that its content is self-contained. The HTML Block element is a bit more difficult in that the rule needs to look for a Text element inside of an HTML Block element.

After detecting the HTML text, determining the “name” of the tag is important for being able to allow only certain tags in a valid document. In the original rule, this was limited to any opening HTML tag that started with a alphabetic character. The first thing I did was to ensure the design for the HTML Block element allows for both an opening HTML tag and a closing HTML tag. With that out of the way, ans including a special case for the ![CDATA[ HTML tag, collecting the name of the tag became collecting the tag until one of the five terminating characters is encountered.

With all the scenarios covered, I moved on to the testing and implementation.

Implementation and Testing¶

Once I had all the test scenarios identified and their test functions set up, the actual code was easy to implement. Starting with the Raw HTML element, getting the __look_for_html_start function set up was rapidly accomplished. From there, I implemented the first pass with the HTML Block element, including better support for tags with non-alphabetic start characters. Finally, after making sure everything else was working properly, I made the small change to allow closing HTML tags in addition to opening HTML tags. It was an iterative process, but one that flowed smoothly along with the design.

def __look_for_html_start(self, context, token, tag_text):
    if tag_text.startswith("![CDATA["):
        tag_text = "![CDATA["
    else:
        _, tag_text = ParserHelper.collect_until_one_of_characters(
            tag_text, 0, " \n\t/>"
        )
    extra_data = "Element: " + tag_text
    if tag_text not in self.__allowed_elements:
        self.report_next_token_error(
            context, token, extra_error_information=extra_data
        )

def next_token(self, context, token):
    if token.is_inline_raw_html:
        self.__look_for_html_start(context, token, token.raw_tag)
    elif token.is_html_block:
        self.__is_next_html_block_start = True
    elif token.is_text and self.__is_next_html_block_start:
        modified_text = (
            token.token_text[2:]
            if token.token_text.startswith("</")
            else token.token_text[1:]
        )
        self.__look_for_html_start(context, token, modified_text)
    else:
        self.__is_next_html_block_start = False

While it is more code than a fair share of the other rules, it was easy to implement for a few reasons. The first was the design phase. While it is true that I used my design phase test cases for the test scenarios, the design was good enough to handle each of those test cases. Therefore, the second reason was those design test cases. Before I started coding, I knew what I was up against, and I was confident that I had identified all the test cases. Finally, I did not code the entire block of code at once. I started with the easier parts of the design and layered upon what was already done and tested until I was done.

It was at this point that I was sure that my rules were paying off. As far as I could tell, they were helping me to focus. The next rule would test that though.

Rule Md027 - False Start¶

Sometimes it is hard to keep to a self-imposed rule, and this was one of those times. I am still not sure why I had a problem with this design, but I did. I even gave myself an extra five minutes, but even with that extra time, my design was nowhere near completed.

Looking back at that rule and my notes on that rule, I am not sure what happened. I have a lot of things crossed out on the page where I was designing. Taking another look them, the things that I had crossed out as being wrong were actually correct. Regardless, I am going to say something very reasonable: Shit happens!

We all have bad periods throughout the day, and I just happened to hit one when I tried to work on this rule. It happens. What was more important was what I did when I encountered that situation: I realized it happened. I was a bit upset with myself, but I did the mental equivalent of dusting myself off, worked on one of my Saturday home projects that I needed to deal with, and cooled down. More importantly, I gave myself some space from the project, and time to depressurize.

While the work that I did on the actual task is rather fuzzy, I do remember clearly that I got it out of my system, and quickly. I knew I would get back to it within a week or two, and I would hopefully have a better experience with it at that point. And for me, that was a good thing!

Rule Md035 - Consistent Horizontal Rules¶

As soon as I came to this rule, I started to have flashbacks to Rule Md004. That was not a particularly nasty rule to implement or test, but I remember it as being very finicky. While I thought it was going to be trivial, it ended up being a fair amount of work to get everything right.

But with everything being fair, I must admit that Rule Md004 was one of the reasons that I decided to add my two new rules for this week. That rule was the first one that I started working on as I was recovering, and I did not have a lot of fun implementing it. I do not remember much about it that was positive, but I do remember having to restart the design two or three times to get it right. To be blunt, I do not know if that was because I rushed things and did not do good design upfront or because I was still sick and did poor design. At this stage, it does not matter. I want to learn from my mistakes, hence the new rules.

And this should be an easy one. All Thematic Break elements must use a consistent sequence. Hoping that I did not just jinx myself, I started working on it.

Design¶

This design was simple from the start. If the configuration value is set to consistent, then do not set a sequence to match, otherwise the configuration value is the sequence to match. In the main token function, that means a small amount of code to deal with setting the style to match if it is not set. Once that is out of the way, it is a simple comparison check: if it fails, it triggers the rule.

It did seem too simple though…

Implementation and Testing¶

But it was not. This was one of the shortest times that I have spent on designing, testing, and implementing a rule to date. All the test scenarios were easy to come up with and implement. The algorithm was just as easy, and quickly coded into a function:

def next_token(self, context, token):
        """
        Event that a new token is being processed.
        """
        if token.is_thematic_break:
            if self.__actual_style:
                if self.__actual_style != token.rest_of_line:
                    extra_data = (
                        f"Expected: {self.__actual_style}, Actual: {token.rest_of_line}"
                    )
                    self.report_next_token_error(
                        context, token, extra_error_information=extra_data
                    )
            else:
                self.__actual_style = token.rest_of_line

And yes, I was a bit paranoid. I did some extra checking for possible boundary conditions, but there were none. I think the ghosts of implementing that other rule were just stuck in my head, and I could not stop listening to their whispers. But after that extra checking, things were good, and I moved on.

Rule Md037 - Missed Emphasis Sequences¶

For me, the most interesting rules to develop are those rules where I have to really think about their design and experiment with the existing data. Even with my two new rules in place, I knew that I would have to experiment on a couple of the rules, and this was one of them. That experimentation was to answer a simple question: what does a “missed” emphasis look like. When I started that research, I was worried that I was going to have to do a lot of parsing for the 4 sequences that can be used for emphasis. Instead, my experimentation revealed that I had already solved that problem.

Because of the way certain sequences need to be handled by the PyMarkdown parser, there are a small set of character sequences that are immediately classified as Special Text: emphasis characters and link characters. When any of those sequences are encountered, they are put into Special Text tokens so that they are more readily identifiable. With no need to change them back, I just left each of those sequences in the Special Text tokens.

And now, I got to benefit from that with this design.

Design¶

A lot of this design was simple, but I was aware that there were enough moving parts in it to still make it a tricky design, if not a difficult one. As mentioned above, one good thing is that the emphasis markers were already parsed and stored in their own SpecialTextMarkdownToken instances. Because of this, I did not have to do anything to look for those sequences; they were already extracted for me.

From there, if the token was in one of the three normal text blocks, I needed to increment a token list to keep track of the fact that any following text was within an acceptable text block. Then, if any following tokens were Text tokens, go into a state where the rule looks for emphasis character sequences in Special Text tokens. Once an emphasis sequence is found, look for a matching emphasis sequence to end the emphasized text. Along the way, make sure to handle normal boundary cases, such a unmatched emphasis sequence in a paragraph.

There were a lot of little “except for” parts in that design, so I used the remaining time in my thirty minutes to work things out on paper with the nine different test scenarios that I created Markdown documents for. I had to do a couple of small last-minute changes, but that was it. It was then on to the implementation phase, and I was a bit nervous about it.

Implementation and Testing¶

The design for this rule was tricky, so I made sure to implement all nine test functions ahead of time. Knowing that the design had lots of small, moving parts, I started with the easiest parts of the design, and moved from there. There were times that I got lost and had to reset that individual section of code. Other than that, the implementation went smoothly. The extra time that I spent in the design phase helped me have confidence that I was taking the right approach.

def next_token(self, context, token):
    if self.__start_emphasis_token:
        if (
            token.is_paragraph_end
            or token.is_setext_heading_end
            or token.is_atx_heading_end
        ):
            del self.__block_stack[-1]
            self.__start_emphasis_token = None
            self.__emphasis_token_list = []
        elif (
            token.is_text
            and token.token_text == self.__start_emphasis_token.token_text
        ):
            self.report_next_token_error(context, self.__start_emphasis_token)

            self.__start_emphasis_token = None
            self.__emphasis_token_list = []
        else:
            self.__emphasis_token_list.append(token)
    elif token.is_paragraph or token.is_setext_heading or token.is_atx_heading:
        self.__block_stack.append(token)
    elif token.is_paragraph_end or token.is_setext_heading_end:
        del self.__block_stack[-1]
    elif (
        token.is_text
        and self.__block_stack
        and (
            self.__block_stack[-1].is_paragraph
            or self.__block_stack[-1].is_setext_heading
            or self.__block_stack[-1].is_atx_heading
        )
    ):
        if token.token_text in ("*", "**", "_", "__"):
            self.__start_emphasis_token = token

Rule Md038 - Code Span Spacing¶

It was just after noon on Sunday, and I wanted to get this rule out of the way before writing this week’s article. In my head, I quickly came up with the design for this one, so I was pretty sure that I could get it done in the time frame I had set for myself.

The big reason for this rule is correctness. Except for a matching pair of single spaces at the start and end of the code block, everything else in the code block is preserved, as is. That means that any unbalanced or extra spaces at the start and end of the code block are preserved as well. While it is possible that this is what the author intended, there is one big problem with this: different parsers handle those spaces differently.

Doing some quick experimentation with the string this is ` a code` span on Babelmark 2, there are parsers that preserve that space, remove that space, and even some that do not consider that text to contain a valid code span. By simply removing that leading space, all parsers align on what the correct parsing and HTML output is for that Markdown example. And that is what this rule is about: matching the output to what the author most likely expected the output to be.

Design¶

It is nice to end a week’s work with some simple stuff, and this was no exception. Because the rule centers on the Code Span element, the design was simply to wait for a Code Span element to appear, and then analyze the text within it. From there the rules are simple. Spaces at the start and end of that text are only acceptable if there is exactly one at each end. The only exception is that the sequence `{space} at the start of the text and {space}` at the end of the text are allowed.

Implementation and Testing¶

With the design being as small as it was, I did not encounter any issues implementing the rule. And with eight separate test functions to make sure things are working, I was able to make quick work of this rule.

def next_token(self, context, token):
    if token.is_inline_code_span:
        has_trailing = False
        if len(token.span_text) == 1:
            has_leading = token.span_text[0] == " "
        else:
            has_leading = token.span_text[0] == " " and token.span_text[1] != "`"
            has_trailing = token.span_text[-1] == " " and token.span_text[-2] != "`"
        if has_leading != has_trailing or has_leading:
            self.report_next_token_error(context, token)

What Was My Experience So Far?¶

When I got Rule Md038 into its mostly finalized state before starting my writing, I was shocked. Hoping to get four rules implemented, I had managed to get seven done, and that was during the weekend. This means I am down to just 21 rules of the initial set to implement. That was something to be proud of.

But more importantly, I was finding that my new temporary rules were helping me focus and stay on track. That was the real bonus. When I got in trouble with previous features, it sometimes took me days to find out how out of focus I was. And even if I realized that I was out of focus before then, there was the feeling of giving up that I had to battle.

With my two new rules, I gave myself strict boundaries to help me accomplish my goal. And it worked. But, as an added bonus, it helped me deal with the emotional aspects of having to put the work down and move on to something else. That was the big win!

What is Next?¶

Yup, more rules, but I do not know which ones and how many I will get done. Stay tuned!

A tabstop of 4 means that when the tab character is encountered, it moves the current position to the next multiplier of 4. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Making Progress On New Rules

Summary¶

Introduction¶

What Is the Audience for This Article?¶

Working Fast and Efficiently¶

Rule Md010 - Hard Tab Characters¶

Design¶

Implementation and Testing¶

Rule Md012 - Detect Extra Blank Lines¶

Design¶

Implementation and Testing¶

Rule Md014 - Showing Bash Style Output¶

Design¶

Implementation and Testing¶

Rule Md033 - No HTML in Markdown¶

Design¶

Implementation and Testing¶

Rule Md027 - False Start¶

Rule Md035 - Consistent Horizontal Rules¶

Design¶

Implementation and Testing¶

Rule Md037 - Missed Emphasis Sequences¶

Design¶

Implementation and Testing¶

Rule Md038 - Code Span Spacing¶

Design¶

Implementation and Testing¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Markdown Linter Beta Release

Category

Tags

Stay in Touch