Markdown Linter - Road To Initial Release - Finalizing Plugins

Summary¶

In my last article, I talked about gearing up for the work that I need to do on the plugins to get them ready for the release. In this article, I do that work to get the already implemented plugins ready for the release.

Introduction¶

Getting close to being ready for the initial release, the list of things left to clean up was down to the actual focus of the project: the plugin rules. Without a healthy set of rules to power it, a linter is usually either a fancy parser engine that is not being used to its fullest or some Regular Expression test cases that have not been tested properly. While the PyMarkdown project was definitely not just a fancy parser, I needed to upgrade the initial set of thirteen rules to make sure that was not the case.

Hoping to make a quick couple days work of upgrades, I started working on the plugin rules.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 11 Apr 2021 and 26 Apr 2021.

Before I Start¶

Apologies to any faithful readers that I did not post an article last week. I was recovering from my second COVID-19 shot and was out of commission from about 6PM on Saturday evening until about 6pm on Tuesday evening. I was able to do minimal amounts of work during that time frame, but I felt like I was constantly trying to push through a tough wall of fog. And after two days of trying to push through that fog, I decided to stop all work, including waiting a week before posting an article. I was having issues with the quality of my own coding work at that time, and it was only through rigorous testing and linting that I was successful there. From a writing an article point of view, I decided that I would rather wait a week and rest than produce something that was not up to par with my usual posts.

Was it worth it? Yes… unequivocally yes! From a personal viewpoint, I know quite a few people that have either had a severe case of COVID and required hospitalization and even a couple of people that have had a good friend pass away from COVID and its complications. My personal choice is to deal with the side effects of the shots and know I am more protected than before, thereby reducing some of my anxiety from this unusual time we are all in.

To any readers out there: Please make an informed choice as to whether you get vaccinated. It is not only your health that you need to worry about, but the health of those around you and to possibility of mutations in COVID. Do not just brush off any decision. Read the information from the CDC, read information from reputable sources about what can happen to you and your local groups with and without the vaccines, and make an educated decision.

And now, on to the rest of the article.

Where To Start?¶

To complete this task, I needed to verify and update each of the 13 plugin rules that have previously been coded, ensuring that they still work and are tested properly. I was aware that some of those rules were going to need no changes at all, but I was also aware that some may need some decent refactoring to keep up with the changes that have occurred since they were coded. I had guesses which rules were going to fall into which category, but I just needed to start working on them to really get an idea of what changes needed to occur.

The First Three Rules¶

Examining Rule md001, everything looked fine with this rule except for its lack of support for front-matter. Since I execute this rule against my blog posts which contain front-matter, this rule fires on each article due to my use of that front-matter to specify metadata about the article. So, after adding three new scenario tests that include various forms of metadata, I started to make the changes to this rule.

Setting the new __front_matter_title field to title, it was surprisingly easy to add support for the Front Matter token to the next_token function. As the Front Matter token only appears once at the start of the document, the code was very simple:

    elif token.is_front_matter:
        if self.__front_matter_title in token.matter_map:
            hash_count = 1

Basically, if that token is present, check for the actual title in the map, setting the hash_count variable to 1 if it is present. At that point, the rest of the code takes control of any following Atx Heading tokens or SetExt Heading tokens as they occur.

But while everything looked good, the tests were not passing. As I started debugging, it looked like I needed to do some finessing with the front_matter_markdown_token.py module. In checking the source code against what tokens were being produced, there were two small issues that I needed to take care of.

The first change I needed to do was to add the call to the lower function to this line:

    value_map[current_title.lower()] = current_value

Because I want to do a simple lookup inside of the value_map dictionary, I want to get the current_title to a standardized form. If I do not do that, I would have to try an increasingly large combination of forms, such as Title, title, TiTle, tItLe and so on. By standardizing on that single form, I keep the number of lookups required to a single lookup. Hence, I appended the .lower() to that variable when using it as a key in the dictionary, standardizing on a lowercase metadata name.

The other issue was a similar issue to the issue of standardized lookup name, but with a standardized form of the data stored for that name. If possible, any lookup of a value from the dictionary should be usable without any modifications being needed. But the way the data currently was, any leading or trailing whitespace was being added to the current_value variable along with the rest of the data. To me, it made more sense to take the same care with the front-matter values. To that extent, I made sure to add a call to the strip built-in function before adding the string to the current_value variable. While it was not perfect, I was pretty sure it would handle a good 80% of the scenarios I already had in mind.

Finally, to round things out, I added some extra configuration to fetch a configurable string to use:

    def initialize_from_config(self):
        self.__front_matter_title = self.plugin_configuration.get_string_property(
            "front_matter_title", default_value="title"
        ).lower()

The Other Two Rules¶

Because Rule md002 and Rule md003 are very narrowly focused, they did not require any changes to the source code or the test code. I did play around with (or experiment with if you prefer) various combinations to try and find some gaps in both rules for about an hour or so each. But in both cases, any failures that I found boiled down to one of the already existing scenarios already being tested in the scenario tests.

The Long March¶

Yup, 13 days occurred between the last commit before this change and the current commit that included this change. Like a fair number of previous difficult tasks, I tried to be smart about it, and ended up choosing what I believe to be the simplest solution.

The Research¶

At the beginning, I added 17 new scenario tests to deal with Block Quote elements, List elements, and inline elements. Right away, there were over ten test failures and a handful of cases that threw an exception when the rule was executed. At that point, this was part of the code in the next_token function:

elif token.is_text and self.__last_paragraph_token:
    split_whitespace = self.__last_paragraph_token.extracted_whitespace.split(
        "\n"
    )
    split_text = token.token_text.split("\n")
    assert len(split_whitespace) == len(split_text)

    for split_index, next_text in enumerate(split_text):
        combined_text = split_whitespace[split_index] + next_text
        if re.search(r"^\s{0,3}#{1,6}\S", combined_text) and not re.search(
            r"#\s*$", combined_text
        ):
            self.report_next_token_error(context, token)

The exceptions were occurring due to the assert statement failing. It was not failing on all tests, just ones that had multiple Text tokens within the bounds of the Heading tokens that spanned multiple lines. As soon as that assert statement was hit in those cases, the assert would trigger, and rightly so. In those cases, only part of the required newlines were attributed and stored with that first Text token, the rest of the newlines being stored in the various tokens where the newline occurred. And as soon as that first Text token was hit, that assert statement evaluates to False and the rule fails.

Lather, Rinse, Repeat¶

Once again, I started out with the best of intentions and wanted to come up with a solution that was very workable and very smart. I have talked about this before in my articles, and I am pretty sure that I will talk about it in future articles. I do not purposefully do it, nor is it a matter of pride or a matter of proving myself to someone. As part of my nature, I want to keep quality and performance at a high level, so I strive to come up with the best solution that maximizes those concepts. And, I also admit, I sometimes get carried away.

So, with the best of intentions, I tried four different variations on a theme, each one trying to solve the issue in a slightly different manner. One incarnation tried to keep every token that occurred in a Python list, only to try and parse through it later. Another incarnation tried to be smarter about handling Link tokens but did not pay enough focus to the other types of tokens. It was just a lot of effort with things getting close to proper solution before falling apart. And with each incarnation of the solution, I got more frustrated until I had enough.

It was then that I decided to take a step back and think things through more clearly.

Learning From Mistakes¶

There were a couple of things that I learned from all those iterations, painful as they were to remember. The big thing that I was once again looking at was that complicated solutions were just tripping over themselves. I would set a value here, access it there, and then when I went to reference it in another piece of the source code, I was surprised that its value had changed. I did not need to have complicated variables to hold the information I needed, I just needed to have “just enough” information.

Only a small distance away from that concept, I needed to keep my solutions small enough to hold “just enough” information to complete the task. Of the four solutions that I tried in that week, three of them tried to keep everything that occurred within the Paragraph element in a list to be analyzed later. It was just crazy the things I was trying to do to keep that data in sync with the tokens as the information came through.

I needed to take the best of what worked and learn from what did not. And I had a feeling that it would be a simple solution that I had missed. And I was right.

Final Solution¶

When I sat back and looked at what I tried and what worked and what did not work, the easy part that worked was the handling of the Text token itself and not referencing Text tokens within an inline Link element. To cleanly take care of that, I replaced this code:

elif token.is_text and self.__last_paragraph_token:

with the slightly more complicated:

elif self.__last_paragraph_token:
    if self.__inside_of_link:
        if token.is_inline_link_end:
            self.__inside_of_link = False
    elif token.is_text:

Essentially, ignore any tokens inside of a Link token until the rule is looking outside of that Link token’s scope. Once that was done, I was able to continue with the processing of Text tokens within a Paragraph token with the following code:

split_whitespace = (
    self.__last_paragraph_token.extracted_whitespace.split("\n")
)
split_text = token.token_text.split("\n")

for split_index, next_text in enumerate(split_text):
    combined_text = (
        split_whitespace[split_index + self.__paragraph_index]
        + next_text
    )
    if (
        self.__first_line_after_hard_break
        or self.__first_line_after_other_token
    ) or split_index:
        if self.__first_line_after_other_token:
            adjusted_column_number = self.__paragraph_column_number
        else:
            adjusted_column_number = (
                self.__paragraph_column_number
                + len(split_whitespace[
                        split_index + self.__paragraph_index]
                )
            )

While it looks like it is a lot more complicated than the previous two lines that it replaces, those two if statements are there solely to ensure that the column number is calculated properly. From that point of view, I think it was a good win! And when I said above that I needed the algorithm and variables to have “just enough” information, I believe I am meeting the constraint here. The __paragraph_index variable is the only thing that was added to the main part of the changed code, as that was the only thing required to ensure that the right part of the Paragraph token’s leading space field was being used. The other fields that are present in the code do not try and reconstruct the data, just report on what the algorithm encountered.

With that main part of the solution in place, there were no more exceptions being thrown as the assert statement had been removed. From my work with the consistency checks for the tokens, I was confident that everything lined up with respect to the newlines in tokens, I just had to keep track of them in this rule. So, I added code that only tracked those changes, and nothing else:

elif token.is_inline_code_span:
    self.__paragraph_index += (
        token.leading_whitespace.count("\n")
        + token.span_text.count("\n")
        + token.trailing_whitespace.count("\n")
    )
elif token.is_inline_raw_html:
    self.__paragraph_index += token.raw_tag.count("\n")
elif token.is_inline_image or token.is_inline_link:
    self.__paragraph_index += token.text_from_blocks.count("\n")
    if token.label_type == "inline":
        self.__paragraph_index += token.before_link_whitespace.count("\n")
        self.__paragraph_index += token.before_title_whitespace.count("\n")
        self.__paragraph_index += token.after_title_whitespace.count("\n")
        self.__paragraph_index += token.active_link_title.count("\n")
    if token.label_type == "full":
        self.__paragraph_index += token.ex_label.count("\n")
    self.__inside_of_link = token.is_inline_link
elif token.is_inline_hard_break:
    self.__paragraph_index += 1
    self.__first_line_after_hard_break = True

The two exceptions to that statement were the Hard Line Break token and the Link token. As I mentioned previously, to deal with Link elememts properly, I needed to keep track of when the rule was looking at tokens within the scope of a Link token, hence self.__inside_of_link. In a similar vein, as I had decided that any inline element on a line would invalidate that line from this rule, I needed to track the presence of that token, hence self.__first_line_after_hard_break.

I believe that the thing that made the above code work well was that I was primarily only tracking the one thing that I needed to: the __paragraph_index field. I did not have to try and reconstitute anything to figure that out, I just had to count newline characters. And to be honest, I think I might work on refactoring that out in the future as well.

Finally, adding in some initialization code when a Paragraph token was encountered:

self.__paragraph_index = 0
self.__first_line_after_other_token = True
self.__first_line_after_hard_break = False
self.__inside_of_link = False
self.__paragraph_column_number = token.column_number

and some code to handle the end of a line within a Text token:

    self.__first_line_after_other_token = False
    self.__first_line_after_hard_break = False
self.__paragraph_index += token.token_text.count("\n")

and I was done. Of course, it took a bit longer to actually arrive at that solution, but not much more. I focused on one small set of scenario tests and worked to get those passing before moving on to another set of tests. If my notes are accurate, I started with plain Text tokens, then multiple line Text tokens, then Text tokens with various forms of Code Span tokens scattered around the Text tokens, and then on to the other inline tokens.

Rule md019¶

After all that work, it was nice to check out Rule md019. Like Rule md002 and Rule md003 above, the constrained and focused nature of this rule led to it being just fine as it was. No changes required.

The Next Four Rules¶

After taking a long time to get Rule md018 finished, it was with some trepidation that I started to work on Rule md020. I was not sure if it was going to be as bad as the previous rule, but I had to push forward anyways.

Rule Md020¶

Starting on Rule md020, I copied over 21 new scenario tests from the scenario tests for Rule Md 018, modifying them to test for possible Atx Closed Heading elements instead of Atx Heading elements. After the modifications were made, I carefully determined what the correct output for each of the new scenario tests was, and coded that in.

Trying to leverage the work that I did in modifying the code for Rule Md018, I looked at the code, determined to refactor it. To that extent, I took a bulk of the code from the RuleMd018 class and moved it into the StartOfLineTokenParser class. In place of the conditional that determines if the report_next_token_error function should be called, I created a new check_start_of_line function and made it do nothing. Then I implemented a new MyStartOfLineTokenParser class, and moved the two conditionals from the original RuleMd018 class into the check_start_of_line function:

class MyStartOfLineTokenParser(StartOfLineTokenParser):

    def __init__(self, owner):
        super().__init__()
        self.__owner = owner

    def check_start_of_line(
        self, combined_text, context, token, line_number_delta, column_number_delta
    ):
        if re.search(r"^\s{0,3}#{1,6}\S", combined_text) and not re.search(
            r"#\s*$", combined_text
        ):
            self.__owner.report_next_token_error(
                context,
                token,
                line_number_delta=line_number_delta,
                column_number_delta=column_number_delta,
            )

Then, in the RuleMd018 class, after creating a new instance of the MyStartOfLineTokenParser class in the __init__ function, the wiring up of the calls to that instance was simple:

    def starting_new_file(self):
        self.__token_parser.starting_new_file()

    def next_token(self, context, token):
        self.__token_parser.next_token(context, token)

With that refactor accomplished and tested, it was then on to the actual rule that I wanted to work on, Rule md020. Reusing the StartOfLineTokenParser class, I created another MyStartOfLineTokenParser class in the rule_md_020.py module. The only change that was required to support the new rule was to replace the conditional in the check_start_of_line function with the follow conditional:

    if re.search(r"^\s{0,3}#{1,6}.*#+\s*$", combined_text):

Because I took the time to do a solid refactor of Rule md020, reusing the MyStartOfLineTokenParser class in this rule was a smart move. Things just worked with little effort. A definite plus!

Rule Md021, Md022 and Md023¶

While it was not as much work as Rule md018, Rule md020 did take a bit of work to get done. Looking at Rule md021, I knew that was not going to be the case. Similar to code for Rule md019, the code for that rule was very simple and straightforward. As such, no changes were required in either the rule code or the test code.

Taking a quick look at the source code for Rule md022 and Rule md023, my hope was that they were was that they did not need any changes. In both cases, the code seemed to be very narrowly confined to working on existing Heading tokens and the text contained within those headings. Looking at the issues list, I addressed concerns about these two rules by extra tests for Block Quote elements and List elements for both rules, as well as extra line spacing tests for Rule md022.

After that research and executing the scenario tests, that initial observation on the simplicity of these rules was confirmed. With the simple addition of some extra scenario tests, I was satisfied that the scenarios were adequately covered.

The Final Four¶

Proceeding on to Rule md024, the code itself was solid and did not require any changes. However, to ensure things were properly tested, I added 6 new scenario tests for the rule. As testing support for rules within Block Quote elements and List elements has been lacking for some of the previous rules, I made sure to add tests for different forms of those elements. In addition, as this rule deals with two headings have the same text, I added extra tests to verify that adding a single space or an inline element (such as adding emphasis) is enough to make the headings not equal.

Looking at Rule md026, the code was bulky and needed a bit of refactoring, but the code itself was solid. As this rule deals with the content of Atx Heading tokens and SetExt tokens, there was no need to check out any weird conditions. Doing my due diligence, I ran through some experimental tests to see if I missed anything. Everything that I threw at the rule emitted the data that I expected without fail, including the line number and column number. After checking to make sure that some form of each experimental test was present in the scenario tests, it was on to the next rule.

Similar to the previous two rules, Rule md036 did not require any changes to the code as it deals with a focused examination of emphasized text within a Paragraph element. And as with the scenario tests for previous rules, some quick experimental testing did not reveal anything that had been missed with those scenario tests outside of containing them within Block Quote elements and List elements. Those two scenario tests were easily added and worked without any changes, so it was on to the last rule.

To say that ending up with Rule md047 was anticlimactic would be an understatement. This rule has one and only one very narrowly defined condition as part of its rule: does the document end with a blank line. Let me just say that it took less than 10 minutes to verify the code was working as desired and the scenario tests were covering everything properly. Yup, that easy.

What Was My Experience So Far?¶

Well, the first thing that I need to confess to my readers is that I did not listen to my own advice. I want to release this project so much that I pushed ahead with coding work on the rules even though I knew I was having issues with the quality of that work. And to be honest, I ended up paying for it with headaches and having to reexamine the code that I had already completed to make sure that I did not mess anything up. Hopefully one of these days I will learn to not push ahead so much.

As to the couple of days that I had hoped to take to get the rules into shape, that did not happen. It is okay that it did not happen, the work took as long as it needed to take to get to a solution that I am confident about. Sure, I would love to get some of that week back that I lost working on Rule md018, but it happened. Most of the rules took a decent one to two hours to properly inspect and fix, and I am okay with that. It is the level of confidence that I have in shipping an initial release with those rules that I am concerned with.

To that end, I think I am doing very well. I made some notes about things to check in the rules, like I always do. But I am certain that anything that I find will be a small issue, not a big issue link Rule md018. I am also certain that when I start documenting the various rules for the initial release, I will find at least a couple of inconsistencies that I will need to look at. Never a dull moment!

What is Next?¶

With each plugin rule examined and working properly, I needed to get back to my proper coding mindset and make sure that each rule was passing all quality bars and each rule was properly documented. Stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments