Markdown Linter - Road To Initial Release - Fine Tuning Plugins

Summary¶

In my last article, I talked about the work I did to get the already implemented plugins ready for the release. In this article, I talk about the remaining work that I needed to do to fine tune the plugins and document them for the release.

Introduction¶

Having just updated the rules and verified that they worked the way I envisioned them working, there was still two big tasks and a small handful of small tasks left to do before release. The large task was to ensure that each plugin rule was triggering properly and reporting the correct error point. After that, I needed to ensure that each plugin rule has a good start at usable documentation. To wrap things up, I needed to keep on whittling down the little items in the issues list that I have continued to find during other tasks.

Knowing that an initial release is inevitable in the next couple of weeks, I moved forward on the project.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 28 Apr 2021 and 26 Apr 2021.

Open-Source Software Projects And Your Life¶

Looking at the time between this commit and this commit, there is a noticeable gap of 9 days. While some of time in that gap was spent trying to figure out the right wording for the documentation on the rules, a lot of it was not. That is what I want to talk about in this section: open-source software projects and your life.

I love my PyMarkdown project, I really do, but it is not my life. In the grand scheme of things, it is important, but not more than my health, my family, my job, and my sanity. Even though I was feeling better after the previous weekend, I was not at 100%. To be honest, for that week, if I was at 60% on any day, it was a great day. And while some people might disagree with my reasons, my priorities that week were on anything except the PyMarkdown project.

And it was not a difficult decision to make either. After working at my day job and doing what I could around the house, I had little energy left to do anything else. When I could, I did work on the project, but when my energy ran out, I locked my computer and went somewhere to depressurize and rest. I did not think twice about this either. The reason I can spend evenings and weekends working on this project is my job. The reason that I have the mental energy to work on this project is due to my health. The reason that I have the emotional energy to work on this project is due to my family and my sanity. Unless I want to develop negative feelings about this project, I need to ensure that I work on the PyMarkdown project for the right reasons and with the right mindset.

That decision did set me back a week or so, but it was the right thing for me to do. For any other open-source contributors out there, please figure out what is important to you and manage your projects based on that list of priorities. Most importantly, do not feel bad if that project does not always make it to the top of the list. It is healthier if it does not.

The Big Scan: Verifying Existing Rules¶

There was no way around this task. I needed to go through each of the rules and their test data, verifying that the scenario tests were reporting the right number of rule violations at the right positions. If it would help others think that I was not going insane, I could create a myth that I played a fun game to get it done, but it would just be a myth.

The truth is, I was looking forward to this. Yeah, I had no delusions that it was going to be hard work, of that I was certain. But this was the culmination of over a years’ worth of work, and I wanted it to look and feel right. With that mindset, the only true way to do it was to check every scenario test and test configuration against every test Markdown file and verify the results manually.

Along the way, I also decided to create the first iteration of the documentation for each rule, placing notes in there that would help me create the documentation. As all thirteen of the existing rules were originated from the MarkdownLint rules, I opened each Markdown file in my editor and compared my manual results against the results product by MarkdownLint.

From my viewpoint, I sincerely believe that this is going to be useful to the users. As an avid user of MarkdownLint inside of VSCode, it is useful to get any warnings about Markdown as I am creating the documents. Because the PyMarkdown project has a more narrowly scoped focus of the GFM Specification instead of MarkdownLint’s any-parser approach, as a user, knowing what the differences are would help me understand it better. And if I know that I would want information on the differences documented, I am going to assume that there are others out there that would share that need.

I also want to mention that, to keep my sanity (or at least try to), I focused on one rule at a time to the exclusion of everything else. I reviewed the code for the rule, made sure that I had a solid understanding of how I believed that rule should work, and then started iterating between scenario tests and the rules they were testing.

And Of Course… Scenario Tests¶

I will admit that when I was going through the scenario tests, I did curse a little. Because I am being honest, I should probably admit that I cursed a lot. It was a big job, and it took a long time. But keeping my focus to a single rule helped me retain my calm, even though some of the scenario tests files are huge.

But there were also benefits to my approach. When I originally developed some of these rules, I did not have all of the command line options that the project now has. As a result, some of the scenario test output includes non-relevant rules being triggered. I was able to clean up a lot of tests by simply adding the --disable-rules md0xx with the specific rule to the command line for that test. That cleared some things up very quickly.

Other than some small things, it was usually very easy to figure out when a triggered rule was reporting the wrong position or misfiring. While the “inside of list” and “inside of block quote” cases were more difficult to spot, the rest of the errors were almost always to do with a wrong line being reported, or the position (0,0) being reported. That made it easy to spot.

But once I spotted issues, I needed to go back to the code for the rule that I was working on and determine if the test result was bad or if the rule was bad. And in cases where the rule was bad, I needed to fix that rule.

Fixing Up The Rules¶

As I went through the rules, most of the rules just required a bit of cleaning up, and the occasional adjustment here and there. A good example of those adjustments was the change made in the RuleMd024.py module to pass the correct arguments into the report_next_token_error function. When this rule called the report_next_token_error function, it did so in a way that the reported position was always (0,0) for SetExt Heading tokens. A small adjustment to the token and the addition of the use_original_position argument quickly took care of that issue with a minimal amount of changed code.

A similar issue occurred with the RuleMd026.py module and how it calculated the position of the error. While slightly more complicated, it took me a bit to make the changes to deal with both Atx Heading tokens and SetExt Heading tokens properly. And it was when I looked at Rule md023 that I realized it should be disabled by default. While some other parsers have problems with Atx Heading elements and SetExt Heading elements that do not start at the beginning of the line, the GFM Specification does not. As such, it makes sense to me to include it as a rule, but to disable it by default.

Even the work that I did on Rule md018 in the last article need to be checked, and I found a couple of small issues there. However, for this rule, the big change was that I need to break up the functions with lots of statements into a collection of smaller functions, each more accurately named. As I refactored into the small functions, I fixed those issues as I went.

Adding Proper Documentation For Plugin Rules¶

Taking my time with this task, per the above section on Open Source Software Projects And Your Life, I worked through this task at a leisurely pace. While I did a couple of format changes and section title changes, the relative content of the sections was already fixed in my mind when I started. It just took a bit of experimentation to figure out what the correct transformation from my mind to the correct layout in a Markdown file was.

This task was all about taking the information that was in my head about each rule and documenting it in such a manner that a user running into a violation of the rule would find useful. Starting with Rule md001, this meant that I needed to have any aliases for the rule at the top, followed by a one line summary of the rules. Based on experience, if you are looking at a rule, you want something quick and easy at the top of any documentation on that rule to make sure you are looking at the documentation for the right rule. Next was a paragraph explaining the reason that the rule was needed, followed by good examples of what the rule triggered on and what the rule does not trigger on. Following knowing that you have the right rule, you then want to understand why this rule is important and start to figure out why it triggered on your specific example. Next is the information on any configuration that applies to the rule. While most of the time a user will go “yeah, that sounds reasonable”, short of disabling the rule altogether, good documentation on configuration allows that user to know how far they can bend the rule, and in what direction.

The last section was the most interesting for me to add: what was the origination of the rule. While all the current rules are based of rules from David Anson’s MarkdownLint project, some changes have been made to bring them in line with the GFM Specification. As such, I thought it was important to give some references that contributed to the rules, as well as any changes to the rules. Basically, if I am going to implement a similar rule, I want to make sure I reference where I got the idea from.

Once I had the first rule done, the remaining twelve rules were relatively easy. Most of the content for these documents was already in the source code for the rule or the source code for the tests for that rule. It was just a matter of extracting that information and restructuring it for the documentation. The remaining information needed for the rules was primarily in the Reasoning section, and that was easily crafted after a bit of research.

Providing For Strict Rule Configuration¶

This change stemmed from a simple test that I made of the configuration for rule Md022. When I set the configuration value to a negative number, the existing code for the ApplicationProperties class did not complain. Given the state of that class, that was the correct behavior. However, in my mind, I had set that configuration to an invalid value, and I wanted to know that it was invalid instead of the ApplicationProperties class assuming the default value. Basically, I wanted to enable “strict configuration mode” for the that instance of the ApplicationProperties class.

Starting with the tests, I added two new tests for Rule md022 with a negative number specified for the lines_below configuration value for Rule md022. One of those tests was created with my proposed “strict configuration mode” enabled and the other test was created without it enabled. The test without the new mode enabled passed right away, returning errors for the lines_above value which had a valid value of 2, but not returning any errors for the invalid lines_below value. With the alternating_heading_types.md Markdown file and the default value for lines_below, that was the correct response.

With the base test working, I needed to add the code to wire up a new command line switch with the ApplicationProperties class. Having added new command line switches before, I knew it was only a small amount of effort to change the main.py module to add the argument:

    parser.add_argument(
        "--strict-config",
        dest="strict_configuration",
        action="store_true",
        default=False,
        help="throw an error if configuration is bad, instead of assuming default",
    )

and then a similarly small amount of effort to pass it into the properties class.

    if args.strict_configuration:
            self.__properties.enable_strict_mode()

As the enable_strict_mode function did not exist, I created it as a simple function that enables the flag as follows:

def enable_strict_mode(self):
        """
        Sets struct mode to True to enable it.
        """
        self.__strict_mode = True

I did think back and forth on whether to provide an “enable” function or a “setter” function but landed on the “enable” function. The easier way to set the self.__strict_mode member variable is to pass it in as one of the arguments to the constructor function.

Thinking through various scenarios, I concluded that everything boiled down to two scenarios. In the first scenario, whether to use strict mode at the creation of the ApplicationProperties instance is known and is passed into the constructor when it is created. In the second scenario, that decision needs to be delayed to a later point where it is known. In that case, it makes more sense to start with the strict mode disabled, enabling strict mode if required. As the configuration is read in and then the decision to use strict mode can be made, I went with the second option.

Altering The Rule¶

With the above code in place, I needed to modify the rule to perform a better check on the configuration value. Starting with:

    def initialize_from_config(self):
        self.__lines_above = self.plugin_configuration.get_integer_property(
            "lines_above", default_value=1
        )
        if self.__lines_above < 0:
            pass
        self.__lines_below = self.plugin_configuration.get_integer_property(
            "lines_below", default_value=1
        )
        if self.__lines_below < 0:
            pass

I quickly changed it to:

    @classmethod
    def __validate_configuration_value(cls, found_value):
        if found_value < 0:
            raise ValueError("Value must not be zero or a positive integer.")

    def initialize_from_config(self):
        self.__lines_above = self.plugin_configuration.get_integer_property(
            "lines_above", default_value=1, valid_value_fn=RuleMd022.__validate_configuration_value
        )
        self.__lines_below = self.plugin_configuration.get_integer_property(
            "lines_below", default_value=1, valid_value_fn=RuleMd022.__validate_configuration_value
        )

The big thing to notice here is that I added the __validate_configuration_value function to handle the verification, leveraging the valid_value_fn from the ApplicationProperties class. This really did simplify the needed logic.

Debugging these changes, I cleaned up a couple of function default parameters to ensure that I could pass the correct values through, and everything worked fine. When the verification function failed, the command line returned an error, but it was a generic error. Based in the PluginManager class, there is code in the BadPluginError function to take an error and translate it into something readable:

formatted_message = f"Plugin id '{plugin_id.upper()}' had a critical "
    + "failure during the '{str(plugin_action)}' action."

It was good that the error was surfacing, but it lacked the ability to allow the user to act on it. Therefore, I added a new cause argument to the BadPluginError constructor function and rewrote that code slightly:

if cause and isinstance(cause, ValueError):
    formatted_message = str(cause)
else:
    formatted_message = f"Plugin id '{plugin_id.upper()}' had a critical "
        + "failure during the '{str(plugin_action)}' action."

It was a small change, but that change allowed errors with the configuration values to be surfaces as:

The value for property 'plugins.md022.lines_below' is not valid:
 Value must not be zero or a positive integer.

To me, that was definitely actionable!

Atx Heading and Spaces & Tabs Vs. Spaces¶

So, this one was a bit weird, but it had been on my mind for a while. When I was going through the GFM Specification to verify one of the past issues dealing with Atx Heading elements, I read the following two lines from the section on Atx Headings:

An ATX heading consists of a string of characters, parsed as inline content, between an opening sequence of 1–6 unescaped # characters and an optional closing sequence of any number of unescaped # characters. The opening sequence of # characters must be followed by a space or by the end of line.

Very specifically, it mentions a space character and not any whitespace characters. I remember this clearly, because it seems to contradict the first two lines of the section on Tabs:

Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

That text is backed up by example 10 which has a Markdown of

#→Foo

and an HTML output of

<h1>Foo</h1>

This seems contradictory, so what to do?

Thinking It Through¶

I guess the best place to start is with the statement that while all specifications try and be perfect, most specifications take a while to get to that state. The GFM Specification is no different than any other specification in that regard. To help address that issue that I found, I posted a question to the Markdown forums to make sure that it gets talked about. But I still had to do something with the parser to move it along. But what was the right answer?

From my experience, most specifications are layered to provide general context at a high level and a more specific context at a lower level. A good example of this could be a specification that says, “in the English language, i always comes before e” while the specification for a particular section says “except after c”. In that given section, I then expect to have an explanation as to why that section is an exception, with good reasons to back that decision up. This provides for a good foundation in the general case, with room to change things at a lower level if required.

Based on that experience and those two seemingly contradictory statements, I decided to change the code to not allow tab characters as part of the starting or ending hash characters for the Atx Heading elements. There was one big factor to assist me in making that change. That was that it was a simple enough change, and I could always revert it if necessary with little effort. That meant that I could adhere to the specification as I was reading it, and if I read it wrong, it was reversable.

That was good enough for me!

Resolving The Issue¶

Having made the decision to go with the more specific part of the specification, there were some changes I needed to do. The first thing was to add eight new tests: two tests each for rule md018, md019, md020, and md021. These tests are variations on existing tests that also include tabs in the area between the hash character (#) and the start of the heading text. With those tests ready to go, it was time to make the changes.

For detecting the opening part of the Atx Heading, the only change needed in the is_atx_heading function was to change this:

    (
        non_whitespace_index,
        extracted_whitespace_at_start,
    ) = ParserHelper.extract_whitespace(line_to_parse, new_index)

to this:

    non_whitespace_index = ParserHelper.collect_while_character(line_to_parse, new_index, " ")
    extracted_whitespace_at_start = line_to_parse[new_index:non_whitespace_index[1]]

This change simply replaces the call to the generic extract_whitespace function with inline code that is targeted specifically for a space character. As the collect_while_character function returns slightly different information than the extract_whitespace function, an extra statement is required to compute the value for the extracted_whitespace_at_start variable.

With the code for the opening hash characters completed, I needed to shift focus to the end hash characters. These changes were both within the parse_atx_headings function in the portion of the code after then ending hash characters are detected and removed. Like the changes required to direct focus on space characters at the start of the Atx Heading element, the end has characters required the same type of changes from:

    if ParserHelper.is_character_at_index_whitespace(
        remaining_line, end_index - 1):
        remaining_line = remaining_line[:end_index]
        (
            end_index,
            extracted_whitespace_before_end,
        ) = ParserHelper.extract_whitespace_from_end(remaining_line)

to:

    if ParserHelper.is_character_at_index(
        remaining_line, end_index - 1, " "):
        remaining_line = remaining_line[:end_index]
        non_whitespace_index = (
            ParserHelper.collect_backwards_while_character(
                remaining_line, len(remaining_line) - 1, " "
            )
        )
        end_index = non_whitespace_index[1]
        extracted_whitespace_before_end = remaining_line[end_index:]

Once again, the *whitespace* function calls needed to be replaced with functions focus on the single space character. And once again, to keep the rest of the code the same, some extra computation was required.

I tested this code in two blocks, the starting hash characters and the ending hash characters. This worked out well and I was quickly able to get both changes made and tested. While it was not a big change, it was one more thing off the list, so it was good.

What Was My Experience So Far?¶

While I do not like that “my getting sick” caused me to slip another week before the initial release, I did need the time to get back into proper form after my second COVID-19 vaccine shot. From experience, if I push myself more than a given amount, I am not only borrowing energy from the short term future but prolonging the time it takes me to recover. Basically, I can take two to four days of “I’ll do stuff when I can” and get back to normal, or I can push for that duration. If I push for that duration, it usually means that instead of two to four days, it will take me four to six days to get back to normal. It truly is not fair, but it just is.

In terms of the project itself, I can mentally feel all the mental list of project checkboxes being checked off one by one. While I am finding things that need fixing, those items can either be postponed or they will only take a couple of hours to fix. Either way, I have the confidence that I can properly allocated them to the right priority, which is the important decision. And the number of checkboxes left to check is diminishing.

From my viewpoint, the entire PyMarkdown project is coming together nicely. The parser and the rules engine are both working properly. The first thirteen rules are implemented, tested, and verified. I have a good first draft of the documentation for those rules, and I only have the bulk of the documentation to compose for the base of the project.

While I am not at the finish line yet, I can definitely see it, and it is closer than the horizon!

What is Next?¶

The list of items in the priority 1 section remains low, but I need to get it to zero before I feel confident about releasing. With documentation being the big push now, it was balancing both of those tasks at the same time. Stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments