Markdown Linter - Improving Consistency · Jack's Digital Workbench

Summary¶

In my last article, I talked about how I “weeded” the project’s issues list to build up my confidence that the PyMarkdown project was going in the right direction. In this article, I talk about how I used that confidence to get back to work on the consistency checker for the tokenizer that is the backbone of the PyMarkdown linter.

Introduction¶

At the end of the last article, I talked about how confidence is more emotional than logical. From that point of view, resolving items off my issues list was more to fan my passion for the PyMarkdown project that any logical reason. And while there are logical components to my desire to instill quality to the project at every stage, it is mostly an emotional desire to make it better. It is true that part of that desire is due to my experience. I know the large cost of fixing a bug once a product is released, and I want to avoid as many of those issues as possible as soon as possible. But another part of that desire is because that I know that once I put the project out there, it will be considered a reflection of my abilities. Due to both reasons and others, I just want the project to have a healthy level of quality that I have confidence that I can maintain.

It was because of that desire to maintain or increase quality on the project that I started to think about whether the cost of my efforts was worth the benefits that I was seeing. Was it worth it?

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 28 Jun 2020 and 05 Jul 2020.

Is the Cost Worth The Benefit?¶

In the end, the tokens that are generated by the project’s parser are just a series of letters and numbers that must be verified for each scenario. For the parser’s scenarios alone, there are 728 scenario tests that are used to verify that parser’s output whenever a change is made. The validation of those same tests is made easier by the test foundations that are already in place: checking the produced tokens against a static list for that scenario, and checking the output from transforming those tokens against the HTML snippet for that scenario.

The HTML part of that verification is easy: it either matches the reference HTML output or it does not. The tokens are a lot more difficult to verify. Is the actual content correct? Are the line number and column number correct? Was the correct amount of whitespace stored in the token to allow us to generate the HTML properly? While I try my hardest to ensure that the token information is correct, I do make mistakes.

And that is where I see the benefit of consistency checks. As I documented in my last article, there are enough parser rules to remember that I regularly add items to the project’s issues list to check later. But I have also noticed that even the act of answering those questions and whether those rules were applied properly can increase my confidence in the project. And if those answers can increase my confidence by themselves, working on completing those checks should have an even larger impact.

From where I was, even though the cost of adding the consistency checks was starting to grow, I felt that my increased confidence in the project was an acceptable benefit given that cost. As such, it was forward with the consistency checks!

Lists and Blank Lines¶

Sometimes it is the large issues that bother us the most. But often, I find it is the smaller items that really grind away at my mind. If I have not started a task, I try to make sure that I have lined up as many of the requirements for that task as possible up before I start on the task. This preparation helps me to focus on the task and to work on that task without too many interruptions. But, as is normal in my line of work, I have had to adapt to make sure that I can switch away from a task before it is completed. I am usually okay with this, as I often I find a good stopping point and document why I stopped and what I was thinking, hoping that it helps me pick it up later. But if I am close to finishing a task and I know there are only a “couple” of things left to go on that task, it is hard for me to resist getting it over that finish line. Just so very tough.

Identifying the Issue¶

This was the case with lists and blank lines, the subject of an item being researched and further documented in the last article This issue was not an overly big issue per se, but it was the issue that was gnawing at me. Specifically, I looked at one of the two list examples that start with a blank line, example 257:

-{space}{space}{space}

and the text before example 257, which clearly states:

When the list item starts with a blank line, the number of spaces following the list marker doesn’t change the required indentation:

Based on that information, a start unordered list token occupies column 1, and the number of spaces following the list marker is 1, for a total of 2. As such, the blank line, being contained within the list, starts at column number 2, with the extra whitespaces added to the end of the line. So why was I seeing the token [BLANK(1,5):] instead of the expected token [BLANK(1,2): ]?

Working the Problem¶

Doing some research, the answer was easy. The blank line token was not wired up to capture the following spaces, something that was quickly fixed. But even with that fixed, when the whitespace went to the end of the line, the count of the following spaces was simply zeroed out within the list handler. To fix this, I had to take some time and create a new algorithm from scratch, paying extra attention to ensure that trailing spaces were put in the right location.

I knew that the block quote tokens needed a rewrite to track the skipped block quote characters and their trailing space characters, so I left that part alone. But I did update the issue to note that the list case had been fixed.

Improving Consistency¶

With the blank token issue fixed, I then was able to add needed logic to the consistency checker. Previously, the content of the __validate_same_line function was pretty sparse:

    assert last_token.token_class == MarkdownTokenClass.CONTAINER_BLOCK
    assert current_position.index_number > last_position.index_number

It was time to change that!

Keeping in mind that I had only addressed the above issue for blank line tokens within a list token, the first thing I did was to exclude the block tokens from the new code. Then, remembering some previous work that I did, I knew there were going to be three different asserts that I would need to do: one for blank lines, one for indented code blocks, and one for everything else.

        if current_token.token_name == MarkdownToken.token_blank_line:
            assert current_position.index_number == last_token.indent_level
        elif current_token.token_name == MarkdownToken.token_indented_code_block:
            assert (
                current_position.index_number - len(current_token.extracted_whitespace)
                == last_token.indent_level + 1
            )
        else:
            assert current_position.index_number == last_token.indent_level + 1

For blank lines tokens on the same line as a list token, the blank line tokens start right after the list token, so the index_number is the same as the list’s indent_level. Because the indented code block token includes whitespace in its column number, there is a bit of correction to do to take that into account. Other than that, the remaining leaf block tokens verify that they start at the indent_level specified by the list.

After testing and manual verification, that change to the check was completed and verified. I examined around ten related scenario tests, and double checked the results from the new modifications. But even after that extra validation, I was left with a feeling that something was left undone. I was not sure what it was, but it was something to think about as I worked.

Making Position Markers Immutable¶

Honestly, the title that I was thinking of for this section was “Making Position Markers as Immutable as Python Will Allow”, but I thought that was too long. But let’s get that out of the way upfront. Python does not have a concept of an immutable complex object. Like most languages, the base objects (such as strings, integers, and floats) are immutable, meaning they cannot be changed once instantiated. To me, this makes perfect sense. If you create one of these base objects, it always retains its properties. If you add two of these objects together, you create a third object to contain the results. Basic computer science 101.

Python complex objects are not so nice with respect to immutability. Sure, there are tricks to prevent overwriting of well-known objects within a class, but there is no built-in support for creating an immutable complex object in Python.¹ Conveniently, I did not need the PositionMarker class to be 100% immutable, just immutable enough that I do not add code that overwrites the member variables from that class by accident.

Benefits of Immutable Objects¶

Why is this important to me? Partly cleanliness and partly experience. Immutable objects are just cleaner, as you can guarantee that their values will never change. If you instantiate an immutable object as the first object in your program, it will still have the same value that it was instantiated with when the program ends. From experience, immutability provides a level of safety that is desirable.

This sense of safety is because immutable objects are useful where there is a concern about random side effects introduced by child objects and functions. In the case of the parser, I know that before some previous refactorings, I was passing the variables text_to_parse and start_index all over the place. Even after that refactoring, I remained concerned that the parser code could accidentally alter the existing PositionMarker objects without a good way for me to detect it or prevent it.

By changing the PositionMarker class into a mostly immutable object, those concerns could be largely eliminated. Any concerns of side effects would be mitigated by the simple fact that once the value for the member variable was specified upon creation, it cannot be changed. Before this change was put in place, I knew where there were cases where I was not 100% sure that I understood where the values in the object came from. If I did things right, after this change I would probably have a couple of extra instantiations in the code, but I would know that the values were solid.

Making the Change¶

Far from one of the bulletproof approaches talked about on Stack Overflow, I just needed something simple to prevent accidental overwrites. For my lightweight approach, I chose to change the PositionMarker class to expose its member variables exclusively through properties.

Before this change, the line_number member variable was defined in the constructor with:

    self.line_number = line_number

When I needed to access the line number, I would simple take the instance name of the PositionMarker class, say position_marker, and append .line_number to it: position_marker.line_number. Clean, simple, and neat, but also mutable. Just as:

    x = position_marker.line_number

assigns the value of the position_marker instance’s line_number member variable to the local variable x, the following:

    position_marker.line_number = x

assigns the value of the local variable x to the line_number member variable.

To make the line_number member variable mostly immutable, I simply changed its instantiation to:

    self.__line_number = line_number

and added a property getter named after the original variable name, but without adding a property setter:

    @property
    def line_number(self):
        """
        Gets the line number.
        """
        return self.__line_number

By doing this for each member variable, two things were accomplished. The first thing was that by prefacing the member variable’s name with __, I declared it as a private variable. Variables declared as such are only able to be changed from within the class that declared them.² Then, by adding a property that has the name of the variable without the prefix (line_number vs. __line_number), I provided a simple read-only access to the object’s properties. It was simple, and it did not require any renaming of read-only references to the member variables.

Cleanup¶

The cleanup that occurred as part of that change was mostly contained within the ContainerBlockProcessor class. That made sense to me, as that is where most of the manipulation of the position occurred after the block quote tokens and list tokens were processed. In that class, there were a handful of cases where I was directly manipulating the PositionMarker member variables, which now required new PositionMarker objects. But since those new objects were all created within the ContainerBlockProcessor class, where I expected them to be created, it was okay. As a nice benefit, once that cleanup was completed, there were a few functions where the passing in of a new PositionMarker class was no longer needed, resulting in some bonus dead code deletion.

What did I gain by doing this? Mostly confidence. Before that change, I was concerned that I would introduce some random side effect in one of the processors. Since the class became mostly immutable and was only changed in the ContainerBlockProcessor class, I gained the confidence that a leaf block cannot change any PositionMarker object. Furthermore, I can easily scan to see where that class in instantiated, knowing in an instant where any changes are occurring, and therefore why they occurred.

Preventing Consistency Checks for Tabs¶

With all those changes in place, I was able to go back and add extra logic to prevent the consistency checks from executing if they encountered tab characters. Until I was able to work on proper tab character support, I felt that this was a good way to keep the consistency checks active without having to disable them in scenario tests that dealt with tabs. From a quality point of view, that felt like a short-term solution to me, not something I wanted to use long term.

The change side of this was implemented very quickly. Within the utils.py module, I added calls to the __calc_initial_whitespace function. If the second returned value (often called had_tab) was True, the whitespace contains a tab character. Due to the simple nature of the consistency check function, when this happened, I simply returned from the function at that point.

With those changes made, I started to work on the testing and was surprised by the result. I was only able to uncomment one scenario test, test_tabs_004. Digging into the cases a bit, I came to an interesting realization. I was missing the most obvious use of tabs: in the prefixes of list content.

Extracting List Item Whitespace¶

During the early design stages of the parser, I made a simple design choice that I believe has saved me hours of work. I designed the container block handlers to remove any processed text before passing the remaining text on for further processing. That design allowed the leaf block processing to be kept simple. The leaf block processor only sees the text that it needs to see to do its processing, and nothing more. Basically, the design choice was to make sure that each processor, container block and leaf block, had a solid definition of their responsibilities, and to keep to those responsibilities.

Following that design, when the list block processing was added, the indent_level was used to figure out where the encompassed leaf block tokens would start. Before the ListBlockProcessor passed the line to the LeafBlockProcessor, it removed exactly indent_level characters from the start of the line, per that design. As those characters were part of the list’s structural indent, it was the list’s responsibility to deal with those characters, not the LeafBlockProcessor.

This approach worked fine until I needed to understand the composition of that extracted whitespace to validate the token. At that time, if the whitespace was composed of space characters, everything remained fine. One character of indent equals one space character, and everything balances out. But when one of those characters is a tab character, that calculation was thrown out of whack.

While properly handling tab characters was out of scope, being able to detect them to properly disable the consistency check was not. As far as I could tell, all the remaining scenario cases that I wanted to uncomment contained a tab character in the list part of the line. If done properly, it would keep the responsibility of when to enable and disable the consistency check in the purview of the check itself. To me, that was the desired goal.

Correctly Extracting the Whitespace¶

To be able to detect the tab characters properly, a bit of juggling had to be done. As the best way to know which container is active was still the token stack, I modified all three list stack tokens to allow them to keep the instance of the list markdown token that they started. This allowed me to add the add_leading_space function to the two main list markdown tokens, effectively adding the whitespace to the list markdown token that owns that whitespace. Once that foundation was setup, I proceeded to modify the __adjust_line_for_list_in_process function from the ListBlockProcessor to properly account for the leading spaces in the lines.

Properly Enabling Consistency Checks¶

Having completed that previous task, I knew that I was almost ready to go back and clean up the work started in the previous section on Preventing Consistency Checks for Tabs. The only thing that I needed to do is to make sure that if the leading spaces contained the tab character, that the leading spaces would be set to a single tab character. While this assignment was a large simplification of how to handle the tab character, the proper implementation could wait. I just needed it to work well enough for me to detect it.

And it did work well enough. By adding a couple of extra tab checks in the right places, I was able to uncomment the remaining calls to the assert_token_consistency function within the test_markdown_tabs.py module. Between the previous tab checks and the extra tabs checks, the consistency checks were now able to properly exclude any Markdown with a tab character from the checks.

For me, this was just the right way to do it. While I needed to comment out those function calls at that time, it always felt like I was cheating. It was not the test’s responsibility to know when to disable the consistency check, it was the consistency check that needed to make that decision. By moving the determination of how to handle tabs into the consistency checks, I felt that I more properly constrained the problem to the responsible object. It did not feel like cheating any more. It just felt right.

But now that the consistency of the line/column numbers was in a better place, how would I verify that the whitespace that I extracted was the correct whitespace? If I wanted people to write solid linting rules, I want to give them solid data to base those rules on.

This All Leads Up to This¶

Given my stated desires to check the consistency of the tokens, I could only see one way to be sure of the content of the tokens. While in previous weeks it was a small voice, that small voice was now starting to speak with its outdoor voice. To check the content, I would need to write a Markdown transformer.

I had been thinking about working on this for a while, and even added an item to the issues list to keep track of it:

## Features - Correctness

- can we generate Markdown from tokens? do we have enough info?

To me, it felt like I was trying to avoid going down this route, afraid that writing a Markdown transformer would result in me going down another rabbit hole. Given my recent experience, I believe it was an acceptable concern that I needed to address if I decided to write the Markdown transformer.

But from a quality point of view, I felt that writing the transformer was inevitable. Outside of manual validation, the only way that I could validate that that tokens were accurately representing the Markdown document was to regenerate the original document from the tokens. As much as I tried to convince myself otherwise, I just could not see another path that would provide me with the same level of quality and confidence I that I believe the project deserves.

What Was My Experience So Far?¶

While the actual work that was documented in this article varied from the work documented in the last article, the goals driving both were the same: quality and confidence.

For me, it is always important to make sure that project responsibilities are defined and adhered to. Some of those responsibilities are people or role focused, and some of those responsibilities are assigned to objects within the project. In both cases, having those clear boundaries helps to figure out where things go and what should be taking care of any issues.

Having completed this block of work, I felt good. For me, commenting out or disabling a test or a part of a test without a good reason just feels wrong. Like “nails down a chalkboard” wrong. By properly assigning the responsibility for disabling the consistency check to the consistency check itself, I was restoring the control of that decision to the object that was responsible for it. I had restored balance to one small corner of the universe!

But it was hard for me not to think about doing a consistency check for the token content. But, like before, it boiled down to a cost-benefit analysis. Did I think that cost of a deep check like that would justify the benefit? The more I thought about it, the more it just made sense. Once again, this was not a logical decision, but an emotional one. And as such, I did feel that the benefit justified the cost. I could not give a solid reason why, just a solid feeling that it would.

What is Next?¶

Once I acknowledged to myself the need for proper verification of the token’s content, the only true viable path was to write a Markdown transformer. While I knew that I had a good foundation for checking the line/column numbers of the tokens, I could not see any other alternative that would verify the tokens at a level that was acceptable to me. But how could I start work on that massive task, and not lose myself like I did with adding tab support?

In all fairness to Python, most popular software languages do not natively support immutable complex objects without some sort of trick, pattern, or additional plugins. ↩
As with everything, there are caveats. If you understand the system that Python uses to mangle the names as part of making a variable private, you can technically get around this. However, that effort is considered out of scope for this article. Since all the member variables are base types, any consideration of modifying complex types exposed in this way by such an object are also considered out of scope. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Improving Consistency