Markdown Linter - Rabbit Hole 1 - Adding Consistency Checks

Summary¶

At the end of my last article, I talked about how I felt a lack of confidence in the work that I had just completed, primarily due to the large number of scenario tokens that I added line/column numbers to. This article documents the work to increase my confidence in those line/column numbers by starting to add consistency checks to all the scenario tests. Additionally, I talk about the bad line/column numbers that were found by the new consistency checks and how I addressed those issues.

Introduction¶

Why am I referring to this work as a rabbit hole? It is because this kind of verification can be easy, but if things are not done right, the verification itself can be very messy and seemingly endless. As with many things in this project, whether or not to do this is a balancing act between the risk of doing (or not doing) it, the cost of doing it, and the impact if the risk comes to pass. The risk of not doing this type of verification is that I may have miscalculated one of the line number/column number pairs for one of the tokens in the 700+ examples that make up the scenario tests. The cost of doing this is a small amount of code for the verification part of the code, but it could also add tons of code to the actual parser to track additional elements. The impact is even more difficult to pin down.

While part of the impact is measurable, namely the number of line number and column number errors found, the other part of the impact is not measurable: confidence. There is my confidence that I have those numbers right and there is the confidence of any users that I have those numbers right. And if I do not reach the minimum confidence level for any users of the project, I am certain that I will lose those users. It will be a balancing act that I will need to be aware of, and to monitor going forward, even after this work is finished.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits after 30 May 2020 up to 14 Jun 2020.

A Quick Aside¶

Yes, one of these again! I have been trying to figure out a good way to talk about the line number and column numbers as a pair, trying out various combinations over the last article and during the writing of this article. The most descriptive and most compact wording that I believe captures the information is: line/column number. It both indicates that they are a pair and that the pair has two distinct parts. From this point forward, I am going to strive to consistently use this format.

Starting with Some Refactoring¶

I am always looking for ways to make the project better, and this was no exception. During the addition of the position_marker variables throughout the parsing code, I noticed there were four variables that were usually getting passed around together: token_stack, token_document, close_open_blocks_fn and handle_blank_line_fn. While these variables might have slightly different names in the functions that used them, their usage pattern was very consistent. In each case, they were defined once in the TokenizedMarkdown class, and passed without change down throughout the rest of the parser. If Python had a good concept of a read-only variable, I would have used that concept to decorate these variables. Another alternative was to make these variables static variables, but it did not feel right to me to make them into 4 different static variables.

class ParserState:
    """
    Class to provide for an encapsulation of the high level state of the parser.
    """

    def __init__(
        self, token_stack, token_document, close_open_blocks_fn, handle_blank_line_fn
    ):
        self.token_stack = token_stack
        self.token_document = token_document
        self.close_open_blocks_fn = close_open_blocks_fn
        self.handle_blank_line_fn = handle_blank_line_fn

As I did want to reduce the overhead of passing these around, I created a new class ParserState, initialized an instance of it in the __parse_blocks_pass function of the TokenizedMarkdown class, and added those four variables to that class as member variables. Then I worked my way through the parser, passing that object down further and further into the parser. Along the way, where possible, I removed PyLint too-many-arguments warnings and a couple of too-many-locals warnings as well. To me, it just left things a bit cleaner, and ready for the next step: starting to add the consistency checks.

Why not static variables?¶

Perhaps it is my object-oriented background or perhaps it is my test automation background, but both of those backgrounds note that static variables are bad. From the object-oriented point of view, static variables cause issues because you must be very aware of what is changing and observing those variables at every moment. From the test automation point of view, static variables are just hard to test. Because everything in the project that uses that static variable has access to it, it can be the result of a direct or indirect change from almost anywhere in that project. If it is only accessible from a static method, then there are problems with clearing out the value of that variable so it can be properly tested under all circumstances. Basically, most of my experience tells me to stay away from static variables if possible.

Starting with the Plumbing¶

With any sort of verification code, it is always a good idea to start with something simple, slowly building on top of that work. As this was just normal verification code, the first thing I did was to add this function to the utils.py module:

def assert_token_consistency(source_markdown, expected_tokens):
    pass

Just a simple function that could be called from each of the scenario tests, starting out with a skeleton function that does nothing. With that in place, I started the laborious task of going to each of the test_markdown_*.py files and changing the tests at the end of each scenario test from:

    # Assert
    assert_if_lists_different(expected_tokens, actual_tokens)
    assert_if_strings_different(expected_gfm, actual_gfm)

to:

    # Assert
    assert_if_lists_different(expected_tokens, actual_tokens)
    assert_if_strings_different(expected_gfm, actual_gfm)
    assert_token_consistency(source_markdown, actual_tokens)

This was a simple task, but by making sure it was completed, I was able to go ahead and add in any consistency checks between the Markdown and the tokens without wondering if I had missed any of the tests. And yes, I was paranoid and double checked all the scenario tests at least one more time to make sure I did not miss any of the tests.

Why did I take such a small step? While other people’s mileage may vary, I find that adding multiple consistency checks in a single step compounds the errors that can occur if you get that check wrong. Unless I can avoid it, I choose small changes to the verification that I can easily validate manually, getting those changes addressed solidly before moving on. The base calculus that I am betting on is that the time taken to complete multiple steps independently is less than the time taken to combine them together. Based on my experience, including implementing, testing, debugging, and maintenance as factors into that equation usually favors the independent steps.

Verifying the Line Numbers Start At 1¶

I was sure that verifying column numbers was going to be messy, but I was confident that verifying line numbers would be more simplistic. When it comes down to it, the base part of verifying line numbers is making sure that each block token starts on its own new line, except for a block that is within a container block. In those cases, and only those cases, can the line number be the same.

But even before I started down that path, the first verification that I wanted to do was to make sure that each array of tokens started with a block token on line 1. It was the simplest verification that I could think of, and from previously discovered issues, I knew there were at least a couple of reported issues in this area. If I coded it properly, not only would I validate that the verification was working, but I could also detect and fix an issue or two at the same time.

Classifying the Tokens¶

Up to this point in the project, I had little need to classify any of the Markdown tokens as anything other than Markdown tokens. When I addressed a block of tokens, those tokens were always as a single token or as a small group of 2-3 tokens. But, to verify the line numbers properly, I was going to need to group the tokens on a larger scale. To accomplish that, I was going to have to add classification to the tokens.

In re-reading the verification steps outlined at the start of the previous section, I realized at the minimum I was going to need to add the concept of each token belonging to a specific class of token. In this context, I was defining the class of the token as the grouping assigned to the Markdown element in the GFM specification. Using that as a template, along with some previous work with Python enumerations, I was able to quickly set up this enumeration:

class MarkdownTokenClass(Enum):
    """
    Enumeration to provide guidance on what class of token the token is.
    """

    CONTAINER_BLOCK = 0
    LEAF_BLOCK = 1
    INLINE_BLOCK = 2

While Python supports four types of enumerations, I was confident that the simple type of enumeration was the right fit to solve this issue. I did not need the enumeration to do anything fancy, just provide a simple property value for each token that specifies what the class of that token is: container block, leaf block, or inline. With the enumeration in place, it was a simple task to go through all the tokens in the markdown_token.py module and associate each token with its respective class. Using the GFM specification as a guide, each token was matched up to its owning element and assigned the proper class instance for that element.

Adding the First Verification¶

With each token classified, the first part of the token verification was to only focus on the blocks, as inline line/column numbers were not yet implemented. Using the new classification variable token_class this was easily accomplished by adding a simple loop over all the non-inline tokens:

    for current_token in expected_tokens:
        if current_token.token_class == MarkdownTokenClass.INLINE_BLOCK:
            continue

Next, this was expanded to remember the last token, as such:

    last_token = None
    for current_token in expected_tokens:
        if current_token.token_class == MarkdownTokenClass.INLINE_BLOCK:
            continue

        if last_token:
            pass
        else:
            assert current_token.line_number == 1

        last_token = current_token

While these were not big changes, it was a good start. It made sure that the first token in the document was always positioned on line number 1, which it should always be. If not, the verification code asserts on the failure. Simple. And while I could have added more checks, I decided that was complex enough for now, and stopped there. As I mentioned above, taking small steps with the verification is a far safer bet. And as I would soon discover, it was already bearing fruit!

Looking for Issues - Link Reference Definitions¶

When implementing the line/column numbers for each block tokens, as documented in the last article, I noted down scenario tests that looked dodgy for later examination. At that time, I tagged the scenario tests for examples 166 and 168 as being incorrect as they started off with an extra blank line that started on line 2. With this new verification code in place, when those scenario tests were executed again, those new verification tests failed right away.

The output of the test was not an issue, as the HTML output for a blank line token in the current output formatter is to output nothing. However, when I added in the line/column number check, it failed as it was the first block token in the output, and it reported that it started on line 2. The blank line was indeed on line 2, but the token was being output at the start of the token array, and then again in the middle of the actual_tokens array.

After some digging, I was able to find the error. During the processing of the blank line, the parser checks to see whether there is an active link reference definition, and if so, stops the definition. However, regardless of whether a definition was found, that blank line token is appended to the token_document array. With any other token, this would be the correct thing to do. However, as the processing of the link reference definition requires rewinding on failures, the blank line was output, the parsing was rewound for the failed link reference definition, and then the blank line was output again. Once I figured out what the problem was, it was easily remedied by seeing if a rewind was in progress by checking the force_ignore_first_as_lrd variable, and only emitting the token if it was not set.

After some double checking, the scenario tests for example 166 and 168 were executed again, verifying that they were indeed fixed. I then ran the tests again to see what other tests were failing, and it was a mass of SetExt heading failures that caught my eye right away.

Looking For Issues - SetExt Headings¶

It was very evident that more than half of the SetExt heading tests were failing from the scenario test summary line. Performing a quick count of the failing tests, there were 13 SetExt heading tests failures out of a total 25 scenario tests. As I have put significant time into making sure that both the SetExt and Atx headings are working properly, I was surprised to see that there were any verification failures, let alone 13 of them.

Within seconds of looking at the first of those tests, it was easy to see why the verification was failing. In the scenario test for example 50, the first part of the Markdown text is:

Foo *bar*
=========

but the tokens being output by that block of Markdown text was:

        "[setext(2,1):=:]",
        "[text:Foo :]",
        "[emphasis:1]",
        "[text:bar:]",
        "[end-emphasis::1]",
        "[end-setext::]",

From an HTML point of view, the tokens were correct, as the HTML output for that text:

<h1>Foo <em>bar</em></h1>

was correct. But it was also obvious that the line/column number for the first token was wrong. As the first token, it should start on line 1, not line 2. Even more interesting is that if you visually look at the Markdown, it is obvious that it indeed is the first Markdown element in the document.

Starting to debug this scenario test, the answer was quick to come. In the processing of a SetExt heading, as talked about previously, the Markdown for a SetExt heading occurs after a paragraph has completed, transforming that paragraph into a SetExt heading. At the time that the SetExt heading token is added to the markdown token array, the position that is associated with the token is the start of the SetExt Markdown indicator. As such, that is the position that is used for the token. And it does make sense that for every token, the position should indicate where that token occurs.

Adapting the Token to Properly Represent Its Data¶

Despite the token position being technically correct, from a line/column number verification point of view it did not make sense, and line/column numbers were what was being verified. As I thought about this, I vacillated between representing the token with the position of the SetExt Markdown element and the position of the block of text contained within the SetExt heading. To me, there just was not a good answer. They both had a valid use and a valid reason to be the one position that represented the token.

Battling back and forth, I finally came to the realization that I needed to break out of my current thinking that a token must have only one position. While it is not a feature I want to use frequently, this was an honest case where the token should contain two positions: the primary position to contain the location of the SetExt element itself and the secondary position to contain the location of the block of text contained within the SetExt heading.

To accomplish this, I added two new variables to the SetExtHeadingMarkdownToken class: original_line_number and original_column_number. These two variables would be used independently of the line_number and column_number variables to track the position of the block of text contained within the SetExt heading. I wired up the necessary code to pass in the correct values for the two new variables, included adjusting the __str__ function to present them as part of the token.

When I reran the scenario test for example 50, I was pleasantly greeted with the following new set of tokens:

        "[setext(2,1):=::(1,1)]",
        "[text:Foo :]",
        "[emphasis:1]",
        "[text:bar:]",
        "[end-emphasis::1]",
        "[end-setext::]",

Changing the Scenario Tests¶

Now that the SetExt heading token contained a good set of information, I needed to change the scenario tests to understand the new information. Without any change, those tests would still only know about the primary position. To address this issue, I added the following code:

def __calc_adjusted_position(markdown_token):
    if markdown_token.token_name == MarkdownToken.token_setext_heading:
        line_number = markdown_token.original_line_number
        index_number = markdown_token.original_column_number
    else:
        line_number = markdown_token.line_number
        index_number = markdown_token.column_number
    return PositionMarker(line_number, index_number, "")

While it may seem like overkill to some people, the purpose of this function is to keep the addition of this new logic contained within a single function. This encapsulation came in useful when I added it in to the consistency function:

    last_token = None
    for current_token in expected_tokens:
        current_position = __calc_adjusted_position(current_token)
        if current_token.token_class == MarkdownTokenClass.INLINE_BLOCK:
            continue

        if last_token:
            last_position = __calc_adjusted_position(last_token)
        else:
            assert current_position.line_number == 1

        last_token = current_token

Used for both the current_token and the last_token, the new function easily provides the right position for both tokens, with only a small change to the target function. In addition, the adding of the last_position variable gave me a bit of a reminder of the direction that I needed to go in with the consistency checks. More on that in a minute.

Refocusing myself on the work directly ahead of me, running all the scenario tests yielded 20 test failures, the original 13 tests plus another 7 tests that contained SetExt heading tokens, just not at the start of the array. For each of those 20 tests, I manually checked the new tokens against the Markdown sample, and once I was satisfied that they were accurate, I adjusted the expected tokens for that test to match the new tokens.

And this time, when I ran all the scenario tests, I was greeted with a clean slate of no test failures.

Before Going On¶

Up to this point, when adding the consistency checks, I was mostly invoking the tests with the following line:

pipenv run pytest -m gfm

As the new consistency checks were only concerned with the scenario tests, this was the fastest method to run all those tests. However, before continuing, I wanted to make sure those changes did not have any adverse side effects, so I ran the all the tests in the project using:

pipenv run pytest

While running all the tests may not have been necessary, I wanted to run them to be confident that I did not introduce any bad side effects. As the complete group of tests can be executed in less than half a minute, the cost of the little bit of extra confidence was easy for me to justify.

Adding More Consistency Checks¶

With the SetExt heading tokens addressed and all the scenario tests passing again, it was time to add more consistency checking. As alluded to in the last section, now that the line numbers are being verified for the first token, it was time to add verification of container blocks and leaf blocks on the same line.

In this case, the smallest step I could take was to check for two blocks that had the same line number. When this condition is true, the previous block must be a container block, otherwise some error has just occurred. In addition, without worrying about column number accuracy yet, I can also state that that the current position must be to the right of the enclosing block. Adding those changes into the consistency checking was therefore just as easy as its description:

        if last_token:
            last_position = __calc_adjusted_position(last_token)
            if last_position.line_number == current_position.line_number:
                assert last_token.token_class == MarkdownTokenClass.CONTAINER_BLOCK
                assert current_position.index_number > last_position.index_number
        else:
            assert current_position.line_number == 1

While this was not checking everything yet, it was a good step forward. While I was certain that I did not mess anything up with the container tokens, it was good to have the validation that I did not miss anything. It was to my surprise that when ran the scenario tests again, two failures were reported.

Duplicated Tokens¶

After I got over my surprise at there being two possible container block failures, I started to investigate those two tests. It was with relief that I noticed that both tests, 218 and 219, were noted as being suspicious during the work from my last article. From a quick glance at the test output, it was easy to spot that the check was not failing due to container block issues, but due to almost duplicate blank line tokens present in the actual_tokens array. Because the line number was the same in both tokens, it looked like a case of a leaf block within a leaf block instead of a duplicated token.

Debugging through this, it took me a while to figure this issue out. After a lot of head scratching, I finally had added enough debug information that I noticed that in the code for the parse_line_for_container_blocks function, once the calling of the handle_block_quote_block function was completed, the blank line had already been handled, but the processing continued. After verifying this a couple of times, I tried removing the first call to process the blank line, but some of the other cases where I handled blank lines stopped working. While it did seem a bit of a kludge¹, on the return from the handle_block_quote_block function, I simply verified that the blank line process was handled already. If that, then the tokens were added to the markdown token array, and the function was returned from.

As the scenario test for example 261 was also flagged in the same line as 218 and 219, I took a similar look at the output, of that, but did not notice any issues. I even tried some variations on the data to see if there was an error that was previously exposed, and now fixed, and I was not able to reproduce it. Confident that 218 and 219 were fixed and that I could not reproduce 261, I removed those issues from my issues list.

Once again, I re-ran the scenario tests, and all the test passed. Executing the entire batch of tests for the project, I was also greeted with a clean set of tests, each one of them passing. Now to figure out what to do next.

Adding in More Line Number Checks?¶

Taking a good look at adding more line number verification, I had a hard choice to make. Either I could try and add more line number verification, or I could start working on column number verification. There were good reasons for both paths, but I had to make a choice.

I went back and forth on this decision before taking another look at the existing scenario tests, trying to figure out what I was most concerned about. I decided that while it would be nice to get the line numbers completed, I was more concerned about the column numbers. In the token array for each scenario tests, I was able to track the lines and line numbers more easily that I was able to track the column numbers. In the end, that one observation is what affected my decision: I needed to start verifying the column numbers next.

What Was My Experience So Far?¶

For any consistency checking being implemented, it is often useful to have a handful of issues that you know should fail, and then see the new consistency logic failing. The new checks passed that part of the test. In addition, the consistency checks add so far pointed out an issue with how I assigned line/column numbers to the SetExt heading token. Having caught and reported failures on those issues, I was confident that adding consistency checking for the line/column numbers was a good thing to do. Seeing as the cost was still very low, it seemed that the cost:benefit ratio was still heavily in favor of continuing, so that is what I decided to do.

Going back to my main issue regarding my confidence with line/column numbers and their accuracy, if anything, I think I believed that I had lost ground. With each check finding something, it was both an affirmation that I was right to question my manual verification abilities and a challenge to find more issues with more consistency checking. Basically, it was both “good job” and “what else ya got?” in the same sentence.

If I am honest, I felt it was a bit of a setback. I had hoped it was going to lift my confidence a bit, instead of a give-and-take that balanced each other out. I knew I needed to do something to boost my confidence, and I just hoped that adding column number consistency checks would be the key to that.

What is Next?¶

Having reached a good point in my verification for line numbers, it was time to proceed further down the rabbit hole that is consistency checking by starting to verify the column numbers. And what a rabbit hole it would end up being!

According to Merriam-Webster: “a haphazard or makeshift solution to a problem and especially to a computer or programming problem” ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments