Markdown Linter - Adding Consistency Checks for Emphasis and Text Tokens

Summary¶

In my last article, I started to add the proper support for line and column numbers for both the text tokens and the emphasis tokens. In this article, I increase my confidence in the line and column numbers for those two inline tokens by adding the consistency checks for those tokens.

Introduction¶

I know that I am fallible. It therefore stands to reason that any code that I write will have some issues with it. Those issues may be obvious issues, or they may be issues that only occur under a bizarre set of circumstances, but they are there. Rather than fight against them, I embrace the attitude that good test automation will help me to identify those types of issues as early as possible.

For the PyMarkdown project, this test automation takes the form of scenario tests containing consistency checks. These consistency checks validate that the Markdown documents in the scenario tests are properly interpreted by the PyMarkdown project. But while these consistency checks are beneficial, the consistency checks have taken a long while to complete. After 3 calendar months have passed, it can easily be said that my decision to add consistency checks to the project removed 3 months of project development time and replaced it with 3 months of project test time. Plain and simple, those statements are facts.

My confidence about the project and its ability to work correctly is an emotional and abstract statement. However, with effort, I have been able to move it in the direction of being more of a fact than a feeling. The consistency checks are a form of test automation that apply a generalized set of rules over a group of tokens, looking for each group to behave in a predictable manner. Before this work, my confidence was expressed as a feeling: “I believe the project is stable”. With this work nearing its completion, I can now point to the scenario tests and consistency checks that run within those scenario tests. I can state that each of the scenario tests is passing a rigorous set of criteria before it is marked as passing. That confidence can now be expressed as: “Here are the tests that are passing and the checks that are being performed on each test.”

From that point of view, it made sense that before I start working on setting the line/column numbers for the remaining inline tokens that I would implement the consistency checks for the Text token and the Emphasis tokens.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit of 02 Sep 2020.

Getting Started With Inline Token Validation¶

At the start of the week, the code used to verify the consistency of inline tokens was extremely simple:

    print(">>last_token:" + ParserHelper.make_value_visible(last_token))
    next_token_index = last_token_index + 1
    while actual_tokens[next_token_index] != current_token:
        print(
            "-token:" + ParserHelper.make_value_visible(actual_tokens[next_token_index])
        )
        next_token_index += 1

Added in as a placeholder to allow me to see what was going on with the inline tokens, it served its purpose well. But as I started to work on the inline tokens and their line/column numbers, I needed to facilitate better consistency checking of those inline tokens.

To start the work off, I removed that placeholder code from two places in the code and replaced both with a call to a new function verify_inline. The only difference between the two invocations of the function were the fourth argument, current_token. Called for the first time from the___verify_token_height function, the current_token variable is set to the block token after a series of inline tokens. The second time it is called, it is called at the end of processing to capture any inline tokens that are within one of the valid text elements, but at the very end of the document. When it is invoked from that location, that same argument is set to None. In both cases, the inline tokens to be validated were clearly outlined for the verify_inline function.

Clearly Defining the Problem¶

Before doing any real processing with the inline tokens, I needed to create a simple list containing the actual inline tokens that I wanted to check. I could have done that with the main list of tokens and the previously document outlining. However, I thought about it and decided that it was clearer to have a separate list that just contained the tokens that I was concerned about. Once I had all the inline tokens between the two block tokens in that new list, there was a small amount of work to do before the list was usable. While it was not difficult, the new list had some extra end tokens at the end of the list that needed to be removed. Working around those extra end tokens would have been okay, but I just felt that it was simpler to remove them from the list before I did any further processing.

Having a simple list of the inline tokens to work with, the first iteration of the checking algorithm started with an easy outline to follow:

    if inline_tokens:
        for token_index, current_inline_token in enumerate(inline_tokens):
            if not token_index:
                __verify_first_inline(last_token, inline_tokens[0])
            else:
                __verify_next_inline(
                    last_token,
                    inline_tokens[token_index - 1],
                    current_inline_token,
                )

        # verify_last_inline(inline_tokens[-1], current_inline_token)

From my viewpoint, the processing of the inline tokens had 3 distinct phases: the first element in that list, each element after it, and the last element in that list. Based on their locations, the first and last elements are special in that they anchor the other inline tokens to the block tokens on either side of the middle elements. Without those anchors, the middle elements lack a foundation with which they can based their positions on.

Based on those observations, I chose to implement the check for the first inline token against the previous block token, and not the check for the last inline token against the following block token. Without validating the first element, validating any of the elements on the inside of the list would be useless. So, whether I liked the idea or not, validation of the first element in the list was mandatory. The last element is a different story. While it would be nice to tie the last inline token to the following block token, I felt that it was not as important as the verification of the first element. However, I added in a placeholder to the code to make sure that I would follow up on it later.

Validating the First Element¶

Following the pattern that I have used for validation in the past, I created the __verify_first_inline function with my standard starting template:

def __verify_first_inline(last_non_inline_token, first_inline_token):
    if <something>:
        pass
    else:
        assert False, last_non_inline_token.token_name

As this function is comparing the starting position of the first inline token to the last valid block token, the <something> in the above code sample was quickly replaced with:

    if last_non_inline_token.token_name == MarkdownToken.token_atx_heading:
        assert False
    elif last_non_inline_token.token_name == MarkdownToken.token_setext_heading:
        assert False
    elif last_non_inline_token.token_name == MarkdownToken.token_paragraph:
        assert False
    elif last_non_inline_token.token_name == MarkdownToken.token_fenced_code_block:
        assert False
    elif last_non_inline_token.token_name == MarkdownToken.token_indented_code_block:
        assert False
    elif last_non_inline_token.token_name == MarkdownToken.token_html_block:
        assert False
    else:
        assert False, last_non_inline_token.token_name

and one by one I added the validation functions to replace the assert False statements. Following that same pattern for resolving these as I have before, I ran the scenario tests over the entire project using the command line:

pipenv run pytest -m gfm

Each time, I just picked one of the failing tests, and worked on that tests in that group until they were all passing. For each validation function, I repeated the same pattern with the first inline token that was observed. For example, the __verify_first_inline_atx function quickly evolved to look like:

def __verify_first_inline_atx(last_non_inline_token, first_inline_token):
    """
    Handle the case where the last non-inline token is an Atx Heading token.
    """

    col_pos = last_non_inline_token.column_number + last_non_inline_token.hash_count

    if first_inline_token.token_name == MarkdownToken.token_text:
        replaced_extracted_whitespace = ParserHelper.resolve_replacement_markers_from_text(
            first_inline_token.extracted_whitespace
        )
        col_pos += len(replaced_extracted_whitespace)
        assert first_inline_token.line_number == last_non_inline_token.line_number
        assert first_inline_token.column_number == col_pos
    elif first_inline_token.token_name == MarkdownToken.token_inline_hard_break:
        assert False
        ...

    else:
        assert (
            first_inline_token.token_name != MarkdownToken.token_inline_link
            and first_inline_token.token_name
            != EndMarkdownToken.type_name_prefix + MarkdownToken.token_inline_link
        ), first_inline_token.token_name

What Did I Discover?¶

Predictably, I discovered that there are 2 groups of text within block tokens: ones that support inline tokens other than the Text token, and ones that do not. The ones that do not support inline tokens are mostly easy: assert that the inline token is a Text token, and then assert on a simple calculation of the first line/column number. The validation of the HTML Block token and the Indented Code Block token both followed this pattern, with very simple validation.

def __verify_first_inline_html_block(last_non_inline_token, first_inline_token):
    assert first_inline_token.token_name == MarkdownToken.token_text
    leading_whitespace_count = len(first_inline_token.extracted_whitespace)
    assert last_non_inline_token.line_number == first_inline_token.line_number
    assert (
        last_non_inline_token.column_number + leading_whitespace_count
        == first_inline_token.column_number
    )

The Fenced Code Block tokens required a bit more effort, but not much. As the Fenced Code Blocks can start with 0-3 space characters that then need to be managed on any subsequent line in the code block, the owning block token’s leading_spaces variable holds the information on what leading spaces were already removed. As such, when calculating the proper position of the first Text token inside of a Fenced Code Block, that removed space needs to be accounted for. To properly facilitate that, the last_token_stack argument needed to be plumbed through so the verification function could calculate the proper owning blocking token.

The second group of block tokens were the more interesting group of tokens to deal with. This group of tokens includes the Atx Heading tokens (as shown in the above example), SetExt Heading tokens, and Paragraph tokens. The __verify_first_inline_atx function and the __verify_first_inline_setext function ended up looking similar: the Text inline token case was populated, but all the other types of inline tokens were handled with assert False statements. The __verify_first_inline_paragraph function was similar, but also slightly different. The same template was used to generate the function, but each of the conditions in the if-elif-else block were met at least once. However, since only the Text token and the Emphasis token have line/column numbers, allowing this comparison to be performed for them:

        assert first_inline_token.line_number == last_non_inline_token.line_number
        assert first_inline_token.column_number == last_non_inline_token.column_number

All the other inline tokens, the ones that did not currently have a line/column assigned to them (yet), used the following comparison:

        assert first_inline_token.line_number == 0
        assert first_inline_token.column_number == 0

It was not much, but it gave me two important bits of information. The first was that there was at least one case where each available inline token was the first inline token inside of a Paragraph token. The second was that both heading tokens, the Atx Heading token and the SetExt Heading token, only contained scenario tests that started with Text tokens. I made a note of that observation in the issue’s list and moved on.

Verifying the Middle Tokens¶

With the validation of the first element out of the way, it was time to start working on the __verify_next_inline function. Now that the middle tokens were anchored at the beginning, each of the middle inline tokens could be validated against the inline token that preceded it. Since I knew that most of the inline tokens had not been handled yet, I started out that function with a slight change to the template:

def __verify_next_inline(
    last_token, pre_previous_inline_token, previous_inline_token, current_inline_token
):
    if (
        previous_inline_token.line_number == 0
        and previous_inline_token.column_number == 0
    ):
        return
    if (
        current_inline_token.line_number == 0
        and current_inline_token.column_number == 0
    ):
        return

    estimated_line_number = previous_inline_token.line_number
    estiated_column_number = previous_inline_token.column_number
    if previous_inline_token.token_name == MarkdownToken.token_text:
        assert False
    ...
    else:
        assert False, previous_inline_token.token_name

    assert estimated_line_number == current_inline_token.line_number, (
        ">>est>"
        + str(estimated_line_number)
        + ">act>"
        + str(current_inline_token.line_number)
    )
    assert estiated_column_number == current_inline_token.column_number, (
        ">>est>"
        + str(estiated_column_number)
        + ">act>"
        + str(current_inline_token.column_number)
    )

The first set of if statements made sure that if either the previous inline token or the current inline token was one that I had not worked on yet, it would return right away. While this assumed that the line/column numbers were correct to a certain extent, I was okay with that assumption in the short term. The second part computed a starting point for the new line/column numbers, and then went into the usual pattern of dealing with each of the eligible tokens by name. Finally, the third part compared the modified line/column numbers against the actual line/column numbers of the current token, asserting with meaningful information if there were any issues.

Emphasis Tokens¶

I thought it would be quick to get emphasis out of the way, and it was! As both the start and end Emphasis tokens contain the emphasis_length, it was a quick matter of adjusting the column number by that amount. As both tokens are confined to a single line, there was no adjusting of the line number to worry about.

Text Tokens¶

As mentioned in a previous section, there are two major groups of block tokens that contain Text tokens: ones that allow all inline tokens and ones that do not allow inline tokens except for the Text token. The ones that do not allow inline tokens are simple, as all the information about the Text token is contained within the token itself. It is the other group that are interesting to deal with.

The easy part of dealing with the Text token is determining the new line number. With the exception of a Text token that occurs right after a Hard Line Break token, the calculation is simple: split the text by the newline character, subtract 1, and that is the number of newlines in the Text token. If the token before the Text token was a Hard Line Break token, it already increased the line number, but the Text token that followed also started with a newline character. To remedy this, that pattern is looked for, and the current_line variable adjusted to remove the newline character at the start of the line.¹

Determining the column number is a more interesting task to undertake. For any Text tokens occurring within a block that does not allow for extra inline tokens, the column number information is already in the token itself, and the calculation is as simple. The column delta is equal to the number of text characters stored within the token². If there was a newline in the token’s text, this count is started after the last newline character.

The second group of block tokens that can contain text are the Atx Heading token, the SetExt Heading token, and the Paragraph token. Since the Atx Heading token can only contain a single line’s worth of data, no extra calculations are required to handle multiple line scenarios. In the case of the other Heading token, the SetExt Heading token, the starting whitespace is stored in the Text token’s end_whitespace field. The processing of this information is a bit tricky in that the starting and ending whitespace for the Text tokens within the SetExt Heading token is stored in that field using the \x02 character as a separator. Still, determining the proper indent and applying it to the column number is relatively simple.

Dealing with a Text token within a Paragraph token is a lot more work. Due to other design reasons, the whitespace indent for these Text tokens is stored within the owning Paragraph token. While that is not difficult by itself, keeping track of which indent goes with which line is a bit of a chore. Luckily, when I was working on the Markdown transformer, I introduced a variable rehydrate_index to the Text token. When rehydrating the Text token, I used this variable to keep track of which stripped indent needed to be added back to which line of any subsequent Text tokens. Given the prefix whitespace for any line within the Paragraph block, calculating the column number delta was easy.

Blank Line Tokens¶

That left the Blank Line tokens to deal with, and I hoped that the effort needed to complete them was more in line with the Emphasis tokens than the Text tokens. I was lucky, and the Blank Line tokens were easy, but with a couple of small twists. Intrinsically, a blank line increases the line number and resets the column number to 1. That was the easy part. The first twist is that if the current token is a Text token, that text token can provide leading whitespace that needs to be considered. That was easily dealt with by adding the following lines to the handler:

    if current_inline_token.token_name == MarkdownToken.token_text:
            estiated_column_number += len(current_inline_token.extracted_whitespace)

The more difficult problem occurred when 2 blank line tokens appear one after the other within a Fenced Code Block token. Because of how the numbers added up, I needed to adjust the estimated_line_number variable by one.

    if current_inline_token.token_name == MarkdownToken.token_blank_line:
        if previous_inline_token.token_name != MarkdownToken.token_blank_line:
            estimated_line_number += 1
        estiated_column_number = 1

With that tweak being done, all the tests were then passing, and it was time to wrap it up.

Was It Worth It?¶

The interesting part about defensive code is that sometimes you are not aware of how good that defense is. Using the analogy of a castle, is a castle better defensible if it can withstand attack or if it deters others from attacking the castle? While I did not have any information about potential attacks that were stopped ahead of time, there were 2 actual issues that the current round of consistency checks did find.

Issue #1: Image Link¶

The first of those issues was an issue with the column number for example 600 as follows:

!\[foo]

[foo]: /url "title"

Before these inline consistency checks were added, the text for the ] character was reported as (1,6). By simply counting the characters, the ! character starts at position 1 and the second o character is at position 6. As such, the ] character should be reported as (1,7).

Doing some research, I concluded that the handling of a properly initiated Image token was being handled properly. However, with a failed Image token sequence, the ! character followed by any other character than the [ character, the ! character was being emitted, but the column number’s delta wasn’t being set. Adding the line

    inline_response.delta_column_number = 1

at the end of the __handle_inline_image_link_start_character function solved that issue.

Issue 2: A Simple Adjustment¶

The second of those issues was more of a nitpick that an actual issue. In the tokenization for example 183:

# [Foo]
[foo]: /url
> bar

the first line was tokenized as:

        "[atx(1,1):1:0:]",
        "[text(1,3):\a \a\x03\a:]",
        "[link:shortcut:/url:::::Foo:::::]",
        "[text(1,4):Foo: ]",
        "[end-link:::False]",
        "[end-atx:::False]",

Having a lot of experience sight reading serializations for all the tokens, the information in the Text token leapt out at me right away. In that token, the extra data associated with the token is composed by adding the self.token_text field, the : character, and the self.extracted_whitespace. Based on the above tokenization, that meant that the text sequence \a \a\x03\a was being considered as text instead of whitespace.

To understand why I thought this is wrong requires an understanding of the existence of that character sequence. The \a sequence is used to denote that a sequence of characters in the original Markdown document was interpreted and replaced with another sequence of characters. The \x03 character within the second half of that sequence means that the {space} character in the first part of the sequence is being replaced with the empty string. Basically, to properly represent the space between the # character denoting the Atx Heading element and the [ that starts the Link element, I needed to add a space character that would not appear in any HTML transformation.

And here is where the nitpicking comes in. When I originally added that sequence when working on the Markdown transformer, it made sense to me to assign it to the token’s self.text_token field. But since then, I have grown to think of that sequence as being more extracted whitespace than token text. To resolve that, I decided to move the call to generate the replacement text from the self.token_text field to the self.extracted_whitespace field. It wasn’t a big move, but it was something that I thought was the right thing to do.

What Was My Experience So Far?¶

While this batch of work wasn’t as laborious as last week’s work, the effort required to make sure it was correct was equal to or exceeding last week’s work. I knew that if I made any mistakes last week, they would be caught when I implemented the consistency checks. Well, these were the consistency checks that would capture any such issues that slipped through.

I am both happy and proud that I am coming to the end of implementing the consistency checks. It has been a long 3 month voyage since I decided that consistency checks were the best way to ensure that the quality that I wanted in the PyMarkdown project was maintained. And while there were times that I questioned if I made the right decision in dedicating this large block of time to this aspect of the project, I was confident that I had made the right decision.

But looking ahead to what I needed to do after the consistency checks, I saw a fair number of items in the issues list that would need researching and possibly fixing. While I could start to release the project without them, I didn’t feel comfortable doing that. I wanted to give the project the best chance it could to make a first impression, and then move from there. And that would mean more work up front. So while I was happy that the consistency check work was coming to an end, there seemed to be a deep pool of issues that would need to be research… and I wasn’t sure how much I was looking forward to that.

I still believe that adding the consistency checks was the right move. Of that I am still certain. Instead of a feeling that I have the right code in place to do the Markdown transformations, I have hard, solid checks that verify the results of each and every scenario test. It also gave me the interesting bit of information that the scenario tests did not include any cases where the Atx Heading token and the SetExt Heading token were followed by anything other than a Text token. Something interesting to follow up on later.

To me, adding more of those checks for the inline tokens was just another solid step forward in quality.

What is Next?¶

Having completed the hardest inline token (Text token) and the easiest inline tokens (Emphasis tokens), it was time to buckle down and get the remaining tokens done. If I was lucky, the foundational work that I had already completed would make completing those tokens easy. If I was unlucky, there would be a whole selection of edge cases that I needed to account for. Realistically, I was expecting something square in the middle between those two scenarios. The next batch worth of work would answer that question!

This has been noted in the issue’s list, and I am hoping to look at it soon. ↩
That is, after removing any special characters and leaving the original text used to create those special characters. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments