Summary

In my last article, I talked about how I integrated my configuration system into the PyMarkdown project, revamping the command line interface along the way. In this article, I talk about my efforts in resolving the remaining Priority 1 items from the issues list.

Introduction

At one point late last year, I was worried that I would never get to the point where I would release the PyMarkdown project in some form. It seemed that for every item I resolved from the issues list, I added two more items. It just looked like I was never going to get to a point where I was comfortable with releasing the project.

Things have changed a lot since last year. While it is taking me longer to get to that initial release point than I had hoped for, I know that I am very close to having a solid first release of this project. One of the best things I did to help me work towards that point was to classify most of the issues list into one of four categories. And with only three items left in the Priority 1 section of that list, it was time to deal with them and get that much closer to a solid release!

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 27 Mar 2021 and 28 Mar 2021.

Why Was This Cleanup Required?

Looking through the issues list, I knew that most of the issues were in one of four buckets of issues, numbered one to four. There were a small handful of issues left in the bucket labelled “Priority 1 - Must Solve Before Initial”, and these were important to me to get resolved before the initial release of the project. But why?

In categorizing the remaining issues into those buckets, any issues placed in the first bucket were issues that either would drastically impede the project’s success or would be too costly to fix later. The remaining handful of issues in the Priority 1 section were all issues that I consider to be breaking changes.

From experience, breaking changes are painful. A breaking change is any change to a system that causes dependent systems to fail specifically due to that change. These changes usually happen when some part of the interface between systems is deleted or changed, but it can also occur if a newly added part of the interface is required. A great example of a breaking change starts with a system allowing free-form user input for the id field of one of its objects. If that system decided to modify that user input process to mandate that any id fields must be only alphabetic characters, that is not enough for it to be a breaking change. For it to be a breaking change, there needs to be a dependent system that interfaces with that field, requiring that system to change its understanding of that field to match the newly provided understanding. Until such time as that understanding is fixed, the dependency between those two systems is broken.

These changes are painful for two main reasons. The biggest reason is that the interface between the two systems is changed in such a way that the old interface is invalidated. Therefore, one of the more frequently suggested mitigations for breaking changes is to support a fixed number of the old interfaces, say the last two interfaces, until such a time as there is confidence that no other system is depending on that old interfaces. If things are done properly, the oldest of the supported interfaces can then be dropped with no systems being dependent on it, therefore removing the “breaking” from a breaking change.

That is where the other big reason comes in: communication. Whether the strategy is to “bite the bullet” and do everything at once or whether it is to support old interfaces, that must be clearly communicated to other teams and other projects. If not, everything falls apart very quickly. And please take it from me, that is where the wrong kind of communication usually happens, with a lot of finger pointing and harsh words being thrown around. Unless careful attention and effort is made to clearly communicate what is going to be done, when it is going to be done, and the work required of all teams, breaking changes are just a source of pain that it is best to avoid.

For each of the remaining Priority 1 issues, I wanted to change small things with tokens and how their debug output is rendered. If I waited until after I released the project, I believe these changes would all qualify as breaking changes when people start to use the project. As these tokens are part of the foundation of the project, it would be difficult to version them in any way that would make sense. Therefore, I mandated these changes as a requirement for the initial release.

A Quick Note

The PyMarkdown project uses its Markdown parser to translate Markdown into tokens, layering a rules engine on top of those tokens emitted by the parser. Unlike other articles in this series, this article almost exclusively deals with tokens. Therefore, I believe that a rudimentary knowledge of tokens is required to understand this article.

The base of every token is either the MarkdownToken class or the EndMarkdownToken class. The MarkdownToken class itself contains the names of each of the different tokens in a list of member variables at the top of the class. When looking at the debug output for a token, it is the open square bracket character [ and this name that form the start of every token. For every MarkdownToken instance and most instances of EndMarkdownToken1, this information is followed by the line number and column number where the area belonging to that token begins. This is then followed by the separator character :, any extra token data, and the close square bracket character ].

A simple example of this is a simple Text token containing the words baz and bar separated by a newline character. When parsed, the Text token [text(1,1):baz\nbar::\n] is emitted within the scope of a Paragraph token. Following the instructions above, it reads that this a text token that starts at line 1 and column 1. The extra data for this token starts with the text content baz\nbar, followed by some extra information regarding any whitespace encountered.

As the extra data for each token varies from token to token, the extra token data part of the debug output varies from token to token. Where possible, I will try and call out what is present in the tokens in the description of the information. As the HTML output is usually not concerned with whitespace characters, the most common start to the extra token data section of the debug output usually deals with any whitespace that was encountered.

The Important Of Good Scenario Tests

While I usually consider good unit tests and good scenario tests pivotal to any project, they are even more important when refactoring code. The group of tests for any project present a logical description of how the project is supposed to behave under a given set of circumstances. Depending on the type of refactoring that is being done, one of two types of expectations are desired. If the refactoring only changes internal components, then the expectation is that no tests will need to be changed. For these changes, a small external change is the expected change, and if the refactoring is successful, then only those small external changes should occur.

I maintain a high level of code coverage and scenario coverage for the PyMarkdown project. Because of this, I have the confidence to make these changes. If anything breaks, I will know about it right away, and not days later. With that confidence in hand, it was time to start making those changes.

End Tokens

The first change that I wanted to tackle was something that sounded simple, but I figured it was going to trickle down into many changes. That change was to refine how the EndMarkdownToken was used, specifically with how it closed certain tokens.

During my development of the PyMarkdown project, I made certain decisions based on the information I had at the time. One of the first decisions that I needed to make revolved around the use of the EndMarkdownToken token. This token is used for tokens that require some manner of manual close token to delineate the section being specified. One part of that token was an indicator force_close, which indicates whether one or more characters caused the section to be closed without force, or whether the section was forced close without any characters. It was that decision that I wanted to take another look at. Was it always necessary?

The Research

One week, as I was fixing an issue with the EmphasisMarkdownToken tokens, I looked at the EndMarkdownToken output and wondered if I had made the right choice, those many months ago. While I was sure that I had made the right decision at that time, I was not sure that the information underlying that decision had changed enough to invalidate that decision. I needed to take a new look.

To me, it just seemed wasteful and misleading to have a force_close field that was always set to False. It was correct to say that it was never forced close, but it just seemed to me that it was extra information that could be conveyed in a better form.

Searching through the source code for [end-emphasis, I searched the many results that appeared and could not find one case where the found debug output indicated a forced close. This backed up my thoughts that always presenting that force_close field in the debug output was inefficient. That information along with the over 100 results that I found in my search led me to decide to make this change. Without searching for any other end tokens, I believe that making this change solely on the merits of the end Emphasis token were enough.

The Effort

The fist change was an easy one. To better understand things in the future, I added a new requires_end_token field to the token classes to define which tokens require an end token to match their start token. There was no external indication of this change, it was just something to help me to define the characteristics of each type of token. To make use of this change, I added a new assert call when a new EndMarkdownToken is created to confirm that the token type supports end tokens.

It was the other change that I was dreading. If I wanted to remove the force_close indication for any type of token that did not support it, I needed to know with certainty that it would never be forced. To do this, I added a new can_force_close field and property to track that property of the tokens. Adding to the effort started in the previous section, I also added another check in the EndMarkdownToken constructor to verify that if that flag was set, the was_forced argument was never set. This left me with the following code:

    if isinstance(start_markdown_token, MarkdownToken):
        assert (
            start_markdown_token.requires_end_token
        ), f"Token '{start_markdown_token} does not require end token."
        if not start_markdown_token.can_force_close:
            assert (
                not was_forced
            ), f"Token '{start_markdown_token}'s end token cannot be forced."

With that in place, a small change to __compose_data_field function was all that was needed to effect the change:

    if (
        isinstance(self.__start_markdown_token, MarkdownToken)
        and self.__start_markdown_token.can_force_close
    ):
        field_parts.append(str(self.was_forced))

That was not the part I was dreading… that came next.

The Cleanup

With all the foundational changes in place, it was time to activate that code. To be doubly sure, I again verified that the corresponding end token was always created with was_forced set to False and set the new can_force_close field to False. Running the scenario tests, I went through and fixed all the scenario tests that were expecting the old format and corrected them to the new format. That might sound difficult, but it effectively was changing the expected output that looked like "[end-emphasis(1,8):::False]" into "[end-emphasis(1,8)::]" for each reference that occurred in a scenario test.

Once those changes were fixed, I looked for other tokens that fell into this category and discovered that this behavior was true for the AtxHeadingMarkdownToken, the SetExtHeadingMarkdownToken, and the LinkStartMarkdownToken. One at a time, I verified that the EndMarkdownToken for those tokens were not being forced, and set the can_force_close arguments to False. From there it was lather, rinse, and repeat. By the time I was done, I had changed the format more than 879 times in over 879 scenario tests.

That was about five hours of really mind-numbing work, but it was worth it. It just looked cleaner to me. It was a good first step.

Fenced Code Blocks

One of the things that I never had time to research is why the Text tokens inside of Fenced Code Blocks tokens did not coalesce like they did within other tokens. And it was not all the time either, it was only part of the time. With this issue, there were two things that I wanted to accomplish. The first was to figure out why it happened. The second, depending on the answer of the first, was to fix it. If I was going to fix it, I was highly confident that it would require changes to the tokens. So, while it was not big enough to normally make my radar, if I was going to fix it, now was the time while I was addressing other tokens.

The Examples

To illustrate this behavior, consider the Markdown string from the test function test_fenced_code_blocks_099f:

    source_markdown = """```

\a
\a\aabc
\a

```""".replace(
        "\a", " "
    )

This test was created to test how blank lines and lines with text interact with each other within a Fenced Code Block element. Since this occurs within a Fenced Code Block, the important thing to remember is that the translation to HTML must occur without losing any data, including whitespaces. As such, when it is tokenized, it tokenizes into the following form:

    expected_tokens = [
        "[fcode-block(1,1):`:3::::::]",
        "[BLANK(2,1):]",
        "[BLANK(3,1): ]",
        "[text(4,3):abc:  ]",
        "[BLANK(5,1): ]",
        "[BLANK(6,1):]",
        "[end-fcode-block::3:False]",
    ]

This set of tokens generated for that Markdown starts and ends with generic Fenced Code Block tokens. Between those tokens are the Blank Line tokens representing each blank line and the one Text token for the one non-blank line of text.

For comparison, test function test_fenced_code_blocks_099c starts with a line of text and then is followed by blank lines until a final line of text and the closing fence:

    source_markdown = """```
z
\a
\a\a
\a
z
```""".replace(
        "\a", " "
    )

This is translated into the tokenized form of:

    expected_tokens = [
        "[fcode-block(1,1):`:3::::::]",
        "[text(2,1):z\n\x03 \n\x03  \n\x03 \nz:]",
        "[end-fcode-block::3:False]",
    ]

Unlike the previous token output, the Text tokens and Blank Line tokens have all been coalesced or combined into a single Text token. In the debug output for this Text token, the \n sequence represents a newline character and the \x03 is used to separate the removed whitespace from the start of the line from the removed whitespace at the end of the line. While it is a dense form of the information, it provides every bit of information needed to faithfully generate the HTML output for that Markdown sample.

I sincerely believe that this single Text token format is easier to understand than the alternative. That alternative would be to use five tokens instead of one token; the first and last tokens of that sequence would be Text tokens with the remaining three tokens being Blank Line tokens. While the single Text token is packed with information, all the necessary information is contained within that one token, not spread out over five tokens. As such, I find that I do not need to try and remember what interactions occur between the tokens as there is only the one token. It is just there.

But if this is the case, why didn’t the previous example also coalesce its tokens into one token?

The Research

It did not take me that long to track the cause for this behavior down, and I was shocked at how easy it was to fix. Inside of the coalesce_processor.py module, there is a loop that looks for Text tokens and Blank Line tokens and sees if it can combine those tokens into an already encountered Text token. Basically, a Text token for a followed by a Text token for b becomes a Text token for a\nb. Similarly, a Text token for a followed by a Blank Line Token becomes a\n. Of course, this is all without whitespace, which changes those combinations a bit, but not by much.

There are five Markdown elements that can contain Text tokens: the two paragraph-like elements (Paragraph and SetExt Heading) and the three code block elements (Indented Code Block, Fenced Code Block, and HTML Code Block). Because of the way that they are parsed, the two paragraph-like elements never start with a Blank Line token. The paragraph-like element is only started when actual text is presented, skipping over those blank lines. The HTML Code Blocks elements are always started by a HTML tag and the Indented Code Block element is always started by indented text. That leaves the Fenced Code Block element as the only element that can start with a Blank Line token.

The Effort

With that research under my belt, it was easy to determine the base change needed to fix this issue. The reason that the Text tokens and Blank Line tokens were not coalescing was that there was no Text token to coalesce to. The change was to add the following code to the end of the loop in the coalesce_text_blocks function:

            elif (
                first_pass_results[coalesce_index].is_blank_line
                and coalesced_list[-1].is_code_block
            ):
                replacement_token = TextMarkdownToken(
                    "",
                    first_pass_results[coalesce_index].extracted_whitespace,
                    line_number=first_pass_results[coalesce_index].line_number,
                    column_number=first_pass_results[coalesce_index].column_number,
                )
                coalesced_list.append(replacement_token)
                did_process = True

This code would only trigger for the scenarios in which the current element was a Blank Line token and it was the first token inside of a code block. In those cases, it kicked off the coalesce process by creating an “empty” Text token that represented the blank line. With this new logic in place, the Markdown for function test_fenced_code_blocks_099f now parsed as:

    expected_tokens = [
        "[fcode-block(1,1):`:3::::::]",
        "[text(2,1):\n\x03 \n  abc\n\x03 \n\x03:]",
        "[end-fcode-block::3:False]",
    ]

The Cleanup

The cleanup for this change was not too bad. The HTML transformer was simplified by removing two lbocks of code, one from the __handle_text_token function and the other from the __handle_blank_line_token function. As Blank Line tokens no longer appeared within Fenced Code Blocks, there was now no need for that processing. In addition, the __handle_end_fenced_code_block_token function required a similar change to remove unused code, as well as a small change to add a newline character under certain specific circumstances. The Markdown transformer change was even simpler, requiring the string modification at the end of the transform function to be further restricted in cases where the end Fenced Code Block token was forced.

Other than those changes, the remaining changes were in the scenario tests that included Fenced Code Blocks that began with a Blank Line. Including new tests that were added to round our scenarios, only 34 scenario tests needed to be adjusted. I was lucky with that change, only requiring changes for scenarios that specifically dealt with Fenced Code Block elements beginning with a blank line. It could have been a lot worse!

Hard Line Breaks

This change is something that I have looked at on and off for around seven months. For whatever reason, when I added support for the Hard Line Break token, I added it without capturing the newline that accompanies the hard line break inside of the token. Instead, that newline was placed in the Text token that followed the Hard Line Break token. Having recently made sure that Hard Line Break elements were being handled properly, I was now even more convinced that the newline character should be with the Hard Line Break token itself, not with the following token.

The reason I had just looked at this before and taken no action to fix it was because of the scope of this change. There was no way that fixing this was not going to be a breaking change. I was moving a newline character from a following token back to the current token, where it should have been from the start. According to my estimates, it was going to impact at least 20 tests outside of the tests for the Hard Line Break token itself. But, to avoid a breaking change later, it was better to incur the effort of changing the code now.

The Examples

To illustrate this behavior, consider the Markdown string from the test function test_hard_line_breaks_655x:

    source_markdown = """foo\\
baz"""

This is a simple example that illustrates how a single backslash at the end of the line (shown in its Python form as a backslash escaped by a backslash) causes a Hard Line Break token to be generated. This in turn produces the tokens:

    expected_tokens = [
        "[para(1,1):\n]",
        "[text(1,1):foo:]",
        "[hard-break(1,4):\\]",
        "[text(2,1):\nbaz::\n]",
        "[end-para:::True]",
    ]

If this does not look a bit weird, it should. Usually, the newline character is contained within the token that contains that newline character. In this case, the newline character for the Hard Line Break token is contained at the start of the following Text token. I mean, it works, but it is not ideal. Dealing with that weirdness caused me to have to add extra bits of code to compensate for that behavior, as well as causing issues during debugging. It just did not work well for me.

Ideally, what I wanted to see was:

    expected_tokens = [
        "[para(1,1):\n]",
        "[text(1,1):foo:]",
        "[hard-break(1,4):\\:\n]",
        "[text(2,1):baz:]",
        "[end-para:::True]",
    ]

It was just a matter of making the necessary changes to get there and undoing a lot of the little changes made to compensate for it along the way.

The Effort

From a change point of view, the first two parts were the easy ones: changing the token itself and changing the __handle_hard_break_token function in the HTML transformer to produce the correct output. Starting from there, I expected a change would be required in the handle_line_end function, and I was correct. For both forms of the token, the whitespace_to_add variable needed to be set to None and the append_to_current_string variable needed to be set to "". This ensured that the correct actions were taken at the end of that line.

After some more debugging, a couple of lines were removed from the __complete_inline_loop function and the __complete_inline_block_processing function as they were no longer needed. The Markdown transformer itself was mostly correct, requiring recombination with any existing paragraph text to be taken care of in the __rehydrate_hard_break function. While not as simple, the verify_line_and_column_numbers.py module required a series of changes that were foreseeable, based on the changes that had already been made.

One of the trickier things to debug in this area were the changes required for links and images. To get things working properly before this change, the __collect_text_from_blocks function and others like it had “funny” logic to deal with the newline being in a separate token. For the __collect_text_from_blocks function, I was lucky: I literally just needed to append a newline character to both the text_parts variable and the text_raw_parts variable. But in other cases, I had to remove code that had worked before and replace it with code, often code that was aware of whether or not it was captured inside of a link label. That took a while to work through.

SetExt Headings

The SetExt Heading token presented its own set of issues. As that token is essentially a Paragraph token followed by a special sequence, the inline processing had to be coded differently than the paragraph handling. With this change, all the fixes that corrected for the Text token following a Hard Line Break token within a SetExt Heading token needed to be removed.

While there weren’t any big blocks of codes that had to be written, there was a lot of moving code around and adding code to specifically handle the whitespace to be stored near Hard Line Break tokens. This occurred in three places within the __process_inline_text_block function and the only effective way to debug it was to uncomment every PLOGGER call within that function. Small thing: I forgot to comment them out again when I was done.

The Cleanup

Both the transformers were not that bad, as they are contained, but the consitency checks were full of cases for handling the newline character in the following Text token that had to be undone. In all, 59 scenario tests were changed, and it took a lot of debugging to remove the “fixes” that were previously added. As a lot of those “fixes” only impacted two or three scenario tests, I just had to work through each of those issues and the ones that followed.

But, as I have said with the previous two changes, those changes just looked right. It just made sense to have the newline character for the Hard Line Break token with the token itself.

What Was My Experience So Far?

While I previously could not see my way to get here, I was now at the point where I had zero Priority 1 items in the issues list. I had to make some hard choices of what I needed to have in the project for an initial release versus what I wanted to have for that release. But I made those choices and followed through to get to this point. No more Priority 1 items!

And then that passed… but in a good way. There were a couple of new items that I knew I needed to add into that section, and I added them. Nothing major, but the section had items again. But knowing that I emptied that section was a big confidence booster. And knowing that I only added three small issues also helped: a task to write documentation, to check that the existing rules were still valid, and setting configuration properties from the command line. These were tasks that I knew would be achievable in a short amount of time.

The big thing for me was to start getting the project out there with a solid initial set of features, fixing up things in a following release as I worked on resolving issues. From there I could start to get feedback, which would help me figure out how to prioritize things going forward.

It was just a good feeling knowing that I was getting close to a solid initial release with an increased sense of confidence in the project.

What is Next?

While I was sure that I was going to find a couple more things to add to the Priority 1 section of the issues list, it was currently empty. As such, I wanted to devote the next week to getting a first pass at the Python library setup code underway.

Almost there!


  1. The line number and column number are only present if the EndMarkdownToken instance consumes one of more characters to create the EndMarkdownToken. For example, an Emphasis element requires matching opening and closing characters to delineate the emphasized text, so that instance of the EndMarkdownToken shows the line number and column number. Conversely, Indented Code Blocks are terminated when a new line of information does not begin with the required indentation, so that instance of the EndMarkdownToken does not show either the line number nor the column number. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~19 min read

Published

Markdown Linter Road To Initial Release

Category

Software Quality

Tags

Stay in Touch