Markdown Linter - Improving the Markdown Transformer

Summary¶

In my last article, I walked through the items that I chose to work on from the issues list, detailing my process for resolving each one. In this article, I detail how I continued the march forward to increase my consistency check confidence by further implementing the token to Markdown transformer.

Introduction¶

Having removed a good number of items from the issues list, I decided that it was time to get back to verifying the tokens using the Markdown transformer. While my initial attempt at implementing the transformer yielded 8 new items on my issues list, I was hopeful that this next batch of features for the transformer would uncover fewer errors. Do not get me wrong. If there are issues with any part of the project, I want to know about them so I can properly prioritize them. I was just hoping that the number of new items that I found would be lower this time.

It is with that hopeful mindset that I started to work on implementing the transformations for the other Markdown features, excluding the container blocks and the link related blocks. While awkwardly stated, that statement outlined a block of work that I knew I would be comfortable working on and confident that I could complete. Looking ahead, I knew that link transformations were next on the list, with container blocks taking up the final position. I knew I still needed to tackle those token groups, but I was far more confident about handling the simpler cases first. By handling those simpler cases first, I hoped to build up my confidence to work on the links in the subsequent section. At least it would build up if I did not discover as many errors as with the last chunk of work!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 19 Jul 2020 and 26 Jul 2020.

Did I Take the Week Off?¶

Looking at the commits for this chunk of work, it may look like I did the HTML block on the Sunday, then took the week off, restarting the work on Saturday. For no other reason than dumb luck, I just hit a wall when trying to transform the SetExt Heading tokens. Everything else just flew right by! More on that in a bit.

Adding HTML Support¶

During the writing of that week’s article, I just felt that I needed a bit of a break. Being a bit fidgety about the block of work that was coming up next, I decided that I wanted to start doing some work on a random token type. The HTML block token support was that work, the type of token to work on chosen randomly using the time-tested Eeny, Meeny method. I honestly did not think that adding that support would be so easy that I could complete it with only 25 lines of code changed, even less if I do not include blank lines. I was just looking to find something to give my mind a break from writing the article. Nothing more.

However, I cannot lie about it. It was refreshing. Go in, add the transformation, run the tests, fix 1 or 2 problems that I found, and… done! It worked so well, I thought I would do it with the next token. And that token is the SetExt Heading Token

Adding SetExt Heading Support¶

While you can estimate and plan for hours, there are times where the effort required to do something seems almost random. In one of the teams that I worked on, we referred to this type of work as the spins on the “Wheel of Effort”. One spin of the wheel and you may get 1 hour, and another spin for a similar item may get 1 day. It just depends. I know there were underlying conditions that were contributing to that calculation of effort. However, from my viewpoint, it just seemed like a spin of the wheel.

And the wheel was about to give me a spin that I really did not like.

The Initial Push¶

From experience dealing with SetExt Heading issues, I had a feeling that adding the support for the SetExt Heading tokens was going to be more difficult than most other tokens. That experience also reminded me that most of the issues that I have had with SetExt Headings were not with the SetExt Heading tokens themselves, but the text block contained within. While the SetExt Heading token itself is created to replace an existing Paragraph token, the handling of the two text blocks was different enough that I was immediately cautious about those differences.

The first change that I made for this feature was to write a simple implementation of the token handlers. It was during that implementation that I discovered that while the SetExt Heading tokens contained information about which heading character was used, the quantity of those characters in the Markdown document was absent. That was an easy fix to implement. A quick jump over to the SetextHeadingMarkdownToken class to add the heading_character_count argument and member variable. Then another quick jump back to the new token handlers to use that new member variable when rehydrating the end token. Done. Easy.

Except it was not. A single set of tests were failing in each test run: the example 052 series of tests dealing with SetExt text that starts with whitespace. To simplify the test for a previous fix, I created the scenario test test_setext_headings_052a to keep the problem simple and easy to diagnose. The issues for both test_setext_headings_052 and test_setext_headings_052a were easily fixed, but I then noticed a glaring issue: there was no tests for SetExt Heading text that had both leading and trailing whitespace. To address that, I created test_setext_headings_052b to add some trailing whitespace while maintaining the same HTML output with test_setext_headings_052a. From there, the c, d and e variations were added to test various other patterns that I hoped were working properly. All those variations passed immediately, except for scenario test test_setext_headings_052b. That was failing and I was not sure why.

The Problem¶

Scenario test test_setext_headings_052b is a simpler version of example 52 with some changes thrown in. In this new scenario test, the data is as follows:

  a{space}
  b{space}
  c
===

where the literal text {space} are space characters at the end of each of those lines. For comparison, the Markdown for test test_setext_headings_052a is:

  a
  b
  c
===

and the HTML output for that test is:

<h1>a
b
c</h1>

By comparison, the new scenario test test_setext_headings_052b, is a literal copy of that original scenario test test_setext_headings_052a with the addition of a single space character at the end of the first two lines. As the GFM specification states that leading and trailing spaces are removed in normal paragraphs, it makes sense that those additional space characters would be removed as excess whitespace. As such, I expected that the HTML output would be the same, and it was.

But when the Markdown transformer rehydrated the text, that removed whitespace was inserted at the start of the line. The rehydrated Markdown for this initial stage was:

   a
   b
  c
===

Digging a bit deeper, it took a fair amount of additional logging and debugging before an answer started forming in my head. The last line was fine, mostly because it did not have any whitespace at the end of its line. In the case of the first and second lines, that trailing space was being placed at the start of the line, instead of the end of the line.

The problem? The 2 spaces removed from the start of the line and the 1 space removed from the end of the line were being merged. The result? When the Markdown transformer went to rehydrate the token, all 3 of those removed spaces were placed at the start of the line. I needed some way to keep the two groups of spaces separate from each other.

Another Rabbit Hole?¶

Yup, that thought had crossed my mind. By the time that I had debugged the problem and understood it clearly, it was late Wednesday night and I had already spent 2 nights working on this issue. To ensure that I did not go “missing” again, I set a maximum duration of 2 days to solve this issue. If not solved in that time, it would go on the issues list to be dealt with later.

Was this a bit overboard? Perhaps. But given my past history with chasing down issues like this, I figured it was better to plan ahead to prevent myself from getting in that predicament. Given how much I like puzzles, combined with a gut feeling about this issue, it just seemed the right thing to do at the time.

Over the Next 2 Days¶

Over the next 2 days, I tried 3 or 4 different approaches. But in each case, there was some manner of complication that prevented me from going forward with it. It was frustrating, but I was adamant that I was going to find a solution. Well, if it did not exceed my self-imposed deadline of Friday night. And that deadline just kept on getting closer and closer.

The Final Breakthrough¶

Early on Friday night, I was very frustrated that nothing I tried had worked. After resetting the code yet again, I decided I was going to do my best to keep things simple. Before, I had tried altering the signatures of the functions and passing variables around, and that did not get me anywhere. Turning that effort around, I figured that the simplest way to separate the two parts of the line was with a separator character.

Previously, the tokens for scenario test test_setext_headings_052b were:

    expected_tokens = [
        "[setext(4,1):=:3:  :(1,3)]",
        "[text:a\nb\nc:: \n   \n  ]",
        "[end-setext::]",
    ]

Using a simple separator character of \x02, I was able to change the serialized form of the text token to:

    expected_tokens = [
        "[setext(4,1):=:3:  :(1,3)]",
        "[text:a\nb\nc:: \n  \x02 \n  \x02]",
        "[end-setext::]",
    ]

That one change was simple and pivotal, and it struck a chord in me right away. That separator character clearly separated the two different whitespace sequences from each other, with no complex variable passing to enforce it. But that was only part of it. Would it the Markdown transformer portion of this fix be as clean and easy?

The Solution¶

I needed a come up with a fix to compensate for that change in the token’s text. Before this change, the token’s text and the token’s whitespace were reintegrated with each other using a simple algorithm:

    rejoined_token_text = []
    split_token_text = main_text.split("\n")
    split_parent_whitespace_text = next_token.end_whitespace.split("\n")
    for iterator in enumerate(split_token_text, start=0):
        joined_text = split_parent_whitespace_text[iterator[0]] + iterator[1]
        rejoined_token_text.append(joined_text)
    main_text = "\n".join(rejoined_token_text)

Basically, split the text string and the whitespace string into arrays, splitting them on the newline character. Then take those two arrays and create a new array with the concatenation of the element from the whitespace array with the text from the text token element. When done with all the elements, create a new string by merging the contents of the array together, using a newline character to join the lines.

After this change, that algorithm got a bit more complex, but not by much.

    rejoined_token_text = []
    split_token_text = main_text.split("\n")
    split_parent_whitespace_text = next_token.end_whitespace.split("\n")
    for iterator in enumerate(split_token_text, start=0):
        ws_prefix_text = ""
        ws_suffix_text = ""
        if split_parent_whitespace_text[iterator[0]]:
            split_setext_text = split_parent_whitespace_text[iterator[0]].split("\x02")
            if len(split_setext_text) == 1:
                if iterator[0] == 0:
                    ws_suffix_text = split_setext_text[0]
                else:
                    ws_prefix_text = split_setext_text[0]
            else:
                ws_prefix_text = split_setext_text[0]
                ws_suffix_text = split_setext_text[1]
        joined_text = ws_prefix_text + iterator[1] + ws_suffix_text
        rejoined_token_text.append(joined_text)
    main_text = "\n".join(rejoined_token_text)

Instead of the very simple concatenation of the whitespace and the text, there was a bit more work to do. First, the whitespace needed to be split into two, based on the new separator character \x02.

It was more difficult than the original algorithm, but not by much. Once I did a refactoring pass on the Markdown transformer, I was sure that I could clean that algorithm up a lot. But even without that refactoring, the changes in the algorithm were easy to understand.

The root of the changes centered on the splitting on the whitespace in the newly inserted \x02 character. Once that given line was split, there were 3 cases to handle. The easy case was the one where the \x02 character was present, yielding an array with 2 entries. In that case, the ws_prefix_text variable was set to the first element and the ws_suffix_text variable was set to the second element. The second case was where there was only 1 element in the array and it was the very first entry in the array. In that case, the prefix whitespace had already been applied when the SetExt Heading token was dealt with, therefore the whitespace was assigned to the ws_suffix_text variable. Finally, in all other cases, that whitespace was good at the start of the line, hence it was assigned to the ws_prefix_text variable.

Testing this algorithm endlessly in my head, I was pleased when I tested the algorithm with the actual code and it almost worked. The first iteration of this algorithm did not include the properly handling of the second case, as detailed above. As such, the other lines were formatted properly, but that first line was still a bit off. However, the good news is that it only took a little bit of debugging before I had that cause identified and fixed. And after 4 days, seeing all the tests pass for this change was a sweet sight for my eyes!

Hindsight is Always Clearer¶

Looking back at the solution, the simplicity and correctness of it is even more evident. Add an extra marker during the initial processing and interpret it during the Markdown transformer processing just makes sense. Looking back, the thing that made the most sense was that the simplest solution was the one that one. It was as complex as it needed to be, and not more complex.

And after taking the time to add this support properly, I could only hope I would have enough time to finish adding the support for the other tokens that I had planned for.

Adding Emphasis Support¶

After the effort required to add the SetExt Heading transformation, I was hesitant to start working on another token. It was not a lack of confidence that was causing me to pause for a bit, it was the work! After taking 4 days to complete the SetExt Heading token support, I was exhausted. Also, by the time the code changes were committed, I only had about 24 hours to complete the rest of the work!

After adding my initial attempt to perform a simple transformation of Emphasis tokens, I was very pleased when I found out that my work for this token was almost complete. The writing of that initial attempt used the information already present in the token and was able to satisfy most of the tests in dealing with emphasis. The only tests that had a problem were the tests that used the alternate emphasis character _ and its slightly altered emphasis rules. When I originally added the EmphasisMarkdownToken class, the HTML transformer did not need to know which character was used for emphasis, only the level of emphasis in the emphasis_length field. As such, any indication of the emphasis character used was absent from the token.

To address that problem, I simply added the emphasis_character field to the EmphasisMarkdownToken class. This was followed up by some small changes in the __process_emphasis_pair function to ensure that the new field was being properly initialized in both the EmphasisMarkdownToken constructor and the emphasis block’s end token constructor. The change in the end token was a bit trickier in that it does not have a specific place for the character, just a generic extra_end_data variable. Those changes modified the code from:

            EmphasisMarkdownToken(emphasis_length),
...
            MarkdownToken.token_inline_emphasis, "", str(emphasis_length),

to:

            EmphasisMarkdownToken(emphasis_length, emphasis_character),
...
            MarkdownToken.token_inline_emphasis, "", \
                 str(emphasis_length) + ":" + emphasis_character,

These changes made sure that both the start of the emphasis block and the end of the emphasis block had access to the emphasis character that had been used. To accommodate these changes, a trivial change was needed in the HTML transformer. Once that was done, I temporarily disabled the consistency checks and ran the full set of tests against those changes.

Perhaps I am paranoid, but even with such a small change, I wanted to run the full battery of GFM specification tests to ensure that those changes were solid on their own before adding in more changes. But with all those changes in place and all tests passing, I then returned to the Markdown transformer. Due to the above work, the Markdown transfer was changed to be aware of the emphasis character used. Similar to the HTML transformer, the Markdown transformer only required a few lines to be changed to accommodate the new versions of the tokens. After that work was completed, the consistency checks were enabled, and I was glad to find out that all emphasis tests were passing on the first try. Compared to the last element, this element was a breeze!

Adding Autolink and Raw HTML Support¶

Seemingly on a roll, the Autolink token support was added in 15 minutes and the raw HTML token support was added in just over 10 minutes. In both cases, the tokens contained all the required information, allowing the transformation to be accomplished with very simple functions. The testing for these changes was likewise, quick, as the amount of code that was changed was small. As these changes took less than 30 minutes combined, there really is not anything more to add.

Adding Code Span Support¶

Trivial, Hard, Decent, Trivial. That was the effort that I required to add support for the tokens documented in the last 4 sections. Picking Code Spans by random out of the group of remaining tokens, I was not sure where the “Wheel of Effort” would land this time. It turned out that adding support for the code span tokens was a bit more difficult than adding support for the Autolink token and the Raw HTML token, but it was not much more difficult.

The Initial Attempt¶

To add the support for the Code Span tokens, I first needed to change the constructor for the InlineCodeSpanMarkdownToken class to keep track of three additional fields: extracted_start_backticks, leading_whitespace, and trailing_whitespace. While those 3 fields were easy add to the token and populate with the right information, their use in many of the scenario tests required small changes in each of those tests. Even those changes were annoying, once they were out of the way, the changes to the Markdown transformer to properly support the Code Span token were minimal.

After running the all the tests with the above changes implemented, everything looked good except for a couple of tests. When I looked at those tests more closely, in each test the Markdown contained a code span with a newline character in the middle of it. While Code Span tokens keep everything within their boundaries exactly as they are, including backslashes and character reference sequences, newline characters are a separate matter.

Addressing the Issue¶

When a Code Span element has a newline character in it, that newline character gets replaced with a single space character. This introduced a bit of a problem for the Markdown transformer, as the HTML transformer was expecting a space character and the Markdown transformer was expecting a newline character. Luckily, I was able to repurpose some work I did a couple of weeks ago for character references in indented code blocks. As I wanted to ensure that I represented both states of the code span data in the tokens, I replaced the newline characters with the text \a\n\a \a instead of replacing it with a single space character. At that point, depending on which viewpoint I was taking, I was able to know both the before processing and after processing states of that character.

Going forward with those changes, I went to look at any errors with the HTML transformer, but there were not any. It turned out that the work I did on 16 Jul 2020 included adding a call to the resolve_references_from_text function, then housed in the tranform_to_gfm.py module. To be honest, I was not aware that I did that, but it was nice that it was already in place. That just left the changes to the Markdown transformer to resolve any replacement markers in that data. And given the the literal text string \a\n\a \a, that was a simple task.

Running the tests at each interval along the way, I was confident that I had made the right changes. But there is always that moment where I say to myself “did I remember everything?”. Fighting past that though, I ran the all the tests and was happy to see that Code Spans were now being transformed properly.

Adding Fenced Code Block Support¶

Like the work done for the Code Span tokens, the Fenced Code Block tokens were missing some information. In this case, that information was tied to both the start token and the end token. For the start token, the information was being serialized properly, but I needed to add specific member variables for each of the arguments being passed in. This small change allowed me to use that information in the transformers in subsequent steps. For the end token, instead of trying to add a lot of generic information to the token, I opted to add a new member variable start_markdown_token to allow for a reference to the start token. In this way, I was able to easily reference the data of the start token without having to duplicate information in a generic way in the end token.

When I started testing these changes, there were a couple of tests that were trickier than the others to get working. Debugging into those tests, I quickly found that the whitespace removed from the start of lines within the fenced code block was causing issues for the Markdown transformer. Having faced a similar problem with the Code Span tokens, I was quickly able to pivot and replace that removed whitespace with a similar replaced sequence and a noop character.

Based on the work in dealing with Code Spans, if knew that if I needed to replace a single newline character with a space character, I would replace the newline with the sequence \a\n\a \a. As there are boundary cases where replacements can be nested, I did not want to use the sequence \a \a\a to replace a removed space character. Instead, I used the \x03 character to signal a character that did not mean anything, in effect a NOOP character. Thus, the replacement string became \a \a\x03\a. After a few small fixes, everything passed.

Adding Atx Heading Support¶

With the way the “Wheel of Effort” had been spinning for me during this chunk of work, I had no clue where the effort for this token would land on that metaphorical wheel. On one side, there were plenty of simple changes required to add support for many of the tokens. On the other side were the SetExt Heading tokens, which took days. It was with trepidation and caution that I started to work on the support for this token.

And it was within a couple of minutes of starting that I figured out this work was destined for the the trivial category. Piggybacking on the work documented in the last section regarding the start_markdown_token field, the big change for the parser was to set this field in the end token for the heading. Seeing as the change was solely additive, I jumped to the Markdown transformer, and added the handlers for both the AtxHeadingMarkdownToken and its corresponding end token. The handlers were both simple, just regurgitating the token’s member variables back into a Markdown form.

Running the tests, I was pleased to see that most of the tests worked on the first run. The ones that did not caused me to look at the tests more closely. It was there that I realized that the only tests that were failing were tests with closing Atx Heading characters. Looking specifically at the code to handle those closing characters, I discovered that I had transposed two variables, producing weird results in the output. Fixing that error, the tests ran without any issues.

What Was My Experience So Far?¶

While I mentioned in the last article that I started to believe that I could see the end of the project, it was this chunk of work that brought it into sharper focus for me. In my mind, anyone can say that they can properly parse a Markdown document, but for me, anything short of being able to empirically prove that the project can parser the Markdown project properly is a failure. For me, it is not about guesswork, it is whether I can back up my claim with data.

The comparisons between the HTML parser output and the HTML output from the GFM specification? They proved the project could get the HTML right for those specific cases. The comparisons between the line numbers and column numbers from the parser’s tokens and the consistency check algorithms? They proved that the project could pinpoint the origin of any element in the original Markdown document. The consistency check using the Markdown transformer? It proved that the project could, with certainty, transform the Markdown document into an intermediate form and then back again, with no loss of data.

From my point of view, each one of these checks was building on the work of the other, providing solid, empirical proof that the parser had properly tokenized the document. I understand that to some it may seem excessive. But to me, this was just dotting my Is and crossing my Ts. I wanted to make a good first impression with this parser, and I wanted to take steps to ensure that.

In addition, that new level of confidence spoke to me. The scenario tests were effective. The consistency checks were working. And in my mind, the project was clearly past the half-way point on its journey to the finish line! It was all starting to come together.

What is Next?¶

As mentioned above, Link transformations were next on the list of Markdown transformations to add. Given the work documented in this article, I was sure that I would run into at least a couple of issues. It was just a question of how much effort it would take for me to push through it!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments