Markdown Linter - Adding Consistency to Token Heights

Summary¶

In my last article, I took care of completing the consistency checks by verifying for the second half of the container block tokens: the Block Quote tokens. In this article, I fill out the line/column number consistency checks by adding support for determining and verifying the height of all block tokens.

Introduction¶

From a high-level point of view, I believe that the project is coming together nicely. Each of the examples in the base GFM specification have been tested, and their proscribed HTML output verified. The consistency check that verifies that that Markdown tokens contain the correct information is also in place, covering all existing tokens. What was left was a bit of unfinished business with the consistency checks to verify those same tokens. Two parts of that check were left: verifying the token height and verifying the inline tokens.

I had always intended the verification of inline tokens to be the final verification. That was always immediately clear to me. To properly verify any of the inline tokens, the tokens around it needed to be verified to give that token a solid foundation. Without those other tokens as a foundation, any attempt at verifying inline tokens would be shaky at best. It only made sense for me to start working on the verification of token heights before verifying the inline tokens.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 15 Aug 2020 and 19 Aug 2020.

Looking Back at Last Week’s Issue¶

After finishing the work on last week’s article, I took it easy for a few days, working on the project when I could, but not pushing to hard. As someone who uses his brain heavily in both his professional capacity and his personal capacity, it was a bit of a wakeup call for me. When I saw the end of the project getting closer and closer, I started to put in extra effort towards the project, thinking that I could get there faster. This normally would not be a bad thing. But to achieve that extra effort, I diverted some of my energy away from the energy that I normally use to take care of myself. I did not really think about it before I started doing it, it just happened.

While the result was me taking an extra day to complete the article, it could have been worse. From an article point of view, I needed to rework 2-3 sections, but it was not too bad. From a personal point of view, if I had continued to push myself, I believe it would have multiplied the time needed to recover substantially. I forgot that taking care of yourself and your mental well-being is very important, especially in times of crisis. In normal times, I might have been able to make that trade off work, but currently, I do not believe it is a viable option for me. I just do not have tons of extra energy to spare. Like it or not, those are just the times we are living in right now.

And that is okay. It is taking me a bit to accept that, but I am working on it. What is important to me is this project and writing about it… all about it. Not just the rosy stuff you read in other blogs, but the actual problems I encountered and how I approached them. At that includes problems like these.

Getting Ready for the Changes¶

Knowing that I was going to be making a lot of changes to the verification logic for line/column numbers, I wanted to prepare for that work by moving that code into its own module. Having seen this feature organization work well for the Markdown transformer and the HTML transformer, I figured that moving the line/column verification code into its own module was an easy choice. Further increasing the benefit of that choice was the fact that I was going to add more logic to the code. Keeping all that logic in one place just made a lot of sense to me.

The movement of the code was a pretty painless process, will all functions that handle the verification of line/column numbers being moved into the verify_line_and_column_numbers.py module. Once moved, any functions that did not need to be public were prefixed with __ and their invocations were also changed to add the __ prefix. Having a good set of scenario tests added to the ease of this change, as I was able to verify that the code was still working properly at each stage of the change.

With the new module created, it was time to add to that new module.

Verifying the Height of Tokens¶

As with any of the consistency checks, the verification of the token height started as a very small and compact function. At the beginning, the __verify_token_height function had 2 parameters: current_token and last_token. Knowing that I had the SetExt Heading token to process, I encapsulated that logic within that function, calling the __validate_block_token_height function to do the heavy lifting. This encapsulation allowed me to replace the line_number and column_number variables used by the other tokens with the original_line_number and original_column_number variables used by the SetExt Heading token.

That verification function, the __verify_token_height function, needed to be called from 2 locations: after the __validate_new_line was called and at the end of the normal processing. The call after the __validate_new_line function was called ensured that the height of any 2 block tokens not on the same line was calculated. If the 2 tokens were on the same line, one of them was a container block token, and they would always be on the same line. As such, I could just focus on the tokens on different lines without losing missing anything. The call at the end of processing would ensure that the height of the final block would also be verified.

It was a good start. With all the little stuff out of the way, it was on to the heavy lifting part of the change.

Doing the Heavy Lifting¶

The __validate_block_token_height function was always intended to be a function that needed to know about every block level token. From the initial design for this change, I figured that it was best to have one big function that could be refactored later, than to have duplicate code in separate handler functions. As I have had good success with that pattern so far, I decided to use it again here.

Like my other uses of the pattern, I started off the function with a large if-elif-else statement containing all the leaf block token names, one to each if statement. Each if statement contained a single line:

    assert False

and a final:

    else:
        assert False, "Token " + last_token.token_name + " not supported."

Just like before, I had a plan. I started running tests in groups based on their name. So to start, I used the command line:

pipenv run pytest -k test_paragraph_blocks_

to run all the tests that dealt with paragraph blocks. If I hit a type of leaf block that I had not worked on yet, the assert False triggered with the line number indicating which token type failed. If I hit a type of block that I was not expecting, the assert False in the final else would be triggered, letting me know which token I missed.

And it was a lot of lather-rinse-repeat. I ran the tests over and over again using the above command line. If any tests failed, I picked either the first test or the last test and examined why the test failed. If it was the first time that I was dealing with that specific token, I coded a good guess as to what the height formula should be. Otherwise, I examined the existing formula, and tried a variation of the code that would handle the new case. Once all the tests for a given group were passing, I picked another group. This repeated until all the scenario tests in the project were passing.

For the most part, that seemingly endless processing loop worked. But as in any project, there were things that needed to be handled separately.

Counting Newlines¶

Shortly into the changes, I figured out that I needed a simple helper function to calculate the number of newline characters in each string. Based on my observations, I was going to need to count a different set of newline characters for most of the tokens. Rather than implementing the algorithm multiple times, it made sense to put it into one function and in one location.

After a couple of tries, I ended up with:

def __count_newlines_in_text(text_to_examine):
    original_length = len(text_to_examine)
    removed_length = len(text_to_examine.replace("\n", ""))
    return original_length - removed_length

I forget where I encountered this pattern initially, but it was a useful one to remember. While it does include the possibly expensive creation of a new string, the algorithm itself is simple. The number of a given characters in a string is the difference between the original length of the string and length of that same string with all that specific character replaced with an empty string.

Say for example I need to know how many a characters are in the string maybe a good day to die, which has a length of 23. If I remove all the a characters, I am left with the string mybe good dy to die which has a length of 20. Subtracting the second result from the first result leaves a value of 3, the number of a characters in the string.

For Paragraph tokens, the use of this function was simple:

    token_height = 1 + __count_newlines_in_text(last_token.extracted_whitespace)

as it was for Indented Code Block tokens:

    token_height = 1 + __count_newlines_in_text(last_token.indented_whitespace)

For Link Reference Definitions, it was even more useful:

    token_height = (
        1
        + __count_newlines_in_text(last_token.extracted_whitespace)
        + __count_newlines_in_text(last_token.link_name_debug)
        + __count_newlines_in_text(last_token.link_destination_whitespace)
        + __count_newlines_in_text(last_token.link_title_raw)
        + __count_newlines_in_text(last_token.link_title_whitespace)
    )

After dealing with those tokens, there were a handful of other tokens that were easily handled by very simple calculations. After those tokens were handled, there were only two troublesome tokens left: the HTML Block token and the Fenced Code Block token.

Leaf Block Stacks and Tracking Tokens¶

To properly process the height of those two troublesome tokens, a pair of concepts were introduced almost at the same time. I tried splitting these two concepts into their own sections, but I found each attempt to do that complicated by their dependencies on each other.

The first of those concepts was the ability to track the leaf block that was currently active at any point. The primary driver for this concept was to provide context to the tokens that occurred within the HTML Block element and the Fenced Code Block element. As these two block elements handle their own text parsing, I needed to avoid any “extra” checking that occurred within these blocks. After trying 4 or 5 other alternatives, the tried-and-true block stack was easily the best, and most reliable, solution.

The second concept was closely tied to the first concept and dealt with properly tracking the right tokens. To finish the handling of the HTML Block element and the Fenced Code Block element, I needed to make sure that the concept of the “last” token was correct. To get that working properly, I added code to check the stack and only set the new “remembered” token variable if that was not set.

How Did This work?¶

Unless there was anything to do with HTML Block elements or Fenced Code Block elements, this code was not invoked. Except for the 4 block tokens that do not have close tokens (Blank Line, New List Item, Link Reference Definition, and Thematic Break), any start token was added to the stack. When an end token was encountered, if the name of that end token matched the top token on the stack, that top element was removed. Simple stack management. After a test revealed that I had forgot to add one of those 4 block tokens to the “do not stack” list, the stack worked flawlessly.

The tracking of the tokens to avoid duplication worked as well. When one of the two special blocks were encountered, the stack logic would add them to the stack. Once added, the algorithm assumed that the handling of the HTML Block tokens and Fenced Code Block token would handle any encapsulated tokens and did not track any of those encapsulated tokens. When the end of the block occurred, it was popped off the stack and the normal processing occurred. There were a couple of small issues at the start, but after they were cleaned up, it was smooth sailing after that.

Why Did They Need Special Processing?¶

Although I tried a number of different options, the only thing that worked for determining the height of these two special block tokens was a brute-force iteration of all the inline tokens. While there were other tokens that persisted information on how many newlines were contained within their Leaf Block, these two Leaf Block tokens did not. Without that information, the only option left was to iterate through each of the encapsulated inline tokens, counting newline characters as I went. But with those tokens already counted, I needed to avoid counting them a second time.

It was not a great solution, but it was the one that I ended up with. Up to this point in the project, there was no reason to change how those two Leaf Blocks stored (or did not store) any newline, it just was not a problem. While it was not a great solution, at this stage it was an efficient solution. But to check to see if I could do it better, I created a new item in the issues list, and moved on.

There Was One Additional Thing¶

As I finished up my work validating the Fenced Code Block token heights, there was one scenario test that snuck up and surprised me: example 97

`````

```
aaa

This example, and the ones around it, show how an open block is closed when the container block that owns it is closed or when the end of the document is reached. While everything else was working properly with this example, the token’s line height was off by one. After double checking the math for the consistency check and for the existing tokens, I confirmed that it was an off-by-one error. Given that error and the section that the example was in, the next step was an easy one: craft an example that included the end of the Fenced Code Block element:

`````

```
aaa
`````

Running that test, it immediately worked, which meant only one thing to me: the algorothm needs to know if the end token was forced.

Determining If A Token Is Forced¶

Right away, I was aware that determining if the end token was forced was not going to be an easy task. I immediately figured out one approach but dismissed it as too costly. But as I spent hours looking for any other approach that worked, I was coming up empty with each attempt. I found some partial solutions, but nothing that worked in all required cases. In the end, it was that costly approach that I returned to as my only solution.

What was that solution? Costly as it was, that solution was to add a field to every end token that indicates whether it was asked to be closed or forced close. Why was it costly? A quick scan of the scenario tests revealed almost 200 instances of an end token… for only the scenario tests starting with test_markdown_a and test_markdown_b. Extrapolating from that sample, I believed that it would realistically mean changing between 1250 and 1750 end tokens throughout all the examples.

It was not a decision that I made lightly, but with no other viable options, I started to embrace it. Making the change was the easy part. Before the was_forced field was added to the EndMarkdownToken class, the compose_data_field function looked like this:

    def compose_data_field(self):
        display_data = self.extracted_whitespace
        if self.extra_end_data is not None:
            display_data = display_data + ":" + self.extra_end_data

Once the was_forced field was added, it changed into:

    def compose_data_field(self):
        display_data = ""
        if self.extra_end_data is not None:
            display_data += self.extracted_whitespace
        display_data += ":"
        if self.extra_end_data is not None:
            display_data += self.extra_end_data
        display_data += ":" + str(self.was_forced)
        self.extra_data = display_data

If this looks like I snuck something extra in, it is because I did. Given the number of changes that I was going to make, I wanted to ensure that I could efficiently make those changes and verify them. While having optional parts of the token serialization left out was more compact, it did make the serialization harder to read. I figured that if I was making a thorough change like this, being explicit with each of the fields would reduce my chances of getting one of the serializations wrong. Instead of asking myself “what was the default value of the second and third fields”, I just added those fields and ensured they were serialized.

Basically, I gambled. The cost of making the change to the token’s serialization was cheap. It would be in the verification of that change where most of the expense would come. And my bet was that serializing all fields explicitly would make that verification easier and faster.

Was It Worth It?¶

If anyone would have asked me that question at the start of that process, I would have hemmed and hawed, giving an answer that would be uncertain at best. But by the time I had finished the first group of scenario tests, that answer was easily becoming a more solid “Yes!”.

Was it painful? Yes! I purposefully did not keep track of how many changes I had completed and how many I had left to go. I felt that keeping track of that number would be discouraging and focusing my mind on something other than the obvious thing: this was the right change to do. I was aware this was going to take a while, and as the hours ticked by, that was hammered home multiple times. By the time I got to the end of this process, it took 4 days and many long hours to complete.

But, in my eyes, it was worth every step of it. As this change touched on every scenario test, I needed to manually verify each scenario test before going on to the next test. And while I had test automation in peace, I had not manually gone through each of the tests and verified for myself if the tokens looked correct. For me, I found peace in having inspected the tests for myself, knowing that the consistency checks were providing me with the same results as my manual verification.

Does that mean I want to do it again? Not really. But would I do it if I felt that I needed to? In a heartbeat. In terms of confidence, this was me doing a final read of my article before publishing it. This was me running the Static Code Analysis against my code one more time just to make sure that I did not miss anything. This was me looking over the changes I am about to commit to the repository before I submit the commit. It is easy to argue that each of those actions is more emotional than logical, but that is the point.

Sometimes, it is just logical to do an action for a positive emotional reaction. And that is okay too.

What Was My Experience So Far?¶

I know I will only get one chance to make a good impression with this project, so I want to make that impression a good one. If I want to provide a good impression by having a good linter as the outcome for this project, I will need to test as many of the combinations of Markdown elements with each other as possible, in as many circumstances as possible. From the beginning of the project, a simple calculation made it clear that the scope of testing required would be difficult if I was lucky. If I was not lucky, that hopeful guess of difficult would grow by at least 2 orders of magnitude.

But even given that prediction that testing everything would be very difficult, I wanted to give the project the best chance to succeed. The best way that I know of doing that is to fully commit to making the right changes to solve the problems properly. For the first set of changes, this meant working with what I had, as it was only the token height calculation that needed that logic. For the the second set of changes it meant changing all the tokens, because there really was not any alternative.

In a strange way, both sets of changes brought me peace, increasing my confidence about the project. In the grand scheme of things, the iterative calculations for the height of the two Leaf Block tokens was not too expensive, and it is localized to that one module. Ammortizing the cost of those calculations over all the Leaf Block tokens makes it even less expensive. From that point of view, those changes were definitely cost effective in my eyes.

And while people can say I am pedantic and a perfectionist, I somewhat liked spending that time going through each scenario and verifying it. Before that review, I had a lot of trust in those consistency checks, but there was always a question of whether I had missed something important. With the existing checks in place and a manual review of those scenario tests, the chances of any major issues being left are drastically reduced.

I will never say that any piece of software has zero bugs in it, but I do know that I feel that I am eliminating many of the paths for bugs to form in this project.

And that I am confident about!

What is Next?¶

After completing the token height verification for block tokens, it was time to start working on the line/column numbers for the inline tokens. I was not sure how much of a chore it would be, but it would be gratifying to get them done!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments