Markdown Linter - Rabbit Hole 2 - Losing My Way

Summary¶

In my last article, I talked about starting to add consistency checks for the line/column numbers by adding simple line number checks. In this article, I document the work that I performed to add column checks to the project in a similar fashion to. While I would like to say that the required effort was simple and the work was quickly accomplished, as the title of the article implies, there were issues along the way that I could have handled better.

Introduction¶

At the end of the last article, I talked about how I felt that the right move at that time was to move from the basic consistency checking of line numbers to adding a similar level of checks for column numbers. My main argument for switching was that, from experience, it was easier to track the line numbers when I manually verified the line/column numbers than it was to track the column numbers.

For line numbers, unless it was a list block or a block quote block, a new block always meant a new line, which was easy to keep track of. For those two container blocks, I had to check to see if a new block followed the container block, but the calculation was still relatively simple.

The calculation of column numbers was more detailed, and therefore more difficult for me to accurately calculate on-the-fly. If not contained within a list block or a block quote block, I looked at the whitespace at the start of the token and adjusted the column based on the amount of whitespace. If it was within a container block, I looked at the amount of indent imparted by the container block and adjusted from there. And while these calculations may seem straightforward and easy, I initially double and triple checked these values before moving on. Even then, when I was checking the consistency of the line numbers, I know that I got a decent handful of column numbers wrong.

In the end, after getting some proper perspective, I believe that the situation became an easy one to resolve. I had the easy cases for line numbers done, and that was good enough for a while. I needed to get the column numbers to a confidence level where I felt comfortable going back to the line numbers and finishing off those checks. Even though I was nervous about the mistakes that I would find, I buckled down and started to work.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits after 30 May 2020 up to 11 Jun 2020.

In Case You Did Not Notice¶

Looking at the dates in the previous section, the duration between the start of the column number work and the end of the column number work was 12 days. To be clear, this was not all the column number work, just the easy parts of the column numbers. The batch of work documented in this article took a lot longer than my normal 5-day to 7-day work time.

It was for reasons like this situation that I named the articles in this part of the series as “rabbit hole” articles. Using that term at work a lot, it was amusing to looking around for a good definition of what “rabbit hole” really means. After some fun research, I settled on this specific part of the article at dictionary.com’s slang dictionary:

But as Kathryn Schulz observed for The New Yorker in 2015, rabbit hole has further evolved in the information age: “These days…when we say that we fell down the rabbit hole, we seldom mean that we wound up somewhere psychedelically strange. We mean that we got interested in something to the point of distraction—usually by accident, and usually to a degree that the subject in question might not seem to merit.”

That really does capture what happened here. But I do not want to spoil the story too much. Just continue reading and keep this in mind.

Keeping It Simple to Start¶

As with all good tales, the work started off with good intentions. As stated above, I simply wanted to replicate for column numbers, the extra confidence that I achieved with line numbers. To that extent, the first stipulation that I made was that, if possible, I was going to leave the container blocks until later. Container blocks were going to easily make things more confusing, so leaving them out would free my mind up to concentrate solely on the leaf blocks. While I was not sure what impact that rule would make, I hoped that it would reduce the large amount of work that I was anticipating.

Taking Time to Come Up With A Plan¶

With container block tokens (via the previous section) and inline tokens (as noted in previous articles) excluded, I was able to concentrate all my effort on the leaf blocks. While it was tempting to dive right into the work and to “clean everything up”, I knew that I needed more of a plan than “clean everything up” or “check all leaf blocks”. I needed to take a good survey of the existing tokens and their resultant HTML and determine what the intent of the token is and how it translates into HTML.

How did I know to do that? When I was going through the different scenario tests while implementing the consistency checks for the line numbers, I noticed that there seemed to be 2 different classes of column numbers for leaf tokens: one class where the whitespace occurs before the token and one class where the whitespace seems to be part of the token. It took a bit of research and looking again at the tests, but I did verify that there were at least 2 classes of leaf tokens.

Confused? Let me help!

Organizing Classes of Blocks By Their Intent¶

Fenced code blocks are blocks where any whitespace that occurs before the token is recorded in the token, placing it solidly in the before class. Looking at the Markdown for example 105:

```
aaa
  ```

and the Markdown for example 106:

   ```
aaa
  ```

in the GFM specification, both fenced code block Markdown fragments produce the exact same HTML output:

<pre><code>aaa
</code></pre>

Note that any whitespace before the code block starts is removed and does not appear in the output. While I need to add that whitespace back in to validate the column numbers properly, it makes no difference to the HTML that is output. As all the whitespace occurs before the token itself starts, I decided to call this class of leaf blocks the before class.

On the other hand, example 86 shows Markdown text for an indented code block:

        foo
    bar

that produces the following HTML:

<pre><code>    foo
bar
</code></pre>

In this case, the indented code block is more aware of the whitespace that occurs before the block starts, putting it firmly in the contains class. Specifically, this class of blocks is for any leaf block where the whitespace makes up even a small part of the block itself. This is obvious with this block as the whitespace starts at column 1 but continues in the produced HTML within the <pre><code> block.

To prove that these leaf blocks are in the correct classes, I replicated the above examples, but deleted a single space character from the initial whitespace for each code block. That change did not make any difference to the fenced code block, but that change precipitated the removal of one space character from the HTML output for the indented code block. Due to that simple behavior, it was simple to place every leaf block type into one of the two classes, allowing the blocks to be mostly handled as a group of blocks instead of as individual blocks.

Aside¶

To keep things above board, there technically is a third class, but it is mostly hidden. In the case of blank lines and link reference definitions, the actual HTML output for both elements is that they are ignored. Therefore, technically the third class is for ignored tokens, but for the purposes of this classification, I assume they are in neither class.

Classifying the Blocks¶

With that newly acquired information in mind, I walked through similar examples and assigned each of the leaf tokens to one of these three classes: before, contains, or ignored.

thematic breaks: before, see example 17
atx headings: before, see example 38
setext headings: before, see example 54
indented code block: contains, see example 86
fenced code block: before, see example 106
HTML block: contains, see example 120
link reference definition: ignored, see example 162
paragraphs: before, see example 192
blank lines: ignored, but treated as contains, see example 197

Keeping Things Consistent¶

In looking at how whitespace was handled for each of the leaf block tokens, one thing did indeed stand out: I was not storing whitespace for each of the tokens in the token itself. While it was being passed down to the base class, there was not an explicit variable named extracted_whitespace for each token that stored this information. Some tokens had that variable, some did not.

While I could calculate this information on a token by token basis, it seemed simpler to add the following line:

    self.extracted_whitespace = extracted_whitespace

to each of the constructors that were not assigning the argument to a member variable. Specifically, the tokens for block quotes, list starts, indented code blocks, html blocks, and thematic breaks needed this change, and it was a very quick change to perform. The important thing about this change was that, going forward, each of the non-ignored tokens contained a variable that explicitly was there to contain any extracted whitespace. As this variable was now in most block tokens, I hoped that the group of tokens could now be processed largely as one group, instead of 11 distinct block tokens.

Player, Enter Rabbit Hole Level 1¶

When I implemented the column numbers to begin with, I honestly thought there was only one class of tokens, and therefore I treated them all as if they were in the before class. To that extent, when I calculated each of the column numbers for the leaf tokens, I started with the number 1 at the first position in the line and incremented by 1 for each space character present in the data. Effectively, I used the formula:

    column_number = 1 + len(token.extracted_whitespace)

But now that I knew there was also a contains class, it meant adjusting the formula slightly. Additionally, based on further observations, it looked like the HTML blocks and the indented code blocks were going to need slightly different formulas.

HTML blocks were the easiest one to observe and determine the pattern for. Based on the Markdown from example 120:

 <div>
  *hello*
         <foo><a>

it was obvious to me that the Markdown produces output that is exactly what is contained within the block, namely:

 <div>
  *hello*
         <foo><a>

Based on the transformation, an identity transformation to be exact, it made sense that the formula for HTML blocks is:

    column_number = 1

The reasoning for this simple calculation is based on the observation of the transformed HTML. As any text that is part of the Markdown HTML block is part of the resultant HTML, I interpreted that to mean that the block itself starts at the start of the line, hence a column number of 1.

Following a similar line of investigation, I looked at a good example of an indented code block, the Markdown for example 86:

        foo
    bar

and its resultant HTML:

<pre><code>    foo
bar
</code></pre>

Based on this transformation, after the first 4 spaces are removed from the start of each line, those lines are inserted into the HTML output, with a <pre> <code> around them. As the first 4 spaces are removed from each line, they are effectively part of the token, but in a weird way. In the above example, the first 4 spaces are swallowed while the remaining spaces are left intact. Based on this behavior, I interpreted that to mean that the block itself starts after the 4th space, even though there are more spaces before the non-space text starts. Given that, the formula is:

    column_number = 4

but with a special provision that any unused text is added to the start of the text token that follows the indented code block token.

Now that I had the behavior down, it was on to implementing the code for the consistency check.

Congratulations Player, Now Enter Rabbit Hole Level 2¶

Having a solid idea of the behavior that I expected from each contains token and the group of begins tokens, it was time to write the consistency check. To allow for an easy computation of what the initial positioning was, I created the __calc_initial_whitespace function, starting it off with the equivalent of:¹

def __calc_initial_whitespace(calc_token):

    if calc_token.token_name in ( `every token` )
        indent_level = len(calc_token.extracted_whitespace)
    else:
        assert False

By isolating all the various formula into one function, I was to be able to use this simple code to implement the check itself:

    init_ws = __calc_initial_whitespace(current_token)
    assert current_position.index_number == 1 + init_ws

Fixing the Easy Failures¶

After running the scenario tests with this new code in place, there were a lot of successes, but also failures that I needed to deal with. The first set of failures was with various tokens that did not have a properly implemented extracted_whitespace member variable: HTML blocks, blank lines, and SetExt headings.

The HTML blocks were the easiest to deal with, as the formula from the previous section always returns a column number of 1. I added code to always return an index of 0 for the length of the initial whitespace, resulting in a column position of 1 once the above code was applied. From there, blank lines were equally easy. While there are good arguments for having blank lines denote the start or the end of the line, I picked the start of the line to keep things simple. This made the coding easy, as these tokens ended up returning 0, the same value as for HTML blocks.

That left an oddball: SetExt headings. For some reason that is lost to me, SetExt headings use the remaining_line member variable instead of extracted_whitespace. While it felt weird, it was easy to do a quick check for that, returning the length of that variable when SetExt headings tokens were encountered. I also added an issue to my list to check this out in the future, as it is an unnecessary complication that does not add any benefit.

Running that scenario tests again, most of the issues were dealt with easily, following the above rules, and adjusting the column numbers to the newly corrected values after additional manual verification. It was only a small handful of errors that showed up, mostly in the HTML block scenario tests and the indented code block scenario tests. As those two blocks were ones that I placed into the contains class, these failures were expected and quickly addressed, applying changes to the scenario tests to resolve those failures in short order.

Paragraphs Were a Bit More Work¶

The next group of failures were easily grouped together as scenario tests that deal explicitly with paragraphs. On closer examination, it seemed that the failures occurred only in a couple of specific cases. After a couple of observations, it became obvious that the failures were restricted to multiline paragraphs. In those multiline paragraphs, all the whitespace stripped from the start of each line in the paragraph is added to the token, not just the first line. To address that, I made some quick changes to the code for dealing with multiline paragraph tokens:

 elif calc_token.token_name == MarkdownToken.token_paragraph:
    if "\n" in calc_token.extracted_whitespace:
        indent_level = calc_token.extracted_whitespace.index("\n")
    else:
        indent_level = len(calc_token.extracted_whitespace)

With the paragraph tokens dealt with, the column numbers now lined up nicely between my calculated numbers and the consistency check numbers. Other than tabs, the only outlying scenario tests were two tests for indented code blocks inside of list blocks.

Another Rabbit Hole¶

To be honest, during the writing of this article, getting a firm understanding of the status of these two scenario tests took a couple of passes. I had that understanding when I completed the tests, then forgot it before I wrote the article. When I started to write article, I figured it out again, then forgot it while focusing on other aspects of the article. Finally, when I found the correct answer again, before I forgot the explanation yet again, I quickly jotted down notes for this section.

Yes, column numbers can be confusing. Keep good notes. Trust me on this.

The two list block scenario tests that I had issues with were example 235 and example 252. For example 235:

 -    one

     two

due to the extra indentation, the top paragraph starts at column number 7. However, as the lower paragraph does not have enough spaces to match that start column, it is instead interpreted as an indented code block.² As such, that indented code block starts at column 5, per established rule from above, and contains text that starts with the 1 extra space. As the column number in the test was 6, it was adjusted to the now correct value of 5, with manual verification and consistency check verification in place. From personal experience, it was easy to forget that the indented code block starts after the 4 space indent, making the calculation look wrong.

For example 252:

1.      indented code

   paragraph

       more code

it was more fun with lists, but this time it was an unordered list that contained a valid indented code block, unlike the example above. In this case, the list start sequence does not have any whitespace before it and is a simple 1 digit list start. From a calculation point of view, that means start with 0, add 0 to that total for leading whitespace for the list start, 2 to that total for characters in the list start sequence, and then add 1 to that total for the mandatory list-to-content whitespace separator. That means that the earliest that any content can start in the list is at index 3, matching the list’s indent_level member variable. As the next token is an indented code block, adding 4 to that total results in an index of 7 or a position/column of 8 where the content for the code block starts. As the column number in the test was 9, it was adjusted to the now correct value of 8, with manual verification and consistency check verification in place.

As I mentioned in my introduction to this article, the calculation for column numbers were more detailed. It was not until I wrote down the formulas for the calculation, as outlined above, that I was able to confirm that I had made the right choice.

And with that, all the scenario tests were passing except for tabs.

Yeah, tabs again.

Be Wary Player, Now Enter Rabbit Hole Level 3: Tabs¶

As I looked at the code, dreading a lot of changes to the code to support tabs, there was both good news and bad news. The first part of the good news was that except for the tab scenarios, the rest of the code was solid. I had found a couple of issues so far, but otherwise column numbers looked good. The second part of the good news was that anything to fix with respect to tabs at the start of a line would be largely restricted to the indented code blocks. Until I started testing with container blocks, any blocks that started with any combination of tabs and spaces would always trigger the 4 whitespace minimum required to start an indented code block.

The bad news? This was not going to be fun. As I documented in More Fun With Tabs, Markdown uses tab characters as tab stops. To remove the complexity of handling those tabs at the start of every block, I did the necessary computations to translate tab characters into the correct number of spaces, based on the current index in the Markdown document. From that point on, any leading whitespace was tab-free and easy to work with and manipulate for the transformation into HTML.

But then I got to validating tokens for consistency with respect to column numbers. After clearing away all the other failures, only 13 failures remained, and all them dealt with tabs and specifically with tabs in indented code blocks. How bad could it be to reverse the change and pass the initial whitespace through? I started working the problem.

Three days later, I determined that it would be very difficult. After 2 restarts, there was just too much code already in place to handle that relied on all leading whitespace being space characters. Replacing it would require more time to just get it to a good starting point, not even to the point of having the tests passing. Resetting for a third time, I decided to just focus on the initial case of having a tab in the starting whitespace that caused an indented code block. Instead of working to reset the tab in all the code, I focused on reconstructing the leading whitespace to determine what the proper handling of the tab should be.

Another three days later, and the code was complete. When the initial whitespace is encountered, a check is made against the initial line to be parsed to see if that line contains a tab. If so, further progressing is done to extract the original whitespace that directly corresponds to the initial whitespace that caused the indented code block to be started. Once the original whitespace fragment was recovered, the extracted whitespace in the indented code block token was adjusted, placing the proper portions of the leading whitespace in the code block token and the following text block token. And it also provisionally supported lists and block quotes.

Why Did I Need to Do This?¶

Basically, before this work, the whitespace that was in the leaf block tokens did not have to be correct as it was largely ignored. As such, only the extracted whitespace stored in the enclosed text block had to be correct, as that whitespace is directly exposed when rendering the token into HTML. To make this more evident, consider the Markdown text for example 7:

-<tab><tab>foo

where <tab> is a tab character. As the specification explains, there is a bit of calculation trickery in this area. The list start character - must be followed by a space, so when the first tab character is encountered, 2 spaces must be emitted to replace the tab stop. When the next tab character is encountered, we are already at a tab stop, so that tab is replaced with 4 characters, for a total initial whitespace of 6 characters.

Why was this important? Because I had to make sure the right whitespace was being placed in the right token, otherwise either the HTML validation would fail, or the new consistency checks would fail. Before this consistency check, if I placed the needed 2 characters in the HTML text, the test passed. Due to the check, I now had to also properly calculate the whitespace to put in the indented code block tokens itself.

As I finished up work on the changes, I looked at the calendar… and noticed that instead of my usual one week of work, it was 12 days later. What I had budgeted 5 days for had taken 12 days.

What went wrong?¶

Looking back, I simply messed up. To me, there is nothing wrong with messing up with something like this, if I learn something and grow. So, it is just a matter of taking a good look at what happened, and being honest with myself.

The big thing that happened is that I believe that when I saw that it was going to be a lot of additional work to support tabs, I should have stopped to re-evaluate what needed to be done. Instead, I kept on pressing forward instead of evaluating whether tabs were worth it at that time. As I personally do not use tabs, and many of the documents that I surveyed do not have tabs in them, I could have sidelined tabs for another day.

In addition, once I started to work on implementing the proper tab support for the tokens, I ignored additional warning signs. Thinking back, I believe that I kept on thinking “just around the next corner” and “one more line”, instead of being honest about the work. Essentially, I let my pride get in the way. Instead of being okay with leaving a partially implemented solution or resetting the work, I was convinced that it was just another 5 minutes away… even after 5 days.

On top of that, I mostly ignored my rule about leaving container blocks out of the current block of work. To get to this point with the tabs, I had to write extra code that is specifically in place to deal with indented code blocks within container blocks that contain tabs in the leading whitespace. That is doubling down on dealing with container blocks, not leaving them for later.

I need to think about things. How did I feel about all this?

What Was My Experience So Far?¶

As I have mentioned before:

Stuff happens, pick yourself up, dust yourself off, and figure out what to do next.

Yeah, I messed up. Stuff happened. Instead of doing what I normally do, I went down the rabbit hole and got lost in trying to get the column numbers right, not just “good enough”. Even after I started implementing the tab support, I did not pay attention to my own warning signs that were telling me to reset and regroup.

As for the dusting myself off, I realized that there was some good news out of all this work. Good news from all this? Really?

The biggest part of the good news is this is the first time that this kind of thing has happened on this project. Yeah, I was hyperfocused on getting the work done, and did not pay attention to anything else. But unlike other projects where this happened multiple times throughout the same project, this was the first time for this project. One instance of this for a project that has lasted 8 months… not bad! This is not just me making lemons out of lemonade, I was genuinely happy. Granted, it took me a bit of self-reflection to get there, but I got there.

And on the way there, I did notice things. One thing that I am confident about is that even though having to take this route with the tab characters is painful, it would have been more painful to have to deal with those tabs in multiple places. The current implementation for leading whitespace removes the tabs, only adding them back in for the few cases where they are needed. Another thing is that although I needed to address a couple of issues with 2 classes of leaf blocks, the calculations for the column numbers were mostly spot on. I still want to make sure remain consistent by having consistency checks, but I am more confident that I calculated the column numbers correctly.

Sure, I got distracted. It happens to everyone, especially with projects that we are all passionate about. I sincerely wanted to do the right thing, and that is not bad, just counterproductive at this time. Right now, I need “good enough”, not “perfect”. While this was indeed a setback, it was a relatively small setback, one that I can easily recover from.

Overall, I was a bit bruised from following tabs down the rabbit hole, but I was okay with it. Not proud of it, but also not blaming myself and flogging myself for it either. Just okay with it. And while I did go overboard, I did get the initial scope of work done. In the end, all is good.

What is Next?¶

After focusing a lot of time on went wrong, it took a bit for me to realize that the consistency checks were working as planned. But on further examination, with possible influence from my issues with hyperfocusing, I decided that I was not yet at the point where I could switch back to line numbers. As such, my next article will talk about how I continued this work verifying column numbers.

The text every token is not meant to be taken literally. Instead of listing each of the 11 tokens, I just felt it was more compact to use a figurative value. ↩
The case where the second paragraph’s column matches the indent of the list item is tested in example 236. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments