Markdown Linter - Starting to Add Line/Column Numbers For Inline Tokens

Summary¶

In my last article, I took care of completing the consistency checks by verifying the height of all block tokens. In this article, with all the block tokens now properly covered, I start to add proper support for line and column numbers for the text inline tokens and the emphasis inline tokens.

Introduction¶

As I mentioned in the last article:

To properly verify any of the inline tokens, the tokens around it needed to be verified to give that token a solid foundation. Without those other tokens as a foundation, any attempt at verifying inline tokens would be shaky at best.

With that foundation now firmly in place, it was then time for me to start adding the line/column numbers to the inline tokens.

The scope of what I was about to start was not lost on me. From the outset, I knew that adding the line/column numbers to the Text tokens was going to be expensive. Starting with the obvious, the Text tokens are the default “capture-all” for anything Markdown that does not firmly fall under another token’s purview. That alone meant there were going to be a fair number of scenarios in which Text tokens were present. Add to that number the various forms of text that the token contains, and each form’s way of dealing with the Text tokens within their bounds. Also, as I wanted to have a good gauge on how hard it was going to be to add the other inline tokens, I added support for Emphasis tokens to the worklist.

I was clear about the scope of this change with myself from the outset. It was going to be a long trek to complete all this work in one week. I did contemplate updating the consistency checks to accommodate the changes to the inline tokens, but discretion got the better part of me. This work was going to be tough enough on its own, no need to add some extra tasks to the list.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit of 29 Aug 2020.

Framing a Big Problem in a Better Light¶

Before starting with this monumental task, I wanted to take a step back and really understand the task and its intricacies. When I started looking at the sheer depth of this task, I will admit I was a bit scared at first. The work this task requires is daunting. Doing a simple search over the project’s scenario tests, I found 1577 instances of a Text token in a scenario test and 161 instances of Emphasis Start tokens in a scenario test. That meant between the Text tokens and both Emphasis Start and Emphasis End tokens, I was looking at 1899 instances that needed to be changed and manually verified. That was indeed overwhelming.

This is where my experience with test automation came in handy. I took a breath and started to look for equivalence partitions that I could use. While the number of discrete instances of Text tokens and Emphasis tokens were facts that I could not change, I decided to apply equivalence partitioning and reduce the effective number of instances down to a more manageable number.

How Does That Work?¶

Let me take a small sample function that is in the ParserHelper class, the is_character_at_index function. This function is as follows:

    @staticmethod
    def is_character_at_index(source_string, index_in_string, valid_character):
        return (
            0 <= index_in_string < len(source_string)
            and source_string[index_in_string] == valid_character
        )

The function is simple in that given a large variation on the parameters, it will simply return a True response or a False response.¹ While the number of variations are largely finite², they do fall into a number of categories. Starting with the index_in_string argument, those groups are: less than 0, equal to 0, greater than 0 and less then len(source_string), equal to len(source_string), and greater than len(source_string). Of those groups, only if the index_in_string argument is in the equal to 0 group or the greater than 0 and less then len(source_string) group do I need to check to see if the character at the specified index is equivalent to the argument valid_character. As the value to compare against is a single character, the only two groups for that part of the comparison are that it matches that single character or it does not.

Based on this information, I can use those groups as equivalence partitions or equivalence groups or to partition the data to test into 7 distinct test groups. The first 3 equivalence groups are the ones that cause the first comparison to fail: less than 0, equal to len(source_string), and greater than len(source_string). For this group, the negative group, a simple test with one value in each group is required. For the other 2 tests, the positive group, in addition to the comparison to get it into one of the 5 groups, one test is required where the index specifies the a character matching the valid_character argument, and one where it does not match. In all, 3 tests in the first group, and 2 sets of 2 tests in the second group, for a grand total of 7 tests.

This works well because it reduces the scope of the testing to a manageable number. Given the less than 0 group, it does not matter if the index_in_string argument is -1, -2, or any other negative number. They all fit into that group and they all evoke the same behavior: they cause the expression to be evaluated as False. By applying this process to many testing problems, it can quickly break down the problem from an unmanageable number of possibilities down to a smaller number of more easily handled cases.

How Does That Apply to This Work?¶

No matter how it is viewed, having to change the serialization of 1577 instances of a Text token is a big job. That part of the work I cannot change. However, I can make the manual validation part of the changes more efficient by applying equivalence classes to those changes. While I was not sure at the onset what those classes were going to be, I was confident that I could work out those groups would be one-by-one.

But it was still a big task, just not as big. Looking back at my notes, I have a scribble that says:

~40 variations for text, ~10 for emphasis

I believe that was a note to myself to boost my confidence by estimating how many equivalence classes that I believed I would get the tests down to. As I wrote this article and looked at that scribble, for a second, I was back at the point in time when I wrote that down. Like an echo, I vaguely remembered the feeling of optimism that washed over me when I wrote those numbers down. While I am not 100% certain of what I was thinking at the time, I am confident that it was something like:

1600 validations is insane! On the other hand, 40 is okay. I can do 40.

At that moment, it was not about whether those numbers were accurate, just that I had confidence that those numbers were in the general vicinity. While having to validate each of approximately 1600 variations of Text tokens filled me with dread, having to validate approximately 40 variations of those same Text tokens and approximately 10 variations of Emphasis tokens was something I had confidence that I could easily handle.

Updating the Text Token¶

Before I was ready to start with the Text tokens, I needed to get ready. Not a lot of work, but some solid foundational stuff to make the rest of the processing go easier.

Getting Ready¶

My main drive for updating the Text token to support line/column numbers was never about the small stuff. It was that boring work, stuff was easy to do and quickly finished, that I wanted to get out of the way. Adding the ability to pass in either a position_marker argument or the line_number and column_number arguments? Done. Making sure they got copied along with the other information when the create_copy function was called? Done. Changing the InlineRequest and InlineResponse classes to handle line numbers and column numbers? Done. If my memory and notes are accurate, those changes were all completed in the first half-hour that I used to work on these changes.

Then, to ensure things were setup to verify the consistency of the changes in the future, I made some changes to the verify_line_and_column_numbers.py module. While I knew I was going to write the actual validations in a separate task, I wanted to make sure that I had a good view of what inline tokens were going to be handed off to the future consistency validators. To accomplish this, I added two sets of print statements: one as part of the __verify_token_height function and one at the end of the verify_line_and_column_numbers function. My plan here was to not only set myself up for the inline consistency checks to come, but to be able to see what the group of inline tokens to be processed was, to allow me to plan future sets of equivalence classes.

With that setup work done, it was on to the actual classes.

Starting with the Paragraph Tests¶

With that foundational work completed, I decided to start with the tests in the test_markdown_paragraph_blocks.py module. Since the Paragraph Block tokens are the default containers for Text tokens, I figured that this was the best bet to get started with some of the simple stuff. That bet paid off with the first equivalence class, a Text token following a Paragraph token.

If I had to point out the simplest case of a Text element in a Markdown document, I would definitely point to an example similar to example 189. Despite its high index number, to me this is the simplest example of all Markdown documents:

aaa

bbb

Simply speaking, it is two paragraphs separated by a single newline. While it is true that a single line of text would be simpler, to me, that is not a realistic example of a Markdown document. To me, a document means multiple paragraphs of text that conveys some information. From experience, it is very hard to convey anything except the most basic forms of information in a single paragraph. Also, as a realistic example, example 189 shows how you can separate two paragraphs in a Markdown document. As such, I consider this the root example.

As this was the root example to me, it also contained the first and root equivalence class: a Text token contained as the first token after a Paragraph token. While there are numerous variations of this equivalence class, for me this is the base class itself. And as I looked through the code on how to isolate this equivalence class, I came to an interesting observation. It should have been an obvious observation, but it took me a bit to work through from “huh?” to obvious. I forgot that equivalence classes deal with input and output, but that source code rarely follows those same patterns.

This Is A Good Thing¶

When I started to look for the source code behind my first equivalence class, I found that it was hard to isolate the source code to just that equivalence class. But as I looked at the source code more, that made sense. One reason that it made sense was that if the cases were isolated based on equivalence class, it would mean that there was a lot of duplicated code in the project. Another reason was that such separation would force distinct paths through the source code that would not be natural from any other viewpoint than that of equivalence classes.

The way the project was designed was to have an initial parsing phase to get all the raw information together, then a coalesce phase to combine any text tokens where possible, and finally an inline parse phase to handle the inline tokens. Dragging any artificial grouping of output across those phases seemed very counter-productive to me. But I still needed to figure things out. It was time for a bit of a regroup.

Rethinking My Approach¶

After performing a global search for TextMarkdownToken( on the project, I was rewarded with a small number of occurrences of a TextMarkdownToken being created within the project. This was good because it meant the number of actual changes that I would need to make was small, and hopefully each change would carry over multiple equivalence classes.

The __handle_fenced_code_block function and the __handle_html_block function (through the check_normal_html_block_end function) were both responsible for handling the additional Text tokens as part of container processing, so they were the first to be changed. In addition, the parse_indented_code_block function, the parse_atx_headings function, and the parse_paragraph functions all included the creation of new instances of the TextMarkdownToken. Making those changes took care of all cases where the Parsing Processor created Text tokens. From there, a quick check confirmed that the Coalescing Processor only modified existing Text tokens and did not create any new ones.

After a bit of double checking to make sure I did not miss anything, I acknowledged that the preparation work was done, and it was now onto inline processing.

How The Inline Processing Works¶

When the Inline Processor starts, it loops through all the tokens, explicitly looking for Text tokens, as they are the only tokens that can contain inline sequences. Once such a Text token is found, a further check is done to make sure that the Text token is within a Paragraph element or a SetExt Heading element (the only two block elements in which inline tokens are allowed) before proceeding with the actual processing of the Text Token.

For any readers that have not been following along on the project’s journey, let me provide a bit of a recap on how that processing works. Back in the article on starting inline processing, I go through the algorithm that I use in the inline processor:³

set the start point to the beginning of the string
look from the start point for a new interesting sequence
- if none is found
  - emit the rest of the line from the start point and exit
- if one is found
  - emit the text from the start point to the start of the interesting sequence
  - handle the current interesting sequence
  - update the start point to the end of the interesting sequence
  - go back to the top

From the Text token perspective, the important parts of that algorithm are the emit the rest of the line part and the emit the text from... part. When most of the other parts of the algorithm emit their own token⁴, a check it made to see what text has been “emitted” before that point. Then a new Text token is created with that emitted text, followed by the newly created token that represents the interesting sequence, followed by the algorithm looks for the next interesting sequence to deal with.

In the end, there were only 4 places where I had to change the creation of the Text tokens to provide the line/column number information. In all, there were only 9 places in the project where I had to change the creation of the Text token. Far from being lulled into thinking the hard work was done, I figured it would be in the updating of the scenario tests that things would get interesting. And I was not disappointed!

Scenarios¶

With the code changes made to the Inline Processor, it was time to focus on the scenario tests and getting their data changed and manually validated. Using the command line:

pipenv run pytest -k test_paragraph_blocks

I executed each of the paragraph specific scenario tests, looking for the expected failures in each test that contains a Text token. Except for three tests, each of these tests were simple cases of the base equivalence class, which meant that they were quickly updated and verified. Of those three tests, two new equivalence classes emerged: the first Text token within an Indented Code Block token, and a Text token following a Hard Break token.

The scenario test for example 195 is as follows:

    aaa
bbb

which was properly parsed into new equivalence class of an Indented Code Block token containing a single Text token and a normal Paragraph token containing a single Text token. As code blocks do not contain any inline processing and no extra inline processing was specified, this was an easy validation of that new equivalence class. Quick, easy, done.

The other failing scenario test, the test for example 196 is as follows:

aaa{space}{space}{space}{space}{space}
bbb{space}{space}{space}{space}{space}

where the sequence {space} was replaced with actual space characters. I replaced the tokens with what I thought was their proper line/column numbers and was surprised to find out that the tests were still failing. As I started to work through the research on why this was happening, I came to an interesting conclusion. I was not going to get away from handling the other inline tokens after all.

The Truth Always Wins¶

Based on the above Markdown, the tokens that were generated for that scenario test were a Text Token, a Hard Line Break token, and another Text Token. The first Text token was fine, I had that covered, and the Hard Line Break token was not what we were focusing on, so the fact that it did not have a line/column number associated with it was fine. But that left the second Text token in a bit of a conundrum. Based on the code at that time, the line/column number was 1,4, which based on the existing logic was correct. But from a validation point of view it was incorrect: it should be 2,1.

It took me a bit to realize that if I was going to change each Text token, I would at least have to partially handle the other inline tokens. In this case, unless I added some code that understood the Hard Line Break token, the source code would correctly state that the line/column number was 1,4. To be clear, it is not that the line/column number of 1,4 is actually correct, but according to the information that the algorithm has, that is the correct value to compute for that token. So, while I did not have to output the line/column number for the other inline tokens yet, I at least had to figure out what change that token was going to impart to the stream of inline tokens in that group.

And It Happened with The Most Complicated Inline Token¶

The Hard Line Break token just happened to be the token I needed to figure out. And it would end up being the most difficult inline token to figure out the new line/column number for. One reason was that, for whatever reason, I placed the newline for the Hard Line Break token with the following Text token, and not the Hard Line Break token itself.⁵ This meant that to properly deal with that token, I needed to reduce the vertical height of the following Text token by 1, as the Hard Line Break token had already increased the line number. The other reason for it being complicated is that the proper setting of the column number relied on checking with the owning Paragraph token, grabbing any leading space for that next line from that token.

All in all, in took a bit of work, but not too much before all the tests in that scenario test group were passing. While I knew there were 10s of hundreds more changes to make, I knew I could do this. Yeah, it would be slow, but if I just kept my focus on the task at hand, I could do this.

Lather-Rinse-Repeat¶

While I could go through each of the other equivalence classes that I discovered and processed, I will leave that for a future article where I talk about the inline consistency checks. It was enough of a brutal and time-consuming process that I will not make it more so by talking about it. Each time, I literally picked a new section of scenario tests, replaced the test_paragraph_blocks in the command line with the prefix for another group of tests and ran it again. With the results of that test run, I picked off one of the failing tests, correcting the line/column number for the Text tokens, and running the tests again to repeat the process. As I went, I manually validated each test’s changes, and I rechecked my results as I staged the changes into the project’s GitHub repository.

A good example of this process was the next group of tests that I tackled: the Hard Line Block group. The first couple of tests were a rehash of what I had already done, so determining the proper line/column numbers for those tests were easy, and quickly verified. That left tests that included Emphasis tokens and Text tokens within Atx Heading tokens. I just buckled down and followed at the same process as documented before, adjusting as I went.

Yes, it was slow, but it was also good. While it dragged on, I was getting predictable results with the application of my process. In my mind, I had confidence that it was no longer a matter of “um… around 1600 tokens? how am I going to…”. I was making that transition to “how can I get these done more efficiently and reduce my time on each test without sacrificing quality?”

Updating the Emphasis Token¶

Compared to the work required to change the Text token, updating the Emphasis token to include line/column numbers was trivial. As the work had already been done to determine the width to apply to the start and end tokens, the main change was to pass the line/column number information to the constructor of the EmphasisMarkdownToken and the EndMarkdownToken.

With that change in place, I started running the scenario tests in the emphasis group and only had to make one small change. In the cases where the end Emphasis token were completely consumed, everything was fine. But in the cases where an end Emphasis token were partially consumed, the column number was off by one. That took a bit of puzzling, but after some thinking, the answer leapt out at me. I will not kid you though, without me scribbling down the various cases and working through the scenarios, it would have taken me a lot longer.

For the start and end Emphasis tokens, the Inline Processor creates a Special Text token that contains either the * or _ character and the number of those characters found. Because emphasis is processed from the inside out⁶, the emphasis characters taken from those Special Text tokens occur at the end of the Special Text token for the start Emphasis token and the beginning for the end Emphasis token. As a result of that, the start Emphasis token’s column number needed to be adjusted by the number of characters consumed, to ensure it pointed at the right location. Once that adjusted was added, the remaining scenario tests passed.

While I was not sure if the other inline tokens would be as easy as the Emphasis tokens, I was hopeful. And in a long project, that is a good thing!

What Was My Experience So Far?¶

When I start to write these sections in my articles, I always refer to my notes and try to put my mind back into the frame of mind I was in at that time. While there are sparse notes here and there about this section of work, there is really only one of those notes that properly sums up the work:

Phew!

While there were times during that slog that I did not think it would ever end, I was adamant that I was going to get through it and complete it. In my mind, it was not a question of confidence, it was a question of endurance. My change algorithm was simple enough, and I had confidence that the validation part of that algorithm was solid. It was just a matter of working through what seemed to be a mind-numbing number of changes until they were all done.

But I persisted and got through it. And while I believe it was the right decision to only focus on the Text token and the Emphasis tokens, in hindsight, it might have been okay to add the other inline tokens at the same time. With all the work to make sure their space was properly accounted for, I believe that most of the work that I have left with those tokens is to plug in the calculated line/column numbers into the inline tokens themselves, changing the serialized text, and writing the consistency checks. Be it as it may, unless I messed up a calculation, the hard part of making sure the calculations work has already been done.

On the technical debt point of view, I am a bit worried, but not too much. The list of things to check in the issues list is a bit larger than I like it, but there are some future ideas and a lot of double-check reminders on there. At the very least, I am sure I can start to make short work of a lot of those issues, or properly prioritize them for later, whichever is best for that issue.

What is Next?¶

With the Text tokens and the Emphasis tokens out of the way, I decided that it was better that I add the consistency checks for those tokens before progressing forward. After having to do a fair amount of work to support “bypassing” those tokens to properly calculate the line/column number of any following Text token, I had a feeling it would come in handy if I moved it up the priority list a bit.

This function makes simple assumptions that the source_string argument is a string of any length, the index_in_string argument is an integer, and the valid_character argument is a string of length 1. Because the argument names are explicit enough and their usage is within a predefined scope, I decided to not verify the type of value for each. As such, the statement that the function will either return True or False assumes that those assumptions are followed. ↩
For more information, see Wikipedia. The short answer to this is that I would start with the first argument containing an empty string, for a count of 1. Then I would move on to a string with 1 character, and have to populate that string with every viable Unicode character. Moving on to strings with 2 characters, I would need to take every 1 character string, and repeat that same process with the second character. Repeating this process, the number of variations on possible strings is not infinite, but mathematically it is called a large finite number, or largely finite. ↩
Looking back at it myself, it is a bit hidden, but it is in the section on code spans in the fourth paragraph. ↩
The two exceptions to this are the handling of the backslash sequence and the character entity sequence, both which add to the text that is being accumulated. ↩
Yes, I did add an item to the issues list for this. ↩
For a good example of this, look at example 422. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments