Markdown Linter - Adding Line and Column Support

Introduction¶

As I documented at the end of the previous article, having arrived at the decision to improve the linter’s foundation, it was time to tackle the largest issue on the stack: missing support for line numbers and column numbers. While at the time I thought of it as a glaring omission (see this article for more ), the passage of time since that discovery has helped me see that omission in a better light. On its own, the process of getting the tokenization correct was a large enough task but adding the determination of each token’s line number and column number to that task would have made that task unbearably larger. Knowing myself as I do, even if I had discovered that requirement during the design phase, there is a really good chance that I would have relegated the determination of each token’s original position into its own task. Regardless of what happened in the past, it was obvious that I needed to start working on that task now.

Starting to think about this issue, I had no doubt it was going to be a tough task to complete, and I knew that I needed to find a way to break things down even further. As I figured that adding line numbers and column numbers to every token was going to be too much, I reduced the scope even further by deciding to limit the scope to only block tokens. As every inline element is rooted in a block element, I had confidence that this would ensure that the line number and column numbers would be solid in each block elements before adding the inline elements into the mix. It was also a solid dividing line for the task that just made sense to me.

With those decisions on scope dealt with, it was time to delve into the depths of line numbers and column numbers.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 26 May 2020 and 16 May 2020.

Why Was Reducing the Scope Good?¶

Each of the implemented and enabled parser test functions includes the list variable expected_tokens and the string variable expected_gfm. The expected_tokens variable is used to ensure that the tokens that are expected from the Markdown document are emitted by the parser. The expected_gfm variable is used to ensure that those tokens can use a simple algorithm¹ to transform themselves into the HTML required by the specification. Prior to this work, their focus was simple: enable that forward translation from Markdown into HTML.

Adding in support for line numbers and column numbers was going to be a lot of work, but it was working in a different direction: backwards towards the original Markdown document. There was no denying that this was a difficult task to complete, but due to circumstances, the reduction in scope would leave me with a far simpler problem to solve. In addition, to protect the work already done in the forward direction, the existing battery of tests remained in place, diligently guarding that work.

With protection for the already completed work in place, and a clear definition of the work to be done in mind, it was time for me to figure out how to start the process off.

Where to start?¶

To help me keep track of things, I added each of the block element names to the readme.md document where I track my issues. As I needed to represent every Markdown block, I moved the question in the issue document regarding the adding of a Link Reference Definition token to the start of this new section and transformed it into a statement. As each block element in the Markdown document needed to be tagged with its position, it was a necessity that Link Reference Definition tokens be added, even if they did not result in any HTML output.

Having all the blocks elements listed in that document, I then needed to somehow pick a good place to start. I was aware that wherever I decided to start, that token was going to have a lot more work associated with it as I tested out and implemented the foundation used to add line numbers and column numbers to the other tokens. Whatever process and code I used for that first block element; the plan was to repeat it in some closely related form for each of the other 10 block elements in the list. But which one to pick?

In the end, I decided to start with Atx headings. Unfortunately, I did not keep any notes as to why I picked this block element to start. My best guess is that after having a lot of exposure to heading code while implementing the last group of rules, I figured headings would be a good place to start. Following that line of thought, I believe that I also would have guessed that SetExt headings were going to be the more difficult token to change, leaving Atx headings as the best place to start. Either that, or I decided to go alphabetic and ‘a’ is the first letter of the alphabet. Either way, Atx headings were first, and it was going to be a fun ride.

Adding in Position Markers¶

Taking a good look at the Atx headings and the information that was being passed around, the first thing that caught my eye was a good opportunity to refactor. As I was letting parts of the parser grow organically, many functions included the argument line_to_parse along with an argument that was usually named start_index. This organic pairing made sense in a lot of places, allowing for the easy manipulation of the current index, and occasionally the line_to_parse. However, when adding in support for line numbers, I was going to have to pass in a third variable, meaning it was time to refactor this. As I have said previously:

Write it once, write it neat. Write it twice, think about extracting it. Write it three times, extract it without a second thought.

And that is where I started to work on replacing the line_to_parse variable and the start_index like variables with the position_marker variable and the PositionMarker class. This class was, and remains, very simple:

class PositionMarker:
    """
    Class to provide an encapsulation of the location within the Markdown document.
    """

    def __init__(self, line_number, index_number, text_to_parse):
        self.line_number = line_number
        self.index_number = index_number
        self.text_to_parse = text_to_parse
        self.index_indent = 0

The line_to_parse variable was replaced by the text_to_parse member variable, the index-type variable was replaced with the index_number member variable, with the line_number member variable being introduced. As I knew I would have to support “indented” text, such as text after a block quote character or a list start sequence, I added index_indent at the same time.

And with that, the easy stuff was ended. It was time to move on to the tougher stuff.

Creating the marker¶

Wanting to start at the beginning, I made some modifications to the __parse_blocks_pass function in the TokenizedMarkdown class to track the line number. As I did not have to deal with Link Reference Definition tokens yet (spoilers!), this was a straight forward change, with the local line_number variable being initialized before the main loop and that same variable being incremented at the end of the loop. To encapsulate this information, I created a new position marker as follows:

position_marker = PositionMarker(line_number, 0, token_to_use)

Instead of passing token_to_use to the parse_line_for_container_blocks function, I instead passed in the expression position_marker.text_to_parse. While I wanted to pass in position_marker by itself, I knew I had some important work to do before getting that would be possible.

Stage 1 - Integrating the Marker Locally¶

Using my knowledge of the project, I knew that the path between the __parse_blocks_pass function in the TokenizedMarkdown class and the parse_atx_headings function in the LeafBlockProcessor class was going to go through a number of functions in the ContainerBlockProcessor class. Instead of using a lot of guesswork to determine what that path was, I decided to start at the parse_atx_headings function itself and work my way backwards to the parse_line_for_container_blocks function.

In the place of adapting the entire function in one pass, I decided to start with focusing on the passing of the position marker into the function. To facilitate this decision, in the parse_atx_headings function, I wrote some simple glue code to interface between the old style code and the new style code. Before the change, the code looked like this:

    def parse_atx_headings(
        line_to_parse, start_index, extracted_whitespace, close_open_blocks_fn
    ):
        """
        Handle the parsing of an atx heading.
        """
        new_tokens = []

After the change, the code looked like this:

    def parse_atx_headings(position_marker, extracted_whitespace, close_open_blocks_fn
    ):
        """
        Handle the parsing of an atx heading.
        """
        line_to_parse = position_marker.text_to_parse
        start_index = position_marker.index_number
        new_tokens = []

In the __parse_line_for_leaf_blocks function where the parse_atx_headings function was called from, I created a new PositionMarker instance named temp_marker, with the information from the local function. That new temp_marker variable was then passed into the parse_atx_headings function instead of the previously used line_to_parse and start_index arguments.

I knew from the start that this approach was going to be more work. However, by isolating the change to the function’s signature, I knew I could keep the parser stable and usable at any point. By keeping the parser in that good state, I knew I could depend on the large bank of tests used by the project to verify any change was a successful change with no unintended side effects. The process was indeed slower than some other methods, but it was a process that I had absolute confidence that I could rely on.

Stage 2 - Integrating the Marker Up the Stack¶

In this stage, the process that was applied to the parse_atx_headings function was applied again and again until the __parse_blocks_pass function was reached. With every step going back towards that function, I repeated the same actions that I performed in stage 1, but with the calling function: add glue code, change function arguments, create a temporary position marker, and pass that position marker into the newly changed function.

The almost continuous execution of tests at this point was essential. As with any parser, even slight changes in the content or interpretations of the content can cause small ripples in the output that may escape detection if not quickly located and addressed. While I was doing my best to try and ensure that all changes were kept as isolated as possible, it was only the rapid battery of tests that kept my confidence high that I was going in the right direction with no ripples.

Another thing that helped me immensely was using Git’s staging ability. To facilitate clean commits, Git has a nice feature called a staging area, with the moving changes into this area being commonly referred to as staging. The nice part about Git staging is that, as this article describes, it is very easy to discard any changes made to your local files in favor of the state of that file with any committed changes and staged changes applied. As such, each time I applied this process to another function, I ensured that once the tests were all passing that I staged those changes in the repository. In the cases where I made mistakes, which did happen frequently, I had the knowledge that I was able to revert the relevant files back to their pre-change states, and resume work from there. When it was time to commit those changes, they were already isolated in the staging area, ready to be committed.

Stage 3 - Cleaning up Going back down¶

Once I reached the __parse_blocks_pass function, I was able to connect with the real position_marker variable, using that variable instead of creating a new temp_marker variable. I then started working back towards the parse_atx_headings function, replacing temp_marker with position_marker as I went.

When I got to the parse_atx_headings function, I then proceeded to replace instances of line_to_parse within that function with position_marker.text_to_parse and instances of start_index with position_marker.index_number. As these changes were localized, they were easy to make, but I still ran frequent test passes just to make sure I was making the right changes. I do acknowledge that some of those test passes were executed out of paranoia, but in my mind, if running the tests an extra time allowed me to step forward with no reservations, it was worth it.

Finally, at the end of all this process, I removed the line_to_parse variable and the start_index variable from the function, as they were no longer being used. There was no benefit to keeping those two variables around, so I just removed them. I now had a solid function to parse for Atx headings, referencing the line number and index number in a nice compact object. In addition, I was able to easily trace this compact object from the main processing line of the parser directly to the function itself. The only thing left to complete the change was to wire up the token.

Stage 4 - Wiring up the Token¶

Now that I had the position_marker object being passed properly into the parse_atx_headings function, I needed to pass that object into the token’s constructor.

The first change I made was to the base MarkdownToken object, changing its constructor to allow the passing in of a line_number argument and a column_number argument to be stored in the instance. To complement this change, I made some modifications to the __str__ method to add the line number and column number to the returned string, but only if at least one of those two numbers was not 0. The benefit of this slight change was that the line number/column number pair would only show up in the function’s output if that information were provided. As that information was only provided once I added support for that type of token, I only had to worry about the token output changing for the tokens I had already changed.

The second change was to modify the constructor for the AtxHeadingMarkdownToken class to accept a position_marker argument. That code was implemented as such:

    line_number = 0
    column_number = 0
    if position_marker:
        line_number = position_marker.line_number
        column_number = (
            position_marker.index_number + position_marker.index_indent + 1
        )

Basically, if the position_marker variable was not present, the default value of 0 was used for the line_number and column_number variables. If the position_marker variable was present, the line_number was be moved over from its position_marker object into the function’s line_number variable. For the column_number variable was assigned the sum of the index_number member variable, the index_indent variable (at this point, always 0), and 1. As index values are always 0 based and column numbers are always 1 based, the addition of 1 to the index number ensured it was based properly.

Finally, these newly calculated values for the line_number and the column_number were passed into the newly retooled MarkdownToken object constructor. This was the point that I was preparing for, where everything came together. It was now time to see how close I got to the actual test results. If memory serves, I believe I actually closed my eyes after saving my changes and started the execution of the tests.

Stage 5 - Addressing Test Results¶

Prior to the very end of the previous step, the most frequent test command line that I used was:

pipenv run pytest -m gfm

This executed only the parser’s scenario tests, reducing the test area to only those tests that were tagged as being part of the gfm group. As I was running the tests with each change, it was important to keep the scope, and therefore execution time, of the tests to a minimum. And up to this point, it worked well, and the tests were solid, passing every time.

But as I made the connection between the AtxHeadingMarkdownToken token’s constructor calling the MarkdownToken token’s constructor, I knew that everything that I had just changed was just a good guess. I hoped that I had wired things up correctly, but at that point, I only knew 2 things for sure:

due to the tests, I had not disrupted any of the parsed tokens for the test cases, including their order and content
I was passing some transformation of the base position_marker into the token’s constructor

With everything wired up, I used the above command line multiple times, each time picking off one of the failed tests to validate. With each test, I first calculated what I thought the line/column pair should be, noting it down in the function’s comments section. I then validated it against the test results, trying my hardest to not peek at the test results before I did my calculations. Once I had all those tests cleaned up, I did an extra pass through the changed tokens, manually recalculating the line/column pair and checking to make sure they were right. Only after all that was addressed did I change the module in the above command line from gfm to rules, cleaning up any test failures caused by line/column pairs that were now being displayed as part of the output.

The Importance of Good Tests¶

While I can talk at length about good practices with respect to tests, I hope I can convey the sincere importance that I associate with having good tests and good test coverage on a project. It is true that I can often “eyeball” a simple change to a project, figuring out whether that change is correct. But in my mind, anything beyond a simple change requires solid, repeatable testing. Without that testing in place, you are betting against the risk of something going wrong with your project.

Testing is not about finding bugs as much as it is about risk management. If you have a small project that rolls various forms of dice for a Dungeons and Dragons campaign, the impact of that risk failing is very light. Any person I know that would use such a project would also have some backup dice for “just in case”. For the PyMarkdown project, the end goal is to have a tool that website owners can trust to keep their website’s Markdown documents in a proscribed format and style. The impact of too many failures is a loss of trust in the project, with a certain level of loss in trust being associated with those website owners dropping their use of the project. By having good testing of various forms embedded within the project, I can hopefully mitigate some amount of the loss of trust that any failure brings with it.

Since I have worked so hard to get the project to this state, I did not want to take any unnecessary risk to the project’s stability. The tests are a solid tool that I use frequently to keep any such risk to a minimum, and especially during refactoring, where I rely on them heavily. And as I rely on them heavily, I am also continuously looking for better ways to test the projects that those tests are in.

Lather-Rinse-Repeat¶

I honestly tried to find another heading for this section. Try as I might, no other section heading seemed to convey the repetition that I went through other than the phrase: Lather, Rinse, and Repeat . For each of the 10 remaining tokens, the 5 steps outlined above for Atx heading were repeated, with only a few changes to the process. Those changes are outlined in the following sections. Please note that the following sections are ordered with respect to the amount of work needed to resolve them, rather than my usual chronological ordering.

When I encountered questions with respect to whether something with respect to that token was done properly or not, those questions were added as action items to the readme.md file to be examined later. I knew this was going to be a slog, and a hard slog at that. It just seemed more efficient to me to note them and move on, circling back to deal with them later.

HTML Blocks, SetExt Heading Blocks and Code Blocks¶

There were no changes introduced to the process for these elements.

Thematic Breaks¶

There was only one small change to the process for this element. That change was the delaying of the cleanup stage until a later time, as I wanted to get more breadth of tokens implemented to ensure I had the right foundation for the change.

The other change that I made was to the MarkdownToken class, thereby affecting all the other tokens. For this change, I moved the code to calculate the line_number variable and the column_number variable from the AtxHeadingMarkdownToken class to the base MarkdownToken class. Once this code was proven, it just made more sense to keep it in the base class than repeating it for each token.

Blank Line Token¶

The change required to process the blank line token was not with the token itself, but with the processing of block quote blocks. The __handle_blank_line function was changed in the TokenizedMarkdown class to accommodate the position_marker argument, starting the change for all calls to this function requiring that argument. Other than that change, everything else was normal.

Block Quote Blocks and List Blocks¶

Strangely enough, I thought these two types of blocks would take the most amount of work to address, but due to the way they were implemented, only a couple of small changes were required. In both cases, the block start algorithms have to deal with with the possibility of a tab character (often denoted as \t) being used as the whitespace between the block start sequence and the rest of the block’s data.

Having already dealt with tab stops versus tab characters and fixing that issue once, I really did not want to fix it again. I just needed to ensure that the current fix and the previous fix were compatible with each other. To ensure that happened correctly, the modified line text and index numbers were passed to a newly created PositionMarker instance, created after the processing of the Block Quote block and the List blocks. This ensured that any line text modified by the processing of either container block would be retained, while adding the now normal processing using the PositionMarker class.

Paragraph Blocks¶

As the catch-all element for Markdown documents, I had a feeling that these blocks would end up near the top of the “most effort” list, although I will admit that I guessed the reason wrong. I thought that it would have something to do with lists and block quotes, while the actual reason is due to Link Reference Definitions. More precisely, the reason is due to failed or partially failed Link Reference Definitions.

Due to its composition, it is impossible to properly determine if a Link Reference Definition is valid without reading the next line. As documented in the article Adding Link Reference Definitions, its multiline nature and use of various forms of quoted sections mean that unless the parser encounters the end of the closing section or a blank line, it does not know if it had reached the end of the Link Reference Definition. If it reaches the end and for any reason that Link Reference Definition is not valid, the logic in the parser starts adding lines back on to a list of lines that need to be reparsed without them being misinterpreted as a Link Reference Definition.

It is precisely that requeuing logic that I needed to alter to work properly with the new PositionMarker class. While the index_number remained untouched, I had to make sure that the line_number variable was properly reset to account for the length of the lines_to_requeue variable when any requeuing occurred. I had a bit of a problem with the initial implementation of this code, so I wanted to take extra time and really boost my confidence that I was handling the change for this token properly.

Link Reference Definitions¶

This one should be very obvious… I needed to add the token itself! When designing the parser, I did not see any need for a token that literally has no impact on the output, so I did not add a token for it. While a link element that refers to a link reference definition will show up as a link in the HTML output, the link reference definition itself is not added to the HTML in any form. However, now that I was adding support for all markdown block elements, I found myself in the position of adding the token in to the parser.

Adding the token itself was not too difficult. That was accomplished by the usual process of finding the right method, the __stop_lrd_continuation method in this case, creating a new instance of the new LinkReferenceDefinitionMarkdownToken class in that function, and properly populating it. Then the process of adding the support for the PositionMarker class kicked in, and quickly I had a case of a wired-up token with the proper data.

And then I went to test it… and I noticed that the tests failed in the transformation of the tokens into HTML. Looking, I quickly determined that I needed to make some additional changes to the TransformToGfm module. This was a simple change, as by definition, Link Reference Definitions have no effect on the output. To accommodate the new token, a simple handler was registered for that token that simply returns the same string that was given to the handler.

And then I went to test it… and the line number and column numbers were incorrect. In about half of the tests, the pair of numbers seemed to be widely different than I had manually calculated. Double checking my numbers, I then noticed a pattern. The numbers being reported were for the last line of the Link Reference Definition. Because of its multiline nature, the values associated with the position_marker variable were the ones used when that Link Reference Definition was considered both valid and complete. Avoiding the passing of a single use variable around, I added the starting position to the LinkDefinitionStackToken instance at the top of the stack, and things looked better.

As I looked at the token and its __str__ function output, it looked very lean. Based on other tokens, I expected a lot more information to be stored in that token, to be analyzed later. Slowly going through the information gathered during the processing of the token, I ended up figuring out how to properly represent the token. Between the various whitespace sequences between parts of the definition, the various parts of the definition themselves, and both uninterpreted and interpreted parts, the data called for a class with 11 member variables.

Slowly but surely, I started working through each of the Link Reference Definition scenario tests, starting with the easier ones and working my way to the more difficult ones. It took me a while to work through each scenario and validate it, but I breathed a sigh of relief when I finished going through all the tests. In the end, it had taken 2 days to complete, along with countless executions of the scenario tests.

It was not until I looked at my notes that I remembered something: I had delayed some of the cleanup.

Cleaning Up¶

Before calling this round of changes done, there was a decent amount of delayed cleanup to be performed. As part of Stage 3 of the process, there were quite a few times where that cleanup was delayed for one reason or another. As I was about to move on to another issue, I wanted to make sure that cleanup was performed.

That cleanup itself was easy, as I had repeated the “change-test-check” process so many times that I believe I could perform it while sleeping. Following a variation of that same process, the code was cleaned up in short order.

What Was My Experience So Far?¶

As I finished the cleanup, I was happy to have the work of adding in the line numbers and column numbers behind me. After 10 days of work, it was now a large task that I knew was not on my task list anymore. But while I was happy, I was also concerned about the 20 new issues that I had added to my issues list. I was aware that at least a few of those items would be checked out and determined to be false alarms, but that probably left somewhere between 16 to 18 issues to research and triage.

On top of that, there was also the fact that each line number/column number pair for the tokens was determined manually by me. While I have faith that I can count accurately for small sets of numbers, I do know that I make mistakes. Some of those mistakes I had already caught while going over the test cases before committing the changes for each of the elements. But it was entirely possible, and indeed probable, that I had missed at least 1 or 2 mistakes. This was not due to lack of confidence, but due to a sense of reality. I know that the project has great coverage and great scenarios, but I also know that I do not have every scenario represented, just the really important ones.

That got me thinking. If I found that many errors, was there a way to remove the need for me to manually count those values? Could I automate it? Would it be worth it? Would I find anything?

The more I thought about those questions, the more I realized that I needed to explore this area. If nothing else, I wanted to make sure I had not missed anything. But after thinking about it for a while, I realized that I did not want to risk that I had missed anything important.

What is Next?¶

And here ladies and gentlemen is where I begin to go down the rabbit hole .

At the current moment, the transform_to_gfm.py module is approximately 900 lines long. About 30% of that module is processing overhead, with about 18% dedicated to handling the somewhat complex issue of list looseness. The remaining 52% is consumed with the simple handle_* methods used to translate each token’s start and end tokens into HTML. While the calculation of list looseness does add some complexity to the translation, the algorithm and implementation are both relatively simple. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments