Markdown Linter - Adding Consistency to Block Quotes

Summary¶

In my last article, I took care of completing the Markdown transformer checks for one half of the container block tokens: the List Block tokens. In this article, I tackle both the line/column number consistency checks and the Markdown transformer checks for the Block Quote tokens.

Introduction¶

While the implementation of list tokens went easier than I thought it would, I remained cautiously optimistic about adding consistency check support for Block Quote tokens. During the initial development of the line/column number checks, I noticed that the Block Quote tokens did not keep any information about removed data. After having completed the List Block token support, I knew that it would be pivotal to get that right. And that meant more work. A lot of nitpicking work.

To add the consistency checks, I was going to have to accomplish 3 things: capture the removed line start text, use that information to validate the line/column numbers, and then write the Markdown transformations for it. The last two items were the easy part, those parts I had confidence I could complete. It was the capturing of removed checks that I was not confident about. And the capturing of text was the change that I needed to tackle first.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 12 Aug 2020 and 14 Aug 2020.

Let’s Talk About the Obvious¶

I am generally very disciplined when it comes to writing my blog. I keep notes when I am writing the code, curating those notes into an outline on the Thursday to Saturday before I post my article. If I am lucky enough, I will start the actual writing of my article on Saturday night, but most often I start writing it early Sunday morning. This allows me to take my time to enjoy my Sunday while ending up with a solid version of the article by Sunday night. Then, on Monday night, I take that article and add extra code samples and do some wordsmithing to form the article into its final form. It may not be the way others write their articles, but it is a process that works for me.

Until this week that is. Everything was on track until Monday, when I started my edits with a headache that I just could not ditch. But process is process, so I started doing those edits, pushing on until a couple of hours later. At that point, I looked at what I was doing and went… huh? I started looking at my edits and tried to figure things out, using even more edits to fix any problems I encountered. Far from making things better, I was making them worse. But I would not see that until later.

It was a while later when I finally stopped and locked my computer. Above all else, this was the best thing that I did in that entire situation. It took me a while, but I stopped. However, by that point, a certain amount of damage was already done. I had written some things twice and other things not at all. In some cases, I had sentences that looking at them later, I wonder if I had written them while drunk. I knew I could recover from that position, but it would require a clear head, some extra time, and some hard work. And it was not going to happen that night. I came to the realization that I was not going to be able to meet my usual posting deadline of Monday night.

At first, I was angry with myself. Many weeks of not missing a posting deadline and now I had a bad record. “I always post on Monday nights, except for…” Argh! But after thinking about the purpose of my blog, to educate and help others, I started to find peace with it. None of us is perfect, and we all need to take care of ourselves. Me included. My health versus my article’s deadline. I needed to take that extra day and get better before finishing… er… unfinishing and then refinishing the editing.

Capturing the Leading Block Quote Text¶

In its simplest form, the handling of container block elements and their tokens is a relatively easy concept. Every line of a Markdown document enclosed within a container block element is usually prefaced with one or more characters that denote the presence of that container element. For Block Quotes these are character sequences with the > character and an optional whitespace, and for Lists these are whitespaces. To make the processing of those contained lines easier, at the start of the project I took some up-front time to ensure that the container block element processing is independent of the leaf block element processing. As such, I was able to work on that leaf block processing independently of container block processing.

When I was doing the initial research for container block support, I started with List tokens. I quickly wrote a proof of concept for the line/column number check and was happy to find that I had already placed the required information in the token. However, after using the same process with Block Quote tokens, it was evident that I had missed the mark with them. Unlike the List tokens, none of the required information was placed within the Block Quote tokens. At that time, after having gone down a rabbit hole for a bit, I wisely decided to wait until later to implement that. Well, that time was now and I knew it was going to take some work to add it.

Addressing the Issue¶

Like the work I did for List Block tokens, the Block Quote stack token needed a change to support the matching_markdown_token field. This field is pivotal to inform the processor of the latest markdown token that is containing other tokens. But to properly use this field, it would take a bit of disassembly and reassembly. Before this change, the code to add the necessary tokens to the two relevant stacks were:

    parser_state.token_stack.append(BlockQuoteStackToken())
    ...
    container_level_tokens.append(
        BlockQuoteMarkdownToken(extracted_whitespace, adjusted_position_marker)
    )

Two different tokens and two different stacks, with nothing between them. With this new functionality in place, that code needed some slight changes:

    new_markdown_token = BlockQuoteMarkdownToken(
        extracted_whitespace, adjusted_position_marker
    )
    container_level_tokens.append(new_markdown_token)
    parser_state.token_stack.append(
        BlockQuoteStackToken(new_markdown_token)
    )

Instead of two distinct tokens, there was now a stack token that included a reference to the Markdown token that it represented.

With that field properly initialized, the add_leading_spaces function was then added to make use of that new field. It is a simple function that adds any extracted text (usually whitespace) to its field, separating any new text from the existing text using a newline character. The function itself is very boring. The interesting part about the function would be in locating where that function needed to be called from and using it properly.

The First One Is Almost Always Easy¶

The first location was obvious: within the __handle_block_quote_section function of the BlockQuoteProcessor class. This is where the bulk of the processing of Bulk Quote elements and their effects go through. In there is an easy to find block of code that records the number of characters removed and resets the string to exclude those characters:

            line_to_parse = line_to_parse[start_index:]
            removed_chars_at_start = start_index

That block of code was slightly modified to record that removed text and place it into its own variable:

            removed_text = (
                line_to_parse[position_marker.index_number : start_index]
            )
            LOGGER.debug(...)
            line_to_parse = line_to_parse[start_index:]
            removed_chars_at_start = start_index

With the removed text in hand, the code just needed to know which token to associate the removed text with. As the code processes Block Quote elements, it was reasonable to assume that the most relevant Block Quote stack token is the last one on the stack. That stack token was easily found with a simple for loop:

            found_stack_token = None
            for stack_index in range(len(parser_state.token_stack) - 1, -1, -1):
                ...
                if parser_state.token_stack[stack_index].is_block_quote:
                    found_stack_token = parser_state.token_stack[stack_index]
                    break

With the extracted text and the ‘top’ stack token, the only thing that was left to do was:

            found_stack_token.matching_markdown_token.add_leading_spaces(removed_text)

Almost right away, I was off to a good start on extracting the leading spaces for the Block Quote token and storing them.

Locating the Next Issue¶

After those changes were in place, I ran the scenario tests again, seeing almost every test that has a Block Quote token fail. My heart dropped. But after a second, I realized that I wanted that to happen. As I wanted to make sure each Block Quote token was verified, I added the leading space data to the end of the string for the BlockQuoteMarkdownToken class, separated from the rest with the : character. If things worked properly, that meant every Block Quote token would, at the very least, now include an extra : character. Every serialized Block Quote token was now “wrong”, so every Block Quote scenario test should fail. Good!

Working through each case was decently fast, with a solid methodical process in place: look at the results, find the next failing test, and manually determine what change needed to be made to the Block Quote token. After making that change to the test data, I would then re-run that specific scenario test, and check for token differences.

It was by using this process that I found the next issue: missing leading whitespace. In some cases, the leading text that was extracted was preceded by one of more whitespaces. Those whitespaces were stuffed into the extracted_whitespace variable and then ignored after that. The resolution to that issue was simple. Instead of only adding the leading space to the removed_text variable, the contents of that extracted_whitespace needed to be added to the variable, as such:

            removed_text = (
                extracted_whitespace
                + line_to_parse[position_marker.index_number : start_index]
            )

Running the tests again, a new block of tests started passing.

And Next… Blank Lines¶

As more and more of the Block Quote scenario tests were changed and started passing, there were a handful of tests that were failing that I left for later. When I got to the end of those tests, I went back and started to look at those failing tests one-by-one.

The first group of failures that I examined were ones in which there was a Blank Line element contained within a Block Quote. A good example of this is example 222:

> foo
>
> bar

In these cases, the parsing was correct, but the newly added add_leading_spaces function was not being called. The outcome was that the number of newlines contained within the owning Block Quote token did not match the number of lines within the Block Quote element itself. To ensure that those two values lined up, the add_leading_spaces function was called with an empty string, thereby evening up those values.

Running the scenario tests again, all the scenario tests explicitly for Block Quotes were passing. I then ran the scenario tests again, checking to make sure that the scenario tests with Block Quotes and Blank Lines passed. While the new changes were now passing, there were still a handful of tests to work on.

And Finally, HTML Blocks¶

Having run the scenario tests again, the only tests that were not passing were scenario tests that included Block Quotes and HTML blocks. Doing some research into the issue, I quickly found that it looked like the same issue with the Blank Line elements, just with HTML blocks. This issue was uncovered through one of my additional tests, “cov_2” used to increase coverage on the various forms of HTML start and end tags:

</hrx
>
</x-table>

Like the previous issue with Blank Line tokens, the number of new lines were not lining up between the two sources. Once again, calling the add_leading_spaces function with an empty string in these cases solved the issue.

Closing Out the Token Changes¶

After those changes has been completed, it was obvious that all the scenario tests were passing. Just to be sure, I manually went through each of the scenarios and verified the newly changed tokens. As I have said before, maybe I am paranoid, but if I was depending on those results, I figured an extra pass or two would not hurt.

It just felt good to get these changes completed. I had a high degree of confidence that I was able to find most of these issues, but an even higher degree of confidence that the remaining work would flush out anything that I missed. It was with those positive thoughts in my head that I started working on the consistency checks.

Validating the Line Numbers and Column Numbers¶

With that confidence, I started working on the consistency checks for line numbers and column numbers. The first step in doing this was removing a couple of lines of code that prevented the consistency checks from firing if one of the tokens was a Block Quote token. It was useful when I did not have anything in place, but now it would just get in the way.

After that change, tests predictably started failing, but I was ready for them. From the work that I had just completed, I figured that I would have to follow a similar path in implementing the consistency check. Therefore, the first thing I did to help with the consistency check is to reset the leading_text_index field of the BlockQuoteMarkdownToken instance to 0 when it is encountered. This made sure that the tracking of the leading spaces would always start with the first entry.

With that done, the code needed to take advantage of that was easy to write. When a token exists on a new line, it’s indent must be determined by tracking the size of the removed text and adding that measurement to the column number. Based on the previous work, this becomes trivially easy:

    split_leading_spaces = top_block.leading_spaces.split("\n")
    init_ws += len(split_leading_spaces[top_block.leading_text_index])

Basically, grab the top Block Quote token, split it into lines, and use the leading_text_index value of that topmost token to grab the right leading space.

From there, the new work mirrored the work that I did in preparing the tokens. When a Blank Line token is encountered, that index is incremented. For each line of text within an HTML block, that index is incremented. And in addition, that leading_text_index field needed to be tweaked at the end of HTML blocks, Atx Heading blocks, and Paragraph blocks, which was something I figured might come up. Just some simple cases where a bit of extra finessing was needed.

To be clear, I was not 100% percent sure that this would happen, but I was hoping that it would. To me it made sense that if I needed to add code to track the leading spaces that were removed, any consistency check would need to follow that same path to verify the information. And as to the extra tweak for the end tokens, I was confident that it was a solid change, and was not worried about that deviation from the previous work.

Writing the Block Quote Markdown Transformations¶

Given the work I had just completed to get to this point, it was obvious to me that I was going to have to do a lot of fiddling with spaces to get the transformations correct. To make that job a bit easier, I decided for the moment to eliminate any Markdown documents which contained both Block Quote elements and List elements. I would get back to these quickly, but at that moment, it was just too much.

The processing of the start and end of the Block Quote elements were simple, largely mimicking the work I did for the List elements. When a Block Quote token was processed, it created a copy of itself and added that copy to the top of the container_token_stack list. From there, the leading space were retrieved from their storage in the token and added to the sequence to be emitted. The end of the Block Quote element was even easier, returning an empty string after removing the top token off the container_token_stack list. The work I had previously done on setting up the leading spaces was really paying off.

With the work to recognize and process the actual tokens taken care of, the main task ahead of me was to add and populate the __perform_container_post_processing_block_quote function. Like how List Block tokens were handled in the __perform_container_post_processing_lists function, this new function was used to handle the newline processing for text enclosed within Block Quote elements. After having completed all this other work, that work was relatively simple to do. With all the individual token processing already performed, this function just needed to focus on detecting newline characters. For each of these characters encountered, the top Block Quote token would be used to determine the current leading_text_index to start with. With each newline encountered, the current split value would be used, incrementing the leading_text_index value afterwards. This pretty much worked flawlessly out of the box.

As I was cleaning up¶

An interesting issue came up as I was generating these transformations. For some reason, I had missed some cases involving Block Quote elements that contained Fenced Code Block elements. With the Markdown transformer now rehydrating the Markdown, it was obvious that things were missing. It did not take me long to figure out that the Fenced Code Blocks were missing their Block Quote leading characters. This was found for example 98:

> ```
> aaa

bbb

When the transformer tried to rehydrate the document, it rehydrated that Markdown text without the leading > sequence before the text aaa. Having just gone through that research for other elements, I was quick to spot it and realize where the issue was. A couple of quick changes, and that change was passing!

Along the Way¶

And to hammer home the point that this consistency checking is good, it found some issues along the way.

The first one was an interesting issue where the starting whitespace for one transformation was missing. And it was such a weird case. It was an Atx Heading element that contained a Shortcut Link element as its only text, separated from the Atx Heading character ( # ) by the mandatory single space. That is really important. Two space it was okay with, but the one space, nope! Due to the way that it was parsed, that spaces were missing and not accounted for. The fix to this was to add that starting whitespace as a new token containing that removed text, but with a twist. That twist was to use the create_replace_with_nothing_marker function to add that text as itself for the Markdown transformer, but keep it as removed for the HTML transformer. With both transformers appeased, it was on to the next issue.

The second issue that I uncovered was that, in rare cases, the leading spaces for certain lines were missing. After doing some digging, it only appeared to be lines that did not have any block quote prefix characters removed from the line. So after adding a new __adjust_paragraph_for_block_quotes function to the LeafBlockProcessor and wiring it into the parse_paragraph function, the debug confirmed that it was only those few cases where that was an issue.

And Of Course, Some Cleanup¶

It would not be a proper week of coding if I did not do some cleanup. In this case, the cleanup was simple: relocating the line/column number logic to its own module. One of the things that made the Markdown transformations easy to deal with is that those transformations are in their own TransformToMarkdown class. That worked well. The line/column number checks were in with some other validation code in the utils.py module. That worked… well… okay?

When I started with the consistency checks, they were new, and keeping all that code in the utils.py module made sense. But as the amount of code in the module grew, I never took the time to to some refactoring. As such, the module developed two responsibilities. The first was to be the location for all the assert_* functions and the write_temporary_configuration functions. Those are all the functions that are called directly from the tests, mostly in the final Assert stage of the test. The second was to house the logic for the line/column number consistency checks.

It just seemed logical, now that that part of the code was stable and well-tested, to take that second responsibility and put it into its own module. I created the verify_line_and_column_numbers.py module and started moving the functions into it, with the verify_line_and_column_numbers function being the main entry point for the module. It just seemed cleaner and more compact. One module, one responsibility.

What Was My Experience So Far?¶

The dominant part of my experience at the current moment is one of humility and patience. Yes, it is Tuesday night. And after a little over 3 hours of extra work, I have finally recovered the article to where it was at this time last night. I do feel an impulse to be hard on myself about this delay, but I also am working hard to remember to be patient with myself. This is one time where I am missing a deadline, and it probably will not be the last.

The humility that I am trying to practice is in understanding that I cannot do everything all the time. I know that sounds like it should be an obvious thing to know, but I think we all forget it from time to time. I was so focused on making sure that I met my deadline, that I neglected to have a conversation with myself on whether that was the right choice. But it was not a large rabbit hole, just a small one that I was able to easily recover from.

Focusing more on the actual work that I accomplished, I was buoyed. I was long worried about how hard it would be to implement the consistency checks for Block Quote tokens. Having completed that work, I now am trying to figure out why I was worried. Trying to figure it out now, I would guess I would focus on what it took to complete that work with lists. That work was very nitpicky because it contained a lot of “empty” whitespace, if I am remembering it clearly. From the effort that it took to deal with that, I can see how I might have thought it would have taken the same effort for Block Quote tokens.

But it did not. The work I had done in that area on List tokens forced me to get things set up to make List token processing easier, which I believe Block Quotes benefited from. Regardless, I was glad that I could close the books on the Block Quote tokens and their consistency checks.

What is Next?¶

With all the block token consistency checks now taken care of, there was a bit of clean up to do with determining the height of each block token. While I initially thought that it would be easy, it did not turn out that way.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments