Markdown Linter - Delving Into the Issues - 19

Summary¶

In my last article, I documented how I was working hard to get to the end of the unprioritized items in my issues list and on to the prioritized parts of the list. This article details the work that I am performing to get to the end of the unprioritized list.

Introduction¶

I started the week with one goal on my mind: to clear out any unprioritized items so that I can start working on the prioritized items next week. I knew that goal might not be achievable, but that was okay for me. It was something concrete that I could work towards. Possibly achievable and concrete… sounded good to me!

Internally, I was fighting a different battle. I knew that I had some work to do to finish off the unprioritized items. I knew that I also had some clearly defined priorities that I had listed in the issues list. But in the process of prioritizing items, I missed the sections that occur after the Bugs - Tabs section. While not everything would be actionable right away, I did need to take the time to figure out when those sections will be handled. Not having a plan with the release getting closer was just getting to me.

At the very least, I know that I could wait with a concrete decision on that question until after all the unprioritized items were resolved, so off I went!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 20 Jan 2021 and 24 Jan 2021.

The Big Push!¶

From the number of items in the issues list that were not prioritized, I knew that if I focused for between a couple of days and a week, I could get every unprioritized item resolved. It was time for a big push!

Yet More Fun With Link Reference Definitions¶

It seems like I cannot get away from issues involving Link Reference Definitions, but I was going to try by resolving this issue:

- dedup `append_text(`
- test_reference_links_extra_03h - 03hc

After consultation with the people on the CommonMark forums, I decided to double-down on my decision from last week to implement code that follows the CommonMark reference implementation. While I did get agreement that the Markdown from function test_reference_links_extra_03h should parse as a valid Link Reference Definition, the current version of CommonMark (0.29.2) does not support that. As such, I had to make sure that the PyMarkdown parser does not support that.

After adding a few extra test functions to make sure I had covered all the scenarios, I started debugging and found the issue almost immediately. In processing the line within the Link Reference Definition using the __is_link_reference_definition function, the function was always finding that the Link Reference Definition was valid, including when it ended with the Hard Line Break elements backslash \ character. Adding some extra code to the __is_link_reference_definition function solved that issue, only to show another issue.

In some rare cases, a line can end with an escaped backslash character, such as:

[bar\\
foo]: /uri

In this case, the final backslash is escaped, and is represented in the token by the Python string bar\\\b\\\nfoo. As Python escapes a backslash with another backslash, that string is effectively read as the text bar, an escaped backslash \\, a backspace special character \b, another escaped backslash \\, the newline character, and finally the text foo. As I have mentioned in previous articles regarding the project’s use of the backspace character, the first escaped backslash will be countered by the backspace character, effectively leaving a single backslash followed by a newline character as the most interesting character sequences in that string. While it can be a bit confusing to read (even to me sometimes!), that encoding allows a process to decide whether to look at that string in its original Markdown form or in its target HTML form.

While that interpretation of the sequence is correct, the parser needed to be changed to prevent that particular end-of-line sequence from being recognized as a Hard Line Break element. That sequence was not really a \ character at the end of the line, it was an escaped \ character at the end of the line. Having identified the issue, I quick worked to modify the handle_line_end function to verify that the characters before a backslash at the end of the line were not the sequence \\\b. After running the tests a couple of times and performing some extra verification steps to make sure I got it right, it was on to the next issue.

Link Reference Definitions and List Boundaries¶

With the goal of resolving this item:

- test_link_reference_definitions_extra_01c

I enabled test function test_link_reference_definitions_extra_01c and added a new test function test_link_reference_definitions_extra_01d. Then I started looking at the test examples, and I realized that I needed to think carefully about them:

- [foo]:
- /url

I knew it was a Link Reference Definition that should be broken, but I was not 100% sure why it was broken. Working through it on paper, I got the concrete answer that I needed. At the end of line 1, the parser has a single list started, with a partial Link Reference Definition active. However, when the second line starts with a character that is not part of a valid Link Reference Definition, the processing is rewound, and the first line is reinterpreted as plaintext. From there, it becomes a single list with two list items, each list item containing half of the “almost” Link Reference Definition.

With that research in hand, I started debugging and realized two things. The first was that the caller_can_handle_requeue argument on the close_open_blocks_fn function was not set to True. This meant that the Link Reference Definitions would never be closed by that function. When I addressed that issue, I found the second thing: turning that flag on and supporting the rewind or requeue from that point required a lot of argument passing. While I didn’t have a lot of time to address it now, I made sure that there was something in the issues list about addressing the issue of passing too many arguments along and moved on.

Link Reference Definitions and Block Quote Boundaries¶

Being in the same area as with Link Reference Definitions and List Boundaries, I figured that dealing with this item:

- test_link_reference_definitions_extra_02b

was a good use of my time. The research for this item almost followed the exact same path that my research had followed for List Blocks. The only difference was that they were working on Block Quote boundaries instead of List boundaries. Using the same manner of fixing this issue for Block Quotes as I used for List Blocks, this issue was quickly fixed. This fix also carried over to the next issue that I addressed:

- add new 2c that increases level of block quotes with LRD

As a matter of fact, I did not have to add any extra code to the parser above the code that I originally added for function test_link_reference_definitions_extra_02b.

But having seen that I missed on a number of these scenarios involving non-paragraph Leaf Block elements, I wanted to make sure that I had some coverage in there for the other Leaf Block elements, hence, the issue:

- verify that > to >> introduces a new level of block quotes

To address that issue, I added new scenario tests, test function test_block_quotes_229c to test function test_block_quotes_229j. I just started with Indented Code Blocks, proceeded to Fenced Code Blocks, ending up on HTML Code Blocks. With various variations in place, when I was done there were nine new scenario tests, with five of those tests being disabled. I had some work to do!

Before Signing Off For The Night¶

Before I shutdown my system for the night, I looked around and noticed that there was another item that was like the one I had just added tests for:

- links, 518b
  - 518b inside of list and/or block quote
  - now test_paragraph_extra_j0

As I had also forgotten to run Black on my last couple of commits, it was a good chance to see if I had even more work to do, or if I would get a break and have some good, positive test karma come my way. I added test function test_paragraph_extra_j0e and two variations, one with the same thing in a Block Quote element and the other with the same thing in a List Block element. The good news? It worked first time, with no test needing to be disabled. It was a good way to end that day!

Friday Night Blues¶

As Friday night rolled around, I found that I had some spare time to work on the PyMarkdown project that evening, but I was less than enthusiastic about it. It seemed that every time I worked on the project, I would resolve two or three items from the issues list, only to add another three or four items to that same list. It felt like I was spinning my wheels and getting nowhere. It took me looking at the specification and the existing scenario tests to realize that I was just hitting a lot of boundary conditions for Block Quote elements. At that moment it dawned on me: I was not adding hard-to-solve items to the issues list, I was adding little issues that were, in all honesty, way off the beaten path of Markdown. Not to sound too negative, but I was dealing with the nit-picky scenarios! That helped me to get my mind in the correct perspective!

It was with that renewed sense of purpose that I refocused to work on:

- test_block_quotes_229d - BQ+ICB

Working through the debug logs, I quickly came to an interesting conclusion. Due to paragraph continuation lines, most of the existing Block Quote tests would either maintain or increase their level of Block Quotes, but never decrease. If one of the Block Quote start characters was missing within a paragraph, the paragraph continuation rule would kick in the Block Quote would continue along without any disruptions.

But that was not so with other, non-Paragraph elements. Focusing on the item that I was currently working on, I changed the __ensure_stack_at_level function to properly determine when the stack needed to increase (stack_increase_needed) and when the stack needed to decrease (stack_decrease_needed). With that code in place, I added code to reduce the Block Quote count if needed, closing any open elements until the proper Block Quote count was reached.

And while a lot of issues I fixed in this time were only with the parser itself, this issue required changes in the Markdown transformer. It was not anything big, but to make sure that the Block Quote token’s leading_text_index field was properly maintained, the __rehydrate_block_quote function and the __rehydrate_block_quote_end function needed to properly increase that variable if needed. After a bit of fiddling and verification to make sure that everything was once again passing, it was off to relax before a long day on Saturday.

Saturday’s Slog¶

Knowing that I would have a big chunk of time on Saturday, I resolved to myself to push hard to see if I could get to the end of the unprioritized list by the end of the day. In my mind, I was not sure if I would be able to do it, but I figured a good push to resolve issues wouldn’t hurt either way. With that, and some loud music in the background, I hunkered down and got to work.

Block Quotes and HTML Blocks¶

Having had good success with Indented Code Blocks and Block Quotes the night before, I decided to just power ahead and deal with this item:

- test_block_quotes_229i and test_block_quotes_229j - BQ+HTML

Doing some testing against BabelMark, I soon found out that unlike the Indented Code Block element, HTML Code Blocks inside of a Block Quote element treated any extra Block Quote start characters as Code Block text. Looking over the GFM specification and thinking about it a bit, this made sense to me. The HTML Code Block has five different start sequences and five matching close sequences. Other than the Block Quote sequences that were in place when it started, unless it matched one of those close sequences, it would just treat any increases as text that was a part of the Code Block itself. But that wasn’t how the parser was interpreting it. It was trying to open a new Block Quote when it saw the extra Block Quote prefix character.

To address this, I had to modify the __count_block_quote_starts function to recognize that situation, and then respond to that with changes in the __get_nested_container_starts function to ignore a change in Block Quote starts. With those changes in place, enabling test function test_block_quotes_229j was trivial, modifying this piece of code from the function __ensure_stack_at_level from:

    stack_decrease_needed = top_token_on_stack.is_indented_code_block

to:

    stack_decrease_needed = (
        top_token_on_stack.is_indented_code_block
        or top_token_on_stack.is_html_block
    )

Running the tests, the scenario tests now passed, and it was on to the next item.

Two Code Blocks Down…¶

And one type of code block left to deal with:

- test_block_quotes_229g and test_block_quotes_229h - BQ+FCB

While I would love to say that fixing these tests just required one or two lines of code, that would be dishonest. Because the processing of the Fenced Code Block elements is sufficiently different from the other Code Blocks, the handling of it in the __handle_block_quote_section function was separate from the other token types. But while I had to repeat changes that were like the ones made for Indented Code Blocks and HTML Code Blocks, it was very beneficial to have them to use as templates. I effectively used them to shortcut my debugging processes, getting both tests working in what I would consider record time!

Just To Be Sure¶

I was confident that I had properly addressed all the issues that had been raised so far, but I was also concerned that I was being shortsighted. So, at that time, I added the item:

- multiple levels up and down over all block quote transitions

When I re-read that item before starting to work on it, I clearly understood that it was to question whether I had coded the Block Quote transitions for the “plus one” cases or the “plus any” cases. I realize that it may not seem like a large difference to others, but I was concerned that I had used an if statement and checked at one level instead of using a while statement, checking until multiple levels had been processed.

After adding 10 new tests that did multiple increases and multiple decreases within various Leaf Block elements, I held my breath and ran the scenario tests. Did I get all the cases the first time around? How many did I miss? Was the work to correct them going to be difficult or easy to complete? I just wanted to get this done.

Having worked myself up over a possible lengthy set of fixes, I was pleased to find out that only one change was required. In the __adjust_paragraph_for_block_quotes function, I had coded a check to look for a Block Quote end token that was preceded by a Fenced Code Block end token. While the purpose of the check remained the same, I needed to adjust it a bit for a case of multiple Block Quote transitions. That check went from:

    if (
        token_document[-1].is_block_quote_end
        and token_document[-2].is_fenced_code_block_end
    ):

to:

    number_of_block_quotes = 0
    end_index = len(token_document) - 1
    while token_document[end_index].is_block_quote_end:
        number_of_block_quotes += 1
        end_index -= 1
    if (
        number_of_block_quotes > 0
        and token_document[end_index].is_fenced_code_block_end
    ):

Other than that change, which took approximately 10 minutes to debug and implement, all the other tests passed without requiring any changes.

The Last Unprioritized Item¶

At this point, there was only one item left in the Uncategorized section:

- single level up or down with list

Like the work done in the last section, I believe this was a reminder to myself to not forget that List Blocks can also be encapsulated within a Block Quote block. So, with the learnings of the past two sections in mind, I put together a series of four new scenario tests for List Blocks within Block Quotes and ran the tests without thinking about it. They all worked on the first try, with no other changes required in the project! It was a good end to this series of work!

And… Done!¶

And it was with that commit that I finished working on the unprioritized items in the issues list. From here on, it was getting prioritized items out of the way for the release!

Starting On Prioritized Issues¶

Having spent a lot of my mental capital to get to this point in the project, I was explicitly looking for some issues that were low hanging fruits. While I knew that there were some bulkier items to deal with in the list, i wanted to try and clear some of the light issues away to get a clearer picture of what was left.

Some Simple Research¶

Intending to start with something simple, this item seemed like a good fit:

- where is `elif starting_whitespace:` used? why? better way to do it?

Thinking way back to when I added this, I remember wondering why I needed to do this in addition to the other code in this function, and what effect it has. To be honest, I have written a lot of code for the PyMarkdown project in the last year. While I try and be clear with everything I write, I dropped the ball on this one. Time to figure it out!

Digging into this issue, I searched for the text elif starting_whitespace: and replaced that string with elif False and starting_whitespace:. After running the scenario tests, I picked one of the failures at random, test function test_atx_headings_extra_42 and looked at the failure. As soon as I started looking at the test output, memories of this change came flooding back to me. Within an Atx Heading Element, there is at least one space character between the start character (#) and the text in the heading. The HTML output that is generated is fine, but when the Markdown transformer tries to reconstruct the original Markdown, that leading space is not represented in the tokens. To address that issue, in those cases where it is not otherwise present in the list of tokens, that code kicks in and adds a Text token with the right markers to resolve the issue.

Experimenting with three or four alternate solutions to the issue, only one of those solutions worked, and it was a lot more convoluted than the existing solution. If there is no text to create a Text token with that the parser can attach some extra whitespace to, then create a Text token specially to hold that extra whitespace. Basically, it was good that I thought I could do better, but I already had the best solution for the job!

Dealing With Multiply Defined Link Reference Definitions¶

Having thought about the item:

- Link_helper.py#86 - if link already registered, should warn?

at length, resolving it was easy. While I could add something specific to deal with that one case, it made more sense to just add a rule to deal with that case. After a quick check to verify that the LinkReferenceDefinitionMarkdownToken class has a did_add_definition field, I resolved this one as done.

Picking Another Easy One¶

Another easy item from the list was the item:

- look for cases where " " is used, and convert to whitespace helper

This refactor was so easy, I almost feel that it is a waste talking about it. Like the replacement of the \n character with ParseHelper.newline_character, this replacement was to replace the space character with ParserHelper.space_character to make it more visible. While I was doing that, I noticed that there were a few cases where I was trying to make those spaces visible using code like:

line_to_parse.replace(" ", "\\s")

To solve that issue, I cloned a copy of the make_value_visible function, calling it make_whitespace_visible, and moving that code, plus code for the other whitespace characters \t and \n into that function. After refactoring the other occurrences of that text to the use the new function, the work on that item was completed.

Wrapping Up For The Night¶

As I was winding down to spend some time with my family, I wanted a simple task to help me relax before stopping for the night. While it was not very productive, I inspected the code using PyCharm, resolving three very simple issues. Combined with a full run of my clean script that runs Black, the changes were minor but useful. Nothing spectacular was changed, but it left the code “just that much better” and allowed me to sign off for the night with a good feeling. I had reached my goal for the week, and I had also managed to clean some little things up as well. A good combination!

Sunday Morning Refactoring¶

After a long and productive Saturday, I decided to give myself a bit of a pass on the hard work and resolve some easy “off-the-books” items. Where possible, I have tried to adhere to good object-oriented programming practices, encapsulating data within an object that cleanly encapsulates the purpose behind the object. From my experience, the problem with doing this with any kind of language parser is that there are often a group of miscenalenous variables that do not cohesively fall under a single theme. Often, I find that they are just a loose group of variables that need to be maintained to get the job done. Having the same type of problem with the main processing phase of the PyMarkdown parser, I decided to just go ahead and group these variables under the ParserState class.

Over three commits, I managed to clarify the usage pattern for the four variables already in that object. This was done by adding a __ prefix to each of those variables to ensure that only private usage of that field was allowed. For external usage of those variables, I added a new getter property for each variable with the same name as the original variable. In this way, I guaranteed that existing code that did not modify those variables would remain as it was, while strengthening the usage pattern for those variables. The additional eight variables were moved into the ParserState class following a similar pattern. Instead of changing the name of the variable to include a __ prefix, the variable was added to the class with the prefix already added, with the provided getter function being added with the name of the variable minus the prefix.

With those changes in place, I started to run the tests, and encountered and handful of cases where the parser needed to change one of those variables during its processing. In each of those cases, I looked at the code and tried to determine the best way to set the information to the desired value while minimizing the exposure of that variable to the rest of the parser. In the case of the set_no_para_start_if_empty function, that exposure was just the name of the function, as the body of the function only sets the __no_para_start_if_empty variable to True. For the mark_start_information function, called when the container processor is invoked for a new line, six of the variables in the class are set to their initial values, keeping any changes in those variables within the class. In all cases, these were what I felt were the most minimal exposure that was needed for other functions and classes dependent on the ParserState class and its variables.

More Cleanup¶

Once those variables were placed within the ParserState class with their getters and setters, I started going through downstream functions. In each case, I looked for the name of one of the variables that I moved into the class and determined if it could be replaced by a simple reference to the parser_state variable that was used for the current instance of the ParserState class. When I could replace it, I simply took the function argument, say original_stack_depth, and replaced it with parser_state.original_stack_depth where it was used in the function. At that point, when I completed that task properly, the argument would show up as unused, and I would remove it from the function’s argument list, also removing it from the arguments passed to that function wherever it was called from. At that point, it was simply another case of lather-rinse-repeat until all non-class references were replaced with class references.

And to be honest, here was where having the solid heft of the scenario tests came in very handy. Instead of just praying that I had made the right changes, every couple of changes was followed with the execution of the complete set of scenario tests. If there was a failure, I would investigate and find out the reason why it failed, addressing that issue. If things were fine, I would stage the changes in Git, allowing me to progress forward with confidence that I could back out any change at any time. While this process was a long process, the confidence that it gave me that I was making the right changes was priceless. With over 99.5% code coverage, if I messed something up in the refactoring, I found out about it right away.

I was making the right changes to clean up the project code, and I was not negatively impacting the code. It was good to be able to do both with the security and confidence of a well-tested project.

What Was My Experience So Far?¶

Looking back at the work I did during the week, it was hard not to get jazzed by the fact that I set a goal to eliminate all non-prioritized items from the issues list. That was tampered with the question about what to do with the other sections that I had forgot about, but it was still a win. And in its own way, resolving the unprioritized section of the issues list helped me answer my own question. I would just prioritize them!

The remaining sections are cleanly divided into two groups: features and bugs. With the possible exception of front matter¹, while the other features would be nice to have, I didn’t need any of them for the initial release. The bugs that remained also divided nicely into two groups: Tabs and Rules. As a linter is nothing without some good rules to show its power, fixing the items in the Rules sections before release was non-negotiable.

The issues in the Tabs section were a different story. Making sure that the Tab support was spot on was going to be a sub-project, not something that I wanted to rush. Provided that I can come up with an “almost” interim solution for Tabs, I should be okay. Not great, but okay. The only reason I have confidence in saying that is because most people that I know shy away from tab character in their documents and source code, mostly due to questions on how they are interpreted. So, while I do have to address Tabs properly at some point, it doesn’t need to be until after the release.

Getting everything fixed and coming up with a good plan for the rest of the sections was just what I needed. Another concrete goal that I can see myself achieving. And yet another step closer to the release!

What is Next?¶

After 19 articles detailing how I was attacking each group of issues, I thought it would be useful to look back over that effort and determine how things went and lessons learned.

Since every article I write has a front matter section to it, I strongly feel that including that one feature into the release should be a priority. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments