Markdown Linter - Addressing the Initial Markdown Transformer Issues

Summary¶

In my last article, I started to work on the token transformer, a transformer specifically used to verify the PyMarkdown tokens by rehydrating or reconstituting the original Markdown document from them. In this article, I talk about how I went through the items that I added to the issues list as a result of that work and resolved them.

Introduction¶

Normally, I use this space to explain some of the reasoning behind the work which formed the basis for the article. Without that bit of preamble, I feel that I am just dropping the reader off in an unfamiliar location without a compass, a map, or any indication of where they are. While that remains true for this article, the reasoning for this article is so simple that I feel that I do not need a lot of words to explain it.

Last week, I added 7 new items to my issues list, with 1 other issue pending from the week before. Plain and simple, I just wanted those issues off the list before I continued with the project’s Markdown transformer.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 16 Jul 2020 and 18 Jul 2020.

Things Continue to Be Busy¶

While there are often weeks where I wish something interesting would happen, those types of weeks are not showing up in my life this summer. Instead, I am having the type of summer where I wish I had less to do, giving myself some extra time to relax and recharge. Thankfully, I have a spouse that is very supportive of my hobbies and writing, giving me time on the evenings and the weekends to make my project and this blog happen. If you check the Git repository for this project, the times when I have been able to eke out some time to work should be obvious.

I am hoping, that as the summer continues, more “free” time comes my way.

Nothing to Worry About¶

Looking at the issues list and the items that I added during my last session, I did not believe there were any serious issues to address. Sure, there were some questions that I wanted to follow up on, but I did not feel that any of them indicated a serious issue. Similarly, there were some possible bugs that I had noticed, but I did not feel that any of these possible bugs were severe. In the worst case, I figured that each of these issues would each take a couple hours’ worth of effort or less to solve. Nothing I could not handle.

But adding those items as I was working on the consistency checks was not something that I liked doing. After all, if I was checking the consistency of my work and there were problems with those checks, did that mean the checks were faulty? It just made sense to me to address those questions right away, ensuring that my confidence in the consistency checks remained high. Besides, if my guesses were right about the severity of the items, I should be able to address them quickly.

And with that mindset in place, I started working on removing those items from the issues list.

Lone Link Reference Definitions¶

Link Reference Definitions are interesting in that they break the mold created by the other Markdown tokens. When these elements are parsed from their original Markdown document into Markdown tokens, they do not add any information to the output HTML document. For the elements themselves to be useful, a separate link element is required that references them. The absence of such a distinct link element is the basis for example 176 and example 188. Apparently, this issue was so important, that the specification includes it twice with the exact same example text for both examples:

[foo]: /url

These examples deal with the other unique part of the Link Reference Definition: it is the only multiline element in the base GFM specification. This means that the next line may need to be parsed to determine if the Link Reference Definition on the current line has ended. In this specific case, the link label ([foo]) and link destination (/url) have both been provided, but there is the possibility of the link title occurring on the next line. So, when the end of the document is reached, for whatever reason, the partial Link Reference Definition was being left partially open, resulting in the token was not being emitted. But why?

Debugging and Resolving the Issue¶

Adding some debug statements and doing some research with the scenario tests, the cause of this issue was quick to pop out at me. At first, I thought it was an issue to do with the __close_open_blocks function, as it was a likely candidate. But a quick change to example 176 to add a couple of blank lines after the Link Reference Definition put a quick end to that. That modified case worked without any problems. Doing a bit more digging, the problem seemed to be in the handling of the return value from the __close_open_blocks function. At that time, when the __close_open_blocks function was called from the main __parse_blocks_pass function to close the document, the return value containing any immediately closed tokens was discarded. That definitely did not seem helpful.

To address this, instead of using the _ variable to capture and ignore the list returned from the __close_open_blocks function, I replaced it with the variable tokens_from_line to capture that value. Making sure that the list was not being extended with itself (which caused failures in almost every scenario test) the lines:

        if tokens_from_line and not self.tokenized_document:
            self.tokenized_document.extend(tokens_from_line)

were added to extend the document list with the previously missing tokens. A quick run of all the scenario tests to verify that it was fixed, and this issue was removed from the issues list!

Getting Rid of Extra Blank Line Tokens¶

While example 559 and example 560 are the two main recorded issues, there were a total of 6 scenario tests that had this same issue. With a Link Reference Definition (or possible Link Reference Definition) being started at the end of the document, an extra Blank Line token was being inserted before the document was closed.

Basically, given the source Markdown document for example 559:

[foo]: /url

or a modification of example 188:

[foo]

an extra Blank Line token appeared in the token array without fault.

Debugging and Resolving the Issue¶

The cause of this issue was a bit more subtle than the others, so it took me a bit longer to figure it out. Most of my initial time on this issue was spent trying to figure out why this issue happened with Link tokens and Link Reference Definition tokens, not just Link Reference Definition tokens. It was not until I took a look at the log output that I realized they were the same issue.

While the text [foo] can be used as a shortcut link, that same text can also be used as a link label for a Link Reference Definition. As the Link Reference Definitions are processed in the initial phase and links are processed in the final phase, that text is considered part of a Link Reference Definition until it can be proven otherwise. In this case, that consideration was dropped quickly, as the link label is not followed by the : character. However, it still underwent the initial consideration before being dropped and requeued for further processing.

With this new insight, I went through the logs for the __parse_blocks_pass function again. I quickly noticed that in these cases, there was a single blank line getting requeued. Thinking about it for a while, this did make sense. At the end of the document, when a partial Link Reference Definition is closed, an empty string ("") is passed into the Link Reference Definition functions to make sure the Link Reference Definition is terminated properly. As a blank line in a normal document naturally closes the element, the passing of the empty string into those functions has proven to be a smart and efficient solution. However, following the flow of data in the log statements, it became obvious that these empty strings were surfacing to the __parse_blocks_pass function where that blank line was eventually stored as an entry in the lines_to_requeue variable. Once placed in that variable, it was requeued and the extra blank line appeared in the array of tokens.

From this information, I figured out that there were two parts to solving this problem. The first was to add this code near the start of the closing code in the __parse_blocks_pass function:

                was_link_definition_started_before_close = False
                if self.stack[-1].was_link_definition_started:
                    was_link_definition_started_before_close = True

As this problem only ever occurred with a Link Reference Definition in progress, it made sense to use that condition as a trigger for the fix. That change allowed me to add the code:

                if was_link_definition_started_before_close and not lines_to_requeue[0]:
                    del lines_to_requeue[0]
                    line_number -= 1

Basically, in the cases where an open Link Reference Definition is closed at the end of the document, if there is a blank line at the start of the lines to requeue, remove it. As a side effect of removing that line, it also becomes necessary to adjust the line number by 1 to ensure that the line number is pointing at the correct line.

My initial thought was that this code was a kludge, but in the end, I grew to be okay with it. As the mainline is where the requeued lines are managed, it started to grow on me that this was the correct place to address that issue. Fixing it at any other point seemed to me like I was just going to pass a variable around through various functions just to handle this one extant case. This was indeed clearer.

The Case of the Possibly Missing Link Text¶

Sometimes, as with example 575, I look at an example or its scenario test, and I get a feeling that it does not look quite right. If am in the middle of something else, which I usually am, I note it down in the issue list to be worked on later. When I find the time to start working on it, I almost always begin by looking around for other supporting cases or documentation to support any possible answers to the asked question.

This issue was no different than the rest. To me:

[foo]()

[foo]: /url1

just looked weird. Should not it be the shortcut link that took precedence? The following example, example 576:

[foo](not a link)

[foo]: /url1

produced the following HTML:

<p><a href="/url1">foo</a>(not a link)</p>

If that was the case for example 576’s output, in the HTML output for example 575:

<p><a href="">foo</a></p>

where were the parentheses?

Debugging and Resolving the Issue¶

It took bit of looking around, but I finally found some good supporting documentation. In the case of example 576, the text not a link does fulfil the link destination requirements of an inline link with not, but then fails on the link title requirements because of missing enclosing characters, such as ". In a subsequent processing pass through the tokens, the inline processors looks for a shortcut link, which it does find with [foo], ignoring the text (not a link). This leaves that text as normal text to be dealt with outside of the Link element.

The case for example 575 is slightly different. Based on example 495, this Markdown text:

[link]()

is a valid inline, producing the HTML output:

<p><a href="">link</a></p>

This almost exactly matches the desired output for example 576. In the previous case, the text between the parentheses, (not a link) was determined to be invalid, and hence was not processed as a valid inline link. However, in this case, it is perfectly valid for an inline link to have no text within its parentheses, as example 495 demonstrates. As that text was a valid part of an inline link element, the entire text was consumed, leaving nothing else for processing to deal with later. With nothing outside of the inline link left for processing, nothing gets added to the surrounding paragraph as “leftover” text.

Based on that research, the parsing of the above Markdown into the above HTML is 100% correct and explainable. Did it initially look weird? Yes. But was it correct? Yes again. Question resolved and removed from the issues list.

Where Was That Extra `text` Coming From?¶

While the 8 scenario tests in which this issue was found were passing, they were only passing because I had modified the acceptance data in the tests. In each case, I simply added text at the appropriate place in the token data to match what was expected. The decision to do that never really sat well with me, so I was happy that it was time to fix this issue.

For this issue, I started by looking at the easy (ok, easier) to digest example 524. The Markdown for this example is:

[link *foo **bar** `#`*](/uri)

and its HTML output is:

<p><a href="/uri">link <em>foo <strong>bar</strong> <code>#</code></em></a></p>

Double checking those two against each other, and against the GFM specification, everything looked fine. However, when I looked at the serialized token for the link:

[link:inline:/uri:::::link *foo **bar** text*]

I noticed that the word text was showing up in the output instead of the text #. At that time, as noted above, I simply added the token information as-is to the output, created an item in the issues list, and planned to deal with it later. It was not that later. I really wanted to figure out where that text was coming from.

Debugging and Resolving the Issue¶

When I looked at the logs, I arrived at the answer within the first 5 minutes of looking at the problem. It turned out that I forgot to add some functionality to the __collect_text_from_blocks function. This function is used to determine the representative text from a list of inline blocks. That text is then used to create a proper record of what the actual text for the link label was before any processing, as well as a normalized version being used for looking up link information. In the case of various inline tokens, instead of placing the original text for the inline element in the resultant string, the literal text text was added to that string.

Fixing this was easy, as in most inline tokens use the token_text member variable to contain the correct text. After running the scenario tests for each of the failing cases, the __collect_text_from_blocks function was modified to properly handle replacement of the text for the various inline cases. I then manually went through all eight cases and verified that they were working properly, without the text text showing up in the token unless the string text was in the source Markdown.

It took a while, but I got it working cleanly, and for all inline tokens. It just felt good to get this “dirty” issue off the books!

Where Was That Extra `not` Coming From?¶

After working on the last issue, I thought that this issue would take the same level of effort and research to resolve. In the tokenization for the link in example 576:

link:shortcut:/url1::not:::foo

I wondered why the text not was in the token. Spotted when I was debugging the last issue, that not value just looked out of place.

Debugging and Resolving the Issue¶

Thankfully, this was an easy problem to find, debug, and fix. In the handle_link_types function for the link_helper.py module, the Markdown for the example:

[foo](not a link)

[foo]: /url1

was parsed to see if it could be a shortcut link and failed. However, before it failed, the link label was set to foo and the link destination was set to not. In the code for handle_link_types, this meant that the inline_link variable was set to the link destination and the pre_inline_link variable was set to the original text for the link destination before processing not. After the failed processing, the pre_inline_link variable was never cleared but was passed into the constructor for the link token when it was created.

With that research performed, it was easy to determine out that the statement:

            pre_inline_link = ""

was all that was required to resolve this issue quickly and cleanly.

What to Do With Backslashes in Code Spans?¶

Sometimes the questions I pose to myself have simple answers. In this case, the question was:

- backslash in code span? what should it look like?

Debugging and Resolving the Issue¶

Doing a quick check in the backslash section of the GFM specification, I did not see anything relevant. However, in the code spans section, example 348 provided me with the information that I needed.

The preface to the example states simply:

Note that backslash escapes do not work in code spans. All backslashes are treated literally:

This is backed up with example 348, whose Markdown:

`foo\`bar`

produces the following HTML:

<p><code>foo\</code>bar`</p>

The answer to my question? Leave them alone. They exist as they are, no parsing or interpretation needed. As far as code spans go, they are just another normal character.

Is It Worth It?¶

As I quickly resolved this issue, I paused before moving on to the next issue. I started to wonder whether this practice of noting questions and issues in the issues list was worth it. In this case, this was a question whose answer I should have known. The fact that I did not know it right away was somewhat embarrassing.

Thinking about it some more, I decided to really devote some serious time to understand what I felt about this question, and to really think thought through it. Doing that, it was only 5 minutes later when I noticed that I had changed how I was feeling from embarrassed to proud and confident.

It took a bit of time, but I mentally reviewed the other times I have used the issues list to document questions and possible bugs. Overwhelmingly, the issues that I logged either helped identify an issue or helped me solidify my understanding of the project and the GFM specification. Should I have known that fact about code spans and backslashes? Maybe? While I have a good grasp of the specification, it is also a large specification with a decent number of exceptional cases along the way.

But as I thought about that further, I realized that my sincerity about the quality for the PyMarkdown project was not absolute, but ever increasing. When I started the project, I had to look up every fact about every element, just to make sure I had it right. I knew that I was now at a level where I usually only had to look up the more esoteric and exceptional parts of the specification. If I was sincere about the quality of the project, I also needed to acknowledge the quality that my learning about the specification brought to the project.

Given that altered point of view, I was confident that it was worth it. Both as an early problem indicator and as a learning tool, it was indeed worth it. And besides folks, I know I am not perfect. That does not mean I am going to stop trying and stop learning. And to me, that is what made the issue list worth it!

Link Labels and Special Characters¶

When I add a question to my issues list, it is always with the hope that I have forgotten something simple, and some quick research will resolve the issue. In this case, the question was:

- backslash and char ent in link label? tests?

Or, for those that do not read native “Jack”:

- what about the backslash character and the character entities in link labels?
    - are they covered by tests?

Reading this now, I believe that this was an honest question to verify how backslashes and character entities are handled in link labels. Based on my current work solving these issues, I have a good idea on the answer, but at the time, I probably was not as sure. The answer? From previous issues, both backslashes characters and character entities are allowed in most areas, including link labels.

But as always, it was time to back up my guess with research. And something told me it was not going to be the quick answer I was hoping for.

Researching the Issue¶

Starting my search at the beginning, I looked for a scenario test that contained a backslash in a link label. It was tricky, but eventually I found example 523 in the section on inline links:

[link \[bar](/uri)

Checking the HTML output, it is easy to see that the backslash effectively escapes the character following it, even in the link label:

<p><a href="/uri">link [bar</a></p>

Backslashes? Check. One part of the issue down, one to go.

To start researching the issue of character entities in link labels, I decided to start with inline Unicode characters. Based on my usage of Markdown, I generally use character entities to represent characters that I cannot find on my keyboard, usually Unicode characters. As such, starting with inline Unicode characters seemed like a good first step before tackling character entities. This decision was validated by my ease in finding not 1, but 2 examples in the GFM specification, example 175:

[ΑΓΩ]: /φου

[αγω]

and example 548:

[ẞ]

[SS]: /url

But try as I might, I could not find an example of a character entity reference within a link label. As I looked, I did find the text above example 327 which backed up my previous research stating:

Entity and numeric character references are recognized in any context besides code spans or code blocks, including URLs, link titles, and fenced code block info strings:

So, while the GFM specification did not explicitly state that character entities in link labels were rejected, there were no actual examples of the positive case either. To me, that was not a showstopper… it was just an invitation to be creative!

Testing the Issue¶

The first thing I needed to do was to come up with some good examples to test against. To do this, I started by creating the scenario test test_character_references_328a as a variation of scenario test test_character_references_328. The sole difference between these two scenario tests was the specification of the link label as [föö] instead of [foo].

In a similar fashion, I copied the test test_reference_links_558 into the new test test_reference_links_558a, altering the link label from [bar\\\\] to [bar\]. If I understand the specification properly, both \\\\ and \ should produce the same \ in the HTML output, so to me it was a good test case. In addition, I added the test function test_reference_links_558b with the link label of [barβ] to make sure named entities were addressed.

Finally, before I started debugging and changing code, I created the following Markdown document from bits and pieces of those new tests:

[f&ouml;&ouml;](/f&ouml;&ouml; "f&ouml;&ouml;")
[bar\\](/url)
[bar&#x5C;](/url)

Verifying its content, I submitted it to my favorite test harness, BabelMark 2, getting the following results for reference parser, commonmark.js 0.29.0:

<p><a href="/f%C3%B6%C3%B6" title="föö">föö</a>
<a href="/url">bar\</a>
<a href="/url">bar\</a></p>

Taking the specific HTML lines generated for each scenario test, I inserted that text into the owning scenario test, and verified that the proper HTML was being output. With that taken care of, it was on to cleaning up the consistency check!

Debugging and Resolving the Issue¶

After putting in that hard work to get the scenario tests set up, the resolution to this issue was anticlimactic. In the inline_helper.py module, the handle_character_reference was already being called, but its return value was being ignored. That was easily addressed by replacing the line:

            current_string_unresolved = InlineHelper.append_text(
                current_string_unresolved, new_string_unresolved
            )

with the line:

            current_string_unresolved += new_string_unresolved

The other change that happened at the same time was that the new_string_unresolved variable was added directly to the current_string_unresolved variable instead of using the append_text function. As the append_text function is used to add text to a string while respecting any needed encoding, it was adding a step of extra encoding to the text where it was not needed. With those two changes in place, a couple of extra changes were needed to the handle_character_reference function to make sure it was returning the correct original string, instead of a mangled version.

Having made those changes, and a couple of test runs later, the problem was fixed and verified. While the research and setup of the new tests took a while, the actual resolution of the issue took less than 20 minutes to complete. With everything setup, the logs were very helpful by showing me where the extra transformation of the text was taking place, leading to a quick resolution. Yeah verbose log files!

Now on to the last issue for this week!

Indented Code Blocks and Disappearing Leading Spaces¶

I knew that the spaces were not disappearing, they just were not being recorded properly. But the phrase “disappearing leading spaces” sounded cool, so that is what I called it to myself. It was also the last thing that I felt I needed to fix before getting back to implementing the Markdown transformer. Having been discovered in the last article, this was the issue with the Markdown transformer check and example 81 where leading spaces were not always being record for later rehydration.

Doing a bit of research, I deduced that when a blank line is encountered in an indented code block, there are three possibilities: too much whitespace, too little whitespace, and just the right amount of whitespace.

The scenario test for example 82 was passing without any problems:

    chunk1
{space}{space}{space}{space}{space}{space}
      chunk2

which took care of the “too many” case. A quick change of this scenario test¹ to present 4 spaces on the blank line instead of 6 spaces resulted in the expected behavior. That took care of the “just the right amount” case. Left was the “too little” case and the issues demonstrated by example 81.

Debugging and Resolving the Issue¶

Originally, my time on this issue was spent looking at the logs with a specific focus on how the tokens were created. I had theorized that most of the issue was going to be concerned with how I extracted and stored the whitespace in the text tokens… and I was wrong. Looking at the logs, the tokens looked fine when they exited the initial processing phase, containing the right number of space characters and the right number of newline characters.

Debugging further, after the coalesce_text_blocks function was invoked on those tokens, the leading whitespace was missing. Digging deeper, the first thing that I noticed is that the variable remove_leading_spaces was being populated with data from the text token’s extra_data field, not the extracted_whitespace field. Fixing that issue got me part of the way there. The next part was to alter the TextMarkdownToken code to allow the combine function to return any whitespace that it removed. This change was pivotal in making sure that any removed whitespace was stored somewhere. Finally, in the case where the coalescing takes place within an indented code block, that removed whitespace is added to the active Indented Code Block token.

Unlike the previous issue, this one took a bit of effort to arrive at the right solution. Looking back in hindsight, the path to fix it seems as clear as glass, but it took getting past a number of failed solutions to get there. Each one of the steps outlined in the last paragraph had to be coded, debugged, and tested before going forward. And in most cases, those steps were not the first solution, but the third or fourth.

But still, it was worth it. Another item off the issues list, and the project’s quality was raised just a bit higher than before!

What Was My Experience So Far?¶

For any readers that have followed the progression of this project from a simple idea to its current form, you may have noticed that my confidence and happiness spikes a bit after a round of refactoring. There is just something about cleaning up the code and resolving ambiguity that makes me smile and makes me feel better about the project. While I can quantitatively show that the number of possible issues and tasks are declining, I believe that I am more affected by the qualitative changes in the project. Removing a single large task from the issues list? Meh. Removing 6-7 small tasks or questions from that same list. Yeah!

But the interesting part of this block of work was not the work itself, but the writing of the article about the work. During the writing of this article, it suddenly dawned on me. For the first time during this project’s lifetime, I believed that I could see the end of the project. There was just this moment where I remember looking at the issues list, and the list looked smaller. That was followed by a moment of clarity that the consistency checks were making a significant impact on my perception of the project. Combined, these two perceptions linked together to form a new level of confidence in my mind.

That new level of confidence spoke to me. The scenario tests were effective. The consistency checks were working. And in my mind, the project was clearly past the half-way point on its journey to the finish line!

What is Next?¶

With the current round of questions and issues addressed, it was time to get back to adding to the consistency checker. While I knew I had a busy week of personal stuff and professional stuff coming up, I also knew that I did not want to lose any momentum on the project. And making progress on the Markdown transformer was just the right kind of thing to help boost my spirits in a long week!

As part of writing this article, I added scenario test test_indented_code_blocks_082a which is the result of the number of spaces on the blank line being reduced from 6 to 4. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Addressing the Initial Markdown Transformer Issues

Summary¶

Introduction¶

What Is the Audience for This Article?¶

Things Continue to Be Busy¶

Nothing to Worry About¶

Lone Link Reference Definitions¶

Debugging and Resolving the Issue¶

Getting Rid of Extra Blank Line Tokens¶

Debugging and Resolving the Issue¶

The Case of the Possibly Missing Link Text¶

Debugging and Resolving the Issue¶

Where Was That Extra `text` Coming From?¶

Debugging and Resolving the Issue¶

Where Was That Extra `not` Coming From?¶

Debugging and Resolving the Issue¶

What to Do With Backslashes in Code Spans?¶

Debugging and Resolving the Issue¶

Is It Worth It?¶

Link Labels and Special Characters¶

Researching the Issue¶

Testing the Issue¶

Debugging and Resolving the Issue¶

Indented Code Blocks and Disappearing Leading Spaces¶

Debugging and Resolving the Issue¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Markdown Linter Core

Category

Tags

Stay in Touch

Summary¶

Introduction¶

What Is the Audience for This Article?¶

Things Continue to Be Busy¶

Nothing to Worry About¶

Lone Link Reference Definitions¶

Debugging and Resolving the Issue¶

Getting Rid of Extra Blank Line Tokens¶

Debugging and Resolving the Issue¶

The Case of the Possibly Missing Link Text¶

Debugging and Resolving the Issue¶

Where Was That Extra text Coming From?¶

Debugging and Resolving the Issue¶

Where Was That Extra not Coming From?¶

Debugging and Resolving the Issue¶

What to Do With Backslashes in Code Spans?¶

Debugging and Resolving the Issue¶

Is It Worth It?¶

Link Labels and Special Characters¶

Researching the Issue¶

Testing the Issue¶

Debugging and Resolving the Issue¶

Indented Code Blocks and Disappearing Leading Spaces¶

Debugging and Resolving the Issue¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Markdown Linter Core

Category

Tags

Stay in Touch

Where Was That Extra `text` Coming From?¶

Where Was That Extra `not` Coming From?¶