Markdown Linter - Delving Into the Issues - 9

Summary¶

In my last article, I continued in my quest to reduce the size of the issues list. In this article, I split my time between adding to the scenario cases tables and dealing with items from the issues list.

Introduction¶

Not much of an introduction here, just my usual plodding forward. Having spent time in the last couple of weeks working on either the scenario cases tables or resolving items from the issues list, I tried this week to split my time evenly between those two tasks. Without further ado, on to the work!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 03 Nov 2020 and 08 Nov 2020.

Dismissing an Easy Issue¶

Initially, looking at the following item:

- 553 with other in-lines?

I thought I would have some work to do. However, when I started looking at this item, it did not take longer than a couple of minutes before I was able to resolve this issue.

Along the way, there are times where I have good ideas on things to check, and then other times where I just have ideas. While I think I meant well with this item, it ended up falling into neither of those two buckets. Taking a look at the Markdown for example 553:

[bar][foo\\!]

[foo!]: /url

I believe I wanted to make sure that I tested other concepts to make sure the lookup worked properly. The most obvious of those concepts would usually be inline elements, so I think it might have made sense from that point of view. However, I had missed one little thing. As function test_reference_links_553 centers around subtle variations with the link reference, any inline element would be treated as plain text, without any interpretation.

Based on that quick research and the fact that I already had tests for inline elements in the link label, I just resolved it without any changes. I think while I might have had something else on my mind when I added that issue to the list, I was unsure of a good way to honor it in any reasonable form. It was good to check out though, just nothing to do to enhance the project with.

Empty Link Labels¶

Added a long time ago, I spotted two issues that I knew that I could resolve quickly:

- link ref def with empty link label, like 560?
- full reference link with empty link label, like 560?

Obviously added at a time when I was not as complete in my knowledge of the GFM Specification as I am now, both items were indicating confusion as to why an empty link label wasn’t valid.

With the experience gained since those items were added, it was easy for me to reference the GFM Specification on Link Reference Definitions, select the link label reference in the first line of the first paragraph, and extract the following bit of text from the definition of a link label:

Between these brackets there must be at least one non-whitespace character.

While that one line escaped me early in the development of the project, I was now familiar enough with it to be able to locate it in 30 seconds or less. As I acknowledge it is a boundary case, I can see why the specification writers added that text in there to deal with that case. From my point of view, an empty link label is just an empty string that needs to be parsed. But I also understand that there is plenty of precedence to also look on an empty string as having no value. I am not sure if that is the way I would have gone with this, but I was happy to follow along with the specification with this one.

Bolstering Up the Scenario Cases¶

The more I used the new scenario-cases.md document, the more I was enjoying it and the confidence it brings to the project. While it is still early in the document’s life, I am starting to rely on that document at the same level that I rely on the GFM Specification examples. Basically, if a parser can properly handle either one of those groups of tests, it is a good thing. If it can properly handle both groups of tests, it is a wonderful thing.

As such, a certain amount of this week was spent beefing up that important document.

Moving Simple Inline Links¶

At the start of this block of work, one of the things that I wanted to do was to move the non-base test_inline_links_518 functions into the Series F group by moving them into the test_paragraph_extra_ group. While this was not a big move, it filled a hole that I had perceived in the Series F group tests. And since it was just moving the tests from one module to the other, the tests were already passing. That made the duration task seem to fly by.

Adding Links as The Last Element in the Document¶

Having just moved that small group of tests into the Series F group, I noticed that all the test cases in that group ended with a Text token, and not a Link element. As that affects what is checked at the end of a leaf block, I thought it was prudent to go through the Series F group and add a variation for each case that tested the base document without any elements after the document.

That was not a difficult task but was a task that was both tedious and lengthy. I went through each of the 16 base tests registered in the Series F group and created a new variant of that base test. Once created, I removed any trailing non-link characters from each test document, double checking that I had not disturbed the Link element itself. As usual, I verified the HTML document against Babelmark, then running the tests to see if there were any issues.

When I ran the tests, I was greeted with the good news that the parser itself was working properly and the consistency checks only required minor changes. Those changes were in the __handle_last_token_end_link function, each of them small adjustments to handle the various parts of the Link token in its various forms, but nothing that wasn’t immediately resolvable.

Following Up with Image Elements¶

It should be no surprise that after completing the work documented in the previous two sections that I decided to follow that work up with ensuring parity for the Image elements in the Series F group. In total, 29 new scenarios were added to the group, mirroring the existing Link element tests.

Due to previous hard work and a bit of luck, there was only one change required in the __handle_last_token_image function. In the case where the last token is a full Image token, I just needed to add a single line to properly increase the inline_height variable by one for each newline in the text_from_blocks field of the Image token. While the verification phase of each test took a while, the testing phase of these additions went by very successfully and very quickly.

Moving Scenario Tests into Their Own Modules¶

Over the course of the next three commits, I took on the immense chore of moving and renaming tests belonging to seven of the identified scenario case groups. Those seven groups were Series A to Series E, Series H and Series J. For each group, I created a new file, such as test_markdown_paragraph_series_a.py and moved tests in from their original modules, renaming them as I went.

As I renamed those functions, I started to come up with a solution for how to identify each test uniquely. What I quickly settled on was to start the test name with the series that it belonged to, followed by a descriptive name based on the contents of the test document. In this way, I could easily tell if I repeated a test within a given group by looking at the name of the function.

While this work was primarily copying and renaming scenario tests, it was exhausting. For each test, I needed to make sure that the name of the function matched the Markdown document contained within the test. Then I needed to take that Markdown document and run in through Babelmark to make sure the HTML output was correct. Repeated on over 100 scenario tests, it took a lot of time and a lot of patience to get correct. But in the end, it was satisfying to be able to see the groups come together, painting a cohesive picture of a group of passing tests along a given theme.

Better Tests for Link Reference Definitions¶

Switching back to resolving items from the issues list, the first thing that caught my eye was an issue dealing with Link Reference Definitions. Of all the leaf blocks elements that I have had to design and code for this project, the Link Reference Definition element was by far the most difficult to get right. It was no surprise to me to find the following item in the issues list:

- what if bad link definition discovered multiple lines down, how to back track?

Starting back in April 2020 when I added support for Link Reference Definitions, I felt that while the feature was implemented, I knew that there was always going to be a possibility of a gap in the feature implementation. Because of the unique multiline nature of this feature, it is impossible to determine if the element itself is valid without reading the next line. As such, I had to implement a “requeue” functionality to allow the parsing of a possible Link Reference Definition element to be rewound and attempted again as a different element. While that has worked well, the bulk of my concerns over this feature centered around whether that rewinding functionality dealt with all possible side effects, not just the most common set of them.

Given that history, I decided to add functions test_link_reference_definitions_166a and test_link_reference_definitions_166b to test for two more cases where an element was only discovered to be invalid. In the case of function test_link_reference_definitions_166a, I made sure that the title portion starts on the same line but was not properly terminated. This was to make sure that the entire element would be discarded as there was no solution where the Link Reference Definition could be considered complete under any circumstances. When I added function test_link_reference_definitions_166b, I took the opposite approach, starting the title on the next line. As I started it on the next line, the Link Reference Definition could be completed, just without the title.

When I ran the tests for these two tests, it was no surprise to me that there was a failure. In looking at the tests, the failure was with function test_link_reference_definitions_166b which fails due to an extra Blank Line token being generated before the rewind is reprocessed. It took me a bit of time to realize that I needed to add the line:

    force_ignore_first_as_lrd = len(lines_to_requeue) > 1

at the end part of the __stop_lrd_continuation function that dealt with continuations that were partially successful. I just had to try different combinations before figuring out what the correct one was before proceeding.

Don’t Judge A Book…¶

Indeed, when I came across this item:

- 603 - href? doesn't look right

I agreed that it did not look right. The Markdown for example 603 is:

<http://foo.bar.baz/test?q=hello&id=22&boolean>

producing the HTML:

<p><a href="http://foo.bar.baz/test?q=hello&amp;id=22&amp;boolean">http://foo.bar.baz/test?q=hello&amp;id=22&amp;boolean</a></p>

I was able to verify it quickly against Babelmark, but it took me a bit to figure out what the parser did to get to that result. The big thing that I had to remember for this case was that it was interpreted as an Autolink, which is meant as a quick way to provide references. As such, it makes sense that instead of a literal interpretation of the specified link, the processing leans more towards what the user probably intended. To that end, it makes sense that the ampersand (&) character in the link is translated into the named character entity & for use in both the reference and the text.

So, after thinking it through and checking it out, the function test_autolinks_603 is 100% correct. For extra points though, to produce the correct link, I determined that the following HTML block would be needed:

<a href="http://foo.bar.baz/test?q=hello&id=22&boolean">http://foo.bar.baz/test?q=hello&id=22&boolean</a>

Yeah, I like puzzles, and this was a good one.

I Really Need to Be More Specific¶

While I am usually good at adding items to the issues list, this one was cryptic:

- 620 - more bad cases, like <

Huh? That really was not a lot to go on, but I gave it a shot. Without more information in the item, I just got a bit creative.

Taking a look at function test_autolinks_604, I took the initial URI autolink of <irc://foo.bar:2233/baz>, stripping it down to <irc:foo.bar> for function test_autolinks_604a and expanding the theme to <my+weird-custom.scheme1:foo.bar> for function test_autolinks_604b. Similarly, I took the email Autolink of <foo+special@Bar.baz-bar0.com> from function test_autolinks_613 and reduced it down to <l@f> for function test_autolinks_613a. Having added some good positive tests, I then decided to add negative tests. For function test_autolinks_620a I specified a theme with too few characters, while function test_autolinks_620b specified a theme with too many characters. Test function test_autolinks_613c specified a scheme with an invalid character in the theme, while function test_autolinks_613d had no domain part and function test_autolinks_613e had no name part.

These tests all passed without incident, but it felt good to increase the scenarios and increase my confidence in the project. While I was pretty sure that these would all pass, as they are all based on regular expressions with specific character counts, it just felt right to explicitly test those limits and make sure they were consistent.

Verifying Link Reference Definitions with Other Blocks¶

In the same manner as other tests, this one started from the issues list item:

- test_link_reference_definitions_183 is a partial lrd followed by bq, add cont+leaf blocks

In test function test_link_reference_definitions_183, the Link Reference Definition (or its acronym LRD as used in the item) follows an Atx Heading element. In the GFM Specification for this example, it explicitly states:

However, it can directly follow other block elements, such as headings and thematic breaks, and it need not be followed by a blank line.

While it states that it can follow other block elements, it only gave three examples: one after an Atx Heading element, one before a Thematic Break element, and one before a SetExt Heading element. Those three tests cases, spread out in the three functions between test_link_reference_definitions_183 and test_link_reference_definitions_185 were a good start, I felt that better coverage was warranted. Therefore, I created functions a to g for function test_link_reference_definitions_183 and functions a to f for function test_link_reference_definitions_185 to cover the before and after cases.

Except for two of the tests, they all passed without incident. The two that did not pass were tests that involved a Link Reference Definition occurring both before and after a list. As I knew I was going to be finishing up with leaf blocks and heading to container blocks in the next week or two, I marked those test as disabled, added an item to the issues list, and kept on going.

Cleaning Up Character Entity Tests¶

At first, when I saw the issues list item:

- test_markdown_entity* various extra tests

I thought that I had missed a couple of cases and looked for some missing cases. It was during that search that I came across the following text at the end of the test_markdown_entity_and_numeric_character_references.py module:

# TODO
#
# & and various forms at end of line
#
# 327 special parsing for html blocks?
# <a href="&ouml;&ouml;.html" x="&ouml;">
# <x-me foo="&ouml;">

# <script>
# &ouml; bar="&ouml;" bbb

Comparing the items in the Python list, I determined that all those cases had already been covered by other tests, but there were some tests that I thought it was worth adding.

While the example for function test_character_references_321 specified that the text must match that of an entity in the named entities table, I added function test_character_references_321a to make it explicit that it was a case-sensitive lookup. Similarly, functions test_character_references_322 and test_character_references_323 mention turning numeric entities into characters, but only included the special NUL character 0 as a byproduct of the text in an example. As such, I created the test_character_references_323a function to call attention to this special character, also showing that any number of leading zeroes does not matter for numeric entities.

In a similar pattern, but at a higher level, I added the test_character_references_336 series of functions, named a to e. While I was okay with the examples showing the usage of entities in paragraphs, I felt that having explicit cases of entities in each of the other leaf blocks was useful. In order, the tests added named entities in each of an Atx Heading element, a SetExt Heading element, and Indented Code Block element, a Fenced Code Block element, and a Html Block element. I also verified that the entity was interpreted in the first two elements, and not interpreted in the last three elements, as per the GFM Specification.

Finally, as a simple set of comprehensive tests, I wanted to have a good example of specifying a named entity using all three forms: named, decimal, and hexadecimal. As such, I created the test_character_references_extra_ functions with 01 using ", 02 using ", and 03 using ". I know that these functions were going to pass ahead of time, but it gave me confidence knowing that I had a concrete set of three tests showing that the form of the entity didn’t matter, as they all produced the same HTML results.

Closing Things Up¶

Even though I was getting close to writing time on Sunday morning, I wanted to try and clear one more easy issue from the list:

- is HTML transformer using text_from_chars, instead of other field?
  - see https://github.com/jackdewinter/pymarkdown/commit/a506ddd3bda08a8ca1d97a7b0d68c114325b545e `extra_74`

This was more of a bookkeeping issue than anything else, or at least I hoped it was. During a previous change on 02 Oct 2020, I thought I had noticed that the HTML transformer was using the text_from_blocks field to create the text for the links. Thankfully, resolving this took a quick look at the __handle_image_token function in the transform__to_gfm.py module to verify it was not using that field.

When I took a second, this should have been more obvious to me. While it is possible to derive the image_alt_text field from the text_from_blocks field, it is the last thing I would have thought about when generating HTML. But I still felt good that I verified this and dispelled any doubts about the HTML output being based on the wrong part of the token out of my mind.

What Was My Experience So Far?¶

The work went on like it always does, but an interesting milestone was met with the completion of this work: any outstanding issues clearly identified as being attributable to a leaf block has been solved. Short version? I finished any issue that was clearly a leaf block issue.

While the realization of that goal was not a big thing to me, it wasn’t a small one either. It still meant that I needed to check how leaf blocks interacted with the two container blocks, but it reduced the number of things to check to just interactions with and between container blocks. That was a good feeling, knowing I had hit that mark. It increased my confidence that things were going in the right direction.

It is still too early to tell, but I am now starting to hope for an initial release of PyMarkdown as a linter in the early parts of 2021. That felt good typing that. Real good.

What is Next?¶

With the work done to verify the leaf blocks, the next week was going to be full of me trying to reduce the issues specific to list blocks. Closer to the line I get!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments