Summary

In my last article, I continued the long journey to remove items from the project’s issues list. Still without a better title than the equivalent of “chapter 3”, this article details more pruning of that list.

Introduction

While this process feels like just one of many stops on a long journey, I do find that I am finding this part of the project particularly satisfying. Sure, I am fixing issues that I have found along the way, but that is a good thing. I am not finding a lot of “I need to redesign this from scratch” issues, just a lot of “wow, did I forget to…” or “wow, I didn’t think to test the combination where…” issues. And I find that observation both calming and a boost to my confidence.

Is it taking up time? Sure. Would I like to get back to implementing and adding rules? Yup. But, with each issue that I resolve, my confidence that this is the right course of action increases. The project’s collection of different Markdown documents and test data just keeps on growing. As I manually check each scenario test out against what is expected, verify the output HTML against the reference implementation, and have consistency checks in place, any new issues just give me that much more information that the project is working properly.

Mind you, properly does not mean it is perfect, just that it is headed in the right direction.

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits 21 Sep 2020 and 27 Sep 2020.

Small Cleanups and Large Aspirations

The first commit in this week’s work was an easy one. During the writing of the article on that week’s work, I realized two things: I wasn’t taking show_debug arguments out of tests after debugging, and test test_raw_html_632 needed a child test with a bit of change, creating test_raw_html_632a. Both changes were accomplished within 15 minutes, so they were good cleanup issues. But after that, I was left with an interesting problem to deal with: Newlines and various inline elements.

There were a couple of issues that I wanted to deal with:

- HTML and Fenced Blocks and better handling of capturing newlines to avoid counting token height
- verify which inlines cannot contain newlines and verify with line/col
  - autolinks
  - raw_html

To break these issues out a bit more, I want to make sure I convey the difference between the PyMarkdown project and a normal Markdown parser. As the project’s focus is to be a Markdown linter, there are additional requirements imposed that are tested with every commit. Those requirements are that any element of interest, once translated into a token to represent it, must include the line number and column number of the original Markdown. While the examples provided by the GFM specification are great for testing a parser, they fall a bit short for testing a linter.

To be clear, that is not a problem with the GFM specification. The section called About this document specifically starts with:

This document attempts to specify Markdown syntax unambiguously. It contains many examples with side-by-side Markdown and HTML.

The specification specifically deals with syntax and how a given Markdown document should be represented by HTML. Full stop. It is this project’s additional requirement of being able to examine the tokens that are then translated to HTML that add that extra level of complexity. And that extra complexity comes with a cost.

Changing That Cost-Benefit Ratio

The cost of this complexity has not been large until this point. Hitting an issue where I thought needed some extra testing, I created a child test by adding a letter to the end of the parent scenario test and made the required variations to the child test for that situation. In cases where there were more variations, I introduced multiple child scenarios, each one testing something specific. This approach did not significantly increase the complexity of the tests, due to its focused nature.

But with this week’s work, it was different. Instead of having one test here to verify and one test there to verify, I wanted to make sure that all inline tokens that followed a given pattern were being handled properly. I wanted to have one central location where I could look to make sure inline tokens and newline characters were being handled properly. This was different in that its scope was going to make it complex.

New Tests - First Batch

The first batch of these new scenario tests were the group from test_paragraph_extra_43 to test_paragraph_extra_46. To reiterate the message from the previous section, the purpose of these tests was not to provide one off tests for each type of inline element, but to provide a group of tests along a given theme. In this case, that theme was an inline element surrounded by simple text on either side, with a newline in the middle of the element.

Far from being tricky, adding these tests was simple. I started with test test_paragraph_extra_43 and support for the Code Span token, adding support for the Raw Html token, the URI Autolink token, and the email Autolink token, ending at test_paragraph_extra_46 Those were easy tests to add. In each case, it was just a simple Markdown document like:

a`code
span`a

It was when I started looking at the links and images that I realized that I was in for a world of hurt. Well, maybe not hurt, but it was going to be a lot of work.

Starting off with the links, I began with inline links, as they are what I mostly use in my documents. In terms of changeable parts, the inline links have 6 elements: the link label, the whitespace before the URI, the URI, the whitespace before the title, the title, and the whitespace after the title. To test these parts, I added tests test_paragraph_extra_47 to test_paragraph_extra_52, one for each of the parts. In addition, for each whitespace test, I added variations on the tests that included whitespace before the newline, whitespace after the newline, and whitespace before and after the newline.

Once those tests were added, it was time to focus on the other 3 link types. The full link type was tested by adding a test for newlines in the label (test_paragraph_extra_53) and newlines in the link reference (test_paragraph_extra_54). A new test was added for a shortcut link with a newline in the link reference (test_paragraph_extra_55) and a collapsed link with a newline in the link reference (test_paragraph_extra_56). Finally, to address some concerns I had about a newline at the start of those elements, I added test test_paragraph_extra_57 that includes a collapsed link with a newline at the start of the link label, and test test_paragraph_extra_58 that includes a full link with a newline at the start of the link reference.

With that group of tests added, it was time to shift to the next group of tests: images. After adding test test_paragraph_extra_59 to place a newline between the ! character and the [ character of the image link, tests test_paragraph_extra_60 to test_paragraph_extra_67 were added as versions of tests test_paragraph_extra_47 to test_paragraph_extra_52, modified for an image instead of a normal link. In addition, tests test_paragraph_extra_60 and test_paragraph_extra_65 were added to test an image link without any link title being provided.

To me, this was a good start to some solid tests. I was focusing on a particular group of tests, and make sure that pattern was cleanly covered.

Adding in Replacements and Backslashes

Finding that the pattern of adding new tests in the previous section was working well as a template, the next group of tests followed a similar pattern. In this group of tests, I wanted to focus on newlines as part of replacement sequences and backslash sequences within different parts of the inline links. My concern was that the special character handling would play havoc with the determination of the column number for the token following the token with the special character. It just seemed to be a good idea to test these thoroughly.

Tests test_paragraph_extra_68 and test_paragraph_extra_69 provide for testing in the inline link label, with a newline around a replacement sequence and a newline near a backslash sequence. Then test test_paragraph_extra_70 tests a inline URI containing a space, followed by tests test_paragraph_extra_71 and test_paragraph_extra_72 testing for newlines within a replacement sequence and a backslash sequence. Then tests test_paragraph_extra_73 to test_paragraph_extra_77 repeat those tests, but with image links instead of normal links. To round out this testing tests test_paragraph_extra_78 to test_paragraph_extra_89 repeat the link testing with the variations based on the other 3 link types.

As I was compiling the list of tests for this article, I did notice that it appears that I have some tests that are done more than once. To combat this, I added a new item to the issues list to create a table or spreadsheet to more accurately track these variations.

Addressing the Issues

All in all, things went well with adding the new tests. For the initial group of 4 tests, the only thing that the tests uncovered was that the Code Span token was not handling the newline properly. That was quickly resolved with a couple of small changes in the handle_inline_backtick function. The bigger issues arose in the handling of the links.

While not a design change, most of that section either needed to be added or enhanced to properly handle links with newlines. The way the code was implemented before that week’s work, if there was a newline character in the current token, the logic was invoked to properly determine what the line/column number should be. There were two problems with that code. The first is that it was only setup to handle newline characters in an inline link. This meant that the code to handle full links, collapsed links, and shortcut links needed to be added.

The second issue was that the initial code added to handle the inline links was only partially complete. In trying to figure out why that was for this article, my notes refer to an extreme lack of examples in the GFM specification that have newline characters in any parts of the Link elements themselves. Part of the push to add this current series of tests was to try and shore up that deficiency, adding those tests myself.

Breaking Things Down

As it is the link type that I use most often, I started with the inline links and their component length arrays. This array, in a variable named link_part_lengths, contained all 5 lengths of the variable parts of the inline links: pre-URI whitespace, URI, pre-title whitespace, title, and post-title whitespace. The link labels were already handled before reaching that point in the code, so there were no issues there.

That code started off with some initialization of the array and its values:

link_part_lengths = [0] * 5
link_part_lengths[0] = len(active_link_uri) + len(
    current_token.before_title_whitespace
)
if current_token.inline_title_bounding_character:
    link_part_lengths[1] = 1
    link_part_lengths[2] = len(active_link_title) + 1
    link_part_lengths[3] = len(
        current_token.after_title_whitespace
    )
link_part_lengths[4] = 0

These values were not always going to be the right values, but they proved to be a good place to start for each link part. Basically, I started with an array of 5 integer values, each set to reflect a certain section of the inline link. While those values are not a direct 1-to-1 mapping of inline link part to length of that part, they are keyed to handle a newline character showing up in any of those inline link parts.

A good example for this is handling a newline character in the inline link title part.

newline_count = ParserHelper.count_newlines_in_text(
    active_link_title
)
if newline_count:
    delta_line += newline_count
    para_owner.rehydrate_index += newline_count

    split_active_link_title = active_link_title.split("\n")
    link_part_lengths[2] = len(split_active_link_title[-1]) + 1
    link_part_index = 2

For the title part, if there are any newlines discovered, the count of newlines found is added to the relevant variables. Once that is done, the title is split on newline characters, and the length of the entire part is set to the length of the last part of that split. Finally, the link_part_index variable is set to the part itself.

Finally, once all the parts are examined in this manner, the following code is executed:

if link_part_index >= 0:
    link_part_lengths[4] = len(
        split_paragraph_lines[para_owner.rehydrate_index]
    )
    link_part_lengths[:link_part_index] = [
        0
    ] * link_part_index
    repeat_count = -(2 + sum(link_part_lengths))

Essentially, this is the magic part of the algorithm that pulls everything together. At the start of the final processing, the last element in the array is set to include any whitespace prefix from the last line that was added. Once that is done, the elements before the last element containing a newline are set to 0. Basically, if we have a title part that has a newline, anything before that title is not useful in determining the column number. Finally, the algorithm then calculates the column number (in the repeat_count variable) by summing up each of the values in the link_part_lengths array.

Show Me This in Practice

Applying that algorithm to the following Markdown

a[Foo](/uri "test
ing")a

Given this information, the array was initialized as follows:

link_part_lengths[0] = 4 + 1
link_part_lengths[1] = 1
link_part_lengths[2] = 8 + 1
link_part_lengths[3] = 0
link_part_lengths[4] = 0

When the newline in the title is detected, the link_part_index variable is set to 2 and the 2nd element is reset to 4: 3 for the length of ing and 1 as per the rest of the adjustment equation.

Without any more newlines, when the final part of the algorithm is executed, no adjustments are made to the 4th element, any element before the 2nd element is zeroed out, and the sum of the remaining elements is then 4. Following the final part of the algorithm, 2 is added to that value, ending with a value of 6. In double checking the above text, the letter a on the second line does indeed start at column 6.

Summary

For me, this algorithm is very effective. It starts with good defaults, updating those values as each part is evaluated. As the column number cannot be determined until after the last part is evaluated, that evaluation is only done at the very end. And by using an array, the algorithm can use the sum function to add those elements up, instead of manual summation.

It took me a while to get to it, but it is and accurate algorithm and an efficient one.

Adjusting the Consistency Checks

Like the changes that needed to be made to the Markdown parser, its consistency checks needed to change to test the new criteria. I initially wanted to just copy over the work I had done in setting the new line/column numbers with newlines, but after 15 milliseconds, I realized the fallacy of that approach. The problem was that because I knew of the approach I already used; it was hard to come up with another way to do it.

In the end, I just started the code with a less optimal starting point for the inline links, growing it from there. Instead of a nice array that I could easily sum, I used a series of 5 boolean variables and 5 string variables that contained the content of each part of the link. While I knew this was almost the same approach, I had confidence that I was picking a path that was enough different that I wouldn’t be tempted to look at the previous code base to figure things out. The approach did not have to be optimal, it just had to be independent from the original algorithm, at least for this phase of the project.

After all that work, adding the shortcut links and collapsed links were simple, being simple sums of the component parts, both constant and variable. But that brought up things that I knew I was going to have to work on soon.

First off, all this verification was happening in the __verify_next_inline_inline_image function. I made a note in the issues list to check out why, but there was not any verification present for the link tokens. Secondly, any support for validating the full link type was missing. After checking the entire list of tests, there were no image tests that were specified using a full link format. Noted, and continued. Finally, there were the existing calculations for the collapsed and shortcut tests. While there was something there, I was sure that those checks wouldn’t handle any newline characters in any of their components. Once again, I noted an item in the issues list, and moved forward.

Besides those missing test areas that I discovered, there were a handful of tests that I could not get working by the end of the week. While it may seem like the tests were only small variations on each other, it did take quite a bit of work to get them working to the extent I did. Not include child tests, there were 47 scenario tests that I added. While some of them worked right away, most of them required research and small changes to get them working. Any even with a good push to get all the work done, I needed to leave some of the tests unfinished.

Leaving Something Undone

It was a hard decision to document the cases that I had missed and leave them for later, but I thought it was an important thing to do. As it was, the commit was done on Sunday morning, already digging into my article writing time for that week. I had no clue if the remaining tests were going to take 5 minutes to resolve or 5 days. I needed to draw a line somewhere and put those tests into the bank for the following week.

While I was not 100% sure about the groupings, I had a bit of information that I used to group them together into 4 groups. The first group was a set of 7 tests that contained a character entity and seemed to have their column number off by 6. The second group was a set of 2 tests that contained a backslash and seemed to have their column number off by 2. I was not sure about the 3 tests in the third group, but they seemed to fit together. Finally, I recognized that I had started work on image versions of some of the links, but I needed to make sure that each of the tests at 78 and over needed image versions.

With that information added to the issues list for the following weeks’ work, I ended the week with a good amount of work done.

What Was My Experience So Far?

One thing that I am learning about in this phase of the project is that I have wells of patience that I can tap into if need be. To be honest, it is surprising me that I have this ability. I usually want to forge ahead with things and make sure they get done to the 90 percent mark, and struggle with that last 10 percent. But this project is different.

I am not sure if it is my self-imposed strictness on the project requirements or my personal investment into this project, but I just find I am a lot more focused on making this project a solid, well-tested application. I know I am not going to be able to catch every issue at the beginning, but I also know the cost of catching those issues once I release the application to the community. For that, I am prepared to make a good-faith effort to do a thorough job in testing, even if it delays the release of the project by a couple of months.

In writing this article though, I am realizing how boring it may appear to others. That possible appearance was made clear to me as I worked on parts of this article. Yes, I added over 40 new scenario tests, debugged them, and got most of them working. And because they are in themes, it was harder to write something interesting about them. No more “hey I caught this, and here is the interesting part”. It was just one or two themes, and then numerous variations on that theme.

But boring as it may be, I believe it is important. Going forward, I want to have a cohesive collection of tests that I can use as a foundation to improve on. I firmly believe that this work, and the work that follows in the next couple of weeks, builds that testing foundation.

What is Next?

No surprise here, more issues to address, and still focusing on newline and links. Stay tuned!

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~14 min read

Published

Markdown Linter Issues

Category

Software Quality

Tags

Stay in Touch