Summary¶
In my last article, I continued the long journey to remove items from the project’s issues list. Still without a better title than the equivalent of “chapter 3”, this article details more pruning of that list.
Introduction¶
While this process feels like just one of many stops on a long journey, I do find that I am finding this part of the project particularly satisfying. Sure, I am fixing issues that I have found along the way, but that is a good thing. I am not finding a lot of “I need to redesign this from scratch” issues, just a lot of “wow, did I forget to…” or “wow, I didn’t think to test the combination where…” issues. And I find that observation both calming and a boost to my confidence.
Is it taking up time? Sure. Would I like to get back to implementing and adding rules? Yup. But, with each issue that I resolve, my confidence that this is the right course of action increases. The project’s collection of different Markdown documents and test data just keeps on growing. As I manually check each scenario test out against what is expected, verify the output HTML against the reference implementation, and have consistency checks in place, any new issues just give me that much more information that the project is working properly.
Mind you, properly does not mean it is perfect, just that it is headed in the right direction.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits 21 Sep 2020 and 27 Sep 2020.
Small Cleanups and Large Aspirations¶
The first commit in this week’s work was an easy one. During the writing of the
article on that week’s
work, I realized two things: I wasn’t taking show_debug
arguments out of tests
after debugging,
and test test_raw_html_632
needed a child test with a bit of change, creating
test_raw_html_632a
. Both changes were accomplished within 15 minutes, so
they were good cleanup issues. But after that, I was left with an interesting problem
to deal with: Newlines and various inline elements.
There were a couple of issues that I wanted to deal with:
- HTML and Fenced Blocks and better handling of capturing newlines to avoid counting token height
- verify which inlines cannot contain newlines and verify with line/col
- autolinks
- raw_html
To break these issues out a bit more, I want to make sure I convey the difference between the PyMarkdown project and a normal Markdown parser. As the project’s focus is to be a Markdown linter, there are additional requirements imposed that are tested with every commit. Those requirements are that any element of interest, once translated into a token to represent it, must include the line number and column number of the original Markdown. While the examples provided by the GFM specification are great for testing a parser, they fall a bit short for testing a linter.
To be clear, that is not a problem with the GFM specification. The section called About this document specifically starts with:
This document attempts to specify Markdown syntax unambiguously. It contains many examples with side-by-side Markdown and HTML.
The specification specifically deals with syntax and how a given Markdown document should be represented by HTML. Full stop. It is this project’s additional requirement of being able to examine the tokens that are then translated to HTML that add that extra level of complexity. And that extra complexity comes with a cost.
Changing That Cost-Benefit Ratio¶
The cost of this complexity has not been large until this point. Hitting an issue where I thought needed some extra testing, I created a child test by adding a letter to the end of the parent scenario test and made the required variations to the child test for that situation. In cases where there were more variations, I introduced multiple child scenarios, each one testing something specific. This approach did not significantly increase the complexity of the tests, due to its focused nature.
But with this week’s work, it was different. Instead of having one test here to verify and one test there to verify, I wanted to make sure that all inline tokens that followed a given pattern were being handled properly. I wanted to have one central location where I could look to make sure inline tokens and newline characters were being handled properly. This was different in that its scope was going to make it complex.
New Tests - First Batch¶
The first batch of these new scenario tests were the group from
test_paragraph_extra_43
to test_paragraph_extra_46
.
To reiterate the message from the previous section, the purpose of these tests was not
to provide one off tests for each type of inline element, but to provide a group of
tests along a given theme. In this case, that theme was an inline element
surrounded by simple text on either side, with a newline in the middle of the element.
Far from being tricky, adding these tests was simple. I started with test
test_paragraph_extra_43
and support for the Code Span token, adding support for the
Raw Html token,
the URI Autolink token, and the email Autolink token, ending at
test_paragraph_extra_46
Those were easy
tests to add. In each case, it was just a simple Markdown document like:
a`code
span`a
It was when I started looking at the links and images that I realized that I was in for a world of hurt. Well, maybe not hurt, but it was going to be a lot of work.
New Tests - Links and Newlines¶
Starting off with the links, I began with inline links, as they are what I mostly
use in my documents. In terms of changeable parts, the inline
links have 6 elements: the link label, the whitespace before the URI, the URI,
the whitespace before the title, the title, and the whitespace after the title.
To test these parts, I added tests test_paragraph_extra_47
to
test_paragraph_extra_52
, one for each of the parts. In addition,
for each whitespace test, I added variations on the tests that included whitespace
before the newline, whitespace after the newline, and whitespace before and after
the newline.
Once those tests were added, it was time to focus on the other 3 link types. The
full link type was tested by adding a test for newlines in the label
(test_paragraph_extra_53
) and
newlines in the link reference (test_paragraph_extra_54
). A new test was added
for a shortcut
link with a newline in the link reference (test_paragraph_extra_55
) and a
collapsed link with a newline
in the link reference (test_paragraph_extra_56
). Finally, to address some
concerns I had about a newline
at the start of those elements, I added test test_paragraph_extra_57
that includes
a collapsed link with
a newline at the start of the link label, and test test_paragraph_extra_58
that includes a full link with
a newline at the start of the link reference.
With that group of tests added, it was time to shift to the next group of tests: images.
After adding test test_paragraph_extra_59
to place a newline between the !
character and the [
character of the image link, tests test_paragraph_extra_60
to
test_paragraph_extra_67
were added as versions of tests
test_paragraph_extra_47
to test_paragraph_extra_52
, modified for an image instead
of a normal link. In addition, tests test_paragraph_extra_60
and
test_paragraph_extra_65
were added to test an image link without any link title
being provided.
To me, this was a good start to some solid tests. I was focusing on a particular group of tests, and make sure that pattern was cleanly covered.
Adding in Replacements and Backslashes¶
Finding that the pattern of adding new tests in the previous section was working well as a template, the next group of tests followed a similar pattern. In this group of tests, I wanted to focus on newlines as part of replacement sequences and backslash sequences within different parts of the inline links. My concern was that the special character handling would play havoc with the determination of the column number for the token following the token with the special character. It just seemed to be a good idea to test these thoroughly.
Tests test_paragraph_extra_68
and test_paragraph_extra_69
provide for testing in
the inline link label, with a newline around
a replacement sequence and a newline near a backslash sequence. Then test
test_paragraph_extra_70
tests a inline URI containing a space, followed by tests test_paragraph_extra_71
and test_paragraph_extra_72
testing for
newlines within a replacement sequence and a backslash sequence. Then tests
test_paragraph_extra_73
to test_paragraph_extra_77
repeat those tests, but with
image links instead of normal links. To round out
this testing tests test_paragraph_extra_78
to test_paragraph_extra_89
repeat
the link testing with the variations
based on the other 3 link types.
As I was compiling the list of tests for this article, I did notice that it appears that I have some tests that are done more than once. To combat this, I added a new item to the issues list to create a table or spreadsheet to more accurately track these variations.
Addressing the Issues¶
All in all, things went well with adding the new tests. For the initial group
of 4 tests, the only thing that the tests uncovered was that the Code Span token was
not handling the newline properly.
That was quickly resolved with a couple of small changes in the handle_inline_backtick
function. The bigger issues arose in the handling of the links.
While not a design change, most of that section either needed to be added or enhanced to properly handle links with newlines. The way the code was implemented before that week’s work, if there was a newline character in the current token, the logic was invoked to properly determine what the line/column number should be. There were two problems with that code. The first is that it was only setup to handle newline characters in an inline link. This meant that the code to handle full links, collapsed links, and shortcut links needed to be added.
The second issue was that the initial code added to handle the inline links was only partially complete. In trying to figure out why that was for this article, my notes refer to an extreme lack of examples in the GFM specification that have newline characters in any parts of the Link elements themselves. Part of the push to add this current series of tests was to try and shore up that deficiency, adding those tests myself.
Breaking Things Down¶
As it is the link type that I use most often, I started
with the inline links and their component length arrays. This array, in a variable
named link_part_lengths
, contained all 5 lengths of the variable parts of the
inline links: pre-URI whitespace, URI, pre-title whitespace, title, and post-title
whitespace. The link labels were already handled before reaching that point in the
code, so there were no issues there.
That code started off with some initialization of the array and its values:
link_part_lengths = [0] * 5
link_part_lengths[0] = len(active_link_uri) + len(
current_token.before_title_whitespace
)
if current_token.inline_title_bounding_character:
link_part_lengths[1] = 1
link_part_lengths[2] = len(active_link_title) + 1
link_part_lengths[3] = len(
current_token.after_title_whitespace
)
link_part_lengths[4] = 0
These values were not always going to be the right values, but they proved to be a good place to start for each link part. Basically, I started with an array of 5 integer values, each set to reflect a certain section of the inline link. While those values are not a direct 1-to-1 mapping of inline link part to length of that part, they are keyed to handle a newline character showing up in any of those inline link parts.
A good example for this is handling a newline character in the inline link title part.
newline_count = ParserHelper.count_newlines_in_text(
active_link_title
)
if newline_count:
delta_line += newline_count
para_owner.rehydrate_index += newline_count
split_active_link_title = active_link_title.split("\n")
link_part_lengths[2] = len(split_active_link_title[-1]) + 1
link_part_index = 2
For the title part, if there are any newlines discovered, the count of newlines found
is added to the relevant variables. Once that is done, the title is split on newline
characters, and the length of the entire part is set to the length of the last part of
that split. Finally, the link_part_index
variable is set to the part itself.
Finally, once all the parts are examined in this manner, the following code is executed:
if link_part_index >= 0:
link_part_lengths[4] = len(
split_paragraph_lines[para_owner.rehydrate_index]
)
link_part_lengths[:link_part_index] = [
0
] * link_part_index
repeat_count = -(2 + sum(link_part_lengths))
Essentially, this is the magic part of the algorithm that pulls everything together.
At the start of the final processing, the last element in the array is set to include
any whitespace prefix from the last line that was added. Once that is done, the
elements before the last element containing a newline are set to 0. Basically, if we
have a title part that has a newline, anything before that title is not useful in
determining the column number. Finally, the algorithm then calculates the column
number (in the repeat_count
variable) by summing up each of the values in the
link_part_lengths
array.
Show Me This in Practice¶
Applying that algorithm to the following Markdown
a[Foo](/uri "test
ing")a
Given this information, the array was initialized as follows:
link_part_lengths[0] = 4 + 1
link_part_lengths[1] = 1
link_part_lengths[2] = 8 + 1
link_part_lengths[3] = 0
link_part_lengths[4] = 0
When the newline in the title is detected, the link_part_index
variable is set to
2
and the 2nd element is reset to 4
: 3
for the length of ing
and 1
as per
the rest of the adjustment equation.
Without any more newlines, when the final part of the algorithm is executed, no
adjustments are made to the 4th element, any element before the 2nd element is
zeroed out, and the sum of the remaining elements is then 4
. Following the
final part of the algorithm, 2
is added to that value, ending with a value of
6
. In double checking the above text, the letter a
on the second line does indeed
start at column 6.
Summary¶
For me, this algorithm is very effective. It starts with good defaults, updating
those values as each part is evaluated. As the column number cannot be determined
until after the last part is evaluated, that evaluation is only done at the very
end. And by using an array, the algorithm can use the sum
function to add those
elements up, instead of manual summation.
It took me a while to get to it, but it is and accurate algorithm and an efficient one.
Adjusting the Consistency Checks¶
Like the changes that needed to be made to the Markdown parser, its consistency checks needed to change to test the new criteria. I initially wanted to just copy over the work I had done in setting the new line/column numbers with newlines, but after 15 milliseconds, I realized the fallacy of that approach. The problem was that because I knew of the approach I already used; it was hard to come up with another way to do it.
In the end, I just started the code with a less optimal starting point for the inline links, growing it from there. Instead of a nice array that I could easily sum, I used a series of 5 boolean variables and 5 string variables that contained the content of each part of the link. While I knew this was almost the same approach, I had confidence that I was picking a path that was enough different that I wouldn’t be tempted to look at the previous code base to figure things out. The approach did not have to be optimal, it just had to be independent from the original algorithm, at least for this phase of the project.
After all that work, adding the shortcut links and collapsed links were simple, being simple sums of the component parts, both constant and variable. But that brought up things that I knew I was going to have to work on soon.
First off, all this verification was happening in the
__verify_next_inline_inline_image
function. I made a note in the issues list to
check out why, but there was not any verification present for the link tokens.
Secondly, any support for validating the full link type was missing. After checking
the entire list of tests, there were no image tests that were specified using a full
link format. Noted, and continued. Finally, there were the existing calculations for
the collapsed and shortcut tests. While there was something there, I was sure
that those checks wouldn’t handle any newline characters in any of their components.
Once again, I noted an item in the issues list, and moved forward.
Besides those missing test areas that I discovered, there were a handful of tests that I could not get working by the end of the week. While it may seem like the tests were only small variations on each other, it did take quite a bit of work to get them working to the extent I did. Not include child tests, there were 47 scenario tests that I added. While some of them worked right away, most of them required research and small changes to get them working. Any even with a good push to get all the work done, I needed to leave some of the tests unfinished.
Leaving Something Undone¶
It was a hard decision to document the cases that I had missed and leave them for later, but I thought it was an important thing to do. As it was, the commit was done on Sunday morning, already digging into my article writing time for that week. I had no clue if the remaining tests were going to take 5 minutes to resolve or 5 days. I needed to draw a line somewhere and put those tests into the bank for the following week.
While I was not 100% sure about the groupings, I had a bit of information that I used
to group them together into 4 groups. The first group was a set of 7 tests that
contained a character entity and seemed to have their column number off by 6. The
second group was a set of 2 tests that contained a backslash and seemed to have their
column number off by 2. I was not sure about the 3 tests in the third group, but they
seemed to fit together. Finally, I recognized that I had started work on image versions
of some of the links, but I needed to make sure that each of the tests at 78
and over
needed image versions.
With that information added to the issues list for the following weeks’ work, I ended the week with a good amount of work done.
What Was My Experience So Far?¶
One thing that I am learning about in this phase of the project is that I have wells of patience that I can tap into if need be. To be honest, it is surprising me that I have this ability. I usually want to forge ahead with things and make sure they get done to the 90 percent mark, and struggle with that last 10 percent. But this project is different.
I am not sure if it is my self-imposed strictness on the project requirements or my personal investment into this project, but I just find I am a lot more focused on making this project a solid, well-tested application. I know I am not going to be able to catch every issue at the beginning, but I also know the cost of catching those issues once I release the application to the community. For that, I am prepared to make a good-faith effort to do a thorough job in testing, even if it delays the release of the project by a couple of months.
In writing this article though, I am realizing how boring it may appear to others. That possible appearance was made clear to me as I worked on parts of this article. Yes, I added over 40 new scenario tests, debugged them, and got most of them working. And because they are in themes, it was harder to write something interesting about them. No more “hey I caught this, and here is the interesting part”. It was just one or two themes, and then numerous variations on that theme.
But boring as it may be, I believe it is important. Going forward, I want to have a cohesive collection of tests that I can use as a foundation to improve on. I firmly believe that this work, and the work that follows in the next couple of weeks, builds that testing foundation.
What is Next?¶
No surprise here, more issues to address, and still focusing on newline and links. Stay tuned!
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.