Summary

In my last article, I continued in my quest to reduce the size of the issues list. In this article, I split my time between adding to the scenario cases tables and dealing with items from the issues list.

Introduction

With a specific focus on getting list issues resolved this week, I was hoping to make some decent headway with the issues list. From my reading of the list at the start of the week, there were a fair number of List element related issues, Block Quote element related issues, and cross-over issues between the two. It just made sense to me to pick one and focus on it. And I picked List elements related issues.

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 11 Nov 2020 and 15 Nov 2020.

Starting Off with An Easy Task

While not a big task, the task:

- 292x with ordered lists?

was a nice easy one to start with. This task was simply to take the functions test_list_items_292a to test_list_items_292c, copy them to test_list_items_292d to test_list_items_292e, changing them from an Unordered List elements to Ordered List elements. Nothing more, nothing less.

And this was very simple, and similarly, the tests passed without any issues. A simple, but a good, solid start.

Double Checking

For these tasks:

- 269 and 305, and variations
  - whitespace is not 100% correct
  - weird cases in list_in_process

nothing was changed, but some research was done.

To start, I went to those two scenario tests and their variations and looked at the whitespace in the tests. While it took me a second to remember why the whitespace looks that way in the list tokens, there was nothing wrong with the way it was being stored there. However, it did take me a second or two to remember that a paragraph within a list stores part of its leading whitespace in the List token and the remaining leading whitespace in the paragraph token. Maybe that is what caused me to add the item to the issues list?

I then looked at the code for the list_in_process function with a similar exploratory effort, but nothing seemed out of place. I even enabled debug for a few of the test_list_items_292 functions, tracing through the code to look for any issues. While I did not find anything wrong, I was happy to take a second look at these areas to put any perceived issues that I might have had to rest.

Ordered List Blocks and Start Numbers

Once again, I found a small task to get out of the way:

- CommonMark and how handles non-initial cases for list starts

To examine this issue, I created two new functions, test_list_blocks_extra_1 and test_list_blocks_extra_2. In the first of these functions, I added a couple of lists with sub-lists, all starting with the the integer 1. In the second of these functions, I changed one of the sub-lists to start with the integer 2 instead.

Checking the PyMarkdown parser HTML output against the BabelMark, everything was fine for the first function, but there was a slight difference for the second function. Instead of acknowledging the 2. signifying the start of a sub-list, that 2. text was combined with the paragraph element from the previous line. That was curious.

After combing through the specification for about an hour, I posted a question to the CommonMark Discussion Boards, and waited for a response. More on that in a later article.

Variations on Existing Lists

Graduating from the simpler tasks, I decided to tackle a task that had a bit more substance to it:

- 276 with extra level of depth, with olist/olist, ulist/olist, and olist/ulist

Starting with the scenario test for example 276:

- - foo

I started generating variations based along different themes. For each of the main variations, I simply ran through a simple list of combinations of the Ordered List element and the Unordered List element: Unordered/Ordered, Ordered/Unordered, Ordered/Ordered, and Unordered/Unordered. After those combinations were taken care of, the a and b variations of those variations were created by adding an Unordered List element (for a) or an Ordered List element (for b).

While I expected something to happen, it was nice to be proven wrong. The tests all passed without any issues. The reason for my initial doubt on this issue? I have had problems with “empty” lists before. While I do not use them myself, the Markdown specification allows for a List element that only contains a start List Element text sequence, say 1., with no other text on that line. And from my knowledge of the GFM Specification, empty list items are covered, but not as completely as the list item starts followed by text. Based on that background, it was good to see that so far, those empty list items were not going to be an issue.

Variations on A Theme

On a bit of a roll with some easy wins in the completed task column, I decided to do a couple of tasks together as a single task:

- 256 with extra spaces on blanks
- 256 with other list types for last instead of just li

Based on previous tasks, it seemed like a good idea to come up with variations to deal with these tasks. To do this, I started with the example for example 256:

-
  foo
-
  ```
  bar
  ```
-
      baz

For the first two variations on test_list_blocks_256, I modified the example Markdown to include extra trailing whitespace as part of the empty list items. From there, I added a variation which replaced the unordered list item elements with ordered list item elements. In addition, I further modified that variation by adding extra blank lines and by reducing the indent on the Fenced Code Block element from 3 to 2, making it ineligible for inclusion into the list. Basically, I looked at Example 256 and experimented with what I thought would be good variations to test.

While a lot of those variations did not result in the discovery of any expected issues, there was one interesting new issue. In cases where there is a blank line inside of a list, there was a weird ordering where the processing of the blank line and the processing of the blank line to close the list were in the wrong order. As such, a small fix was required to make sure that the information is available to make a proper decision on how to handle that blank line with the correct data.

That one took a bit of effort to figure out, but it was a good warm up for what I knew was going to be a bear of a task to follow.

This Week’s Big Issue

While there was not an explicit entry in the issues list for this issue, it was an issue that I had long been concerned about. As the commit message stated, that issue was to “add better support for lists that are aborted due to missing leading space in front of a new block”. I knew I was opening a can of worms by exploring this issue, but for me to have confidence in the project, I felt that I needed to explore this, and explore it now.

Aborted Due to Missing Leading Space?

Basically, in its simplest form, this issue breaks down into the following scenario:

1.
---

In this scenario, the Markdown specifies a simple List element that is created without any content on that line, an empty list item. In starting to parse the second line, the parser must first determine if that newly parsed text will be part of that list or not. To ensure that text is added to that List element, the text on the next line must be indented 3 spaces, matching the indent level of that List element. Therefore, when the Thematic Break element on that next line fails to maintain that level of indentation, the original list is then aborted, and the Thematic Break element is processed after the list has been closed.

This reading of the specification is backed by the HTML output generated by BabelMark for this scenario:

<ol>
<li></li>
</ol>
<hr />

What Is the Issue Then?

While I had a specific answer to a specific question regarding how the List element and the Thematic Break element interacted, I wanted a more generic answer that I could work with. As usual, I opened up the GFM Specification in my browser and started looking for that generic answer. Unfortunately, I did not get an answer that I was satisfied with.

The first thing that I looked for was some variation on the previous example, of which I kind of found a related test with example 27:

- foo
***
- bar

While it was not exactly what I was looking for, it was something. When I went to look for a similar example including an Atx Heading element, I did not find any comparable example.

To be clear, I do not believe this is the fault of the GFM Specification. To be honest, I think the specification has done a great job and specifying the main cases that people are going to encounter. But there is always room for improvement, and I am hoping to contribute to that improvement with the PyMarkdown project’s testing suite. That is part of the process, and how the specification gets better.

With that newfound information in mind, I was left with a slightly modified issue. As I did not have a good set of examples detailing how lists and other leaf blocks interacted, I therefore did not have a good comprehensive scenario test suite that I had confidence in. I was confident that the GFM Specification was getting me a good 90% of the way there, but I wanted more than that 90%. Therefore, the newly modified issue that I needed to solve was that I needed to specifically add more specific tests in this area.

Getting A Good View on The Issue

The first thing I did to address this issue was to stop and think clearly about what needed to be done. From the last bit of work, I knew that the scenario functions test_list_blocks_256* were a good start, so I decided to add functions with similar names more from that point.

Starting with the example I outlined the function test_list_blocks_256f with the following markdown:

1.
---

Then I added a variation on that, making test function test_list_blocks_256fa be the same thing, just with text after the List Item element. Once that was done, I basically copied those scenario tests, replacing the Thematic Break element with an Atx Heading element, a SetExt Heading element, an Ordered Code Block element, a Fenced Code Block element, and an HTML Block element. With scenario test titles going from 256f to 256k, I went back and used BabelMark to replace the output HTML in each of the scenario tests.

Running the newly created scenario tests, I discovered a solid number of issues that needed to be looked at. I knew at this point that there was not going to be an easy solution here. I rolled up my sleeves and got to work.

HTML blocks

Picking one of the elements to start with, that night I decided to start working with the HTML Block element and regretted it within hours. Looking at the failures from those tests, it was obvious that there was more than one problem with these tests. The most immediate problem was that the tokens produced by the parser just looked wrong.

Taking the time to look at the problem in depth, I quickly discovered that the HTML Block element was not causing the list to close like it should. As that was a major issue to find, everything else after that token was affected in some way. To address this issue, I created the correct_for_leaf_block_start_in_list function to allow the parser to clean up in situations like this. More specifically, it was for cases where a paragraph had just been closed while in an active list. In those cases, the parser thought the lists had been closed, but the proper tokens were not being emitted.

But even with those observations in place, there still was something about the tokens that looked “off”. However, I know that there was currently too much noise in the way for me to see that other issue clearly, so I just decided to get it out of the way first.

Cleaning Up HTML Blocks in Lists

To take care of those concerns, I experimented with eight other variations on that list scenario, including some variations with sub-lists. While I was mostly pleased with the results, there were three tests that were failing due to issues in generating the correct HTML and rehydrated Markdown text. After some quick investigation, it became obvious that there was whitespace missing at the start of the HTML blocks.

Double checking with the other failing tests, the pattern that emerged was that the spacing between the end of the previous paragraph and the new HTML block needed to be altered a bit. To make sure that information could get from the parser to the HTML generator, I added a new member variable fill_count to the HTML Block token, using the add_fill function to adjust its value. Once that was added, I was then able to make small alterations to the correct_for_leaf_block_start_in_list function to adjust that fill_count member variable with the difference between the number of whitespace characters removed and the indentation level of the currently active list.

With that information now present in the token, the HTML generator was easily changed to add those extra characters between the start of the HTML Block and the processing of the text within that block. With those changes generating HTML properly, the focus shifted to applying similar changes to the the Markdown generator and the consistency checks. While the first part of this issue had taken days to complete and fix, this part took only half an hour. At that point, the HTML blocks and all their variations were working properly.

Moving on To The Other Blocks

After fixing the HTML Block elements, I took a closer look at the other failing tests and started to notice similar problems. While the fill_count solution was only useful for the HTML blocks, the correct_for_leaf_block_start_in_list function was useful in fixing issues with the handling of Fenced Code Block elements and Atx Heading elements. In both of those instances, the new element was supposed to abort the previous list but was not doing so. The correct_for_leaf_block_start_in_list function needed a couple of tweaks to make sure it was handling things consistently across all three scenarios, but they were all easy fixes.

With the parser generating the tokens properly, the tests were making more progress but still failed on the output HTML. While not a big difference, the output HTML was missing a newline between the end of the previous list and new Fenced Code Block elements, Atx Heading elements, and SetExt Heading elements. That difference was quickly eliminated by adding two small lines of code to the handle of those elements:

    if output_html.endswith("</ol>") or output_html.endswith("</ul>"):
        output_html += "\n"

After crossing my fingers for good luck, and fixing a couple of typing mistakes, the new tests passed without much fanfare. It was just a relief. I knew it would take a while, but I did not think it would take three days to complete the implementation and verification of these new scenario tests.

It Was A Slog…

Over that three days, I was able to get 18 new scenario tests coded and passing properly. To be honest, I was doubtful at certain points that I would get done, as progress was hard to come by. While I cannot remember how “on my game” I felt during that time, I do remember that I felt burdened by the knowledge that this work was only the starting point, and I would have to repeat it multiple times in the future.

And to that end, I had my next task already lined up.

Lather, Rinse, Repeat

While it was a tough couple of days getting through that block of work, I felt that it was a good task to complete, but that it was not yet 100% complete. With some extra time left in the evening, I decided to take a shot at replicating the width of tests that I added for HTML Block elements to the other Leaf Block elements as well.

Taking a bit of time to setup (copy, paste, alter, repeat… many times), I was pleasantly surprised with the low number of failures. For those tests that did fail, patterns that I was familiar with from the previous issues with HTML Block elements began to resurface. As these issues arose in the Atx Heading elements and the Fenced Code Block elements, the first set of changes I made were to add a fill_count member variable to those token class, similar to how I had added them to the Html Block token class. As some of those classes were in a bit more of a “raw” state than the HTML Block class was, I needed to do a bit of extra work to make sure that I could adjust the fill_count variable and have it be persisted. But it was nothing I had not done multiple times before, so it was quickly accommodated.

Other than a couple of small changes, the only big change was to the function __check_for_list_closures. Taking a while to get right, I needed to alter a few of the functions that feed that function to pass the right parameters around. While it was not too difficult, I was hoping that I would find time in the near future to revisit parts of this code and refactor it to make it cleaner.

What Was My Experience So Far?

Getting a fair number of items related to lists taken off the issues list was a good feeling. While there were a couple of draining issues in the middle, it was still good to make that progress forward. While it was a bit disheartening finding issues in the parser after a spell without any major issues with the parser, it was par for the course. Better for me to discover them now, than to have a consumer discover them later. But as long as I was finding these issues, I would have to make sure to examine the existing tests and identify any potential areas where I can add extra tests to validate that I had found most of the issues with lists.

At this point, I thought it would be useful for me to reiterate a point from previous articles. I am not trying to make this project perfect. After years of software development and years of software automation, I know that eliminating all issues is not possible. It always boils down to when the next issue is going to happen, not if. But for me, it is very important to make my best effort in getting the quality to a point that I feel comfortable with.

Following that, while I know there is a bit left to go before releasing this project, I know that it is getting nearer with each item I resolve from the issues lists. And that is the point. It is getting closer, and I just need to keep my focus on the prize: that project release.

What is Next?

Continuing with my efforts to get better scenario tests around lists, I knew that I had to be more structured about that testing. As that meant creating a new test series, it forbade that the next week’s work would be moving this week’s work into a new series and cleaning it up.

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~14 min read

Published

Markdown Linter Issues

Category

Software Quality

Tags

Stay in Touch