In my last article, I talked about finding a process that works for yourself. In this article, I talk about continuing to make progress on testing nested containers.
It took me a while to get this latest round of scenario tests added and passing, but it was worth getting them added into the PyMarkdown project. I have known that nested container elements have been a concern of mine for a while, so it is nice to be able to finally start addressing it. Even if it did take a couple of weeks.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather than the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 04 Jan 2022 and 23 Jan 2022.
The Long Haul: Issue 227¶
The concept itself is simple, but the testing and debugging of the first set of scenarios for this are far from simple. The concept of this issue is to take every combination of three-level deep Markdown containers, and make sure that they all parse properly at their maximum distances.
What does that mean? Picked at random, the test function
contains the following Markdown document:
> > 1. list > > item
This example could be taken out of the GitHub Flavored Markdown specification for its simplicity. It is simply an Ordered List element within a Block Quote element within another Block Quote element. Each bit of spacing between each container element and the next element is done according to normal standards.
But in the real world, things are not always standard. There is an optional space
character that follows the Block Quote character (
>) and a required space character
that follows the Ordered List sequence (
1.). At that point, to allow for different
expressions of the document, up to three space characters can be provided.
While I was not able to find any reason why three or less space characters are okay, but four or more space characters is an indented code block, my guess is because of the tab character. Putting aside that the tab character is interpreted as a Tab Stop rather than four straight space characters, it looks like Indented Code Block elements were prefaced with four space characters, or a non-justified tab character, because it was easy. The way I think about it is this. If you want to create an Indented Code Block element outside of any container, simply use a tab character or its equivalent of four space characters.
But as I am about to show, that does add in some complexities to nested container elements.
Indented Text Works… Up To A Point¶
To see this indented behavior in action, run this Markdown document through any GitHub Flavored Markdown compliant Markdown parser, such as BabelMark:
1. zero spaces 1. one space 1. two spaces 1. three spaces 1. four spaces
Except for the English name for the number of spaces after the start of the List element, the first four lines each translate neatly into a simple text element within a List Item HTML element. Once four spaces are used, as on the fifth line, the List Item text transfers from simple Markdown text to an Indented Code Block element.
<ol> <li>zero spaces</li> <li>one space</li> <li>two spaces</li> <li>three spaces</li> <li> <pre><code>four spaces </code></pre> </li> </ol>
While this is simple when showing these indents and their behavior when starting
from column 1, it easily gets more complicated when nested containers come into
play. That was one of the reasons that I introduced this series of scenario tests:
max series. Starting with the Markdown document for the
test function, I then created the document for the
test function, injecting three spaces between elements instead of zero spaces.
> > > list > > > list
From there, I pedantically added one extra space character to each level of that
scenario for each container element, increasing the space between container elements
from three spaces to four spaces. So, for the
test function, I added
plus one space character after the first Block Quote element,
resulting in a document of:
> > > list > > > item
Once that was finished, I also went through the combinations for the other two container elements at that level, producing the required tests for those scenarios.
It was not complicated, but it took a while to work through everything. For each
test_nested_three_ modules, twelve new scenarios were added to cover all
the combinations. With nine combinations of the base two container elements, which
meant that I added ninety-six new scenario tests. And that was just the beginning.
It was also interesting. Plus-one indents at the first two levels were easy to get working correctly, but it was the second lines that gave me the most trouble. Making sure that the containers and indents worked together on that second line was the bulk of the issue that I needed to fix. But it was worth it!
It Just Takes Time¶
Adding all those scenario tests was a chore, taking a couple of days to work through them all. It just took time to run through each of the ninety-six different scenarios, verify that I had entered everything in properly, and generate the HTML output using Babelmark. From there, I ran each scenario test individually against PyMarkdown, noting which scenario tests passed and which scenario tests failed.
In the end, just over half of the scenario tests passed, and just under half of the scenario tests failed. It was not the ratio that I was hoping for, but it was the one I needed to work with. So, little bit by little bit, I started picking up groups of similar failures and dealing with them.
Two weeks after I started working on those failures, I finally had them all passing. It was a real mixed bag of experiences. Some of the failures that I looked at were resolved within a couple of hours of debugging and fixing. Other failures took a day or two for me to just figure out the debugging end of things, with the fixing of the issue taking just as long. In some cases, I was moving code from deeper within the parser code to closer to the surface. For others, it was adding net-new code for a situation that I did not have to cover before.
And the amount of code varied as much as the location of the code and the complexity met in figuring out what that code was. In a couple of cases, I merely had to ensure that a calculated value was being added to another variable in the correct manner. While those were nice to have, most of code fixes required between ten and fifteen lines of new code, usually to compute a value that was needed in another location. And there was one case where the number of changed lines peaked at around fifty lines.
As I said above, it was a real mixed bag. But as I have said in the last couple of articles I posted, having a good set of processes that work for me really helped. Often, I ended up scribbling down the Markdown document for the scenario and working through it on paper as well as through the code. There are may Blank Characters scribbled on those pages, from a habit I picked up years ago when I started writing parsers. From the above link:
The symbol ␢ has a long history of use for this purpose in early computer programming. It was handwritten on coding sheets by programmers to indicate a space character to punch-card machine operators (who were like a typing pool).
I am not sure how common they are outside of my use of them in parser design, but I find them to be invaluable for clearly writing down what is being parsed.
So, it was a long journey to get those scenario tests passing, but I knew it was just a matter of time before I got there.
Still A Bit Left To Go¶
So there interesting thing about adding a series of combinatorically generated scenario tests is that interesting patterns appear. For this set of scenario tests, I know that there are some simple tests that I am going to have to add in the next series of scenario tests. A good portion of them are for my own benefit, making sure that the project is already covering scenarios that I believe are covered. With my solutions for a small group of third-level nesting scenario tests, I believe that my solutions might cause a problem if extended to a fourth-level nesting scenario. And finally, there are a handful of scenario test variations that I missed with the current set of scenario tests.
None of these issues (or possible issues) were things that I could have spotted ahead of time.
What Was My Experience So Far?¶
With other things going on in life in the last couple of weeks, it sometimes was a struggle to get some “me” time to work on the PyMarkdown project. But with few exceptions, it was useful time that helped me center myself. There were a couple of issues that I thought I would never solve. In each case, I trusted in my development process, took the breaks that I needed, and used a lot of paper to scribble out possible parsing paths. Because of my confidence in my process, while I knew I might have difficulty figuring out how to solve the current issue, I knew that I would get there.
And it was this week that led me to an interesting conclusion: this is not the last Container Block Processor module that I am going to write. While the last rewrite helped me get closer to a clean implementation, this recent set of issues has led me to believe that I still have some learning left to do. Only once I finish that learning will I properly be able to implement a clean parser.
I am okay with that. It is probably not going to happen for at least a year, and there is no timetable associated with it, but it will happen. And when I get to that point, I believe I will feel that it is the right thing to do.
But right now, it is all about finishing these nested container tests. And I still have work to do!
What is Next?¶
As one of the sections above noted, I have a handful of extra scenarios that I need to take care of before moving on. Once those are done, I will probably release a new version of PyMarkdown, just to make sure it is current. Stay tuned!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.