Having the leaf blocks mostly in place, as documented in the last article, the next items on the implementation list were the list blocks and the block quote blocks. These Markdown blocks, referred to as Container Blocks in the GitHub Flavored Markdown (GFM) Specification, are the more complicated blocks to deal with, as they are capable of containing other blocks. As there are specific suggestions on how to parse these blocks, my confidence took a hit when I started looking at this section. My viewpoint: if the specification writers thought it was difficult to implement that they wrote suggestions on how to handle it, it must not be as easy as the leaf blocks!
What Is the Audience For This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 08 December 2019 and 15 December 2019. This work includes creating the scenario tests for all the Container Blocks as documented in the GFM specification and implementing the parsing to pass most those tests except for the nested cases.
Container Blocks, Leaf Blocks, and Interactions (Oh My!)¶
Before container blocks, parsing was easy. A block starts, and when the parser encounters the termination conditions, it ends. There are a few rules about when blocks can start and end, such as “An indented code block cannot interrupt a paragraph.”, but for the most part, there is little interaction between the leaf blocks. The leaf blocks are clean and tidy. Not so much with container blocks.
Container blocks, by their very definition, contain other blocks, namely leaf blocks and container blocks. While this makes certain visual elements easier, this also means specific rules about what interactions are allowed between the blocks. On top of that, as container blocks can contain other container blocks, testing is required to ensure that an arbitrary number of nested containers is properly supported.
A great example of nesting container blocks is the Markdown implementation of sublists. A list containing a list containing a list is simple in Markdown:
- first level - second level - third level
That example is not a single list, but 3 separate lists. The
first level list is the
first level list, containing the list
second level, which contains the list
third level. And while sublists are a simple case of container blocks, more complex
cases are possible, such as this one:
- first level - ```text my text ```
This list is like the first list, except it contains a fenced code block as the contained block. Both examples are just a few of the possibilities of how container blocks can contain other blocks. Looking through the specification, I quickly lost count of the number of combinations possible.
Enter Lazy Continuations¶
If the interactions between container blocks and the blocks they contain was not a fun enough exercise in mental agility, enter lazy continuations. From the GitHub Flavored Markdown (GFM) Specification’s block quotes section:
Laziness. If a string of lines Ls constitute a block quote with contents Bs, then the result of deleting the initial block quote marker from one or more lines in which the next non-whitespace character after the block quote marker is paragraph continuation text is a block quote with Bs as its content.
and from the list items section:
Laziness. If a string of lines Ls constitute a list item with contents Bs, then the result of deleting some or all of the indentation from one or more lines in which the next non-whitespace character after the indentation is paragraph continuation text is a list item with the same contents and attributes.
Basically, what they are both saying is that if a paragraph has been started with block quotes or within a list AND if a line is clearly a continuation of a paragraph, then it is valid to remove some or all of the container block markers. For a more concrete example, example 211 has the following Markdown:
> bar baz > foo
which is parsed the same as if the following Markdown were written as:
> bar > baz > foo
After reading those sections and letting them sink in, my confidence took a dip. This was not going to be an easy concept to get right. But the sooner I dealt with those scenarios, the sooner I could try and implement them the right way. Knowing this, I went forward with the implementation phase of the container blocks.
Getting Down to Work - The Easy Scenarios¶
I often recommend to friends and co-workers that taking a break and doing something unconnected to the “chore” helps your mind get things together. As such, before getting started on this work, I decided to walk our dog for a while and let some of these concepts mull around in my head. I am not sure if it was the exercise or the change in scenery, but it helped to clear the cobwebs from my head and helped me to see things about the project more clearly.
The big thing that it accomplished was to help me cleanly separate out the easy tasks from the more difficult tasks. The easy tasks? Simple block quotes and simple lists, including sub-lists. The difficult tasks? Lazy continuations and mixed container types. I remember feeling that taking this time helped my confidence on the project, as I was taking simple steps to understand where the difficulties were most likely to show up. This process also allowed me to think about those hard issues a bit while implementing the easier features. While I was not devoting any serious time to the more complicated features, it was good to just have my mind aware of which sections of code that I was going to need to keep flexible going forward.
Keeping this in mind, I started with block quotes, adding the block quote test cases to
test_markdown_block_quotes.py, disabling any tests that I figured were not in the
easy category. I then proceeded to implement the code, in the same way as detailed in
prior article on leaf blocks.
Implementing the easy scenario tests for the block quotes was a decent sized task,
mostly completed during two days on a weekend where I had some time. This also included
fixing scenario tests in 6 other test files that has block quotes in their scenarios.
Working on the basic list items over the next week, by the middle of the next weekend they were completed, in a similar fashion to how the block quotes were completed: new scenario tests were added, the easy ones were then tested, enabled, and verified for completion, and the more difficult ones were disabled. Similar to the block quotes, getting this right took roughly a week, and that work also had impact on scenario tests other than the ones I added.
During this process, I believed I found the parsing of lists more difficult. Thinking
implementation in hindsight, I believe it was mostly due to their parsing requirements.
The fact is that block quotes have a single character
> to consider for parsing,
while the lists can be unordered and start with the
* character or the
lists can be ordered and start with a number and the
. or character. In
addition, for ordered lists, there is also the parsing of the start number and how to
interpret it. Looking at the two blocks that way, block quote blocks seem a lot easier
However, now that I have had a bit of time since that code was written, I believe that those two features were closer in difficulty that I initially thought. Having implemented both block quotes and lists, I think that they both had something that was difficult that needed overcoming. Since I have done a lot of parsers in my past, the number of variations in parsing the lists were immediately noticeable to me, while the block quotes were easy to parse. Balancing that out, once parsed the lists were easy to coordinate, while the block quotes took a bit more finessing to get right. In the end, I believe it was a pretty event effort to get both done properly.
At least until nested mixed container blocks.
Nested and Mixed Containers¶
Nested container blocks, specifically mixed nested container blocks, is where things got messy. To be 100% honest, I am pretty sure I did not get everything right with the implementation, and I already have plans to rewrite this logic. More on that later.
I started implementing these features knowing that they probably made up the remaining 10% of the scenarios. I also figured that to handle these specific scenarios properly would require as much time and effort as the prior 90% of the scenarios. This was not really a surprise, as in software development getting a project to the 70-90% finished mark is almost always the easy part.
Over the next week’s work, I reset my fork of the code back to its initial state 3 or 4 times. In each case, I just got to a point where I either hit a block in going forward, I wasn’t happy and confident about the solution, or both. In one of those cases, the code was passing the scenario tests that I was trying to enable, but it just did not feel like I could extend it to the next scenario. I needed to be honest with myself and make an honest determination of how good the code I just wrote was.
In the end, I completed some of the sublists and nested block quotes, requiring only 4 scenario tests to be disabled or skipped. The ones that were disabled were the 10% of the 10%, the cases where there were 3 or more levels of block quotes and lists mixed together. I was not happy with it, but after a week, I knew I needed to move on with the project. Grudgingly, I acknowledged that I would need to rewrite this later.
Why Rewrite Already?¶
I am very confident that I coded the easy level cases correctly, as I have solid scenario tests, and a decent volume of them, to test the various use cases. For the medium difficulty cases, such as a container within a container, I have a decent amount of confidence that the scenario tests are capturing most of the permutations. It is the more complicated cases that I really am not confident about. And when I say I am not confident; it is not that I am not sure if it is handling the test properly: that is a binary thing. The test is passing, or the test is failing, and thus disabled. I am not confident that all those tests work for all use cases like that the scenario tests represent.
Part of any project is learning what works and what does not work. As I started looking at implementing example 237, I read the following paragraph located right before the example:
It is tempting to think of this in terms of columns: the continuation blocks must be indented at least to the column of the first non-whitespace character after the list marker. However, that is not quite right. The spaces after the list marker determine how much relative indentation is needed.
It was then that I was pretty sure I had coded the container blocks in terms of columns instead of spaces. Add that to the list of rewrites needed.
The other category where my confidence is not high is with multiple levels of mixed container blocks. Once I complete the rewrite above, I can properly evaluate how well I can nest the containers, but at that moment it was not high. At that point, example 237 will be a good scenario test to determine how well I have those set up. Having taken some time to really evaluate the code and the scenario tests, I just have a suspicion that there is at least 1-2 bugs in the code that I wrote. For now, that is on my list of possible rewrites, with a medium to high probability of being needed.
The saving grace for both of these scenarios that I believe need rewrites? Their frequency. The scenarios for blocks, leaf blocks and container blocks, comprise about half of the specification, ending with example 306. According to my test failure report, only 4 of the list block tests had to be marked as skipped, hence they were not passing. At approximately 1.3% of the total scenarios, it is not a big impact. In writing this block, I have used lists frequently, block quotes sporadically, and block quotes with lists even less. I am not sure if my writing is representative of everyone’s writing, but at least for now, it is a good place to start.
What Was My Experience So Far?¶
All the leaf blocks were finished in about a week. The easy and medium cases for the container blocks were finished about a week. The hard cases for the container blocks… not finished after a week, but close.
Was I disappointed? Sure. But in comparison to other issues I have had with projects, this was not even near the top 20 in terms of disappointment. To be honest, in terms of how projects have gone for me over the years, this has been a decent project to work on. Every project has its issues, and this was just the set of issues that happened to occur on this project.
I know it may sound a bit silly, but me and my immediate family have a saying we like to repeat when things get tough: “Stuff1 happens, pick yourself up, dust yourself off, and figure out what to do next.” The disabled tests happened, so I took some time to find my focus, and came up with a plan to deal with it. Not a great plan, but it meant I could go forward with the remaining scenarios and circle back once I accumulated more experience with the parser.
Sure there already was some technical debt for this project, but other than that, I believe it is going well. At this point it was just before Christmas, and I had a Markdown parser that was coming along well. My confidence in the implemented leaf blocks was high, as was my confidence in the easy 90% of the container block implementation. The more difficult 10% of the container blocks was still undecided, but I had a plan to deal with it going forward. While not a sterling situation, it was definitely a good position for me to be in.
What is Next?¶
Before I took some time to improve my PyScan tool, I worked on adding HTML block support for the PyMarkdown project. As HTML in Markdown has some funny logic associated with it, the next article will be devoted entirely to the HTML blocks.
When my kids were younger, I did indeed use the word “stuff”. As my kids got older, we changed that word to another one that also starts with “s”. The actual word that we now use should be easy to figure out! ↩
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.