Summary

In my last article, I continued working on Block Quote issues and issues with Block Quotes and their interactions with List Blocks. In this article, I document that work that was done to address those issues and resolve them.

Introduction

With a couple of weeks off for the holidays and relatively light “honey-do” schedule around the house, I was hoping to have some good relaxing days along with some good productive project days. In my head, I was mentally starting to evaluate each of the remaining items in the list, assigning them a priority between 1 (highest) and 4 (lowest). I knew completing the Block Quote issues and Block Quote adjacent issues has a priority of 1, but I needed to take some time to think about the others. And while I was thinking about those others, it was a good time to get some solid work in!

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 22 Dec 2020 and 27 Dec 2020.

Creating A New Scenario Series

After doing a block of work with Block Quotes, I thought it was a good time to start making a new scenario test group for them, Series N. To start this new series off, I simply copied and pasted example scenarios from other series and began to work through the permutations. Once I had those roughly filled out, I created the test_markdown_paragraph_series_n.py module to hold the scenario tests, and just started working on putting them together. After I had the first group of tests in Series N filled out, it was almost cut-and-paste to get the next group of tests ready, with a simple change to the specific Markdown element I was testing in that group.

As nothing is ever easy (it seems…), I ran into some failures that I needed to fix. The first set of failures occurred with unquoted HTML Blocks, in that those HTML Blocks did not end the Block Quote element. This was quickly followed up by the same issue occurring with Atx Heading elements. As I had already solved this problem for List Blocks, it was quickly fixed by adding checks in the check_for_lazy_handling function for both the Atx Heading element starting (LeafBlockProcessor.is_atx_heading) or the HTML Block element starting (HtmlHelper.is_html_block).

At the same time, to handle two extra list-based scenarios, I added the test_block_quotes_213aa function and test_block_quotes_213ab function as variations on the test_block_quotes_213a test function. I tried for a good hour or so to resolve the issues with all these functions, but I did not make a lot of progress. In the end, I got test function test_block_quotes_213ab working, but functions test_block_quotes_213a and test_block_quotes_213aa just did not want to seem to work.

My main priority at the time was filling out the Series N set of tests, so I left those two tests disabled while I proceeded to fill out the Series N tests. Nothing exciting to report from that task… it was just a lot of moving, documenting, checking, and rechecking. You know, the normal stuff.

Addressing Those Tests

In looking good and hard at those disabled tests, I was able to eventually see that a good group of those tests were dealing with nested containers. Specifically, they were dealing with a Block Quote element inside of a List Block element, or the other way around. Putting the disabled tests, their Markdown documents, and their expected tokens under a metaphorical microscope, it was a while before I noticed something that I had previously missed. When I followed the examples closely, it looked like Block Quote elements within a List Block element were not being properly closed in each example.

In debugging this issue, I added a handful of extra tests to help me analyze the situation. The Markdown for one of those tests, the test_block_quotes_extra_02a test function, is as follows:

> start
> - quote
>
> end

and the list of tokens produced for that Markdown were:

[
    '[block-quote(1,1)::> \n> \n>\n> ]',
    '[para(1,3):]',
    '[text(1,3):start:]',
    '[end-para:::True]',
    '[ulist(2,3):-::4:  ]',
    '[para(2,5):]',
    '[text(2,5):quote:]',
    '[end-para:::True]',
    '[BLANK(3,2):]',
    '[para(4,3):]',
    '[text(4,3):end:]',
    '[end-para:::True]',
    '[end-ulist:::True]',
    '[end-block-quote:::True]'
]

As I worked through the example, following the Markdown document and the parser’s tokens, I discovered the issue. The Blank Line element on line 3 ended the first Paragraph element, and I was good with that. But more importantly, because that Paragraph element was closed due to the Blank Line being encountered, the new text on line 4 was not eligible to be included as paragraph continuation text. Based on that information, the new Paragraph element for line 4 needed to be created outside of the List Block element, as it did not have the indentation to support being included in the list.

Basically, the end Unordered List Block token was in the wrong place. To fix it, I would need to move it above the occurrence of the Blank Line element.

Fixing It Right!

Being an issue with the tokenization of the Markdown Document, I needed to start with the parser and the BlockQuoteProcessor class. From the above information, I knew that I needed to alter the way in which the parser_state.handle_blank_line_fn function was called, as it was not properly closing the right blocks. After trying a few other things, I reverted to a very basic approach to solve the issue: forgo the complicated calculations and just do the simple calculation where needed. So, in the __handle_block_quote_section function, before the handle_blank_line_fn function is called, a quick calculation is done to see if the List Block should possibly be closed. If so, the actual index number is set in the forced_close_until_index variable. Then, making a small change to the handle_blank_line_fn function, that variable is passed in and all tokens on the stack are closed until they get to that point.

At the time, it felt like I was using a large sledgehammer to fix a small problem, but it worked. The tokens were being generated properly, and I was able to move on from there. While I could them go on to describe all the problems I had with the Markdown transformer, I will spare any reader the pain. While the basic handling of Block Quote elements was working fine, the proper handling of indents within Block Quote elements needed a good solid day’s overhaul to get it right. It was a good thing I was on holiday, because I had that time to devote to working through all the issues and getting the Markdown rehydration of Block Quotes and their indents just right.

Following that, I expected a lot of issues with the consistency checks, but there was only one major issue. In the __validate_new_line function, the leading_text_index member variable of the Block Quote token was not being updated properly, resulting in an off-by-one error. Unlike the problems with the Markdown transformer, this issue took less than half an hour to find and fix. Phew!

In the end, I was able to enable four of the test_block_quotes_213 test functions and all five of the new test functions that I had added. That was a good feeling.

Still Missing One

After a good night’s worth of sleep, I got back to tackling the disabled functions, specifically test function test_block_quotes_213aa. With a Markdown document of:

> - foo
>   - boo
> - bar

it was immediately obvious that this was somewhat related to the work that I completed the day before. It was just figuring out how that stumped me for a while.

Adding some debug, I started to suspect that the “can I start a container block” logic in the __get_nested_container_starts function was not working properly. Adding more specific debugging in that function, my guess was confirmed. Specifically, there were cases where I felt it was obvious that a new List Block element should start, but the code was skipping over those chances.

Stepping through that code, I noticed weird behavior when I got to this part of that function:

        nested_ulist_start, _ = ListBlockProcessor.is_ulist_start(
            parser_state, line_to_parse, 0, ""
        )
        nested_olist_start, _, _, _ = ListBlockProcessor.is_olist_start(
            parser_state, line_to_parse, 0, ""
        )
        nested_block_start = BlockQuoteProcessor.is_block_quote_start(
            line_to_parse, 0, ""
        )

Looking at the debug output for that code while debugging the test_block_quotes_213aa function, it was obvious that on the second line of the document, the code was not determining that a new block was starting. It just skipped right over it.

Thinking through the issue while I was debugging, this started to make sense. It was not triggering on any of the above start checks because the wrong information was being presented to it. In fact, the data presented to it for line 2 of the above Markdown was literally two space characters followed by the Unsigned List Block start character (-). While that specific form of data made sense for the __get_nested_container_starts function, trying to invoke any one of the three Container block start functions with that data would never succeed. Each of those functions expected the line to at least be minimally processed, and that line of data clearly had not been.

Trying the simple approach first, I tested the following change while debugging function test_block_quotes_213aa:

        after_ws_index, ex_whitespace = ParserHelper.extract_whitespace(
            line_to_parse, 0
        )

        nested_ulist_start, _ = ListBlockProcessor.is_ulist_start(
            parser_state, line_to_parse, after_ws_index, ex_whitespace
        )

and it worked! Debugging through the test again, that processing allowed the is_ulist_start to work with the processed line data, resulting in a non-None result being returned. After applying that same change to the other two start Container block functions, I ran the tests again and everything was good. The tests were passing without any additional changes being required of the Markdown transformer or the consistency checks. I was stoked!

Making the Test Invocation More Efficient

As I was making the changes to all these scenario tests, one thing was becoming obvious to me: I was not changing the call pattern. In the beginning of the project, I was concerned that I was going to have a wide array of ways to invoke the scenario tests based on need. While it was a valid concern at the time, it had not played out that way. For each scenario test, I was always adding the following boilerplate code:

    # Act
    actual_tokens = tokenizer.transform(source_markdown)
    actual_gfm = transformer.transform(actual_tokens)

    # Assert
    assert_if_lists_different(expected_tokens, actual_tokens)
    assert_if_strings_different(expected_gfm, actual_gfm)
    assert_token_consistency(source_markdown, actual_tokens)

While those seven lines are not a lot, over the course of approximately 2,000 scenario tests, those lines added up. Thinking about it a bit, I realized that I was not going to change this format because it was consistently worked well for me. Declared before this code block in each test, the variables source_markdown, expected_tokens and expected_gfm held the relative information for each test. Once set, I could not think of any reason to alter from the pattern of calling the three validation functions, one after another.

At the same time, if I was going to make this change on a large scale, I wanted to start out with a smaller scope of work and validate it first. At that point, I created a new act_and_assert function to contain those seven lines of code, and changed that code block from above to the following:

    # Act & Assert
    act_and_assert(source_markdown, expected_gfm, expected_tokens)

After testing it out on a couple of scenario tests, I applied it to all the scenario tests in the test_markdown_paragraph_series_n.py module and the test_markdown_block_quotes.py module. Making sure the scenario tests all worked, including introducing some false failures that I immediately fixed, I decided to leave it alone for a while. My thought process was that I would let it “set” in my mind. If I was still interested in making the change after a week or so, I could work on it over the course of a couple of weeks.

Looking at the issues list, the following item gained my attention:

- split up link definition within a block quote or list?

While I was not sure it was going to be easy, I at least figured that it would be fun! At the base level, it was an interesting question. Block Quote elements and List Block elements altered the way a line was processed, and Link Reference Definitions could span lines and not appear to follow the right rules. To that end, I added three new scenario tests test_block_quotes_extra_03x to test_block_quotes_extra_03b and I added three new scenario tests test_list_items_extra_01x to test_list_items_extra_01b.

The bad news first. Try as I may, I was not able to get the new Block Quote tests to pass within a decent amount of time. As such, I had to commit them as disabled tests.

The ContainerBlockProcessor was largely unchanged, with the changes in that class being made to get rid of code or simplify code, rather than fix issues. The Markdown transformer was altered to produce the proper output if there were any leading spaces in the list that were not being merged in by the Link Reference Definition token. Those were the easier two fixes. The hard changes involved the HTML transformer.

List Looseness… Again

While I thought all my issues regarding list looseness were behind me, I was wrong. Specifically:

A list is loose if any of its constituent list items are separated by blank lines, or if any of its constituent list items directly contain two block-level elements with a blank line between them.

Breaking that down, a list is lose if, for any item within the list:

  • it is not the first item and preceded by a blank line
  • it is not the last item and followed by a blank line
  • any item contains any two block elements that are separated by a blank line

How is this related to Link Reference Definitions? When the PyMarkdown processors parse a Link Reference Definition element, it creates a token that represents that element so it can be analyzed. However, from an HTML point of view, it does not exist. As such, the following Markdown:

- [foo]: /url "title"
- [foo]

is for HTML output purposes equal to the Markdown:

- 
- [foo]

Basically, from the HTML transformer point of view, it is a Blank Line element. I had some code in the __is_token_loose function to report a Link Reference Definition as a Blank Line element and in the __calculate_list_looseness to trigger off a block, but it was not that simple. Rereading the above section on looseness, it only really mattered inside of the list item if that Blank Line element was between two blocks.

Fixing that specific case took a bit of work in the __calculate_list_looseness function to redesign the way I was handling those cases. In cases where the previous token was a Blank Line, I had to change the algorithm to look before that element and see if there was a valid block token there. It so, the logic from the last point above would come into play, and it would discover a valid case to check. Otherwise, the Blank Line token or the Link Reference Definition token would be intentionally overlooked, and the search would continue.

Repeat, With Block Quotes

It was no surprise that I had to do similar work with Block Quote elements to make sure that they were also working properly. To accommodate that, I added five new scenarios and did a lot of debugging. In the end, the fix was to add a flag in the __handle_blank_line function that triggered when a Link Referenced Definition had been started. Rather than avoid the default handling in this case, that handle was carried out and appended to the already created tokens, resulting the proper output.

Once that fix was made, getting the consistency checks to agree with it was easy, with one calculation for the leading_text_index being off by 1. After a simple adjustment to compensate for that case, those tests were now passing. After only an hour and a half of work to get this working, I understood that I was lucky that it did not end up taking as long as the last set of issues to fix. I was grateful!

Evaluating How I Was Spending My Holiday

I do not think that I am a workaholic, I just like to keep busy. Whether it is figuring out how to do something with home automation, working on a puzzle, cleaning up the garage (again!?), or working on this project, it does not matter. If it is something that keeps me from spending hours doing one thing with no recognizable thing to show for it, I am good. Spending time with my wife helping her out with shopping works wonderfully for that. I get to spend time doing crossword puzzles and other fun things on my phone while waiting for her to finish at various stores she needs to go to. For me, it is all about perspective.

So, it was part of the way through Christmas Eve Day when I thought about this with respect to the project. While it was true that I was resolving items from the issues list, it just was not feeling like it was the right thing to do at that time. As I had the weeks before and after Christmas off, I sat back and decided that I wanted to do something for the project that would be hard to do at any other time. I mean, since I have the time, was I using it to my best advantage?

Within minutes, I had an answer of “No, I was not using my time wisely!”. I had a couple of long-term things I wanted to do with the code base, but those things would be hard to fit into a normal week’s worth of work. It was time to shift gears.

Reducing the Code Required To Check For A Specific Token

While this was done very leisurly over two days, this change was something I had wanted to do for a while. As the project’s growth was organic, I had started out check for the pressence of a given token by looking at its name. At that time, given the rule module rule_md_019.py, if I wanted to see if a given token was a paragraph, I would use:

if token.type_name == MarkdownToken.token_paragraph:
    pass

There is nothing intrinsically wrong with that comparison, but it did have things going against it. The first was that I had to make sure to import the MarkdownToken class into any module where I wanted such a comparison. Next was my observation that I usually had to go to the markdown_token.py module and find out exactly how the token’s type name was spelled, hopefully avoiding any naming errors. Finally, I found it bulky. I guess if I had to add more description to bulky, it was that it took an awful lot of typing to figure out that the token was a paragraph. And if it took a lot of typing, it would also take a lot of reading to do the same thing. I needed something simpler.

Working through a couple of possibilities, I decided to use a simple approach and replace that entire line with:

if token.is_paragraph:
    pass

It required no extra imports, the naming was easy to remember, and it was as simple as I could make it without short forms that would make it to read. Sure is_para might be more compact, but what was I really going to save with that?

Over the course of approximately ten commits, I transferred all the Markdown Tokens from the old way of comparing token types to the new way. I started with Paragraphs and the other Leaf Block tokens, moving on to Container Block tokens, before finishing up with Inline tokens. By the time I was done, there were more lines and functions present in the base MarkdownToken class, but the entire code base seemed to read better.

And that was the point of these changes. No scenario tests were changed as a part of this code base change, I just wanted it to read better!

Do I Make The Code More Object Oriented?

Having made those changes, I felt emboldened to make another set of changes that I had been itching to do for a while: making the code more object oriented.

While each programming paradigm has its good points and bad points, object-oriented was something I have had a lot of experience with. At its core object-oriented programming had three pivotal concepts: encapsulation, composition, and polymorphism. As the PyMarkdown project generates tokens after parsing the Markdown document, I was most interested in encapsulation and polymorphism and how they could be applied to tokens.

From an encapsulation point of view, with a couple of exceptions1 there is no reason to want to change the values of any of those parsed tokens. In the case of those exceptional tokens, there is a multiline concept present in that token that needs to be properly tracked by any after-parsing process. Other than providing that tracking information during a specific after-parsing process, those fields do not have any meaning to a consumer. Therefore, it is okay that they are exposed. However, I did not feel that it was the case with the other values.

And from a polymorphism point of view, it makes sense to me that the tokens should be organized into groups of tokens that have similar properties: Container Blocks, Leaf Blocks, and Inlines. As such, it also follows that some properties and methods are present in one or more of those groups of tokens, but not all those groups. So having a base class for each of those groups is something to work towards.

For both reasons, it just made sense to me to proceed!

Starting with EndMarkdownToken

With the decision made to move forward with this transition, I started at the base by making those changes with the EndMarkdownToken class. To be honest, as I worked through the project’s development, I always felt a bit guilty that I had let the use of this token get out of control.

From my observations, this class is a “hidden” fourth class of tokens, providing a close token to any of the tokens that provide for start functionality. As this functionality ranges across all three groups of tokens, the EndMarkdownToken exists off to the side in its own space. Because of this special grouping, it was the ideal token to start with.

To change this token, I started off with the EndMarkdownToken.type_name_prefix field which contains the string that gets prepended to the start of any EndMarkdownToken type name. Moving it into the MarkdownToken class made the use of that field self-contained, allowing it be modified into a protected field.2 To accomplish that change, I created a new generate_close_markdown_token_from_markdown_token function that allowed me to generate the proper EndMarkdownToken from a given Markdown token.

To me, this just made sense. By performing these changes, I had an easy way of generating an EndMarkdownToken that was documented and set all the right fields. I was able to make sure that the start_markdown_token field was set. As it had not been set consistently, there were places in the code base where I had to do work arounds to try and figure out where the start token for a given end token was. This was just cleaner.

Simple Refactorings

Two of the paradigms that have served me well in my years of software development are D.R.Y. and S.O.L.I.D. The ‘L” in S.O.L.I.D. stands for the Liskov substitution principle, which (in its abbreviated form) states that an object should be able to be replaced with another object that adheres to the same contract without any parent objects being aware of the change.

With respect to the PyMarkdown project, I had been using the isinstance built-in to deal with things instead of following the principle. However, with the recent round of refactorings completed, I was able to fix that. While it may not seem like much, it made me feel better to change:

    if isinstance(token, AtxHeadingMarkdownToken):
        pass

into a more SOLID-like:

    if token.is_atx_heading:
        pass

While I know that there is only one implementation of the AtxHeadingMarkdownToken class, I was just not comfortable with having calls of isinstance scattered throughout the code base. Over two commits, I was able to eliminate all uses of isinstance except for six legitimate uses of it in pivotal parts of the code base.

As with the other changes, this change was a simple change, but an effective one. With all the checks for various tokens using a single manner of access, it was now easier to scan the code base for locations where a given token was referenced. That, and to me, it just looked neater.

Cleaning Up The Leaf Block Tokens and EndMarkdownToken

All those changes inspired me, and I gave myself the green light to further encapsulate the private fields as Python private fields. It was not even remotely exciting, but it was something I knew would make the code cleaner and more maintainable. Using this stripped-down example of the EndMarkdownToken, this is where all the tokens started:

class EndMarkdownToken(MarkdownToken):
    def __init__(
        self,
        type_name,
    ):
        self.type_name = type_name

For each of the Leaf Block tokens, I then started by adding the two underscores to the start of the field name, per Python requirements. But as that made the field “invisible” to other objects, I then needed to add a new function with the property decorator to be able to retrieve the value.

class EndMarkdownToken(MarkdownToken):
    def __init__(
        self,
        type_name,
    ):
        self.__type_name = type_name

    @property
    def type_name(self):
        return self.__type_name

Lather, rinse, repeat. Nothing special or exciting, which was what I was going for. There were no weird cases with these tokens that required a special setter function, so no workarounds were needed. And because the previously public field name and the newly added property name had the exact same name, so changes were needed in other parts of the code.

What Was My Experience So Far?

I really cannot lie on this one. While it was good to resolve some items off the issues list, it was really good to get these refactorings done. I have had those ideas on how to clean up the project in my mind for weeks, and just getting them done gave me a bit of an extra pep in my step. It is not that they were any more or less important than the other items, it is just that their large scope meant it was going to be hard to fit them in. As such, they always went into the “if I ever have some time” bucket… and never seen again.

I honestly think that taking the time to focus on what was important to me with this project was good for me and the project. It helped me refocus on the things that I find important. It helped the project by being able to present a more concise and readable code base to any interested parties. In all, it was a win-win.

What is Next?

With a week left of holidays, I had hoped to get a similar mix of items dealt with in the next week as I had the week before. Stay tuned!


  1. The exceptions as the rehydrate_index field for Paragraph tokens, the leading_spaces_index field for List Block tokens, and the leading_text_index field for Block Quote tokens. 

  2. Python does not have a protected field specifically. The common practice of naming fields to be used in a protected-like manner is to preface them with a single _

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~19 min read

Published

Markdown Linter Issues

Category

Software Quality

Tags

Stay in Touch