In my last article, I continued working on Block Quote issues and issues with Block Quotes and their interactions with List Blocks. In this article, I document that work that was done to address those issues and resolve them.
With a couple of weeks off for the holidays and relatively light “honey-do” schedule around the house, I was hoping to have some good relaxing days along with some good productive project days. In my head, I was mentally starting to evaluate each of the remaining items in the list, assigning them a priority between 1 (highest) and 4 (lowest). I knew completing the Block Quote issues and Block Quote adjacent issues has a priority of 1, but I needed to take some time to think about the others. And while I was thinking about those others, it was a good time to get some solid work in!
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 22 Dec 2020 and 27 Dec 2020.
Creating A New Scenario Series¶
After doing a block of work with Block Quotes, I thought it was a good time to start
making a new scenario test group for them, Series N. To start this new series off, I
simply copied and pasted
example scenarios from other series and began to work through the permutations. Once
I had those roughly filled out, I created the
module to hold the scenario tests, and just started working on putting them together.
After I had the first group of tests in Series N filled out, it was almost cut-and-paste
to get the next group of tests ready, with a simple change to the specific Markdown
element I was testing in that group.
As nothing is ever easy (it seems…), I ran into some failures that I needed to fix.
The first set of failures occurred with unquoted HTML Blocks, in that those HTML
Blocks did not end the Block Quote element. This was quickly followed up by the same
issue occurring with Atx Heading elements. As I had already solved this problem for
List Blocks, it was quickly fixed by adding checks in the
function for both the Atx Heading element starting (
or the HTML Block element starting (
At the same time, to handle two extra list-based scenarios, I added the
test_block_quotes_213aa function and
test_block_quotes_213ab function as variations
test_block_quotes_213a test function. I tried for a good hour or so to resolve
the issues with all these functions, but I did not make a lot of progress. In the
end, I got test function
test_block_quotes_213ab working, but functions
test_block_quotes_213aa just did not want to seem to
My main priority at the time was filling out the Series N set of tests, so I left those two tests disabled while I proceeded to fill out the Series N tests. Nothing exciting to report from that task… it was just a lot of moving, documenting, checking, and rechecking. You know, the normal stuff.
Addressing Those Tests¶
In looking good and hard at those disabled tests, I was able to eventually see that a good group of those tests were dealing with nested containers. Specifically, they were dealing with a Block Quote element inside of a List Block element, or the other way around. Putting the disabled tests, their Markdown documents, and their expected tokens under a metaphorical microscope, it was a while before I noticed something that I had previously missed. When I followed the examples closely, it looked like Block Quote elements within a List Block element were not being properly closed in each example.
In debugging this issue, I added a handful of extra tests to help me analyze the
situation. The Markdown for one of those tests, the
function, is as follows:
> start > - quote > > end
and the list of tokens produced for that Markdown were:
[ '[block-quote(1,1)::> \n> \n>\n> ]', '[para(1,3):]', '[text(1,3):start:]', '[end-para:::True]', '[ulist(2,3):-::4: ]', '[para(2,5):]', '[text(2,5):quote:]', '[end-para:::True]', '[BLANK(3,2):]', '[para(4,3):]', '[text(4,3):end:]', '[end-para:::True]', '[end-ulist:::True]', '[end-block-quote:::True]' ]
As I worked through the example, following the Markdown document and the parser’s tokens, I discovered the issue. The Blank Line element on line 3 ended the first Paragraph element, and I was good with that. But more importantly, because that Paragraph element was closed due to the Blank Line being encountered, the new text on line 4 was not eligible to be included as paragraph continuation text. Based on that information, the new Paragraph element for line 4 needed to be created outside of the List Block element, as it did not have the indentation to support being included in the list.
Basically, the end Unordered List Block token was in the wrong place. To fix it, I would need to move it above the occurrence of the Blank Line element.
Fixing It Right!¶
Being an issue with the tokenization of the Markdown Document, I needed to start
with the parser and the
BlockQuoteProcessor class. From the above information, I knew
that I needed to alter the way in which the
was called, as it was not properly closing the right blocks. After trying a few
other things, I reverted to a very basic approach to solve the issue: forgo the
complicated calculations and just do the simple calculation where needed. So, in the
__handle_block_quote_section function, before the
handle_blank_line_fn function is
called, a quick calculation is done to see if the List Block should possibly be closed.
If so, the actual index number is set in the
forced_close_until_index variable. Then,
making a small change to the
handle_blank_line_fn function, that variable is passed in
and all tokens on the stack are closed until they get to that point.
At the time, it felt like I was using a large sledgehammer to fix a small problem, but it worked. The tokens were being generated properly, and I was able to move on from there. While I could them go on to describe all the problems I had with the Markdown transformer, I will spare any reader the pain. While the basic handling of Block Quote elements was working fine, the proper handling of indents within Block Quote elements needed a good solid day’s overhaul to get it right. It was a good thing I was on holiday, because I had that time to devote to working through all the issues and getting the Markdown rehydration of Block Quotes and their indents just right.
Following that, I expected a lot of issues with the consistency checks, but there was
only one major issue. In the
__validate_new_line function, the
member variable of the Block Quote token was not being updated properly, resulting in
an off-by-one error. Unlike the problems with the Markdown transformer, this issue
took less than half an hour to find and fix. Phew!
In the end, I was able to enable four of the
test_block_quotes_213 test functions
and all five of the new test functions that I had added. That was a good feeling.
Still Missing One¶
After a good night’s worth of sleep, I got back to tackling the disabled functions,
specifically test function
test_block_quotes_213aa. With a Markdown document of:
> - foo > - boo > - bar
it was immediately obvious that this was somewhat related to the work that I completed the day before. It was just figuring out how that stumped me for a while.
Adding some debug, I started to suspect that the “can I start a container block” logic
__get_nested_container_starts function was not working properly. Adding more
specific debugging in that function, my guess was confirmed. Specifically, there were
cases where I felt it was obvious that a new List Block element should start, but the
code was skipping over those chances.
Stepping through that code, I noticed weird behavior when I got to this part of that function:
nested_ulist_start, _ = ListBlockProcessor.is_ulist_start( parser_state, line_to_parse, 0, "" ) nested_olist_start, _, _, _ = ListBlockProcessor.is_olist_start( parser_state, line_to_parse, 0, "" ) nested_block_start = BlockQuoteProcessor.is_block_quote_start( line_to_parse, 0, "" )
Looking at the debug output for that code while debugging the
function, it was obvious that on the second line of the document, the code was not
determining that a new block was starting. It just skipped right over it.
Thinking through the issue while I was debugging, this started to make sense. It was
not triggering on any of the above start checks because the wrong information was being
presented to it. In fact, the data presented to it for line 2 of the above Markdown
was literally two space characters followed by the Unsigned List Block start character
-). While that specific form of data made sense for the
function, trying to invoke any one of the three Container block start functions with
that data would never succeed. Each of those functions expected the line to at least
be minimally processed, and that line of data clearly had not been.
Trying the simple approach first, I tested the following change while debugging function
after_ws_index, ex_whitespace = ParserHelper.extract_whitespace( line_to_parse, 0 ) nested_ulist_start, _ = ListBlockProcessor.is_ulist_start( parser_state, line_to_parse, after_ws_index, ex_whitespace )
and it worked! Debugging through the test again, that processing allowed the
is_ulist_start to work with the processed line data, resulting in a non-None
result being returned. After applying that same change to the other two start Container
block functions, I ran the tests again and everything was good. The tests were passing
without any additional changes being required of the Markdown transformer or the
consistency checks. I was stoked!
Making the Test Invocation More Efficient¶
As I was making the changes to all these scenario tests, one thing was becoming obvious to me: I was not changing the call pattern. In the beginning of the project, I was concerned that I was going to have a wide array of ways to invoke the scenario tests based on need. While it was a valid concern at the time, it had not played out that way. For each scenario test, I was always adding the following boilerplate code:
# Act actual_tokens = tokenizer.transform(source_markdown) actual_gfm = transformer.transform(actual_tokens) # Assert assert_if_lists_different(expected_tokens, actual_tokens) assert_if_strings_different(expected_gfm, actual_gfm) assert_token_consistency(source_markdown, actual_tokens)
While those seven lines are not a lot, over the course of approximately 2,000 scenario
tests, those lines added up. Thinking about it a bit, I realized that I was not going
to change this format because it was consistently worked well for me. Declared before
this code block
in each test, the variables
the relative information for each test. Once set, I could not think of any reason to
alter from the pattern of calling the three validation functions, one after another.
At the same time, if I was going to make this change on a large scale, I wanted to start
out with a smaller scope of work and validate it first. At that point, I created a new
act_and_assert function to contain those seven lines of code, and changed that code
block from above to the following:
# Act & Assert act_and_assert(source_markdown, expected_gfm, expected_tokens)
After testing it out on a couple of scenario tests, I applied it to all the scenario
tests in the
test_markdown_paragraph_series_n.py module and the
test_markdown_block_quotes.py module. Making sure the scenario tests all worked,
including introducing some false failures that I immediately fixed, I decided to leave
it alone for a while. My thought process was that I would let it “set” in my mind.
If I was still interested in making the change after a week or so, I could
work on it over the course of a couple of weeks.
Link Reference Definitions and Containers¶
Looking at the issues list, the following item gained my attention:
- split up link definition within a block quote or list?
While I was not sure it was going to be easy, I at least figured that it would be fun!
At the base level, it was an interesting question. Block Quote elements and List Block
elements altered the way a line was processed, and Link Reference Definitions could span
lines and not appear to follow the right rules. To that end, I added three new
test_block_quotes_extra_03b and I
added three new scenario tests
The bad news first. Try as I may, I was not able to get the new Block Quote tests to pass within a decent amount of time. As such, I had to commit them as disabled tests.
The ContainerBlockProcessor was largely unchanged, with the changes in that class being made to get rid of code or simplify code, rather than fix issues. The Markdown transformer was altered to produce the proper output if there were any leading spaces in the list that were not being merged in by the Link Reference Definition token. Those were the easier two fixes. The hard changes involved the HTML transformer.
List Looseness… Again¶
While I thought all my issues regarding list looseness were behind me, I was wrong. Specifically:
A list is loose if any of its constituent list items are separated by blank lines, or if any of its constituent list items directly contain two block-level elements with a blank line between them.
Breaking that down, a list is lose if, for any item within the list:
- it is not the first item and preceded by a blank line
- it is not the last item and followed by a blank line
- any item contains any two block elements that are separated by a blank line
How is this related to Link Reference Definitions? When the PyMarkdown processors parse a Link Reference Definition element, it creates a token that represents that element so it can be analyzed. However, from an HTML point of view, it does not exist. As such, the following Markdown:
- [foo]: /url "title" - [foo]
is for HTML output purposes equal to the Markdown:
- - [foo]
Basically, from the HTML transformer point of view, it is a Blank Line element. I had some
code in the
__is_token_loose function to report a Link Reference Definition as a Blank
Line element and in the
__calculate_list_looseness to trigger off a block, but it
was not that simple. Rereading the above section on looseness, it only really mattered
inside of the list item if that Blank Line element was between two blocks.
Fixing that specific case took a bit of work in the
to redesign the way I was handling those cases. In cases where the previous token was
a Blank Line, I had to change the algorithm to look before that element and see if there
was a valid block token there. It so, the logic from the last point above would come
and it would discover a valid case to check. Otherwise, the Blank Line token or the
Link Reference Definition token would be intentionally overlooked, and the search would
Repeat, With Block Quotes¶
It was no surprise that I had to do similar work with Block Quote elements to make sure
that they were also working properly. To accommodate that, I added five new scenarios
and did a lot of debugging. In the end, the fix was to add a flag in the
__handle_blank_line function that triggered when a Link Referenced Definition had
been started. Rather than avoid the default handling in this case, that handle was
carried out and appended to the already created tokens, resulting the proper output.
Once that fix was made, getting the consistency checks to agree with it was
easy, with one calculation for the
leading_text_index being off by 1. After a
simple adjustment to compensate for that case, those tests were now passing.
After only an hour and a half of work to get this working, I understood that I was
lucky that it did not end up taking as long as the last set of issues to fix. I
Evaluating How I Was Spending My Holiday¶
I do not think that I am a workaholic, I just like to keep busy. Whether it is figuring out how to do something with home automation, working on a puzzle, cleaning up the garage (again!?), or working on this project, it does not matter. If it is something that keeps me from spending hours doing one thing with no recognizable thing to show for it, I am good. Spending time with my wife helping her out with shopping works wonderfully for that. I get to spend time doing crossword puzzles and other fun things on my phone while waiting for her to finish at various stores she needs to go to. For me, it is all about perspective.
So, it was part of the way through Christmas Eve Day when I thought about this with respect to the project. While it was true that I was resolving items from the issues list, it just was not feeling like it was the right thing to do at that time. As I had the weeks before and after Christmas off, I sat back and decided that I wanted to do something for the project that would be hard to do at any other time. I mean, since I have the time, was I using it to my best advantage?
Within minutes, I had an answer of “No, I was not using my time wisely!”. I had a couple of long-term things I wanted to do with the code base, but those things would be hard to fit into a normal week’s worth of work. It was time to shift gears.
Reducing the Code Required To Check For A Specific Token¶
While this was done very leisurly over two days, this change was something I had wanted to
do for a while. As the project’s growth was organic, I had started out check for the
pressence of a given token by looking at its name. At that time, given the rule module
rule_md_019.py, if I wanted to see if a given token was a paragraph, I would use:
if token.type_name == MarkdownToken.token_paragraph: pass
There is nothing intrinsically wrong with that comparison, but it did have things going
against it. The first was that I had to make sure to import the MarkdownToken class into
any module where I wanted such a comparison. Next was my observation that I usually had
to go to the
markdown_token.py module and find out exactly how the token’s type name
was spelled, hopefully avoiding any naming errors. Finally, I found it bulky. I guess
if I had to add more description to bulky, it was that it took an awful lot of typing to
figure out that the token was a paragraph. And if it took a lot of typing, it would also
take a lot of reading to do the same thing. I needed something simpler.
Working through a couple of possibilities, I decided to use a simple approach and replace that entire line with:
if token.is_paragraph: pass
It required no extra imports, the naming was easy to remember, and it was as simple as I
could make it without short forms that would make it to read. Sure
is_para might be more
compact, but what was I really going to save with that?
Over the course of approximately ten commits, I transferred all the Markdown Tokens from the old way of comparing token types to the new way. I started with Paragraphs and the other Leaf Block tokens, moving on to Container Block tokens, before finishing up with Inline tokens. By the time I was done, there were more lines and functions present in the base MarkdownToken class, but the entire code base seemed to read better.
And that was the point of these changes. No scenario tests were changed as a part of this code base change, I just wanted it to read better!
Do I Make The Code More Object Oriented?¶
Having made those changes, I felt emboldened to make another set of changes that I had been itching to do for a while: making the code more object oriented.
While each programming paradigm has its good points and bad points, object-oriented was something I have had a lot of experience with. At its core object-oriented programming had three pivotal concepts: encapsulation, composition, and polymorphism. As the PyMarkdown project generates tokens after parsing the Markdown document, I was most interested in encapsulation and polymorphism and how they could be applied to tokens.
From an encapsulation point of view, with a couple of exceptions1 there is no reason to want to change the values of any of those parsed tokens. In the case of those exceptional tokens, there is a multiline concept present in that token that needs to be properly tracked by any after-parsing process. Other than providing that tracking information during a specific after-parsing process, those fields do not have any meaning to a consumer. Therefore, it is okay that they are exposed. However, I did not feel that it was the case with the other values.
And from a polymorphism point of view, it makes sense to me that the tokens should be organized into groups of tokens that have similar properties: Container Blocks, Leaf Blocks, and Inlines. As such, it also follows that some properties and methods are present in one or more of those groups of tokens, but not all those groups. So having a base class for each of those groups is something to work towards.
For both reasons, it just made sense to me to proceed!
Starting with EndMarkdownToken¶
With the decision made to move forward with this transition, I started at the base by
making those changes with the
EndMarkdownToken class. To be honest, as I worked
through the project’s development, I always felt a bit guilty that I had let the use
of this token get out of control.
From my observations, this class is a “hidden” fourth class of tokens, providing a
close token to any of the tokens that provide for start functionality. As this
functionality ranges across all three groups of tokens, the
EndMarkdownToken exists off
to the side in its own space. Because of this special grouping, it was the ideal token
to start with.
To change this token, I started off with the
which contains the string that gets prepended to the start of any
type name. Moving it into the
MarkdownToken class made the use of that field
self-contained, allowing it be modified into a protected field.2 To accomplish
that change, I created a new
that allowed me to generate the proper
EndMarkdownToken from a given Markdown token.
To me, this just made sense. By performing these changes, I had an easy way of
EndMarkdownToken that was documented and set all the right fields. I was able to
make sure that the
start_markdown_token field was set. As it had not been
set consistently, there were places in the code base where I had to do work arounds to
and figure out where the start token for a given end token was. This was just cleaner.
Two of the paradigms that have served me well in my years of software development are D.R.Y. and S.O.L.I.D. The ‘L” in S.O.L.I.D. stands for the Liskov substitution principle, which (in its abbreviated form) states that an object should be able to be replaced with another object that adheres to the same contract without any parent objects being aware of the change.
With respect to the PyMarkdown project, I had been using the
to deal with things instead of following the principle. However, with the recent
round of refactorings completed, I was able to fix that. While it may not seem like
much, it made me feel better to change:
if isinstance(token, AtxHeadingMarkdownToken): pass
into a more SOLID-like:
if token.is_atx_heading: pass
While I know that there is only one implementation of the
class, I was just not comfortable with having calls of
isinstance scattered throughout
the code base. Over two commits, I was able to eliminate all uses of
for six legitimate uses of it in pivotal parts of the code base.
As with the other changes, this change was a simple change, but an effective one. With all the checks for various tokens using a single manner of access, it was now easier to scan the code base for locations where a given token was referenced. That, and to me, it just looked neater.
Cleaning Up The Leaf Block Tokens and EndMarkdownToken¶
All those changes inspired me, and I gave myself the green light to further encapsulate the
private fields as Python private fields. It was not even remotely exciting, but it was
something I knew would make the code cleaner and more maintainable. Using this
stripped-down example of the
EndMarkdownToken, this is where all the tokens started:
class EndMarkdownToken(MarkdownToken): def __init__( self, type_name, ): self.type_name = type_name
For each of the Leaf Block tokens, I then started by adding the two underscores to the
start of the field name, per Python requirements. But as that made the field “invisible”
to other objects, I then needed to add a new function with the
property decorator to
be able to retrieve the value.
class EndMarkdownToken(MarkdownToken): def __init__( self, type_name, ): self.__type_name = type_name @property def type_name(self): return self.__type_name
Lather, rinse, repeat. Nothing special or exciting, which was what I was going for. There were no weird cases with these tokens that required a special setter function, so no workarounds were needed. And because the previously public field name and the newly added property name had the exact same name, so changes were needed in other parts of the code.
What Was My Experience So Far?¶
I really cannot lie on this one. While it was good to resolve some items off the issues list, it was really good to get these refactorings done. I have had those ideas on how to clean up the project in my mind for weeks, and just getting them done gave me a bit of an extra pep in my step. It is not that they were any more or less important than the other items, it is just that their large scope meant it was going to be hard to fit them in. As such, they always went into the “if I ever have some time” bucket… and never seen again.
I honestly think that taking the time to focus on what was important to me with this project was good for me and the project. It helped me refocus on the things that I find important. It helped the project by being able to present a more concise and readable code base to any interested parties. In all, it was a win-win.
What is Next?¶
With a week left of holidays, I had hoped to get a similar mix of items dealt with in the next week as I had the week before. Stay tuned!
The exceptions as the
rehydrate_indexfield for Paragraph tokens, the
leading_spaces_indexfield for List Block tokens, and the
leading_text_indexfield for Block Quote tokens. ↩
Python does not have a protected field specifically. The common practice of naming fields to be used in a protected-like manner is to preface them with a single
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.