Summary¶
In my last article, I documented how I was working hard to get to the end of the unprioritized items in my issues list and on to the prioritized parts of the list. This article details the work that I am doing to make that push happen.
Introduction¶
Now a week or so removed from the New Year’s Holiday break, it was refreshing to know that I was still able to resolve a healthy collection of items from the issues list. Sure, it was not the same volume as during the holiday, but it was still a healthy volume of issues to resolve.
And I definitely felt that I was getting closer to the end of the initial phase of the PyMarkdown project. I was pretty sure that I would not be able to resolve every item from the unprioritized section of the issues list this week, but I was confident that it was going to happen in the week after. I just had to maintain a good velicity of resolving issues, and I would get there soon!
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 11 Jan 2021 and 17 Jan 2021.
Doing Some Simple Cleanup¶
As always, I try and ease into the project work for the week with something simple. This
week was no different. And while this change might seem to be of no consequence to others,
to me it was a question of readability and searchability. The task? To replace newline
characters in the source code with ParserHelper.newline_character
and to replace the
colon character (:
) separating the extra data field for the Markdown tokens with
MarkdownToken.extra_data_separator
.
While it might not be obvious to people that are not dealing with the source code, the presence of those two characters in an unescaped and unreferenced form had caused me a lot of frustration. I guess if I had to pin down a cause why that happened, it was because a newline character and a single colon character are easy to miss when scanning through source code. I wanted something that I could easily find, and not something that I would easily miss, as had happened numerous times during debugging sessions. For me, the cost of this change was easily outweighed by the benefit for readability.
It was not a change that fixed an issue or enabled a scenario tests, but I was able to notice the difference almost instantly. It just helped me see the code better, and that was its goal!
Upgrading To Python 3.8¶
Having started a year ago, Python 3.7 was the stable release of Python at the time the project started. At that time, Python 3.8 had just been released on 14 October 2019 1, a couple of weeks before I started working on the code. With the first commit of the source code on 22 Nov 2019, it just seemed like a safer bet to stay with version 3.7 until the bugs were worked out of the then brand new Python release.
Just over a year later, with Python 3.9 released on 15 Oct 20201, it felt like a
good time to upgrade one minor version with the same reasoning in mind. However, there was
also another reason: performance. Having started to explore the performance of the project
on sample Markdown pages, I found that the project’s parser was taking a long time to
parse a simple Markdown file. Using
cProfile and
SnakeViz,
I knew that the number one problem that I had with performance was the way I used log
statements. Without going too far into my research2, in order solve the
performance issue while keeping the extra functionality that helped me debug more
efficiently, I would soon need to write my own logging wrapper. To do this properly, my
research on logging
indicated that I would need to use the stacklevel
argument to allow the wrapper to
function while logging the location where the wrapper’s log statement was called from.
The catch? It was introduced in Python 3.8.
With a good reason to update and a good, stable version of Python 3.8 to update to, I proceeded with the upgrade with relatively few issues. The main issue that I hit was that I needed to ensure that I uninstalled Python 3.7 in the project, install Python 3.8 on my system (including all environment variables), and then install Python 3.8 in the project. Once that was done, the only other issue that I had was with the Black Python formatter. In that case, I needed to examine the graph for that package and make sure that I installed the correct version of the dependent library in the project.
After that small headache, which took minutes to solve, everything was working fine and continues to work fine.
More Fun With Link Reference Definitions¶
The first thing to mention about the next task is that while the commit was performed earlier than the 3.8 Upgrade commit, chronologically this task came after the upgrade task. The reason that this is relevant is that the project uses the default settings for Black, and either those defaults or the algorithm implementing the line folding changed after the upgrade was completed. Why is this relevant? While the commit itself looks like it has a lot of changes, many of those changes occurred in folding the lines according to upgraded settings. And as I was focused on the Link Reference Definitions, I did not notice those formatting changes until after I had made a number of changes. It was just easier to commit them together at that point than to pull them apart.
Other than that noise, there were three new scenario tests introduced, testing
Link Reference Definition elements broken up across container block boundaries. The
first test added, test function test_link_reference_definitions_extra_01
, was created
with a single Unordered List element character, followed by a valid Link Reference
Definition spread over two lines, with the second line not being indented:
- [foo]:
/url
The second test, function test_link_reference_definitions_extra_02
, used the same
format, but used a Block Quote element prefix instead of an Unordered List prefix.
Finally, to provide a touchstone, I added function
test_link_reference_definitions_extra_02
that has both lines of the Link Reference
Definition preceded by the Block Quote element prefix. While it was a duplicate
test, I felt it was a good reminder of how a test with both lines worked, and thus
it was a good reference test.
Now, according to the specification, the List element continues if the next line starts with enough whitespace to maintain the indent or if it is a continuation of a Paragraph within the list. As the Link Reference Definition is not a Paragraph when parsed the first time, the second line terminates the list, and causes the Link Reference Definition to be requeued and parsed as a normal Paragraph on the second parse through. Or so I thought.
Github Flavored Markdown vs CommonMark¶
While both specification are usually in sync with each other, sometimes the GFM Specification and the reference CommonMark implementation CommonMark.Js called from Babelmark 2 differ in small implementation details. I had experimented with the Block Quote scenario test for three or so hours before I submitted a question to the CommonMark forums asking if I had misunderstood something in the specification.
The answer that came back was a supportive answer, but at the same time, an honest answer. The approach that CommonMark’s reference parser had taken was to parse the lines as the start of an Unordered List followed by a Paragraph block. Only after that Paragraph block had been parsed, with the paragraph continuation kicking in, does the parser look for a Link Reference Definition at the start of that Paragraph.
Is this 100% according to the GFM specification? No. But does it make sense for the CommonMark team to do this? I would argue yes. Getting Link Reference Definitions correct in the PyMarkdown parser has continued to be a pain to this day. Based on my limited experience, while Link Reference Definitions can be spread over multiple lines, there are very few cases where that is done in “real life”. From a correctness viewpoint, if I had to guess on the percentages, I believe I would estimate that their approach correctly parses 99.5% of the Link Reference Definition elements, with only some “weird” multiline Link Reference Definition scenarios not being parsed.
But that left me with a decision. What was the correct thing to do for the PyMarkdown parser?
Which To Choose?¶
After thinking about this overnight, I decided that the best approach for the project was
to align with the CommonMark reference implementation, while also discussing the
differences from
the GFM Specification with the CommonMark team in the forums. By making that choice,
I had confidence that I would have something to compare against for correctness that was
both concrete and
measurable. It either would parse properly against commonmark.js 0.29.2
and be
considered correct or it would
not and be considered a parsing failure. As for any differences, I could clearly note
them in documentation somewhere, and talk about them on the forums with the owners of
CommonMark and the GFM specification, trying to improve both. It was not a perfect answer,
things rarely are perfect.
With that decision in hand, I marked all three new tests as skipped before starting to
work on the Block Quote functions. Based on what I was seeing in the test failures,
everything looked fine in the HTML output, except that the output was missing an entry
in the Block Quote for the parsed line. Taking a wild guess, I determined that I needed
to introduce something in the handle_block_quote_block
function to ensure that the number
of lines in the Block Quote element were correct.
Surprisingly, that was the fix that was needed. No changes in the Markdown transformer
were needed, and no changes in the consistency check was needed. Not sure what I had
going on that evening, I decided to mark the List Block version of the scenario test,
the function test_link_reference_definitions_extra_01
, as disabled. Cleaning up
the code and committing it to the repository. It was a good place to stop while I
figured out what was going on in the evening.
Getting Back To It¶
With the plans for that evening falling through, I found that I had a couple of hours
free that evening. Not wanting to let them go to waste, I decided to see if I could tackle
the
test_link_reference_definitions_extra_01
function that I elected not to get working
in the previous section. To ensure I was moving in the correct direction, I added
extra variations of the test that included one and two spaces before the second half
of the Link Reference Definition element, as well as one with each half of the Link
Reference Definition in its own List Item.
As I have had numerous problems with List Blocks in the past, I expected to expend a
lot of effort to clean these cases up, but only a little bit of effort was required.
Specifically, the only change that was needed with in the ListBlockProcessor
class
and its __check_for_list_closures
function. Like the previous section and the
Link Reference Definition element that spanned Block Quote element levels, the CommonMark
reference implementation was treating the Link Reference Definition text as a Paragraph
before detecting the Link Reference Definition element itself. To replicate this
behavior, I needed to modify the __check_for_list_closures
function to keep the
‘paragraph’ open if it was parsing a Link Reference Definition.
Making those modification, I was able to get the main function,
test_link_reference_definitions_extra_01
, working, as well as the sibling functions
test_link_reference_definitions_extra_01a
and
test_link_reference_definitions_extra_01b
. This meant that a Link Reference
Definition split over a List Block and the end of that block with various amounts of
indentation was working properly. However, function
test_link_reference_definitions_extra_01c
, where I split the Link Reference Definition
over two List Items was not working at all. With my time used up in the evening, I
marked it as skipped, cleaned it up, committed it, and went to sleep for the night.
And More Link Reference Definitions¶
As I was looking at the content of the existing scenario tests, I noticed that I did not have a couple of tests that had simple multi-line Raw HTML elements and multi-line Code Span elements in links. Basically, I wanted to take example 644 for Raw HTML elements:
foo <!-- this is a
comment - with hyphen -->
and example 345 for Code Spans:
``
foo
bar
baz
``
and place them within both Inline Links and Reference Links. I was hoping that this was a simple test, but I was not sure.
Adding all four tests, (two for Inline Links and two for Reference Links), I was pleasantly surprised that all four tests passed without any changes. While I am aware that I am getting closer and closer to the initial release of the project, I still find that I expect things to fail as a default setting. As I am usually an optimistic person, my only explanation for that failure viewpoint is one of writing too many automation tests in my career. When I note something down in the issues list, I believe that I feel that most of those items are going to be things that I forgot to cover, not things that I wish to ensure are covered.
Regardless, I need to figure that out and work on it a bit. I do have a lot of confidence in the PyMarkdown project and its accuracy, and I need to project that more.
Fixing Disabled Tests¶
After a good night’s sleep and a good day’s worth of work under my belt, I settled down
in the evening to work on the next issue: enabling test functions
test_block_quotes_extra_02ax
to test_block_quotes_extra_02ad
. The good news was
that the HTML transformer and the Markdown transformer were both working properly.
The bad news was that the consistency checks were failing for all these tests.
It took me a bit to get going that evening, but when I did, it was obvious to me that the
problem was that the consistency checks were not recognizing the active Block Quote
element. Following along in the verify_line_and_column_numbers
method, it became
obvious that the code:
if container_block_stack and container_block_stack[-1].is_block_quote_start:
was not evaluating to true
for the four test functions that I was evaluating. But
how to fix them?
It took me a while to realize that the reason that the condition was not evaluating to
True
was that the Block Quote token was not always the last token on that list.
When I read the variable named container_block_stack
, in my head I was parsing it as
“the stack for container Block Quotes”, not “the stack for container blocks”. Once
I figured that out, the answer became obvious. I created a new function
find_last_block_quote_on_stack
that went back in the stack until it found to last
Block Quote token and returned it. From there, I replaced any occurrence of
container_block_stack[-1]
with last_block_quote_token
. Therefore, the code
from above became:
last_block_quote_token = find_last_block_quote_on_stack(
container_block_stack
)
if last_block_quote_token:
I ran the tests, and after clearing up a couple of typing mistakes, the tests all worked properly, and they were now passing!
A Quick Fix… I Hope!¶
Looking at the end of the uncategorized section of the issues list, there was one item that I felt confident that I could quickly deal with:
- 634, but forcing an html block
To start working on this item, I made six copies of test function test_html_blocks_123
.
The first copy was in a List Block element, the next three copies were in various forms
of a Block Quote element, the fifth copy was within a SetExt Heading element, and the
last copy was within an Atx Heading element. The hard part of each of these tests was
that I needed to make sure I was generating an HTML Block token and not a Raw HTML
token. That took extra care but did not slow me down that much.
Like a handful of other issues like this that I have fixed, the answer to this
one leapt out at me as soon as I looked through the log files. When the next line
was examined to figure out if the Block Quote element should be continued, the
check_for_lazy_handling
was allowing it to continue. The only issue here was that
it was an HTML block, a leaf block type that does not have any continuation logic in
the specification. Having noticed that, it was easy to change the following code:
if (
parser_state.token_stack[-1].is_code_block
or (is_leaf_block_start)
):
to:
if (
parser_state.token_stack[-1].is_code_block
or parser_state.token_stack[-1].is_html_block
or (is_leaf_block_start)
):
thereby fixing the issue. Running the scenario tests again, the tests were indeed fixed without needed any other changes.
That Week’s Big Thing¶
Wrapping up the work for that week, I wanted to make another dent in the issues list, so I decided to at least get the tests set up for the following item:
- links with & and \ with inner link to mine
- see __collect_text_from_blocks
The base concept of this item was simple: create a group of tests to verify how inline elements were represented when placed within a Link element. To make sure that I was doing a good scenario test, I made the choice to use a Reference Link element. By doing this, I would be testing the link label normalization code and the representation code at the same time.
Starting with test_reference_links_extra_03x
, I created a Link element with a link
label that contained a backslash in the link label:
[bar\\foo]: /uri
[bar\\foo]
From there, I then created a copy of that test that encapsulated that link label within another link label:
[xx[bar\\foo]yy](/uri)
[bar\\foo]: /uri1
and finally, I created a copy of that test, changing the Link element to an Image element:
![xx[bar\\foo]yy](/uri)
[bar\\foo]: /uri1
Then, after a bit of thinking, I decided there was only one combination I was missing, so I added that:
[xx![bar\\foo]yy](/uri)
[bar\\foo]: /uri1
That being done, I then repeated that process with &
and ©
for character
entity references, code spans, emphasis, autolinks, raw HTML, and hard line breaks.
By the time I was done, I had added 40 scenario tests to cover all these cases.
Starting to execute the scenario tests, all the tests that just dealt with Link elements were passing without any changes. The Image elements, they were a different story. The failures seemed to stare back at me, standing in the way of me writing the article for that week. It was just time to start debugging and figuring things out.
After a bit of debugging, I seemed to notice that the test failures seemed to be in three
separate groups of issues. The first group of issues was that the
__collect_text_from_blocks
function used to grab the existing tokens and render them
as text was not complete. But that was not the entire issue, but I felt that there
was too much “noise” in the way for me to see the issue clearly. Resolving to reduce
the noise in the issue, I started working on the main part of the issue. In the cases
where the inline element was on its own inside the parent
Link element, the Code Span element, the Raw HTML element, and the Autolink element
were not representing their elements properly. A bit of exploration and debugging took
care of that. With that noise out of the way, I was able to see the other part of that
issue better, and added the condition:
if not is_inside_of_link:
at the start of each of those handlers. The noise that I had experience was simply that in cases where a possible Link element was within another Link element’s link label, each of the changed elements just needed to emit nothing to the collection. Running the tests again, it verified that observations, but I also saw something else.
Only after getting rid of those issues was I able to see that the
__consume_text_for_image_alt_text
function was not handling Emphasis start and end elements properly, failing an assert
near the end of that function. That fix was easy, adding four lines to the elif
statement:
elif inline_blocks[ind + 1].is_inline_emphasis:
pass
elif inline_blocks[ind + 1].is_inline_emphasis_end:
pass
With both of those issues dealt with, the couple of failures that remained were easy ones.
Looking at the HTML output, there were a number of /a
character sequences in the HTML
output. Being the signature for replacement references, I quickly change of the code for
the Text element from:
image_alt_text += inline_blocks[ind + 1].token_text
to
image_alt_text += ParserHelper.resolve_references_from_text(
inline_blocks[ind + 1].token_text
)
Beware Of Special Cases¶
Running the tests again, all the tests were passing except for those that dealt with the
&
sequence. All the tests dealing with the ©
sequence were working fine,
so I had to think quickly to figure out what the problem might be. Because I am used to
looking at processed HTML code, I initially did not see any problem with a &
sequence
in the HTML output. It looked right.
Then it hit me. The problem was not with the HTML output, it was with the processing of
the Markdown input. In the cases that I had problems with, the desired HTML output had
&amp;
which did not look right until I thought about it. Whereas the ©
character sequence is interpreted as a named charactery entity and replaced with the
© symbol, the sequence &
was not being interpreted in the same way. The
desired HTML output was correct! The initial &
from the sequence was being replaced
with the sequence &
to ensure it was displayed properly, removing any chance of
it being interpreted as a named character entity.
Making a quick decision, I looked at the InlineHelper
class and noticed the
append_text
function used to ensure that such strings were properly interpreted.
Taking a quick look at the imports for InlineHelper
and LinkHelper
, I thought there
was a chance of a circular reference occurring. Given that observation, I decided to
make a copy of the append_text
function in the LinkHelper
class to get around the
possibility of the circular reference.
Finally, after a couple of hours of work, all 40 tests were passing. It was a bit of a trek to get there, but it was worth it!
Whoops¶
As I was starting to write the article, I also started to look into what it would take
to remove the duplicate of the append_text
function in the LinkHelper
class. I had
introduced the clone into the LinkHelper
class to avoid any issues with referencing
the InlineHelper
class from the LinkHelper
class. It was as I was starting my
research into this task that I discovered something that I had previously missed. The
LinkHelper
class was already referencing the InlineHelper
class.
After a bit of “how did I miss that?”, I replaced the call to LinkHelper.append_text
with
InlineHelper.append_text
and everything worked fine. Removing the instance of the
append_text
function from the LinkHelper
class, I ran the complete suite of scenario
tests again, and everything worked fine.
What Was My Experience So Far?¶
Looking back at the work I did during the week, it was hard not to get jazzed about the progress. I started off by doing a simple fix that made the source code more readable, enhancing my ability to read the source code. Then I upgraded the base Python version to 3.8, knowing that it would allow me to write the wrapper I wanted around Python’s logging functions. Add to that the coverage and testing I was able to add and verify for Link elements, Image elements, and Link Reference Definitions, and it was really good work!
At some point I noticed the number of scenario tests that I execute with each change. At approximately 2000 tests, I am confident that I am hitting a very large degree of all scenarios for Markdown transformation, not just the “Golden” scenarios, and that was also a boost to my confidence. While I can expect things that I note as an issue to not work, I also need to make sure I appreciate what is working. Having a solid set of tests like that is what allows me to refactor with confidence that I am not negatively impact the code.
Refactor? Sigh. One of the things I know I am looking forward to is looking at the refactor tasks in the prioritized sections and getting a couple of them underway. The fact that they are present in those sections is a good reminder to me that I can always learn how to do things better, and how to make the project more maintainable. And that is always something I can look forward to!
What is Next?¶
With a solid amount of work done this week, I am hoping to be able to clear out the uncategorized section of the issues list in the next week. Will I make it? Stay tuned!
-
Information courtesy of the Python download page. ↩↩
-
I will be covering this in a separate series of articles in the near future. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.