Summary

In my last article, I documented how I was working hard to get to the end of the unprioritized items in my issues list and on to the prioritized parts of the list. This article details the work that I am doing to make that push happen.

Introduction

Now a week or so removed from the New Year’s Holiday break, it was refreshing to know that I was still able to resolve a healthy collection of items from the issues list. Sure, it was not the same volume as during the holiday, but it was still a healthy volume of issues to resolve.

And I definitely felt that I was getting closer to the end of the initial phase of the PyMarkdown project. I was pretty sure that I would not be able to resolve every item from the unprioritized section of the issues list this week, but I was confident that it was going to happen in the week after. I just had to maintain a good velicity of resolving issues, and I would get there soon!

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 11 Jan 2021 and 17 Jan 2021.

Doing Some Simple Cleanup

As always, I try and ease into the project work for the week with something simple. This week was no different. And while this change might seem to be of no consequence to others, to me it was a question of readability and searchability. The task? To replace newline characters in the source code with ParserHelper.newline_character and to replace the colon character (:) separating the extra data field for the Markdown tokens with MarkdownToken.extra_data_separator.

While it might not be obvious to people that are not dealing with the source code, the presence of those two characters in an unescaped and unreferenced form had caused me a lot of frustration. I guess if I had to pin down a cause why that happened, it was because a newline character and a single colon character are easy to miss when scanning through source code. I wanted something that I could easily find, and not something that I would easily miss, as had happened numerous times during debugging sessions. For me, the cost of this change was easily outweighed by the benefit for readability.

It was not a change that fixed an issue or enabled a scenario tests, but I was able to notice the difference almost instantly. It just helped me see the code better, and that was its goal!

Upgrading To Python 3.8

Having started a year ago, Python 3.7 was the stable release of Python at the time the project started. At that time, Python 3.8 had just been released on 14 October 2019 1, a couple of weeks before I started working on the code. With the first commit of the source code on 22 Nov 2019, it just seemed like a safer bet to stay with version 3.7 until the bugs were worked out of the then brand new Python release.

Just over a year later, with Python 3.9 released on 15 Oct 20201, it felt like a good time to upgrade one minor version with the same reasoning in mind. However, there was also another reason: performance. Having started to explore the performance of the project on sample Markdown pages, I found that the project’s parser was taking a long time to parse a simple Markdown file. Using cProfile and SnakeViz, I knew that the number one problem that I had with performance was the way I used log statements. Without going too far into my research2, in order solve the performance issue while keeping the extra functionality that helped me debug more efficiently, I would soon need to write my own logging wrapper. To do this properly, my research on logging indicated that I would need to use the stacklevel argument to allow the wrapper to function while logging the location where the wrapper’s log statement was called from. The catch? It was introduced in Python 3.8.

With a good reason to update and a good, stable version of Python 3.8 to update to, I proceeded with the upgrade with relatively few issues. The main issue that I hit was that I needed to ensure that I uninstalled Python 3.7 in the project, install Python 3.8 on my system (including all environment variables), and then install Python 3.8 in the project. Once that was done, the only other issue that I had was with the Black Python formatter. In that case, I needed to examine the graph for that package and make sure that I installed the correct version of the dependent library in the project.

After that small headache, which took minutes to solve, everything was working fine and continues to work fine.

The first thing to mention about the next task is that while the commit was performed earlier than the 3.8 Upgrade commit, chronologically this task came after the upgrade task. The reason that this is relevant is that the project uses the default settings for Black, and either those defaults or the algorithm implementing the line folding changed after the upgrade was completed. Why is this relevant? While the commit itself looks like it has a lot of changes, many of those changes occurred in folding the lines according to upgraded settings. And as I was focused on the Link Reference Definitions, I did not notice those formatting changes until after I had made a number of changes. It was just easier to commit them together at that point than to pull them apart.

Other than that noise, there were three new scenario tests introduced, testing Link Reference Definition elements broken up across container block boundaries. The first test added, test function test_link_reference_definitions_extra_01, was created with a single Unordered List element character, followed by a valid Link Reference Definition spread over two lines, with the second line not being indented:

- [foo]:
/url

The second test, function test_link_reference_definitions_extra_02, used the same format, but used a Block Quote element prefix instead of an Unordered List prefix. Finally, to provide a touchstone, I added function test_link_reference_definitions_extra_02 that has both lines of the Link Reference Definition preceded by the Block Quote element prefix. While it was a duplicate test, I felt it was a good reminder of how a test with both lines worked, and thus it was a good reference test.

Now, according to the specification, the List element continues if the next line starts with enough whitespace to maintain the indent or if it is a continuation of a Paragraph within the list. As the Link Reference Definition is not a Paragraph when parsed the first time, the second line terminates the list, and causes the Link Reference Definition to be requeued and parsed as a normal Paragraph on the second parse through. Or so I thought.

Github Flavored Markdown vs CommonMark

While both specification are usually in sync with each other, sometimes the GFM Specification and the reference CommonMark implementation CommonMark.Js called from Babelmark 2 differ in small implementation details. I had experimented with the Block Quote scenario test for three or so hours before I submitted a question to the CommonMark forums asking if I had misunderstood something in the specification.

The answer that came back was a supportive answer, but at the same time, an honest answer. The approach that CommonMark’s reference parser had taken was to parse the lines as the start of an Unordered List followed by a Paragraph block. Only after that Paragraph block had been parsed, with the paragraph continuation kicking in, does the parser look for a Link Reference Definition at the start of that Paragraph.

Is this 100% according to the GFM specification? No. But does it make sense for the CommonMark team to do this? I would argue yes. Getting Link Reference Definitions correct in the PyMarkdown parser has continued to be a pain to this day. Based on my limited experience, while Link Reference Definitions can be spread over multiple lines, there are very few cases where that is done in “real life”. From a correctness viewpoint, if I had to guess on the percentages, I believe I would estimate that their approach correctly parses 99.5% of the Link Reference Definition elements, with only some “weird” multiline Link Reference Definition scenarios not being parsed.

But that left me with a decision. What was the correct thing to do for the PyMarkdown parser?

Which To Choose?

After thinking about this overnight, I decided that the best approach for the project was to align with the CommonMark reference implementation, while also discussing the differences from the GFM Specification with the CommonMark team in the forums. By making that choice, I had confidence that I would have something to compare against for correctness that was both concrete and measurable. It either would parse properly against commonmark.js 0.29.2 and be considered correct or it would not and be considered a parsing failure. As for any differences, I could clearly note them in documentation somewhere, and talk about them on the forums with the owners of CommonMark and the GFM specification, trying to improve both. It was not a perfect answer, things rarely are perfect.

With that decision in hand, I marked all three new tests as skipped before starting to work on the Block Quote functions. Based on what I was seeing in the test failures, everything looked fine in the HTML output, except that the output was missing an entry in the Block Quote for the parsed line. Taking a wild guess, I determined that I needed to introduce something in the handle_block_quote_block function to ensure that the number of lines in the Block Quote element were correct.

Surprisingly, that was the fix that was needed. No changes in the Markdown transformer were needed, and no changes in the consistency check was needed. Not sure what I had going on that evening, I decided to mark the List Block version of the scenario test, the function test_link_reference_definitions_extra_01, as disabled. Cleaning up the code and committing it to the repository. It was a good place to stop while I figured out what was going on in the evening.

Getting Back To It

With the plans for that evening falling through, I found that I had a couple of hours free that evening. Not wanting to let them go to waste, I decided to see if I could tackle the test_link_reference_definitions_extra_01 function that I elected not to get working in the previous section. To ensure I was moving in the correct direction, I added extra variations of the test that included one and two spaces before the second half of the Link Reference Definition element, as well as one with each half of the Link Reference Definition in its own List Item.

As I have had numerous problems with List Blocks in the past, I expected to expend a lot of effort to clean these cases up, but only a little bit of effort was required. Specifically, the only change that was needed with in the ListBlockProcessor class and its __check_for_list_closures function. Like the previous section and the Link Reference Definition element that spanned Block Quote element levels, the CommonMark reference implementation was treating the Link Reference Definition text as a Paragraph before detecting the Link Reference Definition element itself. To replicate this behavior, I needed to modify the __check_for_list_closures function to keep the ‘paragraph’ open if it was parsing a Link Reference Definition.

Making those modification, I was able to get the main function, test_link_reference_definitions_extra_01, working, as well as the sibling functions test_link_reference_definitions_extra_01a and test_link_reference_definitions_extra_01b. This meant that a Link Reference Definition split over a List Block and the end of that block with various amounts of indentation was working properly. However, function test_link_reference_definitions_extra_01c, where I split the Link Reference Definition over two List Items was not working at all. With my time used up in the evening, I marked it as skipped, cleaned it up, committed it, and went to sleep for the night.

As I was looking at the content of the existing scenario tests, I noticed that I did not have a couple of tests that had simple multi-line Raw HTML elements and multi-line Code Span elements in links. Basically, I wanted to take example 644 for Raw HTML elements:

foo <!-- this is a
comment - with hyphen -->

and example 345 for Code Spans:

``
foo
bar  
baz
``

and place them within both Inline Links and Reference Links. I was hoping that this was a simple test, but I was not sure.

Adding all four tests, (two for Inline Links and two for Reference Links), I was pleasantly surprised that all four tests passed without any changes. While I am aware that I am getting closer and closer to the initial release of the project, I still find that I expect things to fail as a default setting. As I am usually an optimistic person, my only explanation for that failure viewpoint is one of writing too many automation tests in my career. When I note something down in the issues list, I believe that I feel that most of those items are going to be things that I forgot to cover, not things that I wish to ensure are covered.

Regardless, I need to figure that out and work on it a bit. I do have a lot of confidence in the PyMarkdown project and its accuracy, and I need to project that more.

Fixing Disabled Tests

After a good night’s sleep and a good day’s worth of work under my belt, I settled down in the evening to work on the next issue: enabling test functions test_block_quotes_extra_02ax to test_block_quotes_extra_02ad. The good news was that the HTML transformer and the Markdown transformer were both working properly. The bad news was that the consistency checks were failing for all these tests.

It took me a bit to get going that evening, but when I did, it was obvious to me that the problem was that the consistency checks were not recognizing the active Block Quote element. Following along in the verify_line_and_column_numbers method, it became obvious that the code:

    if container_block_stack and container_block_stack[-1].is_block_quote_start:

was not evaluating to true for the four test functions that I was evaluating. But how to fix them?

It took me a while to realize that the reason that the condition was not evaluating to True was that the Block Quote token was not always the last token on that list. When I read the variable named container_block_stack, in my head I was parsing it as “the stack for container Block Quotes”, not “the stack for container blocks”. Once I figured that out, the answer became obvious. I created a new function find_last_block_quote_on_stack that went back in the stack until it found to last Block Quote token and returned it. From there, I replaced any occurrence of container_block_stack[-1] with last_block_quote_token. Therefore, the code from above became:

    last_block_quote_token = find_last_block_quote_on_stack(
        container_block_stack
    )
    if last_block_quote_token:

I ran the tests, and after clearing up a couple of typing mistakes, the tests all worked properly, and they were now passing!

A Quick Fix… I Hope!

Looking at the end of the uncategorized section of the issues list, there was one item that I felt confident that I could quickly deal with:

- 634, but forcing an html block

To start working on this item, I made six copies of test function test_html_blocks_123. The first copy was in a List Block element, the next three copies were in various forms of a Block Quote element, the fifth copy was within a SetExt Heading element, and the last copy was within an Atx Heading element. The hard part of each of these tests was that I needed to make sure I was generating an HTML Block token and not a Raw HTML token. That took extra care but did not slow me down that much.

Like a handful of other issues like this that I have fixed, the answer to this one leapt out at me as soon as I looked through the log files. When the next line was examined to figure out if the Block Quote element should be continued, the check_for_lazy_handling was allowing it to continue. The only issue here was that it was an HTML block, a leaf block type that does not have any continuation logic in the specification. Having noticed that, it was easy to change the following code:

    if (
        parser_state.token_stack[-1].is_code_block
        or (is_leaf_block_start)
    ):

to:

    if (
        parser_state.token_stack[-1].is_code_block
        or parser_state.token_stack[-1].is_html_block
        or (is_leaf_block_start)
    ):

thereby fixing the issue. Running the scenario tests again, the tests were indeed fixed without needed any other changes.

That Week’s Big Thing

Wrapping up the work for that week, I wanted to make another dent in the issues list, so I decided to at least get the tests set up for the following item:

- links with & and \ with inner link to mine
    - see __collect_text_from_blocks

The base concept of this item was simple: create a group of tests to verify how inline elements were represented when placed within a Link element. To make sure that I was doing a good scenario test, I made the choice to use a Reference Link element. By doing this, I would be testing the link label normalization code and the representation code at the same time.

Starting with test_reference_links_extra_03x, I created a Link element with a link label that contained a backslash in the link label:

[bar\\foo]: /uri

[bar\\foo]

From there, I then created a copy of that test that encapsulated that link label within another link label:

[xx[bar\\foo]yy](/uri)

[bar\\foo]: /uri1

and finally, I created a copy of that test, changing the Link element to an Image element:

![xx[bar\\foo]yy](/uri)

[bar\\foo]: /uri1

Then, after a bit of thinking, I decided there was only one combination I was missing, so I added that:

[xx![bar\\foo]yy](/uri)

[bar\\foo]: /uri1

That being done, I then repeated that process with &amp; and &copy; for character entity references, code spans, emphasis, autolinks, raw HTML, and hard line breaks. By the time I was done, I had added 40 scenario tests to cover all these cases.

Starting to execute the scenario tests, all the tests that just dealt with Link elements were passing without any changes. The Image elements, they were a different story. The failures seemed to stare back at me, standing in the way of me writing the article for that week. It was just time to start debugging and figuring things out.

After a bit of debugging, I seemed to notice that the test failures seemed to be in three separate groups of issues. The first group of issues was that the __collect_text_from_blocks function used to grab the existing tokens and render them as text was not complete. But that was not the entire issue, but I felt that there was too much “noise” in the way for me to see the issue clearly. Resolving to reduce the noise in the issue, I started working on the main part of the issue. In the cases where the inline element was on its own inside the parent Link element, the Code Span element, the Raw HTML element, and the Autolink element were not representing their elements properly. A bit of exploration and debugging took care of that. With that noise out of the way, I was able to see the other part of that issue better, and added the condition:

    if not is_inside_of_link:

at the start of each of those handlers. The noise that I had experience was simply that in cases where a possible Link element was within another Link element’s link label, each of the changed elements just needed to emit nothing to the collection. Running the tests again, it verified that observations, but I also saw something else.

Only after getting rid of those issues was I able to see that the __consume_text_for_image_alt_text function was not handling Emphasis start and end elements properly, failing an assert near the end of that function. That fix was easy, adding four lines to the elif statement:

    elif inline_blocks[ind + 1].is_inline_emphasis:
        pass
    elif inline_blocks[ind + 1].is_inline_emphasis_end:
        pass

With both of those issues dealt with, the couple of failures that remained were easy ones. Looking at the HTML output, there were a number of /a character sequences in the HTML output. Being the signature for replacement references, I quickly change of the code for the Text element from:

    image_alt_text += inline_blocks[ind + 1].token_text

to

    image_alt_text += ParserHelper.resolve_references_from_text(
        inline_blocks[ind + 1].token_text
    )

Beware Of Special Cases

Running the tests again, all the tests were passing except for those that dealt with the &amp; sequence. All the tests dealing with the &copy; sequence were working fine, so I had to think quickly to figure out what the problem might be. Because I am used to looking at processed HTML code, I initially did not see any problem with a &amp; sequence in the HTML output. It looked right.

Then it hit me. The problem was not with the HTML output, it was with the processing of the Markdown input. In the cases that I had problems with, the desired HTML output had &amp;amp; which did not look right until I thought about it. Whereas the &copy; character sequence is interpreted as a named charactery entity and replaced with the © symbol, the sequence &amp; was not being interpreted in the same way. The desired HTML output was correct! The initial & from the sequence was being replaced with the sequence &amp; to ensure it was displayed properly, removing any chance of it being interpreted as a named character entity.

Making a quick decision, I looked at the InlineHelper class and noticed the append_text function used to ensure that such strings were properly interpreted. Taking a quick look at the imports for InlineHelper and LinkHelper, I thought there was a chance of a circular reference occurring. Given that observation, I decided to make a copy of the append_text function in the LinkHelper class to get around the possibility of the circular reference.

Finally, after a couple of hours of work, all 40 tests were passing. It was a bit of a trek to get there, but it was worth it!

Whoops

As I was starting to write the article, I also started to look into what it would take to remove the duplicate of the append_text function in the LinkHelper class. I had introduced the clone into the LinkHelper class to avoid any issues with referencing the InlineHelper class from the LinkHelper class. It was as I was starting my research into this task that I discovered something that I had previously missed. The LinkHelper class was already referencing the InlineHelper class.

After a bit of “how did I miss that?”, I replaced the call to LinkHelper.append_text with InlineHelper.append_text and everything worked fine. Removing the instance of the append_text function from the LinkHelper class, I ran the complete suite of scenario tests again, and everything worked fine.

What Was My Experience So Far?

Looking back at the work I did during the week, it was hard not to get jazzed about the progress. I started off by doing a simple fix that made the source code more readable, enhancing my ability to read the source code. Then I upgraded the base Python version to 3.8, knowing that it would allow me to write the wrapper I wanted around Python’s logging functions. Add to that the coverage and testing I was able to add and verify for Link elements, Image elements, and Link Reference Definitions, and it was really good work!

At some point I noticed the number of scenario tests that I execute with each change. At approximately 2000 tests, I am confident that I am hitting a very large degree of all scenarios for Markdown transformation, not just the “Golden” scenarios, and that was also a boost to my confidence. While I can expect things that I note as an issue to not work, I also need to make sure I appreciate what is working. Having a solid set of tests like that is what allows me to refactor with confidence that I am not negatively impact the code.

Refactor? Sigh. One of the things I know I am looking forward to is looking at the refactor tasks in the prioritized sections and getting a couple of them underway. The fact that they are present in those sections is a good reminder to me that I can always learn how to do things better, and how to make the project more maintainable. And that is always something I can look forward to!

What is Next?

With a solid amount of work done this week, I am hoping to be able to clear out the uncategorized section of the issues list in the next week. Will I make it? Stay tuned!


  1. Information courtesy of the Python download page

  2. I will be covering this in a separate series of articles in the near future. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~17 min read

Published

Markdown Linter Issues

Category

Software Quality

Tags

Stay in Touch