Summary

In my last article, I talked about the headway I am making towards getting more pending issues dealt with. In this article, I talk about my efforts to clean up the scenario tests and the existing parser issues.

Introduction

Another week of demanding work, and another week of results to show for it. While the results of this week are not as flashy and the results from previous weeks, they are still all very meaningful and useful. After a couple of weeks focusing on the newer issues, I decided to tackle some of the older issues. Or at the very least, issues that have been around for a while.

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather than the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 06 Oct 2021 and 10 Oct 2021.

Issue 44 - The Task That Keeps on Giving

To say that completing Issue 44 is a huge task is not doing it justice. At nine hours and counting, I have verified that all non-rule scenario tests have either descriptive or derivative documentation at the top of their test function. With over 3300 scenario tests as of this afternoon, it is no easy feat to go through and verify the comment string starting each test. It is incredibly boring and monotonous. And I do not know about any readers, but for me, those are the tasks where I get into a mindless kind of action that easily leads itself to mistakes. Did I mention monotonous? Huge? Boring?

So why do it? Because I believe it is the right thing to do. During the months of work on this project, I tried to keep every comment string current. When all is said and done, that is the easiest way to ensure the documentation is up to date. But the truth is that I did not always keep those strings current. If I were in a hurry, I would often copy a scenario test, change it to suit the current circumstances, and then go on to the next scenario test. The function name of the test was changed, but not the description. It was because of that “speedy” behavior that I now found myself scouring over 3300 scenario tests for proper comment strings.

After nine hours of grueling work, I now have all scenario tests covered except for the ones that deal with the rules. There is not much to talk about regarding the fixing of documentation strings. In each case, I determine if the scenario test is an original test or a derivative of a nearby test. If it is an original test, I create a descriptive comment string to try and capture what the test is trying to achieve. In the other cases, I use a derivative comment string, usually stating with variation of, that notes the test that it is a derivative of and what change is being made for that specific test.

To be honest, while this task is monotonous, extremely monotonous, it is mostly reading the comment strings to make sure they make sense. When there are obvious copy-and-forget-to-change strings, they are typically easy to spot and come in groups of three or more. There really is no other way to describe it than it being a boring task. But the benefit is in having a coherent set of scenario tests with confidence that their comment strings describe what the test is about. And that decreases the cost of maintaining the project.

It is a pain, but it is worth it. I think so at least. But let me revisit that statement once I have finished verifying every scenario test. As a friend said one time:

Your Mileage May Very. Objects in the rear-view mirror may be closer than they appear.

Issue 28 - Pre-Commit Hooks

It probably skipped everyone’s notice, but a change was introduced during the last month to add Git pre-commit support to the project. The change itself was a ridiculously small one: adding the .pre-commit-hooks.yaml file to the project by one of our contributors. But the effect was huge. The PyMarkdown project can now be used as a pre-commit hook on any system that supports Git Pre-Commit hooks.

By including that file in the repository, Git projects on GitHub can now leverage that information to execute the PyMarkdown project without having to install a Python package. While the file itself does not contain much information, it allows for a .pre-commit-config.yaml file to be added to any repository that references PyMarkdown. If that file is present and contains a reference to the PyMarkdown repository as documented here, then PyMarkdown will be invoked as configured. It is neat and tidy. Even better, it is an almost zero-cost approach for using PyMarkdown.

But as with any project, there were a couple of baby steps that needed to be taken.

Problem 1: Allowing for Configuration

The first issue with this feature was easy to overcome: how to allow for configuration? Because of the way the .pre-commit-hooks.yaml file is organized, my first instinct was to specify an entry value of pymarkdown scan. While this worked as a great default, when custom arguments were added, they were appended to the end of that value. Therefore, if you wanted to disable rule not-me and tried to add an args value of ["--disable-rules", "not-me"], the executed command line would be pymarkdown scan --disable-rules not-me. Adding the scan keyword to the end would not work either, as that keyword had already appeared.

After a bit of research, the solution to this was to use a good default and to add easy-to-read documentation on how to deal with the rest. Having the default composed string of pymarkdown scan was the correct choice, I just chose a bad implementation of how to get there. By splitting those two keywords into an entry value of pymarkdown and an args value of scan provided the required flexibility to add user-defined configuration.

- id: pymarkdown
  name: PyMarkdown
  description: "PyMarkdown - GitHub Flavored Markdown and CommonMark Compliant Linter"
  language: python
  language_version: python3
  entry: pymarkdown
  args: [scan]
  types: [markdown]

To deal with the rest of the issue, documentation was added to inform the user that if they add scanning configuration, the final args argument should be scan. While there is nothing preventing users from using any of the other modes of the project, those modes tend to be mostly static. Therefore, I determined that invoking those modes have near-zero use for invoking PyMarkdown through a pre-commit hook and geared the pre-commit documentation solely towards the scan mode usage.

Problem 2: What Is Good Documentation?

While addressing the simple stuff for pre-commit hooks was easy, there was an interesting question posed by one of the contributors. The condensed version of that question was: why provide any pre-commit documentation at all? The points that they raised were valid. The target implementation of pre-commit used by GitHub may change, requiring a change to the documentation. So would not a “for any questions, go to pre-commit’s site” be sufficient.

Thinking about the issue like someone who had just found out about pre-commit hooks1, I added documentation while trying to maintain that viewpoint. My goal was to add enough documentation that a relative pre-commit newbie like myself could add PyMarkdown support to their project using that documentation. While the amount of text in that document was not large, it was not small either. But for me, it was the right amount of documentation for the subject at hand.

Even so, I spent the next couple of weeks after that comment wondering whether I had gone overboard or not. It was not an easy decision or a hard decision, I just was not sure how I should evaluate it.

In the end, I decided I had to set a clear goal for the feature and for the documentation. To me, the goal of the feature was to make it easy to add PyMarkdown support into a Git project. Once I nailed that down, the documentation focus became a lot clearer. If I wanted to make adding pre-commit support easy, I needed good support to walk people through 90% of the things they can do with PyMarkdown and their pre-commit hooks.

With that focus, I felt that the scope of the document was validated. As I had written the document as a relative pre-commit newbie, I felt that the information presented answered all normal questions that I had during that process. How do I add it? Where do I add it? Can I lock it down to a version so I can replicate a set of results? How do I configure it? Can I use a local configuration file? Can I control which files it scans? These were all questions that I asked myself when I added PyMarkdown support to one of my private repositories.

While the document is long, I strongly feel that it answers each of the questions in the previous paragraph properly. And from my viewpoint, if my goal is to make it easy to add PyMarkdown support through git pre-commit hooks, then that is the level of documentation required. Time will tell.

Release 0.9.1

Hopefully release events will be happening more frequently for this project, and this was a good one. With a good collection of additions and fixes, it just felt like a suitable time to release an update.

Issue 47 - Fun with Fenced Code Blocks

Okay, maybe fun is not the right word. Going back over the results of the latest test scan, I noticed that I still had around eleven scenario tests that were marked as skipped. Deciding to tackle them to wrap up the week, I first started looking at the three scenario tests that were part of the Fenced Code Block tests.

The parsing errors were simple to see. In two of the tests, a start Fenced Code Block token was present without a matching end Fenced Code Block token. The remaining test mistakenly reports the Fenced Code Block element as a HTML Block element, totally missing the mark. It just looked bad.

The good news? Well… the relatively good news? All three scenario tests dealt with Link Reference Definitions that are aborted part way through. As I have mentioned in previous articles, the proper handling of Link Reference Definition elements are not easy, and those elements remain as the one aspect of the parser that does not have my full confidence. That lack of confidence is not for lack of trying. It is a measure of the many combinations of aborted Link Reference Definitions and the other elements that surround them there are. I know I am getting closer to having them all covered, but I am not sure that I have every combination covered… yet! This was just continuing the process of increasing coverage on Link Reference Definitions, and I was ready to accept that challenge.

Covering This Combination

In each of these three scenarios, a Fenced Code Block element was started inside of a container element. In all three scenarios, the next line includes a valid open for a Link Reference Definition element, only to have the container close on the third line. From a parser point of view, the Link Reference Definition element was only in a partial state and therefore needs to be rewound to be interpreted as normal text. It was in that code that the issue appeared.

In all three scenarios, the currently active token stack (token_stack) and the current token document (token_document) maintained in the parser_state variable were not being rewound properly. The result was that both the token_stack and token_document variables were left in weird states. It was then that I recognized that the second line of the Link Reference Definition was valid, and I needed to adjust the tokens to match a partially finished Link Reference Definition.

In these specific cases, the second line of the Link Reference Definition was properly following the first line, providing a Link Reference Definition element that has a label and an URL, but no title. By reducing the number of examined lines to only include those two lines, the code was able to form a complete reference with only the label and the URL. After having done that, the work that went into that was being erased. But why?

Looking some more at the code, the answer slowly became obvious. As the code for the process_link_reference_definition function existed at that time, if a requeue was requested, the code assumed that no valid Link Reference Definition was parsed. However, in these three scenarios, a partial Link Reference Definition was found and there was at least one line to requeue for processing.

The fix was simple: do not rewind everything if a partial reference is parsed. Even a partial reference is valid and that means that the text from the start of the element remains valid. Looking in the (seriously in need of a refactor) function process_link_reference_definition, there is a block of code near line 188:

if lrd_stack_token.copy_of_last_block_quote_markdown_token:

and another block near line 228:

if lines_to_requeue:

To fix these scenarios, I grabbed the lines between those two sections and placed them under the following conditional:

if not did_complete_lrd:

Even though there may be lines that need to be requeued, if there is even a partial Link Reference Definition, the document stack and the document token are not rewound.

It took a while to figure that out, but when I added that code, things started to fall into place. I went through the three scenario tests and started to adjust the tokens and to verify the output against Babelmark. And things just worked. It was a good feeling.

Issue 49 - HTML Blocks

Following up from the previous work I did on Fenced Code Blocks, I thought that dealing with the six scenario tests dealing with HTML blocks was a good option for my next task. That, and while I had confidence I could deal with the skipped List element scenario tests, I wanted to leave them to last. I know that container elements are tricky and can easily devolve into a time sink. So, HTML Block elements it was!

HTML Blocks Are Like Code Blocks

Starting at the end of the skipped HTML Block tests, I worked my way back up the list. It was quickly obvious that the problem with each of the first four tests was a simple one. In each case, the Container Block processor was looking for a valid Link Reference Definition within a HTML Block element. It was as simple as that.

If that does not make sense, there are two lines in the GitHub Flavored Markdown specification that should make this clear:

HTML blocks continue until they are closed by their appropriate end condition, or the last line of the document or other container block. This means any HTML within an HTML block that might otherwise be recognized as a start condition will be ignored by the parser and passed through as-is, without changing the parser’s state.

While a bit terse, the meaning of those two sentences is clear. Once a HTML Block element is started, any HTML is passed through as-is. To be even more precise, everything within the block is considered HTML, even if it is not valid HTML. Therefore, if something that looks like a Link Reference Definition is started, it should be treated like HTML and passed through as-is.

To address that problem, I modified the call to __handle_link_reference_definition in the function __parse_line_for_leaf_blocks from:

    (
        outer_processed,
        requeue_line_info,
    ) = ContainerBlockProcessor.__handle_link_reference_definition(
        parser_state,
        outer_processed,
        position_marker,
        extracted_whitespace,
        remaining_line_to_parse,
        ignore_link_definition_start,
        pre_tokens,
    )

to:

    ignore_lrd_start = (
        ignore_link_definition_start or parser_state.token_stack[-1].is_html_block
    )

    (
        outer_processed,
        requeue_line_info,
    ) = ContainerBlockProcessor.__handle_link_reference_definition(
        parser_state,
        outer_processed,
        position_marker,
        extracted_whitespace,
        remaining_line_to_parse,
        ignore_link_definition_start,
        ignore_lrd_start,
        pre_tokens,
    )

This code ignores any Link Reference Definition starts if the flag is passed to the function or if the current stack token is an HTML Block token.

Going through the last four scenario tests, I was able to make quick work of getting each of the tests cleaned up and passing. In each scenario, it was just matter of validating the tokens by hand, then using Babelmark to verify the HTML. That process went quickly.

To ensure that things were kept in a good state, I decided to commit these changes, having a good solid bookmark for later. It turns out that it was a clever idea that would pay dividends soon. Things were about to get messy.

I can clearly remember looking at the Markdown for scenario test test_html_blocks_extra_02a, then looking at the generated HTML and thinking “that looks wrong”. It took me about twenty minutes of looking and work around the house to figure out why, but I got there.2

To understand the reason that the generated HTML looked wrong requires history on the PyMarkdown project. I started working on the PyMarkdown project because I saw a need for a Markdown linter that was written in Python. Now, I have a long love of parsers based on solid specifications because there is a finality to them that is appealing to me. People can argue about whether this one piece of Markdown is prettier or more functional than another piece of Markdown. That is opinion. But I decided to start writing the parser against the GitHub Flavored Markdown specification, following that wherever it took me. It was a solid specification with examples, so it provided guidance what was valid and test data that backed up that guidance. For most of the project, I easily implemented the new elements, one after the other.

The last element of the specification core that I implemented were Link Reference Definition elements. From looking at the specification for the element, I knew there was a simple way to do it and a proper way to do it. Knowing that the difference between the two approaches was at least one month of development time, I followed my gut and decided to do it the proper way. Regardless of what happens, I solidly feel that I made the right decision at that point.

And the fallout from that decision was not easy to swallow. The resultant effort added at least two months of work required to support Link Reference Definitions and line requeues. Parsing the Link Reference Definitions going forward was the easy part. Because of the nature of Link Reference Definitions, it is not unusual for the next line to be read in before determining if an active Link Reference Definition is complete. If that next line completes the Link Reference Definition, everything is good. Otherwise, the data needs to be rewound, resulting in any lines used to be requeued. To keep things simple, I always use the requeue_line_info variable to denote requeued lines of this nature.

Those line requeues also must reverse any state changes, which is where the problems usually arise. In a smaller, more contained system, it would be easier to record any changes and simply roll them back when the lines are requeued. That is a more challenging task to achieve in this project. I am confident that I have at least 85% to 90% of the cases dealt with, but that is just a guess. And with only a small handful of failures in the base specification’s test data, I have had to generate my own test data to test these scenarios properly.

What Does This Mean for These Tests?

With that history refresher completed, I can now point out that PyMarkdown is one of the few parsers that I have examined that properly implement Link Reference Definitions. After talking with a handful of parser authors and checking their source code, the most prevalent way of parsing Link Reference Definitions is to examine parsed Paragraph elements looking for Link Reference Definitions within their content. While it is not a perfect match for the functionality, that approach surprisingly handles most of the cases that come up. The only difference is that the use of Paragraphs as a “container” also allows for Paragraph Continuation to happen.

It was as I was looking at the test_html_blocks_extra_02a test when all this information came flooding back to me. The resultant HTML looked wrong because it was wrong. If you are using the “parse from paragraph” approach, then the Markdown:

- <script>
- [foo]:
/url
</script>

producing a Link Reference Definition token makes sense. Line two starts the Link Reference Definition, only to be continued and completed on line three because line three becomes paragraph continuation text, allowing it to be added to the previous line due to laziness. This results in:

<ul>
<li>
<script>
</li>
<li></script></li>
</ul>

However, I decided to approach Link Reference Definitions as their own leaf element. This meant adjusting the output of the scenario tests to reflect that approach. In particular, the above example produces the following HTML output:

<ul>
<li>
<script>
</li>
<li>[foo]:
/url
</script></li>
</ul>

Without the paragraph continuation promoting laziness, the Link Reference Definition is not valid, resulting in it being interpreted as plain text.

Fixing The Problem

Digging into this problem, I noticed a few issues that I needed to solve. The first issue was in the __check_for_list_closures function where an already started Link Reference Definition stopped a list from being properly closed. In the previous Markdown example, this is what prevented the parser from properly terminating the Link Reference Definition when the indentation did not support continuing the list on line three.

But removing:

and not parser_state.token_stack[-1].was_link_definition_started

caused a cascade effect. Once removed, the close_open_blocks_fn function was able to try and close the open list element. The problem that it uncovered is that there was not any support for the requeue_line_info variable. It took the better part of two hours to add proper support for the requeue_line_info variable, including support for the cases where that variable was not used. And even after adding that code and getting all the other tests passing, I still walked through the scenarios with the debugger to make sure things were working properly.

As I switched my focus back to solving the issue for the tests, the next issue that arose was that an extra blank line was getting inserted during the rewind. After figuring out that it was the “empty” arguments used to terminate the Link Reference Definition in the close_open_blocks_fn function, I explored whether to pass extra data into that function to be used for aborted Link Reference Definitions. Since this was the only case where it happened, I opted instead to keep the modifications local, adding in two asserts and replacing that blank line with the current line in the __check_for_list_closures function itself. I am not sure if I am 100% okay with this solution, because it feels like more of a hack that a proper fix. But given the amount of extra work that would need to be done to support a new parameter in the close_open_blocks_fn function, it was a lower cost option that was clean to implement.

Thinking I was done, I reran the tests, expecting them to pass. They did not pass. Going once again into the debugger, I found that I was not properly managing the cases where the stack depth decreased. Specifically, there were cases where the start of the Link Reference Definition removed an existing element when it was parsed. As that line was now being unwound, those token_stack elements needed to be put back to how they were before.

Once again trying to think out of the box, I worked on clever tricks to make things work before deciding on using something simple: a stack copy. I had avoided using these in the past as I was afraid that they would not capture the state properly. But in this case, there was no alternative. Going back to the process_link_reference_definition function, I transformed this code:

if len(parser_state.token_stack) >= original_stack_depth:
    while len(parser_state.token_stack) > original_stack_depth:
        del parser_state.token_stack[-1]

into:

if len(parser_state.token_stack) >= original_stack_depth:
    while len(parser_state.token_stack) > original_stack_depth:
        del parser_state.token_stack[-1]
else:
    while len(parser_state.token_stack):
        del parser_state.token_stack[-1]
    for next_token in lrd_stack_token.copy_of_token_stack:
        parser_state.token_stack.append(next_token)

With that last change added, the final two scenario tests were now passing. Executing every scenario test for the project, I found and handful of other scenario tests that were also “wrong” and fixed them. It was a good find, and I was happy to increase my confidence in how aborted Link Reference Definitions were handled.

But next… the three disabled List element scenario tests. But they could wait until Sunday morning.

Issue 51 - Lists and Indents

After spending my free time on Saturday working on Issue 49, I could only hope that Issue 51 would be a lot easier. But knowing that those scenario tests were dealing with List elements did not give me confidence. It is not that I cannot find issues with Lists, it is quite the opposite. Whether it is true or not, I think I have worked hard to deal with most of the List element related failures. That means that whatever failures are left and more difficult. Not fun… but still worth doing.

Taking a quick look at the three scenario tests, the Markdown documents quickly divided the tests into two groups. The first group dealt with strange indents and Block Quote elements while the second group dealt with interesting indentation and its effect on the output. After flipping a card that I keep on my desk for just these decisions, I decided to work on the second group first.

Looking at the failures, it was obvious that there were two separate problems. The first problem was that valid list starts that started with more than three space characters were being ignored. In particular, the entries for nested lists that were indented more than three spaces from the start were not parsing properly. To solve this issue, I created a new __adjust_whitespace_for_nested_lists function that examined a current list token and its parent token, finding the correct range for a “base” ident to use. Once that range was determined, if the amount of whitespace was within that range, the adj_ws variable was trimmed to remove excess space characters.

Continuing with that theme, the next issue was within the __process_eligible_list_start function. Like the work performed in the __adjust_whitespace_for_nested_lists function, the __process_eligible_list_start needed to be adjusted to only remove stack items that were necessary to remove. Like the case for the previous solution, this solution was geared towards properly taking care of lists and their indentation. In this case, it needed to deal with list indents that were decreasing for a given list level but were still considered valid.

A good example of this is:

1. list
     1. list
    10. list
   100. list

In each list item for the nested list, the lists are indented properly, just aligned on the right side. In that document, it means the starting point for the text remains constant, but the starting point for the list text decreases by one for each line.

To solve this, I needed to add another range check. While increasing indents for lists had already been dealt with, I needed to support decreasing indents. As such, the left bound for that range was the start of text for the parent list. The right bound was that point plus three characters, as dictated by the specification. It took a bit to code, but it was quickly added to the function.

Hoping I had done enough to solve the issue, I executed the scenario tests and verified the results. After correcting the tokens and verifying the output, I was happy to have both List element tests now passing. It was a good feeling.

What About the First Group?

After all that work, it was already later in the afternoon that I had hoped for. As such, I created a new issue for the remaining test and started to compose the article as I went through the process of committing the changes so far. The work was a big enough chunk as it was that I did not feel bad in separating the work. It just made sense.

What Was My Experience So Far?

Ugh. It was a long week and a boring week. But I know that work is going to pay off in lower maintenance costs going forward, so I am still on-board for getting it done. It just needs to get done, and I must muster the patience to get it accomplished in short order.

As for the other work that I accomplished this week, I am happy to get those items off the Issues List. The painful thing that I realized after solving those issues is that I could have solved them earlier. Granted it took the better part of two days to eliminate ten of eleven skipped tests, but that is a small amount of effort to expend. I feel that I could have expended that effort earlier and expended less effort to solve them.

But things worked out the way they did, and the project is better for it. At this point, I am just focusing ahead at getting the reported issues dealt with and making the project as robust as possible. I do feel it is in a good place, I just want it to be in a slightly better place!

What is Next?

I still need to finish the verification of the comment strings. I do not want to let that drag on. But as to which issues I will choose other than that one, I am not sure. Stay tuned!


  1. I had been exposed to them before but had forgotten all about them in the time since then. 

  2. I am not sure if it works for any readers out there, but I often find that taking a break and doing something that is not related to what you are having a problem with is great for breaking through problems. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~20 min read

Published

Markdown Linter Beta Bugs

Category

Software Quality

Tags

Stay in Touch