Markdown Linter - Delving Into the Issues - 17

Summary¶

In my last article, I continued working on some big-ticket items from the issues list, making the most of my New Year Holiday break. Back in “normal time”, I am tackling Block Quote items, to try to get to the prioritized part of my issues list within the next week or two.

Introduction¶

While I knew that I was not going to solve the same volume of items as last week, I was confident that I could use this week to make some good progress in dealing with Block Quote elements and their interaction with other elements. I also knew that my mental space was going to be limited this week due to the end of the holidays. I was not the only one that took the time off from my day job, as most of the company that I work for took the same two weeks off. And with everyone coming back to work at the same time, there were bound to be lots of meetings to make sure everyone was resynced for the New Year. And that week there… were… lots… of… meetings.

Factoring that into account, I started my work for that week with a reset of my personal expectations on what I believe I can accomplish in a week. I felt that it was important to my sanity to take the time to seriously understand that I did not need to continue taking care of multiple big-ticket items. Just a handful of normal items would suffice. I knew that if I could manage to make the switch to that mindset, it would be a good week. So, with that in mind, the work started.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 07 Jan 2021 and 11 Jan 2021.

Starting with Something Simple¶

As a matter of pride, I try and keep the PyMarkdown code base clean and following flake8 and pylint guidelines. While I mostly correct any raised issues right away, I often choose to temporarily disable these issues until another time where I can resolve them. My logic in making that decision is that it usually better for me to concentrate on the big picture in the moment, addressing any raised issues when I have some less hectic bursts time. As such, at various points in the code base, there are comments such as:

# pylint: disable=too-many-public-methods

to disable a warning and:

# pylint: enable=too-many-public-methods

to enable the warning again.

But as I am only human, I sometimes forget to balance these statements out, disabling a warning that is only disabled, not enabling a warning that was disabled, or enabling a warning that was not disabled. Writing up a simple Python script, I quickly figured out where these issues were and corrected them. While it was not a very important thing to do, it was just nice to ensure that I had these nailed down. A good start to the week.

Rounding Out Multiline Inline Elements¶

One thing that I was sure that I had not covered properly were newline characters contained with Code Span elements and Raw HTML elements. While I had corrected a handful of issues from this area in the past, I did not feel that I had properly covered all the necessary cases, and I wanted to address that discrepancy.

Like I normally do, I started with scenario test creation. This began by taking a good look at the available scenario tests in the test_markdown_raw_html.py module and the test_markdown_code_spans.py module. For the first module, I added variations of test function test_raw_html_634, focusing on any container block elements or leaf block elements that I had not covered elsewhere. I then repeated this process for the other module by adding variations on the test_code_spans_346 test function. This resulted in eleven new scenario tests being added, four for the Raw Html element and seven for the Code Span element.

From a top-level point of view, the scenario tests for Raw HTML elements worked fine, and did not reveal any additional issues. The Code Span element tests were another matter. While I had previously dealt with newline characters in the main body of the Code Span element, I had forgotten to perform the same actions on the leading and trailing whitespace for the element. Feeling embarrassed that I forgot the whitespace parts of the token, I quickly made changes to the handle_inline_backtick function and the __collect_text_from_blocks function to ensure that the correct tokens were being generated.

To balance these changes out, I also changed the __verify_next_inline_code_span function in the consistency checks to pay attention to the leading and trailing whitespace. Like the changes detailed in the last paragraph, these changes were not difficult once I knew what the problem was. But looking at the code while I was making these changes, I realized that I should not feel embarrassed. While I was being thorough with my testing, the issues that I was finding were more corner cases than anything else. Put bluntly, unless I was testing corner cases, I was sure that I would not create a Raw Html element like:

<a><b href=""
><c>

or a Code Span element like:

This is some `
really nasty
` code.

Unless some specific formatting called for it in a really weird circumstance, I believe I would always write them on one line, not multiple lines.

But it was good to get the corner cases. In my head, I know that if I am focusing on the corner cases, I feel confident about the normal cases. That is a good place for me to be!

Adding Glob Support¶

While not a part of the issues list, one of the things that I had been experimenting on in some “down time” was adding Glob support to the project. This work came about as the quick script that I threw together for validating PyLint disables and enables needed to be able to specify a targetted set of files with Python glob support. Using the same type of mainline base as the PyMarkdown project, I figured the PyLint scanner script was a low-cost, low-risk place to see how much effort it would take to implement it in the PyMarkdown project.

It turned out to be very easy. The __determine_files_to_scan function was the main point of contact for determining the files to process. It took exact file paths, to either a directory or a file, and returned set containing all valid paths. In the case of a file path, it simply added the full path to that file to the collection to be returned. In the case of a directory, the directory was scanned, and all matching files were added to that same collection. Nice, self-contained, and simple.

Being self-contained, it was easy to modify this function to add glob support. To handle those simple cases, I moved that functionality out of the main function and into a new helper function __process_next_path. With that extracted, I rewrote the __determine_files_to_scan function as follows.

    if "*" in next_path or "?" in next_path:
        globbed_paths = glob.glob(next_path)
        if not globbed_paths:
            print(
                "Provided glob path '"
                + next_path
                + "' did not match any files.",
                file=sys.stderr,
            )
            did_error_scanning_files = True
            break
        for next_globbed_path in globbed_paths:
            next_globbed_path = next_globbed_path.replace("\\", "/")
            self.__process_next_path(next_globbed_path, files_to_parse)
    else:
        if not self.__process_next_path(next_path, files_to_parse):
            did_error_scanning_files = True
            break

As the call glob.glob already returns an array of matching elements, I was already most of the way to having this implemented. All I needed to do was to properly add the elements returned from the glob call to the collection. So, instead of rewriting the code to add matching elements to the files_to_parse variable, I just called the already debugged __process_next_path function to do all the heavy lifting.

Once that was done, manual testing of the new functionality went fine. Some new scenario tests needed to be added, and a couple of existing scenario tests needed to be changed slightly, but nothing unexpected. After an hour or so, the work was done and tested. While not terribly exciting, I could now do some manual testing of the PyMarkdown project against a set of files that was not a single file, nor every eligible file in that directory. And it just felt good to get a small task like that out of the way!

Filling Out Existing Tests¶

Narrowing down the items to work on from the issues list, the one that I settled on was:

- test_block_quotes_extra_02a with extra levels of lists?

To start the work on this item, I added three variations of the test_block_quotes_extra_02 test function, altering the number of lists in the document and their locations. Noticing that I could do the same type of variations for Block Quote elements, I also added ten new scenario test functions that were variation on the test_block_quotes_extra_04 function, mixing Block Quote elements with the various types of non-inline elements.

Executing the bulk of the new tests, I was pleasantly surprised that everything except for the consistency checks were passing without any changes being needed. Even the changes needed for the consistency checks were relatively minor and in two main groups.

The first group of changes were in the inline handling part of the verify_line_and_column_numbers function. These changes were not material in nature but served to ensure that the leading_text_index field from the Block Quote token was properly updated. This required the inspection of each inline token to determine if any newline characters are encountered. If any are encountered, the leading_text_index field is incremented by the number of newline characters, ensuring that any references to that field reference the correct line prefix.

Seemingly balancing that change, there were a handful of end Leaf tokens that also needed adjusting to properly support the leading_text_index field. Through trial and error, I quickly isolated each type of token, and was able to properly increment the leading_text_index field to handle the end token. It was not a big task, but it was one that I needed to be very methodical on. I did find that I needed to do adjust each at least once as each test was providing coverage for a specific scenario that had been missed. While it was not that much extra work for each individual test, the amount of work required over all the tests added up quickly.

Properly Handling Link Reference Definitions¶

In the case of test function test_block_quotes_extra_04f, the issue was that it was just broken. No niceties or anything else, just broken. Added during the last section’s work and disabled, the Markdown was:

> [
> abc
> ](/uri)
>
> end

What made this test function broken was not the Markdown itself, but the generated tokens for it. For whatever reason, the parsing of the Block Quote was both started and ended on the first line, only to be restarted on the second line. Because of the container nature of the Block Quote element, this then spread the text required for the Inline Link element split over two distinct Block Quotes. It was just wrong!

Setting the Stage¶

The debugging took a couple of hours to work through, but it was rewarding when I solved it. The problem with the parsing boiled down to my favorite element (heavy sarcasm is implied), the Link Reference Definition element. Because of the unique nature of this element and how it is parsed, I had to add the ability to rewind or requeue the parser so that failed lines from a Link Reference Definition could be properly processed. And while it had worked properly until this point, test function test_block_quotes_extra_04f provided an interesting twist to normal operation, and therefore, an interesting problem.

Because of design decisions for Markdown, the Link element and the Link Reference Definition element both start with the same sequence: [link]. If this sequence is followed by an open square bracket character [, then it probably specifies a collapsed or full link. If this sequence is followed by an open parenthesis character (, then it probably specifies an inline link. If this sequence is followed by a colon character :, then it probably specifies a Link Reference Definition. And finally, if not followed by any of the above, it is probably a shortcut link.

Most of those combinations do not matter, except for my friend (once again, heavy sarcasm implied), the Link Reference Definition. While the Link element and its types are all processed in the inline phase of processing, the Link Reference Definition is processed much earlier in the block phase of processing. Due to that difference, the Link element processing is done with the entire contents of the processed Text token being available, but the Link Reference Definition processing is done one line at a time.

Working Through the Process¶

Why was that information relevant? In the case of the above Markdown, the specified text supports both a Link element and a Link Reference Definition element until line 3. Before that point, the Link Reference Definition processing continues forward. When that point is reached on line 3, the line is processed for suitability as a Link Reference Definition, it fails, and the requeue mechanism needs to be enacted so that the lines can be interpreted properly. Unlike any previous scenario tests, in this case, that requeue mechanism was not sufficient.

What was being requeued was only the information after processing. When the requeue mechanism kicked in, it was trying to return to the state that was in place when the Link Reference Definition started. But when it started processing the requeued information, it did so with the processed line of information. That line was missing the Block Quote prefix, causing the Block Quote to be closed. It took a while to get there, but I did figure out why that was happening with the closing of the Block Quote!

Fixing the Issue¶

In this case, the line that had been passed to the Link Reference Definition processor did not have the Block Quote prefix attached to it. Having been removed at the container level before being passed on for Leaf Block processing, the lines to requeue were missing information. To fix that issue, I had to figure out a way to ensure that I could retain that information so that it could be requeued if needed. Therefore, I introduced a the unmodified_line_to_parse variable that contains the line as read, unmodified by any processing.

This got me a lot of mileage in fixing this issue, but after rerunning some of the tests, a couple of the tests were failing because there was another issue somewhere. Debugging that issue over the course of an hour, I found that there was another requeue issue that I needed to address: the main document and the main token stack. In a couple of the new scenarios, when the processing of the Link Reference Definition was started, another type of block element was ended. The effect of this was that a new Markdown token was placed in the document and a new stack token was placed on the main token stack. While the rewinding took care of the data, it did not take care of that state information.

Dealing with that issue was somewhat simple but took a while to get right. Before starting the processing of the Link Reference Definition, I keep track of the lengths of both the main document and the token stack. If I need to requeue elements, I simply remove any entries that are past that mark. It is not very graceful, but it was successful and worked wonderfully.

Squeezing One More Task In¶

If things are going normally, I organize and write my articles on Sunday, with the editing the draft article going into Monday evening. During that time, I do take a fair number of breaks to ensure that I am writing and editting with a clear mind. But just because I start working on the article, it does not mean that I stop thinking about whatever it is I was working on. Usually, it is a battle between getting the writing done and my urge to complete what I started. Most of the time, the article wins. In this case, it did not.

On Saturday morning, I had started working on figuring out how to get test function test_block_quotes_extra_03b working. And while I had made some progress on it, I was still working on it. For whatever reason, when placed within a Block Quote element, Link Reference Definitions were not being recognized properly. I had started working on this right after fixing test function test_block_quotes_extra_04f and I had spent a decent amount of time trying to get it working. But with a busy weekend in my personal life, I was not able to get a good, solid, contiguous couple of hours to work on this issue as I had hoped to do. As such, I had started to try to figure out this issue about five times and gave up after each short try. It gnawed at me that I could not figure it out. It had not taken me long to resolve the previous set of issues, why was it taking me so long with this one?

Regrouping¶

After completing the bulk of the rough draft of the article, I took some time to relax and clear my head, knowing that I needed to look at the problem again.

This time, I had a lot better results with my debugging. Starting with the basics, I turned on debug logging for the test and followed along in the source code as I read each line of the debug output. It was then that I noticed the issue: the Block Quote token itself was wrong. As I looked through the logs, everything was fine up until the requeue from the Link Reference Definition happened. From there, everything was just off.

Taking some time to think about it, I decided to take our dog Bruce for a walk. During that walk, I tried hard not to think about the issue, and mostly succeeded. When I came back, I was able to examine the log files again, knowing that the Block Quote token was off, and that I had to find the cause. Within five minutes, I had the answer. It was once again a state issue. Before the requeue happened, as each line was being processed within a Block Quote, new information was added to the Block Quote token. This information was about the leading text that was removed from each line in the container processor, ensuring that the leaf processor only had to deal with leaf block related issues. To ensure that the Markdown could be properly rehydrated, this information was stored in the Block Quote token itself. But when the requeue happened, nothing was done to erase the information added to the token between the start of the Link Reference Definition parsing and the start of the requeue. Or at least that is what I thought had happened.

Doing some quick testing, I quickly proved my theory to be correct. As I followed along in the logs for the test function, I saw the amount of leading text in the Block Quote token increase, but never decrease. To further prove that I was on the right track, I compared the number of lines that were requeued to the number of extra lines of leading text present in the token, and it was a match!

Fixing The Issue¶

With a solid lead on what the cause was, the most concrete manner of proving that I had the right cause to fix it. After mulling around various ideas in my head, the one that won out was to simply store a copy of the Block Quote token in the Link Reference Definition token at the start of processing. With the other requeue logic in place, once I had done all the other requeuing, I simply replaced the changed Block Quote token with the copy of the original token. Running through the tests, this worked right away!

After having taken such a long way to get there, I now had it fixed. But since I had ended up solving the issue somewhat late on Sunday evening, I decided to put the changed code aside and to continue edit that week’s article. It was enough to knowing that I had solved it and that it just needed cleaning up before committing. It was then after I had completed my final edit of the article on Monday night that I noticed that I had finished early on Monday night with a lot of time to spare. With that extra time in hand, I was able to take the roughly finished solution and polish it up enough to commit it. While technically it should be a part of next week’s article, it just felt right to include it with this article, as that is where most of the work occurred.

What Was My Experience So Far?¶

After a busy week of getting rid of some big-ticket issues, it was very nice to reduce my scope and focus on the smaller items. Not that I mind working on the big items, it is just that they require me to maintain a larger scope of focus, thereby tiring me out a bit more. The smaller items are not always as satisfying to resolve, but they are also not as draining.

During the article, I mentioned that I was becoming more aware that I was dealing more with corner cases than anything else, and that was a good feeling. I am very confident that any of the main scenarios driving the parser have already been addressed. With those out of the way, it stands to reason that any issues that I am finding are the weird cases that do not occur that often. It just makes sense to me.

It also means that I am getting more confident that I am nearing the end of this testing phase of the PyMarkdown project. My main drive for the project was to complete the project on my own terms, with the level of quality and testing that I expect from other projects. While I could have started releasing this project a while ago, I wanted to make sure that I have reached that level before I ship the project, and work on improving it from there. And with the knowledge that I am cleaning up corner cases, I know that I now closer to that point with the PyMarkdown project than I have ever been before!

And it is a good feeling!

What is Next?¶

I do not want to sound like a broken record, but it is back to the same process of finding the next item to work on, and getting it resolved. The only difference was that I was getting close to eliminating all the “open range” items in favor of the prioritized issues. Progress!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments