Markdown Linter - More Beta Bugs · Jack's Digital Workbench

Summary¶

In my last article, I talked about getting to work on reducing the count of bugs in the Issues List. In this article, I talk about the headway I am making towards getting those issues dealt with.

Introduction¶

Issues, issues, issues. Bugs, bugs, bugs. Problems, problems, problems. No matter which way I talk about it, it means the same thing to me. With the beta release out the door, it is time for me to try and battle the issues on the Issues List. I know I need to either resolve them as already fixed or to fix them there and then. I just need to keep on making good progress on fixing things.

That and I hope that I do not find any “must rearchitect” issues. That would be… well… expensive and time consuming. Fingers crossed!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 26 Sep 2021 and 03 Oct 2021.

Issue 34 - Getting Things To Line Up¶

This issue was recorded simply as a Markdown document:

> this is text
> [a not
>  so
>  simple](/link
> "a
>  title"
>  )
> a real test

The note itself was not that helpful, but once I parsed the document into its tokens and looked at them, I noticed the problem right away: the tokens were not there. Looking more closely at the output, an exception occurred that stopped the document from being parsed.

Enabling the --stack-trace command line option, I quickly found that within a loop in the correct_for_leaf_block_start_in_list function, an assert statement was present to ensure that the loop only worked on lists. It was that assert statement that was failing, causing the parsing to be stopped. I started to address the issue by adding code to only continue the loop if the last item on the token_stack list was a list. Also adding code to conditional cleanup after the loop based on the same condition, I was ready to see how this document would parse.

Feeling a bit anxious, I reran the test, hoping that everything would work. It initially looked like everything was going to work properly. But when I looked at the tokens, something was wrong. It looked like most of the tokens were correct, but there were tokens that were slightly off. For the most part, the data was correct, but the position of the tokens was a bit off.

This was going to be a more complicated issue to work through.

Digging Into The Debug¶

It took a fair amount of experimentation to figure out what the real issue was. Adding debug here and there and observing the behavior slowly led me to the observation that the positions were not off by line numbers, but by small column numbers. Doing some further digging into the problem, I noticed that each column number offset remained constant within an example. With a stroke of luck, I noticed that it only happened with examples that had other elements within a Block Quote element.

It was then I figured out it was a problem with Block Quote elements and their leading_text_index variable. Suddenly, the full scope of the problem fell into place, and additional debug confirmed that hypothesis. When each example was being parsed, the leading text for the Block Quote was being carefully removed and placed in the Block Quote token, but the leading_text_index variable was not updated. As such, each of the calculated positions within the Block Quote element were calculated solely using the length of the leading text from the first Block Quote line.

Wow… it did take a while to get to that hypothesis, and it took a bit more time to confirm it. But when I had it confirmed, it made a lot of sense. To address it, I started adding the required plumbing to update the leading_text_index variable properly, and the rest just fell into place. It didn’t take me long to get that up and working, but I wanted to be extra sure that I solved the problem properly.

As the positions for most of the inline tokens are calculated, I double checked those tokens more thoroughly than the other tokens. I even checked variations on the Markdown document to make sure they were also working. In the end, after much thinking to get there, the issue was solved.

After that long effort, I wanted to pick something more relaxing to work on and get my energy back.

Issue 36 - Verifying Disabled Rules¶

This was a monotonous task, but one that I knew that I needed to do at some point: verify that any rule test that disabled another rule did so for the right reasons. As the number of rules grew dramatically in the last three months, I had suspicions that some of the reasons that rules were disabled were no longer applicable. And while I can check for any enabled rule and its applicability by executing the full set of tests for the project, verifying disabled rules was a very manual process.

This task took the better part of ten hours to complete over four days. It wasn’t that exciting, so I decided to break the one big task into smaller tasks to help keep me motivated throughout the entire process. That turned out to be a great decision. I found that the more I worked on the process, the more errors I made in verifying the disabled rules. Nothing too serious, but for each error I made, I restarted the testing at the last-known good point, hindering my forward process. As far as I can tell, it was just the monotonous nature of the task that made my mind wander just enough to lose track of things.

But after almost ten hours, I was done. When all was said and done, just over one hundred rules had extra --disable-rules options that they did not need. And while it was a very boring task, I found it to be a very worthwhile task. I now had my confidence back that the rules tests were testing what they were supposed to test, and only disabling rules when they had to. That meant I could introduce changes to the core and the rules with “that much” more confidence that the tests would catch any changes gone bad.

Issue 37 - Verifying Lists and Blank Lines¶

While I do try and put forth my best effort every time, sometimes I fall short of that goal. When I picked this Markdown out of the Issues List:

This is text and a blank line.

> a block quote
> + a list

This is a blank line and some text.

I looked for it in the scenario tests and found it. But for some reason, I focused on the scenario test right above the actual test that held this document. I feel kind of stupid, but for the first three or four times that I looked at what I thought was the test housing this example, I was looking at the previous test. After that, I think it was just stuck in my head until I came back to it after leaving it alone for a while. I was very embarrassed when I finally figured that out.

Once I got that straightened out, everything made sense. Because I had made this mistake this time, I probably had made a similar mistake when adding this document to the Issues List. But while I was verifying that example, I used the time to check the other examples for the rule. While all the existing tests were fine, I added this document:

This is text and a blank line.

> + a list

This is a blank line and some text.

to fill in a hole that I saw in the testing.

While the actual example was not a problem, I did find a minor problem that I was able to overcome and add to the tests for the rule. Sometimes, the steps are small. I am okay with that!

Issue 38 - False Positive¶

Sometimes I get lucky and see issues that look like problems, only to have them turn out not be problems at all. While that does not happen as often as I would like, this was one of those cases. The Markdown is:

[a
paragraph inspired link](
/paragraph
"paragraph")

While there wasn’t that much information in the list except for the Markdown example, I can only assume that the extra newline in the Link’s label was the cause of my concern. Looking at other examples for rule Md044, I was able to find examples that somewhat looked like that example, but not one that had a split Link label.

But in the end, just like the previous issue, the example in the Issues List passed without any problem. I added a new scenario test to fill a small hole that I found, but that was it. However, I still feel that this was time well spent. Sure, for the second time in a row the actual issue was a false positive, but I was also able to find something that I missed. And if nothing else, I verified that things were working properly.

Issue 39 - Fun with Nested Containers¶

During the testing of Rule Md007, I came across a document that was parsing incorrectly:

This is a test

> * this is level 1
>   * this is level 2
>     * this is level 3

Playing around with it at the time, I noted that the first two levels were being parsed properly, but the text from the third level was being included into the second level. Taking the time to look at it some more, I determined that the same example without any Block Quote element worked fine. Therefore, it was something about the Block Quote that was causing this to happen. And that meant looking at the __get_nested_container_starts function.

It was obvious that I needed to massage the whitespace to consider the Block Quote, but I was not sure how. After four or five hours of tinkering, I had this code ready:

        _, ex_ws_test = ParserHelper.extract_whitespace(line_to_parse, 0)

        whitespace_scan_start_index = 0
        for token_stack_index in parser_state.token_stack:
            if token_stack_index.is_block_quote:
                # if text_removed_by_container:
                #     if text_removed_by_container.startswith("> "):
                #         text_removed_by_container = text_removed_by_container[2:]
                #     elif text_removed_by_container.startswith(">"):
                #         text_removed_by_container = text_removed_by_container[1:]
                #     else:
                #         POGGER.info("check next container_start> out of block quote data")
                pass
            elif token_stack_index.is_list:
                if token_stack_index.ws_before_marker <= len(ex_ws_test):
                    whitespace_scan_start_index = token_stack_index.ws_before_marker

        after_ws_index, ex_whitespace = ParserHelper.extract_whitespace(
            line_to_parse, whitespace_scan_start_index
        )
        if not ex_whitespace:
            ex_whitespace = ""
            after_ws_index = whitespace_scan_start_index

If there is a lot of whitespace to deal with, the function needs to make an educated guess as to how much of that is actual whitespace and how much of that belongs to the List element. For example, given this Markdown:

> * this is level 1
>   * this is level 2
>     * this is level 3

it is visibly obvious that there are three levels to the list, with all three levels within a Block Quote element. But creating an algorithm to figure that out is not as easy. That algorithm must consider whether the text on the third line is a new level or is a continuation of level two. And since the container processing comes before the leaf processing, that determination needs to be made early on.

The algorithm works with this and tries to remove only as much whitespace as is necessary to get a solid determination of whether the line is eligible for a new List element start or if it is just text. While the second part of the algorithm, not shown in the above example, tackles the “eligible List element start” part of that process, the above code tries to get the whitespace ready for that process. If there is whitespace that has a length that is less than the current List element indentation, that whitespace can be removed for the purposes of the second algorithm. In fact, it is imperative that the whitespace is removed.

It took a while to get there, but with a couple of slight changes, it was working. And for what it is worth, I know that this is probably just the start of issues with nested container tokens. Right now, it is just a feeling, but that feeling tells me that I am going to need to add extra testing for various nested container elements. And as far as this project goes, those feelings are usually correct.

Issue 40 - Images as Headings¶

Most of the time, I look at a collection of rules and they just look like they are meant to be together. But sometime, I see a set of rules that look like two completely different organizations crafted them. That was the case with Rule Md041 and Rule Md033. Rule Md041 allows a top-level heading to be defined as a H1 HTML tag, but Rule Md033 triggers on any HTML tags that are present in the document. To me, that looked like a bit of a mismatch, and a mismatch that was easily fixable.

To cleanly fix this issue, I added a configuration value allow_first_image_element that defaults to True. First, I added code to set the __is_first_element variable if the current token is the very first token, and the __is_first_html_block variable to allow the __look_for_html_start function to know that the HTML block was part of that first token. With that preamble set up, the following code was added:

is_first_image_element = False
if (
    self.__is_first_html_block
    and self.__allow_first_image_element
    and tag_text.lower() == "h1"):

    is_first_image_element = full_tag_text.endswith("</h1>")
    if is_first_image_element:
        full_tag_text = full_tag_text[0 : -len("</h1>")]
        end_of_start_heading_index = full_tag_text.find(">")
        assert end_of_start_heading_index != -1
        full_tag_text = full_tag_text[end_of_start_heading_index + 1 :]
        end_of_image_index = full_tag_text.find(">")
        is_first_image_element = (
            full_tag_text.startswith("<img")
            and end_of_image_index == len(full_tag_text) - 1)

The code is not pretty, but it is packed with functionality. This code ensures that the allow_first_image_element configuration value is enabled and that the HTML tag text starts with <h1. From there, it makes sure that the text ends with </h1>, that only the img tag follows the <h1> tag, and that the img tag is the only tag present between the <h1> tag and the </h1> tag. Only if all those things line up, does the is_first_image_element variable get set to True, allowing the rule to not trigger on image headings.

It just felt good to get this taken care of. If I can find someone to make a good heading image for the PyMarkdown project, I might use this!

Issues 41 - Cleaning Up Documentation¶

Another issue that was added to the Issues List was the documentation for Rule Md041. While verifying that everything looked good for all rule documentation, there was a cryptic set of text at the bottom of the documentation for that rule:

- diff html comments

“Diff html comments?” What did that mean? I wasn’t sure. It took me a big of poking around before I was able to figure out what it meant. After looking at the test documents for this rule in my Markdown editor, I noticed that there was only one difference between the project rule and the original rule: it handled the HTML comment tags differently. Specifically, the original rule doesn’t deal with the comment tags or other special tags at all, rendering them more or less invisible.

Thinking through this difference, I came to a defensible decision that I had written the PyMarkdown rule correctly. Depending on the definitive source used, tags such as the comment tags can be viewed as either proper HTML tags or helper HTML tags. Since those tags do not directly affect how the web page is displayed, it can be argued that they are not proper tags. But it can also be argued that those tags are as important to a document as punctuation or foot notes.

Taking a holistic view, I decided that view every part of a HTML document as integral part of that document. Therefore, HTML comment tags and the other special tags are integral to the document. It therefore follows, if they show up in the Markdown before being translated into HTML, they will be present in the rendered HTML document. Therefore, the rule should trigger on those special tags just like it triggers on any other tag. Done!

Hopefully, resolving this issue will teach me to leave better notes for myself. Probably not right away, but I hope it does!

Issue 42 - Sorted Output¶

This issue was recently put on the Issues List after more rules were added to the project. While it does not have a lot of impact on the tests within the project, I have seen unordered output in normal uses. It is not usually a big problem to mentally go through the list of reported issues and figure out where they are in the document. However, I thoroughly admitted that it would be useful to have them be in sorted order to allow for a good top-to-bottom or bottom-to-top scan of the output.

To supply the output in a sorted order, a couple of small changes needed to be made, mostly in the plugin_manager.py module. Previously, an instance of the ScanContext class was created at the start of each document and functioned as a conduit to ensure that any triggered rules were reported properly. To address this issue, that design was changed slightly to introduce the add_triggered_rule function and the report_on_triggered_rules function. Instead of the report_next_line_error function and the report_next_token_error function directly calling the log_scan_failure function themselves, the new add_triggered_rule function is called, adding the reported error to the context’s list of errors. Once the entire document has been scanned, the report_on_triggered_rules function is then called. This function simply sorts the list of reported errors before calling the log_scan_failure function to report each error.

While resolving this issue does not provide a lot of difference to documents with a small number of reported failures, I find that it makes a substantial difference to larger sets of reported failures. In my case, when going through lists of errors, I tend to work from the bottom of the list to the top so that the line numbers are not disturbed. I can see this working better for me and how I attack those lists. I hope it helps others too!

Issue 31 - Adding Module Support¶

This issue didn’t really involve a big change, but it was a request from one of the users, so it made the top of the list. In Python, one of the interesting ways of executing Python code is through exposing the package as an executable module. As I was not sure which people would want to execute the project it that manner, I left it off the list of features to add. But with a request from a user, it was back on the list.

It took a bit of research to get it right, but in the end, the only change that was needed was the addition of the __main__.py module:

import pymarkdown

if __name__ == "__main__":
    pymarkdown.PyMarkdownLint().main()

If that code seems overly simplistic, it is because it is. According to various pages on the Python site, the __main__.py file had special properties that are invoked when it is invoked on behalf of a package using the -m module flag. When it is run as a module, the above if statement evaluates to True and the main function from the main.py module is invoked.

Not much to add, but a good addition to tidy things up for the week!

What Was My Experience So Far?¶

While I know there are probably a couple of difficult issues hiding in plain sight amongst the Issues List, I haven’t hit any of them yet. I am keeping up a good pace and finding and fixing issues, and I happy with the progress. Do I wish it went faster? Sure… but I also know it will take as long as it takes. Now that I have the beta release out there, it takes the pressure off a bit.

But I also know that I have a good collection of fixed issues, so I am planning a build version update tomorrow to get these fixed issues into the hands of users. That does feel great, knowing that I am at a place where I can do that!

What is Next?¶

What else? More issues. Hopefully, some exciting ones this next week. Well… maybe hopefully? Stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - More Beta Bugs