Markdown Linter - Clearing Away Some Bugs · Jack's Digital Workbench

Summary¶

In my last article, I talked about the work I put into getting the final three rules completed. In this article, I talk about getting to work on reducing the count of bugs in the Issues List.

Introduction¶

With all the work on creating the missing rules wrapped up, I was eager to double check things for the beta release, and get it packaged and uploaded to PyPi.org. But after that was done, I knew that I still had to work to do, simply different work. Instead of writing new rules, I needed to go back over previous issues I have found and start dealing with the “I’ll deal with it later” issues.

Yes, it was now “later”.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 21 Sep 2021 and 25 Sep 2021.

The Beta Release Happens¶

From a commit point of view, the beta release was not anything more than changing a couple of files and running a script that I normal do not run. But mentally, it was an important thing for me. After working for two years to get something working on my terms, I was able to finish a beta release that I am confident about. Running that script knowing that it would publish a solid, well-tested version of my project to the Python PyPi repository just seemed right. It was not too soon, it was not with reservations, it was just the right time for the release to happen.

Do I expect issues to get logged in GitHub? Yes, I would be surprised if issues were not logged. First off, I know I have confidence that I covered most of the more common scenarios, but I also have confidence that I did not cover all the scenarios. Secondly, as weird as it may sound, if people are filing issues, it means they are using the project and putting it through their own scenarios. And hopefully, for every issue that gets logged with GitHub, there is a given multiplication factor of users that do not have any issues with the project, and are therefore happy with the project.

What is that multiplication factor? I do not currently know. But even if that multiplication factor is in the single digits, it means that people are using the project. That is good for my confidence. That is all that I need, nothing more.

And as usual, before digging into fixing issues, there was a bit of cleaning up that I wanted to get done.

Doing Some House Cleaning¶

Having a clean script and a Continuous Integration script has been a big asset for the PyMarkdown project. I am usually fussy about running the clean script before every commit, but the Continuous Integration script makes sure that I stick to that process. Normally, I develop on a Windows machine because I have more tools that work seamlessly with how I work. But that still leaves testing on other operating systems, which has previously been painful. The Continuous Integration script addresses that by running the full set of tests against a Windows operating system, a Mac operating system, and a Linux operating system. Having this process done automatically with each commit has helped me concentrate on the project itself, not the operating systems.

But one little thing has been bugging me during the rule development phase: changing command lines. There are a small handful of rules that I want to disable on my own Markdown scans. In most cases, this is temporary while I get everything else worked out and think about possible rule changes. In some cases, it is just to give me time to fix the issues instead of disabling it. These disabled rules are no-space-in-code, no-space-in-emphasis, line-length, and no-duplicate-heading. That is not what I was concerned about. It was that when developing the rules, I needed to change the disabled rules in both scripts. That just was not cool.

Before I started tackling reported issues as they trickle in, I decided that I wanted to fix that right away. It was not a tremendous change to make, but I believe it was a good choice to work on as it would be one less thing to worry about. Additionally, it showed off a real-world use of the configuration file, which was a benefit.

To perform this switch to use a configuration file, I created a new file, clean.json with the following contents:

{
    "plugins": {
        "no-space-in-code": {
            "enabled": false
        },
        "no-space-in-emphasis": {
            "enabled": false
        },
        "line-length": {
            "enabled": false
        },
        "no-duplicate-heading": {
            "enabled": false
        }
    }
}

Saving it in the root directory of the repository, I then went and replaced this part of the command line:

--disable-rules no-space-in-code,no-space-in-emphasis,line-length,no-duplicate-heading

in both scripts to read:

--config ./clean.json

I executed the clean script locally and verified that everything looked correct. Hoping that I did not miss anything, I created a new commit for this change, pushed it to main repository, and watched as the tests started to run. It took about four minutes, but when I saw that green circle besides the tasks saying that everything executed correctly, I let go a little “Yes!” to myself.

Not a big issue, and an issue that no one except me would probably worry about. But it was an issue that I was happy to get taken care of.

Issue 23 - Issue with Rule Md023¶

I know it might seem weird to some people, but I love debugging. It gives me a chance to really dig into something and either learn about it or increase my knowledge about how it works. Sometimes I learn that I did not think of a given use case, and sometimes I learn that I forgot something completely. That last one was the case with this issue: I forgot to implement pivotal code in Rule Md023.

Token History¶

To understand the issue, I need to go into a bit of parser history to explain that the end_whitespace within a Text token that existing within a SetExt Heading element was a bit tricky to store. Because whitespace must be removed from both the start of each line and the end of each line in a SetExt Heading element, it left me with a fun decision to make: where to put the removed whitespace? Because one of the project’s cardinal parser rules is that any parsing must produce a set of tokens that can then reconstitute the Markdown document exactly as it was. As such, the whitespace had to go somewhere. If it was a normal paragraph, only the leading whitespace from each line would be removed and stored within the Paragraph element that owns that Text token.

After considering assorted options, none of them were attractive. But the least unattractive was to override the end_whitespace field to hold both the leading space characters and trailing space characters that were removed. Normally, I would do this by having two different fields, one for the leading spaces and one for the trailing spaces. But not wanting to add an extra field to the Text token just for use within a SetExt Heading element, I improvised. The end_whitespace would normally hold any trailing spaces, separated by the newline character. So, to “duplex” each line, if there were any leading spaces to store in the same line, it would be followed with the whitespace split character, \x02. In that way, I was able to store both leading spaces and trailing spaces within the same field.

Back To the Issue¶

Going back to the issue at hand, that special character presented an issue that I had not thought about when I was writing this rule. Reminding myself of this, I looked at the existing code:

if self.__setext_start_token and not self.__any_leading_whitespace_detected:
    if token.end_whitespace and " " in token.end_whitespace:
        self.__any_leading_whitespace_detected = True

Without support for the whitespace split character, that code was detecting any whitespace that occurred at the start or the end of the line. Since whitespace at the start of the line would be encoded as {space}\x02 and whitespace at the end of the line as {space}, the above code would evaluate both as extra leading whitespace.

The only way to fix this was to scan through the end_whitespace field, one line of information at a time, looking for any cases where the sequence {space}\x02 was found. To do that, I replaced the earlier code with the following code:

if (
    self.__setext_start_token
    and not self.__any_leading_whitespace_detected
    and token.end_whitespace
):
    split_end_whitespace = token.end_whitespace.split("\n")
    for next_split in split_end_whitespace:
        if self.__seen_first_line_of_setext:
            split_next_split = next_split.split(
                ParserHelper.whitespace_split_character
            )
            if len(split_next_split) == 2 and split_next_split[0]:
                self.__any_leading_whitespace_detected = True
        else:
            self.__seen_first_line_of_setext = True

This code works by going through each line represented in the end_whitespace field, one line at a time. For each line, it splits the line’s information on the whitespace split character. If the whitespace split character is present, then the split will result in the variable split_next_split having an array of two strings, with the first element holding any characters in the line that occurred before the whitespace split character. Therefore, if at least one line has a split_next_split variable of length 2 with a non-zero length, then that line has leading whitespace.

Thinking this was clever, I ran it through the first time, and was disappointed that it failed. After a couple of minutes of debugging, I remembered that the leading spaces for SetExt Heading tokens are stored with that start token, not within the Text token contained within. Adding a bit of extra code to take care of that first case, I executed the code again and everything worked fine!

Just to be sure, I created five new scenario tests for Rule Md023, each with a different amount of trailing space at a different location. Maybe it is a bit paranoid, but after finding one example that failed, I wanted some extra insurance that I had fixed the problem and fixed it properly.

It was an interesting feeling getting back to debugging non-rule code. It partly felt like it was something that I was used to, and partly that it was something new and different. It was an odd pair of feelings, but both were good feelings, so I was okay with both.

Issue 22 - Issue with Rule Md033¶

This issue was not really a big one, but it was reported as not making sense to users, so I wanted to tackle it quickly. Rule Md033 deals with Raw HTML being exposed, either as a HTML Block element or a Raw HTML Inline element. I figured that I had missed something simple and wanted to make sure it was set right.

It turned out that I had missed two small issues, both nuisances. The first one was that I included a close HTML tag, any normal HTML tag starting with </, in the list of acceptable trigger items. While I had thought that the original rule did fire one this, extra checking revealed that it did not.

The other issue was a bit more nuanced, but just as small. In HTML, there are special sequences that allow for special handling of HTML documents. Of special note, these sequences are called out in the GFM documentation of the HTML Block element by each one of them having their own special HTML Block type. These sequences are comment blocks, processing instructions, declarations, and CDATA. I had originally taken one of them, the ![CDATA[ sequence and provided special behavior for that sequence, but that was not good enough. Given a simple example from one of the project’s users:

# title

<!--TOC-->

Apparently, my habit of following a start HTML comment tag with whitespace was just that… my habit.

So, to correct that issue, I change the code to allow for both CDATA and comments to be detected on their own without any whitespace to follow. Along with a change to the default value to suppress processing instructions and declarations by default, this issue was fixed.

It was not a big fix, but a good one to get out of the way.

Issue 27 - Issue With Rule Md032¶

I must admit. When I first saw this issue pop up, in my head I looked at it and went: “No, that looks right!” But after checking it more closely, the issue began to seep into my brain. Starting with something simple, the following Markdown:

- 1
  - 1.1
    - 1.1.1
  - 1.2
- 2

parsed cleanly. Then the second example, the following Markdown:

- 1
  - 1.1
    - 1.1.1

- 2

also parsed cleanly. Fine. I agreed with both of those. Md032 deals with blank lines around List elements, and everything looked good so far. Then the third sample was presented, one which was causing the rule to trigger. A modification of the earlier example, this one was:

- 1
  - 1.1
    - 1.1.1
- 2

and it triggered the rule each time. From looking at the nested List elements, it was obvious that it should not fire, so what was wrong?

Enabling debug for the rule, I was able to quickly narrow down on the issue. Specifically, the third line sets up the third level of a List element only to have the top two levels of those lists ended with the first level List element on the fourth line. As was shown from my debugging, there were two end List element tokens in a row in the token stream. That was something I had missed.

Looking at two lines from the rule, I was quickly able to nail down the source of this issue. In the following if statement:

if not token.is_blank_line and not token.is_new_list_item:
    self.report_next_token_error(context, token, line_number_delta=-1)

only the Blank Line token and a New List Item token prevent the rule from firing if the previous token was an end List token. That meant in this case, the first end List token caused that if statement to be evaluated to False when that second end List token was used in the if statement.

It did not take much code or time to fix this properly. By adding the end List token to the if statement, it allowed the List elements to flow properly when dropping multiple levels.

if (
    not token.is_blank_line
    and not token.is_new_list_item
    and not token.is_list_end
):
    self.report_next_token_error(context, token, line_number_delta=-1)

Issue 32 - Issue with Rule Md037¶

Every so often, you look at something you did and hope no one notices that you did it. In this case, I had Issue 32 filed against it, so I had no choice. Someone noticed it.

The Markdown itself was simple, and the point of focus was a single paragraph:

Read all readme*.md files.  
And the changelog*.md files.

It was this sample that was triggering Rule Md037, looking for spaces inside of asterisks that may show that an Inline Emphasis element was not completed properly. There was only one problem: the was half-finished code commented out to deal with the actual search for whitespaces.

Whoops!

Oh well, it happened. After I got over a momentary feeling of embarrassment, I started to look at the commented-out code and quickly decided that it would not work properly. When the triggering section of the next_token function is entered, it has a list of every token that occurred between the possible start of emphasis and the possible end of emphasis. The main thing I needed the code to do was to verify if the specific pattern, whitespaces directly inside of the possible start or end, were present.

Instead of doing something fancy, I opted for something simple. With both the possible start and possible end sequences, this rule should only trigger if they were Text tokens. If it is not a Text token, it cannot start or end with whitespace, so there was one aspect of the issue dealt with. The other was a good observation on what this was supposed to look for. Since I needed the function to look for directly inside of the emphasis sequences, I needed to have the function check for the first character of the first token and the last character of the last token. If either of those specific characters were a space character, then it was a space directly inside of the emphasis characters. And to be extremely specific, I needed to set it to detect either scenario, not both.¹

After going through that research, I came up with the following code, exactly what I just described above:

assert self.__emphasis_token_list
first_capture_token = self.__emphasis_token_list[0]
did_first_start_with_space = (
    first_capture_token.is_text
    and first_capture_token.token_text[0] == " "
)
last_capture_token = self.__emphasis_token_list[-1]
did_last_end_with_space = (
    last_capture_token.is_text
    and last_capture_token.token_text[-1] == " "
)
if did_first_start_with_space or did_last_end_with_space:
    self.report_next_token_error(context, self.__start_emphasis_token)

Once again, nothing stellar, but good simple corrections in the code.

Issue 33 - Issue With Rule Md031¶

In my blitz to get all rules completed and a beta release out the door, I had encountered this Markdown when trying to get Rule Md031 to work:

> + list
> ```block
> A code block
> ```
> 1. another list

With some more relaxed time on my hands now, I decided to look closely at this rule and figure out what the problem with the rule was. What I did not notice at the time is that this was not a problem with the rule, this was a problem with the parser. When executed, the parser was stopping in this code with an exception. It was something that needed to be looked in to.

It took me a bit to figure out what was going wrong. As I have been concentrating on getting things done for the beta release, I have not spent considerable time in the parser code. If I had to guess, it has been at least six months since I last looked at the parser code. Luckily, with the comments and log messages that I placed in the code; I was able to get up to speed quickly.

In the Leaf Block Processor module, there is code in the correct_for_leaf_block_start_in_list function to properly correct the indent of a line when coming out of List element. Originally, the token_stack state variable was used to track the List elements, with plans to implement an easier mechanism for dealing with Block Quote elements. However, when I started to implement support for Block Quote elements, I realized how similar Block Quote elements and List elements really were. In response to that, the token_stack state variable was then repurposed for both elements, with extra work on my behalf to clean up the code.

while repeat_loop:
    assert parser_state.token_stack[-1].is_list
    if removed_chars_at_start >= parser_state.token_stack[-1].indent_level:
        repeat_loop = False
    else:
        tokens_from_close, _ = parser_state.close_open_blocks_fn(
            parser_state,
            until_this_index=(len(parser_state.token_stack) - 1),
            include_lists=True,)
        html_tokens.extend(tokens_from_close)

assert parser_state.token_stack[-1].is_list
last_indent = parser_state.token_stack[-1].indent_level
delta_indent = removed_chars_at_start - last_indent
assert not delta_indent

Basically, it looks like I missed a boundary use case. In this case, the List element was getting closed properly within the loop, but it was then followed by an assert to verify that the remaining element on the stack was a list. If it was a List token, then there were extra verifications that I had in place to verify the sanity of the parsing.

While slightly more verbose, once I understood the problem, the fix was simple: only act on List tokens:

repeat_loop = True
is_remaining_list_token = True
while repeat_loop and is_remaining_list_token:
    assert parser_state.token_stack[-1].is_list

    if removed_chars_at_start >= parser_state.token_stack[-1].indent_level:
        repeat_loop = False
    else:
        tokens_from_close, _ = parser_state.close_open_blocks_fn(
            parser_state,
            until_this_index=(len(parser_state.token_stack) - 1),
            include_lists=True,
        )
        html_tokens.extend(tokens_from_close)

        is_remaining_list_token = parser_state.token_stack[-1].is_list

if is_remaining_list_token:
    assert parser_state.token_stack[-1].is_list
    last_indent = parser_state.token_stack[-1].indent_level
    delta_indent = removed_chars_at_start - last_indent
    assert not delta_indent

Because this change fixed the parsing of that Markdown document, I was able to take the skip tag off the test_md031_bad_fenced_block_in_list_in_block_quote function. In addition, to make sure that fenced code blocks did not suffer from the same issue, I added the test_md032_bad_fenced_block_in_list_in_block_quote function to perform the same test for Fenced Code Block elements. This was a good find, and I was happy to have it taken care of.

What Was My Experience So Far?¶

After working hard for two years, it was a relief to run the scripts to package the project and to upload it to PyPi.org. As I mentioned earlier in this article, I just felt like it was the right time to do it.

But there are still things to do to make it better, and I wanted to keep on marching forward. After measuring my progress for the last two months in terms of the number of rules I completed per week, I am not sure how, or if, I am going to measure progress going forward. I mean, I am going to resolve issues, of that there is no doubt. However, I cannot divine if the issues I pick are going to be easy issues or difficult issues.

I guess the only thing I can do is to keep on going what I have always done with this project: keep on making it better!

What is Next?¶

Well, I hate to say it (well, no I don’t), but I am going to be fixing issues for a while. While some portions of that work will be boring, I will try and make it more interesting by describing what I am doing and my reasons for doing it. Stay tuned!

Before I read the rule again very closely, I did write a version of the code where it looked for both instead of either. It did not work very well. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Clearing Away Some Bugs