Markdown Linter - Tying Up Some Loose Ends

Summary¶

In my last article, I talked about my efforts to clean up the scenario tests and the existing parser issues. In this article, I continue to talk about cleaning up existing parser issues.

Introduction¶

To an outside viewer, it may seem like I am doing boring stuff, week in and week out. And I will admit, a certain amount of the work that I do is boring. But knowing that the work is going towards increasing the quality of the project really motivates me when things are dragging on. Since I spent most of this past week finishing the big task from last week, I did find the need to remind myself of those observations during the week… multiple times. And remembering why I was doing this did help me push through.

And while I am used to showing as much code as I can for what I am working on, this week is going to be a bit different. To be honest, most of the issues that I fixed had a relatively minor change that was implemented, with the effects of that change rippling out into the rest of the code. In some cases, it was a one-line initial change, and thirty or more lines of ripple changes. I still will try and explain things as best I can, you have my word on that. I’ll see how that works out and see how I feel about it later!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather than the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 12 Oct 2021 and 17 Oct 2021.

Issue 44 - The Long And Winding Road¶

As I said during the introduction, it was an exceedingly long week. I had hoped that I would be able to rip through the comment strings for the scenario tests that related to rules, but that was not the case. Instead of having to read 7 strings and having to fix 1, I was lucky if I encountered any strings that I did not have to fix.

Part of that was my own making though. To make sure that the strings were all readable, I started establishing a common preface for each rule comment string:

"""
Test to make sure this rule does not trigger with a document that
contains ...
"""

While that brough consistency to every comment string for rules, it did involve tons of editing. That and in the cases where the rule did trigger, I had to ensure that I removed the not on the first line. Just checking that out for each file took a bit of time.

And that editing took me into Saturday afternoon. It was a slog. But I never had any doubt that it was worth it. Did I question if my sanity would remain intact? Yes, that was something I wondered many times during the task. But not the base worth of the task. I knew that it was something that I had put off, and I knew that this rewriting was the price that I needed to pay. And that was okay with me. It just needed to get done.

But I was incredibly happy when it was done. There were other things that I wanted to get cleaned up this week!

Issue 56 - Cleaning Up Rules¶

In terms of nagging tasks that I wanted to deal with, cleaning up the comment strings was the biggest one. Thought it was not as high on the list, as the comment string, it still bugged me that I had skipped tests. I know that at the time that I marked those tests as skipped it probably was the right thing to do. But now that I had some time, the right thing to do was to make sure that I resolved those skipped tests.

Having resolved some skipped tests dealing with language elements last week, I decided to try and deal with the skipped tests for rules this week. Luckily, as soon as I walked through the scenario tests for Rule Md005 and Rule Md027, I noticed that everything looked correct. All that I needed to do for those tests was to clean them up, reset the expected values, and verify that there were no other problems. Done.

Rule Md028 was another story. I worked through the debugger and log files for about an hour or two before I figured out the issue. In this scenario, the essence of the problem was with this Markdown document:

- > This is one section of a block quote

When I walked through the parsing of that Markdown, the tokens that I got back were mostly correct, just in a weird order. In particular, the Block Quote was started properly, but it was immediately followed by an end Block Quote token. And walking through the code did not at once reveal any interesting information because everything looked right. That is until I noticed that there was a single point where it no longer looked right.

That point was the on the return of the parser from the __handle_nested_container_blocks function. Everything looked correct before that function returned, especially the this_bq_count variable that contains what the perceived count of the Block Quote is. It was then that I noticed that when the function returned, the this_bq_count variable dropped from 1 to 0. As there was a Block Quote stack token on the token_stack and a this_bq_count variable with a value of 0, that Block Quote was being closed as no longer being correct. It was not the proper thing to do in this case, but in general, it was doing the right thing by removing the Block Quote from the stack.

To fix this, one change was needed, with a lot of ripples to accompany it. The one change was to have the __handle_nested_container_blocks function pass the this_bq_count variable back to its callers. A handful of functions needed changes to accommodate this, including code in the __look_for_container_blocks function to use this information. While these changes were not difficult, it took me a bit to get them right and review them to make sure that I felt that they were right.

But after all that work, all three sets of rule-based scenario tests had their skipped tag removed, and they were all working cleanly. To make sure things stayed that way, I added test functions test_extra_009 and test_extra_009a with the boiled down version of the issue. One nagging item off the list. That felt good! Now for another one.

Issue 53 - Cleaning Up List Tests¶

Sometimes the problem just stares you in the face and announces itself proudly. This was the case with test function test_list_blocks_237x which was clearly throwing an assert. Looking at the assert itself, it was obvious that it was failing, but not how. By adding a clear assert message, I was able to find that out. It was asserting because it assumed that the extracted whitespace would always be whitespace.

In this case, the variable adj_line_to_parse contained the adjusted line that needed to be parsed. However, because it occurs within a Block Quote, the start of the line included the Block Quote characters and whitespace, not just whitespace. That meant that when the parser hit this code:

assert adj_line_to_parse.startswith(
    parser_state.nested_list_start.matching_markdown_token.extracted_whitespace
)

the assert failed. The length of extracted_whitespace field was as it was expected to be, the string that it was being compared to just included other characters at the start of it.

The first step in fixing this was to change the assert into an if. The positive case of the previous assert was working fine, there were just some cases where it needed some refinement. To take care of that “refinement”, this code was added in the else clause of the if statement:

assert parser_state.original_line_to_parse.endswith(
    adj_line_to_parse)
orig_parse = len(parser_state.original_line_to_parse)
curr_parse = len(adj_line_to_parse)
delta_parse = orig_parse - curr_parse
whitespace_to_remove = parser_state.nested_list_start.matching_markdown_token.
    extracted_whitespace[delta_parse:]
assert adj_line_to_parse.startswith(whitespace_to_remove)
adj_line_to_parse = adj_line_to_parse[ len(whitespace_to_remove) : ]

This code was all to get around the possibility that the leading spaces for any line in a Block Quote can have Block Quote characters as part of that leading space. Once past those characters, everything else should be whitespace. So, this code works with that ending whitespace to adjust the line to what it needs to be.

It took me a while to come up with that approach, but it made sense. And another nagging issue down. Just one more left.

Issue 59 - Cleaning Up SetExt Tests¶

The last nagging issue deal with two almost identical scenario tests: test_setext_headings_064a and test_setext_headings_069a. With little information to go on, I started to look at the output of the parser for this document:

- Foo
===

Quickly looking at those Markdown documents, I guessed wrong that the document should produce a SetExt Heading. Using the faithful Babelmark tool, I verified that the provided Markdown should produce this HTML:

<ul>
<li>Foo
===</li>
</ul>

It took me a couple of minutes to have an “aha” moment, but I eventually got there. The problem that I had is that I prejudged the text because it looked like a SetExt sequence. Working though it some more in my head, I could then see that the second line would be considered a paragraph continuation line unless there was something more to that second line. Indeed, when I added two extra spaces before that second line, it matched the indent of the List Item and became a SetExt Heading element. But anything short of that, and it rightfully remaining a simple paragraph continuation line.

With that information in hand, I debugged through the parse_setext_headings function for a while before adding this code:

is_paragraph_continuation = False
if (
    len(parser_state.token_stack) > 1
    and parser_state.token_stack[-2].is_list
):
    adj_text = position_marker.text_to_parse[position_marker.index_number :]
    assert parser_state.original_line_to_parse.endswith(adj_text)
    removed_text_length = len(parser_state.original_line_to_parse) - len(
        adj_text
    )
    if (
        adj_text
        and removed_text_length < parser_state.token_stack[-2].indent_level
    ):
        is_paragraph_continuation = True

Based on the if statement just before this code, it would only be activated if the Leaf Block Processor was inside a Paragraph element. In the logic that I added, if that paragraph was inside of a List Item, a check was added to see how much whitespace was removed from the line to get to that point. If the amount removed was less than the indent level and there was at least one non-whitespace character on that line, it considered the current line a paragraph continuation.

With that logic in place, the ‘if’ statement in the code that followed:

(
    after_whitespace_index,
    extra_whitespace_after_setext,
) = ParserHelper.extract_whitespace(
    position_marker.text_to_parse, collected_to_index
)
if after_whitespace_index == len(position_marker.text_to_parse):

was changed to:

if not is_paragraph_continuation and after_whitespace_index == len(
    position_marker.text_to_parse
):

Basically, the existing logic was almost fine, but needed to also check that it was not a paragraph continuation before proceeding. Verifying the new code against both original test functions, it also passed the two new function I added to help diagnose the problem. I did leave them in there as they cleanly verified the boundary cases that existed.

Issue 43 - Sometimes It Is A Small Fix¶

Encountered during a previous debugging session, I made a note to myself to check out the following Markdown:

* First Item
  * First-First
   * First-Second
    * First-Third
* Second Item

While that Markdown made no difference to the rule that I was debugging at the time, I recognized that it was not properly creating a List Item out of the fourth line as it should have. Instead, the List Item started on line 3 contained both the text for line 3 and for line 4.

After creating a new scenario test, I quickly dropped into debug mode and found the problem relatively easily. Near that start of the is_olist_start and is_ulist_start functions, there is a standard whitespace check to make sure that the whitespace does not exceed the usual three characters. Simply speaking, prefaced with zero to three spaces, this Markdown:

   - this is an item

is translated as a List Item. By adding one more space to that prefix:

    - this is an indented code block

that List Item becomes content for an Indented Code Block. Within another List element, after the list indent is considered:

- this is an item
     - this is a nested list

is translated into a List and a Sublist, while:

- this is an item
      - this is a continuation line

is translated into one List with two-line content.

Verifying that everything still worked properly for the first two cases, I then moved my focus to the remaining two cases. While the last case worked properly, the “this is a nested list” case was being parsed as a continuation line.

With my eyes on those whitespace checks, this made sense. Those checks were evaluating whether the number of whitespace characters from the beginning of the line was three or less. In the “nested” case, it was failing that check from the beginning of the line, when it should have been passing that check from the start of the indent from the first List.

Already having calculated the amount of indent required by the parent list in the __adjust_whitespace_for_nested_lists function, it was a simple matter to pass it back with the adj_ws value. Once passed back, that value was added into the whitespace length check to reduce the required whitespace properly.

After the code was passing the first scenario test test_extra_010x, I added two extra scenario tests to make sure that this worked for other cases. While I could have added more tests, I figured that these three tests gave the changes a good amount of coverage for now. If necessary, I can always go back and add more.

Issue 62 - Sometimes It Really Is A Check¶

At an earlier point in the project’s history, I had tested this snippet of Markdown:

> simple text
> dd
> dd
# a

and it had failed. I can remember adding it during the development of Rule Md027, as the repository history attests to, and I remember being surprised that something so simple could change the tokens.

However, in the time between 2021 Aug 15 and when I looked at on the morning of 2021 Oct 17, it has ceased to be a problem. Just to be sure, I added three new scenario tests test_extra_011x to test_extra_011b to ensure that it would remain a non-issue. Sometimes you get lucky.

Issue 64 - And Sometimes You Did Something Wrong¶

Noticed during the “grand” commenting rewriting of Issue 44, I added a note to the Issues List to check on Rule Md030, as I had noticed something weird. It was a good thing that I noticed it, as there was something weird that was exposed by the scenario tests.

When I went to verify Rule Md030 and its scenario tests, I started by looking at the original rule that this rule was based off. While there is little clear documentation on the configuration values for the original rule, this specific phrase leapt out at me:

consists of a single paragraph or multiple paragraphs

The reason that this leapt out at me is that the work I had done on Rule Md030 was about single versus multiple levels of lists not paragraphs. Delving into the disconnect a bit more, the page sited by the original rule does show examples with multiple paragraphs, not multiple levels within a list.

I can honestly say that the thought that went through my head was something akin to “huh. wonder how that happened?” To be continually honest, if this is the only big mistake I made during my efforts to implement every rule, I will still consider it remarkably successful. For me, if I do not must throw away lots of work, my focus is on getting things done right, not trying to figure out where it messed up.

Looking through the code for module rule_md030.py, most of the code was still very usable. I removed the function __is_current_list_multiline as it was no longer needed. Instead, the __current_list_parent field and the __paragraph_count_map field were introduced to track the current list item and the paragraph count for that list item. Other than those changes, plus some extra code to properly populating those fields and their associated objects, and it relatively remained the same.

However, the changes to the scenario tests were more substantial. Each of the existing *_double.md tests were changed to have a list item with two paragraphs instead of a list item that was part of a level 2 sublist. Basically, the Markdown document:

1. First
1. Second
   1. Second A
1. Third

changed to:

1. First
1. Second - 1

   Second - 2
1. Third

To add coverage to properly test the new understanding of the rule, eight new scenario tests were added with nested sublists. These were evenly divided between the Ordered Lists and Unordered Lists, and again between single paragraph List Items and multiple paragraph List Items. Add to that the required cleanup of the documentation page for Rule Md030, and things were ready to go!

Issue 66 - Winding Down¶

As far as authoring these articles go, I am getting into a good groove most weekends. Because I take the time during the week to curate a simple outline of what I want the article to be, I can start building a set of good paragraphs and examples from that outline and really bring it to life. But sometimes I can get ahead of where I feel I need to be, and tonight is one of those nights.

Having spent a solid amount of time writing, doing work around the house, and spending time with my family, I found I had some “extra” time left over. To be honest, the reason the word extra is in quotation marks is because there really is no concept of extra time. I believe it would be more correct to say that I found myself with about two hours of time that I had not planned for, and I wanted to fill that time with something useful. As such, I started to look at some more of the issues that I logged during my work on Issue 44.

The first issue that I grabbed was simple labelled as such:

### Verify test_md023_bad_improper_indent_atx_in_list_item in lists and block quotes

As such, I looked at the specified tests, and everything looked fine. In fact, it was a fitting example of Rule Md023 operating cleanly inside of a List Item. Then I took another look at the title of the function: test_md023_bad_improper_indent_atx_in_list_item. Well, that was a bit off!

To fix this issue, and the same issue with the *in_block_quotes variant of this scenario test, I changed the prefix of both files from test_md023_bad_improper to test_md023_good_proper. After changing the file names, I changed the scenario text function names to match, and all was good. However, to make sure the original intent of those scenario tests was fulfilled, I created two new test functions and test data files with the original names. Executing those two new scenario tests, I then verified that they were failing properly, adjusted the expected output, and declared the issue fixed.

Issue 70 - More Winding Down¶

With the minutes of my “extra” time winding down, I found this issue and figured out that it was a simple one to fix. For the most part, the main problem with this issue was that the names of the functions were not indicative of what was being tested. In addition, I added a new test_md036_bad_configuration_punctuation test function to supply a test for the punctuation configuration value.

And then I put the digital equivalent of my pen down for the evening. I had an incredibly good start on the article for this week, and I also had managed to get a fair number of items removed from the Issues List.

What Was My Experience So Far?¶

To explain how I am feeling about the project is hard to explain without relating it to something relatable like drafting an article or an essay. There are those essays that we all wrote in high school where we just wanted to get it done. Those were more about ensuring that we met the minimum requirements for the essay to be counted by our teachers, and that was it.

This project is something different to me. To me, this feels like an extended essay on something that I care very deeply about. It is not enough that I want to meet the minimum requirements of the task, I want to make sure to convey a certain feeling with my essay. To that extent, I do not mind if I must go over certain parts of the essay repeatedly until I get the tone and theme exactly right.

From a given viewpoint, this project is already a success. I have a beta release that is out, and it addresses the linting requirements that I want it to address. But for me, that is not enough. I want to make sure the project is solid, with good development principles throughout. And for me, the first part of meeting that goal is to address those items that I put off for any reason.

And for me, striving to increase quality is the goal that keeps me going!

What is Next?¶

To be blunt, I am starting to get to the end of the easier to resolve issues. What are left are more general issues and issues that require me to double check things. Not sure where that is going yet, so stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments