Summary

In my last article, I talked about the work I put into getting that weeks’ three rules completed. In this article, I talk about the next three rules I worked on.

Introduction

Wow! It has been an interesting two weeks since my last article. With a ton of things going on in both my professional life and my personal life, I really have not had any time to work on the PyMarkdown project. There was not even enough time left over for me to write words, let alone code. Both take brain cells and creativity, and after about a week and a half of long days, I did not feel that I possessed either of them.

With most of those efforts out of the way, I was able to start the task of getting back to work on the project. I knew that it was going to take some effort to get my “project-muscles” back, but it was something I was eager to do. And what a better way to flex those muscles back into shape than to complete the work on Rule Md043.

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 11 Sep 2021 and 12 Sep 2021.

Rule Md043 - Completing Glob Matches

Getting started again after a couple of long weeks was no easy task. Before I got busy with life, I had started to work on completing the work to properly handle the recursive code for matching lines. As I mentioned in my last article, I just ran out of time to figure it out for that week. Now, facing the daunting task of getting back to work, it sounded to me like a good task to start with.

To refresh my context on that rule, I took quick inventory on where I left things. I soon figured out that before I stopped working on Rule Md043, I had taken care of all the simple scenarios in which wildcards can be used in globs. The simple cases of a, a*, *a, and a*b were all covered, with a gaping hole in the code where patterns with multiple wildcards should be. Repetitive wildcard cases such as a*b*c were not being handled at all, and that needed to be addressed.

To be clear, while I am pretty sure that most people will not create monster matches like a*b*c*b*c*b*c*d, or any terribly recursive combination. However, I know I still needed to be able to support them. And depending on what the needs are, I can actually see some simple patterns for document headers and document footers requiring at least two or three wildcards. It was time to deal with that!

Design

Breaking it down from a design viewpoint, when the algorithm arrived at the recursive section of the glob, the header and footer of the glob had already been stripped away. This means that if the glob was a*b*c, what needed to be addressed was the remaining pattern *b*. Scribbling out different scenarios on paper, this made it easy for me to see how other matches would work, as the difference between *b* and *b*d* appeared to be an extra level of matching after the first wildcard.

While I am not a stranger to both recursive and iterative algorithms, I have enough experience to know that both algorithms can run away on the user if they are not coded properly. The difference is that recursive algorithms tend to fail with out of memory errors, which looks worse to the user than a bad index error. At least when I am using other software, that is my opinion. So, knowing that, if I chose a recursive algorithm, I wanted to make doubly sure that the recursion was solid before shipping it.

Breaking It Down

In this case, I chose a recursive algorithm as the way to go. As the difference between *b* and *b*d* is just another level, the most efficient way to take care of that is with recursive functions. With that decision done, it was on to figuring out how to do the matching and when to do the recursion.

Taking each part in stride took a bit for me to accomplish, wanting to design everything at once. But once I calmed down a bit, I realized that the algorithm was simple. With the previous header and footer trimmed from the match, the first thing the algorithm had to do was to find the next matches for the first block of non-wildcard matches. In the above examples, this meant looking for a block of length one that matched b. In the first match, once any one line was found that matched that, everything was done, and it could return. But in the second match, the algorithm then had to trim the variables and recurse into the algorithm to find *d* within the remaining data. Once again, if there was any line that matched d, the match was complete.

To be honest, I went through a handful of sheets of paper going through weird combinations of data, making sure that my design worked on paper. In each case, I was careful to adjust the current parameters before mentally “recursing”, and that was something that I was going to must be careful with. But after taking three hours over two days to make sure I had the design nailed down, it was time to see if I got it right.

Testing and Implementation

I already had a few scenario tests for multiple wildcards in the project, but I added a new one to make sure I had things covered. From using globs on file names for years, I had a good idea of the patterns to test, and I was confident that I had the right scenarios to test against.

Going to the code, the first thing I needed was to hook in the new function to the completed_file function, invoking it when there were multiple wildcard characters. This meant replacing this code:

    self.report_next_token_error(
        context,
        remaining_tokens[0][0],
        extra_error_information="More than one wildcard not supported.",

with this code:

    recurse_result = self.__do_recursive(
        remaining_headings,
        top_heading_index,
        remaining_tokens,
        top_token_index,
    )
    if not recurse_result:
        self.report_next_token_error(
            context,
            remaining_tokens[0][0],
            extra_error_information="Multiple wildcard matching failed.",
        )

That was where the fun began. The __do_recursive function itself does not have that many statements, but each statement has a lot of functionality built into it.

def __do_recursive(
        self, remaining_headings, top_heading_index, remaining_tokens, top_token_index):

        bottom_heading_index = top_heading_index + len(remaining_headings)

        assert remaining_headings[0] == "*" and remaining_headings[-1] == "*"
        start_index = 1
        end_index = 2
        while remaining_headings[end_index] != "*":
            end_index += 1
        delta = end_index - start_index
        search_index = 0
        heading_index = top_heading_index + 1
        scan_limit = top_heading_index + 1 + delta
        found_match = False
        while search_index < len(remaining_headings) and search_index < len(
            remaining_tokens):
            (
                end_heading_index,
                end_token_index,
                failure_token,
                _,
            ) = self.__verify_group_heading_match(
                heading_index, top_token_index + search_index, scan_limit=scan_limit)
            if not failure_token and end_heading_index < len(self.__compiled_headings):
                if end_heading_index == bottom_heading_index - 1:
                    found_match = True
                else:
                    new_top_heading_index = end_heading_index
                    new_remaining_headings = remaining_headings[
                        (end_heading_index - top_heading_index) : ]

                    new_remaining_tokens = remaining_tokens[
                        (end_token_index - top_token_index) : ]
                    new_top_token_index = end_token_index

                    found_match = self.__do_recursive(
                        new_remaining_headings,
                        new_top_heading_index,
                        new_remaining_tokens,
                        new_top_token_index, )
            if found_match:
                break
            search_index += 1
        return found_match

As was mentioned in the design section, the first part of the algorithm is setting up for the search. Thankfully, the __verify_group_heading_match function already looks for a block of matching headings, so a call to that function was pivotal to figuring out if a match is found. If it is found and the end index is not at the end of the __compiled_heading variable, it means that the algorithm ran out of headings to match, resulting in no matches. If the end index is valid and equal to the current “bottom”, then the algorithm is done with a found match. However, if it is not at the bottom, that means there are more wildcards to process, so the algorithm recurses.

From my experience, most recursive algorithms are simple in their nature, but the fact that they refer to themselves causes me some unease. It just means that I have to be extra careful about everything in that function as I am using it multiple times with variations on the same data. This was no different. The algorithm at its core was simple. Find a block of matching lines. If there is more to look for, set things up and find a more condensed block of matching lines. If at any point it fails, return right away.

But I still tested it extra carefully. Let me just say that I have been hit by recursive algorithms not being tested properly enough to give them some extra respect when testing. But after going through and testing a handful of additional scenarios, they all matched or did not match according to plan. After some simple cleanup, I checked in the code and moved on.

Ordered List Element Style

Looking through the remaining rules, a pair of them stood out from the rest: Rule Md029 and Rule Md030. To be honest, I thought I had completed them during a previous pass at the rules. But there it was in black and white, they still needed to be completed.

For no other reason that it was first, I decided to start with Rule Md029 regarding the style of Ordered List elements. This rule has a simple principle behind it: keep things consistent. Most Markdown parsers do not like it when an author starts an Ordered List Item with a seemingly random number. Those parsers prefer to have those numbers be all 1s or be a simple increasing number starting with 1. Anything outside of those two options mean that the parser decided what to do with it. And that choice does not always make sense.

Design

Looking at the configuration settings, there were three classes of starting numbers that I needed to look for: all zeroes, all ones, or ordered and starting from one. Basically, 0/0/0, 1/1/1, and 1/2/3. There was the 0/1/2 variant of the ordered, starting at 0 instead of 1, but that was just an alternate starting number. The interesting variant was the one or ordered variant. Allowing for either the 1/1/1 class or the 1/2/3 class, it would be interesting to work around that.

But after looking at all those classes and their variants, the design rapidly became clear to me. For every case other than the one or ordered variant, everything was clear and spelled out by the first number. Even differentiating between the 0-based ordering and the 1-based ordering was easily dealt with using just that first number. The rest of the cases were just following the lead of that first number.

For the one-or-ordered variant, I knew that most of the time the algorithm would need to wait until the second number to figure out which class to put the numbers in. If the first number was 1, then the list was still eligible for both the all ones class and the ordered class. But by the time the second number came around, the algorithm would be locked in, or the algorithm would trigger due to a bad number.

Testing and Implementation

Nailing down the scenario tests for the rule was easy. There are four basic positive sequences that were easy to define. After doing some scribbling, I was able to determine that the negative cases boiled down to three scenario tests. I owe a lot of that small test size due to the positive tests performing double duty as positive tests for their own setting and negative tests for the other settings.

The actual code was a variation on algorithms I have written for other rules dealing with container elements. Not needing to deal with Block Quote elements, I just wrote code to maintain a List element stack, with a second stack for the selected configuration.

def next_token(self, context, token):
    if token.is_list_start:
        self.__list_stack.append(token)
        if token.is_ordered_list_start:
            list_style, last_known_number = self.__match_first_item(context, token)
            self.__ordered_list_stack.append((list_style, last_known_number))
    elif token.is_list_end:
        del self.__list_stack[-1]
        if token.is_ordered_list_end:
            del self.__ordered_list_stack[-1]
    elif token.is_new_list_item and self.__list_stack[-1].is_ordered_list_start:
        list_style, last_known_number = self.__ordered_list_stack[-1]
        list_style, last_known_number = self.__match_non_first_items(
            context, token, list_style, last_known_number)
        self.__ordered_list_stack[-1] = (list_style, last_known_number)

As the configuration may be in flux, it can also vary between a list and any sublists. Therefore, the __ordered_list_stack field contains a tuple containing the list’s style and the last known number. With this information, any style can easily be handled by a list at any level of nesting.

With that information in place, matching the first item was added with the __match_first_item function:

def __match_first_item(self, context, token):
    list_style = self.__style
    last_known_number = int(token.list_start_content)

    if list_style == RuleMd029.__one_or_ordered_style and last_known_number != 1:
        list_style = RuleMd029.__ordered_style

    is_valid = True
    if list_style == RuleMd029.__ordered_style:
        is_valid = last_known_number in {0, 1}
    elif list_style == RuleMd029.__one_style:
        is_valid = last_known_number == 1
    elif list_style == RuleMd029.__zero_style:
        is_valid = last_known_number == 0

    if not is_valid:
        style = ...
        expected_number = 0 if list_style == RuleMd029.__zero_style else 1
        extra_error_information = f"Expected: {expected_number}; Actual: {last_known_number}; Style: {style}"
        self.report_next_token_error(
            context, token, extra_error_information=extra_error_information
        )
        list_style, last_known_number = (None, None)
    return list_style, last_known_number

Following the design, the configured style is used, allowing for a change if the __one_or_ordered_style style is selected. In that case, if the list’s number is not 1, then it must be an ordered style. However, if it is 1, then the class is still in flux, and the algorithm needs to wait.

As to the validity of the list number at that point, it is an easy decision. For any case other than the __one_or_ordered_style style, there is a small set of possible starting values. If it is not one of those values, the rule triggers. Otherwise, the value must be 1 which is valid no matter what.

def __match_non_first_items(self, context, token, list_style, last_known_number):
...    
        new_number = int(token.list_start_content)
        if list_style == RuleMd029.__one_or_ordered_style:
            list_style = (
                RuleMd029.__one_style
                if new_number == 1
                else RuleMd029.__ordered_style
            )
...
    return list_style, last_known_number

Except for the one_or_ordered logic, the __match_non_first_items function is the same as the __match_first_item function. It is only in the case where the first List Item started with a 1 that the style is left undecided. But when the algorithm gets to the second List Item, that ambiguity is either resolved or triggers the rule. As the above code indicates, if the __match_non_first_items function is called with that style still in place, it is either set to __one_style or __ordered_style, depending on whether the new number is set to 1 or not. The code after that change then decides whether the new_number content is valid or not.

I did start out with something a bit more complicated, but as usual, the simpler solution won out. While it was not that much more complicated, it just felt like I was stretching too much when a simpler approach was just as efficient at solving the problem.

List Element Spacing

While it may not seem like much, Rule Md030 is meant to protect against parsers that have a more rigid definition of the spacing that exists between the List Item indicators and the text within the List Item. And since it was one of the remaining ones to deal with, it felt that pairing it with Rule Md030 was just the right thing to do.

Design

I’ll admit it. I copied most of the concepts from the previous rule to this one and adjusted from there. That was possible because they both deal solely with lists and have similar constraints. In this case, the configuration values were stable, even with having to design “wait in this case until…” code. The twist in this rule was in calculating whether the List Item was on a single line or on multiple lines.

Therefore, the design was mostly about adapting the design instead of thinking through the design from scratch. The List Item stack was required to make sure that the algorithm compared List Items within its bounds, and not deeper. As the composition of the list, single line elements throughout or containing any multiple line elements, cannot be figured until the list is finished, all List Item tokens need to be collected until the end List element. Given the single vs multiple classifications at that point, the correct configuration value can then be chosen and applied to each token.

Which left one thing left to design: to design the “single vs multiple” algorithm. Reading the definition of the original rule an extra time, I noticed that it was very clearly documented that:

A document style may change the number of spaces after unordered list items and ordered list items independently, as well as based on whether the content of every item in the list consists of a single paragraph or multiple paragraphs (including sub-lists and code blocks).

This made most of the algorithm simple. Check to see if the difference in line numbers between one List Item and the previous List Item is greater than one. If so, it is a multiple line set of List Items. The very last List Item was a tricky one, but not difficult to design around. Every List Item must contain something, even if it is a Blank Line token. Therefore, the algorithm would need to keep track of the last non-end token encountered, to deal with that final case.

Looking through things, I felt that I had covered everything in the design, and it was time to proceed with the next phase.

Testing and Implementation

Once again taking out my pen and scrap paper, I whittled down the relevant test scenarios to eight scenarios: four for the Ordered List elements and four for the Unorder List elements. I came up with a handful of weird cases, but in the end, they all reduced to those eight scenario tests. While I am not 100% sure of that statement, I am confident enough in that statement to proceed with the coding. Basically, I have a feeling that I might has missed something, but after searching for it for an hour or so, I stopped looking.

The first part of the code has all the hallmarks of the previous rule. The __list_stack code is almost the same, with just a bit more code being executed in each if block.

def next_token(self, context, token):
    if token.is_list_start:
        self.__list_stack.append(token)
        self.__list_tokens.append([])
        self.__list_tokens[-1].append(token)
        self.__last_non_end_token = None
    elif token.is_list_end:
        self.__handle_list_end(context)
        del self.__list_stack[-1]
        del self.__list_tokens[-1]
    elif token.is_new_list_item:
        self.__list_tokens[-1].append(token)
        self.__last_non_end_token = None
    elif not token.is_end_token:
        self.__last_non_end_token = token

The important part of the algorithm is here in the __handle_list_end function, which is invoked when the list ends.1 It is only at the end of the list that the multiline status can be computed. After it is computed, the proper setting can be assigned to the required_spaces variable, which is then compared to the actual number of spaces in each List Item token.

def __handle_list_end(self, context):
    is_multiline = self.__is_current_list_multiline()
    if self.__list_tokens[-1][0].is_ordered_list_start:
        required_spaces = self.__ol_multi if is_multiline else self.__ol_single
    else:
        required_spaces = self.__ul_multi if is_multiline else self.__ul_single
    for _, list_token in enumerate(self.__list_tokens[-1]):
        delta = list_token.indent_level - list_token.column_number
        if self.__list_stack[-1].is_ordered_list_start:
            delta -= len(list_token.list_start_content)
        if delta != required_spaces:
            extra_error_information = (
                f"Expected: {required_spaces}; Actual: {delta}"
            )
            self.report_next_token_error(
                context, list_token, extra_error_information=extra_error_information
            )

Having that code in place, the only part that was remaining was the code to determine if the list contains only single line List Items or if any of the List Items have multiple lines. For every List Item except for the last one, its vertical size can only be calculated against the next List Item. But it does make for a simple loop.

def __is_current_list_multiline(self):
    is_multiline = False
    for token_index, list_token in enumerate(self.__list_tokens[-1]):
        if token_index:
            delta = (
                list_token.line_number
                - self.__list_tokens[-1][token_index - 1].line_number)
            if delta > 1:
                is_multiline = True
                break
    if not is_multiline:
        delta = (
            self.__last_non_end_token.line_number
            - self.__list_tokens[-1][-1].line_number
        )
        is_multiline = delta > 1
    return is_multiline

And because of the clear designing, I knew ahead of time that the last List Item was going to be a different scenario and how to address it. As that last List Item does not have a following List Item to compare it to, the last non-end token serves as a good stand-in.

Once again, I ran through all the cases, looking for that something that I missed and not finding anything. Everything was working and passing, so after that extra checking, I finished up the documentation for the rule and committed it to the repository.

What Was My Experience So Far?

I have been working on this project for almost two years now, and I have never felt closer to a solid beta release than I do right now. With three rules left to go, I feel very confident that I have caught most of the serious bugs. And that is all I can ask for.

As a professional Software Developer Engineer in Test, I know that finding every bug is almost possible, but it is prohibitively expensive. From my own work on this project, I know that maintaining the 99.5%+ code coverage does require commitment and time. And while I have solid code coverage on the code code and rules, I still acknowledge that I am aware that there are things I will have missed. And to be honest, some of those things I probably should have caught, and some are going to be a total surprise to me.

But I am okay with that. And I am okay knowing that an upcoming beta release with all the rules implemented is just a starting point. Besides the issues from the Issues List, I hope people start to use this project and I get issues being reported.

I know I am not perfect, and this project will never be perfect. However, with some good help from users out there, I know this project will continue to get better. And that is what I am hoping for. Good enough and getting better.

What is Next?

For getting restarted on the project after about a weeks’ absence, it took me a bit to get going. But now that I see that the end of the rules implementation phase is in sight, I am eager to get it completed. Stay tuned!


  1. That may sound like a weird statement to some. For example, I make it a practice to name functions after the action they take. Hence, the function __handle_list_end performs the handling for the end of the list. But I have also seen function names like handle, done, and other names that do not imply that context. Reading those functions is not fun. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~15 min read

Published

Markdown Linter Beta Release

Category

Software Quality

Tags

Stay in Touch