Summary

In my last article, I talked about the work I put into getting that weeks’ three rules completed. In this article, I talk about the next three rules I worked on.

Introduction

I know I am getting close to being finished with this phase of implementing rules when I can represent the number of rules left with two hands! It was exciting. I could add a lot more, but what I really want to do is talk about the things I did this week, so here I go!

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 25 Aug 2021 and 29 Aug 2021.

Multiple Top-Level Headings

The implementation of Rule Md025 was a long time coming, but I was finally getting around to it. The basis of the rule is very simple: top-level headings are often used by parsers as a document title. Because of that, having multiple titles in the same document just doesn’t make sense. Well, it did not make sense to me at least.

Design

While I felt that this rule was simple enough to start implementing without any design, I wanted to stick to my goals and design this out… even if it only took five minutes. That five minute estimate was derived at because the design parts were easy for me to grasp together in my head with room to spare. From experience, if you can think about a design as a complete puzzle in your head, the design is generally going to be simple.

The first part of the design was easy. I needed a variable to keep track of whether the first top-level heading was observed. Once set, any other top-level heading was a multiple and needed to trigger the rule. Since both the Atx Heading tokens and the SetExt Heading tokens share the hash_count property, it would mean a simple check to see if that property was set to 1.

That was it. Nothing complex. Even with the configuration parts added in, those new parts were still manageable and simple. If the level configuration value is set, that 1 would be replaced with the variable containing the level configuration value. The front_matter_title configuration required a check for the front-matter token, and then a check to see if property map for the Front Matter token contained the specified configuration value. If so, set the indicator that the first top-level heading was observed.

For me, it was easy going through that design, with no issues involved. I did not even see any warning signs. But still, was it worth it? It did feel good going through the process and having that confidence verified, thinking some more about it and confirming that there were no apparent issues.

Testing and Implementation

After coming up with nine scenario tests, I felt I had given the rule enough good test coverage to move on to the implementation phase. Like the simplicity of the design phase, these scenarios were not particularly difficult. Three good scenarios to make sure the rule does not trigger for safe scenarios, and six scenarios testing the failure combinations.

Just like some of the rules from last week, the implementation was a reflection of the design that I specified above. The rule started with:

def starting_new_file(self):
    self.__have_top_level = False

def next_token(self, context, token):
    if token.is_atx_heading or token.is_setext_heading:
        if token.hash_count == 1:
            if self.__have_top_level:
                self.report_next_token_error(context, token)
            else:
                self.__have_top_level = True

Then, when the self.__level configuration value was introduced, the hash count comparison was changed to:

        if token.hash_count == self.__level:

Finally, when the self.__front_matter_title configuration value was added, a new comparison was added at the end of the main if statement:

    elif token.is_front_matter:
        if self.__front_matter_title in token.matter_map:
            self.__have_top_level = True

Just like the design indicated, it was a simple implementation. But I am not always so lucky that things match up. With extra confidence, but still a firm believe that I am taking the right path, I went on to the next rule.

All Files Must Have Top-Level Headings

This rule, Rule Md041, was another rule that I had been delaying for a long time. Part of the reason for that delay is the scope of this rule. I had confidence that when I completed this rule that it was probably going to be triggered on around 60% of the test scenarios for rules. That was an easy estimate to come up with. A good group of those test scenarios were not focused on Atx Heading elements or SetExt Heading element; therefore they were probably going to trigger this rule.

So, taking a slight detour from the usual design first approach, I decided that I needed to upgrade the invoke_main function to handle this issue. The invoke_main function is part of the test framework and is used by approximately 75% of the scenario tests. The function’s one and only job is to allow the scenario tests to execute the linter with as much parity as possible with the command line.

It was here that I needed to address the issue of Rule Md041 triggering during the scenario tests. Changing each test was out of the question… that would be a mess. Instead, I added a suppress_first_line_heading_rule parameter, defaulted it to True, and implemented the following:

def invoke_main(
    self, arguments=None, cwd=None, suppress_first_line_heading_rule=True
):
    if suppress_first_line_heading_rule:
        new_arguments = arguments.copy() if arguments else []
        if "--disable-rules" not in new_arguments:
            new_arguments.insert(0, "--disable-rules")
            new_arguments.insert(1, "md041")
        else:
            disable_index = new_arguments.index("--disable-rules")
            disable_value = new_arguments[disable_index + 1]
            if not disable_value.endswith(","):
                disable_value += ","
            disable_value += "md041"
            new_arguments[disable_index + 1] = disable_value
        arguments = new_arguments

The changes were simple, yet efficient. If there is no parameter named --disable-rules, then the parameter and the value md041 are inserted at the front of the parameter list. Otherwise, the value for that parameter has the value md041 appended to the end of the existing value.

If was not rocket science, but it was efficient. Instead of spending hours making changes and validating them, I had this change up and running within two hours, testing included. It was a bargain.

Design

I felt that this design was a bit more complex than the previous rule, but not by that much. The basis for this rule is all about the first “real” token in the document. I knew there was going to be a bit of a twist regarding the Front Matter token, so I left that to the end. Every other token was fair game though.

Since this rule is specifically about the first token in the document, I knew that I was going to need a variable just to provide that context to the rule. Once inside an if condition on that variable, unless there was a good reason not to, that variable would be set each time. Quick, easy, done.

From there it was just working through the allowable start cases one by one. If a Atx Heading element or a SetExt Heading element without the right level was found, it needed to trigger the rule. For a HTML Block, I would have to do a bit more work as the allowable <h1 sequence was inside a separate Text element. That meant I would have to set a separate capture variable, and not set the “done” variable until the rule analyzed the following Text element. But other than that, the general idea of how the block was handled was the same as with the Heading elements. I also figured out that Blank Lines would probably not count, so I made a note to test what effect those Blank Lines had.

For the Front Matter element, I thought there would have been more work needed, but in the end, I figured that it was a simple check. If the title is in the Front Matter element’s property map, then a title is present, and the requirements for having a title were met. I checked this a couple of times to make sense, and it passed each time. I think the original name of the rule could have been better worded, but it was working properly.

Testing and Implementation

The thirteen scenario tests that I came up with for this rule were simple to construct. Four of those scenario tests were the good cases where each of the four allowable elements start the document with good data. Basically, tests for things like a Level 1 (top-level) Atx Heading element. I was concerned about Blank Lines causing issues, so I throw an extra good scenario test in there just to be sure. The negative tests? Just one scenario test for every leaf element, and it was taken care of.

The implementation was also uneventful, following the design of the rule without any issues. Starting with this code to handle the Heading elements:

def starting_new_file(self):
    self.__have_seen_first_token = False

def next_token(self, context, token):
    if not self.__have_seen_first_token:
        if token.is_atx_heading or token.is_setext_heading:
            self.__have_seen_first_token = True
            if token.hash_count != self.__start_level:
                self.report_next_token_error(context, token)
        elif not token.is_blank_line:
            self.report_next_token_error(context, token)
            self.__have_seen_first_token = True

the function was grown to handle Front Matter elements and HTML Block elements as described in the design. Except for a couple of typing mistakes here and there, everything went very smoothly.

It was as I was wrapping up the work for this rule and looking forward to Rule Md043 that I realized that I was going to have my hands full with that rule.

Required Heading Structure

The first time I looked at Rule Md043, I did not think it would be too difficult to implement. It is a very simple text comparison between a list of Heading elements and a list of required heading levels and heading text.

But then I looked further in the description and saw that it supports the * character. My heart dropped. That meant I was going to have to implement a glob-based comparison. I was not happy.

Glob

While the dictionary definition is “a small drop” or “a usually large and rounded mass”, if you mention Globs to most computer people, they immediately think of * and ?. From a Linux point of view, a glob is a string of text that can contain the characters * and ? which serve as placeholders for one or more other characters. The phrase is so commonly used that GitHub has a project just for glob.

How does it work? The ? character is a placeholder for any single character. Therefore the glob me?on will match the strings melon and menon. The string mennon will not match the glob as it only has one ? character, not two.

The * character is similar but is a placeholder for any number of any characters. Keep things simple, the glob me*on will match meon, both positve matches above, and the string mennon. As a matter of fact, it will also match a string starting with me, ending with on, and having an insanely large number of characters between them.

So why was I worried about glob-based comparisons? They look easy, do not they? Well, how do you deal with a glob like me*e*on? The algorithm to handle one * character is relatively simple. Adding in support for multiple * characters gets messy really quickly. I’ll get into that a bit more later, but it was that support that I was dreading.

Design

Making sure that I was thinking about the rest of the design clearly, I sat back and just let my mind wander and think about this rule. One thing that was immediately obvious was that any “fancy” headings that were not 100% text were going to be a problem. If an author adds the text # Heading *Me* One, the parser will translate it into three Text elements with two Emphasis element in the middle. Should I then match the plaintext or the resolved text? That was ambiguous enough that I decided to not support any Heading elements that were not 100% plain-text. One issue down.

The next issue was on how to process the entire set of Heading elements. I could try and process as I went, but I felt that it would be too complex to accomplish with little gain. As such, I decided that at the end of the document, the rule needed to example a list of all Heading elements and their associated text. With that complete picture in hand, it would be a relatively simple matter of writing the Glob algorithm. Two issues down.

Glob

Ah, yes… a relatively simple matter… Glob algorithm. There were options there as well. I just had to decide which approach worked the best in this situation. One option that I had was to try and represent each line as a given character, then using Python’s built in Glob Library to resolve it. But after checking with the library, it only works on files and directories, so I would have to write it myself anyways. So little benefit of doing a transform to an easier format and then another transforms back if there was an error.

So given that I was going to need to write this myself, I decided to use the compile option for the Python RegEx library as inspiration. For this rule, I would take each part of the provided configuration and determine if it was either a valid, plain-text Atx Heading element or the * character. In this way, I could handle any configuration errors during the configuration phase, and not later. This was going to be a simple loop with some extra checks for Atx Heading element parts, but nothing too bad.

Given a compiled Glob to work with, the first thing I needed the algorithm to do is to avoid any Glob code unless there is a wildcard character present. For a non-wildcard scenario, a simple check through the entire compiled list against the actual Heading elements would be easy. For the wildcard scenarios, I would need to handle any constant headers and footers to get to the point where the algorithm could deal with the wildcards themselves. I was not sure where to go at that point, but I decided I would do more design when I got there.

Testing and Implementation

In all, I created just four scenario test documents to test this rule. I did not need any more. As one of the variables in these scenarios was the configuration, those four documents provided a solid enough base for me to create all the scenario test functions.

Compiling The Requirements

Because of the nature of the design, I decided to allocate a block of time specifically to the compiling and storing of the requirements. The way I look at it, if I did not get the internal form of the requirements correct, I would be lucky if the actual comparisons worked. It was better to be sure the foundation was correct, so I invested the time there.

The actual compilation function was spot on what the requirements asked for, allowing for a simple element structure:

@classmethod
def __compile(cls, found_value):
    found_parts = found_value.split(",")
    compiled_lines = []
    for next_part in found_parts:
        if next_part == "*":
            compiled_lines.append(next_part)
        else:
            count, new_index = ParserHelper.collect_while_character(
                next_part, 0, "#"
            )
            if not count:
                return None, "Element must start with hash characters (#)."
            if count > 6:
                return (None,
                    "Element must start with between 1 and 6 hash characters (#).",
                )
            new_index, extracted_whitespace = ParserHelper.extract_any_whitespace(
                next_part, new_index
            )
            if not extracted_whitespace:
                return (None,
                    "Element must have at least one space character after any hash characters (#).",
                )
            if len(next_part) == new_index:
                return (None,
                    "Element must have at least one non-space character after any space characters.",
                )
            compiled_lines.append((count, next_part[new_index:]))
    return compiled_lines, None

Basically, go through the elements, looking for either a line with the * character, or look for a valid plain-text Atx Heading element. As one of my frequent complaints about software are error messages that do not help the user, I tried to add useful error messages detailing what the error is.

This kept the list to compare against simple. The entry was either a single * character or a tuple with the hash count and the heading text.

Collecting the Headings and Associated Text

With the requirements compiling, it was then on to the collecting of the Heading tokens and their Text data. This proved to be relatively easy:

def completed_file(self, context):
    if not self.__compiled_headings:
        return

def next_token(self, context, token):
    _ = context
    if self.__collected_tokens:
        if token.is_atx_heading_end or token.is_setext_heading_end:
            self.__all_tokens.append(self.__collected_tokens)
            self.__collected_tokens = None
        else:
            self.__collected_tokens.append(token)
    elif self.__compiled_headings:
        if token.is_atx_heading or token.is_setext_heading:
            self.__collected_tokens = [token]

This form actually took me a tiny bit of effort to get to as I struggled with how to determine whether a Heading token was plain-text only. In the end, I just decided to collect all the tokens for the Heading, store them in the list, and then decide later how to deal with them. While it may not be the most elegant way to deal with this problem, it was the easiest.

A Globbing We Will Go

Rather than walking through the Glob code line by line, I thought it would be more beneficial to walk through the design I did. Globbing a string is something that has weird cases in it. To that extent, I hope I did a good enough job in laying down the foundation for glob comparisons in the above Glob section.

Starting from that point, while whatever is being globbed is up to the implementation, the most common way to think about it is with characters. I believe this is due to the use of Glob for files and directories in Linux, and that it easier to keep one character in your head than an entire string.

As I detailed above, the algorithm for dealing with zero * characters in a glob pattern is easy. With no special characters, the comparison becomes a simple comparison with no changes. If the glob pattern is abc, then those characters must appear in that order. To make sure that I had a good starting point, I implemented that code first. I made sure to get all the simple scenario tests passing, and everything was good.

I then went on to work out how to deal with one * character. As dealing with any number of any character is a harder issue to deal with, I focused on finding out if there are any characters at the start of the glob or the end of the glob that are not wildcards. If that is that case, then I can use my comparison function for zero * characters and deal with both of those sections. If the glob pattern only has one wildcard, the start of the glob and the end of the glob will meet at the glob.

This was a bit more interesting to work through. I had the scenario tests mostly passing, but I had to focus on the algorithm. It deals a lot with a start index and an end index of each section. It was really easy to get the different values mixed up. But I persevered and got all the scenario tests passing. And then on to the hardest part: multiple wildcard characters!

Or so I thought.

Life Often Shows Up When You Least Expect It

I was hoping to get more done this week. However, due to unrelated issues with both cars in our family, it became obvious that life had other plans for me this week. As I have mentioned before, working on this project is a hobby for me, and I must ensure that I keep the other parts of my life in balance.

Well, these car-related issues worked together this week to remove the better part of a day from my personal schedule. And I believe I made the right choice by focusing on those issues. However, by removing those hours from my schedule, I was not able to finish the work on Rule Md043. Specifically, I was not able to handle instances of multiple wildcard characters in the required configuration value list.

Going over how to handle this in my head, I determined that I had made significant progress on the rule itself to warrant it being checked in. I added a custom warning and a new item to my Issues List, just to make sure I had everything covered. Hopefully I will be able to get to it soon, but I feel that this was the right decision to make. Time will tell though.

What Was My Experience So Far?

From a pure numerical standpoint, the number of completed rules is now 26 and the number of rules remaining is now 5. As my expected benchmark was 3 completed rules per week, I was successful in meeting that goal. While I wanted to get more rules done this week, I did not feel it was worthwhile shortchanging Rule Md043 to achieve that.

And with five rules left to go, I have a real good shot at getting the rules completed in the next week or two. That was something worth celebrating! I felt good that I have made all this progress and will know that I have something solid that people can use.

It is all coming together, and I like it! I wish it would move faster, but I know I will get there and soon.

What is Next?

I am almost so close to finishing off the rules that it is hard for me to bear. Will I reach it next week? Stay tuned!

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~15 min read

Published

Markdown Linter Beta Release

Category

Software Quality

Tags

Stay in Touch