Summary

In my last article, I talked about the work I put into getting that weeks’ three rules completed. In this article, I talk about getting the remaining rules implemented.

Introduction

It has been a long way to get here, but I am finally at the point where the beta release is within reach. All I have to do is to finish the last three rules, make sure they are cleaned up a bit, and I am ready for the beta release of PyMarkdown.

And with that…

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 16 Sep 2021 and 19 Sep 2021.

Rule Md013 - Line Length

Most linters have a rule like it, and PyMarkdown was not going to be an exception. It needed a rule that curtailed overly long lines. It was something obvious that was missing. But as I have frequently found to be the case with simple and obvious things, there was a little hitch.

Design

At first, when I looked at this rule, I thought it was going to be easy. I mean, come on! This was line lengths. Just go through the file looking for long lines and flag when they exceed a given length. And then I got to the section on code blocks and headings in the original rules. Reading the description again, there was a brief discussion on why having alternate line lengths for those sections was needed. Furthermore, those discussions made sense.

But that left me with an interesting problem to solve: how do I do that efficiently?

If At First You Do Not Succeed

My first pass at solving this was horrible. Scribbling furiously, I wanted to take the approach of using the tokens to help me figure out what the line lengths were. Each non-end token has a start location, so piecing things back together would be easy, right? Wrong! No matter how much I tried to design around that issue, it always proved to be too much work for too little benefit.

It took a couple of hours of playing around with designs and walking by dog before I produced an observation that I thought had merit. From a more relaxed viewpoint, I observed that I was trying to deal with both requirements of the rule with the same algorithm. It was not the base algorithm that was complex, but the variations that I was trying to add to it to understand where in the document it was. That was the main issue. But could I attack that problem from a different angle and avoid that issue?

Time To Pivot

Instead of trying to change that base algorithm, I decided that I was going to try and implement two algorithms separately and have them work together. The first algorithm, the base algorithm, was great at detecting long lines, but did not have any concept of where in the document it was. I therefore figured out that I needed a second algorithm that simply focused on what the currently active Leaf element was. If there was effective communication between the two algorithms, the first algorithm could ask the second algorithm where in the document it was and make the right decisions for configuration.

Then the issue I needed to solve was that method of communication. Intrinsically, each algorithm occurred in a different part of the document scan: first the tokenized scan occurs, and then the line-by-line scan occurs. That was the gap that I needed the algorithms to cover.

Taking a hint from past ventures in trying to create complex solutions, I decided to keep it simple. During the token pass, the second algorithm would add any Blank Line token or Leaf token to an array. As those tokens contain the line number and multiple Leaf tokens cannot occur on the same line, a simple tracking index would work for tracking the current Leaf token within the line scan phase.

I was not concerned about the covering the cases for long words at the end of lines as I already had that part designed within minutes of reading the rule. Starting at the specified line length, the rule needed to avoid triggering if it could not find any whitespace before the end of the line. From experience with the project, it sounded like it was purposefully made for the ParserHelper module and its extract_until_whitespace function.

Working through different mental runs of the algorithm, it seemed solid. But would it work when coded? It was time to see.

Testing and Implementation

Looking across the requirements for this rule, it was easy to see that the scenario tests were going to have to cover every one of the Leaf elements. With special configuration for both code block and for headings, I did not see any way to avoid it. In addition, I needed to deal with words, or more properly possible word like entities that may extend past the end of the specified line lengths. The variations to deal with strict and stern configuration items would require testing on existing test cases, but no new test data files. Finished creating the test cases, I counted them up and I had 18 different Markdown files entered.

From there, the number of scenario tests based on those tests grew. With six configuration values and combinations between them and the test case data, the number of scenario tests quickly soared to 43 scenario tests before it stopped. There were my usual “what about…” scenarios that I thought about adding afterwards, but in each case those scenarios boiled down to slight variations on existing tests cases and scenario tests. But knowing that they did reduce to those existing tests gave me the confidence to go on with the implementation.

Second Algorithm First

As I was more worried about the interface with the second algorithm, I decided to work on it first. To start, I added the following function:

def next_token(self, context, token):
    if token.is_blank_line or token.is_leaf:
        self.__leaf_tokens.append(token)

It was nothing special, but I was amazed by how simple it seemed. With all my scribbling on paper seeming overwhelming, this function seemed very underwhelming. But there was no disputing it, it did what it needed to very well. It simply added any Leaf tokens or a Blank Line tokens to the list.

Then it was time for the other half of the algorithm. It was equally simple:

 def next_line(self, context, line):
    if (
        self.__leaf_token_index + 1 < len(self.__leaf_tokens)
        and self.__line_index
        == self.__leaf_tokens[self.__leaf_token_index + 1].line_number
    ):
        self.__leaf_token_index += 1

    # Do Stuff Here

    self.__line_index += 1

Because the first algorithm is line based, the second algorithm just needed to provide a line level understanding of what Leaf token is current active. Once again, it was underwhelming to implement the design and find out how simple it was. It was so underwhelming that I added debug statements and walked through multiple test scenarios to make sure it was working properly. It worked in each case. It was a win! It was on to the other algorithm.

Back To The First Algorithm

With the second algorithm code in place, the first algorithm went forward without any problems. While there are variations for code blocks and headings, the initial code added to the next_line function was quite simple:

    line_length = len(line)
    compare_length = self.__line_length
    if line_length > compare_length:

        trigger_rule = False
        if self.__strict_mode:
            trigger_rule = True
        else:
            next_space_index, _ = ParserHelper.extract_until_whitespace(
                line, compare_length)
            if self.__stern_mode:
                trigger_rule = line_length == next_space_index
            else:
                trigger_rule = line_length != next_space_index

        if trigger_rule:
            extra_error_information = (
                f"Expected: {compare_length}, Actual: {line_length}")
            self.report_next_line_error(
                context, 1, extra_error_information=extra_error_information)

Following the design, the actual comparison for line length is simple… it is what comes after it that took a bit to get right. Between the rules for strict, stern, and long end words, it took me some additional time to get the response to those configuration settings exactly right. While I did not state it in the design section for this rule, I was 40% convinced that I would need to design that on the spot to get it right. The other 60%? I wanted to see if my 2 algorithm approach worked as well as I thought it did.

Regardless, the implementation was coded for the base paragraph scenario tests within an hour, and I started the lengthy process of enabling each test scenario and addressing any issues. With 43 scenario tests to enable, it took a while, but I only encountered slight issues along the way. When it came to handling the line lengths of the code blocks and the headings, I just introduced the extra configuration items to handle them. Then, in the above example, I added some… er… ugly if statements to manage the extra values.

After double checking all 43 scenario tests, those changes were committed, and it was on to the next rule.

Rule Md009 - No Trailing Spaces

Lately, I have noted that I have a certain amount of luck that is showing up in my work for this project. It was therefore interesting to see that this trend was showing up again.

Design

Reading this rule, it felt like I was reading an echo of the previously implemented rule. Through serendipity, I had a second rule that would benefit from my recent two algorithm scan approach. It was not any planning on my part, it was sheer luck.

There were a couple of new twists, but nothing that was a substantial change to the supporting algorithm. Based on reading the description for the rule, there were configurations values that needed to know of any Hard Link Break tokens (inline tokens) and if the existing line was within a List token (container token). Other than that complication, the rule seemed simpler than the previous rule. Instead of a complicated triggering scenario, this rule had a very straightforward one: if the number of spaces at the end was not 0 or the configured amount, trigger.

I took my time to work through any issues with the design, but the adaptation of the previous rule saved me a lot of time. While it was luck that I encountered both rules together, it sure helped my design for this rule to have the design for the past rule still in my head.

Tests and Implementation

Getting the tests ready for this rule were easy. After creating 12 files for test cases and the 21 scenario tests to cover them, I was confident that I had everything covered. As the 21 tests for this rule were less than half of the tests for the previous rule, I was a bit nervous about proceeding.

But after double checking the tests, I honestly could not find anything that I had missed. Every extra test case that I produced was reduced into one of the existing test cases. After about a half hour of producing extra cases that matched that pattern, I was in a more confident place about the tests, and I decided to continue.

Second Algorithm, Take Two

Since this rule was mostly a clone of the previous rule, I knew that I had a solid base to work with. So instead of the two lines required for the previous rule’s next_token function, this one had a bit more content:

if token.is_block_quote_start or token.is_list_start:
    self.__container_token_stack.append(token)
elif token.is_block_quote_end or token.is_list_end:
    del self.__container_token_stack[-1]
elif token.is_blank_line or token.is_leaf or token.is_inline_hard_break:
    self.__leaf_tokens.append(token)
    if self.__container_token_stack:
        self.__leaf_owner_tokens.append(self.__container_token_stack[-1])
    else:
        self.__leaf_owner_tokens.append(None)

Taking care of the extra requirement for tracking any Hard Break tokens just required adding or token.is_inline_hard_break to the one conditional. Adding the proper support for the container tokens did take a bit more work. Keeping track of the “current” container token was accomplished with the first four lines of the new function, but that was not enough. As there is specific configuration for a Blank Line within a List Element, the owner of any added Leaf token needed to be stored in the sibling list __leaf_owner_tokens.

    if (
        self.__leaf_token_index + 1 < len(self.__leaf_tokens)
        and self.__line_index
        == self.__leaf_tokens[self.__leaf_token_index + 1].line_number
        and self.__leaf_tokens[self.__leaf_token_index + 1].is_leaf
    ):
        self.__leaf_token_index = self.__inline_token_index + 1
        self.__inline_token_index = self.__leaf_token_index

    if (
        self.__inline_token_index + 1 < len(self.__leaf_tokens)
        and self.__line_index
        == self.__leaf_tokens[self.__inline_token_index + 1].line_number
    ):
        self.__inline_token_index += 1

    if (
        not self.__leaf_tokens[self.__leaf_token_index].is_code_block
        and line
        and line[-1] == " "
    ):
        ...

With all the foundations set, I added debug and made some modifications to the tracking code in the next_line function. Tracking the Leaf elements remained mostly the same, but the inclusion of the Hard Break token added a slight bit of complexity.

    if (
        not self.__leaf_tokens[self.__leaf_token_index].is_code_block
        and line
        and line[-1] == " "
    ):
        (
            first_non_whitespace_index,
            extracted_whitespace,
        ) = ParserHelper.extract_whitespace_from_end(line)
        extracted_whitespace_length = len(extracted_whitespace)

        is_list_empty_line = (
            self.__list_item_empty_lines_mode
            and self.__leaf_owner_tokens[self.__leaf_token_index]
            and self.__leaf_owner_tokens[self.__leaf_token_index].is_list_start
            and first_non_whitespace_index == 0
        )

        if extracted_whitespace_length != self.__break_spaces or (
            self.__strict_mode and not is_list_empty_line
        ):
            self.report_next_line_error(
                context,
                first_non_whitespace_index + 1,
                extra_error_information=extra_error_information,)

Finally, with the second algorithm modified, it was time to make the required changes to the primary algorithm. Instead of checking the length of the line, the new conditional simply checked to see if the last character on the line is a space character. Once that character is detected, except for some code to prevent it from triggering on Blank Lines within a List element, the rest of the code is all about triggering the rule.

As always, I threw extra data and scenario tests against the code, and it came up mostly clean. I had missed a case with a completely empty line, but I quickly fixed that. After that, all tests were showing as passing, and code coverage was at 100 percent. I was good to go for the next rule. But would my luck continue?

The answer was yes!

Design

Looking at this rule, I was floored by my luck. By some miracle, all three of the final three rules needed the same algorithms to provide a solid solution for them. And… yes. With this being a simpler version of the both of the last two rules, I did skip over the design for this rule.

I was not 100% happy with that decision as I wanted to stick by my design and implementation rules. But I also like to be efficient. And as hard as I tried, I could not find a good enough reason for spending the time on the design of this rule that I knew was going to be a simpler clone of the previous two rules.

Tests and Implementation

Because this rule was simpler, the number of test cases and the number of scenario tests were drastically reduced. It also helped that there were no configuration values, reducing the complexity of the design even further.

With everything else being cloned code from the first rule, the only interesting part are the changes I made to detect the reversed link syntax:

def __init__(self):
        super().__init__()
        self.__reverse_link_syntax = re.compile(r"\(.*\)\[\s*[^\^].*\s*]")
        ...

A regular expression was easily the right way to go to detect the reversed link syntax. Once I had that decided, working through what that expression would look like only took about ten minutes. The hard part for me was that I wanted to peak at the original rule and see how they matched the line and borrow some of their code.

But as I wanted to figure it out for myself, that was out of the question. So, I just lined up test data and an online regular expression testing tool, and quickly narrowed down to that expression. It is not a fantastically complicated expression, but it does the job!

if (
    not self.__leaf_tokens[self.__leaf_token_index].is_code_block
    and not self.__leaf_tokens[self.__leaf_token_index].is_html_block
    and line
    and "(" in line
    and "[" in line
):
    regex_search = self.__reverse_link_syntax.search(line)
    if regex_search:
        regex_span = regex_search.span()
        extra_error_information = line[regex_span[0] : regex_span[1]]
        self.report_next_line_error(
            context,
            regex_span[0] + 1,
            extra_error_information=extra_error_information,)

With the regular expression decided on, the rest of the code was trivial to write. The only interesting part was making sure that I used the right regular expression function, one that would return where the match occurred within the line. But after a quick bit of searching on my favorite Python websites, I got that answer and wired everything up.

Ensuring Rule Details and Rule Documentation Are Consistent

With ten rules developed before June and thirty-one rules developed after June, I was concerned that the details of each rule and the documentation for each rule was not coordinated with each other. Given that concern, I wanted to spend at least a good five or six hours going through the rule modules and the rule documentation, making sure that everything looked correct.

It was a bit of a slog, but it was worth it. I took the opportunity to run the document through a word processor and used that processor to look for spelling mistakes and grammar mistakes. I also pulled information out of the Reasoning section to include a sub-heading that identifies the primary reason for each rule. From a readability point of view, it just helped make that section crisper by clearly stating what the reasoning for the rule was up front.

Even though I am usually thorough, there are trivial things that even I mess up on. For me, the best way to look for those is to go through a set of objects that I am trying to standardize on and see what stands out as being different. Especially when I know that I often copied the entire document from one rule and used it as a template for another rule, I wanted to look for those differences that I missed.

What I found while doing this is as follows:

  • There was 1 case where I forgot to change the title to match the new rule.
  • There were 2 cases where I used {space} in examples without adding some text before the example to explain that it was just standing in for space characters.
  • There were 2 cases where a plugin was enabled by default, but the documentation for that rule said it was disabled by default.
  • There were a small handful of cases where I still used “as such:” after I decided that it did not sound right.
  • There were a small handful of cases where I had forgotten to remove quotation marks from within string configuration values. (i.e. transforming "consistent" to consistent)
  • There were 20 cases where the rule modules did not have a module documentation header or class documentation header that matched the rule.
  • There were 22 cases where the rule details function did not have the plugin_url and plugin_configuration values set properly.

Between knowing that the documentation was worded better and that it was also consistent between the rules was a relief. I know that some people might consider this being picky, and I guess they are right. For me, it is just making sure that similar concepts in the same project have a solid attempt at a singular theme running through them. I do hope I have achieved that.

What Was My Experience So Far?

I have been working on this project for two years now, and it felt good to have all the rules working. There were some issues that I needed to track down, but I was confident that those issues would only impact documents that were off the beaten track. Or it may be more proper to say that I hoped that was the case.

Not including my week off for work and home related issues, I was able to shave two weeks off of my ten weeks estimate to get the rules completed. I was able to keep up steady progress, but each rule just took a decent chunk of time to complete. It was not that I did not try hard enough, it was just that there was not enough time in the day. While each design portion was resolved within hours, the testing and documentation took serious time to get right. It just took what it took.

And now that I am so close to a beta release, I want to double check and triple check things before I do this big release. Lots of small tasks before next week!

What is Next?

Having completed all the rules, I knew I needed to clean up a small handful of things before a beta release. It would not take more than a couple of days to complete, but it would be time well invested. Stay tuned!

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~14 min read

Published

Markdown Linter Beta Release

Category

Software Quality

Tags

Stay in Touch