Markdown Linter - Road To Initial Release - Dotting The Is

Summary¶

In my last article, I talked about the work I did to fine tune the plugins and to start documenting them for the release. With the plugin rules fine-tuned and notes for them in hand, I proceded to resolve some final issues that were in the way of a clean release.

Introduction¶

The list is definitely getting shorter. My goal for this stretch was to get every item in the issues list resolved except for the documentation task. I wanted to get a good start on that task, but I knew that it would take more than a couple of days to get the documentation into a state that I would feel good with. After all, if it took me almost 18 months to get the code to the initial release state, I was not going to get the documentation done overnight. However, I could pave the way for me to focus solely on that documentation for the next week!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 09 May 2021 and 16 May 2021.

Open-Source Software Projects And Your Life: The Update¶

I am going to share a bit of a secret with you. As someone who has Autism, I generally get between two and four headaches a week. Sometimes they are from too many people and sometimes they are from old injuries done in my youth. As such, I rarely have a week with less than two headaches. While working on the project in the last three weeks, I have had one long continuous three-week headache, with only brief episodes of peace during that time.

That is, until Sunday, which was yesterday. I am not sure if it was me taking care of myself or if the headache had just run its course. I do not care. The fact that I woke up yesterday with no headache and went to bed yesterday with no headache, with no headache between those two points. That was enough for me.

And it showed on Sunday when I did some work on the PyMarkdown project followed by picking up the notes for this article and working on it. For both tasks, I was able to figure things more clearly and more easily, allowing me to be more concise in what I wanted to accomplish. After three weeks of pain and fog, it was a blessing.

I would be lying if I did not admit that sometimes it was a bit of an uphill battle for me emotionally. I know that I am very close to the project’s initial release and I just wanted to do “five more minutes” of work to get me that much closer to that goal line. However, in my experience, rarely have I ever heard “five more minutes” and have it mean exactly that. Usually it means “until I get this done”. And if I am being honest, I know for me that would have meant that five minutes would have blown up to at least 30 minutes, if not a couple of hours.

It was not always easy. But making sure that I took the time to deal with life and make sure that I was working on the project for the right reasons and in the right mindset was the right thing for me to do. Yes, I know I am behind in my plans for the initial release. But I also know that I can finish the remaining tasks off with a positive attitude and momentum.

And for me, doing things (my version of) right is one of the reasons that I started working on my own projects.

Atx Headings, Part 2¶

Having forgotten to run the Black code formatter before the last couple of commits, I executed the formater on the code base with roughly the expected number of reformats present in the code. Doing some more digging into the work I just completed on Atx Headings, I found a couple of issues.

When I fixed the Atx Heading parsing to not include tab characters as part of the allowed heading whitespace after the starting hash characters, I had neglected to make the same change for the ending hash characters. As such, when I added four new scenario tests to verify that only starting and ending space characters were allowed, the after tests both failed.

It did not take me long to single out the responsible code. Having recently been in that part of the code, I knew that there was a good chance that the error was in the recent work to deal with tab characters and space characters in Atx Heading tokens. With that knowledge, it was less than five minutes before I was able to change this code to handle the end of an Atx Closed Heading whitespace collection:

(
    end_index,
    extracted_whitespace_before_end,
) = ParserHelper.extract_whitespace_from_end(remaining_line)

into the end of an Atx Closed Heading space collection:

non_whitespace_index = (
    ParserHelper.collect_backwards_while_character(
        remaining_line, len(remaining_line) - 1, " "
    )
)
end_index = non_whitespace_index[1]
extracted_whitespace_before_end = remaining_line[end_index:]

With that fix in place, all four scenario tests were passing. I did double check my scenario tests to see why I had missed this, with little insight gained. My guess is that I was just eager to get things done and missed one part of the solution. I did find it before it was released, so I was not too upset. But I knew I needed to make sure I tempered my “need for speed” against my need to do things properly.

And it was with that that I knew I was now on to the big issue that I had been thinking about in the back of my mind for weeks: pragmas.

Pragmas¶

With a long history in the software development industry, pragmas are not something that developers regularly see these days. Historically, pragmas provide additional information to a specific compiler during compile time, giving the compiler extra context to consider. While more famous in the C and C++ languages, the concept of pragmas has often been borrowed by linters and analyzers to allow the source document to provide additional information to guide on how that document is to be analyzed. By default, if the process does not understand a given pragma, it is ignored. I believe this is one of the reasons that most pragmas look line inline comments or something similar.

One of the benchmarks I have for the PyMarkdown project is how well and how quickly it analyzes the directories containing the Markdown version of the articles that make up my blog. Up until I added support for Front-Matter tokens, it was rather messy. However, with that being implemented, a lot of the failures due to Front-Matter Markdown being misinterpreted was gone. I was now able to make some serious progress on dealing with those rule violations that were showing up when I scanned my blog.

Out of the 27 errors that were being reported by scanning the directory, only 12 of those were what I would consider real errors. While the other 15 errors were being properly reported, except for 2 instances of a parser error that I uncovered, the remaining 13 were “allowable” failures. These failures were allowable because I wanted to make the conscious choice to ignore the project’s scan results for those 13 separate instances. The only problem was that I did not have a way to mark those errors as being “allowable”. That is where pragmas came in to play.

Starting With A Good Design¶

While there are times that I feel it is okay to jump in to writing code, those times are few and far between. Even if the design is some scribbles on paper or some notes in a readmd.md file, I have always found benefit in taking the time to work through the base scenarios involved with a given project. After all, the least costly software change to make is one that has not been coded yet.

This design was no different. Starting with scenario tests, I created a set of new scenario tests that started to fill out my design. Like other linters, to keep the readability high, I decided on a simple pragma indicator and format:

<!-- pyml disable-next-line no-multiple-space-atx-->

The entire pragma would be wrapped in a comment. That way, if there was something only slightly wrong with the pragma, the parser would consider it an HTML comment section, and treat it accordingly. Within the pragma, the {space}pyml{space} sequence would kick off the pragma, something that was likely not to be repeated elsewhere.

At that point, all that was left was the command structure. Looking at various other linters, they all seemed to have 8 or more commands for specifying regions, saving state, restoring state, and all that. While they probably have reasons for all those commands, I wanted to keep things simple. From my experience, I never use those other commands because I only want to suppress the failure on the next line. I am not sure if everyone will feel the same way, but it was a good place to start.

With the comment line and command structure done, it was on to more practical issues. It made sense to me to search and store pragmas in a manner independent of the tokenization. For me, pragmas and tokens are different concepts with different responsibilities. Tokens explain what the document is while pragmas provide instructions to the rules engine on what rules to apply (or not apply) and when. Given those restrictions, it made sense to add code to the Container Processor’s parse_line_for_container_blocks function. By adding the code there, the decision on whether to interpret a line normally or as a pragma line could be made without affecting anything else in the parser.

With a collection of pragmas, I now needed someplace to store them. Creating a new PragmaToken class allowed me to add the pragma information at the end of token stream for storage, but also allow the token to be easily extracted. Then, in the PluginManager class, I could add code to take the data within the PragmaToken and parse it for correctness. Assuming a line was a correctly formatted pragma, I could then put it into a simple format that could be referenced by the log_scan_failure function to determine if a failure should be logged or not.

Working things through on paper and with sample code, I was quickly able to tighten that design down some more. But in the end, the design was pretty thought out, with some good tests written before one line of feature code was written.

Implementation - Detection¶

After all that focus on the design of the feature, there really were not many surprises that I encountered. A new __look_for_pragmas function was added to the ContainerProcessor class, only activating if an HTML comment was found with no whitespace before it and without being contained within a Block Quote element or a List element.

From there, there was some simple code to determine if we have a pragma, but not verifying that a valid pragma was present:

was_extended_prefix = line_to_parse.startswith(
    PragmaToken.pragma_alternate_prefix
)

start_index, _ = ParserHelper.extract_whitespace(
    line_to_parse,
    len(
        PragmaToken.pragma_alternate_prefix
        if was_extended_prefix
        else PragmaToken.pragma_prefix
    ),
)
remaining_line = line_to_parse[start_index:].rstrip().lower()
if remaining_line.startswith(
    PragmaToken.pragma_title
) and remaining_line.endswith(PragmaToken.pragma_suffix):
    index_number = (
        -position_marker.line_number
        if was_extended_prefix
        else position_marker.line_number
    )
    pragma_lines[index_number] = line_to_parse
    return True

One of the small surprises that I came across during my design phase was that some of the parsers have plugins that support a concept of a “hidden” comment. These are HTML-style comments that begin with a sequence of <!--- instead of <!--. While these constructs are still valid comments, they are easy to spot and therefore easy to remove when parsing the document. By supporting both a normal comment and this hidden comment for pragmas, I could enable people to write pragmas that would not show up in the rendered HTML output of any parser that supports them. It was a small change, but a good one.

Other than that, the rest of the code simply looking for something that looks like a comment that starts with some whitespace, the text pyml, and some more whitespace. If it finds it, it adds it to the pragma_lines dictionary using the line number as the index. Also returning True in those cases, the ContainerProcessor can then discard that entire line without affecting the token stream. When all the tokens have been collected, if the pragma_lines dictionary contains any elements, a PragmaToken is created with that dictionary and added to the token stream.

Implementation - Compilation¶

The compilation step was also relatively straightforward. In the compile_pragmas function, the keys for the pragma_lines property of the PragmaToken instance are used to get an ordered list of the line numbers for each pragma in the document. Compilation of the pragmas was achieved by simply iterating through each pragma in order. For each pragma, the function follows code like the detection code above, the remaining information in the pragma becoming the command and any relevant command data.

Verifying the bulk of the commands was easy. No command? Report a pragma failure. Command not understood? Report a pragma failure. Missing pragma command data? Report a pragma failure. If anything did not look exactly right, just report an error. Simple.

That left the function to handle the disable-next-line command, which looked like this:

ids_to_disable = command_data[after_command_index:].split(",")
processed_ids = set()
for next_id in ids_to_disable:
    next_id = next_id.strip().lower()
    if not next_id:
        self.log_pragma_failure(
            scan_file,
            actual_line_number,
            f"Inline configuration command '{command}' specified a plugin with a blank id.",
        )
    elif next_id in self.__all_ids:
        normalized_id = self.__all_ids[next_id].plugin_id
        processed_ids.add(normalized_id)
    else:
        self.log_pragma_failure(
            scan_file,
            actual_line_number,
            f"Inline configuration command '{command}' unable to find a plugin with the id '{next_id}'.",
        )

if processed_ids:
    self.__document_pragmas[actual_line_number + 1] = processed_ids

The small change from the design here was that I wanted to be able to collapse the rule ids and the rule aliases down into the rule ids. To that extent, I changed local references of all_ids in the rule registration code to self.__all_ids so that the pragma compiler could make use of them.

Other than that, the command expects a comma separated list of ids and aliases to follow the disable-next-line command. If there is an empty id, it reports a pragma failure. If not, it looks for it in the self.__all_ids dictionary, which is a map from any valid identifier (id or alias) to the plugin object. If it is found in that dictionary, the normalized_id variable is set and added to the set of ids to disable. If not found, it reports a pragma failure.

Implementation - Suppressing the Failure¶

After the rest of the work on this feature, the suppression part was very simple. The last two lines of the last example are:

if processed_ids:
    self.__document_pragmas[actual_line_number + 1] = processed_ids

To keep things simple, the only existing command suppresses one or more rules being triggered on the following line. Therefore, one is added to the line number of the pragma, and the set processed_ids is stored in the dictionary self.__document_pragmas.

Then, the log_scan_failure function was altered to add this code at the beginning:

if self.__document_pragmas and line_number in self.__document_pragmas:
    id_set = self.__document_pragmas[line_number]
    rule_id = rule_id.lower()
    if rule_id in id_set:
        return

Basically, if there are compiled pragmas and the line number of the log scan failure is in the dictionary of pragmas, further processing is required. The rule_id of the failure that is being reported is looked for in the id_set of ids, and if there is a match, then the reporting of that failure is aborted.

Wrap Up¶

To be honest, I was underwhelmed when I got to the final part of the implementation. After all that work, it was less than 10 lines that it took to implement the part where the log scan failure was suppressed. The hard work was making sure that I had good scenario tests from the beginning.

And to spread some good word on Test Driven Development, let me also supply some extra information on why I like scenario tests so much. While it is nice to have a solid design, I feel that those scenario tests help me pull a design from the abstract world into the concrete world. Everyone can argue about what was really meant when I say “data store”, but if I actually write a scenario test that sets up a data store, it makes the entire concept and usage of the data store more real. For me, that helps me ground myself and my designs.

After all that work, I then was able to go through the remaining failures and determine if any of those failures were “allowable”. By allowable, I mean that I acknowledge the failure and allow it to be suppressed because of a conscious decision. By using pragmas, I can then suppress those failures in a simple and discoverable manner. And by also supporting the <!--- sequence, I can also make it so the pragma is hidden. All good work!

Blank Lines and Lists¶

I am not sure if I have talked about this small issue before, but the parser has long had a slight glitch in how the tokens are recorded for a Blank Line element that ends a List element. When the Blank Line element occurs, a Blank Line token is recorded, and it triggers the code that closes any existing List elements or Block Quote elements.

While the tokens come out in the correct order for Block Quotes tokens, the Blank Line token always comes out before the end List token. When testing what the HTML output would be for this scenario, everything is fine because a Blank Line token is ignored by the HTML generator. Even the end List token element is ignored, except for resetting the list state. As the only issue is the ordering of the tokens, and not their output, this is a small glitch.

As this is a small glitch, this is easily worked around. Because it is a small glitch with an easy workaround, I long ago decided to keep things the way they were until something else forced me to fix this issue. While it only caused me to put this issue at the top of the issues list, it was enough to cross the line that I had drawn. The reason? Rule md022.

Discovery¶

One of my standard “smoke” tests is to use the project to scan the directories containing the Markdown for my own blog’s articles. As the plugin rules got cleaned up, especially with the addition of pragmas, the output for the scan got progressively cleaner. With pragmas in place, I was able to disable all the reported errors that were legitimate errors, but errors that I decided to ignore. That left only two “real” errors. Those two errors occurred in two Markdown documents with sections that looked like this:

- grouping of values? Extra code required.
- hierarchy and nesting? Extra code required.

## Configuration Type 2: Grouped

The next step up from the simple type is a grouped type.  While it is a step up from the

The error that was being reported was a triggered rule md022, complaining that there was not a single line above the Atx Heading element. Looking at the Markdown itself, it was clear there is a blank line before the Atx Heading. At that point I looked at the tokens being output for the Markdown document and it was then that I noticed the reversed tokens described above. It was a lightbulb moment.

Resolution¶

As I mentioned above, the first thing I did was to add a new item to the issues list. I did tell myself that one more issue related to this case of improperly ordered tokens was enough to make me fix it, and I wanted to keep my word.

In the meantime, I needed something to address the issue until after the initial release. Looking at the code, one small change seemed obvious but dangerous. So, I added the line:

if not token.is_list_end:

before the line:

self.__blank_line_count = (
    0
    if self.__is_leaf_end_token(token)
    or self.__is_container_end_token(token)
    else None
)

I felt it was dangerous because it was easy. Hopefully not sounding paranoid, it was too easy. With that one change, any ending of a list would not reset the rule’s __blank_line_count member variable.

To this date, I still think that the fix was too easy, and sometimes play around with scenarios that I think can fail it. Thinking more deeply on this issue, I believe that my sense of the tokens not being correct has crossed over into my sense that the fix is not correct. Furthermore, I believe that the feeling is still present because I know that the correct way to fix this is to fix the tokens, and not to temporarily fix it.

But whether I feel 100% comfortable with the decision to delay the fixing of the tokens until after the initial release, I believe that was the right decision to make. And while I do not feel comfortable with the temporary fix, I have to find peace in the fact that I know it is going to be temporary.

Strict Mode Cleanup¶

After implementing the strict configuration mode previously, there were still some loose ends to clean up with its usage.

Reading The Value¶

The biggest “loose end” that I needed to deal with was around initializing this mode from both the command line and configuration. To get things working, I had placed the following code at the end of the __set_initial_state function:

if args.strict_configuration:
    self.__properties.enable_strict_mode()

While that code was correct, there were a few things that were not quite right with it. To deal with that properly, I pulled that code out into its own function, __initialize_strict_mode and added some extra code before it:

def __initialize_strict_mode(self, args):
    effective_strict_configuration = args.strict_configuration
    if not effective_strict_configuration:
        effective_strict_configuration = self.__properties.get_boolean_property(
            "mode.strict-config"
        )

    if effective_strict_configuration:
        self.__properties.enable_strict_mode()

With this in place, it was now following the command-config-default pattern that I want for all project configurations. Only if the command line flag is set will the value of args.strict_configuration be True, so no need for fancy comparisons against None there. If it is not set, then the code tries to look for a value in the configuration, before adopting the default of False. While the False is not explicitly satisfied, it is the default of the get_boolean_property function.

Where To Place The Call?¶

With the functionality localized to a function, the next thing that I needed to focus on was the placement of the function call to the new __initialize_strict_mode function. It did not feel right in the __set_initial_state function, but I was indecisive on where to place it. Working through some scenarios by the time-tested method of trial by scribbles, I finally landed on placing the call to __initialize_strict_mode right before the call to __initialize_logging. The way I saw it, there were safeguards and debugging around providing bad information to the __set_initial_state function, but not after that point. Putting the call after the logging call would leave the logging call unprotected, so I placed it between the two.

The only part that was left of that was to try and protect the strict configuration mode itself. It was when I went to add the strict_mode=True argument to the get_boolean_property function call that I noticed something. I noticed that I had not specified the strict_mode argument for boolean properties. It was not a difficult fix to make, but it caused me to work through my design again, just to make sure I had it right that time.

With that part of the changes done, there was just one part left.

Handling Failures¶

Initially, I used the phrase “Handling Failures Properly” for this section, but I had second thoughts about that. With things coded the way they were, the only thing that was left was to make some small changes to the scenario tests to record the failures. At that point, the error was being reported properly, but I had to take an extra step to validate it: setting the --stack-trace command line option on the test that was reporting the failure.

The example that I mainly worked with was the test_markdown_with_bad_strict_config_type scenario test. When I executed that test, the error that I got back was a stack trace with this text near the end of the error’s message:

    raise ValueError(
        ValueError: The value for property 'mode.strict-config' must be of type 'bool'.
        )

While the right error was being raised, I thought the project should handle the presentation of that error in a more readable fashion.

To do that, I added the following snippet of code:

    except ValueError as this_exception:
        formatted_error = f"Configuration Error: {this_exception}"
        self.__handle_error(formatted_error)

to the try/except/finally clause in the __initialize_parser function and in the main function. While not a substantial change, it covered the two areas capable of raising these ValueError instances. With that error covered, I was then able to print out a more readable, concise, and actionable error message:

Configuration Error: The value for property 'mode.strict-config' must be of type 'bool'.

With a couple of extra formatting changes, this task was completed.

And With That¶

There was only one item in the issues list section labelled Priority 1 - Must Solve Before Initial:

- command line and configuration documentation

While I did check in a commit for the work I did during the week, I want to save talking about that work until next week. It was a task that was both very easy and very difficult to do, so I want to make sure that I talk about it with some distance between working on it and talking about it. That and its only mostly done, and I would prefer to finish it before talking about it.

A good reason to check in next week though, no?

What Was My Experience So Far?¶

Phew… while there are still items in the issues list, there is only one issue remaining in the priority 1 section: documentation. I am not sure if this makes sense to anyone, but that is a big relief to me. Having taken the time to document my work each week, I am not worried about writing the documentation.

Getting to this point has been a roller coaster ride, with both positive and negative hills to climb along the way, but I feel good about this. Sure, there are some little things that I need to fix here and there, but I believe the code for the project is in really good shape to be released. I was not sure it was ever going to get that way, but it is there now.

I guess part of that feeling is confidence because I am literally throwing everything against it. After the pragmas and one little fix, I can now scan every file for my website without any non-allowed failures getting in the way. If I want to, I can probably add something to the linter at a later stage to record those pragmas in a reportable fashion. But for now, if I want to find them out, I just have to do a simple search through my Markdown files for <!---, and that is easy enough.

What is Next?¶

While I had started on the documentation, I needed some time to get it into a good shape to talk about. That is what I am going to be talking about next week. Stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments