Markdown Linter - Core - Pre-Rule Improvements

Introduction¶

Way back in 2019 November, I started this project with a bare-bones framework using a simple dynamic plugin loader. It was a simple proof of concept to determine whether I could create the basis for an extensible linter framework in Python. Once I verified that I could write that framework, I implemented a very simple case to test against: checking to make sure the provided Markdown text ends with an empty line. While that rule was easy to implement, it was when I looked for another rule to implement that I determined that to properly lint Markdown text, I needed a Markdown tokenizing parser. From my viewpoint, unless I had a parser that emitted tokens that were a good representation of the Markdown to lint, the linting rules that I wanted to write would require too much guess work for my own liking. If I wanted to write good, solid rules, I needed to have the right information available for those rules to act upon. I needed a Markdown parser that emits Markdown tokens that I had confidence would be the required, correct information for the rules.

Having now written such a parser against the Github Flavored Markdown specification, it was time to move on to the next part of the project: writing rules. However, since almost 5 months had passed since the project started, there were a few changes that were required in the linter’s core before I could continue.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 12 April 2020 and 16 April 2020, and the commit from 18 April 2020.

Source Providers¶

In my experience, following the threefold rule for refactoring is usually a good idea, as its wisdom has borne out true in my projects many times. While not a literal copy of the threefold rule, I choose to remember the rule as follows:

Write it once, write it neat. Write it twice, think about extracting it. Write it three times, extract it without a second thought.

The original text of “three strikes and you refactor” always seemed a little too harsh for me with its negative connotation. In addition, I feel that it does not provide good enough guidance on what to do in the first two cases, just the third one. My version of the rule still adheres to the spirit of the original rule, while fixing the specific issues that I perceive with it.

Source providers for the parser are a concept that fits that refactoring pattern very well. When the original framework for the parser was written, it was designed to parse a line at a time to conserve memory. In the parser tests, this information is provided to the TokenizedMarkdown class as a single string, with the internal functions of the TokenizedMarkdown class breaking down that data into individual lines for further processing. With the exception of a single feature’s error case¹, this design has proven to be very useful in reducing the complexity of the parser. It made sense to me to refactor this section of code when considering how to add support for the second source of Markdown data: Markdown files.

Starting with the InMemorySourceProvider¶

Now that the project was moving into the rule-development phase, it was necessary to ensure that it was just as easy to feed the parser information from a string as it was to feed it information from a file. As the initial development kept things neat, it was relatively simple to take the logic for grabbing the next line and encapsulate it within the InMemorySourceProvider class as follows:

class InMemorySourceProvider:
    """
    Class to provide for a source provider that is totally within memory.
    """

    def __init__(self, source_text):
        self.next_token = source_text.split("\n", 1)

    def get_next_line(self):
        """
        Get the next line from the source provider.
        """
        token_to_use = None
        if self.next_token:
            if len(self.next_token) == 2:
                token_to_use = self.next_token[0]
                self.next_token = self.next_token[1].split("\n", 1)
            else:
                assert self.next_token
                token_to_use = self.next_token[0]
                self.next_token = None
        return token_to_use

This class contains very simple logic. When the instance of the class is initialized, it starts by breaking down the input text into a tuple. The first element of the resultant tuple contains the next line to be parsed and the second element of that same tuple contains the input text to be parsed in the immediate future. Once that calculation has been performed, the rest of the processing is relatively simple. If get_next_line is called and the tuple contains 2 elements, the first element is returned as the next line, and the next_token variable (for next time) is recalculated using the same expression as was used in the __init__ function. When the get_next_line is called at the end of the file, the tuple contains only 1 element. At that point, that singular element is returned as the next line to be parsed, and the next_token variable is set to None to make sure we end the processing properly. Finally, when get_next_line is called and the tuple is set to None, there is nothing left to parse and None is returned, signaling that the provider has reached the end of its available text.

To be clear, this is the exact code that was in place for the duration of the parser testing, just repackaged to be in a more reusable form. Its interface is plain and simple: it either returns the next line as a string, or it returns a None object if there are no more lines. Nothing fancy as a class either, just a simple interface: one function to create the instance and get it setup, and one function to read the next line.

Continuing with the FileSourceProvider¶

By keeping things simple, creating the class FileSourceProvider was almost as simple as the refactoring to create the InMemorySourceProvider class. While I want to keep options open for future performance experimentation, I just needed something simple for reading a file from the file system. Based on those qualifications, I came up with this:

class FileSourceProvider:
    """
    Class to provide for a source provider that is on media as a file.
    """

    def __init__(self, file_to_open):
        with open(file_to_open, encoding="utf-8") as file_to_parse:
            file_as_lines = file_to_parse.readlines()

        self.read_lines = []
        self.read_index = 0
        did_line_end_in_newline = True
        for next_line in file_as_lines:
            did_line_end_in_newline = next_line.endswith("\n")
            if did_line_end_in_newline:
                next_line = next_line[0:-1]
            self.read_lines.append(next_line)

        if did_line_end_in_newline:
            self.read_lines.append("")

Basically, open the file, read in the lines, and process the lines into the format that we expect. The only tricky bit with the class’s __init__ function was handling line terminators properly. In fact, that is the only purpose for the did_line_end_in_newline variable, remembering if the current line ended with a newline character before it is removed. Based on independent unit testing of the class, I had problems with the characters at the end of the file, which adding that variable and the final if statement resolved cleanly. I am not sure if I feel that the did_line_end_in_newline variable is a kludge or not, but I do feel that it was the right thing to do in order to maintain the fidelity of the data being read in from the file.

Because care was taken in the provider’s __init__ function to do all the necessary processing, the get_next_line function is very basic:

    def get_next_line(self):
        """
        Get the next line from the source provider.
        """
        token_to_use = None
        if self.read_index < len(self.read_lines):
            token_to_use = self.read_lines[self.read_index]
            self.read_index += 1
        return token_to_use

While this function could be more complicated (or simplified depending on your viewpoint), I feel that this is a good example of keeping things basic. The provider reads the information into an array of strings during the __init__ function, and this function simply uses an index to iterate through and return each element of that array. Nothing fancy for now, just some code that is very functional. Fancy code can always be added later.

Testing the Source Providers¶

To make sure both providers are adhering to the interface in the same way, I added a decent number of tests in the test_source_providers.py file. In all the tests, the big thing that is being tested is if the source providers return the correct lines given the correct input. If there are 2 line terminators in the input, each provider must return 3 lines, even if the last one is empty. Every test is a variation on that, thoroughly exercising each provider to ensure that both adhere to the interface flawlessly. After all, if the parser gets bad input to tokenize, it cannot help but to produce bad output, even if is only off by a line terminator.

Replacing Print with Log¶

This improvement was a long time coming: replacing all of the print statements in the parser with log.debug statements. When I was developing the parser, adding a simple Python print statement was the easiest way to add extra debug to the output of the tests. This information was pivotal in my ability to debug the parser and quickly add new features to the parser with confidence. And in the cases where there were problems with those features, those same print statements were also pivotal in helping me ensure the flow of each function was as I had designed it.

Why did I avoid using log.debug statements from the beginning of development, and instead use print statements? I am honestly not sure. I do recall an early experiment in which I used both types of statements, to see which one worked better for me. I remember the experiment, I remember choosing print statements, but I cannot remember why I chose them, knowing I would have to replace them later. I even checked my notes from back then, and nothing about logging vs print. Interesting.

Regardless of why I did it, the time to fix it was now. It was a mostly painless transition and didn’t take that long to accomplish. To the start of most files, I added the following import:

import logging

and at the start of many blocks of processing, I added the following line:

    logger = logging.getLogger(__name__)

Then, for each time I called the print function, like this example:

        print("Line:" + line_to_parse + ":")

I replaced it with the exact same arguments, just changing the name of the called function from print to logger.debug as follows:

        logger.debug("Line:" + line_to_parse + ":")

After the initial changes to replace print with log.debug, everything looked okay until I ran the normal clean script that I use with the project. This script is a simple script to execute the black code formatter, the Flake8 and PyLint linters, and the full set of tests for the project. When the script got to PyLint, it seemed to go crazy and was emitting lots or warning lines, each line essentially being the same.

Reading the warnings carefully and looking at the source code, PyLint seemed to be complaining about each logging function call that involved concatenation. In each case where I was concatenating strings to arrive at what I wanted to log, PyLint raised a warning that I wasn’t doing it right. According to PyLint, the proper way to log a message for the above example is:

        logger.debug("Line:%s:", line_to_parse)

Doing a bit more research, the reason for the warning is because the logging library was purposefully created to be lazy. If the log level for a given call is not high enough to cause the string to be logged, doing any kind of formatting or concatenation on that string is wasted effort. Following that logic, the logger follows the same conventions that are used with the percent character (%) as the string interpolation operator, delaying the evaluation of the actual string until the logger determines whether the specified string is actually going to be logged. Once a positive determination has been made, the format and the arguments are applied to each other, a resolved string is produced, and that string that is then logged.

It took a while to go through each of those messages. I had to examine each concatenation sequence, break it down into its component parts, and verify my changes. Each string that was concatenated needed to be represented in either the format string or the arguments passed to the logger. It was slow and pedantic work, but in the end, I was happy to have a logging solution that was more performant than before.

Side Note¶

Note that as a Python logging newbie, I am not 100% sure if I created more work for myself by frequently creating logging instances inside of the project’s static functions. It is possible that I can get away with a static logger variable created in the module’s namespace at the top of the file, and not worry about creating any other loggers within the same file. However, in all the examples I have seen to date, the logger is either declared at the top of a simple file or within the __init__ method of a class. As a lot of the helper classes are a collection of functions that are labelled with the @staticmethod annotation, I am not sure if one instance at the top of the file is the correct way to go. While it might be more effort than I really need, I am confident that I am covering all the logging situations properly. If I learn something differently about logging, I will come back and revisit it.

Better Error Reporting¶

When I initially proofed out the plugin architecture for the linter, I added a BadPluginError class to help identify plugin issues and report them back to the command line. Using this same pattern to integrate error handling for the parser to the linter, I added the BadParsingError class, raising this error when there were exceptions raised during the parser’s tokenization of the Markdown text. A bit more typing and a short while later, I had the __handle_error function with refactored content. This newly minted function reused the error handling meant for the BadPluginError class to handle both the BadPluginError class and the BadParsingError class in the same manner.

With that tweak done, I had confidence that the error reporting was done, and I wouldn’t need anything more serious until far later in the project. That is what I foolishly thought until about 2 days later. After doing some work on implementing the first two rules, I realized that some things needed to be fixed with error reporting.

The first change I made was to change the name of the error class from BadParsingError to BadTokenizationError to reflect the problem area more accurately. While it is more of a mouthful to say, it accurately describes that it is a problem with the tokenization of the document, not a generic “parsing” issue. A good example of this is when the TokenizedMarkdown class is created. Upon creation, one of the things that it does is load the entities.json file from the resources directory, as detailed in this article. If that resource file is not loaded properly, that code was changed to raise a BadTokenizationError instead of what it was previously doing. Without this file, parsing the Markdown text would still be possible without any issues. But to properly tokenize the Markdown text, the parser needs to know if the named character entities that are provided in the Markdown document refer to valid named entities. It may appear to be a semantic difference to some, but in my experience, it is the attention to detail on little things like that which help improve the project’s maintainability.

In addressing the above case, I stumbled into the second issue: too many exit points. While it does not show it in the commit for the BadTokenizationError fix documented in the paragraph above, the first pass at addressing that issue was messy. It caught any raised error, did some bare bones reporting, and then performed a sys.exit(1) to stop the program with an error code of 1. Doing a quick search through the code, I stopped counting once I hit the third instance of a sys.exit(1) call in the code. It was time for a refactor.

Beginning with the initial case that started the search, I took a quick look at the various search results and came up with a good rubric to follow. If possible, I would try and more correctly classify the error using one of the two existing error classes, BadTokenizationError or BadPluginError, which were already handled at the top level by the __handle_error function. If the error occurred in the main.py file and didn’t fall into either of those categories, I would call the __handle_error function directly, striving to give a clean and concise message on what that error was and why it occurred. If I encountered any errors outside of those parameters, I would save them for last and re-evaluate my rules.

Based on what I found in the project, the first two rules were enough, as I did not find an error that could not neatly fit into one of those two categories. Instead of a few different ways of exiting the linter with an error condition, __handle_error was now the sole conduit for exiting with an error. I went over the tests in the test_main.py, test_plugin_manager.py and test_markdown_entity_and_numeric_character_references.py files to make sure everything looked consistent, and things were good!

What Was My Experience So Far?¶

As I have mentioned in previous articles, I like refactoring, and this stretch of refactoring was no exception. Each of these tasks were little tasks, but I felt better knowing that they were addressed before I shifted my focus to writing the rules for the linter. The change to using source providers would be pivotal in dealing with test sources (strings) and live sources (files) as sibling concepts. Replacing print with log.debug was also pivotal to using live sources, keeping the ability to debug what was going on for experienced users, but not flooding a casual user with all of that information. Finally, getting all of the error reporting to go through one conduit just seems cleaner and more concise to me. Getting these taken care of just felt like the right refactoring to do at the right time.

I also realized that I enjoy the cadence I have established with refactoring. While there are short bursts where it is just adding new features, for the most part I have tried to switch off between refactoring and adding new features. I feel that for my development style, I have found a good balance between fixing things and adding things, even if it needs tweaking every so often. Perhaps it is because of that sense of balance that I have been able to focus on this project for so long without losing energy on it. For me, I believe it boils down to a feeling that this project is not about adding features as much as it is about continual improvement of the project towards its goal, taking my time to do things right as I go.

Looking forward, I am sure I am not done adding features, refactoring code, or learning something new and interesting.

What is Next?¶

Taking a bit of a jump back in time a couple of days from where I am leaving off, in the next article I am going to start talking about the rules that I am developing, and how their development proceeded. Please stay tuned!

Due to link reference definitions being a multiline element, some of the error cases required the parser to be rewritten to handle lines that need to be “rewound”, as documented in this previous article. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments