Markdown Linter - Road To Initial Release - Gearing Up For Plugins

Summary¶

In my last article, I talked about how I worked through the issues to create an installable package for the project. In this article, I talk about gearing up for the work that I need to do on the plugins to get them ready for the release.

Introduction¶

Having made great progress towards the initial release, I knew that I needed to spend some time focusing on the heart of the PyMarkdown linter project: the rule plugins. To be honest, the last time that I have spent any serious time looking at the rules was at least five or six months ago. In that time, I have changed a lot of code, both improving how things performed and fixing issues. It just made sense that before I release the project, that I go through the already implemented rules and make sure they get the same thorough treatment that I have given the rest of the project.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 03 Apr 2021 and 10 Apr 2021.

Where To Start?¶

Looking at the current set of issues that I had with the configuration system, I noticed two little problems and one slightly bigger problem. The first little problem had to deal with the error output from the plugin manager. If there was any kind of exception in the plugin, it was being handled and replaced with a more standard error. But the issue that I had with it was that I needed more information to be provided with that error, to debug plugin issues more effectively. The second problem was the inverse of the first problem: too much information. When listing the plugins from the command line, I felt that I was providing too much information and needed a way to filter out plugins that were in development.

The bigger problem was going to take more time, but I also felt that it was important. I needed to find an effective way to override any configuration file properties from the command line. In any useful command line tool that I have used, that tool almost always has support for a configuration file and configuration from the command line. Having used tools with this functionality implemented, I sincerely felt it is a better, more efficient way to work. I felt it was important enough to take the time to implemented it and do implement it right.

With those three issues to fix, it was time to get to work!

Adding In Where Exceptions Occur¶

As I mentioned above, the first issue that I wanted to tackle was one that I believe is helpful in debugging issues with PyMarkdown’s rule plugin modules. Prior to this change, if one of the rules thrown an exception, plenty of information was included about the exception that occurred. What that information that was missing was any information detailing where the parser was in the Markdown document when the exception occurred. While that may seem like a small thing, that extra content is often pivotal to diagnosing why the exception was raised in the first place.

Thankfully, adding that information was rather trivial. To start, the __init__ function of the BadPluginError class was changed to accept four new arguments: line_number, column_number, actual_line, and actual_token. The handling of those arguments was then added to the end of the processing of the formatted_message variable in the __init__ function:

    if line_number:
        if column_number:
            position_message=f"({line_number},{column_number})"
        else:
            position_message=f"(Line {line_number})"
        formatted_message = f"{position_message}: {formatted_message}"
    if actual_line:
        formatted_message = f"{formatted_message}\nActual Line: {actual_line}"
    if actual_token:
        formatted_message = f"{formatted_message}\nActual Token: {ParserHelper.make_value_visible(actual_token)}"

Basically, if the line_number argument is provided, then the line_number and optionally the column_number are added to the end of the formatted_message variable. Additionly, if a line is being processed when the error occurs, the actual_line is specified, and that variable is added on its own line. Similarly, if a token is being processed when the error occurs, the actual_token is added on its own line.

With that work completed, it was then time to wire up the other side of the exception: where the PluginManager class raised those exceptions. To ensure that the extra information would only be provided when requested, its presence was enabled by passing the arg.show_stack_trace variable from the command line into the class. Once set as the self.__show_stack_trace member variable, the next_line function was changed to:

def next_line(self, context, line_number, line):
    context.line_number = line_number
    for next_plugin in self.__enabled_plugins_for_next_line:
        try:
            next_plugin.plugin_instance.next_line(context, line)
        except Exception as this_exception:
            actual_line = line if self.__show_stack_trace else None

            raise BadPluginError(
                next_plugin.plugin_id, inspect.stack()[0].function, line_number=line_number, actual_line=actual_line
            ) from this_exception

and the next_token function was changed to:

def next_token(self, context, token):
    """
    Inform any listeners of a new token that has been processed.
    """
    for next_plugin in self.__enabled_plugins_for_next_token:
        try:
            next_plugin.plugin_instance.next_token(context, token)
        except Exception as this_exception:
            actual_token = token if self.__show_stack_trace else None

            raise BadPluginError(
                next_plugin.plugin_id, inspect.stack()[0].function, line_number=token.line_number, column_number=token.column_number, actual_token=actual_token
            ) from this_exception

In the next_line function, the line_number is always passed on to the exception when it is raised. Keying off of the self.__show_stack_trace member variable, the actual_line is either set to None or to the current line. As expected, the changes to the next_token function perform the same actions, just with a token-based approach instead of a line-based approach.

Once this code was in place, I started to execute the scenario tests, and was welcomed by the behavior that I expected. The four scenario tests that I expected would change did change. In addition, when I added two additional scenario tests to ensure that the code was properly covered, the output for those two tests was as I expected. With a bit of cleanup to ensure that all PyLint checks pass, it was then on to the next planned change.

Only Displaying Implemented Plugins¶

For the initial release, I wanted everything to look nice and clean. As such, when I was going through the command line to document what I saw there, there was something in the output for the plugins list subcommand that I did not like: unimplemented rules were showing up. I wanted those “in-development” rules plugins to be included in the project to remind me to complete them, but I did not want them to show up until they were done. When I added the version numbers for all the rules, I set the version numbers of those “in-development” rules to 0.0.0 as a visual reminder that they were not yet ready. But now that I wanted to clean things up, that did not look as clean as I wanted it to, so I needed to change it.

To preserve the existing behavior, I changed the command line parser to implement a new --all flag:

sub_sub_parser.add_argument(
    "--all",
    dest="show_all",
    action="store_true",
    default=False,
    help="show all loaded plugins (default is False)",
)

Then, in the __handle_argparse_subparser_list function that handles the list of plugins, I made a small change:

    if next_plugin.plugin_version == "0.0.0" and not args.show_all:
        continue

These two small changes together specified that by default, if a plugin containing a rule has a version of 0.0.0, it will be skipped over in the list, unless the --all flag is specified. Scenario tests that previously emitted:

  md009  no-trailing-spaces     False              False              0.0.0
  md019  no-multiple-space-atx  True               True               0.5.0
  md029  ol-prefix              False              False              0.0.0
  md039  no-space-in-links      False              False              0.0.0

now emitted:

  md019  no-multiple-space-atx  True               True               0.5.0

unless the --all flag was set. By default, only those rules that were no longer in development would be listed. In my eyes, it just looked cleaner and more correct.

Setting Configuration Values Manually¶

With the two small issues out of the way, it was time for me to focus on the large issue. The definition of the issue itself was easy. There already existed a good way to set configuration values using the configuration files, as mentioned in my article on the project’s new configuration system. But for me, that was not enough. I wanted to be able to set configuration values directly from the command line.

Why?¶

When I am starting to use any tool like the PyMarkdown linter, I usually need to experiment with the configuration to get it dialed in the way I want it. With tools that only have a configuration file, I find that that process that I use to figure out settings is inefficient. That process involves a lot of repetitions of the following tasks: look something up or consult the tool’s command line help, figure out a variation that is close to how I want it, change the configuration file, save that file, and run the tool to verify the new behavior. The biggest issue I have with this process is that I am constantly switching contexts from one tool to another, figuring things out for the next step. The second biggest issue is that it is very easy to forget the “save that file” step of the process. I do that a lot!

Then there are those few tools that I find both useful and efficient, like PyTest. By default, PyTest reads its configuration from the setup.cfg file. However, any setting in the tool can be overridden from the command line. That one simple change made configuring PyTest simple and easy. Sure, I still had the consult, figure, and execute steps from above, but I eliminated the change and save steps. To make the PyMarkdown project easier to configure for users, I wanted to adopt a similar style of interface: configuration file for “tested” values and the command line for “experimental” values.

Figuring Out Requirements¶

The first step in adopting a similar type of interface for the PyMarkdown project was to take the time to figure out a solid set of requirements. Specifying the property name from the command line was the relatively easy part. As I had used the . character as a property name separator character, there were a couple of easy combinations that I needed to prevent against as they made no sense. Then focusing in on each one of those property parts, there was another set of requirements that arose from ensuring that each part was consistent and valid. Adding the = character to separate the property name from the property value add a couple of small requirements, but nothing big. For good measure, I decided that whitespaces within the property name were going to be too much trouble, so I added those characters to the list of illegal characters in a property name.

Then there was the big design decision: How to handle the property values? At this point, the configuration system itself can handle properties that are strings, integers, and booleans. Without any modifications, the command line flag -set key=value can handle only one type of property values: strings. I needed to figure out a simple way to allow for that one type to be extended to include the extra types supported by the configuration system. The question was how to do that in a way that made sense?

Research¶

A good subtitle for this section could be “Jack sits around and scribbles on paper like a maniac for a week”. Even if that is what it looked like, “Research” was a more precise definition of what I was doing. As readers of past articles might have picked up, I try to always put my best work out there, so I go the extra distance to try and make the right decisions for the right reasons. This was no different. The problem that I was facing was that there were few good examples out there to compare to.

I do not blame other teams for taking the easy way out and making simple decisions. If you have a configuration system that is relatively simple, the design decisions that come out of providing information to that system are also relatively simple. When I looked at other configuration systems and interfaces, most of what I found were simple configuration systems. If the tools went so far as to provide any additional interfaces, those provided interfaces were also simple. To be plain and concise, for a lot of applications, only supporting a string type is perfectly fine, so their configuration systems support that. If any extra meaning is to be imparted to that string, it is up to the developer to add that extra level of meaning and handle any errors.

The problem with that approach is that the PyMarkdown project does not have a simple configuration system, it has a complex one. As part of the design of the configuration system, it just made sense to have that system understand the difference between the different types of data. For me, to not support that functionality would make the configuration system seem like it was half done. But now I knew that I had to come up with an equally easy solution for manually setting the property values as I had for the rest of the configuration system.

I tried many different alternatives that attempted to figure out the type of the string without any explicit declarations being used. But for every 95% of the cases that I was able to get working well, the remaining 5% of the cases always seemed to sink the approach. One good example is the string True or true. Is it meant to be a string or is it meant to be a boolean? Similarly with the string 010. Is it meant to be a string or an integer? If it is an integer, is it in decimal form or binary form? While I can make sense of when to translate or not, figuring out an algorithm to implicitly make that same decision seemed pointless. In addition, I did not want any required translation to occur any later than it had to. If that translation fails, I would rather the translation failed fast, as soon as it could reasonably fail.

Where Did I Land?¶

Like many other features in this project, after spending a good amount of time trying to do things the fancy way, I was able to find success in using the simple approach. For the configuration values, I decided that a string should only be interpreted as something else if the user explicitly specifies so. As I mentioned in the last section, I was able to make sense of when to translate or not and I believe the user is also the best entity to decide whether to translate or not. Therefore, I decided to use the $ character to start special handling of translation of the configuration value to other types.

Hopefully keeping it simple, if the configuration string is of the form key=value or key=$$value, then the string value is assigned to the property named key. If that value needs to be an integer, then the form key=$#value is used. Similarly, if that value needs to be a boolean, then the form key=$!value is used. Trying to keep things simple, I figured that a decent percentage of the users will not have issues identifying the $ character with a string, the # with an integer, and the ! with a boolean. To further keep things simple, the integer can only be provided in the normal base 10 notation (with a sign and leading zeros a sign if desired) and the boolean is set to true if lowercase value equals the string true.

After spending some time to enumerate all the different scenarios that I wanted to handle, I felt good about the research. The only thing I was concerned about were the boundary cases, so I examine the algorithms again looking for any potential issues. I felt mostly silly for missing one of the boundary conditions, but I did find it before I started coding, that was the important thing. That boundary condition was to have a $ character that was not followed by any other of the type characters.

A couple of quick adjustments to the algorithm, and another check to make sure that I did not miss any other boundary cases, and it was on to the implementation!

Implementation¶

After all that research, I felt an extra sense of urgency to get this feature completed. I do not regret taking almost six days to figure out the right decision for this feature, but it was six days. I wanted to get this feature completed so I could move on to the next thing in the issues list. Given the research from above, I already had more than ten scenarios derived, so I coded each scenario into its own scenario test and started to get to work coding the new set_manual_property function.

The first version of the new set_manual_property function was simple:

__assignment_operator = "="

...

def set_manual_property(self, combined_string):

    equals_index = combined_string.find(ApplicationProperties.__assignment_operator)
    property_key = combined_string[0:equals_index].lower()
    property_value = combined_string[equals_index + 1 :]

    self.__flat_property_map[property_key] = copy.deepcopy(property_value)

Keeping things simple, my goal was to just get the first scenario tests with the simplest scenario passing. As such, the task was simply to split the string up into its constituent parts and add it to the dictionary. It was a simple start, but a good start.

Verifying The Manual Property String¶

With the foundation in place, it was then time to start adding the layers. The first layer that i needed to add was to verify the form of the manual property string. Where possible, I strongly believe that a utility should provide as much information about why something is wrong as is possible. With that in mind, I created the first iteration of the verify_manual_property_form function:

@staticmethod
def verify_manual_property_form(string_to_verify):
    if not isinstance(string_to_verify, str):
        raise ValueError("Manual property form must be a string.")
    equals_index = string_to_verify.find(
        ApplicationProperties.__assignment_operator
    )
    if equals_index == -1:
        raise ValueError(
            "Manual property key and value must be separated by the '=' character."
        )
    return string_to_verify

It was not much, but it was a good start. This function was there to verify that the entire string was a properly formed string, with good error messages if it was not. With that accomplished, it was on to the next step.

Verifying The Property Key¶

The next step was to extend the verification to verify that the property keys on the left of the manual property string were validly formed. This was easily done by adding these two lines at the end of the verify_manual_property_form function:

    property_key = string_to_verify[0:equals_index]
    ApplicationProperties.verify_full_key_form(property_key)

With that call to the new verify_full_key_form function coded, it was time to implement that function:

@staticmethod
def verify_full_key_form(property_key):

    if property_key.startswith(
        ApplicationProperties.__separator
    ) or property_key.endswith(ApplicationProperties.__separator):
        raise ValueError(
            f"Full property key must not start or end with the '{ApplicationProperties.__separator}' character."
        )
    doubles = (
        f"{ApplicationProperties.__separator}{ApplicationProperties.__separator}"
    )
    doubles_index = property_key.find(doubles)
    if doubles_index != -1:
        raise ValueError(
            f"Full property key cannot contain multiples of the {ApplicationProperties.__separator} without any text between them."
        )
    return property_key

Like the previous function, this function takes a slice of the string to verify and focuses on that slice. For this function, that focus was on making sure that the property_key argument was validly formed. This task broke down into two checks: does the key start or end with the separator character and does the key contain two separator characters in a row. By focusing on these checks, the function can then return the property_key argument knowing that there are no cases of any part of that property key that does not contain at least one character.

And just like the last section, it was then on to the next, and in this case, last step.

Verifying The Property Key Parts¶

In the same way that I added to the call to the verify_full_key_form function to the verify_manual_property_form function, I now added a call to the verify_full_part_form function to the verify_full_key_form function.

    split_key = property_key.split(ApplicationProperties.__separator)
    for next_key in split_key:
        ApplicationProperties.verify_full_part_form(next_key)

Constricting the focus even further, it was now time to examine each part of the property_key argument:

@staticmethod
def verify_full_part_form(property_key):
    """
    Given one part of a full key, verify that it is composed properly.
    """

    if (
        " " in property_key
        or "\t" in property_key
        or "\n" in property_key
        or ApplicationProperties.__assignment_operator in property_key
    ):
        raise ValueError(
            f"Each part of the property key must not contain a whitespace character or the '{ApplicationProperties.__separator}' character."
        )
    if not property_key:
        raise ValueError(
            "Each part of the property key must contain at least one character."
        )
    return property_key

Having eliminated all of the high-level checks, this function could then focus on the task of ensuring that each part of the property key was validly formed. That task was then simply to check two things: that the part of the key did not contain a whitespace character or the assignment character (=), and that the part had at least one character in it. I will admit freely that the last check was probably overkill. I just felt better adding it in for now, knowing that I could easily take it out once I proved it was not needed.

And with that done, I had the validation for the string passed to the set_manual_property function completed.

Why Did I Do It This Way?¶

One thing I have learned in the years I have been developing software is that it is important to keep things simple. If things are simple, they are easy to read. If they are easy to read, they are generally easy to debug. By placing each level of verification in its own function, I kept things small, concise, and simple. And as an extra benefit, I could reuse those functions in other parts of the application properties library!

Having followed principles like “Keep It Simple”, I do get confused when I see people trying to cram more than one responsibility into a given function. Sometimes I can understand how a single responsibility can organically grow into a pair of responsibilities, but even in those cases will I try and refactor the code. Anything more than two responsibilities and I wonder why the person who wrote or modified the code did not refactor it. From my point of view, the closer you keep the code to being simple and having a single responsibility, the easier it is to maintain and the fewer bugs I usually find in them. Just my experience, but maybe others have a different experience.

Adding The Type System¶

With the changes in place to validate the form of the string and the key part of that string, it was then time to add the type system I designed, as summarized in the section Where Did I Land?.

To add the type system, I started by defining the prefix values at the top of the ApplicationProperties class:

    __manual_property_type_prefix = "$"
    __manual_property_type_string = "$"
    __manual_property_type_integer = "#"
    __manual_property_type_boolean = "!"

From there, the next step was to explicitly look for the __manual_property_type_prefix character with the following code:

    if (
        property_value.startswith(
            ApplicationProperties.__manual_property_type_prefix
        )
        and len(property_value) >= 2
    ):
        pass

This condition was simple to code. Each of the special sequence starts with a $ character and contains exactly 2 characters. Therefore, if the property_value variable did not meet that criteria, there was no use taking those extra steps.

From there, it was then on to adding the first of the types. Since the string type was the easiest, and probably going to be the most frequently accessed, it made sense to add it at the start of the if block:

    if (
        property_value.startswith(
            ApplicationProperties.__manual_property_type_prefix
        )
        and len(property_value) >= 2
    ):
        if property_value[1] == ApplicationProperties.__manual_property_type_string:
            property_value = property_value[2:]
        else:
            property_value = property_value[1:]

Replacing the pass from the last example with the if-else conditional, this handling of the string type was easy. Knowing that there were other types to come, this was purposefully coded using an if-else instead of a trinary assignment. At this point, there was either another $ character after the first $ character or there was not. In either case, the property_value was modified to skip over those characters, as the property_value variable was already a string type and did not need any modifications.

That simplicity was then changed with the introduction of the boolean type. As per the design, this value gets assigned a True value or a False value based on whether the lowercase value of the remaining string exactly equals true:

            property_value = property_value[2:]
        elif (
            property_value[1]
            == ApplicationProperties.__manual_property_type_boolean
        ):
            property_value = property_value[2:].lower() == "true"
        else:

I went back and forth on whether to modify the check to include other values and handling of whitespace, but I decided to keep it simple as designed. By keeping it simple, it made it easier to explain and document: it is either true or it is not. And stripping whitespace seemed mostly pointless. I guess someone could pass a boolean argument with true as the value. But I just did not feel like that was a path that I wanted to take. So, I decided on no whitespace stripping and no alternate values.

Finally, it was time to handle the integer type:

        elif (
            property_value[1]
            == ApplicationProperties.__manual_property_type_integer
        ):
            try:
                property_value = int(property_value[2:])
            except ValueError as this_exception:
                raise ValueError(
                    f"Manual property value '{property_value}' cannot be translated into an integer."
                ) from this_exception

This translation was a bit different than the boolean type in that it may throw an exception if a bad value was passed for the integer. I thought this was a useful scenario and wanted to test it out to make sure it worked properly. I can see myself extending these translations in the future and knowing that the system can handle a ValueError being thrown is a good thing.

With the final type wired up, all the scenario tests were all passing properly. Looking at the code coverage for the project, I noticed a couple of gaps in the coverage, so I coded some extra scenarios to cover those gaps. Other than that, everything was working as planned.

Wiring It Up To The Command Line¶

Once I had the set_manual_property function thoroughly tested, it was time to wire it up to the command line and to get that working. To do this, I added the following code to incorporate the command into the base command line parser instance:

    parser.add_argument(
        "--set",
        "-s",
        dest="set_configuration",
        action="append",
        default=None,
        help="manualy set properties",
        type=ApplicationProperties.verify_manual_property_form,
    )

Then, in the __set_initial_state function, I simply passed the value collected from the command line the ApplicationProperties instance as follows:

    if args.set_configuration:
        self.__properties.set_manual_property(args.set_configuration)

That is all it took. With that all done, I ran some basic tests, and everything went fine. Almost. When I decided to set multiple configuration items from the same command line, the append action from argparse kicked in and the combined_string arguments to the set_manual_property function was now an array of strings. While I had thought about it when I designed the set_manual_property function, I did not think it would be an issue and I guess wrong. It was a simple omittance, and one that was easy to fix.

The way that I saw it at the time was that I could either modify the existing function to handle the “array of strings” case, or I could write a helper function that would perform that same action, but under a different function name. I played around with splitting the functionality out into a new set_manual_properties function but decided against it. With the determination to keep all the code in one place, I added this block of code to the start of the function:

    if not isinstance(combined_string, str):
        iterator = None
        try:
            iterator = iter(combined_string)
        except TypeError as this_exception:
            raise ValueError(
                "Manual property form must either be a string or an iterable of strings."
            ) from this_exception
        for i in iterator:
            self.set_manual_property(i)
        return

There were a couple of cases here that I wanted to handle. I wanted to prevent execution of the function for non-string arguments other than an iterable of strings. Without implementing mypy and generics, the best I can do is to make sure that if the value is not a string, then it is an iterable of some sort. Then, when passing the individual elements of the iterable instance to the set_manual_property function, the function itself can verify that those elements are strings or not.

It was not perfect, but it was a good design and a faithful implementation. With these changes in place, I was looking forward to verifying that the existing rules were up to date.

What Was My Experience So Far?¶

This block of work was both normal and humbling at the same time. From my viewpoint, this was just me looking at what needed to be cleaned up and getting it done. That really does not change from project to project or task to task. But finding something that I really needed to take a lot of time to focus on and design right was a bit of a shocker.

I know I am not perfect. Clear on that. But I do my best to see what are my next tasks and plan the design for those tasks in advance. If I am going to make a change to my backyard, I probably have thought about it for at least a month, talked to at least four or five people about it, and sketched it out at least three or four times. It’s not that I strive for perfection, I just like to understand the variables of the task dealt with so I can focus on the implementation problems when I am working on the task.

When I started to think about what I needed for to handle types, I honestly thought I would find plenty of examples that I could use to guide the modelling of my own system. I was wrong. It was because of that once omittance that I needed to dig deep and figure things out on my own and weigh the good point and bad points myself. To be clear, it is not the work that I was upset about, but that I missed the need for that work to be planned for.

That was the humbling part. But as I write this article, I am also starting to see it in a different light. Yeah, I missed a design point, and my ego was bruised a bit. So what? I kept to my principles and did the work to make sure I was making the right decision. Do I wish that I could have done that in the spare moments of working on other stuff? Sure… but I did not and did it anyway.

As one of my professors once said to me:

Principles aren’t things that you see clearly in the light. It takes some form of darkness for you to see them for what they truly are.

What is Next?¶

With all that foundational work out of the way so I can focus on the implemented rules plugins, I knew my next task was to start working through those plugins and verify that they worked as advertised. Stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments