Markdown Linter - Road To Initial Release - Configuration

Summary¶

In my last article, I talked about my own requirements for a front-matter processor and how I added it to the project. In this article, I talk about my own requirements for configuration and how I applied them to this project.

Introduction¶

Sometimes I encounter a part of a programming language that I do not like for one reason or another. For whatever reason, the concept of having a good entry-level configuration library for each given language seems to be not high on anyone’s priority list. To be clear, I know that there are libraries out there for each language that do some form of property reading and access, but the support is always basic at best. If one is provided, it either seems to be too simple to accomplish the task that is normally required of it, or it seems that it was added as an afterthought.

Therefore, it was a bit disheartening to find out Python was also in this category. When it comes to built-in support for logging, Python is there with the logging library and its good set of features. When it comes to command line parsing, Python is there with the argparse library and its various helper functions. When it comes to configuration file parsing, Python is there with… configparser? Honestly, I was not impressed. It only supports the ini file format and does not seem to have much in the way of useful support for helper functions that I could find.

So, for this project, I now need to figure out what configuration I need, and how to properly store it.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit between 06 Mar 2021 and 11 Mar 2021.

What Is Configuration?¶

At face value, configuration is anything that provides information that guides the execution of the project. While input information is one class of configuration, a discussion of that class of configuration is a whole other article unto itself. As such, I am going to stick to the two types of configuration that most people think about when they hear configuration: dynamic configuration and static configuration.

To be honest, I usually take the phrases “data stores” and “dynamic configuration” to mean the same thing. This type of configuration is expected to change its configuration information during the execution of the project. More often than not, the project itself is the entity that is changing that information. This dynamic information can be as simple as a file that holds the time the project last scanned another data store to a complete map of the data and relationships for the project. The key point here is that this information is expected to change often.

The natural complement to dynamic configuration is static configuration. When most software developer hear “configuration”, they most often think of things such as configuration files, environment variables, and command line data. These items are all classified as static configuration as they generally do not change once the execution of the program has started. Showing a bit of synergy with each other, if a data store is used to hold information, there is usually one or more static properties that provide any needed configuration to gain access to that data store.

So, for this article, I am focusing on static configuration and how I added better support for static configuration to the PyMarkdown project.

Configuration Requirements Are Important¶

Up to this point, the project and its tests have been able to get by without any real hard requirements for configuration with respect to this project. I have a loose set of command line arguments and a simple map used to configure the plugins, but that is it. Taking a look at my initial list of requirements, there is nothing specifically in there about configuration, just phrases here and there that somewhat speak of configuration. Phrases like “plans to later add other flavors of parser” and “extending the base linting rules” that kind of infer that configuration will be needed, but do not explicitly state its need.

From a design point of view, this was not a failure, but a planned action. In many previous projects, I implemented a simple configuration system early in the project, only to find out that it missed a couple of significant configuration scenarios. In this case, “missed” means that there was not a straightforward way of representing some aspect of the configuration in a way that made the most sense. Given that history, I decided to go with a minimal configuration approach until I could see what kind of configuration the project was going to need.

I was glad I took that approach, as it allowed me to properly classify the configuration into the right category. For me that classification for configuration deals primarily with the requirements in 5 categories:

basic property support
overriding from the command line or environment
validation
grouping of values
hierarchy and nesting

Depending on how many and which of these requirements are needed defines the type of configuration that is required.

While there may be more “official” types of static configuration, the ones that I know about fall into three main types.

Configuration Type 1: Simple¶

When people talk about configuration, most often they are talking about simple configuration. This is primarily configuration that can easily be put into a file or passed around in variables, such as environment variables. From a file-based point of view, this type of configuration usually looks like:

host=localhost
port=8080
username=admin
password=admin@123

Looking for example, a simple Google search turned up this article on how to handle properties files in Python using the jproperties library. This type is called simple because it is simple, usually implemented with no bells and whistles. If the application needs to fetch configuration for the username property, the application needs to explicitly specify the full name of the property as it appears in the configuration file. The result is a string that the application must then do any further changes or validation to before the application can use it.

There are pros and cons to this approach. The pros are easily this type’s simplicity and the ability to easily add a layer on top of this configuration that allows overriding these values from the command line. Unfortunately, the cons are almost the same, just for other reasons. Because this approach is simple, if the application requires any extra organization or validation for the configuration, the application needs to handle that code. As mentioned before, another big con is that with this type of configuration, everything is a string. Any interpretation of the data as anything other than a string is left up to the developer.

Checking in against the list of five requirements:

basic property support? Done.
overriding from the command line or environment? Extra code required.
validation? Extra code required.
grouping of values? Extra code required.
hierarchy and nesting? Extra code required.

Configuration Type 2: Grouped¶

The next step up from the simple type is a grouped type. While it is a step up from the simple type, it does not add much. From a file-based point of view, this type of configuration usually either looks like this:

db.host=localhost
db.port=8080
db.username=admin
db.password=admin@123

or like this:

[Db]
host=localhost
port=8080
username=admin
password=admin@123

As the name of this type suggests, the main thing that this type adds is the ability to group a collection of configuration items together into a group. While this may not seem like a big step forward, it is this type of configuration that I have found to be of most use in approximately 95% of the projects that I have worked on.

Clarifying that statement, I believe I can successfully demonstrate that 95% of the projects that I have worked on have used a Grouped type of configuration for any non-dynamic project configuration. My choice of words was very deliberate in that previous statement. While many of those projects had dynamic configuration that was quite complex, the static configuration for those projects that pointed to the dynamic configuration artifacts were quite simple. And while many of those projects started out using the Simple type of configuration, something always seems to come up that requires the need of grouping of configuration values on some level.

I do believe that my experiences in this area are common. Servers? Most of the time they require some setup of the server and pass any heavy lifting of information to a data store. This usually requires two or three small groups of information, usually with not more than two or three items in each group. Command line applications? Most of the time there was no data stores and any information was passed in through the command line or environment variables. Even if some manner of data store was involved, the configuration only provided the data to connect to the data store, where the dynamic information was stored.

So, what changes from the previous type? Only this:

grouping of values? Extra code required.

The next type is where all the big changes occur.

Configuration Type 3: Complex¶

While this type of configuration is overkill for simple projects, it can be required for more complex projects, hence its name. From the main list of configuration items, there is support for all five, though the support for overriding may be limited due to the nature of the configuration itself. When I have encountered a need for this type of configuration, it has usually been configuration scenarios where there is a strong need for nesting or hierarchical information.

A good example of this is our standard group of configuration items, this time expressed using a JSON file for information:

{
    "db" : {
        "host" : "localhost",
        "port" : 8080,
        "username" : "admin",
        "password" : "admin@123"
    }
}

While this hierarchy could be flattened, expressing it in this form provides meaningful context to the configuration. There is no looking for other configuration items throughout the file that may have the same prefix. All four of the related items are grouped together under the db key. However, a better example of this hierarchical relationship requires a slight adjustment of the example file:

{
    "db" : {
        "server" : {
            "host" : "localhost",
            "database" : "users",
            "port" : 8080
        },
        "account" : {
            "username" : "admin",
            "password" : "admin@123"
        }
    }
}

With a slight addition of the database field, this file now represents what I feel is a more logical expression of the configuration items. While it is true that the previous group of four items were all related to the database, they were not all related in the same way. This organization of the fields presents a more cohesive mapping of what configuration is required for database access: where the database is and what account to use.

But as I mentioned at the start of this section, this may be overkill for some applications. If all you are storing in the configuration is 10 strings or less, I would argue that a Grouped or Simple type can more easily be used, with some adjustments required in the application. But for other applications with more complex configuration, this type may be just right for the project.

Which Is Best For This Project?¶

After reviewing the different types in my head, I decided that the PyMarkdown project would benefit from a complex configuration type. While the base configuration for the project is basic, I believe that the configuration requirements for any plugins and extensions will require more complex configuration. I know that a couple of the plugins that I want to write for my own website will require lists and possibly list of maps. As such, it is better to plan and be prepared, rather than face a nasty rewrite.

And with that decision, I was on to adding complex configuration support to the project!

Adding Proper Configuration Support¶

Having written both low-level and high-level code to work with configuration, I had a good idea of what to do. The most important thing for me to do is to make sure I had any extra requirements for the complex configuration worked out before I started.

Starting With Base Requirements¶

The base configuration requirement for this project is that it can handle complex configuration. Based on that requirement, the most frequently used formats for that kind of data are YAML and JSON. Both have their strengths and weaknesses, as the information at those links will detail. Having used both of them in previous projects, my feeling was that the JSON format is simpler format and more direct for users to use. Because of those reasons, they made a JSON file a better choice for holding configuration information.

With that out of the way, I created a simple class ApplicationProperties to hold the configuration information. Keeping any interaction with it simple, I added a load_from_dict function to take care of transforming a loaded dictionary into a more convenient form for the class. After that was place, I created a simple ApplicationPropertiesJsonLoader class with a straightforward static function named load_and_set. This function (and class) was created for one purpose: to load a JSON file as a dictionary and add it to the provided instance of the ApplicationProperties class.

Class to hold the properties. Check. Class to read properties in JSON format and set to the properties object. Check. Scenario tests to verify that the loading was working properly? Check. Now I needed to figure out the best way to hold the data within the class and how to access it from the application.

Loading The Configuration From A File¶

Thinking of the properties “grouping of values” and “hierarchy and nesting”, I had a decision to make on how to store the data. To me, that problem had two possible solutions: keep the dictionary as it was or translate it into a flattened map. To shape that decision, I had to keep in mind that one of the reasons that a complex configuration type is best for this project is due to its extensibility. Designed correctly, the configuration for any part of the project would be kept in its own “playground”, free from interference from any of the other parts.

Keeping that in mind, I knew that if I kept the dictionary the way it was, I could easily hand off portions of that dictionary to a subsystem using a single line of code. However, if used properly, accessing a property using that dictionary would require multiple lookups, one for each level of hierarchy required for that property. The balance to that was a flattened dictionary, which only every required one lookup. However, this would mean that handing off the configuration for a distinct part of the project would not be as straightforward or inexpensive.

After a fair amount of thought, I decided that the flattened option was the best choice. It optimized fetching values over hierarchy exclusion, which I believe was the right choice to make. Following that decision were three other key decisions. The first decision, which might seem like a straightforward decision, was that all keys are strings. However, as JSON and other configuration formats can handle non-string keys, I wanted to be explicit about it. Secondly, I needed to choose a key separator character that would determine how to create a path to allow orderly traversal through configuration. With many great ones to choose from, I decided to keep it simple and in the property file family by using the period character .. Finally, I had to decide about having all key strings be in upper case or lower case, or to simply have case-sensitive property keys. That one was a bit more difficult, but after looking at a few property files, the answer was obvious: lower case.

With those choices made, I added the __scan_map function and called it from the load_and_set function. It was less than 20 minutes before I had the following function ready to go:

def __scan_map(self, config_map, current_prefix):
    for next_key in config_map:
        if not isinstance(next_key, str):
            raise ValueError(
                "All keys in the main dictionary and nested dictionaries must be strings."
            )
        if self.__separator in next_key:
            raise ValueError(
                f"Keys strings cannot contain the separator character '{self.__separator}'."
            )

        next_value = config_map[next_key]
        if isinstance(next_value, dict):
            self.__scan_map(
                next_value, f"{current_prefix}{next_key}{self.__separator}"
            )
        else:
            new_key = f"{current_prefix}{next_key}".lower()
            self.__flat_property_map[new_key] = copy.deepcopy(next_value)
            LOGGER.debug(
                "Adding configuration '%s' : {%s}", new_key, str(next_value)
            )

Every key in the dictionary is checked to verify that it is a string and does not contain the key separator character. Once the key is verified, if the current value is a dictionary, the function is called recursively with the current key and that dictionary. If it is not a dictionary, it is added to the flattened dictionary by creating a copy of its value. When inserted into the dictionary, the key is transformed into lower case to ensure consistency.

I did try a couple of different options out, but in the end, this simple recursive function was the clear winner. I tried to be super-smart about doing the same thing in an iterative fashion, but it seemed to me to be too much code for such a simple task. While I do not have problems with recursive functions in any programming language, I do know that they can get into runaway mode with one wrong conditional statement. As such, I verified the code a couple of times before running it. I also added the number_of_properties function and the property_names function before adding more tests to test what I had so far.

Fetching Values From The Configuration¶

With a lot of the tough work behind me, I created some new scenario tests to test the new function that I was just about to add: the get_property function. As this was the start of this function, its coding was very simple:

def get_property(self, property_name, property_type, default_value=None):
    """
    Get an property of a generic type from the configuration.
    """
    property_value = default_value
    property_name = property_name.lower()
    if property_name in self.__flat_property_map:
        found_value = self.__flat_property_map[property_name]
        is_eligible = type(found_value) == property_type
        if is_eligible:
            property_value = found_value
    return property_value

Basically, unless a valid value is found to replace its value, the property_value variable is set to default_value. This made sense to me because it allowed me to write code that would know if a property was not found, or to just use a default, whichever solution worked best for the scenario I was working on. From there, the function checks to see if the value of property_name is present in the flattened dictionary and has the specific type that is being looked for. If both of those conditions are met, the property_value variable is set to the value found in the flattened dictionary.

To me, this was just the start, but it was a good, solid, simple start. Basic property support was now done… almost. To make things a little easier, I added three functions: get_boolean_property, get_integer_property, and get_string_property. These were all simple wrappers around the get_property function, providing the appropriate type information for the third property_type parameter.

Where was I now? Basic property support. Check. Grouping of values. Check. Hierarchy and nesting. Check.

The next one to tackle? Validation.

Adding Validation¶

While I do not always need to validate values that I am fetching from configuration, sometimes it is essential to have. Where possible, I prefer to assume that the user is intelligent and will provide intelligent values that make sense. If there is an issue with the value, I prefer to report that error at a later stage when they are being used, either directly to the command line or by logging it to a log file.

But in some cases, I need to have some configuration values that I know I can trust on as they are pivotal to the way the application works. The easiest one for me to think of is the log.level configuration value. As I rely on a solid log file to report any errors, it is pivotal to me that any changes to the logging behavior are completely airtight. Specifically, I only want the log level to be set to one of the known values. The code to do this for the argparse library and command line handling is relatively straightforward:

    available_log_maps = {
        "CRITICAL": logging.CRITICAL,
        "ERROR": logging.ERROR,
        "WARNING": logging.WARNING,
        "INFO": logging.INFO,
        "DEBUG": logging.DEBUG,
    }
...
    parser.add_argument(
        "--log-level",
        dest="log_level",
        action="store",
        help="minimum level required to log messages",
        type=PyMarkdownLint.log_level_type,
        choices=list(PyMarkdownLint.available_log_maps.keys()),
    )
...
    @staticmethod
    def log_level_type(argument):
        if argument in PyMarkdownLint.available_log_maps:
            return argument
        raise ValueError(f"Value '{argument}' is not a valid log level.")

The argument is added with a type argument that specifies the log_level_type function. This function is called with the provided value as its argument, either returning that object or raising a ValueError. The check itself is to simply compare the argument with the name of one of the keys in the available_log_maps dictionary. For me, this is simple, and it seemed like a good pattern to follow.

To start following that pattern, I added the valid_value_fn argument to the get_property family of functions, defaulting to None. Once that was done, the other modification to that function was simple:

    is_eligible = type(found_value) == property_type
    if is_eligible and valid_value_fn:
        try:
            valid_value_fn(found_value)
        except Exception as this_exception:
            is_eligible = False

Added right after the type check, if the valid_value_fn is set, the function will be called to validate the found_value, catching any exceptions that are thrown. Once again, I looked at a few other options, but simplicity won out again.

But having finished that change, something still was not right. Something was a bit off. I was not sure what though.

Adding Strict Mode to Validation¶

Using the log level case as a baseline scenario, I thought through variations on that scenario and came up short. There was just something missing that I had not covered yet. After a bit of thinking and working through scenarios, it finally came to me: I needed a strict mode.

There are times where I want a call to the get_property function to return a value no matter what, but there are other times where I want an exception thrown. In the case of fetching the log level, if there is a problem with the value, I want to follow the argparse example and thrown an exception that halts the application at that point. I needed something to toggle between this strict behavior and the more relaxed behavior.

To accomplish this, I added the strict_mode argument to the function with a default of False. I then added this code snippet:

    is_eligible = type(found_value) == property_type
    if not is_eligible and strict_mode:
        raise ValueError(
            f"The value for property '{property_name}' must be of type '{property_type.__name__}'."
        )
    if is_eligible and valid_value_fn:

and this code snippet:

    except Exception as this_exception:
        is_eligible = False
        if strict_mode:
            raise ValueError(
                f"The value for property '{property_name}' is not valid: {str(this_exception)}"
            )

Using the log.level configuration item as an example, I do not want to default the log level if it is not a string or is not one of the valid strings. In those cases, I want a clear message telling me what the issue is. For me, this completed the validation. I could strictly enforce the validation, request a value from the configuration file and know it was not fetched (checking default=None), or request a value and use the default if anything was wrong.

I believe this now covers all the scenarios for validation. After double checking that the scenario tests were all up to date and passing cleanly, there were just a couple of small things left to handle.

Almost There¶

Looking at what I had, I believed that the ApplicationProperties class was almost there, but I needed to add two small little things to make it complete. The first of those things was a is_required argument to allow me to state my intention that an argument must be present. That was followed by code that was added near the end of the function:

    elif is_required:
        raise ValueError(
            f"A value for property '{property_name}' must be provided."
        )

With that in place, I just wanted to bulletproof the main function get_property. To do that, I added the following code at the start of the function:

    if not isinstance(property_name, str):
        raise ValueError("The propertyName argument must be a string.")
    if not isinstance(property_type, type):
        raise ValueError(
            f"The property_type argument for '{property_name}' must be a type."
        )
    if default_value is not None and type(default_value) != property_type:
        raise ValueError(
            f"The default value for property '{property_name}' must either be None or a '{property_type.__name__}' value."
        )

While this may not be Pythonic, as the types of arguments are being checked, I think it is necessary. As a developer, any reliance that I have on low level functions and libraries requires trust. I trust that those functions will let me know as quickly as possible if I mess up. For me, these parameter checks are just that.

Ordering¶

Out of the initial five items, there was now only one remaining: overriding from the command line or environment. While I did not integrate direct support for this into the ApplicationProeprties class, I did start by setting up a process to follow for any new configuration items. This process is simply a “ladder” to follow when checking configuration values from multiple sources. That ladder is as follows:

command line
configuration file
default value

While I might add code to integrate ApplicationProperties and argparse together in the future, for right now it was manually done. Using the log.level item as an example, the code to properly fetch it was as follows:

    effective_log_level = args.log_level if args.log_level else None
    if effective_log_level is None:
        effective_log_level = self.__properties.get_string_property(
            "log.level", valid_value_fn=PyMarkdownLint.log_level_type
        )
    if effective_log_level is None:
        effective_log_level = self.default_log_level

This follows the process to the letter. The first line checks to see if the log_level field is provided from the command line, either using the present value or setting it to None. If that variable is None, then the function uses the get_string_property function with the log_level_type validator to fetch the log level from the configuration. Implied in that call is that if the value is not present, the default value is None. Finally, if neither of those actions resulted in assigning the effective_log_level variable a non-None value, the default_log_level member variable is used to set a default.

That process is what I want, but it seems a bit long. Maybe I will see about shortening it in the future, but for right now, it is exactly what I want.

Providing Constrained Access To Subsystems¶

Finally, before I could finish implementing this in the PyMarkdown project, I needed to handle the scenario of passing off a subsection of the configuration to a subsystem. This was important to me because I wanted to make sure that there was absolutely no chance of overlap between the main application configuration and the configuration of any plugin.

To do this, I created a new class called ApplicationPropertiesFacade that takes as arguments the root ApplicationProperties instance and a property_prefix that is specific to the section to isolate to. This class is a facade, as the name suggests. To that extent, this class spends most of its code handing off responsibility for satisfying the requests by passing on to the root instance with the property_prefix appended to the start of the key string. While this may seem simplistic, it gets the job done remarkably well.

With the work done on this class, its incorporation into the PyMarkdown project, and all tests passing, it was time to wrap up the work on this item.

What Was My Experience So Far?¶

While I have written many of these types of classes in many languages over the years, I really am happy about how this one turned out. The way I know that I am happy about it as a library-type class is that it does one thing simply but is easily extendable to do slightly more complex things. This class does not try and answer for all the responsibilities of fetching configuration information, it focuses on getting that information from a file. It is small, it is light, and it accomplishes its goals in what I believe is a clean manner. What is there not to like?

But as much as I was confident that I had the right fit for the configuration file, I now found the command line interface lacking. I was going to need to take some time to clean that up. I was also aware that the initial release is very close now. There are two other things in the way of a good first release: cleaning up some important issues and having a good release story. I was hoping to fix the first as soon as possible and to research the second before I needed to start on it. Here we go!

What is Next?¶

With the ApplicationProperties class coded and working fine, I realize that I needed to up my game for the command line. Therefore, the next thing I worked on was the command line of the project.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments