Summary

In my last article, I talked about drifting back to the PyMarkdown project for a while and talk about what is happening there. One of the things that I latched on to was Python Type Hints, which is what I dug into the past week.

Introduction

Things have been very time-consuming in my life recently. Perhaps in the next couple of months, I will write an article about it. Who knows? The effect of those things happening is that I did not publish an article last week. I was about 60% done with a first draft of the article, and I just ran out of steam. Nothing bad really happened, it was just that life was very draining for a couple of weeks, and it caught up to me.

When it came down to it, I just didn’t feel that I could do my best to complete the article. Quite simply, I did not feel that I could author an article that I would be proud of. So, I decided to wait to complete it. And as the week went on, things in my life resolved and lightened up considerably.

So here goes. Sorry for the wait.

Python Is…

One of the wonderful things that I like about the Python programming language is that it is wonderfully untyped. As someone who is creative, I often find that my creativity in other languages is slowed down by trying to figure out what types to use in what scenarios. Not that types are a bad thing, I just find that using strict type systems often hinder my creative process.

As I understand it, one of the cornerstones of the Python language is that it is the behavior that dictates the language, not the other way around. To illustrate that understanding, consider two separate objects that are both initialized to have a self.my_var member variable. The underlying assumption is that if I have a function whose behavior is to act on that variable, it does not matter which object or which instance of a class holds that member variable. The passing of that object into the function is behavior enough.

And that kind of thinking that “we are adults in the room” permeates through a lot of Python. There are no interfaces as in Java and C#. Because of the loose typing, there are no needs for generics. Things just work.

But using a language that does not have strict typing also has its share of problems. For one, since everything looks the same, any kind of function hinting in source editors struggles to figure out the proper set of functions to display. Say a variable was initialized to a None, then an int, and then a str. What function hints should the editor display and when? It is situations like that where things get complicated.

Introducing Python Type Hints

While I have used Python for over four years now, it was only in the last couple of months that I came across Python Type Hints. To be clear, the name is exactly as it sounds. These are hints, not mandates, for how to interpret types within the Python languages space. To use a simple example, consider this Python code:

def my_function(do_something):
    return "something"

It is not really that interesting, therefore it is easy to read. Therefore, compare it to the following code, from PyMarkdown’s main.py module:

    def main(self):
        """
        Main entrance point.
        """
        args = self.__parse_arguments()
        self.__set_initial_state(args)

        new_handler, total_error_count = None, 0
        try:
            self.__initialize_strict_mode(args)
            new_handler = self.__initialize_logging(args)

            self.__handle_plugins_and_extensions(args)

            POGGER.info("Determining files to scan.")
            files_to_scan, did_error_scanning_files = self.__determine_files_to_scan(
                args.paths, args.recurse_directories
            )
            if did_error_scanning_files:
                total_error_count = 1
            else:
                self.__initialize_parser(args)

                self.__handle_main_list_files(args, files_to_scan)

                for next_file in files_to_scan:
                    try:
                        self.__scan_file(args, next_file)
                    except BadPluginError as this_exception:
                        self.__handle_scan_error(next_file, this_exception)
                    except BadTokenizationError as this_exception:
                        self.__handle_scan_error(next_file, this_exception)
        except ValueError as this_exception:
            formatted_error = f"Configuration Error: {this_exception}"
            self.__handle_error(formatted_error, this_exception)
        finally:
            if new_handler:
                new_handler.close()

        if self.__plugins.number_of_scan_failures or total_error_count:
            sys.exit(1)

What makes this function readable is that most of the heavy lifting has been delegated off to the ten functions that main function itself calls. But look at the function call to the __determine_files_to_scan. What kind of arguments does that function require? What kind of return values does that function produce? Without good answers to those questions, using that function creates some hurdles that we must jump over. But what if we could reduce those hurdles a bit?

That is where type hints come in. Having already been modified during this past week, that function definition currently looks like this:

    def __determine_files_to_scan(
        self, eligible_paths: List[str], recurse_directories: bool
    ) -> Tuple[List[str], bool]:
        ...

What Are Type Hints Used For?

To be redundantly clear, Python type hints are exactly that, hints. Any Python code with type hints is interpreted by Python interpreters without using those type hints. However, packages like the Mypy package and the Python extension for VSCode can be used to take advantage of those type hints.

In the case of VSCode, the use of those type hints allows it to provide better feedback to someone editing Python code. Having started to apply type hints to my Python code, I can already see the benefits as I am editing that same code. Instead of me having to do search-and-replace to find the definition of the function to determine how to interact with it, most of that information is displayed in a handy popup that appears when I add the open parenthesis for the function call. At the top are any type hints that I provided, and at the bottom is any documentation that I added with a docstring. That is useful.

The other strong use that I have found for type hints are in the usage of the Mypy type checking tool. From my understanding of this process, Mypy builds a model of what it is being scanned, including any type hints. It then compares usage of variables and function to what it believes to be the correct typing, emitting warnings when the usage does not mirror what it believes to be the correct types.

In its normal mode, this is useful in making sure that there is consistent usage of variables and types throughout a module. But with that, there is a price to pay.

Paying The Price

My first stab at adding type hints started fifteen days ago with this commit. It was not anything grand, but it was a start. Just some simple changes to see how things work. From there, I started to pick off individual modules and start to convert them where possible.

Basically, find something to work on, fix it, validate it, and move on. Then repeat that pattern many, many times. I found that I was shifting between responding to the result of this command line:

pipenv run mypy --strict pymarkdown stubs

to figure out what the next change to tackle was, and this command line:

pipenv run mypy pymarkdown stubs

to resolve any fallout from those changes. Then, when I thought I had a good chunk of work changed, I would enter this command line:

ptest -m

to execute the complete set of tests to verify the changes worked. And it just kept on going. The good news is that I started with over 4400 issues found with strict mode enabled, and now that is down to 1147 issues. Getting close to being done, but not there yet.

And it is a very costly process to add it to an already existing project.

Cost 1: Optional Values

One of the things that I did not think about while writing the Python for the PyMarkdown project was my use of the None value. Instead of producing good default values to denote that some action was not performed, I just assigned the None value to the corresponding variable.

What was wrong with that? Nothing at the time. But when I added type hints, it forced me to use the Optional type around the type of the variable. Instead of using a type like int, I found that I had to use the type Optional[int]. And that brought a whole lot of extra effort to adding type hints. Because the keyword literally means that the value can be a valid integer or the value None, in places where it was introduced, I had two options.

The first option was to use the type as it was, returning Optional[int] from that function if needed. But then, I was just delaying the evaluation of the second option. If I needed to return an int or pass an int while calling another function, the second option was to do something to remove the Optional around the type. In a handful of cases, a well-placed if variable: statement was used to do just that. But in most cases, the variable was wrapped in an Optional type in the rare occurrence that it would be set to None. In those cases, I simply preceded any reference to that variable with assert variable is not None.

That was painful but it was what was needed. The painful part was that I needed to then run the ptest -m command to make sure that all paths through the code were not triggering that assertion. And while I would like to see I got them all on the first try, that was not the case. In those cases, it was reset and try again.

Cost 2: Getting Types Wrong in Tokens

So far, it has only happened two times, but having a single variable that contains multiple types is not desirable. That practice is very undesirable when those variables are member variables of Markdown tokens. Because each basic scenario test validates the list of Markdown tokens produced by the parser, in addition to verifying the HTML produced by applying a transformation to those tokens, any change in what is being tokenized can require many scenario tests to change.

It is through sheer luck that these have been kept to a minimum. In one case, it resulted in the correction of three scenario tests. In the other case, over three hundred scenario tests needed to be fixed. The issue? In some of the failure cases, the variable was set to the string value of "" instead of the boolean value of False. While both values equal to False when used in a Python conditional, enforcing a bool type meant changing many empty string values in that particular token to a False boolean value.

Cost 3: Resolving Cycles

Another cost is one of making sure that I get each module agreeing on which other modules they need to import, and in which order to import them. This is particularly important as Python has no ability to resolve module cycles using forward references. Instead, as soon as the modules try and resolve their import statements during interpretation, a cycle is detected, and a cycle exception is raised. From my research, there are two ways to get around this.

The first way is to make sure that the functions and classes are cleanly organized so that cycles do not occur. A good example of this are classes like InlineRequest and InlineResponse. These two classes used to be contained in the InlineProcessor module before this work started. However, there were a small number of cases where the InlineProcessor module imported another module that also needed the InlineRequest or InlineResponse classes. As such, that other module would try and import those classes from the InlineProcessor module, causing a cycle to occur.

The solution? By lifting those two classes out of the InlineProcessor module and into their own modules, other modules could them import those two classes without importing the InlineProcessor module. This was the easiest way to break the cycle.

But in some cases, this was not possible. While things worked properly when typing was not in the picture, there were a small number of cases where the import statement was needed with no way to reduce the cycle. That is where the second method of avoiding cycles comes in. For these cases, the following pattern was used:

from typing import TYPE_CHECKING

...

if TYPE_CHECKING:  # pragma: no cover
    from pymarkdown.tokenized_markdown import ParseBlockPassProperties

It took me a while to find this, but it works cleanly. Having a good understanding of how type hints could get complicated, the TYPE_CHECKING global variable was added as a work-around. Set to False by default and to True by Mypy, the above fragment of code only imports if type checking is enabled.

How does this help? As the code works fine without the type hints, the above import is only being done to ensure that the type hints are correct. As type hints are not necessary for the normal interpretation of the Python code, the interpreter can safely disregard any type hints. Therefore, as the type hints are only used for type checkers like Mypy, the TYPE_CHECKING global variable is used to detect if that is the current evaluation mode of the module, only importing if that is the case.

It took me a while to get behind it. But from my experience, it is both simple and brilliant at the same time. It is a good example of cleanly dividing responsibilities.

How Is the Work Coming Along? Is It Worth It?

As I mentioned early, Mypy has gone from detecting issues in the 4500-issue range to the 1100-issue range, with that number decreasing with each change. It is a prolonged process, but I feel that it is a good one. I can already see benefits in VSCode when I am editing the project. On top of that, it has helped me to clarify some of the implementation details that I have done over the course of the last two years. Those are good things.

But do not get me wrong, it is a slog. This effort does not erase any technical debt and does not make the project function any better. What it is doing for me is to make the code more maintainable and more readable. While it is not erasing any technical debt, it is improving the quality of the project. That it not as quantifiable to others, but to me it is an important quantification.

And while others might not understand that difference, it is good enough for me.

What is Next?

With over 75% of the project converted, the best answer that I can give right now is that I hope to have the conversion finished by the end of the week. Not sure if it is possible, but I am going to try for it! Stay tuned!

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~10 min read

Published

Markdown Linter Beta Bugs

Category

Software Quality

Tags

Stay in Touch