Summary¶
In my last article, I talked about drifting back to the PyMarkdown project for a while and talk about what is happening there. One of the things that I latched on to was Python Type Hints, which is what I dug into the past week.
Introduction¶
Things have been very time-consuming in my life recently. Perhaps in the next couple of months, I will write an article about it. Who knows? The effect of those things happening is that I did not publish an article last week. I was about 60% done with a first draft of the article, and I just ran out of steam. Nothing bad really happened, it was just that life was very draining for a couple of weeks, and it caught up to me.
When it came down to it, I just didn’t feel that I could do my best to complete the article. Quite simply, I did not feel that I could author an article that I would be proud of. So, I decided to wait to complete it. And as the week went on, things in my life resolved and lightened up considerably.
So here goes. Sorry for the wait.
Python Is…¶
One of the wonderful things that I like about the Python programming language is that it is wonderfully untyped. As someone who is creative, I often find that my creativity in other languages is slowed down by trying to figure out what types to use in what scenarios. Not that types are a bad thing, I just find that using strict type systems often hinder my creative process.
As I understand it, one of the cornerstones of the Python language is that it is
the behavior that dictates the language, not the other way around. To illustrate
that understanding, consider two separate objects that are both initialized to
have a self.my_var
member variable. The underlying assumption is that if I have
a function whose behavior is to act on that variable, it does not matter which
object or which instance of a class holds that member variable. The passing of
that object into the function is behavior enough.
And that kind of thinking that “we are adults in the room” permeates through a lot of Python. There are no interfaces as in Java and C#. Because of the loose typing, there are no needs for generics. Things just work.
But using a language that does not have strict typing also has its share of
problems. For one, since everything looks the same, any kind of function hinting
in source editors struggles to figure out the proper set of functions to display.
Say a variable was initialized to a None
, then an int
, and then a str
. What
function hints should the editor display and when? It is situations like that
where things get complicated.
Introducing Python Type Hints¶
While I have used Python for over four years now, it was only in the last couple of months that I came across Python Type Hints. To be clear, the name is exactly as it sounds. These are hints, not mandates, for how to interpret types within the Python languages space. To use a simple example, consider this Python code:
def my_function(do_something):
return "something"
It is not really that interesting, therefore it is easy to read. Therefore, compare
it to the following code, from PyMarkdown’s main.py
module:
def main(self):
"""
Main entrance point.
"""
args = self.__parse_arguments()
self.__set_initial_state(args)
new_handler, total_error_count = None, 0
try:
self.__initialize_strict_mode(args)
new_handler = self.__initialize_logging(args)
self.__handle_plugins_and_extensions(args)
POGGER.info("Determining files to scan.")
files_to_scan, did_error_scanning_files = self.__determine_files_to_scan(
args.paths, args.recurse_directories
)
if did_error_scanning_files:
total_error_count = 1
else:
self.__initialize_parser(args)
self.__handle_main_list_files(args, files_to_scan)
for next_file in files_to_scan:
try:
self.__scan_file(args, next_file)
except BadPluginError as this_exception:
self.__handle_scan_error(next_file, this_exception)
except BadTokenizationError as this_exception:
self.__handle_scan_error(next_file, this_exception)
except ValueError as this_exception:
formatted_error = f"Configuration Error: {this_exception}"
self.__handle_error(formatted_error, this_exception)
finally:
if new_handler:
new_handler.close()
if self.__plugins.number_of_scan_failures or total_error_count:
sys.exit(1)
What makes this function readable is that most of the heavy lifting has been delegated
off to the ten functions that main
function itself calls. But look at
the function call to the __determine_files_to_scan
. What kind of arguments does
that function require? What kind of return values does that function produce?
Without good answers to those questions, using that function creates some hurdles
that we must jump over. But what if we could reduce those hurdles a bit?
That is where type hints come in. Having already been modified during this past week, that function definition currently looks like this:
def __determine_files_to_scan(
self, eligible_paths: List[str], recurse_directories: bool
) -> Tuple[List[str], bool]:
...
What Are Type Hints Used For?¶
To be redundantly clear, Python type hints are exactly that, hints. Any Python
code with type hints is
interpreted by Python interpreters without using those type hints. However,
packages like the Mypy
package and the Python extension for VSCode can be used
to take advantage of those type hints.
In the case of VSCode, the use of those type hints allows it to provide better feedback to someone editing Python code. Having started to apply type hints to my Python code, I can already see the benefits as I am editing that same code. Instead of me having to do search-and-replace to find the definition of the function to determine how to interact with it, most of that information is displayed in a handy popup that appears when I add the open parenthesis for the function call. At the top are any type hints that I provided, and at the bottom is any documentation that I added with a docstring. That is useful.
The other strong use that I have found for type hints are in the usage of the Mypy type checking tool. From my understanding of this process, Mypy builds a model of what it is being scanned, including any type hints. It then compares usage of variables and function to what it believes to be the correct typing, emitting warnings when the usage does not mirror what it believes to be the correct types.
In its normal mode, this is useful in making sure that there is consistent usage of variables and types throughout a module. But with that, there is a price to pay.
Paying The Price¶
My first stab at adding type hints started fifteen days ago with this commit. It was not anything grand, but it was a start. Just some simple changes to see how things work. From there, I started to pick off individual modules and start to convert them where possible.
Basically, find something to work on, fix it, validate it, and move on. Then repeat that pattern many, many times. I found that I was shifting between responding to the result of this command line:
pipenv run mypy --strict pymarkdown stubs
to figure out what the next change to tackle was, and this command line:
pipenv run mypy pymarkdown stubs
to resolve any fallout from those changes. Then, when I thought I had a good chunk of work changed, I would enter this command line:
ptest -m
to execute the complete set of tests to verify the changes worked. And it just kept on going. The good news is that I started with over 4400 issues found with strict mode enabled, and now that is down to 1147 issues. Getting close to being done, but not there yet.
And it is a very costly process to add it to an already existing project.
Cost 1: Optional Values¶
One of the things that I did not think about while writing the Python for the
PyMarkdown project was my use of
the None
value. Instead of producing good default values to denote that
some action was not performed, I just assigned the None
value to the
corresponding variable.
What was wrong with that? Nothing at the time. But when I added type hints, it
forced me to use the Optional
type around the type of the variable. Instead
of using a type like int
, I found that I had to use the type Optional[int]
.
And that brought a whole lot of extra effort to adding type hints.
Because the keyword literally means that the value can be a valid integer or the
value None
, in places where it was introduced, I had two options.
The first option was to use the type as it was, returning Optional[int]
from
that function if needed. But then, I was just delaying the evaluation of the second
option. If I needed to return an int
or pass an int
while calling another
function, the second option was to do something to remove the Optional
around the type. In
a handful of cases, a well-placed if variable:
statement was used to do just that.
But in most cases, the variable was wrapped in an Optional
type in
the rare occurrence that it would be set to None
. In those cases, I simply preceded
any reference to that variable with assert variable is not None
.
That was painful but it was what was needed. The painful part was that I needed
to then run the ptest -m
command to make sure that all paths through the code
were not triggering that assertion. And while I would like to see I got them
all on the first try, that was not the case. In those cases, it was reset and
try again.
Cost 2: Getting Types Wrong in Tokens¶
So far, it has only happened two times, but having a single variable that contains multiple types is not desirable. That practice is very undesirable when those variables are member variables of Markdown tokens. Because each basic scenario test validates the list of Markdown tokens produced by the parser, in addition to verifying the HTML produced by applying a transformation to those tokens, any change in what is being tokenized can require many scenario tests to change.
It is through sheer luck that these have been kept to a minimum. In one case,
it resulted in the correction of three scenario tests. In the other case, over
three hundred scenario tests needed to be fixed. The issue? In some of the failure
cases, the variable was set to the string value of ""
instead of the boolean
value of False
. While both values equal to False
when used in a
Python conditional, enforcing a bool
type meant changing many empty string values
in that particular token to a False
boolean value.
Cost 3: Resolving Cycles¶
Another cost is one of making sure that I get each module agreeing on which other modules they need to import, and in which order to import them. This is particularly important as Python has no ability to resolve module cycles using forward references. Instead, as soon as the modules try and resolve their import statements during interpretation, a cycle is detected, and a cycle exception is raised. From my research, there are two ways to get around this.
The first way is to make sure that the
functions and classes are cleanly organized so that cycles do not occur. A good
example of this are classes like InlineRequest
and InlineResponse
. These two
classes used to be contained in the InlineProcessor
module before this work
started. However, there were a small number of cases where the InlineProcessor
module imported another module that also needed the InlineRequest
or
InlineResponse
classes. As such, that other module would try and import those
classes from the InlineProcessor
module, causing a cycle to occur.
The solution? By lifting those two classes out of the InlineProcessor
module
and into their own modules, other modules could them import those two classes
without importing the InlineProcessor
module. This was the easiest way to break
the cycle.
But in some cases, this was not possible. While things worked properly when
typing was not in the picture, there were a small number of cases where the
import
statement was needed with no way to reduce the cycle.
That is where the second method of avoiding cycles comes in.
For these
cases, the following pattern was used:
from typing import TYPE_CHECKING
...
if TYPE_CHECKING: # pragma: no cover
from pymarkdown.tokenized_markdown import ParseBlockPassProperties
It took me a while to find this, but it works cleanly. Having a good understanding
of how type hints could get complicated, the TYPE_CHECKING
global variable was
added as a work-around. Set to False
by default and to True
by Mypy, the
above fragment of code only imports if type checking is enabled.
How does this help? As the code works fine without the type hints, the above
import is only being done to ensure that the type hints are correct. As type
hints are not necessary for the normal interpretation of the Python code, the
interpreter can safely disregard any type hints. Therefore, as the type hints
are only used for type checkers like Mypy, the TYPE_CHECKING
global variable
is used to detect if that is the current evaluation mode of the module, only
importing if that is the case.
It took me a while to get behind it. But from my experience, it is both simple and brilliant at the same time. It is a good example of cleanly dividing responsibilities.
How Is the Work Coming Along? Is It Worth It?¶
As I mentioned early, Mypy has gone from detecting issues in the 4500-issue range to the 1100-issue range, with that number decreasing with each change. It is a prolonged process, but I feel that it is a good one. I can already see benefits in VSCode when I am editing the project. On top of that, it has helped me to clarify some of the implementation details that I have done over the course of the last two years. Those are good things.
But do not get me wrong, it is a slog. This effort does not erase any technical debt and does not make the project function any better. What it is doing for me is to make the code more maintainable and more readable. While it is not erasing any technical debt, it is improving the quality of the project. That it not as quantifiable to others, but to me it is an important quantification.
And while others might not understand that difference, it is good enough for me.
What is Next?¶
With over 75% of the project converted, the best answer that I can give right now is that I hope to have the conversion finished by the end of the week. Not sure if it is possible, but I am going to try for it! Stay tuned!
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.