Way back in 2019 November, I started this project with a bare-bones framework using a simple dynamic plugin loader. It was a simple proof of concept to determine whether I could create the basis for an extensible linter framework in Python. Once I verified that I could write that framework, I implemented a very simple case to test against: checking to make sure the provided Markdown text ends with an empty line. While that rule was easy to implement, it was when I looked for another rule to implement that I determined that to properly lint Markdown text, I needed a Markdown tokenizing parser. From my viewpoint, unless I had a parser that emitted tokens that were a good representation of the Markdown to lint, the linting rules that I wanted to write would require too much guess work for my own liking. If I wanted to write good, solid rules, I needed to have the right information available for those rules to act upon. I needed a Markdown parser that emits Markdown tokens that I had confidence would be the required, correct information for the rules.
Having now written such a parser against the Github Flavored Markdown specification, it was time to move on to the next part of the project: writing rules. However, since almost 5 months had passed since the project started, there were a few changes that were required in the linter’s core before I could continue.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 12 April 2020 and 16 April 2020, and the commit from 18 April 2020.
In my experience, following the threefold rule for refactoring is usually a good idea, as its wisdom has borne out true in my projects many times. While not a literal copy of the threefold rule, I choose to remember the rule as follows:
Write it once, write it neat. Write it twice, think about extracting it. Write it three times, extract it without a second thought.
The original text of “three strikes and you refactor” always seemed a little too harsh for me with its negative connotation. In addition, I feel that it does not provide good enough guidance on what to do in the first two cases, just the third one. My version of the rule still adheres to the spirit of the original rule, while fixing the specific issues that I perceive with it.
Source providers for the parser are a concept that fits that refactoring pattern very
well. When the
original framework for the parser was written, it was designed to parse a line at a
time to conserve memory. In the parser tests, this information is provided to the
TokenizedMarkdown class as a single string, with the internal functions of the
TokenizedMarkdown class breaking down that data
into individual lines for further processing. With the exception of a single
feature’s error case1, this design has proven to be very
useful in reducing the complexity of the parser.
It made sense to me to refactor this section of code when considering how to
add support for the second source of Markdown data: Markdown files.
Starting with the InMemorySourceProvider¶
Now that the project was moving into the rule-development phase, it was necessary to
ensure that it was just as easy to feed the parser information from a string as it was
to feed it information from a file. As the initial development
kept things neat, it was relatively simple to take the logic for grabbing the next line
and encapsulate it within the
InMemorySourceProvider class as follows:
class InMemorySourceProvider: """ Class to provide for a source provider that is totally within memory. """ def __init__(self, source_text): self.next_token = source_text.split("\n", 1) def get_next_line(self): """ Get the next line from the source provider. """ token_to_use = None if self.next_token: if len(self.next_token) == 2: token_to_use = self.next_token self.next_token = self.next_token.split("\n", 1) else: assert self.next_token token_to_use = self.next_token self.next_token = None return token_to_use
This class contains very simple logic. When the instance of the class is initialized,
it starts by breaking down the input text into a tuple. The first element of the
resultant tuple contains the next line to be parsed and the second element of that same
tuple contains the input text to be parsed in the immediate future. Once that
calculation has been performed, the rest of the processing is relatively simple.
get_next_line is called and the tuple contains 2 elements, the first element is
returned as the next line, and the
next_token variable (for next time) is
recalculated using the same expression as was used in the
get_next_line is called at the end of the file, the tuple contains only 1
element. At that point, that singular element is returned as the next line to be
next_token variable is set to
None to make sure we end the processing
properly. Finally, when
get_next_line is called and the tuple is set to
there is nothing left to parse and
None is returned, signaling that the provider has
reached the end of its available text.
To be clear, this is the exact code that was in place for the duration of the parser
testing, just repackaged to be in a more reusable form. Its interface is plain and
simple: it either returns the next line as a string, or it
None object if there are no more lines. Nothing fancy as a class either,
just a simple interface: one function to create the instance and get it setup, and one
function to read the next line.
Continuing with the FileSourceProvider¶
By keeping things simple, creating the class
FileSourceProvider was almost as simple
as the refactoring to create the
InMemorySourceProvider class. While I want to keep
options open for future performance experimentation, I just needed something simple
for reading a file from the file system. Based on those qualifications, I came up with
class FileSourceProvider: """ Class to provide for a source provider that is on media as a file. """ def __init__(self, file_to_open): with open(file_to_open, encoding="utf-8") as file_to_parse: file_as_lines = file_to_parse.readlines() self.read_lines =  self.read_index = 0 did_line_end_in_newline = True for next_line in file_as_lines: did_line_end_in_newline = next_line.endswith("\n") if did_line_end_in_newline: next_line = next_line[0:-1] self.read_lines.append(next_line) if did_line_end_in_newline: self.read_lines.append("")
Basically, open the file, read in the lines, and process the lines into the format
that we expect. The only tricky bit with the class’s
__init__ function was handling
line terminators properly. In fact, that is the only
purpose for the
did_line_end_in_newline variable, remembering if the current line
ended with a newline character before it is removed. Based on independent unit testing
of the class, I had problems with the characters at the end of the file, which adding
that variable and the final
if statement resolved cleanly. I am not sure if I feel
did_line_end_in_newline variable is a kludge or not, but I do feel that it
was the right thing to do in order to maintain the fidelity of the data being read in
from the file.
Because care was taken in the provider’s
__init__ function to do all the necessary
get_next_line function is very basic:
def get_next_line(self): """ Get the next line from the source provider. """ token_to_use = None if self.read_index < len(self.read_lines): token_to_use = self.read_lines[self.read_index] self.read_index += 1 return token_to_use
While this function could be more complicated (or simplified depending on your
viewpoint), I feel that this is a good example of keeping things basic. The provider
reads the information into an array of strings during the
__init__ function, and this
function simply uses an index to iterate through and return each element of that
array. Nothing fancy for now, just some code that
is very functional. Fancy code can always be added later.
Testing the Source Providers¶
To make sure both providers are adhering to the interface in the same way, I
added a decent number of tests in the
test_source_providers.py file. In all the
tests, the big thing that is being tested is if the source providers return the correct
lines given the correct input. If there are 2 line terminators in the input, each
provider must return 3 lines, even if the last one is empty. Every test is a variation
on that, thoroughly exercising each provider to ensure that both adhere to the
interface flawlessly. After all, if the parser gets bad input to tokenize, it cannot
help but to produce bad output, even if is only off by a line terminator.
Replacing Print with Log¶
This improvement was a long time coming: replacing all of the
log.debug statements. When I was developing the parser, adding a simple
Why did I avoid using
log.debug statements from the beginning of development, and
Regardless of why I did it, the time to fix it was now. It was a mostly painless transition and didn’t take that long to accomplish. To the start of most files, I added the following import:
and at the start of many blocks of processing, I added the following line:
logger = logging.getLogger(__name__)
Then, for each time I called the print function, like this example:
print("Line:" + line_to_parse + ":")
I replaced it with the exact same arguments, just changing the name of the called
logger.debug as follows:
logger.debug("Line:" + line_to_parse + ":")
After the initial changes to replace
log.debug, everything looked okay
until I ran the normal
clean script that I use with the project. This script is a
simple script to execute the
code formatter, the
linters, and the full set of tests for the project. When the script got to PyLint,
it seemed to go crazy and was emitting lots or warning lines, each line essentially
being the same.
Reading the warnings carefully and looking at the source code, PyLint seemed to be complaining about each logging function call that involved concatenation. In each case where I was concatenating strings to arrive at what I wanted to log, PyLint raised a warning that I wasn’t doing it right. According to PyLint, the proper way to log a message for the above example is:
Doing a bit more research, the reason for the warning is because the logging library
was purposefully created to be lazy. If the log level for a given call is not high
enough to cause the string to be logged, doing any kind of formatting or concatenation
on that string is wasted effort. Following that logic, the logger follows the same
conventions that are used with the percent character (
%) as the
string interpolation operator,
delaying the evaluation of the actual string until the logger determines
whether the specified string is actually going to be logged. Once a positive
determination has been made, the format and the arguments are applied to each other,
a resolved string is produced, and that string that is then logged.
It took a while to go through each of those messages. I had to examine each concatenation sequence, break it down into its component parts, and verify my changes. Each string that was concatenated needed to be represented in either the format string or the arguments passed to the logger. It was slow and pedantic work, but in the end, I was happy to have a logging solution that was more performant than before.
Note that as a Python logging newbie, I am not 100% sure if I created more work for
frequently creating logging instances inside of the project’s static functions. It is
possible that I can get away with a static logger variable created in the module’s
namespace at the top of the file, and not worry about creating any other loggers within
the same file.
However, in all the examples I have seen to date, the logger is
either declared at the top of a simple file or within the
__init__ method of a
class. As a lot of the helper classes are a collection of functions that are labelled
@staticmethod annotation, I am not sure if one instance at the top of the
file is the correct way to go. While
it might be more effort than I really need, I am confident that I am covering all
the logging situations properly. If I learn something differently about logging, I
will come back and revisit it.
Better Error Reporting¶
When I initially proofed out the plugin architecture for the linter, I added a
BadPluginError class to help identify plugin issues and report them back to the
command line. Using this same pattern to integrate error handling for the parser to
the linter, I added the
BadParsingError class, raising this error when there were
exceptions raised during the parser’s tokenization of the Markdown text. A bit more
typing and a short while later, I had the
__handle_error function with refactored
content. This newly minted function reused the error handling meant for the
BadPluginError class to handle both the
BadPluginError class and the
BadParsingError class in the same manner.
With that tweak done, I had confidence that the error reporting was done, and I wouldn’t need anything more serious until far later in the project. That is what I foolishly thought until about 2 days later. After doing some work on implementing the first two rules, I realized that some things needed to be fixed with error reporting.
The first change I made was to change the name of the error class from
BadTokenizationError to reflect the problem area more accurately. While it is
more of a mouthful to say, it accurately describes that it is a problem with the
tokenization of the document, not a generic “parsing” issue. A good example of this is
TokenizedMarkdown class is created. Upon creation, one of the things
that it does is load the
entities.json file from the resources directory, as detailed
If that resource file is not
loaded properly, that code was changed to raise a
BadTokenizationError instead of
what it was previously doing. Without this file, parsing the Markdown text would still
be possible without any issues. But to properly tokenize the Markdown text, the parser
needs to know if the named character entities that are provided in the Markdown
document refer to valid named entities. It may
appear to be a semantic difference to some, but in my experience, it is the attention
to detail on little things like that which help improve the project’s maintainability.
In addressing the above case, I stumbled into the second issue: too many exit
While it does not show it in the commit for the
BadTokenizationError fix documented
in the paragraph above, the first pass at addressing that issue was messy. It caught
any raised error, did some bare bones reporting, and then performed a
stop the program with an error code of 1. Doing a quick search through the code, I
stopped counting once I hit the third instance of a
sys.exit(1) call in the code.
It was time for a refactor.
Beginning with the initial case that started the search, I took a quick look at the
various search results and came up with a good rubric to follow. If possible,
I would try and more correctly classify the error using one of the two existing error
BadPluginError, which were already handled at the
top level by the
__handle_error function. If the error occurred in the
main.py file and didn’t fall into either of those categories, I would call the
__handle_error function directly, striving to give a clean and concise message on
what that error was and why it occurred. If I encountered any errors outside of those
parameters, I would save them for last and re-evaluate my rules.
Based on what I found in the project, the first two rules were enough, as I did
not find an error that could not neatly fit into one of those two categories.
Instead of a few different ways of exiting the linter with an error condition,
__handle_error was now the sole conduit for exiting with an error. I went over the
tests in the
files to make sure everything looked consistent, and things were good!
What Was My Experience So Far?¶
As I have mentioned in previous articles, I like refactoring, and this stretch of
refactoring was no
exception. Each of these tasks were little tasks, but I felt better knowing that they
were addressed before I shifted my focus to writing the rules for the linter. The
change to using source providers would be pivotal in dealing with test sources (strings)
and live sources (files) as sibling concepts. Replacing
also pivotal to using live sources, keeping the ability to debug what was going
on for experienced users, but not flooding a casual user with all of that information.
all of the error reporting to go through one conduit just seems cleaner and more
concise to me. Getting these taken care of just felt like the right refactoring to
do at the right time.
I also realized that I enjoy the cadence I have established with refactoring. While there are short bursts where it is just adding new features, for the most part I have tried to switch off between refactoring and adding new features. I feel that for my development style, I have found a good balance between fixing things and adding things, even if it needs tweaking every so often. Perhaps it is because of that sense of balance that I have been able to focus on this project for so long without losing energy on it. For me, I believe it boils down to a feeling that this project is not about adding features as much as it is about continual improvement of the project towards its goal, taking my time to do things right as I go.
Looking forward, I am sure I am not done adding features, refactoring code, or learning something new and interesting.
What is Next?¶
Taking a bit of a jump back in time a couple of days from where I am leaving off, in the next article I am going to start talking about the rules that I am developing, and how their development proceeded. Please stay tuned!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.