Summary¶
In my last article, I talked about applying my newfound knowledge on performance measuring in Python to the PyMarkdown project. In this article, I start to focus more on resolving the higher priority issues from the project’s issues list.
Introduction¶
When working on a project like the PyMarkdown project for as long as I have, there are things that you like doing, things that you do not mind doing, and things that you know you have to do. For the most part, most of the things that I do related to the project are in the first two categories. But strangely enough, the closer I get to a potential release, the more that I am feeling that things are moving into that third category.
Maybe it is just that after working on this project for over a year, I do not want it to go the next level. Once I release the project, I need to promote the project more than on my own blog, and that will take time. I must admit, I have a bit of a concern that people are going to look at my code and go “why did you do that?”.
Regardless, I am going to try and clear out the Priority 1 section as much as possible and get ready for a beta release as soon as I can. At least, that is what my plan is. Here is hoping I can stick to it!
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 25 Feb 2021 and 28 Feb 2021.
Performance or Readability?¶
After spending the last week working on performance improvements, I was left with one potential improvement that I did not get around to looking at: string concatenation. As I have explained in part articles, when I am solving an issue, I work the issue and fit things together as I go. The rigor that I have built into my development process allows me the freedom to do that, knowing that I will revisit the code at least once before I commit it to the repository. At that time, I know with certainty that I will give the code a more thorough looking over, guided by the knowledge that the algorithm that the code is implementing has been tested properly. That process is just built into how I develop projects. And it was because of that rigor that I felt that it was time to give string concatenation another look.
When it comes to string concatenation, I figured that if it ever became an issue, I would
just give that code another look. A such, I just added strings the normal way using the
+
operator. However, I started to suspect that concatenating strings in this way was an
issue, so I surveyed the project code. In the process, I found that there are many places
where strings are concatenated using the +
operator and almost none being concatenated
using any of the other methods. Performing some quick “off-the-cuff” performance tests, I
was able to find only miniscule improvement between the various methods of concatenating
strings. As such, I was faced with a question of whether I wanted to refactor the code for
readability rather than performance.
How Do You Define Readability?¶
Readability is an interesting concept in software development because often it is thought of as how easy it is for other people on the team to read it. This is increasingly confusing on a project like PyMarkdown where I am the only one on the team. Technically speaking, if I can read my own code with decent ease, the code is readable. I would further argue that the above definition does not really define what readability is, just what it is within a limited scope.
From my point of view, readability is more than that. Having developed software professionally in over 6 different languages, I believe that readability is more fluid that that. Unless there is a business reason to define it otherwise, my general definition of readability is:
Readability is defined as the confidence that the development team has in the ability of someone, who is outside of the team and is familiar with the given language, to look at the code and to be able to clearly read and understand the code.
Now, obviously there are some caveats, but there it is. In the case of the PyMarkdown
project, my goal for readability is that I follow standard Python programming practices and
write performant Python code according to those practices. Following my definition of
readability, as stated above, I expect that a competent Python developer should be able to
look at the source code for the PyMarkdown project and read it without too many issues.
For me, string concatenation using the +
operator is one of those issues that I believe
impedes the readability of the code. From my viewpoint, there are two other methods of
concatenation that are more readable because they are more concise.
So, purely from a position of wanting to increase the readability of the project, I decided to refactor all string concatenations to those two other forms. This meant I would need to switch every string concatenation over to one of the two more readable forms: f-strings and lists with join statements.
F-Strings¶
Added in Python 3.6, f-strings are a special type of string that combines the previous
ways of formatting strings together in a way that just makes sense. Take this example
from the container_block_processor
module:
def __str__(self):
return (
"{ContainerIndices:ulist_index:"
+ str(self.ulist_index)
+ ";olist_index:"
+ str(self.olist_index)
+ ";block_index:"
+ str(self.block_index)
+ "}"
)
For debug purposes, this string is being composed of variables and constants weaved together to create the resultant string. While it creates a string with the desired information, I often have a hard time reading strings formatted in that manner. It is a long sequence of operands over multiple lines, and just seems broken up. Each of those breaks causes me a bit of confusing when I try and figure out what piece goes with that other piece. Therefore, when I try and scan those lines with my eyes, I find that I need to slow down and parse through the line mentally a couple of times before getting a good understanding of that function.
In comparison, here is that same code refactored to use a f-string:
def __str__(self):
return f"{{ContainerIndices:ulist_index:{self.ulist_index};olist_index:{self.olist_index};block_index:{self.block_index}}}"
For me, this is easier to read. While the string itself is longer and I may have to
scroll to read it, it is one string that describes everything in order. The “curly-brace”
characters {
and }
encompasses code that computes a value to be added to the
string. The sequences {{
and }}
are used to add a single one of those characters to
the string. The f-string itself is denoted by simply adding a f
character to the start
of an otherwise normal string. It manages to keep the simple things simple.
While it initially took me a bit to warm up to f-strings, I now find them to be more readable by far. I can easily scan from left to right in the string and almost instantly know how the code is going to compose that string. But what about strings with a variable number of arguments? Or strings that have a separator that would be burdensome to use repeatedly? That is where lists come in.
Lists¶
While f-strings are very readable, they do not work with an optional number of
arguments. For those cases, list
objects and their join
function are recommended.
Consider the following code for the debug_string
function of the MarkdownToken
class:
def debug_string(self):
add_extra = (
":" + self.extra_data
if self.extra_data or self.is_paragraph or self.is_blank_line
else ""
)
column_row_info = (
"(%d,%d)" % (self.line_number, self.column_number)
if include_column_row_info and (self.line_number or self.column_number)
else ""
)
return "[%s%s%s]" % (self.token_name, column_row_info, add_extra)
After various refactors of this function, I had replaced the variable parts of the
string with ternary statements and used a %
style format to stitch those values
together. But it still looked disjointed to me. Even after changing the %
style
format strings into f-strings, I was not happy with how it read. The part that was
bothering me was that the optional parts of the string were not clearly being represented
as optional when I read them. Basically, if I just looked at the last line of that
function, I felt that each one of those strings would contribute all the time, not
optionally. I just felt that was not a good thing.
Therefore I switched the optional parts of that concatenation to a list
:
def debug_string(self):
debug_parts = ["[", self.token_name]
if include_column_row_info and (self.line_number or self.column_number):
debug_parts.append(f"({self.line_number},{self.column_number})")
if self.extra_data or self.is_paragraph or self.is_blank_line:
debug_parts.append(f":{self.extra_data}")
debug_parts.append("]")
return "".join(debug_parts)
For me, this reads cleaner than the previous iteration of this function. The start of the string and the end of the string are clearly stated. The optional parts format their data only if needed, and only then do they calculate and add those formatted results. For me, this is more readable because it clearly delineates between the optional parts of the string and the mandatory parts of the string.
But there are also other areas where I find using lists to concatenate strings useful.
Take this code from one of the __init__
functions for a token:
str(indent_level)
+ MarkdownToken.extra_data_separator
+ extracted_whitespace
+ MarkdownToken.extra_data_separator
+ list_start_content
Like my statements above, I do not find this code easy to read. When I try and
read this code, the MarkdownToken.extra_data_separator
takes up most of my attention
as I try and make sure each one is in the right place. Instead of focusing on what
is between those strings, I find myself focusing on those strings. As such, I rewrote
it using lists as:
MarkdownToken.extra_data_separator.join(
[str(indent_level), extracted_whitespace, list_start_content]
)
I find this format a lot easier to read. I can see that I have a list
of items and that they will be joined using the MarkdownToken.extra_data_separator
string. That fact is mentioned exactly once, and then I do not have to concern myself
with it again. After I have read that, I can then clearly see the list of items that are
going to be joined together, allowing me to focus on the contents of that list instead of
the separator. For me, that is a win!
After Four Days¶
It took me roughly four days of evenings to go through the PyMarkdown code base and move
every string concatenation over to one of the above methods. There were times during that
effort that I felt like I should give up, but I persevered. And once I got to the end
of that huge effort, I almost immediately saw the payoff. Looking over code that I
had changed during those four days, the code looked a lot more readable to me. In
addition, finding string concatenations was a lot easier now. Instead of searching for
the +
operator, which is also used in numeric calculations, I could search for the
sequence f"
to look for the search strings. By searching for .join(
, I was able to
find both string concatenations and token list concatenations, but my naming practices
make differentiating those two groups of concatenations easy.
It was a slog, but I believe my choice to improve readability with respect to string concatenations was the correct one. Not only could I find any concatenations easier than before, but when I read the concatenations, they just were easier to read. I just hope others think the code is more Pythonic and more readable as well.
Whittling Down The Issues List¶
Having been on the issues list for a long time, I decided that I needed to make a good dent in the issues list, specifically the items dealing with Hard Line Breaks. I had fixed some issues with Hard Line Break tokens in the last month or so, and I figured it would be a good time to continue that effort.
The first issue that I picked up on this theme was:
- hard break at start of text?
Looking through all the examples and scenarios that I have already implemented, I was
astounded that I did not have any scenarios where there was a Hard Line Break element
at the start of the line. After adding four very simple tests, I executed those tests,
and the only issue was the firing of an assert
statement in the
__verify_first_inline_paragraph
of the consistency checks. That was issue was remedied
within 10 minutes, and I then moved on to the next issue.
Hard Breaks And Inline Types¶
The next item that I picked up regarding Hard Line Breaks was:
- hard break followed by each inline type
For this item, I looked through the code base and found most of the examples that I
was looking for, but I felt that they were scattered all over the place. With an eye to
resolve this item in the shortest time possible, with time for refactoring later, I simply
went through the list of available inline tokens and created functions
test_hard_line_breaks_extra_03x
to test_hard_line_breaks_extra_03i
for the
one Hard Line Break sequence \
and functions test_hard_line_breaks_extra_04x
to
test_hard_line_breaks_extra_04i
for the other Hard Line Break sequence {space}{space}
.
Due to a lot of hard work in previous weeks, I was lucky in that I only had to make a
small change to how the __complete_inline_block_processing
function changed the
end_string
variable. Double checking my work and running the scenario tests again,
just to make sure, I resolve this item and moved on.
Hard Line Break Grab Bag¶
Apart from the “big” item itself, what was left was a bit of a grab bag of issues:
- hard break followed by 3 spaces, then emphasis
- variations of 52e with extra spaces after, etc.
- variation with start of 52e, then hard break, then something like start of 52e again i.e. make /x02 splitter come into affect
While I hopefully knew what I intended when I wrote those down, I was left a bit in the
dark as I now read them. I knew that the lines that contain the sequence /x02
deal with
SetExt Heading tokens, as that is the only place that I used that sequence as a splitter.
So, between what I knew and what I could guess, I came up with five new scenario tests
that tried to capture the spirit of what I thought those statements meant.1
There was one issue that I needed to deal with after executing the new scenario tests.
In the InlineProcessor
class, the tokens were not being generated correctly
when dealing with leading spaces on a line that was preceded by a Hard Line Break
within the confine of a SetExt Heading element. After I bit of debugging, I was
able to figure out the right setting for the three variables in question, and that
was fixed. To balance that out, I needed to add code with the inverse functionality
to the __verify_next_inline_hard_break
function of the consistency checks to
correctly compensate for that first change.
Another execution of the scenario tests, another double check that things were good, and I was on to another issue.
Tabs¶
When I grabbed the following item:
- stop gap solution for handling Tabs until tab work can be done
it was with the intent to document what needed to be done to properly handle tab
characters in various situations. It was with delight that I looked at and experimented
with the source code to find out that I was almost there. With one small change to the
parse_inline
function, I was able to get all the GFM specification examples with
tabs characters to work.
So, the short story for this item is that there was nothing else that currently needed fixing with respect to tab characters. The long story for this item is that I have over 2000 tests for various parts of the linter and less than 20 tests for tab characters. As such, I added a new item to the issues list to deal with this later. I am not sure how many users have tab characters in their documents, so right now, the priority of that item’s priority is just a guess, based on my usage.
False Alarms But Good Alarms¶
Having poked around the code while performing the refactor that I talked about in the above section on Performance or Readability, I noticed that two of the functions that I expected to have code in their body were mysteriously blank. As such, I added the following items to the issues list:
- __calculate_shortcut_collapsed_deltas - not being used at all
- figure out why?
- shotrcut link over multiple line
- does n't seem to have any code to handle?
Looking into this, I did my research and added two extra scenario tests to be sure of my results. In both cases, everything was fine. Digging deeper into the code, I found the answer to why these functions were okay with an empty body. As both of those two scenarios rely heavily on the link label, which is common to all link types, any calculations for the size of the link label are already factored into the code. However, to prevent myself from looking at this again, I added some extra comments to indicate what I had found, and why these functions being empty was okay.
Some Final Cleanup¶
When I picked this item from the list:
- test_paragraph_series_m_ul_t_nl_ulb_nl_tb - with abc/def/*
I hoped it was going to be simple to resolve. When I created the M-series of paragraph scenario tests, I noted that I did not have any more complicated tests and suggested a variation of a test with multiple lines. Creating a few variations on the initial scenario test with more complex data, I added three tests with variations on:
- abc
- def
*
---
While I tested more than three scenario tests, I did not find anything wrong with any of the tests that I tried. Instead of adding a whole bunch of scenario tests with only a guess that there might be a future problem, I whittled those tests down to the three tests that I did add.
Getting Back To Performance¶
With a decent handful of issues resolved from the issues list, there was a tiny bit of extra cleanup I wanted to do with respect to the plugins. Having looked at their code during breaks, I knew there were a couple of small changes I wanted to make before release. My feeling was that if I waited until after release, those changes would become breaking changes, and I wanted to avoid that if possible.
Calling Plugin Functions Only When Necessary¶
Having already reaped benfits from calling two of the plugin functions
only when necessary,
I decided that those changes were a good practice that I wanted to spread to each of the
functions that comprise the plugin’s interface. I quickly made similar changes to the
completed_file
and starting_new_file
functions, went to measure their change, and…
well… I was not sure.
Rethinking Performance Measurements¶
In previous sections on performance, I was thinking about performance as actual clock time. Spending some time thinking about this while walking our dog, I realized that the main issue was that there was always going to be some difference in timing, even if I took two measurements minutes apart from each other. If something external to the PyMarkdown project hit my hard drive when I was taking measurements, it could cause the measured time to be off. So, was clock time the best measurement to use?
It was then that I decided to move from measuring clock time to measuring clock time
percentages. While I was pretty sure that if I took three samples that I would get
different timings in each sample, I had a high degree of confidence that the percentage
of the time spent in each function would remain relatively the same. Running some sample
tests, this theory bore out in practice. While there were some
differences in the percentages of the top 25, they were mostly negligible. When
there were differences, it was most often from the print
function or one of the
functions in the io.builtin
module. Since those two function groups are used for
input and output, it made sense that there would be variations.
Throwing together a simple Excel spreadsheet, I now had what I believe to be a better way to measure performance.
Using The New Process¶
Using that new process, I was able to compose a new spreadsheet to track any changes
and I found only negligible improvements. I was sure that I was going to get a performance
boost out of those changes, so I checked the code base again and I found the issue:
the templated plugins. When I was adding the original set of plugins to be implemented, I
added all the plugins that I had planned to implement with templated functions. Those
templated function were placeholders in the plugin that did not have any body to the
function other than a pass
statement. My thinking was that when I was ready to write
the code for the plugin, I could start right away without having to add templated stuff.
While that worked as planned, it was now proving to be a performance issue. Roughly one
third of all plugins have been coded, but all the plugins were being called regardless of
whether they had
any code other than a pass
statement in their function body. So, I quickly went
through the plugins and removed any plugin function that did not have any meaningful
statements.
Once I completed that task, I measured the performance again, and entered the results into
this spreadsheet.
It took me a minute to understand what I was looking at, but it then made sense to me. The
next_line
functions were a large part of the templated plugin functions that were
removed, so the measurements for that function dropped off the list. This effectively
created n “vacuum” at the top of the measurements, causing other measurements to increase
slightly to compensate for it. Specifically, the debug
function went from 10.49% of the
execution time to 11.51% of the execution time not because of any changes, but because
the item before it in the list was no longer there. With that item no longer occupying
that “space”, the debug
function occupied a pro-rated portion of what that time that
item used to occupy.
Running two extra sets of performance tests, I was able to replicate this behavior of these performance test both times. Looking at the clock time for each function alone, I am not sure if I would have made the same observation as I did with the percentages, so I was glad that I switched to it!
Setting Context¶
While it was more evident before the previous performance changes, there was also an issue
with the set_context
function. Each time any of the plugin functions were called, a
separate call to the set_context
function was being made to establish the context
for any reporting that was done. I am not sure why I thought that was a good idea,
but it looked like it was causing performance issues before the last change. Now that
I looked at the performance numbers, it was low down on the list, almost inconsequential.
Sitting back and thinking about it, the reduction of the impact of the set_context
function on the performance made sense to me. With fewer plugins having eligible functions
to call, the number of times that the set_context
function was called was reduced at
the same rate. Just as the next_line
function dropped off the top 25 list, the
set_context
function also dropped off that same list.
But looking forward, I knew that those functions would soon be resurfacing on the top 25 list. I had removed those templated methods in the last section, as they were not being used. Somewhere between now and the intial release of the project, I know I have plans to implement those missing plugins, thereby increasing their impact on the performance once again. So, while I could easily say I would do it later, it just made more sense to me to do it now while I was in the code. Also, by getting the refactoring out of the way now, it would give me more time.
It was a rather quick set of changes to make, and it was done within a couple of hours
including running the scenario tests and gathering the performance measurements.
Looking at the
measurements,
I had to look far down the list to find the plugin entries, but I found them and
added them to the spreadsheet below the normal area. Before I removed the templated
functions, there were a combined 62,000 calls to the next_line
and next_token
functions. For those calls, there were 1.2 million calls to set_context
which took
0.16 seconds or 3.49% of the execution time.
Based on those numbers alone, I felt like refactoring that now was still a good thing to do from a performance point of view. Maybe the effects of the change would not show up right away, but I knew that it was quelling a future performance issue that would have definitely occurred otherwise.
What Was My Experience So Far?¶
While I am not sure I speak for everyone, I have an inkling that most software developers exist on a spectrum of “so what” to “what if they say…” with respect to others seeing their code. If that were not the case, linters for various languages and numerous articles on how to contribute successfully to a code review would not be needed. I know I am using PyLint and Flake8 as linters, and Black as a style formatter, so I hope I am covered in those areas. It is the other areas that I am concerned about.
But, in my normal guest to continually learn, I know that if I do not put the code out there, I will not get that feedback that I need to learn. I do know that the parser behind the linter passes every example in the GFM specification, along with every other case that I have come up with and run against CommonMark 0.92. I do know that I have a good handle on what the initial group of rules should be, based on the rules as defined by MarkdownLint. I must put my faith in the work that I have done to architect and design the project to meet my goals.
With that bit of deep thinking out of the way, I am looking at the issues list with a new eye to try and figure out what the minimum set of features that I need to have really is. What are the things that I will regret not having done before I release? The answer to that question, however, is something for another article!
What is Next?¶
Having resolved a number of other items on the issues list, there is once glaring one that I need to resolve before I can use this project myself: a YAML/simple front loader. Without it, using this linter on my own blog posts is a no-go. As such, I tackle that!
-
And yes, if that statement sounds wishy-washy, it is because it is. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.