Summary

In my last article, I talked about applying my newfound knowledge on performance measuring in Python to the PyMarkdown project. In this article, I start to focus more on resolving the higher priority issues from the project’s issues list.

Introduction

When working on a project like the PyMarkdown project for as long as I have, there are things that you like doing, things that you do not mind doing, and things that you know you have to do. For the most part, most of the things that I do related to the project are in the first two categories. But strangely enough, the closer I get to a potential release, the more that I am feeling that things are moving into that third category.

Maybe it is just that after working on this project for over a year, I do not want it to go the next level. Once I release the project, I need to promote the project more than on my own blog, and that will take time. I must admit, I have a bit of a concern that people are going to look at my code and go “why did you do that?”.

Regardless, I am going to try and clear out the Priority 1 section as much as possible and get ready for a beta release as soon as I can. At least, that is what my plan is. Here is hoping I can stick to it!

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 25 Feb 2021 and 28 Feb 2021.

Performance or Readability?

After spending the last week working on performance improvements, I was left with one potential improvement that I did not get around to looking at: string concatenation. As I have explained in part articles, when I am solving an issue, I work the issue and fit things together as I go. The rigor that I have built into my development process allows me the freedom to do that, knowing that I will revisit the code at least once before I commit it to the repository. At that time, I know with certainty that I will give the code a more thorough looking over, guided by the knowledge that the algorithm that the code is implementing has been tested properly. That process is just built into how I develop projects. And it was because of that rigor that I felt that it was time to give string concatenation another look.

When it comes to string concatenation, I figured that if it ever became an issue, I would just give that code another look. A such, I just added strings the normal way using the + operator. However, I started to suspect that concatenating strings in this way was an issue, so I surveyed the project code. In the process, I found that there are many places where strings are concatenated using the + operator and almost none being concatenated using any of the other methods. Performing some quick “off-the-cuff” performance tests, I was able to find only miniscule improvement between the various methods of concatenating strings. As such, I was faced with a question of whether I wanted to refactor the code for readability rather than performance.

How Do You Define Readability?

Readability is an interesting concept in software development because often it is thought of as how easy it is for other people on the team to read it. This is increasingly confusing on a project like PyMarkdown where I am the only one on the team. Technically speaking, if I can read my own code with decent ease, the code is readable. I would further argue that the above definition does not really define what readability is, just what it is within a limited scope.

From my point of view, readability is more than that. Having developed software professionally in over 6 different languages, I believe that readability is more fluid that that. Unless there is a business reason to define it otherwise, my general definition of readability is:

Readability is defined as the confidence that the development team has in the ability of someone, who is outside of the team and is familiar with the given language, to look at the code and to be able to clearly read and understand the code.

Now, obviously there are some caveats, but there it is. In the case of the PyMarkdown project, my goal for readability is that I follow standard Python programming practices and write performant Python code according to those practices. Following my definition of readability, as stated above, I expect that a competent Python developer should be able to look at the source code for the PyMarkdown project and read it without too many issues. For me, string concatenation using the + operator is one of those issues that I believe impedes the readability of the code. From my viewpoint, there are two other methods of concatenation that are more readable because they are more concise.

So, purely from a position of wanting to increase the readability of the project, I decided to refactor all string concatenations to those two other forms. This meant I would need to switch every string concatenation over to one of the two more readable forms: f-strings and lists with join statements.

F-Strings

Added in Python 3.6, f-strings are a special type of string that combines the previous ways of formatting strings together in a way that just makes sense. Take this example from the container_block_processor module:

    def __str__(self):
        return (
            "{ContainerIndices:ulist_index:"
            + str(self.ulist_index)
            + ";olist_index:"
            + str(self.olist_index)
            + ";block_index:"
            + str(self.block_index)
            + "}"
    )

For debug purposes, this string is being composed of variables and constants weaved together to create the resultant string. While it creates a string with the desired information, I often have a hard time reading strings formatted in that manner. It is a long sequence of operands over multiple lines, and just seems broken up. Each of those breaks causes me a bit of confusing when I try and figure out what piece goes with that other piece. Therefore, when I try and scan those lines with my eyes, I find that I need to slow down and parse through the line mentally a couple of times before getting a good understanding of that function.

In comparison, here is that same code refactored to use a f-string:

    def __str__(self):
        return f"{{ContainerIndices:ulist_index:{self.ulist_index};olist_index:{self.olist_index};block_index:{self.block_index}}}"

For me, this is easier to read. While the string itself is longer and I may have to scroll to read it, it is one string that describes everything in order. The “curly-brace” characters { and } encompasses code that computes a value to be added to the string. The sequences {{ and }} are used to add a single one of those characters to the string. The f-string itself is denoted by simply adding a f character to the start of an otherwise normal string. It manages to keep the simple things simple.

While it initially took me a bit to warm up to f-strings, I now find them to be more readable by far. I can easily scan from left to right in the string and almost instantly know how the code is going to compose that string. But what about strings with a variable number of arguments? Or strings that have a separator that would be burdensome to use repeatedly? That is where lists come in.

Lists

While f-strings are very readable, they do not work with an optional number of arguments. For those cases, list objects and their join function are recommended. Consider the following code for the debug_string function of the MarkdownToken class:

    def debug_string(self):
        add_extra = (
            ":" + self.extra_data
            if self.extra_data or self.is_paragraph or self.is_blank_line
            else ""
        )
        column_row_info = (
            "(%d,%d)" % (self.line_number, self.column_number)
            if include_column_row_info and (self.line_number or self.column_number)
            else ""
        )
        return "[%s%s%s]" % (self.token_name, column_row_info, add_extra)

After various refactors of this function, I had replaced the variable parts of the string with ternary statements and used a % style format to stitch those values together. But it still looked disjointed to me. Even after changing the % style format strings into f-strings, I was not happy with how it read. The part that was bothering me was that the optional parts of the string were not clearly being represented as optional when I read them. Basically, if I just looked at the last line of that function, I felt that each one of those strings would contribute all the time, not optionally. I just felt that was not a good thing.

Therefore I switched the optional parts of that concatenation to a list:

    def debug_string(self):
        debug_parts = ["[", self.token_name]
        if include_column_row_info and (self.line_number or self.column_number):
            debug_parts.append(f"({self.line_number},{self.column_number})")
        if self.extra_data or self.is_paragraph or self.is_blank_line:
            debug_parts.append(f":{self.extra_data}")
        debug_parts.append("]")
        return "".join(debug_parts)

For me, this reads cleaner than the previous iteration of this function. The start of the string and the end of the string are clearly stated. The optional parts format their data only if needed, and only then do they calculate and add those formatted results. For me, this is more readable because it clearly delineates between the optional parts of the string and the mandatory parts of the string.

But there are also other areas where I find using lists to concatenate strings useful. Take this code from one of the __init__ functions for a token:

    str(indent_level)
    + MarkdownToken.extra_data_separator
    + extracted_whitespace
    + MarkdownToken.extra_data_separator
    + list_start_content

Like my statements above, I do not find this code easy to read. When I try and read this code, the MarkdownToken.extra_data_separator takes up most of my attention as I try and make sure each one is in the right place. Instead of focusing on what is between those strings, I find myself focusing on those strings. As such, I rewrote it using lists as:

    MarkdownToken.extra_data_separator.join(
        [str(indent_level), extracted_whitespace, list_start_content]
    )

I find this format a lot easier to read. I can see that I have a list of items and that they will be joined using the MarkdownToken.extra_data_separator string. That fact is mentioned exactly once, and then I do not have to concern myself with it again. After I have read that, I can then clearly see the list of items that are going to be joined together, allowing me to focus on the contents of that list instead of the separator. For me, that is a win!

After Four Days

It took me roughly four days of evenings to go through the PyMarkdown code base and move every string concatenation over to one of the above methods. There were times during that effort that I felt like I should give up, but I persevered. And once I got to the end of that huge effort, I almost immediately saw the payoff. Looking over code that I had changed during those four days, the code looked a lot more readable to me. In addition, finding string concatenations was a lot easier now. Instead of searching for the + operator, which is also used in numeric calculations, I could search for the sequence f" to look for the search strings. By searching for .join(, I was able to find both string concatenations and token list concatenations, but my naming practices make differentiating those two groups of concatenations easy.

It was a slog, but I believe my choice to improve readability with respect to string concatenations was the correct one. Not only could I find any concatenations easier than before, but when I read the concatenations, they just were easier to read. I just hope others think the code is more Pythonic and more readable as well.

Whittling Down The Issues List

Having been on the issues list for a long time, I decided that I needed to make a good dent in the issues list, specifically the items dealing with Hard Line Breaks. I had fixed some issues with Hard Line Break tokens in the last month or so, and I figured it would be a good time to continue that effort.

The first issue that I picked up on this theme was:

- hard break at start of text?

Looking through all the examples and scenarios that I have already implemented, I was astounded that I did not have any scenarios where there was a Hard Line Break element at the start of the line. After adding four very simple tests, I executed those tests, and the only issue was the firing of an assert statement in the __verify_first_inline_paragraph of the consistency checks. That was issue was remedied within 10 minutes, and I then moved on to the next issue.

Hard Breaks And Inline Types

The next item that I picked up regarding Hard Line Breaks was:

- hard break followed by each inline type

For this item, I looked through the code base and found most of the examples that I was looking for, but I felt that they were scattered all over the place. With an eye to resolve this item in the shortest time possible, with time for refactoring later, I simply went through the list of available inline tokens and created functions test_hard_line_breaks_extra_03x to test_hard_line_breaks_extra_03i for the one Hard Line Break sequence \ and functions test_hard_line_breaks_extra_04x to test_hard_line_breaks_extra_04i for the other Hard Line Break sequence {space}{space}.

Due to a lot of hard work in previous weeks, I was lucky in that I only had to make a small change to how the __complete_inline_block_processing function changed the end_string variable. Double checking my work and running the scenario tests again, just to make sure, I resolve this item and moved on.

Hard Line Break Grab Bag

Apart from the “big” item itself, what was left was a bit of a grab bag of issues:

- hard break followed by 3 spaces, then emphasis
  - variations of 52e with extra spaces after, etc.
    - variation with start of 52e, then hard break, then something like start of 52e again i.e. make /x02 splitter come into affect

While I hopefully knew what I intended when I wrote those down, I was left a bit in the dark as I now read them. I knew that the lines that contain the sequence /x02 deal with SetExt Heading tokens, as that is the only place that I used that sequence as a splitter. So, between what I knew and what I could guess, I came up with five new scenario tests that tried to capture the spirit of what I thought those statements meant.1

There was one issue that I needed to deal with after executing the new scenario tests. In the InlineProcessor class, the tokens were not being generated correctly when dealing with leading spaces on a line that was preceded by a Hard Line Break within the confine of a SetExt Heading element. After I bit of debugging, I was able to figure out the right setting for the three variables in question, and that was fixed. To balance that out, I needed to add code with the inverse functionality to the __verify_next_inline_hard_break function of the consistency checks to correctly compensate for that first change.

Another execution of the scenario tests, another double check that things were good, and I was on to another issue.

Tabs

When I grabbed the following item:

- stop gap solution for handling Tabs until tab work can be done

it was with the intent to document what needed to be done to properly handle tab characters in various situations. It was with delight that I looked at and experimented with the source code to find out that I was almost there. With one small change to the parse_inline function, I was able to get all the GFM specification examples with tabs characters to work.

So, the short story for this item is that there was nothing else that currently needed fixing with respect to tab characters. The long story for this item is that I have over 2000 tests for various parts of the linter and less than 20 tests for tab characters. As such, I added a new item to the issues list to deal with this later. I am not sure how many users have tab characters in their documents, so right now, the priority of that item’s priority is just a guess, based on my usage.

False Alarms But Good Alarms

Having poked around the code while performing the refactor that I talked about in the above section on Performance or Readability, I noticed that two of the functions that I expected to have code in their body were mysteriously blank. As such, I added the following items to the issues list:

- __calculate_shortcut_collapsed_deltas - not being used at all
  - figure out why?
- shotrcut link over multiple line
  - does n't seem to have any code to handle?

Looking into this, I did my research and added two extra scenario tests to be sure of my results. In both cases, everything was fine. Digging deeper into the code, I found the answer to why these functions were okay with an empty body. As both of those two scenarios rely heavily on the link label, which is common to all link types, any calculations for the size of the link label are already factored into the code. However, to prevent myself from looking at this again, I added some extra comments to indicate what I had found, and why these functions being empty was okay.

Some Final Cleanup

When I picked this item from the list:

- test_paragraph_series_m_ul_t_nl_ulb_nl_tb - with abc/def/*

I hoped it was going to be simple to resolve. When I created the M-series of paragraph scenario tests, I noted that I did not have any more complicated tests and suggested a variation of a test with multiple lines. Creating a few variations on the initial scenario test with more complex data, I added three tests with variations on:

- abc
- def
  *
---

While I tested more than three scenario tests, I did not find anything wrong with any of the tests that I tried. Instead of adding a whole bunch of scenario tests with only a guess that there might be a future problem, I whittled those tests down to the three tests that I did add.

Getting Back To Performance

With a decent handful of issues resolved from the issues list, there was a tiny bit of extra cleanup I wanted to do with respect to the plugins. Having looked at their code during breaks, I knew there were a couple of small changes I wanted to make before release. My feeling was that if I waited until after release, those changes would become breaking changes, and I wanted to avoid that if possible.

Calling Plugin Functions Only When Necessary

Having already reaped benfits from calling two of the plugin functions only when necessary, I decided that those changes were a good practice that I wanted to spread to each of the functions that comprise the plugin’s interface. I quickly made similar changes to the completed_file and starting_new_file functions, went to measure their change, and… well… I was not sure.

Rethinking Performance Measurements

In previous sections on performance, I was thinking about performance as actual clock time. Spending some time thinking about this while walking our dog, I realized that the main issue was that there was always going to be some difference in timing, even if I took two measurements minutes apart from each other. If something external to the PyMarkdown project hit my hard drive when I was taking measurements, it could cause the measured time to be off. So, was clock time the best measurement to use?

It was then that I decided to move from measuring clock time to measuring clock time percentages. While I was pretty sure that if I took three samples that I would get different timings in each sample, I had a high degree of confidence that the percentage of the time spent in each function would remain relatively the same. Running some sample tests, this theory bore out in practice. While there were some differences in the percentages of the top 25, they were mostly negligible. When there were differences, it was most often from the print function or one of the functions in the io.builtin module. Since those two function groups are used for input and output, it made sense that there would be variations.

Throwing together a simple Excel spreadsheet, I now had what I believe to be a better way to measure performance.

Using The New Process

Using that new process, I was able to compose a new spreadsheet to track any changes and I found only negligible improvements. I was sure that I was going to get a performance boost out of those changes, so I checked the code base again and I found the issue: the templated plugins. When I was adding the original set of plugins to be implemented, I added all the plugins that I had planned to implement with templated functions. Those templated function were placeholders in the plugin that did not have any body to the function other than a pass statement. My thinking was that when I was ready to write the code for the plugin, I could start right away without having to add templated stuff.

While that worked as planned, it was now proving to be a performance issue. Roughly one third of all plugins have been coded, but all the plugins were being called regardless of whether they had any code other than a pass statement in their function body. So, I quickly went through the plugins and removed any plugin function that did not have any meaningful statements. Once I completed that task, I measured the performance again, and entered the results into this spreadsheet.

It took me a minute to understand what I was looking at, but it then made sense to me. The next_line functions were a large part of the templated plugin functions that were removed, so the measurements for that function dropped off the list. This effectively created n “vacuum” at the top of the measurements, causing other measurements to increase slightly to compensate for it. Specifically, the debug function went from 10.49% of the execution time to 11.51% of the execution time not because of any changes, but because the item before it in the list was no longer there. With that item no longer occupying that “space”, the debug function occupied a pro-rated portion of what that time that item used to occupy.

Running two extra sets of performance tests, I was able to replicate this behavior of these performance test both times. Looking at the clock time for each function alone, I am not sure if I would have made the same observation as I did with the percentages, so I was glad that I switched to it!

Setting Context

While it was more evident before the previous performance changes, there was also an issue with the set_context function. Each time any of the plugin functions were called, a separate call to the set_context function was being made to establish the context for any reporting that was done. I am not sure why I thought that was a good idea, but it looked like it was causing performance issues before the last change. Now that I looked at the performance numbers, it was low down on the list, almost inconsequential.

Sitting back and thinking about it, the reduction of the impact of the set_context function on the performance made sense to me. With fewer plugins having eligible functions to call, the number of times that the set_context function was called was reduced at the same rate. Just as the next_line function dropped off the top 25 list, the set_context function also dropped off that same list.

But looking forward, I knew that those functions would soon be resurfacing on the top 25 list. I had removed those templated methods in the last section, as they were not being used. Somewhere between now and the intial release of the project, I know I have plans to implement those missing plugins, thereby increasing their impact on the performance once again. So, while I could easily say I would do it later, it just made more sense to me to do it now while I was in the code. Also, by getting the refactoring out of the way now, it would give me more time.

It was a rather quick set of changes to make, and it was done within a couple of hours including running the scenario tests and gathering the performance measurements. Looking at the measurements, I had to look far down the list to find the plugin entries, but I found them and added them to the spreadsheet below the normal area. Before I removed the templated functions, there were a combined 62,000 calls to the next_line and next_token functions. For those calls, there were 1.2 million calls to set_context which took 0.16 seconds or 3.49% of the execution time.

Based on those numbers alone, I felt like refactoring that now was still a good thing to do from a performance point of view. Maybe the effects of the change would not show up right away, but I knew that it was quelling a future performance issue that would have definitely occurred otherwise.

What Was My Experience So Far?

While I am not sure I speak for everyone, I have an inkling that most software developers exist on a spectrum of “so what” to “what if they say…” with respect to others seeing their code. If that were not the case, linters for various languages and numerous articles on how to contribute successfully to a code review would not be needed. I know I am using PyLint and Flake8 as linters, and Black as a style formatter, so I hope I am covered in those areas. It is the other areas that I am concerned about.

But, in my normal guest to continually learn, I know that if I do not put the code out there, I will not get that feedback that I need to learn. I do know that the parser behind the linter passes every example in the GFM specification, along with every other case that I have come up with and run against CommonMark 0.92. I do know that I have a good handle on what the initial group of rules should be, based on the rules as defined by MarkdownLint. I must put my faith in the work that I have done to architect and design the project to meet my goals.

With that bit of deep thinking out of the way, I am looking at the issues list with a new eye to try and figure out what the minimum set of features that I need to have really is. What are the things that I will regret not having done before I release? The answer to that question, however, is something for another article!

What is Next?

Having resolved a number of other items on the issues list, there is once glaring one that I need to resolve before I can use this project myself: a YAML/simple front loader. Without it, using this linter on my own blog posts is a no-go. As such, I tackle that!


  1. And yes, if that statement sounds wishy-washy, it is because it is. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~19 min read

Published

Markdown Linter Road To Initial Release

Category

Software Quality

Tags

Stay in Touch