Markdown Linter - Weeding the Project's Issue List

Summary¶

In my last article, I talked about how I was pulling myself out of the rabbit hole that I dug myself into, by finding a small task, and completing it with panache.¹ In this article, I talk about how I continued to get my confidence back by resolving items on the issues list while increasing the quality of the project.

Introduction¶

At the end of the last article, I talked about how I was starting to get out of the negative headspace that I found myself in, making progress with the project at the same time. The project’s progress was a good thing, but it was my emotions towards the project that I was more concerned with. I have seen people stop working on their passion projects for varied reasons, and I just did not want a momentary lapse of confidence on my part to be the reason that I stopped working on this project.

When I was pulling myself up out of my rabbit hole, I came to the realization that part of the reason that my confidence took a bit of a hit, were the contents of the issues list. While I am pretty sure that not every item on the list is an actual issue, until I debug and verify each one, each item on that list is a potential bug. And each one of those potential bugs represented a bit of uncertainty that lowered my confidence. Given that realization, taking some time to go through and “weed” the project’s issue list seemed like a good idea!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 16 Jun 2020 and 26 Jun 2020.

Starting Out Easy¶

The first three tasks on my list were simple things to resolve, little things that I kept on meaning to do when I had some time. Completing them to get me warmed up for the issues list was a good, solid plan.

The first item on my list was some simple refactoring of the assert_token_consistency function in the utils.py module. I had added some decent functionality to it in the last couple of weeks, and it was time to make sure that the module was returned to its normal, properly organized state. The needed changes were simple. Start with a bit of extracting code into functions, then add some function and variable renaming, and finally test the changes to make sure I did not break anything. While nothing changed in the source code for the project, it felt good knowing that the test code was just that much cleaner.

The next item was also a simple task: add reporting of leading whitespace to the next list item token (li). When I was adding the code to handle leading whitespace for the list start tokens, somehow, I forgot the new list item token in that work. I noticed this when I went to verify the next list item tokens in the consistency checks, and there was no indication of the leading whitespace, other than the indent_level variable. For the sake of those consistency checks, this needed to be addressed.

Adding this support into the consistency checks with the new list item tokens modifications in place was almost trivial. The do not do new list item token until fixed check was removed and replaced with a simple calculation of the init_ws variable, in keeping with the handling of the list start tokens. To complete those changes, I also added some code in the __maintain_block_stack function to guarantee that new list item tokens were properly added and removed from the stack.

Finally, I decided to use PyCharm and its enhanced code analysis tools to take a look at the project, fixing things where possible. I have talked before about how PyCharm, while not a good development environment for me, is a product I definitely like to use as a tool on a current project. For me, the most useful of these tools is a comprehensive look at what arguments and variables are used, and whether they are needed. In addition, PyCharm has a project dictionary that allows me to search for typos in comments and variable names while maintaining a custom dictionary that allows me to remove often used terms and abbreviations. Combined, I feel that using PyCharm as a tool just adds an extra level of quality and cleanliness to the project.

While none of these tasks were big issues to tackle, they were small tasks that were easily resolved and crossed off my mental “when I get time” list. And one less item on that list is one less thing to worry about!

Verifying Issues¶

Scenario 86a: Indented Code Blocks¶

There are times when I look at an issue and I know exactly why I put that issue onto the issues list. This was not one of those times. The notes I have for this one clearly state that scenario 86a is a modification of scenario 86 , but with 9 leading spaces before the indented code block. Those same notes are even clear that the reason I added this new scenario was because I was concerned that scenario 86 only tested a case where the length of the extracted whitespace equaled the length of the remaining whitespace. What I did not know was why I thought this was an issue.

I did some due diligence here and found nothing. I temporarily added some extra debug code around the indented code block code but did not find anything useful. After making a mental note to myself to write better notes on why I added items to the issues list, I removed it from the issues list and moved on.

Scenario 87: More Fun with Indented Code Blocks¶

When I was doing the initial work to add the line/column number to the tokens back in May, I wrote an issue to myself as a question:

087 - shouldn’t it be inside of the indented code block?

Unlike the issue with scenario 86a, it was easy to see why I wrote that question. If you look at the example by itself, the blank line before the reported start of the indented code block is also indented by 4 space characters. However, when I looked at the example within the context of the GFM specification for example 87, the line before the example reads:

Blank lines preceding or following an indented code block are not included in it:

An honest question, an honest and researched answer, and an issue that was quickly resolved. Next!

Scenarios 235, 236, 252, and 255: Indented Code Blocks and Lists¶

In each of these four cases, every scenario had to do with something that may look like an indented code block being started from within a list. Like scenario 87 in the last section, I could see how these scenario tests raised questions. For each one of these examples, from a quick glance it is hard to tell if the number of spaces is correct from the example.

In the case of scenario 235 , my first inclination was that I had coded something wrong. The - character is followed by 4 spaces, so one should be in an indented code block. Right? Almost. The actual start sequence for the list is not -, but -{space}. As such, the list starts with the - character at column number 2, the space character at column number 3, followed by 3 space characters for a total indent to column number 6. The blank line then ends the list on line number 2, and line 3 with 5 leading spaces is picked up as an indented code block. The scenario test was correct. Yes!

The general math for scenario 236 is the same, but because the text two is indented 6 spaces instead of scenario 237’s 5 spaces, it counts as a continuation of the original list. Scenario 252 and scenario 255 are just variations of this, with and without the indented code blocks confusing the issue. In each case, the scenario tests were correct. But I felt good that I had questioned whether they were correct, and it was solidly answered in the positive.

However, even though I did not find an immediate issue, I did find a future issue. In each of these cases, the indent_level associated with the list is assumed to be comprised of space characters. This assumption is fine for now, as the current consistency checks explicitly ignore checking tests that contain tabs. But when the tab character support is enabled in the consistency checks, extra calculations will need to be added to ensure the column numbers remain accurate. This was not something I needed to deal with now, but it would be an issue later.

Blank Lines and HTML Blocks¶

I initially thought that this one was an open-and-shut case; the issue being noted down as:

blanks lines, if starts with 2 ws, is it (x,1) or (x,3)?

The obvious answer is that it is always 1, as I indicated in the commit message for removing the issue:

Answered question: blank lines always start at 1, as do HTML blocks.

To verify this, I did a a quick scan of the test code for the text html-block and the text BLANK, looking for the string values of their respective tokens in the scenario test output. As expected, each of those tokens contained a column number of 1. Until they didn’t.

The HTML block token and the blank line token always start at the start of the line, and hence have a column number of 1. But when those leaf block tokens are created within a container block, the start of the line is where the container block says it is. Therefore, for blank lines created within a list block, their column number becomes the indent level for the container block.

The good news here is that I had an issue with the commit message, not the source code. If I could go back and correct the commit message² to make it more correct, I would have changed it to:

Answered question: blank lines always start at 1, as do HTML blocks, except when enclosed by a container block.

After a bit of double checking, those scenarios and their tokens were also verified. Another quick issue to resolve and get off the list!

Scenarios 197, 257, and 262: Blank Lines and Lists¶

This issue was logged in order to explore whether or not there were issues with lists that started with a blank line, as in scenario 257 and scenario 262. To start with a baseline for blank lines, the example for scenario 197 is as follows:

\a\a

aaa
\a\a

# aaa

\a\a

(Aside: Due to previous issues with me missing spaces in the examples, I had previously replaced the spaces in this example with the character sequence \a, making the space character more visible. Then, before passing this string to the test code, the string is processed this string by invoking .replace("\a", " ") on the resultant string, transforming it into an accurate representation of the example. This greatly reduced the number of times that I missed trailing whitespace to zero!)

In the scenario test for scenario 197, the tokens for the 1st, 4th, and 8th lines, includes the leading whitespace while maintaining a column number of 1. For example, the token for the blank line on line 4 is:

[BLANK(4,1):  ]

Therefore, I compared that behavior to the blank line’s behavior inside of a simple list block, such as with this Markdown for scenario 257:

-\a\a\a
  foo

and the differences were clear:

[BLANK(1,5):]

There are two differences between this token and the uncontained token above. The first is that unlike scenario 197, the leading whitespace that was removed is not stored in the token. The second is that the column number is 5, when it should be 2. Based on my experience with blank lines in the last section, it was easy to see that the column number should be 2, given the unordered list start character - and the mandatory space that followed it.

Scenario 262 was a far easier issue to deal with. Its Markdown is:

Simple. An unordered list start token, by itself in a document. Thanks to the work I did on scenario 257, this one was easily verified. If scenario 257 should produce a blank line token that contains 3 leading space characters, then scenario 262 should contain a blank line token with no leading space characters.

As this was just researching the issue, I resolved the existing issue and added more specific issues to be properly addressed later. One issue to do with recording whitespace for blank lines in a container, and the other issue for correcting the column number of that same blank line in a container.

Scenarios Extra001 and Extra002: Checking for Correctness¶

Looking at how blank lines are defined in the GFM specification, from a purely transform-to-HTML point of view, it is obvious that a document that only contains whitespace will produce an empty HTML document. This is mainly due to the stipulations:

Blank lines between block-level elements are ignored

and

Blank lines at the beginning and end of the document are also ignored.

But for those blank lines to be ignored, it stands to reason that from a tokenization point of view, there must be a blank line token to ignore. And as the linter operates on tokens, the scenario tests test_extra_001 and test_extra_002 were added to make sure the right blank tokens are produced.

After the previous work in the above sections with blank lines, verifying these scenario tests was quick and painless. In reverse order, the text for scenario test test_extra_002 was a simple document with 3 spaces, hence it needed to produce a single blank line token with 3 spaces. With that test solidly in place, it logically follows that remove those 3 spaces for scenario test test_extra_001 would produce a blank line token with no spaces, which is what the test expects.

While this may have seemed like a trivial test, it is often the trivial cases and edge cases that trip up projects. With everyone on the project worried how a big complex example will be resolved, sometimes it is those little examples that slip through the crack out into the wild.

Honestly, even though they are trivial, I just felt better knowing that these trivial cases were double-checked and covered.

Scenarios 559 and 560: Link Reference Definitions?¶

I almost felt embarrassed when I read this one, as the answer was right in the scenarios themselves. The function comments for 559 (and with one small modification, 560) are as follows:

def test_reference_links_559():
    """
    Test case 559:  (part 1) A link label must contain at least one non-whitespace character:
    """

Using scenario 559 as a benchmark, its Markdown is as follows:

[]

[]: /uri

and the Markdown for scenario 560 is almost the same, except for added whitespace:

[
 ]

[
 ]: /uri

As it is hard to argue with the GFM specification’s definition of a link label, I resolved this and moved on. In both cases, there just was not any whitespace in the link label.

But I still did due diligence: verified the example, checked the tokens, and after a slight face-palm, I resolved the issue and moved on.

Changing the MarkdownToken’s Constructor?¶

Starting to ramp down on the project work, I hoped that this was a simple issue to look at and resolve. I had logged a simple question in the issue list:

for all of the tokens that used position_marker, do we need =None any more?

This was an interesting question in that, except for the MarkdownToken class itself, none of the child classes have position_marker as an optional argument! From that point of view, it would be quick to resolve it. But I did not feel that I was doing a complete job, so I decided to run with that idea a bit and find out where it led me to.

Doing a quick search over the MarkdownToken class and its children, the breakdown of how the MarkdownToken constructor was called from the child classes were as follows:

line_number and column_number arguments used: 3
position_marker argument used: 10
none of the above arguments used: 11

For the “none of the above” case, most of those child classes were for inline tokens that do not have line/column support yet. But if the trend of the current statistics continues, this change may be something to revisit in the future. Knowing more about this issue, I was good resolving this issue now, possibly exploring this again in the future when line/column numbers are added to inline elements.

Renaming the SetExt Token’s Whitespace Variable¶

I was finally at the end of my planned issues list, and I was glad that I was ending on another simple one. When I was writing my first pass of the consistency checker, I noted that I had to special case the SetExt heading tokens, as the member variable that is consistently named extracted_whitespace for other tokens was mysteriously named remaining_line for this token.

After performing due diligence to find out where the variable remaining_line was referenced and what was depending on it, there was no reason to keep this difference around. It just made more sense to change the name to the more consistent extracted_whitespace. In addition to a simple search-and-replace, this change allowed me to reduce the complexity of the __calc_initial_whitespace function.

Another small change, but another small step to a cleaner code base!

Why Was This Work Important to Me?¶

From my personal and professional experience, the longer you let an issue sit unexplored, the more uncertainty exists with respect to the team’s confidence in a project. At the time when I resolved those issues, I was at a more emotional place than normal, where those uncertainties were weighing more heavily on my confidence. While it took some time to work through them, it just felt right doing the proper due diligence on each issue and resolving it.

And the results of resolving these issues were very positive. With the exception of adding the proper encapsulation of leading whitespace for new list item tokens, no other source code was measurably changed.³ With respect to the test code, there were a couple of net-neutral changes for code quality, but the other changes only added extra tests cases, not changing existing tests. Basically, while the work done in this time frame did not change the project’s code base significantly, it did increase my confidence in the project by eliminating a few of the existing questions.

What Was My Experience So Far?¶

As far as recoveries go, I was feeling better after this stint of work. By going through the issues list, resolving 9 of those issues, and doing some code cleanup, my confidence was back on track to be where it was before. In any project, people encounter circumstances that force them to evaluate if they have made the right project decisions up to that point. Depending on how hard those circumstances hit them and how hard they hit them, people will decide to continue with the project, abandon it, or take a wait-and-see approach. While I was shaken for a bit, I was now back firmly in the continue with the project camp.

According to Webster, confidence is:

a feeling or consciousness of one’s powers or of reliance on one’s circumstances

Confidence is an emotion, not logic. It is not a light switch and it is not something that listens to reason. Confidence it fickle, hence the expression:

One bad apple spoils the barrel.

I could have resolved 29 issues, but if I found just one issue that looked like a showstopper, those circumstances could be completely different. Who knows?

Whether it was something substantial or something more lightweight, I knew that I needed to do some work to try and influence my confidence in a positive direction. In this case, resolving several items off of the issues list was needed, and it paid off. It was not a guarantee, but a gamble that paid off.

Someone once told me:

Marriage is made up with a whole bunch of days. You have good days, you have bad days, and you have so-so days. The sign of a good marriage is that you have more good days that the other two combined.

A similar concept applies to working on passion projects, like me and my PyMarkdown project. It was just a matter of finding the right thing to do on the project that would reignite my confidence, and therefore passion, for the project.

What is Next?¶

Having done some decent cleanup, I decided it was time to get back to work on the consistency checker. While I realized it was not going to be able to be 100% complete until I started handling tab characters, I wanted to make a good effort towards getting it more complete. And that started with proper accounting of whitespace.

Per Webster’s “dash or flamboyance in style and action”. ↩
Yes, I know you can change a commit message, but the price is usually too high to pay for anything other than your very last commit. ↩
For the sake of this sentence, I define a measurable change as a change that changes the requirements for input or the actual output of the project, with the exception of adding or modifying log messages. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments