Introduction

First off, despite what I said at the end of the last article, no, I am not going to call this “Refactoring: The Sequel”… even if that would be a cool name. From my point of view, refactoring is anything involving getting rid of technical debt, usually increasing the correctness, readability, or maintainability of a project. This can be as simple as writing clear and concise documentation, or a more difficult task involving the rewriting of project code. If a task moves the project in a positive direction while not adding a new feature to the project, I believe that that task falls under the category of refactoring. It is a simple definition, but it has served me well.

Since this project has been ongoing since 22 Nov 2019, there have been a decent number of issues that have been added to the parser’s debt list, and a decent number of issues removed from that same list due to informal “refactor weeks” that I had. Most of the issues that I added to that list had a very specific reason that I decided to put them on the list instead of handling that issue right there. But as I am only human, there were probably a few times where it was just easier to deal with it later. Regardless of how each issue was added to the list of debt, I wanted to try and make a decent size dent to the pile of parser’s debt before closing out the main development effort on it. As my goal is to give better perspective on what I am doing and why, I thought an article just focusing on these items would be informative to anyone that has been following the project so far.

Just to keep everything above board, I am not going to talk about every individual commit between the two commit points noted in the next section, just the interesting ones. In that uninteresting category are some changes to get ready for future features, some changes to fix cut-and-paste issues from past changes, and some changes to correct something that I forgot to do. While they are necessary, they are not interesting. They just prove I am human and make mistakes just like everyone else.

And like everyone else, I sometimes put things off until later. But at some point, the bill is due. For me, it was now that I wanted to pay some of that bill off.

What Is the Audience For This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 05 April 2020 and 11 April 2020.

Refactor 1: Matching CommonMark Character Encoding

The issue that was the parent of this one was an interesting issue to find, and this issue was generated to perform the cleanup for that issue. The HTML output for the examples in the GFM specification are generated by processing the Markdown text examples through CommonMark, mainly because CommonMark is considered the reference implementation for the specification. In each of the following scenario tests, the parser’s output for the href attribute in each link was arguably correct, but the output did not match the output in the examples. To resolve this issue and ensure that the scenario tests passed, I needed to fix the link functionality in the parser so that it encodes the right characters according to CommonMark’s “right way” to ensure the parser passed the tests.

It would have been easy if the encoding misses were all the same thing, but they were not. In the href attribute for example 603, the ampersand character & is replaced with the string &. In example 510, the \ is replaced by the string %5C. However, in example 509, the ? and # characters are both left alone, and not specially encoded. Just small differences in how the characters were represented, and each one a bit different.

To be clear, the main goal of resolving the original issue with those tests was not to figure out what the “actual right way” of encoding the href attribute was, but to match what CommonMark was doing. Specifically, I wanted to make sure I have a solid starting point for matching my parser against others, and CommonMark seemed a better candidate than others. After all, they published a specification on what they expect from the CommonMark parser in the GFM specification. But as the above cases only caught a small number of issues in 3 tests, I determined that I needed to figure out what the proper set of ASCII character encodings are for CommonMark.

To properly figure out what CommonMark encodes and does not encode, I added in the tests in the test_markdown_extra.py file that specifies, in sequence, each of the possible characters after the space character and before the del character in the ASCII table. While I could have done something clever using the Python string constants, I wanted the test to be explicit about what it was testing. After submitting the newly generated Markdown sample to Babelmark, I copied Babelmark’s response for CommonMark back into the test as the response. From there, I just kept on adjusting the encoding code until the HTML output matched Babelmark’s output for CommonMark exactly.

Did I have to do this? Not really. Technically, I was already meeting the minimum requirements by passing the scenario tests. However, I just felt that it was better to do the thorough job than to leave it to chance.

Refactor 2: More Fun with Tabs

When people start developing code in any programming language, one of the first things that they learn is that the tab character in most languages is a bad idea. After that, the religious conversations start about whether to replace a tab character with 2 spaces or 4 spaces, but that is not something I want to go into. Those kinds of conversations usually just devolve into bike-shedding within minutes.

In the GPF specification, they avoid this conversation nicely by making it clear:

Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

Good! So tab characters have a width of 4 characters.1 That sounds right. I then changed my calculate_length function to count every tab character as 4 characters and every non-tab character as 1 character. No problem. Then I went back to fix example 6 which was marked as skipped and example 7 which was added incorrectly as a test. Reading the preface for those tests:

Normally the > that begins a block quote may be followed optionally by a space, which is not considered part of the content. In the following case > is followed by a tab, which is treated as if it were expanded into three spaces. Since one of these spaces is considered part of the delimiter, foo is considered to be indented six spaces inside the block quote context, so we get an indented code block starting with two spaces.

That statement did not sound right to me. The math was wrong. The block quote start character is followed by 2 tab characters for a total count of 8 whitespace characters. Subtract 1 from that for the whitespace character following the block quote character and you have 7, not 6. I read the entire section on tabs about 4 or 5 times, being either persistent or stubborn, depending on your point of view. Then finally I keyed in on the phrase “expanded into three spaces”.

Something was wrong. I rechecked my math using 3 spaces instead of 4 spaces for that example, and it worked. But it said at the top that tabs were the equivalent of 4 spaces, didn’t it? Rereading the preface for Markdown tab characters, for what seemed like the 10th time, it was then that I noticed it. Not a tab width of 4 characters, but a tab stop of 4 characters. While it may seem like a simple difference, it is not.

A tab stop of 4 means that if a tab character is encountered, the parser should consider there to be spaces until reaching a column index that is a multiple of 4. In the above example, the tab character was encountered after the block quote start character, effectively at an index of 1 on that line. That meant when the parser encountered the tab character, it should be considered the same as 3 space characters, as it would take adding 3 space character to get from the current position of 1 to the next tab stop at index 4.

While somewhat interesting, the impact of those changes was more interesting to me. The first impact was that I needed to rewrite the calculate_length function to expand tabs based on a possibly non-zero index starting point. This wasn’t a really big deal as it was a small function, but as the complexity increased enough to where I was worried about properly calculating the length, I added the test_calculate_length module to properly test all of its cases.

The second impact was on the block quote blocks and list blocks, both of which are container blocks. Before this change, the start position of an enclosed block inside of a container block was not an issue. With this change, to properly account for tabs and tab stops, the parser needs to know the exact position on the line where the current string started in case it contained tabs. This proved to be complicated enough that I spread it over 2 commits.

In between, the third impact was discovered, the tokens for indented code blocks. As a bit of a shortcut, instead of grabbing the required whitespace from the string being parsed, I simply used a string containing 4 space characters as the leading whitespace for the code block. I needed to change that to grabbing the correct whitespace as the length of the tab character was now variable.

In the end, 3 small examples, 1 tab character, 3 impacted areas, and a lot of grumbling on my behalf. But it got fixed… it just took a while.

Refactor 3: Fenced Code Blocks

In getting rid of some of the technical debt, there were a couple of issues with the fenced code blocks that I really wanted to address: fenced code blocks inside of block quotes and fenced code blocks with starting whitespace.

The examination of the first issue, fenced code blocks inside of block quotes, started off easy enough. In example 98, which is as follows:

> ```
> aaa

bbb

a fenced code block is started within the block quote. As such, when the block quote ends, the fenced code block is closed. That part was already working fine. However, when the first and only line in the code block was extracted, it was extracted with 2 extra characters of whitespace in front of it, providing for a token representation of [text:aaa: ] instead of [text:aaa:]. This in turn added the string aaa to the HTML document instead of aaa, causing the scenario test to fail.

This issue was easy to fix, making sure that the indent count was considered when returning the remaining string in the fenced code block. Along the way, just to make sure things were parsing properly, I added 2 extra variants of example 98, one with an extra > character at the start of the second line and one without any > at the start of the second line, just to make sure they were properly parsed. Seeing as there was already one issue found in that area, I figured the extra tests would not hurt.

Following that quick find-and-fix, I moved on to example 101 and example 103 which also dealt with inconsistent spacing with fenced code blocks. However, in the case of these examples, it was an issue with properly handling the leading spaces before the fenced code block start, code block content, and fenced code block end.

In the case of example 101:

 ```
 aaa
aaa
```

the fenced code block start sequence is preceded by a single space character, as is the first line of the code block content. The specification clearly states:

If the leading code fence is indented N spaces, then up to N spaces of indentation are removed from each line of the content (if present).

Looking at the source code, it was obvious to me that any indentation for the fenced code block start was extracted and then immediately thrown out. A quick change to how the fenced code block was storing the information to include the indent count of the fenced code block start, some other code to remove up to that many whitespace characters from the start of any content lines, and it was done. This was the same for example 103, just with more of an additional indent, and the added concern of having one of the lines have fewer indent characters than the fenced code block start line.

These two small issues only took a matter of hours before they were fixed, but they were good fixes. I am not sure why, but I felt that these issues were going to be larger than they were and seeing them in my “to fix” list was causing me a bit of stress. Regardless of the effort required, it was good to get them taking care of. Sometimes you move the project in a positive direction in big steps, and sometimes they are baby steps. The important thing to remember is that they are both moving the project in the right direction.

Refactor 4: HTML Blocks inside of Lists

Sometimes I look at some code, and I get a feeling that it isn’t quite right. I often cannot immediately point to something specific, but I just have a feeling that usually turns out to be accurate indicator of bad code. That was the case with example 144. Ignoring any previous statements that I have made about the use of HTML in Markdown, I wanted to make sure that this example was being parsed properly. In this example:

- <div>
- foo

what should be created is a list with an item containing a single <div> HTML item and another list item with the text foo. To be honest, I had to check the HTML and list specifications a couple of times to find out which of them overruled the other. As I should have gathered, the list (a container block) has higher precedence than the HTML block (a leaf block), so when the next list item within the list block starts, the HTML block from the first item is automatically closed.

It is probably because of my uncertainty with the specification regarding which block had the higher precedence that I decided to add an extra test, test_html_blocks_144a:

- <div>
- foo

foo

I realize it might seem like the extra paragraph is unwarranted, but I wanted to make extra sure that everything was being unwound properly, including the list. Whether or not the extra test really adds anything, I’ll double check it later. Even though there was nothing wrong with this test, I still trust my previous experience with code. I would rather double check some code and find nothing rather than not check and miss something important. But at the same time, it had me looking at the list block code, which is how I started looking at the issues for the next section.

Refactor 5: More Fun with Lists

As I mentioned previously, I often get a feeling that a given set of code is not 100% correct, and this time it was about lists, specifically sublists. While each of the required scenario tests for sublists were passing, there was just this little voice in my head saying that something did not feel right. There was just something there that inspired me to add an issue to my notes to check them out further, but not enough for me to give anything more than a vague feeling.

Unlike how things turned out in the previous section with HTML blocks, after I did some digging in the code, I found that my feeling was correct! Due to my digging, I found several problems with sublists, specifically with indenting and switching between sublists and other lists. To debug these problems and to make sure they stayed fixed, I introduced a total of 16 variations on existing, passing scenario tests to properly address these issues. I am still trying to figure out if these issues are due to how I implemented the parser or if they are legitimate edge cases for the specification. Stay tuned for that!

These issues fall into 2 categories: indenting and switching list start sequences. My examples will mostly deal with unordered lists, but I added the same types of tests for both ordered and unordered lists.

As a good example of what I looked for, consider example 281:

- foo
- bar
+ baz

Looking at this Markdown, it should be obvious that it is a pair of unordered lists, the first having two elements and the second having a single element. Playing around with the structure a bit, I decided to keep all the list start characters the same but changed the indent of the second list item to make it into a sublist.

* foo
  * bar
* baz

And… boom! The parser returned an unordered list that contained another list with 2 items in that sublist. Thinking it was a fluke, I changed the last list start character to the + character:

* foo
  * bar
+ baz

And boom… again! The results were not as expected. Working my way through many of the examples of lists in the GFM specification, I found 16 variations of simple changes to existing scenario tests that created sublists that were parsed incorrectly. If I worked them out on paper or by using Babelmark, the paper and Babelmark results backed each other up, but the parser’s results were off. I was missing something important somewhere. But where?

Taking a bit of time to cool down, when I started looking at those examples and various patterns of indentation, start character, and sublists emerged. Adding some debug code to aid in the diagnosis, I ran each of the newly added tests again. After some screaming at my monitor in frustration2, I was rewarded with a common thread that went across all of the tests: starting conditions.

When I start a list in the parser, the most relevant information about the list start is the starting index of the list and the start character. For single level lists and lists that are neat (that is all the sublists are indented the same amount), there are no issues with this approach as everything lines up. But when I started moving the starting locations of the list items and the starting locations of the text within the list items, that is when things fell apart. The big issue was not with the parser going down into the sublists, but in how it recovered when exiting from those sublists, understanding which lists to take off the list stack on the way back out.

With a large amount of grumbling on my part, I worked through each of the new scenario tests and added code that not only remembered where the list item character was, but where the text for that item started. This allowed me to properly figure out where the list itself started, allowing me to create better comparisons on whether or not a list should be removed on the way out. It wasn’t an easy experience, but one by one the failing tests became passing tests as I worked through the various combinations required to get it working properly.

Sure, I grumbled a lot on this one. But I was glad that I followed a feeling and did some exploratory testing to put that feeling to rest, rather than dismiss it! In my experience, there are many times where you can look at code and instantly know it is wrong and why it is wrong. (And no, I am not talking about tabs, or line spacing, or curly bracket positioning, or…) However, the really interesting times are when you have to trust your gut feeling and explore some scenarios, either to find an issue or to be able to increase your confidence to where you do not have that feeling anymore.

Whether it is a “I can see that is wrong” bug or a “I feel like there is something wrong there” feeling, both have a valid place in your arsenal for improving the quality of the project.

What Was My Experience So Far?

The purpose of this dedicated session of refactoring was not to eliminate all the parser’s technical debt, just make a good dent in it. Based on that metric, I feel good that I have reduced this debt a decent amount. Furthermore, I took some steps to reorganize the remaining items into more organized lists, for further study. After making some forward progress with other components of the linter, I will inevitably come back to the list in a few weeks and try and deal with a couple more of those items.

Other than the logistics, I feel good about the parser. There are a decent number of issues documented in the list, 40 items in all, but a lot of them are either suggestions for more complete testing or questions about whether something is possible or desired. Doing a quick scan of those items as I write this article, there are not any open items that I am very worried about, just concerns and ideas that I want explored. Basically, I have things that I want to deal with, but it is okay if they wait a bit.

It is with mixed feelings that I move from writing the parser to writing the rules that will take advantage of the parser. I know that I am moving from having conquered one challenge to starting another challenge, and I am okay with that. On one hand, that transition means that the hard work I have put into the parser will pay off. On the other hand, it means that I will be subjecting that same work to more stringent tests as I go forward with the rules and the rest of the linter.

In the end, I guess it boils down to two things. I have confidence in the work I have done to date with the parser, and if anything is missing or needs fixing, I know I will be able to handle it. I also know that the best way to show my confidence in the project is to move ahead and write rules that rely on that parser.

What is Next?

While I know the parser is not perfect, and it may never be, I know it is good enough to base the rules on top of it. As such, my next set of tasks involves making some changes to how the parser interacts with the rest of the project code, to ensure that it can handle a wide variety of requirements that I may throw at it!


  1. Yes, I know. That does not sound right to me now either. But at the time, I honestly thought it said a tab width of 4 characters. Keep on reading. 

  2. While I don’t frequently use screaming at my monitor as a form of rubber duck debugging, it does have it’s benefits sometimes. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~16 min read

Published

Markdown Linter

Category

Software Quality

Tags

Stay in Touch