Markdown Linter - Delving Into the Issues - 16

Summary¶

In my last article, I continued working on Block Quote issues and some general clean up that I have wanted to do for a couple of months now. With one week left in my holiday, I wanted to make sure I tackled as many of the big-ticket items that I can while I have the time. If I am going to be doing some work during my holiday, I want to make it count!

Introduction¶

With a couple of weeks off for the holidays and relatively light “honey-do”¹ schedule, I had some free time to devote to getting the project closer to the goal line. While I did not want to lose focus on getting the smaller items completed, I had a number of big-ticket items that I wanted to do. Because of their size or scope, I wanted to ensure that I had a dedicated amount of contiguous time to work on each item. Except for the occasional weekend, I figured that this would be the best time to work on them and hopefully get all of them completed in one big push. With a good week left of New Year’s holiday left to go, it seemed like a good idea to try and get as much done as i could in the short amount of time I had. At least, that was my plan.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 28 Dec 2020 and 03 Jan 2021.

Enhancing Code Quality¶

Continuing with the large item cleanup that I had started in the previous week; I was eager to get some refactoring done on the Markdown token classes. The first step towards refactoring those classes was to make each token as read-only as possible, to avoid the possibility of changing the tokens. To accomplish this task, I just went through each token in order, adding the __ prefix to most of the field names, thereby making them private. With each of those fields now private, I added a new getter property for each field named after the original field name. As the property name for the new getter function matched the old name of the field, any read-only access was able to continue without any issues.

As with any such change, there are always exceptions that need to be dealt with individually. One such case was the active field of the SpecialTextMarkdownToken class. The first exception was that instead of retaining the name active, I felt that the name is_active was more descriptive. The second exception was that this token type is used to look for starts and ends of various inline token sequences. As such, when those sequences are found, the previous instances of those tokens are marked as inactive, meaning they will not be used any more. To take care of this, I introduced to that token a new function deactivate, specifically used to deactivate the token without exposing the token’s member variable in its private form.

Once this refactoring was completed, I realized that the markdown_token.py module was way too large for my liking and needed to be broken down. Keeping the base MarkdownToken and EndMarkdownToken classes in their original file, I started to move every Markdown token to one of three new modules: one for container block tokens, one for leaf block tokens, and one for inline tokens. Once the tokens were in their new modules and all tests passed, I added a new base token class for each of the three new modules and switched the base class for each token to the new base token in the same module. By switching over each token to use these three new base classes, I was able to further reduce the amount of code in each token. While it was not too much of a reduction, it was a reduction I felt good about.

Enhancing Some More¶

When I started working on this group of tasks, the first thing that came to mind was the words to an old campfire song I learned long ago in Boy Scouts of Canada. While it is a silly song named Nelly In The Barn, the bit between the verses goes:

Second verse, same as the first, a little bit louder and a little bit worse!

I could think of no phrase better to describe what I needed to do with the StackToken class. Having had good success with changing all the MarkdownToken classes to use private fields and is_x methods (as detailed in my last article), I felt that the StackToken class needed the same treatment.

Unlike the almost 20 classes for MarkdownToken descended classes, the transformation on the 9 classes descended from StackToken went by quickly. Like the changes made to the MarkdownToken classes, I was emboldened to make these changes due to the confidence of having a large group of scenario tests that I can use to look for any issues. Without that large group of tests, I would be worried that I would fix one thing, only to break something else in a related piece of code that I forgot about.

Closing the Test Loop¶

As I noted in the section of the last article entitled Making the Test Invocation More Efficient, I implemented a change to how I invoked the test infrastructure to reduce the needed code from eight lines (1 blank line, 2 comment lines, and 5 code lines) to two lines (1 comment line and 1 code line). Having done some further examination of those changes, I felt that those changes had settled in nicely and it was time to propagate those changes to all similar test functions.

I had a good model for the changes, but even so, the work was very monotonous. With just short of 2000 scenario tests that required changing, it was a test of wills: me versus the code base. To keep myself motivated, I kept a search window open on the side of my editor, selecting a new group of tests to modify whenever I found myself getting bored. Even after taking extra breaks to do household chores, I still found that it was a tough task. But I knew it was a good task to do, so even though I could feel the lobes of my brain numbing with each keystroke, I pressed forward with making the changes.

Keeping Things Simple¶

Having ensured that all end Markdown tokens had their start_markdown_token field set, I looked at the code to figure out if there were any redundancies that were introduced with that change. As that field points to the start Markdown token, there was a good chance that I had stored duplicate data in the EndMarkdownToken to avoid having to calculate the start Markdown token for some of the tokens. It was just a matter of identifying any such tokens.

While I was sincerely expecting more of an impact, the only change that I was able to perform was around the end Markdown token generated for the EmphasisMarkdownToken token. To get around the constraints at the time that it was written, I had added duplicated data to that end token to denote the length of the emphasis and the emphasis character. With that actual start token now available for reference, I was able to replace the duplicate data stored in the EndMarkdownToken with a reference to the EmphasisMarkdownToken instance.

No longer needing that duplicate data, I removed it from the __process_emphasis_pair function. The fallout of that change was approximately 200 instances where I needed to replace the now useless data with the string :. To make things easier while editing, I simply kept the text ::: in the clipboard, searched for [end-emphasis(, and replaced the first three characters after the ) character. It was mind numbing work that I did in three or four shifts, but I got it done. Running the tests, everything was working except for a couple of tests. Investigating each of those failed tests, the failures were all simple typing errors, quickly fixed to make the tests pass.

Reorganizing the Series M File¶

This task was a pure cut-and-paste task, but one that I really needed to do. At over 10 thousand lines of code, the test_markdown_paragraph_series_m.py module was just way too big! I was initially okay with the size of the module, seeing that all the scenario tests in the file were related to each other. But as I started to add more and more tests to that one large file, it was becoming too difficult to work on in that form. As such, I simply create one file for each group of tests, such as test_markdown_paragraph_series_m_fb.py for the Fenced Code Block tests, and moved the test functions into their new home.

Collapsing Ordered and Unordered List Processing¶

At the start of the project, while I was working through the initial parts of the parser, I was not sure that the processing of Ordered List Blocks and Unordered List Blocks would overlap. With almost a year of processing accomplished, I now had a very solid observation on that subject. Most of the processing overlapped, and overlapped cleanly. Now it was just a manner of taking the time to surgically merge two List Block token concepts into one in different places in the source code.

The First Step¶

A big first step on this journey was to move the code for calculating the looseness of a HTML rendered list from the TransformToGfm class into the new TransformToGfmListLooseness class. While it was just a simple cut-and-paste move, I feel that the move left the TransformToGfm class better able to focus on the HTML transformation, instead of also having the responsibility of figuring out the list looseness. It just felt cleaner to have that responsibility in its own class and module.

Along with that change, I made equal changes to how the List Block starts were processed in the HTML transformer and the Markdown Transformer. In the HTML transformer, the __handle_start_unordered_list_token function was renamed to __handle_start_list_token and the __handle_start_ordered_list_token function code was merged into that function. In the Markdown transformer, the same process was repeated with the __rehydrate_unordered_list_start function was renamed to __rehydrate_list_start and the __rehydrate_unordered_list_start_end function code was merged into that function.

That merge allowed for the handler calls in each module to deal more simply with the List Blocks, something that was worth it to me.

Equalizing the Two List Blocks¶

Having done a lot of testing with the Unordered List Blocks, I felt it was time to give some extra focus to the Ordered List Blocks. Before the holiday break started, I had noticed a handful of cases where the Ordered List Blocks had errors in them that I thought should have been caught by existing tests. While fixing this issue was not the primary goal of this round of refactoring, I considered it a simple secondary goal that I should only have to fix list issues once, not twice.

Looking at the second half of the is_ulist_start function, I noticed that there were significant differences from its sibling is_olist_start function. Slowly, I started making changes to the is_ulist_start function, bringing it more in line with it sibling. But after making those changes, I still had the task of making sure that those changes were working properly. As any failures were not already caught, but discovered through other tests, I figured that I needed to stop up my test game.

To start this off, I picked six scenarios from each of Series M tests and made copies of those tests. Instead of using the Order List Blocks in those tests, I replaced the Ordered List Blocks with Unordered List Blocks. It was only after adding those 108 scenario tests that I was confident that those changes had a good start at getting coverage. And it paid off too. The only issues that were found were in the __perform_container_post_processing_lists function, where the data to be merged with the surrounding list had to be massaged before a call to __merge_with_container_data and restored after that call was completed.

Refining List Start Functions¶

The final set of functionality to merge was the is_ulist_start function and the is_olist_start function. Both of these functions had been on my “refactor” list for a while, so I was glad to get started on them. On closer examination, there were only a few statements or values that changed between the two functions. Once the setup was accomplished in the first half of the function, the second half was near identical. Creating a new __xxx function, I cut the second half of one of those two functions and pasted it in that new function. After checking to make sure nothing was lost in the cut-and-paste, I compared it line-by-line with the similar code in the other function, adjusting both functions to be represented by the new function.

After a couple of rewind moments, the new __xxx function incorporated the process from both original functions. With that accomplished and staged, I removed the second function and used the new function in its place. After fixing a couple of small issues, the new function was up and running and working for both Ordered List Blocks and Unordered List Blocks.

At that time, I remember looking at the code and thinking that I had only completed half of the job. Repeating the same process that got me to that point, I soon renamed the __xxx function to __is_start_phase_two, and further extracted code into a new __is_start_phase_one function. With that done, the is_olist_start function was already slim, and I extracted the remaining logic into the __is_start_olist function to keep it slim, replicating that processing with the is_ulist_start function.

In the end, I was very satisfied with the amount of refactoring that I was able to accomplish. Both methods were now six statements long, with 95% of the differing functionality in the __is_start_olist function and the __is_start_ulist function. While it was a good feeling getting the refactoring done, it was an even better feeling knowing that I had a healthy set of test cases that I could count on when refactoring!

Consolidating New Block Starts¶

Having poured through the code looking for things to simplify, I was keenly aware of one set of function calls that I could simplify: detecting new block starts. Developed in different parts of the code for similar reasons, both the List Block Processor’s __check_for_list_closures function and the Block Quote Processor’s check_for_lazy_handling function were performing almost the same set of instructions. One by one, the different Leaf Block start functions were being invoked to determine if a newline indicated the start of a new block.

It was not a big change but consolidating that code into the is_paragraph_ending_leaf_block_start function just made things cleaner. There was just one function to call with good documentation on what was going on. It just made sense to me.

Verifying Paragraph Usage¶

Just before I started writing on that Sunday morning, I decided to add something that was hopefully going to be a slam dunk. While I was reasonably sure that all the various referenced to Paragraph elements and any newlines within them were accounted for, I wanted to be doubly sure. So, to be sure of that, I added the following code to the end of the __rehydrate_paragraph_end function:

        rehydrate_index = current_token.start_markdown_token.rehydrate_index
        expected_rehydrate_index = (
            current_token.start_markdown_token.extracted_whitespace.split("\n")
        )
        assert rehydrate_index + 1 == len(expected_rehydrate_index), (
            "rehydrate_index+1="
            + str(rehydrate_index + 1)
            + ";expected_rehydrate_index="
            + str(len(expected_rehydrate_index))
        )

Basically, by the end of the Paragraph element’s processing, the rehydrate_index member variable should have been incremented once for each newline contained within the paragraph. If this did not happen, it means that the use of that rehydrate_index member variable is potentially off and needs to be fixed.

Surprisingly, even after adding some extra tests, the only one element had issues: the Raw HTML element. In the main parser’s parse_raw_html function, the raw text to use for the tag was not properly accounting for the newline, something that was quickly fixed. This pattern continued in the Markdown generator’s __rehydrate_inline_raw_html function, where I specifically created the __handle_extracted_paragraph_whitespace to handle the pulling apart and reintegration of that information. This was a useful function to have as I found some small issues with the __rehydrate_inline_code_span function that required applying that function to the various parts of the Code Span element. Finally, to round out the fixes, the __verify_next_inline_raw_html function needed to have a similar approach taken to address the last issues with the rehydrate_index.

Given that it could have been a lot more cumbersome to fix, I was happy that I got off with a relatively light amount of work!

As I Was Making These Changes…¶

During the middle of these changes, I rearranged the items in the issues list. My goal was to take the priorities that I had in my mind and reinforce them in the list. The only exceptions to this reorganization were anything that was an immediate focus of what I was doing at the moment. The way I rationalized this was that anything that I could set down for a while was something that I could figure out when to get to it. With an item that remained on the list or was newly added to the list, I deemed that the items were more of an immediate task to get done, and just needed doing.

While it may seem like a bit of a wishy-washy rule, it was one that I still felt fine with after a couple of days of having the prioritization in place. Granted, it felt like I was churning through the immediate task section, but that also felt right. As I am working on whatever current task that I need to work on, I observe things in the code and have questions about whether I took the right approach. Writing those things down in the issues list allows me to continue with the task without losing too much of the essence of what I wanted to verify or question. To me, that just seems like a good approach!

Preparing for The Future¶

A major part of this week’s work was to clean things up a bit with bigger tasks that I do not normally have time for. As such, I decided to spend about four to five hours during the week experimenting with SnakeViz and incorporating primitive SnakeViz support into the project. While it is too early to say what needs the most improvement, I can say that I was able to get some good experience working with the tool and the output HTML that helps visualize the performance. I engaged in the experimentation specifically to gain some experience with code profiling, and I can honestly say that I think I got the experience that I was looking for!

The smart thing about using SnakeViz to profile is that it is interactive. To ensure that a user can dig down and get relevant information, SnakeViz takes a performance analysis file and hosts fragments of it through a webserver that it starts. As the webserver is starting, SnakeViz also launches it own page in its users own browser. After that, it is just a matter of clicking around and displaying information on any number of functions that were executed during the code profile run.

I do not know if everyone will have the same experience that I did, but I found the interface simple and easy to use. When I start focusing on performance, I know I will spend a decent amount of time looking at the tottime column which displays the total amount of time that was spent in each function during the entire code profile run. I expect I will look at the top few items on that list and try and figure out why they are taking that much time. After making some guesses and changes to the code to match, rerunning the code profile run with the exact same data will be pivotal.

While it is still at least a month or so off, I am looking forward to using this tool and making the code perform well!

What Was My Experience So Far?¶

In terms of actual issues that I fixed during this week, the count was either low or zero depending on who was doing the accounting. But the big win was getting most of the big-ticket items taken care of. To be honest, it was a lot of pressure off my mind getting those big items done. From my viewpoint, I am not sure that I would have felt okay with an initial release the project without those items being addressed. It was just good housekeeping, and now it was done!

It just felt good to make the time to get those big-ticket items resolved. While it can be argued that there were not that many real changes to the code base, each of those changes made a significant impact to my confidence about the project. The refactoring to provide easy and private access to the tokens? It increased my confidence that I have not missed any weird side effects. The reorganizing and simple cleanup? It increased my confidence that I had extra “stuff” to maintain across the code base that would make things more complicated. Consolidating the List Block code? It increased my confidence that I have captured a healthy set of scenarios that properly test both Ordered List Blocks and Unordered List Blocks. Each change simply increased my confidence by safely reducing the amount of required code in the project.

It also felt good for another reason: it was fun. Sure, there were boring parts, like making 2000 changes for the scenario test refactor. That was definitely not fun. But that was kind of fun because it was something for the project that was different. It was not the same old test-fix-run and repeat process that I had been doing for months. It was just something different. And that was just fun to do sometimes.

What is Next?¶

With most of the big-ticket tasks out of the way, I needed to buckle down and try and resolve as many of the Block Quote items in the unprioritized section as possible. While it would not be as fun as the big-ticket items, it knew they were worth doing and it would get me closer to an initial release.

To be fair, in our household we have one honey-do list for me and one honey-do list for my spouse. And we both ignore some items until reminded and get some items off the list on our own. You know, typical list. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments