Summary¶
In my last article, I continued working on Block Quote issues and some general clean up that I have wanted to do for a couple of months now. With one week left in my holiday, I wanted to make sure I tackled as many of the big-ticket items that I can while I have the time. If I am going to be doing some work during my holiday, I want to make it count!
Introduction¶
With a couple of weeks off for the holidays and relatively light “honey-do”1 schedule, I had some free time to devote to getting the project closer to the goal line. While I did not want to lose focus on getting the smaller items completed, I had a number of big-ticket items that I wanted to do. Because of their size or scope, I wanted to ensure that I had a dedicated amount of contiguous time to work on each item. Except for the occasional weekend, I figured that this would be the best time to work on them and hopefully get all of them completed in one big push. With a good week left of New Year’s holiday left to go, it seemed like a good idea to try and get as much done as i could in the short amount of time I had. At least, that was my plan.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 28 Dec 2020 and 03 Jan 2021.
Enhancing Code Quality¶
Continuing with the large item cleanup that I had started in the previous week; I was
eager to get some refactoring done on the Markdown token classes. The first step towards
refactoring those classes was to make each token as read-only as possible, to avoid the
possibility of changing the tokens. To accomplish this task, I just went through each
token in
order, adding the __
prefix to most of the field names, thereby making them private.
With each of those fields now private, I added a new getter property for each field
named after the original field name. As the property name for the new getter function
matched the old name of the field, any read-only access was able to continue without any
issues.
As with any such change, there are always exceptions that need to be dealt with
individually. One such case was the active
field of the SpecialTextMarkdownToken
class. The first exception was that instead of retaining the name active
, I felt that
the name is_active
was more descriptive. The second exception was that this token
type is used to look for starts and ends of various inline token sequences. As such,
when those sequences are found, the previous instances of those tokens are marked as
inactive, meaning they will not be used any more. To take care of this, I introduced
to that token a new function deactivate
, specifically used to deactivate the token
without exposing the token’s member variable in its private form.
Once this refactoring was completed, I realized that the markdown_token.py
module
was way too large for my liking and needed to be broken down. Keeping the base
MarkdownToken
and EndMarkdownToken
classes in their original file, I started to
move every Markdown token to one of three new modules: one for container block tokens,
one for leaf block tokens, and one for inline tokens. Once the tokens were in their
new modules and all tests passed, I added a new base token class for each of the three
new modules and switched the base class for each token to the new base token in the
same module. By switching over each token to use these three new base classes, I was
able to further reduce the amount of code in each token. While it was not too much of
a reduction, it was a reduction I felt good about.
Enhancing Some More¶
When I started working on this group of tasks, the first thing that came to mind was the words to an old campfire song I learned long ago in Boy Scouts of Canada. While it is a silly song named Nelly In The Barn, the bit between the verses goes:
Second verse, same as the first, a little bit louder and a little bit worse!
I could think of no phrase better to describe what I needed to do with the StackToken
class. Having had good success with changing all the MarkdownToken
classes to use
private fields and is_x
methods (as detailed in my last article), I felt that the
StackToken
class needed the same treatment.
Unlike the almost 20 classes for MarkdownToken
descended classes, the transformation
on the 9 classes descended from StackToken
went by quickly. Like the changes made to the
MarkdownToken
classes, I was emboldened to make these changes due to the confidence of having a large
group of scenario tests that I can use to look for any issues. Without that large
group of tests, I would be worried that I would fix one thing, only to break something
else in a related piece of code that I forgot about.
Closing the Test Loop¶
As I noted in the section of the last article entitled Making the Test Invocation More Efficient, I implemented a change to how I invoked the test infrastructure to reduce the needed code from eight lines (1 blank line, 2 comment lines, and 5 code lines) to two lines (1 comment line and 1 code line). Having done some further examination of those changes, I felt that those changes had settled in nicely and it was time to propagate those changes to all similar test functions.
I had a good model for the changes, but even so, the work was very monotonous. With just short of 2000 scenario tests that required changing, it was a test of wills: me versus the code base. To keep myself motivated, I kept a search window open on the side of my editor, selecting a new group of tests to modify whenever I found myself getting bored. Even after taking extra breaks to do household chores, I still found that it was a tough task. But I knew it was a good task to do, so even though I could feel the lobes of my brain numbing with each keystroke, I pressed forward with making the changes.
Keeping Things Simple¶
Having ensured that all end Markdown tokens had their start_markdown_token
field set, I
looked at the code to figure out if there were any redundancies that were introduced with
that change. As
that field points to the start Markdown token, there was a good chance that I had stored
duplicate data in the EndMarkdownToken
to avoid having to calculate the start Markdown
token for some of the tokens. It was just a matter of identifying any such tokens.
While I was sincerely expecting more of an impact, the only change that I was able to
perform was around the end Markdown token generated for the EmphasisMarkdownToken
token. To get around the constraints at the time that it was written, I had added
duplicated data to that end token
to denote the length of the emphasis and the emphasis character. With that actual start
token now available for reference, I was able to replace the duplicate data stored in the
EndMarkdownToken
with a reference to the EmphasisMarkdownToken
instance.
No longer needing that duplicate data, I removed it from the __process_emphasis_pair
function. The fallout of that change was approximately 200 instances where I needed to
replace the now useless data with the string :
. To make things easier while editing,
I simply kept the text :::
in the clipboard, searched for [end-emphasis(
, and replaced
the first three characters after the )
character. It was mind numbing work that I did
in three or four shifts, but I got it done. Running the tests, everything was working
except for a couple of tests. Investigating each of those failed tests, the failures
were all simple typing errors, quickly fixed to make the tests pass.
Reorganizing the Series M File¶
This task was a pure cut-and-paste task, but one that I really needed to do. At over
10 thousand lines of code, the test_markdown_paragraph_series_m.py
module was just
way too big! I was initially okay with the size of the module, seeing that all the
scenario tests in the file were related to each other. But as I started to add more
and more tests to that one large file, it was becoming too difficult to work on
in that form. As such, I simply create one file for each group of tests, such
as test_markdown_paragraph_series_m_fb.py
for the Fenced Code Block tests, and moved
the test functions into their new home.
Collapsing Ordered and Unordered List Processing¶
At the start of the project, while I was working through the initial parts of the parser, I was not sure that the processing of Ordered List Blocks and Unordered List Blocks would overlap. With almost a year of processing accomplished, I now had a very solid observation on that subject. Most of the processing overlapped, and overlapped cleanly. Now it was just a manner of taking the time to surgically merge two List Block token concepts into one in different places in the source code.
The First Step¶
A big first step on this journey was to move the code for calculating the looseness of
a HTML rendered list from the TransformToGfm
class into the new
TransformToGfmListLooseness
class. While it was just a simple cut-and-paste move,
I feel that the move left the TransformToGfm
class better able to focus on the HTML
transformation, instead of also having the responsibility of figuring out the list
looseness. It just felt cleaner to have that responsibility in its own class and module.
Along with that change, I made equal changes to how the List Block starts were processed
in the HTML transformer and the Markdown Transformer. In the HTML transformer, the
__handle_start_unordered_list_token
function was renamed to __handle_start_list_token
and the __handle_start_ordered_list_token
function code was merged into that function.
In the Markdown transformer, the same process was repeated with the
__rehydrate_unordered_list_start
function was renamed to __rehydrate_list_start
and the __rehydrate_unordered_list_start_end
function code was merged into that function.
That merge allowed for the handler calls in each module to deal more simply with the List Blocks, something that was worth it to me.
Equalizing the Two List Blocks¶
Having done a lot of testing with the Unordered List Blocks, I felt it was time to give some extra focus to the Ordered List Blocks. Before the holiday break started, I had noticed a handful of cases where the Ordered List Blocks had errors in them that I thought should have been caught by existing tests. While fixing this issue was not the primary goal of this round of refactoring, I considered it a simple secondary goal that I should only have to fix list issues once, not twice.
Looking at the second half of the is_ulist_start
function, I noticed that there
were significant differences from its sibling is_olist_start
function. Slowly, I started
making changes to the is_ulist_start
function, bringing it more in line with it sibling.
But after making those changes, I still had the task of making sure that those changes
were working properly. As any failures were not already caught, but discovered through
other tests, I figured that I needed to stop up my test game.
To start this off, I picked six scenarios from each of Series M tests and made copies of
those tests. Instead of using the Order List Blocks in those tests, I replaced the Ordered
List Blocks with Unordered List Blocks. It was only after adding those 108 scenario tests
that I was confident that those changes had a good start at getting coverage. And it paid
off too. The only issues that were found were in the
__perform_container_post_processing_lists
function, where the data to be merged with
the surrounding list had to be massaged before a call to __merge_with_container_data
and
restored after that call was completed.
Refining List Start Functions¶
The final set of functionality to merge was the is_ulist_start
function and the
is_olist_start
function. Both of these functions had been on my “refactor” list for a
while, so I was glad to get started on them. On closer examination, there were only a
few statements or values that changed between the two functions. Once
the setup was accomplished in the first half of the function, the second half was near
identical. Creating a new __xxx
function, I cut the second half of one of those two
functions and pasted it in that new function. After checking to make sure nothing was lost
in the cut-and-paste, I compared it line-by-line with the similar code in the other
function, adjusting both functions to be represented by the new function.
After a couple of rewind moments, the new __xxx
function incorporated the process from
both original functions. With that accomplished and staged, I removed the second function
and used the new function in its place. After fixing a couple of small issues, the new
function was up and running and working for both Ordered List Blocks and Unordered List
Blocks.
At that time, I remember looking at the code and thinking that I had only completed half
of the job. Repeating the same process that got me to that point, I soon renamed the
__xxx
function to __is_start_phase_two
, and further extracted code into a new
__is_start_phase_one
function. With that done, the is_olist_start
function was
already slim, and I extracted the remaining logic into the __is_start_olist
function
to keep it slim, replicating that processing with the is_ulist_start
function.
In the end, I was very satisfied with the amount of refactoring that I was able to
accomplish. Both methods were now six statements long, with 95% of the differing
functionality in the __is_start_olist
function and the __is_start_ulist
function.
While it was a good feeling getting the refactoring done, it was an even better feeling
knowing that I had a healthy set of test cases that I could count on when refactoring!
Consolidating New Block Starts¶
Having poured through the code looking for things to simplify, I was keenly aware of one
set of function calls that I could simplify: detecting new block starts. Developed in
different parts of the code for similar reasons, both the List Block Processor’s
__check_for_list_closures
function and the Block Quote Processor’s
check_for_lazy_handling
function were performing almost the same set of
instructions. One by one, the different Leaf Block start functions were being invoked
to determine if a newline indicated the start of a new block.
It was not a big change but consolidating that code into the
is_paragraph_ending_leaf_block_start
function just made things cleaner. There was just
one function to call with good documentation on what was going on. It just made sense
to me.
Verifying Paragraph Usage¶
Just before I started writing on that Sunday morning, I decided to add something that was
hopefully going to be a slam dunk. While I was reasonably sure that all the various
referenced to Paragraph elements and any newlines within them were accounted for, I wanted
to be doubly sure. So, to be sure of that, I added the following code to the end of
the __rehydrate_paragraph_end
function:
rehydrate_index = current_token.start_markdown_token.rehydrate_index
expected_rehydrate_index = (
current_token.start_markdown_token.extracted_whitespace.split("\n")
)
assert rehydrate_index + 1 == len(expected_rehydrate_index), (
"rehydrate_index+1="
+ str(rehydrate_index + 1)
+ ";expected_rehydrate_index="
+ str(len(expected_rehydrate_index))
)
Basically, by the end of the Paragraph element’s processing, the rehydrate_index
member
variable should have been incremented once for each newline contained within the paragraph.
If this did not happen, it means that the use of that rehydrate_index
member variable is
potentially off and needs to be fixed.
Surprisingly, even after adding some extra tests, the only one element had issues: the
Raw HTML element. In the main parser’s parse_raw_html
function, the raw text to use
for the tag was not properly accounting for the newline, something that was quickly fixed.
This pattern continued in the Markdown generator’s __rehydrate_inline_raw_html
function,
where I specifically created the __handle_extracted_paragraph_whitespace
to handle the
pulling apart and reintegration of that information. This was a useful function to have
as I found some small issues with the __rehydrate_inline_code_span
function that required
applying that function to the various parts of the Code Span element. Finally, to round
out the fixes, the __verify_next_inline_raw_html
function needed to have a similar
approach taken to address the last issues with the rehydrate_index
.
Given that it could have been a lot more cumbersome to fix, I was happy that I got off with a relatively light amount of work!
As I Was Making These Changes…¶
During the middle of these changes, I rearranged the items in the issues list. My goal was to take the priorities that I had in my mind and reinforce them in the list. The only exceptions to this reorganization were anything that was an immediate focus of what I was doing at the moment. The way I rationalized this was that anything that I could set down for a while was something that I could figure out when to get to it. With an item that remained on the list or was newly added to the list, I deemed that the items were more of an immediate task to get done, and just needed doing.
While it may seem like a bit of a wishy-washy rule, it was one that I still felt fine with after a couple of days of having the prioritization in place. Granted, it felt like I was churning through the immediate task section, but that also felt right. As I am working on whatever current task that I need to work on, I observe things in the code and have questions about whether I took the right approach. Writing those things down in the issues list allows me to continue with the task without losing too much of the essence of what I wanted to verify or question. To me, that just seems like a good approach!
Preparing for The Future¶
A major part of this week’s work was to clean things up a bit with bigger tasks that I
do not normally have time for. As such, I decided to spend about four to five hours during
the week experimenting with SnakeViz
and incorporating primitive SnakeViz support
into the project. While it is too early to say what needs the most
improvement, I can say that I was able to get some good experience working with the tool
and the output HTML that helps visualize the performance. I engaged in the
experimentation specifically to gain some experience with code profiling, and I can
honestly say that I think I got the experience that I was looking for!
The smart thing about using SnakeViz to profile is that it is interactive. To ensure that a user can dig down and get relevant information, SnakeViz takes a performance analysis file and hosts fragments of it through a webserver that it starts. As the webserver is starting, SnakeViz also launches it own page in its users own browser. After that, it is just a matter of clicking around and displaying information on any number of functions that were executed during the code profile run.
I do not know if everyone will have the same experience that I did, but I found the
interface simple and easy to use. When I start focusing on performance, I know I will
spend a decent amount of time looking at the tottime
column which displays the total
amount of time that was spent in each function during the entire code profile run.
I expect I will look at the top few items on that list and try and figure out why they
are taking that much time. After making some guesses and changes to the code to match,
rerunning the code profile run with the exact same data will be pivotal.
While it is still at least a month or so off, I am looking forward to using this tool and making the code perform well!
What Was My Experience So Far?¶
In terms of actual issues that I fixed during this week, the count was either low or zero depending on who was doing the accounting. But the big win was getting most of the big-ticket items taken care of. To be honest, it was a lot of pressure off my mind getting those big items done. From my viewpoint, I am not sure that I would have felt okay with an initial release the project without those items being addressed. It was just good housekeeping, and now it was done!
It just felt good to make the time to get those big-ticket items resolved. While it can be argued that there were not that many real changes to the code base, each of those changes made a significant impact to my confidence about the project. The refactoring to provide easy and private access to the tokens? It increased my confidence that I have not missed any weird side effects. The reorganizing and simple cleanup? It increased my confidence that I had extra “stuff” to maintain across the code base that would make things more complicated. Consolidating the List Block code? It increased my confidence that I have captured a healthy set of scenarios that properly test both Ordered List Blocks and Unordered List Blocks. Each change simply increased my confidence by safely reducing the amount of required code in the project.
It also felt good for another reason: it was fun. Sure, there were boring parts, like making 2000 changes for the scenario test refactor. That was definitely not fun. But that was kind of fun because it was something for the project that was different. It was not the same old test-fix-run and repeat process that I had been doing for months. It was just something different. And that was just fun to do sometimes.
What is Next?¶
With most of the big-ticket tasks out of the way, I needed to buckle down and try and resolve as many of the Block Quote items in the unprioritized section as possible. While it would not be as fun as the big-ticket items, it knew they were worth doing and it would get me closer to an initial release.
-
To be fair, in our household we have one honey-do list for me and one honey-do list for my spouse. And we both ignore some items until reminded and get some items off the list on our own. You know, typical list. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.