In my last article, I documented how I was working hard to get to the end of the unprioritized items in my issues list and on to the prioritized parts of the list. This article details the work that I am performing to get to the end of the unprioritized list.
I started the week with one goal on my mind: to clear out any unprioritized items so that I can start working on the prioritized items next week. I knew that goal might not be achievable, but that was okay for me. It was something concrete that I could work towards. Possibly achievable and concrete… sounded good to me!
Internally, I was fighting a different battle. I knew that I had some work to do to
finish off the unprioritized items. I knew that I also had some clearly defined priorities
that I had listed in the issues list. But in the process of prioritizing items, I missed
the sections that occur after the
Bugs - Tabs section. While not everything would be
actionable right away, I did need to take the time to figure out when those
sections will be handled. Not having a plan with the release getting closer was just
getting to me.
At the very least, I know that I could wait with a concrete decision on that question until after all the unprioritized items were resolved, so off I went!
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 20 Jan 2021 and 24 Jan 2021.
The Big Push!¶
From the number of items in the issues list that were not prioritized, I knew that if I focused for between a couple of days and a week, I could get every unprioritized item resolved. It was time for a big push!
Yet More Fun With Link Reference Definitions¶
It seems like I cannot get away from issues involving Link Reference Definitions, but I was going to try by resolving this issue:
- dedup `append_text(` - test_reference_links_extra_03h - 03hc
After consultation with the people on the
I decided to double-down on my decision from last week to implement code that follows
the CommonMark reference implementation. While I did get agreement that the Markdown
test_reference_links_extra_03h should parse as a valid Link Reference
Definition, the current version of CommonMark (0.29.2) does not support that. As such,
I had to make sure that the PyMarkdown parser does not support that.
After adding a few extra test functions to make sure I had covered all the scenarios,
I started debugging and found the issue almost immediately. In processing the line
within the Link Reference Definition using the
the function was always finding that the Link Reference Definition was valid, including
when it ended with the Hard Line Break elements backslash
\ character. Adding some
extra code to the
__is_link_reference_definition function solved that issue, only to
show another issue.
In some rare cases, a line can end with an escaped backslash character, such as:
[bar\\ foo]: /uri
In this case, the final backslash is escaped, and is represented in the token by
the Python string
bar\\\b\\\nfoo. As Python escapes a backslash with another
backslash, that string is effectively read as the text
bar, an escaped backslash
a backspace special character
\b, another escaped backslash
\\, the newline character,
and finally the text
foo. As I have mentioned in previous articles regarding the
project’s use of the backspace character, the first escaped backslash will be countered by
the backspace character, effectively leaving a single backslash followed by a newline
character as the most interesting character sequences in that string. While it can be
a bit confusing to read (even to me sometimes!), that encoding allows a process to decide
whether to look at that string in its original Markdown form or in its target HTML form.
While that interpretation of the sequence is correct, the parser needed to be changed
to prevent that particular end-of-line sequence from being recognized as a
Hard Line Break element. That sequence was not really a
\ character at the end of the
line, it was an escaped
\ character at the end of the line.
Having identified the issue, I quick worked to modify the
function to verify that the characters before a backslash at the end of the line were
not the sequence
\\\b. After running the tests a couple of times and performing
some extra verification steps to make sure I got it right, it was on to the next issue.
Link Reference Definitions and List Boundaries¶
With the goal of resolving this item:
I enabled test function
test_link_reference_definitions_extra_01c and added a new
test_link_reference_definitions_extra_01d. Then I started looking at
the test examples, and I realized that I needed to think carefully about them:
- [foo]: - /url
I knew it was a Link Reference Definition that should be broken, but I was not 100% sure why it was broken. Working through it on paper, I got the concrete answer that I needed. At the end of line 1, the parser has a single list started, with a partial Link Reference Definition active. However, when the second line starts with a character that is not part of a valid Link Reference Definition, the processing is rewound, and the first line is reinterpreted as plaintext. From there, it becomes a single list with two list items, each list item containing half of the “almost” Link Reference Definition.
With that research in hand, I started debugging and realized two things. The first was
caller_can_handle_requeue argument on the
was not set to
True. This meant that the Link Reference Definitions would never be
closed by that function. When I addressed that issue, I found the second thing: turning
that flag on and supporting the rewind or requeue from that point required a lot of
argument passing. While I didn’t have a lot of time to address it now, I made sure
that there was something in the issues list about addressing the issue of passing too
many arguments along and moved on.
Link Reference Definitions and Block Quote Boundaries¶
Being in the same area as with Link Reference Definitions and List Boundaries, I figured that dealing with this item:
was a good use of my time. The research for this item almost followed the exact same path that my research had followed for List Blocks. The only difference was that they were working on Block Quote boundaries instead of List boundaries. Using the same manner of fixing this issue for Block Quotes as I used for List Blocks, this issue was quickly fixed. This fix also carried over to the next issue that I addressed:
- add new 2c that increases level of block quotes with LRD
As a matter of fact, I did not have to add any extra code to the parser above the code
that I originally added for function
But having seen that I missed on a number of these scenarios involving non-paragraph Leaf Block elements, I wanted to make sure that I had some coverage in there for the other Leaf Block elements, hence, the issue:
- verify that > to >> introduces a new level of block quotes
To address that issue, I added new scenario tests, test function
test_block_quotes_229c to test function
test_block_quotes_229j. I just started with
Indented Code Blocks, proceeded to Fenced Code Blocks, ending up on HTML Code Blocks.
With various variations in place, when I was done there were nine new scenario tests,
with five of those tests being disabled. I had some work to do!
Before Signing Off For The Night¶
Before I shutdown my system for the night, I looked around and noticed that there was another item that was like the one I had just added tests for:
- links, 518b - 518b inside of list and/or block quote - now test_paragraph_extra_j0
As I had also forgotten to run Black on my last couple of commits, it was a good chance
to see if I had even more work to do, or if I would get a break and have some good,
positive test karma come my way. I added test function
and two variations, one with the same thing in a Block Quote element and the other
with the same thing in a List Block element. The good news? It worked first time,
with no test needing to be disabled. It was a good way to end that day!
Friday Night Blues¶
As Friday night rolled around, I found that I had some spare time to work on the PyMarkdown project that evening, but I was less than enthusiastic about it. It seemed that every time I worked on the project, I would resolve two or three items from the issues list, only to add another three or four items to that same list. It felt like I was spinning my wheels and getting nowhere. It took me looking at the specification and the existing scenario tests to realize that I was just hitting a lot of boundary conditions for Block Quote elements. At that moment it dawned on me: I was not adding hard-to-solve items to the issues list, I was adding little issues that were, in all honesty, way off the beaten path of Markdown. Not to sound too negative, but I was dealing with the nit-picky scenarios! That helped me to get my mind in the correct perspective!
It was with that renewed sense of purpose that I refocused to work on:
- test_block_quotes_229d - BQ+ICB
Working through the debug logs, I quickly came to an interesting conclusion. Due to paragraph continuation lines, most of the existing Block Quote tests would either maintain or increase their level of Block Quotes, but never decrease. If one of the Block Quote start characters was missing within a paragraph, the paragraph continuation rule would kick in the Block Quote would continue along without any disruptions.
But that was not so with other, non-Paragraph elements. Focusing on the item that
I was currently working on, I changed the
__ensure_stack_at_level function to
properly determine when the stack needed to increase (
when the stack needed to decrease (
stack_decrease_needed). With that code in place,
I added code to reduce the Block Quote count if needed, closing any open elements
until the proper Block Quote count was reached.
And while a lot of issues I fixed in this time were only with the parser itself, this
issue required changes in the Markdown transformer. It was not anything big, but to
make sure that the Block Quote token’s
leading_text_index field was properly maintained,
__rehydrate_block_quote function and the
needed to properly increase that variable if needed. After a bit of fiddling and
verification to make sure that everything was once again passing, it was off to
relax before a long day on Saturday.
Knowing that I would have a big chunk of time on Saturday, I resolved to myself to push hard to see if I could get to the end of the unprioritized list by the end of the day. In my mind, I was not sure if I would be able to do it, but I figured a good push to resolve issues wouldn’t hurt either way. With that, and some loud music in the background, I hunkered down and got to work.
Block Quotes and HTML Blocks¶
Having had good success with Indented Code Blocks and Block Quotes the night before, I decided to just power ahead and deal with this item:
- test_block_quotes_229i and test_block_quotes_229j - BQ+HTML
Doing some testing against BabelMark, I soon found out that unlike the Indented Code Block element, HTML Code Blocks inside of a Block Quote element treated any extra Block Quote start characters as Code Block text. Looking over the GFM specification and thinking about it a bit, this made sense to me. The HTML Code Block has five different start sequences and five matching close sequences. Other than the Block Quote sequences that were in place when it started, unless it matched one of those close sequences, it would just treat any increases as text that was a part of the Code Block itself. But that wasn’t how the parser was interpreting it. It was trying to open a new Block Quote when it saw the extra Block Quote prefix character.
To address this, I had to modify the
__count_block_quote_starts function to recognize
that situation, and then respond to that with changes in the
__get_nested_container_starts function to ignore a change in Block Quote starts.
With those changes in place, enabling test function
test_block_quotes_229j was trivial,
modifying this piece of code from the function
stack_decrease_needed = top_token_on_stack.is_indented_code_block
stack_decrease_needed = ( top_token_on_stack.is_indented_code_block or top_token_on_stack.is_html_block )
Running the tests, the scenario tests now passed, and it was on to the next item.
Two Code Blocks Down…¶
And one type of code block left to deal with:
- test_block_quotes_229g and test_block_quotes_229h - BQ+FCB
While I would love to say that fixing these tests just required one or two lines of
code, that would be dishonest. Because the processing of the Fenced Code Block elements
is sufficiently different from the other Code Blocks, the handling of it in the
__handle_block_quote_section function was separate from the other token types. But
while I had to repeat changes that were like the ones made for Indented Code Blocks
and HTML Code Blocks, it was very beneficial to have them to use as templates. I
effectively used them to shortcut my debugging processes, getting both tests working
in what I would consider record time!
Just To Be Sure¶
I was confident that I had properly addressed all the issues that had been raised so far, but I was also concerned that I was being shortsighted. So, at that time, I added the item:
- multiple levels up and down over all block quote transitions
When I re-read that item before starting to work on it, I clearly understood that it
was to question whether I
had coded the Block Quote transitions for the “plus one” cases or the “plus any” cases. I
realize that it may not seem like a large difference to others, but I was concerned that I
had used an
if statement and checked at one level instead of using a
checking until multiple levels had been processed.
After adding 10 new tests that did multiple increases and multiple decreases within various Leaf Block elements, I held my breath and ran the scenario tests. Did I get all the cases the first time around? How many did I miss? Was the work to correct them going to be difficult or easy to complete? I just wanted to get this done.
Having worked myself up over a possible lengthy set of fixes, I was pleased to find out
that only one change was required. In the
I had coded a check to look for a Block Quote end token that was preceded by a Fenced
Code Block end token. While the purpose of the check remained the same, I needed to
adjust it a bit for a case of multiple Block Quote transitions. That check went from:
if ( token_document[-1].is_block_quote_end and token_document[-2].is_fenced_code_block_end ):
number_of_block_quotes = 0 end_index = len(token_document) - 1 while token_document[end_index].is_block_quote_end: number_of_block_quotes += 1 end_index -= 1 if ( number_of_block_quotes > 0 and token_document[end_index].is_fenced_code_block_end ):
Other than that change, which took approximately 10 minutes to debug and implement, all the other tests passed without requiring any changes.
The Last Unprioritized Item¶
At this point, there was only one item left in the
- single level up or down with list
Like the work done in the last section, I believe this was a reminder to myself to not forget that List Blocks can also be encapsulated within a Block Quote block. So, with the learnings of the past two sections in mind, I put together a series of four new scenario tests for List Blocks within Block Quotes and ran the tests without thinking about it. They all worked on the first try, with no other changes required in the project! It was a good end to this series of work!
And it was with that commit that I finished working on the unprioritized items in the issues list. From here on, it was getting prioritized items out of the way for the release!
Starting On Prioritized Issues¶
Having spent a lot of my mental capital to get to this point in the project, I was explicitly looking for some issues that were low hanging fruits. While I knew that there were some bulkier items to deal with in the list, i wanted to try and clear some of the light issues away to get a clearer picture of what was left.
Some Simple Research¶
Intending to start with something simple, this item seemed like a good fit:
- where is `elif starting_whitespace:` used? why? better way to do it?
Thinking way back to when I added this, I remember wondering why I needed to do this in addition to the other code in this function, and what effect it has. To be honest, I have written a lot of code for the PyMarkdown project in the last year. While I try and be clear with everything I write, I dropped the ball on this one. Time to figure it out!
Digging into this issue, I searched for the text
elif starting_whitespace: and replaced
that string with
elif False and starting_whitespace:. After running the scenario tests,
I picked one of the failures at random, test function
looked at the failure. As soon as I started looking at the test output, memories of this
change came flooding back to me. Within an Atx Heading Element, there is at least one
space character between the start character (
#) and the text in the heading. The HTML
output that is generated is fine, but when the Markdown transformer tries to reconstruct
the original Markdown, that leading space is not represented in the tokens. To address
that issue, in those cases where it is not otherwise present in the list of tokens,
that code kicks in and adds a Text token with the right markers to resolve the issue.
Experimenting with three or four alternate solutions to the issue, only one of those solutions worked, and it was a lot more convoluted than the existing solution. If there is no text to create a Text token with that the parser can attach some extra whitespace to, then create a Text token specially to hold that extra whitespace. Basically, it was good that I thought I could do better, but I already had the best solution for the job!
Dealing With Multiply Defined Link Reference Definitions¶
Having thought about the item:
- Link_helper.py#86 - if link already registered, should warn?
at length, resolving it was easy. While I could add something specific to deal with
that one case, it made more sense to just add a rule to deal with that case. After a quick
check to verify that the
LinkReferenceDefinitionMarkdownToken class has a
did_add_definition field, I resolved this one as done.
Picking Another Easy One¶
Another easy item from the list was the item:
- look for cases where " " is used, and convert to whitespace helper
This refactor was so easy, I almost feel that it is a waste talking about it. Like the
replacement of the
\n character with
replacement was to replace the space character with
make it more visible. While I was doing that, I noticed that there were a few
cases where I was trying to make those spaces visible using code like:
line_to_parse.replace(" ", "\\s")
To solve that issue, I cloned a copy of the
make_value_visible function, calling it
make_whitespace_visible, and moving that code, plus code for the other whitespace
\n into that function. After refactoring the other occurrences
of that text to the use the new function, the work on that item was completed.
Wrapping Up For The Night¶
As I was winding down to spend some time with my family, I wanted a simple task to help
me relax before stopping for the night. While it was not very productive, I inspected
the code using PyCharm, resolving three very simple issues. Combined with a full
run of my
clean script that runs Black, the changes were minor but useful. Nothing
spectacular was changed, but it left the code “just that much better” and allowed me
to sign off for the night with a good feeling. I had reached my goal for the week,
and I had also managed to clean some little things up as well. A good combination!
Sunday Morning Refactoring¶
After a long and productive Saturday, I decided to give myself a bit of a pass on the
hard work and resolve some easy “off-the-books” items. Where possible, I have tried to
adhere to good
practices, encapsulating data within an object that cleanly encapsulates the purpose
behind the object. From my experience, the problem with doing this with any kind of
language parser is that
there are often a group of miscenalenous variables that do not cohesively fall under a
single theme. Often, I find that they are just a loose group of variables that need
to be maintained to get the job done. Having the same type of problem with the
main processing phase of the PyMarkdown parser, I decided to just go ahead and group
these variables under the
Over three commits, I managed to clarify the usage pattern for the four variables
already in that object. This was done by adding a
__ prefix to each of those
variables to ensure that only private usage of that field was allowed. For external
usage of those variables, I added a new getter property for each variable with the same
name as the original variable.
In this way, I guaranteed that existing code that did not modify those variables would
remain as it was, while strengthening the usage pattern for those variables.
The additional eight variables were moved into the
ParserState class following a similar
pattern. Instead of changing the name of the variable to include a
__ prefix, the
variable was added to the class with the prefix already added, with the provided getter
function being added with the name of the variable minus the prefix.
With those changes in place, I started to run the tests, and encountered and handful of
cases where the parser needed to change one of those variables during its processing.
In each of those cases, I looked at the code and tried to determine the best
way to set the information to the desired value while minimizing the exposure of that
variable to the rest of the parser. In the case of the
function, that exposure was just the name of the function, as the body of the function
only sets the
__no_para_start_if_empty variable to
True. For the
mark_start_information function, called when the container processor is invoked for a
new line, six of the variables in the class are set to their initial values, keeping any
changes in those variables within the class. In all
cases, these were what I felt were the most minimal exposure that was needed for
other functions and classes dependent on the
ParserState class and its variables.
Once those variables were placed within the
ParserState class with their getters and
setters, I started going through downstream functions. In each case, I looked for the name
of one of the variables that I moved into the class and determined if it could be replaced
by a simple reference to the
parser_state variable that was used for the current instance
ParserState class. When I could replace it, I simply took the function argument,
original_stack_depth, and replaced it with
it was used in the function. At that point, when I completed that task properly, the
argument would show up as unused, and I would remove it from the function’s argument list,
also removing it from the arguments passed to that function wherever it was called from.
At that point, it was simply another case of
until all non-class references were replaced with class references.
And to be honest, here was where having the solid heft of the scenario tests came in very handy. Instead of just praying that I had made the right changes, every couple of changes was followed with the execution of the complete set of scenario tests. If there was a failure, I would investigate and find out the reason why it failed, addressing that issue. If things were fine, I would stage the changes in Git, allowing me to progress forward with confidence that I could back out any change at any time. While this process was a long process, the confidence that it gave me that I was making the right changes was priceless. With over 99.5% code coverage, if I messed something up in the refactoring, I found out about it right away.
I was making the right changes to clean up the project code, and I was not negatively impacting the code. It was good to be able to do both with the security and confidence of a well-tested project.
What Was My Experience So Far?¶
Looking back at the work I did during the week, it was hard not to get jazzed by the fact that I set a goal to eliminate all non-prioritized items from the issues list. That was tampered with the question about what to do with the other sections that I had forgot about, but it was still a win. And in its own way, resolving the unprioritized section of the issues list helped me answer my own question. I would just prioritize them!
The remaining sections are cleanly divided into two groups: features and bugs. With the possible exception of front matter1, while the other features would be nice to have, I didn’t need any of them for the initial release. The bugs that remained also divided nicely into two groups: Tabs and Rules. As a linter is nothing without some good rules to show its power, fixing the items in the Rules sections before release was non-negotiable.
The issues in the Tabs section were a different story. Making sure that the Tab support was spot on was going to be a sub-project, not something that I wanted to rush. Provided that I can come up with an “almost” interim solution for Tabs, I should be okay. Not great, but okay. The only reason I have confidence in saying that is because most people that I know shy away from tab character in their documents and source code, mostly due to questions on how they are interpreted. So, while I do have to address Tabs properly at some point, it doesn’t need to be until after the release.
Getting everything fixed and coming up with a good plan for the rest of the sections was just what I needed. Another concrete goal that I can see myself achieving. And yet another step closer to the release!
What is Next?¶
After 19 articles detailing how I was attacking each group of issues, I thought it would be useful to look back over that effort and determine how things went and lessons learned.
Since every article I write has a front matter section to it, I strongly feel that including that one feature into the release should be a priority. ↩
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.