Summary¶
In my last article, I increased the coverage provided by the token to Markdown transformer by adding support for the link related tokens. In this article, I take another large step towards completing the consistency checks by adding support for list related tokens.
Introduction¶
Having implemented the link related tokens, I was now down to one large group: the container related tokens. Even with the confidence I gained from the work that I performed with link related tokens, I felt that “container related tokens” was too large of a group for me to be comfortable working with. Given that the container related tokens group contained only 2 types of tokens, it only seemed natural to focus on one of those two tokens: the list tokens. More specifically, I was going to start with the Unordered List tokens.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 04 Aug 2020 and 08 Aug 2020.
Starting with Unordered Lists¶
Taking a wander through the scenario tests with unordered lists in their examples, two things were clear to me. The first thing is that with few exceptions, they were all working well from the point of view of existing consistency checks. Those few examples were 1 case with nested links and 4 cases which mixed up block quotes with list. Those examples are example 528, example 237, example 238, example 270, and example 271. The second thing that was clear was that because of the first point, I had a high degree of confidence that I was going to be addressing issues that almost exclusively focused on whitespace. With the HTML output already being verified, I was certain that properly transforming Unordered List tokens back into list Markdown would mostly boil down to whitespaces.
Thinking About the Design¶
Before I leapt into the coding, I sat back and thought long and hard about the approach I was going to take with the transformation of these tokens. When I started sketching out what my approach would be, I started to understand that there would be two issues I would have to deal with. They were the transformation of the tokens themselves, and the whitespace they would inject before other tokens. The first part was the easy part: see an unordered list token, use the elements in the token to figure out its transformed form, and emit that transformed form. Done.
Managing the whitespace was going to be a lot more difficult. The thing that helped me was that I knew I already had experience with handling that initial whitespace from writing the main parser. What helped me immeasurably in the parser was to keep the processing of the two container elements, lists and block quotes, separate from the processing of the leaf tokens. By only passing “container element free” text to the leaf token processor, that processor was kept simple. To keep the container handling for the Markdown transformer simple, I decided that employing the same approach was pivotal.
But even with the decision to keep that processing separate, I figured that it would only get me part of the way there. To complete the management of the whitespaces, I would need to be able to calculate the whitespace used before each line contained within a list block. The more I looked at the problems to be solved, the more I was sure that most of my time was going to be managing that initial whitespace.
It was not going to be fun getting the transformations done for the Unordered List tokens, but I had a feeling that it would be really satisfying when it was done!
And… Go!¶
I began this block of work with the moving of the if statement that avoided processing any scenario test that included block quotes or lists starts. Before that move, it was buried in a large collection of if statements, and I wanted to make sure I called it out until it was fixed. Making it even better, when I moved it, I broke that single statement into three explicit statements. As I knew I was going to be enabling each one in the next week or so, being explicit about that work just seemed like the right thing to do. But even though the move was mostly for aesthetics, I confess that it was also to remind me that I only had those tokens left to go.
Once that was completed, I did the easy work first and added the
rehydrate_unordered_list_start
function and the rehydrate_unordered_list_start_end
function to the main processing loop. After running the scenario tests again, I
was reminded by the test output that the rehydrate_next_list_item
function would have
to be added to deal with the Next List Item tokens in the stream. Another quick run of
the tests and a lot of comparison failures, but no missing token failures. A good
step in the right direction!
First Step: List Indentation¶
With the actual token handlers dealt with, it was time to focus on the effects that
those tokens had
on the leaf blocks. Following my usual pattern, instead of immediately creating a
new function to specifically handle the lists, I kept that code inline with the
existing transform
method the Markdown transformer. While I recognize that it
looks messy and sloppy and the outset, it helps me think more clearly without worrying
about that I need to pass where.
Therefore, following my usual pattern, I first added a simple post-processing step that
took the
contents of the variable continue_seq
and applied them to start of the output of
specific tokens. The continue_seq
variable was initialized with an empty string,
but I altered the rehydrate_unordered_list_start
function to reset this variable
to the amount of indent specified by the list. With the change in place, the end
of the loop processing was simple:
new_data = new_data.replace("\n", "\n" + continue_seq)
This gained some traction in getting the scenario tests passing, but that processing needed to be made a bit more complicated.
The first complication that needed to be addressed was that both list starts, and list
ends modified the continue_seq
variable, but needed to avoid applying it to
the line on which they element resided. This was because the processing of the
Unordered List start token
already had the required indenting taking care of, so the postprocessing would just
add extra “garbage” whitespace. To remedy this, I added the skip_merge
variable to
allow the Unordered
List token handlers to inform the main loop to skip any post-processing.
The second complication was the handling the list terminations using the
rehydrate_unordered_list_start_end
function. In some of the easy cases, what was
there was fine, but as soon a list was nested in another list, that processing fell
apart. What was missing was a recalculation of the indent level once the prior list
ends. That was quickly addressed by doing a recalculation of the contents for the
continue_seq
variable for the new list token at the top of the stack.
With those easy fixes, and with the main replacement call in the main loop, a lot of the scenario tests were passing, while keeping the processing simple.
That simplicity would soon change.
Indented Code Blocks¶
As I went through the test failures, there were a few failures that stood out as an odd grouping: indented code blocks and lists. Doing some more research, I found out that due to a bug in my code, the indented code blocks were not being checked properly. It only involved one list item scenario test, but nonetheless, it still needed to be fixed.
In that newly found bug, the problem was that the indented code blocks were always adding their indented padding at the beginning of the lines. This was not usually a problem, but with any examples that contained blank lines within the indented code block, it was an issue. A good example of this is example 81:
chunk1
chunk2
{space}{space}
{space}
{space}
chunk3
When the parser tokenizes this example, the Blank Line tokens that are generated
already include any whitespace that is present on that line. Taking care of their own
whitespace data, when the Markdown transformer interprets those Blank Line tokens,
it needs to accept those Blank Line elements as they are.
Modifications were needed to enforce this behavior. The combine
function of the
TextMarkdownToken
class containing the indented blank line was changed to insert
a NOOP character and then a newline character. As text used in an indented code
block was the only paragraph-like
encapsulating token that inserted a blank line
into the composed text, had confidence this was isolated.
With those NOOP characters in place, the Markdown transformer needed some modifications
to
understand how to deal with this. Before proceeding with the normal insertion of
any whitespace in the continue_seq
variable, a quick check was made to see if
the new_data
variable contained a NOOP character. If so, the string in the
new_data
variable was split and processed. For each element in the split list,
the element was checked to see if it started with a NOOP character. If it did it
simply removed the NOOP character by setting the replacement_data
variable to
the contents of the new_data
variable after that first NOOP character. If it did not
find it, it set the replacement_data
variable to the contents of the continue_seq
variable plus the contents of the new_data
variable. Once that was done, the value
was put back in into the array at the same index. Then,
when the processing was done, it reconstituted the new_data
variable by joining the
elements of the list back together using the \n
character as a joining character.
While I was not looking for that, I was glad I found it. A bit embarrassed that I did not find it right away, but I did find it!
Handling Lazy Continuations Lines¶
With most of the scenario tests now passing, my focus was centered on a set of tests that dealt with lists and lazy handling. While this took me a bit to get my head around, it basically says:
If a string of lines Ls constitute a list item with contents Bs, then the result of deleting some or all the indentation from one or more lines in which the next non-whitespace character after the indentation is paragraph continuation text is a list item with the same contents and attributes. The unindented lines are called lazy continuation lines.
Huh? Let me translate:
If you have a list already started, and you encounter lines that should be in the list except for the fact that they are not indented properly, they are lazy continuation lines, and are included.
The best way to explain is with an example:1
first list item
next line of first list item
next next line of first list item
In that example, all three of those lines are considered part of the list item, even though the line is indented less than the indent of 2 specified by the initial list item element. But that presented a bit of a problem.
When parsing the Markdown document, the indent level was being tracked, and any additional whitespace was added to the contained token. But as I was writing the Markdown transformer, I noticed that I had missed the case where the amount of indent was less than the current list’s indent level. This was not an issue with the HTML transformer, as that transformer does not rely on any of the extracted whitespace. However, the Markdown transformer does.
To fix this issue, I needed to first make a change in the parse_paragraph
function of
the LeafBlockProcessor
class. In that function, I reconstituted the actual indent
of the given line and compared it against the indent level from the dominant unordered
list item token. If that actual indent level was less than the dominant indent level,
I adjusted the actual whitespace by prefacing that whitespace with…well… blech
characters.
Yes, blech characters. Blech, according to Webster’s means “used to express disgust”. While I knew I had to track those indenting characters somehow, I really was not happy with it. Disgust may be a bit too intense of an emotion, but when I figured out what I had to do, that was the most printable word that I uttered.
Using the above example, the tokenized text looks like:
- first list item{newline}
{blech}{blech}next line of first list item{newline}
{blech}next next line of first list item{newline}
In this way, the indent was properly represented in the token, and the Markdown transformer had enough information to rehydrate the data afterwards. With those changes locked into the tokens, the Markdown transformer was then able to be changed to understand those tokens. That processing was simple. If a line of the text output started with a blech character, those blech characters were replaced with a space character. If no blech characters were there, the normal replacement of the newline character would occur.
And I could have changed the name of the character from “blech character”, but after a while, it just kind of stuck with me.
New List Item Tokens¶
It was about this time when I decided to tackle the New List Item tokens. While I had been working around them somewhat, it was starting to get to the point where they were unavoidable. At first, I did not think these tokens would be an issue, but I forgot about one little part of the New List Item tokens: they reset the indent for the list.
A good example of this is example 290:
- a
- b
- c
- d
- e
- f
- g
In this case, the first line starts a list and each line after that starts a new list
item. As the new list items are gradually increasing and then decreasing the indent,
the 3 middle lines (c
, d
, and e
) are interpreted as a new list item element,
rather than a sublist. If it was not for the new list item elements resetting that
indent, those 3 lines are indented to the point where their indent would make them
eligible to start a new sublist.
But to properly track this indent change, it required some interesting thinking. If I tracked the indent change in the actual token, it would mean changing that token. To me, that was a non-starter. Instead, I added a separate stack for container tokens in the Markdown transformer and added copies of the container tokens to this stack. As I added copies of the tokens to the stack, I was free to change the indent value located within the Unordered List token without worrying about side effects.
With those changes in place, the Markdown transformer was able to reset the indent level and make sure it was being properly tracked. This meant that the indents were able to be properly reset to the correct value once a List Item end token was received for a sublist.
Taking a bit of a deep breath and a pause, I noticed that I was close to finishing off the Unordered List Item tokens. That gave me a bit of a jump in my step to clean things up!
Taking Care of Stragglers¶
With all the major and minor cases of lists dealt with, I started going through the other scenario tests, fixing up the token lists after verifying that the new tokens were correct. Everything else was easily resolved at this point, except for some lists in a couple of cases. Those cases were interesting in that there was actually too much whitespace, not too little. And in this case, it was a newline character, not a space.
The Fenced Code Block element and the SetExt Heading element are unique in that they have a line-based sequence that delimits the end of the element. Usually this is not a problem, but in the case of interacting with lists, the transformer was inserting a newline after the element itself, and then another newline to make the end of that line. While this duplication did not occur all the time, it took a bit to figure the exact sequence that triggered this.
After doing some research, it was weird to me, but it only occurred if:
- it was one of these two elements
- the new block of data ends with a newline character
- the next token to be processed is a New List Item token
While the sequence of thing that had to occur was weird, the solution was easy:
block_should_end_with_newline = False
if next_token.token_name == "end-fcode-block":
block_should_end_with_newline = True
delayed_continue = ""
elif next_token.token_name == "end-setext":
block_should_end_with_newline = True
...
block_ends_with_newline = \
block_should_end_with_newline and new_data.endswith("\n")
...
if (
block_ends_with_newline
and next_one
and next_one.token_name == MarkdownToken.token_new_list_item
):
new_data = new_data[0 : -len(continue_seq)]
Basically, if we hit that situation, just remove the excess character. I was hoping to refactor it into something more elegant, but it worked at the time and I wanted to get on to the handling of Ordered List Item tokens.
Second Verse…¶
I fondly remember being a kid at a summer camp and hearing the words “Second verse, same the first, a little bit louder, and a little bit worse!”. Working on the ordered list tokens made me think of that saying almost immediately. Except for the fact that it was not a little bit worse, it was a lot easier.
There were two main reasons for that. The first reason is that looking at the samples as a group, there are objectively fewer examples with ordered lists than unordered lists. In the list items section of the GFM specification, there are 20 of each, but in the lists section of the specification, there are 20 examples of unordered lists and 7 examples of ordered lists. The second reason was that most of the work was either completed when working on the unordered list token transformations, or it was used in a copy-and-paste manner.
However, knowing that lists are one of the two container elements in Markdown, I took some extra time and reverified all the scenario tests, both ordered lists and unordered lists. I was able to find a couple of small things that were quickly fixed, but other than that, everything looked fine!
Cleanup¶
As with a lot of my free project time recently, I used some extra time that I had to focus on a few cleanup items for the project. While none of them was important on its own, I just felt that the project would be cleaner with them done.
The first one was an easy one, going through the HTML transformer and the Markdown transformer, and ensuring that all the token handlers were private. There really was not any pressing need to do this, but it was just cleaner. The only code that was using those handlers was in the same class, so it was just tidier that way.
Next was the creation of the __create_link_token
function to help tidy up the
__handle_link_types
function. The __handle_link_types
function was already messy
enough handling the processing of the link information, the creating of the normal or
image link was just complicating things. While I still want to go back and clean
functions like that up, for the time, moving the creation code to __create_link_token
was a good step.
Finally, there was the case of the justification function calls throughout the code. Not to sound like a purist, but I felt that they were messy. I often had to remind myself of what the three operands were: string to perform on, the justification amount, and the character to use for the justification. The actual use of the function was correct, it just felt like its usage was not clear. So instead of having code like this around the code base:
some_value = "".rjust(repeat_count, character_to_repeat)
I replaced it with:
some_value = ParserHelper.repeat_string(character_to_repeat, repeat_count)
While the code for this operation was a one-line function, now located in the
ParserHelper
class, I felt it now made sense and was in an easy to find
place.
def repeat_string(string_to_repeat, repeat_count):
"""
Repeat the given character the specified number of times.
"""
return "".rjust(repeat_count, string_to_repeat)
“Fixing” Example 528¶
I do not want to spoil the surprise too much, but the fact that I have a section called
"Fixing" Example 528
probably gives it away. I fixed it. But the story is more
interesting than that.
In the last article, I talked about example 528 and how I was having problems getting it to parse properly. Even having done thorough research on the example and the algorithm, I came up with nothing. To me, it looked like the parsing was being performed according to the GFM specification’s algorithm, but the parsing was just off. After yet another attempt to figure this example out and get it working, I posted my research and a request for help to the CommonMark discussion forums..
Keeping my head down and focused on writing that week’s article, I did not notice that I had received a reply the very next day. As a matter of fact, while I did notice that I had a response to my post, it was not until Friday night that it clicked that it was a response to “THAT” post. After getting the cleanup documented in the previous section taken care of, I reserved some dedicated time to walk through the reply.
Kudos¶
First off, I would like to extend kudos to the replier, John MacFarlane, one of the maintainers of the Markdown GFM specification. While he would have been within his right to tell me to RTFM2, he took some time to walk me through the algorithm as it applied to that example, even providing me with some debug output from his program for that example. His response was just a classy response with just the right amount of information.
Side by Side Comparisons¶
Armed with that new information, I turned on the debug output and ran through the output
from my implementation of the algorithm again. Slowly, with my own written notes as an
additional guide, I began to walk through the output. Found closer at 8
. Check.
Found matching opener at 4
. Check. Deactivating opener at 3
. Check.
Found closer at 15
. Check. Popping inactive opener at 3
. Ch…er… what?
“Popping”?
Going back to the algorithm and the text that John provided, it hit me. The popping that he was referring to was this part of the algorithm:
If we do find one, but it’s not active, we remove the inactive delimiter from the stack, and return a literal text node ].
For some reason, when I read that before, I thought it was referring to removing the
token from consideration, not actually removing it from the stack. It probably also
confused things in that I did not maintain a separate stack for the link resolution.
Instead, I added an instance of the SpecialTextMarkdownToken
token to the inline block
list whenever I hit a link sequence. In either case, I was not doing that. To compound
the issue, I did not stop at that inactive token, I just kept on looking for the next
active SpecialTextMarkdownToken
token, finding the image start token. Ah… everything
was now falling into place in my mind.
Fixing the Issue¶
The fix was very anticlimactic. I created the new
__revert_token_to_normal_text_token
function, which removed the
SpecialTextMarkdownToken
token and replaced it with a normal TextMarkdownToken
token. In addition, I changed the algorithm to make sure that when this happened,
it stopped processing for that link end sequence, as per the algorithm. With the start
character sequence now being effectively invisible to the algorithm, the rest of the
parsing went fine, with the correct results coming out. Well, almost. A small fix was
needed to the
__consume_text_for_image_alt_text
function to make it properly emit the correct text
if an instance of a SpecialTextMarkdownToken
token was encountered.
With the big fix and the little fix both completed, the scenario test for Example 528 was fully enabled and fully passing. Finally!
Reminder to Self: Be Humble¶
Having taken a quite a few attempts at implementing the algorithm and making sure it passed all test cases, I hit a wall. A seemingly rock-solid wall. That was fine. During any project, be it software or hardware, things happen. When it got to that point, I gave myself some time, I knuckled down3, and I did some solid research on the problem. Keeping good notes, I was then able to share those notes with peers in the community, along with a sincere and humble request for help.
I do not always get a good response to requests for help. However, I have noticed that doing good research and presenting that research with humility increases the chance of getting a positive response. At no point did I rant and say, “it’s broken” or “my way works, what is wrong with yours”. Instead I said, “Is there something wrong with the algorithm?” and “Based on my implementation of that algorithm”. By acknowledging that it could be an issue with my implementation, I feel that I opened the doors for someone to help, rather than slamming them shut with negative talk.
And as I mentioned in the Kudos section above, mostly in what I believe was a humble approach to asking for help, I got a real good response.
What Was My Experience So Far?¶
Wow… that work was intense. For one, I was right, it was a lot of addressing issues with whitespace and running scenario tests repeatedly. But it was more than that for me. I knew I had the leaf blocks taken care of, but I was really concerned about how difficult the implementation of the container transformations would be. If I did it right and kept to my design, I was confident that I could keep the complexity down, but I still figured it would be complex.
I guess that led to me second guessing every line of code and getting in the way of myself a bit. I did prevail, but that concern or fear of damaging existing tests was somewhat paralyzing at times. And while the logical half of my brain was telling me that I had plenty of tests to reinforce my completed work, the emotional half was another story. That is where that fear was coming from, my emotional side. Only when I took a moment to take another breath and focus on the tests was I able to banish that concern for a while.
And it also helped me to do a bit of self-analysis on why I was concerned. After a lot of thinking, I came to a simple conclusion. The closer I get closer to getting a complete project, the more I am concerned that I have not architected and designed it properly. If I have done that, small changes can be accomplished with few or no unintended side effects. If not, encountering side effects should be frequent. Seeing as I know I have identified some areas of the code base that I want to refactor, I questioned whether the current state was good enough.
Knowing that, it helped me figure it out for myself. I do believe that I have confidence with my architecture and design, and at the same time, I believe that I can improve it. It might seem like a dichotomy, but I am starting to think that both can be correct at the same time. But knowing that was the issue that was causing me concern helps me combat it. I am a lot less worried about it, but it is a work in progress, much like the project.
With that information in hand, I felt better. Cautious about the next steps in getting the checks closer to the finish line, but better! And let’s not forget about finally closing the issue with Example 528. That was good!
What is Next?¶
With the Markdown transformer almost completed, the only tokens left that need a transformation are the Block Quote tokens. In addition, as the line/column number consistency checks do not currently deal with Block Quote tokens either, I will need to add both checks in the next block of work.
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.