Summary

In my last article, I increased the coverage provided by the token to Markdown transformer by adding support for the link related tokens. In this article, I take another large step towards completing the consistency checks by adding support for list related tokens.

Introduction

Having implemented the link related tokens, I was now down to one large group: the container related tokens. Even with the confidence I gained from the work that I performed with link related tokens, I felt that “container related tokens” was too large of a group for me to be comfortable working with. Given that the container related tokens group contained only 2 types of tokens, it only seemed natural to focus on one of those two tokens: the list tokens. More specifically, I was going to start with the Unordered List tokens.

What Is the Audience for This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 04 Aug 2020 and 08 Aug 2020.

Starting with Unordered Lists

Taking a wander through the scenario tests with unordered lists in their examples, two things were clear to me. The first thing is that with few exceptions, they were all working well from the point of view of existing consistency checks. Those few examples were 1 case with nested links and 4 cases which mixed up block quotes with list. Those examples are example 528, example 237, example 238, example 270, and example 271. The second thing that was clear was that because of the first point, I had a high degree of confidence that I was going to be addressing issues that almost exclusively focused on whitespace. With the HTML output already being verified, I was certain that properly transforming Unordered List tokens back into list Markdown would mostly boil down to whitespaces.

Thinking About the Design

Before I leapt into the coding, I sat back and thought long and hard about the approach I was going to take with the transformation of these tokens. When I started sketching out what my approach would be, I started to understand that there would be two issues I would have to deal with. They were the transformation of the tokens themselves, and the whitespace they would inject before other tokens. The first part was the easy part: see an unordered list token, use the elements in the token to figure out its transformed form, and emit that transformed form. Done.

Managing the whitespace was going to be a lot more difficult. The thing that helped me was that I knew I already had experience with handling that initial whitespace from writing the main parser. What helped me immeasurably in the parser was to keep the processing of the two container elements, lists and block quotes, separate from the processing of the leaf tokens. By only passing “container element free” text to the leaf token processor, that processor was kept simple. To keep the container handling for the Markdown transformer simple, I decided that employing the same approach was pivotal.

But even with the decision to keep that processing separate, I figured that it would only get me part of the way there. To complete the management of the whitespaces, I would need to be able to calculate the whitespace used before each line contained within a list block. The more I looked at the problems to be solved, the more I was sure that most of my time was going to be managing that initial whitespace.

It was not going to be fun getting the transformations done for the Unordered List tokens, but I had a feeling that it would be really satisfying when it was done!

And… Go!

I began this block of work with the moving of the if statement that avoided processing any scenario test that included block quotes or lists starts. Before that move, it was buried in a large collection of if statements, and I wanted to make sure I called it out until it was fixed. Making it even better, when I moved it, I broke that single statement into three explicit statements. As I knew I was going to be enabling each one in the next week or so, being explicit about that work just seemed like the right thing to do. But even though the move was mostly for aesthetics, I confess that it was also to remind me that I only had those tokens left to go.

Once that was completed, I did the easy work first and added the rehydrate_unordered_list_start function and the rehydrate_unordered_list_start_end function to the main processing loop. After running the scenario tests again, I was reminded by the test output that the rehydrate_next_list_item function would have to be added to deal with the Next List Item tokens in the stream. Another quick run of the tests and a lot of comparison failures, but no missing token failures. A good step in the right direction!

First Step: List Indentation

With the actual token handlers dealt with, it was time to focus on the effects that those tokens had on the leaf blocks. Following my usual pattern, instead of immediately creating a new function to specifically handle the lists, I kept that code inline with the existing transform method the Markdown transformer. While I recognize that it looks messy and sloppy and the outset, it helps me think more clearly without worrying about that I need to pass where.

Therefore, following my usual pattern, I first added a simple post-processing step that took the contents of the variable continue_seq and applied them to start of the output of specific tokens. The continue_seq variable was initialized with an empty string, but I altered the rehydrate_unordered_list_start function to reset this variable to the amount of indent specified by the list. With the change in place, the end of the loop processing was simple:

        new_data = new_data.replace("\n", "\n" + continue_seq)

This gained some traction in getting the scenario tests passing, but that processing needed to be made a bit more complicated.

The first complication that needed to be addressed was that both list starts, and list ends modified the continue_seq variable, but needed to avoid applying it to the line on which they element resided. This was because the processing of the Unordered List start token already had the required indenting taking care of, so the postprocessing would just add extra “garbage” whitespace. To remedy this, I added the skip_merge variable to allow the Unordered List token handlers to inform the main loop to skip any post-processing.

The second complication was the handling the list terminations using the rehydrate_unordered_list_start_end function. In some of the easy cases, what was there was fine, but as soon a list was nested in another list, that processing fell apart. What was missing was a recalculation of the indent level once the prior list ends. That was quickly addressed by doing a recalculation of the contents for the continue_seq variable for the new list token at the top of the stack.

With those easy fixes, and with the main replacement call in the main loop, a lot of the scenario tests were passing, while keeping the processing simple.

That simplicity would soon change.

Indented Code Blocks

As I went through the test failures, there were a few failures that stood out as an odd grouping: indented code blocks and lists. Doing some more research, I found out that due to a bug in my code, the indented code blocks were not being checked properly. It only involved one list item scenario test, but nonetheless, it still needed to be fixed.

In that newly found bug, the problem was that the indented code blocks were always adding their indented padding at the beginning of the lines. This was not usually a problem, but with any examples that contained blank lines within the indented code block, it was an issue. A good example of this is example 81:

    chunk1

    chunk2
{space}{space}
{space}
{space}
    chunk3

When the parser tokenizes this example, the Blank Line tokens that are generated already include any whitespace that is present on that line. Taking care of their own whitespace data, when the Markdown transformer interprets those Blank Line tokens, it needs to accept those Blank Line elements as they are. Modifications were needed to enforce this behavior. The combine function of the TextMarkdownToken class containing the indented blank line was changed to insert a NOOP character and then a newline character. As text used in an indented code block was the only paragraph-like encapsulating token that inserted a blank line into the composed text, had confidence this was isolated.

With those NOOP characters in place, the Markdown transformer needed some modifications to understand how to deal with this. Before proceeding with the normal insertion of any whitespace in the continue_seq variable, a quick check was made to see if the new_data variable contained a NOOP character. If so, the string in the new_data variable was split and processed. For each element in the split list, the element was checked to see if it started with a NOOP character. If it did it simply removed the NOOP character by setting the replacement_data variable to the contents of the new_data variable after that first NOOP character. If it did not find it, it set the replacement_data variable to the contents of the continue_seq variable plus the contents of the new_data variable. Once that was done, the value was put back in into the array at the same index. Then, when the processing was done, it reconstituted the new_data variable by joining the elements of the list back together using the \n character as a joining character.

While I was not looking for that, I was glad I found it. A bit embarrassed that I did not find it right away, but I did find it!

Handling Lazy Continuations Lines

With most of the scenario tests now passing, my focus was centered on a set of tests that dealt with lists and lazy handling. While this took me a bit to get my head around, it basically says:

If a string of lines Ls constitute a list item with contents Bs, then the result of deleting some or all the indentation from one or more lines in which the next non-whitespace character after the indentation is paragraph continuation text is a list item with the same contents and attributes. The unindented lines are called lazy continuation lines.

Huh? Let me translate:

If you have a list already started, and you encounter lines that should be in the list except for the fact that they are not indented properly, they are lazy continuation lines, and are included.

The best way to explain is with an example:1

first list item
next line of first list item
 next next line of first list item

In that example, all three of those lines are considered part of the list item, even though the line is indented less than the indent of 2 specified by the initial list item element. But that presented a bit of a problem.

When parsing the Markdown document, the indent level was being tracked, and any additional whitespace was added to the contained token. But as I was writing the Markdown transformer, I noticed that I had missed the case where the amount of indent was less than the current list’s indent level. This was not an issue with the HTML transformer, as that transformer does not rely on any of the extracted whitespace. However, the Markdown transformer does.

To fix this issue, I needed to first make a change in the parse_paragraph function of the LeafBlockProcessor class. In that function, I reconstituted the actual indent of the given line and compared it against the indent level from the dominant unordered list item token. If that actual indent level was less than the dominant indent level, I adjusted the actual whitespace by prefacing that whitespace with…well… blech characters.

Yes, blech characters. Blech, according to Webster’s means “used to express disgust”. While I knew I had to track those indenting characters somehow, I really was not happy with it. Disgust may be a bit too intense of an emotion, but when I figured out what I had to do, that was the most printable word that I uttered.

Using the above example, the tokenized text looks like:

- first list item{newline}
{blech}{blech}next line of first list item{newline}
{blech}next next line of first list item{newline}

In this way, the indent was properly represented in the token, and the Markdown transformer had enough information to rehydrate the data afterwards. With those changes locked into the tokens, the Markdown transformer was then able to be changed to understand those tokens. That processing was simple. If a line of the text output started with a blech character, those blech characters were replaced with a space character. If no blech characters were there, the normal replacement of the newline character would occur.

And I could have changed the name of the character from “blech character”, but after a while, it just kind of stuck with me.

New List Item Tokens

It was about this time when I decided to tackle the New List Item tokens. While I had been working around them somewhat, it was starting to get to the point where they were unavoidable. At first, I did not think these tokens would be an issue, but I forgot about one little part of the New List Item tokens: they reset the indent for the list.

A good example of this is example 290:

- a
 - b
  - c
   - d
  - e
 - f
- g

In this case, the first line starts a list and each line after that starts a new list item. As the new list items are gradually increasing and then decreasing the indent, the 3 middle lines (c, d, and e) are interpreted as a new list item element, rather than a sublist. If it was not for the new list item elements resetting that indent, those 3 lines are indented to the point where their indent would make them eligible to start a new sublist.

But to properly track this indent change, it required some interesting thinking. If I tracked the indent change in the actual token, it would mean changing that token. To me, that was a non-starter. Instead, I added a separate stack for container tokens in the Markdown transformer and added copies of the container tokens to this stack. As I added copies of the tokens to the stack, I was free to change the indent value located within the Unordered List token without worrying about side effects.

With those changes in place, the Markdown transformer was able to reset the indent level and make sure it was being properly tracked. This meant that the indents were able to be properly reset to the correct value once a List Item end token was received for a sublist.

Taking a bit of a deep breath and a pause, I noticed that I was close to finishing off the Unordered List Item tokens. That gave me a bit of a jump in my step to clean things up!

Taking Care of Stragglers

With all the major and minor cases of lists dealt with, I started going through the other scenario tests, fixing up the token lists after verifying that the new tokens were correct. Everything else was easily resolved at this point, except for some lists in a couple of cases. Those cases were interesting in that there was actually too much whitespace, not too little. And in this case, it was a newline character, not a space.

The Fenced Code Block element and the SetExt Heading element are unique in that they have a line-based sequence that delimits the end of the element. Usually this is not a problem, but in the case of interacting with lists, the transformer was inserting a newline after the element itself, and then another newline to make the end of that line. While this duplication did not occur all the time, it took a bit to figure the exact sequence that triggered this.

After doing some research, it was weird to me, but it only occurred if:

  • it was one of these two elements
  • the new block of data ends with a newline character
  • the next token to be processed is a New List Item token

While the sequence of thing that had to occur was weird, the solution was easy:

        block_should_end_with_newline = False
        if next_token.token_name == "end-fcode-block":
            block_should_end_with_newline = True
            delayed_continue = ""
        elif next_token.token_name == "end-setext":
            block_should_end_with_newline = True
        ...
        block_ends_with_newline = \
            block_should_end_with_newline and new_data.endswith("\n")
        ...
        if (
            block_ends_with_newline
            and next_one
            and next_one.token_name == MarkdownToken.token_new_list_item
        ):
            new_data = new_data[0 : -len(continue_seq)]

Basically, if we hit that situation, just remove the excess character. I was hoping to refactor it into something more elegant, but it worked at the time and I wanted to get on to the handling of Ordered List Item tokens.

Second Verse…

I fondly remember being a kid at a summer camp and hearing the words “Second verse, same the first, a little bit louder, and a little bit worse!”. Working on the ordered list tokens made me think of that saying almost immediately. Except for the fact that it was not a little bit worse, it was a lot easier.

There were two main reasons for that. The first reason is that looking at the samples as a group, there are objectively fewer examples with ordered lists than unordered lists. In the list items section of the GFM specification, there are 20 of each, but in the lists section of the specification, there are 20 examples of unordered lists and 7 examples of ordered lists. The second reason was that most of the work was either completed when working on the unordered list token transformations, or it was used in a copy-and-paste manner.

However, knowing that lists are one of the two container elements in Markdown, I took some extra time and reverified all the scenario tests, both ordered lists and unordered lists. I was able to find a couple of small things that were quickly fixed, but other than that, everything looked fine!

Cleanup

As with a lot of my free project time recently, I used some extra time that I had to focus on a few cleanup items for the project. While none of them was important on its own, I just felt that the project would be cleaner with them done.

The first one was an easy one, going through the HTML transformer and the Markdown transformer, and ensuring that all the token handlers were private. There really was not any pressing need to do this, but it was just cleaner. The only code that was using those handlers was in the same class, so it was just tidier that way.

Next was the creation of the __create_link_token function to help tidy up the __handle_link_types function. The __handle_link_types function was already messy enough handling the processing of the link information, the creating of the normal or image link was just complicating things. While I still want to go back and clean functions like that up, for the time, moving the creation code to __create_link_token was a good step.

Finally, there was the case of the justification function calls throughout the code. Not to sound like a purist, but I felt that they were messy. I often had to remind myself of what the three operands were: string to perform on, the justification amount, and the character to use for the justification. The actual use of the function was correct, it just felt like its usage was not clear. So instead of having code like this around the code base:

        some_value = "".rjust(repeat_count, character_to_repeat)

I replaced it with:

        some_value = ParserHelper.repeat_string(character_to_repeat, repeat_count)

While the code for this operation was a one-line function, now located in the ParserHelper class, I felt it now made sense and was in an easy to find place.

    def repeat_string(string_to_repeat, repeat_count):
        """
        Repeat the given character the specified number of times.
        """
        return "".rjust(repeat_count, string_to_repeat)

“Fixing” Example 528

I do not want to spoil the surprise too much, but the fact that I have a section called "Fixing" Example 528 probably gives it away. I fixed it. But the story is more interesting than that.

In the last article, I talked about example 528 and how I was having problems getting it to parse properly. Even having done thorough research on the example and the algorithm, I came up with nothing. To me, it looked like the parsing was being performed according to the GFM specification’s algorithm, but the parsing was just off. After yet another attempt to figure this example out and get it working, I posted my research and a request for help to the CommonMark discussion forums..

Keeping my head down and focused on writing that week’s article, I did not notice that I had received a reply the very next day. As a matter of fact, while I did notice that I had a response to my post, it was not until Friday night that it clicked that it was a response to “THAT” post. After getting the cleanup documented in the previous section taken care of, I reserved some dedicated time to walk through the reply.

Kudos

First off, I would like to extend kudos to the replier, John MacFarlane, one of the maintainers of the Markdown GFM specification. While he would have been within his right to tell me to RTFM2, he took some time to walk me through the algorithm as it applied to that example, even providing me with some debug output from his program for that example. His response was just a classy response with just the right amount of information.

Side by Side Comparisons

Armed with that new information, I turned on the debug output and ran through the output from my implementation of the algorithm again. Slowly, with my own written notes as an additional guide, I began to walk through the output. Found closer at 8. Check. Found matching opener at 4. Check. Deactivating opener at 3. Check. Found closer at 15. Check. Popping inactive opener at 3. Ch…er… what? “Popping”?

Going back to the algorithm and the text that John provided, it hit me. The popping that he was referring to was this part of the algorithm:

If we do find one, but it’s not active, we remove the inactive delimiter from the stack, and return a literal text node ].

For some reason, when I read that before, I thought it was referring to removing the token from consideration, not actually removing it from the stack. It probably also confused things in that I did not maintain a separate stack for the link resolution. Instead, I added an instance of the SpecialTextMarkdownToken token to the inline block list whenever I hit a link sequence. In either case, I was not doing that. To compound the issue, I did not stop at that inactive token, I just kept on looking for the next active SpecialTextMarkdownToken token, finding the image start token. Ah… everything was now falling into place in my mind.

Fixing the Issue

The fix was very anticlimactic. I created the new __revert_token_to_normal_text_token function, which removed the SpecialTextMarkdownToken token and replaced it with a normal TextMarkdownToken token. In addition, I changed the algorithm to make sure that when this happened, it stopped processing for that link end sequence, as per the algorithm. With the start character sequence now being effectively invisible to the algorithm, the rest of the parsing went fine, with the correct results coming out. Well, almost. A small fix was needed to the __consume_text_for_image_alt_text function to make it properly emit the correct text if an instance of a SpecialTextMarkdownToken token was encountered.

With the big fix and the little fix both completed, the scenario test for Example 528 was fully enabled and fully passing. Finally!

Reminder to Self: Be Humble

Having taken a quite a few attempts at implementing the algorithm and making sure it passed all test cases, I hit a wall. A seemingly rock-solid wall. That was fine. During any project, be it software or hardware, things happen. When it got to that point, I gave myself some time, I knuckled down3, and I did some solid research on the problem. Keeping good notes, I was then able to share those notes with peers in the community, along with a sincere and humble request for help.

I do not always get a good response to requests for help. However, I have noticed that doing good research and presenting that research with humility increases the chance of getting a positive response. At no point did I rant and say, “it’s broken” or “my way works, what is wrong with yours”. Instead I said, “Is there something wrong with the algorithm?” and “Based on my implementation of that algorithm”. By acknowledging that it could be an issue with my implementation, I feel that I opened the doors for someone to help, rather than slamming them shut with negative talk.

And as I mentioned in the Kudos section above, mostly in what I believe was a humble approach to asking for help, I got a real good response.

What Was My Experience So Far?

Wow… that work was intense. For one, I was right, it was a lot of addressing issues with whitespace and running scenario tests repeatedly. But it was more than that for me. I knew I had the leaf blocks taken care of, but I was really concerned about how difficult the implementation of the container transformations would be. If I did it right and kept to my design, I was confident that I could keep the complexity down, but I still figured it would be complex.

I guess that led to me second guessing every line of code and getting in the way of myself a bit. I did prevail, but that concern or fear of damaging existing tests was somewhat paralyzing at times. And while the logical half of my brain was telling me that I had plenty of tests to reinforce my completed work, the emotional half was another story. That is where that fear was coming from, my emotional side. Only when I took a moment to take another breath and focus on the tests was I able to banish that concern for a while.

And it also helped me to do a bit of self-analysis on why I was concerned. After a lot of thinking, I came to a simple conclusion. The closer I get closer to getting a complete project, the more I am concerned that I have not architected and designed it properly. If I have done that, small changes can be accomplished with few or no unintended side effects. If not, encountering side effects should be frequent. Seeing as I know I have identified some areas of the code base that I want to refactor, I questioned whether the current state was good enough.

Knowing that, it helped me figure it out for myself. I do believe that I have confidence with my architecture and design, and at the same time, I believe that I can improve it. It might seem like a dichotomy, but I am starting to think that both can be correct at the same time. But knowing that was the issue that was causing me concern helps me combat it. I am a lot less worried about it, but it is a work in progress, much like the project.

With that information in hand, I felt better. Cautious about the next steps in getting the checks closer to the finish line, but better! And let’s not forget about finally closing the issue with Example 528. That was good!

What is Next?

With the Markdown transformer almost completed, the only tokens left that need a transformation are the Block Quote tokens. In addition, as the line/column number consistency checks do not currently deal with Block Quote tokens either, I will need to add both checks in the next block of work.


  1. The only example 249, but ordered list. 

  2. Read The F&$#ing Manual… or I guess RTFS in this case. 

  3. According to Webster’s: “pitch in, dig in”. 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~19 min read

Published

Markdown Linter Core

Category

Software Quality

Tags

Stay in Touch