Markdown Linter - Adding Links to the Markdown Transformer

Summary¶

In my last article, I increased the coverage provided by the token to Markdown transformer by adding support for all tokens except for container related tokens and link related tokens. In this article, I take a large step forward towards complete consistency checks by adding support for link related tokens.

Introduction¶

Starting with a large group of tokens to implement, I was now down to two smaller groups of tokens left to implement in the Markdown transformer: container related tokens and link related tokens. Before starting with the container related tokens, I wanted to make sure that all the leaf block tokens were complete, so the link related tokens were the solid choice for the next block of work. But even with confidence in that choice I was still concerned.

Why was I concerned? Because outside of container related tokens and text tokens, I feel that link related tokens are the most used group of tokens. While a good argument can also be made that Atx Heading token is the most used token, I feel that the importance of links in any document somewhat overpowers the argument for the Atx Heading token, if only a bit. In my own writing, I believe headings are useful, but I feel that it is the links that I add to a document that really increase the quality of my documents to the next level. It is possible that others may not agree with my reasoning, but it was one of the sources of my concern.

Another reason for my concern? Links are complex. Just on a normal link alone, I counted at least 12 different properties that I am going to have to represent in the token to allow me to properly rehydrate it. And then there are the Link Reference Definitions, the only multiline element in the base GFM document. I hoped that I already had dealt with most of the complexity of that token, but I knew that adding support for this group of tokens to the Markdown transformer was going to include some serious work.

Regardless of how often the link related tokens are used or how difficult I thought they would be to be implemented, they needed to be implemented. And as the last group of tokens before the container tokens, the time was now to work on them.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 31 Jul 2020 and 31 Jul 2020.

Where to Start?¶

Before I started making any changes, I knew I was going to take a significant amount of time to complete the work on links. I also knew that all that work was not going to just happen. To get it right, I needed to plan it out properly.

I started that process by breaking down the Link tokens groups into 4 smaller groups: Inline Links, Reference Links, Link Reference Definitions, and Image Links. The Image Link tokens were the first one to get prioritized into the first position, as they had no dependencies and could set the foundations for the other groups. Next, I looked at the Image Link tokens. As Image Link tokens were normal Link tokens with a couple of differences, it made sense to me that they would go last. By going last, the link token foundations would be tested and stable before adding support for the Image tokens on top of that. That just left the Link Reference Definition tokens and Reference Link tokens.

The ordering between Link Reference Definition tokens and Reference Link tokens was going to be a bit tricky, but it would be a process that I knew I could manage. To ensure that I could properly test the Link Reference Definition tokens, I needed to start with a rudimentary rehydration of the Shortcut Link token. Once that was done, I could complete the work for the Link Reference Definition tokens, hopefully not hitting any difficult Shortcut Link tokens cases or other link types along the way. At that point, I could switch back to the Shortcut Link token scenario tests before completing the other Link scenario tests.

With research done, the relative ordering of the tasks was easy. Start with Inline Link tokens with their lack of dependencies. Then work on the pair of Link Reference Definition tokens and Reference Link tokens, using the Inline Link tokens as a foundation. Finally, work on Image tokens using all that other work as a solid foundation to make the changes required of the transformer.

It was not a complicated plan, but it was a decent plan that I believed in. And with that plan in place, I started to work on Inline links!

Inline Links¶

In the same way that I start all my transformer additions, like the addition of emphasis support in the last article, I found a good, easy example and started with the scenario test for that example. The needs to pass that first scenario test, example 493, were simple. The LinkStartMarkdownToken class already had all the necessary fields, so no changes were needed there. I then proceeded to add the rehydrate_inline_link_text_from_token function into the transformer, containing a simple transformation:

    link_text = (
        "[" + link_token.text_from_blocks + "]("
        + link_token.link_uri + " \""+ link_token.link_title + "\")"
    )

From there, each additional scenario test introduced another variation on what an acceptable Link token was. For example, the scenario test for Example 494 introduced a link title that was not present. That changed the transformation into:

    link_text = (
        "[" + link_token.text_from_blocks + "](" + link_token.link_uri
    )
    if link_token.link_title:
        link_text += " \"" + link_token.link_title + "\""
    )
    link_text += "\)"

And so on, and so on. For any readers of my articles, this is the same pattern I have been following since I started this project: get a simple case working, then keep on adding on little changes until the big change is complete. This set of changes was no different in that regard.

Then Why Did I Think It Was Difficult?¶

The daunting aspect of this group was its volume. To be able to recreate the inline link token properly, I needed to ensure that every part of the data was properly catalogued and stored in the token. Doing this exercise for the Inline Link token, I came up with the following diagram and mapping:

[foo](   </my url>     "  a title "     )
|---|||-|||-----|||---|||--------|||---||
  |  | | |   |   |  |  |     |    |  |  |
  A  B C D   E   D  F  G     H    G  I  B

A - ex_label and text_from_blocks
- extracted link label, in final and original states
B - self.label_type
- the use of ( and ) denoting an inline link
C - before_link_whitespace
- whitespace before the link starts
D - did_use_angle_start
- True indicating that this URI was encapsulated
E - self.link_uri and self.pre_link_uri
- extracted link URI, in final and original states
F - before_title_whitespace
- whitespace before the title starts
G - inline_title_bounding_character
- character used to bound the title
H - link_title and pre_link_title
- extracted link title, in final and original states
I - after_title_whitespace
- whitespace after the title is completed

While this may seem overly thorough, I felt that without a complete map of the Inline Link, any Markdown transformation would be incomplete. Especially with so many moving parts, I was convinced that without a solid plan, I would miss a combination or a permutation of the test data.

How Did Those Changes Go?¶

The scenario tests were not really difficult to get working properly, it was just the sheer volume of them. Behind each of those scenario tests were a set of Markdown parser functions that needed to be properly exercised. While those functions had already been tested against their HTML output, this work was to add more information in each token, to ensure that the data in the tokens are complete. And that completeness came at a cost.

One of those costs was an increase in the number of results returned by functions, such as by the __process_inline_link_body function. To accomplish the extraction requirements of these changes, this function went from returning a tuple containing 5 variables to returning a tuple containing 10 variables. And while that may seem like a simple refactor to complete, I am still debating with myself on how to handle that change. Do I create a new class that is only used internally in this one case, or do I throw it into an instance of the LinkStartMarkdownToken class that I can pass around more easily? While the token data is not complete, the LinkStartMarkdownToken class has all the necessary fields, with 2 fields to spare. Which to choose? As I said, still thinking on that one.

Another function that I want to clean up is the rehydrate_inline_link_text_from_token function. At 57 lines, it is a bit larger than I usually like in my projects. But in this case, maybe it is warranted. This function does have a single responsibility, the rehydration of the token, and it sticks solidly to that purpose. With the 12 fields that it uses to rehydrate, the implementation difficulty is there. And that is only for the Inline Link tokens, not the other 3 types of link tokens. They will need implementations too.

For me, the really tough part was that I needed to slog¹ through the sheer number of combinations presented in the examples. Using link labels as an example, the link label can contain normal text, backslashes, character entities and inline tokens. To make sure I had all these covered, I needed to make sure I had representative scenario tests for each of these groups of combinations. And then there are the link URIs and link titles. It just got to the point where I used a small spreadsheet to keep track of things, ticking off combinations as I went… just to be sure.

Reference Links and Link Reference Definitions¶

Implementing the transformations for these two tokens was different than the other tokens in that I had to develop the transformations in coordination with each other. For example, I started adding support for these tokens using the scenario test for example 161, which has the following Markdown:

[foo]: /url "title"

[foo]

As I mentioned in a previous section, to properly get this test passing, I needed to start by implementing a bare bones version of the Shortcut Reference Link. After all, if I could not reference the Link Reference Definition, how could I know if it was working properly?

Starting with Simple Shortcut Links¶

Thankfully, after looking at Link Reference Definition examples example 161 to example 188, the only extra link requirement in each of these tests was a Shortcut Reference Link. Having already written the transformer to handle Inline Link tokens, adding code to deal with Shortcut Link tokens was almost trivial.

A Shortcut Reference Link is merely an Inline Link without the inline specifier, so modifying the rehydrate_inline_link_text_from_token function to handle the extra link type was quick and efficient. Reusing code already in that function, I came up with these changes within 5 minutes:

         if link_token.label_type == "shortcut":
            link_label = link_token.text_from_blocks.replace(
                InlineHelper.backspace_character, ""
            ).replace("\x08", "")
            link_text = "[" + link_label + "]"
         elif link_token.label_type == "inline":
         ...
         else:
             assert False

Everything looked good from the shortcut link point of view. However, since most of the tests that have Shortcut Links also have Link Reference Definitions to provide the reference for those links, I needed to switch to get the Link Reference Definition tokens done.

While I was not 100% comfortable with leaving that implementation untested, I understood that I would have to wait a while to complete the testing. To do that, on to Link Reference Definitions!

Moving Over to Link Reference Definitions¶

With a good stab at a transformation for Shortcut Links in place, I turned my attention to the Link Reference Definition transformation. When I started putting the code for this transformation together, I was greeted by good news.

The first bit of good news is that since the support for Link Reference Definition tokens was added more recently than the other tokens, I had created that token with a better idea of what information I might need later. As such, all the fields that I required to properly represent a Link Reference Definition element were already in place. That allowed me to implement a near-complete version of the rehydrate_link_reference_definition function after only 3 tries and 10 minutes. That was very refreshing.

The second bit of good news was that the previous work on this token had already dealt with all the token’s complexity. As I had a lot of issues in implementing the parser’s support for the Link Reference Definition element, I assumed that the rehydration of the token would also be a lot of work. It turned out that because of all that hard work, that near-complete version of the rehydrate_link_reference_definition function was very simple. I had even collected both pre-processed and post-processed versions of the link label, link destination URI, and link title!

Now back to the other Link tokens.

Back to Finishing Up Links¶

With all the hard work done, finishing off the rest of the links was easier than I had previously anticipated. With Link Reference Definition support in place, the scenario tests that included both Link Reference Definition elements and Shortcut Links were executed and, with a few tweaks, passed. Like the effort required to support Shortcut Reference Link tokens, the support for Full Reference Link tokens and Collapsed Reference Link tokens was added quickly.

Within a couple of hours, a good percentage of the scenario tests that involved any 4 of the Link tokens were completed and passing. The remaining tests were those scenario tests that gave me some real issues with determining the original text. A good example of this was the scenario test for example 540:

[foo [bar](/uri)][ref]

[ref]: /uri

which produces the HTML output:

<p>[foo <a href="/uri">bar</a>]<a href="/uri">ref</a></p>

The first link that gets interpreted from that text is the Inline Link [bar](/uri). When the ] character is encountered after that link text, due to the provided algorithm, it is kept as a ] character as there is a valid link between it and its matching opening [ character. Finally, the [ref] is a valid Shortcut Link, matching the reference set up by the Link Reference Definition.

Getting the correct, original text to insert into the Link tokens was a bit of an effort. The __collect_text_from_blocks function took a bit of fiddling to make sure that the original text that was extracted matched the actual original text. As with other things in this project, it took a couple of tries to find something that worked and worked well, but that time was well worth it. A bit frustrating at time, but worth it.

Having completed adding the support for all the non-image link scenario tests, it was time to add the image links into that mix.

Images¶

Looking at the GFM specification for images, it clearly states:

The rules for this are the same as for link text, except that (a) an image description starts with ![ rather than [, and (b) an image description may contain links.

Basically, as I stated before, the Image Link tokens get to use a lot of the work done for the other link tokens as a foundation, with just 2 changes. The first part of that difference was easy to deal with in the transformer: emit the sequence ![ instead of the sequence [. Done.

The second part of that difference was handling examples of links within image links and image links within links. While avoiding the scenario test for example 528², there were plenty of other cases such as example 525:

[![moon](moon.jpg)](/uri)

and example 583:

![foo [bar](/url)](/url2)

that I needed to deal with. The actual parsing of those images and their transformation to HTML were already tested and working. It was just the extraction of the original text that gave me issues. However, having dealt with similar examples in the previous changes for the __collect_text_from_blocks function, I was able to finish up those cases quickly.

The good part about getting to the end of this work took a bit to sink in. I had budgeted and entire week to complete these changes. But even after making sure the commit was clean, it was early on Saturday morning. It was Saturday morning and the link token group was completed. Well, almost completed. More on example 528 later. But it was good enough to mark this block of work done and complete. With some extra time left in my schedule, I decided to put it to good use.

Code Coverage¶

The first good use that I put that extra time to was improving code coverage. While there were only 3 cases where I needed to tighten up the code coverage, it was just something I wanted to make sure got done. It is true that I almost always argue that the scenario coverage metric is more important than the code coverage metric. But with the code coverage percentage in the high 99’s, I just wanted to nail this down while the required changes would be small and manageable.

Moving Special Character Support¶

The second good use for my extra time was to move the special character support in the parser into the ParserHelper class. Along the way, adding proper support for the Markdown transformer was accomplished using a small set of special characters. These special characters allowed the normal processing of the tokens by the HTML transformer to generate the proper HTML output, while at the same time allowing the Markdown transformer to rehydrate the token into its original Markdown.

With the use of the characters scattered around the project’s code base, I felt it was useful to centralize the usage of those characters into the ParserHelper class. To complete that centralization, I also introduced a number of functions that either resolved the characters (for HTML) or removed the characters (for Markdown).³ Those characters and their behaviors are:

the \b or backspace character, used primarily for the \\ or backslash character
- when resolved, removes the character and the character before it
- when removed, leaves the previous character in place
the \a or alert character, used to provide an “X was replaced with Y” notation
- when resolved, removes the alert characters and the X sequence, leaving the Y sequence
- when removed, removes the alert characters and the Y sequence, leaving the X sequence
the \x02 character, used to split whitespaces
- when resolved, is replaced with an empty string
- when removed as part of SetExt Heading whitespace processing, delineates leading whitespace that was removed from trailing whitespace that was removed
the \x03 or “NOOP” character, used with the alert character sequence as a Y sequence
- when resolved, replaces the entire sequence with an empty string
- when removed, same as above but used to indicate that the X sequence was a caused by a blank line token

The centralization of these characters and their usage did help to clean up the code. In all cases, I created a single variable to represent the character, and enforced its use throughout the codebase, except in test output. For example, instead of using the \b for the backspace character, I created a new static member variable of the ParserHelper class called __backspace_character. The use of these variables in the code made it clear that those characters were being used with purpose, and not because of convenience.

But even after that work, I still had a bit of time left. What else could I do to help the project?

Example 528¶

With my last remaining bits of extra time, I wanted to take another shot at the proper processing of example 528. Having tackled the Link related group of tokens, I felt that I had a good grasp of the processing required. With a sense of purpose and confidence, I felt it was time to put that belief to the test.

For some background, example 528 is similar to example 525 and example 583 referenced in the previous sections. However, while each of those examples deals with one type of link within the other type of link, example 528 can be considered the composition of both of those examples together. The example is as follows:

![[[foo](uri1)](uri2)](uri3)

producing the following HTML:

<p><img src="uri3" alt="[foo](uri2)" /></p>

To be clear, this is an inline image link that contains 2 possibly valid inline links within its link label. The final result was that the image’s URI is uri3 as expected, but the alt parameter’s text is set to [foo](uri2), an interpretation of the text within the link label. And to make it even more interesting, the GFM specification also provides an algorithm for evaluating emphasis and links which has been tested.

Yes, I have had the algorithm given to me, and I cannot make it work. I confess. I have been over the specification’s algorithm and my own implementation of that algorithm, and I cannot make it work. Every couple of weeks, I have spent a couple of hours looking at the log output, the source code, and the algorithm description, and… nothing.

Giving it another shot, I decided that instead of assuming I knew what was going on, I would try and test it as a new problem. And with any new problem, I tackle it by doing research on it, so that is what I set out to do. I turned on the logging for the scenario test associated with example 528 and started observing and scribbling. By the time I had gone through the algorithm 3 times, I was convinced that I was either missing something important or there was a problem with the well-tested algorithm. If I were a betting man, my money would be on the algorithm being correct, but I just could not figure out where the issue was!

What I saw in each of my examinations, was that as the processing progressed, the string [foo](uri1) was parsed as a link. Following the algorithm’s instructions, any link starts before that point needed to be marked as inactive, so the code marked the second [ character was marked as inactive. I also double checked the handling of the initial ![ sequence. As that sequence denotes an image token and not a link token, that initial ![ sequence was not marked as inactive. Then, when the second ] character was processed, the implementation skipped over the inactive [ character, hitting the ![ sequence for the image. With the start link characters exhausted, the rest of that string became plain text.

But that is not the result that was specified in the GFM specification. It wanted

<p><img src="uri3" alt="[foo](uri2)" /></p>

and the code was generating:

<p><img src="uri2" alt="foo" />](uri3)</p>

At that point, I decided to seek help from a higher power: the CommonMark specification site. I posted a quick message to the forums, making sure I indicated that I was having a problem, clearly stating my findings from above and that they were based on my implementation. A couple of quick checks for spelling and grammar, and I posted a request for help. I did get a response to my request rather quickly, and I will address that in a future article.

What Was My Experience So Far?¶

The goal that I had set for myself for this chunk of work was to make sure that I added the link token group to the Markdown transformer’s features. While it did take me most of the week to accomplish that, I do believe that I accomplished that with some time to spare. It felt good to be able to take some time and do some small tasks to clean up the code a bit.

The weird thing about being able to see the end of the project’s initial phase is that while I want a quality project, I also want to hurry up. I can see the items in the issue list being resolved and then removed, and I just want to get them all removed. I want to push myself harder to finish them quicker, even though I know that is the wrong thing to do.

As with all side projects, my time and effort spent on the project is a balancing act between my personal responsibilities, my professional responsibilities, and the project’s responsibilities. And yes, while it is a balancing act, those three groups of responsibilities are in the correct order. I need to make sure to take care of myself and my job first, using any extra bandwidth to work on the project. While I do want to push myself to hurry and finish the project, from a resource allocation point of view, it just would not work.

And there is also the quality point of view. While I am good at my job, I am keenly aware that every project has four dials that can be adjusted: scope, cost, time, and quality. If I want the project to completed faster, changing the time dial, I need to adjust the other dials to compensate. This is a personal side project, so I cannot adjust the cost dial, leaving the quality and scope dials. Seeing as I do not want to change my requirements for either of those two dials, I know I need to keep the time dial where it is.

Between the balancing act and the resource logic puzzle, I know I need to stay the course. While I was feeling the need to hurry up, I also knew that is not what has got me to this point in the project. What has got me here is good planning, solid effort, and not exhausting myself. If I upset that balance, I might lose my desire to finish the project, and that would be a shame.

So, as always, I started to look at and plan the next chunk of work, making sure it was a good chunk of work that I could easily accomplish. After all, slow and steady went the tortoise…

What is Next?¶

Add all the normal tokens to the Markdown transformer? Check. Add the Link-related token group to the Markdown transformer? Check. That just left the Container-related token group. Add since I knew that Block Quotes were going to need a lot of work, taking care of the List-related tokens was a good first step.

According to Webster’s dictionary: “to plod (one’s way) perseveringly especially against difficulty”. ↩
More on example 528 later. ↩
While I am sure I can come up with a better name for the two sets of functions, I am not sure what those better names would be. Ideas? ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments