In my last article, I increased the coverage provided by the token to Markdown transformer by adding support for all tokens except for container related tokens and link related tokens. In this article, I take a large step forward towards complete consistency checks by adding support for link related tokens.
Starting with a large group of tokens to implement, I was now down to two smaller groups of tokens left to implement in the Markdown transformer: container related tokens and link related tokens. Before starting with the container related tokens, I wanted to make sure that all the leaf block tokens were complete, so the link related tokens were the solid choice for the next block of work. But even with confidence in that choice I was still concerned.
Why was I concerned? Because outside of container related tokens and text tokens, I feel that link related tokens are the most used group of tokens. While a good argument can also be made that Atx Heading token is the most used token, I feel that the importance of links in any document somewhat overpowers the argument for the Atx Heading token, if only a bit. In my own writing, I believe headings are useful, but I feel that it is the links that I add to a document that really increase the quality of my documents to the next level. It is possible that others may not agree with my reasoning, but it was one of the sources of my concern.
Another reason for my concern? Links are complex. Just on a normal link alone, I counted at least 12 different properties that I am going to have to represent in the token to allow me to properly rehydrate it. And then there are the Link Reference Definitions, the only multiline element in the base GFM document. I hoped that I already had dealt with most of the complexity of that token, but I knew that adding support for this group of tokens to the Markdown transformer was going to include some serious work.
Regardless of how often the link related tokens are used or how difficult I thought they would be to be implemented, they needed to be implemented. And as the last group of tokens before the container tokens, the time was now to work on them.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 31 Jul 2020 and 31 Jul 2020.
Where to Start?¶
Before I started making any changes, I knew I was going to take a significant amount of time to complete the work on links. I also knew that all that work was not going to just happen. To get it right, I needed to plan it out properly.
I started that process by breaking down the Link tokens groups into 4 smaller groups: Inline Links, Reference Links, Link Reference Definitions, and Image Links. The Image Link tokens were the first one to get prioritized into the first position, as they had no dependencies and could set the foundations for the other groups. Next, I looked at the Image Link tokens. As Image Link tokens were normal Link tokens with a couple of differences, it made sense to me that they would go last. By going last, the link token foundations would be tested and stable before adding support for the Image tokens on top of that. That just left the Link Reference Definition tokens and Reference Link tokens.
The ordering between Link Reference Definition tokens and Reference Link tokens was going to be a bit tricky, but it would be a process that I knew I could manage. To ensure that I could properly test the Link Reference Definition tokens, I needed to start with a rudimentary rehydration of the Shortcut Link token. Once that was done, I could complete the work for the Link Reference Definition tokens, hopefully not hitting any difficult Shortcut Link tokens cases or other link types along the way. At that point, I could switch back to the Shortcut Link token scenario tests before completing the other Link scenario tests.
With research done, the relative ordering of the tasks was easy. Start with Inline Link tokens with their lack of dependencies. Then work on the pair of Link Reference Definition tokens and Reference Link tokens, using the Inline Link tokens as a foundation. Finally, work on Image tokens using all that other work as a solid foundation to make the changes required of the transformer.
It was not a complicated plan, but it was a decent plan that I believed in. And with that plan in place, I started to work on Inline links!
In the same way that I start all my transformer additions, like the addition of
in the last article, I found a
good, easy example and started with the scenario test for that example. The needs to
pass that first scenario test,
were simple. The
LinkStartMarkdownToken class already had all the necessary fields,
so no changes were needed there. I then proceeded to add the
rehydrate_inline_link_text_from_token function into the transformer, containing a
link_text = ( "[" + link_token.text_from_blocks + "](" + link_token.link_uri + " \""+ link_token.link_title + "\")" )
From there, each additional scenario test introduced another variation on what an acceptable Link token was. For example, the scenario test for Example 494 introduced a link title that was not present. That changed the transformation into:
link_text = ( "[" + link_token.text_from_blocks + "](" + link_token.link_uri ) if link_token.link_title: link_text += " \"" + link_token.link_title + "\"" ) link_text += "\)"
And so on, and so on. For any readers of my articles, this is the same pattern I have been following since I started this project: get a simple case working, then keep on adding on little changes until the big change is complete. This set of changes was no different in that regard.
Then Why Did I Think It Was Difficult?¶
The daunting aspect of this group was its volume. To be able to recreate the inline link token properly, I needed to ensure that every part of the data was properly catalogued and stored in the token. Doing this exercise for the Inline Link token, I came up with the following diagram and mapping:
[foo]( </my url> " a title " ) |---|||-|||-----|||---|||--------|||---|| | | | | | | | | | | | | A B C D E D F G H G I B
- A -
- extracted link label, in final and original states
- B -
- the use of
)denoting an inline link
- the use of
- C -
- whitespace before the link starts
- D -
Trueindicating that this URI was encapsulated
- E -
- extracted link URI, in final and original states
- F -
- whitespace before the title starts
- G -
- character used to bound the title
- H -
- extracted link title, in final and original states
- I -
- whitespace after the title is completed
While this may seem overly thorough, I felt that without a complete map of the Inline Link, any Markdown transformation would be incomplete. Especially with so many moving parts, I was convinced that without a solid plan, I would miss a combination or a permutation of the test data.
How Did Those Changes Go?¶
The scenario tests were not really difficult to get working properly, it was just the sheer volume of them. Behind each of those scenario tests were a set of Markdown parser functions that needed to be properly exercised. While those functions had already been tested against their HTML output, this work was to add more information in each token, to ensure that the data in the tokens are complete. And that completeness came at a cost.
One of those costs was an increase in the number of results returned by functions,
such as by the
__process_inline_link_body function. To accomplish the extraction
requirements of these changes, this function went from returning a tuple containing 5
variables to returning a tuple containing 10 variables. And while that may seem
like a simple refactor to complete, I am still debating with myself on how to handle
that change. Do I create a new class that is only used internally in this one case,
or do I throw it into an instance of the
LinkStartMarkdownToken class that I can pass
around more easily? While the token data is not complete, the
class has all the necessary fields, with 2 fields to spare. Which to choose?
As I said, still thinking on that one.
Another function that I want to clean up is the
function. At 57 lines, it is a bit larger than I usually like in my projects. But in
this case, maybe it is warranted. This function does have a single responsibility, the
rehydration of the token, and it sticks solidly to that purpose. With the 12 fields
that it uses to rehydrate, the implementation difficulty is there. And that
is only for the Inline Link tokens, not the other 3 types of link tokens. They will
need implementations too.
For me, the really tough part was that I needed to slog1 through the sheer number of combinations presented in the examples. Using link labels as an example, the link label can contain normal text, backslashes, character entities and inline tokens. To make sure I had all these covered, I needed to make sure I had representative scenario tests for each of these groups of combinations. And then there are the link URIs and link titles. It just got to the point where I used a small spreadsheet to keep track of things, ticking off combinations as I went… just to be sure.
Reference Links and Link Reference Definitions¶
Implementing the transformations for these two tokens was different than the other tokens in that I had to develop the transformations in coordination with each other. For example, I started adding support for these tokens using the scenario test for example 161, which has the following Markdown:
[foo]: /url "title" [foo]
As I mentioned in a previous section, to properly get this test passing, I needed to start by implementing a bare bones version of the Shortcut Reference Link. After all, if I could not reference the Link Reference Definition, how could I know if it was working properly?
Starting with Simple Shortcut Links¶
Thankfully, after looking at Link Reference Definition examples example 161 to example 188, the only extra link requirement in each of these tests was a Shortcut Reference Link. Having already written the transformer to handle Inline Link tokens, adding code to deal with Shortcut Link tokens was almost trivial.
A Shortcut Reference Link is merely an Inline Link without the inline specifier, so
rehydrate_inline_link_text_from_token function to handle the extra link
type was quick and efficient. Reusing code already in that function, I came up
with these changes within 5 minutes:
if link_token.label_type == "shortcut": link_label = link_token.text_from_blocks.replace( InlineHelper.backspace_character, "" ).replace("\x08", "") link_text = "[" + link_label + "]" elif link_token.label_type == "inline": ... else: assert False
Everything looked good from the shortcut link point of view. However, since most of the tests that have Shortcut Links also have Link Reference Definitions to provide the reference for those links, I needed to switch to get the Link Reference Definition tokens done.
While I was not 100% comfortable with leaving that implementation untested, I understood that I would have to wait a while to complete the testing. To do that, on to Link Reference Definitions!
Moving Over to Link Reference Definitions¶
With a good stab at a transformation for Shortcut Links in place, I turned my attention to the Link Reference Definition transformation. When I started putting the code for this transformation together, I was greeted by good news.
The first bit of good news is that since the support for Link Reference Definition
tokens was added more recently
than the other tokens,
I had created that token with a better idea of what information I might need later. As
such, all the fields that I required to properly represent a Link Reference Definition
element were already in place. That allowed
me to implement a near-complete version of the
function after only 3 tries and 10 minutes. That was very refreshing.
The second bit of good news was that the previous work on this token had already dealt
with all the token’s complexity. As I had a lot of issues in implementing the parser’s
the Link Reference Definition element, I assumed that the rehydration of the token
would also be a lot of work. It turned out that because of all that hard work, that
near-complete version of the
rehydrate_link_reference_definition function was very
simple. I had even collected both pre-processed and post-processed versions of the
link label, link destination URI, and link title!
Now back to the other Link tokens.
Back to Finishing Up Links¶
With all the hard work done, finishing off the rest of the links was easier than I had previously anticipated. With Link Reference Definition support in place, the scenario tests that included both Link Reference Definition elements and Shortcut Links were executed and, with a few tweaks, passed. Like the effort required to support Shortcut Reference Link tokens, the support for Full Reference Link tokens and Collapsed Reference Link tokens was added quickly.
Within a couple of hours, a good percentage of the scenario tests that involved any 4 of the Link tokens were completed and passing. The remaining tests were those scenario tests that gave me some real issues with determining the original text. A good example of this was the scenario test for example 540:
[foo [bar](/uri)][ref] [ref]: /uri
which produces the HTML output:
<p>[foo <a href="/uri">bar</a>]<a href="/uri">ref</a></p>
The first link that gets interpreted from
that text is the Inline Link
[bar](/uri). When the
] character is encountered
after that link text, due to
the provided algorithm,
it is kept as a
] character as there is a valid link between it and its matching
[ character. Finally, the
[ref] is a valid Shortcut Link, matching the
reference set up by the Link Reference Definition.
Getting the correct, original text to insert into the Link tokens was a bit of an
__collect_text_from_blocks function took a bit of fiddling to make
sure that the original text that was extracted matched the actual original text.
As with other things in this project, it took a couple of tries to find something
that worked and worked well, but that time was well worth it. A bit frustrating
at time, but worth it.
Having completed adding the support for all the non-image link scenario tests, it was time to add the image links into that mix.
Looking at the GFM specification for images, it clearly states:
The rules for this are the same as for link text, except that (a) an image description starts with ![ rather than [, and (b) an image description may contain links.
Basically, as I stated before, the Image Link tokens get to use a lot of the work done
for the other link tokens as a foundation, with just 2 changes. The first part of that
difference was easy to deal with in the transformer: emit the sequence
![ instead of
The second part of that difference was handling examples of links within image links and image links within links. While avoiding the scenario test for example 5282, there were plenty of other cases such as example 525:
and example 583:
that I needed to deal with. The actual parsing of those images and their transformation
to HTML were already tested and working. It was just the extraction of the original
text that gave me issues. However, having dealt with similar examples in the previous
changes for the
__collect_text_from_blocks function, I was able to finish up those
The good part about getting to the end of this work took a bit to sink in. I had budgeted and entire week to complete these changes. But even after making sure the commit was clean, it was early on Saturday morning. It was Saturday morning and the link token group was completed. Well, almost completed. More on example 528 later. But it was good enough to mark this block of work done and complete. With some extra time left in my schedule, I decided to put it to good use.
The first good use that I put that extra time to was improving code coverage. While there were only 3 cases where I needed to tighten up the code coverage, it was just something I wanted to make sure got done. It is true that I almost always argue that the scenario coverage metric is more important than the code coverage metric. But with the code coverage percentage in the high 99’s, I just wanted to nail this down while the required changes would be small and manageable.
Moving Special Character Support¶
The second good use for my extra time was to move the special character support in the
parser into the
ParserHelper class. Along the way, adding proper support for the
Markdown transformer was accomplished using a
small set of special characters. These special characters allowed the normal processing
of the tokens by the HTML transformer to generate the proper HTML output, while at the
same time allowing the Markdown transformer to rehydrate the token into its original
With the use of the characters scattered around the project’s code base, I felt it was
useful to centralize the usage of those characters into the
To complete that centralization, I also introduced a number of functions that either
resolved the characters (for HTML) or removed the characters (for Markdown).3
Those characters and their behaviors are:
\bor backspace character, used primarily for the
\\or backslash character
- when resolved, removes the character and the character before it
- when removed, leaves the previous character in place
\aor alert character, used to provide an “X was replaced with Y” notation
- when resolved, removes the alert characters and the X sequence, leaving the Y sequence
- when removed, removes the alert characters and the Y sequence, leaving the X sequence
\x02character, used to split whitespaces
- when resolved, is replaced with an empty string
- when removed as part of SetExt Heading whitespace processing, delineates leading whitespace that was removed from trailing whitespace that was removed
\x03or “NOOP” character, used with the alert character sequence as a Y sequence
- when resolved, replaces the entire sequence with an empty string
- when removed, same as above but used to indicate that the X sequence was a caused by a blank line token
The centralization of these characters and their usage did help to clean up the code.
In all cases, I created a single variable to represent the character, and enforced
its use throughout the codebase, except in test output. For example, instead of using
\b for the backspace character, I created a new static member variable of
ParserHelper class called
__backspace_character. The use of these variables
in the code made it clear that those characters were being used with purpose, and not
because of convenience.
But even after that work, I still had a bit of time left. What else could I do to help the project?
With my last remaining bits of extra time, I wanted to take another shot at the proper processing of example 528. Having tackled the Link related group of tokens, I felt that I had a good grasp of the processing required. With a sense of purpose and confidence, I felt it was time to put that belief to the test.
For some background, example 528 is similar to example 525 and example 583 referenced in the previous sections. However, while each of those examples deals with one type of link within the other type of link, example 528 can be considered the composition of both of those examples together. The example is as follows:
producing the following HTML:
<p><img src="uri3" alt="[foo](uri2)" /></p>
To be clear, this is an inline image link that contains 2 possibly valid inline links
within its link label. The final result was that the image’s URI is
uri3 as expected,
alt parameter’s text is set to
[foo](uri2), an interpretation of the
text within the link label. And to make it even more interesting, the
GFM specification also provides an algorithm for
evaluating emphasis and links
which has been tested.
Yes, I have had the algorithm given to me, and I cannot make it work. I confess. I have been over the specification’s algorithm and my own implementation of that algorithm, and I cannot make it work. Every couple of weeks, I have spent a couple of hours looking at the log output, the source code, and the algorithm description, and… nothing.
Giving it another shot, I decided that instead of assuming I knew what was going on, I would try and test it as a new problem. And with any new problem, I tackle it by doing research on it, so that is what I set out to do. I turned on the logging for the scenario test associated with example 528 and started observing and scribbling. By the time I had gone through the algorithm 3 times, I was convinced that I was either missing something important or there was a problem with the well-tested algorithm. If I were a betting man, my money would be on the algorithm being correct, but I just could not figure out where the issue was!
What I saw in each of my examinations, was that as the processing progressed, the
[foo](uri1) was parsed as a link. Following the algorithm’s instructions,
any link starts before that point needed to be marked as inactive, so the code
marked the second
[ character was marked as inactive. I also double checked the
handling of the initial
![ sequence. As that sequence denotes an image token and not
a link token, that initial
![ sequence was not marked as inactive. Then, when the
] character was processed, the implementation skipped over the inactive
character, hitting the
![ sequence for the image. With the start link characters
exhausted, the rest of that string became plain text.
But that is not the result that was specified in the GFM specification. It wanted
<p><img src="uri3" alt="[foo](uri2)" /></p>
and the code was generating:
<p><img src="uri2" alt="foo" />](uri3)</p>
At that point, I decided to seek help from a higher power: the CommonMark specification site. I posted a quick message to the forums, making sure I indicated that I was having a problem, clearly stating my findings from above and that they were based on my implementation. A couple of quick checks for spelling and grammar, and I posted a request for help. I did get a response to my request rather quickly, and I will address that in a future article.
What Was My Experience So Far?¶
The goal that I had set for myself for this chunk of work was to make sure that I added the link token group to the Markdown transformer’s features. While it did take me most of the week to accomplish that, I do believe that I accomplished that with some time to spare. It felt good to be able to take some time and do some small tasks to clean up the code a bit.
The weird thing about being able to see the end of the project’s initial phase is that while I want a quality project, I also want to hurry up. I can see the items in the issue list being resolved and then removed, and I just want to get them all removed. I want to push myself harder to finish them quicker, even though I know that is the wrong thing to do.
As with all side projects, my time and effort spent on the project is a balancing act between my personal responsibilities, my professional responsibilities, and the project’s responsibilities. And yes, while it is a balancing act, those three groups of responsibilities are in the correct order. I need to make sure to take care of myself and my job first, using any extra bandwidth to work on the project. While I do want to push myself to hurry and finish the project, from a resource allocation point of view, it just would not work.
And there is also the quality point of view. While I am good at my job, I am keenly aware that every project has four dials that can be adjusted: scope, cost, time, and quality. If I want the project to completed faster, changing the time dial, I need to adjust the other dials to compensate. This is a personal side project, so I cannot adjust the cost dial, leaving the quality and scope dials. Seeing as I do not want to change my requirements for either of those two dials, I know I need to keep the time dial where it is.
Between the balancing act and the resource logic puzzle, I know I need to stay the course. While I was feeling the need to hurry up, I also knew that is not what has got me to this point in the project. What has got me here is good planning, solid effort, and not exhausting myself. If I upset that balance, I might lose my desire to finish the project, and that would be a shame.
So, as always, I started to look at and plan the next chunk of work, making sure it was a good chunk of work that I could easily accomplish. After all, slow and steady went the tortoise…
What is Next?¶
Add all the normal tokens to the Markdown transformer? Check. Add the Link-related token group to the Markdown transformer? Check. That just left the Container-related token group. Add since I knew that Block Quotes were going to need a lot of work, taking care of the List-related tokens was a good first step.
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.