Introduction¶
As detailed in the last article, the remaining work left on the main parser consists of inline links, link reference definitions, reference links, and image links. Inline links were covered in the last article. While I could try and come up with some grand reason for doing the link reference definitions next, the truth is simple: they were next on the list.
What Is the Audience For This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit of 22 March 2020.
What Are Link Reference Definitions?¶
In the last article, I introduced inline links and how they present the text to appear in the link (link label), the link itself (link destination), and an optional title for that link (link title). Link reference definitions are a related concept in that they take the link destination and link title parts of the link and store them in a map, to be used at a different time.
Basically, where an inline link uses the form:
[link](/uri "title")
a link definition uses the form:
[link]: /uri "title"
The main difference between the two elements is that a link reference definition does not add any elements into the HTML document by itself. To utilize the link reference definition, a reference link must be added to the document that has a normalized 1 link label that matches the normalized 1 link label from a link reference definition present elsewhere in the same Markdown document. For example, using the simplest form of reference links, a shortcut reference link, the following Markdown:
[link]: /uri "title"
[link]
creates a link reference definition and uses it, generating the following HTML:
<p><a href="/uri" title="title">link</a></p>
While the normalization rules1 are somewhat complex, in most cases it just means using the same link label in both the reference link and the link reference definition. Unless you happen to get into the more interesting aspects of the normalization rules, both reference links and link reference definitions are easy to use, by design.
Why Use Link Reference Definitions?¶
In the last article, I mentioned that I use inline links exclusively. Now that I have introduced link resource definitions and reference links, I can provide more context on the difference between them. Using a simple lorem ipsum generator, I came up with the following two examples. This first example contains an inline link:
Nam efficitur, turpis ac vestibulum imperdiet, nulla mi mollis erat, nec efficitur nunc
lorem rutrum metus. Vestibulum dictum lacinia lacus, at ornare quam consequat ultrices.
Nam quam leo, aliquet in luctus in,
[porttitor non quam](https://lipsum.com/).
Donec tincidunt augue nisi, sed
pellentesque nisl porttitor vel.
and this second example contains a reference link, specifically a shortcut reference link:
Nam efficitur, turpis ac vestibulum imperdiet, nulla mi mollis erat, nec efficitur nunc
lorem rutrum metus. Vestibulum dictum lacinia lacus, at ornare quam consequat ultrices.
Nam quam leo, aliquet in luctus in, [porttitor non quam]. Donec tincidunt augue nisi,
sed pellentesque nisl porttitor vel.
[porttitor non quam]: https://lipsum.com/
From an HTML output point of view, both examples generate the exact same output.
In both cases, the examples are presented as I would normally include them in my
articles, folding each line after the 90 characters that I keep my Markdown editor set
to. Applying my own stylistics, when I add an inline link I ensure that it begins at
the start of a new line to ensure that I can clearly identify it as a link. In
trying an equivalent example with a reference link instead of an inline link, the
style that I chose was to place the link label delimiters ([
and ]
) around the link
label itself with no change in formatting, with the link reference definition following
later in the document. While link reference definitions can precede or follow
any reference links that use them, to me this seemed like the right way to do it.
From my point of view, I find that the inline reference provides better readability for me and how I read my articles when authoring them. Perhaps it is through having authored and proofed many articles in this format, but to me, not having the link information right in the paragraph feels like a grammatical or spelling error. And while I did not really think about it before, when proofing the Markdown version of my articles, I don’t really “see” the link destination and link title until I slow down on my final passes. Regardless of the reasoning behind it, I just find it works better for me to read the Markdown version of articles with inline links.
Please note that this view is solely my own. When performing a similar evaluation for yourself or your organization, it is important to consider your own criteria when determining which options, reference links or inline links, is best for you.
Hitting Implementation Issues¶
Having implemented inline links as documented in the last article, I started working on the link reference definitions thinking they would be easy. In the case of the first example, example 161, it was in fact pretty easy.
[foo]: /url "title"
[foo]
A complete link label, followed by both a link destination and a link title. All on one line. Then add a simple shortcut reference link to reference the previously added link reference definition. Nice. Compact. Complete.
It was when I moved on to example 162 that all of the issues started:
[foo]:
/url
'the title'
[foo]
The first important issue to understand is that link reference definitions are processed as leaf nodes instead of inline text. To keep the memory footprint of the parser low, I made an early design decision to only get the text from the input source one line at a time.2 While the proper implementation of that design is still in the future, that design limits the main parsing function of the parser to only knowing about the current line, with no built-in capacity for look-ahead or look-behind. Remember this issue, as I will get back to it in just a minute.
The second important issue is that unlike all previous leaf node elements, it can take
multiple lines to determine if the current link reference definition element is valid.
In example 162, as provided above, it isn’t until the end of line following the second
'
character on line 3 that the link reference definition is determined to be valid.
To better highlight this problem, consider
example 165
which provides for an exaggerated example of this:
[foo]: /url '
title
line1
line2
'
[foo]
While the link reference definition as stated in example 165 is valid, it isn’t until
the 5th line, where the second '
character followed by the end of the line
closes off the link title, that the entire link reference definition is deemed valid.
By making one small change to the previous example, removing that previously mentioned
5th line, that entire link reference definition is rendered invalid, as follows:
[foo]: /url '
title
line1
line2
[foo]
After that one small change, instead of a link reference definition followed by a valid link reference, both elements are now just interpreted as plain text.
Remember a couple of paragraphs ago when I mentioned “remember this issue” when talking about processing link reference definitions as leaf nodes? Here is the payoff.
Because of my design choice to process the Markdown document one line at a time, I needed to add extra logic to the parser to allow me to “rewind” those lines. In the case of the modified example 165, the entire link reference definition is rendered invalid, and the parser must rewind to the start of the link reference definition. However, when it starts parsing the lines again, care must be taken to inform the parser that it cannot consider the newly rewound lines to be eligible for a valid link reference definition. Painful, but not too painful.
Following along from that change, if we do a similar change to example 162 by removing
the final '
character
from the definition, it poses a different problem. While the removal of the 5th line of
example 165 invalidates the entire link reference definition, removing the final '
character from example 162 only invalidates the link title, leaving the rest of the link
definition valid. To deal with this, I needed to not only have logic to go backwards
towards the start of the link reference definition, but to stop that rewinding if
whatever part of the definition that was not rewound turned out to be a valid
definition. While the rewinding was a headache and somewhat obvious3,
aborting the rewinding if a valid link reference definition “fragment” was found was
not obvious to me at all.
That code was painful. Honestly speaking, that logic alone took about half of the 5 days required to get the multiple-line aspect of link reference definition parsing properly. A decent amount of that time was taken up with rewriting the logic for extracting the link label, link definition, and link title to handle being straddled over multiple lines. But the real “fun” was making sure that rewinding the lines would properly rewind the token stream and token stack in the parser.
While it took a lot of work to get there, it was personally fulfilling when I got it right. It wasn’t an easy issue to solve and coming up with a clean solution wasn’t easy. As a bonus, my dog stopped looking at me with a funny expression. It was pointed out to me that he did that when facepalmed myself whenever I got the parsing wrong. Personally, I considered that a 2-in-1 benefit!
What Was My Experience So Far?¶
There have only been a few issues that have taxed my experience to solve them and implementing link reference definitions was one of them. That is both a good thing and a bad thing. On the good side, due to my experience and stubbornness, I was able to modify my implementation to deal with the issue without having to change my design decision. On the bad side, there is only one test with a link reference definition being within a container block, and that leaves a lot of questions about how to handle failures within those containers. While I noted the later down for subsequent testing, it still leaves me feeling a bit uneasy that I had to modify the parser to handle that.
Considering where I am with the parser, I am glad that I hit that issue now, and not after I finished the parser. While it was painful to go through, it did reinforce a few things about the design of the parser so far.
The first thing that was reinforced was that, with only one exception, my early decision to do line-by-line parsing is viable. While there may be parser extensions that change that, the number of exceptions in the link reference definition category should be low. If I can then combine that with some Python generator logic, hopefully I can keep the memory profile of the parser low.
The second thing that was reinforced was that the general structure of the parser was properly designed and implemented. While I still need to take some time and refactor individual groups of functions into their own modules, the actual function themselves are durable. Except for the link helper functions and a few core functions, I did not have to change any other functionality to handle the failure rewind scenario. In those few functions I changed, it was either to specifically handle that scenario or to pass information back to the main parser function about how to handle the rewind. Compared to past situations I have encountered; I consider it a benefit that I only had to change the small amount of code that I did.
Finally, my decision to use inline links over reference links was reinforced. In more closely examining the difference between the two types of links, I believe I better understand and appreciate both types of links. While I agree that writing Markdown with reference links will more closely approximate how the paragraphs will look when completed4, the absence of the link destination and link title reduce my comprehension of the paragraph as a whole. I find myself making the same decision as before, but now I believe I can more clearly communicate the options for both types of links and why I chose inline links over reference links.
What is Next?¶
The list of what is left to complete in the parser is getting shorter and shorter. With inline links and link reference definitions tested and completed, only the full implementation of reference links and image links remain. While a lot of foundation work has already been set up for these features, I am wary of declaring that it will be an easy job from here on it. But with only 2 features left, I know the start of the real linting work in coming soon!
-
Long winded version: To normalize a label, strip off the opening and closing brackets, perform the Unicode case fold, strip leading and trailing whitespace and collapse consecutive internal whitespace to a single space. Shorter version: reduce or eliminate whitespace and do a case-insensitive comparison. ↩↩↩
-
To keep the memory footprint of a parser low, the parser design should try and only keep the information that is required for parsing in memory. By parsing the Markdown input line by line, I do not have to worry about having to load the entire Markdown input into memory before I can begin parsing it. ↩
-
Except for a Lisp parser that I once wrote, I cannot think of a single parser where I didn’t have to rewind at least one complex entity out of the 25+ parsers that I have written. ↩
-
As per John Gruber’s original intentions. Go to this part of his original Markdown article and go to the paragraph preceding the start of the emphasis section. For more complete information on why John felt that reference links were better, look further up in that section for more details. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.