Introduction¶
The end of the main parser is in sight! Two articles ago, the additional of support for inline links was documented, and the last article detailed the addition of support for link reference definitions. In terms of remaining work required to meet the GFM specification, only reference links and image links are left. As image links are just reference links with a slightly different syntax and slightly different rules, it made sense to focus on reference links first.
What Is the Audience For This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit of 26 March 2020.
A Quick Aside¶
I just wanted to take a moment to give some context on why I am providing many examples in this article. If it feels like I am providing a lot of “extra” examples, that is because I initially had a few issues with the different link types and their syntax. For whatever reason, the different link types were just not “clicking” inside of my head. It was only after I started focusing equally on the rules and the examples, that I was able to match the “abstract” text in the specification with the “concrete” examples provided. Together they provided the context that I required to truly understand reference links.
I hope that by providing good examples in this article, I am providing a similar amount of clarity to help any readers who may encounter similar issues to what I encountered.
What Are Shortcut Reference Links?¶
In the last article on link reference definitions, I briefly introduced a concept called a shortcut reference link. As that article was focusing on link reference definitions and not reference links, I introduced it only to show what link reference definitions were capable of doing. A good example of this is the following Markdown where the link reference definition is specified on the first line and the shortcut reference link that utilizes that definition is specified on the third line:
[link]: /uri "title"
[link]
Shortcut reference links like this are the easiest of the three reference link types to understand as everything is kept simple. As with all types of reference links, shortcut reference links are taken care of in the inline processing stage, long after any link reference definitions have been collected. To use a link reference definition, the normalized version of the link label from the shortcut reference link must match a link label from a previously defined link reference definition.1 To keep things simple for now, I am going to assign words like “normalized” to mean a case-insensitive comparison between two strings, with no other modifications. While this is not completely accurate, it will help keep things simple for now, and I promise to revisit it before the end of the article.
If the link’s link label matches a link reference definition, the shortcut reference link uses the link label as the text within the anchor. In the case of the above example, this produces the following HTML:
<p><a href="/uri" title="title">link</a></p>
When this HTML was generated, the text link
was used from the shortcut reference link
and the rest of the anchor tag (the text between <a
and
</a>
excluding the text link
) was composed using information from the link
reference definition. The benefit of this approach is that the “marker” for the link
is inline with the rest of the text while the bulkier link data is located
elsewhere in the document. This benefit allows the author to better control their own
authoring experience by controlling where the link reference definitions occur in their
document: after the paragraph containing the link, at the end of the section, or at the
end of the document.
Without exploring the normalization of link labels (yet!), each link label is parsed to render inline processing for the link label. A good instance of this is example 566 from the GFM specification which adds inline emphasis to the link label:
[*foo* bar]
[*foo* bar]: /url "title"
and is rendered as:
<p><a href="/url" title="title"><em>foo</em> bar</a></p>
The key is that if the normalized link label of both the shortcut reference link and the link reference definition match, everything works fine. If the link label does not match, the text is simply rendered as normal text:
<p>[<em>foo</em> bar]</p>
Other than that, the only other special thing is that the link label for a shortcut reference link cannot contain another link. This is explicitly stated in the definition of link text which states:
Links may not contain other links, at any level of nesting. If multiple otherwise valid link definitions appear nested inside each other, the inner-most definition is used.
A good example of this is obtained when the following Markdown text is fed through a GFM compliant parser:
[foo[foo]]
[foo]: /url "title"
generating the following HTML:
<p>[foo<a href="/url" title="title">foo</a>]</p>
Looking at the HTML, the inner shortcut reference link [foo]
was interpreted and the
outer link [foo[foo]]
was deemed invalid, therefore being rendered normally.
Multiple References of a Link Reference Definition¶
In the GFM specification there are instructions and examples on what to do if there are multiple link reference definitions declared with the same normalized link label. However, after a couple of passes through the GFM specification, I was unable to find any guidance on what to do if there are multiple reference links that use a given link reference definition. The closest that the specification gets to this is near the start of the section on full reference links (covered in the next section) which states:
followed by a link label that matches a link reference definition elsewhere in the document.
Based on this information, along with testing against the CommonMark reference parser, I can safely state that the following example:
[*foo* bar] \
[*foo* bar]
[*foo* bar]: /url "title"
generates the following HTML:
<p><a href="/url" title="title"><em>foo</em> bar</a><a href="/url" title="title"><em>foo</em> bar</a></p>
Based on this information and the quote above, I feel that this is a good feature of Markdown. In certain cases where a link is used repeatedly, this behavior can be used to have a single link reference definition to provide the link itself, and multiple reference links pointed to that one definition. In my mind, that is where full reference links come in.
What Are Full Reference Links?¶
In John Gruber’s original specification, there were no shortcut links and full reference links were referred to as “reference-style links”. It was only with later parsers that the more compact shortcut reference link was introduced. In my mind, instead of the order in which reference links were historically introduced, I prefer to see the hierarchy of reference links in reverse chronological order, with the shortcut reference links first and the full reference links second.
My reasoning for this is as follows. Based purely on efficiency, I typically start with the clearest construct that has the least amount of effort to add. With reference links, my first instinct is to add a shortcut reference link with the link label matching the link reference definition that I need to add to complete it. In 95% of the cases that I come across, I need a single link reference definition and a single reference link, so a shortcut reference link is the best option. For the remaining 5%, I usually have a case where I have multiple reference links referencing a single link reference definition, and I need a reference link that I can use there to good effect.
Consider the following example:
This is a sentence that refers to [my link].
Then, in a separate paragraph, I must still refer to it as [my link].
When coming up with this example, I needed to take care to create my example second
paragraph in a way that [my link]
would fit somewhat fluidly in the sentence.
It would be more useful to do the following:
This is a sentence that refers to [my link].
Then, in a separate paragraph, I can then refer to it as [my other link][my link].
For me, that is where the benefit of full reference links come in.
As I alluded to in the previous example, full reference links are of the form:
[text][link]
[link]: /url "title"
Where the first set of square brackets encloses link text and the second set of brackets encloses the link label. Unlike the shortcut reference link, where the link label serves as both the text to match and the text around which to link, a full reference link assigns a block of text to each of those responsibilities. This allows a single link reference definition to be referred to by multiple reference links, each one having tailored link text.
The link text is defined in a manner very similar to link labels, including the limitation that link text cannot include other links, demonstrated by the example 541:
[foo *bar [baz][ref]*][ref]
[ref]: /uri
which generates the following HTML:
<p>[foo <em>bar <a href="/uri">baz</a></em>]<a href="/uri">ref</a></p>
Similar to how the link-within-a-possible-link example was handled for shortcut
reference links, the inner reference link text [baz][ref]
is interpreted as a valid
link, with the rest of the possible-link’s link text being presented as plain text.
It is simply an act of serendipity that both the inner link and outer possible-link
used the link label [ref]
. Because of this act of serendipity, when the inline
processor gets to the [ref]
text at the end of the line, it is interpreted as a
shortcut reference
link, in a separate context from the previous link. As such, a second
link to the same URI is generated at the end of the HTML paragraph.
While I found it easier to visually see the how the above example should be parsed by working through it in my head, it was examples of this flavor that I really struggled with before I combined looking at rule with looking at examples, as detailed above. To be honest, to properly figure these out, I used a pencil and a sheet of paper to visually break down the problem. Only when I had those scribbled notes in front of me did I really get this example. Perhaps it is only me, but it was by literally working through the example and showing my work that I was able to really understand what the parser needed to do. After that, coding the parser to do it was simple. As I have mentioned a few times, figure out whatever works for you, and leverage that.
Collapsed Reference Links¶
While the inclusion of collapsed link references into the specification may seem like
an unnecessary element, it is an alternative that offers the author leeway on how
their Markdown article is constructed. For some authors, the full reference link of
[label][label]
might be preferred. For other authors, the shortcut reference link of
[label]
might be preferred. If the author is looking for something in between, the
collapsed reference link and it’s format of [label][]
offers a middle ground. All
three of the examples provided in this paragraph are semantically equal and will produce
the exacts same HTML.
In the end, it is just a matter of preference which of the reference link formats that the author prefers and is comfortable with.
Normalizing Link Labels¶
Back in the section on
What Are Shortcut Reference Links?,
I simplified
the term “normalized” to mean case-insensitive comparison
. The full definition of
normalized
is a bit more complicated, but not by much. In order of operation:
- grab the link text in its unprocessed form
- remove the opening and closing brackets from the link label
- perform a Unicode case fold (the Unicode equivalent of reducing all letters to lower case)
- strip leading and trailing whitespace
- collapse consecutive internal whitespace to a single space
Once these steps have been applied to the link label’s text, it is that text that is
used to determine if it matches an existing link reference definition. In cases where
the link label is [foo]
or [referenced document]
, this process may seem weird or
superfluous. But, in the case of
example 553:
[bar][foo\!]
[foo!]: /url
the parsed inline text is equivalent, but the link is not interpreted as a full
reference link. This is because the normalized text for the reference link is foo\!
while the normalized text for the link reference definition is
foo!
. While both of these strings will be equivalent to foo!
after applying inline
parsing, their normalized values do not match, and as such, the above example is
rendered in HTML as:
<p>[bar][foo!]</p>
What Implementation Problems Did I Have?¶
Between the work previously done for inline links and link reference definitions, most of the required foundation work was already in place when I started with reference links. As I required a simple implementation of shortcut reference links to properly test link reference definitions, it was only the introduction of link text for full reference links that required any real block of new code.
Even given that solid, tested foundation, there were two issues that gave me troubles as I implemented reference links: using the correct text to determine matching and the order of precedence between different types of reference links.
Adding the parsing for shortcut reference links as part of the work for link
reference definitions, I naturally did the bare minimum required to get it working. For
the link label matching algorithm, it was a simple a == b
comparison which worked
nicely for all the link reference definition examples except two. To get it working
with those two examples, both dealing with case insensitivity, I changed the
comparison to a.lower() == b.lower()
and then both examples parsed correctly.
When I reached the reference link examples that dealt with link label matching,
things got a bit more tricky, but not too tricky. Use .casefold()
instead of
.lower()
? No problem. Stripping various forms of whitespace from the link label?
No problem. Using the right text as a basis for those transformations? That took a
bit of work.
Given that link processing is handled in the inline processing phase, the easiest
solution was to add parallel text handling. My thinking was that since the inline
processing is constrained to a single continuous grouping of text, I just needed
something simple that would only exist for the lifetime of that grouping. To accomplish
this, I used the variable current_string_unresolved
to keep track of a raw,
unresolved version of the string that was being processed in the variable
current_string
.
While that might seem a bit of a kludge2, for me it was a simple solution that was contained to the area most affected by the issue. The other option, using an unresolved stream and then resolving it later, seemed to have too many issues to deal with in a manner that I was sure was going to cover all the edge cases. I know this because I tried that first (and second and third), before sitting back and rethinking what the best approach would be from a high level. After three attempts with the unresolved stream and no success, the kludge solution worked on the first try with no issues. Guess it really was not a kludge!
Having found a decent solution for using the right text to match against, the only issue I had left to deal with was in dealing with the order of precedence of reference links with themselves and other elements. Detailed in the examples between example 572 and example 579, these examples give very specific guidance on the precedence to use for each combination. While the examples provided good guidance, the implementation was not always so easy to get right.
While getting the order in which to check for the various types of reference links took a bit, it was complicated by the determination of whether a given reference link referenced a valid link reference definition. Out of the 4 days it took for me to complete reference links, one of those days was spent just going over all of the combinations, making sure that the specification detailed them (it did!), and then checking and rechecking each modification that made another of the above examples work. It was only then that I staged my changes and moved on to the next example. In the end, it was very tedious work, but it was worthwhile because it worked.
What Was My Experience So Far?¶
For the most part, the implementation of the reference links was a good experience, combining a bit of “reuse foundation” work with some “how do I get this to work properly” challenges. There was just one dark sport on the implementation of the feature.
I have already mentioned my issues with understanding the link specification in the above sections, but I feel the topic is important enough topic to warrant more discussion. Unlike before where I was skipping parts of the specification, this challenge was a genuine case of me reading the specification and not getting a good understanding of it. Even with the provided examples, there were still times where I was unable to comprehend what the specification was doing. It was only after going “old school” and getting out a pencil and some paper did I work through it.
At one point, I remember thinking to myself:
Writing parsers for 30 years and you still need a pencil and paper?
It was not one of the brightest moments of the project, but it did happen. It was after what I think was the 7th or 8th time of me trying to understand the preventing the link within a link logic detailed in the section on reference links. Honestly, I am guessing it was the 7th or 8th time, I lost count of the number of attempts. For whatever reason, it just was not clicking for me. It is times like those that I like to break a problem down to smaller components, what I refer to the “boulders to pebbles” approach. And for some reason, I needed to break those pebbles down even smaller, and I was very hard on myself for it.
No matter who you are, you are going to have good days, bad days, and a lot of in between days. The more that you take care of yourself, the better your chance of being on the positive end of that spectrum. I did not need a pencil and paper because I wasn’t capable of figuring out the problem myself, I needed them as a tool to help me figure it out myself on that day. I now look back at the problem and my scribbles and have the mental capacity to understand it was just a bad day. And even if I am being more charitable about the day than it really was. So what? I used the tools that I needed to in order to solve the problem I faced. Simple as that.
I started this project with a need and a desire to learn. I completed this feature learning that I still have what it takes to solve the issues that I need to. I also learned that I need to focus a bit more on taking better care of myself and nudging myself towards the positive end of that spectrum. Guess what? I stumble, I learn, and I get up and try and not do the same thing again. Well… not too often at least.
What is Next?¶
While I know there is some “optional” stuff I need to add to the parser before I can use it on my own website, there is only one more feature that I need to complete before the parser is complete and GFM compliant. With a bit of a mental drum roll… image links are next!
-
As link reference definitions are parsed with the leaf blocks, and reference links are parsed later with inline processing, the term “previously defined” refers to any definition in the Markdown document that was parsed, and not “previously defined” with respect to the relative locations of the reference links and the definition within the Markdown document. ↩
-
According to Merriam-Webster: “a haphazard or makeshift solution to a problem and especially to a computer or programming problem” ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.