Introduction

Having just implemented the Inline Emphasis feature as documented in my last article, I was eager to move forward with the implementation of the links feature. As the group of features under links were the only ones separating me from the completion of the project’s parser, I was happy when I noticed that I was starting to think “when the parser is done” instead of “if the parser is EVER done”. I have been working on this project for a while and it was nice to know that in my mind, I could “see the light” with respect to this project.

In implementing the algorithm outlined in the section Phase 2: inline structure, I chose to implement the emphasis feature first, leaving the implementation of links until the base algorithm and emphasis feature were implemented and tested. Seeing as both of those were accomplished, I felt that it was the right time to apply my success in implementing the emphasis feature to the link feature. I knew that the work on the delimiter stack would easily carry over, but I was eager to see if the implementation of the next part of links would be as easy as the implementation of emphasis.

What Is the Audience For This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit from 17 March 2020.

Taking a quick peek ahead, I observed that the link feature in the specification is broken down into four groups: inline links, link reference definitions, reference links, and image links. Based on a quick reading of each section, it was obvious to me that for reference links to work properly, link reference definitions would be needed. Image links are variations on the inline link type and reference link type, the only difference being a different start delimiter. As such, I did not see any benefit to implementing image links before those other two link types are completed. This leaves inline links. Inline links are self-contained, allowing them to be implemented separately from the other three groups. Being somewhat impatient to get the parser done, I decided to go for the sub-feature that was more immediately available: inline links.

When writing my articles and documentation, I personally find it more readable to include any links in my documents as a complete unit. This means that when I add a link, I will typically add it in the form shown in the GFM specification at example 494:

[link](/uri)

As a concrete example of this, the above link to example 494 that precedes the sample link format was created with the following Markdown:

[example 494](https://github.github.com/gfm/#example-494)

While the form can be augmented as such:

[link](/uri "title")

to provide a title, I cannot remember a case where I have used that form. While I do not have a strong reason for or against this format, I believe that I just have not encountered a case where I believe that a title for the link was either desired or required. On the other hand, I have used the following form a few times before:

[link](</my uri>)

as an alternative to the form:

[link](my uri)

While my use of the angle bracket form is rare, it was useful in a couple of cases where I needed to provide a space character as part of an URI. While different Markdown-to-HTML processors will handle the space character differently, I just wanted something that read well and was mostly universal.

For a well-documented answer to this question, please look at the inline link section of the GFM for a word-for-word answer, complete with helpful examples. My own summarization of this section is as follows:

  • an inline link occurs on a single line and comprised of the link text, a left parenthesis, an optional link destination followed by an optional whitespace and an optional link title, and a right parenthesis
  • if in doubt about any punctuation characters in the below constructs, backslash escape them
  • the link text is any text that appears between the opening square brackets ([) and the closing square brackets (])
  • the link destination is any non-space, non-control characters, with special rules for the <, (, and ) characters
  • the link title, if included, is contained within a single-quoted string ('), double-quoted string ("), or a parenthesized string (( and ))

After 5 revisions to try and keep my answer “minimal”, that is it! While I could leave it in a more complicated state, that summary is what I believe I have it broken down to in my head. For me, the first two points are the pivotal ones, setting up the link component order and exclusions needed to create valid inline links. The last three rules are just simplifications of what is needed to represent each of the three components, rounding out the definition for inline links.

Keeping to those rules, when I am picturing Markdown links in my head, I usually think of these two forms:

[link](/uri "title")

and:

[link](/uri)

Between the 5 rules stated above and these two examples, I believe I am keeping it simple and minimal, ensuring my consistent use of links. Let me dive into those a bit more.

In terms of the order of components, the link text and both outer parentheses are required for the inline links, but both the link destination and link title are optional. However, due to how they are documented, if a link title is desired, a link destination must precede it. Basically, the component order is always: the text that shows up inside of the link’s anchor, the link URI itself, and a title to use when the link is traversed. Once again, I keep it simple to make it easy to use.

For remembering when to use punctuation characters and backslashes to escape them, I once again try to keep it simple. The second rule is the embodiment of that. My definition of “if in doubt”, as stated in the second rule, is that if I have to think “is this punctuation part of the link or not?”, I have doubt. Therefore, if I am authoring a link and am not sure if I should backslash escape punctuation within a link, I escape it.

Those first two rules are specifically tailored for me and to how I write my articles. My primary goal in coming up with those rules is to allow me to author rough drafts of my articles as I go, links included. During the rough draft phase, as efficiently as possible, I need to leave enough information in the added links to allow me to finish each link in subsequent passes through the document. While the “as I go” element is not present in the subsequent passes when I clean up the link, I do need to make sure that I keep those passes as efficient as possible. By keeping those rules simple, I reduce the amount of friction incurred by adding links to the documents, therefore keeping those passes efficient.

An additional benefit to using those rules is the simplification of the specification’s acceptable rules. Specifically, the wording of my two first points helps me avoid a lot of the weird cases included in the 41 test cases for inline links. The two most frequent reasons for the examples containing weird cases are the inclusion of newline characters and the inclusion of extra punctuation characters. The “single line” part of the first rule helps me avoid any of those newline cases, and the “if in doubt” part of the second rule helps me avoid any of the cases with extra punctuation.

My conservative estimate is that by adding those extra words to my personal rules, I was able to reduce the number of “applicable” cases for links that I am authoring in half, if not a bit more. While that might sound like a weird statement to make, for me it means that I when am authoring a document and adding a link, I can keep my focus on the what I am adding to the document, and not focus on trying to remember all of the weird cases for links and how to avoid them. For me, that is a plus.

Implementing the Algorithm

Having constructed and tested the delimiter stack and emphasis parts of the algorithm, as documented in the last article on Inline Emphasis, it was time to implement the look for link or image part of that algorithm. Once again, in an effort to keep things simple, I added very minimal support for image links (as detailed in the algorithm), but that support also added assert False statements to ensure that any scenarios with images were clearly identifiable and not executable. This helped to prevent me from accidentally working on testing image link features before adding the real support for them in a subsequent feature.

While the link part of algorithm does not have as many special cases as the 17 rules for emphasis, there are a decent number of elements to keep track of when implementing the links portion of the algorithm. I found that by following the examples and the algorithm as stated, I was able to quickly isolate the changes for each example. This isolation allowed me to to cleanly implement each small change with a clear idea of what was needed to be accomplished. As I am human, there were a few issues I initially had in following various portions of the algorithm. In all those cases, a quick re-read of the section helped me get the proper information I needed to implement that portion of the links.

Out of the 41 total examples for inline links, only the first 6 are ones that I would consider normal examples, the remaining 35 testing special cases and boundary conditions. As such, I started with the first example, example 493, and added the code necessary to do a simple parsing of each of the components of that example: link text, link destination, link title, link format separators and link whitespace. Then I simply started working my way down the list of examples, refining each implementation with each example that I worked on. With each example that I cleared; the implementation visibly got closer to its final implementation.

I am not too proud to admit that on my first reading of a lot of the examples, I questioned what their worth to the feature was. However, as I worked down the list of examples, the questioning changed into enjoyment. Each new example added a small layer of complexity to the implementation, like a piece of a puzzle cleanly fitting into place. Even concepts that I worried about, like new lines and backslashes, were given enough examples to clearly and properly demonstrate how they worked. There were a couple of times where I questioned whether an example was really needed, but those instances were few and far between.

Where I had Problems

The addition of inline links to the project went without too many issues. The most prominent of these were the proper encoding of special characters in the links and the interactions between links and emphasis.

In terms of the special characters, the main problem that I had was in selecting an interpretation of the term “special characters” that was consistent with the GFM specification, and hence the CommonMark implementation. The first group of characters, the ones to replace with named HTML entity names is small, so examples such as example 514 and example 517 were easy to implement and get right on the first try. From an HTML author’s point of view, it was obvious that the following Markdown from example 517:

[link](/url 'title "and" title')

should produce the following HTML:

<p><a href="/url" title="title &quot;and&quot; title">link</a></p>

Taking that a bit further, the next groups was still simple. Any characters that are represented by more than 7 bits needed Unicode encoding, which I also thought was obvious. Once again, from an HTML author’s point of view, the following Markdown from example 511:

[link](foo%20b&auml;)

would obviously translate into the following HTML:

<p><a href="foo%20b%C3%A4">link</a></p>

Verifying this was correct was easy. I had to look at the project’s entities.json file for the information on the &auml; symbol. From there I quickly verified that it is represented by the Unicode sequence \u00E4, which becomes the sequence %C3%A4 when encoded with utf-8. This was all done with Python’s urllib module, specifically with the urllib.parse.quote function, and it got all of these right on the first try.

The issue came to the proper encoding of characters with an ordinal value below 127 (or 7F hex) that were not control characters and not alphanumeric characters. By default, the only character that is considered safe according to the urllib.parse.quote function is the / character. When the parser emitted the HTML for those many of those remaining characters, it replaced the actual character with the percent-form of that character. While both approaches are usually syntactically equivalent, the comparison failed because the URI was not the same as the example’s output. It was only over the course of a couple examples that the punctuation character safe list from above was constructed.

The second group of issues came around the interaction between normal inline processing and the processing used for links. In cases such as example:

[link *foo **bar** `#`*](/uri)

all the inline processing occurs within the link text section and is pretty unambiguous to what the intent is. However, in the cases of example 529 to example 534, it is not as clear as to what the intent of the author was. In the case of example 531:

[foo *bar](baz*)

it is not clear to me at all what the author would have intended with this Markdown. Can you figure it out? The good news is that the GFM specification is clear on the precedence of each of the inline processes, but even still, it took me a bit to get that precedence properly implemented.

What Was My Experience So Far?

Except for adding a metadata feature and a table of contents feature, both which are unofficial extensions, every other feature that I normally use when writing articles is now implemented and tested. Having hit that personal milestone, it is a good feeling knowing that I am THAT close to being able to run PyMarkdown against my own articles to lint them. The only other core feature that I sometimes use is the image link feature, and I know that is just a couple of features away.

In terms of the feature implementation and testing for this feature, the issues I had were either caused by my misreading of the specification or caused by trying to skip forward in the examples, and not following the example order set out in the GFM specification. As I have mentioned numerous times in this section of past articles, the GFM specification is well thought out and battle tested from many implementations. While I do recognize that and heap praise on them, at the same time I seem to think that I either know better than they do or know where they are going, hence my skips forward. I need to stop that, as it seems to be keep getting me in trouble.

Except for a couple of “didn’t I learn this already?” moments, things went fine. My guess is that I will learn to trust this specification properly just before I finish the last example. Go figure. In all seriousness, I am going to try and put more effort into following the specification and its examples in their proper order for the next feature, and try and get it locked in. If I had a great specification, I need to learn to lean into it for support, and not fight against it.

At the start of this article, I expressed an interest to see if the implementation of inline links would be as easy as the implementation for emphasis. While there are differences, the 17 rules around emphasis and the looking for a complete inline link after having found the link label itself, I am very convinced that the effort required to implement each of them was pretty similar. As a bonus, having taken a quick look at all of the features in the link group, I am pretty sure that the work for inline links will be heavily leveraged to complete the link group itself.

What is Next?

Emphasis and link support and delimiter stack? Done and tested. Inline support for links? Done and tested. Before going on to reference links and image links, it only made sense to do link reference definitions next, so that is where I implemented next, though not without a lot of difficulties.

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~12 min read

Published

Markdown Linter

Category

Software Quality

Tags

Stay in Touch