Markdown Linter

Introduction¶

Maybe it is just me, but I love the feeling of completing a good round of refactoring where I really get to focus on making sure that the foundations of the project are stable. If it helps any readers, I imagine it as a spa day for your project where the project just gets some personalized attention and cleaning up. While the project is not in a perfectly clean state, I know that I performed a decent amount of tidying up in that direction, work that will help the project as it grows.

With the project cleaned up, and with the new changes to make the text blocks continuous, it was time to start on the inline processing. The first three inline elements to be implemented were the first three elements in the GitHub Flavored Markdown (GFM) Specification: backslashes, character references, and code spans. These elements allow Markdown to escape certain characters, replace a text sequence with a single Unicode character, or indicate that some text is literal code. Each of these elements has its own special use, and are used very frequently when writing Markdown documents. And if those reasons were not good enough, they just happen to be the first three sections in the specification’s inline processing section.

What Is the Audience For This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 08 February 2020 and 14 February 2020.

Inline Processing¶

In Markdown processing, there are two main types of processing that occur: processing to sort the text into blocks and the processing of the contents for those blocks. Courtesy of the specification, another good explanation is as follows:

We can think of a document as a sequence of blocks — structural elements like paragraphs, block quotations, lists, headings, rules, and code blocks. Some blocks (like block quotes and list items) contain other blocks; others (like headings and paragraphs) contain inline content — text, links, emphasized text, images, code spans, and so on.

While it was not readily apparent on my first read of the specification, inline processing occurs only on the content of leaf blocks that do not strictly govern their content. As code blocks contain the literal content for their output and HTML blocks contain the literal HTML content for their output, inline processing is not applied to those blocks. Inline processing is applied to the content of the remaining blocks, the headings blocks and the paragraph blocks, which just happen to be the most frequently used blocks in most Markdown documents.

Backslash Escapes¶

Having completed most of the processing required for the leaf blocks and container blocks, it was time to move on to the inline processing of the content within those blocks. The first of the inline processes to be worked on: the backslash escapes.

For readers familiar with backslashes in modern programming languages, Markdown’s usage of backslashes is similar, but with a twist. In modern programming languages, a backslash character is used in strings to escape the character following the backslash, using that next character to denote a special character. For each special character to be represented, a distinct backslash escape sequence is used to represent it. For example, most languages include the escape sequence \n for a line feed or end-of-line character. This backslash escape is used so often that many programmers use the terms “slash-en” or “backslash-en” instead of referring to the \n character sequence as the new-line character it represents.

The twist that I mentioned earlier is that Markdown, unlike programming languages, uses backslash escapes to only escape the following ASCII punctuation characters with themselves:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

For example, the string \! will emit the sequence ‘!’, but the string \a will emit the sequence \a. Coming from a programming background, this took a bit of getting used to, but it makes sense. As Markdown is used to create a human readable document, authors should not be concerning themselves with control characters, but with how the content is organized. From that point of view, escaping the punctuation characters used to denote organization makes a lot of sense.

It then follows that each processing character is included in that string of characters, and that the most prominent use of backslash escapes in Markdown is to avoid block and inline processing. Because a backslash escaped character is emitted as part of the backslash processing in the parser, any other processing of that character by the parser is effectively short-circuited. This simply allows the punctuation character to be represented without the parser mistaking it for any kind of processing instruction.

For example, to include the text & in your document the sequence \& can be used to escape the ‘&‘ character.¹ Another example is that the following text:

\- this is a paragraph

will generate the text - this is a paragraph as part of a paragraph, instead of creating a list item containing the paragraph this is a paragraph. In both cases, the backslash escapes are used to tell the parser to just treat the escaped character as itself and not to perform any further processing. As useful as that is, backslashes escapes cannot be used in code blocks, which have been covered previously, code spans, which are covered later in this article, or autolinks and raw HTML, which are covered in a future article.

Implementing support for backslash escapes was simple, as it just required a change in how the characters were interpreted. As the text was still contained within a single text block, it was just a matter of making sure the right characters were emitted. This was relatively easy as the processing was easy:

is the next character a backslash?
if not, emit a backslash and resume normal processing
if so, check to see what character follows
- if that character is not in the escape list above, emit a backslash and resume normal processing
- if so, consume that character and emit that character

Basically, if there is a valid backslash sequence, emit the second character and consume it, otherwise, emit first character (the backslash character) and continue. The limits on where backslashes can be used was easy to implement, as there were only a few places where they were not allowed.

Character References¶

Character references are an important part of HTML, and as such, Markdown has solid support for them built in. Character references allow for the inclusion of special characters into the document, without the need to rely on the file editor to support Unicode characters. In addition, the document’s writability and readability is often enhanced by presenting the reader with the text © instead of the symbol ‘©‘.

Think about it this way. As a document author, you want to add the copyright symbol to your document. Where do you find it on your keyboard? If it is not there, what is the clearest and easiest way to add it to the document that is not tied to a specific editor? Markdown addresses this issue by reusing the HTML5 method of specifying character references.

For each character reference, it starts with the ‘&‘ character and ends with the ‘;’ character, with characters between to denote the type of character to reference and what the actual reference is. Named character entity references are the easiest to read, as they contain some form of the name of the character they represent, such as © for the copyright symbol. The full list of named character references that are supported is at the HTML5 entity names document. ² As an alternative to the © named reference, the equivalent numeric references © or &#x00A9 may be used instead. While the result on the rendered page is the same, I feel that the named references are more readable than the numeric references. However, in cases where there is no named reference for a given Unicode character, the numeric references are very handy.

Like the way in which backslash escapes are handled, there are certain blocks that the character references cannot be used in. In particular, they are not recognized in code blocks and code spans but are recognized in most other locations. For example³, given the following Markdown text:

``` f&ouml;&ouml;
f&ouml;&ouml;
```

the character references in the fenced block info string are recognized, but the character references within the code block are not recognized. As such, after translating this Markdown into HTML, the following HTML is expected:

<pre><code class="language-föö">f&ouml;&ouml;
</code></pre>

In the example, as expected, the character references that feed the class attribute for the code tag were translated, while the character references within the bounds of the code tag, which are used to denote a code block, are left alone.

Similar to my experience in processing the backslashes, the implementation for all three-character references were processed in roughly the same manner. Instead of a single character to look for with backslash escapes, character references have a set of allowable character sequences, but otherwise the processing is the same. Once again, the processing was simple, just follow simple rules.

However, while it was not particularly difficult, determining the proper handling of the entities.json file used as a reference for HTML named entities took a bit of thinking to get right. The main decision was whether to download it each time, cache it somewhere once downloaded, or just do a “one-time” include of it into the project as a resource. In the end, I decided to take the later path, placing the file in the pymarkdown/resources/ directory. My assumption is that file does not change that often, perhaps once a month at its worst. As I added the file exactly as it was downloaded from the source at the HTML5 home page, I believe I can check on it from time to time, updating the file when required. With that decision made, I just needed to do some research on the best way to include resources into a project, and the rest was once again just following well documented instructions.

Code Spans¶

Code spans are like code blocks, in that they both protect the characters that are within their confines. However, while code blocks are designed to protect multiple lines of text, such as source code examples, code spans are designed to protect text within a single paragraph. To create a code span, the text to be protected is simply surrounded by one or more backtick (‘`’) characters on each side, making sure that the number of starting backticks and closing backticks are the same.

As a simple example, the Markdown `foo` produces the text foo within a special HTML tag that has special styling associated within it. Like how code blocks protect blocks of text that are already formatted in a specific way, these code spans use that styling are used to specify targeted text that already has meaning attached to it. In my articles, as with other blog authors that I have read, I use code spans to indicate that certain strings have literal meaning to them, such as the literal text to type in at a keyboard.

One good example of this from the previous section are the examples of the various Markdown sequences needed to produce the copyright symbol. If I had simply added the text © to the Markdown document, it will be interpreted as a character sequence, and the ‘©‘ symbol will be generated. By placing backticks around that text, such as `©`, those characters are contained within a code span and the text is preserved literally. And for that last sentence where I needed to include the literal text including backticks, I just made sure to include more backticks around the text than were contained within the text, such as `` `©` ``.⁴

I knew that the parsing and rendering of the tokens was about to get more complex in order to properly implement the code span processing. To keep the code span, the text before it, and the text after it in the right order, I changed the inline parsing to allow for a markdown token to be emitted. When the new code span Markdown token is emitted, the surrounding code first adds a new text block containing any text collected up until that point, emits the new token, and then resets the collected text back to the empty string. This correctly ordered the tokens and is generic enough to hopefully future-proof similar parsing in the future.

There were only a small number of issues with the existing scenarios that needed to be addressed, now that code spans were handled properly. Fixing those tests was simple and just required resampling the parser’s output. But during that testing, I realized I had made a mistake with the handling of one of the header blocks. When I wrote the original code for the Atx Header blocks, as documented in the article on Parsing Normal Markdown Blocks, I hadn’t thought about code spans or other more complex inline elements as part of an Atx header. As such, I therefore I wrote a simple implementation that represented the header text as a simple string within the token.

Double checking the specification, I verified that there were no restrictions on using code spans within a SetExt or Atx header block. As such, I needed to rewrite the parsing code to support having Atx header blocks contain text blocks, instead of simply including the enclosed text in the Atx Markdown token. Instead of tackling that as part of this group of code, I decided to look to see if there were any other “little” things that I missed, and I found a few of them.

Basically, of the issues that I found, most of them were small variations of the scenarios, things that just got lost in the shuffle or lost in the translation. As such, I thought it would be best to take some time, try and note them all down, and then tackle them together before continuing. As the only scenario test that was affected was example 339, I believe that temporarily skipping that test and taking the time to fix those issues was the right call. It would mean that I would have to wait a bit before I could say that code spans were done, but when they were done, I would know that I did them the right way. That was, and still is, important to me.

What Was My Experience So Far?¶

I usually read a specification thoroughly and identify most of the edge cases on my first pass. However, I must admit that I dropped the ball with that on this project. And to be totally honest, I do not expect that it will be the last time either. It is a big specification, and there are going to be hits and misses along the way. What matters to me is not whether I make the mistakes, but that I do not have enough use cases, scenarios, and tests to help me identify any mistakes. With 673 scenarios already identified in the specification, I know the coverage for scenario will be good, but there will be gaps that I will still need to address. Whether it is my dropping the ball or the specification dropping the ball, the work on the these three inline elements has improved my confidence that I am prepared to deal with any such issues that come up.

A good example of this is my reading of the specification around the use of Atx headers. I know I missed the part where the specification, in the preamble to example 36 says:

Contents are parsed as inlines:

In retrospect, not only is this one of the few times inlines with Atx headers was mentioned but there is also only one scenario that covers them, example 36. So, from one point of view, the specification could have more scenarios dealing with inlines and Atx headers. From an additional point of view, it was mentioned and I just missed it. From my personal point of view, it does not matter either way. What matters is that I had enough process and tools in place to catch it. And once I saw that issue, it helped me take a deeper look at some of the other tests, finding small issues with the output from those tests.

From a quality point of view, my confidence was holding steady or increasing. As I mentioned a couple of paragraphs ago, I do not expect to be perfect, I just hope to have the right tools and processes in place to help me figure out when I miss something or get something wrong. Sure, I realized that taking some time to work on fixing these issues was going to put my work on the linter on hold for another week. But my confidence that the linter was on solid footing increased because I found some issues.

For me, quality is not about being perfect, it is about movement in the right direction. And finding those issues, was a step in that right direction.

What is Next?¶

After documenting those issues at the end of the test_markdown_list.py file, I thought it was best to do a quality pass and resolve those issues before moving on to other inline processes. As such, the next article focuses on what bugs I found in the scenario tests, and how I addressed them.

Just to be complete, the character escapes in the next section also provide a way to include the ‘&‘ sequence in Markdown. Using character references, this is by using the text &amp; instead of \&. While both produce identical output, I prefer the first for it’s clarity. Your mileage may vary. ↩
To keep things simple for parsers, this file is maintained as a JSON file that is easily interpreted with a small amount of code in most current languages. ↩
Note that this example is a slightly modified version of example 330 from the GFM specification. ↩
For a good example of this, see example 339 in the GFM specification. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Starting Inline Processing

Introduction¶

What Is the Audience For This Article?¶

Inline Processing¶

Backslash Escapes¶

Character References¶

Code Spans¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Category

Tags

Stay in Touch