Having completed another refactoring session, I have confidence that the PyMarkdown project code has been returned to what I feel is a healthy amount of technical debt. After taking a deep breath and a good look at the features left to implement, I realized that the list is now decently short: emphasis, links, autolinks, raw HTML, and line breaks. Rechecking the section on inline structure in the GitHub Flavored Markdown (GFM) specification, it is hard to miss the fact that the emphasis and link elements have their own “here is a suggested approach to implementation” section while the implementation for parsing the other elements are left up to the reader. Deciding that the authors of the GFM were trying to tell me something, I decided to focus on autolinks, raw HTML, and line breaks first, leaving the emphasis and links for the last chunk of features.
What Is the Audience For This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 22 February 2020 and 27 February 2020.
My Take on HTML in Markdown¶
When it comes to HTML blocks, I implemented them as part of the parser because they are part of the specification. But because of the complexity in understanding HTML blocks, I whole heartedly recommend avoiding using HTML blocks if at all possible.
The big problem I have with HTML blocks is that there are 7 types of HTML blocks, and each one has a different ending condition. As an author that heavily uses Markdown, I want any documents that I create to be simple and easy to read. In my opinion, without a lot of memorization, HTML blocks are not simple at all.
Raw HTML elements are on the complete other end of the spectrum. If I had my way with the GFM specification, I would demote and remove HTML blocks entirely and replace them with raw HTML. Raw HTML is exactly as it sounds: what you type is what you get. The good news is that if you follow a couple of easy to memorize rules, you can do this:
- NEVER start a line with a HTML tag
- Markdown is an authoring language, and should only use HTML sparingly
In support the first rule, my reading of the start conditions of each of the HTML block types in the GFM concludes with the observation that all start conditions begin with the text “line begins…”. Thus, I created the first rule to ensure that I never inadvertently trigger a HTML block. Any non-whitespace text and then any HTML is fine, just not by itself at the start of the line.
My second rule may seem like my own opinion, but I believe it is a solid rule. While it is useful to read the GFM section on What is Markdown?, that section boils down to this one quote from that section by John Gruber:
The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.
There is little question that having the ability to add a HTML tag where needed is a good feature of Markdown. However, I would contend that any HTML tags that are added are contrary to Gruber’s stated goal for a document to not be marked up with tags. As such, unless Gruber’s readability goal is dismissed, anything more than sparing use of HTML in a Markdown document is harmful.
Don’t get me wrong, but there are good use cases for HTML, but they are rare. In the
5-6 years that I have been authoring documents in Markdown, there are only two times
I have ever used HTML tags in Markdown documents. The first time is in the writing of
these articles on the Markdown parser for the PyMarkdown project and the other time was
for a project that used the
</ul> tags to satisfy a legal documentation
requirement for text underlining. From talking with other developers using Markdown
parsers that underling text is a popular request along with a blanket request to
disable all HTML tag support in Markdown, mostly for security reasons.
So, for what it is worth, that is my take on HTML in Markdown, and my reasons for that opinion. Your mileage may vary.
Unlike the rules that an author needs to memorize to properly use HTML blocks, using raw HTML in Markdown is very simple: its either a legal HTML tag or it gets interpreted as normal text. No “if it’s a blah tag, then…” rules. Just valid or invalid. No ending conditions as raw HTML is inline processing of text. Simple. Easy. Clean.
The reason that I can be comfortable in saying that raw HTML is simple is the following block of Markdown text:
Start and End Tags - <a href="link">link text</a> \ Self-Closing Start Tag - <br /> \ Alternate Parameter Enclosing - <b2 data='foo'> \ Simple Alternate Parameter - <c3 foo=bar /> \ Gratuitous Parameter Example - <d4 foo="bar" bam = 'baz <em>"</em>' _boolean zoop:33=zoop:33 />
While the grammar is broken down in the Raw HTML section of the GFM specification, it follows the HTML 5 specification for how to construct the tag and which tag formats are valid. From a document author point of view, this is simple. When adding HTML to a Markdown document, the author either knows how to author a web page in HTML or has a person or web page that tells the author what HTML to insert into the document. In either case, assuming that those tags are valid, those tags are emitted exactly as added, with no extra baggage added during the translation.
In adding these HTML samples to the above Markdown example, I am also following my own rules: never start a line with a tag and use it sparingly. None of the lines start with a tag, ensuring that none of the text is parsed as a HTML block. And while I went a bit overboard with the HTML specifically as it is an example, I can honestly say it is one of the less-than-5 times I have used HTML in Markdown. I think that qualifies as sparingly.
The raw HTML inline processing was very easy to add, as the rules are very simple: its either a valid HTML tag, or not. Not much to add.
Until I started reading the specification, I had no idea that Markdown was capable of a lot of things. In this case, I wasn’t aware that it can make a decent attempt at creating links with very little effort. As far as using them in my articles and documents, I am not sure about them yet, but at the very least they are an interesting concept for Markdown to include.
The concept is simple. An URI contained within the
> characters is
interpreted as a link to that URI, with both the link and the link text being set to
that value. There is also a variation of autolinks that uses any email address that
matches the email address
regular expression in the HTML5 specification. For email address autolinks, the link is set
mailto scheme for the link and the email address for the link text.
Real simple examples of these autolinks are contained within the following Markdown:
My website: <http://example.com> My email: <email@example.com>
which generates the following HTML:
<p>My website: <a href="http://example.com">http://example.com</a></p> <p>My email: <a href="mailto:firstname.lastname@example.org">email@example.com</a></p>
To make sure they are not interpreted as HTML blocks, they are prefaced with text, according to the rules I established in the last section. While a properly created autolink should not be interpreted as either of the HTML elements, I prefer to keep things simple. “NEVER start a line with a HTML tag unless it is a validly formed autolink” just seemed too much. As mentioned before, your mileage may vary.
The actual HTML output is simple, as denoted in the second paragraph of this section. In looking at autolinks a couple of times for this article, my feeling about autolinks as a document author is that there is not enough control of the output. Unless I am missing something, the following Markdown is equivalent to the above HTML, and it gives me more control of the document:
My website: [http://example.com](http://example.com) My email: [firstname.lastname@example.org](mailto:email@example.com)
In the end, while autolinks were as trivial to add as raw HTML, I think I’ll stick with my explicit links.
Wrapping things up for this group of features are the line breaks: hard breaks and soft breaks. At 14 examples for hard line breaks and 2 examples for soft line breaks, only the indented code blocks (11), the tabs (11), and the paragraphs (8) are in the same ball park for the low number of examples needed to adequately demonstrate that given structural element of Markdown. In addition, if you really look at the 14 examples for hard line breaks, there is a good argument to be made that there is large amount of replication between the two character sequences, reducing the “actual” number of examples down into the same 8-11 examples range.
As indicated by the number of examples, line breaks in Markdown are really simple to use, inheriting its line break rules from HTML. In both languages, if the author wants to specifically break a line after some text, they must use an element that forces a line break before its content, use an element that preserves line breaks within its content, or specify a hard line break itself.
A good example of the first case are the grouping of Markdown lines into a paragraph by separating them with a blank line. This is best shown in the following Markdown from example 190:
aaa bbb ccc ddd
which generates the following HTML:
<p>aaa bbb</p> <p>ccc ddd</p>
and is displayed as the following:
In this example, the (soft) line break that occurs in the Markdown between
bbb, and then again between the
ddd, is kept as it is translated into
HTML. However, when the HTML is rendered, that line break is not displayed. When
displayed, the first pair of characters are displayed, followed by a line break, and
then the second set of characters. As a general default rule, Markdown blocks
force a hard line break before displaying their contents, to ensure that the content is
understood to be different.
For the second case, a good example of it is a slightly modified version of the Markdown from example 110:
foo ``` bar bam ``` baz
which generates the following HTML:
<p>foo</p> <pre><code>bar bam </code></pre> <p>baz</p>
and is displayed as the following:
The first thing to notice is that, as described in the last example, when a new
Markdown block is started, the default rule for displaying HTML forces a hard line
break to be rendered, keeping its content distinct from the previous content. As such,
in this example both the
<pre> tag and the
<p> tag are displayed with a line break
before them. While this behavior can sometimes be overridden with styles and style
sheets, it tends to make things more confusing and is mostly avoided by HTML authors.
The second thing to notice is that the line breaks within the code block are preserved as-is. Both code blocks and HTML blocks maintain a very tight control on how their data is translated and displayed, being the only two block elements that preserves any line breaks within its content. In terms of other elements, it should be no surprise that code spans and raw HTML are the only two inline elements that also preserve line breaks within their content.
That leaves the final use case, where a Markdown author wants to force a hard line break outside of any of the previously mentioned constructs. But, how does an author do that?
foo<space><space> <space><space><space><space><space>bar foo<space><space>
generating the following HTML:
<p>foo<br /> bar foo</p>
and is displayed as the following:
Note that for the sake of clarity with this example, the string
<space> is used in
place of the actual
space character itself. The two spaces at the end of the first line cause the HTML
hard break tag
<br /> to be inserted into the data, generating a line break not only
in the generated HTML, but also in the displayed HTML. In contrast, since the two
spaces at the end of the third line closes off the paragraph block, they are simply
stripped away and not replaced with a hard break. This was a smart move as any Markdown
following that paragraph will be in a new block, the starting of the new block will,
by default, force a hard break in the display of that block, as noted above.
In contrast, the second way to force a hard line break is to end the line with the
\ character, as shown in the following Markdown:
foo\ bar foo\
While there are changes in the Markdown from the previous example, the generated HTML
remains the same. After my initial confusion between Python’s
\ line continuation
character and the Markdown’s
\ hard line break character (as
documented here ),
the explicit hard line break character grew on me.
What Was My Experience So Far?¶
If I am being honest with myself, I was not sure at the beginning of the project if I would ever get to this point. With only 2 inline elements left to process, not including the link definitions deferred from before, the parser is getting close to being able to handle a full and rich Markdown document. As the GRM specification contains over 673 scenarios, there were times that I thought I would just give up or use a “mostly” complete parser… something that was just barely good enough. But getting to this point, close to having a solid parser completed feels great!
Sure, there have been cases where it took me a day or two to figure out how to do something properly, such as list blocks and block quotes. Those were tough. And as I am writing this article at a 2-3 week delay from when I made the actual changes to the project, I know that there are some more bumps in the row yet to come. (No Spoilers!) But the important thing is that while those things are hard, I face them with the mentality that I talk about in my article on Embracing Something Hard. Maybe its just how I am, but for me part of the challenge is to embrace something hard and work my way through it.
It was when I was thinking about some of hard stuff that I had tackled previously and how easy this block of features was that I started wondering. Looking back in the project at anything written before this commit on 27 February 2020, it is hard to find any test that compares the output to HTML, though all of scenario test output from the GFM specification is written as HTML. What’s with that?
Going back to my article on Collecting Requirements, I determined that to properly write a Markdown linter, I needed to be able to take the output from a parser as a set of tokens, not output as HTML. The entire reason that I have taken steps to write this Markdown parser is that there are no current parsers that do not output interpreted Markdown as HTML. Furthermore, the previous section on autolinks is proof of that need. Producing a simple link to a web page, I can generate it using an autolink, a raw HTML tag, and a HTML block. Tokenizing the output before that translation to HTML is the only way to ensure that I am linting the Markdown properly.
To get me this far, testing against the tokenized output of the parser was the right thing to do. The linter is going to observe and consume the tokens from the parser, so they are the right thing to test. But the questions of whether I was generating the correct tokens started to bounce around my mind…
What is Next?¶
As I mentioned in the last section, I had some concerns about whether or not the tokenization of the Markdown was correct, so I decided to go all out for the next section and add the remaining scenario tests from the GFM specification. To close the loop on the testing, I also went through all the existing tests and added a new class that transforms the PyMarkdown tokens into HTML, comparing that output directly against the GFM specification. Stay tuned!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.