Markdown Linter

Introduction¶

Having completed most of the Markdown block elements, as documented in the last two articles on leaf blocks and container blocks, I wanted to go back and revisit the HTML blocks that I deferred. For anyone following this series, in the Stopping At A Good Place section of the “Parsing Normal Markdown Blocks” article, I determined that there were 3 types of leaf blocks that would be difficult to implement, so I deferred them. Between my lack of use most of those deferred features and my distinct status as the first user of the parser, I thought this was a decent trade off in the short run. With increased confidence from implementing the other block types, I thought it was a good time to deal with this block type.

Before continuing, I believe it is important for me to highlight some information about HTML blocks in Markdown. I have never needed to use HTML blocks or raw HTML (covered in a later article) in any of my own Markdown documents. Quick research revealed that there are some interesting cases where injecting HTML blocks is a benefit. However, that same research also noted that allowing either type of HTML in Markdown is a potential security issue, and as such, may be disabled for a given Markdown-to-HTML generator. Regardless of my usage patterns or security patterns, I wanted to be sure to include it in the PyMarkdown project for completeness.

What Is the Audience For This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit of 20 December 2019. This work includes implementing the parsing logic for the HTML Blocks as documented in the GFM specification and implementing the parsing to pass all the scenario tests for HTML blocks that were previous entered.

Why Did I Defer HTML Blocks?¶

The HTML blocks were one of those 3 deferred types because a quick reading of the specification lead me to believe the HTML block implementation would be tricky.

Why? Take a minute and read or browse the HTML blocks section of the GitHub Flavored Markdown (GFM) Specification. Don’t worry, I’ll wait while you do that.

Done? What did you think? My initial read of the specification made me think that it was going to be a complete mess to figure out properly. After a walk to clear my head, I took another look at the section. When I factored in the work, I did to implement the container blocks, this section looked like it would be tedious, but not too bad. After handling container blocks within container blocks, the straight parsing of a leaf block would not be too bad. Right?

I admit, it still looked kind of daunting to me. From my observations, the 42 use cases for the HTML blocks was far more than the count for all of the other block groups ¹ except for the list items group, at 47 use cases. And yes, the count is mostly larger than the count for container blocks (with block quotes at 22 use cases) and one half of the use cases for list blocks (with lists at 25 use cases and list items at 47 use cases). The data backed my daunting feeling up, which was a relief. Thinking about how I got to that feeling, I realized that in reading the specification, I was telling myself a story about how hard it would be to implement based on the sheer number of use cases. So how was I going to change that narrative I was telling myself?

Changing the Narrative¶

In the last article, I mentioned that one of my family’s favorite sayings is: “Stuff happens, pick yourself up, dust yourself off, and figure out what to do next.” “Stuff happens” was the decision to defer the HTML blocks, “pick yourself up” was the decision to pick them up again, “dust yourself off” was figuring out why I deferred the blocks, leaving the “figure out what to do next” part. One of my favorite tools to figure out what to do next is to see if I can change the narrative, or story, with whatever it is that I am doing.

Why is this important?

There are facts and opinions that are part of every story. Did the main character go to the cantina before boarding the freighter with the smuggler? That is a fact. Whether or not the captain of that freighter is a smuggler can be an opinion, depending on supporting facts. How much trust the main character had in that smuggler when boarding the freighter is mostly an opinion. The closer something is to a fact, the harder it is to change. Opinions can be changed in many cases if you can find the right story to tell.

The HTML blocks having 42 use cases to define its behavior is a fact, and facts do not change easily. Taking a deeper look at the 7 categories at the start of the specification’s section on HTML blocks, I can make a good argument that there are 3 sets of HTML tags instead of the 7 presented: the meta tags, the special tags, and everything else. Furthermore, the first 20 use cases present general cases while the next 18 use cases, and the last 4 use cases talk about specific rules and why they were specified.

Given this information, I can change the story I am telling myself by breaking down the previous story into smaller stories, each with a specific focus. Instead of one group of 42 use cases, I can have 3 smaller groups: 1 for general HTML blocks with 20 use cases, 1 for specific HTML blocks with 18 use cases, and finally a “wrap-up” group of 4 use cases that better explains why the specified rules are important.

Why is this better?

At 42 use cases for HTML blocks, it is the second biggest block of use cases, and is somewhat scary. Breaking that group up into 2 groups of about 20 use cases followed by a small group with 4 example use cases is something I can comprehend better, implement better, therefore removing my concerns about the large scope.

In addition, experience has taught me that when translating use cases to scenario tests, the last 2 to 3 translations are frequently show-stoppers or require major reworking to properly translate and get working. With a big group of 42 use cases, I know I would be expecting that behavior to happen, with a large amount of rework to do when it happened. After breaking down the problem into the 3 smaller groups, I was somewhat confident that if the same situation occurs, the amount of rework will be limited to approximately 20 scenario tests. For me, reducing that perceived effort helped me keep my confidence up instead of having it take a hit. Instead of “when it happens” with the 42 use cases, it became “if it happens” with the smaller groups of 20 use cases.

Let the Implementation Begin!¶

With a boost to my confidence in place, I was able to get a decent amount of work completed on the HTML blocks, wedged between shopping and work during the end of the holiday season. Despite my initial concerns about the size and complexity of this feature, the development went smoothly. Given how it went, I believe it lends support to my opinion that breaking down the use cases into the 3 groups was the right thing to do.

For those not familiar with Markdown and HTML, there some basic rules for HTML blocks, and then the 3 categories of HTML blocks themselves: the meta tags, the special tags, and everything else. The basic rules are simple. HTML blocks are always started with tags that start at the beginning of a new line, and once the start condition is met for one of the 7 block types, only the matching end condition finishes off the HTML block. In some cases, the end conditions can be met on the same line, and in some cases, the end conditions make sense… and in some they do not. At least not without understanding the rules!

Meta Tags¶

Block type 1 contains what I refer to as the “meta tags”, because those tags usually contain information that is at a higher level than normal tags, such as script information or style information. For anyone familiar with authoring HTML, the Markdown interpretation of these tags is almost the same as in a raw HTML document. The start condition is that one of the strings <script, <pre, or <style is present, followed by whitespace, the string > or the end of the line. The end condition is that one of the strings </script>, </pre>, or </style> are present, though the tags specified in the start condition and end condition do not need to match each other.

As such, the following text is considered a complete HTML block:

<style type="text/css">
h1 { font-size: 140%; font-weight: bold; border-top: 1px solid gray; padding-top: 0.5em; }
</style>

as is:

<script src="jquery.min.js"></script>

and:

<script src="jquery.min.js"></pre>

Note that in the last example, while the Markdown specification considers it a complete HTML block, it is not a valid HTML snippet. The Markdown specification does not specify any validation of the produced output, so beware of garbage-in, garbage-out.

This HTML block type was easy to figure out, hence it was easy to implement. Pretty straight forward: Look for one of the start strings, capture everything until we find one of the end strings. Quick and painless.

Special Tags¶

Block types 2 to 5 are what I refer to as the special tags. In order, the HTML specification refers to these as the comment tag, the processing instruction tag, the declaration tag, and the CDATA tag. Each of these tags is authored exactly as would be expected in a normal HTML document and has its own distinct purpose. In each case, the start condition is a simple string, and the end condition is the inversion of that string.

While most of these tags have seldom used or esoteric purposes, the comment tag is used frequently in HTML code, and is common in HTML documents. Similar to block type 1 above, the following text is considered a complete HTML block:

<!--
    style type="text/css">
h1 { font-size: 140%; font-weight: bold; border-top: 1px solid gray; padding-top: 0.5em; }
</style>
-->

as is:

<!-- this is a comment -->

Like the previous HTML block type, these HTML block types were also easy to figure out and implement. Just like before: look for one of the start strings, capture everything until we find one of the end strings. Just as quick and just as painless.

“Everything Else” Tags¶

With block types 1 to 5 out of the way, the work focused in on the remaining block types 6 and 7. These two block types are different than the other blocks, with their most prominent difference being that their end condition is a simple blank line. Another difference is that there is a long list of tag names that are eligible for block type 6, while any other tag is relegated to block type 7. This becomes important as the start conditions of block type 6 are the string < or </, followed by the tag name, and then followed by whitespace, the string >, the string /> or the end of the line. In contrast, the start conditions for block type 7 are that the HTML must either be a complete open tag or a complete close tag, followed by optional whitespace and the end of the line. As an additional requirement, a block type 7 HTML block cannot interrupt a paragraph.

To me, these rules are confusing to anyone authoring even a small piece of HTML in Markdown, adding to the reasons for me to suggest to people not to use HTML in Markdown. While this confusion is not evident in the examples for the block types 1 to 5, consider this sample:

<canvas
    class="my-canvas">
<pre>
**Hello**,

_world_.
</pre>
</canvas>

and this sample:

<table
    class="column"><tr><td>
<pre>
**Hello**,

_world_.
</pre>
</td></tr></table>

Without looking at the information in the specification, how easy is it to tell what the output of each sample is? To be honest, I had to refer back to the HTML block definitions in the GFM specification twice when I was writing these samples and three times when I was verifying the samples before publishing this article. That does not bode well, does it?

For the first example, the canvas tag name is not in the list for block type 6, and a block type 7 evaluation fails as the tag is neither a complete start tag nor a complete end tag. As such, the canvas start tag ends up being normal text, to be wrapped in a paragraph. The next tag, the pre start tag, gets identified as a block type 1 start, finishing at its own pre end tag, with the remaining canvas end tag going into its own paragraph. I know that was not what I expected at first glance.

The second example has different issues. Because the table tag name is in the block type 6 list of allowable tag names, the start conditions only state that it needs to start with the first part of a start tag or end tag, which the string <table satisfies. However, as the end condition for block type 6 HTML blocks is a blank line, the HTML block ends after **Hello**, and before _world_.. At this point, the text _world_. is parsed as normal text, and the text </pre> is interpreted as a complete end tag by the block type 7 rules, carrying a block type 7 HTML block to the end of the sample. When reading a similar example as part of example 118, it did take several tries to figure out what was going on.

These block types provided a bit of complexity that was different than the previous blocks. As such, I hit a couple of roadblocks that I had to work through. It wasn’t that the implementation was much more complicated than the previous HTML block types, they weren’t. It is almost the same process: find one of the start conditions, and capture everything until a blank line. Sure, the start conditions were a bit meatier, but other than that, it was relatively simple. It was that they start conditions and end conditions were different for these 2 HTML block types that made me look back at the use cases and scenario tests with a couple of “huh”s until I that difference registered in my head. And that list separating HTML block type 6 from 7… sheesh.

My Recommendation¶

When it comes to HTML blocks, I implemented them as part of the parser because they are part of the specification. But because of the complexity in understanding HTML blocks, I whole heartedly recommend avoiding using HTML blocks if possible.

What Was My Experience So Far?¶

I took my time with the implementation for HTML blocks due to the complexities stated above. For the most part, the code I implemented worked on the first or second try, with few cases where it took more tries and debugging than that. I believe the key to the relatively easy implementation was breaking the groups and tasks down into multiple, smaller groups and smaller tasks. In retrospect, I believe this enabled me to more readily get my mind around the task to accomplish, and not get overwhelmed by the size of the problem.

Implementing that thinking for the project, while not concrete, helped me see other things for the project in a better perspective. Most of the things I initially thought would be complex turned out to not be that complex. The long list of tag names for block type 6? Strings in a list object. The end conditions? Either looking for a blank line or one of a set of strings in one of the following lines. Getting the use cases right in the scenario tests? Really simple. I still contend that authoring HTML in Markdown is complex, but the implementation was easy.

Another boost to my confidence was tackling the HTML blocks and getting them out of my “technical debt column”. While I believe that I made the right decision to defer the HTML blocks for the right reasons, it still felt good to get them dealt with. Like my experience with translating the last 2-3 uses cases into scenario tests, thinking about revisiting any technical debt also triggers similar expectations of the reworking of existing code, if that revisiting is actually possible at all. Taking something out of technical debt and being able to remove that uncertainty helped my confidence towards the completion of the parser for this project.

All in all, I believe things are still headed in the right direction!

What is Next?¶

During the implementation of the PyMarkdown parser, I have been using my PyScan script to great benefit. As such, I decided to take the time to polish it up a bit and document it in this article on Software Quality. While doing that, I took some time to refactor the PyMarkdown code to make it easier to work with, preparing it for the inline processing that was to come next. The next article will go over the refactoring that I did, and how it helped the project.

The totals are as follows: paragraphs (9), tabs (11), indented code blocks (15), atx headings (18), thematic breaks (19), block quotes (22), lists (25),setext headings (27), and fenced code blocks (29). ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Adding HTML Blocks