Introduction

In the wrap up for my last article, I provided extra information on my decision to write a Markdown parser as part of the PyMarkdown project. My belief is that the writing of an effective Markdown linter requires a parser that emits an intermediate set of tokens instead of the usual HTML output. Accordingly, from the start of the project, all the scenario tests for the parser have dealt with those generated tokens and not the usual HTML output. Based on my experience to date, both the belief in a non-HTML parser and the decision to test the tokens have proven to be the right choices. During the project development to date, I have not observed a specific case where I believe that there would have been a substantial benefit in comparing the output to HTML instead of the intermediate tokens. However, even with that in mind, the lack of a direct comparison between the output specified in the GFM specification and the output of the parser started to erode my confidence, even if just fractionally.

The questions that came to mind were simple ones. Did I keep the right information when parsing the Markdown to enable a HTML transformation of the tokens? Did I pick the right token to represent a Markdown element in the token stream? Do the tokens contain enough information in them to allow me to write linting rules off them? For me, these questions are relevant given the nature of the project.

Looking at those three questions, I quickly realized that answering that last question was impossible until I start writing the linting rules. Sure, I could take a guess, but that is all it would be. However, I realized that I could probably answer the first two questions and that there was significant benefit to be gained from doing so. If I can write a token-to-HTML translator and apply it to that token stream, when the HTML output for all scenarios match, I have answered the first question. And while I cannot answer the second question completely, if the translation to HTML is simple enough, I will have proven that I am headed in the right direction with respect to making good choices for the representative tokens. While I cannot prove that those choices are not perfect choices until the rules are written, I can at least prove to myself that my token choices are in the right direction.

What Is the Audience For This Article?

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 29 February 2020 and 06 March 2020.

Finishing Up the Scenario Tests

For me, this task was essentially a bookkeeping issue. As the remaining features were the emphasis elements and the link elements, it felt like the right time to march through the remaining scenarios and implement them as scenario tests.

These scenario tests fell into two categories. In the first category, if the scenario needed to test a failure using already implemented features, I copied over an existing test, changed the Markdown input, executed the new test, and copied the tokens from the newly executed test output in that test, manually verifying the tokens as I copied them. Basically, I figured that if the scenario test is failing in a manner that will not change even when the owning feature is implemented, then completing the test was the best choice for the project. In the second category, I did the same thing except I stopped before the execution step, instead adding a @skip tag to the test’s definition. In this way, I was able to add the bulk of the remaining tests without having tests that would obviously fail getting in the way.

While this may have seemed like busy work, it was meant to give me a realistic picture of how close I was to finishing the parser, and it worked. By executing pipenv run pytest, I executed every test and was able to look for entire modules with skipped tests, getting a good indication of what features and testing was left. From a wholistic point of view, it was good to see that out of the 875+ tests in the project so far, there were only just over 100 tests left to go before the parser would be completed. Being able to see how close I was to finishing the parser was definitely worthwhile!

Adding Test Grouping

I knew from the start that this would be the monotonous part, so I tried to make sure that I could stage the changes as I brought more scenario tests online. The first thing I did was to add a new marker for PyTest to the project by adding this line to the setup.cfg file for the project:

markers=gfm

This change allowed me to use one of PyTest’s grouping features: marking. By changing a scenario test’s definition from:

def test_atx_headings_032():
    ...

to:

@pytest.mark.gfm
def test_atx_headings_032():
    ...

I included that test into the gfm group. While I can still execute that test by itself by entering pipenv run pytest -k 032, I could now execute all tests in the gfm group by entering pipenv run pytest -m gfm.

This command was invaluable during development of the parser. After adding HTML translation support to a scenario test, I ensured that it was added to this group, thereby staging the scenario test with its owning feature. After completing the change to make the test pass, I then executed all the tests in the gfm group to ensure that I didn’t break anything else in the process. While it caused me some issues from time to time, it was an extra watch over the completed work, one that I appreciated.

Adding Translating into HTML

Translating any stream into something requires a loop to process through each element in the stream, with some mix of emitting data and altering state. I created the TransformToGfm class to handle that translation, with the transform entry point to facilitate the transformation. At this point in the implementation, this class was very simple. As each token was seen in the loop, its data was emitted with only minor additional processing required.

Adding this support into existing tests was easy, just monotonous. Using the same test 32 that was used in the prior example, that test changed from:

@pytest.mark.gfm
def test_atx_headings_032():
    """
    Test case 032:  Simple headings
    """

    # Arrange
    tokenizer = TokenizedMarkdown()
    source_markdown = """some markdown"""
    expected_tokens = [...]

    # Act
    actual_tokens = tokenizer.transform(source_markdown)

    # Assert
    assert_if_lists_different(expected_tokens, actual_tokens)

to:

@pytest.mark.gfm
def test_atx_headings_032():
    """
    Test case 032:  Simple headings
    """

    # Arrange
    tokenizer = TokenizedMarkdown()
    transformer = TransformToGfm()
    source_markdown = """some markdown"""
    expected_tokens = [...]
    expected_gfm = """<p>some markdown</p>"""

    # Act
    actual_tokens = tokenizer.transform(source_markdown)
    actual_gfm = transformer.transform(actual_tokens)

    # Assert
    assert_if_lists_different(expected_tokens, actual_tokens)
    assert_if_strings_different(expected_gfm, actual_gfm)

In order of appearance, an instance of the TransformToGfm class was added, the expected_gfm variable was set to the expected HTML, the transform function was called, and then the contents of the expected_gfm variable were compared against the output from the transform function. Except for the expected changes to the expected_gfm variable for each test, this transformation was repeated for each test as support for the feature it enabled was added.

Translating the Leaf Blocks

Translating the leaf block tokens added in this article and this article proceeded quickly, encountering only a few unexpected issues. These issues fell into two categories: the handling of HTML blocks and code blocks, and the handling of newlines in the HTML output.

Most of the issues that were uncovered for leaf blocks dealt with the processing of HTML blocks and code blocks. As mentioned in previous articles, these two leaf blocks are special in that they maintain firm control over the formatting of their content. To accommodate these two leaf block types, the handling of the TextMarkdownToken was changed to accommodate the stricter output requirements of those blocks, mostly ensuring that whitespace was preserved. Other than that, the only other changes needed for processing was to change most of the tokens to expose certain fields, allowing the translator to access the token’s attributes cleanly.

From a rendering viewpoint, I had guessed that any newlines in the HTML output were going to be a problem from the start and I was right. While the GFM is purposefully vague on when to add newlines in the translation from Markdown to HTML, it was a vagueness that I could not avoid. As the main push for this article’s work was to add proper comparisons of the GFM’s HTML output for each example, I had a hard choice to make. Either I make modifications to each of the 673 scenarios as I copied their HTML output into the scenario tests, or I needed to ensure that the translation replicated the HTML exactly.

After a lot of thinking, I decided to go with the exact HTML output path, hopefully removing any possible errors that may have occurred during the translation of the HTML output for each scenario test. When I thought about both options, I just felt that I would almost instantly regret making any changes to the HTML output, as it would no longer be synchronized to the GFM specification. Considering that, I figured it was better to be consistent and do a bit more work on the project than to start changing the scenarios.

My current focus was on enabling the HTML comparisons, and I knew it was going to take more effort and time to get them right. As such, I decided to add a fair number of “if this look like” conditions to add or not add newlines, with plans to later refactor the code to look for better groupings down the road. I do not like adding technical debt just for the sake of expediency, but I just felt it was the right decision at the time. I figured by adjusting the translator with little tweaks here and there would give me a more complete picture on what needed done for a proper refactor later. It was not a perfect decision, but it was one that I felt I could live with.

When all the leaf blocks were completed, I did notice a decent boost in my confidence. Except for some issues with getting newlines right, the translation of leaf blocks was straightforward. Knowing that I had made good decisions so far gave me that boost… something that I would need sooner rather than later.

Translating the Container Blocks

While the translation of the leaf blocks went smoothly, I hit a number of issues dealing with the container blocks added in this article. While the block quotes themselves were not really an issue, the list blocks that caused me a lot of trouble.

In implementing the list block support in the translator, I was able to quickly get to a point where the tags themselves were being emitted properly, but the whitespace around the tags were off, especially with the newlines. That was frustrating, but with some helpful observations and experimentation, I was able to get that taken care of relatively quickly.

Following that triumph, I spent a few aggravating days trying to figure out why some list items contained <p> tags and why some list items didn’t contain <p> tags. I tried a couple approaches based on the surrounding tags and tokens, but each of them failed. It wasn’t until I was looking at the specification again, when I took another look at the lists section and noticed the following paragraph in the lists section:

A list is loose if any of its constituent list items are separated by blank lines, or if any of its constituent list items directly contain two block-level elements with a blank line between them. Otherwise a list is tight. (The difference in HTML output is that paragraphs in a loose list are wrapped in <p> tags, while paragraphs in a tight list are not.)

That was the information I was searching for! While the actual implementation is a bit more complicated than just that, that is the essence of when I needed to add the paragraph tags.

The complications in implementation arose as the examples became more complex. For example, based on the above description, it is easy to see that this modified example 294 is a strict list:

- a
- b
- c

and this unmodified example 294 is a loose list:

- a
- b

- c

From the above lists section quote, since there is a blank line that separates two of the list elements, it is a loose list. Pretty easy and straight forward. Implementing this aspect of looseness was decently easy but did require some non-trivial extra code. Basically, go back to the start of the current list, then go through each list element in the list, looking to see if the Markdown element before it is a blank line. If so, mark the entire list as loose and apply that throughout the list.

However, when dealing with lists and sublists, it was not so simple. For example, consider the Markdown from example 299:

- a
  - b

    c
- d

Understanding the Markdown list blocks can be nested and following the guidance from the above quote, you can deduce that the outer list is tight and the sublist is loose. To make the leap from handling the previous 2 examples to this example would mean that I needed to find a way to add scoping to the translation. Without scoping, when the translator processed the above example, it saw 3 items in the same list, with the second element making the entire list loose. Scoping was required to allow the translator to determine that the a and d items were in one list and the b/c item was in it’s own list, therefore determining the correct looseness for both lists.

The code itself to handle scoping was simple, but the tradeoff was that the translator was slowly becoming more complicated, something that I was not happy about. It was not in dangerous territory yet, but it was something to keep a watch out for. In addition, while the difference between a list being lose or strict is obvious in hindsight, at the time it was very annoying. It took me the better part of 4 days to do something that was obvious. Even better than obvious was the fact that it was plainly spelled out in the specification. But as I had to do a number of other times during this project, I picked myself up, dusted myself off, and continued on with the inline translations.

Translating Inlines - Backslash Escapes, Character References, and Code Spans

During this translation process, first I hit a high, then I hit a low, and then I saw the payoff of both with the translation of these inline elements into HTML. These elements consist of the backslash escapes, character references, and code spans elements, and were added in this article. Apart from a couple of typos and the code spans, adding support for these features flew past quickly. The backslash escapes and character references were already being processed along with the text tokens, which in turn were already tested with the leaf blocks. The only new code needed was for code spans, but those additions were quickly made by copying the work done for code blocks and simplifying it a bit. Other than a couple of typos that I also needed to correct; the entire body of this work was completed in just under 3 hours. And to be honest, that included me grabbing some well-deserved dinner.

Based on the days of trying to figure out list blocks and paragraph tags from the last section, it was nice to get a real easy set of changes. It wasn’t anything challenging, just… nice.

Rounding out the series of translations were the tests for raw html, autolinks, and line breaks. With features just added in the last article, the tests for these features were added with only a couple of issues, similar in severity to the issues from the leaf blocks.

The largest group of issues encountered were issues with character encodings in the autolinks feature. Some of those issues were due to Unicode characters being present in the Markdown but needing to be properly encoded and escaped when present in URIs. Some of the issues were because the characters present in the URIs are special characters and had to be escaped to prevent them from being encoded twice.

However, the most annoying issues were differences in the language libraries that caused the translator to URI-encode a different set of characters than in the GFM specification. Specifically, it looks like the Commonmark parser uses the Javascript libraries to encode URIs, while the PyMarkdown project uses the Python libraries. I wasn’t too concerned with these issues at the current time, so I made sure to add some notes to address these concerns later and kept on marching forward.

The big catch with these changes was with the scenario test for the GFM specification’s example 641:

<a href='bar'title=title>

While it might seem like a small difference when looking at a web page, the PyMarkdown parser emitted a HTML block and a tag as content instead of emitting a simple paragraph containing text, as follows:

<p>&lt;a href='bar'title=title&gt;</p>

Looking at the HTML output in the example, it is very clear that it should be a paragraph containing text, but somewhere in the copy-and-paste process I had accepted the wrongs tokens as correct. Digging into this issue, I quickly found out that a single omission in one of the functions of the HtmlHelper module was not checking for whitespace between the tag’s attributes, therefore thinking that it was a valid tag when it was not. Within 5 minutes, I had a fix implemented, and the test corrected, and the last scenario test that was currently implemented was now complete!

As a strange aside, I may not have the only parser that has made this mistake. When I was writing this article in VSCode, as usual, I had to check the example’s Markdown a few times. The Markdown in the above section was generated with the following fenced code block:

```Markdown
<a href='bar'title=title>
```

Its a pretty simple example, and I typed it in as I usually do. Create the fenced block, add in the text from its source, verify it a couple of times, and then add in the language specifier. As soon as I typed in Markdown, a strange thing happened. The a for the tag name turned into a deep blue, as expected. But then both attribute names, href and title, turned light blue while the attribute values, `bar` and title turned red. I added a space before title and then deleted it, repeating this experiment a couple of times, looking for color changes. There were none. For whatever reasons, the coloring library that VSCode is using to color Markdown text seems to believe that example 641 contains valid Markdown. Weird!

What Was My Experience So Far?

In writing automation tests as part of my professional job, I have a clear distinction in my responsibility to my team. I am not there to break their code or to find flaws in it, I am there to help them find issues before they become problems. If possible, I stress automated tests over manual tests, and I also stress being able to adopt consistent processes to shine a very focused light on anything that is different.

The mere fact that I found a couple small issues with the parser, even at this late stage of the project is fine with me. I am helping my team (me) find these issues before the code is released and impacts other people and their processes. While it was a large pain to go through, I felt that I closed part of the testing the loop by consistently adding HTML output verification to each parser scenario test. The mere fact that the issues were found proves its own worth. In addition, there was a small efficiency and confidence boost because I do not have to guess anymore as to whether or not I chose the right tokens. The HTML output from the examples proved that I made the right choices.

In the end, what it boils down to for me is that while adding the HTML output verification to each test was painfully monotonous at times, it paid off. While only a handful of issues were uncovered, it did find at least one issue, which itself boosted my confidence in the project. Regardless of whether any issues were found, knowing that the tokens that the parser was generating were being properly translated into the GFM specification’s HTML output was worth it. No more questioning whether the tokens would translate into the proper HTML… I now had proof they did!

What is Next?

Whenever a protocol specification states something like “and here is a suggested way of…”, it usually means that at least 2-3 groups of people implementing the protocol specification had issues. So, it was with a bit of dread and a bit of confidence that I started looking at dealing with inline emphasis, the topic of the next article.

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~15 min read

Published

Markdown Linter

Category

Software Quality

Tags

Stay in Touch