Introduction¶
In the wrap up for my last article, I provided extra information on my decision to write a Markdown parser as part of the PyMarkdown project. My belief is that the writing of an effective Markdown linter requires a parser that emits an intermediate set of tokens instead of the usual HTML output. Accordingly, from the start of the project, all the scenario tests for the parser have dealt with those generated tokens and not the usual HTML output. Based on my experience to date, both the belief in a non-HTML parser and the decision to test the tokens have proven to be the right choices. During the project development to date, I have not observed a specific case where I believe that there would have been a substantial benefit in comparing the output to HTML instead of the intermediate tokens. However, even with that in mind, the lack of a direct comparison between the output specified in the GFM specification and the output of the parser started to erode my confidence, even if just fractionally.
The questions that came to mind were simple ones. Did I keep the right information when parsing the Markdown to enable a HTML transformation of the tokens? Did I pick the right token to represent a Markdown element in the token stream? Do the tokens contain enough information in them to allow me to write linting rules off them? For me, these questions are relevant given the nature of the project.
Looking at those three questions, I quickly realized that answering that last question was impossible until I start writing the linting rules. Sure, I could take a guess, but that is all it would be. However, I realized that I could probably answer the first two questions and that there was significant benefit to be gained from doing so. If I can write a token-to-HTML translator and apply it to that token stream, when the HTML output for all scenarios match, I have answered the first question. And while I cannot answer the second question completely, if the translation to HTML is simple enough, I will have proven that I am headed in the right direction with respect to making good choices for the representative tokens. While I cannot prove that those choices are not perfect choices until the rules are written, I can at least prove to myself that my token choices are in the right direction.
What Is the Audience For This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 29 February 2020 and 06 March 2020.
Finishing Up the Scenario Tests¶
For me, this task was essentially a bookkeeping issue. As the remaining features were the emphasis elements and the link elements, it felt like the right time to march through the remaining scenarios and implement them as scenario tests.
These scenario tests fell into two categories. In the first category,
if the scenario needed to test a failure using already implemented features, I copied
over an existing test, changed the Markdown input, executed the new test, and copied
the tokens from the newly executed test output in that test, manually verifying the
tokens as I copied them.
Basically, I figured that if the scenario test is failing in a manner that will not
change even when the owning feature is implemented, then completing the test was the
best choice for the project.
In the second category, I did the same thing except I stopped before the execution step,
instead adding a @skip
tag to the test’s definition. In this way, I was able
to add the bulk of the remaining tests without having tests that would obviously fail
getting in the way.
While this may have seemed like busy work, it was meant to give me a realistic picture
of how close I was to finishing the parser, and it worked. By executing
pipenv run pytest
, I executed every test and was able to look for entire modules
with skipped tests, getting a good indication of what features and testing was left.
From a wholistic point of view, it was good to see that out of the
875+ tests in the project so far, there were only just over 100 tests left to go before
the parser would be completed. Being able to see how close I was to finishing the
parser was definitely worthwhile!
Adding Test Grouping¶
I knew from the start that this would be the monotonous part, so I tried to make sure
that I could stage the changes as I brought more scenario tests online. The first
thing I did was to add a new marker for PyTest to the project by adding this line to
the setup.cfg
file for the project:
markers=gfm
This change allowed me to use one of PyTest’s grouping features: marking. By changing a scenario test’s definition from:
def test_atx_headings_032():
...
to:
@pytest.mark.gfm
def test_atx_headings_032():
...
I included that test into the gfm
group. While I can still execute that test by
itself by entering pipenv run pytest -k 032
, I could now execute all tests in
the gfm
group by entering pipenv run pytest -m gfm
.
This command was invaluable during development of the parser. After adding HTML
translation support to a scenario test, I ensured that it was added to this group,
thereby staging the scenario test with its owning feature. After completing the change
to make the test pass, I then executed all the tests in the gfm
group to ensure that
I didn’t break anything else in the process. While it caused me some issues from time
to time, it was an extra watch over the completed work, one that I appreciated.
Adding Translating into HTML¶
Translating any stream into something requires a loop to process through each element
in the stream, with some mix of emitting data and altering state.
I created the TransformToGfm
class to handle that translation, with the transform
entry point to facilitate the transformation. At this point in the implementation,
this class was very simple. As each token was seen in the loop, its data was emitted
with only minor additional processing required.
Adding this support into existing tests was easy, just monotonous. Using the same test 32 that was used in the prior example, that test changed from:
@pytest.mark.gfm
def test_atx_headings_032():
"""
Test case 032: Simple headings
"""
# Arrange
tokenizer = TokenizedMarkdown()
source_markdown = """some markdown"""
expected_tokens = [...]
# Act
actual_tokens = tokenizer.transform(source_markdown)
# Assert
assert_if_lists_different(expected_tokens, actual_tokens)
to:
@pytest.mark.gfm
def test_atx_headings_032():
"""
Test case 032: Simple headings
"""
# Arrange
tokenizer = TokenizedMarkdown()
transformer = TransformToGfm()
source_markdown = """some markdown"""
expected_tokens = [...]
expected_gfm = """<p>some markdown</p>"""
# Act
actual_tokens = tokenizer.transform(source_markdown)
actual_gfm = transformer.transform(actual_tokens)
# Assert
assert_if_lists_different(expected_tokens, actual_tokens)
assert_if_strings_different(expected_gfm, actual_gfm)
In order of appearance, an instance of the TransformToGfm
class was added, the
expected_gfm
variable was set to the expected HTML, the transform
function was
called, and then the contents of the expected_gfm
variable were compared against
the output from the transform
function. Except for the expected changes to the
expected_gfm
variable for each test, this transformation was repeated for each test
as support for the feature it enabled was added.
Translating the Leaf Blocks¶
Translating the leaf block tokens added in this article and this article proceeded quickly, encountering only a few unexpected issues. These issues fell into two categories: the handling of HTML blocks and code blocks, and the handling of newlines in the HTML output.
Most of the issues that were uncovered for leaf blocks dealt with the processing of
HTML blocks and code blocks. As mentioned in previous articles,
these two leaf blocks are special in that they maintain firm control over the formatting
of their content. To accommodate these two leaf block types, the handling of the
TextMarkdownToken
was changed to accommodate the stricter output requirements of
those blocks, mostly ensuring that whitespace was preserved. Other than that, the only
other changes needed for processing was to change most of the tokens to expose
certain fields, allowing the translator to access the token’s attributes cleanly.
From a rendering viewpoint, I had guessed that any newlines in the HTML output were going to be a problem from the start and I was right. While the GFM is purposefully vague on when to add newlines in the translation from Markdown to HTML, it was a vagueness that I could not avoid. As the main push for this article’s work was to add proper comparisons of the GFM’s HTML output for each example, I had a hard choice to make. Either I make modifications to each of the 673 scenarios as I copied their HTML output into the scenario tests, or I needed to ensure that the translation replicated the HTML exactly.
After a lot of thinking, I decided to go with the exact HTML output path, hopefully removing any possible errors that may have occurred during the translation of the HTML output for each scenario test. When I thought about both options, I just felt that I would almost instantly regret making any changes to the HTML output, as it would no longer be synchronized to the GFM specification. Considering that, I figured it was better to be consistent and do a bit more work on the project than to start changing the scenarios.
My current focus was on enabling the HTML comparisons, and I knew it was going to take more effort and time to get them right. As such, I decided to add a fair number of “if this look like” conditions to add or not add newlines, with plans to later refactor the code to look for better groupings down the road. I do not like adding technical debt just for the sake of expediency, but I just felt it was the right decision at the time. I figured by adjusting the translator with little tweaks here and there would give me a more complete picture on what needed done for a proper refactor later. It was not a perfect decision, but it was one that I felt I could live with.
When all the leaf blocks were completed, I did notice a decent boost in my confidence. Except for some issues with getting newlines right, the translation of leaf blocks was straightforward. Knowing that I had made good decisions so far gave me that boost… something that I would need sooner rather than later.
Translating the Container Blocks¶
While the translation of the leaf blocks went smoothly, I hit a number of issues dealing with the container blocks added in this article. While the block quotes themselves were not really an issue, the list blocks that caused me a lot of trouble.
In implementing the list block support in the translator, I was able to quickly get to a point where the tags themselves were being emitted properly, but the whitespace around the tags were off, especially with the newlines. That was frustrating, but with some helpful observations and experimentation, I was able to get that taken care of relatively quickly.
Following that triumph, I spent a few aggravating
days trying to figure out why some list items contained <p>
tags and why some list
items didn’t contain <p>
tags. I tried a couple approaches based on
the surrounding tags and tokens, but each of them failed. It wasn’t until I was looking
at the specification again, when I took another look at the lists section and noticed
the following paragraph in the
lists section:
A list is loose if any of its constituent list items are separated by blank lines, or if any of its constituent list items directly contain two block-level elements with a blank line between them. Otherwise a list is tight. (The difference in HTML output is that paragraphs in a loose list are wrapped in
<p>
tags, while paragraphs in a tight list are not.)
That was the information I was searching for! While the actual implementation is a bit more complicated than just that, that is the essence of when I needed to add the paragraph tags.
The complications in implementation arose as the examples became more complex. For example, based on the above description, it is easy to see that this modified example 294 is a strict list:
- a
- b
- c
and this unmodified example 294 is a loose list:
- a
- b
- c
From the above lists section quote, since there is a blank line that separates two of the list elements, it is a loose list. Pretty easy and straight forward. Implementing this aspect of looseness was decently easy but did require some non-trivial extra code. Basically, go back to the start of the current list, then go through each list element in the list, looking to see if the Markdown element before it is a blank line. If so, mark the entire list as loose and apply that throughout the list.
However, when dealing with lists and sublists, it was not so simple. For example, consider the Markdown from example 299:
- a
- b
c
- d
Understanding the Markdown list blocks can be nested and following the guidance from
the above quote, you can deduce that the outer list is tight and the sublist is loose.
To make the leap from handling the previous 2 examples to this example would mean that
I needed to find a way to add scoping to the translation. Without scoping, when the
translator processed the above example, it saw 3 items in the same list, with the
second element making the entire list loose. Scoping was required to allow the
translator to determine that the a
and d
items were in one list and the b/c
item
was in it’s own list, therefore determining the correct looseness for both lists.
The code itself to handle scoping was simple, but the tradeoff was that the translator was slowly becoming more complicated, something that I was not happy about. It was not in dangerous territory yet, but it was something to keep a watch out for. In addition, while the difference between a list being lose or strict is obvious in hindsight, at the time it was very annoying. It took me the better part of 4 days to do something that was obvious. Even better than obvious was the fact that it was plainly spelled out in the specification. But as I had to do a number of other times during this project, I picked myself up, dusted myself off, and continued on with the inline translations.
Translating Inlines - Backslash Escapes, Character References, and Code Spans¶
During this translation process, first I hit a high, then I hit a low, and then I saw the payoff of both with the translation of these inline elements into HTML. These elements consist of the backslash escapes, character references, and code spans elements, and were added in this article. Apart from a couple of typos and the code spans, adding support for these features flew past quickly. The backslash escapes and character references were already being processed along with the text tokens, which in turn were already tested with the leaf blocks. The only new code needed was for code spans, but those additions were quickly made by copying the work done for code blocks and simplifying it a bit. Other than a couple of typos that I also needed to correct; the entire body of this work was completed in just under 3 hours. And to be honest, that included me grabbing some well-deserved dinner.
Based on the days of trying to figure out list blocks and paragraph tags from the last section, it was nice to get a real easy set of changes. It wasn’t anything challenging, just… nice.
Translating Inlines - Raw Html, Autolinks, and Line Breaks¶
Rounding out the series of translations were the tests for raw html, autolinks, and line breaks. With features just added in the last article, the tests for these features were added with only a couple of issues, similar in severity to the issues from the leaf blocks.
The largest group of issues encountered were issues with character encodings in the autolinks feature. Some of those issues were due to Unicode characters being present in the Markdown but needing to be properly encoded and escaped when present in URIs. Some of the issues were because the characters present in the URIs are special characters and had to be escaped to prevent them from being encoded twice.
However, the most annoying issues were differences in the language libraries that caused the translator to URI-encode a different set of characters than in the GFM specification. Specifically, it looks like the Commonmark parser uses the Javascript libraries to encode URIs, while the PyMarkdown project uses the Python libraries. I wasn’t too concerned with these issues at the current time, so I made sure to add some notes to address these concerns later and kept on marching forward.
The big catch with these changes was with the scenario test for the GFM specification’s example 641:
<a href='bar'title=title>
While it might seem like a small difference when looking at a web page, the PyMarkdown parser emitted a HTML block and a tag as content instead of emitting a simple paragraph containing text, as follows:
<p><a href='bar'title=title></p>
Looking at the HTML output in the example, it is very clear that it should be a
paragraph containing text,
but somewhere in the copy-and-paste process I had accepted the wrongs tokens as
correct. Digging into this issue, I quickly found out that a single omission in one of
the functions of the HtmlHelper
module was not checking for whitespace between the
tag’s attributes, therefore thinking that it was a valid tag when it was not. Within 5
minutes, I had a fix implemented, and the test corrected, and the last scenario test
that was currently implemented was now complete!
As a strange aside, I may not have the only parser that has made this mistake. When I was writing this article in VSCode, as usual, I had to check the example’s Markdown a few times. The Markdown in the above section was generated with the following fenced code block:
```Markdown
<a href='bar'title=title>
```
Its a pretty simple example, and I typed it in as I usually do. Create the fenced
block, add in the text from its source, verify it a couple of times, and then add in
the language specifier. As soon as I typed in Markdown
, a strange thing happened.
The a
for the tag name turned into a deep blue, as expected. But then both
attribute names, href
and title
, turned light blue while the attribute values,
`bar`
and title
turned red. I added a space before title
and then deleted
it, repeating
this experiment a couple of times, looking for color changes. There were none. For
whatever reasons, the coloring library that VSCode is using to color Markdown text
seems to believe that example 641 contains valid Markdown. Weird!
What Was My Experience So Far?¶
In writing automation tests as part of my professional job, I have a clear distinction in my responsibility to my team. I am not there to break their code or to find flaws in it, I am there to help them find issues before they become problems. If possible, I stress automated tests over manual tests, and I also stress being able to adopt consistent processes to shine a very focused light on anything that is different.
The mere fact that I found a couple small issues with the parser, even at this late stage of the project is fine with me. I am helping my team (me) find these issues before the code is released and impacts other people and their processes. While it was a large pain to go through, I felt that I closed part of the testing the loop by consistently adding HTML output verification to each parser scenario test. The mere fact that the issues were found proves its own worth. In addition, there was a small efficiency and confidence boost because I do not have to guess anymore as to whether or not I chose the right tokens. The HTML output from the examples proved that I made the right choices.
In the end, what it boils down to for me is that while adding the HTML output verification to each test was painfully monotonous at times, it paid off. While only a handful of issues were uncovered, it did find at least one issue, which itself boosted my confidence in the project. Regardless of whether any issues were found, knowing that the tokens that the parser was generating were being properly translated into the GFM specification’s HTML output was worth it. No more questioning whether the tokens would translate into the proper HTML… I now had proof they did!
What is Next?¶
Whenever a protocol specification states something like “and here is a suggested way of…”, it usually means that at least 2-3 groups of people implementing the protocol specification had issues. So, it was with a bit of dread and a bit of confidence that I started looking at dealing with inline emphasis, the topic of the next article.
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.