In my last article, I talked about my continuing work on the PyMarkdown project, and how I evaluated the adding of new features to the project. In this article, I talk about going back and ensuring that one of the foundations of any word-based parser is solid: whitespace handling.
It All Starts With Whitespace¶
In my experience writing over ten different parsing engines over the years, the one aspect that does not get enough attention is the definition of whitespace and how it applies to the parser. For code parsers, whitespace is often seen as unimportant and is seen as space-filler between the “real” focus of the language: the keywords. Even Python, with its indent-based semantics, is not far from that appraisal.
But when translating from one language or format to another, whitespace can play a key role in deciding how something is interpreted. That is how things are structured with Markdown, specifically GitHub Flavoured Markdown or GFM. As one of the main goals of Markdown is to allow people to author documents in a near-authentic format, adding any unnecessary verbiage or formatting was frowned up. Specifically, the GFM states:
The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.
To preserve that readability, whitespace is used creatively in Markdown. A good example of this is the Indented Code Block. This is a section of the document, often computer related code, which needs to be passed unaltered to the reader. As the name suggests, each line in this code block is preceded by an indent of at least four spaces. While it does not have the language naming capability of its sibling the Fenced Code Block, the Indented Code Block supplies the same code block experience without negatively affecting the readability.
But support different experiences in different situations, the use of whitespace must be broken down a bit.
Whitespace and Markdown¶
To keep things simple, Markdown provides for three types of whitespace.
The space character is written as
\u0020 depending on the format. In many of the cases where
a space is allowed, a tab character (
is allowed and is treated as up to four spaces as if the tab character was a tab stop. The
fun part there is that most developers think that the tab character is unilaterally
expanded to four spaces, instead of interpreting it as a tab stop. Therefore,
when reading the specification, any reference to
spaces means a specified number
of space characters or tab characters.
Expanding from that simple definition of spaces is the definition for
whitespace characters. This
definition includes both the space character and the tab character but adds the
newline character (
\u000a), line tabulation character (
form feed character (
\u000c), and carriage return character (
Basically, what this does is to allow for a whitespace construct that includes any
character that is typically used to separate words, lines or paragraphs.
Finally, there is the Unicode whitespace character.
This is a more expansive version of the whitespace character, adding all the
characters from the Unicode
Zs and subtracting the line tabulation character.
While I am clear on the prior use
of the line tabulation character, I am less clear on why it has fallen out of
use. My best guess is that with modern systems, the concept of scrolling down
to a given line is outdated, and hence its inclusion into whitespace is less
important than it once was. But as it is in the specification, I follow it!
Getting these definitions clear in my head was important to me. If I am going to increase my confidence that I have the right type of whitespace selected for a given element, I need to know what those whitespace types are.
And before anyone asks… no, this is not glorious work. But this is necessary work. The items in the whitespace section have been on the issues list for almost as long as the project has been around. I need to make sure that I have a clear implementation of the whitespace handling for the project. Not for any sense of ego or anything like that, but to make sure it is right for the users. I want that confidence that I implemented the whitespace handling correctly.
Issue 456 - Cleaning Up Whitespace¶
When I was originally coding the parser behind PyMarkdown, I did not understand
these distinct types of whitespace as I do now. To bring my understanding more
in line with the specification, I have made minor changes to the way the parser
handles whitespace over the last two years. But in the back of my mind, I was always
concerned that I missed something. Hence, there was a section in the original
issues.md file that dealt with whitespace and correctness. Having taken a quick
look at that section again, I thought it was an opportune time to finally resolve any confidence
issues I had with whitespace.
Opening a new issue, I decided to take a muti-task approach to solving any outstanding issues. The first part of that was making sure that I had a clear understanding of what the three distinct types of whitespace were. Once that was taken care of, I carefully went through the specification and created a list in the issue’s ticket, detailing each element and the types of whitespaces contained within that element.
The results of that list were interesting: there were four types of whitespace involved. The most prevalent whitespace type was the spaces type, which seems to be the default whitespace type, appearing in ten elements. From there, the whitespaces type was the next most popular, appearing in six of the elements. In a tie for last are the whitespace type used in link labels and the whitespace type used in emphasis. The emphasis type is the Unicode whitespace type and is very expansive on what it includes.
The new type is the Unicode case fold used to compare link labels with a matching Link Reference Definition element. The best description that I found is in this project and basically walks through the answer to the question: how to do a case insensitive match with a full Unicode codeset. While I could try and describe the folding process more, I believe the article does a much better job of explaining it and would suggest any readers check that article out if they want more information.
Creating The Scenario Tests¶
With the list compiled, I started the process of adding a new set of scenario test cases to the project. As I am a big proponent of Test-Driven Development, it was imperative that I implemented the tests first so I could understand the scope of what I needed to deal with. As I have detailed in earlier articles, this is not a fun process, but a necessary one. Over three days, I added 166 new scenario tests to the project, all dealing with testing how whitespace is handled for each of the elements in the list. As emphasis is the only element to use the Unicode whitespace type, and I want to test Unicode punctuation support at the same time, I left emphasis scenario tests out of that list for now.
It was a brutal process, but I was used to it (somewhat) by now. Working through the list, I copied a series of tests from the series before it, changed the function name and the function description, and changed the internals. Then the process was a simple one: change the Markdown sample to demonstrate the scenario, verify with Babelmark that CommonMark produces the same HTML results that I calculated, and then run the tests themselves. With the test output, I then verified that the HTML output matched CommonMark and that the generated PyMarkdown tokens looked correct. Lather rinse repeat. Many times.
More Work Yet To Come¶
I am done generating the 166 new scenario tests. Along the way, I noticed that 33 of those scenario tests were not functioning properly, so I marked them as skipped. My plan for this week is to start attacking that list and see how many of the scenario tests I can get working by the end of the weekend. Since it is the Labor Day holiday weekend, I am going to take it somewhat easy, but I still want to make significant progress on this task.
On one hand, the items have lasted almost two years without any progress, so what is another week. But that is not good enough for me! Now that I have an idea of the work involved, I want to solve any issues and finally put whitespace issues to rest. Stay tuned for the progress!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.