Introduction

With the project requirements, the test framework, and the test strategy in place, it was time to start working on the most frequently used and easy-to-parse Markdown items. These Markdown blocks, referred to as Leaf Blocks in the GitHub Flavored Markdown (GFM) Specification, are the root of many Markdown documents and have the virtue of being easy to parse. With small exceptions, each of the Leaf Blocks is self-contained. For the most part, those exceptions arise in how the Leaf Blocks interact with each other. In all cases, this interaction is small and does not require complicated logic to understand.

The full record of the work detailed in this article is documented in the project’s GitHub repository in the commits that occurred between 30 November 2019 and 05 December 2019. This work includes creating the scenario tests for all the Leaf Blocks as documented in the GFM specification and implementing the parsing to pass all of those tests except for the Link Reference Definitions, HTML Blocks, and Tables.

While the documentation of what needed to be done (GFM Specification) and what was done (GitHub commits) is straightforward, the “hows” and “whys” of what I implemented is worth talking about. The process that I followed for the implementation of the Leaf Blocks did not uncover any development issues during implementation. However, without giving too much away, the same process applied to other block types (to be talked about in future articles) did uncover issues that were not so easy to resolve. As there were complications that arose with those feature implementations, I wanted to provide a consistent documentation of the process from the beginning, to provide a complete picture of how things progressed. I firmly believe that it is always good to show the complete story of what happened, and not only one side of the story. Let’s go!

Moving Forward with Implementation

Even though the first commit for processing Markdown elements is on 30 November 2019, my work on implementing them started on 25 November 2019. Based on the test framework and strategy documented in previous articles, the first thing to do was to write the scenario tests cases, even if most of those tests were initially disabled or skipped. This was easily done by annotating each test function with @pytest.mark.skip. Once I implemented the code to satisfy a given test, I removed that skip annotation for that specific test. While I would made modifications on how I disabled tests later, this was a good point for me to start off at.

What Was the Workflow?

From the outset, the basic implementation workflow was as follows:

  1. figure out the next section to work on
  2. figure out the next section-feature to implement
  3. enable the relevant tests for that section-feature
  4. add or change the code in tokenized_markdown.py to implement that feature
  5. execute all enabled tests, with special attention to the feature added in item 4.
  6. if there were any test errors; debug, fix and go back to item 4.
  7. stage the changes in the project before
  8. if there are more features in the current section, go back to item 2.
  9. verify each test case’s input and output against the specification
  10. if any verification errors are found; debug, fix and go back to item 4.
  11. if there are any leaf block sections left to work on, go back to item 1.

It was not really glamourous, but it worked well. Looking closely at the list, it is easy for me to see why… I took an agile approach without really being aware of it. According to the Wikipedia article on Agile Software Development, there are a number of good practices that I was following. Because I was doing testing as I went, the is a good argument to be made that I was practicing Agile Testing and Test Driven Development. As the tests are also the acceptance criteria for this stage of the project, Acceptance Test Driven Development could also be tacked on to those two Agile practices. Finally, as the workflow is iterative by its very nature, the workflow also qualifies as Iterative and Incremental Development. All in all, I see a few solid agile patterns within the workflow.

Agile aspirations aside, the real test of this workflow is that it works for me and works well. I was able to stick to the process easily. It very nicely compartmentalized my work into nice iterations that were easy for me to keep in my head. It was also simple enough that if I needed to refocus myself, I just had to figure out where I was in the workflow and where I was in the specification, and I was able to get back to work! In addition, I feel that if I had performed this development as part of a team, the frequent commits and complete with enabled tests would enable me to share my progress with the rest of the team, and solicit their feedback in a quick and iterative manner.

More importantly, at no point in the development practice did I feel that I bit off more than I could handle. Of course, there were times where I was wondering how long it was going to take me and how I would handle some features… I am only human! But the agile nature of how the workflow is structured kept me grounded and focused on the feature that was in front of me. I just reminded myself to keep that focus, and feature by feature, the foundations of the parser came together.

In the end, this workflow was not about being agile or taking easy to implement steps. It is about finding something that works well for the team… namely me.

How Did Things Progress?

The order in which things are tackled is important. Doing the big stuff at the start of the project sometimes pays off, but it can often be demoralizing. Doing the small stuff first can lay some great foundations but miss the larger target due to the smaller focus. To accomplish this for the PyMarkdown project, I broke this part of the project down into 4 groups of Markdown elements. Each group of Markdown elements that were handled added new information to the stream of tokens that were being generated by the parser, allowing for future examination. It was very important to me to ensure that the token stream was kept working, the implementation always moving forwards.

Group 1: Foundational Elements

The first group that I worked on were the rudimentary elements of blank lines, paragraphs, and thematic breaks. This was a good first group to work on, as these were all common Markdown elements that people use and are foundational to the rest of the work. As such, they were good confidence boosters for the tribulations that I expected that would occur later with the more complicated elements.

The only real issue that I had with this first group was due to my lack of confidence about the Markdown specification itself. From my days on the Internet Engineering Task Force, I am used to clear grammar specifications written in Backus-Naur form. However, this specification has no such representation and is written mainly as a series of use cases and text to describe each use case. It took me a while to see that what I perceived initially as a downfall was a bonus. Instead of having to search for examples or to make them up myself, they were already provided. Once I got used to that concept, my confidence increased, and I started to implement each test more quickly than the last one.

While it did not seem like much at the time, at this point the parser was capable of handling the following Markdown:

This is captured in a paragraph.

***

Group 1 Sidebar: Tabs

I started to tackle the GFM specification decision that any tab character is replaced with exactly 4 space characters. For the most part, this had little bearing on the foundational elements, but the subject of tabs versus spaces has ignited programming holy wars that last to this day. I thought it was useful and prudent to deal with it and get it out of the way early.

Smartly, Markdown avoids these arguments with a strong statement that 1 tab character equals 4 space characters, and a decent argument to reinforce that the decision is the right one. Except for the indented code block, every Markdown element is only recognized if it starts with less than 4 spaces. An indented code block line is only recognized if it starts with 4 spaces. Therefore, a shortcut for any indented code block is to start the line with 1 tab character, due to its 1:4 mapping. To be honest, I feel this is brilliant in its simplicity.

Group 2: Headers

The next group that I tackled were the header markers, referred to in the specification as the setext and atx elements. Weird names though they are, they are the up to 6 # characters at the start of the line, or the - or = characters underlining text from a previous paragraph. While the atx elements (the # characters) was straight forward, the ‘underlining’ aspect of the setext element made it interesting. As that element essentially makes the last paragraph a heading, I had to search backwards in the list of generated tokens for the first time.

It was also at this point that I decided to perform some refactoring to better handle string processing. The simple truth about any parser is that it requires gratuitous amounts of “string fiddling” 1. Most efficient parsers work aggressively to parse their documents in a way that minimizes the number of actual strings created while parsing. A good example of efficient “string fiddling” can be seen in the following example of parsing the sentence I have a black dog. When parsing out the word black, the most optimal parsers will find the index of the b in black, then find the space character after the k, using the language’s substring function and those two indexes to create a single string with black in it. Less optimal parsers will find the b append it to the end of an empty string (creating a new string with b), then find the l character and appended it, etc. This can easily cause 6 strings to be created during the parsing of the word black, when only 1 is needed. As some of the Markdown documents that the parser will handle are large, it is important to remember optimizations like this as features are added.

Keeping this in mind, I started looking for “string fiddling” patterns that looked ripe for refactoring. The most obvious one was the determine_whitespace_length function that took care of any tabs in the input data. While I would rip this out later, opting instead to do a simple search-and-replace for tabs at the start of parsing, the determine_whitespace_length function kept things manageable for tabs characters. There were also the extract_whitespace* functions for extracting whitespace and the collect_while_character function for collecting data for a string while the input was a given character. Taking a peek ahead in the specification, it was easy to see that moving the code into those functions was going to pay off.

When it comes down to it, there were no real issues that I experienced with the headers. My confidence was still building from the foundational group above, but there was nothing weird or challenging that I did not handle with a bit of serious thought and planning.

At this point, the parser could handle the following Markdown elements:

# My Markdown

This is captured in a paragraph.

But this is also a header
-------------------

***

Group 3: Indented and Fenced Code Blocks

Marching right along, indented and fenced code blocks were next on the list. Both are used to denote sections of text that are to be represented literally, but one is easier and one is more flexible. The indented code blocks require 4 space characters (or a tab character) at the start of the line to denote the block, and text is presented plainly. However, the fenced code blocks start and end with an equal number of ` or ~ characters and include provisions for naming the type of text used within the code block. This naming allows processors to specify a given style to apply to the code block, allowing processors and style sheets to ‘colorize’ the text according to the the specified type name.

This grouping was easy to process, adding the extract_until_whitespace function to the growing list of helper functions. The interesting part to the code blocks was that I needed to add extra processing of normal text to handle the text within the code blocks. Prior to these code blocks, any text that did not fall into one of the other categories was simply wrapped in a paragraph. Both blocks have specific end conditions, and until those end conditions are met, the collection continues. This meant adding extra code at the start of line parsing to determine if it was within one of the code blocks. If the end condition was met, then the end block token was emitted, and if not, a text block would be emitted without further parsing.

It was at this point that I started seeing the intertwining nature of some of the use cases. An indented code block cannot interrupt a paragraph, but a fenced code block can. So when looking for the indented code block, I had to explicitly disallow one from starting if the block currently being process was a paragraph. While this was only a small case, it became obvious to me from a quick scan over the specification that this type of pattern was going to repeat more than once. As such, I started moving the start and stop logic into their own functions, whether they required it or not. This improved the readability and enabled me to get a better view on what was being handled and where.

At this point, the parser could handle the following Markdown elements:

# My Markdown

This is captured in a paragraph.

```Python
    rt = "1:" + str(1)
```

But this is also a header
-------------------

    code block

***

Please note that the fenced code block specifies python as it’s type, allowing the colorization of the text with the assumption that the code block is Python code.

Group 4: Stopping at a Good Place

Sometimes it makes sense to march forward without much attention to the surroundings, and sometimes it makes sense to stop at a good place along the way. In taking a quick look at HTML blocks, I figured they were going to be tricky, and I had the same determination with the table element. Taking a look at the link reference definitions, I noticed that they required inline expansion of text within the blocks, something that I wasn’t even remotely close to yet. These three leaf blocks were in the final group: the To Be Done Later group.

To ensure that I had a good place to come back to when I was ready for the each of these blocks, I made sure to go through and implement, verify, and then disable each test for every leaf block.

Depending on the leaf block, I handled the disabling of the tests differently. To properly deal with the link reference definitions, I needed the inline processing capabilities that I knew were many weeks away. As such, I kept those tests disabled in the previous documented way of using the @pytest.mark.skip annotation. This was a big shout out to me that these were going to need to be completed after almost everything else.

In the case of any other of the leaf node tests, I captured the current tokens emitted for that case and placed them in the corresponding test. While it might seem weird, my belief was that by testing each test case this way, I would increase overall coverage and possibly hit edge cases not currently documented in a use case. It also meant that once I started implementing the HTML blocks and table blocks, those tests would just start failing in predictable fashion.

What Was My Experience So Far?

It is always easier to look back and see what worked and what did not work, than to observe it at the time. With only a few issues, I personally felt like I dodged a lot of pain due to the specification and planning. While BNF grammars are easy to implement, the general rule is to “be strict in what you generate and lenient in what you accept”. As such, coming up with “valid” parse cases is a task that takes a long time to complete. By having the acceptable test cases as part of the core specification, the time that I would normally spend in the development and testing phase was greatly reduced. True, it took me a while to get used to it, but when I did, it just worked and worked well.

One of the practices that I engaged in during the development of the parser is to liberally spread around print statements as I went. As I was adding these statements, my dominant thought was to collect enough information to determine which pieces of information were the most relevant for log messages to be added later. However, as I proceeded, that information also had the additional benefits of being immensely helpful to debug any parsing issues, and indispensable in the verification of the code itself. While I know I need to remove those statements or convert them before the project is completed, their presence is indeed beneficial.

All in all, I think I had a great start to an interesting project and learned a bit in the process… and learning is always good!

What is Next?

Next up on the list is adding block quote and list support to the parser. Stay tuned!


  1. I remember this term being used all the way back to my university days. The closest I have been able to come to a definition is the Oxford dictionary’s definition: touch or fidget with something in a restless or nervous way. Perhaps this is alluding to amount of work to get most string operations “just right”? 

Like this post? Share on: TwitterFacebookEmail

Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Reading Time

~13 min read

Published

Markdown Linter

Category

Software Quality

Tags

Stay in Touch