When I was working on implementing the inline code spans, as detailed in the last article, I performed a thorough scan of the scenario tests and their source data, noting down any issues I found. I knew that I had missed the mark on how to internally represent Atx Headers, and I was curious about how many other things I had missed. Having found a decent handful of issues to fix, I decided to spend some time to address those issues before adding more of the inline processing. In my mind, it was better to take a week and try and keep the issue count low than to continue forward, possibly compounding the cost of fixing those issues. As I am in a somewhat ideal scenario, with nobody pressuring me for features or issue fixes, I figured it was best if I took advantage of that to the best of my abilities and “do it right”.
What Is the Audience For This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 14 February 2020 and 20 February 2020.
Why Refactor Now?¶
As the owner and sole developer on the project, my decision to stop and refactor was any easy one to make. It also did not escape my attention that making a decision like that isn’t always that simple for larger projects. If this were a work project or a team project, the group working on the project would have to meet to figure things out. The team itself usually plays a critical role in assessing the cost and benefit of each task, but in the end it is a balancing act for managers and directors to figure out which tasks are the best ones to focus on at any given time. As it is only me on the PyMarkdown project, I get to conveniently shortcut those conversations in favor of simple decisions based on my own experience.
From my years of experience, I find that there are usually two driving forces that make me support a decision to refactor a project. The first driving force is the cost of any applicable technical debt on product code. As I consider tests integral to a project’s quality, my definition of the term “product code” includes all source code and resources required to execute the project normally and to perform testing used to certify that same project as shippable. With almost 700 scenario tests for PyMarkdown’s parser, even a small change can generate large ripples through the code base, especially the scenario tests. As the scenarios for a feature are only added when that feature itself is added, each feature added therefore adds a significant amount of product code for any change to ripple through. It also follows that each extra test impacted by an issue means an increased cost to fix that issue before proceeding. Therefore, if the goal is to keep overall costs down, refactoring more frequently during development seems logical.
The second driving force is less quantifiable, but equally important to any type of project. That force is the flow of the team working on the project. Based on past projects, I know that I work more efficiently if I focus on a group of similar things for a while, then shift to another group when done. The larger the difference is in the skill sets between the two groups of work, the more relaxed I feel about the work. This sense of relaxation allows me to get into a better flow. As this project is being written by me in the evenings and on the weekends, keeping a good flow for the project keeps me focused and energized about the project. As I am the sole developer on the project at the time, keeping myself motivated in a necessity!
While one force is more empirical and the other is more feelings based, both forces worked together to convince me that it was yet again time to refactor the PyMarkdown project.
Issue 1: SetExt Headers, Paragraphs, and Block Quotes¶
I needed one of the smaller issues to get me warmed up, so after looking over the list
of possible issues to work on, I decided on
this one. During the review of active scenario tests, I noticed that the text
was in the paragraph tags for the specification’s example, but the tokens that the
parser were omitting had them outside of the paragraph. It seemed simple enough but
looks were deceiving.
It took me a bit of digging to find two possible reasons for the string
=== to be kept
inside of the paragraph: laziness and non-interrupts. Working backwards, the
concept of laziness with block quotes and list items is that you can omit any of
the leading characters if the next non-whitespace character after the omitted
characters is text that continues the paragraph. In essence:
> this is a block quote
is equivalent to
> this is > a block quote
That got me halfway there, but there was still the question of how the
=== would be
kept as part of the paragraph and not as a SetExt header indicator. I read the
section on SetExt a couple of times before the following lines clicked:
However, it cannot interrupt a paragraph, so when a setext heading comes after a paragraph, a blank line is needed between them.
Bingo! I wish it did not take multiple reads through that part of the specification, but specifications are sometimes like that. Based on those discoveries, I temporary rewrote the scenario for scenario 63 to:
foo bar ===
The goal of this was to remove the laziness out of the equation while working on the interrupt issue. When I reran scenario test 63, I correctly got a single paragraph with 3 lines in it. Whatever the issue was, it was not just simple parsing of paragraphs. Taking a step forward, I added the block quotes back into the picture, changing that scenario text to:
> foo > bar > ===
It was at this point that the scenario broke, apparently thinking that the third line was a SetExt header and should close off the paragraph. Issue reproduced! To be sure, I applied the laziness concept to the block quote, reverting the scenario back to its original text of:
> foo bar ===
and validated that the behavior was the same.
Armed with a good reproduction case for the issue, good debug output, and a general area in the source code for the cause of the issue, the issue was identified and fixed in quick order. This issue was very specific, so only the one scenario test was impacted, which was a good thing. The problem was that during the development of block quotes, something made me think that anything looking like a SetExt header should close off a paragraph, hence I added code to do just that. I checked the code a couple of times, and that was the only scenario test referencing that code, so I just deleted it.
While the research on the issue was a bit more effort than I originally thought, fixing this issue was a great warm up to the next couple of issues. One issue, one scenario test impacted, and I was refactoring.
Issue 2: Python, Markdown, and Line Continuation Characters¶
The next issue for me to work on was a subtle copy-and-paste error, one that flew under my radar until I took a solid look at it. This issue did not show any indications of failure until I started my scan of the scenario tests. The only reason I found this one was that I went looking for any error, not something specific.
In Markdown, the
\ character at the end of the line is used to denote a hard line
break, not yet to be implemented in the PyMarkdown project. In Python, the
character at the end of the line is used as a line continuation character, telling the
Python interpreter to treat the text before the character and the text after the
character as a single line. Hopefully, any readers see where I am going with this.
When I added the scenario test for
I did a copy-and-paste on the Markdown input to the new scenario test, a process I
have done for 99% of the scenario tests in the project. To accomplish this, I pasted
the following Markdown text between the
""" characters denoting the Markdown to use
resulting in the Python code:
source_markdown = """Foo\ ----"""
After pasting each scenario’s Markdown into the scenario test, I try to ensure that
I replace every instance of the
\ character with the
\\ characters to properly
represent the backslash character in a Python string. As I am
only human, there are times that I forget to do this. Luckily for me, if a
character is not paired up with a valid character to escape, the Python interpreter
will generate a warning like the following:
======================================================== warnings summary ======================================================== test\test_markdown_setext_headings.py:346 C:\old\enlistments\pymarkdown\test\test_markdown_setext_headings.py:346: DeprecationWarning: invalid escape sequence \> expected_gfm = """<h2\>Foo\\</h2>""" -- Docs: https://docs.pytest.org/en/latest/warnings.html
As the backslash character in Markdown is used with punctuation characters and the backslash character in Python is used with alphabetic characters, this is usually a very solid detection scheme for finding copy-and-paste misses. In this case, that check failed.
The good news here is two-fold: an easy fix and a very localized fix. This fix was easy as I just had to apply the missed substitution. It was localized mainly because I had not yet implemented hard line breaks. And yes, it meant that when I did implement hard line breaks, I triple checked my line endings to avoid this issue showing up again.
Momentum was increasing, so it was time to step things up!
Issue 3: Code Blocks, Indenting, and Blank Lines¶
Having resolved a couple of warm-up issues, I felt it was time to tackle some larger issues. Each issue in this group either deals with vertical space issues or leading space issues within a code block. The vertical space issue was that blank lines were not being folded into the text blocks properly, causing foreseeable issues with parsing complete blocks of text soon. Given some Markdown text, such as:
``` abc def ```
I expected that the output tokens would include a fenced code block with a single text block inside of it with three lines present. Instead, there were three tokens present, the first and last were text tokens with strings and the token in the middle was a blank line token. While the token representation was technically correct, parsing it would be awkward. For the specific case of blank lines within a code block, it made sense to merge the blank line tokens into the surrounding text tokens.
The leading space issue was a bit more subtle but equally simple. To properly parse text blocks within a code block, an appropriate amount of whitespace may need to be removed from each line as it is combined. As always, it is the details that matter, and it is easy to gloss over them. In the opening part of the indented code block section of the specification, the following line is present:
The contents of the code block are the literal contents of the lines, including trailing line endings, minus four spaces of indentation.
The similar line for fenced code blocks reads:
If the leading code fence is indented N spaces, then up to N spaces of indentation are removed from each line of the content (if present).
Basically, this means that the following indented code block:
should produce a code block with the text
fred on the first line and the text
<space>frank1 on the second line, having had the first 4 spaces removed. The fenced
code blocks are
a bit more nuanced, in that:
``` fred ```
is parsed as
``` fred ```
is parsed as
fred, based on the extra indenting of the fenced code block start.
While the line containing the text
fred is the same in both cases, the number of
leading spaces before the fenced code block are different, resulting in the different
Prior to fixing this issue, text lines were combined in a simple manner and no whitespace was removed from the start of any lines within code blocks. To properly address this issue, not only did these two rules need to be followed, but the existing code to properly remove leading spaces for each line within a normal paragraph needed to be preserved. It took a bit to get it right, but with a good number of scenario tests to keep things honest, it was easy to get it right quickly.
My original estimates for the impact of this issue was 15-20 tests, and it thankfully remained within that range. While the initial number of scenarios covered by these issues was 15, I expected other scenarios to use code blocks to show how their feature worked with code blocks, adding another 3-7 scenario tests in the process. Looking back, I think I got off nicely with the scope of these changes.
Issue 4: Paragraphs and Indenting¶
Feeling energized from fixing the issues documented in the previous sections, I decided to keep with the theme and attack the issue with leading spaces in normal paragraphs. Similar to the prior issue, the rule for normal paragraphs is that all leading whitespace is removed as the text for the paragraph is pasted together. While it is hard to point to an exact quote from the specification for this rule2, example 192 clearly shows this as the Markdown:
is translated into the HTML text:
The fix for this was similar to the change made for the code blocks issue, but instead
of specifying a fixed number of whitespace to remove, the
combine function was
changed to accept a value that indicates the removal of all whitespace. The original 9
cases were quickly tested with the fix, and things looked solid.
Originally, it looked like the changes would be confined to the original 9 cases, but I suspected that the number would at least be double that, as this fix would affect any scenario involving a paragraph with multiple lines. While a number of the cases were simple cases, when all was said and done, there were 59 changes to scenario tests in the commit for this issue. Even so, those changes were quickly made, and the scenario tests were re-verified.
While this took a lot of work to get right, it felt good to get this one resolved. It was hard for me to quantify this to myself, but the parsing of the paragraphs always looked like they had too many spaces. It was nice to finally figure out why!
Issue 5: Trailing Spaces in Scenarios¶
During the early development phase of the project, I wanted to get going with the
scenarios as quickly
as possible, so I copied each scenario’s Markdown text from the GFM specification
exactly as-is. This was not a
problem, except that in a small number of cases, there were lines in the scenarios that
ended in one or more whitespaces. These trailing whitespaces raised the
trailing-whitespace warning when I ran the PyLint program over the project’s
code base, as I do with each set of changes. Determined to deal with the issue later,
I added a few comment lines like this:
# pylint: disable=trailing-whitespace
to disable the Pylint warnings that occurred, knowing I would have to fix them later. At this point in the project, it seemed like a good time to deal with this.
As this issue was largely out of sight, everything was fine. That is, everything was fine until I hit a couple of problems in a row that did involve these scenarios. Instead of immediately noticing the trailing whitespace and the comment, I started debugging each issue without noticing either the comment or the whitespace, and was dumfounded by why the parsing was not as was suggested by the Markdown text that was clearly visible. When I took a step back to really read the scenario tests, I then noticed the comment at the top of each of the problem test functions, and then it clicked. But it took a lot longer than it should have. Instead of “just dealing with it”, I decided that refactoring was a better solution.
The fix was an easy one too, something I should have thought of earlier. The ASCII BELL
character is represented by
\a in Python strings, and to the best of my knowledge is
seldom used in most Python programs or Markdown text. As it has a very low probability
of being used in any scenarios, I replaced the
terminating whitespace characters with
\a characters, then added
.replace("\a", " ")
at the end of the sample string. It was then a simple matter of going through the
other 6 scenarios with trailing whitespaces and repeating this fix.
While issues like this may seem small, having to disable a PyLint warning did not feel right, even if it helped me maintain momentum at the time. It just felt really good to solve this issue properly.
Issue 6: Getting Atx Headers Right¶
In the last article, I started looking for issues after wondering if a code span could work within an Atx Header, realizing that the specification allowed it, but my current implementation did not allow it. As stated in the preamble for scenario 36:
The raw contents of the heading are stripped of leading and trailing spaces before being parsed as inline content.
Basically, my decision was based on my usage patterns, not the specification. In my experience, I have only ever done Atx Headers in the form:
### This is my header
When I read the specification, I glossed over that section, only thinking of Atx Headers as containers for normal text. However, based on the specification, the following text is also allowed:
### This *is* `my` header
As a result, instead of the header text
This is my header being generated by the
first sample, I can use the second sample to generate the header text of
This <em>is</em> <code>my</code> header. Neat!
The change itself was pretty simple and confined to the
As it was a simple change, the accompanying change in each test was also pretty
simple: take a single Atx Header token with text, replace it with an Atx Header token
without the text, a Text token with the text, and an End token for the Atx Header.
While I was concerned that the fix for this issue was going to be more widespread, it was confined to 22 scenario tests, and was easy to verify.
Issue 7: Bringing the Tabs Back¶
Looking for something to finish the refactoring session with, I decided to tackle one of the longstanding fixes that I had some reservations about: the bulk conversion of tabs to spaces. While it was a good fix at the time, I suspected that there might be problems with the code blocks where all text is supposed to be preserved literally, including tab characters.
In a stroke of luck, all the affected scenario tests are in the
and the places where tabs are important can be grouped into 2 distinct groups.
In the Markdown specification, there is a distinction between whether there is enough
whitespace for an indented code block with 4 spaces, or not with less than 4 spaces.
To address those cases, I simply added the
is_length_less_than_or_equal_to functions. While I could have simply used a
modifier to get the same effect, I thought it was more readable to simply spell it
out. For cases where the actual length was needed, the
determines the length of the string, allowing for the length of a tab to be 4
characters while every other character is assigned a length of 1.
While this was not a very technical issue to fix, it helped me return things to a known good state, with confidence that tabs were being treated properly. Before this fix, I was always concerned that the bulk translation of tab characters to spaces would introduce a hard to diagnose issue. With that translation removed, that concern went away.
What Was My Experience So Far?¶
At various points in the development of PyMarkdown, I have wondered if my thinking should be more realistic with a “two steps forward, one step back” feel to it. Maybe it is just who I am, but with a few exceptions, I see almost all of this development as stepping forward with quality, and hence, all positive. I like the fact that I am implementing some new features, then doing some refactoring, and repeating. It gives me a solid perception that the project is on stable footing at every stage, accumulating little technical debt along the way.
Something that struck me at this point was how easily I seemed to fall into a rhythm that works well for me and the project: implementing a couple of features, noting down any issues as I implement, and then fixing a couple of the more pressing issues before repeating the pattern. I am not sure if that kind of a pattern that everyone else works well with, but it seems to work well for me. To a certain extent, it also helps me write these articles, as writing about quality software is very different than the development of that software. For me, I find that they complement each other very well.
In terms of energy, keeping that rhythm going and writing these articles is helping to keep me charged up about the project. While I have written my share of parsers in my career, they have almost always been for work projects with a specific goal and deadline to achieve. Being freed from those restrictions does come with its benefits, but not having someone looking over your shoulder means that you have to take on that role yourself. These articles, while initially created to talk about my approach in creating quality software, also server the purpose of keeping me honest and responsible to any readers. Call it a backup plan, but it seems to be working well!
What is Next?¶
Going back to the specification for features to implement, I decided to start at the end and get the line breaks, autolinks, and raw html inline processing taken care of. While I do not use them frequently myself, they are interesting aspects to the GFM specification, and perhaps learning about them will increase my use of them in my everyday Markdown usage.
<space>represents a space character, which by it’s very nature, is invisible. ↩↩
The specification states “The paragraph’s raw content is formed by concatenating the lines and removing initial and final whitespace.” This is the closest reference that I could find to removing whitespace. Perhaps initial means per line? ↩
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.