Summary¶
In my last article, I talked about how my effort to work efficiently continues to pay dividends for the project. In this article, I talk about the work I put into getting Rule Md027 implemented.
Introduction¶
Everyone has something that they are particular about. Depending on the person, there might be a collection of things they are particular about. While a few of my things deal with “just making sense”, the one that I hit the other week was with not getting something done.
Don’t get me wrong. If you ask me to move a mountain, I will not be upset if I don’t get that done. Now, if you ask me to organize a conference because no one else can do it, and if I have the support, I will give it my best shot. I will probably kick myself along the way for the things that I didn’t anticipate ahead of time, but I won’t kick myself too hard. After all, I have never organized a conference before.
But designing a rule for the PyMarkdown project? I thought I could do that while sipping on a cold beverage and nibbling on some carrot cake with the tunes up loud. But a couple of weeks ago, I hit Rule Md027 and that changed. Given a 30-minute design window, I couldn’t even get a basic design off the ground. Even with a 10-minute extension, I was still at ground zero. It wasn’t that I didn’t get it done, it was that I should have been able to get it done and didn’t. It weighed on me.
I didn’t want to let that negativity get in the way of the other rules, so I decided to give myself a week to design and implement the rule properly. I wanted to get it done in less than a week, but if it took that long, it took that long. I just wanted to get it done.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred on 15 Aug 2021.
I Do Not Give Up Easily¶
Roughly two weeks ago, at the start of my efficiency push, I started to work on Rule Md027, but failed to make it out of the design phase. This week, I decided that I was going to go back to that rule to properly design and implement it.
Why? Because I do not give up easily. I have pride in my work, and the fact that I had to bail on this rule, even temporarily, just did not sit right with me. This time, I knew I needed to give myself as much design time as I needed to get the design right. In the end, it took about five hours and two small redesigns to make it happen. The important thing is that I got there.
Deciding To Use Token Rules¶
The big problem that my design faced was that I needed to work off the token information, not the line information. However, I tried to not acknowledge that and work with the line information instead. Call me stubborn, but I thought it would be easier to design it that way. I knew that working with line information on this rule would still be difficult, I just figured that working with lines wouldn’t be as difficult as working with tokens.
From my original research on the original rule, it was clear that certain lines, such as a line in a HTML Block element, did not trigger the original rule. Thinking things out, this made sense to me. When specifying a HTML Block element, such as:
<!-- this is an example -->
the author is very specific that anything within that HTML Block be represented as-is in the document. Therefore, it the author specifies:
> I have HTML that look like:
> <!--
> this is an example
> -->
this new rule should assume that the author specifically decided to include that extra space within the HTML Block. The same argument follows for Fenced Code Block elements and Indented Code Block elements.
I tried to find a way to work around cases like that one, and others, but they all ended up being too expensive from a design point of view. The token already had this information translated for me, so working with line information would mean repeating some of that work to get this rule triggering correctly. Thinking about it long and hard, I decided in the end that working with the token information made more sense. But I was also aware that working with tokens would still require a fair amount of work.
The Cost¶
When I write rules that are line based, the calculations for where a given
rule triggers are easy: calculate the number of characters since
the start of the line and pass it to the report_next_line_error
function
which reports that the rule was triggered. Quick, easy, done. With tokens,
it becomes a bit more complicated than that.
The difficult part about reporting on the triggering of a given token-based rule is that the engine is restricted to information available in the token. For any single token, the only constants are the starting position of the token and the information contained within the token. Anything that needs to be provided to report on the triggered rule needs to be calculated from that.
It is not a high cost, but it costs the project the time required to properly figure out the equation used to translate between the position of the token and the position where the rule is triggered. And based on my research, that calculation was going to be different for each token.
Designing For The Container Tokens¶
Given all that information and a whole bunch of scribbles that I used to work out issues, I was ready to focus on the design. I knew that the design was going to entail multiple levels of effort, but I was prepared for that.
At the top level of the design are the container tokens, the two List tokens and the Block Quote tokens. As the rule is invoked for each token in turn, the top level needs to track what container token is the one that is currently active. I consider a container token currently active if it is the container token that is deepest in a stack of all container tokens that have been opened but not closed. The reason that this is important is because Block Quote elements and List elements are container elements, and therefore they can contain any container element or leaf element. Therefore, when processing the top level of the tokens, the algorithm needs to track this information and only act on any non-container tokens if those tokens occur while a Block Quote token is active.
Basically, if this rule is called with a Markdown document that looks like this:
this is a test
or:
- this is a test
or:
> - this is a test
then the rule should never take any action for the text this is a test
outside of the top level of the design, because it never hits a state
where that text is encountered, and a Block Quote is currently active.
It is only those cases where a Block Quote token is currently active that processing needs to proceed to the next level of the design. It is with scenarios like the following where that extra processing is needed:
> this is a test
or:
>> this is a test
or:
- > this is a test
Therefore, as part of the design, I knew I would have to keep track of the current line of the active Block Quote token. The information about which characters started any line within a Block Quote token is stored with the active Block Quote token. While that information would not affect whether the rule would trigger, I knew that it will be needed to provide for an accurate column number of where the rule triggered. If the current Block Quote line is not correct, then the column number would not be correct. Not as bad as not triggering the rule, but still important.
With the container tokens dealt with, I needed to deal with the handling of leaf tokens within the scope of an active Block Quote token. That was next!
Designing For The Leaf Tokens¶
Having dealt with the container tokens, in the next level of the design I needed to deal with leaf tokens. Trying to break things down into manageable blocks of work, I organized the leaf tokens into four classes that I needed to design for: single line tokens, complex single tokens, text encompassing tokens, and everything else.
For the single line tokens, the action to take is relatively easy. Single line elements such as the Thematic Break element only have one line, so a simple check is required to determine if the rule is triggered. Once that check has been performed, only the current Block Quote line needs to be updated.
The next tokens to consider are the complex single tokens. Currently the only token that falls into this category is the Link Reference Definition, but I am sure future work will add more tokens like this. This group is specific to tokens like the Link Reference Definition token where a single token is defined with complex parts. That makes this group unique because while the rules for a newline within a part may be vary from part to part, the triggering of this rule on any eligible whitespace between parts is constant. That is why the following Markdown document will only trigger on this rule three times:
> [lab
> el]:
> /url
> "tit
> le"
The three trigger points are the whitespace on the first line before the Link Label, the whitespace on the third line before the URL, and the whitespace on the fourth line before the title. Each of them deals with whitespaces. The other lines are special.
The next group of tokens, and the largest of the four groups, are the text encompassing tokens. What makes this group interesting is that the handling of any Text tokens within the scope of these tokens is dependent on the type of token that contains them. The handling for text in either of the two Code Block elements or HTML Block element is different from the text in a SetExt Heading element, which is also different from the text in the Atx Heading element or the Paragraph element. From a design point of view, that meant that I would have to track whichever of those encompassing tokens was active to enable me to properly deal with any encompassed text.
With all those tokens out of the way, any remaining tokens fall into a simple group called “everything else”. This includes tokens like the Emphasis token used to denote Emphasis within a block of text. The distinction for these tokens is that there is no way that any of these tokens can trigger the rule to occur, so they can be dismissed. With each token in this group, the nature of the token just precludes it from being interesting to this rule.
Designing For The Inline Tokens¶
After designing for the other two classes of tokens, the only class that was left to design for was the Inline tokens class. Looking over the use cases for teach token, I determined that I only needed to design for three tokens: Links, Raw HTML, and Code Spans. With every other inline token, the token is interpreted inline or defined as to not exceed a single line. Therefore, if there are any extra space characters before that token is encountered, those characters will be placed into a Text token. That meant I only had to deal with the three outliers.
The Raw HTML token and the Code Span tokens were the easy ones to design for. Everything within the token is eligible, so one simple check is sufficient. But with the Link tokens, there are multiple parts, such as the Link Labels, which are not eligible, and parts like the whitespace, which are eligible. Not too much of a difference, but one to keep track of.
Once one of those tokens triggered the rule to fire, I knew I would have some non-trivial calculations to figure out the proper line number and column number. Because these tokens occur within leaf tokens, I knew that I would have to do some interesting work to merge the results of any of these tokens with their parent tokens. But I was okay with designing that part of the algorithms when I got to it.
Starting To Implement¶
As with any good Test-Driven Development practitioner, I looked at
my scribbled notes and started writing scenario tests for each
of the scenarios I was interested in. For each scenario test, I
created a new Markdown document in the tests/md027
directory
and an accompanying disabled test in the test_md027
Python module.
Going through all the relevant examples on my testing worksheet, I ended up with 81 disabled scenario tests when I was done. From start to finish, it took me just over three and a half hours (including short breaks) to add all those scenarios and their tests into the project. To be clear, that was not getting the tests working, just entering them in a disabled state.
From that effort, I knew that implementing this rule was not going to be something that would be done in a day. I was hoping I could keep it to a week.
Implementing For The Container Tokens¶
As designed, the first thing that I needed to code was high-level handling of the container tokens. This proved to be very simple:
def next_token(self, context, token):
if token.is_block_quote_start:
self.__container_tokens.append(token)
self.__bq_line_index[len(self.__container_tokens)] = 0
elif token.is_block_quote_end:
num_container_tokens = len(self.__container_tokens)
newlines_in_container = self.__container_tokens[-1].leading_spaces.count("\n")
del self.__bq_line_index[num_container_tokens]
del self.__container_tokens[-1]
elif token.is_list_start:
self.__container_tokens.append(token)
elif token.is_list_end:
del self.__container_tokens[-1]
elif (self.__container_tokens and
self.__container_tokens[-1].is_block_quote_start):
self.__handle_within_block_quotes(context, token)
Without missing a beat, this followed the design that I had specified.
The self.__container_tokens
list maintains what the active container
token is, being modified by both List tokens and Block Quote tokens. If
a Block Quote token is encountered, a bit more work is done to add an
entry in the self.__bq_line_index
dictionary to track the index within
the Block Quote. Finally, if the token
variable is set to a non-container
token and the active container token is a Block Quote token, then the
__handle_within_block_quotes
function is called to handle the processing.
With the container tokens design dealt with, it was on to the leaf tokens.
Implementing For The Leaf Tokens¶
For any readers who follow these articles, it should come as no surprise
that I started implementing the __handle_within_block_quotes
function
as one humongous function. As I have mentioned in other articles,
I prefer to code first, refactoring later when I have a more complete
picture of the code in question.
Starting With Paragraphs¶
For no other reason than Paragraph elements showing up more than any
other elements, I decided to start my work on the __handle_within_block_quotes
function with them:
def __handle_within_block_quotes(self, context, token):
num_container_tokens = len(self.__container_tokens)
if token.is_paragraph:
self.__last_leaf_token = token
for line_number_delta, next_line in enumerate(
token.extracted_whitespace.split("\n")
):
if next_line:
split_leading_spaces = self.__container_tokens[-1].leading_spaces.split(
"\n")
line_index = (
self.__bq_line_index[num_container_tokens] + line_number_delta)
calculated_column_number = len(split_leading_spaces[line_index]) + 1
self.report_next_token_error(
context,
self.__container_tokens[-1],
line_number_delta=line_number_delta,
column_number_delta=-calculated_column_number)
self.__bq_line_index[num_container_tokens] +=
token.extracted_whitespace.count("\n")
The Paragraph token is a special scenario in the PyMarkdown project.
But because Paragraph elements are the most common elements in Markdown, I
wanted to get it out of the way. Due to the constraints of the
Paragraph element, any leading spaces on a line within a Paragraph element
are skipped.1 But because the PyMarkdown project deals in tokens
and not HTML, any skipped or translated characters must be accounted for in
the token. Therefore, the Paragraph token contains a leading_spaces
property
to contain those skipped spaces, separated by newline characters for readability.
Given that background, the code within the bounds of the if token.is_paragraph
part of the function use that leading_spaces
property to figure out if any of
the individual lines begin with whitespace. Once split into separate lines, a simple
iteration over the lines and a check for if next_line
determines if leading
whitespace was removed for that line. If so, the real fun begins.
As Paragraph elements can contains multiple lines of text, calculations
are performed to determine where in the paragraph the triggering
occurred. The change from the token’s line number is easy to figure out.
Enumerating through the various lines in the leading_spaces
property,
the change is simply the iteration through the for
loop. To keep things
simple, I named this variable line_number_delta
.
For the column number, there is no way around using the line prefix information
stored in the Block Quote token. Since the number of lines into the Block
Quote is kept track of in the self.__bq_line_index
dictionary, the function
needs to calculate the proper index into that prefix list. By taking the current
Block Quote token’s entry in that table and adding the iteration through the loop
(line_number_delta
), the proper index is obtained. With that calculation, the
length of the Block Quote prefix for that line is looked up, its length computed,
and the rule is triggered with a call to report_next_token_error
.
Finally, to make sure the index within the Block Quote token remains
correct, the entry in the self.__bq_line_index
dictionary is updated
to include the number of newline characters in the paragraph.
Phew! To be honest, I had some confidence that this would be one of the more difficult elements to get right, and I was correct. But it was also good that I dealt with it early. By working on this code and the ten or so tests that deal directly with paragraph elements, I gave myself a good example that I could use as a reference point.
And the testing was pivotal. I was able to comment out the skip
test attribute, run the tests, and make any adjustments as necessary.
Slowly, the scenario test for Rule Md027 were starting to pass!
On To Single Line Elements¶
While it may be an oversimplification, the remaining handlers that I wrote for handling their tokens are just variations on the Paragraph handler from the last section. Starting with the handler for Thematic Break tokens, it was easier to implement than the Paragraph token’s handler:
elif token.is_thematic_break:
if token.extracted_whitespace:
column_number_delta = -(
token.column_number - len(token.extracted_whitespace)
)
self.report_next_token_error(
context, token, column_number_delta=column_number_delta
)
self.__bq_line_index[num_container_tokens] += 1
As the exact information is already stored within the token, and because
that information cannot span lines, that handler simply checks to see
if the token was prefaced with any whitespace. When that scenario
occurs, the reported position is adjusted by providing a new
calculation for the column number to report. This is required because
negative deltas are used to present absolute positions on the line, and
not negative changes to the token’s column number. As such, the
column_number
is used as a base, subtracting the length of the found
whitespace, and reporting that as the absolute start.
After that calculation, adding 1
to the proper __bq_line_index
entry
was trivial. It needed to account for the single line containing the
Thematic Break, so adding 1
made sense. And the if
statement for
token.is_blank_line
was even simpler, because the token’s line number
and column number are always the position where the rule triggers.
Hence, no column_number_delta
variable was required.
Complex Elements¶
Link Reference Definitions tokens are essentially the same as the previous two elements, just with more parts to take care of. The first part of handling this element is practically a copy of what was written to handle the Thematic Break token:
elif token.is_link_reference_definition:
scoped_block_quote_token = self.__container_tokens[-1]
if token.extracted_whitespace:
column_number_delta = -(token.column_number - len(token.extracted_whitespace))
self.report_next_token_error(
context, token, column_number_delta=column_number_delta
)
self.__bq_line_index[num_container_tokens] += (1
+ token.link_name_debug.count("\n")
+ token.link_destination_whitespace.count("\n")
+ token.link_title_whitespace.count("\n")
+ token.link_title_raw.count("\n")
)
The main difference so far is that the calculation for the number of lines in the Paragraph token is a bit more complex. Once that was taken care of, I was able to work on the other two parts of that handler: the whitespaces.
found_index = token.link_destination_whitespace.find("\n")
if found_index != -1 and ParserHelper.is_character_at_index_whitespace(
token.link_destination_whitespace, found_index + 1
):
line_number_delta = token.link_name_debug.count("\n") + 1
split_array_index = (self.__bq_line_index[num_container_tokens] + line_number_delta)
split_leading_spaces = scoped_block_quote_token.leading_spaces.split("\n")
specific_block_quote_prefix = split_leading_spaces[split_array_index]
column_number_delta = -(len(specific_block_quote_prefix) + 1)
self.report_next_token_error(context, token,
line_number_delta=line_number_delta, column_number_delta=column_number_delta,
)
The first handler looks like parts of the handler for the Paragraph token.
Once that whitespace is found, a bit more work is required to calculate the
column_number_delta
variable, but it just a variation of previous work.
Instead of using the line_number_delta
returned by the enumerate
function,
it is computed based on the parts of the element that come before that
whitespace, the information in the token.link_name_debug
variable.
The Block Quote token’s leading_spaces
variable is then split and the
index into that list of split values is calculated. Then, just as with
the Paragraph token’s handler, the column_number_delta
variable is set
to the length of that line’s prefix.
found_index = token.link_title_whitespace.find("\n")
if found_index != -1 and ParserHelper.is_character_at_index_whitespace(
token.link_title_whitespace, found_index + 1
):
line_number_delta = (
token.link_name_debug.count("\n")
+ token.link_title_whitespace.count("\n") + 1 )
split_array_index = (self.__bq_line_index[num_container_tokens] + line_number_delta)
split_leading_spaces = scoped_block_quote_token.leading_spaces.split("\n")
specific_block_quote_prefix = split_leading_spaces[split_array_index]
column_number_delta = -(len(specific_block_quote_prefix) + 1)
self.report_next_token_error(context, token,
line_number_delta=line_number_delta, column_number_delta=column_number_delta,
)
This is again copied, almost verbatim, to handle the next whitespace part of the Link Reference Definition token. They are close enough to each other that I will probably refactor them together in a future refactoring effort.
Once again, the handling of this token was just repeating the work done previously, with some small alterations.
Encompassing Elements¶
With most of the other work done, it was time to focus on the encompassing elements. To start, I picked the Atx Heading token, which contains the Heading text between the Atx Heading token and the end Atx Heading token. To ensure that the function knows it is dealing with another token between those two tokens, I added some very simple handling of those Atx Heading tokens:
elif token.is_atx_heading:
if token.extracted_whitespace:
column_number_delta = -(token.column_number - len(token.extracted_whitespace))
self.report_next_token_error(
context, token, column_number_delta=column_number_delta
)
self.__last_leaf_token = token
self.__bq_line_index[num_container_tokens] += 1
elif token.is_atx_heading_end:
self.__last_leaf_token = None
While there are variations on this code, like Fenced Code Block elements or SetExt Heading elements which have a close line, almost all the encompassing block handler functions are like this. The only outlier is the end Fenced Code Block toke handler, which has information about the closing fence characters. But even that handler is simple, reusing elements from the Link Reference Definition handler.
Now that the rule knows a token was found within one of these encompassing
elements, it was time to implement the handler for the Text token. The
text within a SetExt Heading element is specially encoded to handle the
start and end of the line differently than a Paragraph token, so that
took a lot of the work. The Code Block tokens and the HTML Block tokens
were easy to handle. For those tokens, the text is as written, so no
analysis is needed. Because of that, incrementing __bq_line_index
was all that was required.
And with that, I checked my handy list of things that I needed to implement, and everything was checked off. Along the way I had found three different parser bugs and added them to the Issues List to deal with later.
Things looked good with the rule, but I wanted to make sure I had addressed every issue that I could find. Therefore, it was time to start looking over the scenario tests and double checking everything.
Simple Refactoring¶
Having completed the large handler function for tokens, before I went on, I knew that I needed to refactor it into separate functions. So, taking my time, I carefully created a new function for each token, moving the handler code into that new function.
With the solid suite of scenario tests to back me up, I was confident that any issue I introduced would be found. But that got me thinking. Did I miss something?
Double Checking¶
Wanting to have confidence that I completed everything, I started going through the scenario test data and validated that I had taken care of all the scenarios that I could think of. Along the way, I added four more tests to the rule, bringing the total of scenario tests to 85.
And I even found another bug. Well, not really a bug, but an inconsistency. Given the Markdown document:
> simple text
# a
the test failed with an unmatched Block Quote index. Basically, the number of newlines in the Block Quote token was not equal to the number of newlines collected within the Block Quote.
After a quick look, I was interested to find that in the above case,
the Block Quote token’s leading_spaces
field did not end with a
newline character. Verifying that this seemed to be the only case
where it happened, I added this code to the next_token
function to
deal with it for now:
if self.__container_tokens[-1].leading_spaces and \
not self.__container_tokens[-1].leading_spaces.endswith("\n"):
newlines_in_container += 1
Everything was good… until it wasn’t.
Dropping The Ball On Inline Elements¶
I blame myself for doing something careless, I am just not sure when I did it. At some point along the weeklong implementation of this rule, I had entries for Code Span, Raw HTML, and Links on my To-Do list. And somewhere along the line, I checked them off.
What is even more embarrassing is that I double check my tests and implementations to try and make sure that I don’t miss things like this. But when I did my double checks, I double checked to ensure that I had all the correct scenario test data. And I did include scenario test data for those three elements… I just never added the tests for them. I did not think I had added one without the other, but obviously I did for those elements.
As I found this out at 6pm on Sunday night, it was too late to add in any fixes at that point. So, I put it on a short list of things to fix, and I want to get to it when I have some spare time in between writing the other rules.
What Was My Experience So Far?¶
Looking at the picture of a scoreboard that I have scribbled on a piece of paper on my desk, the number of completed rules is now 20 and the number of rules remaining is now 11. I hope that I don’t find many more rules like Rule Md027, but I am confident I can still get the rules done quickly.
As to how I feel about Rule Md027, I must admit that it is a bit of mixed bag. On one hand, it was a monster rule, and it took good design followed by good implementation to get it to this point. But the shadow of having missed the three inline elements is also present. To throw things in the positive again, I did find a few parser bugs, which means I can fix them before users find them. As I said, a mixed bag.
But in the end, I am still positive about how things are going. Once I get the rules done, or at least a good first pass at the rules, I am thinking about releasing a minor version to include the new rules. I think this is still a probable thing, and I am kind of looking forward to it.
So yeah, I took some bumps, and I got some bruises. But I also got a monster of a rule mostly done. That feels good!
What is Next?¶
After a week of focusing only on one rule, I wanted to get back to making more progress. I hope I get more done this week! Stay tuned!
-
See example 192 in the GFM specification. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.