Summary¶
In my last article, I started to add the proper support for line and column numbers for both the text tokens and the emphasis tokens. In this article, I increase my confidence in the line and column numbers for those two inline tokens by adding the consistency checks for those tokens.
Introduction¶
I know that I am fallible. It therefore stands to reason that any code that I write will have some issues with it. Those issues may be obvious issues, or they may be issues that only occur under a bizarre set of circumstances, but they are there. Rather than fight against them, I embrace the attitude that good test automation will help me to identify those types of issues as early as possible.
For the PyMarkdown project, this test automation takes the form of scenario tests containing consistency checks. These consistency checks validate that the Markdown documents in the scenario tests are properly interpreted by the PyMarkdown project. But while these consistency checks are beneficial, the consistency checks have taken a long while to complete. After 3 calendar months have passed, it can easily be said that my decision to add consistency checks to the project removed 3 months of project development time and replaced it with 3 months of project test time. Plain and simple, those statements are facts.
My confidence about the project and its ability to work correctly is an emotional and abstract statement. However, with effort, I have been able to move it in the direction of being more of a fact than a feeling. The consistency checks are a form of test automation that apply a generalized set of rules over a group of tokens, looking for each group to behave in a predictable manner. Before this work, my confidence was expressed as a feeling: “I believe the project is stable”. With this work nearing its completion, I can now point to the scenario tests and consistency checks that run within those scenario tests. I can state that each of the scenario tests is passing a rigorous set of criteria before it is marked as passing. That confidence can now be expressed as: “Here are the tests that are passing and the checks that are being performed on each test.”
From that point of view, it made sense that before I start working on setting the line/column numbers for the remaining inline tokens that I would implement the consistency checks for the Text token and the Emphasis tokens.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit of 02 Sep 2020.
Getting Started With Inline Token Validation¶
At the start of the week, the code used to verify the consistency of inline tokens was extremely simple:
print(">>last_token:" + ParserHelper.make_value_visible(last_token))
next_token_index = last_token_index + 1
while actual_tokens[next_token_index] != current_token:
print(
"-token:" + ParserHelper.make_value_visible(actual_tokens[next_token_index])
)
next_token_index += 1
Added in as a placeholder to allow me to see what was going on with the inline tokens, it served its purpose well. But as I started to work on the inline tokens and their line/column numbers, I needed to facilitate better consistency checking of those inline tokens.
To start the work off, I removed that placeholder code from two places in the code
and replaced both with a call to a new function verify_inline
. The only
difference between the two
invocations of the function were the fourth argument, current_token
. Called for
the first time from the___verify_token_height
function, the current_token
variable
is set to the block token after a series of inline tokens. The second time it is
called, it is called at the end of processing to capture any inline tokens that are
within one of the valid text elements, but at the very end of the document. When it
is invoked from that location, that same argument is set to None
. In both cases,
the inline tokens to be validated were clearly outlined for the verify_inline
function.
Clearly Defining the Problem¶
Before doing any real processing with the inline tokens, I needed to create a simple list containing the actual inline tokens that I wanted to check. I could have done that with the main list of tokens and the previously document outlining. However, I thought about it and decided that it was clearer to have a separate list that just contained the tokens that I was concerned about. Once I had all the inline tokens between the two block tokens in that new list, there was a small amount of work to do before the list was usable. While it was not difficult, the new list had some extra end tokens at the end of the list that needed to be removed. Working around those extra end tokens would have been okay, but I just felt that it was simpler to remove them from the list before I did any further processing.
Having a simple list of the inline tokens to work with, the first iteration of the checking algorithm started with an easy outline to follow:
if inline_tokens:
for token_index, current_inline_token in enumerate(inline_tokens):
if not token_index:
__verify_first_inline(last_token, inline_tokens[0])
else:
__verify_next_inline(
last_token,
inline_tokens[token_index - 1],
current_inline_token,
)
# verify_last_inline(inline_tokens[-1], current_inline_token)
From my viewpoint, the processing of the inline tokens had 3 distinct phases: the first element in that list, each element after it, and the last element in that list. Based on their locations, the first and last elements are special in that they anchor the other inline tokens to the block tokens on either side of the middle elements. Without those anchors, the middle elements lack a foundation with which they can based their positions on.
Based on those observations, I chose to implement the check for the first inline token against the previous block token, and not the check for the last inline token against the following block token. Without validating the first element, validating any of the elements on the inside of the list would be useless. So, whether I liked the idea or not, validation of the first element in the list was mandatory. The last element is a different story. While it would be nice to tie the last inline token to the following block token, I felt that it was not as important as the verification of the first element. However, I added in a placeholder to the code to make sure that I would follow up on it later.
Validating the First Element¶
Following the pattern that I have used for validation in the past, I created the
__verify_first_inline
function with my standard starting template:
def __verify_first_inline(last_non_inline_token, first_inline_token):
if <something>:
pass
else:
assert False, last_non_inline_token.token_name
As this function is comparing the starting position of the first inline token to the
last valid block token, the <something>
in the above code sample was quickly replaced
with:
if last_non_inline_token.token_name == MarkdownToken.token_atx_heading:
assert False
elif last_non_inline_token.token_name == MarkdownToken.token_setext_heading:
assert False
elif last_non_inline_token.token_name == MarkdownToken.token_paragraph:
assert False
elif last_non_inline_token.token_name == MarkdownToken.token_fenced_code_block:
assert False
elif last_non_inline_token.token_name == MarkdownToken.token_indented_code_block:
assert False
elif last_non_inline_token.token_name == MarkdownToken.token_html_block:
assert False
else:
assert False, last_non_inline_token.token_name
and one by one I added the validation functions to replace the assert False
statements. Following that same pattern for
resolving these as I have before, I ran the scenario tests over the entire project
using the command line:
pipenv run pytest -m gfm
Each time, I just picked one of the failing tests, and worked on that tests in that
group until they were all passing. For each validation function, I repeated the same
pattern with the first inline
token that was observed. For example, the __verify_first_inline_atx
function quickly
evolved to look like:
def __verify_first_inline_atx(last_non_inline_token, first_inline_token):
"""
Handle the case where the last non-inline token is an Atx Heading token.
"""
col_pos = last_non_inline_token.column_number + last_non_inline_token.hash_count
if first_inline_token.token_name == MarkdownToken.token_text:
replaced_extracted_whitespace = ParserHelper.resolve_replacement_markers_from_text(
first_inline_token.extracted_whitespace
)
col_pos += len(replaced_extracted_whitespace)
assert first_inline_token.line_number == last_non_inline_token.line_number
assert first_inline_token.column_number == col_pos
elif first_inline_token.token_name == MarkdownToken.token_inline_hard_break:
assert False
...
else:
assert (
first_inline_token.token_name != MarkdownToken.token_inline_link
and first_inline_token.token_name
!= EndMarkdownToken.type_name_prefix + MarkdownToken.token_inline_link
), first_inline_token.token_name
What Did I Discover?¶
Predictably, I discovered that there are 2 groups of text within block tokens: ones that support inline tokens other than the Text token, and ones that do not. The ones that do not support inline tokens are mostly easy: assert that the inline token is a Text token, and then assert on a simple calculation of the first line/column number. The validation of the HTML Block token and the Indented Code Block token both followed this pattern, with very simple validation.
def __verify_first_inline_html_block(last_non_inline_token, first_inline_token):
assert first_inline_token.token_name == MarkdownToken.token_text
leading_whitespace_count = len(first_inline_token.extracted_whitespace)
assert last_non_inline_token.line_number == first_inline_token.line_number
assert (
last_non_inline_token.column_number + leading_whitespace_count
== first_inline_token.column_number
)
The Fenced Code Block tokens required a bit more effort, but not much. As the Fenced
Code Blocks can start with 0-3 space characters that then need to be managed on any
subsequent line in the code block, the owning block token’s leading_spaces
variable
holds the information on what leading spaces were already removed. As such, when
calculating the proper position of the first Text token inside of a Fenced Code Block,
that removed space needs to be accounted for. To properly facilitate that, the
last_token_stack
argument needed to be plumbed through so the verification function
could calculate the proper owning blocking token.
The second group of block tokens were the more interesting group of tokens to deal with.
This group of tokens includes the Atx Heading tokens (as shown in the above example),
SetExt Heading tokens, and Paragraph tokens. The __verify_first_inline_atx
function
and the __verify_first_inline_setext
function ended up looking similar: the Text
inline token case was populated, but all the other types of inline tokens were handled
with assert False
statements. The __verify_first_inline_paragraph
function was
similar, but also slightly different. The same template was used to generate the
function, but each of the conditions in the if-elif-else
block were met at least once.
However, since only the Text token and the Emphasis token have line/column numbers,
allowing this comparison to be performed for them:
assert first_inline_token.line_number == last_non_inline_token.line_number
assert first_inline_token.column_number == last_non_inline_token.column_number
All the other inline tokens, the ones that did not currently have a line/column assigned to them (yet), used the following comparison:
assert first_inline_token.line_number == 0
assert first_inline_token.column_number == 0
It was not much, but it gave me two important bits of information. The first was that there was at least one case where each available inline token was the first inline token inside of a Paragraph token. The second was that both heading tokens, the Atx Heading token and the SetExt Heading token, only contained scenario tests that started with Text tokens. I made a note of that observation in the issue’s list and moved on.
Verifying the Middle Tokens¶
With the validation of the first element out of the way, it was time to start working
on the __verify_next_inline
function. Now that the middle tokens were anchored at
the beginning, each of the middle inline tokens could be validated against the inline
token that preceded it. Since I knew that most of the inline tokens had not
been handled yet, I started out that function with a slight change to the template:
def __verify_next_inline(
last_token, pre_previous_inline_token, previous_inline_token, current_inline_token
):
if (
previous_inline_token.line_number == 0
and previous_inline_token.column_number == 0
):
return
if (
current_inline_token.line_number == 0
and current_inline_token.column_number == 0
):
return
estimated_line_number = previous_inline_token.line_number
estiated_column_number = previous_inline_token.column_number
if previous_inline_token.token_name == MarkdownToken.token_text:
assert False
...
else:
assert False, previous_inline_token.token_name
assert estimated_line_number == current_inline_token.line_number, (
">>est>"
+ str(estimated_line_number)
+ ">act>"
+ str(current_inline_token.line_number)
)
assert estiated_column_number == current_inline_token.column_number, (
">>est>"
+ str(estiated_column_number)
+ ">act>"
+ str(current_inline_token.column_number)
)
The first set of if
statements made sure that if either the previous inline token or
the current inline token
was one that I had not worked on yet, it would return right away. While this assumed
that the line/column numbers were correct to a certain extent, I was okay with that
assumption in the short term. The second part computed a starting point for the new
line/column numbers, and then went into the usual pattern of dealing with
each of the eligible tokens by name. Finally, the third part compared the modified
line/column numbers against the actual line/column numbers of the current token,
asserting with meaningful information if there were any issues.
Emphasis Tokens¶
I thought it would be quick to get emphasis out of the way, and it was! As both the
start and end Emphasis tokens contain the emphasis_length
, it was a quick matter of
adjusting the column number by that amount. As both tokens are confined to
a single line, there was no adjusting of the line number to worry about.
Text Tokens¶
As mentioned in a previous section, there are two major groups of block tokens that contain Text tokens: ones that allow all inline tokens and ones that do not allow inline tokens except for the Text token. The ones that do not allow inline tokens are simple, as all the information about the Text token is contained within the token itself. It is the other group that are interesting to deal with.
The easy part of dealing with the Text token is determining the new line number.
With the exception of a Text token that occurs right after a Hard Line Break token,
the calculation is simple: split the text by the newline character, subtract 1,
and that is the number of newlines in the Text token. If the token before the Text
token was a Hard Line Break token, it already increased the line number, but the
Text token that followed also started with a newline character. To remedy this,
that pattern is looked for, and the current_line
variable adjusted to remove the
newline character at the start of the line.1
Determining the column number is a more interesting task to undertake. For any Text tokens occurring within a block that does not allow for extra inline tokens, the column number information is already in the token itself, and the calculation is as simple. The column delta is equal to the number of text characters stored within the token2. If there was a newline in the token’s text, this count is started after the last newline character.
The second group of block tokens that can contain text are the Atx Heading token, the
SetExt Heading token, and the Paragraph token. Since the Atx Heading token can only
contain a single line’s worth of data, no extra calculations are required to handle
multiple line scenarios. In the case of the other Heading token, the SetExt Heading
token, the starting whitespace is stored in the Text token’s end_whitespace
field.
The processing of this information is a bit tricky in that the starting and ending
whitespace for the Text tokens within the SetExt Heading token is stored in that
field using the \x02
character as a separator. Still, determining the proper
indent and applying it to the column number is relatively simple.
Dealing with a Text token within a Paragraph token is a lot more work. Due to other
design reasons, the whitespace indent for these Text tokens is stored within the
owning Paragraph token. While that is not difficult by itself, keeping track of which
indent goes with which line is a bit of a chore. Luckily, when I was working on the
Markdown transformer, I introduced a variable rehydrate_index
to the Text token.
When rehydrating the Text token, I used this variable to keep track of which stripped
indent needed to be added back to which line of any subsequent Text tokens. Given
the prefix whitespace for any line within the Paragraph block, calculating the
column number delta was easy.
Blank Line Tokens¶
That left the Blank Line tokens to deal with, and I hoped that the effort needed to complete them was more in line with the Emphasis tokens than the Text tokens. I was lucky, and the Blank Line tokens were easy, but with a couple of small twists. Intrinsically, a blank line increases the line number and resets the column number to 1. That was the easy part. The first twist is that if the current token is a Text token, that text token can provide leading whitespace that needs to be considered. That was easily dealt with by adding the following lines to the handler:
if current_inline_token.token_name == MarkdownToken.token_text:
estiated_column_number += len(current_inline_token.extracted_whitespace)
The more difficult problem occurred when 2 blank line tokens appear one after the
other within a Fenced Code Block token. Because of how the numbers added up, I needed
to adjust the estimated_line_number
variable by one.
if current_inline_token.token_name == MarkdownToken.token_blank_line:
if previous_inline_token.token_name != MarkdownToken.token_blank_line:
estimated_line_number += 1
estiated_column_number = 1
With that tweak being done, all the tests were then passing, and it was time to wrap it up.
Was It Worth It?¶
The interesting part about defensive code is that sometimes you are not aware of how good that defense is. Using the analogy of a castle, is a castle better defensible if it can withstand attack or if it deters others from attacking the castle? While I did not have any information about potential attacks that were stopped ahead of time, there were 2 actual issues that the current round of consistency checks did find.
Issue #1: Image Link¶
The first of those issues was an issue with the column number for example 600 as follows:
!\[foo]
[foo]: /url "title"
Before these inline consistency checks were added, the text for the ]
character was reported as (1,6)
. By simply counting the characters, the !
character starts at position 1 and the second o
character is at position 6. As
such, the ]
character should be reported as (1,7)
.
Doing some research, I concluded that the handling of a properly initiated Image
token was being handled properly. However, with a failed Image token sequence,
the !
character followed by any other character than the [
character, the
!
character was being emitted, but the column number’s delta wasn’t being set.
Adding the line
inline_response.delta_column_number = 1
at the end of the __handle_inline_image_link_start_character
function solved that
issue.
Issue 2: A Simple Adjustment¶
The second of those issues was more of a nitpick that an actual issue. In the tokenization for example 183:
# [Foo]
[foo]: /url
> bar
the first line was tokenized as:
"[atx(1,1):1:0:]",
"[text(1,3):\a \a\x03\a:]",
"[link:shortcut:/url:::::Foo:::::]",
"[text(1,4):Foo: ]",
"[end-link:::False]",
"[end-atx:::False]",
Having a lot of experience sight reading serializations for all the tokens, the
information in the Text token leapt out at me right away. In that token, the extra
data associated with the token is composed by adding the self.token_text
field,
the :
character, and the self.extracted_whitespace
. Based on the above
tokenization, that meant that the text sequence \a \a\x03\a
was being considered
as text instead of whitespace.
To understand why I thought this is wrong requires an understanding of the
existence of that character sequence. The \a
sequence is used to denote that
a sequence of characters in the original Markdown document was interpreted and
replaced with another sequence of characters. The \x03
character within the
second half of that sequence means that the {space}
character in the first part
of the sequence is being replaced with the empty string. Basically, to properly
represent the space between the #
character denoting the Atx Heading element
and the [
that starts the Link element, I needed to add a space character that
would not appear in any HTML transformation.
And here is where the nitpicking comes in. When I originally added that sequence
when working on the Markdown transformer, it made sense to me to assign it to
the token’s self.text_token
field. But since then, I have grown to think of
that sequence as being more extracted whitespace than token text. To resolve
that, I decided to move the call to generate the replacement text from the
self.token_text
field to the self.extracted_whitespace
field. It wasn’t
a big move, but it was something that I thought was the right thing to do.
What Was My Experience So Far?¶
While this batch of work wasn’t as laborious as last week’s work, the effort required to make sure it was correct was equal to or exceeding last week’s work. I knew that if I made any mistakes last week, they would be caught when I implemented the consistency checks. Well, these were the consistency checks that would capture any such issues that slipped through.
I am both happy and proud that I am coming to the end of implementing the consistency checks. It has been a long 3 month voyage since I decided that consistency checks were the best way to ensure that the quality that I wanted in the PyMarkdown project was maintained. And while there were times that I questioned if I made the right decision in dedicating this large block of time to this aspect of the project, I was confident that I had made the right decision.
But looking ahead to what I needed to do after the consistency checks, I saw a fair number of items in the issues list that would need researching and possibly fixing. While I could start to release the project without them, I didn’t feel comfortable doing that. I wanted to give the project the best chance it could to make a first impression, and then move from there. And that would mean more work up front. So while I was happy that the consistency check work was coming to an end, there seemed to be a deep pool of issues that would need to be research… and I wasn’t sure how much I was looking forward to that.
I still believe that adding the consistency checks was the right move. Of that I am still certain. Instead of a feeling that I have the right code in place to do the Markdown transformations, I have hard, solid checks that verify the results of each and every scenario test. It also gave me the interesting bit of information that the scenario tests did not include any cases where the Atx Heading token and the SetExt Heading token were followed by anything other than a Text token. Something interesting to follow up on later.
To me, adding more of those checks for the inline tokens was just another solid step forward in quality.
What is Next?¶
Having completed the hardest inline token (Text token) and the easiest inline tokens (Emphasis tokens), it was time to buckle down and get the remaining tokens done. If I was lucky, the foundational work that I had already completed would make completing those tokens easy. If I was unlucky, there would be a whole selection of edge cases that I needed to account for. Realistically, I was expecting something square in the middle between those two scenarios. The next batch worth of work would answer that question!
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.