Summary¶
In my last article, I completed the last bit of work needed to complete the consistency checks. However, as I accumulated some items in my issues list, I decided to take some time and make a sizable dent in that list. This article details those issues that I investigated and their results.
Introduction¶
When I am teaching archery at the camps in my local area, I must constantly keep a level head and strive to find ways keep the kids motivated. Faced with children possessing various levels of archery skill, not only do I have to tailor any assistance to each individual child, but I also try to figure out how to get that child to retain some part of that assistance. Luckily, I have a couple of tricks up my sleeve that helps me in this area.
The most useful trick involves the difference between moving a mountain and moving a mountain’s worth of pebbles. When I ask the camper how easy it is to move a mountain, they usually look at me like I am the most stupid parent on the planet and then proceed to express how impossible it is in various colorful forms. As a follow up, when I then ask them if they can move that mountain one pebble as a time, they state that it would take a long time, but eventually they would be able to move that mountain. Granted, the description of how long that effort would take differs from camper to camper, some more colorful than others, but they all convey that it is ultimately doable.
At that point, I calmly talk to the camper and explain that we are going to start working on each pebble of their archery skills, one pebble at a time. At open range events, I let each group of kids know that me and the other coaches will be there all day, and will be as helpful at the end of the day as we are at the beginning. Admittedly, a bit crazier near the end, but we try our best to remain helpful in the middle of that craziness.
The reason I mention this story is that when it comes to the items on the project’s issues list, the list definitely looks more like a mountain than a pebble to me. But by taking the same approach with the items that I do when teaching archery, I can approach that list calmly and not be hit with a large wall of anxiety. Instead of seeing the list as a large mountain, I can choose to see it as a large group of pebbles, moving them one at a time. My goal at this point is not to get rid of all those items at once, but to make steady progress in reducing the size of the issues list, once pebble at a time.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 09 Sep 2020 and 12 Sep 2020.
Starting with An Easy One¶
Now that I have the consistency checks in place, when I am faced with a problem with a line number being off or a column number being off, it is a question of whether the parser’s calculation is wrong or the check’s calculation is wrong. While I hope that it is not an error with the consistency checks, I feel that it is a valid question to ask myself with each issue.
That is the mindset that I had when I started looking at this issue. Leftover from my previous work on consistency checks, this was an issue where 15-20 minutes of digging into it with little success caused me to sideline it for later. The Markdown sample in question was a modified example 143:
> <div>
> foo
> bar
bar
At issue was the tokenization of the Blank Line element at the end of the Block Quote
element. According to the parser, that Blank Line element was tokenized to
[BLANK(4,1):]
. However, the consistency checks were stating that the proper
tokenization should be [BLANK(4,3):]
. Which one was correct? What was the
miscalculation that was causing one of the two to be off?
Digging In¶
Performing a quick visual check of the tokenization, things seemed to be valid on
the parser side. The Blank Line element was enough to terminate the HTML element,
but not to also terminate the Block Quote element before the Blank Line element itself.
That part of the test looked good. In addition, the Markdown text does not contain any
other text on that Blank Line, so the position of (4,1)
looked to be an accurate
position for the token. This meant shifting my focus to the consistency checks.
Looking at the debug information for the consistency checks, something immediately leapt out at me. The debug output read:
>>blank_line>>split>['> ', '> ', '> ', '', '']
>>blank_line>>index>1
>>current_position.index_number>>1
>>current_position.index_indent>>0
>>1 + init_ws(2)>>3
Matching that up against the code that produced the debug, it stated that the
index_number
variable was set to 1
while the consistency check calculated the number
3
for that same value. That matched up with the previous information, so looking at
that last line, I saw that the calculation was determining that there was an initial
whitespace of 2 applied to get to the value of 3
. That did not seem right.
Adding some temporary debug statements, I was able to quickly determine that the reason
that the
check was adding that indent of 2 characters was due to a bad index on the Block Quote
token. With every newline that occurs within a given Block Quote, the
leading_text_index
field must be updated to point to the current line. In this case,
the HTML Block had a number of newline characters within its data but had not updated
the index. As a result, instead of the index being set to 3, it was set to 0. This
meant that it was picking up the 0th element of the array1, >{space}
,
with its length of 2 instead of the 3rd element in the array, an empty string with its
length of 0.
Fixing the Problem¶
Having learned a decent amount of information from my research, fixing the issue was
relatively
simple. To increase the index by the right amount, I had to count the number of
newlines in the text and apply that number to the leading_text_index
field.
newlines_in_text_token = __count_newlines_in_text(current_token.token_text)
print(">>newlines_in_text_token>" + str(newlines_in_text_token))
container_block_stack[-1].leading_text_index += newlines_in_text_token
To err on the side of caution, I also decided to add a couple of extra scenario tests
with further variations on the example. Just to make sure that the right thing would
happen with an extra line of text, I added an extra line of text and created test
test_html_blocks_143b
. This addressed my concern that there may be something special
with 2 lines of text, and a third line of text would either highlight or eliminate that
concern. Then, to make sure that Block Quote lines and their indents were working
properly, I added test test_html_blocks_143c
. This test alternated the indents for the
Block Quote element between 0 spaces and 1 space, stressing that part of the fix.
The Start of A Series¶
This next issue was the start of what I would later refer to as the 518 series of tests. The process started when I looked at example 518 and I wrote down a note check if all the combinations were covered. To give more context, I scribbled down “518, are we sure?” which I interpreted as “Am I sure that I have all the combinations covered?”. While it may sound like a weird logically jump to make from a couple of scribbles, in my own way, I wrote down what was needed to make sure that I followed it up.
My thinking was simple. A standard link, such as the following Markdown from example 518:
[link]( /uri
"title" )
has 6 parts: the link label, the whitespace before the URI, the URI, the whitespace before the title, the title, and the whitespace after the title. All other parts of that link are single characters in nature and required elements. So, what I did with this series of tests is to start changing where the newline(s) were, seeing if I could break anything. To add some extra flavor to the tests, I also added a couple of tests that included backslash characters.
What Did I Find Out?¶
The primary thing that I found out was that the parser itself was not handling the
newline characters within links properly. Specifically, the line numbers and column
numbers were not reflective of the Markdown document that was parsed. To address
this, the __handle_inline_special
function was modified to handle an alternate
calculation of the line/column number if a Link end token was found. While the
initial calculation of new_index - next_index
works for any Link tokens that do not
contain a newline character, the calculation for those that included a newline
character was going to be a bit more complicated.
After trying out a couple of solutions in my head, the only one that gained the
most traction with me
was a staggered approach. In this approach, I preset an array with the lengths of
various parts of the link, as outlined above. The function then goes through each of
the parts of the link in their order, checking for newlines for each part as it goes.
If at least one newline character is found, the line variables are updated and the
link_part_index
is set to that part’s index number. At the end, all values before
that index are reset to 0 and the column number is adjusted by the sum of each value
in the array.
While there are small fixes along the way to make the algorithm work properly, this
algorithm works well as it keeps things simple. As each part is checked in turn for
newline characters, the change to the line number variable is accurate. By setting
and resetting the link_part_index
variable to the last element that had a newline,
only the elements after that index get added to the total, accurately reflecting
the number of characters after the last newline character.
After that was done, the only extra adjustment that needed to be added was accounting
for whitespace at the start of lines within Paragraph elements, making sure that it
gets applied properly. That entailed tracking whether or not the parser was currently
processing a paragraph token and using the rehydrate_index
field of the Paragraph
token. With that in place, manual verification of the newly added case confirmed
that the algorithm was calculating things properly.
But What About…?¶
Finishing up the work, there were two things that I knew I needed to work on: consistency checks and other types of links. Strangely, the only changes I needed to the checks were to change Link Reference Definition checks to ensure that it accounted for the newlines in various components. Other than that, the everything seemed to line up.
As for other types of links, that answer was clear. Based on the parser code and the consistency check code, the only type of link that was being tested for newlines inside of the link were plain inline links. To combat this, I added extra items to the issues list, as well as made a mental note to revisit this later.
Fenced Code Blocks and Blank Lines¶
While this did not take a long time to solve, it was a good issue to get out of the way. At issue was the Markdown for example 99:
```
{space}{space}
```
where {space}
stands for an actual space character. While the actual item reads
fenced, 99 with more blanks
, I knew that I was concerned about not having more
examples with different types of blanks inside of the fenced code block. To fully
test this, I created many variations on this one test, differing the
number of blank lines and the amount of whitespace on each blank line.
I was happy to find out that the work on the parser stood up to this extended testing,
and the consistency checks only required a small change. To be precise, the only
change that it needed was to reset the column number to 0
if the
first inline token inside of a Fenced Code block was a Blank Line token.
I did discover something somewhat puzzling though. In a Fenced Code block, if the first token is a Text token, the rest of the body of the block is coalesced. If the first token is a Blank Line token, then the tokens are not coalesced at all. Rather than focus on that at the time, I just noted it down in the issues list, and hoped to find time soon to figure out if that is the correct solution.
The First Inline Token After an Atx Heading Token¶
Having noticed this one a while ago, it was always something that I wondered about: why does the parser sometimes insert a whitespace character right after an Atx Heading token? I knew I must have a good reason for it, but I could not recall why. I did find a couple of scribbles in old notes about Atx Headings, but that was it. I needed something more to go on for me to understand this and wrap it up.
Doing the Research¶
Going way back in the project’s commit log, I noticed two things: The first thing was that this project has been going on for a long while, and the second was that in that long while, very few of the commits refer to Atx Headings in their commit messages. To me, this means that Atx Headings are very stable, something that I felt proud of. That stability helps me to understand why I did not make any notes around what I was doing: quite probably they just were not needed.
There are only 4 times where Atx Headings have been mentioned in the commit logs:
- Adding atx and setext headings.
- initial addition for Atx Heading support
- Fixing Atx headers to allow better inline processing.
- moved most text out of the token to allow normal inline processing to take place
- Adding line/column support for tax headings.
- adding line numbers and column numbers to the tokens
- Added testing of every inline at the start of a Atx heading…
- this issue
This summary made it really easy for me to come to the observation that when the change was made to allow better inline parsing, the initial whitespace between the Atx Heading and its enclosed text was placed in the whitespace section of the text token. By looking at the consistency checks, I also observed that there are zero instances where an Atx Heading is not immediately followed by a Text token. After looking at those two observations and commit history, I do not feel it is a leap to say that this was a design decision that I made but never recorded. Further, I feel that the worst case is that it is a pattern that has a lot going for it and could easily be adopted as a new design decision. Either way, it seems to be a winning design.
Backing Up the Design Decision¶
Another thing that I learned from my research was that there was only one case of a non-Text element following the Atx Heading characters in any of the Markdown examples. The Markdown for example 183 is:
"""# [Foo]
[foo]: /url
> bar
which produces an Atx Heading element with a Link element right after it. Based on the previous research, I expected the tokenization to include that whitespace character, and it did not disappoint:
expected_tokens = [
"[atx(1,1):1:0:]",
"[text(1,3)::\a \a\x03\a]",
"[link(1,3):shortcut:/url:::::Foo:::::]",
"[text(1,4):Foo:]",
"[end-link:::False]",
"[end-atx:::False]",
For readers that have not been following along, the \a
character in an inline text
string represents a replacement of one character for another character. In this case,
the {space}
character between the first two \a
characters is being replaced by
the character \x03
used as a NOOP character. To me, this means that when the parser
tokenized that part of the line, it purposefully replaced a space character with an empty
string. Based on the research, this was the right thing to do.
The rest of addressing this issue was just replicating various forms of this example for
each inline sequence. Starting with function test_atx_headings_extra_2
and ending with
function test_atx_headings_extra_12
, I just cycled through each of the newline
sequences, include Hard Line breaks. And with two exceptions, each of the new test
functions passed.
The first case where I needed to fix something came even before I added the extra
functions: example 183. To make sure this was working properly, I needed to add an extra
line to the __process_inline_text_block
function of the InlineProcessor
class. In
the specific case where I was adding that special replacement, I determined that I was
not clearing the starting_whitespace
variable, and that caused the extra text to occur
within the link instead of before the link. That caused the rehydration to fail as the
space character was in the wrong position.
The second case was in the __verify_first_inline_atx
function of the consistency
checks. The function was first cleaned up removing all the extra assert False
statements and replacing them with an assert that the first token processed within
the Atx Heading was a Text token. Once that was done, I just reordered the remaining
lines in the function to keep functionality together, and it was done.
What Was My Experience So Far?¶
I went into this work knowing that at best, I would be able to knock a handful of issues of the list in one week. Meeting that goal, it also started to sink in that the research for each issue was going to take the lion’s share of resolving each issue.
If I was lucky, I figured I would stumble upon what I needed to do early in the research. But I prepared myself for the probability that, most often, I would need to add debug statement, add some extra scenario tests, execute all relevant scenario tests, examine the results of those tests, and iterate. Sometimes it would only take 1 or 2 iterations, but I figured that most often it would take upwards of 5 to 10 iterations. Based on my experience with this week’s work, that set of expectations was a healthy set to start with.
At this point, I was feeling decently okay. Not to sound too vague, but what I went through was right down the middle of what I expected to happen. As such, I was not feeling over positive or overly negative, just… well… okay. If anything, I was hoping that this “down the middle” result would apply most of the time. Having too many quick issues would get my hopes up for the remaining issues and having too many long running issues would stop any momentum I was gaining.
But more importantly, I just wanted to keep on working through the issues list. I was okay with moving issues to the “nice to have” section that I had created. But I was only okay with that if I honestly though it was not required. If I kept myself honest, it was a good way to move forward without jeopardizing the project.
And with that, I ended the work for the week, on a mostly positive note.
What is Next?¶
Having completed resolving a small number of issues, I was hoping to keep up some momentum by continuing to solve more issues in the next week. Simple as that!
-
A simple reminder that when most people talk about arrays, they refer to an index into that array. As such, the first element of the array in the zeroth (0th) element of the array. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.