Summary¶
In my last article, I talked about starting to work on getting rid of some long-standing issues: nested Container Block elements. In this article, I talk about continuing that work and dealing with the remaining nested block scenario tests.
Introduction¶
I wish I could say that this was a good week, but it was not. It was not even this week that I am about to talk about here; that was last week. Despite my best intentions, I got sick, and it has taken the better part of a week for me to get better. To hear me talk about it, please read my upcoming article, Autism and Patience Do Not Mix, coming out later this week.
Before Monday afternoon when I got too sick to work, I had 90% of this article finished. I knew I was getting sick, but I did not feel good about publishing something that was only “mostly” done. I hope my decision to delay publishing this article for a week is okay with any readers. I would rather postpone publishing an article for a full week than to release something that I did not think was a quality article.
And on a similar level, I did not want the care that I put into that week’s work to be glossed over by an article that was not on par with that work. As I talk about below, I was stunned to find out that there were only 2 examples out of 673 examples in the GFM specification that deal with relative indentation of block elements. It was something that I knew about and something that I should have dealt with earlier. But with no pressure from the specification’s examples to deal with relative indentation, it was not until this week that I worked to resolve the issue.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 04 Jul 2021 and 05 Jul 2021.
Why Do I Use Detailed Log Files?¶
One thing that I do not think I have ever mentioned is why I use log file output instead of just debugging the code interactively. In normal everyday development, I do a bit of both as the issue requires. For me, the real understanding comes from looking at a big picture and comparing it to other pictures. Whether the right comparison pictures are from earlier runs of the same scenario or pictures of similar scenarios depends on what I am looking for. Basically, I see it as a puzzle with lots of data that I need to solve, and I love it!
While the success of this technique may vary from person to person, it just works well for me. If I am comparing the output of one test to another test, this technique allows me to copy the relevant section of the log files to another editor, run the tests again, and examine the both log file sections side-by-side. If I did not understand something well enough, I can reset my position in both log files to an earlier point and restart my examination from those new positions. Most importantly, it allows me to see a larger sequence of values than the ones visible from a single breakpoint within the code. At it with that data that I can better see the bigger picture.
For those reasons and others, more detailed log files just work better for me. Knowing the parser because of my experience with it, I find it easy to follow along with how each line is parsed, even though the complicated sections dealing with Container Block elements. I have learned that I need to usually ignore most of the “stuff” in those blocks of the logs unless I am looking for something specifically to do with Container Blocks.
But even with all that good stuff working in my favor, that does not always make the job of fixing issues easier.
More Detail Does Mean More Work¶
While there are benefits to using more detailed log files, there are drawbacks as well. If you are not used to the volume of data, it can be overwhelming. Even being used to the volume, there are still times that I need to walk away because it becomes overwhelming. There are still times where I look at all the data without the right mindset, and it all looks like gibberish. I find closing my eyes and taking three or four deep breaths does wonders for getting the right mindset, but even that does not always work. Sometimes I need to clear my mind by walking around for 10 to 15 minutes until my head clears.
But in the end, my evaluation on whether the costs are worth the benefits always produce a resounding “Yes!”. But even with that yes, I do admit that there are times, like with Scenario Test 237 and 238, that those costs and benefits get tested… sometimes to their maximums.
Scenario Test 237¶
Starting with what I believed to be the easiest of the two disabled tests, I
started with Scenario Test test_list_blocks_237
, quickly renamed test_list_blocks_237x
:
> > 1. one
>>
>> two
Unlike the previous two tests, these two tests were introducing something
that was a core tenet to the nesting of Container Blocks: relative spacing.
Not complaining to the writers of the
GFM specification too much,
but as this is a core concept of nested blocks, I wonder why the specification
allocated only two examples to this concept and not more. With all seriousness,
every other example in their specification can be properly converted into
HTML without worrying about relative spacing. So, what is relative spacing
or more properly relative nested container block spacing
?
Basically, the GFM specification talks about how nested block spacing is not absolute, but relative to the last Container Block element on that line. To put this into perspective, consider the above example. When properly parsed, the specification says that it will produce the following HTML output:
<blockquote>
<blockquote>
<ol>
<li>
<p>one</p>
<p>two</p>
</li>
</ol>
</blockquote>
</blockquote>
But if you look at the Markdown example with an absolute mindset, the text
two
is clearly not indented enough to qualify as being within the
Ordered List element. So what gives?
From a relative positioning point of view, the first line is broken down
into the following sections: {space}{space}{space}>{space}>{space}
for
the two nested Block Quotes elements that start the line, 1.{space}{space}
for
the Ordered List element that is next on that same line, and one
for the text
for first List Item of that Ordered List element. Using that same
relative point of view, this means that there are four characters after
the Block Quote elements before the text of that List Block element begins.
Applying that to the last line of the example, the two nested Block
Quote elements take up >>{space}
, with the following nested Ordered List element
coming into effect after the four space characters {space}{space}{space}{space}
.
At that point, the text two
is present. This means that, relatively speaking,
the text two
is validly indented to remaining part of the Ordered List element.
Looking at the HTML output above, this is indeed how that is parsed.
How To Approach That Change¶
In adding this support to the PyMarkdown parser, I was faced with two options: make the parser’s handling of nested sections relative-aware or make adjustments where needed. While I would like to eventually adjust the parser to be more relative-aware, I quickly figured out that it would take quite a bit to make that change. I spent a good day going through some preview steps of what it would take, just to get an idea of the effort. If I had to guess, it would take at least a month or two, if not more. Seeing as I want to get the PyMarkdown linter out there, that was out. So, the only good option left was to make adjustments where needed.
Subtle Changes¶
To be fully honest, getting to this set of changes took a long time to figure out, including at least three times where I had to walk away to clear my head.
The first part of solving that puzzle was recognizing from the
logs that the __handle_block_quote_section
function was not
providing the right adjusted text. Specifically, in handling
this case under normal situations, the correct number of space
characters are placed in the respective Block Quote token. But
because they both start on the same line, things got messed up.
To correct this, I added the following code to ensure the
adjusted_removed_text
and adj_leading_spaces
variables both
had the correct values.
if (
container_start_bq_count
and parser_state.token_stack[stack_index - 1].is_block_quote
):
count_of_actual_starts = ParserHelper.count_characters_in_text(
adjusted_removed_text, ">"
)
assert count_of_actual_starts != this_bq_count
adj_leading_spaces = parser_state.token_stack[
stack_index - 1
].matching_markdown_token.leading_spaces
while len(text_removed_by_container) > len(
adj_leading_spaces + adjusted_removed_text
):
adj_leading_spaces += " "
adjusted_removed_text = adj_leading_spaces + adjusted_removed_text
With that text now adjusted properly, it was on to modifications to
the __handle_nested_container_blocks
function. For this function,
the indent_level
member variable of the start List Block token was
being set to a number that indicated absolute position, not relative
position. To remedy that, I added code to calculate the indentation
difference between the original Block Quote element and the original
List Block element. The adjusted_indent_level
variable was then
adjusted to properly reflect the indentation relative to the Block
Quote element and how it was set on the current line.
indent_level = parser_state.nested_list_start.indent_level
list_start_token_index = parser_state.token_document.index(
parser_state.nested_list_start.matching_markdown_token
)
token_after_list_start = parser_state.token_document[
list_start_token_index + 1
]
assert (
parser_state.nested_list_start.matching_markdown_token.line_number
== token_after_list_start.line_number
)
column_number_delta = (
token_after_list_start.column_number
- parser_state.nested_list_start.matching_markdown_token.column_number
)
adjusted_indent_level = (
column_number_delta + end_container_indices.block_index
)
With those values calculated, it then allowed me to do the final changes
to the function. Before the big change was added, I needed the function
to recompute the indent_level
variable to allow for an adjustment to
the relative positioning. However, before I reset indent_level
, I
needed the algorithm to be aware of whether there was any difference
between the indent_level
variable and the adjusted_indent_level
variable,
so adjustments could be made later.
After all those changes, it came down to the two lines at the end of the following example:
if indent_level > adjusted_indent_level:
delta = indent_level - adjusted_indent_level
indent_level = (
column_number_delta + end
...
if (
parser_state.token_document[-1].is_blank_line
and (end_container_indices.block_index + start_index)
< indent_level
):
...
elif delta:
adj_line_to_parse = adj_line_to_parse[delta:]
After all that work, the only thing that still needed to be
adjusted was the adj_line_to_parse
variable containing the
current line. Because both Container Block elements are processed
independently of any other elements, the adj_line_to_parse
variable
is reset to remove any whitespace that is part of the whitespace for
one of the Container Block elements. In this scenario, the right
amount of whitespace was not removed, leading the Leaf Block
element processing to be wrong.
By removing that extra whitespace, everything fell into place, and it worked! It was a long way to get there, but it was worth it.
Scenario Test 238¶
Taking a quick look at this issue, on the surface it seemed to be the same issue as with scenario test 237, just the order in which the different indents were applied were on the last line, not the first:
>>- one
>>
> > two
Looking at the output tokens, everything looked fine, and the HTML was being generated properly. But on closer examination, there was one little difference that changed the output of the Markdown generator: an extra line of leading whitespace.
To make sure that the Block Quote element is being represented properly,
when each line inside of the Block Quote element is tokenized, the leading
spaces including the >
character are stored within the owning Block
Quote token. This allows for it to be reconstructed without any issues as
all leading Block Quote information is present, even if it varies. But in
this case, the leading spaces for the first Block Quote were added to the
token, followed by a newline character and a fully indented representation
of the second Block Quote element, both from that line. While each one
was accurate by itself, when they were combined by the Markdown generator,
they added a lot of extra whitespace. That was the issue.
It took a bit of head scratching before I figured it out, but I did figure it out. In a nutshell, because of two spaces between the first Block Quote character and the second Block Quote character, they were not interpreted as a “group” of Block Quote elements. Rather, they were interpreted as a Block Quote element, a space, and another Block Quote element.
special_case = False
special_case_adjusted_text = None
if (
container_start_bq_count
and stack_bq_count > 1
and container_start_bq_count != stack_bq_count
):
stack_index = 1
block_quote_token_count = 0
while True:
if parser_state.token_stack[stack_index].is_block_quote:
block_quote_token_count += 1
if block_quote_token_count == stack_bq_count:
break
stack_index += 1
assert stack_index < len(parser_state.token_stack)
assert stack_index < len(parser_state.token_stack)
matching_block_quote_token = parser_state.token_stack[
stack_index
].matching_markdown_token
if "\n" in matching_block_quote_token.leading_spaces:
last_newline_index = (
matching_block_quote_token.leading_spaces.rindex("\n")
)
special_case_adjusted_text = (
matching_block_quote_token.leading_spaces[
last_newline_index + 1 :
]
)
special_case = True
Cleaning Up With Small Variations¶
Having a Monday off and with this week’s article started, I wanted to see if I could make some progress on two variations: one for scenario test 237 and one for scenario test 238.
As detailed above, the Markdown for scenario test 237 is:
> > 1. one
>>
>> two
where the indentation on the final line allows that line to be included
in the Ordered List element started on line 1. The small variation there
was to create test function test_list_blocks_237e
with one less space
character on that final line:
> > 1. one
>>
>> two
This reduction in indentation on that final line makes that final line ineligible for the Ordered List element. I was pleasantly surprised that it worked right away, without any changes required.
Scenario test test_list_blocks_238a
includes a similar change, this
time adding an extra space on the final line to make it eligible for
the Unordered List element:
>>- one
>>
> > two
While this new test was not as successful as test function
test_list_blocks_237e
was, it only required minimal changes
to make it work properly. The HTML output was already consistent
with what was expected, but the Markdown generator had additional
whitespace in its output for the final line. Specifically, when
that final line was parsed, the Paragraph token started with the
extraction of two space characters. This meant that the regenerated
output had four space character between the final >
character and
the two
text.
Addressing that issue did not require that much work. In the
__handle_nested_container_blocks
function, I added the following
code to reduce the number of spaces by the appropriate amount:
elif (
not nested_container_starts.block_index
and adj_line_to_parse
and adj_line_to_parse[0] == " "
and indent_was_adjusted
and parser_state.nested_list_start
):
assert adj_line_to_parse.startswith(
parser_state.nested_list_start.matching_markdown_token.extracted_whitespace
)
adj_line_to_parse = adj_line_to_parse[
len(
parser_state.nested_list_start.matching_markdown_token.extracted_whitespace
) :
]
already_adjusted = True
This code was added specifically to address the extra space characters
and to remove them from the adjusted line variable adj_line_to_parse
.
If the indent was adjusted and the current line includes a nested list start
token, this code reduces that adjusted line variable by the list’s indent.
The code itself was made simpler by the handling of an increase in the spacing
already being handled previously in that same function.
After taking a bit to figure out that solution, once it was implemented, everything worked fine.
What Was My Experience So Far?¶
The full quote from Shakespeare’s Macbeth is:
It is a tale told by an idiot, full of sound and fury, signifying nothing.
Trying to mentally return to when I learned this line in high school, I remember talking specifically about that line. I seem to remember that we were talking about how Macbeth’s wife had just died, and he did not see that life contained any meaning for him after that point. Over the years since high school, my thoughts on that line have changed a bit. I now think of that same line in situations where someone goes on and on about something, only to have it appear in real life as something with little sound or fury.
The work that I documented in this article really did feel like that. When I started the work, I was not sure how difficult the work was going to be, only that it would require some changes. Now, it may be because of the research that I did to get prepared for these changes, but those changes ended up feeling… well… trivial. I was worried that I was going to have to make some grand changes to the project to accommodate this “little” issue that needed to be fixed, and the actual work was “little”.
Do not get me wrong, I am grateful that those issues required less than 100 lines of code to change. But at the same time, I realized that I had built this issue up as “THE NASTY CHANGES REQUIRED TO…”1 instead of “yup, just some normal tweaking” changes. For context, I spent a full evening working on the research and trying simple changes out until I was convinced it would take a more concerted effort to solve. And even then, I did some more testing to make sure that my research was correct. For me, usually that amount of research leads to a lot of changes.
And maybe that is why I feel that it went from “sound and fury” to almost “nothing”: I did proper research. Sure, it took some time to figure out the correct decisions to make based on that research, but it was that research that was pivotal. For me, that is a just a good feeling to have. While I was not able to show any actual code as a result of that research, it helped me prune many decision trees early on, resulting in allowing me to follow a quick path to the actual work I needed to do. Essentially, it pointed out the 90% of the work that I should avoid and had me focus on the 10% of the work that would be most beneficial. And that helped a lot!
In the end, the way I see it, while the “sound and fury” of debugging is usually where I expect the hard work to be, there are cases where the “signifying (almost) nothing” portion of the debugging work is where it is at!
What is Next?¶
After the week I have had being sick, I really am not sure what is going on yet. Stay tuned!
-
For any readers not fluent in text-speak or DM/IM-speak, the extended use of capital letters usually implies that the author of the text is yelling. In this case, it would be more of a “booming load” voice, implying sound and fury. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.