Summary¶
In my last article, I took care of completing the Markdown transformer checks for one half of the container block tokens: the List Block tokens. In this article, I tackle both the line/column number consistency checks and the Markdown transformer checks for the Block Quote tokens.
Introduction¶
While the implementation of list tokens went easier than I thought it would, I remained cautiously optimistic about adding consistency check support for Block Quote tokens. During the initial development of the line/column number checks, I noticed that the Block Quote tokens did not keep any information about removed data. After having completed the List Block token support, I knew that it would be pivotal to get that right. And that meant more work. A lot of nitpicking work.
To add the consistency checks, I was going to have to accomplish 3 things: capture the removed line start text, use that information to validate the line/column numbers, and then write the Markdown transformations for it. The last two items were the easy part, those parts I had confidence I could complete. It was the capturing of removed checks that I was not confident about. And the capturing of text was the change that I needed to tackle first.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 12 Aug 2020 and 14 Aug 2020.
Let’s Talk About the Obvious¶
I am generally very disciplined when it comes to writing my blog. I keep notes when I am writing the code, curating those notes into an outline on the Thursday to Saturday before I post my article. If I am lucky enough, I will start the actual writing of my article on Saturday night, but most often I start writing it early Sunday morning. This allows me to take my time to enjoy my Sunday while ending up with a solid version of the article by Sunday night. Then, on Monday night, I take that article and add extra code samples and do some wordsmithing to form the article into its final form. It may not be the way others write their articles, but it is a process that works for me.
Until this week that is. Everything was on track until Monday, when I started my edits with a headache that I just could not ditch. But process is process, so I started doing those edits, pushing on until a couple of hours later. At that point, I looked at what I was doing and went… huh? I started looking at my edits and tried to figure things out, using even more edits to fix any problems I encountered. Far from making things better, I was making them worse. But I would not see that until later.
It was a while later when I finally stopped and locked my computer. Above all else, this was the best thing that I did in that entire situation. It took me a while, but I stopped. However, by that point, a certain amount of damage was already done. I had written some things twice and other things not at all. In some cases, I had sentences that looking at them later, I wonder if I had written them while drunk. I knew I could recover from that position, but it would require a clear head, some extra time, and some hard work. And it was not going to happen that night. I came to the realization that I was not going to be able to meet my usual posting deadline of Monday night.
At first, I was angry with myself. Many weeks of not missing a posting deadline and now I had a bad record. “I always post on Monday nights, except for…” Argh! But after thinking about the purpose of my blog, to educate and help others, I started to find peace with it. None of us is perfect, and we all need to take care of ourselves. Me included. My health versus my article’s deadline. I needed to take that extra day and get better before finishing… er… unfinishing and then refinishing the editing.
Capturing the Leading Block Quote Text¶
In its simplest form, the handling of container block elements and their tokens is a
relatively easy concept. Every line of a Markdown document enclosed within a container
block element is usually prefaced with one or more characters that denote the presence
of that container element. For Block Quotes these are character sequences with the
>
character and an optional whitespace, and for Lists these are whitespaces. To make
the processing of those contained lines easier, at the start of the project I took some
up-front time to ensure that
the container block element processing is independent of the leaf block element
processing. As such, I was able to work on that leaf block processing independently of
container block processing.
When I was doing the initial research for container block support, I started with List tokens. I quickly wrote a proof of concept for the line/column number check and was happy to find that I had already placed the required information in the token. However, after using the same process with Block Quote tokens, it was evident that I had missed the mark with them. Unlike the List tokens, none of the required information was placed within the Block Quote tokens. At that time, after having gone down a rabbit hole for a bit, I wisely decided to wait until later to implement that. Well, that time was now and I knew it was going to take some work to add it.
Addressing the Issue¶
Like the work I did for List Block tokens, the Block Quote stack token needed a
change to support the matching_markdown_token
field. This field is pivotal to
inform the processor of the latest markdown token that is containing other tokens.
But to properly use this field, it would take a bit of disassembly and reassembly.
Before this change, the code to add the necessary tokens to the two relevant stacks
were:
parser_state.token_stack.append(BlockQuoteStackToken())
...
container_level_tokens.append(
BlockQuoteMarkdownToken(extracted_whitespace, adjusted_position_marker)
)
Two different tokens and two different stacks, with nothing between them. With this new functionality in place, that code needed some slight changes:
new_markdown_token = BlockQuoteMarkdownToken(
extracted_whitespace, adjusted_position_marker
)
container_level_tokens.append(new_markdown_token)
parser_state.token_stack.append(
BlockQuoteStackToken(new_markdown_token)
)
Instead of two distinct tokens, there was now a stack token that included a reference to the Markdown token that it represented.
With that field properly initialized, the add_leading_spaces
function was then added
to make use of that new field. It is
a simple function that adds any extracted text (usually whitespace) to its
field, separating any new text from the existing text using a newline character.
The function itself is very boring. The interesting part about the function would be in
locating where that function needed to be called from and using it properly.
The First One Is Almost Always Easy¶
The first location was obvious: within the __handle_block_quote_section
function
of the BlockQuoteProcessor
class. This is where the bulk of the processing of
Bulk Quote elements and their effects go through. In there is an easy to find block of
code that records the number of characters removed and resets the string to exclude
those characters:
line_to_parse = line_to_parse[start_index:]
removed_chars_at_start = start_index
That block of code was slightly modified to record that removed text and place it into its own variable:
removed_text = (
line_to_parse[position_marker.index_number : start_index]
)
LOGGER.debug(...)
line_to_parse = line_to_parse[start_index:]
removed_chars_at_start = start_index
With the removed text in hand, the code just needed to know which token to associate the removed text with. As the code processes Block Quote elements, it was reasonable to assume that the most relevant Block Quote stack token is the last one on the stack. That stack token was easily found with a simple for loop:
found_stack_token = None
for stack_index in range(len(parser_state.token_stack) - 1, -1, -1):
...
if parser_state.token_stack[stack_index].is_block_quote:
found_stack_token = parser_state.token_stack[stack_index]
break
With the extracted text and the ‘top’ stack token, the only thing that was left to do was:
found_stack_token.matching_markdown_token.add_leading_spaces(removed_text)
Almost right away, I was off to a good start on extracting the leading spaces for the Block Quote token and storing them.
Locating the Next Issue¶
After those changes were in place, I ran the scenario tests again, seeing almost every
test that has a Block Quote token fail. My heart dropped. But after a second, I
realized that I wanted that to happen. As I wanted to make sure each Block Quote token
was verified, I added the leading space data to the end of the string for the
BlockQuoteMarkdownToken
class, separated from the rest with the :
character. If
things worked properly, that meant every Block Quote token would, at the very least,
now include an extra :
character. Every serialized Block Quote token was now “wrong”,
so every Block Quote scenario test should fail. Good!
Working through each case was decently fast, with a solid methodical process in place: look at the results, find the next failing test, and manually determine what change needed to be made to the Block Quote token. After making that change to the test data, I would then re-run that specific scenario test, and check for token differences.
It was by using this process that I found the next issue: missing leading whitespace.
In some cases, the leading text that was extracted was preceded by one of more
whitespaces. Those whitespaces were stuffed into the extracted_whitespace
variable
and then ignored after that. The resolution to that issue was
simple. Instead of only adding the leading space to the removed_text
variable,
the contents of that extracted_whitespace
needed to be added to the variable, as such:
removed_text = (
extracted_whitespace
+ line_to_parse[position_marker.index_number : start_index]
)
Running the tests again, a new block of tests started passing.
And Next… Blank Lines¶
As more and more of the Block Quote scenario tests were changed and started passing, there were a handful of tests that were failing that I left for later. When I got to the end of those tests, I went back and started to look at those failing tests one-by-one.
The first group of failures that I examined were ones in which there was a Blank Line element contained within a Block Quote. A good example of this is example 222:
> foo
>
> bar
In these cases, the parsing was
correct, but the newly added add_leading_spaces
function was not being called. The
outcome
was that the number of newlines contained within the owning Block Quote token did
not match the number of lines within the Block Quote element itself. To ensure
that those two values lined up, the add_leading_spaces
function was called with
an empty string, thereby evening up those values.
Running the scenario tests again, all the scenario tests explicitly for Block Quotes were passing. I then ran the scenario tests again, checking to make sure that the scenario tests with Block Quotes and Blank Lines passed. While the new changes were now passing, there were still a handful of tests to work on.
And Finally, HTML Blocks¶
Having run the scenario tests again, the only tests that were not passing were scenario tests that included Block Quotes and HTML blocks. Doing some research into the issue, I quickly found that it looked like the same issue with the Blank Line elements, just with HTML blocks. This issue was uncovered through one of my additional tests, “cov_2” used to increase coverage on the various forms of HTML start and end tags:
</hrx
>
</x-table>
Like the previous issue with Blank Line tokens, the number of new lines were not lining
up between the two sources. Once again, calling the add_leading_spaces
function
with an empty string in these cases solved the issue.
Closing Out the Token Changes¶
After those changes has been completed, it was obvious that all the scenario tests were passing. Just to be sure, I manually went through each of the scenarios and verified the newly changed tokens. As I have said before, maybe I am paranoid, but if I was depending on those results, I figured an extra pass or two would not hurt.
It just felt good to get these changes completed. I had a high degree of confidence that I was able to find most of these issues, but an even higher degree of confidence that the remaining work would flush out anything that I missed. It was with those positive thoughts in my head that I started working on the consistency checks.
Validating the Line Numbers and Column Numbers¶
With that confidence, I started working on the consistency checks for line numbers and column numbers. The first step in doing this was removing a couple of lines of code that prevented the consistency checks from firing if one of the tokens was a Block Quote token. It was useful when I did not have anything in place, but now it would just get in the way.
After that change, tests predictably started failing, but I was ready for them. From
the work
that I had just completed, I figured that I would have to follow a similar path in
implementing the consistency check. Therefore, the first thing I did to help with
the consistency check is to reset the leading_text_index
field of the
BlockQuoteMarkdownToken
instance to 0 when it is encountered. This made sure that
the tracking of the leading spaces would always start with the first entry.
With that done, the code needed to take advantage of that was easy to write. When a token exists on a new line, it’s indent must be determined by tracking the size of the removed text and adding that measurement to the column number. Based on the previous work, this becomes trivially easy:
split_leading_spaces = top_block.leading_spaces.split("\n")
init_ws += len(split_leading_spaces[top_block.leading_text_index])
Basically, grab the top Block Quote token, split it into lines, and use the
leading_text_index
value of that topmost token to grab the right leading space.
From there, the new work mirrored the work that I did in preparing the tokens. When a
Blank Line token is encountered, that index is incremented. For each line of
text within an HTML block, that index is incremented. And in addition, that
leading_text_index
field needed to be tweaked at the end of HTML blocks,
Atx Heading blocks, and Paragraph blocks, which was something I figured might
come up. Just some simple cases where a bit of extra finessing was needed.
To be clear, I was not 100% percent sure that this would happen, but I was hoping that it would. To me it made sense that if I needed to add code to track the leading spaces that were removed, any consistency check would need to follow that same path to verify the information. And as to the extra tweak for the end tokens, I was confident that it was a solid change, and was not worried about that deviation from the previous work.
Writing the Block Quote Markdown Transformations¶
Given the work I had just completed to get to this point, it was obvious to me that I was going to have to do a lot of fiddling with spaces to get the transformations correct. To make that job a bit easier, I decided for the moment to eliminate any Markdown documents which contained both Block Quote elements and List elements. I would get back to these quickly, but at that moment, it was just too much.
The processing of the start and end of the Block Quote elements were simple,
largely mimicking the work I did for the List elements. When a Block Quote token
was processed, it created a copy of itself and added that copy to the top of the
container_token_stack
list. From there, the leading space were retrieved from
their storage in the token and added to the sequence to be emitted. The end
of the Block Quote element was even easier, returning an empty string after removing
the top token off the container_token_stack
list. The work I had previously done
on setting up the leading spaces was really paying off.
With the work to recognize and process the actual tokens taken care of, the main
task ahead of me was to add and populate the __perform_container_post_processing_block_quote
function. Like how List
Block tokens were handled in the
__perform_container_post_processing_lists
function, this new function was used
to handle the newline processing for text enclosed within Block Quote elements.
After having completed all this other work, that work was relatively simple to do.
With all the individual token processing already
performed, this function just needed to focus on detecting newline characters.
For each of these characters encountered, the top Block Quote token would be
used to determine the current leading_text_index
to start with. With each
newline encountered, the current split value would be used, incrementing the
leading_text_index
value afterwards. This pretty much worked flawlessly out
of the box.
As I was cleaning up¶
An interesting issue came up as I was generating these transformations. For some reason, I had missed some cases involving Block Quote elements that contained Fenced Code Block elements. With the Markdown transformer now rehydrating the Markdown, it was obvious that things were missing. It did not take me long to figure out that the Fenced Code Blocks were missing their Block Quote leading characters. This was found for example 98:
> ```
> aaa
bbb
When the transformer tried to rehydrate the document, it rehydrated that Markdown
text without the leading >
sequence before the text aaa
. Having just gone
through that research for other elements, I was quick to spot it and realize where
the issue was. A couple of quick changes, and that change was passing!
Along the Way¶
And to hammer home the point that this consistency checking is good, it found some issues along the way.
The first one was an interesting issue where the starting whitespace for one
transformation was missing. And it was such a weird case. It was an Atx Heading
element that contained a Shortcut Link element as its only text, separated from the Atx
Heading character ( #
) by the mandatory single space. That is really important. Two
space it was okay with, but the one space, nope! Due to the way that
it was parsed, that spaces were missing and not accounted for. The fix to this was
to add that starting whitespace as a new token containing that removed text, but with
a twist. That twist was to use the create_replace_with_nothing_marker
function to
add that text as itself for the Markdown transformer, but keep it as removed for the
HTML transformer. With both transformers appeased, it was on to the next issue.
The second issue that I uncovered was that, in rare cases, the leading spaces for
certain lines were missing. After doing some digging, it only appeared to be lines
that did not have any block quote prefix characters removed from the line. So
after adding a new __adjust_paragraph_for_block_quotes
function to the
LeafBlockProcessor
and wiring it into the parse_paragraph
function, the debug
confirmed that it was only those few cases where that was an issue.
And Of Course, Some Cleanup¶
It would not be a proper week of coding if I did not do some cleanup. In this case,
the cleanup was simple: relocating the line/column number logic to its own
module. One of the things that made the Markdown transformations easy to deal with
is that those transformations are in their own TransformToMarkdown
class.
That worked well. The
line/column number checks were in with some other validation code in the utils.py
module. That worked… well… okay?
When I started with the consistency checks, they were new, and keeping all that code
in the utils.py
module made sense. But as the amount of code in the module grew, I
never took the time to to some refactoring. As such, the module developed two
responsibilities. The first was to be the location for all the assert_*
functions
and the write_temporary_configuration
functions. Those are all the functions that are
called directly from the tests, mostly in the final Assert stage of the test. The
second was to house the logic for the line/column number consistency checks.
It just seemed logical, now that that part of the code was stable and well-tested, to
take that second responsibility and put it into its own module. I created the
verify_line_and_column_numbers.py
module and started moving the functions into it,
with the verify_line_and_column_numbers
function being the main entry point for
the module. It just seemed cleaner and more compact. One module, one responsibility.
What Was My Experience So Far?¶
The dominant part of my experience at the current moment is one of humility and patience. Yes, it is Tuesday night. And after a little over 3 hours of extra work, I have finally recovered the article to where it was at this time last night. I do feel an impulse to be hard on myself about this delay, but I also am working hard to remember to be patient with myself. This is one time where I am missing a deadline, and it probably will not be the last.
The humility that I am trying to practice is in understanding that I cannot do everything all the time. I know that sounds like it should be an obvious thing to know, but I think we all forget it from time to time. I was so focused on making sure that I met my deadline, that I neglected to have a conversation with myself on whether that was the right choice. But it was not a large rabbit hole, just a small one that I was able to easily recover from.
Focusing more on the actual work that I accomplished, I was buoyed. I was long worried about how hard it would be to implement the consistency checks for Block Quote tokens. Having completed that work, I now am trying to figure out why I was worried. Trying to figure it out now, I would guess I would focus on what it took to complete that work with lists. That work was very nitpicky because it contained a lot of “empty” whitespace, if I am remembering it clearly. From the effort that it took to deal with that, I can see how I might have thought it would have taken the same effort for Block Quote tokens.
But it did not. The work I had done in that area on List tokens forced me to get things set up to make List token processing easier, which I believe Block Quotes benefited from. Regardless, I was glad that I could close the books on the Block Quote tokens and their consistency checks.
What is Next?¶
With all the block token consistency checks now taken care of, there was a bit of clean up to do with determining the height of each block token. While I initially thought that it would be easy, it did not turn out that way.
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.