Summary¶
In my last article, I started the transition to working on Block Quote issues. Having made good progress with those issues, this week I continued with those issues, sometimes blurring the line between Block Quote issues and Block Quote/List Block interactions.
Introduction¶
Now firmly in the mode of dealing with Block Quote block issues, I was looking forward to making more progress in this area. With each week that passed, I was becoming more aware of how close I was to be able to at least do an initial release of the project. Block Quotes, and possibly Block Quote/List Block interactions are the last big thing that I need to do before taking that leap, so I just needed to buckle down and work through those issues, wherever they would lead.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 16 Dec 2020 and 20 Dec 2020.
Fixing Things Up Before Proceeding¶
Before I started doing a lot of work for the week, I decided I needed to get one of
the disabled functions working: test_list_items_282
. While it was not a tremendously
important issue, it was one that I had put off for a while, and I just wanted it out of
the way. Following the GFM Specification exactly, it can be argued that the following
Markdown:
1. foo
2. bar
3) baz
should either be parsed as a list with 2 items or two lists. The crux of the issue is whether the third line is considered a continuation of the second line or if it is an entirely separate list. Basically, it all comes down to how your parser implements paragraphs and paragraph continuations.
Luckily, one of the primary authors of the specification (@jgm) chimed in with this following bit of information:
The intent was to have language that applies to the starting of list items in the middle of a paragraph that isn’t a direct child of a list item at the same level.
While some of the other help was not as useful, that one sentence was. I did agree with the other author that Paragraph elements in Markdown are the default element, however I was not as sure that his interpretation of paragraph continuations was correct. But this information helped me out a lot. Using that information, I was able to quickly put together these lines:
if parser_state.token_stack[-2].is_list:
is_sub_list = (
start_index >= parser_state.token_stack[-2].indent_level
)
Following the suggestion of @jgm, I changed the exclusion code at the end of the
is_olist_start
function to include a reference to is_sub_list
. Following my
usual process, I was quick to find that everything just fell into place, and the
test was passing. But I was not confident that I had properly fixed the issue,
so I created four variants of the test data, each just a little different from each
other. It was only when all five tests had passed that I considered the issue resolved
and dealt with.
An Easy Set of Tests¶
Some of these items take days to complete and some take hours. Until I start working on them, I never know which bucket they will end up in. Therefore, when I started working on this item:
- 634a in bq and in list
I had no clue what was going to happen. Starting with the basics, I looked at the
Markdown for function test_raw_html_634
which was:
<a /><b2
data="foo" ><c>
As the start of the Markdown does not have a single HTML tag, or a HTML tag from one of the special groups, an HTML block is ruled out. But when the processing happens for a Paragraph element, the inside of that paragraph is then filled with Raw HTML elements, only slightly altering the Markdown when rendered as HTML:
<p><a /><b2
data="foo" ><c><p/>
As the item suggests, placing that same Markdown inside of a List Block or a Block Quote
should result in the same behavior, just inside of that block. Crossing my fingers for
good luck, I created two variants of that Markdown: one that was prefixed with -
and
the other that was prefixed with >
. Things flowed quickly through my usual process
and I was happy to find that these scenarios both worked without any additional changes
being needed.
More Fun with Block Quotes¶
With good luck occurring for my last item, I hoped it would carry on to my next item. So, when I looked for more work to round out the Block Quotes tests, I came across the following group of items in the issues list:
- 300 with different list following
- 300 with extra indent on following item
- 301, but with extra levels of block quotes
- 301, with indented code blocks
Rather than tackling them separately, I decided to tackle them together as a group.
The first part of that work was making the requested variations on the data for test
function test_list_items_300
. That data was somewhat simple, with the Markdown text
of:
* a
> b
>
* c
To address the first item, I added test function test_list_items_300a
that included an
Ordered List element after the Block Quote element, instead of an Unordered List
element. Test function test_list_items_300b
addressed the second item by keeping that
List Item element as an Unordered List element but adding 2 space characters before it
to make it a sublist.
Similarly, test function test_list_items_301
has a Markdown text of:
- a
> b
```
c
```
- d
The new variation of this Markdown in test function test_list_items_301a
was to change
the second line to start two Block Quote elements instead of one. Test function
test_list_items_301b
modified the test data slightly by indenting each line of the
Fenced Code Block element by one space. Test function test_list_items_301c
did a more
drastic change by replacing the Fenced Code Block element with the single character c
indented from the start of the Block Quote element. Finally, the test_list_items_301d
function did a more correct version of test function test_list_items_301c
by including
a blank line and a c
character indented by four spaces, but properly enclosing them
within the Block Quote started on line 2.
With those changes made, it was time to get down to figuring out if there were any problems and dealing with them!
Working The Problem¶
Starting with my normal process for working through new test functions, I worked
through the tokens and HTML for each of these functions. Apart from
the functions test_list_items_301b
and test_list_items_301c
, the other tests
were producing the tokens that I expected them to. After trying to get those two
functions working properly for an hour or so, I decided to put them on hold while I got
the other functions cleared up.
Focusing on those other test functions, the HTML output was mostly there, but required
a small amount of fiddling. Specifically, tests in which there was a Blank Line element
within a Block Quote element within a List Block element, the Blank Line within the
Block Quote was being used to determine whether the list was loose. As that
Blank Line was within another Block and not within the List Block itself, it should not
have been affecting the calculation of List Block looseness. Luckily, the fix for this
was to add nine lines of code to the __calculate_list_looseness
function to properly
increase and decrease the stack_count
variable to account for the Block Quote token.
With the tokens and HTML output deal with, it was time to deal with the rehydrated Markdown and the consistency checks. The fix to the Markdown was an easy one to see: the whitespace allocated to the Block Quote tokens was not being added back into the Markdown text that was generated. Some easy changes to incorporate that information was almost as easy to add, leaving the consistency checks.
While the consistency checks took a bit, in retrospect they were somewhat easy to
understand and fix. At the time though, it took a bit of effort to work through them.
Like the changes in the Markdown generator, changes needed to be introduced to
properly track which part of the Block Quote’s extracted whitespace was applied in the
Markdown generator. That tracking is done using the Block Quote token’s
leading_text_index
variable, which was not being properly incremented to track the
newlines used within the owning Block Quote token.
Once that change was done, things were looking a lot better, but there was a single
case where there was an index error getting that whitespace out of the token. Upon
debugging, it was obvious that the leading_text_index
variable was getting incremented
twice. Fixing that took a bit of passing information around but was quickly taken
care of. And with that fix in place, each of the tests that I was working on was
solved and passing properly.
Not Everything Was Solved¶
With the other tests passing cleanly, I refocused my efforts on test functions
test_list_items_301b
and test_list_items_301c
. Doing my usual research into the
issues, I started to figure out what the tokens generating by the parser should be,
comparing that answer with what was being generated. It was then that I
noticed that the tokens for the tests were close to normal, but not correct. In both
test functions, the tokens mostly differed in where the one element stopped and the
next one started.
Now, when I say, “I fiddled with the code”, I really mean I tried normal paths and
interesting paths to try and solve the issue. And it was similar with these issues.
After around two hours of fiddling, I was no closer to having a solution than when
I first looking at the problem.
In the end, after doing a fair amount of research and debugging, I decided that I was
going to commit the code with functions test_list_items_301b
and
test_list_items_301c
disabled. I just was not getting the right “angle” on solving
those issues. I just felt it would be better to commit what I had and work on those
two functions in the next couple of days, so that is what I did!
Dealing With 301B¶
During that Saturday, I decided to take another shot at test function
test_list_items_301b
. I knew the the tokens were just wrong. To quantify that
wrongness, the Markdown text:
- a
> b
```
c
```
- d
was produced a Block Quote element that tried to include an empty Fenced Code Block
element into the Block Quote, leaving the c
character by itself on
a line outside of the element, followed by another empty Fenced Control Block. From
prior work on List Blocks, I knew that the Fenced Code Block should terminate the
Block Quote element, as Block Quotes follow similar roads. It was just a matter of
figuring out how to get there.
Knowing that I have had tried to solve this problem the day before, I decided to take the hard-line approach of debugging line-by-line through some of the code. I don’t usually do this as it is very time consuming and requires meticulous notes. While I am comfortable with doing that work, there are just more efficient ways of getting to the same target. But with those ways not working for this problem, it was down to the nitty-gritty.
It was a good thing that I used this approach because it told me something interesting. Usually, at the start of processing, the code separates the line into extracted whitespace and parseable line. In this case, the debugger was showing me that the separation that I had expected was not done. As such, when the parser looked at the raw data for the line, the spaces at the start of the line prevented that line from being recognized as the start of a Fenced Code Block.
The good news here was that adding these four lines to the __process_lazy_lines
function made everything work:
after_ws_index, ex_whitespace = ParserHelper.extract_whitespace(
line_to_parse, 0
)
remaining_line = line_to_parse[after_ws_index:]
Basically, I just took the same code that split the original line_to_parse
variable
in the main-line processor and added it here. With that code in place, the
is_fenced_code_block
function did the rest, properly noticing the Fenced Code Block
start sequence, and properly starting the code block within the Block Quote.
Dealing With 301C¶
It was just after 6pm PST when I committed the changes for test_list_items_301b
, and
I decided to start working on test_list_items_301c
. In this case, it was not the
tokens that were wrong, but the spacing included from one of the lines within a
Block Quote element. Doing research into the night, I still was unable to figure out
what the magic sequence was to get this working properly. Rather than press on, I
decided to take a break and spend some time with my family, relaxing myself and my
brain.
This worked well. Starting again in the morning, I was able to quickly see that there
were two problems that I needed to address. The first of those problems was that
the function __adjust_paragraph_for_block_quotes
was always adding an empty string
to the Block Quote token, which was the problem I was trying to solve. The second
problem was that it was a great solution for most of the time, this specific test being
one of the few times where it was not.
With that fresh information, I started experimenting and I was able to isolate code in
the parse_line_for_container_blocks
function that affected the outcome of the test.
The fun part was that if I took the indentation used for the list and added it to the
token’s data, test test_list_items_301c
worked, but other tests failed. Doing some
extra plumbing and experimentation, I narrowed the active scenario down to specifically
only trigger when the parse_line_for_container_blocks
function was triggered within
a paragraph that was within a block quote. [more]
What Was My Experience So Far?¶
Having felt dread at the prospect of Block Quotes taking as long to complete as List blocks did, I am happy to report that the feeling is quickly fading. The sense I am getting from the issues I am looking at are that there are issues to deal with, but nothing I have not dealt with already, just variations on it. On top of that, there are only one type of Block Quotes to worry about with a very simple start sequence. I was very solidly feeling that Block Quotes were going to be a lot easier than List blocks.
That did not mean things were going to be easy though! I was starting to come close to the end of the initial set of issues that I had added to the issues list, but I was adding more as I looked through the code. This week, I was able to get rid of a handful of those issues, and it felt good. But with approximately 70 lines of items before hitting those that dealt with Rules, Tabs, and Correctness, it was a sobering reminder that I just needed to get stuff done and done cleanly.
And for me, that often poses a problem. There are things that I want to do and feel that I should be doing, and there are things that I need to do. Resolving any issue that deals with Tab characters? Now that is a want, and I can live with a “stupid” translation of Tabs until I can get the time to make them right. Resolving any issues that might uncover scenarios that I have not covered yet. To the best of my abilities, that is a must. But there is a grey area between those two.
Take the item:
- image token handling confusing and non-standard
I would like to get it resolved, but I am not sure that I need to do that. I must balance my desire for wanting things done right and delaying any release with my knowledge that I can live with it for a couple of months while I get the project out. And those are the types of decisions that I am going to have to make more and more as this project gets to its closing stages. Do I really need it, or can it wait?
At least I know it is going to be fun!
What is Next?¶
I was just as happy to get some holiday time off in which I could spend more time with my family as I was to be able to make some solid holiday progress. This next week was going to be a good mixture of resolving solid issues, test cleanup, and code cleanup. Stay tuned!
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.