In my last article, I talked about my efforts to clean up the scenario tests and the existing parser issues. In this article, I continue to talk about cleaning up existing parser issues.
To an outside viewer, it may seem like I am doing boring stuff, week in and week out. And I will admit, a certain amount of the work that I do is boring. But knowing that the work is going towards increasing the quality of the project really motivates me when things are dragging on. Since I spent most of this past week finishing the big task from last week, I did find the need to remind myself of those observations during the week… multiple times. And remembering why I was doing this did help me push through.
And while I am used to showing as much code as I can for what I am working on, this week is going to be a bit different. To be honest, most of the issues that I fixed had a relatively minor change that was implemented, with the effects of that change rippling out into the rest of the code. In some cases, it was a one-line initial change, and thirty or more lines of ripple changes. I still will try and explain things as best I can, you have my word on that. I’ll see how that works out and see how I feel about it later!
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather than the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 12 Oct 2021 and 17 Oct 2021.
As I said during the introduction, it was an exceedingly long week. I had hoped that I would be able to rip through the comment strings for the scenario tests that related to rules, but that was not the case. Instead of having to read 7 strings and having to fix 1, I was lucky if I encountered any strings that I did not have to fix.
Part of that was my own making though. To make sure that the strings were all readable, I started establishing a common preface for each rule comment string:
""" Test to make sure this rule does not trigger with a document that contains ... """
While that brough consistency to every comment string for rules, it did
involve tons of editing. That and in the cases where the rule did trigger,
I had to ensure that I removed the
not on the first line. Just checking
that out for each file took a bit of time.
And that editing took me into Saturday afternoon. It was a slog. But I never had any doubt that it was worth it. Did I question if my sanity would remain intact? Yes, that was something I wondered many times during the task. But not the base worth of the task. I knew that it was something that I had put off, and I knew that this rewriting was the price that I needed to pay. And that was okay with me. It just needed to get done.
But I was incredibly happy when it was done. There were other things that I wanted to get cleaned up this week!
In terms of nagging tasks that I wanted to deal with, cleaning up the comment strings was the biggest one. Thought it was not as high on the list, as the comment string, it still bugged me that I had skipped tests. I know that at the time that I marked those tests as skipped it probably was the right thing to do. But now that I had some time, the right thing to do was to make sure that I resolved those skipped tests.
Having resolved some skipped tests dealing with language elements last week, I decided to try and deal with the skipped tests for rules this week. Luckily, as soon as I walked through the scenario tests for Rule Md005 and Rule Md027, I noticed that everything looked correct. All that I needed to do for those tests was to clean them up, reset the expected values, and verify that there were no other problems. Done.
Rule Md028 was another story. I worked through the debugger and log files for about an hour or two before I figured out the issue. In this scenario, the essence of the problem was with this Markdown document:
- > This is one section of a block quote
When I walked through the parsing of that Markdown, the tokens that I got back were mostly correct, just in a weird order. In particular, the Block Quote was started properly, but it was immediately followed by an end Block Quote token. And walking through the code did not at once reveal any interesting information because everything looked right. That is until I noticed that there was a single point where it no longer looked right.
That point was the on the return of the parser from the
__handle_nested_container_blocks function. Everything looked correct before
that function returned, especially the
this_bq_count variable that contains
what the perceived count of the Block Quote is. It was then that I noticed
that when the function returned, the
this_bq_count variable dropped from
0. As there was a Block Quote stack token on the
this_bq_count variable with a value of
0, that Block Quote was
being closed as no longer being correct. It was not the proper thing to
do in this case, but in general, it was doing the right thing by removing
the Block Quote from the stack.
To fix this, one change was needed, with a lot of ripples to accompany it.
The one change was to have the
this_bq_count variable back to its callers. A handful of
functions needed changes to accommodate this, including code in
__look_for_container_blocks function to use this information.
While these changes were not difficult, it took me a bit to get them
right and review them to make sure that I felt that they were right.
But after all that work, all three sets of rule-based scenario tests
skipped tag removed, and they were all working cleanly.
To make sure things stayed that way, I added test functions
test_extra_009a with the boiled down version of
the issue. One nagging item off the list. That felt good! Now for
Sometimes the problem just stares you in the face and announces itself
proudly. This was the case with test function
which was clearly throwing an assert. Looking at the assert itself,
it was obvious that it was failing, but not how. By adding a clear
assert message, I was able to find that out. It was asserting because
it assumed that the extracted whitespace would always be whitespace.
In this case, the variable
adj_line_to_parse contained the adjusted
line that needed to be parsed. However, because it occurs within a
Block Quote, the start of the line included the Block Quote characters
and whitespace, not just whitespace. That meant that when the parser hit
assert adj_line_to_parse.startswith( parser_state.nested_list_start.matching_markdown_token.extracted_whitespace )
the assert failed. The length of
extracted_whitespace field was as it
was expected to be, the string that it was being compared to just included
other characters at the start of it.
The first step in fixing this was to change the
assert into an
positive case of the previous
assert was working fine, there were just
some cases where it needed some refinement. To take care of that
“refinement”, this code was added in the
else clause of the
assert parser_state.original_line_to_parse.endswith( adj_line_to_parse) orig_parse = len(parser_state.original_line_to_parse) curr_parse = len(adj_line_to_parse) delta_parse = orig_parse - curr_parse whitespace_to_remove = parser_state.nested_list_start.matching_markdown_token. extracted_whitespace[delta_parse:] assert adj_line_to_parse.startswith(whitespace_to_remove) adj_line_to_parse = adj_line_to_parse[ len(whitespace_to_remove) : ]
This code was all to get around the possibility that the leading spaces for any line in a Block Quote can have Block Quote characters as part of that leading space. Once past those characters, everything else should be whitespace. So, this code works with that ending whitespace to adjust the line to what it needs to be.
It took me a while to come up with that approach, but it made sense. And another nagging issue down. Just one more left.
The last nagging issue deal with two almost identical scenario tests:
test_setext_headings_069a. With little information
to go on, I started to look at the output of the parser for this document:
- Foo ===
Quickly looking at those Markdown documents, I guessed wrong that the document should produce a SetExt Heading. Using the faithful Babelmark tool, I verified that the provided Markdown should produce this HTML:
<ul> <li>Foo ===</li> </ul>
It took me a couple of minutes to have an “aha” moment, but I eventually got there. The problem that I had is that I prejudged the text because it looked like a SetExt sequence. Working though it some more in my head, I could then see that the second line would be considered a paragraph continuation line unless there was something more to that second line. Indeed, when I added two extra spaces before that second line, it matched the indent of the List Item and became a SetExt Heading element. But anything short of that, and it rightfully remaining a simple paragraph continuation line.
With that information in hand, I debugged through the
function for a while before adding this code:
is_paragraph_continuation = False if ( len(parser_state.token_stack) > 1 and parser_state.token_stack[-2].is_list ): adj_text = position_marker.text_to_parse[position_marker.index_number :] assert parser_state.original_line_to_parse.endswith(adj_text) removed_text_length = len(parser_state.original_line_to_parse) - len( adj_text ) if ( adj_text and removed_text_length < parser_state.token_stack[-2].indent_level ): is_paragraph_continuation = True
Based on the
if statement just before this code, it would only be
activated if the Leaf Block Processor was inside a Paragraph element.
In the logic that I added, if that paragraph was inside of a List Item,
a check was added to see how much whitespace was removed from the line
to get to that point. If the amount removed was less than the indent
level and there was at least one non-whitespace character on that line,
it considered the current line a paragraph continuation.
With that logic in place, the ‘if’ statement in the code that followed:
( after_whitespace_index, extra_whitespace_after_setext, ) = ParserHelper.extract_whitespace( position_marker.text_to_parse, collected_to_index ) if after_whitespace_index == len(position_marker.text_to_parse):
was changed to:
if not is_paragraph_continuation and after_whitespace_index == len( position_marker.text_to_parse ):
Basically, the existing logic was almost fine, but needed to also check that it was not a paragraph continuation before proceeding. Verifying the new code against both original test functions, it also passed the two new function I added to help diagnose the problem. I did leave them in there as they cleanly verified the boundary cases that existed.
Encountered during a previous debugging session, I made a note to myself to check out the following Markdown:
* First Item * First-First * First-Second * First-Third * Second Item
While that Markdown made no difference to the rule that I was debugging at the time, I recognized that it was not properly creating a List Item out of the fourth line as it should have. Instead, the List Item started on line 3 contained both the text for line 3 and for line 4.
After creating a new scenario test, I quickly dropped into debug mode and found
the problem relatively easily. Near
that start of the
is_ulist_start functions, there is a
standard whitespace check to make sure that the whitespace does not exceed the
usual three characters. Simply speaking, prefaced with zero to three spaces,
- this is an item
is translated as a List Item. By adding one more space to that prefix:
- this is an indented code block
that List Item becomes content for an Indented Code Block. Within another List element, after the list indent is considered:
- this is an item - this is a nested list
is translated into a List and a Sublist, while:
- this is an item - this is a continuation line
is translated into one List with two-line content.
Verifying that everything still worked properly for the first two cases, I then moved my focus to the remaining two cases. While the last case worked properly, the “this is a nested list” case was being parsed as a continuation line.
With my eyes on those whitespace checks, this made sense. Those checks were evaluating whether the number of whitespace characters from the beginning of the line was three or less. In the “nested” case, it was failing that check from the beginning of the line, when it should have been passing that check from the start of the indent from the first List.
Already having calculated the amount of indent required by the parent list in
__adjust_whitespace_for_nested_lists function, it was a simple matter
to pass it back with the
adj_ws value. Once passed back, that value was
added into the whitespace length check to reduce the required whitespace
After the code was passing the first scenario test
added two extra scenario tests to make sure that this worked for other
cases. While I could have added more tests, I figured that these
three tests gave the changes a good amount of coverage for now. If necessary,
I can always go back and add more.
At an earlier point in the project’s history, I had tested this snippet of Markdown:
> simple text > dd > dd # a
and it had failed. I can remember adding it during the development of Rule Md027, as the repository history attests to, and I remember being surprised that something so simple could change the tokens.
However, in the time between 2021 Aug 15 and when I looked at on the morning
of 2021 Oct 17, it has ceased to be a problem. Just to be sure, I added three new
test_extra_011b to ensure that it would
remain a non-issue. Sometimes you get lucky.
Noticed during the “grand” commenting rewriting of Issue 44, I added a note to the Issues List to check on Rule Md030, as I had noticed something weird. It was a good thing that I noticed it, as there was something weird that was exposed by the scenario tests.
When I went to verify Rule Md030 and its scenario tests, I started by looking at the original rule that this rule was based off. While there is little clear documentation on the configuration values for the original rule, this specific phrase leapt out at me:
consists of a single paragraph or multiple paragraphs
The reason that this leapt out at me is that the work I had done on Rule Md030 was about single versus multiple levels of lists not paragraphs. Delving into the disconnect a bit more, the page sited by the original rule does show examples with multiple paragraphs, not multiple levels within a list.
I can honestly say that the thought that went through my head was something akin to “huh. wonder how that happened?” To be continually honest, if this is the only big mistake I made during my efforts to implement every rule, I will still consider it remarkably successful. For me, if I do not must throw away lots of work, my focus is on getting things done right, not trying to figure out where it messed up.
Looking through the code for module
rule_md030.py, most of the code was still
very usable. I removed the function
__is_current_list_multiline as it was no
longer needed. Instead, the
__current_list_parent field and the
__paragraph_count_map field were introduced to track the current list item and
the paragraph count for that list item. Other than those changes, plus some
extra code to properly populating those fields and their associated objects,
and it relatively remained the same.
However, the changes to the scenario tests were more substantial. Each of the
*_double.md tests were changed to have a list item with two paragraphs
instead of a list item that was part of a level 2 sublist. Basically, the
1. First 1. Second 1. Second A 1. Third
1. First 1. Second - 1 Second - 2 1. Third
To add coverage to properly test the new understanding of the rule, eight new scenario tests were added with nested sublists. These were evenly divided between the Ordered Lists and Unordered Lists, and again between single paragraph List Items and multiple paragraph List Items. Add to that the required cleanup of the documentation page for Rule Md030, and things were ready to go!
As far as authoring these articles go, I am getting into a good groove most weekends. Because I take the time during the week to curate a simple outline of what I want the article to be, I can start building a set of good paragraphs and examples from that outline and really bring it to life. But sometimes I can get ahead of where I feel I need to be, and tonight is one of those nights.
Having spent a solid amount of time writing, doing work around the house, and spending time with my family, I found I had some “extra” time left over. To be honest, the reason the word extra is in quotation marks is because there really is no concept of extra time. I believe it would be more correct to say that I found myself with about two hours of time that I had not planned for, and I wanted to fill that time with something useful. As such, I started to look at some more of the issues that I logged during my work on Issue 44.
The first issue that I grabbed was simple labelled as such:
### Verify test_md023_bad_improper_indent_atx_in_list_item in lists and block quotes
As such, I looked at the specified tests, and everything looked fine.
In fact, it was a fitting example of Rule Md023 operating cleanly inside of a
List Item. Then I took another look at the title of the function:
test_md023_bad_improper_indent_atx_in_list_item. Well, that was a bit off!
To fix this issue, and the same issue with the
of this scenario test, I changed the prefix of both files from
test_md023_good_proper. After changing the
file names, I changed the scenario text function names to match, and all
was good. However, to make sure the original intent of those scenario tests
was fulfilled, I created two new test functions and test data files with the
original names. Executing those two new scenario tests, I then verified
that they were failing properly, adjusted the expected output, and declared
the issue fixed.
With the minutes of my “extra” time winding down, I found this issue and
figured out that it was a simple one to fix. For the most part, the main
problem with this issue was that the names of the functions were not
indicative of what was being tested. In addition, I added a new
test_md036_bad_configuration_punctuation test function to supply a test
punctuation configuration value.
And then I put the digital equivalent of my pen down for the evening. I had an incredibly good start on the article for this week, and I also had managed to get a fair number of items removed from the Issues List.
What Was My Experience So Far?¶
To explain how I am feeling about the project is hard to explain without relating it to something relatable like drafting an article or an essay. There are those essays that we all wrote in high school where we just wanted to get it done. Those were more about ensuring that we met the minimum requirements for the essay to be counted by our teachers, and that was it.
This project is something different to me. To me, this feels like an extended essay on something that I care very deeply about. It is not enough that I want to meet the minimum requirements of the task, I want to make sure to convey a certain feeling with my essay. To that extent, I do not mind if I must go over certain parts of the essay repeatedly until I get the tone and theme exactly right.
From a given viewpoint, this project is already a success. I have a beta release that is out, and it addresses the linting requirements that I want it to address. But for me, that is not enough. I want to make sure the project is solid, with good development principles throughout. And for me, the first part of meeting that goal is to address those items that I put off for any reason.
And for me, striving to increase quality is the goal that keeps me going!
What is Next?¶
To be blunt, I am starting to get to the end of the easier to resolve issues. What are left are more general issues and issues that require me to double check things. Not sure where that is going yet, so stay tuned!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.