Summary¶
In my last article, I talked about the work I put into getting that weeks’ three rules completed. In this article, I talk about the next three rules I worked on.
Introduction¶
Wow! It has been an interesting two weeks since my last article. With a ton of things going on in both my professional life and my personal life, I really have not had any time to work on the PyMarkdown project. There was not even enough time left over for me to write words, let alone code. Both take brain cells and creativity, and after about a week and a half of long days, I did not feel that I possessed either of them.
With most of those efforts out of the way, I was able to start the task of getting back to work on the project. I knew that it was going to take some effort to get my “project-muscles” back, but it was something I was eager to do. And what a better way to flex those muscles back into shape than to complete the work on Rule Md043.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 11 Sep 2021 and 12 Sep 2021.
Rule Md043 - Completing Glob Matches¶
Getting started again after a couple of long weeks was no easy task. Before I got busy with life, I had started to work on completing the work to properly handle the recursive code for matching lines. As I mentioned in my last article, I just ran out of time to figure it out for that week. Now, facing the daunting task of getting back to work, it sounded to me like a good task to start with.
To refresh my context on that rule, I took quick inventory on where
I left things. I soon figured out that before I stopped working on Rule Md043,
I had taken care of all the simple scenarios in which wildcards can
be used in globs. The simple cases of a
, a*
, *a
, and a*b
were all covered, with a gaping hole in the code where patterns with
multiple wildcards should be. Repetitive wildcard cases such as a*b*c
were
not being handled at all, and that needed to be addressed.
To be clear, while I am pretty sure that most people will not create
monster matches like a*b*c*b*c*b*c*d
, or any terribly recursive
combination. However, I know I still needed to be able to support them. And depending
on what the needs are, I can actually see some simple patterns for
document headers and document footers requiring at least two or three
wildcards. It was time to deal with that!
Design¶
Breaking it down from a design viewpoint, when the algorithm
arrived at the recursive section of the glob, the header and footer
of the glob had already been stripped away. This means that if
the glob was a*b*c
, what needed to be addressed was the remaining
pattern *b*
. Scribbling out different scenarios on paper, this made
it easy for me to see how other matches would work, as the difference
between *b*
and *b*d*
appeared to be an extra level of matching after
the first wildcard.
While I am not a stranger to both recursive and iterative algorithms, I have enough experience to know that both algorithms can run away on the user if they are not coded properly. The difference is that recursive algorithms tend to fail with out of memory errors, which looks worse to the user than a bad index error. At least when I am using other software, that is my opinion. So, knowing that, if I chose a recursive algorithm, I wanted to make doubly sure that the recursion was solid before shipping it.
Breaking It Down¶
In this case, I chose a recursive algorithm as the way to go.
As the difference between *b*
and *b*d*
is just another level,
the most efficient way to take care of that is with recursive
functions. With that decision done, it was on to figuring out
how to do the matching and when to do the recursion.
Taking each part in stride took a bit for me to accomplish,
wanting to design everything at once. But once I calmed down
a bit, I realized that the algorithm was simple. With the
previous header and footer trimmed from the match, the first
thing the algorithm had to do was to find the next matches
for the first block of non-wildcard matches. In the above
examples, this meant looking for a block of length one that
matched b
. In the first match, once any one line was found
that matched that, everything was done, and it could return.
But in the second match, the algorithm then had to trim the
variables and recurse into the algorithm to find *d*
within
the remaining data. Once again, if there was any line that
matched d
, the match was complete.
To be honest, I went through a handful of sheets of paper going through weird combinations of data, making sure that my design worked on paper. In each case, I was careful to adjust the current parameters before mentally “recursing”, and that was something that I was going to must be careful with. But after taking three hours over two days to make sure I had the design nailed down, it was time to see if I got it right.
Testing and Implementation¶
I already had a few scenario tests for multiple wildcards in the project, but I added a new one to make sure I had things covered. From using globs on file names for years, I had a good idea of the patterns to test, and I was confident that I had the right scenarios to test against.
Going to the code, the first thing I needed was to hook in the
new function to the completed_file
function, invoking it when
there were multiple wildcard characters. This meant replacing
this code:
self.report_next_token_error(
context,
remaining_tokens[0][0],
extra_error_information="More than one wildcard not supported.",
with this code:
recurse_result = self.__do_recursive(
remaining_headings,
top_heading_index,
remaining_tokens,
top_token_index,
)
if not recurse_result:
self.report_next_token_error(
context,
remaining_tokens[0][0],
extra_error_information="Multiple wildcard matching failed.",
)
That was where the fun began. The __do_recursive
function itself
does not have that many statements, but each statement has a lot of
functionality built into it.
def __do_recursive(
self, remaining_headings, top_heading_index, remaining_tokens, top_token_index):
bottom_heading_index = top_heading_index + len(remaining_headings)
assert remaining_headings[0] == "*" and remaining_headings[-1] == "*"
start_index = 1
end_index = 2
while remaining_headings[end_index] != "*":
end_index += 1
delta = end_index - start_index
search_index = 0
heading_index = top_heading_index + 1
scan_limit = top_heading_index + 1 + delta
found_match = False
while search_index < len(remaining_headings) and search_index < len(
remaining_tokens):
(
end_heading_index,
end_token_index,
failure_token,
_,
) = self.__verify_group_heading_match(
heading_index, top_token_index + search_index, scan_limit=scan_limit)
if not failure_token and end_heading_index < len(self.__compiled_headings):
if end_heading_index == bottom_heading_index - 1:
found_match = True
else:
new_top_heading_index = end_heading_index
new_remaining_headings = remaining_headings[
(end_heading_index - top_heading_index) : ]
new_remaining_tokens = remaining_tokens[
(end_token_index - top_token_index) : ]
new_top_token_index = end_token_index
found_match = self.__do_recursive(
new_remaining_headings,
new_top_heading_index,
new_remaining_tokens,
new_top_token_index, )
if found_match:
break
search_index += 1
return found_match
As was mentioned in the design section, the first part of the algorithm
is setting up for the search. Thankfully, the __verify_group_heading_match
function already looks for a block of matching headings, so a call to
that function was pivotal to figuring out if a match is found. If it is
found and the end index is not at the end of the __compiled_heading
variable, it means that the algorithm ran out of headings to match,
resulting in no matches. If the end index is valid and equal to the
current “bottom”, then the algorithm is done with a found match. However,
if it is not at the bottom, that means there are more wildcards to
process, so the algorithm recurses.
From my experience, most recursive algorithms are simple in their nature, but the fact that they refer to themselves causes me some unease. It just means that I have to be extra careful about everything in that function as I am using it multiple times with variations on the same data. This was no different. The algorithm at its core was simple. Find a block of matching lines. If there is more to look for, set things up and find a more condensed block of matching lines. If at any point it fails, return right away.
But I still tested it extra carefully. Let me just say that I have been hit by recursive algorithms not being tested properly enough to give them some extra respect when testing. But after going through and testing a handful of additional scenarios, they all matched or did not match according to plan. After some simple cleanup, I checked in the code and moved on.
Ordered List Element Style¶
Looking through the remaining rules, a pair of them stood out from the rest: Rule Md029 and Rule Md030. To be honest, I thought I had completed them during a previous pass at the rules. But there it was in black and white, they still needed to be completed.
For no other reason that it was first, I decided to start with Rule Md029 regarding the style of Ordered List elements. This rule has a simple principle behind it: keep things consistent. Most Markdown parsers do not like it when an author starts an Ordered List Item with a seemingly random number. Those parsers prefer to have those numbers be all 1s or be a simple increasing number starting with 1. Anything outside of those two options mean that the parser decided what to do with it. And that choice does not always make sense.
Design¶
Looking at the configuration settings, there were three classes of starting numbers
that I needed to look for: all zeroes
, all ones
, or ordered
and starting from one.
Basically, 0/0/0
, 1/1/1
, and 1/2/3
. There was the 0/1/2
variant of the
ordered, starting at 0
instead of 1
, but that was just an alternate starting
number. The interesting variant was the one or ordered
variant. Allowing for
either the 1/1/1
class or the 1/2/3
class, it would be interesting to work
around that.
But after looking at all those classes and their variants, the design rapidly
became clear to me. For every case other than the one or ordered
variant,
everything was clear and spelled out by the first number. Even differentiating
between the 0-based ordering and the 1-based ordering was easily dealt with using
just that first number. The rest of the cases were just following the lead of
that first number.
For the one-or-ordered variant, I knew that most of the time the algorithm
would need to wait until the second number to figure out which class to put
the numbers in. If the first number was 1
, then the list was still eligible
for both the all ones
class and the ordered class
. But by the time the
second number came around, the algorithm would be locked in, or the algorithm
would trigger due to a bad number.
Testing and Implementation¶
Nailing down the scenario tests for the rule was easy. There are four basic positive sequences that were easy to define. After doing some scribbling, I was able to determine that the negative cases boiled down to three scenario tests. I owe a lot of that small test size due to the positive tests performing double duty as positive tests for their own setting and negative tests for the other settings.
The actual code was a variation on algorithms I have written for other rules dealing with container elements. Not needing to deal with Block Quote elements, I just wrote code to maintain a List element stack, with a second stack for the selected configuration.
def next_token(self, context, token):
if token.is_list_start:
self.__list_stack.append(token)
if token.is_ordered_list_start:
list_style, last_known_number = self.__match_first_item(context, token)
self.__ordered_list_stack.append((list_style, last_known_number))
elif token.is_list_end:
del self.__list_stack[-1]
if token.is_ordered_list_end:
del self.__ordered_list_stack[-1]
elif token.is_new_list_item and self.__list_stack[-1].is_ordered_list_start:
list_style, last_known_number = self.__ordered_list_stack[-1]
list_style, last_known_number = self.__match_non_first_items(
context, token, list_style, last_known_number)
self.__ordered_list_stack[-1] = (list_style, last_known_number)
As the configuration may be in flux, it can also vary between a list and
any sublists. Therefore, the __ordered_list_stack
field contains a tuple
containing the list’s style and the last known number. With this information,
any style can easily be handled by a list at any level of nesting.
With that information in place, matching the first item was added with
the __match_first_item
function:
def __match_first_item(self, context, token):
list_style = self.__style
last_known_number = int(token.list_start_content)
if list_style == RuleMd029.__one_or_ordered_style and last_known_number != 1:
list_style = RuleMd029.__ordered_style
is_valid = True
if list_style == RuleMd029.__ordered_style:
is_valid = last_known_number in {0, 1}
elif list_style == RuleMd029.__one_style:
is_valid = last_known_number == 1
elif list_style == RuleMd029.__zero_style:
is_valid = last_known_number == 0
if not is_valid:
style = ...
expected_number = 0 if list_style == RuleMd029.__zero_style else 1
extra_error_information = f"Expected: {expected_number}; Actual: {last_known_number}; Style: {style}"
self.report_next_token_error(
context, token, extra_error_information=extra_error_information
)
list_style, last_known_number = (None, None)
return list_style, last_known_number
Following the design, the configured style is used, allowing for a change
if the __one_or_ordered_style
style is selected. In that case, if the
list’s number is not 1, then it must be an ordered style. However, if it
is 1, then the class is still in flux, and the algorithm needs to wait.
As to the validity of the list number at that point, it is an easy decision.
For any case other than the __one_or_ordered_style
style, there is a small
set of possible starting values. If it is not one of those values, the
rule triggers. Otherwise, the value must be 1
which is valid no matter
what.
def __match_non_first_items(self, context, token, list_style, last_known_number):
...
new_number = int(token.list_start_content)
if list_style == RuleMd029.__one_or_ordered_style:
list_style = (
RuleMd029.__one_style
if new_number == 1
else RuleMd029.__ordered_style
)
...
return list_style, last_known_number
Except for the one_or_ordered
logic, the __match_non_first_items
function is the
same as the __match_first_item
function. It is only in the case where
the first List Item started with a 1
that the style is left undecided.
But when the algorithm gets to the second List Item, that ambiguity is
either resolved or triggers the rule. As the above code indicates, if the
__match_non_first_items
function is called with that style still in place,
it is either set to __one_style
or __ordered_style
, depending on whether
the new number is set to 1
or not. The code after that change then
decides whether the new_number
content is valid or not.
I did start out with something a bit more complicated, but as usual, the simpler solution won out. While it was not that much more complicated, it just felt like I was stretching too much when a simpler approach was just as efficient at solving the problem.
List Element Spacing¶
While it may not seem like much, Rule Md030 is meant to protect against parsers that have a more rigid definition of the spacing that exists between the List Item indicators and the text within the List Item. And since it was one of the remaining ones to deal with, it felt that pairing it with Rule Md030 was just the right thing to do.
Design¶
I’ll admit it. I copied most of the concepts from the previous rule to this one and adjusted from there. That was possible because they both deal solely with lists and have similar constraints. In this case, the configuration values were stable, even with having to design “wait in this case until…” code. The twist in this rule was in calculating whether the List Item was on a single line or on multiple lines.
Therefore, the design was mostly about adapting the design instead of thinking through the design from scratch. The List Item stack was required to make sure that the algorithm compared List Items within its bounds, and not deeper. As the composition of the list, single line elements throughout or containing any multiple line elements, cannot be figured until the list is finished, all List Item tokens need to be collected until the end List element. Given the single vs multiple classifications at that point, the correct configuration value can then be chosen and applied to each token.
Which left one thing left to design: to design the “single vs multiple” algorithm. Reading the definition of the original rule an extra time, I noticed that it was very clearly documented that:
A document style may change the number of spaces after unordered list items and ordered list items independently, as well as based on whether the content of every item in the list consists of a single paragraph or multiple paragraphs (including sub-lists and code blocks).
This made most of the algorithm simple. Check to see if the difference in line numbers between one List Item and the previous List Item is greater than one. If so, it is a multiple line set of List Items. The very last List Item was a tricky one, but not difficult to design around. Every List Item must contain something, even if it is a Blank Line token. Therefore, the algorithm would need to keep track of the last non-end token encountered, to deal with that final case.
Looking through things, I felt that I had covered everything in the design, and it was time to proceed with the next phase.
Testing and Implementation¶
Once again taking out my pen and scrap paper, I whittled down the relevant test scenarios to eight scenarios: four for the Ordered List elements and four for the Unorder List elements. I came up with a handful of weird cases, but in the end, they all reduced to those eight scenario tests. While I am not 100% sure of that statement, I am confident enough in that statement to proceed with the coding. Basically, I have a feeling that I might has missed something, but after searching for it for an hour or so, I stopped looking.
The first part of the code has all the hallmarks of the previous rule.
The __list_stack
code is almost the same, with just a bit more code
being executed in each if
block.
def next_token(self, context, token):
if token.is_list_start:
self.__list_stack.append(token)
self.__list_tokens.append([])
self.__list_tokens[-1].append(token)
self.__last_non_end_token = None
elif token.is_list_end:
self.__handle_list_end(context)
del self.__list_stack[-1]
del self.__list_tokens[-1]
elif token.is_new_list_item:
self.__list_tokens[-1].append(token)
self.__last_non_end_token = None
elif not token.is_end_token:
self.__last_non_end_token = token
The important part of the algorithm is here in the __handle_list_end
function, which is invoked when the list ends.1 It is only at the
end of the list that the multiline status can be computed. After it is
computed, the proper setting can be assigned to the required_spaces
variable, which is then compared to the actual number of spaces in
each List Item token.
def __handle_list_end(self, context):
is_multiline = self.__is_current_list_multiline()
if self.__list_tokens[-1][0].is_ordered_list_start:
required_spaces = self.__ol_multi if is_multiline else self.__ol_single
else:
required_spaces = self.__ul_multi if is_multiline else self.__ul_single
for _, list_token in enumerate(self.__list_tokens[-1]):
delta = list_token.indent_level - list_token.column_number
if self.__list_stack[-1].is_ordered_list_start:
delta -= len(list_token.list_start_content)
if delta != required_spaces:
extra_error_information = (
f"Expected: {required_spaces}; Actual: {delta}"
)
self.report_next_token_error(
context, list_token, extra_error_information=extra_error_information
)
Having that code in place, the only part that was remaining was the code to determine if the list contains only single line List Items or if any of the List Items have multiple lines. For every List Item except for the last one, its vertical size can only be calculated against the next List Item. But it does make for a simple loop.
def __is_current_list_multiline(self):
is_multiline = False
for token_index, list_token in enumerate(self.__list_tokens[-1]):
if token_index:
delta = (
list_token.line_number
- self.__list_tokens[-1][token_index - 1].line_number)
if delta > 1:
is_multiline = True
break
if not is_multiline:
delta = (
self.__last_non_end_token.line_number
- self.__list_tokens[-1][-1].line_number
)
is_multiline = delta > 1
return is_multiline
And because of the clear designing, I knew ahead of time that the last List Item was going to be a different scenario and how to address it. As that last List Item does not have a following List Item to compare it to, the last non-end token serves as a good stand-in.
Once again, I ran through all the cases, looking for that something that I missed and not finding anything. Everything was working and passing, so after that extra checking, I finished up the documentation for the rule and committed it to the repository.
What Was My Experience So Far?¶
https://jackdewinter.github.io/2021/07/26/markdown-linter-getting-back-to-new-rules/
What is Next?¶
For getting restarted on the project after about a weeks’ absence, it took me a bit to get going. But now that I see that the end of the rules implementation phase is in sight, I am eager to get it completed. Stay tuned!
-
That may sound like a weird statement to some. For example, I make it a practice to name functions after the action they take. Hence, the function
__handle_list_end
performs the handling for the end of the list. But I have also seen function names likehandle
,done
, and other names that do not imply that context. Reading those functions is not fun. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.