Summary¶
In my last article, I talked about the work I put into getting the final three rules completed. In this article, I talk about getting to work on reducing the count of bugs in the Issues List.
Introduction¶
With all the work on creating the missing rules wrapped up, I was eager to double check things for the beta release, and get it packaged and uploaded to PyPi.org. But after that was done, I knew that I still had to work to do, simply different work. Instead of writing new rules, I needed to go back over previous issues I have found and start dealing with the “I’ll deal with it later” issues.
Yes, it was now “later”.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please consult the commits that occurred between 21 Sep 2021 and 25 Sep 2021.
The Beta Release Happens¶
From a commit point of view, the beta release was not anything more than changing a couple of files and running a script that I normal do not run. But mentally, it was an important thing for me. After working for two years to get something working on my terms, I was able to finish a beta release that I am confident about. Running that script knowing that it would publish a solid, well-tested version of my project to the Python PyPi repository just seemed right. It was not too soon, it was not with reservations, it was just the right time for the release to happen.
Do I expect issues to get logged in GitHub? Yes, I would be surprised if issues were not logged. First off, I know I have confidence that I covered most of the more common scenarios, but I also have confidence that I did not cover all the scenarios. Secondly, as weird as it may sound, if people are filing issues, it means they are using the project and putting it through their own scenarios. And hopefully, for every issue that gets logged with GitHub, there is a given multiplication factor of users that do not have any issues with the project, and are therefore happy with the project.
What is that multiplication factor? I do not currently know. But even if that multiplication factor is in the single digits, it means that people are using the project. That is good for my confidence. That is all that I need, nothing more.
And as usual, before digging into fixing issues, there was a bit of cleaning up that I wanted to get done.
Doing Some House Cleaning¶
Having a clean script and a Continuous Integration script has been a big asset for the PyMarkdown project. I am usually fussy about running the clean script before every commit, but the Continuous Integration script makes sure that I stick to that process. Normally, I develop on a Windows machine because I have more tools that work seamlessly with how I work. But that still leaves testing on other operating systems, which has previously been painful. The Continuous Integration script addresses that by running the full set of tests against a Windows operating system, a Mac operating system, and a Linux operating system. Having this process done automatically with each commit has helped me concentrate on the project itself, not the operating systems.
But one little thing has been bugging me during the rule development phase: changing command lines. There are a small handful of rules that I want to disable on my own Markdown scans. In most cases, this is temporary while I get everything else worked out and think about possible rule changes. In some cases, it is just to give me time to fix the issues instead of disabling it. These disabled rules are no-space-in-code, no-space-in-emphasis, line-length, and no-duplicate-heading. That is not what I was concerned about. It was that when developing the rules, I needed to change the disabled rules in both scripts. That just was not cool.
Before I started tackling reported issues as they trickle in, I decided that I wanted to fix that right away. It was not a tremendous change to make, but I believe it was a good choice to work on as it would be one less thing to worry about. Additionally, it showed off a real-world use of the configuration file, which was a benefit.
To perform this switch to use a configuration file, I
created a new file, clean.json
with the following contents:
{
"plugins": {
"no-space-in-code": {
"enabled": false
},
"no-space-in-emphasis": {
"enabled": false
},
"line-length": {
"enabled": false
},
"no-duplicate-heading": {
"enabled": false
}
}
}
Saving it in the root directory of the repository, I then went and replaced this part of the command line:
--disable-rules no-space-in-code,no-space-in-emphasis,line-length,no-duplicate-heading
in both scripts to read:
--config ./clean.json
I executed the clean script locally and verified that everything looked correct. Hoping that I did not miss anything, I created a new commit for this change, pushed it to main repository, and watched as the tests started to run. It took about four minutes, but when I saw that green circle besides the tasks saying that everything executed correctly, I let go a little “Yes!” to myself.
Not a big issue, and an issue that no one except me would probably worry about. But it was an issue that I was happy to get taken care of.
Issue 23 - Issue with Rule Md023¶
I know it might seem weird to some people, but I love debugging. It gives me a chance to really dig into something and either learn about it or increase my knowledge about how it works. Sometimes I learn that I did not think of a given use case, and sometimes I learn that I forgot something completely. That last one was the case with this issue: I forgot to implement pivotal code in Rule Md023.
Token History¶
To understand the issue, I need to go into a bit of parser history to explain that the
end_whitespace
within a Text token that existing within a SetExt Heading element was a bit
tricky to store. Because whitespace must be removed from both the start of each line
and the end of each line in a SetExt Heading element, it left me with a fun decision to make: where to put the
removed whitespace? Because one of the project’s cardinal parser rules is that any parsing must
produce a set of tokens that can then reconstitute the Markdown document exactly as it was.
As such, the whitespace had to go somewhere. If it was a normal paragraph, only the
leading whitespace from each line would be removed and stored within the Paragraph
element that owns that Text token.
After considering assorted options, none of them were attractive. But the least
unattractive was to override the end_whitespace
field to hold both the leading space
characters and trailing space characters that were removed. Normally, I would do this by having two different
fields, one for the leading spaces and one for the trailing spaces. But not wanting to
add an extra field to the Text token just for use within a SetExt Heading element, I
improvised. The end_whitespace
would normally hold any trailing spaces, separated
by the newline character. So, to “duplex” each line, if there were any leading spaces to
store in the same line, it would be followed with the whitespace split character,
\x02
. In that way, I was able to store both leading spaces and trailing spaces
within the same field.
Back To the Issue¶
Going back to the issue at hand, that special character presented an issue that I had not thought about when I was writing this rule. Reminding myself of this, I looked at the existing code:
if self.__setext_start_token and not self.__any_leading_whitespace_detected:
if token.end_whitespace and " " in token.end_whitespace:
self.__any_leading_whitespace_detected = True
Without support for the whitespace split character, that code was detecting
any whitespace that occurred at the start or the end of the line. Since whitespace
at the start of the line would be encoded as {space}\x02
and whitespace at the
end of the line as {space}
, the above code would evaluate both as extra leading
whitespace.
The only way to fix this was to scan through the end_whitespace
field, one
line of information at a time, looking for any cases where the sequence
{space}\x02
was found. To do that, I replaced the earlier code with the
following code:
if (
self.__setext_start_token
and not self.__any_leading_whitespace_detected
and token.end_whitespace
):
split_end_whitespace = token.end_whitespace.split("\n")
for next_split in split_end_whitespace:
if self.__seen_first_line_of_setext:
split_next_split = next_split.split(
ParserHelper.whitespace_split_character
)
if len(split_next_split) == 2 and split_next_split[0]:
self.__any_leading_whitespace_detected = True
else:
self.__seen_first_line_of_setext = True
This code works by going through each line represented in the end_whitespace
field,
one line at a time. For each line, it splits the line’s information on the whitespace
split character. If the whitespace split character is present, then the split will
result in the variable split_next_split
having an array of two strings, with the
first element holding any characters in the line that occurred before the whitespace
split character. Therefore, if at least one line has a split_next_split
variable
of length 2
with a non-zero length, then that line has leading whitespace.
Thinking this was clever, I ran it through the first time, and was disappointed that it failed. After a couple of minutes of debugging, I remembered that the leading spaces for SetExt Heading tokens are stored with that start token, not within the Text token contained within. Adding a bit of extra code to take care of that first case, I executed the code again and everything worked fine!
Just to be sure, I created five new scenario tests for Rule Md023, each with a different amount of trailing space at a different location. Maybe it is a bit paranoid, but after finding one example that failed, I wanted some extra insurance that I had fixed the problem and fixed it properly.
It was an interesting feeling getting back to debugging non-rule code. It partly felt like it was something that I was used to, and partly that it was something new and different. It was an odd pair of feelings, but both were good feelings, so I was okay with both.
Issue 22 - Issue with Rule Md033¶
This issue was not really a big one, but it was reported as not making sense to users, so I wanted to tackle it quickly. Rule Md033 deals with Raw HTML being exposed, either as a HTML Block element or a Raw HTML Inline element. I figured that I had missed something simple and wanted to make sure it was set right.
It turned out that I had missed two small issues, both nuisances. The first one
was that I included a close HTML tag, any normal HTML tag starting with </
, in the list
of acceptable trigger items. While I had thought that the original rule did fire one
this, extra checking revealed that it did not.
The other issue was a bit more nuanced, but just as small. In HTML, there are
special sequences that allow for special handling of HTML documents. Of special note,
these sequences are called out in the GFM documentation of the
HTML Block element by each one of them
having their own special HTML Block type. These sequences are comment blocks, processing
instructions, declarations, and CDATA. I had originally taken one of them,
the ![CDATA[
sequence and provided special behavior for that sequence, but that was not
good enough. Given a simple example from one of the project’s users:
# title
<!--TOC-->
Apparently, my habit of following a start HTML comment tag with whitespace was just that… my habit.
So, to correct that issue, I change the code to allow for both CDATA and comments to be detected on their own without any whitespace to follow. Along with a change to the default value to suppress processing instructions and declarations by default, this issue was fixed.
It was not a big fix, but a good one to get out of the way.
Issue 27 - Issue With Rule Md032¶
I must admit. When I first saw this issue pop up, in my head I looked at it and went: “No, that looks right!” But after checking it more closely, the issue began to seep into my brain. Starting with something simple, the following Markdown:
- 1
- 1.1
- 1.1.1
- 1.2
- 2
parsed cleanly. Then the second example, the following Markdown:
- 1
- 1.1
- 1.1.1
- 2
also parsed cleanly. Fine. I agreed with both of those. Md032 deals with blank lines around List elements, and everything looked good so far. Then the third sample was presented, one which was causing the rule to trigger. A modification of the earlier example, this one was:
- 1
- 1.1
- 1.1.1
- 2
and it triggered the rule each time. From looking at the nested List elements, it was obvious that it should not fire, so what was wrong?
Enabling debug for the rule, I was able to quickly narrow down on the issue. Specifically, the third line sets up the third level of a List element only to have the top two levels of those lists ended with the first level List element on the fourth line. As was shown from my debugging, there were two end List element tokens in a row in the token stream. That was something I had missed.
Looking at two lines from the rule, I was quickly able to nail
down the source of this issue. In the following if
statement:
if not token.is_blank_line and not token.is_new_list_item:
self.report_next_token_error(context, token, line_number_delta=-1)
only the Blank Line token and a New List Item token prevent the rule from firing
if the previous token was an end List token. That meant in this case, the first
end List token caused that if
statement to be evaluated to False
when that
second end List token was used in the if
statement.
It did not take much code or time to fix this properly. By adding the
end List token to the if
statement, it allowed the List elements to flow
properly when dropping multiple levels.
if (
not token.is_blank_line
and not token.is_new_list_item
and not token.is_list_end
):
self.report_next_token_error(context, token, line_number_delta=-1)
Issue 32 - Issue with Rule Md037¶
Every so often, you look at something you did and hope no one notices that you did it. In this case, I had Issue 32 filed against it, so I had no choice. Someone noticed it.
The Markdown itself was simple, and the point of focus was a single paragraph:
Read all readme*.md files.
And the changelog*.md files.
It was this sample that was triggering Rule Md037, looking for spaces inside of asterisks that may show that an Inline Emphasis element was not completed properly. There was only one problem: the was half-finished code commented out to deal with the actual search for whitespaces.
Whoops!
Oh well, it happened. After I got over a momentary feeling of embarrassment, I started to
look at the commented-out code and quickly decided that it would not work properly.
When the triggering section of the next_token
function is entered, it has a list of
every token that occurred between the possible start of emphasis and the possible end
of emphasis. The main thing I needed the code to do was to verify if the specific
pattern, whitespaces directly inside of the possible start or end, were present.
Instead of doing something fancy, I opted for something simple. With both the possible start and possible end sequences, this rule should only trigger if they were Text tokens. If it is not a Text token, it cannot start or end with whitespace, so there was one aspect of the issue dealt with. The other was a good observation on what this was supposed to look for. Since I needed the function to look for directly inside of the emphasis sequences, I needed to have the function check for the first character of the first token and the last character of the last token. If either of those specific characters were a space character, then it was a space directly inside of the emphasis characters. And to be extremely specific, I needed to set it to detect either scenario, not both.1
After going through that research, I came up with the following code, exactly what I just described above:
assert self.__emphasis_token_list
first_capture_token = self.__emphasis_token_list[0]
did_first_start_with_space = (
first_capture_token.is_text
and first_capture_token.token_text[0] == " "
)
last_capture_token = self.__emphasis_token_list[-1]
did_last_end_with_space = (
last_capture_token.is_text
and last_capture_token.token_text[-1] == " "
)
if did_first_start_with_space or did_last_end_with_space:
self.report_next_token_error(context, self.__start_emphasis_token)
Once again, nothing stellar, but good simple corrections in the code.
Issue 33 - Issue With Rule Md031¶
In my blitz to get all rules completed and a beta release out the door, I had encountered this Markdown when trying to get Rule Md031 to work:
> + list
> ```block
> A code block
> ```
> 1. another list
With some more relaxed time on my hands now, I decided to look closely at this rule and figure out what the problem with the rule was. What I did not notice at the time is that this was not a problem with the rule, this was a problem with the parser. When executed, the parser was stopping in this code with an exception. It was something that needed to be looked in to.
It took me a bit to figure out what was going wrong. As I have been concentrating on getting things done for the beta release, I have not spent considerable time in the parser code. If I had to guess, it has been at least six months since I last looked at the parser code. Luckily, with the comments and log messages that I placed in the code; I was able to get up to speed quickly.
In the Leaf Block Processor module, there is code in the correct_for_leaf_block_start_in_list
function to properly correct the indent of a line when coming out of List element.
Originally, the token_stack
state variable was used to track the List elements,
with plans to implement an easier mechanism for dealing with Block Quote elements.
However, when I started to implement support for Block Quote elements, I realized
how similar Block Quote elements and List elements really were. In response to
that, the token_stack
state variable was then repurposed for both elements, with
extra work on my behalf to clean up the code.
while repeat_loop:
assert parser_state.token_stack[-1].is_list
if removed_chars_at_start >= parser_state.token_stack[-1].indent_level:
repeat_loop = False
else:
tokens_from_close, _ = parser_state.close_open_blocks_fn(
parser_state,
until_this_index=(len(parser_state.token_stack) - 1),
include_lists=True,)
html_tokens.extend(tokens_from_close)
assert parser_state.token_stack[-1].is_list
last_indent = parser_state.token_stack[-1].indent_level
delta_indent = removed_chars_at_start - last_indent
assert not delta_indent
Basically, it looks like I missed a boundary use case. In this case, the List element was getting closed properly within the loop, but it was then followed by an assert to verify that the remaining element on the stack was a list. If it was a List token, then there were extra verifications that I had in place to verify the sanity of the parsing.
While slightly more verbose, once I understood the problem, the fix was simple: only act on List tokens:
repeat_loop = True
is_remaining_list_token = True
while repeat_loop and is_remaining_list_token:
assert parser_state.token_stack[-1].is_list
if removed_chars_at_start >= parser_state.token_stack[-1].indent_level:
repeat_loop = False
else:
tokens_from_close, _ = parser_state.close_open_blocks_fn(
parser_state,
until_this_index=(len(parser_state.token_stack) - 1),
include_lists=True,
)
html_tokens.extend(tokens_from_close)
is_remaining_list_token = parser_state.token_stack[-1].is_list
if is_remaining_list_token:
assert parser_state.token_stack[-1].is_list
last_indent = parser_state.token_stack[-1].indent_level
delta_indent = removed_chars_at_start - last_indent
assert not delta_indent
Because this change fixed the parsing of that Markdown document,
I was able to take the skip
tag off the
test_md031_bad_fenced_block_in_list_in_block_quote
function. In
addition, to make sure that fenced code blocks did not suffer from the
same issue, I added the
test_md032_bad_fenced_block_in_list_in_block_quote
function to perform
the same test for Fenced Code Block elements. This was a good find,
and I was happy to have it taken care of.
What Was My Experience So Far?¶
After working hard for two years, it was a relief to run the scripts to package the project and to upload it to PyPi.org. As I mentioned earlier in this article, I just felt like it was the right time to do it.
But there are still things to do to make it better, and I wanted to keep on marching forward. After measuring my progress for the last two months in terms of the number of rules I completed per week, I am not sure how, or if, I am going to measure progress going forward. I mean, I am going to resolve issues, of that there is no doubt. However, I cannot divine if the issues I pick are going to be easy issues or difficult issues.
I guess the only thing I can do is to keep on going what I have always done with this project: keep on making it better!
What is Next?¶
Well, I hate to say it (well, no I don’t), but I am going to be fixing issues for a while. While some portions of that work will be boring, I will try and make it more interesting by describing what I am doing and my reasons for doing it. Stay tuned!
-
Before I read the rule again very closely, I did write a version of the code where it looked for both instead of either. It did not work very well. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.