Markdown Linter - Delving Into the Issues - 4

Summary¶

In my last article, I started to add extra scenario tests that tested newline characters in various groups or themes. In this article, I continue working towards having a good level of group coverage for each of those groups.

Introduction¶

There are times during the development of a project where the tasks are all feature related, and then there are times where the tasks are all issue related. Instead of waiting for that second era to arrive, I instead elected to proactively invest extra time in trying to improve the scenario tests. While I know I will not find every issue, I want to make sure I do my best to throw as many combinations of elements against the wall, hoping none of those combinations will stick.¹

From my viewpoint, I am aware that this focus on quality is taking time and effort away from adding new rules to PyMarkdown. However, in my eyes, every minute spent on making sure the project is stable is worth it. Even with the knowledge that I will not catch every issue before it happens, I am confident that by being proactive and adding groups of tests, I will put the project in a great position to be a stable platform for linting Markdown documents. Even better, by focusing on groups of combinations instead of individual tests, I believe I can discover a large group of issues at the current stage, instead of in a bug report later. At the very least, that is my hope, and my confidence is increasing with each group I add.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits 02 Oct 2020 and 04 Oct 2020.

Cleaning Up After Last Week¶

As the work I documented last week ran into the overtime territory, I decided at that time to disable those tests, with promises to work on them next week. As this is the next article, this is the next week. Time to get to work!

With very few exceptions, the results of each scenario test have been checked at least three times. As such, I was very confident that I needed to update the consistency checks to deal with these changes, and not the results. As an ancillary task, since the code to calculate the line/column number was just introduced, I knew that there was a decent chance that I might find an issue or two that would need to be fixed. As the consistency checks were not passing, I was hoping that it was the checks that needed updating but prepared for changes in the parser itself. Only research would tell, so it was on to more research!

Doing the Research¶

Digging into the code for links, it seemed obvious that the issue was with the links and their column numbers was something simple. As I scanned more code, my confidence that it was something simple just increased. All the big pieces of code looked fine, but one thing was sticking out: replacements. Back when I was adding the Markdown transformer, I made sure that both the HTML-friendly and Markdown-friendly versions of the token’s text were included in the Link Token. However, when I added the code to handle this into the consistency checks, I got confused.

Part of this is my own fault, but the reason I was confused was due to the duality that exists in the tokens and the project. In inline text situations, replacement sequences are used to show both the before and after parts of the replacement. In link situations, a similar pattern is used, including both parts of the replacement in the token. However, in the link case, there are a couple of added wrinkles.

The first wrinkle is that there are usually 2 different fields of the Link token that contain this information. Using the link_title field as an example, the processed version is stored in that field, while the raw text used to generate that field is kept in the pre_link_title field. To avoid useless duplication of fields, the pre_ version of the field is only used if they have a different value that their processed field. That is the second wrinkle. In processing that Link token information, it is common to see code likeL

    link_title = parent_cur_token.link_title
    if parent_cur_token.pre_link_title:
        link_title = parent_cur_token.pre_link_title

Basically, set the link_title variable to the token’s link_title field by default. But if the pre_link_title field is set, use that instead.

Given that there are two different access patterns, sometimes I can get confused and use a process for dealing with the other access pattern. And that is where the third wrinkle shows up: dealing with link labels. In the case of link labels, the processed version of the text exists between the start Link token and the end Link token, with the raw version being kept in the start Link token’s text_from_blocks field.

And to be honest, while it does make sense in how each of the 3 use cases are used in practice, remembering which one is in play when I am writing code on a given field can be difficult.

Fixing the Issues¶

Hopefully there is enough information in the previous section to allow for this revelation to be easy to see in advance: I used the wrong pattern in one place and the wrong field in another place. While the fix was just a couple of lines of code in the newly refactored __verify_next_inline_inline_image_inline function, it took some debugging to properly address the issue by using the correct patterns. With both patterns adjusted, most of the tests were passing.

It was in those remaining failures that I noticed that another problem that was hiding behind that problem: character references were not updating the column number. With all the confusion of the other problem out of the way, there were a couple of outlier tests dealing with character entities that were not adding any space for the Code Span tokens. That was fixed by adding the following code to the end of the handle_character_reference function:

        inline_response.delta_line_number = 0
        inline_response.delta_column_number = (
            inline_response.new_index - inline_request.next_index
        )

Once again, it was obvious that something needed to be done once I started looking at the function itself, but without something causing me to look there, it was overlooked. I was just grateful that the extra scenario tests pointed this out!

Rounding Out Image Support¶

While the support for images and their line/column numbers was added a long time ago, the proper validation support in the consistency checks was only recently added. In addition, while previous work tested the inline image support, there was very little testing of the shortcut and compressed image types, and no testing of the full image type.

Adding The Tests¶

The testing part of this support was easy to address. As part of finishing the work from the previous week, I copied over tests test_paragraph_extra_78 to test_paragraph_extra_89, renaming them test_paragraph_extra_90 to test_paragraph_extra_a2. Once that was done, I changed the example text from the link start sequence [ to the image start sequence ![. Temporarily disabling the consistency checks, I then ran through each of the scenario tests, adjusting the token output and the HTML output for the change from a Link token to an Image token. From the token point of view, I was primarily looking for the Image token to take the place of the start Link token. In addition, I was checking to ensure that any tokens after the start Link token but before (and including) the end Link token were removed. Finally, I adjusted the HTML for the scenario from an HTML anchor tag to an HTML image tag, along with any tag attribute changes that were required.

At this point, it should go without saying that as I was making those changes, I ran through the token output and the HTML output multiple times. I did these checks both mentally and my tried and true method of “counting out loud”, verifying that both values were the same. As I knew the next task was to properly implement the consistency checks for the other image types, I also knew that an extra count or two would take less time than trying to debug any off-by-one issues with any of the line/column numbers. It just made sense to take an extra minute per test and do that extra level of verification.

Reviewing Collapsed and Shortcut Images¶

Containing only a link label, the shortcut and collapsed image types had already been coded into the consistency check. From a quick visual inspection, they both made sense and looked great:

    elif previous_inline_token.label_type == "collapsed":
        image_alt_text = previous_inline_token.image_alt_text
        if previous_inline_token.text_from_blocks:
            image_alt_text = previous_inline_token.text_from_blocks

        token_prefix = 1
        newline_count = ParserHelper.count_newlines_in_text(image_alt_text)
        if newline_count:
            estimated_line_number += newline_count
            para_owner.rehydrate_index += newline_count
            estimated_column_number = 0

            split_label_text = image_alt_text.split("\n")
            image_alt_text = split_label_text[-1]
            token_prefix = 0

        estimated_column_number += 2
        estimated_column_number += 2 + token_prefix + len(image_alt_text)

Using the collapsed link type as an example, the algorithm was really easy to code up. A Markdown collapsed link type looks like this: [reference][]. To start with, the code to determine the content of the link label is used, as was mentioned in the previous sections. With that information in hand, the token_prefix was set to 1 to account for the opening [, before proceeding to deal with any newlines. Dealing with the newlines used the same pattern that I used for each part of the inline image type. However, the big change here is that the token_prefix is set to 0 if a newline character is found. The rationale behind this is that if there any newlines in the link label, the length of the opening [ is no longer useful in determining the column number, as it comes before the part that has a newline character in it.

With all that preparation in place, it is then time to compute the change in the column number. The initial estimated_column_number += 2 takes care of the [] characters at the end of the image text. The 2 from the first part of the following statement combines a +1 for the translation from an index into a position and a +1 for the ] character after the link label. The token_prefix variable is then added to account for the length of the opening [ label, as described above, followed by adding the length of the raw version of the link label text.

Why Review This?¶

It may seem weird that I reviewed this before going forward but let me explain. I reviewed both because I knew that I had plans to use either the shortcut code or the collapsed code as a template for the full image type. I knew that the only difference in the code between those two types was the addition of a link reference. To me, it just made sense to start with the code for either of those image types and just add the extra code to handle the link reference.

Following that plan to reuse that code, the support for the full image types was added within a couple of hours. After making sure the consistency checks were enabled, I started running the newly added scenario tests against the newly added consistency check code. It was then that I started noticing some weird failures.

Debugging The Alt Attribute¶

Digging into those failures, it was initially hard for me to figure out what the issue was. Everything looked fine until I started doing a character by character mental reconstruction of the tokens from the Markdown. It was then that I saw it: the alt attributes on a handful of the image tags were wrong.

The scenario test function test_paragraph_extra_73 is a great example of that. With the Markdown text:

a![Fo&beta;o](/uri "testing")a

and a simple knowledge of Markdown, the text assigned to the alt attribute of the HTML image tag was obvious to me. It should be Foβo. But when I looked at the token itself, that token was:

[image(1,2):inline:/uri:testing:Foo::::Foo:False:":: :]

That was close, but I was expecting a token that was more like:

[image(1,2):inline:/uri:testing:Foβo::::Fo&beta;o:False:":: :]

which contained both the processed version of the text Foβo and the unprocessed version of the text Foβo. This was not like some previous issues that I had already resolved where one of the other inline tokens was not being represented properly. This was a case of some very simple replacements needing to take place but being missed.

In addition, after looking at some other cases, backslash escape sequences were also causing issues, though usually hidden. Function test_paragraph_extra_74 is a good example where the output HTML was correct, but the tokenization contained an o for the processed text instead of Fo]o.

Fixing The Issue¶

To get around this issue, I was faced with three different choices. The first choice was to code something up specific to this issue. I immediately rejected that approach as I felt that one of the extensions that I will need to support in the future may have to introduce some base level of character manipulation like the character entities. As such, I wanted to leave my options open.

That left the other two options, or rather one option with two variations. In either case, I already had a mechanism for registering inline handling and using those handlers. As such, my second choice was to use the existing inline processor code for text blocks. The issue that I had with that approach was that I would need to pass an additional flag into that function that would limit its use to only the backslash escapes and the character entities. While that may have been possible with a smaller function, the size and complexity of the __process_inline_text_block function gave me have concerns about possible side effects.

From three choices down to one, I went with a simplified processor approach to the __process_inline_text_block function. When the handlers register themselves, I added a new flag to denote whether the handler was for one of the two simple inline sequences. If it was, I added the text sequence to the __valid_inline_simple_text_block_sequence_starts. Then, in the imaginatively named process_simple_inline function, I added the code for a very pared down version of the __process_inline_text_block function. This function was purposefully crafted to only handle those simple inline sequences and the newline character, nothing else.

It was when I went to wire the function up that I found another interesting surprise waiting for me.

Circular References Suck!¶

With a decent first attempt at a simplified inline processor in place, I did as I normally do and went to the place where I needed the function, the __consume_text_for_image_alt_text function in the LinkHelper module, and added the reference and the import to go along with it. After starting the test executing, there was a long pause, and I was then greeted with a simple error telling me:

E   ImportError: cannot import name 'LinkHelper' from 'pymarkdown.link_helper'
    (C:\old\enlistments\pymarkdown\pymarkdown\link_helper.py)

Having hit these a number of times, I was sure it was a circular import reference, but where was it? It was then that I thought about it. The LinkHelper module was already imported from the InlineProcessor module, as I used it to handle all the heavy lifting with the links. As Python does not have any concept of forward references, all referenced classes or functions must be loaded before a reference is made to that item. In this case, with the relationship already established, I could not add the reference in the other direction. I needed to find another solution.

I was sure that I would be able to come up with a more elegant solution at a later time, but I wanted to finish this task up, so I took the lazy approach of passing the function as an object. As it was passed as an argument to each function, there was no restriction imposed and I was able to use that method at the target __consume_text_for_image_alt_text function. I may want to look at a better way to do that in the future, but at the time, it worked.

Rounding The Corner¶

At that point, all the scenario tests were passing. I was sure that there were other scenarios that I could find and test, but I was confident that I had a good collection of paragraph tests added to the project. I knew in future weeks I would need to expand that collection in size and cover the SetExt Headings, but that could come later.

Duplicating Tests Without Duplicating Effort¶

With those scenario tests completed for links and images contained within paragraphs, it was time to extend those tests to including inline processing within SetExt Heading tokens. To that extent, I copied over test_paragraph_extra_43 to test_paragraph_extra_a2, creating tests test_setext_headings_extra_43 to test_setext_headings_extra_a2. In terms of the Markdown documents for each test, the only change I made was to add --- after the Markdown being tested, transforming it from a Paragraph token test to a SetExt Heading token test.

As I expected, there were some issues that I needed to resolve to get those tests working. As the original versions of the tests used Paragraph Tokens, and the handling of those Paragraph tokens relied on the rehydrate_index field of the Paragraph token, that was an obvious change that needed to be made. To be specific, I needed to add code to each reference of the rehydrate_index field to only reference that field if the Paragraph token was the container for the inline text. For the consistency check and the Inline Processor, this meant protecting any use of the para_owner variable, and the owning_paragraph_token variable for the Markdown transformer.

With that change made, the only other thing that needed to be dealt with is the change in how the leading whitespaces are handled within a Paragraph token and within a SetExt Heading token. To allow better processing of paragraphs, the Paragraph token collects any leading whitespace in the Paragraph token, whereas the processing of inline tokens within the SetExt Heading token stores that information with the inline tokens themselves. That handling was changed to ensure the information was properly being mined from each inline token.

And while it probably took a couple of hours to resolve, it felt like these changes only took a couple of minutes. Compared to some of the other changes I have done in the last couple of weeks, these changes were completed quickly. I am sure that there were a couple of issues in the code that I needed to fix, but as I said, the time just flew by.

What Was My Experience So Far?¶

In the introduction, I mentioned my increase in confidence for the project, mostly due to my switch from individual scenario tests to scenario test groups. From a work point of view, instead of many small issues, I am lucky if I can get 2 to 3 medium or large sized issues done in a week. But as I am covering many combinations in one task, I feel that the benefit easily outweighs the cost. And that benefit is what helps me drive forward.

And especially with today’s climate, I find that I sometimes need some help in maintaining focus on this project and driving it forward. At times like that, when I need some extra focus, I find it is good to think about the positive parts of the project. Instead of focusing on what is left to do, I focus on the progress I have made in getting to this point. Instead of focusing on the quantity of tasks I can get done in a week, I focus on the quality that those tasks being to the project. And most importantly, instead of focusing on what I cannot change around me, I choose to focus on this PyMarkdown project, and the benefits I can provide to Markdown authors with this project.

So, as I move forward with the project, focusing on quality, I choose these areas to focus on.

What is Next?¶

Staying with the theme of testing blocks, I tackle another couple of testing blocks in the next article.

Throwing Spaghetti. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments