Markdown Linter - Delving Into the Issues - 6

Summary¶

In my last article, I continued in my quest to reduce the size of the issues list. In this article, I make good progress in removing items from the issues list.

Introduction¶

After having a week where I felt like I barely made any progress, it was nice to get a week where I was able to get some work done while keeping balance with my personal life. For some reason, the stop-and-go nature of last week was not repeated this week, and I was grateful for it. But this week was not really about any one of those issues, just getting some good, solid work in to reduce the size of the issues list.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 14 Oct 2020 and 18 Oct 2020.

Finishing Up the Work¶

After a day’s rest from writing that week’s article, I looked at my notes and noticed there was one little issue that was left over from the last weeks’ worth of work:

- test_paragraph_extra_b5 - adding extra newline somewhere, possibly not clearing after previous image?

Basically, I was not able to get the test_paragraph_extra_b5 function to pass before the end of the previous week, so I marked it with @pytest.mark.skip to disable the test until I could look at it. Not wanting to leave things undone, this was the number one thing on my list for this week.

Looking at the Markdown document and the HTML output, everything looked fine. It was when I looked at the tokens and saw the issue that it came flooding back to me: there was a problem with the newlines. Specifically, there was a problem with the Text token following the test document’s Image token:

    "[text(3,19):\ndef::\n\n]",

If things are working properly, the number of newline characters in the text field is equal to the number of new characters in the extracted field. In this case, the text field contains one newline character, but the extracted whitespace contains two newline characters. And once I saw that pattern, I was able to see it in other similar tests. In total, there were 16 Paragraph token tests and 12 SetExt Heading token tests that exhibited this behavior. The key was that in each case, the Text token followed an Image token and preceded the end of block token.

Digging In¶

Now I had a pattern, so it was time to fix it. I started by turning the debug log on and working my way through the data for the test. There was a substantial amount of data, but using my experience on the project, I was able to quickly narrow the search to something in the inline processing section. Taking a look at the tokens produced right before the inline processing started, everything looked right. I then decided to focus on the inline processing based on that information.

Digging into the Inline Processor, I followed the processing along token by token, right up to and including the Image token and everything looked fine. Even the processing for the Text token itself looked fine if it was not for that extra newline character in the extracted whitespace field. To narrow down where that character was being added, I started with some debug to log the tokens at various stages of the processing, and the data confirmed my previous statement. From a token point of view, everything looked perfect.

To me, that just left the tidying up after the tokens. Because of the way the parser is designed, the inline text is accumulated until an inline token is required. Once that token is created, the parser then grabs all the characters up to the start of that token, creates a Text token with that information, and adds it before adding the new token. I stared at the code that created the token, and wondered what the problem could be.

    inline_blocks.append(
        TextMarkdownToken(
            current_string,
            starting_whitespace,
            end_whitespace=end_string,
            line_number=last_line_number,
            column_number=last_column_number,
        )
    )

It was on a hunch more than anything else that I decided to look at the end_string variable. This variable is where the extra whitespace stripped from the end of the line is stored. Looking at this variable’s impact on the logs, I noticed a couple of “holes” that I filled with extra debug log statements. It was with that information that I was able to pinpoint that along the way to the creation of that Text token, the extra newline character was appearing. Tracking backwards, it soon became clear that the issue was that the end_string was being assigned that newline character, but it was not been cleared properly.

Fixing the Issue¶

After about 45 minutes of additional debugging and fiddling, the answer final came to me. When the inline_response.consume_rest_of_line field was set, four different fields and variables were being reset. The thinking here was that if the different handlers indicated that they had consumed the rest of the data on the line, there was not anything else to do. But one thing was forgotten: the resetting of the end_string variable.

Sometimes a fix is rewriting some logic, and sometimes a fix is adding some new logic to handle a missed case. This time, the fix was just about completing the reset action. Adding a single line in the if inline_response.consume_rest_of_line: block was all that was needed:

        end_string = None

Anticlimactic that it was only resetting one variable on one line, but that is the way it sometimes is. Lots of looking only to find out something small was missed. With the research I completed, it did make sense though. Without that variable being reset to None, the newline character from a previous line was being carried through to the Text token after the Image token.

Once I verified that the fix worked with that one test, I ran the entire suite of tests with the new code in place. Comparing the test failures to the set of 16 Paragraph token tests and 12 SetExt Heading token tests, there was a 100% match! After approximately 3 hours of work, I had found that small but required change!

Being More Precise¶

The first part of this task was to perform some catch-up work on code coverage. Having forgotten to check the coverage with recent changes, I was happy that there were only two small areas that I needed to improve on. Even better, both were cases where I had added defensive code that was not being utilized. As such, removing that code from the __consume_text_for_image_alt_text function and the __complete_inline_block_processing function returned the code coverage numbers to their normal levels.

With that task under my belt, I decided to address an issue that has been bugging me for a couple of weeks: the rehydrate_index assert. Located in the __handle_last_token_text function, I have improved this assert over the last few weeks. While it started off as a simple “less than” expression, it was now down to a comparison between the count of newlines in the tokens and the rehydrate_index field. My problem with it? As improved as it was, it was still checking to see if the rehydrate_index field was either equal to num_newlines or num_newlines + 1.

As documented in past articles, I have incrementally improved this check to the point where it was when I started this work. But the real target was always to get that assert statement down to a one-part expression. Specifically, based on its usage, the goal was to get the statement down to:

assert last_block_token.rehydrate_index == (num_newlines + 1)

Debugging¶

Starting the work on this task, the first thing I did was to figure out the amount of work I needed to do. In this case, the best way to do this was to make the change detailed at the end of the last section and look for failures. While it was not research in the typical way that I do it, it was useful. It pointed out that all the scenario test expect for approximately 15 scenario tests were passing. This was useful for both marking a starting point and for my confidence.

As I started digging into the failures, those tests started to separate themselves into two distinct groups: those with code spans including newline characters and Text tokens occurring at the end of Paragraph blocks. While the work took a long time, in both cases the process was the same. I started by isolating one test in the group I was looking at, then looked at the consistency check output to find a pattern that made sense. Then I just kept on drilling and adding debug where needed until the pattern became visible to me. Now, as I have been working on this project for a while, the “until a pattern became visible” to me is guided by experience and enhanced by luck, and both just take time.

After a while, the pattern that leapt out at me for the code span tests was simple: the line number was being updated properly, but the rehydrate_index field was not being updated. Adding some extra debug, I was able to validate that quickly, then adding the follow code to remedy that problem:

        if last_token.token_name == MarkdownToken.token_paragraph and not link_stack:
            last_token.rehydrate_index += num_columns

Having found and fixed that issue, the second group was then easier to see. Seeing as a not updated rehydrate_index field was the issue with the problem, I gave it a short with this one, and it worked here as well! Once again, a quick fix solved the issue. Executing the tests with the new changes in place, the consistency checks were now testing against my intended target of:

assert last_block_token.rehydrate_index == (num_newlines + 1)

Paragraphs Ending with Multiline Non-text Inlines¶

That title is a mouthful, but accurate. The next issue that I tackled was:

 - repeat __handle_last_token_text with paragraphs ending with other inlines

I was not sure if this was going to be a problem, but I can see why I added this to the issues list. Most of the existing tests were focused on a specific Markdown element and either ended with that element or text following that element. Even harder to find were tests where the element at the end of the block contained a newline character in the middle of it. So, while I found an isolated scenario test here and there that tackled some of these combinations, there was not a good solid “here it is” block of tests.

Time to change that! To address this, I added test functions test_paragraph_extra_c7 to test_paragraph_extra_d4. Quite simply, I added 8 tests, 4 tests for links and 4 tests for images. Within each group of 4 tests, I added a split Text element, a split Code Span element, a split Raw Html element, and a split Emphasis element. It was not rocket science, but it was good to have all those tests in one place, ensuring that I was addressing that concern. Even better? They all worked on the first try.

Expanding Paragraph Tests to SetExt Heading Tests¶

With a metaphorical bounce in my step after addressing the last issue, I decided to tackle the copying and transformation of Paragraph tests to SetExt Heading tests. While not officially on the issues list, my goal was to keep the extra Paragraph tests synced up with their cousins, the extra SetExt Heading tests. I am not sure if the right term is cousins, but that is how I thought of them, as the only difference between the two groups of tests was their parent blocks. The contents of the tests remaining the same, except for the addition of a SetExt Heading block terminator.

As I knew that these tests were based off existing Paragraph tests, it was pretty simple to go through the new tests and adjust them to use a SetExt Heading token instead of a Paragraph heading token. While simple, even the creation of copied test functions test_setext_headings_extra_a3 to test_setext_headings_extra_d4 took some time to get right. But once again, I was rewarded for my hard work with tests that all passed, with no failures.

Are Links Being Verified Properly?¶

This task was not to solve the following issue, but to research it:

- are links getting verified properly in checks? images are, cannot find link code for same

Based on previous tasks, I have done extensive work on making sure that any image links were verified properly. However, I have previously noticed that the same rigor had not been applied to the Link token. While I did not want to solve the issue right away, I wanted to make sure I had the right information to resolve it later.

And it did not take me long to finally figure out why this issue that had been bothering me for a while. The big issue here was that I was missing validation in cases where an end token was present. Specifically, while the end Emphasis token has a line/column number attached to it, the end Link token does not. That meant that when the previous token validation was occurring, the existing code was doing the equivalent of a digital shrug and letting it go. That was the real crux of the problem.

Additionally, the Image token was of interest. Doing a double check of my findings, it became immediately obvious why it escaped the same fate: it is a whole token. Whereas the Link token has a start token, the inner text, and then an end token, the Image token is entirely self-contained. As such, everything is processed at one time, and no end token handling is required, thus the processing for an empty line/column number was avoided.

It was clear to me that solving this would require some serious planning and thinking.

Adding Proper Link Verification¶

Given the research from the last section, I decided to take the time needed to tackle that issue properly. Dedicating a minimum of 4 hours to this task, I sat down and started to work the problem.

To keep things simple, I am going to keep track of the pseudo-code for this solution instead of the actual code. While it is true that the code needs some major refactoring to occur, the more important reason is that it is just very verbose. The big reason for me to relay this information is that I found this useful during my development of the solution. While the algorithm at its base components is simple, I just found keeping it in my head in a clear and concise form was too much.

Step 1¶

The first issue was that I needed to have an anchor to base the consistency check’s line/column numbers from. As the end Link token does not have a line/column number associated with it, I first needed to add some code to look backwards in the token stream for the first valid line/column number. While I was initially worried about possible interactions between the end Emphasis token and the end Link token, I eventually determined that exploring that path was a dead end. Because the end Emphasis token has a line/column number, it can serve as an anchor token just as well as any other token.

- search backwards from the current token for the last token to have a valid line/column number

Step 2¶

With that anchor in place, it was then possible to apply the translations to the line/column number. This was mostly due to how the link is constructed. In the case of a simple inline link such as [link](/url), the [ is processed, then the link label text link, and then the rest of the link text. When the end Link token is encountered, the remaining text ](/url) needs to be accounted for.

Starting with the code for the handling of the Image token, I was able to quickly build up a similar block of code that provided the translations for the line/column number. As I had a pattern to work off of, this work went by fairly quickly.

- search backwards from the current token for the last token to have a valid line/column number
- apply the translations for the line/column number due to the rest of the Link token

Step 3¶

With that work in place, I then went to handle the previous token in the normal case and hit an issue. The anchor line/column number did not seem to be correctly calculated. I double checked my code, and everything looked fine. Doing some deep digging, I started to understand why. When I figured out which token to use as the anchor, I did not take into account any changes introduced to the line/column number by the anchor token itself. Basically, if I was using a Text token containing text at (2,7) as an anchor, I was using that (2,7) as the starting point, ignoring any changes introduced by its contents. Instead, I needed to use (2,11) as the starting point, taking the length of the token’s original Markdown into account.

The saving grace here was that immediate calculations were okay, so I just added the code following the previously added code. While it does not affect the results, in hindsight I might want to move the translation code for the anchor token up to make sure that the code logically flows better.

- search backwards from the current token for the last token to have a valid line/column number (anchor token)
- apply the translations for the line/column number due to the rest of the Link token
- apply the translations for the line/column number due to the anchor token

Testing the Changes and Further Adjustments¶

Having executed an isolated scenario test or two during these changes, I then went to execute those tests again, and the results were hopeful. From what I could see, the line/column number calculations were working properly, but another assert was getting fired. After a quick check, I came to an interesting observation: while looking at the line/column numbers and making sure they were being adjusted properly, I totally forgot about the rehydrate_index field.

Most of the adjusting went off without a hitch, but a couple of outlying tests were still failing with the assert for the rehydrate_index field. It took me a good couple of hours of adding debug statements and scanning the consistency check output before I found the pattern. While I had just added code to adjust the rehydrate_index field to account for the Link token, I had failed to adjust the field to account for the anchor token. As such, if the token before the end Link token (the anchor token) contained one or more newline characters, those newline characters were missed.

Fixing that problem took a bit of creativity. I already had the code to properly calculate the proper values, but they were in the __process_previous_token function. The problem with reusing that function was that it modifies the rehydrate_index field as a side effect, something I was not sure that I wanted to happen. In the end, I worked around this problem by doing a sample call to the __process_previous_token function and getting the change in the line number, then invoking that function again on the anchor token, applying the proper change and restoring the rehydrate_index field. It took a while to figure that out and get it working, but it was worth it.

With that code in place, I ran those tests again, and the consistency checks for the small sample of tests I was using passed. Crossing my fingers, I ran it for the entire set of scenario tests and was happy to see that they all had passed. While it took quite the while to accomplish, it was good to finally put that issue to rest.

Needing Some Extra Emphasis¶

While the description for this issue was:

- add verification for links with the endlink as the last element
- add verification for links+emphasis as the last 2 inline elements

I felt that it really did not describe the problem properly. Added when addressing the previous issue, what I believe I was trying to note is that I wanted more testing with the interaction with the end Link token and the end Emphasis token. At the time when I added that note, I remember thinking that there might be some weird interaction between the two end tokens. It was only later that I remembered that the end Emphasis token was the rare end token that has a line/column number attached to it.

Still, it made sense to add some extra tests for emphasis and interactions with the link tokens, as I did not believe I had good coverage of that so far. To remedy that, I added test functions test_paragraph_extra_d7 to test_paragraph_extra_e0, with emphasis around the the link and image elements followed by emphasis within the link label of the link and image elements. These tests were not at all complicated to add and I am also not sure they are actually needed. However, as I perceived testing in that area to be a bit thin, adding those tests just seems to me to be a good idea. And luckily, after running those new tests, everything passed without incident.

Just Before the End¶

I remember looking at the clock and noticing that I had around 20 minutes left before I started writing that week’s article at noon. While I used to start writing first thing Sunday morning, as of recently I have tried to keep my notes in a way that I could more easily mine them for the articles. Still in the proving stage, it has helped me out with the writing process, allowing me to find my flow more easily on Sunday at noon when I now start writing.

Anyhow, with somewhere near 20 minutes left before noon on a Sunday and me wanting to milk the clock for every minute that I could, I quickly looked at the issues list. I was hopeful that I could find some low-hanging-fruit to work on in the time remaining, and quickly came across this little gem:

- code span
  - multiple lengths of ticks, whitespace

I felt that this was a good candidate, but I still needed to investigate it. At the very least, I could get the research done and have it ready for next week.

The Research¶

After some quick research, I found out that while the scenario tests between (and including) the test_code_spans_338 function and the test_code_spans_359 function has some interesting variations, I felt that they were missing some basic coverage. Looking at those tests, it was pretty evident that tests with variations on the number of backticks and the amount of whitespace were in the minority. Doing some quick scans for the number of backticks in the tests, my suspicions were confirmed with only 4 tests dealing with double backticks and no tests dealing with more than two backticks. As for whitespace, the tally was similar: 5 tests with one whitespace character and 1 test with 2 whitespace characters.

Fixing the Issue¶

With that research in hand, I could either document the issue more thoroughly, or fix the issue. Looking back at the description of the issue, I inferred that what I was really looking for was a centralized place where those tests kept, not having to look for them all over the place. If that were the case, I figured I could solve the issue by adding 3 new tests, so I decided to fix the issue.

The 3 new tests? I started with the text aa`aa`aa, adding one backtick and one whitespace in critical areas to get aa `` aa `` aa and then finally aa ``` aa ``` aa. I just kept it simple and created 3 new tests, each one clearly handling an increasing number of backticks and whitespace. A quick execution of the new scenario tests, and… no issues. Assuming my inference was correct, problem solved!

What Was My Experience So Far?¶

While there were only a couple of issues that made me really think about how to solve them, this was a good week for just getting some issues taken care of. At this point in the project, I am confident that I have 95 percent of the code working properly. My hope is that by adding new tests and making sure that existing tests are grouped properly, I can find those outlier scenarios that I have not thought of. Whether those scenarios end up working or failing is something I cannot control. But I can strive to find as many of those scenarios as possible.

And I have noticed that this hope is being largely reflected in the work I am doing. Less and less work is in the project’s parser itself, with more changes being attributed to improving the verification of the parser. From where I sit, that is a good place to be in. The other observation? In times when I do need to change the parser, those changes are usually small changes of five lines or less. That statistic alone fills me with confidence.

Now, while I do want to find as many scenarios as possible, I need to balance that with trying to wrap up the initial phase of this project so I can use it and get the project out there. For me, that means I need to start buckling down and shifting my focus away from issues in the “Nice to Have” category and work on the other issues. Here’s hoping I can action on that!

What is Next?¶

From my viewpoint, next week’s article will be interesting as I talk about my efforts to group the tests together more concisely, with documentation to back up the reasons for those tests.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Delving Into the Issues - 6

Summary¶

Introduction¶

What Is the Audience for This Article?¶

Finishing Up the Work¶

Digging In¶

Fixing the Issue¶

Being More Precise¶

Debugging¶

Paragraphs Ending with Multiline Non-text Inlines¶

Expanding Paragraph Tests to SetExt Heading Tests¶

Are Links Being Verified Properly?¶

Adding Proper Link Verification¶

Step 1¶

Step 2¶

Step 3¶

Testing the Changes and Further Adjustments¶

Needing Some Extra Emphasis¶

Just Before the End¶

The Research¶

Fixing the Issue¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Markdown Linter Issues

Category

Tags

Stay in Touch