Markdown Linter - Delving Into the Issues - 5

Summary¶

In my last article, I started to add extra scenario tests that tested newline characters in various groups or themes. In this article, I continue working towards having a good level of group coverage for each of those groups.

Introduction¶

To be totally honest with any readers, the work that I did for this article was just more of the same stuff. Basically, look at the issues list, find something to work on, and work on it. However, as of late, I have been trying to find groups of issues to work on, instead of one-off items. As such, I was looking for issues that could help me resolve questions about a group of behavior for the project, not just a very specific behavior.

And to continue to be honest, this work was done during a difficult week for me. With things to do around the house, I did not seem to get any really good stretches of time to work on the project until later at night when I was tired. It was just one of those weeks.

But even though I was tired, I wanted to continue. While I would have been ecstatic to continue working with great velocity, what I needed to do was to maintain forward momentum. That meant picking items from the issues list that I could work on in phases, keeping that momentum up from phase to phase. That is why I wanted to focus on items I could deal with as a group. It was a good way to keep working, while not feeling I was forever stopping. For me, it was not about doing the work as much as maintaining the feeling that the work was progressing. That is what was important to me.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits 07 Oct 2020 and 11 Oct 2020.

Starting With Some Cleanup¶

As usual, there were some issues left over from the previous week that I needed to take care of. While not pivotal to my efforts going forward, I spent a brief amount of time going through the Bugs - General - Uncategorized section of the issues list, making sure to remove anything that had already been done. This effort was mostly done for reasons of cleanliness and accuracy, though I will admit that getting rid of a handful of items that had been completed was good for my confidence.

In addition, during the prior week’s testing, I had added functions test_fenced_code_blocks_099k and `test_fenced_code_blocks_099l to test out different counts of multiple blank lines within a fenced code block. As those tests passed and seemed useful, I decided to keep them in the test suite, but did not have an issue to tag them with. It just made sense to add them at this point, before taking them any further.

Ensuring Consistent Paragraphs¶

The work in this section all came from one line:

- make sure line/column is tracking text indenting on each line

Not a fantastically descriptive line, but it was there. And it was enough for me to understand what it is that I wanted to look at.

In previous articles, I have talked about how paragraphs are the base building block of any Markdown document. Being the default block container for the default inline element, I would guess that an average of 75% of existing Markdown document blocks are Paragraph elements. I have no hard facts to back it up, but as I look at a representative sample of the articles I have worked on, a SWAG leads me to think that 75% is a fairly good estimate.

This perceived predominance of Paragraph elements in Markdown documents influenced my perception when I was designing the token system used by the PyMarkdown project. To accommodate this perception, I decided to place any newline handling elements in the Paragraph token instead of the encapsulated inline tokens. At the time, my thoughts were that the Paragraph element had rules that were different enough from the 2 header elements, the 2 container elements, and the 2 code block elements, that I needed to capture the element’s information differently. As Paragraph elements largely ignore leading space on a given line, the Paragraph token seemed to be the right place to store that information. While it has caused a couple of issues in the past, I still believe that it is still the right decision.

The Issue¶

While I do believe it was the right decision to make, that decision has added extra headaches to the project. The big headache is confirming that the newlines in the inline token correspond to newlines in the whitespace that is extracted and stored in the Paragraph token. That is where the rehydrate_index field comes in.

I have talked about it in passing, especially in a previous post in the section titled Verifying the Rehydration Index. In that article, I talk about how I was not sure if there were any errors because I was not sure that the field was named correctly. While I concluded that the field was named correctly, that name can still be confusing at times. The rehydrate_index field indicates the index of the next newline that will need processing. As such, once the processing of any text at the start of a line is done within the bound of a Paragraph token, that field is updated to 1. This offset also means that when all processing for the text within the Paragraph token has been completed, the index should be set to the number of newline characters plus 1.

To move towards consistency of this field’s value, I added the following code to the end of the processing of the end of a Paragraph block:

    num_newlines = ParserHelper.count_newlines_in_text(
        last_block_token.extracted_whitespace
    )
    if last_block_token.rehydrate_index > 1:
        assert (
            last_block_token.rehydrate_index == num_newlines
            or last_block_token.rehydrate_index == (num_newlines + 1)
        ), (
            "rehydrate_index ("
            + str(last_block_token.rehydrate_index)
            + ") != num_newlines("
            + str(num_newlines)
            + ")"
        )

Basically, if the algorithm is at the end of a Paragraph block, any inline elements contained within the paragraph should have moved the rehydrate_index field to its proper value. However, I needed to add an alternate conditional to handle the case where it was 1 count short. While that was concerning, it was more concerning that there were some cases where the rehydrate_index field even fell short of that adjusted mark. I felt it was imperative to get those outliers addressed first.

Addressing the Failures¶

In debugging, it is not often that an answer almost literally jumps out and screams “Here I am, please fix me!” In this case, when I executed the failing scenario tests and looked at their failures, it was an easy observation to make. For some reason, the Text token after a Hard-Line Break token included a newline character in its text section, but did not include a newline character in its whitespace section. As a result of that mismatch, the tests failed as the Paragraph token’s rehydrate_index field was not set to the correct value when verifying the tokens.

It took me a while of careful tracing through the logs, but I finally found that in the handling of the end of the line for a Hard-Line Break token, it was not clearing the whitespace_to_add variable in both case. This code:

    whitespace_to_add = ""

was being executed if the Hard-Line Break token was created by multiple space characters in the Markdown document. However, if the token was created by the backslash character at the end of the line, it was not. Making sure that code was in both branches solved some of those issues, but every test. There were still a handful of tests that failed.

Covering My Bases¶

While the tests for the SetExt Headings tokens were all passing, the discovery of the previous failures inspired me to add some similar tests for the SetExt Heading tokens. To do this, I started with the original scenario test, test_setext_headings_extra_22:

this was \\
---

From there, I added functions test_setext_headings_extra_22a to test_setext_headings_extra_22d to test the creation of a Hard-Line Break token followed with a Text token. To start, I added the first two functions that simply had both forms of creating a Hard-Line Break token with some simple text following it. In addition, to make sure that I was handling the next line’s leading whitespace properly, I added two more variations that included a single space character at the start of the following line.

While I am not sure how useful these four tests will be in the future, at the time they were important. As I had just fixed some issues with Paragraph tokens and extracted whitespace, I wanted to make sure that a similar format with SetExt Heading embedded text did not suffer from a similar problem.

Continuing Forward¶

With the experience from the last section in hand, I continued to look for the root cause of the remaining failed tests. As in the previous section, the tests were failing with a common theme: newlines that occurred within a Link token.

While the solution to this issue was not as easy to arrive at as the solution for the last section, it was fairly simple to see that the problem had two parts: the tokens derived from the link’s label, and the main body of the link itself. Almost as soon as I started looking at the logs, I noticed that the numbers were off, and I had to dig in a bit deeper.

Digging Deeper¶

The way inline links are constructed is as follows:

[link](/uri "title")

However, the token generation algorithms follow the way that the tokens are used to generate the HTML output, which for that Markdown is:

<p><a href="/uri" title="title">link</a></p>

As the link label (the text link in the above sample) may contain any form of inline text except for another link, that information cannot be contained in its processed form within the Link token. Instead, there is a start Link token, followed by the processed version of the link label, followed up by the end Link token. From PyMarkdown’s point of view, the token stream for the above Markdown document is:

        "[para(1,1):]",
        '[link(1,1):inline:/uri:title::::link:False:":: :]',
        "[text(1,2):link:]",
        "[end-link:::False]",
        "[end-para:::True]",

Unless the Link token contained a newline character somewhere within its bounds, everything was working properly, and the consistency checks were properly verifying the tokens. But when the Link tokens contained a newline character, those properly working algorithms were not working so well.

Finding A Solution¶

As I knew that this problem had two parts, I figured that the solution needed to have two parts are well. I believe that my realization of that conclusion is what enabled me to shortcut other solutions that did not work in favor of a split solution that did work.

The first half of the solution needed to deal with the text contained within the link label. While this text is handled in the tokens between the start end end Link tokens, from a Markdown point of view, it occurs right after the opening [ character of the link. The good news here is that adjusting the rehydrate_index field for each enclosed token was easily done by adding some simple code at the end of the processing loop in the __verify_inline function.

The second half of the solution was in the handling the Link token itself. As the inside inline tokens come first in the Markdown document, it made sense to handle the Link part of the solution when the end Link token is processed. This meant adding some extra code to the __verify_next_inline function, processing the end Link token at the top of the function if the current_line_token line/column numbers are both 0. If the current_line_token variable is a end Link token and the code is processing within a Paragraph token, I added new functionality to calculate the number of newline characters encountered within the Link token itself.

Running through this code in my head, and solving some simple issues, I executed the failing tests again, and was pleased to find that they were all passing. In that version of the code, I had four different sections, one for each type of link. However, after a quick examination of the code, I believed that the code could easily be parsed down to one block of code that just gathered the newline counts from each part. Eliminating all the other blocks except for the inline block, I was happy to find out that my guess was correct. The way I had set up the values for the other types of links allowed for a more simplified version of the code.

After a quick check of that guess locally, a full scenario test run confirmed that there were no ill effects of these changes, I cleaned the code up and checked in that work for the first part of the week.

Interlude¶

It was after this point in the week where it became more difficult to find numerous good blocks of time to work on the project. The first day it happened it was not so bad, but the struggle to find good time to work on the project was draining. I knew I had to keep my priorities focused on the other tasks in my life, as they had priority. But even so, it was just hard not to get a good block of work done on the project.

But I persevered.

Inlines and New Lines¶

The second half of the week was spent working on clearing up a related item from the issues list:

- need other multiline elements in label and verify

Piggybacking off the previous item, this seemed like a really good choice to pick off the issues list. While the previous work dealt with plain text inside of the link label, this issue took that work and improved on it. Instead of just simple text elements, this work adds testing support for Raw Html tokens and Code Span tokens that span multiple lines.

Starting with Tests¶

The start of this work was easy. I added scenario test functions test_paragraph_extra_a3 to test_paragraph_extra_c6 to cover all the new scenarios. Starting with simple tests, the first three tests that I specified were a normal link label [li\nnk], a code span [li`de\nfg`nk], and a raw html [li<de\nfg>nk]. Moving on to variations of those tests, having completed those tests for the inline link type, I transitioning to examples for the full link type, the collapsed link type, and the shortcut link type. Once those variations were done, I copied each of those tests, replacing the start Link element character [ with the Image element character ![.

Work to setup the tests was also easy, but pedantic. After a quick survey of the inline element types, I was shocked to find out that only the three inline tokens named above allow for newlines with their elements. It did make the creation of the tests easier though, so I was happy to find that out.

And maybe it was just my experience on the project, but the test setup went smoothly. I did the research, I figured out what I needed to test, and then cycled through the variations with speed and accuracy. Just felt good to have a solid grasp on the problems.

Starting with the Inline Processor¶

From a link point of view, the tests were generating their tokens successfully. The Text tokens and Code Span tokens were being handled properly, passing all their tests. However, when the Raw Html token tests were executed, the text text was being added to the token stream instead of the Raw Html token. Having fixed an issue in this area before, I had a feeling it was in the __collect_text_from_blocks function, where link tokens are translated back into their source text. After taking a quick look there, it was somewhat funny to see that I had handled the Code Span token case, but not the Raw Html token case. A quick fix, followed by another run over those tests, and the Link token tests that were failing started working again.

That left the failing tests that dealt with Image tokens. While this took a bit more work, it was roughly the same process as with the Link token. In this case, the text for the alt parameter of the image tag is created by consuming the elements generated from the Image token’s label field. This work is done by the __consume_text_for_image_alt_text function, consuming the tokens as the processing is done.

The main difference with this process from the link process is that, according to the GFM specification, this text is preserved largely without any of the special characters. As such, I had to check the proper translation of each of the inline tokens against BabelMark to make sure that I had the right translation. But other than that extra bump in the process, it went smoothly. The __consume_text_for_image_alt_text function filled out quickly with each coded translation.

With that task completed, the scenario tests were generating the HTML output that I expected. Onwards!

Rehydrating the Markdown Text¶

After I was sure that the tokens being generated correctly, I quickly ran each of the Markdown documents from the new tests through Babel Mark, verifying that the output HTML was correct. Encountering no problems, I moved on to the rehydration of those tokens into Markdown and was happy with the results. With only a couple of tests failing, I took a quick look at the failures and noticed the problem right away: rehydrating Link tokens and Image tokens that contained newlines was not working.

Following the log files, I was able to quickly figure out that the problem was that the backslash escape sequences and replacement markers were not being resolved from the text before doing the final rehydration of the elements. In the end, the following lines:

    text_to_modify = ParserHelper.remove_backspaces_from_text(text_to_modify)
    text_to_modify = ParserHelper.resolve_replacement_markers_from_text(
        text_to_modify
    )

were added to the __insert_leading_whitespace_at_newlines function to resolve those elements.

With that code added, every scenario test was passing to the point of being able to regenerate its original Markdown. On to the consistency checks!

Cleaning Up the Consistency Checks¶

It was in this area that I spent a lot of time making sure things were correct. Due to the shortened time blocks in which I could work on the project, the solutions that I initially came up with were just not solid enough solutions to use. These solutions were either too complicated or failed to meet the criteria, leading me to throw each approach out. In the end, I just reset to use a simple approach, after which things started to work out fine.

Learning from my previous work, I was pretty sure these changes were going to involve handling replacement markers and backslash escapes better. Specifically focusing on the link label and image label, I was able to quickly determine that the link labels and link text from the different link types in the __verify_next_inline_inline_image function needed to call the resolve_replacement_markers_from_text function to remove the replacement markers.

After making that change, I followed a hunch to see if the other changes I made needed to be copied in some form to the consistency checks. I was rewarded to find positive benefit to extending the code under the if "\n" in str(current_inline_token): condition of the __verify_inline function in a manner similar to the changes made to the __consume_text_for_image_alt_text function. It just made sense.

What Was My Experience So Far?¶

This was just a brutal week project-wise. The stop-and-go (but mostly stop) nature of this week reminded me of a time home when I worked in a cubicle farm for a year. Instead of people just looking over at you or emailing you, they would have to walk over to your cubicle and stand by your “door” to have a conversation with you. At least for me, it often seemed like I was just getting into a good work flow when someone else showed up to talk, causing another interruption.

While I definitely had my priorities straight in dealing with the issues around my house, the stop-and-go nature of this week made it hard to get into a good flow. Even thinking about how I felt that week made the task of writing this article more difficult. It was just that jarring at times.

That also made my current predicament that much more painful. I can see the issues list getting smaller and smaller, closer to a point where I know I will feel comfortable in releasing the project. And I want to get there, but I want to get there with what I consider to be solid quality. But I also want to get there soon. And I know that if I had more quality time during the week, I would have been able to resolve at least a couple more issues.

But I am still happy with the momentum on the project, and where I am with it. And one of the promises that I made to myself at the start of the project is that I must have balance between this project and other priorities in my life. And this was just a week where I had to put my money where my mouth was.

Hopefully next week will be better.¹

What is Next?¶

Next week’s article will be more interesting, as I was able to address my time allotments on the project, submitting 9 commits during that week.

It was better. These articles track roughly 2-3 weeks behind the actual work, I know for a fact it got better. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments