Markdown Linter - Delving Into the Issues - 2

Summary¶

In my last article, I started to tackle a few of the items on the project’s issues list. As the imaginative title for this article suggests, this article details more of the effort to reduce the items on the issues list.

Introduction¶

As I mentioned in the last article, this part of the project is not about anything stellar, just going through the issues list and removing one item at a time. Far from moving a mountain by itself, it is all about moving that mountain one pebble at a time. The work for this week was no different. It was just about dealing with 6 items from the issues list and either proving they are not an issue or fixing the item if it really is an issue.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 15 Sep 2020 and 20 Sep 2020.

Rounding Out Coverage¶

While the entry for this issue:

- each inline token surrounded by text

was short, I knew that this issue was all about completing the code coverage on recently changed code. I was already aware that more than half of the cases in the consistency check __verify_first_inline_setext function had the default assert False code put there as a placeholder. In addition, as soon as I looked the rest of the code, it was obvious that I had not tested a lot of the inline sequences within an Atx Heading element or a SetExt Heading element. That entry was a reminder for me to circle back around and fix it!

To accomplish that, I started by working with each inline element and its relevant position in a line of the Markdown document. Tests 1 to 11 dealt with each inline element at the start of the line with text after it, tests 12 to 20 dealt with each element with text on either side of it, and tests 21 to 31 dealt with each element at the end of the line with text before it. For good measure, I added tests 32 to 42 which were each inline element by itself on a line. Once that was accomplished for the simple case, the Paragraph element, I cloned those tests into 2 other sets of tests, one for Atx Heading elements and one for SetExt Heading elements. Once cloned, I introduce a small change for both elements, the Atx Heading tests starting with the sequence #{space} at the start of each line and the SetExt Heading tests having the sequence --- follow the test text on the next line.

What Did I Find?¶

Running the tests after completing those changes, I found two expected results and one unexpected result. The first expected result was that the new tests rounded out the __verify_first_inline_setext function by providing good examples for each of the inline elements, except for the Hard Line Break element. The Hard Line Break element has some special rules, so in most cases the setup caused the text to not be interpreted into a Hard Line Break token, and just placed in a normal Text token.

The second expected result was that several of the __handle_last_token_ functions needed adjustment when evaluated within the bounds of a SetExt Heading element. During the creation of those functions, as evidenced by the work required in the __verify_first_inline_setext function, testing within a SetExt Heading element was almost entirely missed. While it was not difficult to address, the testing clearly showed that I needed to add the following code to each __handle_last_token_ function to compensate:

    if last_block_token.token_name == MarkdownToken.token_setext_heading:
        inline_height += 1

Finally, the unexpected result was that the normal Paragraph processing and the Atx Heading processing were fine, requiring no fine tuning. While I had hoped for that outcome, it was nice for it to happen. I had expected at least one or two small issues to show up, but none appeared.

Fixing A Simple, Yet Wincing Issue¶

Yes, I said a “wincing” issue. While there are more colorful names for these issues, I prefer the name “wincing” as I believe it is clear, descriptive, and obvious. Quite simply, these are a class of issues that are so simple that when I see them, I cannot help but wince that the error made it through any of my commit processes without being detected. In this case, it was a typo where I typed estiated_column_number instead of estimated_column_number, before I copied it into many places. Yeah, I winced.

While I was at it, I also decided to also tackle the issue:

- 52e - make new case with different indent levels for each`

Like my reasons for doing the work in the previous section on Rounding Out Coverage, I just had doubts about whether I had missed anything with the scenario test 52e. With previous issues, I had seen a handful of false positives occur from a test having a constant number of indent spaces. I believe my feeling was that to truly test this scenario, I needed to add a test where the number of ident spaces varied from line to line. As such, I cloned scenario test 52e into 52f, altering the indents on the second and third line to 3 spaces and 1 space, respectively.

What Did I Find?¶

One of the amusing things that I found out is that I get really embarrassed by simple typos. Other than that, while I was right to question the indents from scenario test 52e, the newly added scenario test 52f confirmed that the code was handling the indents properly.

Verifying the Rehydration Index¶

In looking at this issue, I could easily tell beforehand that it was going to be a research issue:

- why?  shouldn't each one be of the proper length?
```
  if split_extracted_whitespace and last_token.rehydrate_index < len(
            split_extracted_whitespace
        ):
```

Introduced when I was working on the Markdown transformer, the rehydrate_index field of the Paragraph token is not serialized with the other fields in the token. It is used exclusively for tracking the index into the Paragraph token’s extracted_whitespace field. When I reviewed this code for another issue, I believe that I saw something about this statement that made me question why it was needed or what it was used for.

What Did I Find?¶

After adding extra debug statements and running through the tests, I started to understand why I might have had issues with this code. Normally, when software developers talk about an index, they are referring to a number that is 0 based instead of a position which is 1 based. Basically, the difference is whether the counting starts with 0, as with an index, or the counting starts with 1, as with a position. From a purist point of view, this field is named properly: if the field has a value of 0, it is referring to the set of characters before the first newline character in the extracted_whitespace field. Another way to look at this is that when the line is split on the newline character into an array, as I often do for processing, the 0th element is the first element.

However, when this field is used, it has the appearance of being a position. The extracted_whitespace field holds any leading whitespace for a Paragraph token. Accordingly, any whitespace before the first newline in that field is applied as leading whitespace to that text string as soon as that string’s processing starts. As such, if there are multiple lines in the paragraph, the rehydrate_index field is set to 1 almost from the very beginning of processing. So, while it is properly called an index, it can appear to look like a position based on its usage.

Following that information back to the code sample above, my question now made sense. However, while the naming could perhaps be better, it is indeed an index that somewhat behaves like a position, with that duality causing the confusion. In the end, after a decent amount of review, I resolved this issue without any changes except for an extra assert statement and some debug. I felt that the name of the variable and the variable’s use were both exactly what they should be.

Fenced Code Blocks and Blank Lines¶

This issue was an interesting one:

- verify that 2 blank lines solution
`if previous_inline_token.token_name != MarkdownToken.token_blank_line`
does not affect single line and 3+ line solutions
  - why needed?

From the start of the work on this issue, I knew it was going to either be very simple or very complex. The easy part of working on this issue was adding 4 new scenario tests, 99f to 99i, to check for any abnormalities. Based on the information contained in the issue’s text, those new tests included multiple blank lines as well as mixing blank lines and text lines within the Fenced Code Block element.

What Did I Find?¶

The first thing I did after running the tests was to adjust the produced tokens to the new values, after manually verifying each one individually. Considering how the coalescing processor merged the Text tokens together, the text and the line/column numbers for each token seemed fine. What was immediately different was the HTML that was produced. Instead of being a direct interpretation of the Markdown, the resultant HTML had a few newlines missing.

Far from being easy to find, this issue took me 2 days of research to find and diagnose, revealing problems with both the HTML transformer and the Markdown transformer. This issue just proved to be very difficult to isolate and identify.

Addressing the Issue¶

For the HTML transformer, additional logic needed to be added to track past characters. For whatever reason, when there are 2 blank lines in a Fenced Code Block element, one of the newline characters is not emitted.

To properly address this issue, newlines needed to be added in 3 areas. The first area was in the handling of the text token, where a Text token preceded by 2 Blank Line tokens, requiring a newline to be added. The second area was in the handling of a Blank Line token to make sure that the newline was only added in the right cases. Finally, at the end of a Fenced Code Block element, another check needed to be added for the multiple Blank Line tokens, adding a newline if they were found.

After completing that grueling episode of debugging, my focus turned to the Markdown transformer and consistency checks. While I was dreading another 2 days of research into the issue and how it affected the Markdown transformer, the results there was very easy to see and diagnose. When a switch is made from rehydrating a Text token to rehydrating a Blank Line token, a newline is omitted in the process. As such, it was easy to identify this case and add an extra newline before focusing on the rehydration of the Blank Line token.

Having fixed that issue, I reran the scenario tests and got a series of weird errors complaining about current_token and None. That was quickly tracked down to working code for handling the last token in a consistency check, and checking to see if the inline_height variable needed to be increased by 1 to take care of the end Fenced Code Block token. After adding a quick check to make sure that it was not None, all the tests ran without issue.

I had mixed emotions about this issue as I worked through it. On one hand, it was not a simple issue, so it was good to get it out of the way. Whether it was pride or confidence, I was pretty sure that most of the simple issues had been cleaned up, and this easily was not a simple issue. On the other hand, to find an issue like this near the end of the first phase of the project was a bit disheartening.

In the end, I felt that the positives of finding and fixing this issue outweighed the negatives. It took a while to clean up, but that scenario was now running properly.

Testing Backslashes¶

The item for this one was simple, yet descriptive:

- backslashes - 600
  - verify with before and after with all valid inline

During the various passes I have made through the code, one thing that I worried about was whether each of the inline sequences properly handled the backslash character. While I was confident that the start sequences of inline elements were decently tested, I still retained a shred of doubt that I had missed something. As for the end of the inline elements, the confidence level was somewhere between a shred of doubt and a hunch that I missed something. Regardless, it was a good thing to check out and then check off the list.

To complete this task, the first part was simple: I needed to come up with a series of scenario tests to cover all the cases for backslashes and inline character sequences. So, starting with the Code Span element, I started working my way through the inline elements. In each case, the main test was to make sure that a backslash character before the start inline sequence prevented the inline element from starting. Then, if there was an end inline sequence, I added an additional test to make sure that a backslash used right before the end of the sequence had the intended result.

Finally, to make sure all inline sequences were covered properly, I replicated each of the tests for all three inline blocks: Paragraph elements, Atx Heading elements, and SetExt Heading elements. Tests 1 to 13 were for Paragraph elements, tests 14 to 26 were for SetExtHeading elements, and tests 27 to 39 were for the Atx Heading elements.

What Did I Find?¶

For this issue, I lucked out. Each of the tests passed, without any issues. But far from being a disappointment, researching this issue and adding the additional tests helped me put to rest any doubts that I had about backslash handling.

Properly Escaping Special Characters¶

Unlike a lot of the other issues in this block of work, this item was clearly a task:

verify that any special characters used can be recognized and specially escaped

To get to this point in the development of the project, I had to enable certain characters to have additional meanings. From the backspace character (\b) meaning “do not consider the previous character” to the replacement character (\a) allowing for a substitution to be documented, there were in all 5 characters that fit into this group.

According to the Github Flavored Markdown (GFM) specification, a character is:

a Unicode code point. Although some code points (for example, combining accents) do not correspond to characters in an intuitive sense, all code points count as characters for purposes of this spec.

Following this logically, any character that is a classic ASCII control character, that is a character between 0 and 31, is considered part of the Unicode control characters group. As all the special characters are classic control characters, they also exist as valid Unicode code points. Therefore, if I wanted to use them as special characters, I had to make sure to escape them if they appeared in input.

So now, to complete this task, I just needed to figure out how to do that.

Walking on The Shoulders of Giants¶

Many languages face the same problem that I was then facing: how do we represent a control character in various formats while keeping the string readable? In the case of strings in languages such as Python, the backslash character is the go-to choice.

A good example of this is the definition of the various special characters themselves.

    __backspace_character = "\b"
    __alert_character = "\a"
    whitespace_split_character = "\x02"
    replace_noop_character = "\x03"
    blech_character = "\x04"

In the first two cases, the backslash precedes a character which is a stand-in for the control character. In the two first cases, the \b character is standing in for the backspace character (ASCII 08) and the \a character is standing in for the bell or alert character (ASCII 07). In the remaining cases, the \x informs the Python interpreter that a 2-digit hexadecimal number will follow that is the actual character to insert into the string. In all cases, there is a clear understanding of what the character is, due to the backslash character escaping the readable form of the character to insert.

For the processing of Markdown documents, the backslash was not a good character to use. The biggest advocate against its use was that it was already being used as an escape character for Markdown’s special characters. To efficiently escape the desired character sequences, the chosen sequence would have to be one that rarely showed up in the normal Markdown document. Unimaginatively, I ended up choosing the character \x05.

I could try and explain why it was the best choice, giving lots of made up reasons, but in this case, the reason was simple: 5 follows after 4. It made a lot of sense to pick another control character for the escaping character, as the other characters were all in that group. At that point, after eliminating any control characters that were commonly used, there were roughly 25 characters left. Any character would have been just as good as the others, so I just picked the next one in the sequence. Sometimes you need to solidly think something through, and sometimes you just need to pick something that just works. This was definitely a “just works” moment.

Making The Change¶

From the outset, I knew that if I did this change properly, the bulk of the work would be contained within the ParserHelper class and the InlineProcessor class. The changes to the InlineProcessor class were easy to define. When parsing inline blocks of text, I would need to specifically look for any of the special characters, escaping them before moving forward with the rest of the string. For any processing after that change had been made, I knew that is where the ParserHelper class came in.

Adding the following declaration to the top of the class:

escape_character = "\x05"

the next function was easy. To avoid a cycle in the module dependencies, the valid_characters_to_escape function was added to allow the InlineProcessor class query the ParserHelper class for the characters to escape. After that, to make sure that I had coded things properly, I added a series of 24 scenario tests prefixed with test_textual_content_extra_ to perform special character tests. Starting with a simple test with an example that contained each special character, those tests quickly moved to testing various combinations of each character and normal text. Each section picked a specific type of escaped character, including it in a simple string and interspersed with examples that produced the unescaped versions of those same characters.

A good example of this is the function test_textual_content_extra_4. Used to test the replacement character \a, the Markdown document is as follows:

\a&\a&\a

To prove that the code is working properly, an escaped replacement character must appear on either side of an actual replacement sequence, also using the \a character. This results in a Text token containing the text \x05\a\a&\a&\a\x05\a\a&\a&\a\x05\a, as predicted. It starts with the sequence \x05\a to provide for an actual \a character to be output. The following sequence, \a&\a&\a then documents that the & character was replaced with the text &. From there, it just repeats enough times to make sure it works well.

Once that simple sequence and others like it with the \a character were verified, this pattern was then repeated with the other inline containers: Atx Heading elements and SetExt Heading elements.

Cleaning Up¶

While these changes sound good in theory, in practice there were some issues that needed to be addressed. The major change was the inclusion of the function remove_escapes_from_text. To ensure that those character were properly represented, I needed to be able to present a text string and have any instances of the escape character removed from them. Basically, if I supplied a Markdown document of \aa, that \a character is then escaped, producing a token with the text \x05\aa When I go to use that text, I need to remove the “protection” of the escape character before rendering that text.

That was easy to do, and as I started adding calls to the remove_escapes_from_text function, there were some previously working tests that started failing. After doing a bit of research, the cases that were failing were a portion of the tests that relied on the remove_x_from_text and resolve_x_from_text functions. In each of those functions, something like the following was being done:

    def remove_backspaces_from_text(token_text):
        return token_text.replace(ParserHelper.__backspace_character, "")

While that was acceptable before this change, there was one glaring error with it after this change: it was removing all backspace characters, including escaped ones. Thought it took a bit more code, that one function was transformed into the following:

    def remove_backspaces_from_text(token_text):
        start_index = 0
        adjusted_text_token = token_text[0:]
        next_backspace_index = ParserHelper.__find_with_escape(
            adjusted_text_token, ParserHelper.__backspace_character, start_index
        )
        while next_backspace_index != -1:
            adjusted_text_token = (
                adjusted_text_token[0:next_backspace_index]
                + adjusted_text_token[next_backspace_index + 1 :]
            )
            start_index = next_backspace_index
            next_backspace_index = ParserHelper.__find_with_escape(
                adjusted_text_token, ParserHelper.__backspace_character, start_index
            )
        return adjusted_text_token

While it was not as simple, it was correct. Create a copy of the string, and then look for the first occurrence of the backspace character that is not escaped. When that is done, adjust that copy to remove that backspace, and then continue searching from there until there are no more backspaces to remove.

Finally, after all that work, there was only one set of cases left, special characters within a code span. This was a special case because the characters needed to be preserved exactly, with no substitutions. In the case of the code span, once the code span start character sequence is found, a scan is made to locate a matching end character sequence. If it is found, the handler simply grabs and characters between the end of the start sequence and the start of the end sequence. To properly escape these characters, the escape_special_characters function was added.

After a lot of work, and a fair number of changes, the parser was cleanly handling the escaped characters. From my point of view, that was a good, solid issue to end the work for the week on!

What Was My Experience So Far?¶

While there were many little things to learn, the big thing that I am learning with this phase of the project is that the bulk of this work is in the research. If I am having a good day, the research will take anywhere from 5 minutes to an hour, followed by at least an hour of retesting, or adding a slew of new tests and then retesting. When that happens, I am grateful that it goes by so quickly.

And those positive experiences prepared me for experiences like the one that I had with Fenced Code Blocks this week. They are frustrating and can make me want to pull my hair out. I actually prefer the complex issues over these types of issues. With those issues, I can more clearly see the issue by breaking down that complex systems into simpler systems that can then be analyzed individually. For the Fenced Code Block issue, it was just a small nitpicky issue that I had to calmly analyze, looking within a small and complex set of variables to figure out.

It is at moments like that, when I am faced with tough issues, that I take the time to step back and try and get a better perspective. While there are some issues that are more difficult like the Fenced Code Block issue, the bulk of the issues are relatively easy to diagnose given the right debug statements. And fortunately, the bulk of the project’s design has held up nicely, so only a couple of small redesigns needed to be done. All in all, it is not a bad place to be in. Stepping back and really appreciating those parts of the project is what I believe keeps me sane. In the end, the severity of the issues and features evens out, even reducing in scope the further I get to the end of the project’s first phase.

Looking ahead, I know that I have a good handful of tough issues to deal with. Knowing that I was still able to keep a good perspective and keep my wits around me, I have confidence that I will be able to solve those issues. Some of them may take an hour to solve, some may take a couple of days to solve, but I will solve them!

What is Next?¶

Not to sound like I am repeating myself, but this part of the project is all about keeping the momentum going and solving another handful of issues.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Delving Into the Issues - 2

Summary¶

Introduction¶

What Is the Audience for This Article?¶

Rounding Out Coverage¶

What Did I Find?¶

Fixing A Simple, Yet Wincing Issue¶

What Did I Find?¶

Verifying the Rehydration Index¶

What Did I Find?¶

Fenced Code Blocks and Blank Lines¶

What Did I Find?¶

Addressing the Issue¶

Testing Backslashes¶

What Did I Find?¶

Properly Escaping Special Characters¶

Walking on The Shoulders of Giants¶

Making The Change¶

Cleaning Up¶

What Was My Experience So Far?¶

What is Next?¶

Comments

Reading Time

Published

Markdown Linter Issues

Category

Tags

Stay in Touch