Markdown Linter - Delving Into the Issues - 14

Summary¶

In my last article, I started the transition to working on Block Quote issues. Having made good progress with those issues, this week I continued with those issues, sometimes blurring the line between Block Quote issues and Block Quote/List Block interactions.

Introduction¶

Now firmly in the mode of dealing with Block Quote block issues, I was looking forward to making more progress in this area. With each week that passed, I was becoming more aware of how close I was to be able to at least do an initial release of the project. Block Quotes, and possibly Block Quote/List Block interactions are the last big thing that I need to do before taking that leap, so I just needed to buckle down and work through those issues, wherever they would lead.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 16 Dec 2020 and 20 Dec 2020.

Fixing Things Up Before Proceeding¶

Before I started doing a lot of work for the week, I decided I needed to get one of the disabled functions working: test_list_items_282. While it was not a tremendously important issue, it was one that I had put off for a while, and I just wanted it out of the way. Following the GFM Specification exactly, it can be argued that the following Markdown:

1. foo
2. bar
3) baz

should either be parsed as a list with 2 items or two lists. The crux of the issue is whether the third line is considered a continuation of the second line or if it is an entirely separate list. Basically, it all comes down to how your parser implements paragraphs and paragraph continuations.

Luckily, one of the primary authors of the specification (@jgm) chimed in with this following bit of information:

The intent was to have language that applies to the starting of list items in the middle of a paragraph that isn’t a direct child of a list item at the same level.

While some of the other help was not as useful, that one sentence was. I did agree with the other author that Paragraph elements in Markdown are the default element, however I was not as sure that his interpretation of paragraph continuations was correct. But this information helped me out a lot. Using that information, I was able to quickly put together these lines:

    if parser_state.token_stack[-2].is_list:
        is_sub_list = (
            start_index >= parser_state.token_stack[-2].indent_level
        )

Following the suggestion of @jgm, I changed the exclusion code at the end of the is_olist_start function to include a reference to is_sub_list. Following my usual process, I was quick to find that everything just fell into place, and the test was passing. But I was not confident that I had properly fixed the issue, so I created four variants of the test data, each just a little different from each other. It was only when all five tests had passed that I considered the issue resolved and dealt with.

An Easy Set of Tests¶

Some of these items take days to complete and some take hours. Until I start working on them, I never know which bucket they will end up in. Therefore, when I started working on this item:

- 634a in bq and in list

I had no clue what was going to happen. Starting with the basics, I looked at the Markdown for function test_raw_html_634 which was:

<a  /><b2
data="foo" ><c>

As the start of the Markdown does not have a single HTML tag, or a HTML tag from one of the special groups, an HTML block is ruled out. But when the processing happens for a Paragraph element, the inside of that paragraph is then filled with Raw HTML elements, only slightly altering the Markdown when rendered as HTML:

<p><a  /><b2
data="foo" ><c><p/>

As the item suggests, placing that same Markdown inside of a List Block or a Block Quote should result in the same behavior, just inside of that block. Crossing my fingers for good luck, I created two variants of that Markdown: one that was prefixed with - and the other that was prefixed with >. Things flowed quickly through my usual process and I was happy to find that these scenarios both worked without any additional changes being needed.

More Fun with Block Quotes¶

With good luck occurring for my last item, I hoped it would carry on to my next item. So, when I looked for more work to round out the Block Quotes tests, I came across the following group of items in the issues list:

- 300 with different list following
- 300 with extra indent on following item
- 301, but with extra levels of block quotes
- 301, with indented code blocks

Rather than tackling them separately, I decided to tackle them together as a group.

The first part of that work was making the requested variations on the data for test function test_list_items_300. That data was somewhat simple, with the Markdown text of:

* a
  > b
  >
* c

To address the first item, I added test function test_list_items_300a that included an Ordered List element after the Block Quote element, instead of an Unordered List element. Test function test_list_items_300b addressed the second item by keeping that List Item element as an Unordered List element but adding 2 space characters before it to make it a sublist.

Similarly, test function test_list_items_301 has a Markdown text of:

- a
  > b
  ```
  c
  ```
- d

The new variation of this Markdown in test function test_list_items_301a was to change the second line to start two Block Quote elements instead of one. Test function test_list_items_301b modified the test data slightly by indenting each line of the Fenced Code Block element by one space. Test function test_list_items_301c did a more drastic change by replacing the Fenced Code Block element with the single character c indented from the start of the Block Quote element. Finally, the test_list_items_301d function did a more correct version of test function test_list_items_301c by including a blank line and a c character indented by four spaces, but properly enclosing them within the Block Quote started on line 2.

With those changes made, it was time to get down to figuring out if there were any problems and dealing with them!

Working The Problem¶

Starting with my normal process for working through new test functions, I worked through the tokens and HTML for each of these functions. Apart from the functions test_list_items_301b and test_list_items_301c, the other tests were producing the tokens that I expected them to. After trying to get those two functions working properly for an hour or so, I decided to put them on hold while I got the other functions cleared up.

Focusing on those other test functions, the HTML output was mostly there, but required a small amount of fiddling. Specifically, tests in which there was a Blank Line element within a Block Quote element within a List Block element, the Blank Line within the Block Quote was being used to determine whether the list was loose. As that Blank Line was within another Block and not within the List Block itself, it should not have been affecting the calculation of List Block looseness. Luckily, the fix for this was to add nine lines of code to the __calculate_list_looseness function to properly increase and decrease the stack_count variable to account for the Block Quote token.

With the tokens and HTML output deal with, it was time to deal with the rehydrated Markdown and the consistency checks. The fix to the Markdown was an easy one to see: the whitespace allocated to the Block Quote tokens was not being added back into the Markdown text that was generated. Some easy changes to incorporate that information was almost as easy to add, leaving the consistency checks.

While the consistency checks took a bit, in retrospect they were somewhat easy to understand and fix. At the time though, it took a bit of effort to work through them. Like the changes in the Markdown generator, changes needed to be introduced to properly track which part of the Block Quote’s extracted whitespace was applied in the Markdown generator. That tracking is done using the Block Quote token’s leading_text_index variable, which was not being properly incremented to track the newlines used within the owning Block Quote token.

Once that change was done, things were looking a lot better, but there was a single case where there was an index error getting that whitespace out of the token. Upon debugging, it was obvious that the leading_text_index variable was getting incremented twice. Fixing that took a bit of passing information around but was quickly taken care of. And with that fix in place, each of the tests that I was working on was solved and passing properly.

Not Everything Was Solved¶

With the other tests passing cleanly, I refocused my efforts on test functions test_list_items_301b and test_list_items_301c. Doing my usual research into the issues, I started to figure out what the tokens generating by the parser should be, comparing that answer with what was being generated. It was then that I noticed that the tokens for the tests were close to normal, but not correct. In both test functions, the tokens mostly differed in where the one element stopped and the next one started.

Now, when I say, “I fiddled with the code”, I really mean I tried normal paths and interesting paths to try and solve the issue. And it was similar with these issues. After around two hours of fiddling, I was no closer to having a solution than when I first looking at the problem. In the end, after doing a fair amount of research and debugging, I decided that I was going to commit the code with functions test_list_items_301b and test_list_items_301c disabled. I just was not getting the right “angle” on solving those issues. I just felt it would be better to commit what I had and work on those two functions in the next couple of days, so that is what I did!

Dealing With 301B¶

During that Saturday, I decided to take another shot at test function test_list_items_301b. I knew the the tokens were just wrong. To quantify that wrongness, the Markdown text:

- a
  > b
   ```
   c
   ```
- d

was produced a Block Quote element that tried to include an empty Fenced Code Block element into the Block Quote, leaving the c character by itself on a line outside of the element, followed by another empty Fenced Control Block. From prior work on List Blocks, I knew that the Fenced Code Block should terminate the Block Quote element, as Block Quotes follow similar roads. It was just a matter of figuring out how to get there.

Knowing that I have had tried to solve this problem the day before, I decided to take the hard-line approach of debugging line-by-line through some of the code. I don’t usually do this as it is very time consuming and requires meticulous notes. While I am comfortable with doing that work, there are just more efficient ways of getting to the same target. But with those ways not working for this problem, it was down to the nitty-gritty.

It was a good thing that I used this approach because it told me something interesting. Usually, at the start of processing, the code separates the line into extracted whitespace and parseable line. In this case, the debugger was showing me that the separation that I had expected was not done. As such, when the parser looked at the raw data for the line, the spaces at the start of the line prevented that line from being recognized as the start of a Fenced Code Block.

The good news here was that adding these four lines to the __process_lazy_lines function made everything work:

        after_ws_index, ex_whitespace = ParserHelper.extract_whitespace(
            line_to_parse, 0
        )
        remaining_line = line_to_parse[after_ws_index:]

Basically, I just took the same code that split the original line_to_parse variable in the main-line processor and added it here. With that code in place, the is_fenced_code_block function did the rest, properly noticing the Fenced Code Block start sequence, and properly starting the code block within the Block Quote.

Dealing With 301C¶

It was just after 6pm PST when I committed the changes for test_list_items_301b, and I decided to start working on test_list_items_301c. In this case, it was not the tokens that were wrong, but the spacing included from one of the lines within a Block Quote element. Doing research into the night, I still was unable to figure out what the magic sequence was to get this working properly. Rather than press on, I decided to take a break and spend some time with my family, relaxing myself and my brain.

This worked well. Starting again in the morning, I was able to quickly see that there were two problems that I needed to address. The first of those problems was that the function __adjust_paragraph_for_block_quotes was always adding an empty string to the Block Quote token, which was the problem I was trying to solve. The second problem was that it was a great solution for most of the time, this specific test being one of the few times where it was not.

With that fresh information, I started experimenting and I was able to isolate code in the parse_line_for_container_blocks function that affected the outcome of the test. The fun part was that if I took the indentation used for the list and added it to the token’s data, test test_list_items_301c worked, but other tests failed. Doing some extra plumbing and experimentation, I narrowed the active scenario down to specifically only trigger when the parse_line_for_container_blocks function was triggered within a paragraph that was within a block quote. [more]

What Was My Experience So Far?¶

Having felt dread at the prospect of Block Quotes taking as long to complete as List blocks did, I am happy to report that the feeling is quickly fading. The sense I am getting from the issues I am looking at are that there are issues to deal with, but nothing I have not dealt with already, just variations on it. On top of that, there are only one type of Block Quotes to worry about with a very simple start sequence. I was very solidly feeling that Block Quotes were going to be a lot easier than List blocks.

That did not mean things were going to be easy though! I was starting to come close to the end of the initial set of issues that I had added to the issues list, but I was adding more as I looked through the code. This week, I was able to get rid of a handful of those issues, and it felt good. But with approximately 70 lines of items before hitting those that dealt with Rules, Tabs, and Correctness, it was a sobering reminder that I just needed to get stuff done and done cleanly.

And for me, that often poses a problem. There are things that I want to do and feel that I should be doing, and there are things that I need to do. Resolving any issue that deals with Tab characters? Now that is a want, and I can live with a “stupid” translation of Tabs until I can get the time to make them right. Resolving any issues that might uncover scenarios that I have not covered yet. To the best of my abilities, that is a must. But there is a grey area between those two.

Take the item:

- image token handling confusing and non-standard

I would like to get it resolved, but I am not sure that I need to do that. I must balance my desire for wanting things done right and delaying any release with my knowledge that I can live with it for a couple of months while I get the project out. And those are the types of decisions that I am going to have to make more and more as this project gets to its closing stages. Do I really need it, or can it wait?

At least I know it is going to be fun!

What is Next?¶

I was just as happy to get some holiday time off in which I could spend more time with my family as I was to be able to make some solid holiday progress. This next week was going to be a good mixture of resolving solid issues, test cleanup, and code cleanup. Stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments