Markdown Linter - Rabbit Hole 3 - Trying To Dig Myself Out

Summary¶

In my last article, I talked about how I quickly found myself descending into a rabbit hole while addressing an issue with my PyMarkdown project. From experience, that behavior is a pattern that I personally must work hard at to avoid. In this article I talk about mentally digging myself out of that hole while extending the consistency checks for the PyMarkdown linter to tokens beyond the initial token.

Introduction¶

At the end of the last article, I talked about how I started chasing a feature down multiple rabbit holes, noticing that behavior a good 4-5 days after I felt that I should have noticed it. As part of who I am, I find certain types of puzzles very hard to put down and almost addictive in nature. During the development of the PyMarkdown project, I have had my share of these puzzles, so I have had to be very vigilant that I take a measured approach when solving each issue in an attempt to keep that behavior in check.

As this was the first time during this project that I have travelled down this route so completely, I was slightly proud of myself. Usually, for a project of this complexity and duration, I would have expected to descend to those depths an extra 2-3 more times, if not more. From experience, I find that it is during those times that I lose my perspective on a project, and either go for perfection, give up the project that I am working on, or both. I am not sure, but my current belief is that by taking a look at the bigger picture, and looking at it frequently, I have mostly mitigated my desire for perfection by moderating it with other grounding concepts like feasibility.

Having noticed that I dug myself into that hole, it was now time to try and figure out what to do. Stay where I was? Go forward? Go backward? It really was my choice on how I handled that situation.

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commit of 14 Jun 2020.

Mental and Emotional Stuff Is Important Too¶

I am not one for using excuses when something does not work out, but I am also not one to keep on knocking someone down for failing. Keep in mind the saying I have adopted:

Stuff happens, pick yourself up, dust yourself off, and figure out what to do next.

For me and my family, there are two parts of that statement are important to realize: the trying part and the figure out/learning part.

In all honesty, I have every reason to not write this blog or work on this project. I could spend more time interacting with my family, fiddling around with home automation (how this blog started), watching movies (the more b-movie or “hard” sci-fi the better), reading books (see movies), or any number of other things. At the time that I am writing this article, it is just shy of 4 months that I have been working from home, due to COVID-19. There is the stress related to thinking about the 500,000 fatalities and 10 million infected people worldwide. Let’s face it. The mere thoughts of that scale of death and disease are somewhat crippling to me on good days. Add to that the normal stress of making sure I am being a good worker, colleague, husband, and father. Pick any 1 or 2 of those reason, and most people would understand my reasons if I said “I started writing a blog, but I stopped when…”

But I continue writing this blog and continue developing the project. I have good days and I have bad days, and I also have just, well, days. But I keep on trying. I know I am not going to get the blog right each time, but I try, and I improve. I know I am not going to get the PyMarkdown project right on the first try, and I am probably at my tenth or eleventh try if I add all the small tries together. But I keep on trying.

And that is where the second half of the saying comes into play: figure out what to do next. There is a great quote that may or may not be attributed to Albert Einstein:

Insanity is doing the same thing over and over again and expecting different results.

Figuring out what to do next for me is learning from what just happened, trying to avoid the same mistakes in the future. Does it always work? Nope. Sometimes it takes a couple of times for me to figure things out. Sometimes I do not figure them out and have to ask for help on how to deal with those mistakes or how to prevent them. And in some rare cases, I just need to let them be. But that is also learning: learning what I can do, what I cannot do, and trying to understand the effort required for each.

How Does This Relate to My Current Situation?¶

I went down a rabbit hole and lost 4-5 days of productive work, only to realize that when I was completely obsessed with the solution. While I did struggle to find some positive spin on the situation, it was a rough break to deal with. Mentally, it took me about a week to get over the “oh no, I went there again” thoughts in my head. Emotionally, I was hard on myself because I had “let myself” not pay attention and lost my way. Then add the weight of the elements in my environment that were going on around the world at that time. It took me a bit to work through that quagmire.

But I do not believe on giving up on something without a good reason. I have learned a lot about Python, writing, and myself by working on the PyMarkdown project and writing about it in this blog. The things that I have learned are in all sorts of areas, not just the technical areas that I expected. And personally, I have committed to this path: to learning and helping others learn. Basically, I have too many good reasons to keep going, and not many reasons to give up.

And yes, it was hard to push myself forward for the first couple of weeks, through all the emotional and mental debris in the way. I did a lot of falling in that time, deciding each time to pick myself up. There were times when the words flew out of my fingers, and times where I could not type the next line in either the project or the blog. But I had confidence that I would work through it, and that I would not give up. This was something I knew I could do, I just had to push through some bad stuff to get there.

Where to Start?¶

I knew that I had to push forward with something positive, but what to do? After a bit of thinking, hoping to get myself back into the writing mood, I decided that I would make some enhancements to the work already done. In the last article, I described how I added code to validate the initial token in the token stream. This was done by only adding logic into the function for the cases where the last_token variable was None for any block tokens. The plan for this enhancement was to keep the focus on block tokens, but to focus on the other block tokens in the arrays. That is, focus on the cases where the the last_token variable was not None.

Keeping the Scope Small¶

After being burnt by the events documented in the previous article, I wanted to be extra careful by setting myself up for success. If I started with a small scope and found that I was able to get it working quickly, I could always do more work on a subsequent enhancement. Knowing how I work; I knew that it would be far more difficult for me to reduce the work’s scope once I had started it. Especially when my confidence needed a boost, I needed a solid, yet achievable, goal that I could complete with confidence.

The question was where to start? I had a feeling that dealing with leaf blocks which started on the same line as container blocks were going to be a headache, so I removed them from the scope. I knew that I had issues with tab characters from the last enhancement, so I removed them as well. Was there anything else I could remove?

Looking over the scenario tests and their data while scribbling some notes for myself, I quickly came to two conclusions. The first was that verifying the consistency of list-related objects would be relatively easy, as the token contained all the necessary data. Because of the presence of the indent_level variable and extracted whitespace in those tokens, I was able to quickly figure out how to check their consistency. After a couple of mental dry runs with my solution and the real test data, I had confidence I could complete those checks.

The second conclusion was that handling the block quote tokens were going to be a completely different issue. Those tokens did not have a concept like an indent_level that could be generally applied, nor any memory of the whitespace that they extracted from the line. Supporting consistency checking with block quotes was going to take some decent changes to the block quote tokens, something that was going to take a lot of time and planning to properly execute.

With these considerations in mind, I decided that this enhancement would not address any token arrays with block quote tokens or tab characters. To keep things more manageable, I would restrict any checking to leaf block tokens on new lines. With this in place, it was good to go! But I needed to add one thing first: a consistency token stack.

Adding the Consistency Token Stack¶

The necessity for adding this stack to the consistency check was not a surprise to me. To allow the parser to maintain a proper understanding of what the scope of the active container blocks was, I added a stack to the parser explicitly for those container blocks. With the task of validating the consistency of the tokens and without access to that internal stack, I knew that I needed to replicate some of the functionality of that stack.

Before any of the serious checking of the tokens took place, I needed a simple stack that would track which container tokens were currently active. The first pass at this was really simple, with the pseudocode being as follows:

for each token:
  if the token is a container block:
    add the token to the stack
  else if the token ends a container block:
    remove the last item from the stack

  do the rest of the processing

This pseudocode is part of my mental algorithm repository that I lovingly refer to as Parser 101. Except in the case of very simple parsers, almost every other parser I have written has had to maintain some manner of stack to track context. While the consistency checker is not a full-fledged parser, it is operating on the tokens generated by a legitimate full-fledged parser. As such, it made sense that some of my parser experience would assist me with this enhancement.

The first pass at implementing this algorithm was a simple translation of the above pseudocode fragment into Python code. The if the token is a container block translation was simple, due to a previous refactoring:

if current_token.token_class == MarkdownTokenClass.CONTAINER_BLOCK:

and the add the token to the stack part of the translation was just as simple:

    container_block_stack.append(current_token)

The if the token ends a container block part of the translation was a bit more tedious to implement. This was primarily due to end tokens not having any concept of the token that started that block, only the name. As such, after adding a comment and a note to improve this, the following code was written to figure out if the container block was ending:

    elif isinstance(current_token, EndMarkdownToken):
        token_name_without_prefix = current_token.token_name[
            len(EndMarkdownToken.type_name_prefix) :
        ]
        if token_name_without_prefix in (
            MarkdownToken.token_block_quote,
            MarkdownToken.token_unordered_list_start,
            MarkdownToken.token_ordered_list_start,
            MarkdownToken.token_new_list_item,
        ):

Finally, the remove the last item from the stack was translated into:

            assert container_block_stack[-1].token_name == token_name_without_prefix
            del container_block_stack[-1]

After all that was added, I quickly ran the tests and… tests were failing all over the place. But why?

Getting the Stack Code Working¶

Looking at the stack code, something became obvious almost immediately. If my observation was correct, the processing of the EndMarkdownToken was never being invoked. A couple of quick debug statements verified that observation. Doing a bit more digging, I found out that I had set the class of the EndMarkdownToken to INLINE_BLOCK. This meant at the start of the consistency check loop, the code:

        if current_token.token_class == MarkdownTokenClass.INLINE_BLOCK:

would prevent the EndMarkdownToken from ever getting through. A quick fix changed that statement to:

        if (
            current_token.token_class == MarkdownTokenClass.INLINE_BLOCK
            and not isinstance(current_token, EndMarkdownToken)
        ):

and the test all ran… er… not fine? With echoes of “what now?” ringing through my ears, and debugging some more, the answer took a while to find. Using extra debug statements, I was able to pull some additional information out of the tests. It became apparent that the second half of the function had many issues with those end tokens. However, addressing that issue was easy, by adding the following code:

        if isinstance(current_token, EndMarkdownToken):
            continue

after the stack code and before the remaining code. A couple of extra test runs, both with and without debug statements and… everything was working with the new stack code. On to the next issue!

Removing Block Quotes¶

With the stack code in place and running cleanly, I needed to add the code to prevent checking for the consistency for block quotes. As previously decided, block quotes were outside the scope of the current enhancement, and as such the following code was added to check for block quotes:

            found_block_quote = False
            for i in container_block_stack:
                if i.token_name == MarkdownToken.token_block_quote:
                    found_block_quote = True
            if found_block_quote:
                last_token = current_token
                continue

While this code may seem flawed, it was written this way on purpose. I did not want to have to validate any tokens that were contained within a block quote block, but I did not want to remove the validation of the non-contained tokens either. The above block of code simply checks to see if there is a block quote on the container block stack, skipping over any token while that condition is true. This allows for any block quote contained tokens to be avoided, while still validating any other tokens present in that test sample.

Ignoring Tabs for Now¶

The other big thing to take care of was to ignore any tabs that occurred in the initial whitespace for any of the block tokens. This was accomplished by changing the return value of __calc_initial_whitespace to be a pair of values, the first being the same indent_level as before and the second was a new had_tab variable. By setting this variable, it was then possible to change this assert:

            assert current_position.index_number == 1 + init_ws

to:

            if not had_tab:
                assert current_position.index_number == 1 + init_ws

As I knew tabs were going to be a problem until they are properly and thoroughly addressed, this change circumvented the assert statement when a tab was encountered.

Adding the Basic Enhancement¶

With those changes in place and the tests passing, it felt that it was the right time to add the code for the planned enhancement. The basic part of this change was easy, adding an else statement that followed the same line number check, as such:

            if last_position.line_number == current_position.line_number:
                assert last_token.token_class == MarkdownTokenClass.CONTAINER_BLOCK
                assert current_position.index_number > last_position.index_number
            else:
                init_ws, had_tab = __calc_initial_whitespace(current_token)
                if not had_tab:
                    assert current_position.index_number == 1 + init_ws

This part of the algorithm makes clear sense. Leaf blocks always start on a new line, except for when they are started on the same line as a container block. As such, the token starts right after any leading whitespace. Since this enhancement focuses solely on leaf block tokens on a new line, and due to the previous work to ignore any tabs in the consistency checks, this code was kept simple.

Testing this code out in my head, it was all sound except for when the token was in a container block. While block quote blocks were excluded, that still left list blocks. As such, when I ran the tests this time, I expected failures to occur with leaf blocks that are started on their own line but are contained within a list block. As I went through the test run failures, it was affirming to see that each of the failures that were now showing up were squarely within those parameters.

Going Through the Failures¶

As I went through the failures from the last set of code changes, I scribbled down notes about the failure patterns that I saw. The big pattern that I observed is, what I felt, was a very obvious one. When a leaf block was created within a list block, the reported column number was always one more than the indent_level variable for the list token. That pattern does make sense, as the list block contains the new leaf block, calculated from the indent level of the list.

The first pass at modifying the check to take this into account was the following:

                top_block = container_block_stack[-1]
                if (
                    top_block.token_name == MarkdownToken.token_unordered_list_start
                    or top_block.token_name
                    == MarkdownToken.token_ordered_list_start
                    or top_block.token_name == MarkdownToken.token_new_list_item
                ):
                    init_ws += top_block.indent_level

This code almost worked but reported errors with the first line. That was quickly addressed with a simple change to the code, checking to make sure that the container_block_stack variable containing the container stack is not empty:

                if (
                    container_block_stack
                ):
                    top_block = container_block_stack[-1]
                    if (
                        top_block.token_name == MarkdownToken.token_unordered_list_start
                        or top_block.token_name
                        == MarkdownToken.token_ordered_list_start
                        or top_block.token_name == MarkdownToken.token_new_list_item
                    ):
                        init_ws += top_block.indent_level

Dealing with Consistency and Lists¶

At this point, the only tests that were failing were tests related to list blocks. Most of those tests were either contained in the test_markdown_list_blocks.py module or the test_markdown_lists.py module. Doing a bit more digging, the reason for a lot of the failures was visible within seconds of looking at the failing scenario tests. To make sure that lists and their inevitable sublists can be controlled properly, the indent_level variable contains the required indentation from the beginning of the line, not since the end of the last list block. To complement this, any whitespace that occurs before that list block is saved within the list block token. The effect of this is that both ordered and unordered list start tokens did not require any further modification, containing all the required information within their own token.

To address this, the check for adjusting the calculation if the token is in a list was changed to:

                if (
                    container_block_stack
                    and current_token.token_name
                    != MarkdownToken.token_unordered_list_start
                    and current_token.token_name
                    != MarkdownToken.token_ordered_list_start
                ):

Furthermore, on additional examination, blank lines were in a similar situation. Blank lines contain complete information on how they were derived, so no manipulation of the token’s column number was required. Hence, another slight modification of the check resulted in the code:

                if (
                    container_block_stack
                    and current_token.token_name != MarkdownToken.token_blank_line
                    and current_token.token_name
                    != MarkdownToken.token_unordered_list_start
                    and current_token.token_name
                    != MarkdownToken.token_ordered_list_start
                ):

Almost There¶

Running the tests again, the number of failures remaining was now a much smaller set of tests. Once again doing my due diligence, I discovered an interesting issue with the new list item token: I had left out an important field out of the token. Unlike the ordered and unordered list start tokens, the contents of the new list item token only had the indent_level field, with the extracted_whitespace field being added to try and solve the failures. But without the proper values for the extracted_whitespace field, it was not a proper solution to the problem. I even made modifications to the container block stack to properly deal with the new list tokens. But even with those two changes in place, it just was not enough to solve all the issues.

Once I determined that it was not enough, I knew that I had to fix this properly at a later date and close out this enhancement. To accomplish that, I made a small change to the code as part of the main consistency check:

            elif current_token.token_name == MarkdownToken.token_new_list_item:
                # TODO later
                pass

I did not feel that it was a great solution, but it was a decent stop-gap until I could address those tokens.

The Inevitable Bug Fixes¶

Along the way, there were some small issues that were uncovered in the parser code, fixed almost immediately due to their small nature. There was a small fix made to the list_in_process function to make sure that the line was adjusted properly in some edge cases. There was another fix to the __adjust_for_list_start function to make sure that it returned an indication of whether it processed the list start index or not. And finally, even though the tabs were not part of the enhancement, there was a bit of cleanup to do in the area of the dual-purpose code used to handle list indents and block quotes. None of these were big bugs, but little issues that the consistency checker was uncovering, and getting dealt with up front before they became a larger issue.

And that is where I stopped. I knew that if I started trying to get more of the tab code to work, it would be another rabbit hole. Reminding myself of the big picture and the current goals for the enhancement, I backed off. And after a couple of minutes, I felt good about it. Tabs would wait for another enhancement, and then I would spend a good amount of time working on tabs. But for now, it could wait.

Importance of Starting Small¶

I needed this enhancement to be a win, and I think I achieved that by keeping my goals in mind. Rather than trying to do everything and having my effort be scattered, I kept my work focused within the boundaries of a simple goal, one that I had confidence that I would be able to achieve. With this progress, I felt more confident about myself and the project, and I knew I was on my way back to building a solid and stable foundation that I could build further on.

While my previous experience guided me towards this approach as I needed a win, it is a solid approach regardless. One of my managers at a previous company used to say:

It is better to under promise and over perform than the other way around.

I prefer another quote that was relayed to me by a friend at my current company:

Having a conversation with someone about resetting expectations is always troubling for me. I would rather my reports take small bites of work and get better estimates on those bites of work, than take huge chunks of works and get lost and need help finding their way.

Those words have stayed with me for a long while, and I have tried to live up to the essence of that quote. I cannot tell anyone when the PyMarkdown linter is going to be ready. The scope is big enough that I cannot accurately judge that span of time. But I can let people know what my next 3 items on the issue list are, and approximately how long it will take to complete them. Those items are my small bites, something I can solidly analyze and realistically figure out a time frame for their completion.

For me, I feel that it is better to complete 10 tasks that work together than to complete 1 task that will not work until it is completely done. There just is no comparison in my mind. And having just completed one of those smaller tasks, my confidence was returning.

What Was My Experience So Far?¶

After my experience with the rabbit hole called Tabs, I was worried that I was going to focus more on the negative aspects of the project, rather than making some good progress. Would I repeat the behavior that I documented in my last article, or would I correct that behavior and move the project forward? While the enhancements I made to the consistency checking were small, I was convinced that they were solid enhancements. That was exactly the type of win that I needed.

This enhancement was not about technical feasibility, it was about getting myself out of a negative headspace. I believe it is true of most people that we are own worst enemies, as we are privy to those thoughts and emotions that we keep private. I know that I am my own worst enemy for those reasons. And while I did try hard with the last enhancement, I know I was harder on myself that others would have been. Not for going down the rabbit hole, but because I did not notice I had slipped down there and needed to get out of there before getting lost.

I started this enhancement trying to get rid of the “oh no, here we go again” mindset. Based on where my confidence was at finishing this enhancement, I am proud to report a lot of positive work has been done. My confidence was by no means back to where it should be, but I was on my way there. I developed a plan, made certain limitations, and stuck by those limitations while delivering a well thought out enhancement.

This was definitely a step in the right direction.

What is Next?¶

Having added a few issues to my issues log, mostly to verify certain scenarios I tested or wrote about, I knew that I had some technical debt building up. Along with the new list item token issue, I felt it was a good time to build up some more confidence by knocking some more of those issues out of the way.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments