In my last article, I talked about my work on the PyMarkdown project to solidify how it manages whitespace. In this article, I talk about how inspiration struck me for this article… and when.

To Get Things Started

First off, I want to be honest with any readers on the timing of this article. It is Tuesday afternoon and I had almost given up on authoring an article this week. At this point in most weeks, I have either finished the article and it is posted, or I am giving it a “24 hour” no-read period to make sure it still reads like I want it to. Basically, it is either posted or within two or three small edits from being posted. Call me old fashioned, but I always feel better knowing that I can go back to a document 24 hours after I have finished writing it, still having it sound the way I wanted it to be.

This week was different. With Labor Day here in North America, I spent part of the day doing yard work and part of the day working on the whitespace handling that I mentioned last week. By Monday evening I am usually polishing the article up if it is not already finished. But this week I had nothing unique that I wanted to say. Sure, I could have put together a low-quality article that said what I was doing, but it would have been a status report more than an article.

So, it was interesting to me that I got the inspiration for this article when I sat down to do some more work on Tuesday evening. Inspiration is weird.

The Journey Here

To understand the inspiration, I need to explain how I got there. The first part is the easy part: hard work. In the last week, I have resolved most of the whitespace issues that I documented prior to last week. The first skipped test and the skipped test with tabs and code spans are the only two skipped tests that are left to fix. Through hard work and imagination, I managed to crack through and correct the other thirty-one scenario tests and get them working.

For the most part, that hard work was correcting one thing or another. A good example of that was coming up with better names for the various helper functions that I used to extract whitespace. To be blunt, my naming sucked. Granted, it sucked after learning more about the three types of whitespace, but it still sucked. Instead of a generic whitespace in function names, I created variants of those functions that used space, ascii_whitespace, or unicode_whitespace properly. That took a while to carry out, but it felt good to have those functions with names that reflected my new understanding. And as it felt like I was changing the entire code base, I also refactored a handfull of large functions while I was there. “Always leave an area better than when you found it” is a good motto of mine.

To that extent, I just bore down and started working through each of the thirty-three skipped scenarios, from the bottom to the top. Why? Because I did not want to get bored, and I figured starting from the bottom would help me not get bored. Basically, for no reason other than to do something different. After all, a list of thirty-three failed tests is the same to me whether I start at the bottom of the top: they all need to be fixed.

For the most part, the last thirty (or first thirty, depending on viewpoint) scenario tests were relatively easy to fix. In some cases, I had skipped the test because I had “thought” it might be bad. In those cases, I just made sure to verify the test data and everything worked properly. There were other cases where I had collected the expected output HTML improperly, and I just went ahead and fixed those.

That left the significant issues for me to fix. The next batch of issues that I dealt with were dealing with removing whitespace from the start and end of paragraphs. Originally, I was removing only spaces, but the specification clearly says:

The paragraph’s raw content is formed by concatenating the lines and removing initial and final whitespace.

That took a little bit of figuring out on how to resolve it, but I was able to resolve it with a minimal amount of code changing. That was a good find.

The next issue that I tackled was properly dealing with whitespace at the end of the end of a fenced code block. While I was producing the right HTML, I was not storing that whitespace in the end token. As a result, when I tried to rehydrate the Markdown in function test_whitespaces_fenced_code_closed_with_spaces_after, there was no space to rehydrate with. That was a bit more of a chore, requiring me to add a new field to the end fenced code block token to store that whitespace. The downside to that was that there was already an optional field at the end of that token, therefore it made more sense to add it before the serialization of that optional field. This affected every end fenced code block token in the scenario tests, but it was a quick search-and-replace to fix them all. Even so, let me just say I stopped counting the instances after the first fifty and leave it at that.

Having completed dealing with the issue of bad whitespace storage in the end fenced code block token, I thought it was only fitting to look at a similar issue with not properly stripping whitespace at the end of the fenced code block’s info string construct. Like my quote for paragraphs is the quote for info strings:

The line with the opening code fence may optionally contain some text following the code fence; this is trimmed of leading and trailing whitespace and called the info string.

In this instance, the start of the info string element was being trimmed properly, but the end of that same element was only handling spaces and not whitespace. Most of the work there was just handling the extra whitespace characters, but it did take some time to work out.

Basically, up to this point everything I did to fix those issues required some thinking or some changes, but nothing major. Just me applying solid debugging practices to every test, putting in the time and getting it working. By far, the costliest fix was having to do the search-and-replace for the end fenced code block token, and that was only tedious, not difficult. But that would change on Monday.

And Then Came Monday

Now that everyone is caught up, that is when Monday started. At the start of Monday, I had three scenario tests that I needed to get working. One dealt with properly recognizing a hard break, one was properly handing spaces inside of a code span, and the last was very first scenario test in the file, the one I have not even looked at yet.

This was the place that I needed any inspiration I could find. This was an interesting problem to solve and not an easy one. To make things easier for the project, I replaced any tab characters with the proper count of space characters. The downside to approach was that the parser loses any context of where the original tab characters were. I knew I did not have an idea on how to fix it, hence I decided to do something else for a while. Enter yardwork.

It might sound counter-intuitive, but I find the best way to solve problems and get inspiration is to do something that is not connected to the problem you are trying to solve. In this case, it was a section of our yard that I decided to not weed, only to change my mind later. Just freely thinking about things and staring out into the blue sky helped me reset my mind. That and clearing about fifty square feet of backyard from weeds and filling up our yard waste bin to the top.

When I came inside, I looked at the problem again and came up with a solution almost immediately. I knew that the first thing that I needed to do was to pass the original string to the lower levels of the parser, to allow the parser tell what the original whitespace was. That was easy enough on its own, but it was not enough. That change enabled me to properly deal with container tokens and leaf tokens, but not the inline tokens. For those tokens, I quickly figured out that I needed to add an optional field in the text token for a tabified_text field that would only be populated for original lines that had tab characters in them.

As a proof of concept, I worked on the hard break issue and got it working properly within a couple of hours. To be fair, most of that time was getting the tabified_text field working properly. Once that was done, I looked back at the specification where it says:

A line break (not in a code span or HTML tag) that is preceded by two or more spaces and does not occur at the end of a block is parsed as a hard line break

But in the scenario test where a line in a paragraph ended with two tab characters, the parser rightfully saw the more than two space characters and decided that there should be a hard break. With the new tabified_text field, I was able to change the parser to notice that the line did not end with two space characters, returning the proper HTML. Once I was inspired to put the right data in the text token, the rest of the solution almost wrote itself.

And Then It Hit Me

Once again, inspiration comes at weird moments in our life. Having given up on authoring an article this week, I finished some work downstairs and was taking a quick shower before starting work on the project for the evening. As per my routine, I mentally went over what I did yesterday and what I planned to do today. Just my usual routine.

And then a weird thought came into my head. Instead of using the same backwards tab lookup concept to fix one or two small issues, what if I used the same approach to fix all the tab issues?

To be clear, after the raw line is passed into the Container Block Processor to look for blocks, one of the first things that is done is to translate the line into a detabified format. From that point on, the line information is passed without any space characters. But with the original string now available, other options opened. It was then that the ideas started flowing.

The first one was a simple one: what if I applied the same solution to other elements? If I had the original line within reach, I should be able to pull match things up and figure out which space characters belonged to each tab.

The second idea was a more radical one. What if I specifically only looked at any leading whitespace and translated any tabs in that leading whitespace to space characters. For the most part, the translated tab characters are only beneficial for indentation purposes, so it might work.

Where To Go From Here?

To be honest, I am not sure which choice I am going to take. But that is the beauty of inspiration! Not only do I have one possible solution, but that solution inspired me to think of the other solution. And not only do I have two possibly viable solutions to my tab issues, but I had good inspiration for an article!

For me, the bonus is that even in my fifties, I am finding new ways of exploring and analyzing that I previously thought were dead ends. This is one of the reasons that I am starting to cultivate other hobbies that deal less and less with computers, video, and audio and rooting myself more in the real world. Not only is it better for me, but it allows me to be inspired more often. And that is always a good thing! After all, you never know what weird thought and weird moments in your life will inspire you!

Like this post? Share on: TwitterFacebookEmail


So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Reading Time

~9 min read



Software Quality


Stay in Touch