Making Progress with Tabs

Summary¶

In my last article, I talked about how I am making strides to get any remaining whitespace issues dealt with. Namely, tab characters. In this week’s article, I will talk about the process I have made in doing just that.

Before We Start¶

I apologize to any readers out there, but I was unable to author an article last week. Due to a small accident and family commitments, I had some bad headaches and little energy left after dealing with both of those things. It does not happen too often, but I just did not have any mental space to write anything publishable. I hope that I can do better this week!

History With Tabs¶

As I mentioned in the last article, when I started with the PyMarkdown project a few years ago, I needed to decide how to manage tab characters. At the time, I thought that the right decision was to deal with space characters instead of tab characters and space characters. Following that belief, I converted all tabs to the correct number of tabs, following Markdown’s “tabs as tab stops” rule. However, looking back at my project requirements, I decided that to be faithful to my requirements, I needed to add tab character support back in.

And despite having to take it easy for the better part of a week off due to my accident, things are going ahead nicely. As of last night, all elements have support for tab characters except for lists, block quotes, and link reference definitions. And while the progress has not been steady, it continues forward at a good pace.

Taking A Short Side Trip¶

Last week I took a couple of days and tried out a “what if” scenario. The question that I wanted to play out was whether I was correct in fixing tab characters after parsing. Insteadm what if I had started to go through the code and expect the code to manage space characters and tab characters? I wanted it to be more than a thought experiment. I wanted to try it out and see what the cost would be.

While the first results were positive, the rest of the results went downhill quickly. One thing that I found was that the initial transformation brought to the parser was a clear standard on the length of certain elements. As the transformation was performed at the start of the line, there was no need to track the number of characters that occurred before the start of the line… it was always zero. Since Markdown uses tab stops, knowing the start position for interpreting a tab character is essential. Having to track that through multiple functions was just messy, and I got it wrong multiple times. That was easily one headache that I dodged.

Another thing that was an issue was interpreting tab stops on the fly. Not to complain too much, but it was painful. Instead of seeing the correct number of space characters from an algorithm that I knew was correct, I had to figure out the start of the string each time. If I was off by one, the length of the string would not be correct. And I realize that it may sound like whining but looking at two strings with tabs in each end mentally trying to compare them was simply hard. When going through this experiment, I found that when I was looking at parts of the code, I was focusing more on the tab expansions than the code itself. It just felt off.

And finally, perhaps from a biased viewpoint, my confidence was down as I was playing out this “what if” scenario. It can very well be that I have grown reliant on the fact that the string x\ty will get translated into x y, if it starts at the beginning of the line. To me, it just looks cleaner. And when I am verifying the code or debugging the code, I want clean and easy to read. When I look at the translation algorithm and the output from that algorithm, I have a tremendous amount of confidence that I translated the tab characters properly. When I looked at the same types of scenarios in the debugger with an untranslated tab character, my confidence was noticeably lower.

Was it the right choice for everyone? Possibly not. But as I am the maintainer and developer for the project, what is important is that it was the right choice for me. And now I have that confidence that I made the right choice.

Translating Tabs Back¶

Except for Link Reference Definitions, all other Inline elements and Leaf Block elements are done. Some were a bit trickier than others, but they are all done. In all cases, that work involved matching up where each index was in the “normal” line and matching it up to the same index in the “tabified” line. With those anchors in place, it was just a matter of replacing a part of the normal line with the same part from the tabified line, and things we done.

It took a bit to translate each element, including extra test scenarios for most elements, but it was worth it.

What Happened With Indented Code Blocks?¶

Having completed Indented Code Blocks, I can say that their translation was only a slight bit trickier than the other elements. Nothing much to report there. What I thought was going to be an issue with tab characters and Indented Code Block elements was about housing them inside of containers. But more on that later.

What About Link Reference Definitions?¶

The thing that I am going to work on starting tomorrow are Link Reference Definitions. The difficult part with these elements is their capability for rewinding. As I covered in other articles, a parser can only tell if a Link Reference Definition is complete by one of two methods: a full Link Reference Definition is parsed OR the next line is examined, not found to be a continuation, and then having a rewind mechanism to manage rewinding the input. While there are other ways to manage that last method, they are not guaranteed to work and have their own issues. As painful as rewinding is, it is the most dependable.

When I added the current Link Reference Definition support to the parser, I did not have to worry about reverting to an original form of the string. As such, I added code to kind of handle it, but that code was not thoroughly tested. Well… it is now going to get tested. I know I need to place the tabified string in the queue to get reprocessed, I am just not sure how I will carry that out. I either need to adjust the normal string or have another queue to handle a special tabified string.

But as that is the first task for this week, I am not worried about it. Yet.

What About Lists and Block Quotes?¶

This is where the fun comes in. As I showed in last week’s article, changing tabs into spaces and back can be tricky in some situations. From last week, consider the Markdown snippets:

- first line
  - second line
\t- third line

and

- first line
\tsecond line

From a translation to HTML and understand point of view, both snippets are easy to parse. In the first snippet, the tab expands to four characters of prefix, effectively creating a triply nested Unordered List block, with a single item having the text third line. The second snippet is similar but creates a singly nested Unordered List with a single item having the text first line\nsecond line.

But there is more to see when examined from a tokenization point of view. The tab that is translated in the first snippet is translated to four space characters that are exclusively in the list item’s prefix area. While that might cause some issues spanning two List Blocks, let me table that observation. The real problem appears when generating tokens for the second snippet. The tab character is at the same column as in the first snippet, so it expands to four space characters.

The real question is how are those space characters distributed between the Unordered List token of the owning list and the Text token containing the line’s text?

Working Out The Kinks¶

Strangely, it took some experimentation with Indented Code Blocks to figure this out.

Note, for simplicity’s sake, in the following snippets I am representing space characters as \b and tab characters as \t in the following snippets. Otherwise, it would be hard to see the differences!

Using the Babelmark tool, I submitted this snippet:

\t\tfoo

and received back the following HTML:

<pre><code>\tfoo
</code></pre>

In the GitHub Flavored Markdown specification, there is a stated principle referred to as the principle of uniformity:

principle of uniformity: if a chunk of text has a certain meaning, it will continue to have the same meaning when put into a container block (such as a list item or blockquote).

Basically, if we take our first snippet and place it within a container block, the original block of text should retain its meaning. Therefore, we can submit this snippet:

-\b\t\tfoo

and observe its HTML output.

<ul>
<li>
<pre><code>\b\bfoo
</code></pre>
</li>
</ul>

Taking the container HTML code away, the principle is supported as the Indented Code Block is still there, but with a slight variation. That variation is that the entire indent was translated into space characters, totally removing any trace of the tab characters.

What Is Going On?¶

After doing my usual scribbling, here is what I believe is happening. Because of its location in column three, the first tab character is expanded to two characters and the second tab character is expanded to four characters for a total of six characters. Allowing for four space characters to be used for the Indented Code Block, which leaves the two space characters that are seen preceding the word foo.

I repeated this experiment with other elements, number of tabs, and number of spaces, and that relationship seemed to always hold. From this, the reasonable conclusion is that to avoid any problems, any leading tab characters are always represented as spaces. This makes sense and this is backed up by experimental data.

Why is this important? Because it gives me a solid rule to follow when translating the spaces back into tab characters. And that helps me out when I need to come up with an algorithm for how to tokenize the tabs. I do not have that algorithm yet, but after I deal with Link Reference Definitions, which is the next thing on my plate. Stay tuned!

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments