In my last article, I talked about how I try my best to maintain focus on my various endeavors. In this article, I talk about how I am making strides to get any remaining whitespace issues dealt with. Namely, tab characters.
When going through the initial requirements for the PyMarkdown project a couple of weeks ago in that week’s article, I was pleasantly surprised to find that I was closer to a final release than I thought I was. Specifically speaking, it was comforting to find I only have a little bit left to do with respect to whitespace handling. More specifically, there is only a little bit of work left to do with the handling of tab characters as whitespace.
Tabs Are Evil… No, Really¶
Just like many older software development professionals, I have a bit of a hate relationship with the tab character. It is nothing personal against the actual character itself. It is just that for the majority of my professional development life I have had other developer imply that tab characters are evil.
I think that feeling may come from “ye olde religious wars” of years past. It was not uncommon in the 1990s and 2000s to hear conversations between developers extolling the virtues of “tabs as 2 spaces” versus “tabs as 4 spaces” versus “tabs as 8 spaces”. They were always conversations that I put into the category of conversations known as bike shedding. They were mostly trivial conversations with little hard evidence.
My point of view on concepts like software style have remained static over the years. I do not care much for any reasonable style or coding guidelines if it is clearly documented. Bonus points to any team that is using code analysis to ensure that every commit follows those guidelines. I just remember too many “deep” conversations on teams about tabs and how they converted into spaces to really care one way or the other. And like most developers, I just stopped using tab characters to avoid the noise of those conversations.
Markdown Takes A Different Tact¶
The people authoring the GitHub Flavored Markdown specification had probably also had some of the same experiences I had, so they created an entire section to detail how they want tabs to be handled. And instead of taking one of the approaches detailed in the last section, they changed it up a bit. From the specification:
Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.
Now, parts of that are somewhat vague, but they spelled out the important part: that one tab character is a tab stop of four characters.
What is a tab stop? On old typewriters and word processors, there was the choice to specify multiple locations on a given line that were destinations for the tab key. When the tab key was pressed, the position on the line would move to the next place that the tab key was told to stop. Hence the name, tab stop. When this functionality was moved into word processors, this same behavior was replicated to allow people to seamlessly translate their typing skills between typewriters and the word processors. As some of the more popular applications in the first personal computers involved creating documents, this was probably a good direction to take.
Hopefully not sounding too ancient, I remember taking typing classes on electronic typewriters and loved that I was able to translate my typing skills to the new computers in the computer lab. While we used those computers to create programs during computer labs, we also found out that it was easier to type our essays and other assignments those same computers. When I used a typewriter to type those same essays, I always needed a healthy supply of Liquid Paper on hand to help correct all the mistakes I made. Being able to use a computer to craft the documents like I am doing with this article was a new concept… something that we now take for granted.
These days, tab stops are still in use. When authoring this article, I tested out the tab support with both Markdown files and Python files and learned that Visual Studio Code (VSCode) uses tabs stops of four for both types of files. To be honest, I hoped that VSCode would support for Markdown files, but was not sure. I was happily surprised at the tab stop support for Python files. From my own viewpoint, that one observation gave the authors of the GitHub Flavored Markdown specification extra credibility in their choice. It was a still a pain to implement it, but that was on me, not them.
Taking A Short Cut¶
Now, at almost three years after I started writing PyMarkdown, I have a better appreciation for the tab character and how it can be used creatively. But when I started writing the original implementation for PyMarkdown, having tab characters being interpreted as a tab stop of four was a nuisance. From my developer point of view, anyone in their right mind would always use spaces instead of tabs. It was the quickest way to avoid the “tabs as…” holy wars after all.
As such, when I added support for tabs to PyMarkdown I used a shortcut. As I was convinced that any “sane” person would shy away from tab characters, I wanted to take the quickest path to implementing tab characters. From my point of view, why add support for something most people are not going to use? As such, I added the support at the highest level of the tokenizer to replace tabs characters with the proper space characters. All done and correct, right?
Fast forward to the current day with more experience and a deeper understanding of Markdown. Some readers may question why I started talking about my days with electric typewriters and word processors. But from my new point of view, I believe that thinking back to those days helped me understand why the specification has support for tab characters as tab stops.
That point of view is a simple one: replacing tabs with a constant number of space characters is how a developer thinks, but it is not how people usually think when writing documents. When people open word processors, they expect that word processor to support tab stops. However, most people will not think about it in that way. For the most part, if they use tabs in their documents, they will use those tabs to line up columns of information. If they need something fancy, they may try and insert tables and play with column widths and titles. But for something quite simple, tab characters come in very handy.
It was with that new viewpoint that I looked at the issues left to take care of and decided that I needed to add proper tab support in to PyMarkdown. The next question was how to add it back in.
I had two choices on how to add tab characters support back in. The first was to replace code that was currently working with code that also supported tab characters. The second choice was to do post-processing in select sections to convert space characters back into the tab characters that spawned them.
After back and forth on the pros and cons, I decided on the second approach. While it is adding complexity to the project, I believe it was a valid calculated risk that I needed to take. If I take tab support out of the equation, I am happy and confident with PyMarkdown as it stands. I do find the odd issue here and there, but that is normal with most projects. From my point of view, I wanted to add tab support back in as a solid feature, not a bug fix. As such, adding extra support in to properly translate space characters back to tab characters seemed like the right approach.
That approach is paying off well so far, except for Indented Code blocks, I believe all the Leaf Block elements are taken care of. As of the writing of this article, I am currently working to flush out the paragraph support for handling generic text that has tab characters. Since elements such as autolinks are invalid if they have any tab characters, they can present themselves as paragraph text. It just seemed like a clever idea to make sure those bases were covered.
Based on what I was working on last night, I hope to have paragraphs wrapped up and committed within the next couple of days. That leaves Inline Elements, Indented Code blocks, and Container Blocks that I need to work on. I am not too worried about Inline Elements, but I am concerned about the other two. To properly capture the tab character in Markdown that looks like this:
- first line - second line \t- third line
is easy. The tab character becomes four space characters to help it reach the next tab stop. But this Markdown is a bit more nuanced:
- first line \tsecond line
The same character expansion occurs, but which of the two elements should “control” the tab character. The token for the list block or the text token for the list content?
I am not sure at this point… but I am hoping to have an answer by next week. Stay tuned!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.