Summary¶
In my last article, I talked about the process I have made in adding tab support to the project. This week, I will talk about the various issues that I have encountered.
Before We Start¶
My writing has been spotty at best lately. Part of that is due to lack of energy and the rest is due to lack of quality. Life has been punching hard lately and I needed to take the time to get things back to allow me to move forward. While everything is not back where I needed it to be, it is slowly moving back in that direction.
For me, work can often be a distraction from the real world. When I am working on one of my projects, such as the PyMarkdown project, I can just focus on it and not worry about the things around me. While the other stuff that has happened to me in the past month was a bit much, I was starting to get a handle on it when my mother passed away. That really knocked me down and out for a bit.
While it may seem like a bit of a crutch, having my projects as a distraction has allowed me to channel my negative energy into a positive effort… something that my mother would have approved of. It has also allowed me to reduce the flow of those negative emotions to a level where I can more readily deal with them. And given that my mother was one of my closest friends, I definitely needed that in the last couple of weeks.
It will still take me a while to complete the processing of those feelings, but I am dealing with them at my own pace. At the very least, it allows me to pace myself most of the time. Sometimes those emotions just come out and I must take a break and work on them. It is just life. It is not predictable, and it is not always fun. But for me, working through these emotions helps me understand what is important in life. And for that I am grateful.
Where Were We?¶
As it has been a couple of weeks, I thought a quick catch up was warranted. The last time I wrote, I was working to figure out how to represent Indented Code Blocks whitespace properly, with Link Reference Definitions and Container Blocks on the horizon. I needed to deal with the possibility of the Indented Code Blocks splitting a tab character when it was set up properly.
The short story: things have progressed. The first thing that I was lucky to encounter was that those complexities were avoided without containers. Because the parser treats the tabs characters like tab stops, the only way to get to the four characters needed for an Indented Code Block without a container were up to three space characters followed by a tab character. Once the parser observed the tab character, the tab stop of four gave the Indented Code Block its required four characters, and any other characters went into the Code Block itself.
After making a bit of an attempt at working on the Link Reference Definitions, they proved too much for me to handle along with what was going on in life. That was not a big deal at the time. I knew that I would have to handle them, and I was okay with putting them aside. As they have an increased difficulty level, I felt that getting some more tab handling under my belt was a good thing to accomplish before dealing with them. They are still on the list as of the writing of this article, but they are now at the top of the list.
Dealing With Block Quotes¶
The list of elements to work on was shrinking to the point that only container blocks were left on the list. Looking at the choices of Lists and Block Quotes, I felt that the Block Quotes were the easier of the two to deal with first. When I say they were easier, it was that the number of permutations of whitespace tied to the Block Quotes is two: is it followed by a space character or not. The permutations for List elements and potential padding characters were an extensive list. Better to handle it later.
But it meant dealing with split tab characters. I was not looking forward to that.
What is a split tab? When writing Markdown text for inclusion into a Block Quote
container, the text >{space}text
is as valid as >text
. That is to say that
a single space character before the text within the Block Quote line is optional.
When dealing with HTML output, that is not an issue as the whitespace is
thrown away. When dealing with space characters in Markdown output, that is not
an issue as a space character is an atomic construct. Each space character will
only ever be one space character.
The problem with the tab character is that it can represent between one and four
space characters, depending on where it is situated in the line. If the text
>{tab}text
is presented to the parser, the tab character is expanded to three
space characters, effectively becoming >{space}{space}{space}text
. But when
the parser goes to associate those space characters with specific Markdown elements,
there is an ownership problem. Which element gets the tab character?
Doing my usual experimentation and research, I noticed that while there were complexities, they did not end up being as bad as I thought they would be. In the above case, I needed to do two things to properly split the tab. The first of those things was to place the tab character with the element that was nested inside of the container. While this did impact each Leaf Block element type, it was not that bad. And a simple search early in the matching process allowed me to detect the split tab case with ease.
The other part took a bit of thinking, but it was to remove the space that was
at the end of the Block Quote leading_space field. As the Block Quote contents
are being created, a \n>{space}
is added to the owning single depth block quote
element to denote the new line. The \n
indicate that it is a new line and the
>{space}
provide for the normal use of the Block Quote with a space. As the
HTML does not require the space and the Markdown token has the tab character,
the Block Quote’s trailing space character can be removed.
It was an easy answer, but one that I was not sure of at first. I needed to work through the first three element types nested within a Block Quote element to really get a sense that it was the right answer. And it was not that I did not trust my own research, but the answer seemed too simple. As I was going through a lot in my personal life, I was concerned that I was taking a short cut that I would not normally take.
But time bore out the research and the solution. The code to handle it was simple, somewhat eloquent, and easy to match to each element.
Block Quotes Were Hard¶
With that out of the way, I proceeded to work on different permutations of the Block Quote elements and tab characters for each of the elements. There were a handful that were easy to get right, and another handful that had a lot of trick scenarios that I had to work through. If memory serves, I do not believe any of them fell in the middle of that equation.
A good example of the difficult cases was Fenced Code Blocks and Indented Code Blocks. The bare implementation of these is easy. But when I placed them inside the Block Quote element, I had to rework a certain amount of the code. For Indented Code Blocks, the consuming of the first four whitespace characters caused a lot of split tab situations that I had to get right. For the Fenced Code Blocks, it was the same thing but depended on the indentation of the opening Fence Block element. In both cases, I had to be sure to follow the specification on how to deal with them.
And just when I thought I was finished with Block Quotes, I hit myself with a set of tests that I needed to fix. For laughs and giggles, I added a series of tests with each Leaf Block element inside of a Block Quote, with every whitespace replaced with a tab character. I was not laughing when most of the tests failed. In most cases, it was not something too serious to fix. It just took some extra time to work through those cases. And while I cleaned up those tests, I found a single test that I should have fixed that I did not. Another detour.
Is It Worth It?¶
As I was getting ready to author this article tonight, I asked myself that question a couple of times. I personally do not include tab characters in my Markdown documents as they are not predictable. I realize that the parser interprets them as tab stops, but my mind is keyed to seeing them expanded into a constant set of space characters. That is just my opinion, not anyone else’s. When I am typing, I want to know exactly where the spacing is.
I could just get away without implementing proper support for tab characters, but that would not sit well with me either. It is part of the specification, and I would feel like I missed it on purpose. So that was out.
With those tasks mostly completed, my answer was: yes, it is worth it. I still have some more ways to go, but it is indeed worth it. For people that do not use tabs characters, I have a solid linter that is compliant with the GitHub Flavored Markdown specification. If there are any issues with compliancy, I will work to fix them. This was no different.
Was the decision to fix the Tab characters as a post-process a good one? I am not 100% sure, but I believe so. While it may be taking extra work to get there, it leaves the core parser and linter intact unless the line being processed includes tab characters.
But from a personal pride point of view, I want to complete it for my own reasons. And for me, that is good enough.
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.