It took a while to get there, but today I am happy to release the latest patch versions of the PyMarkdown project. And while I had a lot of life happening, as I talked about in these articles, I did make a lot of progress in dealing with things.
While there have been a handful of issues fixed in the last six months, the bulk of the work was aimed at shoring up two major areas: tabs and nested container blocks.
As I talked about in my previous article Making Progress With Tabs, I made a decision early in the project to handle tab characters as space characters rather than as tab characters. This made a lot of things easier, especially the computation of indents at the start of a line. But as I get closer to having a totally compliant parser, it made sense to go back and ensure that any tab characters were properly represented in the token format. That action also has the benefit of allowing the parser to emit the HTML output format that the project uses to verify the Markdown translation.
But that is not always an easy task. Markdown treats tab characters as tab stops, a single tab character will translate into either 1, 2, 3, or 4 space characters, depending on where it occurs in the current line. As such, there is no easy substitution that can be used to go back from space characters to tab characters. And because of that, there is the possibility of the tab characters being expanded to multiple tab characters that span elements. Consider the Markdown:
- first line
{tab}second line
The first line sets up an unordered list block, and the second line is translated
into {space}{space}{space}{space}second line
. The first two of those space characters
belong to the list element, and the second two space characters belong to the paragraph
element containing the text second line
. Those situations cause for what I call
a “split tab” situation, and they are tricky to handle properly.
In the end, those issues are always resolvable, it just takes time. In most cases, I quickly figure out what needs to be done, but as they say, “the devil is in the details!”. Finding the correct situations that solve a specific issue is not usually the problem. The problem arises when I try and make sure the new code only fires when it is supposed to and does not fire in the other cases. Finding the right combination of things to check for takes a lot of experimentation. And that just takes times.
Because I am trying to test some weird cases with tab characters, I also ran into issues with nested container blocks arising from combinations that I have not covered yet. This was not a surprise for me, but it did take some time to get my head around. The good news for those issues is that they mostly occurred when the parser moves from a sublist back to its parent list. The calculations for those were just a slight bit off. All those issues were worthwhile to find, they just took time.
And hopefully this does not sound like an excuse, but I do need a clear head to properly think through these issues and debug them properly. I had a lot going on since the previous release, and the amount of “quality” development time I had was limited. Not an excuse in my books, just life. And yes, I must remind myself that life happens, and not to be hard on myself for not getting farther faster.
I am starting to see the end of the tunnel for the main phase of this project. The first roadblock is proper support for tab characters, and I believe I am closing in on having that taken care of. Following that, there is testing that I want to do to ensure that the nested container blocks are working properly. Finally, after that, I have some work to do on properly handling container blocks that start at different positions and accounting for that.
For what it is worth, I also believe that is the correct order of issues to be working on. While they are not plentiful, I have started to see more Markdown documents that contain tab characters. The count of those documents is definitely greater than the documents that I have seen with deeply nested container blocks or starting lists and block quotes at different locations on a line. If I had to guess, I would suggest that when people are authoring Markdown documents, they want things to look “right”, and those two formatting choices are not conducive to looking “right”.
Regardless, those issues are on my list, and I will hopefully be picking up momentum on closing them out. There of course will be more refactoring that I want to do as I proceed through those areas, but I am hopeful that the refactoring will allow me to speed up the fixing of the other issues, not delay them. Stay tuned!
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.