In my last article, I talked about making hard choices when it comes to projects. In this article, I talk about the follow-through with the choice that I needed to make last week.
Having decided that I needed to replace the whitespace processing in the PyMarkdown parser, the follow through was brutal. As of last week’s article, I had burnt through all except twenty-five scenario tests. Experience led me to believe that those remaining tests were going to be the hard ones. Either fortunately or unfortunately for me, depending on one’s viewpoint, my estimates on the PyMarkdown project and work to be done have been spot on lately. It was going to be on hell of a week!
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather than the solutions themselves. For a full record of the changes talked about in this article, please consult the commits that occurred between 05 Dec 2021 and 12 Dec 2021.
There are often diverse ways of saying the same thing. The gist of what I remember from one lecture that I attended is that the lecturer, someone smarter than me by far, was talking about how 80% of coding is just following established patterns. He further went on to say that the remaining 20% is a mixed bag of tasks that will either prove you to be a competent developer or force you to hurl your monitor out the window in frustration. For sure, that might seem like hyperbole to most. But if you ask most software developers to be honest about whether they have thought about throwing things around the office, including their computer or monitor, most will admit that they have had those thoughts.
That frustration is just a natural thing. Software does exactly what you it is told to do, no more and no less. It then follows that if a developer has an incomplete picture of what they want to do, then the result is a program that is incomplete. And while the high-level picture of what to be done is always clear, that picture gets grainier as the software developer zooms in to individual sections of that picture. As a result, the gap between “the ideal” and “the reality” are what are known as bugs. The real question that faces most developers is whether they can find all the relevant bugs before the users of their programs do.
For the last week, I had been in a forest where there were tons of those bugs lurking around every corner. While it took a bit of work to get through them, I had managed to vanquish most of the bugs about whitespaces. But the reality of software development is that the easy to find bugs are almost always the first to go, leaving the trickier bugs to diagnose and fix. That is where I started this week. With tricky bugs.
Dealing With Tabs¶
As I talked about in the last article, tabs in Markdown are treated as tab stops and not just blindly replaced with four space characters. With eight unresolved tests that dealt with tab characters, this was not something that I could delay addressing for much longer. There was no other choice other than trying to tackle this issue head on.
Dear readers: I know you might think I am not being truthful about taking a good day and a half to think about the impact of tabs on the parser. But I did. In my usual fashion, I scribbled things out on paper and talked to myself, working through the various issues that could arise with different solutions. There were a half-dozen “half-solutions” that I produced, but each was quickly discarded. To be honest, calling them half-solutions is probably generous. In each case, I was trying to deal with the effects of having tabs without dealing with the actual tab characters themselves.
That did not work very well for me. Those solutions did take care of resolving the perceived size of each tab character, but it introduced more problems. Specifically, they often included passing extra lengths around and made some calculations a lot more difficult in the process. On top of that, even with helper functions, I found that I was coding variations of how to use those helper functions in multiple places. It was just messy and not very maintainable.
So, after deciding to use all those scribbles as kindling in our fireplace, I decided to go for what I consider to be the nuclear option. Instead of creating a maintenance nightmare in the rest of the code, I decided to add this code to the Container Block Processor:
if "\t" in position_marker.text_to_parse: line_to_parse = ParserHelper.detabify_string( position_marker.text_to_parse ) position_marker = PositionMarker( position_marker.line_number, position_marker.index_number, line_to_parse, position_marker.index_indent, )
Adding a new
detabify_string function to the
ParserHelper class, I called it
from the start of the Container Block Processor and its handling of each line.
I used it to replace any tabs characters in the string in
that one place. I used almost identical code once in the Markdown generator to
process the test data before the Markdown comparison there.
And except for some altered test data that needed to be addressed, I was done.
There was no “including tab counts” that needed to be altered and passed around.
There was no need to know if I had preprocessed a given string or if I still had
to do that. There was no weird
if statements to deal with tab characters.
if statement in the parser code, and one
if statement in the test
code and it was done.
Why The Long Journey?¶
So why did it take me so long to come to that conclusion? Data integrity. At the start of this project, I wanted to ensure that the integrity of the parser was as high as possible at every stage. I had it in my head that translating the tab characters in any way was violating that rule. So, I tried to take the long way around and deal with the effect instead of the character itself.
And I guess in a certain way, it is changing things. But the question that I ended up asking myself was whether that violation was relevant to the tokenizer. I could easily argue that from the viewpoint of parsing the Markdown as a HTML parser, it was crossing the line. That was crystal clear. But from the point of view of a tokenizer, I ended up wavering on that viewpoint. The tokens are just truthful representations of what was parsed. As the specification is clear that tokens are to be interpreted as the corresponding number of space characters, there was a new question. That question was whether an uninterpreted or interpreted representation of that tab character was the best choice.
But it was not an easy journey or choice. And as with all hard choices, it just took a while for me to work through things and make sure that it was the right choice.
Dialing In The Remaining Tests¶
With the tab tests out of the way, I needed to apply my normal work ethic to get movement on the remaining issues. But with a firm guideline on how to approach the whitespace, it only took a couple of days before I had the remaining scenario tests passing, with caveats. I just focused on making sure that the right whitespace was being applied to the properly scoped token, and that helped a lot. It was still challenging work getting everything assigned properly, but a simpler set of rules made decisions easy to make.
What were the caveats? There are still a couple of scenario tests that I disabled instead of getting them to work, and they fall into two groups. The first group is a set of tests that have three or more levels of nesting. While there are some tests that “just work”, I want to make sure to spend some time and effort to properly scope and focus on each combination and make sure that it works. The second group are scenario tests involving changing indentation, usually from the Block Quote characters. Once again, there are some simple combinations, but I feel that the right thing to do is to specifically focus on those combinations in their own scoped issues.
And to be clear, everything still parses properly. It is just the calculation of the whitespace lengths that is in question. And I want to get those right, not just kind of right. I passionately believe that it is a good enough reason to handle them separately.
With the remaining tests either passing or skipped, it was time to start cleaning
up after the mess of making those changes. There were a handful of places where
I had forgot to change
from my “in the moment” variable names (like
gg) to variable
names that describe their function. Using Sourcery, I
was able to calculate a relative health of each function, and I kept on refactoring
until that measurement was at least 40 for every function in the project.
Using Code Inspector, I was able to quickly
figure out what PyLint issues needed to be addressed.
To be honest, this part of the process was just relaxing. Each of the refactoring steps I took were easily tested, thanks to a complete set of passing scenario tests. If I made a change and one of the tests failed, I just rolled back the last change and tried again. While it did take three or four hours to complete, there was very almost no stress associated with those changes. It was a pleasant change for working over the weekend.
Watching The Issues Fall¶
Finishing the prior work just before noon, I took a rather leisurely lunch and got some of my energy back. It was a hard slog getting everything finished and tidied up, but it was nice to get there.
The issues that I could choose to work on had as much potential to be easy issues as they had to be difficult issues. I knew that the first one, Issue 99 was going to be easy as this work had all kicked off because of the problems I had addressing that issue. But what about the others?
As I made my way through the other issues registered around the same time as Issue 99, only one issue out of seven logged was not at once resolved. I was fine leaving those issues for another week. It just felt good that the work that I had undertaken to correct those other issues had cleaned up six other issues.
What Was My Experience So Far?¶
I know it may sound funny to some people, but I was sure that the heavy sigh I made when I finished working on the tests could be heard throughout my house. It was just a good feeling to simplify something that was complex and difficult to maintain. It was even better to see multiple tests resolving themselves because of me refactoring the whitespace from the containers.
But I know that I need to focus more on cleaning up the remaining issues in both the GitHub issues list and my own Issues List. So hopefully I will be able to get both of those cleaned up. I know I am making timely progress though, and that gives me hope that I can clean things up nicely within a couple of weeks.
What is Next?¶
Having finished the changes to address whitespace issues, I am eager to get back to a mixture of refactoring code and fixing issues. Not sure what mix of those two is going to happen yet, but I am fairly sure I want a good balance of both. Stay tuned!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.