In my last article, I talked about my progress in reducing the PyMarkdown project issue count, including the bug that almost knocked me down for the count! In this article, I talk about making hard choices when it comes to projects.
Having fixed some issues in the prior week, I decided to tackle another issue from the outstanding issues list. What I thought at first was a minor snag turned out to be a major issue. More than a major issue. More of a story.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather than the solutions themselves.
After taking a well-deserved night off, I started to work on one of the logged issues, Issue 99. There was no specific reason that I picked this issue, except for a feeling that it was a good starting issue for the week. It looked like a simple issue. When a HTML Block element was encountered in a List element, the List element was being closed prematurely.
Digging into the debug for the failing tests, I was able to quickly spot what the issue was, and subsequently fixed it. I manually verified the tokens, and everything looked good with each token. Then I did a mental conversion from tokens into HTML, and everything was also good with the conversion. The only things left were to verify the translation back into Markdown and to verify that the other consistency checks were passing.
Looking at the code in the
transform_to_markdown.py module, I tried to
add some simple code to adapt to the new condition, but it did not work on
that first try. Come to think of it, it did not work on my second try either.
Having had a long day, I chalked that experience up to tiredness,
deciding to try again the next day.
Before sitting down at my computer for work that next evening, I made sure I had a healthy meal beforehand, I was all nice and showered, and that I was nice and relaxed for the most part. If nothing else, I wanted to start that evening’s work in a good mood to address the issue from the night before. I mean… how hard could it be?
After three hours of trying to get it working and failing, I had my answer. It was a tough problem to solve. And it had to do a lot with history and requirements.
Because of the way that Markdown is processed, there are only two real solutions to deal with containers and their impact on a given line. The first solution is to deal with containers after the leaf elements and inline elements have been dealt with. This is a lot less work in the end, but the setting up of each element must be precise. In addition to that, tabs within a Markdown document are a problem.
For reference purposes, Markdown parsers interpret tabs as tab stops of 4. Tab stops means that the tab-to-space conversion ratio depends on where in the line the tab is. The value 4 signifies that at best, a single tab character can be interpreted as 4 space characters. The way to think about it is this. If the index in the line is equal to 0 with a modulo of 41, 4 spaces are added. If the index modulo 4 is 1, then 3 spaces are added. Similarly, if at index 2, then 2 spaces are added, and if at 3, then 1 space is added. This allows things to be aligned on tab stops that occur in the middle of lines. But these calculations assume that you know that index on the line before calculating its impact. As any container indents have not been applied, that is not possible.
The other option is to handle the container elements and the other elements on a line-by-line basis. Because of the above issues with tab stops, which is the way I decided to do things. I knew that it was going to be a bit more work to keep track of both things at the same time. I needed to juggle distinct parts of the same line at the same time, ensuring that I knew what the effects of the container elements were before parsing the line. But to be blunt, it evolved in a bad way.
As I sat there that night, looking at the source code, I could not believe how convoluted it had become. There were more than three distinct types of merge functions, each for a specific case. And because I developed those functions as I progressed, these were organic, with exceptions to rules throughout those functions. It was just a mess to try and figure out.
I knew I needed to figure it out to move on, so I decided to take the next evening to focus on understanding the algorithms. Because I needed to ensure that I verified any recent changes before going on, I really needed to understand those functions and how to properly change them for the new data. In my mind, it was necessary.
While my optimism was a bit more deflated than the night before, I gave myself a similar starting position the next night. But even with that good starting point, I spent the next two hours figuring things out and hurling at least a half dozen “WTF?” insults at my monitor before I was done. I had scribbles down over a good five sheets of paper strewn all over my desk. And those were just the ones that I did not scribble on top of and then threw out because I got it wrong. I stopped counting those sheets after the first ten.
It had taken me two hours, but I had a good concept of what was going on. I just did not have any ideas on how to change the code to do what I needed it to do for the issue that I just fixed. Frankly, I was amazed that what I had figured out worked. I did not intentionally mean for the code to get like that, but the complex tangle of code was captured on those five sheets of paper. And they did not look neat and tidy either.
It was on a bit of a whim that I started sketching out a similar algorithm for managing the container text after the other elements had been processed. This was more in line with my design decision to keep container processing separate from the leaf processing, so that was one good sign. The other good sign was that within fifteen minutes, I had a solid design for the base algorithm. Twenty minutes later, I had a list of the changes I would need to make to the whitespace in the tokens to present the needed information required by the algorithm.
I knew that the new sketch would not be that easy to implement, as things rarely are as easy as they seem. But even with a bit of extra work, the new sketch design was simple to understand, and the changes to the tokens to support those changes were also easy to understand. There was only the one algorithm, with a couple of small algorithms on the side to calculate lists for the main algorithm. And I did not take care of translating tabs into spaces either. That was something else I would have to do.
That is where I got to the hard part.
It was now the start of Friday evening, and I had an exceedingly difficult choice to make. If I stuck with the current algorithm, it would take me at least another day or two to make my current change. I did not have any good estimate to base a guess on due to state of the algorithms. I had a good feeling that if I had to make similar changes, as I expected to have to do for at least half of the remaining issues, I would have to take a similar length of time to make those changes.
On the other hand, I could change that algorithm to the proposed design, but there were no guarantees that it would fix things. At the very least, the new algorithm would be more understandable, and thus more maintainable. But that just meant that the focus for addressing issues like these would focus more on making sure that the right whitespace is encapsulated in the tokens for the container algorithm to use. And I would still have to find a way to deal with tab characters.
That is when I realized: there just was no straightforward way to deal with this problem. Regardless of which option I chose, there would be a decent amount of work. So, any short-term considerations were out, as they were even. I had to start thinking about the long term, and how my choice would reflect on the future of the project.
Working things out in my head, I knew that I had made my decision when I started to try and convince myself that the current solution was good enough. I was not doing my usual process of evaluating multiple choices on their merit, I was arguing to myself that what I had met some minimum bar that was hard to define. To me, that meant that I had a clear winner in the other choice, I was just worried about the cost.
That is when I had an objective talk with myself about the merit of the new approach versus the old approach. If I had a tough time thinking about how to modify the old algorithm now, how much would it cost to change it in another week or two? Would I have to go through the same process again? The more I asked myself questions like that, the more the cost of the new algorithm did not seem to be that expensive.
The more I thought about it, it was a hard decision, but I needed to rewrite that part of the code. A simpler algorithm costing more now would easily safe effort in the future. It sucked, but I knew I needed to make the hard decision.
I was going to rewrite the project’s Markdown generator.
With everyone else out of the house for most of the day, I had a rare Saturday to myself. Because I expected to throw even more “WTF?” comments at my monitor, it was a good thing I was alone in the house.
The first thing that I noticed is that while I needed to rewrite the Markdown generator, I did not have to rewrite most of it. With one or two exceptions, the handling of the non-container tokens was good. As for those small exceptions, I was sure that only the whitespace handling would need to change, and that I could manage those changes.
With that knowledge in hand, I turned off all existing checks, picked a simple scenario test with multiple lines and no container elements, and started to work on it. I was not trying to solve the entire problem, just get a good head start on rewriting the code.
It was not too long before I had a promising idea of what I needed to do, and I committed that code to my local repository. Then I enabled the Markdown generator check for only those scenario tests that did not container any container elements. Working through those issues, I then added checking for scenario tests including only Block Quote elements and worked through those issues. Finally, I repeated that process with List elements, to get to where I am now.
As of the writing of this article, I have 25 scenario tests that I need to get working without enabling the remaining tests with both Block Quote elements and List elements. Turning those on, the total jumped to 53 scenario tests, which is still a manageable number. Some of those tests are going to be easy to resolve, and there are other tests that are going to tax me. But having the simplified design means that it is easy to defend and that change makes it worthwhile
It is worthwhile because I can easily maintain a picture in my head about what needs to happen. Every single whitespace character needs to be preserved in a token somewhere. Most of the time, which means that I need to ensure that the whitespace emitted in container tokens is correct. That is simply a fact because the new algorithm is simple, and I mean to keep it that way. Sure, it is shifting the effort to the tokens, but I feel good about that.
What Was My Experience So Far?¶
During the middle of the week, I was at a bit of a low point, mentally speaking. While I realize that I am not perfect, I do hope that I make good development choices that do not result in too much wasted effort. I do acknowledge that there are parts of the PyMarkdown code base that look busy, but usually I keep things grouped by functionality. But in those cases, there are a small group of functions that provide support for each other and their common responsibility. In each of those cases, I can usually pick up a function and understand its purpose within five minutes, and what to change in fifteen minutes.
Not being able to do that is exactly what I had to come to terms with during this week. I had spent time maintaining code that became overly complex and spun out of control. It was enough out of control, that I did not think that there was an option other than to rewrite that functionality using a different algorithm. That hurt.
But as I author this article, I think my viewpoint has changed. With around fifty scenario tests left to convert and get working, I find myself having a renewed sense of optimism. The new code is cleaner and more direct on what it is trying to accomplish. The new code is independent of the processing for the other Markdown elements, so it just focuses on what it needs to do. And more importantly to me, the new code is easily more maintainable.
Did I throw away a Saturday on getting the new code to work? Yes. However, if I am honest, this new code just feels better. It is a bit more work right now to get back to a “stopping point” where I can start fixing issues again. But I am more confident that any later changes can be incorporated into the generation of the tokens themselves, not requiring any changes to the new functions for handling container elements.
And when it all comes down to it, that is what is important. I sometimes forget that every line of code that I write is an experiment. I make good guesses as to what I need, so my coding accuracy is decent, but it is still just a guess. In this case, an experiment failed, and I needed to try and find a better way to accomplish the same task.
So, yes. it sucks that I had to do a rewrite. But honestly, sometimes, there is no uncomplicated way, just difficult paths to follow. And I am okay with that… now.
What is Next?¶
Since I have started on the first part of this changeover, it makes sense that I keep on going until it is finished. Stay tuned!
This is a fancy way of saying “what is the remainder after dividing by the other number?” ↩
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.