I have been busy at work on the PyMarkdown project since December 2019, November 2019 if you include project pre-planning. During that time, I have established a very consistent schedule for planning the next weeks’ worth of work. Normally I start planning the next article a couple of days before I write the rough outline for the current article, ensuring continuity between the two articles. This occurs concurrently with the last couple of days of development work for the week, so I usually have a very healthy picture of where the project is and where the project is going.
This week was very different. For the first time since I started the project, I was unsure of the right direction to take. As I stated at the end of the previous article, I needed to answer one important question: should I write more rules or should I fix more issues?
While I did arrive at a final answer, it would eventually take me almost 2 weeks before I was able to properly answer that question. In addition to spending some time to take a good look at the project, I decided to use that time to knock some of the low priority issues off the technical debt list, hoping to gain some useful insight on how to answer the question along the way.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 15 May 2020 and 08 May 2020.
How Did I Get Here?¶
I believe that questions such as features versus issues are important questions for any project maintainers to ask of themselves and their project team. Even more important to me is the policy that any such questions and answers regarding a project must occur with full transparency to team members and other people interested in the project. I know I am not the first person to get stuck on an issue like this in their project, and I will not be the last. Because of that desire for transparency and openness, I wanted to document how I arrived at an answer, with explanations along the way to help others with their decisions regarding similar questions. Basically, I consider everyone reading this article to be an honorary team member.
As I alluded to in the introduction, most of my articles start off as ideas that I get while I am working on the project. When I am trying to figure out the direction that I want to go with on the project, I consider both my ability to code and test any changes along with how I am going to document those changes in an article. By the time the end of the “project week” starts to roll around, I have usually started to transition from having an idea of where I want the project to go, into a plan of what to do for the project in the next week, and how I will write the article about the execution of the plan the week after.
When I get to the last day of the week, three things happen one after the other. First, I perform a small project retrospective to figure out what parts of the last week went well, and where I can improve, being as honest with myself as I can. Secondly, I scan over my commits and notes from that week and come up with a good unifying theme for that week’s outline and sketch out a very rough outline for the future article. If things went well, that theme matches the project direction from a week before. If not, I just adjust as I go. Finally, I use both of those pieces of information to compose a plan on what needs to be done in the next week with respect to the project.
I have found this process very healthy as it asks three very important questions for the project. How did you do? What did you do? What do you plan to do? More specifically, by asking the questions in that order, my plan for the next week can be an informed plan, incorporating any “where can I improve” action items into the next weeks work, while they are still fresh from the retrospective.
Where Did This Process Fall Apart?¶
The current issue that I had with this process started when I asked myself the rules versus issues question because of one of those retrospectives. I remember sitting down at my desk, starting to type out a couple of notes to myself as part of my retrospective. There were two main themes that were a part of that retrospective. What went well was the writing of the rules and my belief that I had developed the right framework. What needed improving was that my confidence in the framework was at a high level, but was it was high enough for me to continue developing rules? I did not feel comfortable moving forward on the project without answering that question. But after a long period of time, I had not made any progress on answering it. I silently hoped that moving on to my next task would help.
Despite my hope, skipping to creating a rough outline of that week’s work did not help me to answer that question either. The prior few weeks had all been about proving that I had done enough work on the project that linting rules were both possible and easy to write. I sincerely believed that I have succeeded with that goal with room to spare. Mission achieved! But with that important question solidly answered in the positive, it did not help me move forward on answering the question on whether the project would benefit more from more rules or less issues.
I then tried to force the issue by coming up with a plan for the next week, hoping it would jar something in my head that would “magically” resolve the issues I had with the retrospective and outline. It did not work. I just sat there and paused for a long time while staring at the screen. I tried bouncing between the retrospective, the outline, and the planning, but the results did not change. I was at a bit of an impasse. As far as I could figure, the scales seemed equally balanced. If the questions had to do with some manner of data-driven metric, it would have been easy for me to come up with a go-no-go threshold that I could rely on. This decision was not going to be data-driven, but emotion-driven. That would make the go-no-go decision a lot more difficult to pin down to an answer that I felt good with.
After making a good-faith effort to arrive at an answer, I decided to choose neither answer. Instead, I decided to do two things. The first thing was to take a good look at the code over a few days, hoping to get extra insight as to where I felt the project was and how to better answer that question. The second thing was to fix some low-level issues that I had noticed and wanted to address. Nothing too earth-shattering, just some small cleanup issues that never seemed to get picked up in one of the other refactoring efforts. I hoped that this two-pronged approach would help me to pick up some other information that would tip the scales one way or the other.
In my mind, while it was not much of a plan, it was moving forward. It was taking a step forward in the hope of getting more information. At the very least, the project would benefit by a few issues being removed from the technical debt list. That was level of success criteria for the week that was lower that I was used to, but I was fine with it. That is, if I figured out how to properly answer that question!
Taking Time Is Not Taking It Easy¶
While it is true that I have written every line of code in the project so far, I rarely have time to take a step back and see how things all fit together. To quote a professor of mine from university:
Sometimes you cannot see the forest because the trees get in the way.
As I was looking through the code, I was taking notes, mostly questions for myself. In all honesty, a couple of them were “does this really work?” which I resolved to answer for myself by the end of the week. Most of them though were simply me wondering out loud if there was a better way to do something or whether I could change something in a function to make it more readable or maintainable.
In the end, as would befit my university professor, I think I was better able to see both the forest and the trees after those couple of days. But while I had hoped that it would nudge me one way or the other in answering my main question, I remained at an impasse. I was able to answer all the questions I had asked myself as I was looking through the code, but none of those answers signified that I had done something wrong. They only suggested small improvements on how I wrote the code. By no means were any of those answers better enough to tip the scales.
With one of my two paths not providing any information to sway the answer either way, it was time to switch to the second path: fixing issues.
Better Logging Command Line Support¶
One of the first things on my “never got around to it”1 list was to make sure that PyMarkdown has proper command line logging support. This level of logging support was already available through the test framework but adding that same support to the command line support seemed to always get pushed off. It was time to change that.
Adding that support was extremely simple, especially since I have added support like this to many command line programs. From experience, the two parts to adding this feature are adding the ability to control the default log level and adding the ability to redirect any logs that are produced to a file.
The actual core code to implement this feature was just 7 lines long, with the
interpretation of the command line arguments being the bulk of this change. The
existing interpretation code, to handle command line argument parsing through the
argparse library, was changed to include support for the
--log-file options, including a default setting of
the log level. To round out these changes, the
log_level_type function was added to
verify that any specified log level is a valid log level.
Those 7 core lines to add the logging itself may change from language to language, but they almost always are simple modifications of a common pattern. The first part of that pattern is dealing with writing to the console. As many logging frameworks will do this by default, the customization here is to ensure that the desired logging level is applied to the console log handler. The second part of that pattern is to add support for logging to a log file, usually requiring 2 to 3 discrete actions to customize the file logging to the proper settings for the application. As these are the two most frequently used logging modes for command line programs, most languages include good solid templates on how to add this for that specific language.
The fun part for me is always in making sure that a change like this is tested properly,
and this was not an exception. As this is a new set of command line options, existing
tests that listed existing command line options were updated. Additional tests were
test_main.py to specifically test the new options, including tests
specifically around specifying invalid options.
I am not sure if it felt good to have this issue taken care of as much as I felt like I should have got to this one before that moment. Being such of a core fixture in other command line applications I have written, I just felt like this should have been addressed a lot sooner. Still, it was good to get it out of the way.
Better Logging Support¶
This change was a small-ish change, but it was one that I was overdue to explore. Back
at the start of May, in my article on
I noted that while I might be able to get away with a single static logger variable
at the top of my modules, I had not seen any good documentation on the “right” way
to do it. When I looked at various examples, such as
at Python Spot, the examples seemed to always show logging within a single module
application, and not within a multiple module application like PyMarkdown. As such,
I decided to add localized
logger variables until I could do some
research and figure out the proper way to add logging to each module.
It was not until I got the time to do more thorough research that I was able to find a good example of how to log with multiple modules. While the Python logging docs has a good section on “Logging from multiple modules”, it was actually the section titled Advanced Logging Tutorial that gave me the information I was looking for. While not an example, the guidance that is given near the top of this section is quite clear:
A good convention to use when naming loggers is to use a module-level logger, in each module which uses logging, named as follows:
Python logger = logging.getLogger(__name__)
This means that logger names track the package/module hierarchy, and it’s intuitively obvious where events are logged just from the logger name.
Yes! After searching for some clear guidance for weeks on this, I finally found something that was both authoritative and definitive. Even more, the last sentence contained a good, solid explanation of why this process should be followed.
In the grand scheme of things, this change took very little time. Instead of having a
logger instance declared as
logger within various classes and static methods, a new
LOGGER was created at the top of each file per the instructions quoted above.
The change from
LOGGER was a simple search-and-replace tasks for each
file, quickly accomplished. The hard part here was removing any
that were being passed into functions in favor of the
LOGGER instance declared at the
top of the file. Testing was also simple, as all I had to do was execute all of the
tests again, and make sure I did not miss anything.
It felt good to get this one off the issues list and committed to the repository. If I had to guess, I think this one never made it into any of my refactoring lists because things were working okay, with no loss functionality in PyMarkdown because of it. At the same time, there was this persistent nagging when looking at the issues list that I really need to figure out the “right” way to do this… and now I know what that is.
But did this help me figure out the answer to my question? Nope. Taking some time to go back and look at my half-written notes, I was still at that same impasse. Nothing had changed. Hopefully, that would soon change.
Adjusting the Parsing of Whitespace in SetExt Tokens¶
Like the other issues that I addressed at this time, the effort to address this issue was small. Found during the development of rule MD023, this was a rare case where I felt that code added to a rule could have been positioned better in the core framework. While the added code in the rule was only a small amount of code, it was a case where I knew a better way of handling this whitespace was possible, as it was already being done for the related Paragraph token.
The two tokens that the SetExt heading token is related to are the Atx heading token and the Paragraph token. The SetExt heading token is related to the Atx heading token in that they are both heading tokens. The SetExt heading token is related to the Paragraph token as a SetExt token is created by changing an eligible Paragraph token with SetExt markings after it into a SetExt token. As such, when I wrote the code for rule MD023, I was surprised that the logic for detecting whitespace at the start of a Paragraph token was trivial but the similar logic for a SetExt token was more involved.
Digging into why those tokens were handled differently, I quickly determined that it was only a couple of small changes that separated the handling of those two tokens. Addressing these differences required a few simple changes during the coalesce phase and the inline processing phase, both ensuring that the processing afforded to Paragraph tokens was also being applied to its kindred SetExt tokens. That was immediately followed up by adding a couple of tests to make sure this change stuck, and then by a change to rule MD023 to make use of a more trivial calculation of leading whitespace.
Looking back at my notes while I am writing this article, I believe this was the start of me mentally tipping the scales towards spending time working on the issues. While this was not a big change, I believe that it represented a larger set of smaller things that I wanted to get right before moving on. I believe the change that was occurring was a subtle change in how I was weighing the various categories of issues.
That weighing was starting to put more emphasis on fixing issues, specifically issues with the parser. The previous two issues that were addressed, both dealing with logging, did not seem to affect my decision at all. This issue, dealing with the parser, moved the weighing enough that I noticed it. While it was only barely noticeable at this point, that feeling was going to get stronger as I addressed the next issue.
Properly Grouping Hard Line Break Whitespace¶
One of the things that I noticed when fixing the previous issue was that where hard line breaks were concerned, they did not have any leading whitespace embedded within them. It was not a big issue. If I was not looking at the test cases for the previous issue, I would not have seen this issue. It was just silently waiting to be discovered.
It may seem like a small thing, but I have a “rule” that any whitespace before any non-text token goes with the token that follows it. Just before I started writing rules, I noticed that many of the tokens were following this rule, so I decided to apply this pattern as a blanket rule over all the tokens. The benefit to this approach is that I have a consistent expectation of where the leading whitespace will be placed after it is extracted. That benefit allows me to write better, more consistent rules where I always know where to look for that whitespace.
The fix for this issue was almost trivial. For the most part, it was adding another
parameter to the
HeadBreakMarkdownToken constructor, passing it either the
line continuation character
\ or the leading whitespace that caused the hard line
break to occur. A bit of cleaning up with the resultant variables, and it was done.
But as with the previous issue, I could feel the weighing of my priorities changing.
This was another small thing, so small that it went undetected until I chanced upon
it. But the thing was, it started me thinking: If I was able to find this issue, what
other issues were lurking in the parser, waiting to be discovered?
Consistently Using the Word “Heading”¶
While I was very aware that this was a non-code related task, I felt that it was a good time to get it out of the way. During my documentation of the first three rules, I decided to choose “heading” of “header” for the reasons outlined in that article. However, even though I had made that change in my articles2, I had not made the accompanying changes in the PyMarkdown source.
This was a quick search-and-replace, followed by running of tests to make sure things were good. Experiencing no bumps and no typos, everything went fine. While purely cosmetic, it felt good to make sure that the blog and the source code were in sync with each other. Making sure they were in sync just felt good.
I have long since got rid of the notes I used during the early days of writing the
parser for PyMarkdown, so any attempt at figuring out why I added this class near the
start of the project on 29 January 2020 is probably going to fail. My best guess, based
on what I can see in the GitHub repository, is that perhaps I believed that having a
specific end token for each start token type was the way to go. Regardless, since that
time I have adopted an approach with a single
EndMarkdownToken that refers to its
MarkdownToken by name. This approach has proven to be quite practical as
most of the operations that rely on
EndMarkdownToken instances do not need any
contextual information except for the starting tokens’s name. As such, the
practicality of having a specific
EndMarkdownToken instance that matches each start Markdown token feels overpowered to
me, with little benefits to show for the added complexity.
Removing this token was easy. The class was removed from the
leaf_block_processor.py module was changed to add an
instance for the related SetExt heading. The rest of the changes in the commit for this
change were holdovers from the previous changes, where I had forgot to do a
clean build and record the changes.
This change was cosmetic, but like other issues detailed in this article, it changed the weighing of the issues even more. Once again, the change was not a dramatic one, but it was enough that at this point, it was noticeable.
What Was My Choice?¶
Having addressed a good handful of small issues that did not make the big lists for refactorings, the balance between the two scales had shifted enough that I knew that I had a good solid answer: fix more issues. I did not feel that I would be wrong in adding more rules, just that I wanted to focus on ensuring that a number of the smaller issues were given some focus to ensure they were resolved properly. It also did not feel like I had lost any confidence in writing rules, that was still at a healthy level.
In the end, I believe it came down to a solid understanding that if I was going to write more rules with this framework, I wanted to make sure that any obvious issues were dealt with. The largest of those issues that needed to be addressed was adding the proper line number and column number support to the tokens. But it also meant working through the issues that I found during the first 12 rules and either verifying they are issues and addressing them or explaining why they were okay.
What Was My Experience So Far?¶
I strongly believe that the process of taking my time and working through those low priority issues gave me some valuable insight into the project that I had missed before. While the primary catalyst for being able to properly answer the question were the parser issues that I resolved, I do not discount the insights provided by looking at the source code at a higher level than usual. I believe that by allowing myself time to absorb the project code at a higher level, it opened some doors in my mind that allowed me to be better influenced by the issues I fixed. I cannot prove that of course, but as with all feelings, it just is.
And while it was initially irritating that I could not answer the features versus issues question, I now believe it was inevitable. Especially since I am the only one working on this project, I do not have anyone to remind me to stop looking at the trees and focus on the forest. That change in perspective really helped me to get a clearer picture, and for that reminder, I am grateful.
One thing that I did not expect was that the answering of this question taking almost 2 weeks. I started the planning for this block of work on 02 May and it was 16 May before I had the planning in place for the next block of work. The thing is, at no time during the process did I want to timebox this process. It took 2 weeks because that is what I needed to properly answer this question. And I was fine with that.
For the large part, I am also okay with spending some time making sure I get the parser right before moving on to authoring more rules. Sure, it means I am extending the development cycle out by at least a couple of weeks, but I think that the time will be well spent.
What is Next?¶
Having answered the question of rules vs foundation, it was time to tackle one of the big issues that I had listed: line numbers and column numbers. I knew this was not going to be an easy change, but that just told me I should make sure I do it right the first time!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.