Markdown Linter

Introduction¶

While it just happened to be the feature that was last on the list, I feel that it was kind of fitting that image links were the last feature to be added. Whether or not an author creates Markdown content for a blog, documentation, or some other purpose, a simple image can often improve the readability of each of those forms of documents. Imagine going to your favorite game blog or art blog and seeing very good descriptions of the topic being blogged about, but no pictures. It just makes sense to add images to documents, where needed, to enhance that document’s ability to convey the message that the author is delivering.

From there, there are just a small number of cleanups that I want to do before going on, so let’s go!

What Is the Audience For This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 27 March 2020 and 04 April 2020.

What Are Image Links?¶

As I have covered in previous articles, a simple inline link is constructed as:

[foo](/url "title")

and the corresponding reference link (and matching link reference definition) is constructed as:

[boo]: /url "title"

![boo]

both producing the following HTML:

<p><a href="/url" title="title">foo</a></p>

In both examples, by changing the opening character sequence [ to the opening character sequence ![, we transform the destination HTML into the following image link:

<p><img src="/url" alt="foo" title="title" /></p>

Besides the different starting characters, the first difference is that instead of the a or anchor tag being produced, the img or image tag is produced. Due to that change, the href attribute is replaced with the src attribute, the title attribute is left alone, and the inline text contained with the normal anchor link is replaced with the alt attribute text.

The second difference is in the text that is assigned to the alt attribute. For normal links, the link text or link label is interpreted as inline text and placed between the <a> and </a> tags. As this text has been moved to the image tag’s alt attribute, a transformation must occur to ensure that only valid text is assigned to the attribute. While not strictly specified as a “MUST” in the specification, the images section of the specification mentions:

Though this spec is concerned with parsing, not rendering, it is recommended that in rendering to HTML, only the plain string content of the image description be used.

This transformation is on display with example 581 as the Markdown text:

![foo *bar*]

[foo *bar*]: train.jpg "train & tracks"

is transformed into:

<p><img src="train.jpg" alt="foo bar" title="train &amp; tracks" /></p>

In the above example, the * character indicating emphasis is removed from the string foo *bar*, leaving the attribute value foo bar to be assigned to the alt attribute. For other inline processing sequence, similar transformations are made to ensure that the vital information is kept with the token.

The final difference is a small change to the rule that links may not contain other links. This rule is changed so that links may not other links unless those links are image links. Consider example 525:

[![moon](moon.jpg)](/uri)

which is transformed into:

<p><a href="/uri"><img src="moon.jpg" alt="moon" /></a></p>

While this may not look useful, consider the following Markdown:

[![moon](https://static.thenounproject.com/png/2852-200.png)](https://www.bing.com/search?q=the+moon)

Once transformed, that Markdown becomes the following image:

That Markdown snippet produced an image tag within an anchor tag. This presents the reader with an image of a moon that may be clicked on. When that image is clicked on, the link that surrounds the image link is acted upon, navigating to the link supplied in the outer link. In this case, click on the image of the moon takes the reader to the Bing search page, already primed with the search text for the moon. This feature used often to present styled buttons to a user instead of simple text for them to click on. Personally, I think the right image to click on makes more of an impact and having that ability available to Markdown authors is a plus.

Simple Cleanups¶

With image links done, what remained were the various cleanups that I wanted to do before declaring the parser “done”. While each of these cleanups required code to be rewritten, my stated goal with these cleanups was not to fix bugs, but to get the code base in a more maintainable shape before fixing those bugs. It was my hope that by doing things in this order, it would make any bug fixing that would occur in the future easier to perform and easier to validate. We will see how that worked out in the next article! For now, on to the cleanups!

Simple Cleanup 1: Splitting Up the TokenizedMarkdown Class¶

The first cleanup on my list was to split up the TokenizedMarkdown class into more clearly defined classes. While I could have done this at an earlier stage of the project, I wanted to take a different approach with this project. While the ParserHelper and HtmlHelper classes were required for the early stages of the project, there were other possible groupings I wondered about. Basically, I wondered if I would make the same grouping decisions at the end of the parser phase as I would have at the start of that same parser phase. In short, I wanted to experiment. More on the results of that experiment at the end of this section!

Going through the code and separating each grouping out was a chore, but a useful one. The first effort was to come up with the larger groupings. Arriving at these groupings was a simple task, as the processing for the tokens is broken up into three sections: the functions for the preliminary tokenization of the input were left in the TokenizedMarkdown class; the functions combining contiguous text blocks were moved into the CoalesceProcessor class; and the functions to handle inline processing were moved into the InlineProcessor class. This was a major chore, but it reinforced the separation between the various processors and classes in the process, so all was good!

With those major regroupings undertaken, two of those three classes still contained huge blocks of code, so it made sense to do further refinements along more feature-based lines. The TokenizedMarkdown class was my first target and was broken down into the ContainerBlockProcessor class and the LeafBlockProcessor class, which just seemed like logical groups to extract from the main class. From there, it similarly seemed that the ContainerBlockProcessor class was still too large, so I further extracted the ListBlockProcessor class and the BlockQuoteProcessor class. At this point, I felt more confident that the different functions for tokenization were in solid, well-defined, well-sized groups.

Similarly, the InlineProcessor class was too large so I extracted functions into the EmphasisHelper class, the LinkHelper class and the InlineHelper class. The main brains of inline processing remained in InlineProcessor, coordinating when to apply inline processing to the various tokens that needed it. To allow for special processing to be handled properly, the coordination of the [, ![, and ] link strings, and the * and _ emphasis strings was also kept in the InlineProcessor class, while the actual processing of emphasis and link were moved to the EmphasisHelper class and the LinkHelper class. The remaining functions that implemented the less intensive inline processing were added to the InlineHelper class. Like my observation for the tokenization classes in the last paragraph, to me this just seemed to be the right groupings for these functions.

Comparing my choices against some notes I had scribbled down at the start of the project, I found that I was decent at projecting the larger strokes of the grouping, but really bad at my stab on the more finer groups. While the names are different than the ones I used, I was spot on with the 3 high level processors: tokenizing, coalescing, and inline. Furthermore, when it comes to the specific classes for container blocks and leaf blocks, I had those spot on.

And it was there I stopped. I had a couple of scribbles for emphasis being on its own with a question mark beside it, but that was it. The rest of the scribbles were all followed by question marks, including a guess that each leaf block and container block should be in its own class. If I had to guess as to why I thought that way, I would wager that I thought I would need more code for the inline processing while using less code than I anticipated for the block processing. When those assumptions on code amounts changed, it caused me to look at the rest of the code and extract more specific groups into the groups detailed above.

The end results? While I feel it was good to do a good high-level design of the function groupings, I need to be prepared that those plans are going to need to be revisited and redefined once I starting writing code. But still, a useful experiment, and the project’s code base was also more maintainable at the same time!

Simple Cleanup 2: Reducing Complexity¶

Up to this point in the development of the project, I was more intent on completing the parser itself than to complete it with good organization and low complexity. While the simple reorganizations that I documented in the last section were a start, there were three sets of complexity warnings that I disabled until later: the too-many-branches warning, the too-many-statements warning, and the too-many-locals warning. While none of these affected the quality of the parsing, I knew that a result of disabling these warnings was that I would have to start resolving them at the end of the main block of parser development. As that time had arrived, my bar tab was now due.

For the most part, the resolution for all three warnings was the same: split the code that the warning was complaining about into smaller, more focused functions.

Too many branches? From experience, I have only run into 2 or 3 cases where having a method with more than 10 or 15 branches was the right thing to do. Most of the time, too many branches means you are doing too much in your function, and by splitting up the functionality into multiple functions, you keep the number of branches down and the comprehension on what each function does goes up. It’s almost always a win-win.

Too many statements? Pretty much the same argument, but exchange statements for branches. Once again from experience, if you have a function that has more than 25-30 statements in it, it is hard for most people to truly understand what that function is doing without a lot of comments. Splitting the functionality across multiple functions allows for those statements to be associated with named functions that describes its purpose, instead of the reader trying to figure out “what does that section do?”

Too many locals? This is a tricky one with parsers, so I left it for last. In most cases, what I described for branches and statements goes for locals. Too many of them gets in the way of a clear understanding of what a function does. From personal experience, when the number of locals in a function exceeds somewhere between 10 and 15 variables, I usually need good logging or a good debugger to really understand what each of the values should be at any point in the function. Under 10 and I am okay.

But parsers? They are often the exception to the rule for a lot of things. The amount of state needed to properly parse something often causes a lot of locals and arguments to be declared and passed around, many of them for temporary computations. For example, whenever a specification says “collect all X up to Y, except for”, it usually means that one variable is needed for collecting, possibly one variable to report errors, and the “except for” at the end most often means passing in some kind of state, often from another part of the program. Since parsers are more grammatically based than other types of programs, their need to interface with these things called “users” causes more variations that need to be dealt with than other types of programs. At least from my experience, that seems to be the case.

For the most part, I managed to clean up all cases of the branch warnings and statement warnings and tried to address the local warnings. While I did not complete all of them, I felt it was a solid effort to get the parser into a more maintainable shape. And as with all the other cleanups documented here, I do believe that each one is contributing to the health of the project. And that is a good feeling in my books!

Simple Cleanup 3: PyCharm Static Code Analysis¶

When I am authoring projects, I typically have a specific development environment or editor for each language. For my C# projects I use Visual Studio by Microsoft, and for Java projects I use IntelliJ by JetBrains. In both cases, these editors are widely accepted as one of the best editors for that specific language. While Visual Studio can handle Python and JetBrains has a Python specific editor (PyCharm), I find that I personally prefer a one language to one editor relationship. As such, I write most of my Python code in Visual Code.

However, while I do not use either editor for development, I do find that PyCharm has some usefulness in running static analysis passes on my Python projects. From experience, the hints that PyCharm provides make the cost of the manual use of it as a Static Code Analysis tool worthwhile. And as I was finished with the bulk of the development on the parser part of the project, I thought it was a good time to look at the project through that lens.

One immediate benefit was that PyCharm has a very decent analysis engine, and it gave me hints on how to improve the flow of my functions. When none of the other functions that use my function make use of the function’s return value, PyCharm told me that it makes sense to remove the return value. When arguments or variables were not being used, PyCharm suggested that they should be removed. And when I got into the bad habit of incrementing a value by using a = a + 1, PyCharm pointed out that Python also has the += operator and I should rewrite the previous snippet as a += 1. All in all, a lot of small, but useful improvements to clean up the project.

While my manual usage of PyCharm does not replace the need for running Flake8 and PyLint over the project before I commit any changes, I believe it offers me a fresh view of the code base and gives me hints on where it thinks I can do better. As an added bonus, with few exceptions, the changes are minor, and the entire batch of changes can be applied quickly. Personally, I consider using PyCharm in this manner akin to asking a friend to take a read over a document that I have been working on before I submit it. As such, it is not always about making the changes that PyCharm suggests but thinking about those changes and deciding if those changes are what is best for your project.

Simple Cleanp 4: Pulling Strings to The Top¶

I wanted to do this cleanup for a while, partially for code cleanliness reasons and partially for performance reasons. In writing the first iteration of the parser, I knew that I was focusing on getting the parser complete and not always following best practices. One of those best practices that I use, especially for parsers, is to make sure that strings that I know are going to be used over and over again are pulled out of their functions/methods and defined once at the top of the class. In the hurry of wanting to get the parser done, I had not kept this practice up. As such, there were multiple instances where I had certain strings, such as the block quote string >, in multiple places, instead of one place.

By aggregating these strings to the top of the modules where they were declared, I was able to clean up the code and impact performance. By moving the strings to the top of the modules, it made it easier for me to see the strings and concepts that were in each of the modules. This also has a nice benefit of having a variable name associated with the strings, allowing for easier searching through the project for those concepts. From a performance point of view, this also reduces the number of times that string is declared within the module, using references to the module’s instance of the string instead of creating a new instance every time that piece of code is executed.

All in all, for both cleanliness and performance reasons, it was a good cleanup to do!

What Was My Experience So Far?¶

From a code completion point of view, I felt that I had done well. Sure, there are still 5 scenario tests that are disabled and a number of issues and bugs that I have logged for further research. But other than that, it was finished.

I am sure that some of the issues are just clarifications that I need to prove are done right, and that there are some real bugs that I am going to have to fix. I am also sure that those two lists are not going to be static and are going to interact with each other in some way. I am confident that while 1 of the disabled tests will require some interesting thinking on my part (nested image links), the other 4 disabled tests will require some serious changes to how I process lists. But even with these in mind, I also have faith that, due to the large number of scenario tests, that all of these issues will probably only account for less than 2% of all of the scenarios I am ever going to run across, and some of the more obscure scenarios to boot. As such, I believe I can take my time to address them when I have time, and not rush to solve them right away.

From a code quality point of view, I was happy to have some time to do some clean up. In their own way, each of the cleanups helped the health of the project. The first cleanup added better organization of the code, the second and third cleanups increased the maintainability of the code, and the fourth cleanup added a bit of a performance improvement. Each of those cleanups moved the project to a happier place, quality-wise, in their own way. Big plus there for me!

From a development point of view, getting to this point was indeed a milestone, but it was with mixed feelings that I got here. With respect to the base parser, outside of some issues that I have previously noted, it was complete. Sure, I knew that I needed to fix those issues, but after 713 examples (with 5 examples skipped), it was done. Along the way, I modified the existing 648 examples to add 70 new examples to properly test out the parser. And sometimes it was a hard journey to get those working.

Not too complain… but I initially felt that my fun was over. Could I write a parser to properly parse Markdown into intermediate tokens? Yes. Would I be able to finish it while keeping the code coverage and scenario coverage high? Yes. Would I then be able to take those tokens and not only use them to power rules, but to emit the specification’s HTML? Yes. For me, determining whether I can complete challenges like that is fun, and having arrived at the answer, I felt that a lot of the project’s fun was completed.

But in thinking about that some more, I realized that while one part of the fun for the project was over, it was up to me to decide whether I would still consider the project fun. As I largely equate challenges with fun, that means it is up to me to find the challenges left in the project, and to overcome them. Since this is my first real project in Python, at the very least there should be challenges around releasing it. Basically, I just need to take a better look at the project to determine where the challenges are. It just took me thinking about what perspective I wanted to have.

What is Next?¶

While I was anxious to get to writing rules, there were some things that needed to be addressed with the parser before I could continue. While I could have left them there and dealt with them later, I wanted to take a good crack at the top items on the list and deal with them now, before the rules were added. So next article: “Refactoring: The Sequel!”

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Markdown Linter - Adding Image Links and Simple Cleanup