It All Takes Planning¶
When I started the PyMarkdown project, I just wanted to prove that I could write a parser in Python and have it been decent. While I have proven the first part, that I can write a parser in Python, I am still struggling with whether I would call it a decent parser. For me to call it a decent parser means that I have thought out and implemented a variety of things that I would say is essential to a parser.
Starting With The Specification and Specification Test Cases¶
For me, the start of any parser is having a good understanding of what it is that I am trying to parse. In the beginning, I do not need to understand the entire specification, but I do need to have read the entire thing from the first line to the last line. Call me foolish, but until I have the one complete pass through the specification, I do not believe I have an adequate picture to start with.
I know I am not going to remember the entire specification. I would be impressed by anyone who can remember the entire specification after just one pass. What I look for are those parts of the specification that I can figure out will cause me issues. At the very least, I want to try and keep those in mind for my designing of the parser components. If possible, I write down scenario tests against the specification that I can use to properly test my parser. But to be honest, from a first pass, that is a nice-to-have, not a must have.
For any Markdown parser, the golden specification is the GitHub Flavored Markdown specification. As a mature specification, it not only has the raw specification itself, but over six hundred scenarios and their expected HTML output. While many specifications have not included this information in the past, this is starting to become a more frequent occurrence. If not in the document itself, those test cases appear in a companion document and are being seen as the best way to concretely describe the specification. As a bonus, there are some rudimentary scenarios that the parser is expected to manage.
From a parser point of view, I believe that the PyMarkdown project has accomplished that goal in spades. With over five thousand scenario tests currently passing against the project, I can confidently state that 99% of the scenario tests pass, and I am getting those remaining tests passing. To be clear, there are four tests dealing with block quotes and lists starting in different columns, and that is on the near future list of things to address.
Clear Goals and Requirements¶
Having one or more clear goals are important before working on any project. For the PyMarkdown project, the main goal was a simple one: to provide a framework with which I can create rules that can evaluate Markdown and provide feedback on its structure. From there, the main requirements were developed.
For those readers that do not want to read the entire article from December 2019, those requirements are:
- must be able to see an accurate tokenization of the markdown document before translating to HTML
- all whitespace must be encoded in that token stream as-is
- initial tokenization for GitHub Flavored Markdown only, add others later
- must be able to provide a consistent lexical scan of the Markdown document from the command line
- extending the base linting rules should require very little effort
- written in Python
Coming up on three years since that article was written, it is fair to look back and figure out how the project is meeting those requirements.
From a project perspective, the good news is that the project itself has come a long way in meeting those requirements. Without reservation, I believe that the accurate tokenization, GFM parsing, consistent tokenization, and written in Python requirements have been solidly met. And to be clear, I do mean without reservation, not close to with a few reservations. With the entire parser written in Python and over five thousand scenario tests, there is no doubt in my mind that it meets those requirements.
With respect to the whitespace encoding requirement, the project is most of the way there. To finish things up, I just need to change the whitespace tests that I recently wrote to properly look for tabs where tabs were specified. It is not a big change, but it will finish the adherence to this requirement.
That leaves the extending requirement, and it is an important enough requirement that I have added a full section for that below. Looking at how things are so far, that comes out to the main goal being met, four of six requirements being met, one of those requirements being within reach, and one to be talked about later. Not bad.
Clean Separation of Responsibilities¶
Call it an old habit of mine, but I believe that one of my successes in both the testing field and the development field is my strict regimen around keeping a clear separation of responsibilities between various parts of a project. Unless there is no other way to accomplish it, I passionately believe that each section of a project should be consistently separate from the others. For the PyMarkdown project, I think I have made substantial progress is this area. To be blunt though, I probably crossed that line more often than I should have. Sometimes it was just more expedient to deal with it in that manner. But I also know that I usually pay for crossing that line later.
There are good examples of where I have implemented clean separations in the PyMarkdown project. The first is the clear separation of the three stages of parsing into three separate processors: the block processor, the coalesce processor, and the inline processor. The first processor takes care of any blocks and produces a token stream of those container blocks and leaf blocks. The second processor takes care of coalescing any consecutive text tokens to allow the third processor to do the inline processing without having to span text tokens. Keeping each of those processors focused on one main concept is one of the things that I believe made the parser work as well as it has.
Not everything is perfect though. While the inline processing works well, and the handling of leaf blocks works well, the handling of container blocks needs a lot of work. One of the reasons that I have invested time into writing more scenario tests for the container blocks is that they need work. And once I get all those tests passing, I want to take the time to see if I can break down the container handling and refactor it. There are just too many little fixes in it for my taste.
Extensibility is the one thing that is tricky to implement in a parser. The rules themselves are extensible by design, so I get some points there. But when it comes to the parser itself, there are glimpses of getting ready to extend the parser, but no concrete steps.
One reason for that is that adding something to a parser can often introduce
scenarios that were not thought of originally. Let me use one of the feature
requests as an example. One of the users wants to be able to have support for
Latex formulas and properly detecting any whitespace, like how normal emphasis
works. This is required as normal Latex formulas can have arbitrary spacing and
formulas can use the
* character as part of the formula.
But adding in support for Latex blocks come with questions that may be difficult to answer. Are there solid specifications for Latex Markdown blocks? What are the differences between those blocks and emphasis blocks or codespan blocks? Would I need to add new support code, or could I leverage existing code? What are the border cases that I need to deal with? And those are only the questions that come up with Latex formulas themselves. I also must figure out if there are any interactions with other elements, and what those interactions are.
Even before that starts, I need to take the time to extract the non-container code for elements and make them more modular. That is going to have its own problems. It is something I know I need to do, but I have put it off until I can make sure that the existing code is very solid. But that time is just around the corner, so I need to get ready.
From a history of working on projects from a test and quality perspective, this is always a concern of mine, even though I rarely state it explicitly. It is just part of how I work. If I do not have near 95% plus test coverage, I need a good reason. If I do not have good user stories and scenarios, I take the time to research them. It is just part of my normal development process.
One of the most important decisions on quality that I made is not to trust myself with writing the best Python code possible. Do not get me wrong, I am very decent at writing Python code, but I would not categorize myself in the “best” category. That is why I use tools like Pyroma, Flake8, PyLint, and Sourcery.Io to scan my code with each commit. Running these tools does not make my code perfect, but they make my code better. And that is good enough for me.
Where Does That Leave The Project?¶
Let me recap where my thinking is so far:
- Clear specification and test cases. Check.
- Need to work on variable start position cases.
- Clear goal and clear requirements. Check.
- Most of the requirements are already met, so getting the whitespace requirement completed is a priority.
- Separation of responsibilities. Check.
- Most of the code is in decent shape, with the container block processing being one of the outliers.
- It is “working”, but I want to take the time to make it cleaner.
- Extensibility. Meh
- Plugins are extensible, but parser itself is not.
- Need to move common elements out into their own modules.
- Quality. Check.
- Continuous integration with checks, both locally and on GitHub.
To be honest, when I started this article, I was concerned that I was going to uncover more things to be done than I have just listed. That is a relief. But at the same time, it is a good working list that will help me move the project forward!
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.