Introduction¶
As I documented at the end of the previous article, having arrived at the decision to improve the linter’s foundation, it was time to tackle the largest issue on the stack: missing support for line numbers and column numbers. While at the time I thought of it as a glaring omission (see this article for more ), the passage of time since that discovery has helped me see that omission in a better light. On its own, the process of getting the tokenization correct was a large enough task but adding the determination of each token’s line number and column number to that task would have made that task unbearably larger. Knowing myself as I do, even if I had discovered that requirement during the design phase, there is a really good chance that I would have relegated the determination of each token’s original position into its own task. Regardless of what happened in the past, it was obvious that I needed to start working on that task now.
Starting to think about this issue, I had no doubt it was going to be a tough task to complete, and I knew that I needed to find a way to break things down even further. As I figured that adding line numbers and column numbers to every token was going to be too much, I reduced the scope even further by deciding to limit the scope to only block tokens. As every inline element is rooted in a block element, I had confidence that this would ensure that the line number and column numbers would be solid in each block elements before adding the inline elements into the mix. It was also a solid dividing line for the task that just made sense to me.
With those decisions on scope dealt with, it was time to delve into the depths of line numbers and column numbers.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 26 May 2020 and 16 May 2020.
Why Was Reducing the Scope Good?¶
Each of the implemented and enabled parser test functions includes the list variable
expected_tokens
and the string variable expected_gfm
. The expected_tokens
variable
is used to ensure that the tokens that are expected from the Markdown document are
emitted by the parser. The expected_gfm
variable is used to ensure that those tokens
can use a simple algorithm1 to transform themselves into the HTML required by
the specification. Prior to this work, their focus was simple: enable that forward
translation from Markdown into HTML.
Adding in support for line numbers and column numbers was going to be a lot of work, but it was working in a different direction: backwards towards the original Markdown document. There was no denying that this was a difficult task to complete, but due to circumstances, the reduction in scope would leave me with a far simpler problem to solve. In addition, to protect the work already done in the forward direction, the existing battery of tests remained in place, diligently guarding that work.
With protection for the already completed work in place, and a clear definition of the work to be done in mind, it was time for me to figure out how to start the process off.
Where to start?¶
To help me keep track of things,
I added each of the block element names to the
readme.md
document where I track my issues. As I needed to represent
every Markdown block, I moved the question in the issue document regarding the adding
of a Link Reference Definition token to the start of this new section and transformed
it into a statement. As each block element in the
Markdown document needed to be tagged with its position, it was a necessity that
Link Reference Definition tokens be added, even if they did not result in any HTML
output.
Having all the blocks elements listed in that document, I then needed to somehow pick a good place to start. I was aware that wherever I decided to start, that token was going to have a lot more work associated with it as I tested out and implemented the foundation used to add line numbers and column numbers to the other tokens. Whatever process and code I used for that first block element; the plan was to repeat it in some closely related form for each of the other 10 block elements in the list. But which one to pick?
In the end, I decided to start with Atx headings. Unfortunately, I did not keep any notes as to why I picked this block element to start. My best guess is that after having a lot of exposure to heading code while implementing the last group of rules, I figured headings would be a good place to start. Following that line of thought, I believe that I also would have guessed that SetExt headings were going to be the more difficult token to change, leaving Atx headings as the best place to start. Either that, or I decided to go alphabetic and ‘a’ is the first letter of the alphabet. Either way, Atx headings were first, and it was going to be a fun ride.
Adding in Position Markers¶
Taking a good look at the Atx headings and the information that was being passed around,
the first thing that caught my eye was a good opportunity to refactor. As I was letting
parts of the parser grow organically, many functions included the argument
line_to_parse
along with an argument that was usually named start_index
. This
organic pairing made sense in a lot of places, allowing for the easy manipulation of
the current index, and occasionally the line_to_parse
. However, when adding in
support for line numbers, I was going to have to pass in a third variable, meaning it
was time to refactor this. As I have said
previously:
Write it once, write it neat. Write it twice, think about extracting it. Write it three times, extract it without a second thought.
And that is where I started to work on replacing the line_to_parse
variable and the
start_index
like variables with the position_marker
variable and the
PositionMarker
class. This class was, and remains, very simple:
class PositionMarker:
"""
Class to provide an encapsulation of the location within the Markdown document.
"""
def __init__(self, line_number, index_number, text_to_parse):
self.line_number = line_number
self.index_number = index_number
self.text_to_parse = text_to_parse
self.index_indent = 0
The line_to_parse
variable was replaced by the text_to_parse
member variable, the
index-type variable was replaced with the index_number
member variable, with the
line_number
member variable being introduced. As I knew I would have to support
“indented” text, such as text after a block quote character or a list start sequence,
I added index_indent
at the same time.
And with that, the easy stuff was ended. It was time to move on to the tougher stuff.
Creating the marker¶
Wanting to start at the beginning, I made some modifications to the
__parse_blocks_pass
function in the TokenizedMarkdown
class to track the line
number. As I did not have to deal with Link Reference Definition tokens yet
(spoilers!), this was a straight forward change, with the local line_number
variable
being initialized
before the main loop and that same variable being incremented at the end of the loop.
To encapsulate this information, I created a new position marker as follows:
position_marker = PositionMarker(line_number, 0, token_to_use)
Instead of passing token_to_use
to the parse_line_for_container_blocks
function, I
instead passed in the expression position_marker.text_to_parse
. While I wanted to
pass in position_marker
by itself, I knew I had some important work to do before
getting that would be possible.
Stage 1 - Integrating the Marker Locally¶
Using my knowledge of the project, I knew that the path between the
__parse_blocks_pass
function in the TokenizedMarkdown
class and the
parse_atx_headings
function in the LeafBlockProcessor
class was going to go through
a number of functions in the ContainerBlockProcessor
class. Instead of
using a lot of guesswork to determine what that path was, I decided to start at the
parse_atx_headings
function itself and work my way backwards to the
parse_line_for_container_blocks
function.
In the place of adapting the entire function in one pass, I decided to start with
focusing
on the passing of the position marker into the function. To facilitate this decision,
in the parse_atx_headings
function, I wrote some simple
glue code to interface between the old style
code and the new style code. Before the change, the code looked like this:
def parse_atx_headings(
line_to_parse, start_index, extracted_whitespace, close_open_blocks_fn
):
"""
Handle the parsing of an atx heading.
"""
new_tokens = []
After the change, the code looked like this:
def parse_atx_headings(position_marker, extracted_whitespace, close_open_blocks_fn
):
"""
Handle the parsing of an atx heading.
"""
line_to_parse = position_marker.text_to_parse
start_index = position_marker.index_number
new_tokens = []
In the __parse_line_for_leaf_blocks
function where the parse_atx_headings
function
was called from, I created a new PositionMarker
instance named temp_marker
, with
the information from the local function. That new temp_marker
variable was then
passed into the parse_atx_headings
function instead of the previously used
line_to_parse
and start_index
arguments.
I knew from the start that this approach was going to be more work. However, by isolating the change to the function’s signature, I knew I could keep the parser stable and usable at any point. By keeping the parser in that good state, I knew I could depend on the large bank of tests used by the project to verify any change was a successful change with no unintended side effects. The process was indeed slower than some other methods, but it was a process that I had absolute confidence that I could rely on.
Stage 2 - Integrating the Marker Up the Stack¶
In this stage, the process that was applied to the parse_atx_headings
function was
applied again and again until the __parse_blocks_pass
function was
reached. With every step going back towards that function, I repeated the same actions
that I performed in stage 1, but with the calling function: add glue code, change
function arguments, create a temporary position marker, and pass that position marker
into the newly changed function.
The almost continuous execution of tests at this point was essential. As with any parser, even slight changes in the content or interpretations of the content can cause small ripples in the output that may escape detection if not quickly located and addressed. While I was doing my best to try and ensure that all changes were kept as isolated as possible, it was only the rapid battery of tests that kept my confidence high that I was going in the right direction with no ripples.
Another thing that helped me immensely was using Git’s staging ability. To facilitate clean commits, Git has a nice feature called a staging area, with the moving changes into this area being commonly referred to as staging. The nice part about Git staging is that, as this article describes, it is very easy to discard any changes made to your local files in favor of the state of that file with any committed changes and staged changes applied. As such, each time I applied this process to another function, I ensured that once the tests were all passing that I staged those changes in the repository. In the cases where I made mistakes, which did happen frequently, I had the knowledge that I was able to revert the relevant files back to their pre-change states, and resume work from there. When it was time to commit those changes, they were already isolated in the staging area, ready to be committed.
Stage 3 - Cleaning up Going back down¶
Once I reached the __parse_blocks_pass
function, I was able to connect
with the real position_marker
variable, using that variable instead of creating a
new temp_marker
variable. I then started working back towards the
parse_atx_headings
function, replacing temp_marker
with position_marker
as
I went.
When I got to the
parse_atx_headings
function, I then proceeded to replace instances of line_to_parse
within that function with position_marker.text_to_parse
and instances of start_index
with position_marker.index_number
. As these changes were localized, they were
easy to make, but I still ran frequent test passes just to make sure I was making
the right changes. I do acknowledge that some of those test passes were executed out
of paranoia, but in my mind, if running the
tests an extra time allowed me to step forward with no reservations, it was worth it.
Finally, at the end of all this process, I removed the line_to_parse
variable and
the start_index
variable from the function, as they were no longer being used.
There was no benefit to keeping those two variables around, so I just removed them.
I now had a solid function to parse for Atx headings, referencing the line number and
index number in a nice compact object. In addition, I was able to easily trace this
compact object from the main processing line of the parser directly to the function
itself. The only thing left to complete the change was to wire up the token.
Stage 4 - Wiring up the Token¶
Now that I had the position_marker
object being passed properly into the
parse_atx_headings
function, I needed to pass that object into the token’s
constructor.
The first change I made was to the base MarkdownToken
object, changing its constructor
to allow the passing in of a line_number
argument and a column_number
argument to be
stored in the instance. To complement this change, I made some
modifications to the __str__
method to add the line number and column number to
the returned string, but only if at least one of those two numbers was not 0. The
benefit of this slight change was that the line number/column number pair would only
show up in the function’s output if that information were provided. As that information
was only provided once I added support for that type of token, I only
had to worry about the token output changing for the tokens I had already changed.
The second change was to modify the constructor for the AtxHeadingMarkdownToken
class to accept a position_marker
argument. That code was implemented as such:
line_number = 0
column_number = 0
if position_marker:
line_number = position_marker.line_number
column_number = (
position_marker.index_number + position_marker.index_indent + 1
)
Basically, if the position_marker
variable was not present, the default value of 0
was used for the line_number
and column_number
variables.
If the position_marker
variable was present, the line_number
was be moved over
from its position_marker
object into the function’s line_number
variable. For
the column_number
variable was assigned the sum of the index_number
member variable,
the index_indent
variable (at this point, always 0), and 1
. As index values are
always 0 based and column numbers are always 1 based, the addition of 1
to the index
number ensured it was based properly.
Finally, these newly calculated values for the line_number
and the column_number
were passed into the newly retooled MarkdownToken
object constructor. This was the
point that I was preparing for, where everything came together. It was now time
to see how close I got to the actual test results. If memory serves, I believe I
actually closed my eyes after saving my changes and started the execution of
the tests.
Stage 5 - Addressing Test Results¶
Prior to the very end of the previous step, the most frequent test command line that I used was:
pipenv run pytest -m gfm
This executed only the parser’s scenario tests, reducing the test area to only those
tests that were tagged as being part of the gfm
group. As I was running the tests
with each change, it was important to keep the scope, and therefore execution time,
of the tests to a minimum. And up to this point, it worked well, and the tests were
solid, passing every time.
But as I made the connection between the AtxHeadingMarkdownToken
token’s constructor
calling the MarkdownToken
token’s constructor, I knew that everything that I had
just changed was just a good guess. I hoped that I had wired things up correctly, but
at that point, I only knew 2 things for sure:
- due to the tests, I had not disrupted any of the parsed tokens for the test cases, including their order and content
- I was passing some transformation of the base
position_marker
into the token’s constructor
With everything wired up, I used the above command line multiple times, each time
picking off one of the failed tests to validate. With each test, I first calculated
what I thought the line/column pair should be, noting it down in the function’s
comments section. I then validated it against the test results, trying my hardest
to not peek at the test results before I did my calculations. Once
I had all those tests cleaned up, I did an extra pass through the changed tokens,
manually recalculating the line/column pair and checking to make sure they were right.
Only after all that was addressed did I change the module in the above command
line from gfm
to rules
, cleaning up any test failures caused by line/column pairs
that were now being displayed as part of the output.
The Importance of Good Tests¶
While I can talk at length about good practices with respect to tests, I hope I can convey the sincere importance that I associate with having good tests and good test coverage on a project. It is true that I can often “eyeball” a simple change to a project, figuring out whether that change is correct. But in my mind, anything beyond a simple change requires solid, repeatable testing. Without that testing in place, you are betting against the risk of something going wrong with your project.
Testing is not about finding bugs as much as it is about risk management. If you have a small project that rolls various forms of dice for a Dungeons and Dragons campaign, the impact of that risk failing is very light. Any person I know that would use such a project would also have some backup dice for “just in case”. For the PyMarkdown project, the end goal is to have a tool that website owners can trust to keep their website’s Markdown documents in a proscribed format and style. The impact of too many failures is a loss of trust in the project, with a certain level of loss in trust being associated with those website owners dropping their use of the project. By having good testing of various forms embedded within the project, I can hopefully mitigate some amount of the loss of trust that any failure brings with it.
Since I have worked so hard to get the project to this state, I did not want to take any unnecessary risk to the project’s stability. The tests are a solid tool that I use frequently to keep any such risk to a minimum, and especially during refactoring, where I rely on them heavily. And as I rely on them heavily, I am also continuously looking for better ways to test the projects that those tests are in.
Lather-Rinse-Repeat¶
I honestly tried to find another heading for this section. Try as I might, no other section heading seemed to convey the repetition that I went through other than the phrase: Lather, Rinse, and Repeat . For each of the 10 remaining tokens, the 5 steps outlined above for Atx heading were repeated, with only a few changes to the process. Those changes are outlined in the following sections. Please note that the following sections are ordered with respect to the amount of work needed to resolve them, rather than my usual chronological ordering.
When I encountered questions with respect to whether something with respect to that
token was done properly or not, those questions were added as action items to the
readme.md
file to be examined later. I knew this was going to be a
slog, and a hard slog at that.
It just seemed more efficient to me to note them and move on, circling back to deal
with them later.
HTML Blocks, SetExt Heading Blocks and Code Blocks¶
There were no changes introduced to the process for these elements.
Thematic Breaks¶
There was only one small change to the process for this element. That change was the delaying of the cleanup stage until a later time, as I wanted to get more breadth of tokens implemented to ensure I had the right foundation for the change.
The other change that I made was to the MarkdownToken
class, thereby affecting all
the other tokens. For this change, I moved the code to calculate the line_number
variable and
the column_number
variable from the AtxHeadingMarkdownToken
class to the base
MarkdownToken
class. Once this code was proven, it just made more sense to keep
it in the base class than repeating it for each token.
Blank Line Token¶
The change required to process the blank line token was not with the token itself, but
with the processing of block quote blocks. The __handle_blank_line
function was
changed in the TokenizedMarkdown
class to accommodate the position_marker
argument,
starting the change for all calls to this function requiring that argument. Other than
that change, everything else was normal.
Block Quote Blocks and List Blocks¶
Strangely enough, I thought these two types of blocks would take the most amount of
work to address, but due to the way they were implemented, only a couple of small
changes were required. In both cases, the block start algorithms have to deal with
with the possibility of a tab character (often denoted as \t
) being used as the
whitespace between the block start sequence and the rest of the block’s data.
Having already dealt with tab stops versus tab characters and
fixing that issue once,
I really did not want to fix it again. I just needed to ensure that the current fix
and the previous fix were compatible with each other. To ensure that happened
correctly, the modified line text and index numbers were passed to a newly created
PositionMarker
instance, created after the processing of the Block Quote block and
the List blocks. This ensured that any line text modified by the processing of
either container block would be retained, while adding the now normal processing
using the PositionMarker
class.
Paragraph Blocks¶
As the catch-all element for Markdown documents, I had a feeling that these blocks would end up near the top of the “most effort” list, although I will admit that I guessed the reason wrong. I thought that it would have something to do with lists and block quotes, while the actual reason is due to Link Reference Definitions. More precisely, the reason is due to failed or partially failed Link Reference Definitions.
Due to its composition, it is impossible to properly determine if a Link Reference Definition is valid without reading the next line. As documented in the article Adding Link Reference Definitions, its multiline nature and use of various forms of quoted sections mean that unless the parser encounters the end of the closing section or a blank line, it does not know if it had reached the end of the Link Reference Definition. If it reaches the end and for any reason that Link Reference Definition is not valid, the logic in the parser starts adding lines back on to a list of lines that need to be reparsed without them being misinterpreted as a Link Reference Definition.
It is precisely that requeuing logic that I needed to alter to work properly with the
new PositionMarker
class. While the index_number
remained untouched, I had to
make sure that the line_number
variable was properly reset to account for the
length of the lines_to_requeue
variable when any requeuing occurred. I had a
bit of a problem with the initial implementation of this code, so I wanted to take
extra time and really boost my confidence that I was handling the change for this
token properly.
Link Reference Definitions¶
This one should be very obvious… I needed to add the token itself! When designing the parser, I did not see any need for a token that literally has no impact on the output, so I did not add a token for it. While a link element that refers to a link reference definition will show up as a link in the HTML output, the link reference definition itself is not added to the HTML in any form. However, now that I was adding support for all markdown block elements, I found myself in the position of adding the token in to the parser.
Adding the token itself was not too difficult. That was accomplished by the usual
process of finding the right method, the __stop_lrd_continuation
method in this case,
creating a new instance of the new LinkReferenceDefinitionMarkdownToken
class in
that function, and properly populating it. Then the process of adding the support for
the PositionMarker
class kicked in, and quickly I had a case of a wired-up
token with the proper data.
And then I went to test it… and I noticed that the tests failed in the transformation
of the tokens into HTML. Looking, I quickly determined that I needed to make
some additional changes to the TransformToGfm
module. This was a simple change, as
by definition, Link Reference Definitions have no effect on the output. To accommodate
the new token, a simple handler was registered for that token that simply returns the
same string that was given to the handler.
And then I went to test it… and the line number and column numbers were incorrect.
In about half of the tests, the pair of numbers seemed to be widely different than I had
manually calculated. Double checking my numbers, I then noticed a pattern. The
numbers being reported were for the last line of the Link Reference Definition.
Because of its multiline nature, the values associated with the position_marker
variable were the ones used when that Link Reference Definition was considered both
valid and complete. Avoiding the passing of a single use variable around, I added
the starting position to the LinkDefinitionStackToken
instance at the top of the
stack, and things looked better.
As I looked at the token and its __str__
function output, it looked very lean. Based
on other tokens, I expected a lot more information to be stored in that token, to be
analyzed later. Slowly going through the information gathered during the processing
of the token, I ended up figuring out how to properly represent the token. Between the
various whitespace
sequences between parts of the definition, the various parts of the definition
themselves, and both uninterpreted and interpreted parts, the data called for a
class with 11 member variables.
Slowly but surely, I started working through each of the Link Reference Definition scenario tests, starting with the easier ones and working my way to the more difficult ones. It took me a while to work through each scenario and validate it, but I breathed a sigh of relief when I finished going through all the tests. In the end, it had taken 2 days to complete, along with countless executions of the scenario tests.
It was not until I looked at my notes that I remembered something: I had delayed some of the cleanup.
Cleaning Up¶
Before calling this round of changes done, there was a decent amount of delayed cleanup to be performed. As part of Stage 3 of the process, there were quite a few times where that cleanup was delayed for one reason or another. As I was about to move on to another issue, I wanted to make sure that cleanup was performed.
That cleanup itself was easy, as I had repeated the “change-test-check” process so many times that I believe I could perform it while sleeping. Following a variation of that same process, the code was cleaned up in short order.
What Was My Experience So Far?¶
As I finished the cleanup, I was happy to have the work of adding in the line numbers and column numbers behind me. After 10 days of work, it was now a large task that I knew was not on my task list anymore. But while I was happy, I was also concerned about the 20 new issues that I had added to my issues list. I was aware that at least a few of those items would be checked out and determined to be false alarms, but that probably left somewhere between 16 to 18 issues to research and triage.
On top of that, there was also the fact that each line number/column number pair for the tokens was determined manually by me. While I have faith that I can count accurately for small sets of numbers, I do know that I make mistakes. Some of those mistakes I had already caught while going over the test cases before committing the changes for each of the elements. But it was entirely possible, and indeed probable, that I had missed at least 1 or 2 mistakes. This was not due to lack of confidence, but due to a sense of reality. I know that the project has great coverage and great scenarios, but I also know that I do not have every scenario represented, just the really important ones.
That got me thinking. If I found that many errors, was there a way to remove the need for me to manually count those values? Could I automate it? Would it be worth it? Would I find anything?
The more I thought about those questions, the more I realized that I needed to explore this area. If nothing else, I wanted to make sure I had not missed anything. But after thinking about it for a while, I realized that I did not want to risk that I had missed anything important.
What is Next?¶
And here ladies and gentlemen is where I begin to go down the rabbit hole .
-
At the current moment, the
transform_to_gfm.py
module is approximately 900 lines long. About 30% of that module is processing overhead, with about 18% dedicated to handling the somewhat complex issue of list looseness. The remaining 52% is consumed with the simplehandle_*
methods used to translate each token’s start and end tokens into HTML. While the calculation of list looseness does add some complexity to the translation, the algorithm and implementation are both relatively simple. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.