Summary¶
In my last article, I completed the addition of proper support for line and column numbers for the text token and emphasis tokens by finishing the consistency checks. In this article, I talk about the efforts and issues required to finish implementing the line and column numbers for the remaining inline tokens, including their consistency checks.
Introduction¶
From a wholistic point of view, I felt that the accuracy and consistency of the tokens were getting more solid with each change. While I expected a fair number of tests to fail when I started to add the consistency checks, I was now at a point where a failing test would be a novel thing. And that was good! But even with that positive outlook on the project and the consistency checks, I knew I still had a way to go to finish things up properly with respect to the tokens.
After having finished adding the line/column numbers for the Emphasis tokens and the Text token, the remaining inline tokens were the only things left in the way of finishing that work. After the work I had done on that group of tokens, I was hoping that this would be an easy batch of work to complete. But only time would tell.
What Is the Audience for This Article?¶
While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 04 Sep 2020 and 09 Sep 2020.
Remaining Inline Tokens¶
Having taken care of the Emphasis tokens and the Text token, all the other inline tokens remained: Raw-HTML, Links, Images, Autolinks, Code Spans and Hard Line Breaks. Before starting work on each of these tokens, I was not sure if the effort required to implement each one would be more like the Emphasis tokens or more like the Text token. I hoped it would be a simple case and easy to work on, but there were no guarantees.
With some optimism in mind, and my fingers crossed, I started my work.
Raw HTML and Links¶
As I write this article and look back at my notes, I fully admit that I am a bit
stumped. Picking one of these two inline tokens to work on makes sense to me. I have
no notes to myself saying, “two for the price of one” or “these will be simple”.
I am left scratching my head as to why I decided to work on both at the same time.
Regardless of why I decided to do both, they were both completed.
I believed that working on both items at the same time would just be asking for something to go wrong, so I chose to focus first on the Raw HTML token. The initial change was easy, changing the creation of the token from:
RawHtmlMarkdownToken(valid_raw_html)
to:
RawHtmlMarkdownToken(valid_raw_html, line_number, column_number)
thereby passing the line number and the column number to the constructor for the
RawHtmlMarkdownToken
class. Once that was done, another simple change was made
to the handle_angle_brackets
function to pass the current line number and column
number as arguments, as such:
new_token, after_index = HtmlHelper.parse_raw_html(
between_brackets,
remaining_line,
inline_request.line_number,
inline_request.column_number,
)
Testing and Iterating¶
Running the tests for the Raw HTML token, it was immediately obvious to me that in
certain cases, the column number was off by a bit. After a bit of research, I noticed
that in cases where there was a Text token before the Raw HTML token, the new Raw HTML
token had the same line/column number as the Text token. Digging a bit deeper, it
appeared that in those cases, the remaining_line
field of the inline_request
object
had the correct number of characters to make up the difference, but they were not being
applied.
To address that inadequacy, I made a small change to the above example. Following the
logic of the inline algorithm, once a new token is created to be inserted, the text
leading up to that token is determined, largely based off the remaining_line
variable. While this seems slightly out of order, it ensures that the proper Text token
with the proper line/column number is inserted in the correct order. However, because
the new token is created before that Text token is inserted, it does not have the
right column number. By simply adding the length of the remaining_line
variable to
the column number, the difference is accounted for. This was accomplished by first
calculating that new column number:
new_column_number = inline_request.column_number
new_column_number += len(inline_request.remaining_line)
and then passing that new new_column_number
value into the call to the
HtmlHelper.parse_raw_html
function in the previous example instead of the
inline_request.column_number
argument.
New Lines Inside of the Token¶
After running through the tests again, almost all the tests were passing, except for the test for example 500:
[link](<foo
bar>)
This case may look weird, but everything is computed properly as a series of Text
tokens and a Raw HTML token. Because of
the newline in the URI, the text is not eligible to be a link, but since the URI part
is enclosed in “angle brackets”, it is eligible to be a Raw HTML token. But even with
the Raw HTML token being parsed, the Text token containing the trailing )
character was
off. Instead of being reported as (2,5)
, it was being reported as (1,17)
.
To investigate this further, I created the new scenario test test_raw_html_634a
. This
new test was a more isolated
case of example 500, a copy of the test function test_raw_html_634
with a newline
character inserted inside of the b2
HTML tag, as follows:
<a /><b2
data="foo" ><c>
I started to look at this issue and it turned out to be an easy issue to overcome.
Fixing Newlines¶
With this isolated scenario test, it was very easy to see that the issue was with the
raw_tag
field of the Raw HTML token. When the token contained a newline character, that newline
was treated as a normal character and added to the character count. What I needed to
do was to make sure that the algorithm understood that the newline character was special
and to handle it differently. So, to address that
behavior, I introduced some extra code to the handle_angle_brackets
function:
if (
new_token
and new_token.token_name == MarkdownToken.token_inline_raw_html
and "\n" in new_token.raw_tag
):
split_raw_tag = new_token.raw_tag.split("\n")
inline_response.delta_line_number += len(split_raw_tag) - 1
length_of_last_elements = len(split_raw_tag[-1])
inline_response.delta_column_number = -(length_of_last_elements + 2)
else:
inline_response.delta_column_number = (
inline_response.new_index - inline_request.next_index
)
return inline_response
Basically, since the existing code already handled the case with zero newlines perfectly,
I did not need to change that aspect of the function. However, in the case of a Raw HTML
token that contained a newline in its raw_tag
field, I needed special processing to
kick in. The first thing I
needed was a clear picture of the raw_tag
field and its newline characters, so I
split the string on newline characters into the split_raw_tag
variable. Then I
addressed the line number calculation first, correcting the line number calculation by
adding the number of newline characters found to the inline_response.delta_line_number
variable.1
After I was sure that the line number was being correctly calculated, it was time for me
to focus on the column number. While each of the lines in the raw_tag
field were
important, their content was already mostly covered by the calculation for the change to
the line number. Each line except the last line that is. That last line was the new
information that would lead the text after the Raw HTML token. As such, the column
number was at least as many characters along as the length of any text past that last
newline character, as calculated for the length_of_last_elements
variable. With
that calculation completed, all that was required was to add 2 to that value for
constant element overhead: 1 for
the length of the closing angle brackets (>
) and 1 to translate the value from an
index to a position.2
Conveying That Information Back to The Caller¶
With everything else completed, I then had to decide how I was going to get the newly calculated column number back to the calling function. According to the debug statements that I had added, the value was being calculated properly. While there were a couple of options on the table, I decided to go for a simple approach: a negative number.
I am not sure if this choice is pythonic or not, I believe that it conveys the right information in an efficient manner. If the column number is zero or positive, it represents a simple change or delta to the column number, a simple value to be added to the current column number to arrive at the new column number. However, if the column number is negative, it represents an absolute number that should be used for the column number. For example, if the token contains a newline character, it makes sense that the returned value would indicate a value from the start of the new line, not from the last know position.
Why a negative number? While I could have returned an extra value that determined whether the number was referential or absolute, that seemed too bulky. For me, this was keeping it lean within its limited scope.
Adding Validation¶
After all that work, the validation was very anti-climactic. It may appear that I
cheated and copied the calculation from above as-is into the new
__verify_next_inline_raw_html
function. Rather than being a cheat, I worked
through the calculations again on paper, making sure that I did not miss any weird
boundary conditions. After generating the algorithm in the
__verify_next_inline_raw_html
function from scratch, I compared the two algorithms
together and the algorithms themselves were the same. Rather than cheating,
I considered it a validation that I had derived the right algorithm twice.
What About the Links?¶
As I mentioned at the start of this section, I am not sure why I decided to work on these two tokens together. I can only guess that perhaps I thought that adding line/column numbers to the Link tokens would uncover something important that adding line/column numbers to the Raw HTML tokens would not. The reality was that after completing the Raw HTML token work, the changes needed to implement the line/column numbers for Link tokens was trivially easy.
Unexpectantly, this would foreshadow the following work on the other inline tokens.
Autolinks, Code Spans, Images and Hard Line Breaks¶
I expected some manner of difficulty in implementing the line/column numbers for these tokens, however the previous work made the implementation of the new code easy. There were issues that needed to be properly addressed for each specific type of token, but the hard work had already been done. As such, the work was more akin to copy-and-paste-and-adjust than anything else.
In the implementation of each of the tokens, the initial calculation for each of the tokens included values for the length of the constant part of the element and the variable part of the element. Once that was complete and the easy tests were passing, any multiline parts were addressed, with progress being made to get closer to having the remaining scenario tests passing. To finish that work, consistency checks were added that were simply verifying the algorithms used previous and verifying the work.
This process was a simple rehash of the work that I did for the Raw HTML token, and then again for the Link token. But it was working and working well. While a part of me was saying “this is too easy, what’s wrong?”, I double checked all my work to quiet that voice and assure myself that I had not missed anything.
While it was simple work, it did take a couple of days to complete. But at the end of that work, each relevant token had a line number and a column number, and they had been verified. Even more interesting, while some extra scenarios were added to deal with missing cases (mostly to deal with multiline element parts), no new issues had been found. Things were looking good. Almost.
Last Inline vs Next Block¶
With all the inline tokens supporting line/column numbers, I felt as if a bit of a load was taken off of my shoulders. I was not really worried that there was something wrong, but as I mentioned in the last article:
I know that I am fallible.
There was no proof that I could see that I had missed something, but I just had a nagging feeling that I had left something out. Trying to get rid of that feeling, I went through the work that I had just completed and checked it again, finding nothing out of the ordinary. On top of that, I had automation in place to catch any miscalculations that I made, something that was welcome.
After that extra checking was completed, I could not find anything wrong and I was ready to move on. But as I was getting ready to start working on some of the items in the issue list, I noticed something. Reading my previous article to gain some extra perspective on where I was in the project, I noticed the part where I stated:
So, whether I liked the idea or not, validation of the first element in the list was mandatory. The last element is a different story. While it would be nice to tie the last inline token to the following block token, I felt that it was not as important as the verification of the first element. However, I added in a placeholder to the code to make sure that I would follow up on it later.
I remembered!
The Other Anchor for the Inline Tokens¶
As I mentioned before, I started with the outer token and the first token because I wanted to ensure that was able to anchor the list, and the anchoring the first token was the easiest solution at the time. Having finished that task off and also having finished validation of the inline tokens within the list, it was now time to work on anchoring the other side of the list: the last inline token and the following block token. That is what the nagging feeling was! That is what I was trying to remember.
Starting to research what I needed to do to resolve this anchor issue, I came to an interesting observation. While all groups of inline tokens start after a block token, not all groups of inline tokens end with a block token. Because of the way tokenization is performed, I decided not to expose line/column numbers for any of the end tokens that did not add something to the data stream. This means that except for the Emphasis end token, none of the other end tokens have a line/column associated with them.
Why is that observation on tokenization important? A good example is the Markdown document for example 364:
foo*bar*
From that Markdown,
I can surmise that the tokens for that document will start with a Paragraph
start token and end with a Paragraph end token. Inside of the paragraph, there will be
a Text token containing foo
, an Emphasis start token, a Text token containing bar
,
and an Emphasis end token. This is backed up by the realized tokens for the example,
which are:
expected_tokens = [
"[para(1,1):]",
"[text(1,1):foo:]",
"[emphasis(1,4):1:*]",
"[text(1,5):bar:]",
"[end-emphasis(1,8)::1:*:False]",
"[end-para:::True]",
]
From looking at this tokenization, the last token with a line/column number attached to it is the Emphasis end token, an inline token. Getting an actual block token to appear after those tokens is as simple as changing the Markdown to:
foo*bar*
This adds a new Blank Line token to the end of the array, adding the tokenization
'[BLANK(2,1):]'
. However, I knew the real trick would be to determine that value
without having to add that extra token.
Focusing on That Line Number¶
Working through the issue from the previous section helped me understand something else about the relationship with the last inline token and the possible following block token: only the line number was important. Because the inline tokens are always contained within a container block element or a leaf block element, the last token in an inline token group is guaranteed to be either an end token for a previously started block element or a start token for a new block element.
If the next block token after the inline tokens is a block start token, because of the work up to this point, a line/column number is guaranteed. If the next block token is a block end token, one of two things happens. Either a block start token follows with the start of a new block element, or the end of the document is reached. If a block start token follows, the line/column number is guaranteed as above. In the case of the end of the document, as no token with a valid line/column number follows, some other calculation is needed to determine the line number to compare to.
The good news is that only the line number is important. Because only the line number is important, there is another available number that we can use: the number of lines in the Markdown document. As such, if there is a block start token after the inline block, I used the line number from that token as the line number to compare against. If no such token existed, I used the number of lines in the document.
I tested this approach with a handful of scenarios, some with eligible following block tokens and some with an absence of eligible following block tokens. On paper it seemed to work without fail. The only thing that was left was to test that approach with actual code.
Completing the Checks¶
To calculate the line number to compare to, I added the __verify_last_inline
function with my usual development pattern. Following that pattern, I started
adding handlers for each of the inline tokens it encountered, just trying to
get to a steady state where all the handlers were present. Once that was achieved,
I started adding the content to each handler to calculate the height of the inline
token.
Now I wish I could say it was good planning, but it was just serendipity that I did this work right after the work on adding the line/column number support to most of the inline tokens. Based on that recent work, adding the calculations for the heights of each of the tokens was exceedingly easy. While it was easy, it took a couple of days for me to work through each case and verify each twist and turn of the token. But in the end, with only one planned failure3 to address and one or two items to look at later, the token validation was complete!
What Was My Experience So Far?¶
Getting to this point in the PyMarkdown project was a momentous achievement for me, measured in months rather than days. Along the way, I had developed a Markdown parser, ensured that it emitted tokens that include line numbers and column numbers, verified its output against both expected HTML output and the original Markdown input, and had a good array of consistency checks on the line numbers and column numbers. Phew. It was a long list, but the project has come a long way.
I was relieved that I got to this point with my original design mostly intact. I was aware that I was going to have to do some refactoring in the future to make the code more modifiable, but I believe it is in a decent position to make that happen. Besides, when I start doing that, I have almost 1400 scenario tests that will make sure any changes are not negatively impacting the code.
With all that good stuff in mind, I started to look at the issues list, and paged through it. At just over 200 lines of text, it was a bit daunting to look at initially. But as I progressed with the project, any idea or question I had was put into that list. There were ideas for better tests, questions on whether something was done right, and notes on things I wanted to check because they looked weird in some way. And during the project’s development to date, I had taken proactive efforts to get any serious issues out of the way. Being the optimistic, I hoped that I was left with a solid set of enhancements. Regardless of what remained in the list, I was sure that I could tackle it. And sure, there might be some rewrites that I would need to do, but they would make the project stronger, leaner, faster, and more maintainable.
So how was I feeling? Very optimistic. There were quite the number of items in the issues list, but if I tackled them one at I time, I could get through them. And going through them would either fix an issue or confirm that the project did not have that particular issue. And to me, both outcomes were positive.
What is Next?¶
Having completed all the consistency checks, I now had operating scenario tests with values and line/column numbers that I was very confident about. But along the way, I had accumulated a decent number of items in the issues list. Before getting back to filling out the set of rules, it was time to address those items.
-
Python’s
split
function works as expected. If you have a string that does not contain the sequence to split on, it returns an array with one element. If you have a string that has one or more instances of the sequence to split on, it returns an array with each element being any text between those instances. As such, if a string has one newline character and is split, it will result in an array with a length of 2. Therefore, I usedlen(split_raw_tag) - 1
to figure out the number of newline characters found in theraw_tag
field. ↩ -
By their nature, an index starts at 0 and a position starts at 1. As a column number is a position on a line but was being computed as an index, I needed to add 1 to the value to transition it into being a position. ↩
-
Ahead of time, I had already determined that the scenario test
test_inline_links_518b
was split over multiple lines and would be addressed by this validation. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.