Clarity Through The Summarizing of Test Measurements

Introduction¶

As part of the process of creating a Markdown Linter to use with my personal website, I firmly believe that it is imperative that I have solid testing on the linter and the tools necessary to test the linter. In previous articles, I talked about the framework I use to scenario test Python scripts and how my current PyTest setup produces useful test reports, both human-readable and machine-readable. These two things allow me to properly test my Python scripts, to collect information on the tests used to verify those scripts, and to determine how well the collection of tests covers those scripts.

While the human-readable reports are very useful for digging into issues, I often find that I need a simple and concise “this is where you are now” summary that gives me the most pertinent information from those reports. Enter the next tool in my toolbox, a Python script that summarizes information from the machine-readable reports, unimaginatively called PyScan. While it is simple tool, I constantly use this tool when writing new Python scripts and their tests to ensure the development is going in the direction that I want to. This article describes how I use the tool and how it provides a benefit to my development process.

Why Not Discuss The Script Itself?¶

When coming up with the idea for this article, I had two beneficial paths available: focus on the code behind the PyScan tool or focus on the usage of the PyScan tool. Both paths have merit and benefit, and both paths easily provide enough substance for a full article. After a lot of thought, I decided to focus on the usage of this tool instead of the code itself. I made this decision primarily due to my heavy use of the PyScan tool and it’s significant benefit to my development process.

I rely on the PyScan to give me an accurate summary of the tests used to verify any changes along with the impact on code coverage for each of those changes. While I can develop without PyScan, I find that using PyScan immediately increases my confidence in each change I make. When I make a given type of change to either the source code or the test code, I expect a related side-effect to appear in the test results report and the test coverage report. By having PyScan produce summaries of the test results and test coverage, each side-effect is more visible, therefore adding validation that the changes made are the right changes.

In the end, the choice became an easy one: focus on the choice with the most positive impact. I felt that documenting how I use this tool satisfied that requirement with room to spare. I also felt that if any readers are still interested in looking at the code behind the script, it’s easy enough to point them to the project’s GitHub repository and make sure it is well documented.

Setting Up PyScan For It’s Own Project¶

Based on the setup from the last article, the PyTest command line options --junitxml=report/tests.xml and --cov-report xml:report/coverage.xml place the tests.xml file and the coverage.xml file in the report directory. Based on observation, the tests.xml file is in a JUnit XML format and the coverage.xml file is in a Cobertura XML format. The format of the tests.xml is pretty obvious from the command line flag required to generate it. The format of the coverage.xml file took a bit more effort, but the following line of the file keyed me to it’s format:

<!-- Based on https://raw.githubusercontent.com/cobertura/web/master/htdocs/xml/coverage-04.dtd -->

From within the project’s root directory, the main script is located at ../main.py. Since the project uses pipenv, the command line to invoke the script is pipenv run python pyscan/main.py and invoking the script with the --help option gives us the options that we can use. Following the information from the help text, the command line that I use from the project’s root directory is:

pipenv run python pyscan/main.py --junit report/tests.xml --cobertura report/coverage.xml

With everything set up properly, the output from that command looks like:

Test Results Summary
--------------------

Class Name                     Total Tests   Failed Tests   Skipped Tests
----------------------------  ------------  -------------  --------------
test.test_coverage_profiles              2              0               0
test.test_coverage_scenarios            12              0               0
test.test_publish_scenarios              9              0               0
test.test_results_scenarios             19              0               0
test.test_scenarios                      1              0               0
---                                     --              -               -
TOTALS                                  43              0               0

Test Coverage Summary
---------------------

Type           Covered   Measured   Percentage
------------  --------  ---------  -----------
Instructions       ---        ---        -----
Lines              505        507        99.61
Branches           158        164        96.34
Complexity         ---        ---        -----
Methods            ---        ---        -----
Classes            ---        ---        -----

Before We Continue…¶

To complete my setup, there are two more things that are needed. The first thing is that I primarily execute the tests from a simple Windows script called ptest.cmd. While there is a lot of code in the ptest.cmd script to handle errors and options, when the script is boiled down to it’s bare essence, the script runs tests and reports on those tests as follows:

pipenv run pytest
pipenv run python pyscan/main.py --only-changes --junit report/tests.xml --cobertura=report/coverage.xml

Note

I also have a Bash version called ptest.sh which I have experimented with locally, but is not checked in to the project. If you are interested in this script, please let me know in the comments below.

Setting up a script like ptest keeps things simple and easy-to-use. One notable part of the script is that there is a little bit of logic in the script to not summarize any coverage if there are any issues running the tests under PyTest. Call me a purist, but if the tests fail to execute or are not passing, any measurements of how well the tests cover the code are moot.

The other thing that I have setup is a small change to the command line for PyScan. In the “bare essence” text above, after the text pyscan/main.py, there is a new option used for PyScan: the --only-changes option. By adding the --only-changes option, PyScan restricts the output to only those items that show changes. If no changes are detected, it displays a simple line stating that no changes have been observed. In the case of the above output, the output with this new option is as follows:

Test Results Summary
--------------------

Test results have not changed since last published test results.

Test Coverage Summary
---------------------

Test coverage has not changed since last published test coverage.

To me, this gives a very clear indication that things have not changed. In the following sections, I go through different cases and explain what changes I made and what effects I expect to see summarized.

Introducing Changes and Observing Behavior¶

For this section of the article, I temporarily added a “phantom” feature called “nothing” to PyScan. This feature is facilitated by two code changes. In the __parse_arguments function, I added the following code:

        parser.add_argument(
            "--nothing",
            dest="do_nothing",
            action="store_true",
            default=False,
            help="only_changes",
        )

and in the main function, I changed the code as follows:

        args = self.__parse_arguments()

        if args.do_nothing:
            print("noop")
            sys.exit(1)

Note that this feature is only present for the sake of these examples, and is not in the project’s code base.

Adding New Code¶

When I added the above code for the samples, the output that I got after running the tests was:

Test Results Summary
--------------------

Test results have not changed since last published test results.

Test Coverage Summary
---------------------

Type       Covered   Measured     Percentage
--------  --------  ---------  -------------
Lines     507 (+2)   511 (+4)  99.22 (-0.39)
Branches  159 (+1)   166 (+2)  95.78 (-0.56)

Based on the introduced changes, this output was expected. In the Measured column, 4 new lines were added (1 in __parse_arguments and 3 in main) and the if args.do_nothing: line added 2 branches (1 for True and one for False). In the Covered column, without any tests to exercise the new code, 2 lines are covered by default (1 in __parse_arguments and 1 in main) and 1 branch is covered by default (the False case of if args.do_nothing:).

Adding a New Test¶

Having added source code to the project, I added a test to address the new code. To start, I added this simple test function to the test_scenarios.py file:

def test_nothing():
    pass

This change is just a stub for a test function, so the expected change is that the number of tests for that module increase and there is no change in coverage. This effect is born out by the output:

Test Results Summary
--------------------

Class Name            Total Tests   Failed Tests   Skipped Tests
-------------------  ------------  -------------  --------------
test.test_scenarios        2 (+1)              0               0
---                       --                   -               -
TOTALS                    44 (+1)              0               0

Test Coverage Summary
---------------------

Type       Covered   Measured     Percentage
--------  --------  ---------  -------------
Lines     507 (+2)   511 (+4)  99.22 (-0.39)
Branches  159 (+1)   166 (+2)  95.78 (-0.56)

Populating the Test Function¶

Now that a stub for the test is in place and registering, I added a real body to the test function as follows:

def test_nothing():

    # Arrange
    executor = MainlineExecutor()
    suppplied_arguments = ["--nothing"]

    expected_output = """noop
"""
    expected_error = ""
    expected_return_code = 1

    # Act
    execute_results = executor.invoke_main(arguments=suppplied_arguments, cwd=None)

    # Assert
    execute_results.assert_results(
        expected_output, expected_error, expected_return_code
    )

The code that I added at the start of this section is triggered by the command line argument --nothing, printing the simple response text noop, and returning a return code of 1 . This test code was crafted to trigger that code and to verify the expected output.

Test Results Summary
--------------------

Class Name            Total Tests   Failed Tests   Skipped Tests
-------------------  ------------  -------------  --------------
test.test_scenarios        2 (+1)              0               0
---                       --                   -               -
TOTALS                    44 (+1)              0               0

Test Coverage Summary
---------------------

Type       Covered   Measured     Percentage
--------  --------  ---------  -------------
Lines     509 (+4)   511 (+4)  99.61 ( 0.00)
Branches  160 (+2)   166 (+2)  96.39 (+0.04)

Based on the output from the test results summary, the test does verify that once triggered, the code is working as expected. If there was any issue with the test, the summary would include the text 1 (+1) in the Failed Tests column to denote the failure. As that text is not present, it is safe to assume that both tests in the test.test_scenarios module succeeded. In addition, based on the output from the test coverage summary, the new code added 4 lines and 2 branches to the code base, and the new test code covered all of those changes.

Establishing a New Baseline¶

With the new source code and test code in place, I needed to publish the results and set a new baseline for the project. To do this with the ptest script, I invoked the following command line:

ptest -p

Within this ptest script, the -p option was translated into the following command:

pipenv run python pyscan/main.py --publish

When executed, the publish/coverage.json and publish/test-results.json files were updated with the current summaries. Following that point, when the script was run, it reverts back to the original output of:

Test Results Summary
--------------------

Test results have not changed since last published test results.

Test Coverage Summary
---------------------

Test coverage has not changed since last published test coverage.

This process can be repeated at any time to establish a solid baseline that any new changes can be measured against.

Refactoring Code - My Refactoring Process¶

In practice, I frequently do “cut-and-paste” development during my normal development process. However, I do this with a strict rule that I follow: “2 times on the fence, 3 times refactor, clean up later”. That rule break down as follows:

if I cut-and-paste code once, I then have 2 copies, and I should consider refactoring unless I have a good reason to delay
if I cut-and-paste that code again, I then have 3 copies, and that third copy must be into a function that the other 2 copies get merged into
when I have solid tests in place and I am done with primary development, go back to all of the cases where I have 2 copies and condense them if beneficial

My rationale for this rule is as follows.

When you are creating code, you want the ideas to flow free and fast, completing a good attempt at meeting your current goal in the most efficient way possible. While cut-and-paste as a long term strategy is not good, I find that in the short term, it helps me in creating a new function, even if that function is a copy of something done before. To balance that, from experience, if I have pasted the same code twice (meeting the criteria for “3 times refactor”), there is a very good chance that I will use that code at least one more time, if not more. At that point, it makes more sense to refactor the code to encapsulate the functionality properly before the block of code becomes to unwieldly.

Finally, once I have completed the creation of the new source code, I go back and actively look for cases where I cut-and-pasted code, and if it is worth it to refactor that code, with a decision to refactor if I am on the fence. At the very least, refactoring code into a function almost always makes the code more readable and maintainable. Basically, by following the above rule for refactoring, I almost always change the code in a positive manner.

The summaries provided to me from PyScan help me with this refactoring in a big way. Most of the time, the main idea with refactoring is to change the code on the “inside” of the program or script without changing the “outside” of the program or script. If any changes are made to the “outside”, they are usually small changes with very predictable impacts. The PyScan summaries assist me in ensuring that any changes to the outside of the script are kept small and manageable while also measuring the improvements made to the inside of the script. Essentially, seeing both summaries helps me keep the code refactor of the script very crisp and on course.

Refactoring Code - Leveraging The Summaries¶

A good function set of functions for me to look at for clean-up refactoring were the generate_test_report and generate_coverage_report functions. When I wrote those two functions, I wasn’t sure how much difference I was going to have between those two functions, so did an initial cut-and-paste (see “2 times on the fence”) and started making changes. As those parts of PyScan are now solid and tested, I went back (see “clean up later”) and compared the two functions to see what was safe to refactor.

The first refactor I performed was to extract the xml loading logic into a new __load_xml_docment function. While I admit I didn’t get it right the first time, the tests kept me in check and made sure that, after a couple of tries, I got it right. And when I say “tries”, I mean that I made a change, ran ptest, got some information, and diagnosed it… all within about 30-60 seconds per iteration. In the end, the summary looked like this:

Test Results Summary
--------------------

Test results have not changed since last published test results.

Test Coverage Summary
---------------------

Type        Covered   Measured     Percentage
--------  ---------  ---------  -------------
Lines     499 (-10)  501 (-10)  99.60 (-0.01)
Branches  154 ( -6)  160 ( -6)  96.25 (-0.14)

As expected, the refactor eliminated both lines of code and branches, with the measured values noted in the summary.

The second refactor I made was to extract the summary file writing logic into a new __save_summary_file function. I followed a similar pattern to the refactor for __load_xml_docment, but there was a small difference. In this case, I observed that for a specific error case, one function specified test coverage and the other function specified test summary. Seeing as consistent names in output is always beneficial, I decided to change the error messages to be consistent with each other. The test coverage name for the first function remained the same, but the test summary name was changed to test report, with the text summary added in the refactored function.

At this point, I knew that one test for each of the test results scenarios and test coverage scenarios was going to fail, but I knew that it would fail in a very specific manner. Based on the above changes, the text Project test summary file for the results scenario test should change to Project test report summary file and the text Project test coverage file for the coverage scenario test should change to Project test coverage summary file.

When I ran the tests after these changes, there were indeed 2 errors, specifically in the tests I thought they would show up in. Once those 2 tests were changed to reflect the new consistent text, the tests were ran again and produced the following output:

Test Results Summary
--------------------

Test results have not changed since last published test results.

Test Coverage Summary
---------------------

Type        Covered   Measured     Percentage
--------  ---------  ---------  -------------
Lines     491 (-18)  493 (-18)  99.59 (-0.01)
Branches  152 ( -8)  158 ( -8)  96.20 (-0.18)

Once again, the output matched my expectations. While it wasn’t a large number of code or branches, an additional 8 lines and 2 branches were refactored.

Determining Additive Test Function Coverage¶

There are times after I have written a series of tests where I wonder how much actual coverage a given test contributes to the overall test coverage percentage. As test coverage is a collaborative effort of all of the tests, a single number that identifies the amount of code covered by a single test is not meaningful. However, a meaningful piece of information is what unique coverage a given test contributes to the collection of tests as a whole.

To demonstrate how I do this, I picked one of the tests that addresses one of the error conditions, the test_summarize_cobertura_report_with_bad_source function in the test_coverage_scenarios.py file. Before I changed anything, I made sure to publish the current state to use it as a baseline. To determine the additive coverage this test provides, I simply changed it’s name to xtest_summarize_cobertura_report_with_bad_source. As the pytest program only matches on functions that start with test_, the function was then excluded from the tests to be executed.

Upon running the ptest script, I got the following output:

Test Results Summary
--------------------

Class Name                     Total Tests   Failed Tests   Skipped Tests
----------------------------  ------------  -------------  --------------
test.test_coverage_scenarios       11 (-1)              0               0
---                                --                   -               -
TOTALS                             43 (-1)              0               0

Test Coverage Summary
---------------------

Type       Covered   Measured     Percentage
--------  --------  ---------  -------------
Lines     507 (-2)        511  99.22 (-0.39)
Branches  159 (-1)        166  95.78 (-0.60)

Interpreting this output, given what I documented earlier in this article, was pretty easy. As I “disabled” one of the coverage scenario tests in the test_coverage_scenarios.py file, the summary reports one less test in test.test_coverage_scenarios as expected. That disabled test added 2 lines of coverage and 1 branch of coverage to overall effort, coverage that was now being reported as missing. As this test was added specifically to test a single error case, this was expected.

If instead I disable the xtest_junit_jacoco_profile test in the test_coverage_profiles.py file, I get a different result:

Test Results Summary
--------------------

Class Name                    Total Tests   Failed Tests   Skipped Tests
---------------------------  ------------  -------------  --------------
test.test_coverage_profiles        1 (-1)              0               0
---                               --                   -               -
TOTALS                            43 (-1)              0               0

Test Coverage Summary
---------------------

Type       Covered   Measured     Percentage
--------  --------  ---------  -------------
Lines     501 (-8)        511  98.04 (-1.57)
Branches  152 (-8)        166  91.57 (-4.82)

Like the previous output, the disabled test is showing up as being removed, but there is a lot more coverage that was removed. Strangely enough, this was also expected. As I also use PyScan to summarize test results from Java projects I work on, I used all 6 coverage measurements available from Jacoco ¹ as a baseline for the 2 measurements generated by PyTest for Python coverage. With a quick look at the report/coverage/pyscan_model_py.html file, this was indeed the reason for the difference, with the test exercising 4 additional paths in each of the serialization and deserialization functions. Basically, four paths of one line each, times two (one for serialization and one for deserialization), and the 8 lines/branches covered is explained.

Wrapping Up¶

I believe that making my decision to talk about how I use my PyScan tool to summarize test results and test coverage was the right choice. It is difficult for me to quantize exactly how much benefit PyScan has provided to my development process, but it is easily in the very positive to indispensable category. By providing a quick summary on the test results file and the test coverage file, I can ensure that any changes I make are having the proper effects on those two files at each stage of the change that I am making. I hope that by walking through this process and how it helps me, it will inspire others to adopt something similar in their development processes.

For an example Jacoco HTML report that shows all 6 coverage measurements, check out the report trunk coverage for Jacoco. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Clarity Through The Summarizing of Test Measurements

Introduction¶

Why Not Discuss The Script Itself?¶

Setting Up PyScan For It’s Own Project¶

Before We Continue…¶

Introducing Changes and Observing Behavior¶

Adding New Code¶

Adding a New Test¶

Populating the Test Function¶

Establishing a New Baseline¶

Refactoring Code - My Refactoring Process¶

Refactoring Code - Leveraging The Summaries¶

Determining Additive Test Function Coverage¶

Wrapping Up¶

Comments

Reading Time

Published

Category

Tags

Stay in Touch