Introduction¶
As part of the process of creating a Markdown Linter to use with my personal website, I firmly believe that it is imperative that I have solid testing on the linter and the tools necessary to test the linter. In previous articles, I talked about the framework I use to scenario test Python scripts and how my current PyTest setup produces useful test reports, both human-readable and machine-readable. These two things allow me to properly test my Python scripts, to collect information on the tests used to verify those scripts, and to determine how well the collection of tests covers those scripts.
While the human-readable reports are very useful for digging into issues, I often find
that I need a simple and concise “this is where you are now” summary that gives me the
most pertinent information from those reports. Enter the next tool in my toolbox, a
Python script that summarizes information from the machine-readable reports,
unimaginatively called PyScan
. While it is simple tool, I constantly use this tool
when writing new Python scripts and their tests to ensure the development is going in
the direction that I want to. This article describes how I use the tool and how it
provides a benefit to my development process.
Why Not Discuss The Script Itself?¶
When coming up with the idea for this article, I had two beneficial paths available: focus on the code behind the PyScan tool or focus on the usage of the PyScan tool. Both paths have merit and benefit, and both paths easily provide enough substance for a full article. After a lot of thought, I decided to focus on the usage of this tool instead of the code itself. I made this decision primarily due to my heavy use of the PyScan tool and it’s significant benefit to my development process.
I rely on the PyScan to give me an accurate summary of the tests used to verify any changes along with the impact on code coverage for each of those changes. While I can develop without PyScan, I find that using PyScan immediately increases my confidence in each change I make. When I make a given type of change to either the source code or the test code, I expect a related side-effect to appear in the test results report and the test coverage report. By having PyScan produce summaries of the test results and test coverage, each side-effect is more visible, therefore adding validation that the changes made are the right changes.
In the end, the choice became an easy one: focus on the choice with the most positive impact. I felt that documenting how I use this tool satisfied that requirement with room to spare. I also felt that if any readers are still interested in looking at the code behind the script, it’s easy enough to point them to the project’s GitHub repository and make sure it is well documented.
Setting Up PyScan For It’s Own Project¶
Based on the setup from
the last article, the PyTest command
line options --junitxml=report/tests.xml
and --cov-report xml:report/coverage.xml
place the tests.xml
file and the coverage.xml
file in the report
directory.
Based on observation, the tests.xml
file is in a JUnit XML format and the
coverage.xml
file is in a Cobertura XML format. The format of the tests.xml
is pretty obvious from
the command line flag required to generate it. The format of the coverage.xml
file
took a bit more effort, but the following line of the file keyed me to it’s format:
<!-- Based on https://raw.githubusercontent.com/cobertura/web/master/htdocs/xml/coverage-04.dtd -->
From within the project’s root directory, the main script is located at ../main.py
.
Since the project uses pipenv
, the command line to invoke the script is
pipenv run python pyscan/main.py
and invoking the script with the --help
option
gives us the options that we can use. Following the information from the help text,
the command line that I use from the project’s root directory is:
pipenv run python pyscan/main.py --junit report/tests.xml --cobertura report/coverage.xml
With everything set up properly, the output from that command looks like:
Test Results Summary
--------------------
Class Name Total Tests Failed Tests Skipped Tests
---------------------------- ------------ ------------- --------------
test.test_coverage_profiles 2 0 0
test.test_coverage_scenarios 12 0 0
test.test_publish_scenarios 9 0 0
test.test_results_scenarios 19 0 0
test.test_scenarios 1 0 0
--- -- - -
TOTALS 43 0 0
Test Coverage Summary
---------------------
Type Covered Measured Percentage
------------ -------- --------- -----------
Instructions --- --- -----
Lines 505 507 99.61
Branches 158 164 96.34
Complexity --- --- -----
Methods --- --- -----
Classes --- --- -----
Before We Continue…¶
To complete my setup, there are two more things that are needed. The first thing is
that I primarily execute the tests from a simple Windows script called ptest.cmd
.
While there is a lot of code in the ptest.cmd
script to handle errors and options,
when the script is boiled down to it’s bare essence, the script runs tests and reports
on those tests as follows:
pipenv run pytest
pipenv run python pyscan/main.py --only-changes --junit report/tests.xml --cobertura=report/coverage.xml
Note
I also have a Bash version called ptest.sh
which I have experimented with locally, but is not checked in to the project. If you are interested in this script, please let me know in the comments below.
Setting up a script like ptest
keeps things simple and easy-to-use. One
notable part of the script is that there is a little bit of logic in the script to not
summarize any coverage if there are any issues running the tests under PyTest. Call me
a purist, but if the tests fail to execute or are not passing, any
measurements of how well the tests cover the code are moot.
The other thing that I have setup is a small change to the command line for PyScan. In
the “bare essence” text above, after the text pyscan/main.py
, there is a new option
used for PyScan: the --only-changes
option. By adding the --only-changes
option,
PyScan restricts the output to only those items that show changes. If no changes are
detected, it displays a simple line stating that no changes have been observed. In the
case of the above output, the output with this new option is as follows:
Test Results Summary
--------------------
Test results have not changed since last published test results.
Test Coverage Summary
---------------------
Test coverage has not changed since last published test coverage.
To me, this gives a very clear indication that things have not changed. In the following sections, I go through different cases and explain what changes I made and what effects I expect to see summarized.
Introducing Changes and Observing Behavior¶
For this section of the article, I temporarily added a “phantom” feature called
“nothing” to PyScan. This feature is facilitated by two code changes.
In the __parse_arguments
function, I added the following code:
parser.add_argument(
"--nothing",
dest="do_nothing",
action="store_true",
default=False,
help="only_changes",
)
and in the main
function, I changed the code as follows:
args = self.__parse_arguments()
if args.do_nothing:
print("noop")
sys.exit(1)
Note that this feature is only present for the sake of these examples, and is not in the project’s code base.
Adding New Code¶
When I added the above code for the samples, the output that I got after running the tests was:
Test Results Summary
--------------------
Test results have not changed since last published test results.
Test Coverage Summary
---------------------
Type Covered Measured Percentage
-------- -------- --------- -------------
Lines 507 (+2) 511 (+4) 99.22 (-0.39)
Branches 159 (+1) 166 (+2) 95.78 (-0.56)
Based on the introduced changes, this output was expected. In the Measured
column,
4 new lines were added (1 in __parse_arguments
and 3 in main
) and the
if args.do_nothing:
line added 2 branches (1 for True and one for False). In the
Covered
column, without any tests to exercise the new code, 2 lines are
covered by default (1 in __parse_arguments
and 1 in main
) and 1 branch is covered
by default (the False case of if args.do_nothing:
).
Adding a New Test¶
Having added source code to the project, I added a test to address the new code. To
start, I added this simple test function to the test_scenarios.py
file:
def test_nothing():
pass
This change is just a stub for a test function, so the expected change is that the number of tests for that module increase and there is no change in coverage. This effect is born out by the output:
Test Results Summary
--------------------
Class Name Total Tests Failed Tests Skipped Tests
------------------- ------------ ------------- --------------
test.test_scenarios 2 (+1) 0 0
--- -- - -
TOTALS 44 (+1) 0 0
Test Coverage Summary
---------------------
Type Covered Measured Percentage
-------- -------- --------- -------------
Lines 507 (+2) 511 (+4) 99.22 (-0.39)
Branches 159 (+1) 166 (+2) 95.78 (-0.56)
Populating the Test Function¶
Now that a stub for the test is in place and registering, I added a real body to the test function as follows:
def test_nothing():
# Arrange
executor = MainlineExecutor()
suppplied_arguments = ["--nothing"]
expected_output = """noop
"""
expected_error = ""
expected_return_code = 1
# Act
execute_results = executor.invoke_main(arguments=suppplied_arguments, cwd=None)
# Assert
execute_results.assert_results(
expected_output, expected_error, expected_return_code
)
The code that I added at the start of this section is triggered by the command line
argument --nothing
, printing the simple response text noop
, and returning a return
code of 1 . This test code was crafted to trigger that code and to verify the expected
output.
Test Results Summary
--------------------
Class Name Total Tests Failed Tests Skipped Tests
------------------- ------------ ------------- --------------
test.test_scenarios 2 (+1) 0 0
--- -- - -
TOTALS 44 (+1) 0 0
Test Coverage Summary
---------------------
Type Covered Measured Percentage
-------- -------- --------- -------------
Lines 509 (+4) 511 (+4) 99.61 ( 0.00)
Branches 160 (+2) 166 (+2) 96.39 (+0.04)
Based on the output from the test results summary, the test does verify that once
triggered, the code is working as expected. If there was any issue with the test,
the summary would include the text 1 (+1)
in the Failed Tests
column to denote
the failure. As that text is not present, it is safe to assume that both tests in
the test.test_scenarios
module succeeded. In addition, based on the output from the
test coverage summary, the new code added 4 lines and 2 branches to the code base, and
the new test code covered all of those changes.
Establishing a New Baseline¶
With the new source code and test code in place, I needed to publish the results and
set a new baseline for the project. To do this with the ptest
script, I invoked the
following command line:
ptest -p
Within this ptest
script, the -p
option was translated into the following command:
pipenv run python pyscan/main.py --publish
When executed, the publish/coverage.json
and publish/test-results.json
files were
updated with the current summaries. Following that point, when the script was run, it
reverts back to the original output of:
Test Results Summary
--------------------
Test results have not changed since last published test results.
Test Coverage Summary
---------------------
Test coverage has not changed since last published test coverage.
This process can be repeated at any time to establish a solid baseline that any new changes can be measured against.
Refactoring Code - My Refactoring Process¶
In practice, I frequently do “cut-and-paste” development during my normal development process. However, I do this with a strict rule that I follow: “2 times on the fence, 3 times refactor, clean up later”. That rule break down as follows:
- if I cut-and-paste code once, I then have 2 copies, and I should consider refactoring unless I have a good reason to delay
- if I cut-and-paste that code again, I then have 3 copies, and that third copy must be into a function that the other 2 copies get merged into
- when I have solid tests in place and I am done with primary development, go back to all of the cases where I have 2 copies and condense them if beneficial
My rationale for this rule is as follows.
When you are creating code, you want the ideas to flow free and fast, completing a good attempt at meeting your current goal in the most efficient way possible. While cut-and-paste as a long term strategy is not good, I find that in the short term, it helps me in creating a new function, even if that function is a copy of something done before. To balance that, from experience, if I have pasted the same code twice (meeting the criteria for “3 times refactor”), there is a very good chance that I will use that code at least one more time, if not more. At that point, it makes more sense to refactor the code to encapsulate the functionality properly before the block of code becomes to unwieldly.
Finally, once I have completed the creation of the new source code, I go back and actively look for cases where I cut-and-pasted code, and if it is worth it to refactor that code, with a decision to refactor if I am on the fence. At the very least, refactoring code into a function almost always makes the code more readable and maintainable. Basically, by following the above rule for refactoring, I almost always change the code in a positive manner.
The summaries provided to me from PyScan help me with this refactoring in a big way. Most of the time, the main idea with refactoring is to change the code on the “inside” of the program or script without changing the “outside” of the program or script. If any changes are made to the “outside”, they are usually small changes with very predictable impacts. The PyScan summaries assist me in ensuring that any changes to the outside of the script are kept small and manageable while also measuring the improvements made to the inside of the script. Essentially, seeing both summaries helps me keep the code refactor of the script very crisp and on course.
Refactoring Code - Leveraging The Summaries¶
A good function set of functions for me to look at for clean-up refactoring were the
generate_test_report
and generate_coverage_report
functions. When I wrote those
two functions, I wasn’t sure
how much difference I was going to have between those two functions, so did an initial
cut-and-paste (see “2 times on the fence”) and started making changes. As those parts
of PyScan are now solid and tested, I went back (see “clean up later”) and compared
the two functions to see what was safe to refactor.
The first refactor I performed was to extract the xml loading logic into a new
__load_xml_docment
function. While I admit I didn’t get it right the first time, the
tests kept me in
check and made sure that, after a couple of tries, I got it right. And when I say
“tries”, I mean that I made a change, ran ptest
, got some information, and diagnosed
it… all within about 30-60 seconds per iteration. In the end, the summary looked like
this:
Test Results Summary
--------------------
Test results have not changed since last published test results.
Test Coverage Summary
---------------------
Type Covered Measured Percentage
-------- --------- --------- -------------
Lines 499 (-10) 501 (-10) 99.60 (-0.01)
Branches 154 ( -6) 160 ( -6) 96.25 (-0.14)
As expected, the refactor eliminated both lines of code and branches, with the measured values noted in the summary.
The second refactor I made was to extract the summary file writing logic into a new
__save_summary_file
function. I followed a similar pattern to the refactor for
__load_xml_docment
, but there was a small difference. In this case, I observed that
for a specific error case, one function specified test coverage
and the other function
specified test summary
. Seeing as consistent names in output is always beneficial,
I decided to change the error messages to be consistent with each other. The
test coverage
name for the first function remained the same, but the test summary
name was changed to test report
, with the text summary
added in the refactored
function.
At this point, I knew that one test for each of the test results scenarios and test
coverage scenarios was going to fail, but I knew that it would fail in a very specific
manner. Based on the above changes, the text Project test summary file
for the
results scenario test should change to Project test report summary file
and the text
Project test coverage file
for the coverage scenario test should change to
Project test coverage summary file
.
When I ran the tests after these changes, there were indeed 2 errors, specifically in the tests I thought they would show up in. Once those 2 tests were changed to reflect the new consistent text, the tests were ran again and produced the following output:
Test Results Summary
--------------------
Test results have not changed since last published test results.
Test Coverage Summary
---------------------
Type Covered Measured Percentage
-------- --------- --------- -------------
Lines 491 (-18) 493 (-18) 99.59 (-0.01)
Branches 152 ( -8) 158 ( -8) 96.20 (-0.18)
Once again, the output matched my expectations. While it wasn’t a large number of code or branches, an additional 8 lines and 2 branches were refactored.
Determining Additive Test Function Coverage¶
There are times after I have written a series of tests where I wonder how much actual coverage a given test contributes to the overall test coverage percentage. As test coverage is a collaborative effort of all of the tests, a single number that identifies the amount of code covered by a single test is not meaningful. However, a meaningful piece of information is what unique coverage a given test contributes to the collection of tests as a whole.
To demonstrate how I do this, I picked one of the tests that addresses one of the error
conditions, the test_summarize_cobertura_report_with_bad_source
function in the
test_coverage_scenarios.py
file. Before I
changed anything, I made sure to publish the current state to use it as a baseline. To
determine the additive coverage this test provides, I simply changed it’s name to
xtest_summarize_cobertura_report_with_bad_source
. As the pytest
program only
matches on functions that start with test_
, the function was then excluded from the
tests to be executed.
Upon running the ptest
script, I got the following output:
Test Results Summary
--------------------
Class Name Total Tests Failed Tests Skipped Tests
---------------------------- ------------ ------------- --------------
test.test_coverage_scenarios 11 (-1) 0 0
--- -- - -
TOTALS 43 (-1) 0 0
Test Coverage Summary
---------------------
Type Covered Measured Percentage
-------- -------- --------- -------------
Lines 507 (-2) 511 99.22 (-0.39)
Branches 159 (-1) 166 95.78 (-0.60)
Interpreting this output, given what I documented earlier in this article, was pretty
easy. As I “disabled”
one of the coverage scenario tests in the test_coverage_scenarios.py
file, the summary
reports one less test in test.test_coverage_scenarios
as expected. That disabled
test added 2 lines of coverage and 1 branch of coverage to overall effort, coverage
that was now being reported as missing. As this test was added specifically to test a
single error case, this was expected.
If instead I disable the xtest_junit_jacoco_profile
test in the
test_coverage_profiles.py
file, I get a different result:
Test Results Summary
--------------------
Class Name Total Tests Failed Tests Skipped Tests
--------------------------- ------------ ------------- --------------
test.test_coverage_profiles 1 (-1) 0 0
--- -- - -
TOTALS 43 (-1) 0 0
Test Coverage Summary
---------------------
Type Covered Measured Percentage
-------- -------- --------- -------------
Lines 501 (-8) 511 98.04 (-1.57)
Branches 152 (-8) 166 91.57 (-4.82)
Like the previous output, the disabled test is showing up as being removed, but there
is a lot more coverage that was removed. Strangely enough, this was also expected. As
I also use PyScan to summarize test results from Java projects I work on, I used all 6
coverage measurements available from Jacoco 1 as a baseline for the 2
measurements generated by PyTest for Python coverage. With a quick look at the
report/coverage/pyscan_model_py.html
file, this was indeed the reason for the
difference, with the test exercising 4 additional paths in each of the serialization
and deserialization functions. Basically, four paths of one line each, times two (one
for serialization and one for deserialization), and the 8 lines/branches covered is
explained.
Wrapping Up¶
I believe that making my decision to talk about how I use my PyScan tool to summarize test results and test coverage was the right choice. It is difficult for me to quantize exactly how much benefit PyScan has provided to my development process, but it is easily in the very positive to indispensable category. By providing a quick summary on the test results file and the test coverage file, I can ensure that any changes I make are having the proper effects on those two files at each stage of the change that I am making. I hope that by walking through this process and how it helps me, it will inspire others to adopt something similar in their development processes.
-
For an example Jacoco HTML report that shows all 6 coverage measurements, check out the report trunk coverage for Jacoco. ↩
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.