Markdown Linter - Road To Initial Release - Creating A Package

Summary¶

In my last article, I talked about resolved the remaining Priority 1 items from the issues list. In this article, I talk about how I worked through the issue in creating an installable package for the project.

Introduction¶

Having invested a lot of time getting the PyMarkdown project to a place where I feel confident in creating an initial release of the project, it was now time for me to create that release. To be honest, I was not sure what to expect out of the Python setup process. Creating releases for other languages is usually done as an add-on to the language, not part of the core language as Python does. As such, I was genuinely interested in how the process would differ between Python and the other languages I have written installers for.

Like everything else in this project, this was going to be a learning experience, and I was eager to get underway!

What Is the Audience for This Article?¶

While detailed more eloquently in this article, my goal for this technical article is to focus on the reasoning behind my solutions, rather that the solutions themselves. For a full record of the solutions presented in this article, please go to this project’s GitHub repository and consult the commits between 01 Apr 2021 and 03 Apr 2021.

Where To Start?¶

While the changes that I needed to perform on the project to get it from its then state to a packaged state were small, the path to get there was anything but short. Having done my usual research, I ended up finding three sources that I thought would be helpful to my effort:

I liked Nicolas’ article on creating Python packages because it was the first article that I found in my searches that seemed to lay everything out on the table. It felt that it provided me with a lot of useful information in a concrete, easy to digest form. While I did have a couple of issues with his examples, I do believe that they were because I was trying to adapt his example as I went and messed things up. The FreeCodeCamp article was useful in filling in the gaps that I found in Nicolas’ article, especially when it came to what to do after you had a package. Finally, having the Python 3.8 library documentation helped me fill in the last bit of the knowledge that I needed to complete the setup process. Together, with just a dash of experimentation thrown in for good measure, I was confident that I could create a Python package. Even if that effort took a while.

Creating a New Setup.py¶

While I have had a local setup.py file on my machine for months, it was always something that I was toying around with, nothing concrete. As such, I found that it was more efficient to start from scratch and create a new setup.py file based mostly on Nicolas’s article. I do not have any issues with his use of mainline function calls, such as the ones that he uses to read the readme.md file from the directory, but I prefer things in functions. From my perspective, it just helps me to keep things readable. I did like the way he was organizing some of the values at the start of the module and decided to follow that approach. Furthermore, I decided that it was more readable to have every value in variables, instead of being somewhat hidden in the call to the setup function, so I also made that change.

The Most Important Parts of Setup¶

For me, the four most important parts of any setup are: name of the package, version of the package, minimum required platform, and a declaration of any dependencies. Others can disagree with me on whether these things are the most important parts of any setup script, but I believe I have a strong argument in my favor. It is a simple argument: without these four parts, the rest of the setup script is useless. Any documentation without something to document is pointless. Similarly, any declaration of what needs to be included in the package and how to access it are useless without that base declaration. At least in my mind, those four properties are always the foundation of any installation script.

Using the setup.py module from the article Nicolas wrote as a good set of crib notes, I created a very basic module:

import runpy
from setuptools import setup

PACKAGE_NAME = "PyMarkdown"
SEMANTIC_VERSION = get_semantic_version()
MINIMUM_PYTHON_VERSION = "3.8.0"

def parse_requirements():
    lineiter = (line.strip() for line in open("install-requirements.txt", "r"))
    return [line for line in lineiter if line and not line.startswith("#")]

def get_semantic_version():
    version_meta = runpy.run_path("./version.py")
    return version_meta["__version__"]

setup(
    name=PACKAGE_NAME,
    version=SEMANTIC_VERSION,
    python_requires=">=" + MINIMUM_PYTHON_VERSION,
    install_requires=parse_requirements(),
)

It was not much, but it was a good start. Both the package name and the minimum Python version required are hardwired in as they are almost never going to change. The function get_semantic_version was written to encompass the code from the article to fetch the version number, and the parse_requirements function was written to encompass the requirements for the project.

Since I decided to specify the installation requirements for the project in the file install-requirements.txt, I added a very simple version of this file with a single line present:

Columnar

Moving Version Information Into A Single Module¶

It took me a bit to warm up to this, but after reading PEP 396, it just made sense. If there is any reason to know the exact version of a Python library, the __version__ field applied to the library name should contain the definitive version for that library. Following this PEP just made sense but required some rearrangement of code in the project.

Previously, the only place where the version information was kept was in the __version_number field of the PyMarkdownLint class. While I debated an approach that would leverage that existing code, the simplicity of simply having a single version.py file just made more sense to me. With the get_semantic_version function already present in the setup.py module, as detailed in the last section, I added the following code to the PyMarkdownLint class to reference that same file:

    @staticmethod
    def __get_semantic_version():
        file_path = __file__
        assert os.path.isabs(file_path)
        file_path = file_path.replace(os.sep, "/")
        last_index = file_path.rindex("/")
        second_last_index = file_path.rindex("/", 0, last_index)
        file_path = file_path[0 : second_last_index + 1] + "version.py"
        version_meta = runpy.run_path(file_path)
        return version_meta["__version__"]

This code is effectively the same code as in the get_semantic_version function of the setup.py module. The only changes present were to deduce the executable path from the __file__ variable and to determine the relative location of the version.py file from where that executable is located.

After all this work, the only thing that was needed was a new version.py module:

"""
Library version information.
"""

__version__ = "0.5.0"

and a small change to the test_markdown_with_dash_dash_version test function to fetch the version from the version.py module.

Adding Documentation¶

With those basics out of the way, it was time to add the documentation basics to the setup.py module:

def load_readme_file():
    with open("README.md", "r") as readme_file:
        return readme_file.read()

AUTHOR = "Jack De Winter"
AUTHOR_EMAIL = "jack.de.winter@outlook.com"

ONE_LINE_DESCRIPTION = "A GitHub Flavored Markdown compliant Markdown linter."
LONG_DESCRIPTION = load_readme_file()
LONG_DESCRIPTION_CONTENT_TYPE = "text/markdown"
KEYWORDS = ["markdown", "linter", "markdown linter"]

PROJECT_CLASSIFIERS = [
    "Development Status :: 4 - Beta",
    "Programming Language :: Python :: 3.7",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
]

setup(
    ...
    author=AUTHOR,
    author_email=AUTHOR_EMAIL,
    description=ONE_LINE_DESCRIPTION,
    long_description=LONG_DESCRIPTION,
    long_description_content_type=LONG_DESCRIPTION_CONTENT_TYPE,
    keywords=KEYWORDS,
    classifiers=PROJECT_CLASSIFIERS,
...

Most of these fields are self-explanatory and are simple string objects or lists of string objects. The three fields that stand apart from that are the LONG_DESCRIPTION field, the LONG_DESCRIPTION_CONTENT_TYPE field, and the PROJECT_CLASSIFIERS field.

The LONG_DESCRIPTION_CONTENT_TYPE field is the easiest of the three as it assumes that the README file for the project will always be README.md. As such, the MIME content type for the long description will always be text/markdown. For my projects, I feel that it is a good assumption to make, so that was an easy one to get out of the way. Then, to ensure that the LONG_DESCRIPTION field is always up to date, the load_readme_file function reads the contents of the README.md file and places them into the LONG_DESCRIPTION field. For me, these fields just make sense as I can contain a package description of the project and the GitHub description of the project in one place.

Finding the right values for the PROJECT_CLASSIFIERS field was the tasks that I had the hardest time with out of the three fields. With a seemingly endless page of available classifiers, it was hard to narrow down the classifiers to a small set. While I am not comfortable that I have the right set of classifiers for the project, I believe I have a good set to start with.

Looking at that work, the one thing that I needed to do to wrap it up was to make sure that the README.md file only contained information I wanted someone to see when they were having their initial look at the project. While I do not want to hide the project’s issues list, I did not want it to be the first thing people saw. As such, I moved it over into the new issues.md file.

Rounding Out The Setup Properties¶

According to my research, the only two other fields that I needed to add were the scripts field and the packages field. The packages field was the easy one to define out of those two: I simply needed to list all the packages for the project.¹ While both examples use the setuptools module and its find_packages function, I wanted to maintain fine-grained control over the packages. As such, I specified each package name separately.

setup(
    ...
    scripts=ensure_scripts(["scripts/pymarkdown"]),
    packages=[
        "pymarkdown",
        "pymarkdown.extensions",
        "pymarkdown.plugins",
        "pymarkdown.resources",
    ],
)

For the specification on how to start the PyMarkdown application, it took me a while to decide on an action to use for that. During my research phase, I had three possibilities for how to interact with the project itself: py_modules, scripts, and entry_points. There was barely any information on entry_points and how to use them, so I decided to not use those unless I found enough information to warrant changing to them. Looking to my third reference source, the Python libraries documentation, I found this article on setup scripts. As that is what the standard libraries used, I decided that was the best way for this project.

Looking at the example that Nicolas provided in his article, I quickly created my own script:

#!/usr/bin/env python

from pymarkdown import PyMarkdownLint
PyMarkdownLint.main()

but came across one glaring problem right away. That script would work well on Linux systems, but my development environment is a Windows machine. As I use the PyLint scanner on all my Python projects, I decided to look at how they solved this problem, and used their ensure_scripts function verbatim²:

def ensure_scripts(linux_scripts):
    """
    Creates the proper script names required for each platform (taken from PyLint)
    """
    if util.get_platform()[:3] == "win":
        return linux_scripts + [script + ".bat" for script in linux_scripts]
    return linux_scripts

It is wonderful in its simplicity! If the first three characters of the platform are win, then the function assumes that the list of scripts must refer to scripts that will work on a Windows machine. It accomplishes this by adding another list of scripts to the list, this new list being comprised of every element of the original list, but with a .bat appended to the end. With that, the last thing was to copy the .bat batch file format over from PyLint:

@echo off
rem Use python to execute the python script having the same name as this batch
rem file, but without any extension, located in the same directory as this
rem batch file
"%~dpn0" %*

I was not sure if that batch script was going to work, but if it was good enough for PyLint, I figured it was a good enough starting place for me.

Almost Finished¶

Two simple things were left before my first attempt to compile my first Python package. The first thing was to add a simple LICENSE.txt file to the project to establish the use of the project. The other was to add a __init__.py module to the pymarkdown directory to make sure that the base of the project was considered a module for setup to pick up.

With those two things addressed and out of the way, it was time to compile the setup for the project!

The Fun Begins: Getting Packaging To Work¶

To start compiling the setup, I included the twine and setuptools into my development environment using pipenv install twine setuptools. Once that was complete, I added the following package.cmd script to the repository to make things easier:

rmdir /s /q dist
rmdir /s /q build
rmdir /s /q PyMarkdown.egg-info

pipenv run python setup.py sdist bdist_wheel
pipenv run twine check dist/*

It was nothing fancy, but it allowed me to repeatedly repackage the project to test any changes in an efficient manner. Basically, it removes any signs of a previous build before running the setup.py script and then the twine script. While it is not as fancy as the Gradle scripts I have for Java projects at work, I found that it is uncomplicated and works very well. I purposefully did not add any error handling to the batch script as I wanted to make sure I saw all the information that was reported, unfiltered.

To assist in testing those changes, I created a new project pymtest at the same level as the PyMarkdown project and left it almost empty for now. I created that project to be my test installation environment, useful once I had a package to install. For now, I just wanted to get it ready for later. Thus, I created a simple refresh_package.cmd script with these contents:

pipenv uninstall PyMarkdown
pipenv install ..\pymarkdown\dist\PyMarkdown-0.5.0.tar.gz

Simply, uninstall any existing PyMarkdown package and install a new one right from the dist directory of the PyMarkdown project.

Now on to the real work: debugging the install script.

Pass 1: Getting The Version Right¶

Executing the package.cmd script, everything worked fine, and I had a new package to test! Switching over to my test project, I executed the refresh_package.cmd batch script… and waited. Looking at the output, the uninstall command was completing in under a second, but the install command was taking its time on the Resolving phase of installing the package. It was agonizing!

But when it was done, it displayed the following error:

    ERROR: Command errored out with exit status 1:
...    
    FileNotFoundError: [Errno 2] No such file or directory: '..pip-req-build-mfg5j1bu\\version.py'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I tried a couple of different things with no luck before I opened the PyMarkdown-0.5.0.tar.gz archive file from the project’s dist directory and examined its contents. When I did that, I noticed that there was no version.py file anywhere in the archive.

At that point, I spent about an hour or so trying to figure out how to get that version.py file into the archive at the right place before deciding to go with a more intuitive approach. After looking at how the files were installed after the install pymarkdown command was completed, it was obvious that my current approach would necessitate copying the version.py file into the pymarkdown directory. So, instead of trying to figure out how to do that “complicated” action, I decided on the “simple” action to move the version.py file into the pymarkdown directory.

With that decision made, I rewrote the get_semantic_version in the setup.py module as follows:

def get_semantic_version():
    version_meta = runpy.run_path("./pymarkdown/version.py")
    return version_meta["__version__"]

I also rewrote the __get_semantic_version function in the main.py module as follows:

def __get_semantic_version():
        file_path = __file__
        if not os.path.isabs(file_path):
            assert False
        file_path = file_path.replace(os.sep, "/")
        last_index = file_path.rindex("/")
        file_path = file_path[0 : last_index + 1] + "version.py"
        version_meta = runpy.run_path(file_path)
        return version_meta["__version__"]

With the version.py file moved into the pymarkdown directory, and with both references to that file now looking for it in the new location, that error was now resolved.

Pass 2: File Name Casing Matters¶

After packaging the project again, I ran the refresh_package.cmd script and was now greeted with this error:

FileNotFoundError: [Errno 2] No such file or directory: 'README.md'

I examined the directory structure of the archive a good four or five times without any ideas coming to mind. I even looked at the Python install pages to see if I could find anything. But all I could find with a list of the files to distribute. This included other types of readme files, but not specifically the README.md file. Double checking the project that Nicolas set up, I saw that he was using README.md as a source for his long documentation without any apparent extra setup needed to include that file. So, I figured it must be something else.

That is when it hit me. Windows has many uses as an operating system³, but one of the things I do not like about it is the case-insensitivity of the file system. In this case, I had called the readme file readme.md instead of READMD.md. Simply correcting the case of the file name resolved this issue.

After a simple case of “cannot see the forest because of the trees”, it was on to the next issue.

Pass 3: Making Sure The Right Files Are Included¶

This time, when I executed the refresh_package.cmd script after repackaging the project, I was greeted with this error:

FileNotFoundError: [Errno 2] No such file or directory: 'install-requirements.txt'

With some newfound experience under my belt, I immediately opened the archive and found that the install-requirements.txt file was not in the archive. Thankfully, in looking for solutions for the last error, I came across a solution to include data files into the setup process using a MANIFEST.in file. Located in the same section where I found the information detailing which files were automatically included in the setup archive, that section there is information on the MANIFEST.in file near the end of that section. Following those instructions, I was quickly able to create a new MANIFEST.in file with the following contents:

include install-requirements.txt

After a quick repackaging and reinstalling, this error was indeed solved.

Pass 4: Lather, Rinse, Repeat¶

While that file was now present in the archive, the new error was complaining about a missing directory:

error: package directory 'pymarkdown\resources' does not exist

The main reason for this directory is to host the entities.json file. That file contains each of the named entities, with the corresponding Unicode character they each entity maps to. I tried adding an __init__.py and other such workarounds to get the file included, but nothing worked. Convinced that I had tried other approaches, I followed the same approach as the last section, and added it to the MANIFEST.in file:

include pymarkdown/resources/entities.json

I do not want to make it sound that I dislike the MANIFEST.in approach to including files in the setup archive. I don’t. But to me, it feels like that file is the last option to include files, with all other options having been exhausted. For me, that is my own sniff test for whether the use of the MANIFEST.in file is warranted. For example, I would rather figure out that I need to change the readme.md file into the README.md file before I thought about adding it to the MANIFEST.in file. In this case, I was convinced that there was no other way to include the file, and as such, I had passed my own sniff test.

And It Was Done¶

With that change made, I was now seeing the refresh of the packaging complete without any errors:

Installing ..\pymarkdown\dist\PyMarkdown-0.5.0.tar.gz...
Adding PyMarkdown to Pipfile's [packages]...
Installation Succeeded
Pipfile.lock (db4242) out of date, updating to (29513d)...
Locking [dev-packages] dependencies...
Locking [packages] dependencies...
 Locking...Building requirements...
Resolving dependencies...
Success!
Updated Pipfile.lock (29513d)!
Installing dependencies from Pipfile.lock (29513d)...
  ================================ 1/1 - 00:00:05
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.

And Now, Verifying The Usage¶

With everything looking good in the packaging and installation, the next step was to test the usage of the newly installed library. With optimism in my heart, I went to execute my first test command, pipenv run pymarkdown --help, and I waited. After a good couple of minutes, I killed the script, checked things again, and everything seemed fine.

It seemed like I was not done debugging the setup process quite yet.

Pass 1: Proper Script Files¶

Having “imported” the script files from the PyLint project, I hoped they would work out of the box, but assumed that I would have to do some work to get them operational. I liked the idea of calling the pymarkdown script from the pymarkdown.bat script, but after 45 minutes and approximately 4 attempts at rewriting the scripts, I gave up. Just like before, I decided to go with simplicity for both files, the pymarkdown file:

#!/usr/bin/env python

from pymarkdown import PyMarkdownLint
PyMarkdownLint().main()

and the pymarkdown.bat file:

python -c "from pymarkdown import PyMarkdownLint; PyMarkdownLint().main()" %*

Instead of having one script call the other, I opted for matching the contents of both scripts as closely as possible. In the shell version, the shebang at the start of script takes care of invoking Python and Python itself takes care of the command line arguments. In the batch script version, I needed to explicitly call Python with the -c argument to tell Python to execute the next argument as a Python script. Finally, the $* at the end of that line causes any arguments passed to the batch script to be passed to the Python program specified with the -c argument.

After a couple of tries, mostly due to small typing mistakes, when I executed the command line pipenv run pymarkdown --help, I was welcomed with the help documentation for the project. Success!

Pass 2: Init Files¶

With the batch script issue in the last section resolved, the execution of the test command pipenv run pymarkdown --help now yielded this error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'PyMarkdownLint' from 'pymarkdown' (C:\Users\jackd\.virtualenvs\pymtest-W-bOTTm6\lib\site-packages\pymarkdown\__init__.py)

Perhaps it is my knowledge of other programming languages, but I favor direct imports in the files that need them over the use of __init__.py modules. For me, it just seems like overkill in 98% of the cases, leading to a hard-to-understand view of dependencies between files. In the case of creating a setup package, this turned out to be one of the 2% cases that I had not come across yet.

But, seeing it as this was an obvious request for a proper __init__.py module, I added one to the pymarkdown package with the contents:

from pymarkdown.main import PyMarkdownLint

I do not use it in any of the other modules for the project, but it is there for the setup.py module and any others that need it. As such, I can stay true to how I use import statements while providing the information that the setup scripts need. For me, that is a win-win.

Pass 3: Including Data Files¶

With the pipenv run pymarkdown --help command now running without any issues, I wanted to include some more complex examples to test in the refresh_package.cmd script. To that extent, I added the following lines to the end of that file:

pipenv run pymarkdown plugins list
pipenv run pymarkdown plugins info md048
pipenv run pymarkdown plugins info md047
pipenv run pymarkdown scan ..\blog-content\website\content\articles

Going through the reinstall process with the new version of this script, the installation and the first three commands all went off without any issues. However, when it got to the scan command, the following error was emitted:

BadTokenizationError encountered while initializing tokenizer:
Named character entity map file '..\lib\site-packages\pymarkdown\resources\entities.json' was not loaded ([Errno 2] No such file or directory: '..\\lib\\site-packages\\pymarkdown\\resources\\entities.json').

Going back to the useful files to distribute section, I quickly noticed that one of the items in the list was labelled Installing Additional Files. This seemed to fit the situation that I had before me exactly. Reading the information on the other side of that link, I knew what to do within a couple of minutes. Within a couple more minutes, I had this change coded up and inserted at the end of the setup function call in the setup.py module:

setup(
    ...
    data_files=[('Lib/site-packages/pymarkdown/resources', ['pymarkdown/resources/entities.json'])]
)

Going through the entire process again, everything worked fine, and I was now done with the test scenarios I had in mind. I tried a handful of additional scenarios to make sure I had them all covered, and each scenario worked as I expected it to. I had a fully functioning install script!

Pass 3.1: Cleanup¶

This was not really a pass on its own, but a little bit of cleanup that I wanted to do. While looking at various other Python setup articles and library packages, I decided to add three more arguments to the setup function call:

setup(
    ...
    maintainer=AUTHOR,
    maintainer_email=AUTHOR_EMAIL,
    url=PROJECT_URL,
    ...

Since I am both the author and the maintainer, it just made sense to set the maintainer fields to the same values as with the author fields. I also wanted people to be able to get more information on the project, so setting the url field also made sense.

What Was My Experience So Far?¶

Based on my experience with other languages, creating an installation package for the project in Python was a walk in the park. There was no fancy extra packaging required, everything was written in Python. While it took me about four hours to make sure everything was working properly, I would estimate that a similar installer for C# or Java would easily take at least eight hours to get into a similarly finished form. For me, that is a win.

In general, I am very pleased with how this work went on getting the setup code into proper shape. There were some very good examples that I could lean on to get my code working, and the starting points were all well-defined. That made the distance I needed to travel from sample code to working code very short, which was very pleasant for once. During the creation of the setup script, I did notice a couple of extra things that I want to clean up before the initial release. But like before, they are all small and reasonable, so I am confident I can make short work of them.

What is Next?¶

With the setup packaging complete for now, I move on to simplifying the output from some of the commands and starting to update the rules for the initial release.

I almost feel that a “duh?!” would be warranted here, but do not feel that it is appropriate. ↩
Since I took a look, someone refactored the setup code. Please look at this code, which is the code I cribbed from. ↩
This comment is not meant to start a religious war. I firmly believe that there are many different jobs that need done, with some tools being the obvious choice for that job. There are other jobs where the tools that can be used are more on personal preference combined with the job at hand. For myself, operating systems are just that: tools. ↩

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments