Fine Tuning Pelican: Enabling Website Crawling

Preface¶

This is one of the articles in a series about how I fine tuned my Pelican+Elegant web site to make it “more mine”. For other articles in the series, click on the title of the article under the heading “Fine Tuning Pelican” on the right side of the webpage.

Unlike my series on Choosing and Setting Up Pelican, these articles are intended to be byte sized, addressing specific fine tunings I made to my own website. As such, the first article in this series contains a Disclaimer section with generic information for each of the articles in the series.

Introduction¶

Many forms of website crawling are fraught with copyright issues and may be considered unethical, as is discussed in this article on Data Crawling and Ethics. In contrast, there are legal and ethical uses for web crawling, such as providing the data to search engines such as Google and Bing. While Search Engine Registration and Optimization is covered in another article, it is worthwhile to ensure that the website is properly set up to regulate any web crawling that does occur. This article details the setup required to enable this regulation.

Why Use A Robots.Txt File?¶

The Robots Exclusion Protocol has been around for almost as long as webservers. As described in the protocol, it manifests itself as a specially formatted robots.txt file located in the base directory of the webserver. While this protocol is not enforceable and remains a suggestion for web crawlers, it does provide for the rules that you have for “proper” crawlers accessing your website.

For my website, this file exists in the content directory at /extras/robots.txt and has the following configuration:

User-agent: *
Disallow:

Sitemap: https://jackdewinter.github.io/sitemap.xml

This instructs any web crawler that is behaving properly of 3 important facts about the website. The first instruction is that the website is okay with crawlers representing any user agents are allowed to access the site. The second instruction is that there are no paths in the webserver that web crawlers are not allowed to access. Finally, the third instruction provides the web crawler with the location of the website’s sitemap, detailing the location of each page on the website.

These pieces of information are important for different reasons. The first two pieces of information are meant to restrict web crawlers from accessing the site, if so informed. In the case of this configuration, the * value for the user-agent field means that all user agents are allowed, and the empty value for the disallow field means that no parts of the website are disallowed. Between these two instructions, a web crawler can correctly determine that it is allowed to access any webpage on the website, appearing as any type of web browser or web crawler.

How To Publish The Robots.txt File¶

Publishing the robots.txt file requires two separate bits of configuration to ensure it is done properly. The first bit of configuration modifies the existing STATIC_PATHS value to add the path extra/robots.txt to the list of directories and files to publish without modification. The second bit of configuration specifies that the file at the path extra/robots.txt, when published without any modifications, will be located at the path /robots.txt at the webserver’s root.

STATIC_PATHS = ['extra/robots.txt']

EXTRA_PATH_METADATA = {
    'extra/robots.txt': {'path': '/robots.txt'}
}

Publishing the Sitemap¶

Generating a sitemap for Pelican is accomplished by adding the sitemap plugin to the PLUGINS configuration variable as follows:

PLUGINS = ['sitemap']

As detailed in the sitemap plugin documentation, while there are defaults for the sitemap, it is always better to specify actual values for each specific website. The values used for my website are as follows:

SITEMAP = {
    'format': 'xml',
    'priorities': {
        'articles': 0.6,
        'indexes': 0.5,
        'pages': 0.4
    },
    'changefreqs': {
        'articles': 'weekly',
        'indexes': 'weekly',
        'pages': 'monthly'
    }
}

In short, the configuration specifies that the format is xml, producing a /sitemap.xml file. The priorities of scanning are articles, then indexes, then pages, with change frequencies roughly backing up the priorities. For my website, the thought behind the values is that articles, and the indices they are part of, will be updated on a weekly frequency while pages will vary rarely changed.

What Was Accomplished¶

The purpose of this article was to detail the configuration for my website that supports crawling of the site for information. The first part of this configuration enabled the creation of a robots.txt file and publishing that file as part of the website. The second part of the configuration added the sitemap plugin and tailored the sitemap configuration for the specific balances for my website. Together, this configuration makes me feel confident that the website is well configured for web crawlers, specifically search engines.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.

Comments

Fine Tuning Pelican: Enabling Website Crawling

Preface¶

Introduction¶

Why Use A Robots.Txt File?¶

How To Publish The Robots.txt File¶

Publishing the Sitemap¶

What Was Accomplished¶

Comments

Published

Fine Tuning Pelican+Elegant

Category

Tags

Stay in Touch