Preface¶
This is one of the articles in a series about how I fine tuned my Pelican+Elegant web site to make it “more mine”. For other articles in the series, click on the title of the article under the heading “Fine Tuning Pelican” on the right side of the webpage.
Unlike my series on Choosing and Setting Up Pelican, these articles are intended to be byte sized, addressing specific fine tunings I made to my own website. As such, the first article in this series contains a Disclaimer section with generic information for each of the articles in the series.
Introduction¶
Many forms of website crawling are fraught with copyright issues and may be considered unethical, as is discussed in this article on Data Crawling and Ethics. In contrast, there are legal and ethical uses for web crawling, such as providing the data to search engines such as Google and Bing. While Search Engine Registration and Optimization is covered in another article, it is worthwhile to ensure that the website is properly set up to regulate any web crawling that does occur. This article details the setup required to enable this regulation.
Why Use A Robots.Txt File?¶
The Robots Exclusion Protocol has been around
for almost as long as webservers. As described in the protocol, it manifests itself as
a specially formatted robots.txt
file located in the base directory of the webserver.
While this protocol is not enforceable and remains a suggestion for
web crawlers, it does provide for the rules that you have for “proper” crawlers accessing
your website.
For my website, this file exists in the content directory at /extras/robots.txt
and
has the following configuration:
User-agent: *
Disallow:
Sitemap: https://jackdewinter.github.io/sitemap.xml
This instructs any web crawler that is behaving properly of 3 important facts about the website. The first instruction is that the website is okay with crawlers representing any user agents are allowed to access the site. The second instruction is that there are no paths in the webserver that web crawlers are not allowed to access. Finally, the third instruction provides the web crawler with the location of the website’s sitemap, detailing the location of each page on the website.
These pieces of information are important for different reasons. The first two pieces of
information are meant to restrict web crawlers from accessing the site, if so informed.
In the case of this configuration, the *
value for the user-agent
field means that
all user agents are allowed, and the empty value for the disallow
field means that no
parts of the website are disallowed. Between these two instructions, a web crawler
can correctly determine that it is allowed to access any webpage on the website,
appearing as any type of web browser or web crawler.
How To Publish The Robots.txt File¶
Publishing the robots.txt
file requires two separate bits of configuration to ensure
it is done properly. The first bit of configuration modifies the existing
STATIC_PATHS
value to add the path extra/robots.txt
to the list of directories and
files to publish without modification. The second bit of configuration specifies that
the file at the path extra/robots.txt
, when published without any modifications, will
be located at the path /robots.txt
at the webserver’s root.
STATIC_PATHS = ['extra/robots.txt']
EXTRA_PATH_METADATA = {
'extra/robots.txt': {'path': '/robots.txt'}
}
Publishing the Sitemap¶
Generating a sitemap for Pelican is accomplished by adding the sitemap
plugin to the
PLUGINS
configuration variable as follows:
PLUGINS = ['sitemap']
As detailed in the sitemap plugin documentation, while there are defaults for the sitemap, it is always better to specify actual values for each specific website. The values used for my website are as follows:
SITEMAP = {
'format': 'xml',
'priorities': {
'articles': 0.6,
'indexes': 0.5,
'pages': 0.4
},
'changefreqs': {
'articles': 'weekly',
'indexes': 'weekly',
'pages': 'monthly'
}
}
In short, the configuration specifies that the format is xml
, producing a
/sitemap.xml
file. The priorities of scanning are articles, then indexes, then pages,
with change frequencies roughly backing up the priorities. For my website, the thought
behind the values is that articles, and the indices they are part of, will be updated on
a weekly frequency while pages will vary rarely changed.
What Was Accomplished¶
The purpose of this article was to detail the configuration for my website that supports
crawling of the site for information. The first part of this configuration enabled the
creation of a robots.txt
file and publishing that file as part of the website. The
second part of the configuration added the sitemap
plugin and tailored the sitemap
configuration for the specific balances for my website. Together, this configuration
makes me feel confident that the website is well configured for web crawlers,
specifically search engines.
Comments
So what do you think? Did I miss something? Is any part unclear? Leave your comments below.