Mandelbot is Fractle's web crawler and it lets you control what files are crawled using a robots.txt file. This page describes Mandelbot's support of robots.txt, which is part of Mandelbot's support for the Robots Exclusion Protocol.
Mandelbot looks for a file named robots.txt in the root directory
of your website. For example, for the url
http://www.example.com/dogs/barking.html
,
Mandelbot would look for a corresponding robots.txt file at
http://www.example.com/robots.txt
.
Different domains, sub-domains, protocols (http and https), and
ports, each require their own robots.txt file in their respective
root directory. The robots.txt file at
http://www.example.com/robots.txt
only applies to urls
starting with http://www.example.com/
. It will not
apply to any urls starting with: https://www.example.com/
,
http://subdomain.example.com/
, or
http://www.example.com:8080/
.
Mandelbot will not look in sub-directories for your robots.txt file nor in files with different names. The root directory is the only location Mandelbot will look for a robots.txt file. If it is not there, as far as Mandelbot is concerned, it does not exist.
If you create or change your robots.txt file, please allow at least 48 hours for Mandelbot to discover the changes; we cache the robots.txt file to avoid sending unnecessary requests to your server.
To temporarily stop Mandelbot from crawling your site, return a response containing a 503 Service Unavailable HTTP Status Code to any request made by Mandelbot. When such a response is received, Mandelbot will stop crawling and resume at some later time.
Do not change your robots.txt file to temporarily stop Mandelbot from crawling; it will not take effect until the cached copy expires and can negatively affect the indexing of your pages in Fractle.
The HTTP Status Code that is received by Mandelbot when it makes an HTTP request for robots.txt determines how it is interpreted.
If a 2xx Success status code is received, the response will be processed according to the rules on this page.
If a 3xx Redirection status code is received, the redirects will be followed and the end result treated as though it were the robots.txt response.
If a 4xx Client Error status code is received, it will be treated as though no robots.txt file exists and crawling of any url is allowed. This is true even for 401 Unauthorized and 403 Forbidden responses.
If a 5xx Server Error status code is received, it will be treated as a temporary server problem and the request will be retried later. If Mandelbot determines a server is incorrectly returning 5xx status codes instead of 404 status codes, it will treat 5xx errors as 404 errors.
The robots.txt file should be a plain text file consisting of lines separated by carriage returns, line feeds, or both.
A valid line consists of several ordered elements: a field, a colon,
a value, and an optional comment prefixed by a hash:
<field>:<value>(#<comment>)
.
The <field>
element is case-insensitive and any
whitespace before or after any element is ignored.
Invalid lines and valid lines with an invalid <field>
are ignored; the remaining valid lines are processed according to
the rules on this page. The treatment of valid lines with a valid
<field>
and an invalid <value>
is undefined.
A robots.txt file contains zero or more directive groups. A directive group consists of one or more user agent lines, followed by one or more directive lines.
A user agent line has <field> = User-agent
and a
case-insensitive <value>
which represents the user
agent's name. It is invalid to use the same user agent in multiple user
agent lines within a robots.txt file.
Zero or one of the directive groups will apply to
Mandelbot, which follows the
directives for User-agent: Mandelbot
and falls back
to the directives for User-agent: *
if there is no
group specifically for Mandelbot.
In this example, Mandelbot follows its specific group:
User-agent: Mandelbot Disallow: /private # Mandelbot follows this directive User-agent: * Disallow: /secret # Mandelbot ignores this directive
In this example, Mandelbot falls back to the default group:
User-agent: Anotherbot Disallow: /private # Mandelbot ignores this directive User-agent: * Disallow: /secret # Mandelbot follows this directive
In this example, no group applies to Mandelbot:
User-agent: Anotherbot Disallow: /private # Mandelbot ignores this directive
In this example, two user-agents including Mandelbot share a group:
User-agent: Mandelbot User-agent: Anotherbot Disallow: /private # Mandelbot follows this directive User-agent: * Disallow: /secret # Mandelbot ignores this directive
A disallow directive is a type of directive line. It has
<field> = Disallow
and a <value>
which is a relative path from the website root. The path must
start with a forward slash, is case sensitive, and special
characters must be percent encoded.
Mandelbot will not crawl any url that is prefixed by the path of any disallow directive in the applicable directive group.
In this example, Mandelbot and other robots are blocked from crawling any url:
User-agent: * Disallow: /
In this example, Mandelbot is blocked from crawling any url:
User-agent: Mandelbot Disallow: /
In this example, Mandelbot is blocked from crawling any url prefixed
by /secret or /hidden (e.g. /secret/doc.html
is blocked,
but /private/secret/doc.html
is not):
User-agent: Mandelbot Disallow: /secret Disallow: /hidden
A special type of disallow directive is one with no path, which means allow everything. It is used to override the default directive group. In this example, Mandelbot may crawl any url while other robots are blocked from crawling any url:
User-agent: Mandelbot Disallow: User-agent: * Disallow: /
An allow directive is a type of directive line. It has
<field> = Allow
and a <value>
which is a relative path from the website root. The path must
start with a forward slash, is case sensitive, and special
characters must be percent encoded.
Allow directives are used to override disallow directives. By default, all urls are allowed, so allow directives are only necessary when a disallow directive's scope needs to be reduced.
Mandelbot will crawl any url that is prefixed by the path of any allow directive in the applicable directive group.
In this example, Mandelbot is blocked from crawling any url prefixed
by /secret except those prefixed by /secret/readme.txt (e.g.
/secret/doc.html
is blocked, but /secret/readme.txt
and /secret/readme.txt?v=1
are not):
User-agent: Mandelbot Disallow: /secret Allow: /secret/readme.txt
The disallow and allow directives support wildcards. The paths may use
a *
to represent 0 or more characters and a $
as the last character to represent the end of a Url.
In this example, Mandelbot is blocked from crawling the contents of any
folder named private (e.g. /secret/private/doc.html
is
blocked, but /secret/private-stuff/doc.html
is not):
User-agent: Mandelbot Disallow: /*/private/
In this example, Mandelbot is blocked from crawling urls that end with
a pdf file extension (e.g. /doc.pdf
is blocked, but
/doc.pdf?load=1
is not):
User-agent: Mandelbot Disallow: /*.pdf$
To determine which directive applies, Mandelbot sorts the directives by path length in descending order, with allow directives taking precedence in the case of ties.
If the highest precedence directive that matches the url is an allow directive, Mandelbot will crawl the url; if it is a disallow directive, Mandelbot will not crawl the url; and if no directive matches, Mandelbot will crawl the url.
In this example, Mandelbot is blocked from crawling urls with a pdf
file extension except those prefixed by /files (e.g. /doc.pdf
is blocked, but /files.pdf
is not) as /files is the same
length as /*.pdf (Allow directives take precedence in ties) and /doc
has no effect as it's shorter than /*.pdf (longer directives take
precedence):
User-agent: Mandelbot Disallow: /*.pdf Allow: /files Allow: /doc
Placing a *
at the end of a path is usually redundant
as it is implicit; but it has two use cases: allowing a regular
dollar character at the end of a path and changing the precedence of
a directive.
In this example, Mandelbot is blocked from crawling the url /money and
any urls prefixed by /earn$ because the *
in /earn$* turns
the dollar character into a regular character:
User-agent: Mandelbot Disallow: /money$ Disallow: /earn$*
In this example, Mandelbot is blocked from crawling urls with a pdf
file extension except those prefixed by /doc (e.g. /files.pdf
is blocked, but /doc.pdf
is not) as the extra *
characters at the end of the paths change their length and therefore
their precedence:
User-agent: Mandelbot Disallow: /*.pdf* Allow: /files Allow: /doc****
Mandelbot ignores comments.
Comments may be included after any valid line by prefixing the
comment with a #
. The hash and everything after it
is ignored. A comment is also allowed on its own line.
In this example containing different types of comment, Mandelbot is blocked from crawling any url prefixed by /secret and other robots are blocked from crawling any url:
# A standalone comment on its own line User-agent: Mandelbot # Whitespace is allowed but not required between elements Disallow: /secret# This is a comment User-agent: * # For other robots Disallow: / # Block everything
A sitemap as defined on sitemaps.org provides a method for you to inform Mandelbot about all the pages on your site. Mandelbot may use it to make crawling more efficient.
You specify sitemaps in your robots.txt file by creating lines with
<field> = Sitemap
and a <value>
which is an absolute url to a valid sitemap file (e.g.
Sitemap: http://www.example.com/sitemap.xml
).
Sitemap lines exist outside directive groups and usually appear at the start of the robots.txt file before any directive groups. However, they may occur anywhere in the robots.txt file. Sitemap lines are not associated with any specific user agent and may be used by any crawler regardless of where the Sitemap line occurs in the robots.txt file. You can include multiple sitemap lines within your robots.txt file.
Mandelbot doesn't currently support Crawl-delay directives in robots.txt files, but we can manually set the Crawl-delay used by Mandelbot for your website if you contact us.
Mandelbot's support of robots.txt is just part of its support for the Robots Exclusion Protocol. Mandelbot also supports Robot Tags, which provide control over what files are indexed.
For an overview of the protocol, additional information on Mandelbot's use of robots.txt, and details on interactions and conflicts between robots.txt and Robot Tags, read about the Robots Exclusion Protocol.
Fractle © 2024