Being a Polite Crawler on the Web

Date: — Topics: , — by Slatian

What I've learned from running my own web search crawler

Table of Contents

Why be Polite on the Web?

Slatian notes

You might be aware, that I'm building my own search engine for general purpose web search. Most of these are learnings from efficiency improvements and experience from hosting my own web services.

Also there should be no need to convince anyone that a bit of politeness is a good thing.

Over politeness being a pretty good default there are a few other convincing reasons to be polite:

Efficiency
Being polite also means cooperating with the Server on the other end, which in almost all cases will result in your crawler being faster and more efficient.
Reputation
If your crawler gets known as the one who isn't following basic rules crawling will become more difficult, as websites handle your crawler as an unwelcome visitor.

What is Scraping and what does a Crawler do?

Scraping
Taking information that is intended for Human viewing (like an HTML page) and (trying to) extract machine readable information from it. Usually it is used to work around bad or non-existent interfaces for automatic information retrieval. Scraping is useful for something like a search tool or link previews.
Crawling
Loop of fetching a document (i.e. a web page), extracting links to other documents from that and then repeating that for the newly discovered document links. The goal of crawling is to collect the found documents to extract data from them.

The two are usually used together as the link extraction in a crawler is usually done by scraping, that is extracting the human readable links. The "getting data from documents" step after crawling will also involve scraping to some degree.

The two are independent: One can build a crawler without a scraper, by targeting an API or a scraper without a crawler if no document discovery mechanism is needed (i.e. a link preview).

Identify Your Crawler

Identify your crawler so that the Admin on the other side knows who is crawling what and why. Crude attempts at trying to look like a Browser probably won't last long. This is mainly because your crawler has a very different goal from the average website visitor.

Set the HTTP User-Agent header to a name containing the name of your site or project and include a link to an information page that explains in more detail:

An example user agent could be: ExampleBot (https://example.org/about/example-bot)

Crawl at a Reasonable Speed

Crawling too fast can — depending on what the Server is doing — degrade the Service for others because on smaller services your crawler could be responsible for a significant amount of load, even with what might seem like not a lot of requests to you.

Determining how fast is still okay can be difficult, treating every origin the same with a fixed delay of a few seconds between requests will work pretty well.

The best way to find out what is acceptable is to read the Crawl-Delay from robots.txt.

Note though, that blindly trusting the server on this value might not be desirable either as the delay can end up being hours or even days this way. Capping this at a delay of 2 minutes or more should be a reasonable compromise though.

Slow down if your crawler encounters a 429 (too many requests) code. They sometimes come with a Retry-After header that tells your crawler how long the server wants it to wait until the next request.

Another mechanism one can implement is a dynamic delay based on a multiple of the response time. This way, if the server responds slower, the crawler also slows down preventing it from overwhelming small or busy web hosts:

Don't Scrape what You could get Through an API

If you only need specific information that is available using a well documented API, strongly consider querying that API instead of scraping web pages.

Don't Crawl Things you shouldn't

While crawling the things you shouldn't crawl seems interesting and appealing it really isn't. In fact you probably want to crawl even less than you are allowed to.

Possible reasons for a Server to advise crawlers not to crawl certain sections are:

Your crawler can get the information of which paths it shouldn't crawl from robots.txt. For matching the user agent you should use the same crawler name you have set in the User-Agent header.

Same goes for the HTML <meta name="robots" … tag, this one won't tell you in advance, but may be a useful additional tool for detecting content that shouldn't be indexed.

There is also the HTTP X-Robots-Tag header which works similar to the HTML robots meta element, but has slightly different syntax.

Opinion: Implementing X-Robots-Tag is good for completeness but not a strict requirement for a new crawler that only indexes HTML. robots.txt and <meta name="robots" … are the essential ones to support.

Refetching the robots.txt file after crawling for multiple minutes is a good idea, as that allows the admin of the target server to update the crawling preferences when they notice that the crawler is going where it shouldn't.

See also:

Only Crawl what You Need

Do not fall for "I might need it later …": no you don't.

If you didn't plan for it, you'll probably save incomplete or the wrong information or likely have discarded what you didn't need to make space for other things.

Crawling what you don't need will cost you bandwidth, storage and time for information that will never get used or if it will be needed is out of date by then. Not crawling unnecessary information also means your crawler is faster because is has to do less work.

Best case is that you think about what information you need before setting up a crawler.

In case you do need historical information Common Crawl and the Internet Archive are your friends.

Detect what Your Crawler Already Knows

Requesting resources that the crawler already knows is wasting bandwidth and crawl time on stuff that you already have, remembering a few bits of information can help with avoiding that.

The first step is to split up crawling and analyzing data, this way you can iterate fast on working with the data while avoiding unnecessary network requests.

The continuation is remembering the Last-Modified and ETag headers and sending them back with the If-Modified-Since and If-None-Match. If the resource didn't change the Server will reply with a 304 (not modified) if the content didn't change.

Use HEAD Requests

In case your crawler isn't sure if it needs a resource or not, but could distinguish the two by the HTTP header information alone, use the HTTP HEAD method, which will return the headers like for a GET request, but won't return the content.

This is useful if some heuristic tells the crawler that a file may not be interesting, but not for sure (i.e. a path ending in.png is likely an image, but could also be a page about the image file), after evaluating the HEAD the crawler can always use a GET request if the content seems interesting.

Relevant headers one might evaluate for head requests are:

Be Smart about URL Queries

URLs with the query part (after the first ?) set often link slight variations of pages that simply aren't useful to crawl.

In practice the sane default is to not crawl URLs with the query part set and define exceptions from that based on known site structure and carefully chosen heuristics.

Do have fun! Do be curious!

To summarize:

If you've read this far past a lot of "don't"s, now is the time for the "do"s.

Writing a good crawler is hard, but writing an okay one is actually quite easy. The fact alone you have put a thought to how to make a polite crawler will probably lead to better code.

Be curious, mix in a bit of politeness and have fun with the web!