Being a Polite Crawler on the Web
What I've learned from running my own web search crawler
Table of Contents
- Why be Polite on the Web?
- What is Scraping and what does a Crawler do?
- Identify Your Crawler
- Crawl at a Reasonable Speed
- Don't Scrape what You could get Through an API
- Don't Crawl Things you shouldn't
- Only Crawl what You Need
- Detect what Your Crawler Already Knows
- Use HEAD Requests
- Be Smart about URL Queries
- Do have fun! Do be curious!
Why be Polite on the Web?
You might be aware, that I'm building my own search engine for general purpose web search. Most of these are learnings from efficiency improvements and experience from hosting my own web services.
Also there should be no need to convince anyone that a bit of politeness is a good thing.
Over politeness being a pretty good default there are a few other convincing reasons to be polite:
- Efficiency
- Being polite also means cooperating with the Server on the other end, which in almost all cases will result in your crawler being faster and more efficient.
- Reputation
- If your crawler gets known as the one who isn't following basic rules crawling will become more difficult, as websites handle your crawler as an unwelcome visitor.
What is Scraping and what does a Crawler do?
- Scraping
- Taking information that is intended for Human viewing (like an HTML page) and (trying to) extract machine readable information from it. Usually it is used to work around bad or non-existent interfaces for automatic information retrieval. Scraping is useful for something like a search tool or link previews.
- Crawling
- Loop of fetching a document (i.e. a web page), extracting links to other documents from that and then repeating that for the newly discovered document links. The goal of crawling is to collect the found documents to extract data from them.
The two are usually used together as the link extraction in a crawler is usually done by scraping, that is extracting the human readable links. The "getting data from documents" step after crawling will also involve scraping to some degree.
The two are independent: One can build a crawler without a scraper, by targeting an API or a scraper without a crawler if no document discovery mechanism is needed (i.e. a link preview).
Identify Your Crawler
Identify your crawler so that the Admin on the other side knows who is crawling what and why. Crude attempts at trying to look like a Browser probably won't last long. This is mainly because your crawler has a very different goal from the average website visitor.
Set the HTTP User-Agent header to a name containing the name of your site or project and include a link to an information page that explains in more detail:
- who owns the crawler
- contact information
- why he crawler exists
- a short explanation of how the crawler works
- how to opt out of crawling
- which control mechanisms are available
An example user agent could be: ExampleBot (https://example.org/about/example-bot)
Crawl at a Reasonable Speed
Crawling too fast can — depending on what the Server is doing — degrade the Service for others because on smaller services your crawler could be responsible for a significant amount of load, even with what might seem like not a lot of requests to you.
Determining how fast is still okay can be difficult, treating every origin the same with a fixed delay of a few seconds between requests will work pretty well.
The best way to find out what is acceptable is to read the Crawl-Delay from robots.txt.
Note though, that blindly trusting the server on this value might not be desirable either as the delay can end up being hours or even days this way. Capping this at a delay of 2 minutes or more should be a reasonable compromise though.
Slow down if your crawler encounters a 429 (too many requests) code. They sometimes come with a Retry-After header that tells your crawler how long the server wants it to wait until the next request.
Another mechanism one can implement is a dynamic delay based on a multiple of the response time. This way, if the server responds slower, the crawler also slows down preventing it from overwhelming small or busy web hosts:
Don't Scrape what You could get Through an API
If you only need specific information that is available using a well documented API, strongly consider querying that API instead of scraping web pages.
Don't Crawl Things you shouldn't
While crawling the things you shouldn't crawl seems interesting and appealing it really isn't. In fact you probably want to crawl even less than you are allowed to.
Possible reasons for a Server to advise crawlers not to crawl certain sections are:
- They don't want to be indexed (which is their right and they're politely asking you to leave)
- There is a (near) infinite labyrinth of automatically generated pages somewhere. Crawling this would waste resources on both, the Server and your crawler.
- They are crawler traps that will lock you out if you send a request to them.
- They are access denied pages
- They contain large files that will storage space without much benefit on the crawler side.
Your crawler can get the information of which paths it shouldn't crawl from robots.txt. For matching the user agent you should use the same crawler name you have set in the User-Agent header.
Same goes for the HTML <meta name="robots" … tag, this one won't tell you in advance, but may be a useful additional tool for detecting content that shouldn't be indexed.
There is also the HTTP X-Robots-Tag header which works similar to the HTML robots meta element, but has slightly different syntax.
Opinion: Implementing X-Robots-Tag is good for completeness but not a strict requirement for a new crawler that only indexes HTML. robots.txt and <meta name="robots" … are the essential ones to support.
Refetching the robots.txt file after crawling for multiple minutes is a good idea, as that allows the admin of the target server to update the crawling preferences when they notice that the crawler is going where it shouldn't.
See also:
Only Crawl what You Need
Do not fall for "I might need it later …": no you don't.
If you didn't plan for it, you'll probably save incomplete or the wrong information or likely have discarded what you didn't need to make space for other things.
Crawling what you don't need will cost you bandwidth, storage and time for information that will never get used or if it will be needed is out of date by then. Not crawling unnecessary information also means your crawler is faster because is has to do less work.
Best case is that you think about what information you need before setting up a crawler.
In case you do need historical information Common Crawl and the Internet Archive are your friends.
Detect what Your Crawler Already Knows
Requesting resources that the crawler already knows is wasting bandwidth and crawl time on stuff that you already have, remembering a few bits of information can help with avoiding that.
The first step is to split up crawling and analyzing data, this way you can iterate fast on working with the data while avoiding unnecessary network requests.
The continuation is remembering the Last-Modified and ETag headers and sending them back with the If-Modified-Since and If-None-Match. If the resource didn't change the Server will reply with a 304 (not modified) if the content didn't change.
Use HEAD Requests
In case your crawler isn't sure if it needs a resource or not, but could distinguish the two by the HTTP header information alone, use the HTTP HEAD method, which will return the headers like for a GET request, but won't return the content.
This is useful if some heuristic tells the crawler that a file may not be interesting, but not for sure (i.e. a path ending in.png is likely an image, but could also be a page about the image file), after evaluating the HEAD the crawler can always use a GET request if the content seems interesting.
Relevant headers one might evaluate for head requests are:
Be Smart about URL Queries
URLs with the query part (after the first ?) set often link slight variations of pages that simply aren't useful to crawl.
In practice the sane default is to not crawl URLs with the query part set and define exceptions from that based on known site structure and carefully chosen heuristics.
Do have fun! Do be curious!
To summarize:
- Communicate who is crawling why
- Crawl as fast as you can, but stay as slow as necessary
- Read and respect
robots.txt - Only crawl what you need
- Split up crawling and analyzing
If you've read this far past a lot of "don't"s, now is the time for the "do"s.
Writing a good crawler is hard, but writing an okay one is actually quite easy. The fact alone you have put a thought to how to make a polite crawler will probably lead to better code.
Be curious, mix in a bit of politeness and have fun with the web!