Interesting Resources for building Search Engines

Date: — Topic: — by Slatian

A collection of links to articles and documentation.

Table of Contents

How Search works

Articles that dive into how a Search Engine works under the hood.

Datastructures

Datastructures useful for building search engines.

FST — Finite State Transducers

In a nutshell: FSTs use properties of State Machines to store sorted sets of strings, map strings to numbers and run fast prefix queries, but also other queries using automata like regex or even levenstein distance queries.

Index 1,600,000,000 Keys with Automata and Rust

An implementation would be the rust fst crate.

Roaring Bitmaps

Roaring Bitmaps are datastructures, that from the outside can be operated like giant bitmaps, storing very big sets of numbers, but optimise themselves on the inside to use memory as efficient as possible and staying fast.

Better bitmap performance with Roaring bitmaps — Research Paper (PDF)

Crawling strategies

Stract

Stract is mostly Interesting because of its politeness algorithm that in a nutshell is a politeness factor multiplied by a either the response time or a minimum delay, whichever is larger. Capped at a maximum delay of 180s that should be good for the website and the crawler.

Note: While the algorithm is Interesting the robots.txt Crawl-delay should take priority if present (though that one should also be capped at a friendly value).

Stract Crawler — Webmaster Documentation

Server to Crawler Protocols

If are crawling the web and care about admins not making your life difficult one should respect the web servers and their administrators. Respecting these signals also helps gathering higher quality data.

The baseline is implementing and respecting robot.txt, other protocols may provide additional useful metadata depending on your use case.

An interesting overview is "How I block bots" by Seirdy.

Relevancy and Ranking

List of algorithms that allow one to sort search results.

Bag of words ranking

The most straight forward way of measuring how well a document matches a query is word counting. Though simple counters aren't very good at this, weighting them relative to statistics of your document corpus makes very good at keyword based search though.

Usually the go to is BM25 as it automatically lightens the impact of common words and rewards completeness over repeating the same words over and over again. It also has a theoretical maximum output, so one can normalise it.

PageRank

PageRank at its core is for measuring a documents overall relevancy in the context of other documents that can link to each other, not towards a search query.

It's famous for being at the core of Google for many years.

The PageRank algorithm is described in the original Google Paper from 1998 but also has its own dedicated Paper.

In a nutshell: Every page has a score, it distributes that score via its links to other pages (and keeps some for itself for dampening reasons) iterate a few times and the scores now represent how relevant in terms of incoming links a page is.

The PageRank Citation Ranking: Bringing Order to the Web

TextRank

TextRank is a variation of PageRank applied to texts.

It can be used for keyword extraction, simple, automatic summaries, but also for augmenting bag of words algorithms with some smarter weights.

TextRank can also be used to pull in knowledge from a knowledge graph when connecting words and sentences.

TextRank: Bringing Order into Text

Text Processing

Text by itself is pretty messy, when building a search engine one probably wants to apply some normalization.

Snowball Stemming

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

The best thing about Snowball is that it can be compiled to other languages you are more likely to write your search engine in.

Snowball