Modern Web Search: Not Just an Index

Date: , Updated: — Topics: , — by Slatian

Why an index isn't enough for a modern search engine.

Table of Contents

What this post is about

This post is abut the claim of:

We need a sovereign European search index!

The short answer is: Yes, absolutely! But a search index alone won't be enough.

Note: I'm living in Europe so I've started this with the European version of the claim. But I hope that it is quite obvious that everyone deserves a sovereign and local search engine, this is a global problem, not just a European one.

What is a search index?

To show why am index is not enough we have to first answer what an index is:

A search index is a database of documents (web pages) reorganised in a way that they can be queried in an efficient way by their contents (i.e. the search term "cat" matches all documents that contain the word "cat" or "cats"). It can also include functionality that allows querying which words follow each other, or how important some arbitrary scoring algorithm determined a word or phrase to be.

So an index is the basic functionality of what one thinks when someone wants to build a search engine.

While this on a small scale is already pretty useful, in the context of a large database with a diverse set of topics or languages it isn't, you need some extra components.

Why do we need something else too?

The problem with the search index is that it can search for words, but doesn't neither considers the fact, that multiple, very different words can mean the same thing nor the fact that the same word can mean multiple things depending on the wider context. Those are two problems, which require different solutions, that rely on other components.

The rest of the article will go through them component by component.

"Forbidden Knowledge"

The web of today is full of sites that intentionally or not don't follow web standards in their structure and content, to avoid filling the index with useless data the crawlers collecting the data usually cheat by having additional Knowledge about popular content management systems, wiki engines and services baked in to know what is worth visiting and what isn't, even though the information isn't communicated through a mechanisms like robots.txt or sitemap.xml.

Big search engines either guess that information or collect it via their proprietary web admin tools.

A Secondary Index

Every mainstream search engine nowadays isn't one, but at least two search engines. One with the index of all known web pages, one carefully curated, that is displayed in the form of infoboxes from Wikipedia, answers from Stack Overflow (or somewhere else) or dictionary lookups.

Without those the sources that are known to reliably provide useful results would often be drowned by other pages that may be very relevant, but not what someone may be looking for.

This is not about LLM ("AI") generated answers, those are not a part of the searching machinery.

There is an ethical dimension to this on

Both of these points are worth putting more through into them and a search engine should be transparent about them.

Secondary uses of the secondary index may be to have a reference dataset of texts that can be used to tune heuristics for the main index.

A Dictionary/Thesaurus

The "multiple words can mean the same thing" can be solved pretty easily by a synonym lookup, for that one needs a dictionary of known (important: not all possible words are known) words, which can then be used to broaden a search with little results or uprank results that also mention synonyms.

This dictionary can also be used to improve language guessing. This is necessary because it is unfortunately not uncommon, that the language annotation of the website itself can't be relied on, as it is often hard coded in templates or an overlooked setting. An important consideration here is that the same sequence of characters can be a valid word in multiple languages.

Another use for a dictionary is the classic of resolving spelling errors and also normalising alternate valid spellings (i.e. American English and British English).

Often the dictionary becomes visible as part of an autocompletion or query spellchecking feature.

Prominent examples that are usable for this purpose are Wiktionary and WordNet. There is also GermaNet for the german language.

A Knowledge Graph

While words can mean a lot, word and dictionary definitions don't make up reality. To "solve" the problem of what exactly a usually very short query is asking for the search engine needs a Knowledge Graph that can disambiguate a very short query string into a precise query that the index can give an answer for, resolving name and word collisions and determining the meaningful words in the query.

The knowledge graph is the most invisible of the additional components, though it powers things like query suggestions, infoboxes, may be intertwined with the secondary index and so on.

The most complete general purpose and freely licensed knowledge graph is Wikidata.

Geocoding

A lot of things humans search for day to day are location related, which is why Geocoding is not only important for finding the place "near me", but also as an useful extension to the dictionary and knowledge graph. This is useful even without a mapping feature.

Usable datasources are GeoNames and OpenStreetMap Nominatim.

"Cheating"

There are a few ways to "cheat" and omit some of those components, but those come with downsides.

Single Domain, single language Index

Limiting the index scope to a single knowledge domain and a single or a few related languages will greatly reduce the opportunities for word collisions and thereby the need for a dictionary.

This will limit the how large the index can grow.

LLM Embedding

Another common way to model meaning that is independent of the word used to express it is by generating embeddings using an LLM and building an index with a vector database.

This approach is very simple in terms of writing code as this a usecase that already has existing software (Often labelled as "semantic search").

However, when using it one should acknowledge, that the LLM is a large blob, that is difficult or impossible to verify or reproduce. They have a lot of hidden and unchangeable assumptions baked in. Those assumptions will affect search results and aren't really that great for a sovereign search index.

Another challenge with embeddings is the compute power needed to generate them.

TL;DR

For building a modern search engine an index alone isn't enough.

There are a few ways to cheat, but either with scaling limits or by giving up sovereignty.

One also needs curated secondary indices, dictionaries and a knowledge graph to resolve a query in the searchbox to an actual, better query that will generate the actual results.

There are already projects out there that solve part of the problem, search engines, dictionaries, knowledge graphs.

Let's build the future, together!