The 39c3 Search Engines Post

Date: — Topics: , — by Slatian

At 39c3 I visited with one main quest: Talk to as many beings who have anything to do with search engines as possible.

Table of Contents

Overview

To summarise it went very well and I was pretty tired after congress.

Slatian interjects:

Future Slatian here: I was actually pretty unsure about this post and fell into a perfectionism trap which is why it is a quarter of a year late. The weighting of the text is off and doesn't really match my real experience and some things just feel rough. But it's real text of a real human experience written by a real human who overthought it without actually improving it. So without any further improvements, here it is!

In case you missed it, I hosted two self organized sessions:

I'll organise such meetups at other events too.

I also spent some time at the Digitalcourage assembly.

And visited the beings behind the two search engines that made searching the file shares at congress possible, both built at 39c3 itself.

The Search Engine Creators Meetup

While shorter than I had hoped (Forgot the room reservation over Christmas) it was a success most beings who showed up were mostly interested in searching personal/organisation knowledge bases. Since we didn't get much past the introduction round I'll try to paste the notes here with some annotations that I remember.

Technologies used by people who were present:

Main topics people were interested in:

Tokenizing was interesting because it starts easy (just split on spaces and special characters) and then get complicated pretty quickly, for example with compound words (very common in German for example). To drop some names of libraries that try to solve this: charabia from meilisearch, tantivy has tokenization infrastructure and I of course mentioned my own unobtanium-segmenter.

Vector Search was mentioned multiple times, but I don't remember if any discussion arose from that but I'd like to hear of any unconventional implementations because my current opinion on vector search is that it depends on LLMs which will always have hidden baked in assumptions (intentional and unintentional) that may not align with ones mission of building an independent search system.

Metadata took the most time of the discussion and the main points were:

Someone also had a project to make OpenStreetMap searchable in an unusual way (I unfortunately forgot who that was).

In case you want to connect: There is the search-collab Mailing list for beings interested in building search engines and exchanging knowledge about search engines.

The ask me anything

The ask me anything was great, the room at the Free Knowledge Habitat was full despite the early hour (10:30 on day 2, thanks to heaven for supplying me with a cup of coffee).

The questions were mostly about search engines in general or about my specific search engine.

One very interesting question that I want to share here was:

Do you rank by accessibility?

This question is very good, because a search engine preferring more accessible sites is a very good incentive to improve web accessibility. The answer I had to give was an unfortunate "no", but in general search engines also make use of some accessibility features so there is a bit of a coincidence in what is accessible to the search engine and what is accessible to humans, but no accessibility ranking.

The answer I didn't have ready is that the reality of making things accessible beyond the basics is multidimensional and complicated. There are automatic accessibility checks, but those alone without good intentions on the other end only incentivize cheating the test.

Meeting c3search and c3ordnung

I've met the creators of both c3search and c3ordnung. The two search engines that were created for the fileshares at congress.

By coincidence both use meilisearch and had a three step crawl, index, search pipeline with no long term persistence between crawling and indexing. They're both written very quickly which is why there is no source code available for c3search (yet?).

Anyway, if you're reading this: It has been nice to meet you!

Meeting Digitalcourage

I've made some noise at the Digitalcourage assembly about Europe currently being without a sovereign search engine and had the joy to exchange some knowledge about this with Rena Tangens (Huge thanks for taking the time!).

"No sovereign European search engine": There are currently 4 search engines on this planet that are truly large enough to be useful as go-to search engines for most of the world: Google, Bing, Yandex, Baidu … you may have noticed that all of those are located in countries we as Europeans shouldn't trust nowadays. The only large European search engine is Mojeek, but it's more on the scale of largest of the small ones.

Turns out that this has already been a topic at Digitalcourage way back in 2017 and I want contact the people behind the Open Web Index, even though the project seems to be mainly aimed at academics.

Update: I kind of did contact them, though as things go a lot of distractions showed up and I haven't gone further on this end yet.

Resources Discovered

Web Data Commons - University of Mannheim
"A large public corpus of web tables containing time and context metadata" tables an in HTML tables, that contain useful data. Someone dropped this at the ask me anything. Thank you!
Open Web Index
is a project by the University of Passau and the Open Search Foundation, I knew this one before, but dismissed it, thanks to Rena Tangens for bringing this to my attention. The best description of the open web index is: A research focused CommonCrawl made in Europe that goes an extra mile kilometre.
GermaNet
a German version of WordNet made at the University of Tübingen.
Website Taxonomy Extraction – Identifying and Extracting Navigation Structures from Websites
An interesting paper about extracting menu structure from HTML to make a whole new dimension of pages machine readable, discoverable and searchable.