Fixing synapse 1.128.0 with SQLite

Date: 2025-04-13 — Topics: Databases, Linux Administration — Lang: SQL — by Slatian

Synapse 1.128.0 came with a unpleasant surprise, here is how I worked around it

Table of Contents

The Problem

On 2025-04-08 element HQ released the update for synapse 1.128.0.

I running my server the way I do got the update via my package manger like most other people. Also like most other people I'm using the SQLite backend.

After the update my matrix clients reported that they were unable to connect and the synapse process used 100% of one CPU.

Not a great start for debugging, but fortunately I wasn't alone, other people had the same issue, also with SQLite, which got me looking at the changelog in detail for anything that could be database related …

Long story short: I found out that the Add a column participant to room_memberships table. background migration caused the problem with a VERY slow SQL query. On other databases, this sucks up performance, but on SQLite, this locks up the database thread and the service grinds to a halt.

The Solution

Almost every SQL performance problem can be worked around by throwing enough indices at it. This is the classic, trade of storage space for compute time and here it's worth it.

Also as an admin I also have the benefit of being able to inject indices with the sqlite3 command without having to modify any application code.

So here are the 3 indices that speed the query fast enough to not only allow the server to function, but also speed the migration up to fit within a few minutes of a weekend:

create index slatians_speedup_room_joins
	ON current_state_events(room_id, state_key)
	where membership='join';
create index slatians_speedup_event_messages
	ON events(room_id, sender)
	where type='m.room.message' OR type='m.room.encrypted';
create index slatians_speedup_room_membership_update_covering
	ON room_memberships (event_stream_ordering, room_id, user_id);

You can check the status of background jobs that synapse runs on the database by using the following SQL query, an empty output means all jobs have finished:

select * from background_updates;

Note: Whether or not you leave those indices in after the migration has finished is up to you, I've let them in.

The Problem in Detail

As said above, the problem is that there is slow query running, I'll split this up into why slow queries are bad in SQLite and then why the query is slow.

Why Slow Queries are bad in SQLite

SQLite is a simple and performant database and unlike other databases one can include it as a library that stores the data in a single file, no extra service needed and very welcome for free time admins who don't want to put in the time to keep a database service running.

Part of why SQLite is so simple, yet fast is that it runs in a single thread, meaning it does one thing start to finish and then does the next allowing it to mostly ignore the problem of multiple queries running into each other (Note: oversimplified). Problem with that is, if one query takes very long the others have to wait very long, which is a problem if that other query is for a request from a client with a timeout that is not very long, resulting in what looks like connectivity problems.

The Slow Query

Part of the adding another column to the room_memberships table is populating it with the correct data, which is what the populate_participant_bg_update background update is for. Searching the code for that one can quickly find out …

… it is mainly implemented using the following query:

UPDATE room_memberships
SET participant = True
FROM (
    SELECT DISTINCT c.state_key, e.room_id
    FROM current_state_events AS c
    INNER JOIN events AS e ON c.room_id = e.room_id
    WHERE c.membership = 'join'
        AND c.state_key = e.sender
        AND (
            e.type = 'm.room.message'
            OR e.type = 'm.room.encrypted'
        )
) AS subquery
WHERE room_memberships.user_id = subquery.state_key
    AND room_memberships.room_id = subquery.room_id
    AND room_memberships.event_stream_ordering <= ?
    AND room_memberships.event_stream_ordering > ?;

That's a little monster, but nothing too bad, the subquery can be a potential source of slowness, but not in this case.

Speaking of the subquery, Asking the database about how it would do the select …

EXPLAIN QUERY PLAN
SELECT DISTINCT c.state_key, e.room_id
FROM current_state_events AS c
INNER JOIN events AS e ON c.room_id = e.room_id
WHERE c.membership = 'join'
    AND c.state_key = e.sender
    AND (
        e.type = 'm.room.message'
        OR e.type = 'm.room.encrypted'
    )
;

Result:

QUERY PLAN
|--SCAN c
|--SEARCH e USING INDEX events_room_stream (room_id=?)
`--USE TEMP B-TREE FOR DISTINCT

… tells us, that it is:

Looking at all entries in the current_state_events table (alias c)
Querying the events table using an index that only contains the room id
Collecting all of that into a binary tree to return each result only once

For this the context, that the current_state_events contains round 64K entries and the events table about 1 million entries. (Yes, small server)

The two problems here are:

The scan on current_state_events happens without using an index, meaning the database has to look at every single entry
The events are only indexed by their room_id leaving a lot of comparisons for the database to do on the fly (remember the high CPU usage?)

Speeding Things up

The most efficient way to get this sped up is to custom tailor two indices on those queries so that the database has to do as little as possible when doing the actual work, meaning we need covering indices.

What's a covering index? A covering index is an index, that contains all data the database needs for a given query. this has the benefit, that the database never has to leave the index saving a lookup in the actual table. This is less code running and also easier to predict and cache, making it quite fast.

The first index is for the current_state_events table, it needs the state_key and the room_id and only the rows where the membership column contains the string join:

create index slatians_speedup_room_joins
	ON current_state_events(room_id, state_key)
	where membership='join';

I've put room_id first, because that's needed for doing the joining before the DISTINCT part happens. Maybe the other way around could have been optimozed even better by SQLite, but the main point of this index is saving the comparison against the membership column.

The second index is for the events table, it needs the room_id for the join (which is why it comes first), the sender for filtering immediately after the join and it should only contain the entries that comparison against the type is checking for:

create index slatians_speedup_event_messages
	ON events(room_id, sender)
	where type='m.room.message' OR type='m.room.encrypted';

Those two indices speed the query up enough for synapse to do it's job as a server again, but of course there is the outer query, which can be sped up too.

Looking at the schema of the room_memberships table reveals …

.schema room_memberships

Output: (reformatted for readability)

CREATE TABLE room_memberships(
	event_id TEXT NOT NULL,
	user_id TEXT NOT NULL,
	sender TEXT NOT NULL,
	room_id TEXT NOT NULL,
	membership TEXT NOT NULL,
	forgotten INTEGER DEFAULT 0,
	display_name TEXT,
	avatar_url TEXT,
	event_stream_ordering BIGINT REFERENCES events(stream_ordering),
	participant BOOLEAN DEFAULT FALSE,
	UNIQUE (event_id)
);

CREATE INDEX room_memberships_room_id ON room_memberships (room_id);
CREATE INDEX room_memberships_user_id ON room_memberships (user_id);
CREATE INDEX room_memberships_user_room_forgotten ON room_memberships (user_id, room_id) WHERE forgotten = 1;
CREATE TRIGGER room_memberships_bad_event_stream_ordering
	[… left out …]
CREATE INDEX room_membership_user_room_idx ON room_memberships (user_id, room_id) ;
CREATE INDEX room_memberships_stream_ordering_idx ON room_memberships (event_stream_ordering) ;

quite some columns, a trigger we're not interested in and a lot of indices, but none that combine event_stream_ordering with the needed user_id and room_id, so to create one:

create index slatians_speedup_room_membership_update_covering
	ON room_memberships (event_stream_ordering, room_id, user_id);

Now one can watch the background job eating through the remaining work within a few minutes.

Closing Notes

This was the first time the synapse server brought a surprise with an update, so far every other update just worked™.

This little exercise in database performance allowed me to avoid a downgrade to an older version and get my chat working again, although I wish that it wouldn't have been necessary.

If you're now interested in where to get the knowledge to speed up SQLite I recommend reading the SQLite Indexing Tutorial and The SQLite Query Optimizer Overview.