LSM on mobile — New Journey

If you put a database on every device — which is what local-first means in practice — you immediately need a search index on every device. The user’s expectation is not “search across our cloud.” It is “search across what I have, instantly.” On a phone, “instantly” is roughly 10ms. Anything slower and the keystroke-by-keystroke feedback loop falls apart.

We thought hard about how to build that. The answer turned out to be a Log-Structured Merge tree, but built for the constraints of a phone rather than a server. This post walks through why, what it looks like, and how it stays small.

Why not SQLite FTS5?

The obvious starting point on mobile is SQLite’s FTS5 module. It is right there. It works. We started prototyping with it, and for a long time it was good enough.

It stopped being good enough for three reasons:

Updates are expensive. FTS5 stores the inverted index in a way that makes incremental updates write-amplifying. A workspace where issues, comments, and chat messages stream in continuously will rewrite a lot of bytes on every batch.
No native vector posting layout. When we started prototyping semantic search and field-weighted BM25-F, we had to layer additional tables alongside FTS5 and join across them. Latency stopped being predictable.
No control over compaction shape. FTS5’s segment management is an internal implementation detail. We could not tune it for the “many small writes, occasional bulk merge” pattern that local-first sync produces.

These are not bugs in FTS5. It is a great library for what it is. It is just not built for the workload of a CRDT-synced workspace, where the primary keys constantly change as histories merge.

So we built our own. The architecture is on-disk LSM, and every component has a reason.

The architecture, in one breath

The search slice is a memtable on top of a write-ahead log on top of a stack of immutable per-segment redb files, with a roaring bitmap of tombstones, a manifest that knows which segments are live, and a tiered compaction policy that periodically merges small segments into larger ones. ADRs 039 through 049 cover the design decisions.

That sentence does a lot of work. Let me unpack each piece.

Memtable + WAL

Writes go to two places: an in-memory memtable, and a write-ahead log on disk. The memtable is a sorted structure (we use a simple skiplist) that supports point lookups and range scans for queries that need to see fresh data. The WAL is append-only, which means writes are sequential, which means even a slow phone storage controller is happy.

When the memtable hits a size threshold, it is flushed: written out as an immutable segment file, the WAL is truncated, and a new empty memtable is allocated. Crash recovery is “replay the WAL into a fresh memtable.”

Immutable per-segment redb files

Each flushed segment is its own redb file. redb is a small, embedded, ACID key-value store written in Rust. We use it because:

Segment files are append-only after creation; redb’s MVCC fits this perfectly.
Each segment is self-contained, so corruption of one segment does not poison the others.
redb’s read path is tight — a segment lookup is a B-tree descent in mmap’d memory.

The “immutable” part is the load-bearing word. Once a segment is written, nothing modifies it. To delete a document, we do not edit the segment; we mark it tombstoned in a separate structure.

Roaring bitmap of tombstones

Deletions in an LSM are tricky. The naive approach is to write a “delete record” that overrides earlier entries. That works, but it makes every read consult every segment, and it makes “how big is this collection?” slow.

We instead keep a roaring bitmap of deleted document IDs, persisted alongside the manifest. On query, we filter results through the bitmap. Roaring bitmaps are the right shape for this: dense ranges (a community where most messages are kept) and sparse ranges (a project where most issues are deleted) are both compact, and set operations are fast.

When a segment is compacted away, the tombstones for documents that no longer exist anywhere are dropped. The bitmap stays small.

Manifest

The manifest is a tiny file — a single source of truth for “which segments are live, what tombstones apply, what’s the schema version.” It is updated atomically (write-and-rename) on every flush and every compaction. Recovery is “open the manifest, open the listed segments, replay the WAL.” This is the discipline LevelDB taught the world, applied at phone scale.

Tiered compaction

Without compaction, you eventually have hundreds of segments and every query consults all of them. With too much compaction, you spend battery rewriting bytes. We use a tiered policy: when N segments at level L exist, they are merged into one segment at level L+1. The level cutoffs are tuned so that, in steady state, a query touches at most a handful of segments.

Compaction runs in the background, never on the foreground UI thread, and never when the device is on battery and the user is active. We are conservative about waking the radios.

The parallel-vector posting layout

The other thing we built, which is novel enough to deserve its own callout, is the parallel-vector posting layout.

A traditional inverted index posting list looks like this: for term T, a list of (doc_id, term_frequency, positions) tuples. Reading it back means decoding all three fields together for every posting.

We split the layout: the doc IDs go in one vector, the term frequencies in another, the field flags in a third. All three vectors are aligned by index — posting i for term T is (docs[i], tfs[i], fields[i]). The result is that:

Doc-id-only queries (filtering, set intersection) read just one vector, in tight cache-friendly stripes.
Scoring (BM25-F) reads the doc-id and term-frequency vectors in lockstep, which the CPU prefetcher loves.
Field-restricted queries can skip whole runs by checking the field-flag vector first.

The numbers we measure on a recent iPhone, against a 10,000-message community: P50 of 1.6ms, P99 of 4.8ms. On a mid-range Android, the same query hits P99 of 8.9ms. We are comfortably inside the “feels instant” budget on every device class we test.

Staying small

A search index on a phone has to fight for its disk space. We have three levers:

Compaction reclaims tombstoned bytes. Once a deleted document falls out of every live segment, its bytes are gone.
Field weights, not field copies. BM25-F lets us score titles higher than bodies by weighting, not by duplicating the title into a separate posting list.
Optional cold-history pruning. A user can opt to keep only the last N months of message history fully indexed; older history is searchable by metadata only. This is a per-user setting, not a server policy.

In our internal dogfooding, the index size for a typical engineer’s workspace settles around 40–80MB, which is comparable to a few good photos. For mobile, that is a price worth paying for “search that actually feels native.”

What’s next

The LSM is the substrate; we are now layering richer features on top of it: federated search across communities (without breaking the encryption boundary), semantic ranking with a small on-device embedding model, and faceted filtering driven by the planning slice’s structured fields.

The point is: the boring part — fast, durable, compactable, on-device — is solved. Everything we add from here gets to assume single-digit-millisecond search as a baseline, which is a very different design space from “search is a network call that costs you 100ms.”

We will share more numbers as the slice gets more workload. For now, the best benchmark is the one you can feel: type into the New Journey search bar and notice what is missing — the wait.