Skip to content
Sean Bergman
Go back

7 Signals, 4 Scopes: How LibTrails Searches 100 Books

Search is one of those features that’s easy to build badly. You wire up a LIKE '%query%' clause, it kinda works, and then someone searches “dystopian fiction” and gets nothing because no book has that exact phrase in its title.

I ran into exactly this with LibTrails. The first version of search used FTS5 keyword matching on book metadata plus semantic search over the 121K extracted topics — two signals fused with Reciprocal Rank Fusion. It worked, but searching “dystopia” returned poor results because the LLM-generated book themes (like “dystopian young adult fiction”) were invisible to search — not in the FTS index, not embedded.

So I expanded it. What started as a 2-signal system is now a multi-scope hybrid search — 7 signals for books, 5 for clusters, with domain and universe search layered on top. All fused through RRF. Here’s how it works and why each piece matters.

The Problem

LibTrails has 100 books from Project Gutenberg, but those books contain a lot of structure underneath:

A single search approach — keyword or semantic — can’t cover all the ways someone might look for a book. Someone typing “Orwell” needs exact keyword matching. Someone typing “authoritarian government” needs semantic understanding. And someone typing a character’s name needs full-text search across the actual book content.

The Architecture: 7 Signals

The system attacks each query from seven angles, then fuses the results into a single ranked list.

User Query: "dystopian fiction"

        ├──► FTS5 keyword search (3 signals)
        │    ├── Book metadata:  title, author, description, themes
        │    ├── Topic labels:   121K normalized topic labels
        │    └── Chunk text:     32K text chunks → mapped to books

        ├──► Semantic vector search (4 signals)
        │    ├── Topic vectors:      121K topic embeddings
        │    ├── Book theme vectors:  ~2.5K theme embeddings
        │    ├── Book vectors:        100 whole-book embeddings
        │    └── Chunk vectors:       32K chunk embeddings

        └──► Reciprocal Rank Fusion (k=60)
             └── Final ranked list

The Keyword Signals (BM25)

The first three signals use SQLite FTS5 with the Porter stemmer:

Book metadata search is the fast path. If someone types “Hunger Games” or “Orwell”, FTS5 matches against the title, author, description, and theme labels. This is where exact matches live.

Topic label search queries the 121K normalized topic labels extracted by the LLM pipeline. If the user types “machine learning”, it matches topics like “machine learning optimization techniques” directly. These topic matches get mapped back to books through the chunk-topic link table.

Chunk text search is the safety net. It searches the raw text of all 32K chunks, so if someone searches for a character name or a specific phrase that appears in the prose, it still finds the book.

The four semantic signals use sqlite-vec with BGE-small-en-v1.5 embeddings (384 dimensions):

Topic vector search is the dominant signal. The query gets embedded and compared against all 121K topic vectors. Because the topic vocabulary is so large, this consistently surfaces the most unique books. “Authoritarian government” finds topics about “totalitarian regimes”, “state surveillance”, and “political oppression” — without needing exact keyword matches.

Book theme vectors act as a lightweight ColBERT/MaxSim pattern. Each book is represented by multiple theme vectors (from the ~2.5K LLM-generated themes). The best-matching theme determines the book’s score. So “dystopia” matches theme labels like “dystopian young adult fiction” even when the word doesn’t appear in the title.

Whole-book vectors capture each book’s overall identity. Rather than averaging chunk embeddings (which dilutes the signal — a 400-page novel about war averages out to generic “fiction”), each book gets a single vector built from title + author + description + themes. This is the holistic signal. “Epic fantasy series” ranks Lord of the Rings and Mistborn highly because their descriptions and themes align.

Chunk vectors search the 32K chunk embeddings for narrative-level matches. This has the highest overlap with other signals, but serves as a content-level safety net for matches that metadata misses.

Fusing It All Together: Reciprocal Rank Fusion

Here’s the key insight: BM25 scores and cosine similarities are on completely different scales. You can’t just add them together. Reciprocal Rank Fusion sidesteps this by ignoring raw scores entirely and using only rank position:

def rrf_fuse(ranked_lists, k=60):
    scores = {}
    for ranked in ranked_lists:
        for rank, (item_id, _score) in enumerate(ranked):
            scores[item_id] = scores.get(item_id, 0.0) + 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

A book at rank r in signal s contributes 1 / (k + r + 1) to its fused score. The k=60 parameter comes from Cormack et al., 2009 and controls how much rank position matters.

Why this works so well:

What Each Signal Actually Contributes

I ran diagnostics across six query types (dystopia, machine learning, cooking/food, quantum physics, dragons/magic, stoicism/ancient philosophy):

SignalAvg CandidatesAvg Exclusive BooksRole
Topic vectors~100~35Dominant semantic signal
Book vectors~50~19Holistic “vibe” matching
Chunk vectors~30~8Content-level safety net
Theme vectors~25~8Genre/theme matching
Chunk FTS~40~7Specific factual content
Book FTS~15~5Exact title/author hits
Topic FTS~20~4Topic label keywords

“Exclusive books” means books found only by that signal. Topic vectors consistently find the most unique results because the 121K topic vocabulary casts the widest net. But every signal contributes — remove any one of them and some relevant books disappear from results.

Graceful Degradation

One design decision I’m happy with: all semantic search functions wrap their sqlite-vec queries in try/except blocks and return empty lists if the vector table doesn’t exist. This means search works at any build stage — for every scope:

No feature flags, no conditional logic in the fusion layer. You get the best search your current data supports.

The Stack

Everything runs on SQLite — FTS5 for keyword search, sqlite-vec for vector search. The embedding model is BGE-small-en-v1.5 at 384 dimensions. The total database for the 100-book demo is about 1 GB — FTS indexes, vector tables, and all the metadata in a single SQLite file. The architecture scales linearly; larger libraries just mean more vectors.

There’s no Elasticsearch, no Pinecone, no external vector database. Just SQLite extensions doing both jobs in a single file. For a dataset of this size, it’s more than fast enough.

Beyond Books: Search at Every Scale

The 7-signal architecture handles book search, but LibTrails isn’t just a list of books — it’s a layered knowledge graph. Topics form clusters (via the Leiden algorithm), clusters roll up into domains (26 semantic super-categories), and the whole thing renders as a 3D galaxy. Search needs to work at every layer.

Cluster Search (5 Signals)

When you search on the Clusters tab, the question changes from “which books?” to “which topic clusters match?” This uses its own 5-signal RRF fusion:

SignalTypeHow it maps
FTS topic labelsKeywordtopics_ftstopics.cluster_id
Topic vectorsSemantictopic_vectorstopics.cluster_id
Cluster label matchTextLIKE search on LLM-generated cluster labels
FTS chunk textKeywordchunks_ftschunk_topic_linkstopics.cluster_id
Chunk vectorsSemanticchunk_vectorschunk_topic_linkstopics.cluster_id

The topic-based signals (1 & 2) are straightforward — topics already have a cluster_id from the Leiden partition. The chunk-based signals (4 & 5) take a 2-hop path: chunk → topic (via chunk_topic_links) → cluster (via topics.cluster_id). This works because chunks literally contain the topics that form the clusters.

The cluster label signal is intentionally simple — a LIKE search against the ~2,500 LLM-generated cluster labels. With that few rows, there’s no need for an FTS index. If you search “economics”, it directly matches a cluster labeled “Economic Theory & Market Systems” even before the vector signals fire.

Domain Search (Aggregation + Label Match)

Domain search doesn’t have its own signal pipeline. It’s a two-part composite:

  1. Cluster aggregation: Runs a full cluster search (5 signals, top 100 results), maps each cluster to its parent domain, and keeps the best cluster score per domain plus a count of how many clusters matched.
  2. Direct label match: LIKE search on the ~26 domain labels. Searching “economics” directly hits “Economics & Industry” as a domain — surfacing it even if none of its individual clusters scored high enough.

This layered approach means a broad query like “war” finds dozens of matching clusters across domains like “Heroic & Military Epic”, “Politics & Governance”, and “Death & Mortality” — while a narrow query like “stoicism” might only match a handful of clusters but still lights up “Literary Classics & Philosophy” as the top domain.

Universe Search (3D Galaxy Highlighting)

Universe search wraps cluster search and returns the minimum payload needed for GPU rendering — just {cluster_id, score} pairs. The frontend uses these scores to highlight matching clusters in amber and dim everything else in the 3D galaxy view. It inherits all 5 cluster signals automatically.

One SearchBar, Four Scopes

A shared SearchBar component sits on every tab — debounced input, / keyboard shortcut to focus, Esc to clear, loading spinner, result count. The same query routes to the appropriate search scope based on context: books get ranked cards with match-type badges, clusters get reordered cards, domains get filtered, and the universe gets 3D highlights with a sidebar results list.

You can try it yourself at libtrails.app — search for anything and watch the results cascade across all four views.

What I Learned

Building this taught me a few things:

Don’t average embeddings to represent documents. Mean-pooling hundreds of chunk embeddings dilutes the signal into a generic blob. Build purpose-specific vectors from the information that actually defines the document.

More signals > better signals. Any individual signal has blind spots. The topic vector search is the strongest single signal, but it still misses books that the chunk FTS search catches. RRF makes it cheap to add more signals because weak signals can’t hurt the final ranking.

RRF transfers well across domains. I’d already been using RRF in RAG knowledge engine projects, typically followed by a multi-vector reranker to squeeze out the last bit of relevance. Here I deliberately skipped the reranker — LibTrails needs a search bar that feels instant, running on a single Lightsail instance with 1 GB of RAM. Keeping the small local embedding model (BGE-small, 384 dims) and letting RRF do the heavy lifting was the right trade-off: fast enough for real-time search, good enough that the top results consistently make sense.


Share this post on:

Previous Post
From Königsberg to Leiden: Starting a Deep Dive into Graph Theory
Next Post
Hello World: Why I'm Starting This Blog