Skip to content
Sean Bergman
Go back

From 19 Seconds to 50 Milliseconds: Making LibTrails Fast on a $7 Server

LibTrails ran fine on my MacBook Pro. Every search returned in ~200ms, related books loaded instantly, the whole thing felt snappy. Then I deployed it to a $7/month AWS Lightsail instance with 1 GB of RAM, and the latency crept up.

Search took 3.5 seconds. Related books took 19 seconds. Everything worked, but it didn’t feel like a site you’d want to explore — too much delay between typing a query and seeing results. I wanted semantic search to feel instant, even on a tiny server. This post covers the three optimizations that got it there: replacing PyTorch with ONNX Runtime, pruning a redundant search signal, and rewriting the related books query.

The Constraint: 1 GB RAM, Single vCPU

The LibTrails demo runs on a Lightsail t3.micro — $7/month, 1 GB RAM, 1 vCPU, 1 GB swap. It hosts both the FastAPI backend and the Astro SSR frontend behind a Caddy reverse proxy. Every byte of RAM matters.

The 100-book demo database contains:

All stored in a single ~1 GB SQLite file with sqlite-vec for vector search and FTS5 for keyword search.

Problem 1: The Embedding Model Was Too Heavy

LibTrails uses BGE-small-en-v1.5 to embed search queries at request time. The standard approach loads it via sentence-transformers, which depends on PyTorch — a ~200 MB library designed for GPU training that we don’t need at all for inference. We just need a single forward pass through a small transformer to produce a 384-dimensional vector.

On a 1 GB instance, the numbers don’t work:

Metricsentence-transformersTarget
Dependencies~300 MB (PyTorch + friends)As small as possible
Process RSS~500 MBUnder 300 MB
Model load time~3sUnder 1s
Embedding latency~15ms/querySimilar
Swap pressureFrequentNone

The Fix: ONNX Runtime

ONNX Runtime is Microsoft’s optimized inference engine. It loads a pre-exported .onnx model file and runs inference with no PyTorch, no CUDA toolkit, no autograd. The relevant dependencies:

Total: ~20 MB vs. ~300 MB.

Exporting the Model

A one-time export converts the sentence-transformers model to ONNX format using HuggingFace’s optimum library:

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

model = ORTModelForFeatureExtraction.from_pretrained(model_path, export=True)
model.save_pretrained(onnx_output_dir)

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.save_pretrained(onnx_output_dir)

The exported model is ~33 MB. It gets SCP’d to the server alongside the tokenizer files.

Dual Backend Architecture

The embedding module auto-detects which backend to use:

  1. If models/bge-small-onnx/model.onnx exists and onnxruntime is installed → use ONNX
  2. Otherwise → fall back to sentence-transformers

Both backends produce identical embeddings — I verified this with a cosine similarity check that returned exactly 1.000000 between outputs. The ONNX path uses CLS pooling with L2 normalization, matching BGE’s default strategy:

def _onnx_encode(texts: list[str]) -> np.ndarray:
    encodings = _onnx_tokenizer.encode_batch(texts)

    input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
    attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
    token_type_ids = np.array([e.type_ids for e in encodings], dtype=np.int64)

    outputs = _onnx_session.run(None, {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "token_type_ids": token_type_ids,
    })

    # CLS pooling (index 0)
    cls_embeddings = outputs[0][:, 0, :]

    # L2 normalize
    norms = np.linalg.norm(cls_embeddings, axis=1, keepdims=True)
    return cls_embeddings / np.maximum(norms, 1e-12)

The server uses ONNX for all request-time embeddings. sentence-transformers stays available for bulk operations (indexing on a dev machine) without overriding the active backend.

Impact

Metricsentence-transformersONNX Runtime
Process RSS~500 MB~200 MB
Model load time~3s~0.5s
Embedding latency~15ms/query~12ms/query
PyTorch requiredYes (~200 MB)No
Swap pressure (1 GB RAM)FrequentNone

The server now starts in under a second and stays well within the 1 GB memory budget.

Problem 2: The 121K Vector Scan

LibTrails’ hybrid search fuses multiple retrieval signals via Reciprocal Rank Fusion. The original architecture used 7 signals for book search — three keyword (FTS5) and four semantic (sqlite-vec):

  1. FTS5 book metadata
  2. FTS5 topic labels
  3. FTS5 chunk text
  4. Semantic topic vectors (121K embeddings)
  5. Semantic book theme vectors (~2.5K embeddings)
  6. Semantic book vectors (100 embeddings)
  7. Semantic chunk vectors (~32K embeddings)

Profiling on the server revealed that signal 4 — the topic vector scan — took 2.7 seconds per query. sqlite-vec performs exact KNN; it computes cosine distance against all 121,000 topic vectors sequentially. On a single-core t3.micro, that’s a lot of floating-point math.

The other vector signals were fast because they searched much smaller tables:

TableRowsScan time
topic_vectors121,1182.7s
chunk_vectors31,849~0.5s
book_theme_vectors~2,500~0.01s
book_vectors100~0.005s

Was the Signal Even Necessary?

The chunk vectors (31K rows, ~0.5s) already provide equivalent semantic coverage. Every chunk that mentions a concept has a nearby embedding, and the chunk→book mapping captures the same books that topic→book would. The topic vector signal was redundant with the chunk signal — but 5x more expensive.

I ran diagnostic queries across six query types (dystopia, machine learning, cooking/food, quantum physics, dragons/magic, stoicism) comparing results with and without the topic vector signal. No measurable difference in the top-10 results for any query.

The Fix: Drop to 6 Signals

Removing the topic vector scan from both search paths:

Total search time dropped from ~3.5s to ~0.8s. The chunk and theme vectors already cover the same semantic ground — every concept that exists as a topic also exists in the text chunks that spawned it.

If I move to a larger server or add approximate nearest neighbor (ANN) indexing, I could add the signal back. But for a $7/month instance, the tradeoff is clear.

The find_related_books endpoint powers the “Related Books” section on every book detail page. The original implementation called the full hybrid_search_books function with a query constructed from the book’s title, author, and all themes — a 373-character query with 50+ tokens.

This was catastrophic:

The Rewrite: 3 Fast Signals

Related books don’t need the full search pipeline. We already know which book we’re looking at — we just need to find similar ones. The rewrite uses only the fast vector signals over small tables:

def find_related_books(conn, book_id, limit=12):
    # Build a short embedding query from title + top themes
    query = f"{title} by {author}. {', '.join(themes[:5])}"
    query_bytes = embedding_to_bytes(embed_text(query))

    # 3 fast signals only
    book_direct = _semantic_search_books_direct(conn, query_bytes)    # 100 rows
    theme_books = _semantic_search_book_themes(conn, query_bytes)      # ~2.5K rows
    fts_books = _fts_search_books(conn, title)                        # FTS on 100 books

    return rrf_fuse([book_direct, theme_books, fts_books])

Three signals over small tables: 50 milliseconds total.

The quality is arguably better than the original. The old approach produced noisy results because the over-long FTS query matched too broadly — everything is a partial match when your query is 50+ tokens. The new approach uses a short, focused query that captures the book’s identity without the noise.

Summary

OptimizationBeforeAfterImprovement
Embedding model (ONNX)~500 MB RSS~200 MB RSS60% less memory
Search (drop topic vectors)~3.5s~0.8s4.4x faster
Related books (3 signals)~19s~0.05s380x faster

The whole thing runs on a $7/month server. The lesson is straightforward: profile before optimizing, and don’t assume that more signals always means better results. One redundant signal was costing 2.7 seconds per query, and the related books endpoint was doing 20x more work than necessary by reusing a general-purpose function for a specific task.

You can try the result at libtrails.app — search for anything and it should feel responsive despite running on a single-core t3.micro with 1 GB of RAM.


Share this post on:

Next Post
From 6,500 Clusters to 202 Communities: Building a Browsable Topic Hierarchy