LibTrails ran fine on my MacBook Pro. Every search returned in ~200ms, related books loaded instantly, the whole thing felt snappy. Then I deployed it to a $7/month AWS Lightsail instance with 1 GB of RAM, and the latency crept up.
Search took 3.5 seconds. Related books took 19 seconds. Everything worked, but it didn’t feel like a site you’d want to explore — too much delay between typing a query and seeing results. I wanted semantic search to feel instant, even on a tiny server. This post covers the three optimizations that got it there: replacing PyTorch with ONNX Runtime, pruning a redundant search signal, and rewriting the related books query.
The Constraint: 1 GB RAM, Single vCPU
The LibTrails demo runs on a Lightsail t3.micro — $7/month, 1 GB RAM, 1 vCPU, 1 GB swap. It hosts both the FastAPI backend and the Astro SSR frontend behind a Caddy reverse proxy. Every byte of RAM matters.
The 100-book demo database contains:
- 121,118 topic embeddings (384 dimensions each)
- 31,849 chunk embeddings
- ~2,500 theme embeddings
- 100 whole-book embeddings
- FTS5 indexes over topics, chunks, and book metadata
All stored in a single ~1 GB SQLite file with sqlite-vec for vector search and FTS5 for keyword search.
Problem 1: The Embedding Model Was Too Heavy
LibTrails uses BGE-small-en-v1.5 to embed search queries at request time. The standard approach loads it via sentence-transformers, which depends on PyTorch — a ~200 MB library designed for GPU training that we don’t need at all for inference. We just need a single forward pass through a small transformer to produce a 384-dimensional vector.
On a 1 GB instance, the numbers don’t work:
| Metric | sentence-transformers | Target |
|---|---|---|
| Dependencies | ~300 MB (PyTorch + friends) | As small as possible |
| Process RSS | ~500 MB | Under 300 MB |
| Model load time | ~3s | Under 1s |
| Embedding latency | ~15ms/query | Similar |
| Swap pressure | Frequent | None |
The Fix: ONNX Runtime
ONNX Runtime is Microsoft’s optimized inference engine. It loads a pre-exported .onnx model file and runs inference with no PyTorch, no CUDA toolkit, no autograd. The relevant dependencies:
onnxruntime(~15 MB)tokenizers(~6 MB, HuggingFace’s Rust-based tokenizer)
Total: ~20 MB vs. ~300 MB.
Exporting the Model
A one-time export converts the sentence-transformers model to ONNX format using HuggingFace’s optimum library:
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
model = ORTModelForFeatureExtraction.from_pretrained(model_path, export=True)
model.save_pretrained(onnx_output_dir)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.save_pretrained(onnx_output_dir)
The exported model is ~33 MB. It gets SCP’d to the server alongside the tokenizer files.
Dual Backend Architecture
The embedding module auto-detects which backend to use:
- If
models/bge-small-onnx/model.onnxexists andonnxruntimeis installed → use ONNX - Otherwise → fall back to sentence-transformers
Both backends produce identical embeddings — I verified this with a cosine similarity check that returned exactly 1.000000 between outputs. The ONNX path uses CLS pooling with L2 normalization, matching BGE’s default strategy:
def _onnx_encode(texts: list[str]) -> np.ndarray:
encodings = _onnx_tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.array([e.type_ids for e in encodings], dtype=np.int64)
outputs = _onnx_session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids": token_type_ids,
})
# CLS pooling (index 0)
cls_embeddings = outputs[0][:, 0, :]
# L2 normalize
norms = np.linalg.norm(cls_embeddings, axis=1, keepdims=True)
return cls_embeddings / np.maximum(norms, 1e-12)
The server uses ONNX for all request-time embeddings. sentence-transformers stays available for bulk operations (indexing on a dev machine) without overriding the active backend.
Impact
| Metric | sentence-transformers | ONNX Runtime |
|---|---|---|
| Process RSS | ~500 MB | ~200 MB |
| Model load time | ~3s | ~0.5s |
| Embedding latency | ~15ms/query | ~12ms/query |
| PyTorch required | Yes (~200 MB) | No |
| Swap pressure (1 GB RAM) | Frequent | None |
The server now starts in under a second and stays well within the 1 GB memory budget.
Problem 2: The 121K Vector Scan
LibTrails’ hybrid search fuses multiple retrieval signals via Reciprocal Rank Fusion. The original architecture used 7 signals for book search — three keyword (FTS5) and four semantic (sqlite-vec):
- FTS5 book metadata
- FTS5 topic labels
- FTS5 chunk text
- Semantic topic vectors (121K embeddings)
- Semantic book theme vectors (~2.5K embeddings)
- Semantic book vectors (100 embeddings)
- Semantic chunk vectors (~32K embeddings)
Profiling on the server revealed that signal 4 — the topic vector scan — took 2.7 seconds per query. sqlite-vec performs exact KNN; it computes cosine distance against all 121,000 topic vectors sequentially. On a single-core t3.micro, that’s a lot of floating-point math.
The other vector signals were fast because they searched much smaller tables:
| Table | Rows | Scan time |
|---|---|---|
| topic_vectors | 121,118 | 2.7s |
| chunk_vectors | 31,849 | ~0.5s |
| book_theme_vectors | ~2,500 | ~0.01s |
| book_vectors | 100 | ~0.005s |
Was the Signal Even Necessary?
The chunk vectors (31K rows, ~0.5s) already provide equivalent semantic coverage. Every chunk that mentions a concept has a nearby embedding, and the chunk→book mapping captures the same books that topic→book would. The topic vector signal was redundant with the chunk signal — but 5x more expensive.
I ran diagnostic queries across six query types (dystopia, machine learning, cooking/food, quantum physics, dragons/magic, stoicism) comparing results with and without the topic vector signal. No measurable difference in the top-10 results for any query.
The Fix: Drop to 6 Signals
Removing the topic vector scan from both search paths:
- Book search: 7 signals → 6
- Cluster search: 5 signals → 4
Total search time dropped from ~3.5s to ~0.8s. The chunk and theme vectors already cover the same semantic ground — every concept that exists as a topic also exists in the text chunks that spawned it.
If I move to a larger server or add approximate nearest neighbor (ANN) indexing, I could add the signal back. But for a $7/month instance, the tradeoff is clear.
Problem 3: Related Books Took 19 Seconds
The find_related_books endpoint powers the “Related Books” section on every book detail page. The original implementation called the full hybrid_search_books function with a query constructed from the book’s title, author, and all themes — a 373-character query with 50+ tokens.
This was catastrophic:
- Every FTS5 query matched nearly everything (the query was too broad)
- The topic vector scan ran on top of that (2.7s)
- The chunk vector scan ran on all 32K embeddings (0.5s)
- All seven signals fired on an over-broad query
- Total: 19 seconds per request
The Rewrite: 3 Fast Signals
Related books don’t need the full search pipeline. We already know which book we’re looking at — we just need to find similar ones. The rewrite uses only the fast vector signals over small tables:
def find_related_books(conn, book_id, limit=12):
# Build a short embedding query from title + top themes
query = f"{title} by {author}. {', '.join(themes[:5])}"
query_bytes = embedding_to_bytes(embed_text(query))
# 3 fast signals only
book_direct = _semantic_search_books_direct(conn, query_bytes) # 100 rows
theme_books = _semantic_search_book_themes(conn, query_bytes) # ~2.5K rows
fts_books = _fts_search_books(conn, title) # FTS on 100 books
return rrf_fuse([book_direct, theme_books, fts_books])
Three signals over small tables: 50 milliseconds total.
The quality is arguably better than the original. The old approach produced noisy results because the over-long FTS query matched too broadly — everything is a partial match when your query is 50+ tokens. The new approach uses a short, focused query that captures the book’s identity without the noise.
Summary
| Optimization | Before | After | Improvement |
|---|---|---|---|
| Embedding model (ONNX) | ~500 MB RSS | ~200 MB RSS | 60% less memory |
| Search (drop topic vectors) | ~3.5s | ~0.8s | 4.4x faster |
| Related books (3 signals) | ~19s | ~0.05s | 380x faster |
The whole thing runs on a $7/month server. The lesson is straightforward: profile before optimizing, and don’t assume that more signals always means better results. One redundant signal was costing 2.7 seconds per query, and the related books endpoint was doing 20x more work than necessary by reusing a general-purpose function for a specific task.
You can try the result at libtrails.app — search for anything and it should feel responsive despite running on a single-core t3.micro with 1 GB of RAM.