From 6,500 Clusters to 202 Communities: Building a Browsable Topic Hierarchy

When I first ran the Leiden algorithm on LibTrails’ 121,000 extracted topics, it produced 2,468 clusters. Clean, mathematically sound groupings. And completely useless for browsing.

The problem wasn’t the algorithm — it was the mismatch between what graph partitioning optimizes for and what a user actually wants to explore. This post is about the journey from those 2,468 clusters to a three-tier hierarchy that actually makes sense when you click around libtrails.app.

The Starting Point: 2,468 Clusters

LibTrails extracts topics from 100 Project Gutenberg classics using a two-pass LLM pipeline (Gemma 27B for book-level themes, then 12B for chunk-level topics). The result is 121,118 deduplicated topics with 384-dimensional embeddings.

To find structure in that sea of topics, I built a KNN graph — each topic connected to its nearest neighbors by cosine similarity — and ran Leiden with CPM (Constant Potts Model) at resolution 0.001. The result: 2,468 clusters where tightly related topics land together. “Categorical imperative,” “moral duty,” and “Kantian ethics” in one cluster. “Photosynthesis,” “chloroplast structure,” and “plant energy conversion” in another.

At the cluster level, the groupings are excellent. But showing a user 2,468 clusters is like handing them an index instead of a table of contents.

Problem 1: Binary Membership Breaks Everything

Before tackling cluster granularity, I hit a more fundamental issue. I’d grouped the 2,468 clusters into 26 high-level themes (domains) using K-means on cluster centroids plus LLM-generated labels. The Themes page should show each theme with its most representative books.

Instead, every theme showed nearly every book.

A typical book has ~1,000 topic mentions scattered across hundreds of clusters. With binary “any topic link” membership, a book that has one topic in the “Russian Literature” cluster counts as a Russian Literature book. With 26 themes covering those clusters, every book ends up touching every theme. “On Cooking” appeared as Russian Literature. So did every fantasy novel.

The Fix: Concentration Scoring

The solution is weighted membership. Instead of “does this book have any topic in this theme?”, ask “what fraction of this book’s topics are in this theme?”

I built a book_domains bridge table that materializes each book’s concentration per theme — pre-computed during stats refresh, not on every API request. The scoring uses BM25 saturation combined with PPMI (Positive Pointwise Mutual Information) to weight topic relevance, then marks each book’s highest-scoring theme as its “primary” domain.

The difference was dramatic:

Method	Avg books/theme	Min	Max
Binary (old)	859 of 937	426	936
Weighted ≥1%	656	57	911
Primary only	32	1	147

That Russian Literature theme? Went from 426 books to 7 primary books — Chekhov, Grossman, Pasternak. The books that are actually about Russian literature.

Problem 2: Clusters Are Too Granular to Browse

With weighted membership solved at the theme level, I looked at the layer below: can users browse the 2,468 clusters directly?

The answer was no. Cluster-level concentration is almost nonexistent — 96% of clusters have zero books with even 5% concentration. A typical book’s topics are so spread across clusters that no single cluster captures a meaningful chunk of any book’s content. And asking users to navigate 2,468 items defeats the purpose of organizing things in the first place.

But the clusters are valuable — just not for browsing. When you search LibTrails for “Russian revolution,” the search system finds matching topics and groups them by cluster. The cluster provides the semantic neighborhood: “these 80 related topics all cluster together — Bolshevik movement, Tsarist collapse, revolutionary ideology.” That requires fine granularity.

What I needed was a middle layer: coarser than clusters, finer than themes, designed for browsing.

The Three-Tier Hierarchy

The solution runs Leiden at two different resolutions on the same topic KNN graph:

Topics (121,118)
  → KNN graph (k=4, embedding similarity edges)
  → Leiden CPM γ=0.001    → 2,468 clusters   (search neighborhoods)
  → Leiden CPM γ=4.86e-5  → 202 communities  (Topics page, Universe)
  → K-means on centroids  → 26 themes        (Themes page)

Tier	Count	What it does	User sees?
Themes	26	Broad theme browsing	Themes page
Communities	202	Specific topic exploration	Topics page, Universe 3D view
Clusters	2,468	Search neighborhoods	Search results only

Each topic gets two assignments — a fine-grained cluster_id for search and a coarser community_id for browsing. Communities inherit theme assignments through majority vote: each community joins whichever theme most of its topics belong to.

Finding the Right Resolution

The resolution parameter γ in CPM controls how fine-grained the communities are. Too high and you’re back to thousands of groups. Too low and you get a few giant blobs.

Sweeping the Parameter Space

I ran a 25-point logarithmic sweep from γ=1e-6 to 5e-4 at different KNN densities (k=2 through k=10). Each configuration produces a different partition — different community count, different max size, different quality score.

The key metrics to balance:

Community count: ~100-250 (browsable without overwhelming)
Max community size: Small enough to be coherent
Modularity Q: Higher is better (stronger community structure)
NMI stability: Partition shouldn’t change drastically with small parameter shifts

For the 100-book demo, k=4 with γ=4.86e-5 landed in the sweet spot: ~100 first-pass communities, good quality score, stable partitions.

Two-Pass Splitting

One-pass Leiden at γ=4.86e-5 produces 101 communities — but some are massive. 48 communities exceeded 1,000 topics (the threshold I chose based on proportional reasoning: 1,000/121K = 0.8% of the corpus per community).

The second pass re-runs Leiden at 2× resolution (γ=9.72e-5) on just the oversized communities, with a hard cap of 1,000 topics. This uses leidenalg’s built-in max_comm_size parameter — a single pass with a constraint, not a recursive split.

Pass 1: 101 communities (CPM γ=4.86e-5)
  → 48 exceeded threshold of 1,000 topics
  → Re-split at CPM γ=9.72e-5 with max_comm_size=1,000
Pass 2: 238 communities (split ratio 2.4×)

Naming 238 Communities

Every community needs a human-readable label. I fed the top topics from each community to gemma-3-27b-it running on an RTX 3090 via LM Studio — 8 parallel workers, ~2.4 names/second, 238 names generated in 97 seconds with zero failures.

The LLM produces names like “Kantian Ethics & Moral Philosophy,” “Victorian Social Norms,” “Trojan War & Greek Heroism.” Descriptive enough to browse, specific enough to be useful.

Cleanup: From 238 to 199

Raw Leiden output always needs curation. Some communities are noise (a single topic extracted incorrectly), others are too small to be useful on their own, and a few don’t land in the right theme.

Deleted 17 singletons — communities with exactly 1 topic, artifacts of extraction noise. Topics like “Ivar’s Birth & Lineage” or “Caius Blosius Studies” that didn’t connect to anything.

Absorbed 22 tiny communities (2-19 topics) into their nearest large neighbor by centroid cosine similarity. “Don Gregorio & Luis” (9 topics) merged into “Quixotic Chivalry & Satire” (similarity 0.795). “Renfield’s Psychological Profile” (11 topics) merged into “Dracula & Victorian Horror” (similarity 0.799). Each merge was validated by checking that the centroid similarity was high enough for the topics to genuinely belong together.

Assigned 6 orphaned communities to themes via centroid similarity. Most were straightforward — “Russian Aristocratic Psychology & Faith” landed in “Psychological Realism in Literature” (similarity 0.942).

3 manual overrides where the algorithm’s suggestion was thematically wrong despite reasonable similarity scores. “Jordan Baker & Social Commentary” (5 topics) should obviously go with “Gatsby & Roaring Twenties Society” even though the cosine similarity was lower than another option.

Structural Validation

After cleanup, I ran structural quality checks:

KNN edge retention: 79.5% of edges in the KNN graph stay within their community. This means most of each topic’s nearest neighbors are in the same community — the communities respect the underlying similarity structure.

Cluster exclusivity: 100% — all 2,442 clusters map to exactly one community. No cluster is split across communities.

Book overlap: Zero same-theme community pairs with Jaccard similarity above 0.50. Communities within the same theme represent genuinely different facets, not redundant groupings.

Centroid similarity: 66 same-theme community pairs have centroid cosine above 0.95 — semantically close but representing different aspects. “Trojan War & Greek Heroism” (0.974 similarity with “Trojan War & Aftermath”) covers the war itself while the other covers its consequences.

What I Learned

Leiden gives you mathematically correct partitions, not browsable ones. The algorithm optimizes for graph modularity. Users need balanced, well-labeled groups of things they can click through. Bridging the gap requires parameter sweeping, multi-pass splitting, and manual curation.

Binary membership is a trap. The moment you have more than a handful of categories and items with many tags, “any link” membership will put everything everywhere. Concentration scoring — asking what fraction of an item’s connections land in a category — is the baseline fix.

The right number of tiers depends on the data. 121K topics need three tiers: fine-grained for search, medium for exploration, coarse for orientation. Two tiers (just clusters and themes) left a usability gap. Four would be overkill for 100 books.

LLMs are excellent community labelers. Feeding the top topics from each community to a 27B model produces surprisingly good names at ~2.4 names/second. The manual override rate was 3 out of 238 — barely 1%.

You can explore the result at libtrails.app — the Topics page shows all 202 communities, the Universe page renders them as a 3D galaxy, and the Themes page groups them into 26 browsable themes.