It started with a practical question: why does the Leiden algorithm cluster my book topics the way it does?
I’d been building LibTrails — a semantic book discovery app that extracts 121,000+ topics from 100 classic books, embeds them as vectors, and uses the Leiden community detection algorithm to group them into ~2,400 thematic clusters. The system works. The clusters make sense. But I realized I was treating Leiden as a black box, and I wanted to understand the math underneath.
So I went looking for a map of the territory.
The Survey Paper That Started Everything
I found “Community detection in graphs” by Santo Fortunato (arXiv:0906.0612v2) — a 100+ page survey from 2010 that’s become the standard reference for the field. It covers everything: modularity, spectral methods, label propagation, overlapping communities, benchmarks, and the resolution limit problem that makes naive modularity optimization dangerous.
The very first sentence of the paper sent me on a detour:
“The modern science of networks has brought significant advances to our understanding of complex systems. One of the most relevant features of graphs representing real systems is community structure…”
The paper traces the origin of network science back to Euler and the Königsberg bridge problem of 1736. I’d heard of it before, but I’d never actually worked through the proof. So I used Claude to generate an interactive primer.
Euler’s Bridges: Where Graph Theory Begins
The problem is elegant: the city of Königsberg had four landmasses connected by seven bridges. Can you walk across every bridge exactly once and return to where you started?
Euler proved you can’t — and in doing so, invented graph theory. His key insight was abstraction: ignore the geometry of the city and reduce it to vertices (landmasses) and edges (bridges). Then the question becomes purely about the degree of each vertex.
The rule: an Eulerian circuit (crossing every edge exactly once, returning to the start) requires all vertices to have even degree. Königsberg’s graph has four vertices, all with odd degree. The walk is impossible.
I had Claude generate a visual primer showing both the geographic layout and Euler’s abstract graph, with the degree analysis that proves impossibility.
View the Königsberg Bridge Problem primer →
It’s a small thing, but working through this made the rest of the survey paper click differently. Graph theory isn’t just a toolkit — it’s a way of seeing structure where you’d otherwise see geography, social networks, or in my case, a pile of book topics.
Building a Study Plan
With the Fortunato survey as my anchor text, I used ChatGPT to help me build a structured self-study plan — a graduate-level sequence for learning graph/network fundamentals and modern community detection, with an applied track for LibTrails.
The plan has 9 phases, each with readings (arXiv papers), key concepts to extract, coding checkpoints, and notes on how each topic applies to LibTrails specifically:
| Phase | Topic | Key Paper |
|---|---|---|
| 0 | Community detection mental model | Fortunato (survey) |
| 1 | Modularity fundamentals | Newman |
| 2 | Resolution limit & gamma sweeps | Fortunato & Barthelemy |
| 3 | Louvain mechanics | Blondel et al. |
| 4 | Leiden guarantees | Traag, Waltman & van Eck |
| 5 | Constant Potts Model (CPM) | Traag, Van Dooren & Nesterov |
| 6 | Flow-based communities (Infomap) | Rosvall & Bergstrom |
| 7 | Stochastic Block Models | Karrer & Newman |
| 8 | Overlap & validation (LFR benchmarks) | Ahn et al., Lancichinetti et al. |
The recommended pacing is ~8 weeks at 4–6 focused hours per week, though the fast path (phases 0–5 to get the Leiden + CPM core) is more like 5 weeks.
Why This Matters for LibTrails
LibTrails currently uses Leiden with modularity optimization — the standard default. But the study plan is already reshaping how I think about the pipeline:
The resolution limit is real. Modularity maximization can merge small, meaningful communities into larger ones. For a library app, this means niche topics (say, “Stoic meditation practices”) might get absorbed into a giant “Philosophy” cluster. Gamma sweeps and CPM (phase 5) offer a way to expose multiple resolution scales — micro, meso, macro — instead of forcing a single partition.
Stability matters more than modularity score. A partition that shifts dramatically when you change the random seed or remove 1% of edges isn’t trustworthy. The study plan emphasizes running multi-resolution sweeps and tracking stability (NMI/VI across seeds) before shipping any clustering to users.
Overlap is natural in libraries. A book about the French Revolution belongs in both “Politics & Governance” and “Heroic & Military Epic.” Link communities and soft assignment (phase 8) could enable richer browsing — books appearing in multiple regions, chunks bridging themes.
Learning in Public
I’m sharing this because the “learning in public” approach has been one of the most rewarding parts of building LibTrails. The project is as much about the journey as the destination — using AI tools to accelerate understanding, generating study materials, and then applying what I learn back to the product.
If you’re working with embeddings, similarity graphs, or clustering in any domain, the Fortunato survey is the place to start. It’s dense but well-written, and every section connects to practical decisions you’ll face when shipping community detection in production.
The arXiv papers are all free. The tools are open source (NetworkX, leidenalg, igraph). The only cost is time and curiosity.
Resources:
- Königsberg Bridge Problem primer — interactive visual explainer
- Graduate Self-Study Plan — 9-phase curriculum with arXiv readings and coding checkpoints
- Fortunato, “Community detection in graphs” (2010) — the survey paper that anchors the study
- LibTrails — see the Leiden clustering in action