Skip to content
Sean Bergman

> graph study plan

Graph Networks, Community Detection, Louvain & Leiden — Applied to LibTrails Digital Libraries

Prepared for: Sean  |  Date: 2026-02-12

Download PDF ↓

This plan turns our chat into a structured, graduate-level self-study sequence for learning graph/network fundamentals and modern community detection, with an applied track for building semantic community maps inside your LibTrails digital library pipeline (topic/chunk graphs, kNN similarity graphs, and navigation graphs).

What you'll be able to do by the end

  • Explain community detection objectives (modularity, CPM, Map Equation) and their assumptions.
  • Implement Louvain and Leiden on weighted graphs; interpret the role of the resolution parameter gamma.
  • Diagnose failure modes (resolution limit, disconnected communities, hubness in embedding graphs).
  • Design evaluation protocols (stability across seeds, NMI/VI, size distributions, synthetic benchmarks).
  • Ship an interpretable "library map" feature: multi-resolution thematic clusters + overlap handling + browse UX.

How to use this document

Each phase includes: (1) readings (with arXiv IDs), (2) the key ideas to extract, (3) a coding checkpoint, and (4) how it applies to LibTrails. If you do only one thing beyond Leiden: add CPM and do multi-resolution sweeps with stability evaluation.

Prerequisites and Setup

You can complete this plan with Python + NetworkX + (optional) igraph/leidenalg. For larger graphs, consider GPU acceleration via NVIDIA nx-cugraph.

  • Python stack: networkx, numpy, scipy, pandas, matplotlib; plus one of: (a) leidenalg + igraph, (b) graspologic (SBM), (c) infomap package.
  • Graph representations: store weighted edges; preserve node metadata (book_id, chunk_id, topic_id, embedding centroid, author, year).
  • Two graph types for LibTrails: (1) Similarity graphs (kNN on embeddings) and (2) Flow graphs (navigation/clickstream, citation, recommendation transitions).
  • Core practice: always run multi-resolution sweeps and stability checks; never trust a single partition.

Roadmap at a Glance

Phase Goal Key readings (arXiv) Primary deliverable
0 Community detection mental model + vocabulary 0906.0612 Notes: definitions, taxonomy, evaluation metrics
1 Modularity fundamentals cond-mat/0308217 Implement modularity score + null model intuition
2 Why gamma matters (resolution limit) cond-mat/0603718 Gamma sweep plan + expected effects
3 Louvain mechanics 0803.0476 Run Louvain on your similarity graph; inspect pathologies
4 Leiden guarantees + practice 1810.08473 Leiden baseline + stability across seeds
5 CPM objective for similarity graphs 1104.3083 Leiden-CPM multi-resolution + interpretability checks
6 Flow-based communities 0906.1405 Infomap on navigation/citation graphs; compare with Leiden
7 Generative baseline (SBM) 1008.3926 Use SBM as a diagnostic: is structure real or artifact?
8 Overlap + validation 0903.3178, 0805.4770, 0709.2938 Overlapping/edge communities + synthetic benchmark harness

Recommended pacing: ~8 weeks at 4-6 focused hours/week. If you want faster: do phases 0-5 first (the Leiden+CPM core), then add flow (phase 6) once you have navigation data.

Phase 0 — Community Detection: What It Is (and Is Not)

Goal: Build a clean mental model: community definitions, taxonomy, evaluation, and why "the" correct partition rarely exists.

Readings

What to extract

  • Different notions of community (density-based, flow-based, statistical/generative).
  • Hard partitions vs overlapping vs hierarchical communities.
  • Why evaluation is difficult without ground truth; which metrics are used anyway (NMI/VI, modularity, conductance).
  • Common failure modes: resolution limit, degeneracy (many near-optimal partitions), sensitivity to small perturbations.

Coding checkpoint

Write a one-page "community definition for LibTrails" that states: what nodes represent, what edges represent, and what a 'community' should mean for a user.

How this applies to LibTrails

LibTrails is inherently multi-theme. Treat community detection as a user-facing organizing lens, not as an objective truth. Your product goal is an interpretable, navigable map that remains stable under small changes.

Phase 1 — Modularity Fundamentals and the Null Model

Goal: Understand modularity as an objective and why its null model matters.

Readings

What to extract

  • Modularity as: observed intra-community edges minus expected intra-community edges under a degree-preserving null model.
  • The configuration model intuition: why degree distributions dominate expectations.
  • Weighted modularity: what changes when edges have weights.

Coding checkpoint

Implement modularity scoring for a partition of your weighted graph. Verify your score matches a library implementation on small graphs.

How this applies to LibTrails

In embedding-derived kNN graphs, degree and weight distributions can be artifacts of geometry (hubness). Understanding the null model helps you avoid overclaiming semantic 'communities' that are actually degree effects.

Phase 2 — The Resolution Limit: Why Gamma Sweeps Are Mandatory

Goal: Learn why modularity can miss small communities and how resolution parameters control scale.

Readings

What to extract

  • Why modularity maximization can merge small real communities into larger ones.
  • How resolution interacts with graph size and total edge weight.
  • Practical implication: single-run clustering is not defensible.

Coding checkpoint

Design a gamma sweep grid and decide your tracking metrics: number of communities, size distribution, stability across random seeds, and a small sample of human-read 'cluster labels'.

How this applies to LibTrails

LibTrails will have both micro-topics (niche) and macro-themes (genres). Gamma sweeps let you expose a zoomable 'map' where users can traverse scales instead of forcing one partition.

Phase 3 — Louvain: Greedy Multi-Level Modularity Optimization

Goal: Learn the classic Louvain method and its multi-level aggregation mechanics.

Readings

What to extract

  • Two phases: local moving + aggregation; why it scales.
  • Sensitivity to node order / randomness; local optima.
  • Pathology: communities can be internally disconnected.

Coding checkpoint

Run Louvain on: (a) a toy graph where you know the answer and (b) your LibTrails similarity graph. Inspect for disconnected communities and unstable results across seeds.

How this applies to LibTrails

Louvain is valuable as a baseline and a cautionary tale. If your clusters split into disconnected pieces, users will experience incoherent shelves. This motivates Leiden.

Phase 4 — Leiden: Connectivity Guarantees and Better Partitions

Goal: Understand how Leiden fixes Louvain and why it is the default for production.

Readings

What to extract

  • Refinement step and guarantees about well-connected communities.
  • Why convergence is more reliable than Louvain.
  • Practical parameters: random seed, iterations, resolution.

Coding checkpoint

Establish your production baseline: Leiden (modularity) with a documented gamma sweep. Record stability (NMI/VI) across seeds and small graph perturbations (e.g., remove 1% edges).

How this applies to LibTrails

Leiden is the right default for LibTrails similarity graphs. The refinement step typically yields more coherent thematic groupings and avoids 'split cluster' UX issues.

Phase 5 — Constant Potts Model (CPM): A Better Fit for Similarity Graphs

Goal: Add CPM as an alternative objective that often behaves more predictably on embedding graphs.

Readings

What to extract

  • CPM objective and why it avoids the modularity resolution limit in a precise sense.
  • How the CPM resolution parameter directly controls minimum internal density.
  • When CPM beats modularity: geometric graphs, kNN graphs, and graphs with strong local neighborhoods.

Coding checkpoint

Run Leiden under both objectives (modularity vs CPM) across the same gamma grid. Compare: stability, community size distribution, and semantic coherence of labels.

How this applies to LibTrails

For LibTrails, CPM often yields more intuitive 'topic neighborhoods' on kNN similarity graphs, making it easier to produce clean shelves and map regions without surprise merges.

Phase 6 — Flow-Based Communities: Infomap and the Map Equation

Goal: Learn when you should prefer flow-based communities (navigation, citations, transitions) over density-based objectives.

Readings

What to extract

  • Why flow-based objectives capture 'paths' through a network.
  • Interpretation: compressing random walks (codebooks) instead of counting edges.
  • Why Infomap can reveal different, often more navigable, partitions than modularity/CPM.

Coding checkpoint

If you have navigation signals (clicks, reading sequences, recommendations), build a directed/weighted flow graph and run Infomap. Compare the resulting communities against Leiden on the same node set.

How this applies to LibTrails

LibTrails can combine two maps: a semantic similarity map (CPM/Leiden) and a user-flow map (Infomap). Agreement suggests robust themes; disagreement highlights 'semantic vs behavioral' differences worth surfacing.

Phase 7 — A Statistical Baseline: Degree-Corrected Stochastic Block Models

Goal: Adopt a generative lens to detect when communities are real vs artifacts of degree/geometry.

Readings

What to extract

  • Why standard SBM fails with heavy-tailed degrees; how degree correction fixes it.
  • Interpretation: communities as parameters of a generative process (likelihood).
  • Using SBM as a diagnostic rather than a production algorithm.

Coding checkpoint

Fit a simple SBM/DC-SBM (even on a subgraph) and compare partitions to Leiden. Use disagreements to diagnose hubness, overly generic topics, or edge-weight issues.

How this applies to LibTrails

Embedding graphs often contain hubs (generic chunks/topics). A generative sanity check helps you avoid building product features on clusters that are driven by hubs rather than meaning.

Phase 8 — Overlap and Validation: Link Communities, LFR Benchmarks, and Baselines

Goal: Round out your toolkit: overlap handling and evidence-based evaluation.

Readings

What to extract

  • Why overlap is natural in text libraries (books belong to multiple themes).
  • How LFR produces graphs with known ground truth for testing parameter sensitivity.
  • Why you should always keep a simple baseline (label propagation) to prevent over-fitting complexity.

Coding checkpoint

Build a small evaluation harness: generate LFR graphs; compare Louvain/Leiden/CPM/Infomap; report stability and accuracy vs ground truth. Then apply the same metrics to LibTrails graphs (without ground truth).

How this applies to LibTrails

Overlap support enables richer LibTrails UX: a book can appear in multiple regions; chunks can bridge themes; authors can sit at intersections. Your evaluation harness prevents 'pretty maps' that collapse under perturbations.

Applied Track — Building the LibTrails Library Map

Use this track alongside the phases above. The goal is to move from 'I can run Leiden' to 'I can ship a stable, interpretable, multi-resolution community map for a digital library'.

  • A. Define graph objects: Choose node types (chunks, books, topics, authors). Start with two graphs: chunk-similarity (kNN) and book-similarity (aggregate chunk signals).
  • B. Edge construction: Prefer mutual-kNN; store cosine similarity as weights; consider weight transforms (e.g., max(sim-τ,0)) to suppress weak ties.
  • C. Multi-resolution sweep: Run Leiden with modularity and CPM across gamma grid. Track: #communities, size distribution, stability (VI/NMI across seeds), and human interpretability (auto labels from top terms).
  • D. Stability-first selection: Pick 2-3 resolutions that are stable and useful (micro, meso, macro). Make these zoom levels in the UI.
  • E. Overlay flow communities (optional): If you capture user navigation, run Infomap and display 'paths' or 'corridors' between semantic regions.
  • F. Overlap: Add link-community or soft assignment (e.g., top-2 communities by affiliation strength) for boundary objects (books bridging themes).
  • G. Product UX: Provide a 'map' view + 'shelves' view. Each community should have: label, centroid exemplars, representative books, and edges to neighboring communities.

Recommended evaluation signals (no ground truth)

  • Stability: partitions should be consistent across random seeds and small edge perturbations.
  • Coherence: community label quality (top keywords/topics) and exemplar similarity.
  • Separation: neighboring communities should have distinct labels; avoid duplicates caused by fragmentation.
  • Coverage: avoid a 'giant misc cluster'; tune gamma/graph construction instead.
  • User utility: can a user quickly find a relevant shelf/region and jump between adjacent themes?

References

Paper Authors Link
Community detection in graphs (survey) Fortunato arXiv:0906.0612
Modularity and community structure Newman arXiv:cond-mat/0308217
Resolution limit in community detection Fortunato & Barthelemy arXiv:cond-mat/0603718
Louvain method Blondel et al. arXiv:0803.0476
Leiden algorithm Traag, Waltman & van Eck arXiv:1810.08473
Constant Potts Model (CPM) Traag, Van Dooren & Nesterov arXiv:1104.3083
Infomap / Map Equation Rosvall & Bergstrom arXiv:0906.1405
Degree-Corrected Stochastic Block Model Karrer & Newman arXiv:1008.3926
Link communities Ahn, Bagrow & Lehmann arXiv:0903.3178
LFR benchmark Lancichinetti, Fortunato & Radicchi arXiv:0805.4770
Label propagation Raghavan, Albert & Kumara arXiv:0709.2938