Data Pipeline

Pipeline logic is implemented in engine/ingest/ and called from both CLI and API surfaces. The pipeline transforms raw SRD JSON files and homebrew content into a queryable SQLite database and immutable progression JSON artifacts.

Flow

db-refresh (engine/ingest/refresh_sqlite.py): Scans srd/2014/ and srd/2024/ for files matching the 5e-SRD-*.json naming convention. Each file is read as a JSON array of records. Every record is normalized through family-specific normalizer functions (class, creature, feat, spell, etc.), then inserted into an edition-scoped SQLite table (e.g., srd_2014_classes, srd_2024_spells). Each row stores the original payload (source_payload), the normalized payload (normalized_payload), and a data column containing the normalized form used by API queries. A srd_datasets metadata table tracks every imported dataset with row counts and timestamps. For 2024 data, synthetic class records are built from the subclass file when a dedicated classes file is absent, ensuring the class table always exists for both editions.
Homebrew projection: Canonical homebrew records from homebrew/classes/ and homebrew/subclasses/ are read, validated against Pydantic schemas (CanonicalHomebrewClass, CanonicalHomebrewSubclass), and projected into the edition-specific tables. Merged views (e.g., srd_2014_classes_merged) union SRD rows with homebrew rows, giving downstream consumers a single query surface that includes both official and homebrew content.
generate-progressions (engine/ingest/generate_progressions.py): Reads from the SQLite database to build per-entity progression JSON files under progressions/classes/{edition}/ and progressions/subclasses/{edition}/{class_index}/{subclass_index}.json. Each progression describes the level-by-level feature grants, spell grants, mechanic changes, and metadata for a single class or subclass. The progressions/manifest.json catalog is regenerated with every run, listing all entities with their file paths, identity keys, and generation timestamps.

Primary Modules

engine/ingest/refresh_sqlite.py — SRD file scanning, table creation, record normalization, homebrew merging, and view generation.
engine/ingest/generate_progressions.py — Progression artifact generation from SQLite, manifest writing.
engine/ingest/normalizers.py — Family-specific normalizer functions that transform raw SRD JSON into consistent shapes. Handles classes, creatures, feats, spells, subclasses, species, and subspecies. Records that don't match a known family are passed through as deep copies.
engine/ingest/db_utils.py — Filename parsing utilities that derive collection names, table names, and SRD versions from file paths. Enforces the 5e-SRD- prefix convention.
engine/ingest/homebrew_store.py — Reads and validates canonical homebrew JSON files from disk.

Design Rationale

The pipeline uses a two-phase architecture (refresh → generate) rather than a single pass because the phases have different invalidation cadences. db-refresh needs to run when source JSON files change (rare), while generate-progressions may need to rerun when the progression schema evolves or normalizer logic changes. Separating them allows selective re-execution.

SQLite was chosen as the intermediate store rather than in-memory data structures because it provides indexed queries, transactional consistency, and persistence between runs. The progression generator can query for specific classes or subclasses without loading the entire dataset into memory—important as the SRD grows across editions.

The normalizer architecture uses a family-dispatch pattern (normalize_srd_record delegates to normalize_class, normalize_creature, etc.) rather than a registry or plugin system. This keeps the normalization logic explicit and easy to audit. Each normalizer produces a flat, consistent dict shape that strips SRD-specific nesting (like proficiency.proficiency.name) into direct field access.

Records are stored with three JSON representations: source_payload (the raw SRD JSON, for debugging and traceability), normalized_payload (the normalizer output), and data (the version used by API queries, currently identical to normalized_payload but decoupled to allow future API-specific transformations).

Assumptions & Constraints

File naming convention is authoritative: The pipeline derives table names, dataset names, and SRD versions entirely from file paths. A file at srd/2024/5e-SRD-Spells.json becomes table srd_2024_spells. Renaming or moving files changes the pipeline's behavior.
Destructive refresh: db-refresh drops and recreates every SRD table on each run. There is no incremental update or diff-based refresh. This is simple and correct but means every refresh is a full rebuild of the SQLite database.
2024 class synthesis: Because the 2024 SRD ships subclasses without a separate classes file, the pipeline synthetically constructs class records from subclass class references. These synthetic records are marked is_partial: True and lack details like hit die or proficiency choices.
Homebrew is optional: If no homebrew files exist, the pipeline runs identically—merged views simply contain only SRD rows.
Manifest is regenerated atomically: The entire manifest.json is rewritten on each generate-progressions run. There is no append or partial update.

Conceptual Model

The pipeline implements a classic ETL (Extract-Transform-Load) pattern with a materialized view layer:

Extract: Read SRD JSON arrays from srd/{edition}/5e-SRD-*.json files.
Transform: Normalize each record through family-specific functions that flatten nested references, standardize field names, and handle cross-edition inconsistencies.
Load: Insert into edition-scoped SQLite tables with full-text JSON columns.
Materialize: Generate immutable progression JSON files that serve as the read-optimized layer for the API and UI.

The boundary between "database content" and "progression artifacts" is significant: the database is the queryable source of truth for the SRD, while progressions are derived, denormalized views optimized for the level-up workflow. A progression file for a class contains everything needed to render a level-up preview without any further database queries.

Data Pipeline

Data Pipeline

Flow

Primary Modules

Design Rationale

Assumptions & Constraints

Conceptual Model

Related Hubs