Package Exports
- @sentropic/graphify
Readme
graphify
graphify turns a corpus into a reconciled, ontology-typed knowledge graph. Most knowledge isn't documentary — it doesn't live as one fact in one file. It's entities and relations scattered across sources: the same person under three names in twenty-five books, a component named one way in a CSV registry and another way in a manual, a case that only makes sense once its evidence, motive, and method are linked. Prose and docs flatten that structure; a knowledge graph keeps it. graphify extracts canonical entities and typed relations, deduplicates and reconciles them across sources under a configurable ontology, and gives you back a queryable graph your assistant — or you, from the terminal — can reason over.

The flagship corpus: 1,193 canonical entities across 19 ontology types (Work, Saga, Case, Character, Evidence, Motive, ForensicMethod, DisguisePersona, Alias…) reconciled from 25 public-domain mystery works, clustered into 99 communities — here with the entity panel open on Sherlock Holmes. Explore the live studio → https://mystery-saga.sent-tech.ca/studio/
Why a knowledge graph, not more prose
Four things a graph gives you that documents can't:
- Queryable structure — ask
graphify query "what connects Irene Adler to the Bohemia case?"and get a path through typed nodes and edges, not a wall of search hits to re-read. - Entity reconciliation — "Holmes", "Mr. Sherlock Holmes", and a disguised persona collapse into one canonical entity with aliases and evidence refs, instead of staying three scattered mentions.
- Cross-source linking — a character appearing in several works, an author and their translator, a registry row and the extracted mention that matches it: edges across sources are first-class, with provenance.
- Ontology-typed nodes — every node carries a type from a profile you control (
Character,Case,Evidence, …), each type can pin avisual_encoding(shape + color), and relations are validated against allowed endpoints — so the graph stays a model, not a hairball.
From code graphs to ontology graphs
graphify started life as a code knowledge graph: parse a repo, extract classes/functions/calls, cluster, report. No ontology, no entity management. The current product line keeps that (see Code knowledge graphs) but generalizes it:
| Initial approach (code graph) | Now (ontology-driven entity graph) | |
|---|---|---|
| Node types | Fixed (class, function, concept) | Configurable ontology profile per project |
| Same thing, many names | Stays duplicated | Canonical entities + reconciliation with a reviewable patch lifecycle |
| Sources | A codebase | Any corpus: books, manuals, registries (CSV/JSON/YAML), papers, images, transcripts |
| Rendering | One default style | Per-type visual_encoding (shape, color) carried by the profile |
| Review | Read the report | Reconciliation studio + audit trail of every accept/reject decision |
Proof: the graph pays for itself
A token benchmark prints after every run. Once built, each query reads the compact graph instead of re-reading raw files:
| Corpus | Files | Tokens per query vs raw | Worked example |
|---|---|---|---|
| Karpathy repos + 5 papers + 4 images | 52 | ~71.5× fewer | worked/karpathy-repos/ |
| graphify source + Transformer paper | 4 | ~5.4× fewer | worked/mixed-corpus/ |
| httpx (synthetic Python library) | 6 | ~1× | worked/httpx/ |
A tiny corpus already fits in context, so there's little to compress — the value there is structural clarity, not token savings. Each worked/ folder ships the raw inputs and the actual output so you can reproduce the numbers. Token figures are estimates unless backed by real model calls.
Quickstart
Requires: Node.js 20+ and one supported AI coding assistant (Claude Code, Codex, Gemini CLI, and others — see Reference).
npm install -g @sentropic/graphify
graphify installBuild your first graph from your assistant:
/graphify . # Claude Code / Gemini CLI / Copilot / Aider / OpenCode / others
$graphify . # CodexThis writes .graphify/:
.graphify/
├── graph.json persistent graph — query weeks later without re-reading
├── GRAPH_REPORT.md god nodes, surprising connections, suggested questions
├── graph.html local standalone HTML export (legacy viewer)
├── wiki/ optional LLM-readable wiki pages
└── cache/ local SHA256 cache (ignored)Query it directly from the terminal — no assistant needed:
graphify query "what connects attention to the optimizer?" --graph .graphify/graph.json
graphify path "DigestAuth" "Response" --graph .graphify/graph.json
graphify explain "SwinTransformer" --graph .graphify/graph.json
graphify summary --graph .graphify/graph.json # compact first-hop orientation(--graph is optional once .graphify/graph.json is the resolved default.)
Build options
The build is driven from the skill; common flags:
/graphify ./raw --directed # preserve source→target direction
/graphify ./raw --mode deep # more aggressive INFERRED edge extraction
/graphify ./raw --update # re-extract only changed files, merge into existing graph
/graphify ./raw --cluster-only # rerun clustering only, no re-extraction
/graphify ./raw --no-viz # skip HTML, just report + JSON
/graphify ./raw --svg # also export graph.svg
/graphify ./raw --graphml # also export graph.graphml (Gephi, yEd)
/graphify ./raw --neo4j-push bolt://localhost:7687 # push directly to a running Neo4jgraphify watch [path] keeps the graph live in a background terminal: code saves trigger an instant AST rebuild (no LLM), while doc/image changes set a flag and notify you to run --update for the LLM re-pass. For cross-repo work, graphify clone <url> builds a graph for a remote repo and graphify merge-graphs <graphs...> stitches several graphs together.
The ontology layer
Configurable ontology (profiles)
A project can pin an ontology profile that constrains the graph: allowed node types, relation types, citation requirements, review statuses, per-type visual_encoding (shape + color), and named registry bindings (CSV, JSON, or YAML). Profile mode is strictly additive — it activates only when graphify finds graphify.yaml, graphify.yml, .graphify/config.yaml, or .graphify/config.yml, or when you pass --config/--profile. Without it, normal graphify behavior is unchanged.
A minimal graphify.yaml:
version: 1
profile:
path: graphify/ontology-profile.yaml # node/relation types, citation rules, statuses
inputs:
corpus:
- raw/manuals
registries:
- references/components.csv
dataprep:
pdf_ocr: auto
citation_minimum: pagegraphify profile validate --config graphify.yaml
graphify profile dataprep . --config graphify.yaml
graphify profile report --profile-state .graphify/profile/profile-state.json \
--graph .graphify/graph.json --out .graphify/profile/profile-report.mdRegistries are normalized into ordinary extraction fragments with stable IDs and profile attributes, so external authoritative data and extracted mentions live in the same graph.
Canonical entities and cross-source reconciliation
The same real-world thing is often mentioned differently across sources: a person named one way in a paper and another way in a dataset, a class in code and the concept describing it in a doc. graphify models a canonical entity (with a label, aliases, type, status, evidence refs) and links the variant mentions to it, so they collapse to a single node instead of staying scattered.
Reconciliation candidates are generated deterministically — entity_match candidates ranked by shared normalized terms and exact-label match, each carrying a score and a proposed patch operation:
graphify ontology candidates \
--profile-state .graphify/profile/profile-state.json \
--out .graphify/ontology/candidates.jsonA reviewable patch lifecycle (propose → validate → dry-run → apply)
Reconciliation never edits derived files directly. .graphify/graph.json and .graphify/ontology/*.json are generated artifacts; every decision is a reviewable graphify_ontology_patch_v1 instead. A patch is validated against the active profile hash, graph hash, evidence refs, relation endpoint rules, status-transition policy, and a configured repository path jail.
Supported patch operations: accept_match, reject_match, create_canonical, merge_alias, set_status, add_relation, reject_relation, deprecate_entity, supersede_entity.
The safe workflow is validate first, dry-run before write, then write only after explicit approval:
graphify ontology patch validate \
--profile-state .graphify/profile/profile-state.json --patch patch.json
graphify ontology patch apply \
--profile-state .graphify/profile/profile-state.json --patch patch.json --dry-run
graphify ontology patch apply \
--profile-state .graphify/profile/profile-state.json --patch patch.json --writeEvery applied or rejected patch is recorded; preview the trail without mutating files:
graphify ontology decision-log --profile-state .graphify/profile/profile-state.jsonReconciliation studio
graphify ontology studio starts a local studio over the same patch core. By default it serves a read-only API; --write enables the patch mutation routes (validate/dry-run/apply), bound to loopback and guarded by a bearer token. It also serves a Svelte studio SPA for working candidate queues, candidate/canonical comparison, evidence, audit trail, and patch preview — the screenshot at the top of this README is this studio over the mystery-saga corpus.
graphify ontology studio --config graphify.yaml # read-only API + SPA
graphify ontology studio --config graphify.yaml --write # token-gated apply, loopback onlyThe same write-guarded core is also exposed over MCP — the default graphify serve graph server is read-only, and mutation tools require the explicit graphify ontology serve --config graphify.yaml --write.
Multimodal ingestion
The same semantic pass handles non-code inputs:
| Type | Extensions | Extraction |
|---|---|---|
| Docs | .md .mdx .txt .rst .html |
Concepts + relationships + design rationale via the platform model |
| Office | .docx .xlsx |
Converted to markdown, then extracted |
| Papers | .pdf |
Local preflight: text-layer PDFs become Markdown via unpdf/pdftotext; scanned/low-text PDFs can use mistral-ocr for Markdown + images |
| Images | .png .jpg .webp .gif |
Multimodal vision — screenshots, diagrams, any language |
| Audio / Video | .mp4 .mov .webm .mkv .avi .m4v .mp3 .wav .m4a .ogg |
Detected locally; downloaded with yt-dlp when needed, normalized with ffmpeg, transcribed via faster-whisper-ts, then fed through the same semantic path |
PDF OCR, audio/video transcription, and provider variables are detailed under Reference.
What you get
- God nodes — the highest-degree concepts everything connects through.
- Confidence scores — every
INFERREDedge carries aconfidence_scorefrom 0 to 1;EXTRACTEDedges are always 1.0. - Hyperedges — group relationships connecting 3+ nodes that pairwise edges can't express (all classes implementing a protocol, all functions in an auth flow).
- Rationale comments — docstrings and inline
# WHY:/# HACK:/# NOTE:markers extracted asrationale_fornodes: not just what the code does, but why. - Surprising / INFERRED connections — ranked cross-source links (code↔paper rank above code↔code), each with a plain-English why.
- Community labels — Louvain clusters named so you can navigate the graph by topic.
Code knowledge graphs
Code was graphify's original use case and remains a first-class one: a codebase is itself a non-documentary corpus, and the graph answers "what calls this?", "what breaks if I change this?" better than grep. Code files go through a deterministic no-LLM AST pass (tree-sitter) that extracts classes, functions, imports, call graphs, docstrings, and rationale comments — no file contents leave your machine for code.
- ~20 languages via tree-sitter AST: Python, JS, TS, Go, Rust, Java, C, C++, Ruby, PHP, Lua — plus C#, Kotlin, Scala, Swift, Zig, PowerShell, Elixir, Objective-C, and Julia whose grammars are optional dependencies that degrade gracefully when absent. Vue, Svelte, Blade, Dart, Verilog/SystemVerilog, and EJS use regex fallback extraction.
- Call graphs and flows: build a directed graph and derive execution flows from
CALLSedges (graphify flows build). - Review surfaces:
graphify review-delta,graphify review-analysis, andgraphify recommend-commits(advisory-only) give blast radius, bridge nodes, test-gap hints, and impacted communities for changed files. Review impact rules intentionally favor recall over precision — false positives are reported, not hidden. Review benchmarks (graphify review-eval) are deterministic local fixtures, not a universal quality guarantee. Token metrics are estimates unless backed by actual model calls. - Git lifecycle:
graphify hook installwires post-commit/checkout/merge/rewrite hooks plus agraphify-jsonmerge driver that union-merges graph nodes when branches build the graph concurrently, so.graphify/graph.jsonsurvives merges instead of conflicting.
Realization tracking: agent-stats
When several AI agents (Claude Code, Codex, Antigravity/Gemini) work the same repository, git authorship stops telling you who actually built what. graphify agent-stats indexes the agentic CLI conversation transcripts already on your machine — Claude Code project transcripts, Codex rollouts, Antigravity/Gemini chats — and attributes branches, commits, and work packages to the agent sessions that produced them.
Attribution is evidence-based, never git authorship: commit SHAs the session actually printed, h2a registry identity, and worktree×branch×time correlation. Facts are stored locally in .graphify/agents/facts.jsonl.
graphify agent-stats # per-agent table: sessions, tokens, commits, branches, WPs
graphify agent-stats sync # parse/refresh transcripts (incremental; --full to re-parse)
graphify agent-stats sessions # parsed sessions with their evidence-based agent identity
graphify agent-stats wp <WP-id> # conductor view: agents/sessions joined to a Track work packageThe wp view can additionally attribute merged PRs via the gh CLI (skip with --no-pr when offline).
How it works
graphify combines a deterministic structural pass with a model-backed semantic pass:
- Structural pass (no LLM). Code is parsed with tree-sitter into classes, functions, imports, call graphs, and rationale comments. Docs, papers, Office files, and images are normalized into text or multimodal inputs (with local PDF preflight in between).
- Semantic pass. Platform-backed subagents extract concepts, relationships, and design rationale. Every relationship is tagged
EXTRACTED(found in source),INFERRED(inference, with aconfidence_score), orAMBIGUOUS(flagged for review) — so you always know what was found vs guessed. - Clustering. Results merge into a Graphology graph, clustered with Louvain community detection. Clustering is topology-based — no embeddings, no vector database. The model-extracted
semantically_similar_toedges are already in the graph, so they influence communities directly. - Exports. Interactive HTML, queryable JSON, a plain-language audit report, and optional SVG, GraphML (Gephi/yEd), Neo4j cypher, an agent-crawlable wiki (
--wiki), and an Obsidian vault (--obsidian).
Lineage & attribution
graphify builds on the foundational work of Safi Shamsi's graphify, extending it from code-structure graphs to a full knowledge & entity-reconciliation lifecycle. Selected review-workflow ideas were adapted from the code-review-graph comparison work (see spec/SPEC_CODE_REVIEW_GRAPH_OPPORUNITY.md). This repository is the maintained TypeScript product line, aligned against upstream Graphify where parity matters; see UPSTREAM_GAP.md for the tracked parity contract.
Reference
Supported assistants
graphify install writes assistant integrations. Pass --platform <name> for non-Claude clients: codex, gemini, copilot, vscode, aider, opencode, claw, droid, trae, trae-cn, cursor, hermes, kimi, kiro, antigravity, windows.
To make an assistant always prefer the graph, run the matching graphify <platform> install (e.g. graphify claude install writes a CLAUDE.md section plus a PreToolUse hook; graphify gemini install (or graphify install --platform gemini) writes GEMINI.md and registers the MCP server; graphify copilot install (or graphify install --platform copilot) installs the global skill for GitHub Copilot CLI). Platforms without PreToolUse hooks (Gemini, Aider, OpenCode, Trae, Droid, and others) use AGENTS.md as the always-on mechanism instead. Uninstall with the matching uninstall, or graphify uninstall to remove all detected integrations.
Invocation differs per client: /graphify . in Claude Code, Gemini CLI, Copilot, and most others, but $graphify in Codex. Codex can also register the read-only graph as an MCP server with codex mcp add graphify -- graphify serve /absolute/path/to/.graphify/graph.json.
Input scope
Scope-aware commands default to --scope auto (committed files plus .graphify/memory/* in a Git repo). --scope tracked adds staged files; --all (alias for --scope all) restores the full recursive folder walk for papers, notes, and media. Inspect before rebuilding:
graphify scope inspect . --scope autoAdd a .graphifyignore file (same syntax as .gitignore) to exclude folders.
MCP server
Expose graph.json as a read-only MCP server for structured graph access (query_graph, get_node, get_neighbors, shortest_path, plus resources like graphify://report, graphify://god-nodes, graphify://audit):
graphify serve .graphify/graph.jsonPDF preflight and Mistral OCR
GRAPHIFY_PDF_OCR controls PDF handling: auto (default) runs a local unpdf preflight with pdftotext fallback and calls mistral-ocr only when a PDF has too little extractable text; off keeps the PDF as-is; always forces OCR; dry-run records the decision without calling the API. Mistral OCR requires MISTRAL_API_KEY (override the model with GRAPHIFY_PDF_OCR_MODEL); if missing in auto mode, graphify warns and leaves the source PDF in the semantic input. Sidecars are written under .graphify/converted/pdf/ with provenance back to the original.
Local audio/video transcription
Transcription uses the published faster-whisper-ts runtime (no Python). Defaults match upstream: Whisper model base, CPU device, int8 compute. Override with GRAPHIFY_WHISPER_MODEL, GRAPHIFY_WHISPER_MODEL_DIR, GRAPHIFY_WHISPER_MODEL_ID, GRAPHIFY_WHISPER_MODEL_REVISION, GRAPHIFY_WHISPER_DEVICE, and GRAPHIFY_WHISPER_COMPUTE_TYPE. URL ingestion goes through yt-dlp; transcripts land under .graphify/transcripts/ and are treated like regular documents.
Optional provider variables
For CI/headless text corpora, semantic extraction can be delegated to a direct provider with graphify extract --backend anthropic|openai|gemini|mistral|cohere|ollama (via the Vercel AI SDK). OLLAMA_BASE_URL overrides the local Ollama URL. Google Workspace export (.gdoc, .gsheet, .gslides) is enabled with GRAPHIFY_GOOGLE_WORKSPACE=1 and the relevant GOOGLE_OAUTH_* credentials. API keys are read only from environment variables and are never written to config, .graphify/, reports, or logs.
Privacy
graphify sends file contents to your assistant's underlying model API for semantic extraction of docs, papers, and images. Code files are processed locally via tree-sitter AST — no code contents leave your machine. Audio/video transcription and PDF text preflight run locally; Mistral OCR is the only PDF-specific network call, and only when OCR mode requires it. Agent-stats transcript parsing is entirely local — transcripts never leave your machine. No telemetry, usage tracking, or analytics. The only network calls are to your platform's model API during extraction, explicit direct-backend extraction, optional Mistral OCR, the optional gh PR lookup in agent-stats wp, and any URLs you explicitly ask graphify to ingest.
Tech stack
Graphology + Louvain (graphology-communities-louvain) + tree-sitter + vis-network, with regex-backed language fallbacks, unpdf, optional pdftotext, optional mistral-ocr, officeparser, turndown, the yt-dlp + ffmpeg + faster-whisper-ts transcription path, and optional Vercel AI SDK direct text backends. No Neo4j required; the default HTML output is fully static.
License
MIT. See LICENSE.
Contributing
Worked examples are the most trust-building contribution. Run the graphify skill on a real corpus, save output to worked/{slug}/, write an honest review.md evaluating what the graph got right and wrong, and submit a PR.
Extraction bugs — open an issue with the input file, the cache entry (.graphify/cache/), and what was missed or invented.
See ARCHITECTURE.md for module responsibilities and how to add a language.