Package Exports
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (agentic-dataset-builder) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Agentic Dataset Builder
Pure TypeScript CLI for turning local Pi, Codex, and Claude Code history into one validated dataset.parquet file.
Goal
Use this repo when you want an AI coding assistant to do one job end-to-end:
- discover local session history
- normalize it into the local Qwen35-compatible schema
- label records by training use
- write one final parquet dataset
The CLI is native Node.js + TypeScript. It does not require Python.
Fastest path
If the package is published on npm:
npx --registry=https://registry.npmjs.org/ agentic-dataset-builder@0.2.6 --output-root ./outDefault behavior now includes:
- sources:
pi,codex,claude - labels kept:
cot_eligible,agent_only,prompt_only
If working from this repo locally:
npm install
npm run build
node dist/cli.js --output-root ./outWhat the command does
The CLI will:
- detect local session roots for
pi,codex, andclaude - read supported history files
- validate normalized records with
Zod - keep only the labels you requested
- write one final parquet file
- write a manifest and a run log
Default source behavior
pi- full agent traces
- can produce
cot_eligibleoragent_only
codex- full agent traces
- usually produces
agent_only
claude- local project traces with assistant messages, tool calls, tool results, and visible thinking when present
- can produce
cot_eligible,agent_only, ordiscard
Claude prompt-only fallback is no longer the default parser path.
Default output
Each run creates one directory:
<output-root>/agentic-dataset-<timestamp>/
dataset.parquet
manifest.json
run.logFiles:
dataset.parquet- final merged dataset
manifest.json- source roots, source counts, labels kept, output path
run.log- step-by-step execution log for debugging
Recommended commands
Pi + Codex + Claude prompt-only (default):
node dist/cli.js --output-root ./outPi + Codex only:
node dist/cli.js --output-root ./out --include-sources pi,codex --include-labels cot_eligible,agent_onlyCodex + Claude:
node dist/cli.js --output-root ./out --include-sources codex,claude --include-labels cot_eligible,agent_onlyClaude only:
node dist/cli.js --output-root ./out --include-sources claude --include-labels cot_eligible,agent_only,discardPi only:
node dist/cli.js --output-root ./out --include-sources pi --include-labels cot_eligible,agent_onlyImportant flags
--output-root <dir>- required output root
--include-sources <csv>- any of:
pi,codex,claude
- any of:
--include-labels <csv>- any of:
cot_eligible,agent_only,prompt_only,discard prompt_onlyremains available for lossy prompt-history style inputs, but local Claude project traces now usually label ascot_eligible,agent_only, ordiscard
- any of:
--pi-root <dir>- override detected Pi session path
--codex-root <dir>- override detected Codex session path
--claude-root <dir>- override detected Claude project-history path
--help- print CLI help
Auto-detected paths
The CLI tries OS-specific defaults automatically.
Typical paths:
- Pi:
~/.pi/agent/sessions - Codex:
~/.codex/sessions - Claude:
~/.claude/projects
On Windows it also checks APPDATA and LOCALAPPDATA variants.
Verification checklist
After a run, verify these three things:
dataset.parquetexistsmanifest.jsonexistsrun.logdoes not end with an uncaught error
Typical quick check:
ls ./out/agentic-dataset-*/Development notes
Useful development commands:
npm run check
npm run test
npm run buildClaude Code / AI assistant contributors should also read:
This repo currently includes:
- Zod validation for source events and final records
- Vitest coverage for core schema and labeling paths
- native parquet writing in TypeScript