Package Exports

ppef
ppef/aggregation
ppef/aggregation/__tests__/aggregators.unit.test
ppef/aggregation/__tests__/pipeline.unit.test
ppef/aggregation/aggregators
ppef/aggregation/index
ppef/aggregation/pipeline
ppef/claims
ppef/claims/index
ppef/collector
ppef/collector/__tests__/result-collector.unit.test
ppef/collector/__tests__/schema.unit.test
ppef/collector/index
ppef/collector/result-collector
ppef/collector/schema
ppef/executor
ppef/executor/__tests__/binary-sut.unit.test
ppef/executor/__tests__/checkpoint-hash-bug.diagnostic.test
ppef/executor/__tests__/checkpoint-manager.integration.test
ppef/executor/__tests__/checkpoint-manager.unit.test
ppef/executor/__tests__/checkpoint-merge-bug.diagnostic.test
ppef/executor/__tests__/checkpoint-merge-bug.unit.test
ppef/executor/__tests__/checkpoint-storage.unit.test
ppef/executor/__tests__/executor.unit.test
ppef/executor/__tests__/memory-monitor.unit.test
ppef/executor/__tests__/parallel-checkpoint-merge.integration.test
ppef/executor/__tests__/parallel-executor.integration.test
ppef/executor/__tests__/parallel-executor.unit.test
ppef/executor/__tests__/resource-calculator.unit.test
ppef/executor/__tests__/run-id.unit.test
ppef/executor/__tests__/worker-entry.integration.test
ppef/executor/__tests__/worker-entry.unit.test
ppef/executor/__tests__/worker-threads-executor.unit.test
ppef/executor/binary-sut
ppef/executor/checkpoint-manager
ppef/executor/checkpoint-storage
ppef/executor/checkpoint-types
ppef/executor/executor
ppef/executor/index
ppef/executor/memory-monitor
ppef/executor/parallel-executor
ppef/executor/resource-calculator
ppef/executor/run-id
ppef/executor/worker-entry
ppef/executor/worker-executor
ppef/executor/worker-threads-executor
ppef/registry
ppef/registry/case-registry
ppef/registry/index
ppef/registry/sut-registry
ppef/renderers
ppef/renderers/index
ppef/renderers/latex-renderer
ppef/renderers/types
ppef/robustness
ppef/robustness/__tests__/perturbations.unit.test
ppef/robustness/index
ppef/robustness/perturbations
ppef/statistical
ppef/types
ppef/types/aggregate
ppef/types/case
ppef/types/claims
ppef/types/evaluator
ppef/types/index
ppef/types/perturbation
ppef/types/result
ppef/types/sut

Readme

PPEF - Portable Programmatic Evaluation Framework

A claim-driven, deterministic evaluation framework for experiments. PPEF provides a structured approach to testing and validating software components through reusable test cases, statistical aggregation, and claim-based evaluation.

Published npm package with dual ESM/CJS output. Single runtime dependency: commander.

Features

Type-safe: Strict TypeScript with generic SUT, Case, and Evaluator abstractions
Registry: Centralized registries for Systems Under Test (SUTs) and evaluation cases with role/tag filtering
Execution: Deterministic execution with worker threads, checkpointing, memory monitoring, and binary SUT support
Statistical: Mann-Whitney U test, Cohen's d, confidence intervals
Aggregation: Summary stats, pairwise comparisons, and rankings across runs
Evaluation: Four built-in evaluators — claims, robustness, metrics, and exploratory
Rendering: LaTeX table generation for thesis integration
CLI: Five commands for running, validating, planning, aggregating, and evaluating experiments

Installation

# Install as a dependency
pnpm add ppef

# Or use locally for development
git clone https://github.com/Mearman/ppef.git
cd ppef
pnpm install
pnpm build

Development

pnpm install              # Install dependencies
pnpm build                # TypeScript compile + CJS wrapper generation
pnpm typecheck            # Type-check only (tsc --noEmit)
pnpm lint                 # ESLint + Prettier with auto-fix
pnpm test                 # Run all tests with coverage (c8 + tsx + Node native test runner)

Run a single test file:

npx tsx --test src/path/to/file.test.ts

CLI (after build):

ppef run          # Execute experiments
ppef validate     # Validate configuration
ppef plan         # Dry-run execution plan
ppef aggregate    # Post-process results
ppef evaluate     # Run evaluators on results

Architecture

Data Flow Pipeline

SUTs + Cases (Registries)
    → Executor (runs SUTs against cases, deterministic runIds)
    → EvaluationResult (canonical schema)
    → ResultCollector (validates + filters)
    → Aggregation Pipeline (summary stats, comparisons, rankings)
    → Evaluators (claims, robustness, metrics, exploratory)
    → Renderers (LaTeX tables for thesis)

Module Map (`src/`)

Module	Purpose
`types/`	All canonical type definitions (result, sut, case, claims, evaluator, aggregate, perturbation)
`registry/`	`SUTRegistry` and `CaseRegistry` — generic registries with role/tag filtering
`executor/`	Orchestrator with worker threads, checkpointing, memory monitoring, binary SUT support
`collector/`	Result aggregation and JSON schema validation
`statistical/`	Mann-Whitney U test, Cohen's d, confidence intervals
`aggregation/`	`computeSummaryStats()`, `computeComparison()`, `computeRankings()`, pipeline
`evaluators/`	Four built-in evaluators + extensible registry (see below)
`claims/`	Claim type definitions
`robustness/`	Perturbation configs and robustness metric types
`renderers/`	LaTeX table renderer
`cli/`	Five commands with config loading, module loading, output writing

Key Abstractions

SUT (SUT<TInputs, TResult>): Generic System Under Test. Has id, config, and run(inputs). Roles: primary, baseline, oracle.

CaseDefinition (CaseDefinition<TInput, TInputs>): Two-phase resource factory — getInput() loads a resource once, getInputs() returns algorithm-specific inputs.

Evaluator (Evaluator<TConfig, TInput, TOutput>): Extensible evaluation with validateConfig(), evaluate(), summarize(). Four built-in types:

ClaimsEvaluator — tests explicit hypotheses with statistical significance
RobustnessEvaluator — sensitivity analysis under perturbations
MetricsEvaluator — multi-criterion threshold/baseline/target-range evaluation
ExploratoryEvaluator — hypothesis-free analysis (rankings, pairwise comparisons, correlations, case-class effects)

EvaluationResult: Canonical output schema capturing run identity (deterministic SHA-256 runId), correctness, metrics, output artefacts, and provenance.

Subpath Exports

Each module is independently importable:

import { SUTRegistry } from 'ppef/registry';
import { EvaluationResult } from 'ppef/types';
import { computeSummaryStats } from 'ppef/aggregation';

Available subpaths: ppef/types, ppef/registry, ppef/executor, ppef/collector, ppef/statistical, ppef/aggregation, ppef/evaluators, ppef/claims, ppef/robustness, ppef/renderers.

Conventions

TypeScript strict mode, ES2023 target, ES modules
Node.js native test runner (node:test + node:assert) — not Vitest/Jest
Coverage via c8 (text + html + json-summary in ./coverage/)
Conventional commits enforced via commitlint + husky
Semantic release from all branches
No any types — use unknown with type guards
Executor produces deterministic runId via SHA-256 hash of inputs

License

MIT

ppef