JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 37
  • Score
    100M100P100Q50951F
  • License ISC

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (ppef) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    PPEF - Portable Programmatic Evaluation Framework

    A claim-driven, deterministic evaluation framework for experiments. PPEF provides a structured approach to testing and validating software components through reusable test cases, statistical aggregation, and claim-based evaluation.

    Published npm package with dual ESM/CJS output. Single runtime dependency: commander.

    Features

    • Type-safe: Strict TypeScript with generic SUT, Case, and Evaluator abstractions
    • Registry: Centralized registries for Systems Under Test (SUTs) and evaluation cases with role/tag filtering
    • Execution: Deterministic execution with worker threads, checkpointing, memory monitoring, and binary SUT support
    • Statistical: Mann-Whitney U test, Cohen's d, confidence intervals
    • Aggregation: Summary stats, pairwise comparisons, and rankings across runs
    • Evaluation: Four built-in evaluators — claims, robustness, metrics, and exploratory
    • Rendering: LaTeX table generation for thesis integration
    • CLI: Five commands for running, validating, planning, aggregating, and evaluating experiments

    Installation

    # Install as a dependency
    pnpm add ppef
    
    # Or use locally for development
    git clone https://github.com/Mearman/ppef.git
    cd ppef
    pnpm install
    pnpm build

    Development

    pnpm install              # Install dependencies
    pnpm build                # TypeScript compile + CJS wrapper generation
    pnpm typecheck            # Type-check only (tsc --noEmit)
    pnpm lint                 # ESLint + Prettier with auto-fix
    pnpm test                 # Run all tests with coverage (c8 + tsx + Node native test runner)

    Run a single test file:

    npx tsx --test src/path/to/file.test.ts

    CLI (after build):

    ppef experiment.json   # Run experiment (default command)
    ppef run config.json   # Explicit run command
    ppef validate          # Validate configuration
    ppef plan              # Dry-run execution plan
    ppef aggregate         # Post-process results
    ppef evaluate          # Run evaluators on results

    Quick Start

    Create a minimal experiment with three files and a config:

    experiment.json

    {
      "experiment": {
        "name": "string-length",
        "description": "Compare string length implementations"
      },
      "executor": {
        "repetitions": 3
      },
      "suts": [
        {
          "id": "builtin-length",
          "module": "./sut.mjs",
          "exportName": "createSut",
          "registration": {
            "name": "Built-in .length",
            "version": "1.0.0",
            "role": "primary"
          }
        }
      ],
      "cases": [
        {
          "id": "hello-world",
          "module": "./case.mjs",
          "exportName": "createCase"
        }
      ],
      "metricsExtractor": {
        "module": "./metrics.mjs",
        "exportName": "extract"
      },
      "output": {
        "path": "./results"
      }
    }

    sut.mjs — System Under Test factory

    export function createSut() {
      return {
        id: "builtin-length",
        config: {},
        run: async (input) => ({ length: input.text.length }),
      };
    }

    case.mjs — Test case definition

    export function createCase() {
      return {
        case: {
          caseId: "hello-world",
          caseClass: "basic",
          name: "Hello World",
          version: "1.0.0",
          inputs: { text: "hello world" },
        },
        getInput: async () => ({ text: "hello world" }),
        getInputs: () => ({ text: "hello world" }),
      };
    }

    metrics.mjs — Metrics extractor

    export function extract(result) {
      return { length: result.length ?? 0 };
    }

    Run it:

    npx ppef experiment.json

    Workflows

    The typical pipeline chains CLI commands: validate, run, aggregate, then evaluate.

    ppef validate config.json
        → ppef run config.json
            → ppef aggregate results.json
                → ppef evaluate aggregates.json -t claims -c claims.json

    1. Validate Configuration

    Check an experiment config for errors before running:

    ppef validate experiment.json

    2. Preview Execution Plan

    See what would run without executing (SUTs x cases x repetitions):

    ppef plan experiment.json

    3. Run an Experiment

    Execute all SUTs against all cases with worker thread isolation:

    ppef run experiment.json
    ppef run experiment.json -o ./output -j 4 --verbose
    ppef run experiment.json --unsafe-in-process  # No worker isolation (debugging only)

    The output directory contains a results JSON and (by default) an aggregates JSON.

    4. Aggregate Results

    Compute summary statistics, pairwise comparisons, and rankings from raw results:

    ppef aggregate results.json
    ppef aggregate results.json -o aggregates.json --compute-comparisons

    5. Evaluate Results

    Run evaluators against aggregated (or raw) results. Each evaluator type takes a JSON config file.

    Claims — Test Explicit Hypotheses

    Test whether SUT A outperforms baseline B on a given metric with statistical significance:

    ppef evaluate aggregates.json -t claims -c claims.json -v

    claims.json:

    {
      "claims": [
        {
          "claimId": "C001",
          "description": "Primary has greater accuracy than baseline",
          "sut": "primary-sut",
          "baseline": "baseline-sut",
          "metric": "accuracy",
          "direction": "greater",
          "scope": "global"
        }
      ],
      "significanceLevel": 0.05
    }

    Metrics — Threshold, Baseline, and Range Criteria

    Evaluate metrics against fixed thresholds, baselines, or target ranges:

    ppef evaluate aggregates.json -t metrics -c metrics-config.json

    metrics-config.json:

    {
      "criteria": [
        {
          "criterionId": "exec-time",
          "description": "Execution time under 1000ms",
          "type": "threshold",
          "metric": "executionTime",
          "sut": "*",
          "threshold": { "operator": "lt", "value": 1000 }
        },
        {
          "criterionId": "f1-range",
          "description": "F1 score in [0.8, 1.0]",
          "type": "target-range",
          "metric": "f1Score",
          "sut": "*",
          "targetRange": { "min": 0.8, "max": 1.0, "minInclusive": true, "maxInclusive": true }
        }
      ]
    }

    Robustness — Sensitivity Under Perturbations

    Measure how performance degrades under perturbations at varying intensity levels:

    ppef evaluate results.json -t robustness -c robustness-config.json

    robustness-config.json:

    {
      "metrics": ["executionTime", "accuracy"],
      "perturbations": ["edge-removal", "noise", "seed-shift"],
      "intensityLevels": [0.1, 0.2, 0.3, 0.4, 0.5],
      "runsPerLevel": 10
    }

    Output Formats

    All evaluators support JSON and LaTeX output:

    ppef evaluate aggregates.json -t claims -c claims.json -f latex
    ppef evaluate aggregates.json -t metrics -c metrics.json -f json -o results.json

    Inline Evaluators

    Evaluator configs can be embedded directly in the experiment config via the optional evaluators field, making the config self-contained:

    {
      "experiment": { "name": "my-experiment" },
      "executor": { "repetitions": 10 },
      "suts": [ ... ],
      "cases": [ ... ],
      "metricsExtractor": { ... },
      "output": { "path": "./results" },
      "evaluators": [
        {
          "type": "claims",
          "config": {
            "claims": [ ... ]
          }
        }
      ]
    }

    JSON Schema Validation

    Experiment configs can reference the generated schema for IDE autocompletion:

    {
      "$schema": "./ppef.schema.json",
      "experiment": { ... }
    }

    Standalone evaluator configs reference schema $defs:

    {
      "$schema": "./ppef.schema.json#/$defs/ClaimsEvaluatorConfig",
      "claims": [ ... ]
    }

    Cross-Language Specification

    PPEF is designed for cross-language interoperability. A Python runner can produce results consumable by the TypeScript aggregator, and vice versa.

    The specification lives in spec/ and comprises three layers:

    Layer Location Purpose
    JSON Schema ppef.schema.json Machine-readable type definitions for all input and output types
    Conformance Vectors spec/conformance/ Pinned input/output pairs that any implementation must reproduce
    Prose Specification spec/README.md Execution semantics, module contracts, statistical algorithms

    All output types are available as $defs in the schema, enabling validation from any language:

    ppef.schema.json#/$defs/EvaluationResult
    ppef.schema.json#/$defs/ResultBatch
    ppef.schema.json#/$defs/AggregationOutput
    ppef.schema.json#/$defs/ClaimEvaluationSummary
    ppef.schema.json#/$defs/MetricsEvaluationSummary
    ppef.schema.json#/$defs/RobustnessAnalysisOutput
    ppef.schema.json#/$defs/ExploratoryEvaluationSummary

    Run ID generation uses RFC 8785 (JSON Canonicalization Scheme) for deterministic cross-language hashing. Libraries exist for Python (jcs), Rust (serde_jcs), Go (go-jcs), and others.

    Architecture

    Data Flow Pipeline

    SUTs + Cases (Registries)
        → Executor (runs SUTs against cases, deterministic runIds)
        → EvaluationResult (canonical schema)
        → ResultCollector (validates + filters)
        → Aggregation Pipeline (summary stats, comparisons, rankings)
        → Evaluators (claims, robustness, metrics, exploratory)
        → Renderers (LaTeX tables for thesis)

    Module Map (src/)

    Module Purpose
    types/ All canonical type definitions (result, sut, case, claims, evaluator, aggregate, perturbation)
    registry/ SUTRegistry and CaseRegistry — generic registries with role/tag filtering
    executor/ Orchestrator with worker threads, checkpointing, memory monitoring, binary SUT support
    collector/ Result aggregation and JSON schema validation
    statistical/ Mann-Whitney U test, Cohen's d, confidence intervals
    aggregation/ computeSummaryStats(), computeComparison(), computeRankings(), pipeline
    evaluators/ Four built-in evaluators + extensible registry (see below)
    claims/ Claim type definitions
    robustness/ Perturbation configs and robustness metric types
    renderers/ LaTeX table renderer
    cli/ Five commands with config loading, module loading, output writing

    Key Abstractions

    SUT (SUT<TInputs, TResult>): Generic System Under Test. Has id, config, and run(inputs). Roles: primary, baseline, oracle.

    CaseDefinition (CaseDefinition<TInput, TInputs>): Two-phase resource factory — getInput() loads a resource once, getInputs() returns algorithm-specific inputs.

    Evaluator (Evaluator<TConfig, TInput, TOutput>): Extensible evaluation with validateConfig(), evaluate(), summarize(). Four built-in types:

    • ClaimsEvaluator — tests explicit hypotheses with statistical significance
    • RobustnessEvaluator — sensitivity analysis under perturbations
    • MetricsEvaluator — multi-criterion threshold/baseline/target-range evaluation
    • ExploratoryEvaluator — hypothesis-free analysis (rankings, pairwise comparisons, correlations, case-class effects)

    EvaluationResult: Canonical output schema capturing run identity (deterministic SHA-256 runId), correctness, metrics, output artefacts, and provenance.

    Subpath Exports

    Each module is independently importable:

    import { SUTRegistry } from 'ppef/registry';
    import { EvaluationResult } from 'ppef/types';
    import { computeSummaryStats } from 'ppef/aggregation';

    Available subpaths: ppef/types, ppef/registry, ppef/executor, ppef/collector, ppef/statistical, ppef/aggregation, ppef/evaluators, ppef/claims, ppef/robustness, ppef/renderers.

    Conventions

    • TypeScript strict mode, ES2023 target, ES modules
    • Node.js native test runner (node:test + node:assert) — not Vitest/Jest
    • Coverage via c8 (text + html + json-summary in ./coverage/)
    • Conventional commits enforced via commitlint + husky
    • Semantic release from main branch
    • No any types — use unknown with type guards
    • Executor produces deterministic runId via SHA-256 hash of RFC 8785 (JCS) canonicalized inputs

    License

    MIT