Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (e2e-engineering) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

e2e-engineering

Master engineering orchestrator — drives a Task from idea to passing E2E across three phases: pre-implementation (idea → approved PRD), implementation (vertical-slice TDD loop → green tests), post-implementation (review + human QA). Five hard gates, a depends_on slice DAG, and .e2e-engineering/ state files keep the flow honest. The essay below ("AI-Engineering") is the philosophy this skill encodes.

Install

npx e2e-engineering init                     # auto-detect the agent in this project
npx e2e-engineering init --target claude     # full skill → .claude/skills/e2e-engineering/
npx e2e-engineering init --target cursor     # .cursor/rules/e2e-engineering.mdc + AGENTS.md
npx e2e-engineering init --target codex      # AGENTS.md
npx e2e-engineering init --target opencode   # AGENTS.md
npx e2e-engineering init --target all        # everything

Flags: --dest <dir> · --force · --dry-run. Auto-detect: .claude/ → claude · .cursor/ → cursor · else → codex. An existing AGENTS.md is never clobbered (writes AGENTS.e2e-engineering.md).

In Claude Code: /e2e-engineering. Also triggers on "ship-it", "ship it", "implement feature X", "write e2e for X", "build this end to end", "run the full flow".

Fidelity

	Claude Code	Codex / OpenCode / Cursor
Phases, 5 gates, DAG, TDD loop, state files, constitution	yes	yes
Parallel slice execution (worktree fan-out)	yes	no — sequential
Subagent dispatch, 65% auto-checkpoint, `/run`+`/verify` gate 5	yes	manual

Portable targets run slices one at a time in dependency order; everything else is identical.

Claude marketplace

Plugin lives in dist/claude-plugin/. Once pushed to a GitHub repo: /plugin marketplace add <owner>/<repo> then /plugin install e2e-engineering@e2e-engineering.

MIT.

AI-Engineering

How to Build Software with AI Agents

Core principle

The main lesson across the files is simple: AI does not remove the need for software engineering discipline. It makes discipline more important.

The workflow is not “ask the AI to build everything and hope.” The workflow is:

Human clarifies the idea
Human and AI align on language and architecture
AI helps produce a PRD
PRD becomes small vertical issues
Agents implement with tests
Agents or humans review
Human performs QA
New issues are created
The loop repeats

AI changes the tools, but not the fundamentals: clear requirements, good modular design, small tasks, feedback loops, testing, QA, and review still matter.

Part 1 — Prepare your mindset: AI agents are not magic engineers

The files repeatedly describe agents as useful but constrained. They can write code quickly, explore repositories, implement issues, and even review each other’s work, but they do not naturally carry long-term memory across sessions. That means you need to give them process, structure, documentation, and feedback loops.

A helpful mental model is:

Human = strategic programmer
AI agent = tactical programmer

The human decides what matters, what trade-offs are acceptable, what the system should become, and where quality boundaries belong. The agent executes tactical work inside that structure. The “de-slop” file makes this very clear: architecture improvement is not something you simply run AFK; it requires judgment from the programmer above the agent.

Part 2 — Make your codebase ready for AI

2.1 Why architecture matters more with AI

A messy codebase makes AI worse. If the file system does not reflect the mental model of the application, the AI enters the repo with no prior memory and sees only scattered files. It does not automatically know which modules belong together, which concepts are central, or where responsibilities live.

So before expecting good AI output, you need a codebase that is:

Easy to navigate
Easy to test
Organized around meaningful modules
Built around clear interfaces
Protected by feedback loops

The files argue that the structure of the codebase is often more influential than prompts or instruction files. If the system is hard to change, the agent will struggle to change it safely.

2.2 Use deep modules

A central architectural idea is the deep module.

A module is a unit of application behavior: a group of components, functions, services, or capabilities. A module has an interface, which is what callers need to know to use it, and an implementation, which is the internal code that performs the work.

A deep module hides a lot of implementation behind a relatively simple interface. A shallow module exposes a complex interface while hiding very little implementation. Deep modules are better because they give the caller more capability with less surface area to understand.

A practical way to think about it:

Bad for AI:
Many tiny files
Unclear relationships
Hidden dependencies
Business rules spread everywhere

Good for AI:
Larger meaningful modules
Clear public interfaces
Tests around module boundaries
Implementation details hidden inside

Deep modules give you two major benefits: locality and leverage. Locality means related changes and bugs concentrate in one place. Leverage means callers get more behavior per unit of interface they need to learn.

2.3 Define seams and adapters

A seam is the boundary where one module talks to another. It is often the best place to test. For example, if a service depends on time, you can define a clock interface and use a real clock in production but a fake clock in tests. The fake clock is an adapter that satisfies the same interface.

This matters because agents need reliable places to test behavior. If your seams are unclear, the AI does not know where to write tests or how to isolate behavior.

A good module should therefore have:

A clear public interface
A small number of meaningful exported functions
Tests at the boundary
Adapters for external dependencies
Internal implementation hidden from callers

2.4 Run architecture improvement regularly

The “de-slop” workflow suggests using an architecture-improvement process to identify shallow modules, duplicated concepts, poor locality, missing seams, and untested parallel implementations. In the example, the AI identifies places where frontend and backend logic could drift because two parallel implementations lack a shared seam.

The important part: do not let the AI blindly refactor the whole codebase. Let it surface candidates, then you choose which refactor matters.

A useful prompt pattern:

Explore this codebase for architecture-deepening opportunities.
Look for shallow modules, duplicated business rules, unclear seams,
poor locality, and places where tests cannot easily be written.
Do not implement yet. Give me candidates and explain the trade-offs.

Then choose one candidate and ask the AI to propose:

The new module boundary
The public interface
The implementation location
The tests needed
The migration plan
The risks

Part 3 — Establish shared language before building

3.1 Why “Grill Me” is useful (@mattpocock)

The original grill-me skill asks the AI to interview the user relentlessly until both sides reach a shared understanding. It walks down the design tree and resolves dependencies between decisions one by one.

The goal is not to move fast immediately. The goal is to prevent the AI from implementing the wrong thing quickly.

A simple version of the prompt:

Interview me relentlessly about every aspect of this plan
until we reach a shared understanding.
Walk down each branch of the design tree.
Resolve dependencies between decisions one by one.
If a question can be answered by exploring the codebase,
explore the codebase instead of asking me.

Use this when an idea is still vague.

3.2 Prefer “Grill with Docs” when there is a codebase (@mattpocock)

The newer workflow replaces pure grill-me with grill-with-docs when a codebase exists. The problem with grill-me alone is that good terminology may emerge during the conversation but not get documented. Then the user has to re-explain the same domain concepts again in future sessions.

Grill-with-docs adds documentation to the alignment process. It looks for a context.md file, uses existing shared language, challenges fuzzy terms, cross-references with code, and updates the documentation as the conversation progresses.

Use this structure:

/context.md
  - Domain vocabulary
  - Core entities
  - Definitions
  - Relationships
  - Terms users see in the UI
  - Terms developers use in code

The purpose is to align:

Human language
Code language
Agent language
User-facing language

When all four match, the AI needs fewer words to understand your intent and is more likely to generate code that fits the domain.

3.3 Create ADRs for important decisions

Some decisions are not just vocabulary. They are architectural trade-offs. For those, use ADRs: architectural decision records.

The files suggest creating ADRs when a decision is:

Hard to reverse
Surprising without context
The result of a real trade-off
Likely to affect future implementation

This prevents future agents from undoing decisions because they do not understand why they were made.

A simple ADR template:

## ADR: [Decision title]

### Context

What problem or trade-off led to this decision?

### Decision

What did we decide?

### Consequences

What becomes easier?
What becomes harder?
What should future agents avoid changing casually?

Part 4 — Follow the 7 phases of AI-driven development

One file lays out seven phases of AI-driven development:

1. Idea
2. Research
3. Prototype
4. PRD
5. Implementation planning
6. Execution
7. QA

These phases can be used for a full app, a feature, a bug fix, or a refactor.

Phase 1 — Start with the idea

The idea can be broad or narrow. It might be a full application, a feature, a bug fix, or a refactor. The important thing is not to jump straight from idea to implementation. The idea is just the starting point.

Start by writing:

What I want to change:
Why I want to change it:
Who it affects:
What must remain true:
What I am unsure about:

Then run a grill-with-docs session.

Phase 2 — Research

Use research when the task depends on external APIs, unfamiliar libraries, complex integration details, or parts of the repo that are difficult to explore repeatedly. The research should be cached in a temporary asset like research.md, so future agents do not need to rediscover the same information from scratch.

But research can rot. The files warn that research usually belongs to the lifetime of a sprint or idea, not permanently. If it gets stale, it can mislead the agent.

A good research.md contains:

External API behavior
Relevant docs
Constraints
Known gotchas
Example calls
Integration risks
Decisions already made

Phase 3 — Prototype

Prototype when you need concrete feedback before writing the PRD. This is especially important for UI, UX, state machines, business logic, or external service integration.

The prototype is not the final implementation. It is a learning tool.

Use prototypes to answer questions like:

Which UI direction feels right?
Does this state machine make sense?
Can this API integration actually work?
Is this interaction too confusing?
What implementation path has the fewest unknowns?

The changelog file also describes a /prototype skill for throwaway prototypes, including UI variations and small terminal apps for testing logic. The core philosophy is: prototype first, then hand off to an implementation agent.

Phase 4 — Write the PRD

A PRD is the destination document. It describes where the work is going, not every tiny step to get there. The files describe PRDs as containing problem statements, proposed solutions, user stories, implementation decisions, and testing decisions.

A strong PRD should include:

## PRD: [Feature Name]

### Problem

What is broken, missing, annoying, or valuable?

### Goal

What should be true when this is complete?

### Non-goals

What are we intentionally not doing?

### User stories

As a [user], I want [behavior], so that [outcome].

### Implementation decisions

What has already been decided?
What constraints must be respected?

### Testing decisions

What behaviors must be tested?
Which tests should be unit, integration, or visual?

### Risks

What could go wrong?

### Acceptance criteria

How will we know this is done?

The files emphasize that testing decisions inside the PRD help agents follow TDD and create feedback loops during implementation.

Phase 5 — Turn the PRD into vertical issues

The PRD is the destination. The issues are the journey.

A major mistake is breaking work into horizontal layers:

Task 1: database
Task 2: backend
Task 3: frontend
Task 4: tests

This delays feedback. Instead, the files recommend vertical slices: each task should cut through the necessary layers and produce something testable.

A vertical slice might include:

Small schema change
Service function
UI behavior
Tests
Acceptance criteria

The files connect this to the “tracer bullet” idea: pick slices that reveal unknowns early. If a risky integration might fail, make that one of the first slices.

A good issue should include:

## Issue: [Small vertical slice]

### Parent PRD

Link to PRD

### What to build

Precise task description

### Acceptance criteria

- [ ] Behavior A works
- [ ] Behavior B is tested
- [ ] Existing behavior is preserved

### Testing instructions

What tests to add or run

### Blocking relationships

Blocked by:
Blocks:

### Notes for agent

Important context, files, constraints, and risks

Part 5 — Triage your backlog before agents touch it

The /triage workflow turns messy ideas, bug reports, and feature requests into actionable work. It uses labels as a state machine. Each issue should have a category and a state.

Common category labels:

bug
enhancement

Common state labels:

needs triage
needs info
ready for agent
ready for human
won’t fix

The key rule is: an issue should not be picked up by an AFK agent unless it is explicitly ready for agent.

This prevents the agent from wasting time on vague, low-quality, contradictory, or out-of-scope tasks.

A useful triage workflow:

1. Pull all untriaged issues.
2. Categorize each as bug or enhancement.
3. Decide the state.
4. If unclear, mark needs info.
5. If out of scope, mark won’t fix and document why.
6. If actionable, write an agent brief.
7. Mark ready for agent only when fully specified.

The files also recommend documenting “out of scope” decisions so future agents can reject similar ideas consistently.

Part 6 — Execute with TDD: Red, Green, Refactor

The TDD workflow is one of the strongest recommendations in the files. The agent should write a failing test first, then implement the minimum code to pass, then refactor.

The loop:

Red: write one failing test
Green: write the minimum implementation to pass
Refactor: clean up while tests remain green
Repeat

The important detail is one test at a time. The files warn that LLMs tend to create huge horizontal layers: many tests at once, then a massive implementation attempt. That often produces weak tests and messy code.

A good agent instruction:

Use red-green-refactor.
For each behavior:
1. Write exactly one failing test.
2. Run it and confirm it fails for the expected reason.
3. Implement the smallest change to pass.
4. Run the test again.
5. Only then move to the next behavior.
After all tests pass, look for refactor candidates.
Do not rewrite the test just to make the implementation pass.

This works especially well with agents because the human can see the test fail, then pass, which provides confidence that the implementation is grounded in real feedback.

Part 7 — Build feedback loops everywhere

The files repeat one message: without feedback loops, AI is coding blind.

Useful feedback loops include:

Unit tests
Integration tests
Type checking
Linting
Build checks
CI
Regression tests
Browser screenshots
Manual QA
Code review

For backend work, feedback is usually textual. Tests, logs, type errors, and build failures are easy for the AI to read. For frontend work, this is harder because the feedback is visual: spacing, layout, scrolling, animation, hover states, dark mode, and interaction feel.

So frontend agents need browser access. The files describe using Chrome DevTools-style tooling so the agent can open the local app, inspect pages, take screenshots, emulate dark mode, and verify rendering.

For frontend or full-stack work, add:

Browser automation
Screenshot inspection
Light/dark mode checks
Responsive layout checks
Ad hoc interaction testing
Accessibility checks when relevant

This makes the AI more like a human frontend developer because it can inspect the actual execution environment, not just the code.

Part 8 — Run agents safely with sandboxes

The Sandcastle file introduces a way to run agents AFK in isolated sandboxes. The problem it addresses is permissions: if agents constantly ask for permission, they cannot work autonomously; if you give them unrestricted access, they can do dangerous things. Sandboxing gives them a controlled environment.

Sandcastle is described as a TypeScript library for orchestrating coding agents in isolated sandboxes. It can run prompts with agents, use GitHub issues as a backlog manager, and run agents in parallel.

A typical Sandcastle-style setup has:

A .sandcastle directory
A Dockerfile or sandbox definition
Environment variables
A backlog source such as GitHub issues
A planner agent
One or more implementer agents
A reviewer agent
Possibly a merger agent

The workflow described in the file:

1. Planner reads open labeled issues.
2. Planner identifies unblocked tasks.
3. Implementer agents work in sandboxes.
4. Agents run tests and type checks.
5. Reviewer analyzes the changes.
6. Merger can combine or select branches.

The Sandcastle file also shows that agents can be prompted to use red-green-refactor during implementation, tying autonomous execution back to TDD.

Part 9 — Use worktrees for parallel development

Git worktrees let multiple branches of the same repository be checked out in separate folders. This allows multiple agents to work independently without interfering with each other.

The basic idea:

main repo
feature-worktree-1
feature-worktree-2
bugfix-worktree-3

Each worktree can have its own branch, its own changes, and its own agent.

The files describe this as a powerful way to make parallelization easier. One agent can work on one idea, another agent can work on another, and each can produce a PR back to main.

But there is an important warning: protect your main branch and make sure the agent pushes to the specific branch name. Otherwise, an agent may accidentally push work to main if the setup is wrong.

A safe instruction for agents:

You are working in a git worktree.
Before committing, run git status and confirm the branch name.
Do not push to main.
Push only to the current feature branch.
Open a PR back to main.
If branch identity is unclear, stop and report.

Part 10 — Review in a fresh context

The files recommend reviewing AI-generated code in a fresh context. If the same agent that wrote the code reviews it inside a bloated context, it may be less effective. A fresh context gives the reviewer a cleaner view.

The newer skills changelog also describes a planned /review skill with two parallel review modes:

Standards review:
Does the code follow repository conventions?

Spec review:
Does the implementation match the issue or PRD?

This distinction is useful. A change can be well-written but solve the wrong problem, or it can solve the right problem while violating project standards.

A good review prompt:

Review this PR in a fresh context.

Check two things separately:

1. Spec compliance:
- Does the implementation satisfy the issue?
- Are all acceptance criteria met?
- Are user stories preserved?

2. Code standards:
- Does the code match existing conventions?
- Are module boundaries respected?
- Are tests meaningful?
- Are there unnecessary abstractions?
- Are there risky changes outside scope?

Do not rewrite code yet. First produce findings ranked by severity.

Part 11 — Use handoff when context gets too large

Long sessions consume context. The /handoff skill creates a temporary handoff document that summarizes the current conversation, intent, artifacts, decisions, and suggested next skills. This lets another agent continue the work without carrying the entire original conversation.

Use handoff when:

The session is getting long
You want a fresh agent to continue
You want to delegate a subtask
You want another agent to review or prototype independently
You want to preserve intent without copying everything

A handoff document should include:

## Handoff

### Current goal

What are we trying to accomplish?

### Current state

What has been decided or built?

### Important artifacts

Links to PRD, issues, context.md, ADRs, prototypes, branches

### Domain language

Terms the next agent must understand

### Constraints

What must not change?

### Recommended next action

What should the next agent do?

### Suggested skill

grill-with-docs / prototype / tdd / review / triage / etc.

Part 12 — Human QA closes the loop

Even after AFK implementation, tests, and review, the human still performs QA. The seven-phase workflow explicitly ends with the agent producing a QA plan and the human walking through the completed work. That QA often creates more tickets, which go back into the implementation loop.

A good QA plan includes:

Core happy path
Edge cases
Regression checks
Visual checks
Data integrity checks
Error states
Performance concerns
Accessibility concerns if relevant
Manual steps to reproduce

The loop becomes:

Execute issue
Run tests
Review
Human QA
Find problems
Create new issues
Triage
Execute again

This is why the process is iterative, not one-shot.

The complete workflow

Here is the combined tutorial workflow from the files:

1. Prepare the codebase
   - Improve architecture
   - Create deep modules
   - Define seams and adapters
   - Add tests around boundaries

2. Establish shared language
   - Create or update context.md
   - Use grill-with-docs
   - Add ADRs for hard-to-reverse decisions

3. Start from an idea
   - Describe the goal
   - Explain why it matters
   - Identify uncertainty

4. Research when needed
   - Cache temporary research in research.md
   - Avoid stale permanent research

5. Prototype when taste or uncertainty matters
   - UI prototypes
   - Logic prototypes
   - API experiments

6. Write the PRD
   - Problem
   - Goal
   - User stories
   - Implementation decisions
   - Testing decisions
   - Acceptance criteria

7. Break into vertical issues
   - Avoid horizontal layers
   - Create tracer-bullet tasks
   - Add blocking relationships
   - Reference the parent PRD

8. Triage the backlog
   - Label category and state
   - Mark only clear tasks as ready for agent
   - Document out-of-scope decisions

9. Execute with agents
   - Use sandboxes
   - Use worktrees
   - Use one agent per unblocked task when useful
   - Protect main

10. Use TDD
   - One failing test
   - Minimal implementation
   - Refactor
   - Repeat

11. Add feedback loops
   - Tests
   - Type checks
   - Lint
   - Builds
   - Browser screenshots for frontend

12. Review in fresh context
   - Spec review
   - Standards review

13. Human QA
   - Walk through the completed work
   - Create new issues
   - Repeat the loop

Practical “minimum viable” version

If someone is not ready for the full multi-agent workflow, the simplest version is:

1. Use grill-with-docs to clarify the feature.
2. Write a PRD.
3. Break the PRD into 3–6 vertical issues.
4. Pick one issue.
5. Ask the agent to use red-green-refactor.
6. Run tests and type checks.
7. Review the diff.
8. QA manually.
9. Create follow-up issues.

This gives most of the benefit without needing a full Sandcastle-style AFK factory.

Advanced version: AI software factory

The advanced version combines everything:

Architecture-ready codebase
+
context.md and ADRs
+
PRDs
+
GitHub issues
+
triage labels
+
Sandcastle or equivalent sandbox orchestration
+
Git worktrees
+
TDD prompts
+
review agents
+
human QA

At that point, the human does the “day shift”: thinking, deciding, grilling, documenting, prioritizing, and reviewing. The agents do the “night shift”: implementing, testing, reviewing, and reporting. This “human day shift / AI night shift” idea appears as the final shape of the workflow.

Final takeaway

The combined message of the files is:

Do not use AI to avoid engineering.
Use engineering to make AI useful.

Good AI-driven development is not about the perfect prompt. It is about creating a system where agents can succeed:

Clear language
Clear architecture
Clear tasks
Clear tests
Clear feedback
Clear review
Clear human ownership

When those pieces are in place, AI agents can become genuinely powerful collaborators. When they are missing, AI simply accelerates entropy and produces code that is faster to write but harder to maintain.