Test-driven harness engineering for Python agents

Looplet¶

Own the loop. Test every change.

Keep prompts, tools, hooks, cases, and graders in code and files your team can review. Capture a failure, inspect the resulting world, and turn the behavior into a required pytest or CI contract.

Run the network-free proof Install and configure GitHub

Reviewable

Harness changes are ordinary Python, YAML, Markdown, and JSON.

Observable

Model calls and tool dispatches can become durable evidence.

Re-executable

Recorded responses can exercise fresh tool and hook code.

Gateable

Host-observed outcomes become required release checks.

$ uv run python examples/regression_demo/run_demo.py

1. CAPTURE v1 with fixed model responses
   collected profit: 200
   required eval: FAIL (0.00)

2. CHANGE one reviewable harness line
   - "profit": revenue + cost,
   + "profit": revenue - cost,

3. REPLAY with fresh v2 tool execution
   same model decisions: true
   collected profit: 40
   required eval: PASS (1.00)

No API key and no network. The response sequence stays fixed while changed tool code executes again and an independent collector checks the output. Read the proof and its limits.

Start with the job in front of you¶

I have a private tool loop¶

Adapt one tool, replace only the control loop, and establish parity before adding hooks or cartridges.

Migrate an existing loop | Build the first loop

I have a failure worth preserving¶

Capture the run, collect the real outcome, and decide whether replay, a mock, or fresh model samples answer the question.

Failure to regression | Choose an experiment

I need the exact interface¶

Find commands, Python entry points, artifact files, and operational controls without reading the package source.

CLI | Python API | Saved artifacts

One workflow from prototype to release¶

01

Build

Own the model, tools, state, and dispatch loop in Python or a cartridge.

02

Capture

Persist prompts, responses, steps, stop reasons, and metadata as readable files.

03

Test

Collect resulting world state and compare it with grader-only expectations.

04

Ship

Make required graders and thresholds fail closed in pytest or CI.

The execution boundary stays visible¶

owner_lookup.py

from looplet import OpenAIBackend, composable_loop, tool, tools_from


@tool(description="Look up one service owner by name.")
def lookup_owner(service: str) -> dict:
    owners = {"payments": "fintech-platform", "search": "discovery"}
    return {"service": service, "owner": owners.get(service)}


for step in composable_loop(
    llm=OpenAIBackend.from_env(),
    tools=tools_from([lookup_owner], include_done=True),
    task={"goal": "Find the owner of payments, then finish."},
    max_steps=5,
):
    print(step.pretty())

Every dispatch returns to the caller as a typed Step. Hooks can observe or steer prompt construction, permissions, dispatch, completion, compaction, and lifecycle events without requiring a graph runtime. Cartridges are optional; they package the same harness as reviewable files when that helps distribution or code review.

Follow the quickstart | Read the hook protocol | Inspect cartridge boundaries

The harness can cross runtime boundaries¶

The shipped coder_portable cartridge is the complete coding-harness reference architecture with zero in-process portability blockers:

Boundary	What crosses it
MCP	All 16 coding tools
LEP	Permission, test, cache, stale-file, and linter hooks
SSP	Shared mutable file-cache state
MGP	Host model access for `web_fetch` and subagents

from looplet import bundled_cartridge_path
from looplet.cartridge import analyse_cartridge


coder = bundled_cartridge_path("coder_portable")
assert analyse_cartridge(coder).profile == "portable"

Portable means the loader does not import author-owned tool, hook, or state code. The bundled protocol servers are Python programs launched with the active Looplet interpreter, SSP and MGP use Unix sockets, and the full coder has not yet run on a production Rust, Go, or TypeScript loader. The Python-host coder remains the agent factory default and keeps host-owned eval and dynamic-memory behavior that the portable reference deliberately omits.

Study the portable coder | Inspect the cartridge format

Evidence has different jobs¶

Evidence	Use it for	Do not claim
Yielded `Step` stream	Live routing, approval, display, and instrumentation	Independent product correctness
Provenance trace	What the model saw, returned, and dispatched	That recorded prompts are safe to publish
Captured-response replay	Tool, hook, permission, state, and dispatch changes under fixed model responses	Better future model decisions
Outcome collector and grader	Whether the resulting file, command, record, or service state is correct	Isolation when the candidate owns the runner
Fresh sampled cases	Prompt, model, schema, and context changes that affect decisions	Universal performance from one sample

Looplet calls replay captured-response replay because tools, clocks, networks, randomness, and side effects execute again. Protected promotion oracles belong in a host-owned runner; arbitrary untrusted code also requires OS or process isolation.

Capture and replay | Behavioral evals | Saved artifact reference

Designed for a specific team and stage¶

Looplet is a good fit when¶

one model calls tools until it is done;
your team already reviews Python, files, pytest, and CI;
prompt, tool, model, or hook changes need regression evidence;
exact interception points and local artifacts matter;
you want to own execution rather than adopt a hosted control plane.

Choose another layer when¶

the workflow is naturally a durable branching graph;
a managed control plane should be the source of truth;
you need a finished assistant, sandbox, or annotation product;
your main need is fleet analytics or a hosted experiment dashboard;
a small disposable loop is still enough.

Looplet can run inside a workflow engine and export to observability systems. It does not try to replace either one. Core uses only the Python standard library; provider SDKs are optional extras.

Read the selection guide | Check the FAQ | Operate a production loop