Pitfalls¶

Sharp edges worth knowing before you build on Looplet. Each has a principled fix that preserves an owned, inspectable, testable harness.

1. `max_steps` must match in config and state¶

# ✓ do this
N = 20
config = LoopConfig(max_steps=N)
state  = DefaultState(max_steps=N)

The loop warns and syncs to the LoopConfig value if the two differ, but matching them silences the warning and makes intent clear.

2. `redact=` in provenance scrubs upstream BY DEFAULT¶

# ✓ do this - PII never reaches Anthropic OR the trace file
sink = ProvenanceSink(dir="traces/", redact=scrub_pii)
llm  = sink.wrap_llm(AnthropicBackend(...))

Do not double-wrap the LLM in a separate redactor outside the sink. The sink already scrubs the prompt before forwarding to the wrapped backend. If you want the legacy record-only behaviour (scrub the trace but forward the raw prompt to the provider), opt out:

sink = ProvenanceSink(dir="traces/", redact=scrub_pii, redact_upstream=False)

3. Use `HookDecision(stop="reason")` in `should_stop`¶

# ✓ do this - the reason string appears in EvalContext.stop_reason
def should_stop(self, state, step_num, new_entities):
    if self.tokens > self.cap:
        return HookDecision(stop="budget_exceeded")
    return False

A bare return True works but records stop_reason="hook_stop", which makes evaluators unable to distinguish a budget stop from a timeout stop.

4. `eval_discover` only collects functions defined in the eval file¶

# eval_my_agent.py
from looplet import eval_mark            # decorator - not collected
from my_helpers import eval_count_tools  # helper from another file - not collected

@eval_mark("verdict")
def eval_correct(ctx):                   # collected
    return ctx.final_output.get("answer") == ctx.task.get("expected")

This is intentional. Do not work around it by defining pass-through wrappers. Just import normally; the __module__ filter handles the rest.

5. `should_stop` fires AFTER the current step¶

If a hook stops the loop, the trajectory may not end with a done() call. Trajectory evaluators must handle this via ctx.stop_reason, not by assuming a terminal done() step:

def eval_finished_cleanly(ctx):
    return ctx.completed         # True iff stop_reason == "done"

def eval_no_hard_timeout(ctx):
    return ctx.stop_reason != "timeout"

6. Tool errors should carry remediation¶

The LLM reads tool_result.error and tool_result.data verbatim. A good error includes both what went wrong and what to try next:

# ✓ do this
return {
    "error": "File not found: x.py",
    "remediation": "Use glob to list existing files, or write to create one.",
}

# ✗ not this
return {"error": "ENOENT"}

7. Do not swallow exceptions in hooks¶

A hook that eats KeyError can mask a real bug - for example, a missing tool_call.args key that should have surfaced as a prompt for the model. Let exceptions propagate unless you have a specific recovery.

8. `composable_loop` is a generator¶

# ✓ do this
for step in composable_loop(...):
    ...

# ✓ or this if you do not care about streaming
list(composable_loop(...))

# ✗ this does nothing - the loop never runs
composable_loop(...)

9. `generate_with_tools` is surfaced via hasattr¶

If you wrap an LLM backend yourself, forward generate_with_tools when the wrapped backend has it:

class MyWrapper:
    def __init__(self, inner):
        self._inner = inner
        if hasattr(inner, "generate_with_tools"):
            self.generate_with_tools = inner.generate_with_tools

    def generate(self, prompt, **kw):
        return self._inner.generate(prompt, **kw)

Otherwise native tool-calling silently falls back to JSON parsing.

10. Prefer Protocol-conforming classes over inheritance¶

All hooks, LLM backends, and states are @runtime_checkable Protocols. Any object with the right methods works. Do not subclass LoopHook or register anywhere - just implement the methods you need:

# ✓ do this
class MyHook:
    def post_dispatch(self, state, session_log, tool_call, tool_result, step_num):
        ...

# ✗ do not do this
class MyHook(LoopHook):         # unnecessary
    ...

11. Don't run linters / type-checkers / LSP after every `write`¶

A good editing trajectory looks like: write A, write B, edit C, then done(). The intermediate states almost always fail to compile or type-check - that's normal, the work isn't finished yet.

If you wire a post_dispatch hook that runs mypy / tsc / LSP diagnostics after every edit and injects the errors as InjectContext(...), the model receives a constant stream of "you broke it" feedback during a sequence of edits that, taken together, would have been correct. Models then abandon multi-step refactors and revert to single-edit-then-verify patterns that are objectively worse.

# ✗ do not do this
class LSPFeedback:
    def post_dispatch(self, state, session_log, tool_call, tool_result, step_num):
        if tool_call.tool in {"write", "edit"}:
            errors = run_typechecker()        # noisy mid-sequence
            if errors:
                return InjectContext(f"Type errors:\n{errors}")
        return None

# ✓ do this - only check at natural sync points (done() or explicit checkpoints)
class LSPFeedback:
    def check_done(self, state, session_log, context, step_num):
        errors = run_typechecker()
        if errors:
            return Block(f"Type errors before done():\n{errors}")
        return None

Credit to Mario Zechner's Pi write-up for naming this anti-pattern crisply.

12. Aggressive compaction silently destroys prompt caching¶

Anthropic and OpenAI cache prompt prefixes; the cache breakpoint moves forward as the conversation grows. If your compaction strategy rewrites the prefix on every turn (e.g. by pruning all tool results older than N tokens, or summarising older messages in place), the cache hit rate collapses and per-turn cost can rise 5–10×.

# ✗ silently cache-hostile - every turn rewrites the prefix
config = LoopConfig(
    compact_service=PruneToolResults(keep_recent_tool_results=2),
    cache_policy=CachePolicy(...),
)

# ✓ keep enough recent results to stay behind the cache breakpoint,
#    and only summarise on overflow, not every turn
config = LoopConfig(
    compact_service=DefaultCompactService(
        keep_recent=4,
        keep_recent_tool_results=10,    # ≥ what cache_policy expects to keep stable
    ),
    cache_policy=CachePolicy(...),
)

If you're unsure, run with MetricsHook for a few turns and inspect the usage.cache_read_input_tokens reported by your provider. A healthy run shows that number climbing; a cache-hostile run shows it flat near zero.

13. Mid-loop PII redaction causes confident hallucinations¶

A natural-looking pattern is to write a hook that scrubs PII (emails, SSNs, names) from tool_result.data in post_dispatch so the LLM "never sees" sensitive values. This works for the trace file. It does not work for the LLM. When the model receives [EMAIL] instead of j.smith@example.com, it doesn't treat it as opaque - it invents a plausible-looking replacement (bhansen@corp.local) and then continues building a story around the invention. By the time the agent writes a structured report, every downstream value can be hallucinated.

This was found in dogfood round 15 (a SOC triage cartridge): the agent invented an entire username, host IP, and lateral-movement narrative, then confidently labelled it severity: critical / recommended_action: isolate_host against a host that didn't exist.

# ✗ Scrubs the LLM's own view → it invents replacements
class PIIRedactionHook:
    def post_dispatch(self, state, session_log, tool_call, tool_result, step_num):
        tool_result.data = scrub_pii(tool_result.data)   # don't do this
        return None

The right pattern is to scrub at the boundary, not in the loop:

Trace-only redaction. Use ProvenanceSink(redact=scrub_pii, redact_upstream=False) - this rewrites what hits disk while forwarding the original prompt to the provider. This is an explicit opt-out from the safer upstream-redaction default; use it only when the provider is allowed to receive the raw values.
Stable token substitution. If the LLM truly should never see the raw value, replace it with a consistent opaque token (USER_AC42F1, generated by hashing the original) BEFORE it ever enters the loop, and look the original up post-hoc when reporting. The model treats the token as an opaque reference instead of inventing a plausible-looking string.
Deny tools that return PII. If a tool's output is too sensitive for the LLM to ever see, the right move is a permission rule that denies the tool entirely, not a hook that mutates its output.

The general principle: never silently change what the LLM sees mid-conversation. The model doesn't know the substitution happened and will reconstruct what it thinks the missing piece "should be."

14. Default `context_window_steps=5` silently elides chained-tool source data¶

The loop's RECENT RESULTS block inlines the last N steps' tool results into every prompt. The default N is 5 (env CONTEXT_WINDOW_STEPS, set in looplet.context_budget). Older steps' data rolls out and is only summarized in SESSION LOG as the agent's reasoning text, not the actual tool output.

For chained tool-use cartridges where step M needs to reference data returned by step M-K (with K > 5), the LLM no longer sees the source-of-truth value. It does not say "I don't know" - it reconstructs a plausible value and proceeds. This was mis-diagnosed as model hallucination in the SOC-triage dogfood; the actual root cause was the agent's lookup_user(...) at step 8 referring to a username from get_alert(...) at step 1, which had already aged out of the recent-results window.

# config.yaml - declare a wider window when your agent chains tools
# across many steps:
context_window_steps: 30                   # default 5
context_window_total_chars: 60000          # default 20 000
context_inline_per_step_chars: 5000        # default 3 000

These map onto LoopConfig.context_window_steps etc., which override the env defaults for the run.

How to spot it: wrap your backend with ProvenanceSink and grep the trace for the value the agent later invents. If the value was in the prompt at the call where it first appeared correctly but isn't in the prompt at the call where it appears wrong, your window is too narrow.

15. `ctx.metadata` is per-dispatch - use a resource for cross-tool state¶

Every tool dispatch builds a fresh ToolContext and copies metadata from state.metadata (and LoopConfig.tool_metadata). Mutations a tool makes to ctx.metadata are local to that one call and do not survive to the next tool call. Cartridges that try to hand state from accumulator_tool to flush_tool via ctx.metadata.setdefault(...) will silently see an empty buffer when flush_tool runs.

The right pattern is a resource, which is constructed once per cartridge load and passed by reference to every dispatch:

# resources/ioc_buffer.py
def build():
    return {"iocs": []}        # mutable, shared by reference

# tools/normalize/tool.yaml
requires:
  - ioc_buffer

# tools/normalize/execute.py
def execute(ctx, *, iocs):
    ctx.resources["ioc_buffer"]["iocs"].extend(iocs)   # persists across calls

# tools/publish/execute.py
def execute(ctx):
    return {"count": len(ctx.resources["ioc_buffer"]["iocs"])}   # sees them all

Why the loop is built this way: ctx.metadata is meant for inputs to the call that the loop or caller wants to pin (task IDs, permission mode, request-scoped tags). Anything mutable a tool generates and a sibling tool needs to read is, by definition, shared agent state, and shared state belongs in the resource registry where it has an explicit name and an explicit requires: declaration. This makes it visible in cartridge_to_preset strict validation and impossible to accidentally lose to a fresh dispatch.

How to spot it: a "buffer" tool that reports total_buffered: 8 across three calls, followed by a "publish" tool that writes ioc_count: 0 to disk. The dispatcher is doing what it's supposed to; you used the wrong storage.

16. Do not grade a preferred trajectory as task quality¶

An eval such as "pytest" in ctx.tool_sequence rewards one historical way of solving a task. A stronger model may produce the correct outcome through a different tool or fewer steps and receive a worse score.

# ✗ freezes one implementation path
def eval_ran_pytest(ctx):
    return "pytest" in ctx.tool_sequence

# ✓ checks the world after the agent stops
def collect_tests(state):
    proc = subprocess.run(["pytest", "-q"], capture_output=True)
    return {"tests_passing": proc.returncode == 0}

def eval_tests_pass(ctx):
    return ctx.artifacts["tests_passing"]

Trajectory checks are appropriate for harness plumbing (for example, "did the permission hook block this call?") and auditing. Keep them separate from outcome quality.

17. The agent must not own its release oracle¶

Case files seeded from task["files"] live in the agent's writable sandbox. Visible tests guide the agent; they cannot protect a release gate. Likewise, colocated cartridge evals are versioned self-tests, but a candidate that can edit its own cartridge can also edit those graders.

Use the top-level case expected field for grader-only data in ordinary self-tests. For promotion, run collector and grader code in a host-owned runner and do not pass oracle data, paths, callables, or capabilities through the candidate task, runtime, resources, tools, or files. If a generated harness can inspect or modify the evaluator that promotes it, the green result is not evidence. Separate directories are not a sandbox against arbitrary same-user code; use OS or process isolation for that threat.

18. Captured-response replay is not deterministic replay¶

replay_loop() fixes recorded model responses. It then invokes fresh tools, hooks, state, permissions, clocks, networks, randomness, and side effects.

Use it to isolate harness-runtime changes when model decisions should remain fixed. Mock or sandbox side effects when you need repeatability. Do not use it to claim what a changed prompt or model would have done; record fresh sampled runs for that question.

Pitfalls¶

1. max_steps must match in config and state¶

2. redact= in provenance scrubs upstream BY DEFAULT¶

3. Use HookDecision(stop="reason") in should_stop¶

4. eval_discover only collects functions defined in the eval file¶

5. should_stop fires AFTER the current step¶