Two Independent Stop Layers: Why One Budget Cap Isn't Enough for Autonomous Agents

2026.06.17 15 min

An autonomous agent that runs unattended has to stop on its own. Most designs give it exactly one way to do that: a budget cap in a config file. That cap is also the one thing a task author can edit. So the entire safety story rests on a single number that one typo, one copy-pasted config, or one self-modifying loop can set arbitrarily high.

auto-r-graph treats that as a structural flaw, not a tuning problem. It runs a fully-unattended propose -> validate -> train -> evaluate -> keep-or-discard ML research loop. A local LLM does the creative work; a deterministic rule engine owns all routing and stopping. Stopping gets two independent layers: per-task caps the run author controls, and a fixed governance pack the run cannot reach. This post walks through both layers, why they have to be independent, and the test that proves the second one fires when the first is deliberately disabled.

One Cap Is a Single Point of Failure

The loop is propose -> validate -> train -> evaluate -> keep-or-discard. A local LLM does the creative work, and deterministic rules own all routing and stopping. The LLM writes code. It never decides whether the run continues. That decision belongs to the engine.

So the engine has to be the thing that stops the run. Here is the problem most agent designs walk straight into: the thing that stops the run is also the thing a task author can configure. One number in a YAML file. One brake.

The danger is not a malicious LLM. The danger is structural. If a single configurable cap is your only brake, your entire safety story rests on a number that a task author, or a self-modifying loop editing its own config, can set arbitrarily high. One typo. One edit. One generated config that picks a big number. The cap is one edit away from off, and nothing else notices.

This is a falsifiable claim, not a vibe. If your only brake is a number in a config file, set it to 10**9 and watch the run never stop.

auto-r-graph turns that exact scenario into a passing test. The per-task iteration cap lives in the keep_discard node (ar_graph/nodes/__init__.py:899-914). The proof test (tests/test_pack_runtime.py:234-258, test_fr11_pack_halt_proof) sets max_iterations=10**9, which neutralizes that node's own iteration halt entirely. The node does not stop the run: halt=False. Yet at iteration=51, the engine still returns exactly one HaltAction with reason pack:budget-iteration. Something else stopped it.

That something else is the fix, and the architecture this whole piece is about: not one cap, but two independent layers of brakes. Layer one is the per-task budgets the run author tunes (max_iterations, max_run_wall_s, plateau_n, target_score), enforced by the node. Layer two is a fixed governance pack (ar_graph/rules/ar.budget/pack.yaml) compiled into the same engine, with hard thresholds (>= 50 iterations, > 1800s of container time, > 2M tokens) that no task configuration can raise or bypass. When the first brake is disabled, the second one still stops the run.

The rest of this post is about why those two layers have to be independent, and how auto-r-graph keeps them that way.

Layer 1: The Caps You Tune

Layer 1 is the budget the run author owns. It lives entirely in the keep_discard node, which folds four independent stop criteria into a single halt bool and a single halt_reason string (ar_graph/nodes/__init__.py:899-914).

The four criteria evaluate in a locked order, and the node records the first one that fires:

Wallclock: elapsed >= task.max_run_wall_s
Max iterations: iteration >= task.max_iterations -> 'max_iter'
Plateau: no_improve_streak >= task.plateau_n -> 'plateau'
Target: target_hit -> 'target'

If none match, halt=False and the loop continues. The ladder is a plain if/elif chain, so order is not incidental. It is a contract. Tests pin it: test_wallclock_wins_over_plateau and test_wallclock_wins_over_target (tests/test_wall_budget.py:98-147) prove wallclock takes precedence over both a satisfied plateau and a hit target. Whichever fires first is the reason you see; the rest never run.

These are the right knobs for a run author. They are per-task, set in the TaskSpec YAML, and they answer task-shaped questions. How long should this run? When has it converged? What score is good enough? The defaults are deliberate but unremarkable: max_iterations=50, plateau_n=8, max_wall_s=1800, max_tokens=2,000,000 (ar_graph/state.py:22-26). Tune them up for a hard problem, down for a quick probe. That is the design intent.

And that is exactly why Layer 1 cannot be where safety lives.

Every one of these caps is author-controlled by definition. The same field you raise to give a run more room is the field that, set wrong, gives it unlimited room. max_iterations=50 is a number in a YAML file. So is max_iterations=10**9. The node enforces whatever it is handed. It has no opinion about whether the value is sane. Anything you can tune up, you can tune off. A typo, a copy-pasted config from a different task, or an agent editing its own TaskSpec all land in the same place: a halt condition that never trips.

A cap you can edit is not a backstop. It is a setting. The thing that actually stops an autonomous agent has to be something the run cannot reach. That is Layer 2.

Layer 2: The Backstop You Can't Touch

Anything a task config can set, a task config can unset. So there is a second layer the task config cannot reach: the ar.budget governance pack.

Three fixed axes, three CLIPS rules

The pack hardcodes three thresholds as deterministic CLIPS rules (ar_graph/rules/ar.budget/pack.yaml:21-38):

Iterations: >= 50, written (state (iteration ?i&:(>= ?i 50))) -> pack:budget-iteration
Container wall-time: > 1800 cumulative seconds -> pack:budget-wall-seconds
AI tokens: > 2,000,000 cumulative tokens -> pack:budget-tokens

These are the three resources a runaway loop actually burns: how many times it spins, how much compute time it eats, and how many tokens it spends. Each gets its own independent halt rule. Exceeding any one is enough to stop the run.

The thresholds are literals in the rule patterns, not parameters. There is no config key that points at them. A task can raise its own max_iterations arbitrarily high and the pack's >= 50 still fires at exactly 50. The proof of that lives in the next section.

One honest caveat about the verification status. Only the iteration axis has a full runtime halt proof: a test that drives a live engine and asserts a HaltAction with reason pack:budget-iteration fires. The wall-seconds and tokens axes are verified as installed, integer-typed rules sharing the same pattern shape, but their runtime halt is not yet exercised by a proof test (specs/02-ar-v2/soak-report.md flags the tokens axis as untested-live in its backlog). The architecture treats all three identically. The test coverage does not yet.

Defense-in-depth, not a side channel

The pack states its own purpose at the top of pack.yaml:

Defense-in-depth beyond keep_discard's own halt logic -- fires halt if any of the three budget axes is exceeded, and CANNOT be bypassed by node logic.

All three rules gate on the same node, (node-id (id keep_discard)), and each emits a halt reason that no node path can produce: pack:budget-iteration, pack:budget-wall-seconds, pack:budget-tokens. The node's own halt ladder emits a different vocabulary (max_iter, plateau, target, wallclock). The two namespaces never overlap.

That separation is the audit-log payoff. When a run stops, the halt_reason tells you which layer fired and why, with no ambiguity. A pack:budget-* reason is a string only the pack rule emits, so seeing it in the log is proof the backstop, not the tunable cap, brought the run down. The next section disables Layer 1 on purpose to make exactly that happen.

Same Engine, Not a Bolt-On

A separate watchdog process is appealing until you count its failure modes. It can crash while the loop keeps running. It can desync from the state it's supposed to police. It can be skipped on a code path nobody tested. Each of those is a way for the backstop to be silently absent at the moment it's needed.

auto-r-graph avoids all of them by not building a separate watchdog. Both stop layers compile into the same CLIPS routing engine. The ar.budget pack is not a second evaluator running alongside the router. It is more rules in the router.

This works because packs share the inline RuleSpec grammar. The governance pack and the routing rules go through the identical compile/install path. The only difference is where the rule came from.

# ar_graph/graph/_fathom.py:426-428
# Compile + install governance pack rules into the SAME engine.
# Packs share the inline RuleSpec grammar, so the same compile/install path
# applies -- the only difference is the source (pack.yaml vs IR inline).

The consequence is structural: if routing runs, the backstop runs. There is no separate process to start, no second clock to keep in sync, no path that loads the router but forgets the watchdog. The pack's three rules pattern-match the same flat state scalars (iteration, cum_wall_s, cum_tokens) that the routing rules match, gated on the same (node-id (id keep_discard)) fact (ar_graph/rules/ar.budget/pack.yaml:21-38). They're indistinguishable from routing rules at runtime because, at runtime, that's what they are.

This is verified concretely, not asserted. The live engine holds exactly 13 compiled rules: 10 inline routing rules plus 3 governance-pack rules. (The graph YAML also contains node definitions and pack mounts, but those are not themselves compiled rules, so a reader counting - id: entries in stargraph.yaml will find more than 13.) tests/test_pack_runtime.py:161-192 (test_pack_rules_installed) reaches into engine._env.rules() and asserts each one is present: the pack's budget-cap-iterations, budget-cap-wall-seconds, budget-cap-tokens, alongside inline rules like r-continue-loop, r-halt, and r-done-halt. Then it asserts len(rule_names) == 13. The backstop isn't trusted to be installed. The test confirms it is sitting in the same rule set as the router.

That shared machinery is also why the two layers don't race when they both fire. When iteration hits the node cap and the pack threshold in the same tick, the engine emits both a node-driven GotoAction and the pack's HaltAction, and stargraph's fixed precedence collapses them to a single HaltAction. The full precedence chain is interrupt > halt > goto > parallel > continue (stargraph/src/stargraph/runtime/action.py:24,85); the slice that matters here is halt > goto. Same engine, one resolution path, no contention (tests/test_pack_runtime.py:278-298).

The Proof: Neutralize Layer 1, Watch Layer 2 Fire

Two stop layers is a nice claim. A test that disables one and watches the other fire is a proof. auto-r-graph ships that test: tests/test_pack_runtime.py:234-258, test_fr11_pack_halt_proof.

The setup deliberately breaks Layer 1. The helper _high_iteration_cap_task (test_pack_runtime.py:223-231) builds a TaskSpec with max_iterations=10**9, a ceiling no real run reaches. With that cap, the keep_discard node's own iteration halt can never trip. The node-level brake is, for all practical purposes, gone.

Then the run is driven to iteration 51 with halt=False:

state = RunState(task=_high_iteration_cap_task(), iteration=51, halt=False)
actions = _fire_route(state)

halts = [a for a in actions if isinstance(a, HaltAction)]
assert len(halts) == 1, "exactly one HaltAction expected (the pack's)"
assert halts[0].reason == "pack:budget-iteration"   # test_pack_runtime.py:251

The node does not halt. halt=False says so. Yet the engine returns exactly one HaltAction, and its reason is exactly pack:budget-iteration, a string only the pack rule emits (ar_graph/rules/ar.budget/pack.yaml:21-23, (state (iteration ?i&:(>= ?i 50)))). No node halt reason (wallclock, max_iter, plateau, target) can produce that string. The reason is the attribution. It proves the halt came from the governance pack, at runtime, with Layer 1 neutralized.

Non-bypass at the extreme

10**9 is large. 10**18 is the extreme, and test_pack_thresholds_fixed_non_bypass (test_pack_runtime.py:314-346) uses it to make the non-bypass claim unarguable. A TaskSpec lifts its own iteration ceiling to 10**18; at iteration 51, the pack still halts with reason pack:budget-iteration. Task config sets the ceiling. Task config cannot raise the floor. The pack's hardcoded >= 50 is the floor.

The negative control

A halt that always fires would pass the proof test trivially and prove nothing. So the suite pins the boundary. test_fr11_negative_control_below_cap (test_pack_runtime.py:260-275) runs at iteration 49, one below the cap, with halt=False. No HaltAction fires; the only action is the r-continue-loop goto back to propose. The same control repeats inside the 10**18 non-bypass test at iteration 49. The threshold is a real boundary at 50, not an always-on halt.

Coexistence, not a race

The interesting case is when both layers fire on the same tick. test_node_halt_unaffected_by_pack (test_pack_runtime.py:278-298) sets max_iterations=50 and runs iteration 51 with halt=True. Now the node halts too: r-halt emits a goto done, and the pack independently emits its HaltAction. Both a GotoAction and a HaltAction are in the action set.

The engine resolves this with its fixed precedence, where halt > goto is the relevant slice. translate_actions(actions) collapses the pair to a single HaltAction with reason pack:budget-iteration. The pack does not race the node's halt path, and it does not weaken it. When both brakes are pressed, the run stops once. The second layer reinforces the first; it never competes with it.

That is the whole argument made falsifiable. Disable Layer 1. The run still halts at iteration 51, for the reason only Layer 2 can give.

Stopping Is the Last Line, Not the Only One

Budget caps are the outermost ring of a layered defense, not its core. By the time either stop layer fires, several cheaper constraints have already done most of the work. Stopping is what's left when narrowing the agent has run out.

Start with the cheapest filter. Before any container runs, LLM-generated code is parsed with the Python AST and checked against a fixed import allowlist:

ALLOWLIST = {"sklearn", "xgboost", "pandas", "numpy"}   # ar_graph/nodes/__init__.py:59
CAP = 3                                                  # per-iteration AST-repair attempts (:64)

Invalid code never reaches a container. On a parse failure or a disallowed import, the iteration increments repair_attempts and routes back to propose for up to 3 attempts before forcing the build-error path (ar_graph/nodes/__init__.py:456-461; routing rule r-edit-invalid-retry at ar_graph/graph/stargraph.yaml:77-79). Zero containers are spent on invalid code. The expensive resource, compute, is guarded by the cheap check first.

Next, constrain what the agent can change. The LLM edits only two marked regions of a fixed template, MODEL_REGION and FEATURE_REGION. Data loading, the train/val split, and the scorer are byte-identical every run. The agent literally cannot touch the code that computes its own score, so there is no metric to game (README.md:19).

Then take routing away from the agent entirely. The LLM never decides where the loop goes next. All transitions are deterministic Fathom rules, 10 inline routing rules in ar_graph/graph/stargraph.yaml:65-113, pattern-matching flat state scalars (README.md:24). Creativity stays in the proposals; control stays in the engine. A model that can only fill in two regions and cannot route has a very small surface to misbehave on.

Layered, the order is deliberate:

Pre-filter: AST parse plus import allowlist, up to 3 repairs, 0 containers on invalid code.
Structural lock: two editable regions; scorer, split, and data loading frozen.
Deterministic routing: all transitions are Fathom rules; the LLM never routes.
Stop layers: per-task caps, then the fixed ar.budget governance pack.

Each layer is cheaper and more certain than the one outside it. The pre-filter rejects bad code for the cost of a parse. The structural lock removes whole categories of cheating by construction. Deterministic routing makes the control flow auditable. Only after all of that do budget caps matter, and they matter precisely because no earlier layer is perfect. A self-modifying agent could, in principle, find a path the inner layers didn't anticipate. The budget pack is the backstop for exactly that case: it bounds the blast radius when everything closer in fails.

That's the right framing for budget caps. They are not the mechanism that keeps an autonomous agent safe. They are the last guarantee that holds when the mechanisms that were supposed to keep it safe don't. Narrow what the agent can do first. Cap what it can spend last.

Governed and Still Fast

The last objection to a fixed backstop is performance: a second stop layer must cost throughput. There are two separate claims here, and they are worth keeping apart.

The first is empirical, and a 60-minute throughput soak gate supports it. The governed loop, both stop layers active, sustains real autonomous throughput well above target, with zero incidents and zero leaks. The gate measures experiments per hour against a fixed floor of 12 and passed at exp_per_hr=19. That headline metric actually undercounts, because of a doc_id REPLACE collision in the reporting path. The checkpoint runs_history tells the truer story: 113 runs completed at an average 29.2 s/run, a sustained throughput of roughly 113 experiments/hour (a lower-bound proxy, assuming each completed run ran at least one experiment). Either way the loop cleared the floor. (Source: /home/sean/leagues/auto-r-graph/specs/02-ar-v2/soak-report.md.)

Reliability held under that load. Over the 60-minute run:

115 runs issued
113 runs completed (status=done)
0 incidents
0 orphaned ar-train containers

Zero orphaned containers matters here. Every experiment spins up a training container; a governed loop that leaked containers under load would trade one safety problem for another. It did not leak any.

The second claim is architectural, and it is reasoning rather than measurement. The soak is a single run of the full governed loop. It has no ungoverned baseline, so it cannot isolate the pack's marginal cost or prove it added "zero" overhead. But the design makes a near-zero cost the expected outcome. The per-task caps and the fixed ar.budget pack both compile into the same CLIPS engine and evaluate as deterministic rules on flat state scalars. There is no separate evaluator, no second pass, nothing extra to run each tick. Three more pattern-match rules in an engine that is already matching ten is not a measurable tax, and the pack only emits a halt when a budget axis is actually exceeded.

So the "safety slows it down" trade is the wrong frame here. The governed loop demonstrably sustains throughput above target with no incidents and no leaks, and the backstop's runtime cost is, by construction, three extra rules in a pass the engine was already making.

The takeaway is simple. If your agent has exactly one budget cap, you have a single point of failure: one typo, one bad config, one self-modification away from an agent that no longer stops. Add a fixed, independent backstop compiled into the same engine, where no task config can raise or bypass it. Then write the test that deliberately neutralizes the first cap and proves the second one fires (tests/test_pack_runtime.py:234-258). Until that test is green, you don't have two stop layers. You have one, and a story.

← All posts