Air-Gapped Agents: A Deterministic, Rules-Routed Architecture for DoD, Cleared, and Offline Workloads

2026.06.17 37 min

The named components

A few named components carry the whole paper. Definitions first.

Stargraph is an agent-graph runtime (PyPI distribution stargraph, currently still imported in code as harbor during an in-progress rename). It runs graphs of nodes and decides transitions with rules, not a model.
Fathom is a deterministic rules engine built on CLIPS, a mature production-rule (if-this-then-that over facts) inference system. Fathom decides which node runs next.
Bosun is Stargraph's governance rule packs: budgets, retries, audit, safety. Shipped signed.
Nautilus is a policy-gated data broker that enforces a clearance/classification lattice and signs every request. Stargraph reaches it through a read-only tool.
DSPy is the framework Stargraph wraps for LLM nodes. It talks to any OpenAI-compatible endpoint.
cve_remediation is the reference workload in the mal-ellim repository: a CVE triage-and-remediation pipeline built on the full stack. It is the concrete artifact that makes the paper's claims checkable.

Two more terms recur. A one-way diode is a hardware device that physically permits network traffic in one direction only. Data can be pushed out, nothing can be pulled in. Provenance is the recorded origin of a piece of data: where it came from, who produced it, and on which run and step.

The problem: a router you can't replay

The hardest deployment for an agent is the one with no exit. Behind a one-way diode or on an isolated classified LAN, the question is not whether the agent usually behaves. It is whether you can show, after the fact, exactly what it did and why, and whether anything in its design could have reached past the boundary. Most agent frameworks answer neither question. The thing that decides what the agent does next is a language model, and a language model is a sample from a distribution, not a record you can replay.

The architecture here takes the model out of that position. Nodes do work: the LLM, the classical models, the tools. Routing moves to a deterministic rules engine over typed facts. Every external touchpoint is closed by default and opened only by an explicit, named flag. The claims rest on a reference workload that runs fully offline today, and the paper marks every place the documentation runs ahead of the shipped code. The point is not to sell a posture. It is to show what a code-enforced one looks like.

Abstract: why air-gapped agents need a deterministic spine

Most agent frameworks put a language model in the loop twice: once to do the work, once to decide what happens next. The second use is the problem. When an LLM picks the next node, routing becomes stochastic. The same inputs can take different paths on different runs. There is no stable artifact to audit and no fixed firing trace to diff. That disqualifies the stack for a one-way diode or an isolated classified LAN, where the operative question is not "does it usually behave" but "can we audit and reproduce exactly what it did."

The argument is simple and the paper grounds it in code. An agent stack is auditable and safe to run air-gapped when its decision layer is deterministic and every external touchpoint is closed by default rather than opened by convention. Both halves matter. Determinism without default-deny gives you a reproducible system that still has open doors. Closed gates without determinism give you a system whose behavior you cannot replay. These two properties are not the only path to air-gap safety. A non-deterministic stack behind a hard network namespace with full I/O recording would also never reach the network. The claim is narrower: they are a sufficient and auditable basis, and the reference workload demonstrates it.

Separate work from routing. Stargraph splits the two uses of an LLM apart. Nodes do work: LLM calls, classical ML, tool invocations, retrieval. Fathom decides transitions over provenance-typed facts. The routing layer is not a model playing router. It is a rules engine over typed facts, each carrying (origin, source, run_id, step, confidence, timestamp) with typed origins llm|tool|user|rule|model|external. The stack states the thesis directly: "Transitions between nodes are decided by Fathom (a CLIPS rules engine) over provenance-typed facts — not by an LLM playing router" (stargraph/README.md:6-9). A deterministic decision layer is what makes air-gap operation auditable.

The intended environment. Stargraph is built for "environments where auditability, determinism, and provenance matter more than ecosystem size (DoD, regulated, air-gapped, cleared workloads)" (README.md:11-12). It does not chase LangGraph or n8n on mindshare. It targets the deployment those tools cannot enter: behind one-way diodes or fully isolated LANs, where the design goal is zero outbound dependencies at run time (docs/guides/air-gap-deployment.md:16-17).

The evidence spine. Claims about air-gap safety are cheap. This paper grounds them in cve_remediation, which "runs fully offline and deterministically out of the box (no LLM, no network, no containers required)" and selectively lights up live infrastructure behind explicit, default-off env flags (mal-ellim/README.md:7-9). Live infrastructure is gated behind exactly three flags, all default-off when unset or empty:

LLM_BASE_URL points DSPy at a real OpenAI-compatible model
CVE_REM_LIVE_BROKER=1 enables a real Nautilus policy broker
HARBOR_SERVICENOW_LIVE=1 enables a real ServiceNow POST

The gates live at code choke points, not in a policy memo. Details are in the reference-workload section below.

Falsifiability, stated up front. The real spine is not a test count. Passing a suite once proves less about determinism than reproducing the same structure twice. The spine is structural reproducibility, checkable by re-running cve_remediation.run_demo --json:

170 fixed rule firings across 8 graph IRs, every IR exit code 0. Per IR: main graph 100, doctrine_ingest 10, offline_learning 14, drift_watch 11, tier_re_eval 5, audit_anchor 10, lab_leak_reaper 5, rolling_restart 15. Sum 170, matching README.md:26.
Stable structural graph hashes. Each IR emits a fixed hash per graph definition. The main graph (graph/harbor.yaml) hashes to 7f75a0673985476ed52373207604302a81e20cdc2f88a5eb37469b3a5e8ff358.

If routing were an LLM, the firing counts would not hold run-to-run. Because routing is Fathom over typed facts, they do. (The offline test suite is also network-free and passing. That number, its provenance, and a caveat about stale docs appear once, in the reference-workload section.)

Where the docs overstate the code. This paper marks the gaps rather than papering over them, and it does so once, in a closing section called Disclosed doc-vs-code deltas. Three deltas matter for an evaluator. The documented NFS/SMB hard startup refusal is a soft health warning in code. The documented cleared --workers > 1 refusal does not exist, because the serve CLI has no --workers option. Two of the cleared profile's four gate flags are declared but not yet wired into the runtime. Where the code is stronger than the doc, the paper says so. Where it is weaker, it says that too. The thesis stands on what the code enforces.

The rest walks the spine: the deterministic routing layer, the cleared profile's security gates and which are enforced today, the zero-outbound runtime posture, the cve_remediation workload as falsifiable proof, reproducibility and audit, and an accounting of positioning and gaps.

The deterministic core: rules, not LLMs, decide what happens

An agent stack is only as auditable as its least predictable component. In most agent frameworks that component is the router. An LLM is asked, in natural language, which tool to call or which node to run next, which makes the control flow itself a sample from a probability distribution. You cannot diff it and you cannot certify it for a classified LAN. Stargraph removes the LLM from that position.

The argument that rules route and LLMs do not is developed at length in the companion posts "LLMs Are Knowledge Engineers, Not Inference Engines" and "Trust as a Type." This section covers only what is load-bearing for air-gap operation.

Work and routing are separated by design

Stargraph splits an agent into two layers. Nodes do work: they call an LLM, run a classical-ML model, invoke a tool, or retrieve from a store. A separate engine decides what happens between them. That engine is Fathom, and it decides every node transition over provenance-typed facts.

The runtime stack is inspectable end to end (stargraph/README.md:54-66):

stargraph serve -> Graph -> skills -> Bosun rule packs -> Fathom CLIPS -> Nodes

Routing and governance live in the Bosun rule packs and Fathom, not in model weights. The maintainers state the premise directly: "let rules — not LLMs — decide what happens between tools" (stargraph/README.md:18-21).

This is verifiable in the runtime, not just the README. The per-node tick runs the node body, mirrors state into Fathom, then calls run.fathom.evaluate() and translates the returned CLIPS stargraph_action facts into the next transition (src/stargraph/runtime/dispatch.py:71-85; fathom/_adapter.py:121-131). The next node is chosen purely from CLIPS facts. No LLM appears anywhere in the routing path. When Fathom is not mounted, the loop walks static IR edges instead (dispatch.py:147), so routing is rules-or-static-edges, never an LLM. The same facts, fed to the same rule packs, fire the same rules in the same order every time. There is no temperature and no sampling in the control plane.

One nuance is worth naming. Stargraph ships a graph called ai-builder-router whose Classifier node is described as an "LLM route classifier." That is not a counterexample. The classifier is a worker node that asserts a route-classified fact; CLIPS rules pattern-match that fact and emit the transition. The LLM contributes a fact, the rules engine makes the decision. That is the architecture described here.

Provenance-typed facts: the substrate rules reason over

Rules are only trustworthy if the facts they fire on are trustworthy. In Stargraph, trust is a type on every fact. Each fact carries (origin, source, run_id, step, confidence, timestamp). The origin is the typed source category (llm | tool | user | rule | model | external), source names the specific producer, and step indexes the position in the run (stargraph/README.md:32-34). A rule can therefore say "halt if a fact whose origin is llm is being used to clear a human approval gate." It reasons over the provenance of the data, not just its value.

This is not a feature you bolt on later. The provenance bundle is asserted as typed CLIPS slots on every fact, with coercion that rejects floats, None, and naive datetimes rather than silently stringifying them (src/stargraph/fathom/_provenance.py:25-33; _adapter.py:108-119). ADR 0007 records the reasoning: cleared deployments require auditability from the first fact, and provenance-as-an-optional-sidecar was considered and rejected for retrofit pain (design-docs/stargraph-adrs.md:133-147). Counterfactual replay depends on it too. You cannot re-derive a run deterministically if you do not know where each fact came from.

Classical ML is a first-class node, so an LLM is optional

A deterministic control plane is only useful if you can keep the work plane deterministic where it matters. Stargraph treats classical ML (sklearn, XGBoost, ONNX, PyTorch) as a first-class node type (stargraph/README.md:36-38). A cheap deterministic model produces a confidence score, and Fathom routes on that score. High-confidence output proceeds; low-confidence output falls back to an LLM or a human gate. ADR 0008 motivates this directly. "Cleared environments often cannot run frontier LLMs," so classical ML as a first-class node "expands deployable surface to environments that disallow frontier LLMs."

This is backed at the dependency level. Stargraph's core dependencies contain no openai, anthropic, boto, or google-cloud SDK. LLM access is via DSPy over any OpenAI-compatible endpoint (a local shim, Ollama, or vLLM). A frontier-LLM-free deployment is the default, not a degraded mode. (The full dependency floor, including two network-capable libraries that warrant explanation, is detailed once, in the runtime-posture section.)

LLM nodes degrade to deterministic heuristics with no model present

When an LLM node is in the graph but no model endpoint is configured (offline test, cold start, or a diode-isolated LAN with no LM at all) the pipeline still produces a stable output. Stargraph's DSPy-based nodes degrade gracefully: with no LM configured they fall back to deterministic heuristics and tag the output's source ('lm' vs 'heuristic') so the provenance stays honest.

The cve_remediation vulnerability-class classifier is the concrete case. With no LM it falls back to a fixed CWE -> vuln_class heuristic dictionary and labels vuln_class_source accordingly (mal-ellim/cve_remediation/graph/nodes/_shared.py:1459-1583). The result is a pipeline that produces stable, attributable outputs with no model present, and a provenance trail recording exactly which outputs came from a heuristic. The LLM, when present, is a fallback for cases the deterministic path cannot resolve. It never decides what happens next.

That is the deterministic core: a CLIPS rule engine deciding transitions over provenance-typed facts, classical models doing confident work, LLMs demoted from router to optional, clearly-tagged contributor. Everything later rests on this one inversion. The decision layer does not sample. It fires rules.

Closed by default: the cleared deployment profile and its gates

A diode-side or isolated-LAN deployment is only as safe as its defaults. Policy memos do not survive a misconfigured boot. Stargraph encodes the air-gap posture in code: a first-class cleared deployment profile that sets every security gate flag closed at instantiation, denies capabilities unless explicitly granted, and refuses the boot-time escape hatches before any durable I/O happens.

Four gate flags, two enforced today

The profile is a Pydantic policy bundle. ClearedProfile sets all four bool gates True and defaults the auth provider to mTLS (src/stargraph/serve/profiles.py:202-219):

class ClearedProfile(Profile):
    name: str = "cleared"
    tls_required: bool = True
    signature_verify_mandatory: bool = True
    default_deny_capabilities: bool = True
    audit_required: bool = True
    auth_provider_factory = Field(default_factory=_cleared_auth_factory_default)  # mTLS

The contrast is the point. OssDefaultProfile (profiles.py:182-199) is the inverse: all four gates False, with a BypassAuthProvider default. Same field set, opposite values. The posture is a property of which class you instantiate, and select_profile() (profiles.py:222-239) activates ClearedProfile only on STARGRAPH_PROFILE=cleared.

Be precise about what is enforced. Of the four flags, two are wired into the runtime today:

signature_verify_mandatory is read in bosun/signing.py:385,510: _refuse_or_warn() raises PackSignatureError under cleared.
default_deny_capabilities is read in serve/api.py:396-412: the capability gate returns HTTPException(403, "capability '<cap>' not granted under cleared profile").

The other two, tls_required and audit_required, are declared but not yet read by the runtime. A grep across the source finds them only in profiles.py. Neither serve/lifecycle.py, serve/api.py, nor cli/serve.py consults them. The profiles module docstring labels itself a POC slice and defers TLS and audit-sink enforcement to a later phase. In the documented air-gap topology, mTLS is terminated at an Envoy/nginx edge and the app runs on plain localhost. So tls_required being inert in-process is consistent with edge-terminated TLS, but it is not an in-app refusal of plain HTTP. An evaluator should treat the TLS and audit gates as operator-enforced at the deployment layer, not code-enforced by the profile flag.

Default-deny in practice

Under the cleared profile, an unset capability returns HTTPException(403, "capability '<cap>' not granted under cleared profile"). The capability gate is uniform across every gated HTTP route. There is no per-route permissive carve-out. The default-deny mechanism applies identically to runs:start and runs:read. Those two "pass" only because the shipped grant sets (mTLS / bypass) include them, not because they are structurally exempt.

The HTTP routes that emit this 403 under cleared, by capability (serve/api.py):

runs:cancel (844), runs:pause (879), runs:respond (1019), runs:resume (1058)
counterfactual:run (1104)
artifacts:read (1167, 1188)
plus runs:start (686) and runs:read (707), permissive only via the shipped grant set

Two capabilities sometimes listed with these are not HTTP route gates and do not produce that 403. artifacts:write is engine-side, consumed by WriteArtifactNode. tools:broker_request is node-level, enforced in BrokerNode / broker_request by raising CapabilityError (auth.py:167-168; nodes/nautilus/broker_node.py:50; tools/nautilus/broker_request.py:47). They are real default-deny capabilities, but they fire deeper in the stack than the HTTP edge. The OSS-default profile flows unset capabilities straight through; the cleared profile denies them. Same code, two profiles, two postures.

Boot-time refusal of escape hatches

Two CLI flags exist for developer convenience: --allow-pack-mutation (runtime Bosun pack edits) and --allow-side-effects (nodes/tools declaring write/external effects). Both are forbidden under cleared. The startup gate raises ProfileViolationError and exits non-zero before uvicorn binds the socket and before any checkpoint-DB or artifact-store I/O. Operators see a clean failure on stderr rather than a gate firing mid-bootstrap with partial state (src/stargraph/cli/serve.py:213-227):

if selected.name == "cleared":
    if allow_side_effects:
        raise ProfileViolationError(
            "--allow-side-effects not permitted under cleared profile",
            profile="cleared", flag="--allow-side-effects")
    if allow_pack_mutation:
        raise ProfileViolationError(
            "--allow-pack-mutation not permitted under cleared profile",
            profile="cleared", flag="--allow-pack-mutation")

One precision. The gate runs after select_profile(), which constructs ClearedProfile() and reads a cwd-relative stargraph.toml if present, a read-only config stat/open. The refusal still happens before any durable or network I/O (no temp dir, no checkpoint DB, no socket bind on a rejected boot), but a read-only config discovery does run first. The flag help text marks both flags "FORBIDDEN under --profile cleared" (serve.py:99-125). A cleared boot cannot opt back into mutation or side effects through a command-line argument.

The broker is read-only by type

The single data path out of the agent toward an external broker is read-only by construction. stargraph.tools.nautilus.broker_request declares side_effects = SideEffects.read (src/stargraph/tools/nautilus/broker_request.py:14-16): "the broker is read-only from Stargraph's POV." There is no write surface to gate because none was built. The documentation also describes a cleared-mode engine-side refusal of any node or tool declaring write/external effects, layered behind the --allow-side-effects flag refusal, and that wiring is signposted in the source as the canonical runtime gate. Operators relying on the cleared posture should confirm the runtime side-effect refusal is active in their build, not only the CLI flag refusal. The flag refusal is confirmed in code; the engine-side runtime refusal is documented and partially wired.

The net result is a deployment profile where the air-gap posture is a property of the running process. It is enforced at instantiation for signature verification and the capability gate, and at boot for the escape hatches. TLS and audit-sink enforcement are still owed to the deployment layer until the profile flags are wired.

Pinning zero outbound network: wheelhouse, weights, and the deployment runbook

A cleared deployment runs behind a one-way diode or a fully isolated LAN. There is no PyPI, no Hugging Face Hub, no package mirror. The failure mode you cannot tolerate is a dependency silently reaching out to fill a gap. Stargraph's air-gap workflow turns every would-be network call into a hard, loud error at install time or boot time, never a quiet fetch at runtime. The stated goal is explicit: "Zero outbound HTTP at run time once the wheelhouse + embedding weights are staged. Audit, replay, and HITL all stay local" (docs/guides/air-gap-deployment.md:17-21).

This section walks the runbook: the offline wheelhouse, the SHA-256-pinned embedding weights, the embeddable local stores, and the deployment posture that makes the no-network claim checkable.

The dependency floor, including the network-capable libraries

The "zero outbound" claim has to survive an honest reading of the actual dependency list. Stargraph's core dependencies (pyproject.toml:24-55) are: pydantic, pluggy, prompt-toolkit, fathom-rules, jsonschema, pyyaml, structlog, anyio, uuid-utils, dspy, mcp, aiosqlite, asyncpg, rfc8785, orjson, typer, jsonpatch, cryptography, fastapi, uvicorn, httpx, cronsim, pyjwt, nautilus-rkm, blake3, argon2-cffi, jinja2, graphglot, redis, and python-docx.

Three things follow.

No cloud-LLM SDK is in core. There is no openai, anthropic, boto, or google-cloud. LLM access is via DSPy against any OpenAI-compatible endpoint, so a frontier-LLM-free deployment installs nothing extra.
lancedb and sentence-transformers are NOT core. They live in optional extras: stores (pyproject.toml:73) and skills-rag (pyproject.toml:93). The vector store and embedder do not ship in the default install. They are opt-in.
Two network-capable libraries DO ship in core: httpx and redis. Any paper claiming zero outbound has to address them. In the source tree, redis is declared but not imported anywhere under src/. It is a dependency floor entry, not an active runtime client in the shipped code. httpx is imported only inside specific tool implementations: the ServiceNow and CargoNet tool families, plus the CLI respond client. Those are exactly the egress-capable tools that the cleared side-effect posture and the workload's default-off flags are built to gate. Neither library makes an outbound call on the default offline path. Both are present because the tools that can talk to live infrastructure, when explicitly enabled, use httpx to do it. The "zero outbound at runtime" property is a property of the gating, not of the absence of HTTP clients.

Note the hard floor: Python >= 3.13 (pyproject.toml:10), inherited by any workload built on Stargraph.

The offline wheelhouse: an install that refuses to reach out

The wheelhouse is built on a connected machine and physically carried across the gap. Two pip flags do the load-bearing work, chosen for what they refuse to do.

--only-binary=:all: refuses any source distributions. An sdist can run arbitrary code in its build backend, which may itself try to fetch; binary-only takes that path off the table.
--no-index blocks PyPI lookups entirely. --find-links points pip at the local wheelhouse directory and nowhere else.

The failure mode is the part that matters (docs/guides/air-gap-deployment.md:46-73). A missing transitive dependency "triggers a clear No matching distribution error rather than a silent network reach-out." A gap in the wheelhouse stops the install with a named error you can read. It never degrades into an outbound request.

Pinning embedding weights by SHA-256

The one component that historically wants to phone home is the embedder. Stargraph's reference embedder, MiniLMEmbedder, wraps sentence-transformers/all-MiniLM-L6-v2 (384-dim) and is wired for offline-only operation in two layers.

First, offline enforcement. When HF_HUB_OFFLINE=1 is set, the embedder calls huggingface_hub.snapshot_download(local_files_only=True). In code, offline = os.environ.get("HF_HUB_OFFLINE") == "1" and local_only = offline or not allow_download is passed straight through (src/stargraph/stores/embeddings.py:145-154). If the local cache is incomplete, it fails loud. There is no silent network fallback. An integration test confirms it: HF_HUB_OFFLINE=1 plus an empty cache raises (tests/integration/test_minilm_offline_load.py). Be precise about scope. The constructor's allow_download defaults to True, so without HF_HUB_OFFLINE=1 (and without passing allow_download=False) the loader will fetch on first use. The offline guarantee is keyed on the env var, which the air-gap runbook exports process-wide so the downstream SentenceTransformer(model_dir) call also stays offline.

Second, integrity. The POC MiniLM embedder pins its safetensors weights by SHA-256 and re-hashes the file on every real instantiation. There is no skip-if-already-verified cache. The pin is the default value of an overridable expected_sha256 constructor kwarg:

MINILM_SHA256 = "53aa51172d142c89d9012cce15ae4d6cc0ca6895895114379cacb4fab128d9db"

(src/stargraph/stores/embeddings.py:66). On drift it raises EmbeddingModelHashMismatch, a loud-fail StargraphError that aborts construction with no fallback (verified subclass of Exception, not Warning; errors/_hierarchy.py:179). That single check catches three failure classes at once: a wrong-revision pull (the revised bytes differ, so the content hash catches it), cache corruption, and in-place tampering. The cost is bounded, about 50 ms for a 90 MB file on every load. The guarantee is scoped to real MiniLMEmbedder construction; the test-only FakeEmbedder bypasses hashing. In an air-gapped tree where the weights blob is a static artifact under configuration control, re-hashing on each instantiation turns "are these the bytes we approved?" into a runtime invariant. The staging artifact is a SHA-256 manifest carried alongside the weights, so the pin the embedder enforces at load is the value an operator verifies when laying down the cache.

Stores that need no service to start

Zero outbound at runtime is only achievable if nothing in the data path expects a remote service. Stargraph's stores default to embeddable, zero-infra implementations behind Protocols:

Vector: LanceDB
Graph: RyuGraph (a community fork of Kuzu, kept after the Kuzu repo was archived; same Python API surface)
Doc / memory / fact: the SQLite trio

All are local-filesystem-only with no outbound network (docs/knowledge/air-gap.md:124-135; ADR 0006, "Ship default Providers that are embeddable and zero-infra"). Because they sit behind Protocols, an operator can deliberately swap in a networked backend. The default start path requires no external service, which is what makes offline deployment tractable rather than a custom-integration project.

The deployment posture: mTLS edge, single process, POSIX-local, UTC

With the wheelhouse and weights staged, the rest of the runbook is the runtime envelope (docs/guides/air-gap-deployment.md:109-267).

mTLS termination. Two supported topologies: an Envoy/nginx edge that terminates mutual TLS in front of the process, or FastAPI-direct mTLS. Under the cleared profile, mTLS is the default auth provider (profiles.py:202-219). As noted above, the app itself does not yet refuse plain HTTP in-process. TLS is terminated at the edge in the documented topology.
Single-process invariant. One process, one fsync'd JSONL audit sink, no multi-worker fan-out. Scale by running multiple instances, not multiple workers. This underpins both audit cohesion and replay determinism. (The documented --workers > 1 CLI refusal does not exist; see the closing deltas section. Enforce single-process at the supervisor, e.g. a systemd unit.)
POSIX-local-only state. All durable state lives on the local filesystem, which keeps audit, replay, and HITL local per the zero-outbound goal.
UTC timezone. The host and service run in UTC (with a worked systemd unit), removing local-time skew as a source of nondeterminism in timestamps and replay.

What the runbook is reaching for

The runbook's worked example concludes with "no outbound network observable on tcpdump." That is a documented result in the air-gap guide, not a measurement reproduced for this paper, and it is the right check to run before trusting a cleared deployment. What this paper can demonstrate is narrower and verified: the offline test path is hermetically network-free and the structural reproducibility (graph hashes, rule-firing counts) is stable. An operator validating a served, broker-connected, mTLS deployment should run their own packet capture. The design is what makes that capture come back clean: offline pip that errors instead of fetching, an embedder that fails loud, SHA-256 weights re-verified every instantiation, embeddable local stores, and egress confined to explicitly-flagged tools.

The reference workload: cve_remediation running fully offline and fail-closed

A cleared profile and a SHA-256-pinned wheelhouse are claims about the framework. cve_remediation makes the claim checkable. The pipeline runs the full architecture (8 graph IRs, 117 node implementations, deterministic CLIPS routing) and the test that matters is simple: pull the network cable and it must still run. It does.

The canonical evidence, in one place:

Property	Value	How to reproduce
Offline test suite	887 tests passing, network-free, ~44s	`.venv/bin/python -m pytest -q` in `/home/sean/leagues/mal-ellim`, observed 2026-06-17 (`887 passed, 27 warnings in 44.31s`)
Total rule firings (8 IRs)	170 (main graph 100)	`cve_remediation.run_demo --json`
Main graph hash	`7f75a0673985476ed52373207604302a81e20cdc2f88a5eb37469b3a5e8ff358`	same; structural hash of `graph/harbor.yaml`
Per-IR exit code	0 across all 8	same

Two qualifications on this table. First, the shipped README.md/RUNNING.md still state 546 tests. The suite has grown to 887, so the documented count is stale (the suite grew, the docs did not). The 887 figure is a single local observation, dated above. Treat it as a measurement to re-run, not a committed CI artifact. Second, the load-bearing falsifiability is not "N tests pass." It is the structural reproducibility: the same 170 firings and the same graph hashes on every offline run. A graph hash is a hash of the graph definition (topology, node signatures, state schema, rule-pack versions), so its stability run-to-run is close to tautological for an unchanged file. The rule-firing counts are the real determinism signal, because they reflect what the engine actually did. If any non-deterministic primitive leaked into routing, a firing count would drift. None do.

Offline-by-default is enforced at code choke points, not by convention

Live infrastructure lights up only behind the three explicit, default-off env flags named in the abstract. The enforcement is in the code path.

_dispatch_intent (cve_remediation/graph/nodes/_shared.py:242-346) builds a deterministic broker envelope and returns it with no external IO unless CVE_REM_LIVE_BROKER=1 and a lifespan-active broker is registered. If the flag is set but no broker is reachable, it falls back to offline policy evaluation and stamps broker_unavailable=True into state. The gap stays observable instead of silently masked.

_servicenow_auth (_shared.py:349-366) is the harder choke point. It returns a non-empty error before resolving any credentials whenever live mode is off:

if not _servicenow_live_enabled():
    return None, headers, _SN_OFFLINE_ERR

Every direct-httpx caller short-circuits on a non-empty error, so even with SERVICENOW_BASE_URL plus valid credentials in the environment, no network call is made. The docstring names the contract: "the choke point that keeps 'offline' actually offline even when SERVICENOW_BASE_URL + creds are present in the environment."

Tests are hermetically forced offline

A runtime choke point is worthless if a developer's .env can leak live behavior into the suite. conftest.py closes that gap. Harbor calls load_dotenv() at import time, and load_dotenv defaults to override=False, so values already in os.environ win. conftest.py:23-41 exploits this. At import time, before any test module imports harbor, it force-empties the full live-mode toggle set: CVE_REM_LIVE_BROKER, HARBOR_SERVICENOW_LIVE, the SERVICENOW_* quad, and the LLM_* quad:

for _var in _OFFLINE_TEST_ENV:
    os.environ[_var] = ""

The SERVICENOW_* vars are cleared as defense-in-depth. The pipeline gates egress on HARBOR_SERVICENOW_LIVE, but the underlying CMDB read tools do not self-gate, so an empty base URL guarantees no live instance can be resolved. A developer .env cannot make the suite reach the network, regardless of contents.

Air-gap governance is deterministic CLIPS, not LLM judgment

The isolation policy lives in cve_rem.offline_isolation as Fathom CLIPS rules over declared network_edge / replica_load / redaction_pack facts (graph/rules/cve_rem.offline_isolation/rules.clp). Three halt families:

isolation-no-inbound-from-prod: any network_edge with direction "inbound" and source_zone "production" asserts a bosun.violation at severity "halt".
isolation-egress-only-to-approved-drop: any outbound edge whose dest_zone is not "approved-drop" halts.
isolation-replica-load-without-redaction-pack and its siblings: a replica load with an empty redaction-pack hash, a hash with no active matching redaction_pack, or a pack signed off the trusted-signer list all halt.

These are pattern matches against provenance-typed facts. A rules engine evaluates them identically every run. No stochastic routing, no prompt to argue with.

DoD-style clearance lattice, taint-aware at the HITL boundary

Source access is gated by a clearance lattice (unclassified < cui < confidential < secret < top-secret, RUNNING.md:275) plus per-purpose allowlists. Sources in nautilus.yaml carry a classification and allowed_purposes; the agent's clearance (cui) must dominate a source's classification or the request is denied. _evaluate_nautilus_policy_offline (_shared.py:300-320) runs the same nautilus.yaml rules even when the live broker is unreachable, so policy is enforced offline rather than bypassed.

The taint policy is the deliberately air-gap-aware piece (cve_rem.taint_policy/pack.yaml:7-9): "only an explicit human /respond approval clears — offline-synthesized or env-default clearances never satisfy the gate. Tainted runs hold at HITL." An offline run can synthesize a clearance to keep evaluating, but that synthesized clearance can never satisfy a human approval gate. The fail-closed direction is correct. Offline convenience never escalates into authority.

A confused-deputy gate that fail-closes in both modes

Before any broker work, _dispatch_intent checks the (intent, calling-node) pair against the _INTENT_GRANTS matrix (_shared.py:111) via the cve_rem.intent_authz pack. An unauthorized pairing fails closed:

if not authz["allow"]:
    raise HarborRuntimeError(
        f"intent_authz halt: node {caller!r} not authorized to "
        f"dispatch intent {intent_name!r} ({authz['reason']})"
    )

The grants matrix mirrors the real call sites, so legitimate flows always pass. The gate is cheap, deterministic, and enforces in both offline and live modes (_shared.py:282-299). A compromised or mis-wired node cannot borrow another node's authority to dispatch an intent it was never granted.

Integrity is offline-verifiable end to end

Air-gapped means no online CA, no key server, no revocation fetch. Trust verification has to work with only the bytes on disk. The pipeline uses Ed25519 throughout.

Pack signing. sign_packs.py generates an in-process Ed25519 keypair and, per pack, writes a compact EdDSA-JWT manifest.jwt plus a <key_id>.pub.pem sidecar pinned on first use (TOFU, trust on first use: the first key seen for a pack is recorded and re-checked thereafter). Signing is idempotent. Re-running with no rule changes yields the same tree_hash (the JWT differs only in iat).
Doctrine ingest. Phase 0 (graph/phase0/doctrine_ingest.yaml:24) signs an Ed25519 manifest over node+edge SHA-256 and pins the hash to a boot-time allowlist, so doctrine the boot did not authorize cannot be loaded.
Ship quorum. Phase 6 (graph/phase6/offline_learning.yaml:29) gates ship on a Shamir 2-of-3 secret-sharing quorum. The r-shamir-fail rule halts when the quorum is not_reached. No single operator ships a model from the air-gapped host.

Every one of these checks is a hash comparison or a signature verification against local material. Establishing trust requires no outbound call, which is the only kind of trust that survives a one-way diode.

Reproducibility and audit: byte-identical replay, counterfactuals, and STRIDE

An air-gapped run is only auditable if it can be reproduced exactly and inspected adversarially. Stargraph treats both as engineering invariants. Three machines back this: a determinism scope that pins every non-deterministic primitive, counterfactual replay keyed on a structural graph hash, and a STRIDE threat model wired to code choke points. A single-process invariant holds them together.

The replay and counterfactual machinery is the subject of the companion post "Counterfactual Replay." This section covers what is load-bearing for air-gap audit.

Byte-identical replay

Non-determinism is the enemy of reproducible audit. Stargraph's DeterminismScope records and replays every non-deterministic primitive a run can touch: wall-clock reads, RNG draws, UUID generation, OS randomness, and secret-token minting (src/stargraph/replay/determinism.py:2-24). On replay, those primitives return their recorded values, so two runs of the same graph produce byte-identical output.

Two design decisions close the usual leaks. First, the IR forbids set/frozenset in state_schema, because hash-randomized iteration order would make state serialization non-deterministic across processes. The schema rejects these types rather than hoping authors avoid them. Second, recorded HTTP interactions replay from a cassette layer; in CI the cassette runs with record_mode='none', so a missing recording is a hard failure, not a silent live request. No "reproducible" test can quietly reach the network. This matters most where it is hardest to verify. An air-gapped replica has no upstream to diff against, so determinism is what lets you re-run a recorded session on a disconnected host and confirm bit-for-bit that nothing changed.

Counterfactual replay and structural graph hashing

Stargraph names counterfactual replay as a first-class differentiator: deterministic re-execution from any checkpoint with a mutated rule, output, or fact (src/stargraph/replay/counterfactual.py:14-15,135). This is how an auditor asks "what would this run have done if this fact were different" without a live LLM and without re-running upstream work.

The safety mechanism is the structural graph hash: sha256(canonical(topology + node signatures + state schema + rule pack versions)) (ADR 0005, design-docs/stargraph-adrs.md:98-111). On resume, the engine recomputes the hash and refuses to continue on a mismatch unless a migrate block explicitly applies (src/stargraph/graph/run.py:146-167, _refuse_cf_prefix). A graph that changed shape since the checkpoint cannot silently resume against stale state. The auditor is forced to acknowledge the change. Note the scope. This hash protects against graph-definition drift between checkpoint and resume. It is not itself a proof that node execution was deterministic; the determinism scope and rule-firing counts carry that.

STRIDE threat model

The cleared posture is backed by a real STRIDE threat model: 36 cells across 6 surfaces by 6 categories (docs/security/threat-model.md:7). The mitigations:

Spoofing. mTLS is the default auth provider under the cleared profile (serve/auth:MtlsProvider), closing the spoofed-client surface.
Tampering, pack signing. Verification is algorithm-strict: only Ed25519 is accepted, and alg:none, HS256, and RS256 are rejected at load (threat-model.md:36). Trust is anchored by a static pubkey allowlist with TOFU first-pin and documented pin-drift handling.
Repudiation, audit. The audit sink is an fsync'd, append-only JSONL log. (As noted in the cleared-profile section, the audit_required flag itself is not yet wired. The single-sink append-only design is the mechanism; mandatory enforcement is owed to a later phase.)
Trigger trust. Webhook triggers are untrusted by default and HMAC-gated; cron trusts nothing external (threat-model.md:96-112).

The model is honest about its edges. It documents two named post-1.0 gaps: Fathom-pack hot-reload (a pack change requires a serve restart rather than a live swap) and a per-run wall-clock cap (threat-model.md:145-148). A threat model that names its own gaps is more trustworthy than one that claims none.

The single-process invariant

Audit cohesion depends on a structural choice: one process, one audit sink. This is locked Decision #5: one fsync'd JSONL audit sink, no multi-worker fan-out, scale by running multiple isolated instances. A single writer means the audit log is a true total order of decisions, and replay determinism holds because there is no inter-worker scheduling nondeterminism to record. The same invariant drives the rate limiter: in-memory only, single-process scheduler (NFR-14, serve/ratelimit.py:8; single-sink referenced in bosun/audit/__init__.py:16). The documented CLI refusal of --workers > 1 does not exist in the shipped code (see the closing deltas). Enforce single-process at the supervisor level.

Falsifiable reproducibility evidence

Driving all 8 graph IRs of cve_remediation through the offline demo emits stable graph hashes and a fixed total of 170 rule firings:

Graph IR	Rule firings
main graph (`graph/harbor.yaml`)	100
`doctrine_ingest`	10
`offline_learning`	14
`drift_watch`	11
`tier_re_eval`	5
`audit_anchor`	10
`lab_leak_reaper`	5
`rolling_restart`	15
Total	170

Every IR exits 0, and the firing counts are stable run-to-run. Stability is the falsifiable part. If any non-deterministic primitive leaked into routing, a count would change between runs. They do not. Reproducibility here is a measured property of the rules engine, not a promise.

Related work, positioning, and honest trade-offs

Positioning: not a mindshare contest

Stargraph does not try to out-ecosystem LangGraph or n8n. It says so directly: "Not chasing LangGraph or n8n on mindshare. It competes on correctness, inspectability, and ability to run where those tools can't" (README.md:140-141). The target is named explicitly: DoD, regulated, air-gapped, cleared workloads. The technical wedge is architectural, not feature-count. Rules decide transitions over provenance-typed facts, which is the property the others do not offer and the reason Stargraph runs behind a one-way diode where an LLM-routed graph cannot make a deterministic guarantee.

The on-ramp is stated plainly. LangGraph is named as "the realistic audience" for a migration guide (design-docs/stargraph-roadmap.md:113). The YAML/JSON IR is the canonical, language-neutral graph definition specifically so that bidirectional conversion with LangGraph/Agno/CrewAI/n8n is tractable (ADR 0009).

Tool	Relationship	Source
LangGraph	Named non-competitor on mindshare; named migration on-ramp; IR conversion target	README.md:140; roadmap:113; ADR 0009
n8n	INSPIRATION.md peer; explicit non-goal ("parity with n8n"); IR conversion target	INSPIRATION.md; design §4; ADR 0009
Agno / CrewAI	IR bidirectional-conversion targets	ADR 0009
DSPy	Wrapped for LLM nodes; not replaced (non-goal)	README "What Stargraph is not"
LangChain	Store interop rejected (dependency surface)	ADR 0006

Intellectual honesty: the preregistered study refuted three of its own claims

A preregistered study in whitepaper-proof/ (verdicts against PREREGISTRATION.md, paired bootstrap 95% CIs, 10k resamples, * = CI excludes zero) refuted the intuitive positioning claims this project might otherwise have leaned on. They are reported here as the repo reports them. One limitation throughout: the study's per-cell sample sizes and total run count are not reproduced here, so the CIs below are interpretable only relative to one another, not as absolute power claims.

"Fewer/one broker tool beats many specialized tools" was refuted (C2). The broker was no better at any tier and worse at two: weak −4.4pp [−7.6,−1.2]*, mid ±0.0, strong −2.4pp [−4.4,−0.8]*. The broker relocates the error rather than removing it. Choosing among named tools becomes phrasing an intent for a static keyword router, and the failures were zero-source routes from paraphrase mismatch ("location" ≠ "site"). The surviving claim is narrower: "a single broker tool with a competent router costs nothing measurable in task success and brings policy enforcement, attestation, and audit for free. 'Fewer tools beat more' is unsupported" (FINDINGS.md C2). An exploratory LLM intent router reached parity with specialized tools at all tiers and beat the keyword-routed broker at the strong tier by +2.4pp [+0.8,+4.4]* (a generator-vs-router contrast at fixed executor).

"Broad/intent prompts beat prescriptive prompts" was refuted (C3). Intent lost overall at every tier (weak −39.2pp*, mid −19.2pp*, strong −22.5pp*). The predicted crossover does exist, but between a hybrid (prescriptive rules plus a judgment escape clause) and prescriptive, not between intent and prescriptive. Hybrid matched prescriptive in-distribution and beat it on conflict scenarios (mid +15.0pp [+3.3,+26.7]*, strong +10.0pp [+1.7,+18.3]*). Reading: rules are load-bearing. Broad intent only buys an escape hatch, and only without paying the in-distribution cost if you keep the rules.

"Generator capability barely matters" was refuted (C4), and the external anchor with it. The repo's external anchor is Lin et al. (2026, arXiv:2605.30621) on self-improving agent harnesses. Stargraph replicated it (Exp 4) and refuted both predicted shapes. Generator capability matters where it counts: at the weak executor, the strong generator beats the weak one by +26.7pp [+13.3,+40.0]* (generator-vs-generator at a fixed weak executor; +16.7pp [+3.3,+30.0]* in the head-to-head contrast). The gain is non-monotonic, but the mid tier is a trough, not a peak, the opposite shape from Lin et al. A plausible boundary condition: their result may hold where the harness encodes procedural scaffolding the executor cannot infer, whereas here the update mostly transmits decision rules that strong executors already derive. The external anchor is a contrast point, not a confirmation.

The cross-cutting result is the honest takeaway: in three of four experiments the preregistered pure position lost and an instrumented middle won. "Pure positions ('fewer tools', 'broad intent', 'generator doesn't matter') refute; instrumented middles hold." The architecture this paper describes, broker plus competent router and hybrid rule packs, is the instrumented middle.

Disclosed doc-vs-code deltas

This is the single authoritative accounting of where the documentation runs ahead of the shipped code. Evaluators should audit the code, not the prose.

Delta	Doc says	Code does	Source
NFS/SMB filesystem	"refuses to start" via a `statfs(2)` magic-number table in `serve.lifecycle:_check_local_fs`	Returns a `health()` warning string for `{nfs,nfs4,smb,smbfs,cifs}`; `_detect_fs_type` walks `os.statvfs` + `/proc/mounts`; no `_check_local_fs`, no startup abort	doc `air-gap-deployment.md:216-224`; code `stores/_common.py:106-169` (confirmed absent from `serve/lifecycle.py`)
Cleared `--workers`	cleared profile rejects `--workers > 1` ("`--workers 4` exits with code 2")	serve CLI has no `--workers` option at all; invariant holds only via single-sink/rate-limiter design	doc `air-gap-deployment.md:194-214`; code `cli/serve.py` (confirmed absent)
Cleared gate flags	"All gates closed"	2 of 4 wired: `signature_verify_mandatory`, `default_deny_capabilities` enforced; `tls_required`, `audit_required` declared but not read by runtime	code `serve/profiles.py:202-219` (grep confirms only profiles.py references the two unwired flags)
Cleared 403 route list	7 mutation routes incl. `runs:counterfactual`, `artifacts:write`, `tools:broker_request`	HTTP route gates: cancel/pause/respond/resume/`counterfactual:run`/artifacts:read; `artifacts:write` and `tools:broker_request` are engine/node-level (`CapabilityError`), not HTTP 403	code `serve/api.py:686-1188`
Embedder default	offline-by-default	`allow_download=True` by default; offline requires `HF_HUB_OFFLINE=1` set process-wide	code `stores/embeddings.py:145-154`

The boot-time controls that are enforced remain so. The cleared startup gate raises ProfileViolationError with a non-zero exit before any durable I/O when --allow-side-effects or --allow-pack-mutation are passed under --profile cleared (cli/serve.py:213-227).

Maturity caveats for evaluators

Version metadata drift. pyproject.toml:7 declares 0.4.0; README.md:10 still says "Status: v0.3.0 — Alpha." Treat version-derived claims with caution.
In-progress harbor→stargraph rename. The package installs as PyPI dist stargraph but is still imported as harbor. mal-ellim's code and README still use Harbor naming throughout (HARBOR_* env vars, harbor.tools.*, harbor run). The CHANGELOG notes the rename was wholesale with no compat shims, so import sites and env-var names are mid-migration.
Hard Python floor. Both projects require Python >= 3.13. No lower-version fallback.
Stale test count in mal-ellim docs. README/RUNNING state 546 offline tests; the suite now collects 887, observed passing offline 2026-06-17.
Two named post-1.0 threat-model gaps. Fathom-pack hot-reload is absent (a pack change requires a serve restart), and there is no per-run wall-clock cap. Both are disclosed in the threat model itself.

None of these caveats touch the core falsifiable claim: stable graph hashes and 170 fixed rule firings on every offline run, plus the two enforced cleared-profile gates (signature verification and capability default-deny) and the boot-time refusal of the escape hatches. They are disclosed because an evaluator deploying behind a diode should know exactly where the prose runs ahead of the code and where the version story is still settling.

What holds and what to watch

The argument is narrow and the evidence is concrete. Take the language model out of the router and decide transitions with a deterministic rules engine over provenance-typed facts, and the control plane stops being a sample and starts being a record: one you can diff, replay, and audit on a host with no upstream to consult. Close every external touchpoint by default and open it only through a named, observable flag, and the air-gap posture becomes a property of the running process rather than a checklist an operator is trusted to follow. The cve_remediation workload shows both halves working together today: 170 fixed rule firings and stable graph hashes on every offline run, with live infrastructure reachable only behind three default-off env flags, each enforced at a code choke point.

The honest accounting matters as much as the architecture. Two of the cleared profile's four gate flags are not yet wired. A documented NFS refusal is a soft warning. A documented --workers refusal does not exist. The embedder needs an env var, not just a profile, to stay offline. The test count in the docs is stale. These are listed not to undercut the thesis but because a paper claiming auditability cannot itself be unauditable. The deterministic spine (rules routing, typed facts, reproducible firings) is real and enforced. The runtime gates that depend on it are partly enforced and partly owed to the deployment layer and a later release. An evaluator who deploys behind a diode should enforce single-process operation, terminate mTLS at the edge, set HF_HUB_OFFLINE=1 process-wide, confirm the engine-side side-effect refusal in their build, and run their own packet capture. The architecture gives them a posture the code can defend. The gaps tell them exactly where to keep watch.

← All posts