Agents

Building Agents That Optimize AI

The job of an agent is to make a model that would have been wrong, right.

What an Agent Is For

A single LLM call answers a question. An agent is a small program whose only job is to make that answer trustworthy. Agents are the layer where directives, logic gates, retrievers, verifiers, and evaluators live. They turn a fluent guesser into a system you can put in front of users.

I build agents in three roles:

  • Director agents — carry the directive; decide what the next step should be.
  • Worker agents — do the narrow task (search, summarize, extract, translate, write SQL).
  • Verifier agents — their only job is to say no when an output violates the directive, the truth table, or the evidence.

Directive Engineering

A directive is not a prompt. A prompt is text. A directive is a small, versioned contract that names:

  • the system's purpose in one sentence;
  • the ontology — which entities exist and which properties they carry;
  • the allowed operations over those entities (ingest, expand, collapse, invoke);
  • the prohibitions — what the system must refuse to do, expressed as falsifiable rules;
  • the governing equation — the algebraic relationship every decision must satisfy. I use z = x · y: state is identity times behavior, and any output that cannot be expressed that way is rejected.

Directives are checked into source control alongside the code, version-tagged, and loaded by every agent at startup. When behavior drifts, I diff the directive — not the model weights.

Read the full ButterflyFX directive →

The Agent Loop

Every agent I build runs the same four-phase loop, borrowed from my game-engine work where 60fps gives you no room to improvise:

SAMPLE  →  EVALUATE  →  PUBLISH  →  VERIFY
   |          |            |          |
 read     decide       commit      reject if
state     action       to log      directive
                                   violated
  1. SAMPLE — read the current state from the shared registry. No agent calls another agent directly. They publish coordinates; others read them.
  2. EVALUATE — run the directive's allowed operations. This is where logic gates and (if needed) an LLM call live.
  3. PUBLISH — write the result back to the registry as a typed record with full ancestry. No destructive updates.
  4. VERIFY — a separate verifier agent reads the published record and checks it against the directive's truth table. If it fails, the record is rolled back and the EVALUATE phase is retried with the failure as context.

The single-writer registry is the synchronization point. No callbacks, no event bus, no spaghetti.

Evaluation Harnesses

An agent without an evaluation harness is folklore. I ship every agent with three fixtures:

  • Golden cases — canonical inputs with known correct outputs. These never fail; if they do, deployment is blocked.
  • Adversarial cases — inputs designed to elicit the failures we've actually seen in production. Every real incident becomes a permanent fixture.
  • Drift cases — held-out samples replayed weekly to detect silent regressions when a model provider updates a checkpoint.

The harness reports a single scalar score per directive version. That score is the only number that matters when deciding whether to ship a directive change.

Optimization Patterns

  • Cheapest model first. Route by decision tree, truth table, and the Schwarz-D surface — promote to a larger model only when the small one's confidence falls below a gate threshold. See the routing page for the full algorithm and a live demo.
  • Manifold-focused retrieval. Pull only the slice of context whose coordinates sit inside the gradient region of interest. Most prompts get smaller, not larger, over time.
  • Verifier ensembles. Two verifiers with different prompts catch failure modes a single verifier shares with the worker.
  • Deterministic shells. Wrap every LLM call in a parser, a schema validator, and a retry budget. The LLM never returns free text to the next stage.
  • Directive A/B. Two directive versions run in shadow on the same traffic; the harness picks the winner on the next deploy.

Where This Has Shipped

The same patterns power the AI behind KensGames (decision-tree-driven NPCs over a manifold board state), TTLRecall (verified scenario generation), and the AI-accelerated DevOps and test-generation work referenced in my resume.