2026-06-177 min read

AI models and harnesses

Model routing by task lane, agent surfaces we run, eval harness catalog, and promotion rules: models are rented, harnesses are owned.

operating stackaiframeworks

Tools are vendors. Models are inference choices. Harnesses are the rails that make models safe in production. This memo documents how Exiid routes models, which agent surfaces we run, and what must pass before autonomy moves up the ladder.

Who this is for

For: operators putting agents into revenue-bearing workflows.

Useful when: you need model transparency without a chatbot logo strip.

What must be true

Four conditions from AI Systems Design still gate every system:

Condition	The bar
Workflow	Documented and stable enough to measure
Volume	Enough repetitions to amortize eval cost
Ground truth	Score outputs without arguing
Reversibility	Known undo path for worst-case output

01 — Model routing matrix

Route by task lane, not one model for everything. Track cost per completed outcome.

Task	Default	Autonomy
Decode / research	Claude Sonnet	L1 Assist
Structured extraction	Haiku / 4o mini	L2 Reviewed
Code generation	Sonnet via Cursor	L2, L3 after CI green
High-stakes draft	Sonnet; Opus for hard calls	L2, human publishes
Irreversible actions	Human only	Above Approval Line
Embeddings / RAG	embedding-3-large or voyage-3	L4 with retrieval eval
Batch enrichment	mini at volume	L3 supervised

Routing rules:

No frontier model on throughput
No L4 without runtime checks (schema, bounds, allowlists)
Cheaper models only after eval corpus proves stability
Anthropic primary, OpenAI bench on client installs

02 — Agent surfaces we run

Model-agnostic where it helps. Gated everywhere it matters.

Surface	Role	Tier
Cursor	Primary coding harness	Runs
Claude	Decode, memos, reviewed automation	Runs
Codex	Bench coding failover	Bench
Gemini	Bench long-context decode	Bench
OpenClaw	Heartbeat ops on installs	Install
Pi	L1 quick capture	Bench
OpenCode	Headless CI agents	Install
Hermes	Strategy / CEO-class planning	Bench

Paperclip is Bench orchestration for Compounding Portfolio operators who need org-chart + per-agent budgets. Exiid still installs Approval Line and eval harness on top.

03 — Harness catalog

Harness	Role
Context	AGENTS.md, rules, skills, content SSOT
Coding	CI + Playwright before merge
Eval	Golden set on every prompt/model change
Runtime	Schema, bounds, banned tools per output
Orchestration	n8n, Temporal, or Paperclip with approval nodes
Promotion	200 @ 98% before L3; demotion on suspicion
Drift	Weekly sampled trace review
Blast-radius	Action caps, draft-only side effects

04 — Eval loop cadence

Golden set on every change (50 cases minimum to automate)
Runtime checks on every run
Drift review weekly on production samples

Eval scores are promotion currency for the Autonomy Ladder. No score movement, no autonomy movement.

What we refuse

No autonomy without evals
No L4 without hard runtime checks
No "replace your team" theater on Client Services installs

Receipts

Method.app agent grid
AI Systems Design
AI-First Business Models: cost per completed outcome
One operator, many agents