AI models and harnesses
Model routing by task lane, agent surfaces we run, eval harness catalog, and promotion rules: models are rented, harnesses are owned.
Tools are vendors. Models are inference choices. Harnesses are the rails that make models safe in production. This memo documents how Exiid routes models, which agent surfaces we run, and what must pass before autonomy moves up the ladder.
Who this is for
For: operators putting agents into revenue-bearing workflows.
Useful when: you need model transparency without a chatbot logo strip.
What must be true
Four conditions from AI Systems Design still gate every system:
| Condition | The bar |
|---|---|
| Workflow | Documented and stable enough to measure |
| Volume | Enough repetitions to amortize eval cost |
| Ground truth | Score outputs without arguing |
| Reversibility | Known undo path for worst-case output |
01 — Model routing matrix
Route by task lane, not one model for everything. Track cost per completed outcome.
| Task | Default | Autonomy |
|---|---|---|
| Decode / research | Claude Sonnet | L1 Assist |
| Structured extraction | Haiku / 4o mini | L2 Reviewed |
| Code generation | Sonnet via Cursor | L2, L3 after CI green |
| High-stakes draft | Sonnet; Opus for hard calls | L2, human publishes |
| Irreversible actions | Human only | Above Approval Line |
| Embeddings / RAG | embedding-3-large or voyage-3 | L4 with retrieval eval |
| Batch enrichment | mini at volume | L3 supervised |
Routing rules:
- No frontier model on throughput
- No L4 without runtime checks (schema, bounds, allowlists)
- Cheaper models only after eval corpus proves stability
- Anthropic primary, OpenAI bench on client installs
02 — Agent surfaces we run
Model-agnostic where it helps. Gated everywhere it matters.
| Surface | Role | Tier |
|---|---|---|
| Cursor | Primary coding harness | Runs |
| Claude | Decode, memos, reviewed automation | Runs |
| Codex | Bench coding failover | Bench |
| Gemini | Bench long-context decode | Bench |
| OpenClaw | Heartbeat ops on installs | Install |
| Pi | L1 quick capture | Bench |
| OpenCode | Headless CI agents | Install |
| Hermes | Strategy / CEO-class planning | Bench |
Paperclip is Bench orchestration for Compounding Portfolio operators who need org-chart + per-agent budgets. Exiid still installs Approval Line and eval harness on top.
03 — Harness catalog
| Harness | Role |
|---|---|
| Context | AGENTS.md, rules, skills, content SSOT |
| Coding | CI + Playwright before merge |
| Eval | Golden set on every prompt/model change |
| Runtime | Schema, bounds, banned tools per output |
| Orchestration | n8n, Temporal, or Paperclip with approval nodes |
| Promotion | 200 @ 98% before L3; demotion on suspicion |
| Drift | Weekly sampled trace review |
| Blast-radius | Action caps, draft-only side effects |
04 — Eval loop cadence
- Golden set on every change (50 cases minimum to automate)
- Runtime checks on every run
- Drift review weekly on production samples
Eval scores are promotion currency for the Autonomy Ladder. No score movement, no autonomy movement.
What we refuse
- No autonomy without evals
- No L4 without hard runtime checks
- No "replace your team" theater on Client Services installs
Receipts
- Method.app agent grid
- AI Systems Design
- AI-First Business Models: cost per completed outcome
- One operator, many agents