AI Systems Design
How Exiid designs operational AI systems for ventures: workflow-first decomposition, the Autonomy Ladder, eval loops, blast-radius containment, and where automation pays.
An operational AI system is not a model with a prompt. It is a workflow with named owners, measured failure modes, and a bounded blast radius. This note documents how Exiid designs AI systems inside ventures — what gets automated, what stays human, and how the loop stays honest.
Who this is for
For: operators putting agents into revenue-bearing workflows — support, sales ops, content, fulfillment, data — where a bad output costs money or trust.
Useful when: you are deciding what to automate, how much autonomy to grant, and how you will know when the system degrades.
What must be true
Before any AI system deserves build capital inside a venture, four conditions have to hold:
| Condition | The bar | | --- | --- | | Workflow | Documented and stable enough to measure. You cannot automate a process you cannot describe. | | Volume | Enough repetitions to amortize build, eval, and review cost. | | Ground truth | A way to score an output without arguing about it. | | Reversibility | A known undo path for the worst plausible output. |
If any row fails, the work is premature. Demos do not change that.
01 — Design the workflow, not the agent
Most failed AI systems started life as agent projects. Someone built an autonomous loop, then went looking for a job it could hold. The order is backwards.
The unit of design is the workflow. Decompose it into steps, then classify each step into one of three lanes:
- Deterministic. Rules are known and stable. This is code, not a model. Using an LLM here adds cost and variance for nothing.
- Judgment-light. Pattern recognition with clear ground truth — classification, extraction, triage, drafting against a template. This is where models earn their keep.
- Judgment-heavy. Ambiguous inputs, irreversible outcomes, or taste. This stays human, with the model as an assist.
Agents are steps, not products. A well-designed system usually contains several small agents wired into one workflow — each with a narrow contract, a typed input, and a checkable output — rather than one general agent improvising across the whole process. Narrow contracts are what make the next two sections possible: you cannot evaluate or contain what you cannot specify.
02 — Place every step on the Autonomy Ladder
Autonomy is not a binary. Exiid grades every automated step on a five-level ladder:
| Level | Name | Who acts | Who checks | | --- | --- | --- | --- | | L0 | Manual | Human | Human | | L1 | Assist | Human, with model drafts | Human | | L2 | Reviewed | Model | Human approves every output | | L3 | Supervised | Model | Human samples a percentage | | L4 | Closed-loop | Model | Automated evals gate every output |
Two rules govern movement on the ladder:
- Promotion is earned by eval evidence, never by enthusiasm. A step moves from L2 to L3 after a defined run of consecutive outputs clears the quality threshold — we typically require 200 reviewed outputs at 98 percent acceptance before sampling replaces full review.
- Demotion is free. Any operator can knock a step down a level on suspicion alone, instantly, without a meeting. The asymmetry is deliberate. The cost of an unnecessary demotion is review labor. The cost of a late demotion is an incident.
New systems launch at L2 by default. L4 is reserved for steps with hard runtime checks — schema validation, numeric bounds, allowlisted actions — not just model self-assessment.
03 — Draw the Approval Line
The human-in-the-loop question is not "should a human be involved" but "exactly which decisions does a human keep." We draw an explicit Approval Line per workflow. A decision stays above the line — human-owned — when any of these four tests is true:
- Irreversible. Money moves, a message reaches a customer, a contract gets signed, data gets deleted.
- Identity-bearing. The output speaks as the brand on a judgment call, not from a template.
- Material. The financial or legal exposure of a single decision exceeds a stated threshold.
- Novel. The input falls outside the distribution the evals cover. Out-of-distribution means out of autonomy.
Everything below the line gets automated aggressively. Everything above it gets tooling — better context, better drafts, faster retrieval — but not autonomy.
One trap deserves its own warning: human review must be designed as carefully as the automation. A reviewer rubber-stamping 400 approvals a day is not a control. It is latency with a salary. Review queues need volume caps, sampling logic, and escalation paths, or the human layer becomes the system's weakest component.
04 — Run the eval loop or you are guessing
A system without evals is a system whose quality you learn about from customers. Every Exiid AI system ships with three measurement layers:
| Layer | What it is | Cadence | | --- | --- | --- | | Golden set | Scored test cases run against every prompt, model, or logic change | Every change | | Runtime checks | Assertions on live outputs — schema, bounds, banned actions, confidence floors | Every run | | Drift review | Human scoring of sampled production traces | Weekly |
Three operating rules keep the loop honest:
- No system launches without a golden set. If you cannot write 50 scored cases, you do not understand the task well enough to automate it.
- Every incident becomes a test case. The golden set is a fossil record of everything that has ever gone wrong.
- Eval scores are the promotion currency for the Autonomy Ladder. No score movement, no autonomy movement.
The question is never whether the agent will fail. It is how much it can break in the time before someone notices — and whether the system notices first.
05 — Contain failure before you scale it
Failure containment is a design input, not an ops afterthought. Five blast-radius rules apply to every automated step:
- Cap actions per run. Hard limits on spend, message volume, records touched, and retries. An agent in a loop should hit a ceiling, not a credit limit.
- Stage side effects. Outputs land in a draft state — draft email, pending record, proposed change — and commit in a separate, checkable step.
- Prefer reversible writes. Soft deletes, versioned records, idempotent operations. The undo path from the "What must be true" table gets built, not assumed.
- Kill switch per workflow. Any operator can halt one workflow without touching the rest of the platform. Granularity is what makes people willing to pull it.
- Keep the manual path alive. The team must still be able to run the process by hand. A manual path that has atrophied is not a fallback; it is a rumor.
06 — Where automation pays
The Payback Test is blunt: automation pays when volume multiplied by time saved and error cost avoided clears the combined cost of building, evaluating, and reviewing the system. Run the numbers per step, not per project.
| Pays | Does not pay | | --- | --- | | High-volume, low-variance work with clear ground truth | Low-frequency, high-stakes decisions | | Triage, enrichment, extraction, drafting, QA, data hygiene | Strategy, pricing calls, partner negotiations | | Steps tolerant of review latency | Relationship moments where trust is the product | | Processes already measured and working | Broken processes — automation makes bad outcomes arrive faster |
That last row is the most common failure we see. Automating a process nobody measured does not fix it. It industrializes the defect.
The operating picture
A venture-grade AI system, in one sentence: a decomposed workflow whose automated steps each hold a graded level of autonomy, earn promotion through evals, operate inside a bounded blast radius, and hand everything above the Approval Line to a human whose review load was designed, not inherited. Build that, and the system compounds. Skip any piece, and you have built a demo with production traffic.
Read next
- When AI actually fits a services sprint — the conditions under which AI deserves systems work at all.
- Recon before roadmap — the evidence gate that comes before any build plan, AI included.