2026-06-137 min read

AI Systems Design

How Exiid designs operational AI systems for ventures: workflow-first decomposition, the Autonomy Ladder, eval loops, blast-radius containment, and where automation pays.

ai systemsautomationoperations

An operational AI system is not a model with a prompt. It is a workflow with named owners, measured failure modes, and a bounded blast radius. This note documents how Exiid designs AI systems inside ventures — what gets automated, what stays human, and how the loop stays honest.

Who this is for

For: operators putting agents into revenue-bearing workflows — support, sales ops, content, fulfillment, data — where a bad output costs money or trust.

Useful when: you are deciding what to automate, how much autonomy to grant, and how you will know when the system degrades.

What must be true

Before any AI system deserves build capital inside a venture, four conditions have to hold:

Condition	The bar
Workflow	Documented and stable enough to measure. You cannot automate a process you cannot describe.
Volume	Enough repetitions to amortize build, eval, and review cost.
Ground truth	A way to score an output without arguing about it.
Reversibility	A known undo path for the worst plausible output.

If any row fails, the work is premature. Demos do not change that.

01 — Design the workflow, not the agent

Most failed AI systems started life as agent projects. Someone built an autonomous loop, then went looking for a job it could hold. The order is backwards.

The unit of design is the workflow. Decompose it into steps, then classify each step into one of three lanes:

Deterministic. Rules are known and stable. This is code, not a model. Using an LLM here adds cost and variance for nothing.
Judgment-light. Pattern recognition with clear ground truth — classification, extraction, triage, drafting against a template. This is where models earn their keep.
Judgment-heavy. Ambiguous inputs, irreversible outcomes, or taste. This stays human, with the model as an assist.

Agents are steps, not products. A well-designed system usually contains several small agents wired into one workflow — each with a narrow contract, a typed input, and a checkable output — rather than one general agent improvising across the whole process. Narrow contracts are what make the next two sections possible: you cannot evaluate or contain what you cannot specify.

02 — Place every step on the Autonomy Ladder

Autonomy is not a binary. Exiid grades every automated step on a five-level ladder:

Level	Name	Who acts	Who checks
L0	Manual	Human	Human
L1	Assist	Human, with model drafts	Human
L2	Reviewed	Model	Human approves every output
L3	Supervised	Model	Human samples a percentage
L4	Closed-loop	Model	Automated evals gate every output

Two rules govern movement on the ladder:

Promotion is earned by eval evidence, never by enthusiasm. A step moves from L2 to L3 after a defined run of consecutive outputs clears the quality threshold — we typically require 200 reviewed outputs at 98 percent acceptance before sampling replaces full review.
Demotion is free. Any operator can knock a step down a level on suspicion alone, instantly, without a meeting. The asymmetry is deliberate. The cost of an unnecessary demotion is review labor. The cost of a late demotion is an incident.

New systems launch at L2 by default. L4 is reserved for steps with hard runtime checks — schema validation, numeric bounds, allowlisted actions — not just model self-assessment.

03 — Draw the Approval Line

The human-in-the-loop question is not "should a human be involved" but "exactly which decisions does a human keep." We draw an explicit Approval Line per workflow. A decision stays above the line — human-owned — when any of these four tests is true:

Irreversible. Money moves, a message reaches a customer, a contract gets signed, data gets deleted.
Identity-bearing. The output speaks as the brand on a judgment call, not from a template.
Material. The financial or legal exposure of a single decision exceeds a stated threshold.
Novel. The input falls outside the distribution the evals cover. Out-of-distribution means out of autonomy.

Everything below the line gets automated aggressively. Everything above it gets tooling — better context, better drafts, faster retrieval — but not autonomy.

One trap deserves its own warning: human review must be designed as carefully as the automation. A reviewer rubber-stamping 400 approvals a day is not a control. It is latency with a salary. Review queues need volume caps, sampling logic, and escalation paths, or the human layer becomes the system's weakest component.

04 — Run the eval loop or you are guessing

A system without evals is a system whose quality you learn about from customers. Every Exiid AI system ships with three measurement layers:

Layer	What it is	Cadence
Golden set	Scored test cases run against every prompt, model, or logic change	Every change
Runtime checks	Assertions on live outputs — schema, bounds, banned actions, confidence floors	Every run
Drift review	Human scoring of sampled production traces	Weekly

Three operating rules keep the loop honest:

No system launches without a golden set. If you cannot write 50 scored cases, you do not understand the task well enough to automate it.
Every incident becomes a test case. The golden set is a fossil record of everything that has ever gone wrong.
Eval scores are the promotion currency for the Autonomy Ladder. No score movement, no autonomy movement.

The question is never whether the agent will fail. It is how much it can break in the time before someone notices — and whether the system notices first.

05 — Contain failure before you scale it

Failure containment is a design input, not an ops afterthought. Five blast-radius rules apply to every automated step:

Cap actions per run. Hard limits on spend, message volume, records touched, and retries. An agent in a loop should hit a ceiling, not a credit limit.
Stage side effects. Outputs land in a draft state — draft email, pending record, proposed change — and commit in a separate, checkable step.
Prefer reversible writes. Soft deletes, versioned records, idempotent operations. The undo path from the "What must be true" table gets built, not assumed.
Kill switch per workflow. Any operator can halt one workflow without touching the rest of the platform. Granularity is what makes people willing to pull it.
Keep the manual path alive. The team must still be able to run the process by hand. A manual path that has atrophied is not a fallback; it is a rumor.

06 — Where automation pays

The Payback Test is blunt: automation pays when volume multiplied by time saved and error cost avoided clears the combined cost of building, evaluating, and reviewing the system. Run the numbers per step, not per project.

Pays	Does not pay
High-volume, low-variance work with clear ground truth	Low-frequency, high-stakes decisions
Triage, enrichment, extraction, drafting, QA, data hygiene	Strategy, pricing calls, partner negotiations
Steps tolerant of review latency	Relationship moments where trust is the product
Processes already measured and working	Broken processes — automation makes bad outcomes arrive faster

That last row is the most common failure we see. Automating a process nobody measured does not fix it. It industrializes the defect.

The operating picture

A venture-grade AI system, in one sentence: a decomposed workflow whose automated steps each hold a graded level of autonomy, earn promotion through evals, operate inside a bounded blast radius, and hand everything above the Approval Line to a human whose review load was designed, not inherited. Build that, and the system compounds. Skip any piece, and you have built a demo with production traffic.

We build and co-own ventures.

Research