Beyond the Demo: The Agentic Stack That Actually Ships

From chatbot novelty to dependable agents: architecture, evaluation, safety, cost-control and human oversight for production

Aug 21, 2025

Since publishing our book Building Creative Machines, which includes a hands-on chapter on agents, we’ve been collecting field notes and fresh insights. This update distils what we’ve learnt so far.

The thesis

Demos excite; production systems endure.
Agents are workflows with state, not personalities.
Reliability is a property of the system, not the model.
Shippable agents emerge when software engineering disciplines tame probabilistic behaviour.
The winning teams treat prompts as code, context as product, and evaluation as governance.

What the agentic stack is

Interaction layer — channels, schemaed prompts, and contracts for inputs/outputs.
Interfaces must constrain ambiguity, or ambiguity will tax every downstream layer.
Orchestration — planners, tools, memory, and recovery logic.
Deterministic skeleton; probabilistic muscles.
Every action should be observable, interruptible, and retryable.
Knowledge and context — retrieval, business rules, and data governance.
Context beats scale when it is fresh, curated, and permissioned.
Models — a portfolio, not a monolith.
Small models for control, larger ones for hard reasoning; specialised ones for niche domains.
Safety and policy — guardrails, consent, and provenance.
Policy must be executable, testable, and versioned like code.
Evaluation and observability — offline tests, online counters, and human audits.
Metrics define reality; without them, you are demo-driven.
Cost and performance — budgets, latency targets, and graceful degradation paths.
Economics are design constraints, not finance afterthoughts.

Design rules that survive contact with production

Plan–Act–Observe–Correct must be explicit.
Agents that cannot critique their own steps will escalate errors, not value.
Schema-first, not model-first.
Define JSON outputs and tool signatures before prompts to eliminate fragile post-hoc parsing.
Tool use over monologue.
If an agent can call an API, it should not hallucinate an answer.
Constraints before creativity.
Budgets, rate limits, and timeouts are safety rails for intelligence, not shackles.
Human-on-call.
Escalation is a feature; supervision is a control; learning from interventions is the flywheel.
Version everything.
Prompts, retrieval indices, tools, and policies deserve semantic versions and rollback plans.
State is a risk factor.
Short-term scratchpads help reasoning; long-term memory demands consent, retention rules, and audit.

Metrics that matter

Task success rate: completed to spec, not just “answered”.
First-pass yield: proportion of cases with zero human edits.
Intervention rate and time-to-resolution: the human bottleneck you must design for.
Guardrail breach rate: policy violations per thousand actions, not per thousand requests.
Cost per successful task: your true unit economics.
Retrieval utility: fraction of citations that were necessary and sufficient.
Drift: change in performance across time, data freshness, and model updates.

What you measure compounds; what you ignore becomes incident reports.

Safety that scales with ambition

Safety is not a filter; it is an architecture.
Classify risks by capability (what the agent can do), data (what it can see), and interaction (who it can affect).
Gate high-impact tools behind checks, rate limits, and explicit consent.
Log every tool call with inputs, outputs, and provenance hashes for post-incident forensics.
Red-team the workflow, not just the model, because exploits live in orchestration seams.
Policy-as-code enables pre-deployment tests and in-production enforcement with the same rules.

The economics of intelligence

Cheaper tokens are not a strategy; predictable margins are.
Cache aggressively where correctness is stable; retrieve aggressively where freshness matters.
Distil reasoning into lighter models for the 80%, and reserve heavy models for the 20% edge.
Batch where latency allows; stream where latency kills value.
Optimise for cost per resolved outcome, not cost per thousand tokens.

Human-in-the-loop that teaches the machine

Shadow mode reveals failure modes without user risk.
Claim-check patterns let humans approve actions before the agent spends money or reputation.
Structured feedback (rubrics, tags, diffs) turns interventions into training data, not anecdotes.
Rotation and calibration keep reviewers consistent, which keeps metrics honest.
Promotion from assisted to autonomous should be gated by real-world evidence, not confidence on a slide.

Patterns that ship

Retrieval-first assistants: opinions grounded in documents you can audit.
Toolformer-style agents: actions decomposed into verifiable calls with typed contracts.
Checkers and critics: second-pass validators specialised for policy, maths, or citations.
Safety sentinels: independent processes that can veto execution and revoke credentials.
Fallback ladders: from “best-effort agent” to “deterministic template” to “sorry, escalating”.

Elegance is a nice-to-have; escape hatches are a must-have.

A pragmatic build order

Start with a single job-to-be-done whose success can be unambiguously scored.
Codify the spec as tests before writing the prompt.
Instrument from day one; counters beat curiosity.
Ship a supervised version to expose real inputs, then automate only the stable parts.
Add tools one by one, each behind a feature flag and an evaluation gate.
Declare SLAs early; they will shape model choice, caching, and escalation design.
Write post-mortems for agent failures with the same discipline as production outages.

Choosing models without the theatre

Pick for fit, not for fashion.
Latency budgets, token windows, and tool-call accuracy often dominate raw benchmark scores.
Model heterogeneity de-risks supplier change and unlocks routing by task profile.
On-device or edge models reduce latency and leakage, but demand ruthless scoping.
The most strategic investment is not model selection; it is your data pipeline and policies.

Governance that earns trust

Traceability is your social licence to operate.
Every answer should be explainable by the inputs, tools, and policies that were in force at the time.
Users deserve transparent boundaries: what the agent can do, what it logs, and how to contest outcomes.
Compliance becomes lighter when an audit is built in, not reconstructed months later.
Trust is a UX: clear affordances for correction, consent, and escalation reduce cognitive load and legal risk.

The quiet differentiator: editorial judgement

Great agents are opinionated about what good looks like.
Style guides, domain rubrics, and example-based tests encode taste into the loop.
Consistency is a feature; teams remember tools that respect their standards under pressure.
The best prompt is the one your organisation can maintain when the author is on holiday.

The shipping test
Can we replay any decision with full provenance?
Do we know cost per successful task under real load?
What fails closed, and where do we degrade gracefully?
Which parts are safe to automate today, and which require human claim-check?
What evidence would justify the next autonomy step?
If you cannot answer these in one meeting, you have a demo, not a product.

If you disagree with any rule here, reply with your counterexample; we’ll learn faster together.

Building Creative Machines

Discussion about this post

Ready for more?