Building Production AI Agents: Architecture, Evals, and What Actually Ships

"Agent" is the most overused word in AI right now. Half the things called "agents" are just chatbots with a function-calling layer; the other half are research demos with five hours of guard-railing duct-taped on top to keep them from filing a tax return for the wrong company. The actual category — a real, multi-step, autonomously-acting AI agent in production — is much smaller than the discourse suggests, and the engineering to ship one well is much less glamorous than the marketing.

We've shipped enough production AI to know which agents survive and which ones turn out to be slide-deck art. The version below is the engineering view: the architecture that holds up, the failure modes that kill most attempts, the evals that catch problems before users do, and the situations where you should genuinely build an agent instead of something simpler. We also do this work for clients as part of our AI integration practice, so the patterns here are what we actually use.

What "agent" actually means (and what it doesn't)

A real AI agent is something that can:

Take multi-step actions to achieve a goal, not just answer a single question.
Use tools — call APIs, query data, write files, send messages — based on the model's own decisions.
Maintain state across steps so it knows what it's already tried.
Recover from failure in some structured way (re-plan, escalate, hand off).

By that bar: most AI features called "agents" aren't. A support chatbot that calls a knowledge base via RAG is not an agent — it's a grounded chatbot. A code assistant that suggests one edit at a time isn't an agent — it's a copilot. Neither of these needs agent architecture, and pretending they do is how teams spend three months building infrastructure they didn't need.

The flip side: a feature that actually needs to plan, act, observe, re-plan — like an outbound sales agent that researches a prospect, drafts an email, hits a CRM API, and schedules a follow-up — is agent territory, and it does benefit from agent architecture.

Production agent architecture (the boring version)

Most working production agents share roughly the same shape:

1. An LLM core doing planning and decision-making. Usually a frontier model (GPT, Claude, Gemini) for the reasoning step; a cheaper, faster model for routine sub-tasks if cost matters.

2. A tool layer. A small, well-defined catalogue of functions the agent can call. Each tool has a typed schema, idempotency where possible, and a clean error contract. Too many tools and the model gets confused; too few and it can't actually achieve the goal. The sweet spot is usually 5–15 well-scoped tools.

3. Memory. Two flavours: working memory (what the agent has done in this run — required) and long-term memory (what the agent has learned across runs — optional, mostly hype). Working memory is what makes multi-step coherence possible. Long-term memory is what most teams over-invest in and under-use.

4. State and persistence. Every step the agent takes should land in durable storage — what it tried, what came back, what it decided next. This is what makes incidents debuggable. It's also what makes resuming after a crash possible, and what your eval pipeline reads from.

5. Guardrails. Layered: pre-action checks (is this tool call sensible? does it match the typed schema?), action-rate limits (the agent cannot call the same tool 200 times in a loop), and behavioural constraints (the system prompt and the tool catalogue together encode what the agent is and isn't supposed to do). More on guardrails in our LLM guardrails post.

6. An observability layer so you can see every step of every run — what the model saw, what it chose, what came back, what it decided next. Without this, an agent in production is a black box that occasionally burns money.

That's it. There's no special "agent framework" you need. Most production agents we've shipped are a few hundred lines of careful TypeScript or Python wrapped around an LLM provider, a tool schema, and a state machine.

The failure modes that kill most agents

Three categories cause more than 80% of production agent failures:

Looping. The model gets stuck in a thought-action-observation loop that never converges. Same tool call, slightly different arguments, indefinitely. The fix is at the architecture layer: hard step limits, repeated-call detection, and a graceful "I can't do this, escalating" branch.

Hallucinated tool calls. The model invents arguments to a tool that look plausible but aren't real (a customer ID that doesn't exist, a date in the wrong format). The fix is at the tool layer: typed schemas with strict validation, and a clean error contract that lets the model recover instead of compounding the mistake.

Behavioural drift. A change in the underlying model — a new GPT version, an Anthropic update — alters the agent's behaviour subtly. It still works, but it now refuses cases it used to handle, or escalates more aggressively, or formats outputs differently. The fix is the eval pipeline (next section) — without one, this drift goes unnoticed until a user reports it.

All three are normal engineering problems. They're not solved by the next model release. They're solved by the kind of careful production engineering that most teams skip because the discourse said the model would handle it.

Eval pipelines — how to actually test an agent

A production agent without an eval pipeline isn't a production agent; it's a prototype that happens to be live. The minimum viable eval setup looks like this:

A representative test set. 50–200 representative goals the agent should achieve, ranging from easy to genuinely hard, including known edge cases.
Trajectory scoring. For each goal, you score the agent's whole trajectory — not just the final answer, but the sequence of decisions and tool calls it made along the way.
Both deterministic and LLM-as-judge scoring. Deterministic where you can (did it call the right tool? did it land on the right outcome?), LLM-as-judge where you can't (was the reasoning sound? was the final action appropriate?).
CI integration. Every prompt change, model change, or tool change re-runs the eval set. A drop below threshold blocks the deploy.

Don't worry about hitting 100% — production agents rarely do. Worry about catching regressions. The eval pipeline is what tells you whether your last change was an improvement or a quiet step backwards. (For a deeper take, see our LLM evaluation guide.)

When NOT to build an agent (most of the time)

The single most important decision in agent engineering is: should this be an agent at all?

If the workflow is well-defined, the steps are predictable, and the order is fixed: don't build an agent. Build a pipeline. A pipeline is faster, cheaper, more reliable, and easier to debug. It just calls an LLM at the steps where intelligence is needed, and uses ordinary code for everything else.

If the workflow is open-ended, the steps aren't known in advance, and the next action depends on the result of the previous: yes, this is an agent.

For most products, the right architecture is a pipeline with an LLM stage or two. The "agent" architecture is justified for a narrow set of use cases where genuine autonomy adds value. Pretending otherwise is how teams spend Q3 building agent infrastructure for a workflow that would have been a 200-line pipeline.

Lexa as a narrow, well-bounded agent

Lexa — Pakistan's first AI legal chatbot — is a useful concrete example of where "agent" stops being marketing and starts being engineering. It's narrow on purpose: a focused conversational interface, grounded in legal text via retrieval (we covered the choice in our RAG vs Fine-Tuning post), with a tightly-scoped tool catalogue (retrieval, citation lookup, escalation to a human) and hard refusal behaviour for anything outside its domain.

That narrowness is what makes it dependable. A "legal agent" that tried to do everything — interpret laws, draft documents, advise on specific cases, route to lawyers — would be a much harder system to ship responsibly. The agent we built is the one that survives contact with real users, which is the only kind worth shipping.

What we'd tell anyone building a production agent

Don't build an agent if a pipeline would do.
Keep the tool catalogue small and well-typed.
Persist every step. Make incidents debuggable.
Build the eval pipeline before you scale users — not after.
Plan for behavioural drift. Models change.
Don't ship without guardrails.
Most importantly: ship the smallest useful agent first. Scope creep on agents is more expensive than on any other class of feature, because complexity scales non-linearly. The same principle applied more broadly is in our post on why most AI features fail in production.

FAQ

Do I need an "agent framework" like LangChain or AutoGen? No. A clean production agent is a few hundred lines of careful code around an LLM provider, a typed tool catalogue, and a state machine. Frameworks can help with prototyping but introduce coupling we usually end up regretting in production. We've shipped agents both with and without frameworks; the framework-free version is more debuggable.

What's the difference between an agent and a copilot? A copilot suggests; the user decides. An agent acts; the user sees the result. Most products want a copilot. Some genuinely want an agent. Choose deliberately.

How long does it take to ship a production agent? For a narrow, well-bounded agent: 6–10 weeks. For a broader agent with multiple tools and complex workflows: 12–20 weeks. Most of the time isn't the model — it's the tools, the state, the evals, and the guardrails.

Which model is best for agents? Today, the frontier reasoning models (Claude, GPT, Gemini) are all viable, with different strengths. The right choice depends on your tool catalogue, latency budget, and cost target. We pick per use case with a provider-agnostic layer so you're not locked in.

Can the agent learn from user feedback? Not without careful design. Continual learning sounds great and quietly introduces drift. Most production agents we ship don't "learn" autonomously — they get updated by the team based on what evals and observability reveal.

Building an AI agent and want a sober technical read on architecture, evals, and scope? Book a 30-minute call — senior engineer, no sales rep, written recommendation after the call.

Building Production AI Agents: Architecture, Evals, and What Actually Ships

What "agent" actually means (and what it doesn't)

Production agent architecture (the boring version)

The failure modes that kill most agents

Eval pipelines — how to actually test an agent

When NOT to build an agent (most of the time)

Lexa as a narrow, well-bounded agent

What we'd tell anyone building a production agent

FAQ

Need help building this?

Related posts

What is MCP (Model Context Protocol)? A Practical Guide for Product Teams

How to Evaluate LLM Outputs in Production (Eval Pipelines That Actually Work)

How Much Does It Cost to Build an AI Chatbot in 2026? An Honest UK Breakdown