The last eighteen months produced a wave of agentic demos, a smaller number of agentic prototypes, and a very small number of agents that actually run in production. The gap between the first and the last is not primarily about model capability. It is about the surrounding architecture, and the patterns that make autonomous systems reliable enough to trust.

Start with a narrow scope

The most common failure mode for production agents is overreach. A general-purpose agent handed a vague goal tends to wander into the long tail of failure modes, and the debugging story is painful because every run is a different walk through the state space. Agents that succeed almost always begin with a narrow, well-defined job where the success criteria can be checked automatically.

Breadth can come later. It rarely comes first.

Planning as a separate concern

The agents that perform best treat planning as a distinct step from execution. A planner produces a sequence of steps, ideally structured and inspectable, and an executor carries them out with clear checkpoints. This separation is what makes agents debuggable. When something goes wrong, the failure can be localized to the plan or to a specific step, not to a continuous stream of tool calls that all look equally suspicious.

Replanning when something unexpected happens is the other half. A rigid plan that cannot adapt to new information is not an agent. It is a script with extra tokens.

Tool use needs type discipline

Tool calling works when the tools have small, well-defined interfaces with rigorous input validation and clear error messages. It fails when tools are loosely specified and the agent is asked to figure out the contract at runtime. Every production tool should be treated like a public API, with schemas, versioning, idempotency where possible, and observability for every call.

The corollary is that the set of tools should be curated. An agent with fifty poorly documented tools is almost always worse than the same agent with ten excellent ones.

Memory is a product decision

Long-term memory is where agents get interesting and where they get into the most trouble. Naive memory systems accumulate noise until retrieval becomes useless. Good memory systems are selective about what they store, structured about how they store it, and careful about how they surface it back into context. Summarization, deduplication, and explicit forgetting are all necessary.

The right memory model depends entirely on the use case. A research agent needs episodic memory of sources. A customer agent needs stable facts about the user. Conflating these produces memory that is vaguely useful for everything and reliably useful for nothing.

Recovery and bounded autonomy

Every production agent needs bounds. Step limits, cost limits, and explicit halt conditions prevent the class of failures where the system burns through a budget chasing an impossible goal. Human-in-the-loop checkpoints for irreversible actions are not a weakness. They are a feature that buys trust.

Recovery is the other side of the same coin. An agent that can detect its own failure, roll back, and try a different approach is dramatically more valuable than one that plows forward regardless.

Observability built for non-determinism

Standard logging is not enough. An agent trace needs to capture the prompt, the reasoning, the tool call, the tool response, and the next decision, all in a form that is searchable and comparable across runs. Without this, debugging becomes archaeology. With it, the team can identify patterns across thousands of runs and improve the system systematically.

Production agents are not the models. They are everything around the models. The teams building agents that work are the ones treating the surrounding architecture with the seriousness it deserves.