The agentic AI hype cycle peaked around mid-2024. By late 2025, teams that had shipped production agentic systems were quiet - not because the technology failed, but because they had learned that building agentic AI in production is a systems engineering problem that most initial prototypes were not designed for.
I have built two agentic systems in production: a clinical document processing pipeline that coordinates retrieval, synthesis, and structured output across multiple specialized agents, and an internal research automation tool that orchestrates web retrieval, analysis, and report generation. Neither of them looks like the AutoGen multi-agent chat demos from 2023. Here is what production agentic architecture actually looks like.
When to Use Agents - And When Not To
Before the architecture: the most important agentic AI decision is whether to use agents at all. This sounds obvious but the field spent about 18 months in a phase where "multi-agent" was applied to problems that needed a single well-structured prompt.
Use agents when:
- The task requires multiple discrete tool calls whose sequence cannot be fully predetermined
- The task requires dynamic planning that adapts based on intermediate results
- The task naturally decomposes into specialized subtasks with different tool requirements
- Failure recovery requires reasoning about what went wrong and trying a different approach
Do not use agents when:
- A single well-structured prompt with chain-of-thought handles the task reliably
- The workflow steps are fully predetermined (use a pipeline, not an agent)
- You need sub-100ms latency (agent overhead is measured in seconds, not milliseconds)
- The task does not require tool use - a sophisticated reasoning chain without tools is not an agent, it is a prompt
The practical rule: if you can write the workflow as a deterministic DAG without branching on LLM outputs, use a pipeline. If the workflow requires the LLM to decide what to do next based on what it found, use an agent.
Orchestration Patterns
ReAct (Reasoning + Acting)
ReAct is the foundational agentic pattern - the agent alternates between reasoning steps ("I need to find the patient's medication history") and action steps (call the EHR API, read the result, reason about it). It is the simplest pattern and handles a wide range of single-agent tasks well.
ReAct works well when: the task has a clear goal, the tool set is bounded, and the number of steps is small (under 10). It degrades when tasks require long-horizon planning, when intermediate results require significant interpretation before deciding the next action, or when failure in one step should trigger a fundamentally different approach rather than a retry.
Implementation recommendation: LangGraph's agent node with tool calling is the cleanest ReAct implementation I have used. It handles tool response parsing, error recovery, and state persistence more reliably than custom ReAct loops.
Plan-and-Execute
Plan-and-Execute separates planning from execution. A planner LLM generates a step-by-step plan for the task, and an executor LLM carries out each step. The key advantage over ReAct is that the plan is visible and auditable - you can inspect what the agent intends to do before it does it, which is critical for high-stakes applications.
In my clinical document processing pipeline, we use a plan-and-execute architecture where the planner generates a structured analysis plan ("Step 1: Extract all adverse event reports. Step 2: Identify any Grade 3+ events. Step 3: Cross-reference with protocol criteria...") that a human can review before the executor runs. This gate between planning and execution is not a feature we added for safety theater - it is genuinely useful for catching cases where the planner has misunderstood the task.
The weakness of plan-and-execute is brittleness to mid-plan surprises. If execution step 3 reveals information that makes the original plan wrong, a pure plan-and-execute system either fails or requires re-planning. Hybrid systems that allow limited re-planning at execution time handle this better.
Multi-Agent Patterns
Multi-agent systems coordinate multiple specialized agents with defined roles. The common patterns:
- Supervisor/worker: A supervisor agent decomposes tasks and delegates to specialized worker agents. Workers report results back; supervisor synthesizes. Clean separation of concerns, easy to test individual workers.
- Peer collaboration: Multiple agents with equal standing contribute to a shared task. Used in critique/refinement workflows where one agent generates and another reviews. CrewAI's default model.
- Hierarchical: Multi-level supervisor trees for very complex task decomposition. High overhead - avoid unless the task genuinely requires it.
Framework guidance: LangGraph is the most production-ready for supervisor/worker patterns because its graph model makes state flow explicit and debuggable. CrewAI is faster to prototype for peer collaboration patterns. AutoGen remains the most flexible for experimental multi-agent designs but has more operational overhead in production.
Tool Integration Architecture
Tools are the atoms of agentic systems. Every tool call is a potential failure point - API error, timeout, malformed response, rate limit, authentication failure. The architecture of your tool layer determines how gracefully the system handles these failures.
Key design principles for production tool integration:
- Tool schemas are documentation. Every tool needs a precise name, description, and parameter schema. The LLM uses this to decide when and how to call the tool. Vague tool descriptions lead to wrong or unnecessary tool calls. "search_database" is a bad tool name. "search_clinical_trials_by_indication_and_phase" is a good tool name.
- Tools should be idempotent where possible. An agent that calls a tool twice (due to error recovery or re-planning) should not create duplicate side effects. Design tools to be safely re-callable.
- Tool outputs need size limits. An agent that calls a tool and receives 200KB of text will either exceed the context window or waste tokens on irrelevant content. Every tool should return the minimum useful output - summaries over full documents, structured extracts over raw API responses.
- Build a tool registry. As tool count grows, management becomes a problem. A central tool registry with versioning, documentation, and usage analytics prevents tool proliferation and inconsistency.
Memory Systems
Agentic systems need memory at multiple timescales. Getting this architecture right is the difference between an agent that learns and adapts and one that starts from scratch on every invocation.
In-Context Memory
The conversation history and current task state within a single agent run. Managed by the framework (LangGraph's state graph, LangChain's memory classes). The constraint is context window size. For long-running tasks, you need a summarization strategy - compress older steps before they push critical information out of the context window.
Episodic Memory
Records of past agent runs that can be retrieved and used to inform current runs. "Last time I processed a clinical study report of this type, I found that the adverse events section was structured differently than the protocol template assumed - I should check for that." Implemented with a vector store and similarity retrieval. This is the memory layer most teams skip and most wish they had built earlier.
Semantic Memory
Factual knowledge the agent has access to - a knowledge base, a retrieval corpus, a structured database. Implemented with RAG. The design of this layer follows the RAG architecture principles covered separately (chunking, retrieval strategy, metadata filtering).
Procedural Memory
Learned behaviors and heuristics about how to accomplish tasks. In practice, this lives in the system prompt and tool descriptions. Updating procedural memory means updating prompts - which requires version control and testing.
Guardrails Architecture
Production agentic systems need guardrails at multiple layers. A single guardrail layer at the output stage catches problems too late and misses issues that occur mid-reasoning.
Layer 1 - Input guardrails: Validate and sanitize inputs before they reach the agent. Catch prompt injection, PII that should not be processed, malformed inputs that will waste agent compute.
Layer 2 - Tool call guardrails: Before any tool is called, validate that the call is appropriate given the current context. An agent should not call a "delete_record" tool in a read-only analysis context. These guardrails can be implemented as tool wrappers that check preconditions before execution.
Layer 3 - Output guardrails: Validate agent outputs before they are returned to users or used as inputs to downstream systems. For clinical outputs, this includes schema validation, clinical plausibility checks, and confidence thresholding. Tools like Guardrails AI and NeMo Guardrails provide frameworks for this layer.
Layer 4 - Loop detection: Production agents can get stuck in reasoning loops - calling the same tool repeatedly, re-planning endlessly, or cycling between two approaches. Implement step count limits, loop detection based on repeated tool calls, and escalation paths when limits are hit.
Observability: The Non-Negotiable Layer
An agentic system that you cannot observe is an agentic system you cannot debug, improve, or trust. Observability for agentic AI is more complex than for standard software because the execution graph is not predetermined - it emerges from the agent's decisions.
What you need to trace for every agent run:
- Every LLM call: input tokens, output tokens, latency, model used, temperature settings
- Every tool call: tool name, inputs, outputs, latency, success/failure
- Full reasoning trace: each reasoning step the agent took, in order
- Branching decisions: where the agent chose between multiple options and why
- Final output: what was returned, how long the full run took, total token cost
LangSmith handles this well for LangChain/LangGraph-based systems. For other frameworks, OpenTelemetry with a custom AI tracing layer is the most portable option. Arize Phoenix and Weights & Biases are strong alternatives for teams that want purpose-built AI observability.
The operational rule I follow: if you cannot reconstruct why an agent produced a specific output, the agent is not production-ready. Every production incident investigation starts with the trace.
What I Would Build Differently
Looking back at the production systems I have shipped: I would invest earlier in the tool registry, the episodic memory layer, and the planning/execution separation. These are the architectural decisions that become expensive to retrofit once the system is in production. The guardrails and observability layers were built in from the start and have paid off every time something went wrong - which, in agentic systems, is not a question of if but when.
The systems that have degraded gracefully in production share one property: explicit state management. When something goes wrong and you need to understand what the agent was doing and why, explicit state is your only debugging tool. Implicit state - reasoning that happens inside a black-box LLM call without tracing - is where agentic systems become unmaintainable. Design for the debug experience from day one.