When AI Agents Actually Work in Production (and When They Don’t)
Multi-agent demos look magical. Production agents look very different. Here’s the dividing line we’ve learned across 18 months of shipping autonomous systems for enterprise customers.
Every week, a Twitter demo shows three AI agents collaborating to build something impressive — a website, a research report, a working app. Every week, we get a call from a CTO asking why their agent build that worked perfectly in a sandbox falls over the moment it touches a real workflow.
The honest answer: most AI agent systems aren't failing because of the model. They're failing because the surrounding system is wrong for what an agent actually is.
The shape of the problem
An LLM agent is, fundamentally, a system that decides what to do next based on incomplete information, unbounded action space, and probabilistic reasoning. That's a fine description of a strategist. It's a terrible description of a deterministic piece of enterprise software. When teams treat agents like the latter, they get flakiness; when they treat agents like the former, they get value.
Where agents work in production
1. Bounded research and synthesis
Tasks like “find the three most relevant clauses across these 200 contracts” or “summarize what changed in this PR review thread.” The agent has a clear input, a defined corpus, and a clear stop condition. Recovery from a wrong path is cheap.
2. Triage, not execution
The agent labels, routes, drafts. A human approves before action. Customer support ticket triage. Sales lead enrichment. Compliance pre-screening. Anywhere the cost of a wrong action is high but the cost of a wrong recommendation is low.
3. Stateless, idempotent operations
An agent that can call a tool 12 times is fine if calling the same tool 12 times is safe. Agents that issue real-world side effects (send emails, charge cards, deploy code) need explicit safeguards — typically a confirmation gate, a budget cap, and a strict tool allowlist.
Where agents fail in production
- Long-horizon, multi-step decisions with branching consequences. The longer the trajectory, the more compounding error you accumulate. By step 12, the agent has talked itself into a path that no human would take.
- Tasks where the failure mode is invisible. If a wrong answer looks like a right answer, the agent will produce wrong answers indefinitely. Most enterprise data tasks are like this.
- Anywhere SLAs matter. Agents are non-deterministic. A 5-second p50 latency can be a 90-second p99. Customer-facing flows die on this.
The architecture we use now
For every agent system we ship, we constrain it along three axes:
- Tool allowlist. Not “any tool the LLM can imagine.” A specific list of well-tested, idempotent tools with strict input schemas.
- Step budget. Hard cap on agent loop iterations. Force a graceful handoff to a human when exceeded.
- Trace-everything telemetry. Every model call, every tool call, every prompt template version, captured with cost. Without this, debugging is impossible.
The agents that ship to production look much less impressive than the agents in demos. They're also the only ones still running six months later.
What this means for your roadmap
If your AI agent project is stuck at “works in demo, fails in pilot,” you're almost certainly missing one of: clear bounds on the action space, a human-in-the-loop checkpoint, or production-grade observability. The model isn't the bottleneck. The system around the model is.
We help enterprise teams across financial services, healthcare and education ship production agents without learning these lessons the hard way. If that's where you're stuck, get in touch.
