Failure mode: tool sprawl
If you give an agent twenty tools, you gave it twenty ways to hurt you. We narrow tools to verbs your business already understands: read invoice, create ticket, post credit under policy. Each tool has typed arguments validated server-side, not “stringly typed” JSON blobs interpreted by hope.
Failure mode: runaway loops
Agents retry when they are confused. Retries without caps become storms. We enforce max steps, max tool calls, wall-clock timeouts, and spend caps on paid APIs. When budgets trip, the agent stops and asks for a human with a structured summary. It does not improvise a new plan forever.
Failure mode: silent partial success
APIs return 200 with incomplete payloads. We treat idempotency keys and post-condition checks as mandatory. The agent surfaces receipts: what it attempted, what the system returned, and what changed. Users should never hear “done” when half the write failed.
Failure mode: prompt injection via tools and retrieved text
Untrusted content is not instructions. We isolate system prompts, sanitize tool arguments, and filter retrieval with ACLs. Tools cannot fetch arbitrary URLs unless you explicitly want that risk.
Opinion: the best agent UX is boring receipts
Flashy streaming tokens impress demos. Operators need receipts. Links to tickets, IDs for transactions, and trace IDs for support. We bias UX toward auditability over charisma, because your security team will not fall in love with a typing indicator.
How we evaluate agents before expanding traffic
We run red-team prompts against tool policies, replay production traces in shadow mode, and measure containment without harming customers. Expansion happens when error budgets say so, not when marketing schedules say so.
What you should ask any vendor
Ask for the policy matrix, the kill switch behavior, and the last incident postmortem. If a vendor cannot show how an agent fails safely, they cannot show how it succeeds safely either.
Next step
If you have a queue that should be agent-shaped, start with a Rapid POC that implements two tools and one retrieval source, then measure containment and escalation quality for two weeks. Evidence first, expansion second.