10 Essential Insights on Coordinating Multiple AI Agents at Scale

Coordinating multiple AI agents in a complex system is widely regarded as one of the toughest engineering challenges today. In a recent podcast, Intuit's Chase Roossin (group engineering manager) and Steven Kulesza (staff software engineer) shared their hard-won experience on this topic. Whether you're building a multi-agent orchestration platform or simply scaling up your AI workflows, these ten insights will help you avoid common pitfalls and build systems that truly collaborate. From communication protocols to failure handling, here's what you need to know.

1. Define Clear Agent Boundaries and Responsibilities

Just like in a well-organized team, each AI agent needs a clearly scoped domain. Without explicit boundaries, agents can step on each other's toes, duplicate work, or even contradict each other's outputs. Start by mapping out the overall workflow and assigning each agent a specific function, such as data retrieval, decision-making, or output generation. This not only reduces conflicts but also makes debugging infinitely easier. Chase Roossin emphasizes that when every agent knows its lane, you can scale horizontally without chaos. Use a central registry or a configuration file to document these boundaries, and update them as the system evolves.

10 Essential Insights on Coordinating Multiple AI Agents at Scale — Source: stackoverflow.blog

2. Implement a Robust Communication Protocol

Agents need a common language to share data and signals. Whether it's a RESTful API, a message queue (like Kafka), or a shared knowledge graph, the protocol must be reliable and low-latency. Steven Kulesza points out that asynchronous communication often works best for larger deployments because it prevents one slow agent from blocking the entire pipeline. Ensure your protocol supports versioning, timeouts, and retries. Consider using a standardized format such as JSON or Protocol Buffers to keep parsing consistent across diverse agents. The goal is to make each agent a black box that can be replaced or upgraded without breaking the overall system.

3. Use a Central Orchestrator (Sparingly)

A central orchestrator can simplify coordination by managing the order of agent execution and handling failures. However, it can also become a bottleneck and a single point of failure. The Intuit engineers recommend using a lightweight orchestrator that only manages high-level steps, while allowing agents to make local decisions autonomously. For example, a task scheduler like Celery or a workflow engine like Temporal can provide the necessary coordination without micromanaging. The key is to strike a balance: enough orchestration to avoid deadlocks, but not so much that you lose the benefits of agent independence.

4. Design for Idempotency and Retry

In distributed systems, failures are inevitable. An agent might crash, a network call might timeout, or a database might lock. Every agent operation should be idempotent, meaning that performing the same action multiple times yields the same result. This allows you to retry failed tasks safely. Steven Kulesza suggests building in a retry mechanism with exponential backoff and jitter to avoid thundering herd problems. Additionally, each agent should output a unique transaction ID so that downstream agents can deduplicate. Without these safeguards, a simple glitch can cascade into a system-wide inconsistency.

5. Monitor Agent Behavior with Distributed Tracing

When multiple agents interact, a single request can span dozens of services. Traditional logging becomes useless. Instead, implement distributed tracing (e.g., with OpenTelemetry) to follow the entire lifecycle of a task. Chase Roossin notes that this is the only way to identify which agent is slowing down the pipeline or returning errors. Attach metadata like agent version, input hash, and decision path to each span. Set up dashboards that show agent‑level latency and error rates. With good observability, you can quickly pinpoint whether the problem is a misbehaving agent or a resource constraint.

6. Use a Shared Context or Knowledge Graph

Agents often need to share state information, such as user preferences, intermediate results, or environmental conditions. Instead of passing large payloads between agents, use a shared, persistent knowledge graph (like a graph database) that all agents can read from and write to. This reduces coupling and makes the system more resilient to agent failures. Steven Kulesza explains that Intuit uses a lightweight in‑memory cache for high‑frequency data and a SQL database for persistent context. Ensure proper locking or optimistic concurrency control to prevent race conditions.

7. Implement Graceful Degradation and Fallbacks

No matter how well you design the system, some agents will fail. Instead of crashing the entire pipeline, build fallback logic. For instance, if a recommendation agent times out, serve a default set of recommendations. If a data‑scrubbing agent is unavailable, the system can use raw data (with a warning). Chase Roossin stresses the importance of defining clear Service Level Objectives (SLOs) for each agent. When an agent can't meet its SLO, the orchestrator should route the request to a simpler, more robust agent. This ensures that the overall system remains functional, even if not at peak accuracy.

8. Version and Test Agents Independently

Treat each agent as a deployable microservice. Use containerization (e.g., Docker) and CI/CD pipelines to test and deploy each agent on its own. This allows you to roll out changes to one agent without redeploying the entire platform. Steven Kulesza advises maintaining a suite of integration tests that simulate multi‑agent interactions, using mocks for upstream and downstream agents. Also, include canary deployments to test new agent versions in production with a small fraction of traffic. Only promote an agent to full production after it has proven stable in the canary environment.

9. Plan for Conflict Resolution

When multiple agents produce different answers for the same query, you need a conflict resolution strategy. Common approaches include majority voting, weighted scoring, or human-in-the-loop arbitration. For example, if three different summarization agents generate summaries, the system can compare them and select the one with highest confidence. Chase Roossin mentions that Intuit sometimes uses a meta‑agent that uses a reinforcement learning policy to decide which agent's output to trust. Whatever method you choose, ensure that the resolution logic is transparent and auditable – especially in regulated domains like finance or healthcare.

10. Continuously Optimize and Retire Underperforming Agents

Agent performance can degrade over time due to data drift, changing user behavior, or model staleness. Regularly evaluate each agent's accuracy, latency, and cost. If an agent consistently underperforms, consider replacing it with a newer model or merging its function with another agent. Steven Kulesza recommends keeping a registry of agent metrics and setting up automated alerts when key metrics fall below thresholds. Additionally, maintain a lifecycle policy: every agent should have a designated owner, a review cadence, and a clear sunset process. This prevents the system from accumulating “zombie agents” that nobody maintains.

Conclusion

Getting multiple AI agents to play nice at scale is not a one‑time setup – it's an ongoing discipline. By defining clear boundaries, implementing robust communication, designing for failure, and continuously monitoring performance, you can build a multi‑agent system that is both powerful and resilient. As Chase Roossin and Steven Kulesza highlighted, the challenges are real, but so are the solutions. Start with these ten principles, and you'll be well on your way to orchestrating AI agents that truly work together. For a deeper dive, listen to the full podcast or explore the resources linked in each section.

Tags: