AI agents aren’t failing. The coordination layer is failing

opinion
Apr 10, 20268 mins

When multiple AI agents compete instead of collaborating, the problem isn’t the agents — it’s the missing spine that connects them.

shutterstock 1869308242 team putting together a chain of gears teamwork coordination collaboration
Credit: Studio Romantic / Shutterstock

Our multi-agent AI system was impressive in demos. One agent handled customer inquiries. Another managed scheduling. A third processed documents. Each worked beautifully in isolation.

In production, they fought each other. The scheduling agent would book appointments while the inquiry agent was still gathering requirements. The document agent processed files using outdated context from a conversation that had moved on two turns ago. End-to-end latency crept from roughly 200 milliseconds to nearly 2.4 seconds as agents waited on each other through ad-hoc API calls that nobody had designed for scale.

The problem was not the agents. Every individual agent performed well within its domain. The problem was the missing coordination infrastructure between them, what I now call the “Event Spine” that enables agents to work as a system rather than a collection of individuals competing for the same resources.

Among my peers running production AI at enterprise scale across telecommunications and healthcare, the same pattern keeps surfacing. Agent proliferation is real. The tooling to coordinate those agents is not keeping up.

Multi-agent coordination: Before vs after.
Figure 1: Multi-Agent Coordination — Before (N² connections) vs After (Event Spine)

Sreenivasa Reddy Hulebeedu Reddy

Why direct agent-to-agent communication breaks down

The intuitive approach to multi-agent systems is direct communication. Agent A calls Agent B’s API. Agent B responds. Simple point-to-point integration. It mirrors how most teams build microservices initially, and it works fine when you have two or three agents.

The math stops working quickly. As agent count grows, connection count grows quadratically. Five agents need 10 connections. Ten agents need 45. Twenty agents need 190. Each connection is a potential failure point, a latency source and a coordination challenge that someone must maintain.

Worse, direct communication creates hidden dependencies. When Agent A calls Agent B, A needs to understand B’s state, availability and current workload. That knowledge couples the agents tightly, defeating the entire purpose of distributing capabilities across specialized components. Change B’s API contract, and every agent that calls B needs updating.

We have seen this movie before. Microservices went through the same evolution — from direct service-to-service calls to message buses, to service meshes. AI agents are following the same trajectory, just compressed into months instead of years.

The Event Spine pattern

The Event Spine is a centralized coordination layer with three properties designed specifically for multi-agent AI systems.

First, ordered event streams. Every agent action produces an event with a global sequence number. Any agent can reconstruct the current system state by reading the event stream. This eliminates the need for agents to query each other directly, which is where the latency was hiding in our system.

Second, context propagation. Each event carries a context envelope that includes the originating user request, current session state and any constraints or deadlines. When an agent receives an event, it has the full picture without making additional calls. In our previous architecture, agents were making three to five round-trip calls just to assemble enough context to act on a single request. Third, coordination primitives. The spine provides built-in support for common patterns: sequential handoffs between agents, parallel fan-out with aggregation, conditional routing based on confidence scores and priority preemption when urgent requests arrive. These patterns would otherwise need to be implemented independently by each agent pair, duplicating logic and introducing inconsistency.

from collections import defaultdict
 import time
  class EventSpine:
      def __init__(self):
          self.sequence = 0
          self.subscribers = defaultdict(list)
       def publish(self, event_type, payload, context):
          self.sequence += 1
          event = Event(
              seq=self.sequence,
              type=event_type,
              payload=payload,
              context=context,
              timestamp=time.time()
          )
          for handler in self.subscribers[event_type]:
              handler(event)
          return event
      def subscribe(self, event_type, handler):
            self.subscribers[event_type].append(handler)
Event spine architecture request flow
Figure 2: Event Spine Architecture — Request Flow with Ordered Events and Context Propagation

Sreenivasa Reddy Hulebeedu Reddy

3 problems the Event Spine solves

Problem one: race conditions between agents. Without coordination, our scheduling agent would book meetings before the inquiry agent had finished collecting requirements. Customers received calendar invitations for appointments that were missing critical details. The Event Spine solved this by enforcing sequential processing for dependent operations. The scheduling agent subscribes to requirement-complete events and only acts after receiving confirmation that the inquiry agent has gathered everything needed.

Problem two: context staleness. Agents making decisions based on outdated information was our second most common failure mode. A customer would correct their phone number during a conversation, but the document agent — which had pulled context three turns earlier — would generate paperwork with the old number. The Event Spine solved this by attaching the current context envelope to every event. When the inquiry agent publishes an update, the attached context reflects the latest state. Downstream agents never operate on stale data.

Problem three: cascading failures. When one agent in a direct-call chain fails, downstream agents either hang waiting for a response or fail themselves. A single document processing timeout would cascade into scheduling failures and inquiry timeouts. The Event Spine introduced dead-letter queues, timeout policies and fallback routing. When the document processing agent experienced latency spikes, events automatically rerouted to a simplified fallback handler that queued work for later processing instead of failing the entire pipeline.

Three problems the event spine solves
Figure 3: Three Problems the Event Spine Solves — Before and After Comparison

Sreenivasa Reddy Hulebeedu Reddy

Results

The combined impact reshaped our system’s performance profile. End-to-end latency dropped from approximately 2.4 seconds back to roughly 180 milliseconds. The improvement came primarily from eliminating the cascading round-trip calls between agents. Instead of five agents making point-to-point requests to build context, each agent receives exactly what it needs through the event stream.

Agent-related production incidents dropped 71 percent in the first quarter after deployment. Most of the eliminated incidents were race conditions and stale context bugs that are structurally impossible with event-based coordination.

Agent CPU utilization decreased approximately 36 percent because agents stopped performing redundant work. In the old architecture, multiple agents would independently fetch the same customer data from shared services. With context propagation through the spine, that data is fetched once and shared through the event envelope.

Duplicate processing was eliminated entirely. Our previous architecture had no reliable way to detect when two agents were acting on the same request simultaneously. The Event Spine’s global sequence numbering provides natural deduplication. Developer productivity improved as well, though this is harder to quantify. Adding a new agent to the system now requires subscribing to relevant events rather than integrating point-to-point with every existing agent. Our most recent agent addition took two days from prototype to production, compared to the two weeks our previous additions required.

# Example: Sequential handoff with fallback
  class AgentCoordinator:
      def __init__(self, spine: EventSpine):
          self.spine = spine
          spine.subscribe('inquiry.complete', self.on_inquiry_complete)
          spine.subscribe('scheduling.complete', self.on_scheduling_complete)
          spine.subscribe('agent.timeout', self.on_agent_timeout)
 
      def on_inquiry_complete(self, event):
          # Sequential: scheduling only starts after inquiry
          self.spine.publish(
              'scheduling.request',
              payload=event.payload,
              context=event.context  # Fresh context propagated
            )
        def on_agent_timeout(self, event):
          # Fallback: route to dead-letter queue
          self.spine.publish(
              'fallback.process',
              payload=event.payload,
              context={**event.context, 'fallback_reason': 'timeout'}
	}

What this means for enterprises scaling AI agents

Multi-agent AI is not a future concept. It is a present reality in any enterprise running more than two AI-powered capabilities. If you have a chatbot, a recommendation engine and a document processor, you have a multi-agent system, whether you designed it as one or not.

The coordination challenge will only grow as agent counts increase. Every enterprise I speak with is adding AI capabilities faster than they are adding coordination infrastructure. That gap is where production incidents live.

The Event Spine pattern provides the architectural foundation that prevents agent proliferation from becoming agent chaos. It is the same lesson the industry learned with microservices a decade ago: distributed systems need explicit coordination infrastructure, not just well-intentioned API contracts between teams.

The enterprises that will scale AI successfully are the ones investing in coordination architecture now, before the complexity becomes unmanageable. The ones that wait will eventually build it anyway — just under more pressure, with more incidents and at higher cost.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Sreenivasa Reddy Hulebeedu Reddy

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer at AT&T Services Inc., where he architects AI-driven customer interaction platforms. His work focuses on the practical challenges of deploying AI at enterprise scale — balancing performance, cost and reliability for systems serving millions of users.

With over 13 years of experience spanning telecommunications, healthcare and financial services, Sreenivasa has led teams through complex technical transformations, including migrations to cloud-native architectures and the integration of large language models into production workflows. He also serves as a peer reviewer for Wiley-IEEE Press and Manning Publications and is an IEEE senior member.

More from this author