Skip to main content

The Multi-Agent RAG Problem Nobody Talks About

When multiple agents share a retrieval pipeline, they retrieve the same chunks, contaminate each other's context, and produce one-sided answers. Here's the coordination problem underneath.

Zaher Khateeb
7 min read

You have three agents working on a research question: a researcher, an analyst, and a writer. The researcher retrieves context from your knowledge base. The analyst retrieves context from the same knowledge base. They both run the same vector search against the same query.

They get the same five chunks.

The researcher summarizes those chunks. The analyst analyzes those same chunks from a different angle. The writer synthesizes two perspectives that are really one perspective, dressed up differently. Your multi-agent system just spent 3x the tokens to produce a single-agent answer.

This isn't a retrieval problem. Your embedding model is fine. Your chunking strategy is fine. The problem is that nobody designed the coordination between retrieval and multi-agent execution. Each agent treats retrieval as a private, independent operation — and when multiple agents independently retrieve against the same query, they converge on the same context and produce redundant work.

I've been building multi-agent systems long enough to know where the real bottlenecks hide. They're rarely in the model. They're in the coordination — and RAG pipelines have a coordination problem that most teams don't notice until they're debugging why their three-agent system produces answers no better than one agent working alone.

The Context Collision Problem

Standard RAG has a well-understood architecture: chunk your documents, embed them, retrieve by similarity, stuff into a prompt. It works for single-agent systems. The moment you add a second agent, three failure modes emerge that don't exist in single-agent RAG.

Redundant retrieval. Two agents querying the same vector store with the same (or similar) queries retrieve near-identical chunk sets. This is by design — vector similarity is deterministic. Given the same query embedding, you get the same nearest neighbors. But it means your multi-agent system is doing redundant work: multiple agents processing the same information and producing overlapping insights.

Context contamination. In systems with shared memory or blackboard patterns, Agent A's retrieved context bleeds into Agent B's reasoning. If Agent A retrieves five chunks about mesh topology benefits and writes them to shared state, Agent B — tasked with finding counterarguments — now has a context window primed toward mesh topology benefits. The shared context biases subsequent retrieval and reasoning.

Coverage gaps. The most insidious failure. When all agents converge on the same high-similarity chunks, the slightly less similar but critically relevant chunks go unretrieved. A query about "mesh vs hub-and-spoke topology" might surface five chunks about mesh — because your knowledge base has more mesh content — while the hub-and-spoke perspective sits at rank 6-10, just below the top-k cutoff. Every agent misses it independently, and no agent is tasked with noticing the gap.

Naive shared retrieval where both agents get the same 5 chunks and produce a one-sided answer, vs coordinated scoped retrieval where a query decomposer assigns different sub-queries to each agent, producing complementary context and a balanced comparison

These failures are invisible in evaluation. Precision@5 measures whether your top 5 chunks are relevant — it doesn't measure whether your top 5 chunks across all agents provide diverse coverage. You can have perfect per-agent retrieval metrics and still produce a multi-agent system that's no better than a single agent.

Why This Happens

The root cause is a mismatch between retrieval architecture and execution architecture.

RAG was designed for single-agent use: one query, one retrieval, one generation. The entire pipeline assumes a single consumer of the retrieved context. Cosine similarity, top-k selection, reranking — all optimized for a single pass.

Multi-agent systems assume a different execution model: multiple agents with different roles, different information needs, and different perspectives. But when these agents share a retrieval pipeline that was designed for single use, the pipeline can't serve their distinct needs. It gives every agent the same answer because it was asked the same question.

The fix isn't better embeddings or fancier reranking. It's coordination between the retrieval step and the agent orchestration step.

Three Patterns That Actually Work

Pattern 1: Query Decomposition Before Retrieval

The simplest and most effective pattern. Before any agent retrieves, a decomposition step breaks the query into sub-queries scoped to each agent's role.

# Instead of: every agent retrieves against the same query
query = "Compare fault tolerance of mesh vs hub-and-spoke"
 
# Decompose into role-scoped sub-queries
sub_queries = {
    "researcher": "mesh network fault tolerance mechanisms and resilience",
    "analyst": "hub-and-spoke topology failure modes and single points of failure",
    "writer": "tradeoff comparison frameworks for network topologies",
}
 
# Each agent retrieves against its own sub-query
for agent_role, sub_query in sub_queries.items():
    context = retriever.retrieve(sub_query, top_k=5)
    await agents[agent_role].run(context=context)

This is embarrassingly simple, and it solves the redundancy problem entirely. Each agent gets different chunks because each agent asks a different question. The decomposition can be done by an LLM ("break this query into three sub-queries for a researcher, analyst, and writer") or by deterministic rules based on agent roles.

The key insight: the decomposition happens before retrieval, not after. If you decompose after retrieval (i.e., give all agents the same chunks and ask them to focus on different aspects), you're still limited by the chunks you retrieved. If the hub-and-spoke perspective wasn't in the top-k, no amount of agent-level decomposition will find it.

Pattern 2: Retrieved-Set Deduplication

When agents must retrieve against similar queries (because their tasks genuinely overlap), deduplicate at the coordination layer.

# Track what's been retrieved across all agents
retrieved_ids: set[str] = set()
 
async def coordinated_retrieve(query: str, top_k: int = 5) -> list[Chunk]:
    # Retrieve more than needed
    candidates = retriever.retrieve(query, top_k=top_k * 3)
 
    # Filter out chunks already claimed by other agents
    fresh = [c for c in candidates if c.id not in retrieved_ids]
 
    # Claim the chunks we're using
    selected = fresh[:top_k]
    retrieved_ids.update(c.id for c in selected)
 
    return selected

Agent A gets the top 5 chunks. Agent B gets chunks 6-10 (the next most relevant ones that Agent A didn't claim). This forces coverage diversity — each agent sees different parts of the knowledge base, even when their queries overlap.

The tradeoff: Agent B's chunks are less similar to the original query than Agent A's. But in multi-agent systems, breadth of coverage usually matters more than marginal similarity gains. A chunk at 0.78 similarity that provides new information is more valuable than a chunk at 0.85 similarity that repeats what another agent already found.

Pattern 3: Retrieval-Aware Agent Protocols

The most sophisticated pattern. Define the retrieval strategy as part of the agent coordination protocol — not as an independent step each agent runs on its own.

With agenticraft-foundation, you can model a multi-agent RAG pipeline as a formal protocol and verify that the retrieval coordination is correct before running it:

from agenticraft_foundation import (
    process, event, parallel, verify_deadlock_free,
)
 
# Define the protocol events
decompose = event("decompose")
retrieve_researcher = event("retrieve_researcher")
retrieve_analyst = event("retrieve_analyst")
synthesize = event("synthesize")
 
# Decomposer controls the sequence
decomposer = process(decompose >> retrieve_researcher
    >> retrieve_analyst >> synthesize)
 
# Each agent waits for decompose, then retrieves independently
researcher = process(decompose >> retrieve_researcher)
analyst = process(decompose >> retrieve_analyst)
 
# Compose: synchronize on decompose, scope retrievals per agent
system = parallel(
    decomposer,
    parallel(researcher, analyst, sync_on={decompose}),
    sync_on={decompose, retrieve_researcher, retrieve_analyst, synthesize},
)
 
# Prove: no deadlocks, synthesis always reachable
assert verify_deadlock_free(system)

This proves that the protocol — decompose first, then scoped retrieval, then synthesis — can't deadlock. The decompose event synchronizes all agents before retrieval begins. Retrieval events are scoped to each agent (no collision). Synthesis happens only after both retrievals complete.

The value isn't just verification. Writing the protocol forces you to think about retrieval coordination as an explicit design decision. Most multi-agent RAG bugs exist because retrieval coordination was never designed — it just happened implicitly, and nobody noticed the collision until the outputs were one-sided.

Honest Caveat

Formal verification proves the protocol is correct — that decomposition happens before retrieval, that synthesis waits for all agents. It doesn't prove that the sub-queries are good or that the retrieved chunks are relevant. Those are quality problems, not coordination problems. Both matter.

Reranking Changes the Equation

Multi-agent RAG shifts the reranking problem. In single-agent RAG, you rerank to find the most relevant chunks for one consumer. In multi-agent RAG, you rerank to find the most complementary chunks across consumers.

A cross-encoder that scores (query, chunk) pairs independently can't do this. It doesn't know what other agents have already retrieved. Two approaches work:

Marginal relevance reranking. After each agent's retrieval, rerank the candidate set with a diversity penalty for chunks similar to what other agents already have. This is Maximal Marginal Relevance (MMR) applied across agents instead of within a single retrieval.

LLM-as-reranker with context. Pass the LLM the query, the candidate chunks, and a summary of what other agents have already retrieved. Ask it to select chunks that complement rather than duplicate. More expensive, but significantly better at ensuring coverage.

The practical takeaway: if you're running multi-agent RAG, your reranker needs to be coordination-aware. A reranker that optimizes per-agent relevance independently will converge on the same chunks for every agent.

When Single-Agent RAG Is Better

Multi-agent RAG adds coordination overhead. It's not always worth it.

Single-agent RAG wins when:

  • The query is simple and factual ("What is the default timeout for OpenAI?")
  • One chunk contains the full answer
  • Speed matters more than coverage
  • Your knowledge base is small enough that top-5 captures most relevant information

Multi-agent RAG wins when:

  • The query requires multiple perspectives ("Compare X and Y")
  • The answer spans multiple documents or topics
  • Thoroughness matters more than speed
  • Different agents have genuinely different information needs

The decision shouldn't be static. A well-designed system assesses query complexity at runtime — simple queries take the fast single-agent path, complex queries get coordinated multi-agent retrieval. You pay the coordination cost only when the query demands it.

What the Tutorials Don't Tell You

Token cost multiplies. If three agents each retrieve 5 chunks and stuff them into separate prompts, you're paying for 15 chunks of context across 3 LLM calls. With coordinated retrieval, the same information coverage might require 10 unique chunks across 3 calls — 33% less context cost because there's no duplication.

Evaluation needs to be system-level. Per-agent precision@5 can be perfect while system-level coverage is terrible. Add a coverage metric: across all agents, what fraction of the relevant documents in your knowledge base were retrieved? If three agents collectively retrieved from only 5 unique documents when 12 were relevant, your per-agent metrics are misleading.

Chunk size is a per-agent decision. A researcher agent benefits from larger chunks (more context per retrieval). An analyst agent benefits from smaller chunks (more precise, less noise). A single chunk size applied across all agents is a compromise that serves none of them well.

Shared memory is a double-edged sword. Writing retrieved chunks to a shared blackboard lets agents build on each other's findings. It also introduces ordering effects — the first agent to write shapes the context for every subsequent agent. If your system isn't deterministic about agent execution order, the same query can produce different answers on different runs.


The multi-agent RAG problem isn't about retrieval quality. It's about retrieval coordination. When multiple agents independently retrieve from the same knowledge base, they converge on the same context and produce redundant work. The fix is designing retrieval as part of the agent protocol — with query decomposition, deduplication, and coordination-aware reranking.

The retrieval pipeline and the coordination protocol are not separate concerns. In multi-agent systems, they're the same concern. Building them independently is how you end up with a three-agent system that produces a one-agent answer.

ZK
Zaher KhateebFounder & CTO at AgentiCraft

Building the infrastructure layer between AI agent logic and production. Distributed systems, multi-agent coordination, and making unreliable components work together reliably at scale.

Be the First to Know

Join the waitlist and get early access when we launch.

Launching soon.