Debugging Multi-Agent AI: A Step-by-Step Guide to Automated Failure Attribution
Overview
Large language model (LLM) multi-agent systems are powerful but notoriously fragile. When a multi-agent task fails, developers face a daunting question: which agent caused the failure, and at what point? Sifting through lengthy interaction logs manually is like hunting for a needle in a haystack—time-consuming and error-prone.

Researchers from Penn State University, Duke University, Google DeepMind, and others have introduced a novel solution: automated failure attribution. They created the first benchmark dataset, Who&When, and developed multiple attribution methods. This tutorial walks you through using their open-source framework to pinpoint failure causes in your own multi-agent systems. By the end, you'll be able to set up, run, and interpret automated attribution to accelerate debugging and improve system reliability.
Prerequisites
Knowledge Requirements
- Familiarity with LLM-based multi-agent architectures (e.g., agent roles, communication loops).
- Basic Python programming (pip, virtual environments, reading code).
- Understanding of model evaluation metrics (precision, recall, accuracy).
Software and Hardware
- Python 3.9+ installed.
- Git for cloning the repository.
- Access to an LLM API (e.g., OpenAI, Anthropic, or local model via Ollama). The framework supports GPT-4, Claude, and others.
- GPU recommended but not required—some attribution methods use inference only.
Step-by-Step Instructions
1. Clone the Repository and Set Up Environment
Start by obtaining the official code from GitHub:
git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
cd Agents_Failure_Attribution
Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
The requirements.txt includes libraries for JSON handling, request APIs, and basic ML tools. Ensure your Python version matches the project requirement (3.9+).
2. Understand the Dataset: Who&When
Download the benchmark dataset from Hugging Face:
from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When", split="train")
The dataset contains multi-agent trajectories with labeled failures. Each sample includes:
- Interactive log: Full conversation history among agents.
- Agent roles: e.g., planner, executor, critic.
- Ground truth: Which agent caused the failure (ID) and the timestamp (step number).
The failures are categorized into types: error propagation, miscommunication, incorrect reasoning, and external tool misuse. Spend time exploring a few samples to get familiar with the data format (JSON lines).
3. Choose an Attribution Method
The framework implements four methods:
- Trace-back: Replays the log and flags the first deviation from expected output.
- Critic LLM: Uses a separate LLM to analyze the log and assign blame.
- Contrastive Attribution: Replaces each agent’s output with a correct version and measures impact on final outcome.
- Counterfactual Reasoning: Simulates alternative decisions to identify what changed the result.
For this guide, we'll use the Critic LLM method because it's straightforward and doesn't require multiple simulations.
4. Configure the Environment Variables
Create a .env file (or export directly) with your API keys:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
LLM_MODEL=gpt-4 # or claude-3-5-sonnet-20241022
If using a local model (e.g., Llama 3 via Ollama), set the endpoint accordingly.
5. Run Attribution on a Single Trajectory
Use the provided script run_attribution.py:
python run_attribution.py --method critic --input sample_log.json --output attribution_result.json
Input file format: a JSON object with fields conversation (list of messages) and agents (list of agent IDs). Example snippet:
{
"conversation": [
{"agent": "planner", "step": 0, "content": "We need to calculate sum."},
{"agent": "executor", "step": 1, "content": "Sum: 3+5=7"}, // error here
...
],
"task": "compute addition",
"final_result": "incorrect"
}
The script outputs a JSON:
{
"blamed_agent": "executor",
"blamed_step": 1,
"explanation": "Executor provided wrong arithmetic; planner's instruction was correct."
}
6. Evaluate Attribution Accuracy
To measure performance on the benchmark, run:
python evaluate.py --method critic --dataset who_and_when
This prints metrics: Agent Accuracy (correct agent), Step Accuracy (correct step within ±1), and Combined Accuracy (both agent and step). Compare different methods to choose the best for your use case.
7. Interpret and Act on Results
Once you have attribution results:
- Check the explanation for context—sometimes the blamed agent is downstream of an earlier error. Use the trace-back method as a cross-check.
- Update the agent’s prompt or logic to fix the issue. For example, if the executor miscalculates, add explicit step-by-step reasoning instructions.
- Re-run the task to verify the fix.
Common Mistakes
Ignoring the Temporal Dimension
New users often focus only on which agent, forgetting when. A late mistake may be caused by earlier miscommunication. Always examine the blamed step and the surrounding context. The dataset includes step numbers for a reason—use them.
Using Attribution on Incomplete Logs
If your logs lack full inter-agent dialogue (e.g., only final outputs), attribution methods will be inaccurate. Ensure you capture all messages in real time. The framework expects a chronological list of agent utterances.
Overlooking Tool Interactions
Many multi-agent systems use external tools (calculators, search APIs). If a tool returns an unexpected result, the agent using it may be blamed incorrectly. Attribute tool calls separately if possible; the Contrastive method can help isolate tool vs. agent errors.
Confidence Overreliance
The Critic LLM method outputs a confidence score (0-1). Don't treat borderline scores (e.g., 0.5) as reliable. When confidence is low, use a second method or manual inspection. The benchmark provides a baseline—use it to calibrate your own thresholds.
Summary
Automated failure attribution transforms debugging LLM multi-agent systems from a manual nightmare into a systematic, data-driven process. By leveraging the Who&When dataset and the open-source attribution framework, you can quickly identify which agent caused a failure and at what step. This tutorial covered setup, method selection, execution, evaluation, and common pitfalls. Start by cloning the repo, run attribution on a sample trajectory, and iterate. As you integrate this into your development cycle, you'll reduce downtime and build more robust agent collaborations.
For deeper dives, refer to the original paper (Spotlight at ICML 2025) and the dataset page.
Related Articles
- Satellite Analysis of Cyclone-Induced Landslides: A Case Study of Papua New Guinea
- How Scientists Discovered the Hidden Map in Your Nose: A Step-by-Step Guide to Understanding Smell Organization
- Consciousness May Be Universe's Deepest Layer, New Theory Proposes
- Climate Scientist James Hansen Warns 2026 Will Shatter Global Heat Records
- Artemis II's Laser Link Beams Unprecedented HD Views from Deep Space
- Groundbreaking Mechanochemical Method Streamlines Production of Conductive Organic Compounds
- How to Build a Video World Model with Long-Term Memory Using State-Space Models
- Unveiling the Subduction Zone Disintegration: A Guide to the Juan de Fuca Plate's Tearing Process