Automated Failure Diagnosis for Multi-Agent Systems: A Step-by-Step Guide

By

Introduction

Multi-agent systems powered by large language models (LLMs) are increasingly used to tackle complex collaborative tasks, but failures remain common and notoriously difficult to debug. When your multi-agent system malfunctions, you're left sifting through endless interaction logs to determine which agent caused the failure and at what point—a process researchers have dubbed “automated failure attribution.” A team from Penn State, Duke, Google DeepMind, and other institutions developed a benchmark dataset (Who&When) and several automated attribution methods, accepted as a spotlight at ICML 2025. This guide walks you through applying those methods to your own multi-agent system, helping you pinpoint failures quickly and move from manual log archaeology to efficient, data-driven diagnosis.

Automated Failure Diagnosis for Multi-Agent Systems: A Step-by-Step Guide
Source: syncedreview.com

What You Need

Step-by-Step Guide

Step 1: Download the Who&When Dataset and Code

Begin by obtaining the benchmark dataset and open-source code from the official repositories. This ensures you have the right reference data and attribution tools.

  1. Clone the GitHub repository: git clone https://github.com/mingyin1/Agents_Failure_Attribution
  2. Install required dependencies: pip install -r requirements.txt
  3. Download the dataset from Hugging Face: huggingface-cli download Kevin355/Who_and_When --local-dir ./data

Step 2: Understand the Benchmark Structure

The Who&When dataset contains multi-agent interaction logs annotated with ground-truth failure points: which agent was responsible and at which turn the failure originated. Study the dataset structure to map your own logs appropriately.

Step 3: Prepare Your Own Multi-Agent Logs

To diagnose failures in your system, you need to format its logs in the same schema as the benchmark. This ensures the attribution methods can process them.

  1. Export your system’s interaction data: each turn should capture the agent that spoke, the message content, and a timestamp or turn number.
  2. For each task execution, create a JSON object with keys: task, agents (list of agent names), turns (list of turn objects with agent and message), and failure (boolean, set to false for unlabeled data).
  3. Save your logs in a folder (e.g., ./my_logs/).

Step 4: Run Automated Failure Attribution

Use the provided scripts to apply the attribution method of your choice to your logs. The code includes a command-line interface for each method.

Automated Failure Diagnosis for Multi-Agent Systems: A Step-by-Step Guide
Source: syncedreview.com
  1. Choose a method: DirectPrompt is lightweight but less accurate; AttributionLM offers the best performance if you can run inference on a GPU.
  2. Run the attribution script:
    python run_attribution.py --method AttributionLM --logs ./my_logs --output ./results
  3. The script returns a JSON file per log with responsible_agent and failure_turn predictions.

Step 5: Interpret the Results and Debug

Now you have a clear hypothesis: which agent likely caused the failure and at what step. Use this to focus your debugging efforts.

  1. Check the predicted failure turn in your original logs: examine the messages exchanged around that point.
  2. Verify whether the identified agent made an incorrect reasoning step or misunderstood another agent’s output.
  3. If the prediction seems off, cross-reference with the attribution confidence score (output by the model). Consider re-running with a different method (e.g., TraceEval) to triangulate.

Step 6: Improve Your System Based on Findings

The ultimate goal is to fix the failure and prevent recurrence. The attribution gives you a starting point for system iteration.

Tips for Success

Tags:

Related Articles

Recommended

Discover More

From One Child to Many: A Mother’s New Biotech Aims to Scale Personalized Gene TherapiesUnraveling Ancient Trade: A Step-by-Step Guide to Tracing Bronze Age Metal Origins Using the Spanish Mine DiscoveryHow Sports Unions Are Pushing to Ban 'Under' Bets on Athlete Performance: A Guide to the Regulatory DebateWhy AI Weather Models Falter at Predicting the Most Dangerous ExtremesMastering Design Dialects: A Step-by-Step Guide to Adaptive Systems