Automated Failure Attribution in LLM Multi-Agent Systems: A Practical Guide Using the Who&When Benchmark
Overview
Large Language Model (LLM) multi-agent systems have become indispensable for tackling complex tasks through collaborative workflows. However, these systems often fail—sometimes unexpectedly—leaving developers to waste hours sifting through logs to pinpoint the responsible agent and the moment of failure. This guide introduces Automated Failure Attribution, a novel approach developed by researchers from Penn State University, Duke University, Google DeepMind, and other institutions. By leveraging the open-source Who&When benchmark and associated attribution methods, you can systematically diagnose failures in multi-agent systems, accelerate debugging, and improve system reliability. This tutorial will walk you through the core concepts, setup, and application of these tools.

Prerequisites
Knowledge Requirements
- Familiarity with Python programming (version 3.8 or higher)
- Basic understanding of LLMs and multi-agent architectures (e.g., agent communication, tool use)
- Experience with common machine learning libraries (PyTorch, Transformers)
Software and Hardware
- Python environment with pip/conda
- Access to an LLM API (e.g., OpenAI, Anthropic) or a local model (e.g., via Ollama) for attribution experiments
- Sufficient GPU memory (16GB+ recommended) if running local models
- Git to clone the repository
Dataset Access
The Who&When dataset is hosted on Hugging Face. You will need an internet connection to download it.
Step-by-Step Instructions
1. Set Up the Environment
Start by cloning the official repository from GitHub, which contains the code, pre-trained models, and evaluation scripts.
git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
cd Agents_Failure_Attribution
Create and activate a virtual environment, then install dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
Note: The requirements.txt includes packages for PyTorch, transformers, datasets, and logging utilities. If you encounter version conflicts, create a fresh environment or use a compatible Python version (e.g., 3.10).
2. Understand the Who&When Dataset
The benchmark contains simulated multi-agent task logs where failures occur. Each log includes:
- Agent interactions: messages, tool calls, and outputs
- Ground-truth labels: which agent caused the failure (Who) and at which step (When)
- Task metadata: task description, success/failure flag
Download the dataset using the Hugging Face datasets library:
from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When", split="train")
print(dataset[0]) # Inspect a sample
The dataset is split into training and test sets. For quick experiments, you can use the smaller validation subset.
3. Implement or Use Pre-Built Attribution Methods
The repository provides several automated attribution methods. Two primary approaches are:
- Log Embedding + Classifier: Encode agent activity logs into vector embeddings, then train a classifier to predict the failing agent and step.
- LLM-based Reasoning: Prompt an LLM to analyze the log and output the attribution directly.
To run the LLM-based method with OpenAI's GPT-4:
python run_attribution.py --method llm --model gpt-4 --api_key YOUR_API_KEY
For the embedding method, train and evaluate using:
python run_attribution.py --method embedding --model sentence-transformers/all-mpnet-base-v2
Both scripts will output accuracy metrics (Who accuracy, When accuracy, and combined F1 score).
4. Evaluate Performance on Your Own Multi-Agent System
To apply these methods to a custom system, you must log interactions in a compatible format. The expected log structure is a list of dictionaries, each containing:

agent_id: identifier stringstep: integer indexaction: description of the action (e.g., "called get_weather API", "sent message to agent2")output: result or responsetask_context: optional global context
Example:
log = [
{"agent_id": "A", "step": 0, "action": "receive_task", "output": "Find the shortest path"},
{"agent_id": "B", "step": 1, "action": "query_database", "output": "Error: timeout"},
...
]
Once you have logs, you can use the provided attribution_pipeline.py:
from attribution_pipeline import predict_failure
result = predict_failure(log, method="llm", model="gpt-4")
print(f"Failing agent: {result['who']}, failure step: {result['when']}")
5. Interpret and Act on Results
The attribution output indicates which agent first deviated from the correct path and when. Use this information to:
- Review the specific agent’s logic or configuration at that step
- Improve inter-agent communication protocols
- Re-train or fine-tune the offending agent model
For example, if agent B fails at step 3 due to an API timeout, you might add retry logic or enhance the agent’s error handling.
Common Mistakes
- Ignoring task context: Attribution methods perform best when the log includes the original task description. Without it, the model may misinterpret actions.
- Normalizing logs inconsistently: Ensure all logs use the same schema. Missing fields or inconsistent data types can cause embedding methods to fail silently.
- Overusing brute force methods: Manually reading entire logs is error-prone and slow. Rely on automated attribution for systematic debugging.
- Forgetting to set API keys: The LLM-based method requires a valid API key. Set it via environment variable
OPENAI_API_KEYor the command line argument. - Using imbalanced datasets: If your own data has many successful runs but few failures, consider oversampling failure cases when training an embedding classifier.
Summary
Automated failure attribution addresses the critical challenge of diagnosing errors in LLM multi-agent systems. By using the Who&When benchmark and the open-source tools described, developers can quickly identify the culprit agent and failure step, drastically reducing debugging time. This guide walked you through environment setup, dataset understanding, method implementation (embedding- and LLM-based), evaluation on custom logs, and interpretation of results. Adopting these techniques will make your multi-agent systems more reliable and your development cycle more efficient. For further details, refer to the original paper and the GitHub repository.
Related Articles
- 10 Surprising Revelations About Fat Metabolism That Challenge Decades of Scientific Dogma
- Razr Fold vs Galaxy Z Fold 7: Deciding the Best Foldable for You
- Potent Plant-Derived Compounds Show Remarkable Antiviral Power Against Ebola and COVID-19
- The Man Behind the Moon: Anton Kiriwas and NASA's Artemis Journey
- Beyond the Gym: Creatine’s Unexpected Benefits for Brain and Body
- Gut-Brain Axis Shaken: Coffee Rewires Microbiome and Mood, Study Finds
- 10 Reasons Ben Mauro’s ‘Huxley’ Universe Is Poised to Redefine Sci-Fi
- Brewing Better Coffee: How Electrical Currents Could Unlock Flavor Secrets