Automating Intellectual Toil: How I Built eval-agents with GitHub Copilot

By

As an AI researcher at Microsoft's Copilot Applied Science team, I found myself repeatedly analyzing hundreds of thousands of lines of agent trajectory data. This repetitive intellectual work begged for automation. Using GitHub Copilot, I created a tool called eval-agents that not only automated my own analysis but also empowered my entire team to do the same. Below, we explore the key questions behind this journey.

What motivated the creation of eval-agents?

Working as an AI researcher, I analyze coding agent performance using standardized evaluations like TerminalBench2 or SWEBench-Pro. Each evaluation run produces trajectories—detailed .json files listing the agent's thought processes and actions for every task. With dozens of tasks per benchmark and multiple runs per day, I was drowning in hundreds of thousands of lines of code. I needed to manually explore patterns in this data, which was both tedious and time-consuming. The engineer in me saw a repetitive mental task ripe for automation. I wanted to eliminate the intellectual toil of pattern-spotting and let agents do the heavy lifting. That desire sparked eval-agents, a system that automates the analysis of agent performance, freeing me for more creative work.

Automating Intellectual Toil: How I Built eval-agents with GitHub Copilot
Source: github.blog

How were agent trajectories analyzed before automation?

Before eval-agents, analyzing trajectories was a manual, cyclic process. Each task in an evaluation dataset generates its own trajectory—typically a JSON file hundreds of lines long. When a new benchmark run arrived, I would examine these trajectories one by one, looking for common patterns or anomalies. I used GitHub Copilot to help surface those patterns, but I still had to mentally combine insights across dozens or hundreds of trajectories. It was like reading a thousand books and trying to recall every theme. I could reduce the lines of code I needed to read from hundreds of thousands to a few hundred, but the mental effort remained high. The repetitive loop—use Copilot to find patterns, then investigate manually—was crying out for automation. That frustration pushed me to build a tool that could automate the entire analysis pipeline.

What role did GitHub Copilot play in analyzing trajectories?

GitHub Copilot was my daily partner in analyzing agent trajectories. When given a set of trajectory files, Copilot could quickly identify recurring patterns—like common failure modes or unique decision points—and summarize them. I would ask Copilot to highlight specific behaviors or to compare different agents' actions. This slashed the time I spent reading raw JSON from hours to minutes. However, I still had to initiate each inquiry myself, repeating the same steps for every new benchmark run. Copilot acted as a powerful accelerator for my intellectual work, but it didn't eliminate the repetition. That experience taught me the potential of combining Copilot's pattern-matching with autonomous agents to remove the manual loop entirely. It became the cornerstone of eval-agents, which harnesses Copilot's capabilities inside automated workflows.

What is eval-agents and how does it work?

eval-agents is a system I built to automate the analysis of coding agent performance. It turns the manual process of examining trajectories into an unsupervised, agent-driven pipeline. Here's how it works: when a new benchmark run is completed, eval-agents automatically ingests all trajectory files. It then uses GitHub Copilot to generate reports and identify key patterns across the entire dataset. Instead of me manually asking Copilot to find specific issues, the agents pre-analyse the data and present actionable insights. The system is designed so that anyone on my team—even without deep AI knowledge—can run analyses and author new agents to adapt the tool to their specific needs. Think of it as a self-improving analysis assistant: it automates the repetitive cognitive tasks that used to consume my day, and it lets us focus on improving the underlying agents instead of just reviewing their outputs.

Automating Intellectual Toil: How I Built eval-agents with GitHub Copilot
Source: github.blog

What were the three main design goals for eval-agents?

I built eval-agents with three core principles in mind, drawn from my experience as an open-source maintainer on the GitHub CLI:

  1. Easy to share and use: The tool had to be straightforward for the entire Copilot Applied Science team to adopt, with minimal setup. Sharing agents means sharing productivity gains.
  2. Easy to author new agents: Team members with varying expertise should be able to create new analytical agents for emerging needs without learning complex frameworks. Low barrier to entry was key.
  3. Coding agents as the primary vehicle: I wanted the main way to contribute improvements to be through code—by building or modifying agents. This puts automation at the center of our workflow, encouraging an agent-driven development culture.

These goals ensured that the tool not only solved my immediate problem but also scaled across the team, fostering a collaborative environment where everyone could automate their own intellectual toil.

How does eval-agents enable team collaboration?

Before eval-agents, each researcher on the Copilot Applied Science team analyzed agent trajectories individually, often duplicating effort. Now, a single instance of eval-agents can be shared across the team. When someone creates a new analysis agent—for example, one that detects a specific type of agent failure—that agent becomes available to everyone. This shared library of agents means that insights are pooled, not siloed. New team members can quickly ramp up by using existing agents and then contribute their own. The system is designed to be iteratively improved: as benchmarks evolve, agents can be updated collaboratively. I now maintain the tool rather than performing manual analysis myself, and I see my peers building agents to address their unique challenges. This shift has turned our team into a collective automaton, where we spend less time on repetitive analysis and more on innovative research.

Tags:

Related Articles

Recommended

Discover More

Master Quordle: Hints and Answers for Game #1574Upgrade Your Fedora Silverblue to Fedora Linux 44: A Step-by-Step Rebase GuideOptimizing docs.rs Builds: A Guide to Reducing Default TargetsAutomated Failure Attribution in LLM Multi-Agent Systems: A Comprehensive Guide7 Essential Facts About Stack Allocation in Go