Skip to content
← Back to explorer

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski · Jan 20, 2026 · Citations: 0

Abstract

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open source Archipelago, our infrastructure for agent execution and evaluation.

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Rubric Rating, Expert Verification
  • Rater population: Domain Experts
  • Unit of annotation: Multi Dim Rubric
  • Expertise required: Law

Evaluation Lens

  • Evaluation modes: Simulation Env
  • Agentic eval: Long Horizon
  • Quality controls: Not reported
  • Confidence: 0.65
  • Flags: None

Research Summary

Contribution Summary

  • We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
  • APEX-Agents requires agents to navigate realistic work environments with files and tools.
  • We test eight agents for the leaderboard using Pass@1.

Why It Matters For Eval

  • We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
  • APEX-Agents requires agents to navigate realistic work environments with files and tools.

Related Papers