Skip to content
← Back to explorer

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Weizheng Wang, Hongzhang Huang, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, GuoLiang Li, Jihua Kang, Fan Wu · May 5, 2026 · Citations: 0

How to use this page

Moderate trust

Use this for comparison and orientation, not as your only source.

Best use

Secondary protocol comparison source

What to verify

Validate the evaluation procedure and quality controls in the full paper before operational use.

Evidence quality

Moderate

Derived from extracted protocol signals and abstract evidence.

Abstract

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.

Low-signal caution for protocol decisions

Use this page for context, then validate protocol choices against stronger HFEPX references before implementation decisions.

  • The abstract does not clearly describe the evaluation setup.

Should You Rely On This Paper?

This paper has useful evaluation signal, but protocol completeness is partial; pair it with related papers before deciding implementation strategy.

Best use

Secondary protocol comparison source

Use if you need

Background context only.

Main weakness

The abstract does not clearly describe the evaluation setup.

Trust level

Moderate

Usefulness score

50/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Weak / implicit signal

Usefulness for eval research

Moderate-confidence candidate

Extraction confidence 55%

What We Could Verify

These are the protocol signals we could actually recover from the available paper metadata. Use them to decide whether this paper is worth deeper reading.

Human Feedback Types

strong

Rubric Rating

Directly usable for protocol triage.

"Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively."

Evaluation Modes

missing

None explicit

Validate eval design from full paper text.

"Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively."

Quality Controls

missing

Not reported

No explicit QC controls found.

"Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively."

Benchmarks / Datasets

strong

Workspace Bench

Useful for quick benchmark comparison.

"To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies."

Reported Metrics

missing

Not extracted

No metric anchors detected.

"Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively."

Human Feedback Details

  • Uses human feedback: Yes
  • Feedback types: Rubric Rating
  • Rater population: Not reported
  • Unit of annotation: Multi Dim Rubric
  • Expertise required: General

Evaluation Details

  • Evaluation modes:
  • Agentic eval: None
  • Quality controls: Not reported
  • Evidence quality: Moderate
  • Use this page as: Secondary protocol comparison source

Protocol And Measurement Signals

Benchmarks / Datasets

Workspace-Bench

Reported Metrics

No metric terms were extracted from the available abstract.

Research Brief

Metadata summary

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively.

Based on abstract + metadata only. Check the source paper before making high-confidence protocol decisions.

Key Takeaways

  • Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively.
  • Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored.
  • To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies.

Researcher Actions

  • Compare this paper against nearby papers in the same arXiv category before using it for protocol decisions.
  • Check the full text for explicit evaluation design choices (raters, protocol, and metrics).
  • Use related-paper links to find stronger protocol-specific references.

Caveats

  • Generated from abstract + metadata only; no PDF parsing.
  • Signals below are heuristic and may miss details reported outside the abstract.

Recommended Queries

Research Summary

Contribution Summary

  • To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies.
  • We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%.
  • We evaluate 4 popular agent harnesses and 7 foundation models.

Why It Matters For Eval

  • To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies.
  • We evaluate 4 popular agent harnesses and 7 foundation models.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Rubric Rating

  • Gap: Evaluation mode is explicit

    No clear evaluation mode extracted.

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Pass: Benchmark or dataset anchors are present

    Detected: Workspace-Bench

  • Gap: Metric reporting is present

    No metric terms extracted.

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.