CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig · Jan 28, 2025 · Citations: 0

Automatic Metrics Demonstrations General Pairwise Preference Web Browsing

Open arXiv Open DOI RSS feed

Abstract

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: Yes
Feedback types: Pairwise Preference, Demonstrations
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: General
Extraction source: Runtime deterministic fallback

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Web Browsing
Quality controls: Not reported
Confidence: 0.70
Flags: runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

success ratetask success

Research Brief

Deterministic synthesis

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. HFEPX signals include Pairwise Preference, Demonstrations, Automatic Metrics with confidence 0.70. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 8:35 PM · Grounded in abstract + metadata only

Key Takeaways

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world…
We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (success rate, task success).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference.
We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps.

Why It Matters For Eval

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Pairwise Preference, Demonstrations
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: success rate, task success

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: Pairwise PreferenceShared tag: Web Browsing
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
Modeling Distinct Human Interaction in Web Agents Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: Pairwise PreferenceShared tag: Web Browsing
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: DemonstrationsShared tag: Web Browsing
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
Oracular Programming: A Modular Foundation for Building LLM-Enabled Software Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: DemonstrationsShared tag: Web Browsing
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization Protocol Overlap

Citations: 0 Relevance: 5.00 Shared tag: Pairwise Preference
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Web Browsing
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Web Browsing
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Pairwise Preference
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Pairwise Preference
- Shared HFEPX protocol tags
- Aligned human feedback protocol
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Demonstrations
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Pairwise Preference
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Pairwise Preference
- Shared HFEPX protocol tags
- Aligned human feedback protocol

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote