HFEPX Hub

Multi Agent Or Web Browsing Papers

Updated from current HFEPX corpus (Apr 12, 2026). 257 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 257 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: WebArena. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.

Papers: 257 Last published: Mar 22, 2026 Global RSS Tag RSS

Multi AgentWeb Browsing

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: High .

Analysis blocks below are computed from the currently loaded sample (60 of 257 total papers in this hub).

All Sampled Papers (60) Replication-Ready Only (12)

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

12 papers are replication-ready (benchmark + metric + explicit evaluation mode).
1 papers support judge-vs-human agreement analysis.
5 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

32% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 42.8% of papers in this hub.
WebArena is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is adjudication (2.3% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Benchmark Interpretation

WebArena appears in 2.3% of hub papers (4/257); use this cohort for benchmark-matched comparisons.
OSWorld appears in 1.7% of hub papers (3/257); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 36% of hub papers (63/257); compare with a secondary metric before ranking methods.
cost is reported in 13.7% of hub papers (24/257); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (32% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (5.1% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (21.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (66.9% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (17.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (18.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 100% of papers.

Known Gaps

Only 5.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (17.7% coverage).
Annotation unit is under-specified (18.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (WebArena vs OSWorld) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: WebArena Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

SODIUM: From Open Web Data to Queryable Databases

Sodium-Bench with accuracy gives a fast comparison anchor.

Strongest recent paper

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible B…

Useful for current practice scanning; published Feb 27, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 10.0

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
SODIUM: From Open Web Data to Queryable Databases
Mar 19, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Sodium Bench · Metric: Accuracy
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 7.5

HF: Red Team · Eval: Llm As Judge · Benchmark: AdvBench · Metric: Success rate
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Feb 2, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Vdr Bench · Metric: Not Reported
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Mar 3, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Kernelbench · Metric: Success rate
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Feb 14, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Elo

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
SODIUM: From Open Web Data to Queryable Databases Mar 19, 2026	Yes Expert Verification	Automatic Metrics	Sodium Bench	Accuracy	Not Reported
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Yes Red Team	Llm As Judge	AdvBench , Jbf Eval	Success rate , Jailbreak success rate	Not Reported
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Feb 2, 2026	Yes Expert Verification	Automatic Metrics	Vdr Bench	Not Reported	Adjudication
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning Mar 3, 2026	Yes Rubric Rating	Automatic Metrics	Kernelbench	Success rate	Not Reported
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Elo	Not Reported
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Feb 18, 2026	Yes Pairwise Preference	Automatic Metrics	Memoryarena	Recall	Not Reported
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling Feb 18, 2026	No Not Reported	Automatic Metrics	LiveCodeBench	Accuracy	Calibration
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification Jul 15, 2025	Yes Pairwise Preference	Automatic Metrics , Simulation Env	VisualWebArena , OSWorld	Accuracy	Not Reported
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation Apr 1, 2026	Yes Critique Edit	Simulation Env	WebArena , Interruptbench	Not Reported	Not Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation Mar 19, 2026	Yes Demonstrations	Simulation Env	Mapg Bench	Not Reported	Not Reported
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments May 28, 2025	Yes Red Team	Automatic Metrics	Rtc Bench	Jailbreak success rate	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	SODIUM: From Open Web Data to Queryable Databases	Jailbreak Foundry: From Papers to Runnable Attacks…
Human Feedback	Demonstrations	Expert Verification	Red Team
Evaluation Modes	Human Eval, Llm As Judge	Automatic Metrics	Llm As Judge
Benchmarks	WebArena, ToolBench	Sodium Bench	AdvBench, Jbf Eval
Metrics	Precision, Pass@1	Accuracy	Success rate, Jailbreak success rate
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Domain Experts	Unknown
Annotation Unit	Trajectory	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (21)
Expert Verification (13)
Demonstrations (10)
Critique Edit (6)

Evaluation Modes

Automatic Metrics (110)
Simulation Env (46)
Llm As Judge (12)
Human Eval (6)

Top Benchmarks

WebArena (4)
OSWorld (3)
BIRD (2)
Paperbench (2)

Top Metrics

Accuracy (63)
Cost (24)
Success rate (9)
Precision (8)

Rater Population Mix

Domain Experts (31)

Quality Controls

Adjudication (6)
Calibration (3)

Coverage diagnostics (sample-based): human-feedback 80.0% · benchmarks 31.7% · metrics 63.3% · quality controls 8.3%.

Top Papers

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav · Jul 15, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents.
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0

Rubric RatingExpert Verification Human Eval Web Browsing

The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu · Jun 25, 2025 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Red Team Llm As Judge Multi Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen · Feb 2, 2026 · Citations: 0

Expert Verification Automatic Metrics Web Browsing

However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations.
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0

Expert Verification Human EvalAutomatic Metrics Multi Agent

In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0

Expert Verification Llm As JudgeSimulation Env Multi Agent

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0

Demonstrations Automatic MetricsSimulation Env Multi Agent

We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong · Mar 3, 2026 · Citations: 0

Rubric Rating Automatic Metrics Multi Agent

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability.
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0

Critique Edit Simulation Env Long Horizon

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Red Team Automatic Metrics Web Browsing

Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
Vedanta S P, Ponnurangam Kumaraguru · Mar 19, 2026 · Citations: 0

Rubric Rating Simulation Env Multi Agent

Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority.
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0

Expert Verification Simulation Env Multi Agent

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness…
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation
Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu · Aug 8, 2025 · Citations: 0

Pairwise Preference Llm As Judge Multi Agent

Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation.
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue · Feb 16, 2026 · Citations: 0

Llm As Judge Web Browsing

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory
Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang · Mar 27, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians.
A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei · Mar 23, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis.
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu · Mar 4, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous…
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
Role-Augmented Intent-Driven Generative Search Engine Optimization
Xiaolu Chen, Haojie Wu, Jie Bao, Zhen Chen, Yong Liao · Aug 15, 2025 · Citations: 0

Rubric Rating Automatic Metrics Web Browsing

To better evaluate the method under realistic settings, we address the benchmarking limitations of prior work by: (1) extending the GEO dataset with diversified query variations reflecting real-world search scenarios and (2) introducing…
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji · May 7, 2025 · Citations: 0

Demonstrations Automatic Metrics Long Horizon

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Automatic Metrics Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures.
Go-Browse: Training Web Agents with Structured Exploration
Apurva Gandhi, Graham Neubig · Jun 4, 2025 · Citations: 0

Simulation Env Web Browsing

To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments.
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du · Mar 11, 2026 · Citations: 0

Red Team Automatic Metrics Multi Agent

Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains insufficiently studied.
Sabiá-4 Technical Report
Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás · Mar 10, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Tool Use

The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal…
BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Xinyu Gao, Gang Chen, Javier Alonso-Mora · Mar 10, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans.
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Red Team Automatic Metrics Web Browsing

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Web Browsing

In this work, we introduce the task of modeling human intervention to support collaborative web task execution.
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
MARS: toward more efficient multi-agent collaboration for LLM reasoning
Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao · Sep 24, 2025 · Citations: 0

Critique Edit Automatic Metrics Multi Agent

Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents.
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo · Aug 26, 2025 · Citations: 0

Demonstrations Automatic Metrics Multi Agent

In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin · May 21, 2025 · Citations: 0

Critique Edit Automatic Metrics Multi Agent

Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0

Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Ojas Jain, Dhruv Kumar · Apr 7, 2026 · Citations: 0

Simulation Env Multi Agent

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis
Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong · Mar 30, 2026 · Citations: 0

Pairwise Preference Web Browsing

While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable…
World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0

Llm As JudgeSimulation Env Multi Agent

To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0

Simulation Env Long Horizon

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang · Mar 23, 2026 · Citations: 0

Simulation Env Long Horizon

To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo · Mar 9, 2026 · Citations: 0

Human Eval Multi Agent

To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025 · Citations: 0

Simulation Env Long Horizon

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
Aligning Large Language Models with Searcher Preferences
Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu · Mar 11, 2026 · Citations: 0

Pairwise Preference Web Browsing

This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher · Mar 2, 2026 · Citations: 0

Pairwise Preference Multi Agent

When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI…
Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen · Mar 2, 2026 · Citations: 0

Expert Verification Multi Agent

Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs.
Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks
Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, Stefan Zohren · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and…
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now