HFEPX Hub

Multi Agent Or Tool Use Papers

Updated from current HFEPX corpus (Feb 27, 2026). 52 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 52 Last published: Feb 25, 2026 Global RSS Tag RSS

Multi AgentTool Use

Research Narrative

Grounded narrative Model: deterministic-grounded

Updated from current HFEPX corpus (Feb 27, 2026). This page covers 52 papers centered on Multi Agent Or Tool Use Papers. Common evaluation modes include Automatic Metrics, Simulation Env, with benchmark emphasis on Retrieval, LiveCodeBench. Metric concentration includes accuracy, cost, and the agentic footprint highlights Multi Agent, Tool Use. Use the anchored takeaways below to compare protocol choices, quality-control patterns, and evidence depth before allocating new eval budget.

Why This Matters For Eval Research

Evaluation emphasis: Automatic Metrics and Simulation Env appear frequently in this slice.

Evidence: Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion
Benchmark concentration: Retrieval, LiveCodeBench helps control cross-paper variance.

Evidence: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion , The Headless Firm: How AI Reshapes Enterprise Boundaries
Metric concentration: accuracy, cost is repeatedly reported in this group.

Evidence: Training Generalizable Collaborative Agents via Strategic Risk Aversion , The Headless Firm: How AI Reshapes Enterprise Boundaries , A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Protocol Takeaways

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: The Headless Firm: How AI Reshapes Enterprise Boundaries , A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives , A Benchmark for Deep Information Synthesis
Stratify by benchmark (Retrieval vs LiveCodeBench) before comparing methods.

Evidence: A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives , A Benchmark for Deep Information Synthesis , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Papers with explicit human feedback is visible in approximately 23.1% of papers in this set.

Evidence: A Benchmark for Deep Information Synthesis , SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery , Cooperative-Competitive Team Play of Real-World Craft Robots

Benchmark Interpretation

Retrieval appears as a recurring benchmark anchor in this page.
4 papers (7.7%) mention Retrieval.
Most common evaluation modes: Automatic Metrics, Human Eval.

Metric Interpretation

accuracy is a common reported metric and should be paired with protocol context before ranking methods.
8 papers (15.4%) mention accuracy.
Most common evaluation modes: Automatic Metrics.

Researcher Checklist

Papers with explicit human feedback: Coverage is a replication risk (23.1% vs 45% target).
Papers reporting quality controls: Coverage is a replication risk (5.8% vs 30% target).
Papers naming benchmarks/datasets: Coverage is usable but incomplete (25% vs 35% target).
Papers naming evaluation metrics: Coverage is strong (40.4% vs 35% target).
Papers with known rater population: Coverage is a replication risk (15.4% vs 35% target).
Papers with known annotation unit: Coverage is usable but incomplete (21.2% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (23.1% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.8% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (40.4% vs 35% target).

Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (21.2% vs 35% target).

Known Limitations

Narrative synthesis is grounded in metadata and abstracts only; full-paper method details may be missing.
Extraction fields are conservative and can under-report implicit protocol details.
Cross-page comparisons should control for benchmark and metric mismatch.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=0, left_only=2, right_only=2

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=1, left_only=1, right_only=35

1 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=2, right_only=36

0 papers use both Llm As Judge and Automatic Metrics.

Top Papers

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics Tool Use

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
The Headless Firm: How AI Reshapes Enterprise Boundaries
Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026 · Citations: 0

Automatic Metrics Multi Agent

We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Human EvalAutomatic Metrics Tool Use

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
Cooperative-Competitive Team Play of Real-World Craft Robots
Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang · Feb 24, 2026 · Citations: 0

Simulation Env Tool Use

Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
PyVision-RL: Forging Open Agentic Vision Models via RL
Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng · Feb 24, 2026 · Citations: 0

Automatic Metrics Tool Use

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David C Parkes · Feb 24, 2026 · Citations: 0

Demonstrations Automatic Metrics Multi Agent

Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
Lichang Song, Ting Long, Yi Chang · Feb 21, 2026 · Citations: 0

Automatic Metrics Multi Agent

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0

Automatic Metrics Multi Agent

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic · Feb 19, 2026 · Citations: 0

Automatic Metrics Multi Agent

As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatur
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0

Llm As JudgeSimulation Env Multi Agent

Web agents based on large language models have demonstrated promising capability in automating web tasks.
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and informati
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026 · Citations: 0

Simulation Env Tool Use

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li, Xirui Li, Tianyi Zhou · Feb 15, 2026 · Citations: 0

Simulation Env Multi Agent

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems?
MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir · Feb 15, 2026 · Citations: 0

Automatic Metrics Tool Use

The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026 · Citations: 0

Simulation Env Multi Agent

We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0

Automatic Metrics Tool Use

To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0

Pairwise Preference Simulation Env Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution
Isaac Picov, Ritesh Goru · Feb 6, 2026 · Citations: 0

Automatic Metrics Tool Use

Explaining closed-source Large Language Model (LLM) outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text.
VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration
Jaeyoon Jung, Yejun Yoon, Kunwoo Park · Feb 4, 2026 · Citations: 0

Automatic Metrics Multi Agent

This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration.
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong · Feb 3, 2026 · Citations: 0

Automatic Metrics Tool Use

To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models
Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026 · Citations: 0

Automatic Metrics Tool Use

The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Simulation Env Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0

Red Team Automatic Metrics Tool Use

This paper presents a comprehensive empirical study on the safety alignment capabilities.
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan · Nov 18, 2025 · Citations: 0

Automatic Metrics Multi Agent

As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025 · Citations: 0

Automatic Metrics Multi Agent

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0

Expert Verification Llm As JudgeSimulation Env Multi Agent

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
Collaborative Document Editing with Multiple Users and AI Agents
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025 · Citations: 0

Simulation Env Multi Agent

We propose integrating AI agents directly into collaborative writing environments.
CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI
Hasin Jawad Ali, Ilhamul Azam, Ajwad Abrar, Md. Kamrul Hasan, Hasan Mahmud · Sep 14, 2025 · Citations: 0

Automatic Metrics Multi Agent

The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches.
Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems
Jingyu Guo, Yingying Xu · Aug 27, 2025 · Citations: 0

Automatic Metrics Multi Agent

While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap · Aug 11, 2025 · Citations: 0

Automatic Metrics Multi Agent

We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence
CoAct-1: Computer-using Multi-Agent System with Coding Actions
Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi · Aug 5, 2025 · Citations: 0

Automatic Metrics Long Horizon

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks.
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning
Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu · Jul 4, 2025 · Citations: 0

Automatic Metrics Multi Agent

Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation.
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu · Apr 26, 2025 · Citations: 0

Automatic Metrics Multi Agent

Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and di
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0

Expert Verification Automatic Metrics Tool Use

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao · Feb 25, 2025 · Citations: 0

Automatic Metrics Multi Agent

One natural way for humans to detect time series anomalies is through visualization and textual description.
Should You Use Your Large Language Model to Explore or Exploit?
Keegan Harris, Aleksandrs Slivkins · Jan 31, 2025 · Citations: 0

Automatic Metrics Tool Use

We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff.
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
M. Saifullah, K. G. Papakonstantinou, A. Bhattacharya, S. M. Stoffels, C. P. Andriotis · Jan 23, 2024 · Citations: 0

Simulation Env Multi Agent

To tackle the high dimensionality of state and action spaces, we propose DDMAC-CTDE, a Deep Decentralized Multi-Agent Actor-Critic (DDMAC) reinforcement learning architecture with Centralized Training and Decentralized Execution (CTDE).