Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
We additionally contribute a CAD dataset with human preference annotations.
Experiments with proprietary models (GPT-4o, Gemini, etc) show large gains, with GPT-4o (Omni) achieving up to +23.4 absolute accuracy points on the human-preference benchmark.
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang · Aug 28, 2025 · Citations: 0
Automatic MetricsGeneral
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction.
While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration.
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo · Aug 26, 2025 · Citations: 0
Automatic MetricsGeneral
Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning.
Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025 · Citations: 0
Critique EditAutomatic MetricsGeneral
To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,
Our Recommender Agent reviews unmatched outputs and proposes schema revisions, enabling TASER to outperform vision-based table detection models such as Table Transformer by 10.1%.
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap · Aug 11, 2025 · Citations: 0
Automatic MetricsGeneral
We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence
Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf{18\%} on ConfAIde and \tex
Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li · Aug 5, 2025 · Citations: 0
Automatic MetricsGeneral
Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks.
While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency.
Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Loza Vera, Muhammad Dehan Al Kautsar · Jul 31, 2025 · Citations: 0
Red TeamAutomatic MetricsGeneral
Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints.
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available.
Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Yuqi Ren · Jun 30, 2025 · Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
Conducting supervised and preference fine-tuning of large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values.
To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework for automated, scalable preference dataset construction across languages.
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025 · Citations: 0
Red TeamAutomatic MetricsGeneral
In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries.
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025 · Citations: 0
Automatic MetricsSimulation EnvGeneral
We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi
Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the disti
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su · May 28, 2025 · Citations: 0
Red TeamSimulation EnvGeneral
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces.
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025 · Citations: 0
Pairwise PreferenceSimulation EnvGeneral
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
We evaluated 34 distinct models across 45 configurations from major closed-source and open-source providers, and compared outputs to responses from 106 human participants.
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu · May 21, 2025 · Citations: 0
Pairwise PreferenceAutomatic MetricsGeneral
However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reason
In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems.