A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
Traditionally, this requires costly, large-scale human data collection.
Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs simulate human survey responses.
These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators.
We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks.
Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic.
Speech is essential for realistic role-playing, yet existing work on role-playing agents largely centers on text, leaving Speech Role-Playing Agents (SRPAs) underexplored and without systematic evaluation.
SpeechRole-Eval offers a multidimensional benchmark that directly evaluates generated speech, preserving paralinguistic cues and measuring interaction ability, speech expressiveness, and role-playing fidelity.
User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions.
Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and τ-Bench).
These include natural language interaction with humans, goal adaptation, contextual reasoning, external tool use, and the handling of unforeseen ethical dilemmas, enabled in part by multimodal large language models (LLMs).
These developments highlight not only a gap between technical autonomy and the broader cognitive and social capacities required for human-centered mobility, but also the emergence of a form of vehicle intelligence that currently lacks a…
We develop an attribution-aware evaluation framework to measure speaker-level inclusion and mis-representation in debate summaries.
These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for multilingual democratic applications.
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available.
Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity.
To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs.
Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking…
In response, we introduce WebDS, the first end-to-end web-based data science benchmark.
For instance, Browser Use, which accomplishes 80\% of tasks on WebVoyager, completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes, such as poor information grounding, repetitive behavior and…
Extensive experiments on five benchmark datasets show the superiority of EviOmni, which provides compact and high-quality evidence, enhances the accuracy of downstream tasks, and supports both traditional and agentic RAG systems.
Our evaluation demonstrates that ToolRegistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and broad compatibility with OpenAI function calling standards.
DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench.
However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive…
To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond.
Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior.
This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features.
Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis.