A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators.
To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.
To answer these questions, we introduce a benchmark for standardizing research on two-player, sequential, language-based games.
Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents in various economic contexts; (ii) evaluate agents in both individual and collective performance…
Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language.
In this work, we propose the SimSiam Naming Game (SSNG), a feedback-free EmCom framework that replaces sampling-based updates with a symmetric, self-supervised representation alignment objective between autonomous agents.
We discuss how speaker effects serve as indices for assessing language development and social cognition, and we encourage future research to extend these findings to the emerging domain of artificial intelligence (AI) speakers, as AI agents…
This flexibility in handling various text interventions is a key advancement in causal estimation for textual data, offering opportunities to better understand human behaviors and develop effective interventions within social systems.
Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code,…
The results position inference-time textual guidance as a scalable and controllable approach to distillation for interactive AI agents in marketing settings.
Ultimately, by fine-tuning optimizable parameters with appropriate regularization, NTK-CL achieves state-of-the-art performance on established PEFT-CL benchmarks.
Then, a comprehensive benchmark is proposed for evaluating the instruction-following and reasoning abilities of LLM4AD systems, which includes LaMPilot-Bench, CARLA Leaderboard 1.0 Benchmark in simulation and NuPlanQA for multi-view visual…
Finally, the main challenges of LLM4AD are discussed, including latency, deployment, security and privacy, safety, trust and transparency, and personalization.
Among them, the score-matching method has demonstrated superior performance across various evaluation metrics, particularly for the commonly encountered Additive Nonlinear Causal Models.
In this work we present a framework for robots to continually learn tasks and visuo-motor skills and query for novel skills via dialog interactions with human users.
Moreover, with our IRB approved human-subjects study we demonstrate that our dialog based continual learning framework allows users to teach robots cooking skills successfully (100%) while spending a higher ratio of time on finishing an…
Here, we propose MAEnvs4VRP library, a unified framework for multi-agent vehicle routing environments that supports classical, dynamic, stochastic, and multi-task problem variants within a single modular design.
It follows the Agent Environment Cycle ("AEC") games model and features an intuitive API, enabling rapid adoption and seamless integration into existing reinforcement learning frameworks.
Supporting the health and well-being of dynamic populations around the world requires governmental agencies, organizations and researchers to understand and reason over complex relationships between human behavior and local contexts in…
We evaluate the effectiveness of our approach by benchmarking it on 27 downstream tasks spanning three distinct domains: health indicators, socioeconomic factors, and environmental measurements.
Neural language models (LMs) have been shown to capture complex linguistic patterns, yet their utility in understanding human language and more broadly, human cognition, remains debated.
While existing work in this area often evaluates human-machine alignment, few studies attempt to translate findings from this enterprise into novel insights about humans.