A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
A Proximal Policy Optimization (PPO) agent is trained using a risk-sensitive reward that penalizes reserve shortfall, capital inefficiency, and breaches of a volatility-adjusted solvency floor, with tail risk explicitly controlled through…
To reflect regulatory stress-testing practice, the agent is trained under a regime-aware curriculum and evaluated using both regime-stratified simulations and fixed-shock stress scenarios.
We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.
First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context.
Compared to general-purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across multiple tissue types, while more closely aligning with the cognitive logic of manual annotation.
To address this question, we introduce goal decoding tasks and evaluation frameworks using large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks.
As large language models (LLMs) and LLM-based agents increasingly interact with humans in decision-making contexts, understanding the trust dynamics between humans and AI agents becomes a central concern.
Using established behavioral theories, we develop an approach that studies whether LLMs trust depends on the three major trustworthiness dimensions: competence, benevolence and integrity of the human subject.
Developing and integrating emotion-understanding models are essential for a wide range of human-computer interaction tasks, including customer feedback analysis, marketing research, and social media monitoring.
Furthermore, we benchmark state-of-the-art encoder-only Pretrained Language Models (PLMs) and Large Language Models (LLMs) on this enriched dataset.
Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as false information and personal question, along with multiple…
Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via…
To support reproducible evaluation, we introduce arXiv2Table, a benchmark of 1,957 tables referencing 7,158 papers, with human-verified distractors and rewritten, schema-agnostic user demands.
To address this, we propose an Adaptive Social Learning (ASL) framework in this paper, aiming to improve the adaptive reasoning ability of language agents in dynamic social interactions.
Extensive experiments on the benchmark social intelligence environment verify that ASL achieves 15.6% higher task performance than GPT-4o.
We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow - implementing new functionality by reusing existing…
Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced.
Furthermore, we propose "connection thinking" to bridge the semantic gap between rigid formal logic and fluent human-like reasoning, ensuring coherent logical transitions.
Extensive experiments demonstrate that training models on our synthesized dataset, GeoTrust, substantially enhances deep geometric reasoning capabilities and yields significant performance gains across out-of-distribution (OOD) benchmarks,…
However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited.
To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages.
This preliminary study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently with active involvement of human specialists.
First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability.
Despite its relatively small parameter scale, Fin-R1 achieves competitive empirical performance across established financial benchmarks and demonstrates practical utility in compliance checking and robo-advisory.