A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths.
Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
Existing methods attempt to improve faithfulness through external interventions, such as specialized prompting, decoding-based calibration, or preference optimization.
Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods.
This marks a conceptual departure from traditional automated text scoring by prioritising information gain over fidelity to expert rubrics or human-annotated data.
To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
Our evaluation reveals critical limitations in current LLMs' context-dependent reasoning.
The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the size of the linguistic community, which affects the quantity of available training data.
Consistent with previous research, the results show that the models are able to generalize morphological processes to unseen words with human-like accuracy.
Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept…
CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision.
In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved data efficiency.
GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines.
We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks.
We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.
Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction of 65.73%--88.50% compared to conventional consensus…