A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov · Feb 14, 2026
Citations: 0
Rubric RatingAutomatic MetricsCoding
However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.
Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%.
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li · Feb 12, 2026
Citations: 0
Automatic MetricsTool UseCoding
To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zoo
GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim · Feb 9, 2026
Citations: 0
Rubric RatingAutomatic MetricsCoding
However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng · Feb 3, 2026
Citations: 0
Simulation EnvLong HorizonCoding
In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
SWE-Master systematically explores the complete agent development pipeline, including teacher-trajectory synthesis and data curation, long-horizon SFT, RL with real execution feedback, and inference framework design.
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026
Citations: 0
Simulation EnvLong HorizonCoding
While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment.
Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success.
To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026
Citations: 0
Rubric RatingCritique EditAutomatic MetricsCoding
Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages.
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025
Citations: 0
Automatic MetricsLong HorizonCoding
On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025
Citations: 0
Simulation EnvLong HorizonCodingMultilingual
These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nv
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025
Citations: 0
Human EvalAutomatic MetricsCoding
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English.
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang · Oct 27, 2025
Citations: 0
Pairwise PreferenceHuman EvalCoding
Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.
However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation.
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong · Oct 14, 2025
Citations: 0
Pairwise PreferenceAutomatic MetricsCoding
Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li · Oct 5, 2025
Citations: 0
Pairwise PreferenceAutomatic MetricsCoding
On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.
These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.