- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Human EvalAutomatic Metrics General
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0
Automatic Metrics Coding
We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
- A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0
Automatic Metrics Law
This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
- Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea, Adela Bâra · Apr 1, 2026 · Citations: 0
Automatic Metrics General
Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task.
- IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin · Mar 11, 2026 · Citations: 0
Automatic Metrics General
IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections.
- Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0
Automatic Metrics General
Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
- Robust Preference Alignment via Directional Neighborhood Consensus
Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei · Oct 23, 2025 · Citations: 0
Automatic Metrics General
To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus.
- Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie · Mar 3, 2026 · Citations: 0
Simulation Env General
LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
- Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025 · Citations: 0
General
To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
- Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception
Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru · Mar 23, 2026 · Citations: 0
Automatic Metrics General
However, its reliance on human contributors limits both the timeliness and scalability.
- One Model for All: Multi-Objective Controllable Language Models
Qiang He, Yucheng Yang, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy · Apr 6, 2026 · Citations: 0
- FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation
Juhyun Oh, Nayeon Lee, Chani Jung, Jiho Jin, Junho Myung · Mar 4, 2026 · Citations: 0