- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0
Red Team Llm As Judge Multi Agent
Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
- VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0
Demonstrations Automatic MetricsSimulation Env Multi Agent
We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Red Team Automatic Metrics Long Horizon
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0
Demonstrations Simulation Env Multi Agent
To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
- AJAR: Adaptive Jailbreak Architecture for Red-teaming
Yipu Dou, Wang Yang · Jan 16, 2026 · Citations: 0
Red Team Simulation Env
Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops.
- A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel · Feb 4, 2026 · Citations: 0
Red Team Llm As Judge
Automated LLM-as-a-Judge frameworks have become the de facto standard for scalable evaluation across natural language processing.
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0
Red Team Automatic Metrics Web Browsing
Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
- SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0
Demonstrations Simulation Env Long Horizon
To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
- RAPTOR: A Foundation Policy for Quadrotor Control
Jonas Eschmann, Dario Albani, Giuseppe Loianno · Sep 15, 2025 · Citations: 0
Demonstrations Simulation Env Long Horizon
Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car.
- Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji · May 7, 2025 · Citations: 0
Demonstrations Automatic Metrics Long Horizon
The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
- SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou · Mar 14, 2026 · Citations: 0
Red Team Automatic Metrics
The benchmark is constructed from U.S.
- Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva · Oct 6, 2025 · Citations: 0
Demonstrations Long Horizon
Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data.
- Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0
Demonstrations Long Horizon
We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
- MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0
Demonstrations Simulation Env Long Horizon
Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
- SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0
Demonstrations Simulation Env Multi Agent
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
- Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0
Demonstrations Simulation Env
Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0
Red Team Llm As Judge
We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
- DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao · Oct 10, 2025 · Citations: 0
Demonstrations Simulation Env
Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations Simulation Env
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu · Sep 26, 2025 · Citations: 0
Demonstrations Automatic Metrics
Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families.
- Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang · Apr 6, 2026 · Citations: 0
Red Team Simulation Env
The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions.
- Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · Mar 26, 2026 · Citations: 0
Red Team Llm As Judge
In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while…
- Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
Siddharth Srikanth, Freddie Liang, Ya-Chuan Hsu, Varun Bhatt, Shihan Zhao · Mar 12, 2026 · Citations: 0
Red Team Simulation Env
Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates.
- Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation
Eeham Khan, Luis Rodriguez, Marc Queudot · Mar 10, 2026 · Citations: 0
Demonstrations Automatic Metrics
We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets.
- WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du · Mar 11, 2026 · Citations: 0
Red Team Automatic Metrics Multi Agent
Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains insufficiently studied.
- MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0
Red Team Automatic Metrics Web Browsing
Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
- What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0
Red Team Automatic Metrics Tool Use
This paper presents a comprehensive empirical study on the safety alignment capabilities.
- A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0
Red Team Automatic Metrics
This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
- Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0
Red Team Automatic Metrics
These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
- LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo · Aug 26, 2025 · Citations: 0
Demonstrations Automatic Metrics Multi Agent
In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
- CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0
Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing
We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0
Demonstrations Long Horizon
Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
- Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0
Red Team Simulation Env
Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
- On Discovering Algorithms for Adversarial Imitation Learning
Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025 · Citations: 0
Demonstrations Simulation Env
RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
- RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0
Demonstrations Long Horizon
To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
- TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0
Demonstrations Web Browsing
The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
- IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0
Demonstrations Long Horizon
We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
- Continual Robot Skill and Task Learning via Dialogue
Weiwei Gu, Suresh Kondepudi, Anmol Gupta, Lixiao Huang, Nakul Gopalan · Sep 5, 2024 · Citations: 0
Demonstrations Simulation Env
In this work we present a framework for robots to continually learn tasks and visuo-motor skills and query for novel skills via dialog interactions with human users.
- Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025 · Citations: 0
Red Team Automatic Metrics
We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies…
- State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0
Demonstrations Automatic Metrics
Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
- Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty · Mar 15, 2026 · Citations: 0
Red Team Automatic Metrics
While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed…
- IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin · Mar 11, 2026 · Citations: 0
Red Team Automatic Metrics
IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections.
- Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0
Pairwise PreferenceRed Team Automatic Metrics
Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
- IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi · Feb 26, 2026 · Citations: 0
Demonstrations Automatic Metrics
We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging…
- Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0
Red Team Automatic Metrics
Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
- MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0
Red Team Automatic Metrics
We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
- FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026 · Citations: 0
Red Team Automatic Metrics
A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
- Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0
Demonstrations Automatic Metrics
Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
- Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025 · Citations: 0
Red Team Automatic Metrics
Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
- RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li · Oct 29, 2025 · Citations: 0
Red Team Automatic Metrics
As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs.
- DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang · Sep 29, 2025 · Citations: 0
Red Team Automatic Metrics
Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final…
- Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao · Jul 8, 2025 · Citations: 0
Pairwise PreferenceRed Team Automatic Metrics
Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization…
- When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025 · Citations: 0
Red Team Automatic Metrics
In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
- Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao · May 26, 2025 · Citations: 0
Demonstrations Automatic Metrics
Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks.
- Maximizing Asynchronicity in Event-based Neural Networks
Haiqing Hao, Nikola Zubić, Weihua He, Zhipeng Sui, Davide Scaramuzza · May 16, 2025 · Citations: 0
Demonstrations Automatic Metrics
Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML).
- Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0
Rubric RatingDemonstrations
GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
- SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes · Feb 24, 2026 · Citations: 0
Rubric RatingRed Team
We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items.
- Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer · Feb 7, 2025 · Citations: 0
Demonstrations Web Browsing
We propose oracular programming: a foundational paradigm for integrating traditional, explicit computations with inductive oracles such as LLMs.
- Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0
Demonstrations Automatic Metrics
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.