HFEPX Hub

Demonstrations Or Red Team Papers

Updated from current HFEPX corpus (Apr 12, 2026). 130 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 130 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: AdvBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.

Papers: 130 Last published: Mar 22, 2026 Global RSS Tag RSS

DemonstrationsRed Team

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Medium .

Analysis blocks below are computed from the currently loaded sample (60 of 130 total papers in this hub).

All Sampled Papers (60) Replication-Ready Only (8)

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

8 papers are replication-ready (benchmark + metric + explicit evaluation mode).
1 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 25.4% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is rater calibration (0.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Benchmark Interpretation

AdvBench appears in 1.5% of hub papers (2/130); use this cohort for benchmark-matched comparisons.
DROP appears in 1.5% of hub papers (2/130); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 6.9% of hub papers (9/130); compare with a secondary metric before ranking methods.
jailbreak success rate is reported in 6.9% of hub papers (9/130); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0.8% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (16.2% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (30.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (13.8% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (13.1% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 0.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.8% coverage).
Annotation unit is under-specified (13.1% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (AdvBench vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and jailbreak success rate.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: AdvBench Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step To…

Tracesafe-Bench with accuracy gives a fast comparison anchor.

Strongest recent paper

SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

Useful for current practice scanning; published Mar 14, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 10.0

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Tracesafe Bench · Metric: Accuracy
SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
Mar 14, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Semeval · Metric: F1
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 7.5

HF: Red Team · Eval: Llm As Judge · Benchmark: AdvBench · Metric: Success rate
AJAR: Adaptive Jailbreak Architecture for Red-teaming
Jan 16, 2026 · Citations: 0 · Score: 7.5

HF: Red Team · Eval: Simulation Env · Benchmark: Harmbench · Metric: Success rate
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Mar 19, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Simulation Env · Benchmark: Mapg Bench · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories Apr 8, 2026	Yes Red Team	Automatic Metrics	Tracesafe Bench	Accuracy	Not Reported
SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions Mar 14, 2026	Yes Red Team	Automatic Metrics	Semeval	F1 , F1 macro	Not Reported
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Yes Red Team	Llm As Judge	AdvBench , Jbf Eval	Success rate , Jailbreak success rate	Not Reported
AJAR: Adaptive Jailbreak Architecture for Red-teaming Jan 16, 2026	Yes Red Team	Simulation Env	Harmbench	Success rate , Jailbreak success rate	Not Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation Mar 19, 2026	Yes Demonstrations	Simulation Env	Mapg Bench	Not Reported	Not Reported
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments May 28, 2025	Yes Red Team	Automatic Metrics	Rtc Bench	Jailbreak success rate	Not Reported
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness Sep 17, 2025	Yes Red Team	Automatic Metrics	AdvBench	Helpfulness	Not Reported
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Aug 28, 2025	Yes Red Team	Automatic Metrics	DROP	Accuracy	Not Reported
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness Feb 4, 2026	Yes Red Team	Llm As Judge	Reliablebench	Not Reported	Not Reported
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks Mar 4, 2026	Yes Demonstrations	Simulation Env	MiniWoB++	Not Reported	Not Reported
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning Sep 26, 2025	Yes Demonstrations	Automatic Metrics	Not Reported	Accuracy , Cost	Calibration

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	TraceSafe: A Systematic Assessment of LLM Guardrail…	SemEval-2026 Task 6: CLARITY -- Unmasking Political…
Human Feedback	Demonstrations	Red Team	Red Team
Evaluation Modes	Human Eval, Llm As Judge	Automatic Metrics	Automatic Metrics
Benchmarks	WebArena, ToolBench	Tracesafe Bench	Semeval
Metrics	Precision, Pass@1	Accuracy	F1, F1 macro
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Trajectory	Trajectory	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (69)
Red Team (61)
Pairwise Preference (7)
Rubric Rating (2)

Evaluation Modes

Automatic Metrics (33)
Simulation Env (16)
Llm As Judge (5)
Human Eval (1)

Top Benchmarks

AdvBench (2)
DROP (2)
HotpotQA (2)
Windowsagentarena (2)

Top Metrics

Accuracy (9)
Jailbreak success rate (9)
Success rate (9)
Cost (6)

Rater Population Mix

Domain Experts (16)
Mixed (2)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 26.7% · metrics 61.7% · quality controls 1.7%.

Top Papers

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Red Team Llm As Judge Multi Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0

Demonstrations Automatic MetricsSimulation Env Multi Agent

We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Red Team Automatic Metrics Long Horizon

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
AJAR: Adaptive Jailbreak Architecture for Red-teaming
Yipu Dou, Wang Yang · Jan 16, 2026 · Citations: 0

Red Team Simulation Env

Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops.
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel · Feb 4, 2026 · Citations: 0

Red Team Llm As Judge

Automated LLM-as-a-Judge frameworks have become the de facto standard for scalable evaluation across natural language processing.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Red Team Automatic Metrics Web Browsing

Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0

Demonstrations Simulation Env Long Horizon

To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
RAPTOR: A Foundation Policy for Quadrotor Control
Jonas Eschmann, Dario Albani, Giuseppe Loianno · Sep 15, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car.
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji · May 7, 2025 · Citations: 0

Demonstrations Automatic Metrics Long Horizon

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou · Mar 14, 2026 · Citations: 0

Red Team Automatic Metrics

The benchmark is constructed from U.S.
Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva · Oct 6, 2025 · Citations: 0

Demonstrations Long Horizon

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data.
Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0

Demonstrations Long Horizon

We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0

Demonstrations Simulation Env

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0

Red Team Llm As Judge

We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao · Oct 10, 2025 · Citations: 0

Demonstrations Simulation Env

Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations Simulation Env

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu · Sep 26, 2025 · Citations: 0

Demonstrations Automatic Metrics

Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families.
Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang · Apr 6, 2026 · Citations: 0

Red Team Simulation Env

The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions.
Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · Mar 26, 2026 · Citations: 0

Red Team Llm As Judge

In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while…
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
Siddharth Srikanth, Freddie Liang, Ya-Chuan Hsu, Varun Bhatt, Shihan Zhao · Mar 12, 2026 · Citations: 0

Red Team Simulation Env

Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates.
Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation
Eeham Khan, Luis Rodriguez, Marc Queudot · Mar 10, 2026 · Citations: 0

Demonstrations Automatic Metrics

We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets.
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du · Mar 11, 2026 · Citations: 0

Red Team Automatic Metrics Multi Agent

Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains insufficiently studied.
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Red Team Automatic Metrics Web Browsing

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0

Red Team Automatic Metrics Tool Use

This paper presents a comprehensive empirical study on the safety alignment capabilities.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0

Red Team Automatic Metrics

This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0

Red Team Automatic Metrics

These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo · Aug 26, 2025 · Citations: 0

Demonstrations Automatic Metrics Multi Agent

In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0

Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0

Demonstrations Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0

Red Team Simulation Env

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
On Discovering Algorithms for Adversarial Imitation Learning
Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025 · Citations: 0

Demonstrations Simulation Env

RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0

Demonstrations Long Horizon

We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
Continual Robot Skill and Task Learning via Dialogue
Weiwei Gu, Suresh Kondepudi, Anmol Gupta, Lixiao Huang, Nakul Gopalan · Sep 5, 2024 · Citations: 0

Demonstrations Simulation Env

In this work we present a framework for robots to continually learn tasks and visuo-motor skills and query for novel skills via dialog interactions with human users.
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025 · Citations: 0

Red Team Automatic Metrics

We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies…
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0

Demonstrations Automatic Metrics

Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty · Mar 15, 2026 · Citations: 0

Red Team Automatic Metrics

While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed…
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin · Mar 11, 2026 · Citations: 0

Red Team Automatic Metrics

IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections.
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda · Mar 7, 2026 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics

Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale,…
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging…
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0

Red Team Automatic Metrics

Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Red Team Automatic Metrics

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026 · Citations: 0

Red Team Automatic Metrics

A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025 · Citations: 0

Red Team Automatic Metrics

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li · Oct 29, 2025 · Citations: 0

Red Team Automatic Metrics

As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs.
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang · Sep 29, 2025 · Citations: 0

Red Team Automatic Metrics

Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final…
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao · Jul 8, 2025 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics

Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization…
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025 · Citations: 0

Red Team Automatic Metrics

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao · May 26, 2025 · Citations: 0

Demonstrations Automatic Metrics

Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks.
Maximizing Asynchronicity in Event-based Neural Networks
Haiqing Hao, Nikola Zubić, Weihua He, Zhipeng Sui, Davide Scaramuzza · May 16, 2025 · Citations: 0

Demonstrations Automatic Metrics

Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML).
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0

Rubric RatingDemonstrations

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes · Feb 24, 2026 · Citations: 0

Rubric RatingRed Team

We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items.
Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer · Feb 7, 2025 · Citations: 0

Demonstrations Web Browsing

We propose oracular programming: a foundational paradigm for integrating traditional, explicit computations with inductive oracles such as LLMs.
Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0

Demonstrations Automatic Metrics

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now