HFEPX Hub

CS.AI + Demonstrations Papers

Updated from current HFEPX corpus (Apr 9, 2026). 48 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 48 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: Windowsagentarena. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.

Papers: 48 Last published: Mar 22, 2026 Global RSS Tag RSS

Cs.AIDemonstrations

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (48) Replication-Ready Only (1)

High-Signal Coverage

100.0%

48 / 48 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
1 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by demonstration data.
simulation environments appears in 20.8% of papers in this hub.
Windowsagentarena is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is rater calibration (2.1% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Benchmark Interpretation

Windowsagentarena appears in 4.2% of hub papers (2/48); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 2.1% of hub papers (1/48); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 8.3% of hub papers (4/48); compare with a secondary metric before ranking methods.
precision is reported in 4.2% of hub papers (2/48); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (18.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (20.8% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (22.9% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 31.3% of papers.

Known Gaps

Only 2.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20.8% coverage).
Annotation unit is under-specified (22.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Windowsagentarena vs ALFWorld) before comparing methods.
Track metric sensitivity by reporting both cost and precision.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Windowsagentarena Metric Slice: cost Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vi…

Mapg-Bench gives a fast comparison anchor.

Strongest recent paper

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying M…

Useful for current practice scanning; published Mar 4, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 10.0

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Mar 19, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Simulation Env · Benchmark: Mapg Bench · Metric: Not Reported
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Mar 4, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Simulation Env · Benchmark: MiniWoB++ · Metric: Not Reported
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Sep 26, 2025 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion
Feb 9, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Not reported · Benchmark: TREC · Metric: Not Reported
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems
Mar 23, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Precision

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation Mar 19, 2026	Yes Demonstrations	Simulation Env	Mapg Bench	Not Reported	Not Reported
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks Mar 4, 2026	Yes Demonstrations	Simulation Env	MiniWoB++	Not Reported	Not Reported
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning Sep 26, 2025	Yes Demonstrations	Automatic Metrics	Not Reported	Accuracy , Cost	Calibration
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion Feb 9, 2026	Yes Demonstrations	Not Reported	TREC	Not Reported	Not Reported
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems Mar 23, 2026	Yes Demonstrations	Not Reported	Not Reported	Precision	Not Reported
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework Feb 17, 2026	Yes Demonstrations	Automatic Metrics	Not Reported	Cost	Not Reported
Schema for In-Context Learning Oct 14, 2025	Yes Demonstrations	Not Reported	GPQA	Not Reported	Not Reported
Watch and Learn: Learning to Use Computers from Online Videos Oct 6, 2025	Yes Demonstrations	Not Reported	OSWorld , Windowsagentarena	Not Reported	Not Reported
Efficient Agent Training for Computer Use May 20, 2025	Yes Demonstrations	Not Reported	Windowsagentarena	Not Reported	Not Reported
Structured Agent Distillation for Large Language Model May 20, 2025	Yes Demonstrations	Simulation Env	ALFWorld , WebShop	Not Reported	Not Reported
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning May 7, 2025	Yes Demonstrations	Automatic Metrics	Not Reported	Win rate	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	Meanings and Measurements: Multi-Agent Probabilisti…	Dual-Modality Multi-Stage Adversarial Safety Traini…
Human Feedback	Demonstrations	Demonstrations	Demonstrations
Evaluation Modes	Human Eval, Llm As Judge	Simulation Env	Simulation Env
Benchmarks	WebArena, ToolBench	Mapg Bench	MiniWoB++
Metrics	Precision, Pass@1	Not reported	Not reported
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Trajectory	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (48)
Pairwise Preference (3)
Rubric Rating (1)

Evaluation Modes

Simulation Env (10)
Automatic Metrics (8)
Human Eval (1)
Llm As Judge (1)

Top Benchmarks

Windowsagentarena (2)
ALFWorld (1)
DROP (1)
GPQA (1)

Top Metrics

Cost (4)
Precision (2)
Win rate (2)
Accuracy (1)

Rater Population Mix

Domain Experts (9)
Mixed (1)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 100.0% · benchmarks 16.7% · metrics 18.8% · quality controls 2.1%.

Top Papers

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0

Demonstrations Automatic MetricsSimulation Env Multi Agent

We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
RAPTOR: A Foundation Policy for Quadrotor Control
Jonas Eschmann, Dario Albani, Giuseppe Loianno · Sep 15, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car.
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji · May 7, 2025 · Citations: 0

Demonstrations Automatic Metrics Long Horizon

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva · Oct 6, 2025 · Citations: 0

Demonstrations Long Horizon

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data.
Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0

Demonstrations Long Horizon

We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0

Demonstrations Simulation Env

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations Simulation Env

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu · Sep 26, 2025 · Citations: 0

Demonstrations Automatic Metrics

Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0

Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing

We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0

Demonstrations Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
On Discovering Algorithms for Adversarial Imitation Learning
Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025 · Citations: 0

Demonstrations Simulation Env

RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0

Demonstrations Web Browsing

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0

Demonstrations Long Horizon

We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
Continual Robot Skill and Task Learning via Dialogue
Weiwei Gu, Suresh Kondepudi, Anmol Gupta, Lixiao Huang, Nakul Gopalan · Sep 5, 2024 · Citations: 0

Demonstrations Simulation Env

In this work we present a framework for robots to continually learn tasks and visuo-motor skills and query for novel skills via dialog interactions with human users.
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0

Demonstrations Automatic Metrics

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao · May 26, 2025 · Citations: 0

Demonstrations Automatic Metrics

Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks.
Maximizing Asynchronicity in Event-based Neural Networks
Haiqing Hao, Nikola Zubić, Weihua He, Zhipeng Sui, Davide Scaramuzza · May 16, 2025 · Citations: 0

Demonstrations Automatic Metrics

Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML).
Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0

Rubric RatingDemonstrations

GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Jonathan Laurent, André Platzer · Feb 7, 2025 · Citations: 0

Demonstrations Web Browsing

We propose oracular programming: a foundational paradigm for integrating traditional, explicit computations with inductive oracles such as LLMs.
Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0

Demonstrations Automatic Metrics

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion
Minghan Li, Ercong Nie, Siqi Zhao, Tongna Chen, Huiping Huang · Feb 9, 2026 · Citations: 0

Demonstrations

We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline.
Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0

Demonstrations

Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems
Seou Choi, Sachin Vaidya, Caio Silva, Shiekh Zia Uddin, Sajib Biswas Shuvo · Mar 23, 2026 · Citations: 0

Demonstrations

In this work, we present a robotics framework for the autonomous construction, alignment, and maintenance of precision optical systems.
Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
Michael Cuccarese · Apr 7, 2026 · Citations: 0

Demonstrations

This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu · Feb 26, 2026 · Citations: 0

Demonstrations

Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang · Feb 12, 2026 · Citations: 0

Demonstrations

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated…
End-to-End Low-Level Neural Control of an Industrial-Grade 6D Magnetic Levitation System
Philipp Hartmann, Jannick Stranghöner, Klaus Neumann · Sep 1, 2025 · Citations: 0

Demonstrations

Magnetic levitation is poised to revolutionize industrial automation by integrating flexible in-machine product transport and seamless manipulation.
Training with Pseudo-Code for Instruction Following
Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor · May 23, 2025 · Citations: 0

Demonstrations

We evaluate our method on 12 publicly available benchmarks spanning instruction-following, mathematical reasoning, and commonsense reasoning, across six base models.
Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu · Mar 23, 2025 · Citations: 0

Pairwise PreferenceDemonstrations

Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs).
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
Yi-Chih Huang · Feb 19, 2026 · Citations: 0

Demonstrations

Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema · Apr 7, 2026 · Citations: 0

Demonstrations

In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Yuning Wu, Ke Wang, Devin Chen, Kai Wei · Mar 11, 2026 · Citations: 0

Demonstrations

To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO).
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma, Rakshit S. Trivedi · Mar 6, 2026 · Citations: 0

Pairwise PreferenceDemonstrations

Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline.
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu · Mar 3, 2026 · Citations: 0

Demonstrations

Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD.
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0

Demonstrations

Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David C Parkes · Feb 24, 2026 · Citations: 0

Demonstrations

Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
ViPRA: Video Prediction for Robot Actions
Sandeep Routray, Hengkai Pan, Unnat Jain, Shikhar Bahl, Deepak Pathak · Nov 11, 2025 · Citations: 0

Demonstrations

Videos, including those of humans or teleoperated robots, capture rich physical interactions.
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering
Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang · Sep 29, 2025 · Citations: 0

Demonstrations

We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM.
Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova · Sep 26, 2025 · Citations: 0

Demonstrations

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning.
CausalARC: Abstract Reasoning with Causal World Models
Jacqueline Maasch, John Kalantari, Kia Khezeli · Sep 3, 2025 · Citations: 0

Demonstrations

As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4)…
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng · Jul 11, 2025 · Citations: 0

Demonstrations

The model is trained on a dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents.
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0

Demonstrations

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…
MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning
Yihong Guo, Yu Yang, Pan Xu, Anqi Liu · Jun 10, 2025 · Citations: 0

Demonstrations

We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote