- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0
Demonstrations Simulation Env Multi Agent
To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
- SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0
Demonstrations Simulation Env Long Horizon
To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
- Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0
Demonstrations Long Horizon
We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
- Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0
Demonstrations Simulation Env
Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
- DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao · Oct 10, 2025 · Citations: 0
Demonstrations Simulation Env
Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations Simulation Env
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu · Sep 26, 2025 · Citations: 0
Demonstrations Automatic Metrics
Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families.
- Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation
Eeham Khan, Luis Rodriguez, Marc Queudot · Mar 10, 2026 · Citations: 0
Demonstrations Automatic Metrics
We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets.
- LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo · Aug 26, 2025 · Citations: 0
Demonstrations Automatic Metrics Multi Agent
In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge.
- CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou · Jan 28, 2025 · Citations: 0
Pairwise PreferenceDemonstrations Automatic Metrics Web Browsing
We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0
Demonstrations Long Horizon
Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
- TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0
Demonstrations Web Browsing
The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
- IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0
Demonstrations Long Horizon
We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
- Continual Robot Skill and Task Learning via Dialogue
Weiwei Gu, Suresh Kondepudi, Anmol Gupta, Lixiao Huang, Nakul Gopalan · Sep 5, 2024 · Citations: 0
Demonstrations Simulation Env
In this work we present a framework for robots to continually learn tasks and visuo-motor skills and query for novel skills via dialog interactions with human users.
- State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0
Demonstrations Automatic Metrics
Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
- IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi · Feb 26, 2026 · Citations: 0
Demonstrations Automatic Metrics
We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging…
- Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0
Demonstrations Automatic Metrics
Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
- Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao · May 26, 2025 · Citations: 0
Demonstrations Automatic Metrics
Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks.
- Maximizing Asynchronicity in Event-based Neural Networks
Haiqing Hao, Nikola Zubić, Weihua He, Zhipeng Sui, Davide Scaramuzza · May 16, 2025 · Citations: 0
Demonstrations Automatic Metrics
Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML).
- Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0
Rubric RatingDemonstrations
GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
- AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0
Demonstrations
We introduce AuditBench, an alignment auditing benchmark.
- FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0
Demonstrations
In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
- Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0
Demonstrations
Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
- Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
Michael Cuccarese · Apr 7, 2026 · Citations: 0
Demonstrations
This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
- A Survey of On-Policy Distillation for Large Language Models
Mingyang Song, Mao Zheng · Apr 1, 2026 · Citations: 0
Expert VerificationDemonstrations
We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.
- RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection
Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin, Pin-Yu Chen, Hong-Yan Huang · Jan 7, 2026 · Citations: 0
DemonstrationsCritique Edit
On a fake news detection benchmark, RADAR consistently outperforms strong retrieval-augmented trainable baselines, as well as general-purpose LLMs with retrieval.
- AmbiSQL: Interactive Ambiguity Detection and Resolution for Text-to-SQL
Zhongjun Ding, Yin Lin, Tianjing Zeng, Rong Zhu, Bolin Ding · Aug 21, 2025 · Citations: 0
Demonstrations
We provide 40 ambiguous queries collected from two real-world benchmarks that SIGMOD'26 attendees can use to explore how disambiguation improves SQL generation quality.
- XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models
Zhu Liu, Zhen Hu, Lei Dai, Yu Xuan, Ying Liu · Jul 5, 2025 · Citations: 0
Demonstrations
We introduce XISM, an interactive system that combines data-driven inference with expert knowledge.
- Training with Pseudo-Code for Instruction Following
Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor · May 23, 2025 · Citations: 0
Demonstrations
We evaluate our method on 12 publicly available benchmarks spanning instruction-following, mathematical reasoning, and commonsense reasoning, across six base models.
- Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu · Mar 23, 2025 · Citations: 0
Pairwise PreferenceDemonstrations
Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs).
- From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
Yi-Chih Huang · Feb 19, 2026 · Citations: 0
Demonstrations
Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
- In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema · Apr 7, 2026 · Citations: 0
Demonstrations
In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain.
- Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Yuning Wu, Ke Wang, Devin Chen, Kai Wei · Mar 11, 2026 · Citations: 0
Demonstrations
To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO).
- COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma, Rakshit S. Trivedi · Mar 6, 2026 · Citations: 0
Pairwise PreferenceDemonstrations
Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline.
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu · Mar 3, 2026 · Citations: 0
Demonstrations
Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD.
- Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
Xufei Lv, Jiahui Yang, Haoyuan Sun, Xialin Su, Zhiliang Tian · Mar 2, 2026 · Citations: 0
Demonstrations
Based on this insight, we propose AT2QA, an Autonomous and Training-free Agent for TKG Question Answering.
- ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0
Demonstrations
Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
- Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0
Demonstrations
We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning.
- Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0
Demonstrations
Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
- Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David C Parkes · Feb 24, 2026 · Citations: 0
Demonstrations
Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
- Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0
Demonstrations
This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
- Language Steering for Multilingual In-Context Learning
Neeraja Kirtane, Kuan-Hao Huang · Feb 2, 2026 · Citations: 0
Demonstrations
We propose language vectors, computed as the mean activation difference between parallel source and target language examples at a particular layer, and added as an offset to hidden states at inference time to shift the model's internal…
- Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer
Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu · Jan 9, 2026 · Citations: 0
Demonstrations
Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code.
- AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025 · Citations: 0
Demonstrations
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
- ViPRA: Video Prediction for Robot Actions
Sandeep Routray, Hengkai Pan, Unnat Jain, Shikhar Bahl, Deepak Pathak · Nov 11, 2025 · Citations: 0
Demonstrations
Videos, including those of humans or teleoperated robots, capture rich physical interactions.
- Mapping Semantic & Syntactic Relationships with Geometric Rotation
Michael Freenor, Lauren Alvarez · Oct 10, 2025 · Citations: 0
Demonstrations
We introduce Rotor-Invariant Shift Estimation (RISE), a geometric approach that represents semantic-syntactic transformations as consistent rotational operations in embedding space, leveraging the manifold structure of modern language…
- EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering
Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang · Sep 29, 2025 · Citations: 0
Demonstrations
We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM.
- Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova · Sep 26, 2025 · Citations: 0
Demonstrations
The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning.
- CausalARC: Abstract Reasoning with Causal World Models
Jacqueline Maasch, John Kalantari, Kia Khezeli · Sep 3, 2025 · Citations: 0
Demonstrations
As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4)…
- NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng · Jul 11, 2025 · Citations: 0
Demonstrations
The model is trained on a dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents.
- Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0
Demonstrations
Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…
- REFLEX: Metacognitive Reasoning for Reflective Zero-Shot Robotic Planning with Large Language Models
Wenjie Lin, Jin Wei-Kocsis, Jiansong Zhang, Byung-Cheol Min, Dongming Gan · May 20, 2025 · Citations: 0
Demonstrations
Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing…