- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge Long Horizon
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0
Demonstrations Simulation Env Multi Agent
To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
- SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0
Demonstrations Simulation Env Long Horizon
To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
- Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0
Demonstrations Simulation Env
Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
- TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino · Mar 5, 2026 · Citations: 0
Demonstrations Web Browsing
The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes?
- IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0
Demonstrations Long Horizon
We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and…
- State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0
Demonstrations Automatic Metrics
Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
- Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0
Demonstrations Automatic Metrics
Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
- Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek · Feb 28, 2026 · Citations: 0
Rubric RatingDemonstrations
GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
- AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0
Demonstrations
We introduce AuditBench, an alignment auditing benchmark.
- FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0
Demonstrations
In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
- Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion
Minghan Li, Ercong Nie, Siqi Zhao, Tongna Chen, Huiping Huang · Feb 9, 2026 · Citations: 0
Demonstrations
We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline.
- A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems
Seou Choi, Sachin Vaidya, Caio Silva, Shiekh Zia Uddin, Sajib Biswas Shuvo · Mar 23, 2026 · Citations: 0
Demonstrations
In this work, we present a robotics framework for the autonomous construction, alignment, and maintenance of precision optical systems.
- Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu · Feb 26, 2026 · Citations: 0
Demonstrations
Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation.
- RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection
Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin, Pin-Yu Chen, Hong-Yan Huang · Jan 7, 2026 · Citations: 0
DemonstrationsCritique Edit
On a fake news detection benchmark, RADAR consistently outperforms strong retrieval-augmented trainable baselines, as well as general-purpose LLMs with retrieval.
- In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema · Apr 7, 2026 · Citations: 0
Demonstrations
In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain.
- Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Yuning Wu, Ke Wang, Devin Chen, Kai Wei · Mar 11, 2026 · Citations: 0
Demonstrations
To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO).
- COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma, Rakshit S. Trivedi · Mar 6, 2026 · Citations: 0
Pairwise PreferenceDemonstrations
Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline.
- ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Adam Dejl, Deniz Gorur, Francesca Toni · Feb 27, 2026 · Citations: 0
Demonstrations
Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by…
- Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0
Demonstrations
We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning.
- Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0
Demonstrations
Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
- Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0
Demonstrations
This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.