Research Utility Snapshot
Evaluation Modes
- Automatic Metrics (9)
- Simulation Env (2)
FewMMBench: A Benchmark for Multimodal Few-Shot Learning Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0
Demonstrations Automatic Metrics General
- In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang · Feb 25, 2026 · Citations: 0
Demonstrations Automatic Metrics General
- Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination Rakshit Trivedi, Kartik Sharma, David C Parkes · Feb 24, 2026 · Citations: 0
Demonstrations Automatic Metrics Coding
- Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts.
- Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors.
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences Yi-Chih Huang · Feb 19, 2026 · Citations: 0
Demonstrations Automatic Metrics Coding
- Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
- Positioned as a "methodological experiment," this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research.
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0
Demonstrations Automatic Metrics General
- This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
- Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang, Li Qing · Feb 17, 2026 · Citations: 0
Demonstrations Automatic Metrics General
- Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
- We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025 · Citations: 0
Demonstrations Automatic Metrics General
- We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang · Oct 21, 2025 · Citations: 0
Demonstrations Simulation Env General
- Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
- This challenge intensifies for multi-step bimanual mobile manipulation, where humans must teleoperate both the mobile base and two high-DoF arms.
SPACeR: Self-Play Anchoring with Centralized Reference Models Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu · Oct 20, 2025 · Citations: 0
Demonstrations Simulation Env General
- Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
- Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings.
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster · Jun 23, 2025 · Citations: 0
Demonstrations Automatic Metrics Coding
- Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important implica
Oracular Programming: A Modular Foundation for Building LLM-Enabled Software Jonathan Laurent, André Platzer · Feb 7, 2025 · Citations: 0
Demonstrations Automatic Metrics Coding