- CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026
Simulation Env General
In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
- VeRO: An Evaluation Harness for Agents to Optimize Agents
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue · Feb 25, 2026
Automatic Metrics Coding
An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles.
- When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
Muku Akasaka, Soyeon Caren Han · Feb 25, 2026
Automatic Metrics General
In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks.
- Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding · Feb 24, 2026
Automatic Metrics Coding
Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning.
- CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis · Feb 24, 2026
Automatic Metrics Coding
Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other str
- AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang · Feb 24, 2026
Simulation Env General
The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.
- KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026
Automatic Metrics Math
Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
- How Retrieved Context Shapes Internal Representations in RAG
Samuel Yeh, Sharon Li · Feb 23, 2026
Automatic Metrics General
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial.
- ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie · Feb 23, 2026
Automatic Metrics General
The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task.
- Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026
Automatic Metrics Coding
Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
- Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
Melkamu Abay Mersha, Jugal Kalita · Feb 18, 2026
Automatic Metrics Coding
Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret.
- Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Matteo Rinaldi, Rossella Varvara, Viviana Patti · Feb 16, 2026
Automatic Metrics General
We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.
- propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries · Feb 12, 2026
Human Eval Multilingual
We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose,
- SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You · Jan 6, 2026
Automatic Metrics Coding
While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory.
- The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities
Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026
Automatic Metrics Coding
In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software.
- RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025
Automatic Metrics General
While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
- On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu · Dec 22, 2025
Automatic Metrics General
Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance.
- OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025
Automatic Metrics Coding
The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
- From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025
Automatic Metrics Medicine
To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
- AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025
Automatic Metrics General
Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.
- A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025
Automatic Metrics Medicine
Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
- PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025
Automatic Metrics General
To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.