- SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026 · Citations: 0
Automatic Metrics
Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
- CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026 · Citations: 0
Simulation Env
In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
- MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026 · Citations: 0
Automatic Metrics
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
- Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026 · Citations: 0
Automatic Metrics
In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
- LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0
Automatic Metrics
The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
- Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026 · Citations: 0
Simulation Env
We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
- Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026 · Citations: 0
Automatic Metrics
Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
- Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Barah Fazili, Koustava Goswami · Feb 25, 2026 · Citations: 0
Automatic Metrics
This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models.
- Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas · Feb 25, 2026 · Citations: 0
Automatic Metrics
Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly.
- Revisiting Text Ranking in Deep Research
Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026 · Citations: 0
Automatic Metrics
To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
- A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
- Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026 · Citations: 0
Automatic Metrics
We study efficient multi-vector retrieval for late interaction in any modality.
- A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0
Human EvalAutomatic Metrics Tool Use
Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
- Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe, Deep Shah · Feb 24, 2026 · Citations: 0
Automatic Metrics
These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-v
- Position-Aware Sequential Attention for Accurate Next Item Recommendations
Timur Nabiev, Evgeny Frolov · Feb 24, 2026 · Citations: 0
Automatic Metrics
Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.
- HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
- Generative Pseudo-Labeling for Pre-Ranking with LLMs
Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang · Feb 24, 2026 · Citations: 0
Automatic Metrics
Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking.
- E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
Jiwoo Kang, Yeon-Chang Lee · Feb 24, 2026 · Citations: 0
Automatic Metrics
Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalizati
- RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition
Kun Ran, Marwah Alaofi, Danula Hettiachchi, Chenglong Ma, Khoi Nguyen Dinh Anh · Feb 24, 2026 · Citations: 0
Automatic Metrics
R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.
- KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026 · Citations: 0
Automatic Metrics
Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
- NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026 · Citations: 0
Automatic Metrics
Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
- Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao · Feb 23, 2026 · Citations: 0
Automatic Metrics
We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retr
- Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026 · Citations: 0
Automatic Metrics
Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications.
- Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0
Expert Verification Automatic Metrics
Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
- PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026 · Citations: 0
Automatic Metrics
Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
- Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Long Horizon
Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
- Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026 · Citations: 0
Automatic Metrics
Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
- VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026 · Citations: 0
Automatic Metrics
Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
- RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026 · Citations: 0
Automatic Metrics
Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
- Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026 · Citations: 0
Automatic Metrics
Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
- CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026 · Citations: 0
Automatic Metrics
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
- Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0
Automatic Metrics Multi Agent
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
- WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026 · Citations: 0
Automatic Metrics
We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages.
- ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala · Feb 18, 2026 · Citations: 0
Automatic Metrics
Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models.
- Variable-Length Semantic IDs for Recommender Systems
Kirill Khrylchenko · Feb 18, 2026 · Citations: 0
Automatic Metrics
In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions.
- ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · Feb 16, 2026 · Citations: 0
Automatic Metrics
ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
- Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao · Feb 16, 2026 · Citations: 0
Automatic Metrics
News recommendation plays a critical role in online news platforms by helping users discover relevant content.
- Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao · Feb 16, 2026 · Citations: 0
Automatic Metrics
Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment.
- Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026 · Citations: 0
Automatic Metrics
16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
- AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026 · Citations: 0
Automatic Metrics
We introduce \textbf{AWED-FiNER}, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than
- Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026 · Citations: 0
Automatic Metrics
Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency.
- The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities
Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026 · Citations: 0
Automatic Metrics
In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software.
- RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025 · Citations: 0
Automatic Metrics
While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
- OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025 · Citations: 0
Automatic Metrics
The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
- Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025 · Citations: 0
Automatic Metrics
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
- LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels
I-Fan Lin, Faegheh Hasibi, Suzan Verberne · Oct 16, 2025 · Citations: 0
Automatic Metrics
Our evaluation on four benchmark sets shows that our approach achieves competitive results, better than recent state-of-the-art baselines, while avoiding the need to estimate the number of clusters during embedding refinement, as is require
- FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025 · Citations: 0
Automatic Metrics
We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
- Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025 · Citations: 0
Human EvalAutomatic Metrics
Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
- AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025 · Citations: 0
Automatic Metrics
Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.
- TASER: Table Agents for Schema-guided Extraction and Recommendation
Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025 · Citations: 0
Critique Edit Automatic Metrics
To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,
- PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025 · Citations: 0
Automatic Metrics
We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
- Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025 · Citations: 0
Automatic Metrics
We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
- Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025 · Citations: 0
Automatic Metrics
Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
- Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0
Pairwise Preference Automatic Metrics Multi Agent
These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
- Diffusion Generative Recommendation with Continuous Tokens
Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan · Apr 16, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference.
- PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025 · Citations: 0
Automatic Metrics
To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
- Causal Claims in Economics
Prashant Garg, Thiemo Fetzer · Jan 12, 2025 · Citations: 0
Automatic Metrics
As economics scales, a key bottleneck is representing what papers claim in a comparable, aggregable form.
- Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli · Jun 7, 2024 · Citations: 0
Automatic Metrics
MRAG integrates seamlessly with existing RAG frameworks and benchmarks.
- Augmenting Lateral Thinking in Language Models with Humor and Riddle Data for the BRAINTEASER Task
Mina Ghashami, Soumya Smruti Mishra · May 16, 2024 · Citations: 0
Automatic Metrics
The SemEval 2024 BRAINTEASER task challenges language models to perform lateral thinking -- a form of creative, non-linear reasoning that remains underexplored in NLP.