HFEPX Hub

CS.IR Human Feedback And Eval Papers

Updated from current HFEPX corpus (Feb 27, 2026). 61 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 61 Last published: Feb 26, 2026 Global RSS

Cs.IR

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 61 papers for CS.IR Human Feedback And Eval Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Retrieval, BrowseComp and metric focus on accuracy, relevance. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

9.8% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , MoDora: Tree-Based Semi-Structured Document Analysis System
automatic metrics appears in 96.7% of papers in this hub.

Evidence: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access

Protocol Takeaways

Most common quality-control signal is rater calibration (1.6% of papers).

Evidence: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: A Benchmark for Deep Information Synthesis , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , MoDora: Tree-Based Semi-Structured Document Analysis System

Benchmark Interpretation

Retrieval appears in 50.8% of hub papers (31/61); use this cohort for benchmark-matched comparisons.
BrowseComp appears in 1.6% of hub papers (1/61); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 18% of hub papers (11/61); compare with a secondary metric before ranking methods.
relevance is reported in 13.1% of hub papers (8/61); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (9.8% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (1.6% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (57.4% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (54.1% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6.6% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (19.7% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (9.8% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (1.6% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (57.4% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (54.1% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.6% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (19.7% vs 35% target).

Known Limitations

Only 1.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.6% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=2, left_only=0, right_only=57

2 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=59, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=2, right_only=2

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 31 papers (50.8%)

31 papers (50.8%) mention Retrieval.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Benchmark Brief

BrowseComp

Coverage: 1 papers (1.6%)

1 papers (1.6%) mention BrowseComp.

Examples: Revisiting Text Ranking in Deep Research

Benchmark Brief

DROP

Coverage: 1 papers (1.6%)

1 papers (1.6%) mention DROP.

Examples: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Metric Brief

accuracy

Coverage: 11 papers (18%)

11 papers (18%) mention accuracy.

Examples: MoDora: Tree-Based Semi-Structured Document Analysis System , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , NanoKnow: How to Know What Your Language Model Knows

Metric Brief

relevance

Coverage: 8 papers (13.1%)

8 papers (13.1%) mention relevance.

Examples: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Metric Brief

latency

Coverage: 7 papers (11.5%)

7 papers (11.5%) mention latency.

Examples: Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators , LiCQA : A Lightweight Complex Question Answering System , Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , MoDora: Tree-Based Semi-Structured Document Analysis System

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026 · Citations: 0

Automatic Metrics

Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026 · Citations: 0

Simulation Env

In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026 · Citations: 0

Automatic Metrics

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026 · Citations: 0

Automatic Metrics

In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0

Automatic Metrics

The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026 · Citations: 0

Simulation Env

We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026 · Citations: 0

Automatic Metrics

Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Barah Fazili, Koustava Goswami · Feb 25, 2026 · Citations: 0

Automatic Metrics

This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas · Feb 25, 2026 · Citations: 0

Automatic Metrics

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly.
Revisiting Text Ranking in Deep Research
Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026 · Citations: 0

Automatic Metrics

To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026 · Citations: 0

Automatic Metrics

We study efficient multi-vector retrieval for late interaction in any modality.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Human EvalAutomatic Metrics Tool Use

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe, Deep Shah · Feb 24, 2026 · Citations: 0

Automatic Metrics

These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-v
Position-Aware Sequential Attention for Accurate Next Item Recommendations
Timur Nabiev, Evgeny Frolov · Feb 24, 2026 · Citations: 0

Automatic Metrics

Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
Generative Pseudo-Labeling for Pre-Ranking with LLMs
Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics

Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking.
E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
Jiwoo Kang, Yeon-Chang Lee · Feb 24, 2026 · Citations: 0

Automatic Metrics

Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalizati
RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition
Kun Ran, Marwah Alaofi, Danula Hettiachchi, Chenglong Ma, Khoi Nguyen Dinh Anh · Feb 24, 2026 · Citations: 0

Automatic Metrics

R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026 · Citations: 0

Automatic Metrics

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026 · Citations: 0

Automatic Metrics

Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao · Feb 23, 2026 · Citations: 0

Automatic Metrics

We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retr
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026 · Citations: 0

Automatic Metrics

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications.
Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0

Expert Verification Automatic Metrics

Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026 · Citations: 0

Automatic Metrics

Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026 · Citations: 0

Automatic Metrics

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026 · Citations: 0

Automatic Metrics

Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026 · Citations: 0

Automatic Metrics

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026 · Citations: 0

Automatic Metrics

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026 · Citations: 0

Automatic Metrics

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0

Automatic Metrics Multi Agent

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026 · Citations: 0

Automatic Metrics

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages.
ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala · Feb 18, 2026 · Citations: 0

Automatic Metrics

Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models.
Variable-Length Semantic IDs for Recommender Systems
Kirill Khrylchenko · Feb 18, 2026 · Citations: 0

Automatic Metrics

In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions.
ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · Feb 16, 2026 · Citations: 0

Automatic Metrics

ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao · Feb 16, 2026 · Citations: 0

Automatic Metrics

News recommendation plays a critical role in online news platforms by helping users discover relevant content.
Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao · Feb 16, 2026 · Citations: 0

Automatic Metrics

Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026 · Citations: 0

Automatic Metrics

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026 · Citations: 0

Automatic Metrics

We introduce \textbf{AWED-FiNER}, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than
Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026 · Citations: 0

Automatic Metrics

Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency.
The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities
Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi · Jan 5, 2026 · Citations: 0

Automatic Metrics

In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software.
RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025 · Citations: 0

Automatic Metrics

While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics acr
OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025 · Citations: 0

Automatic Metrics

The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025 · Citations: 0

Automatic Metrics

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels
I-Fan Lin, Faegheh Hasibi, Suzan Verberne · Oct 16, 2025 · Citations: 0

Automatic Metrics

Our evaluation on four benchmark sets shows that our approach achieves competitive results, better than recent state-of-the-art baselines, while avoiding the need to estimate the number of clusters during embedding refinement, as is require
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025 · Citations: 0

Automatic Metrics

We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025 · Citations: 0

Human EvalAutomatic Metrics

Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025 · Citations: 0

Automatic Metrics

Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.
TASER: Table Agents for Schema-guided Extraction and Recommendation
Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025 · Citations: 0

Critique Edit Automatic Metrics

To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025 · Citations: 0

Automatic Metrics

We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025 · Citations: 0

Automatic Metrics

We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025 · Citations: 0

Automatic Metrics

Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
Diffusion Generative Recommendation with Continuous Tokens
Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan · Apr 16, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025 · Citations: 0

Automatic Metrics

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
Causal Claims in Economics
Prashant Garg, Thiemo Fetzer · Jan 12, 2025 · Citations: 0

Automatic Metrics

As economics scales, a key bottleneck is representing what papers claim in a comparable, aggregable form.
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli · Jun 7, 2024 · Citations: 0

Automatic Metrics

MRAG integrates seamlessly with existing RAG frameworks and benchmarks.
Augmenting Lateral Thinking in Language Models with Humor and Riddle Data for the BRAINTEASER Task
Mina Ghashami, Soumya Smruti Mishra · May 16, 2024 · Citations: 0

Automatic Metrics

The SemEval 2024 BRAINTEASER task challenges language models to perform lateral thinking -- a form of creative, non-linear reasoning that remains underexplored in NLP.

CS.IR Human Feedback And Eval Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs