HFEPX Hub

CS.IR + General Papers

Updated from current HFEPX corpus (Apr 12, 2026). 23 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 23 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Adjudication. Frequently cited benchmark: Innoeval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 19, 2026.

Papers: 23 Last published: Mar 19, 2026 Global RSS Tag RSS

Cs.IRGeneral

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (23) Replication-Ready Only (2)

High-Signal Coverage

100.0%

23 / 23 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

73.9% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 52.2% of papers in this hub.
Innoeval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is adjudication (4.3% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

Innoeval appears in 4.3% of hub papers (1/23); use this cohort for benchmark-matched comparisons.
Scirepeval appears in 4.3% of hub papers (1/23); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 17.4% of hub papers (4/23); compare with a secondary metric before ranking methods.
cost is reported in 13% of hub papers (3/23); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (73.9% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (17.4% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (60.9% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (26.1% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (43.5% vs 35% target).

Strengths

Strong human-feedback signal (73.9% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 47.8% of papers.

Known Gaps

Only 4.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Benchmark coverage is thin (17.4% of papers mention benchmarks/datasets).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Innoeval vs Scirepeval) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Innoeval Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

SODIUM: From Open Web Data to Queryable Databases

Highest protocol score with explicit human/eval signal plus Sodium-Bench.

Strongest benchmark reference

Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Pa…

Scirepeval with recall gives a fast comparison anchor.

Strongest recent paper

Aligning Multimodal Sequential Recommendations via Robust Direct Pref…

Useful for current practice scanning; published Mar 31, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

SODIUM: From Open Web Data to Queryable Databases
Mar 19, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Sodium Bench · Metric: Accuracy
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Apr 7, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Scirepeval · Metric: Recall
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE
Mar 31, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Ndcg
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Apr 2, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Relevance
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Mar 25, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Latency
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion
Feb 9, 2026 · Citations: 0 · Score: 6.0

HF: Demonstrations · Eval: Not reported · Benchmark: TREC · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
SODIUM: From Open Web Data to Queryable Databases Mar 19, 2026	Yes Expert Verification	Automatic Metrics	Sodium Bench	Accuracy	Not Reported
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching Apr 7, 2026	Yes Rubric Rating	Automatic Metrics	Scirepeval	Recall	Not Reported
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE Mar 31, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Ndcg , Cost	Not Reported
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning Apr 2, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Relevance	Not Reported
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework Mar 25, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Latency , Relevance	Not Reported
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion Feb 9, 2026	Yes Demonstrations	Not Reported	TREC	Not Reported	Not Reported
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem Feb 16, 2026	No Not Reported	Llm As Judge	Innoeval	Not Reported	Adjudication
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders Feb 24, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Latency , Cost	Not Reported
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition Apr 26, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Hit@5	Not Reported
Role-Augmented Intent-Driven Generative Search Engine Optimization Aug 15, 2025	Yes Rubric Rating	Automatic Metrics	Not Reported	Perplexity	Not Reported
TaoSR1: The Thinking Model for E-commerce Relevance Search Aug 17, 2025	Yes Pairwise Preference	Human Eval	Not Reported	Relevance	Not Reported
AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents Mar 23, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	SODIUM: From Open Web Data to Queryable Databases	Beyond Paper-to-Paper: Structured Profiling and Rub…	Aligning Multimodal Sequential Recommendations via…
Human Feedback	Expert Verification	Rubric Rating	Pairwise Preference
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Sodium Bench	Scirepeval	Not reported
Metrics	Accuracy	Recall	Ndcg, Cost
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Domain Experts	Domain Experts
Annotation Unit	Unknown	Multi Dim Rubric	Pairwise

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (9)
Critique Edit (3)
Rubric Rating (3)
Expert Verification (2)

Evaluation Modes

Automatic Metrics (12)
Human Eval (2)
Llm As Judge (1)

Top Benchmarks

Innoeval (1)
Scirepeval (1)
Sodium Bench (1)
TREC (1)

Top Metrics

Accuracy (4)
Cost (3)
Latency (3)
Relevance (3)

Rater Population Mix

Domain Experts (6)

Quality Controls

Adjudication (1)

Coverage diagnostics (sample-based): human-feedback 73.9% · benchmarks 17.4% · metrics 60.9% · quality controls 4.3%.

Top Papers

SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0

Rubric Rating Automatic Metrics

To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue · Feb 16, 2026 · Citations: 0

Llm As Judge Web Browsing

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
Role-Augmented Intent-Driven Generative Search Engine Optimization
Xiaolu Chen, Haojie Wu, Jie Bao, Zhen Chen, Yong Liao · Aug 15, 2025 · Citations: 0

Rubric Rating Automatic Metrics Web Browsing

To better evaluate the method under realistic settings, we address the benchmarking limitations of prior work by: (1) extending the GEO dataset with diversified query variations reflecting real-world search scenarios and (2) introducing…
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE
Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan · Mar 31, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems.
TaoSR1: The Thinking Model for E-commerce Relevance Search
Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin · Aug 17, 2025 · Citations: 0

Pairwise Preference Human Eval

Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3)…
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu · Apr 2, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process.
AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents
Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li · Mar 23, 2026 · Citations: 0

Pairwise Preference Tool Use

To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under…
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang · Mar 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0

Pairwise Preference Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali · Mar 7, 2026 · Citations: 0

Automatic Metrics Long Horizon

Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization
Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An · Oct 18, 2025 · Citations: 0

Critique Edit Multi Agent

However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail.
Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion
Minghan Li, Ercong Nie, Siqi Zhao, Tongna Chen, Huiping Huang · Feb 9, 2026 · Citations: 0

Demonstrations

We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline.
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026 · Citations: 0

Pairwise Preference

Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie · Mar 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines.
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0

Automatic Metrics Multi Agent

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering
Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury · Feb 27, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA.
Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0

Expert Verification

Additionally, we present HyperDocRED, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through…
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025 · Citations: 0

Human EvalAutomatic Metrics

Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
Yiwen Tang, Qiuyu Zhao, Zenghui Sun, Jinsong Lan, Xiaoyong Zhu · Oct 26, 2025 · Citations: 0

Critique Edit

To alleviate the issue, we propose a novel framework REVISION.
TASER: Table Agents for Schema-guided Extraction and Recommendation
Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025 · Citations: 0

Critique Edit

To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,…

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now