HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-23

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: ContentBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 23, 2026.

Papers: 61 Last published: Feb 23, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 61 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

15.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

41.7%

Papers with reported metric mentions in extraction output.

3 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

9.8% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 37.7% of papers in this hub.
ContentBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is rater calibration (3.3% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Feb 23, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Cost, Relevance
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Feb 23, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy, Precision
Can Large Language Models Replace Human Coders? Introducing ContentBench
Feb 23, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Agreement, Cost
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Feb 23, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: F1, Precision
NanoKnow: How to Know What Your Language Model Knows
Feb 23, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
Feb 23, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Coherence

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration Feb 23, 2026	Automatic Metrics	MMLU	Cost, Relevance	Calibration
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference Feb 23, 2026	Automatic Metrics	GSM8K	Accuracy, Precision	Calibration
Can Large Language Models Replace Human Coders? Introducing ContentBench Feb 23, 2026	Automatic Metrics	ContentBench	Agreement, Cost	Not reported
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models Feb 23, 2026	Automatic Metrics	Not reported	F1, Precision	Gold Questions
NanoKnow: How to Know What Your Language Model Knows Feb 23, 2026	Automatic Metrics	NQ, SQuAD	Accuracy	Not reported
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge Feb 23, 2026	Automatic Metrics	Kghalubench	Accuracy, Coherence	Not reported
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation Feb 23, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems Feb 23, 2026	Llm As Judge	Not reported	Precision	Not reported
Natural Language Processing Models for Robust Document Categorization Feb 23, 2026	Automatic Metrics	Not reported	Accuracy, Throughput	Not reported
No One Size Fits All: QueryBandits for Hallucination Mitigation Feb 23, 2026	Automatic Metrics	Not reported	Win rate	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (9.8% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (6.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (1.6% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (11.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.2% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (11.5% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 6.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.2% coverage).
Annotation unit is under-specified (11.5% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ContentBench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 6.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (23)
Simulation Env (4)
Llm As Judge (2)
Human Eval (1)

Top Metrics

Accuracy (4)
Cost (2)
F1 (2)
Agreement (1)

Top Benchmarks

ContentBench (1)

Quality Controls

Calibration (2)
Gold Questions (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
Mukul Chhabra, Luigi Medrano, Arush Verma · Feb 23, 2026 · Citations: 0

Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error…
How communicatively optimal are exact numeral systems? Once more on lexicon size and morphosyntactic complexity
Chundra Cathcart, Arne Rubehn, Katja Bocklage, Luca Ciucci, Kellen Parker van Dam · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Natural Language Processing Models for Robust Document Categorization
Radoslaw Roszczyk, Pawel Tecza, Maciej Stodolski, Krzysztof Siwek · Feb 23, 2026 · Citations: 0

This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution.
No One Size Fits All: QueryBandits for Hallucination Mitigation
Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram · Feb 23, 2026 · Citations: 0

Expert Verification

Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype…
PhysMem: Self-Evolving Physical Memory for Robot Manipulation
Haoyang Li, Yang You, Hao Su, Leonidas Guibas · Feb 23, 2026 · Citations: 0

We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones.
What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
William Watson, Nicole Cho, Sumitra Ganesh, Manuela Veloso · Feb 23, 2026 · Citations: 0

We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension.
InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu · Feb 23, 2026 · Citations: 0

Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said.
A Very Big Video Reasoning Suite
Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer · Feb 23, 2026 · Citations: 0

We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026 · Citations: 0

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting…
AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen · Feb 23, 2026 · Citations: 0

Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA.
NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026 · Citations: 0

Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's…
BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop
Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen · Feb 23, 2026 · Citations: 0

For the workshop, we call for papers related to the overall theme of BabyLM, which includes training efficiency, small-scale training datasets, cognitive modeling, model evaluation, and architecture innovation.
How Retrieved Context Shapes Internal Representations in RAG
Samuel Yeh, Sharon Li · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi · Feb 23, 2026 · Citations: 0

Large Language Models (LLMs) play a critical role in how humans access information.
Structured Prompt Language: Declarative Context Management for LLMs
Wen G. Gong · Feb 23, 2026 · Citations: 0

SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script.
Entropy in Large Language Models
Marco Scharringhausen · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously
Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou · Feb 23, 2026 · Citations: 0

We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible…
AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization
Fahmida Liza Piya, Rahmatollah Beheshti · Feb 23, 2026 · Citations: 0

We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content.
Latent Introspection: Models Can Detect Prior Concept Injections
Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, Jan Kulveit · Feb 23, 2026 · Citations: 0

Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.
gencat: Generative computerized adaptive testing
Wanyong Feng, Andrew Lan · Feb 23, 2026 · Citations: 0

Pairwise Preference

We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment.
QUIETT: Query-Independent Table Transformation for Robust Reasoning
Gaurav Najpande, Tampu Ravi Kumar, Manan Roy Choudhury, Neha Valeti, Yanjie Fu · Feb 23, 2026 · Citations: 0

Experiments on four benchmarks, WikiTQ, HiTab, NQ-Table, and SequentialQA show consistent gains across models and reasoning paradigms, with particularly strong improvements on a challenge set of structurally diverse, unseen questions.
Exploring Anti-Aging Literature via ConvexTopics and Large Language Models
Lana E. Yeganova, Won G. Kim, Shubo Tian, Natalie Xie, Donald C. Comeau · Feb 23, 2026 · Citations: 0

Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation.
Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026 · Citations: 0

We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
Contextual Safety Reasoning and Grounding for Open-World Robots
Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar · Feb 23, 2026 · Citations: 0

Web Browsing

Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment.
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference
Moritz A. Zanger, Yijun Wu, Pascal R. Van der Vaart, Wendelin Böhmer, Matthijs T. J. Spaan · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao · Feb 23, 2026 · Citations: 0

We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of…
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0

Red Team

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling
Xiang Li, Zikai Wei, Yiyan Qi, Wanyun Zhou, Xiang Liu · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang · Feb 23, 2026 · Citations: 0

Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR.
Denotational Semantics for ODRL: Knowledge-Based Constraint Conflict Detection
Daham Mustafa, Diego Collarana, Yixin Peng, Rafiqul Haque, Christoph Lange-Bever · Feb 23, 2026 · Citations: 0

We validate it with 154 benchmarks across six knowledge base families (GeoNames, ISO 3166, W3C DPV, a GDPR-derived taxonomy, BCP 47, and ISO 639-3) and four structural KBs targeting adversarial edge cases.
Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics
Daham Mustafa, Diego Collarana, Yixin Peng, Rafiqul Haque, Christoph Lange-Bever · Feb 23, 2026 · Citations: 0

For these operands, a single scalar constraint admits one interpretation per axis, making policy evaluation non-deterministic.
SHIELD: Semantic Heterogeneity Integrated Embedding for Latent Discovery in Clinical Trial Safety Signals
Francois Vandenhende, Anna Georgiou, Theodoros Psaras, Ellie Karekla · Feb 23, 2026 · Citations: 0

We present SHIELD, a novel methodology for automated and integrated safety signal detection in clinical trials.
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Multi Agent

To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Wall Kim, Chaeyoung Song, Hanul Kim · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Keyboards for the Endangered Idu Mishmi Language
Akhilesh Kakolu Ramarao · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Decision MetaMamba: Enhancing Selective SSM in Offline RL with Heterogeneous Sequence Mixing
Wall Kim, Chaeyoung Song, Hanul Kim · Feb 23, 2026 · Citations: 0
NILE: Formalizing Natural-Language Descriptions of Formal Languages
Tristan Kneisel, Marko Schmellenkamp, Fabian Vehlken, Thomas Zeume · Feb 23, 2026 · Citations: 0

This is motivated from educational scenarios where learners describe a formal language (presented, e.g., by a finite state automaton, regular expression, pushdown automaton, context-free grammar or in set notation) in natural language, and…
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026 · Citations: 0

Long Horizon

The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Feb 23, 2026 · Citations: 0

Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations.
Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding
Roberto Tacconelli · Feb 23, 2026 · Citations: 0

An out-of-distribution (OOD) evaluation on a document published after the model's training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text.
Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
Borisiuk Anna, Andrey Savchenko, Alexander Panchenko, Elena Tutubalina · Feb 23, 2026 · Citations: 0

In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores.
Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support
Deborah N. Jakobi, David R. Reich, Paul Prasse, Jana M. Hofmann, Lena S. Bolliger · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DEEP: Docker-based Execution and Evaluation Platform
Sergio Gómez González, Miguel Domingo, Francisco Casacuberta · Feb 23, 2026 · Citations: 0

Comparative evaluation of several systems is a recurrent task in researching.
Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering
Wuzhenghong Wen, Bowen Zhou, Jinwen Huang, Xianjie Wu, Yuwei Sun · Feb 23, 2026 · Citations: 0

Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.
OrthoAI: A Neurosymbolic Framework for Evidence-Grounded Biomechanical Reasoning in Clear Aligner Orthodontics
Edouard Lansiaux, Margaux Leman, Mehdi Ammi · Feb 23, 2026 · Citations: 0
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh · Feb 23, 2026 · Citations: 0

This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance.
Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation
Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong · Feb 23, 2026 · Citations: 0

Expert Verification

Additionally, we present HyperDocRED, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1
Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie · Feb 23, 2026 · Citations: 0

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation.
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026 · Citations: 0

Long Horizon

We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026 · Citations: 0

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026 · Citations: 0

Critique Edit

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives
Emma Jiren Wang, Siying Hu, Zhicong Lu · Feb 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning
Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin · Feb 23, 2026 · Citations: 0

As collecting data for knowledge injection fine-tuning is costly, we further leverage a reinforcement learning-based approach with verifiable rewards (RLVR) to elicit knowledge-rich traces without human supervision, then transfer such an…
OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents
Ruicheng Ao, David Simchi-Levi, Xinshang Wang · Feb 23, 2026 · Citations: 0

Whether AI agents can perform this task remains untested.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote