HFEPX Archive Slice

HFEPX Daily Archive: 2026-03-06

Updated from current HFEPX corpus (Mar 10, 2026). 56 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 56 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Adjudication. Frequently cited benchmark: Emobench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 6, 2026.

Papers: 56 Last published: Mar 6, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

56 / 56 papers are not low-signal flagged.

Benchmark Anchors

14.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

41.1%

Papers with reported metric mentions in extraction output.

5 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

17.9% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 37.5% of papers in this hub.
Emobench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (3.6% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Mar 6, 2026 · Citations: 0 · Score: 6.5

Eval: Human Eval, Automatic Metrics · Metrics: Accuracy, Cost
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Mar 6, 2026 · Citations: 0 · Score: 6.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
Mar 6, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation
Mar 6, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Agreement
Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records
Mar 6, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Precision, Recall
PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Mar 6, 2026 · Citations: 0 · Score: 6.0

Eval: Human Eval · Metrics: Agreement, Faithfulness

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing Mar 6, 2026	Human Eval, Automatic Metrics	Frtr Bench	Accuracy, Cost	Not reported
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation Mar 6, 2026	Llm As Judge, Automatic Metrics	Lit Ragbench	Accuracy	Not reported
ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning Mar 6, 2026	Automatic Metrics	Mmsi Bench	Accuracy	Not reported
Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation Mar 6, 2026	Automatic Metrics	Not reported	Agreement	Inter Annotator Agreement Reported, Adjudication
Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records Mar 6, 2026	Automatic Metrics	Not reported	Precision, Recall	Inter Annotator Agreement Reported
PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations Mar 6, 2026	Human Eval	Not reported	Agreement, Faithfulness	Not reported
From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring Mar 6, 2026	Automatic Metrics	Not reported	Accuracy, F1	Not reported
CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation Mar 6, 2026	Automatic Metrics	Not reported	Agreement, Relevance	Not reported
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring Mar 6, 2026	Human Eval	Not reported	Agreement	Not reported
Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation Mar 6, 2026	Automatic Metrics	Not reported	Accuracy	Calibration

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (17.9% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (8.9% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (5.4% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (12.5% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 8.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Annotation unit is under-specified (12.5% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Emobench vs Eq-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Emobench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 8.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (21)
Llm As Judge (6)
Human Eval (4)
Simulation Env (1)

Top Metrics

Accuracy (4)
Agreement (3)
Cost (2)
Relevance (2)

Top Benchmarks

Emobench (1)
Eq Bench (1)
Frtr Bench (1)
Lit Ragbench (1)

Quality Controls

Adjudication (2)
Calibration (2)
Inter Annotator Agreement Reported (2)

Papers In This Archive Slice

Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik · Mar 6, 2026 · Citations: 0

Pairwise PreferenceExpert Verification

This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods.
Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping
Zhenyu Lei, Qiong Wu, Jianxiong Dong, Yinhan He, Emily Dodwell · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Dynamic Self-Evolving Extraction System
Moin Amin-Naseri, Hannah Kim, Estevam Hruschka · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language Shapes Mental Health Evaluations in Large Language Models
Jiayi Xu, Xiyang Hu · Mar 6, 2026 · Citations: 0

This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations.
MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour · Mar 6, 2026 · Citations: 0

Expert Verification

Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity.
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal · Mar 6, 2026 · Citations: 0

Multi Agent

Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes.
Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations
Mirza Samad Ahmed Baig, Syeda Anshrah Gillani · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation
Joseph James · Mar 6, 2026 · Citations: 0

Human annotation remains the foundation of reliable and interpretable data in Natural Language Processing (NLP).
Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers
David Heye, Karl Kindermann, Robin Decker, Johannes Lohmöller, Anastasiia Belova · Mar 6, 2026 · Citations: 0

Artifact Evaluation (AE) is essential for ensuring the transparency and reliability of research, closing the gap between exploratory work and real-world deployment is particularly important in cybersecurity, particularly in IoT and CPSs,…
Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records
Brian E. Perron, Dragan Stoll, Bryan G. Victor, Zia Qia, Andreas Jud · Mar 6, 2026 · Citations: 0

Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa).
"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
Roshni Lulla, Fiona Collins, Sanaya Parekh, Thilo Hagendorff, Jonas Kaplan · Mar 6, 2026 · Citations: 0

Pairwise Preference

The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase.
KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection
Archie Sage, Salvatore Greco · Mar 6, 2026 · Citations: 0

Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.
Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning
Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar · Mar 6, 2026 · Citations: 0

Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality.
Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0

Long Horizon

We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma, Rakshit S. Trivedi · Mar 6, 2026 · Citations: 0

Pairwise PreferenceDemonstrations

Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline.
NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
Ethan Smith · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei · Mar 6, 2026 · Citations: 0

Pairwise Preference

To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives.
Abductive Reasoning with Syllogistic Forms in Large Language Models
Hirohiko Abe, Risako Ando, Takanobu Morishita Kentaro Ozeki, Koji Mineshima, Mitsuhiro Okada · Mar 6, 2026 · Citations: 0

Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern.
From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring
Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen · Mar 6, 2026 · Citations: 0

Pairwise Preference

On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning…
Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task
Hirohiko Abe, Kentaro Ozeki, Risako Ando, Takanobu Morishita, Koji Mineshima · Mar 6, 2026 · Citations: 0

In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts.
Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI
Tanjim Taharat Aurpa · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · Mar 6, 2026 · Citations: 0

Critique Edit

We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii)…
The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks
Anca Dinu, Andreiana Mihail, Andra-Maria Florescu, Claudiu Creanga · Mar 6, 2026 · Citations: 0

The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions.
Continual Adaptation for Pacific Indigenous Speech Recognition
Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI
Giovanni Servedio, Potito Aghilar, Alessio Mattiace, Gianni Carmosino, Francesco Musicco · Mar 6, 2026 · Citations: 0

At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities…
Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion
Hari Shankar, Vedanta S P, Sriharini Margapuri, Debjani Mazumder, Ponnurangam Kumaraguru · Mar 6, 2026 · Citations: 0

We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts.
SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin · Mar 6, 2026 · Citations: 0

Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.
FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang · Mar 6, 2026 · Citations: 0

Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences.
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki · Mar 6, 2026 · Citations: 0

Long Horizon

To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic,…
Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models
Luis de-Marcos, Manuel Goyanes, Adrián Domínguez-Díaz · Mar 6, 2026 · Citations: 0

Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time,…
MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan · Mar 6, 2026 · Citations: 0

Long Horizon

We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns.
CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad · Mar 6, 2026 · Citations: 0

Pairwise Preference

We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety.
Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning
Claire Roman, Philippe Meyer · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026 · Citations: 0

Pairwise Preference Long Horizon

We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…
A Causal Graph Approach to Oppositional Narrative Analysis
Diego Revilla, Martin Fernandez-de-Retana, Lingfeng Chen, Aritz Bilbao-Jayo, Miguel Fernandez-de-Retana · Mar 6, 2026 · Citations: 0

Current methods for textual analysis rely on data annotated within predefined ontologies, often embedding human bias within black-box models.
Diffusion Language Models Are Natively Length-Aware
Vittorio Rossi, Giacomo Cirò, Davide Beltrame, Luca Gandolfi, Paul Röttger · Mar 6, 2026 · Citations: 0

We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal…
Making Implicit Premises Explicit in Logical Understanding of Enthymemes
Xuyao Feng, Anthony Hunter · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin · Mar 6, 2026 · Citations: 0

To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios.
Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality
Xi Wang, Mengdie Zhuang, Jiqun Liu · Mar 6, 2026 · Citations: 0

Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such…
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0

Rubric Rating

This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing
Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu · Mar 6, 2026 · Citations: 0

Multi Agent

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration.
Track-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL
Bingfeng Chen, Shaobin Shi, Yongqi Luo, Boyan Xu, Ruichu Cai · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Imagine How To Change: Explicit Procedure Modeling for Change Captioning
Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models
Nikita Soni, August Håkan Nilsson, Syeda Mahwish, Vasudha Varadarajan, H. Andrew Schwartz · Mar 6, 2026 · Citations: 0

These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.
Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling
Chanhui Zhu · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Addressing the Ecological Fallacy in Larger LMs with Human Context
Nikita Soni, Dhruv Vijay Kunjadiya, Pratham Piyush Shah, Dikshya Mohanty, H. Andrew Schwartz · Mar 6, 2026 · Citations: 0

We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (HuFT:Human-aware Fine-Tuning).
Learning Next Action Predictors from Human-Computer Interaction
Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber · Mar 6, 2026 · Citations: 0

Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively).
InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning
Maksym Taranukhin, Shuyue Stella Li, Evangelos Milios, Geoff Pleiss, Yulia Tsvetkov · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions
Hussein Ghaly · Mar 6, 2026 · Citations: 0

We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement.
Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li · Mar 6, 2026 · Citations: 0

Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored.
VerChol -- Grammar-First Tokenization for Agglutinative Languages
Prabhu Raja · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An · Mar 6, 2026 · Citations: 0

Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.
ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Mingluo Su, Huan Wang · Mar 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim · Mar 6, 2026 · Citations: 0

Long Horizon

Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80%…
Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference
Ramchand Kumaresan · Mar 6, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote