HFEPX Archive Slice

HFEPX Weekly Archive: 2026-W15

Updated from current HFEPX corpus (Apr 9, 2026). 315 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 315 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 8, 2026.

Papers: 315 Last published: Apr 8, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 315 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

16.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

50.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

7.9% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 28.9% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.2% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 7.5

Eval: Human Eval, Automatic Metrics · Metrics: Accuracy, Helpfulness
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Apr 8, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy
SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)
Apr 8, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: F1
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Apr 8, 2026 · Citations: 0 · Score: 6.5

Eval: Simulation Env · Metrics: Cost, Token cost
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Apr 8, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, Latency
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
Apr 8, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Human Eval, Automatic Metrics	Rewardbench	Accuracy, Helpfulness	Not reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories Apr 8, 2026	Automatic Metrics	Tracesafe Bench	Accuracy	Not reported
SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA) Apr 8, 2026	Automatic Metrics	Semeval	F1	Not reported
ReDAct: Uncertainty-Aware Deferral for LLM Agents Apr 8, 2026	Simulation Env	ALFWorld	Cost, Token cost	Not reported
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models Apr 8, 2026	Automatic Metrics	GSM8K, TruthfulQA	Accuracy, Latency	Not reported
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors Apr 8, 2026	Automatic Metrics	Meddialbench	Accuracy	Not reported
How Much LLM Does a Self-Revising Agent Actually Need? Apr 8, 2026	Automatic Metrics	Not reported	F1, Win rate	Not reported
Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering Apr 8, 2026	Automatic Metrics	Not reported	Accuracy, F1	Not reported
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models Apr 8, 2026	Llm As Judge	IFEval, Healthbench	Not reported	Not reported
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images Apr 8, 2026	Llm As Judge, Automatic Metrics	Not reported	Accuracy, Exact match	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (7.9% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.5% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (3.8% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (20% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (9.5% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.9% coverage).
Annotation unit is under-specified (9.5% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (ALFWorld vs BFCL) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ALFWorld Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 3.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (91)
Simulation Env (11)
Human Eval (6)
Llm As Judge (6)

Top Metrics

Accuracy (30)
Cost (16)
Recall (8)
F1 (6)

Top Benchmarks

ALFWorld (1)
BFCL (1)
Full Duplex Bench (1)
Healthbench (1)

Quality Controls

Calibration (7)
Inter Annotator Agreement Reported (3)
Adjudication (2)
Gold Questions (2)

Papers In This Archive Slice

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0

We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction
Jackson Petty, Jaulie Goe, Tal Linzen · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
Jianhui Liu, Haoze Sun, Wenbo Li, Yanbing Zhang, Rui Yang · Apr 8, 2026 · Citations: 0

Spatial understanding is a fundamental cornerstone of human-level intelligence.
Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation
Songhee Han · Apr 8, 2026 · Citations: 0

More fundamentally, teaching and learning are shaped by human cognition, behavior, motivation, and social interaction in ways that cannot be fully specified, predicted, or exhaustively modeled.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik · Apr 8, 2026 · Citations: 0

This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus.
ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection
Chhavi Dhiman, Naman Chawla, Riya Dhami, Gaurav Kumar, Ganesh Naik · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent
Bingxuan Li, Simo Du, Yue Guo · Apr 8, 2026 · Citations: 0

Long Horizon

We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module.
Efficient Learned Data Compression via Dual-Stream Feature Decoupling
Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
On the Price of Privacy for Language Identification and Generation
Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
How Much LLM Does a Self-Revising Agent Actually Need?
Seongwoo Jeong, Seonil Son · Apr 8, 2026 · Citations: 0

Critique Edit

Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Red Team Long Horizon

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics
Kosmas Pinitas, Ilias Maglogiannis · Apr 8, 2026 · Citations: 0

Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI.
Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery
Jia Yu, Weiwei Yu, Pengfei Xiao, Fukun Xing · Apr 8, 2026 · Citations: 0

We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the…
Dynamic Context Evolution for Scalable Synthetic Data Generation
Ryan Lingo, Rajeev Chhajer · Apr 8, 2026 · Citations: 0

Tool Use

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language Bias under Conflicting Information in Multilingual LLMs
Robert Östling, Murathan Kurfalı · Apr 8, 2026 · Citations: 0

To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of…
Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews
Ehsan Barkhordar, Abdulfattah Safa, Verena Blaschke, Erika Lombart, Marie-Catherine de Marneffe · Apr 8, 2026 · Citations: 0

We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37…
Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
Elyas Irankhah, Samah Fodeh · Apr 8, 2026 · Citations: 0

Expert Verification

Third, results on the development set show that alignment accuracy is mainly limited by reasoning.
The Impact of Steering Large Language Models with Persona Vectors in Educational Applications
Yongchao Wu, Aron Henriksson · Apr 8, 2026 · Citations: 0

We study persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures.
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu · Apr 8, 2026 · Citations: 0

To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
Selective Neuron Amplification for Training-Free Task Enhancement
Ryyan Akhtar · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multilingual Embedding Probes Fail to Generalize Across Learner Corpora
Laurits Lyngbaek, Ross Deans Kristensen-McLachlan · Apr 8, 2026 · Citations: 0

Under in-distribution evaluation, probes achieve strong performance (QWK\approx0.7), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions.
Is Cross-Lingual Transfer in Bilingual Models Human-Like? A Study with Overlapping Word Forms in Dutch and English
Iza Škrjanec, Irene Elisabeth Winther, Vera Demberg, Stefan L. Frank · Apr 8, 2026 · Citations: 0

However, their alignment with human processing seems to critically depend on how lexical overlap is encoded, possibly limiting their explanatory adequacy as models of bilingual reading.
SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)
Liang-Chih Yu, Jonas Becker, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Lung-Hao Lee · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text
Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja · Apr 8, 2026 · Citations: 0

In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, Leo Huang · Apr 8, 2026 · Citations: 0

We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable…
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0

Long Horizon

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Md Motaleb Hossen Manik, Ge Wang · Apr 8, 2026 · Citations: 0

We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and…
Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
Philipp D. Siedler · Apr 8, 2026 · Citations: 0

Multi Agent

We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation.
MARS: Enabling Autoregressive Models Multi-Token Generation
Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun · Apr 8, 2026 · Citations: 0

When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks.
Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl
Juan-José Guzman-Landa, Juan-Manuel Torres-Moreno, Graham Ranger, Miguel Figueroa-Saavedra, Martha-Lorena Avendaño-Garrido · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DTCRS: Dynamic Tree Construction for Recursive Summarization
Guanran Luo, Zhongquan Jian, Wentao Qiu, Meihong Wang, Qingqiang Wu · Apr 8, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Continuous Interpretive Steering for Scalar Diversity
Ye-eun Cho · Apr 8, 2026 · Citations: 0

However, evaluations of pragmatic inference in large language models (LLMs) often rely on prompt-based manipulations.
ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals
Yihao Wang, Zijian He, Jie Ren, Keze Wang · Apr 8, 2026 · Citations: 0

We introduce ChunQiuTR, a time-keyed retrieval benchmark built from the Spring and Autumn Annals and its exegetical tradition.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era
Rudra Jadhav, Janhavi Danve · Apr 8, 2026 · Citations: 0

We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S.
Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus
Aidan Mannion, Cécile Macaire, Armand Violle, Stéphane Ohayon, Xavier Tannier · Apr 8, 2026 · Citations: 0

Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations.
iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
Wenshuo Wang, Boyu Cao, Nan Zhuang, Wei Li · Apr 8, 2026 · Citations: 0

This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.
Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han · Apr 8, 2026 · Citations: 0

Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48\% while maintaining competitive accuracy.
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings
Mingchen Li, Wajdi Aljedaani, Yingjie Liu, Navyasri Meka, Xuan Lu · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models
Ane G. Domingo-Aldama, Iker De La Iglesia, Maitane Urruela, Aitziber Atutxa, Ander Barrena · Apr 8, 2026 · Citations: 0

BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical…
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
Xiaotian Luo, Xun Jiang, Jiangcheng Wu · Apr 8, 2026 · Citations: 0

Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or…
HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues
Yijie Zhong, Yunfan Gao, Haofen Wang · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
On the Step Length Confounding in LLM Reasoning Data Selection
Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu · Apr 8, 2026 · Citations: 0

Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.
Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM
Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang · Apr 8, 2026 · Citations: 0

Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality.
WRAP++: Web discoveRy Amplified Pretraining
Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models
Paula Dodig, Boshko Koloski, Katarina Sitar Šuštar, Senja Pollak, Matthew Purver · Apr 8, 2026 · Citations: 0

The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content.
SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Usman Naseem, Robert Geislinger, Juan Ren, Sarah Kohail, Rudy Garrido Veliz · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation
Guanran Luo, Wentao Qiu, Wanru Zhao, Wenhan Lv, Zhongquan Jian · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong · Apr 8, 2026 · Citations: 0

Long Horizon

Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer.
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal · Apr 8, 2026 · Citations: 0

Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause.
GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering
Guanran Luo, Wentao Qiu, Zhongquan Jian, Meihong Wang, Qingqiang Wu · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Video-guided Machine Translation with Global Video Context
Jian Chen, JinZe Lv, Zi Long, XiangHua Fu · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach
Daniel N. Wilke · Apr 8, 2026 · Citations: 0

Multi Agent

We present a solver-agnostic framework in which coordinated large language model (LLM) agents autonomously execute the complete computational mechanics workflow, from perceptual data of an engineering component through geometry extraction,…
When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning
Yang Xiang, Yixin Ji, Ruotao Xu, Dan Qiao, Zheming Yang · Apr 8, 2026 · Citations: 0

Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT…
Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation
Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026 · Citations: 0

Pairwise Preference

Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation.
Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search
Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026 · Citations: 0

Pairwise Preference

To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR).
Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models
Marshall Brett · Apr 8, 2026 · Citations: 0

Fisher damage remains constant at ~5,300 positions across the validated range (λ = 0.15-0.6), achieving +28% median margin improvement at λ = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the…
TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks
Xiangyu Wang, Jin Wu, Haoran Shi, Wei Xia, Jiarui Yu · Apr 8, 2026 · Citations: 0

Long Horizon

To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework.
Multilingual Cognitive Impairment Detection in the Era of Foundation Models
Damar Hoogland, Boshko Koloski, Jaya Caporusso, Tine Kolenik, Ana Zwitter Vitez · Apr 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote