HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-25

Updated from current HFEPX corpus (Apr 12, 2026). 92 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 92 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: MMLU. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 92 Last published: Feb 25, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 92 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

13.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

45.0%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

12% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 44.6% of papers in this hub.
MMLU is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is inter-annotator agreement reporting (2.2% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Improving Parametric Knowledge Access in Reasoning Language Models
Feb 25, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Recall
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Feb 25, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Pass@1, Latency
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Feb 25, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Cost
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Feb 25, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Faithfulness
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Feb 25, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Feb 25, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, F1

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Improving Parametric Knowledge Access in Reasoning Language Models Feb 25, 2026	Automatic Metrics	SimpleQA, NQ	Recall	Not reported
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents Feb 25, 2026	Automatic Metrics	SWE Bench, SWE Bench Verified	Pass@1, Latency	Not reported
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference Feb 25, 2026	Automatic Metrics	MMLU	Accuracy, Cost	Not reported
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models Feb 25, 2026	Automatic Metrics	DROP	Accuracy, Faithfulness	Not reported
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models Feb 25, 2026	Automatic Metrics	MMLU, MMLU Pro	Accuracy	Not reported
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection Feb 25, 2026	Automatic Metrics	Not reported	Accuracy, F1	Inter Annotator Agreement Reported
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models Feb 25, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (12% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.6% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (17.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.9% coverage).
Annotation unit is under-specified (10.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (MMLU vs SWE-bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: MMLU Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 3.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (41)
Simulation Env (6)
Human Eval (2)
Llm As Judge (1)

Top Metrics

Accuracy (11)
Cost (3)
Success rate (3)
Pass@1 (2)

Top Benchmarks

MMLU (2)
SWE Bench (2)
SWE Bench Verified (2)
Arlarena (1)

Quality Controls

Inter Annotator Agreement Reported (2)
Calibration (1)

Papers In This Archive Slice

Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026 · Citations: 0

We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical…
Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
Jiří Milička, Hana Bednářová · Feb 25, 2026 · Citations: 0

The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons.
VeRO: An Evaluation Harness for Agents to Optimize Agents
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue · Feb 25, 2026 · Citations: 0

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles.
Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models
Binchi Zhang, Xujiang Zhao, Jundong Li, Haifeng Chen, Zhengzhang Chen · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi · Feb 25, 2026 · Citations: 0

Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH).
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026 · Citations: 0

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0

Long Horizon

First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning.
Causality $\neq$ Invariance: Function and Concept Vectors in LLMs
Gustaw Opiełka, Hannes Rosenbusch, Claire E. Stevenson · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran · Feb 25, 2026 · Citations: 0

Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage.
Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework
Rakib Ullah, Mominul islam, Md Sanjid Hossain, Md Ismail Hossain · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
Arno Simons · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Decoder-based Sense Knowledge Distillation
Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis · Feb 25, 2026 · Citations: 0

Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons, Richard Diehl Martinez, Dan Jurafsky · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0

The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads
Kunpeng Zhang, Poppy Zhang, Shawndra Hill, Amel Awadelkarim · Feb 25, 2026 · Citations: 0

Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation.
DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026 · Citations: 0

Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with…
Dynamic Personality Adaptation in Large Language Models via State Machines
Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse · Feb 25, 2026 · Citations: 0

This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan · Feb 25, 2026 · Citations: 0

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Long Horizon

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Tool Use

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026 · Citations: 0

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PatchDenoiser: Parameter-efficient multi-scale patch learning and fusion denoiser for Low-dose CT imaging
Jitindra Fartiyal, Pedro Freire, Sergei K. Turitsyn, Sergei G. Solovski · Feb 25, 2026 · Citations: 0
CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
Miyu Oba, Saku Sugawara · Feb 25, 2026 · Citations: 0

Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention.
RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning
Bo Xue, Yuan Jin, Luoyi Fu, Jiaxin Ding, Xinbing Wang · Feb 25, 2026 · Citations: 0

Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more…
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
Large Language Models are Algorithmically Blind
Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote · Feb 25, 2026 · Citations: 0

Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best…
Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0

Pairwise Preference

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged…
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Hao Wang, Haofei Sun · Feb 25, 2026 · Citations: 0

To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Scalable Kernel-Based Distances for Statistical Inference and Integration
Masha Naslidnyk · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
xai-cola: A Python library for sparsifying counterfactual explanations
Lin Zhu, Lei You · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang · Feb 25, 2026 · Citations: 0

Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion
Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit · Feb 25, 2026 · Citations: 0

We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis.
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026 · Citations: 0

Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Long Horizon

In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs
Heng Wang, Changxing Wu · Feb 25, 2026 · Citations: 0

Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability.
fEDM+: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation
Abeer Dyoub, Francesca A. Lisi · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
MD. Sagor Chowdhury, Adiba Fairooz Chowdhury · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0

Demonstrations

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson · Feb 25, 2026 · Citations: 0

Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular.
Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach
Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng · Feb 25, 2026 · Citations: 0

Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement…
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu · Feb 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
Duc Trung Vu, Pham Khanh Chi, Dat Phi Van, Linh Ngo Van, Sang Dinh · Feb 25, 2026 · Citations: 0

Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final…
Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang · Feb 25, 2026 · Citations: 0

Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now