HFEPX Archive Slice

HFEPX Fortnight Archive: 2026-F05

Updated from current HFEPX corpus (Apr 12, 2026). 943 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 943 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 8, 2026.

Papers: 943 Last published: Mar 8, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 943 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

11.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

35.0%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

11.9% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 26.8% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (1.6% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models
Mar 8, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
Mar 7, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: F1
Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
Mar 8, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
Mar 7, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: F1, F1 macro
QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis
Mar 8, 2026 · Citations: 0 · Score: 5.0

Eval: Not reported · Metrics: Rmse
An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data
Mar 8, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models Mar 8, 2026	Automatic Metrics	LiveCodeBench	Accuracy	Not reported
Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin Mar 7, 2026	Automatic Metrics	Ts Bench	F1	Not reported
Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation Mar 8, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise Mar 7, 2026	Automatic Metrics	Not reported	F1, F1 macro	Calibration
QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis Mar 8, 2026	Not reported	Semeval	Rmse	Not reported
An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data Mar 8, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
KohakuRAG: A simple RAG framework with hierarchical document indexing Mar 8, 2026	Automatic Metrics	Not reported	Precision	Not reported
KCoEvo: A Knowledge Graph Augmented Framework for Evolutionary Code Generation Mar 8, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR Mar 8, 2026	Automatic Metrics	Not reported	Error rate, Cer	Not reported
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs Mar 8, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.9% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.1% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (21.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10.9% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.5% coverage).
Annotation unit is under-specified (10.9% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (DROP vs SWE-bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: DROP Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (253)
Simulation Env (39)
Llm As Judge (22)
Human Eval (16)

Top Metrics

Accuracy (91)
Cost (38)
Precision (23)
F1 (21)

Top Benchmarks

DROP (6)
SWE Bench (5)
AIME (3)
MMLU (3)

Quality Controls

Calibration (15)
Inter Annotator Agreement Reported (8)
Adjudication (6)
Gold Questions (1)

Papers In This Archive Slice

An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data
Trinh Pham, Thanh Tam Nguyen, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen · Mar 8, 2026 · Citations: 0

Without timely evaluation, organizations cannot approve releases or detect failures early.
AI Steerability 360: A Toolkit for Steering Large Language Models
Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin · Mar 8, 2026 · Citations: 0

Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task).
DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation
Bo Jiang · Mar 8, 2026 · Citations: 0

We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and…
Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
David Beauchemin, Richard Khoury · Mar 8, 2026 · Citations: 0

In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks.
Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context
Ashish Pandey, Tek Raj Chhetri · Mar 8, 2026 · Citations: 0

Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1)…
Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems
Zongqian Li, Tengchao Lv, Shaohan Huang, Yixuan Su, Qinzheng Sun · Mar 8, 2026 · Citations: 0

Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO…
Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models
Zongqian Li, Shaohan Huang, Zewen Chi, Yixuan Su, Lexin Zhou · Mar 8, 2026 · Citations: 0

MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation.
ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
Yuzhuang Xu, Xu Han, Yuxuan Li, Wanxiang Che · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis
A. J. W. de Vink, Filippos Karolos Ventirozos, Natalia Amat-Lefort, Lifeng Han · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types
Matic Korun · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li · Mar 8, 2026 · Citations: 0

Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning.
Large Language Model for Discrete Optimization Problems: Evaluation and Step-by-step Reasoning
Tianhao Qian, Guilin Qi, Z. Y. Wu, Ran Gu, Xuanyi Liu · Mar 8, 2026 · Citations: 0

It aimed to (1) provide an overview of LLMs' ability in large-scale problems, (2) offer suggestions to those who want to solve discrete optimization problems automatically, and (3) regard the performance as a benchmark for future research.
Scalable Training of Mixture-of-Experts Models with Megatron Core
Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Ref-DGS: Reflective Dual Gaussian Splatting
Ningjing Fan, Yiqun Wang, Dongming Yan, Peter Wonka · Mar 8, 2026 · Citations: 0
KohakuRAG: A simple RAG framework with hierarchical document indexing
Shih-Ying Yeh, Yueh-Feng Ku, Ko-Wei Huang, Buu-Khang Tu · Mar 8, 2026 · Citations: 0

We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with \pm0.1% numeric tolerance and exact source attribution.
StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
Haishu Zhao, Aokai Hao, Yuan Ge, Zhenqiang Hong, Tong Xiao · Mar 8, 2026 · Citations: 0

However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations.
KCoEvo: A Knowledge Graph Augmented Framework for Evolutionary Code Generation
Jiazhen Kang, Yuchen Lu, Chen Jiang, Jinrui Liu, Tianhao Zhang · Mar 8, 2026 · Citations: 0

Both modules are trained with synthetic supervision automatically derived from real-world API diffs, ensuring scalability and minimal human effort.
A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification
Furkan Genç, Onat Özdemir, Emre Akbaş · Mar 8, 2026 · Citations: 0
Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR
Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel, Rupak Tiwari, Arju Shrestha · Mar 8, 2026 · Citations: 0

In this work, we introduce Nwāchā Munā, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling.
Learning-free L2-Accented Speech Generation using Phonological Rules
Thanathai Lertpetchpun, Yoonjeong Lee, Jihwan Lee, Tiantian Feng, Dani Byrd · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib · Mar 8, 2026 · Citations: 0

Long Horizon

To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline.
Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd · Mar 8, 2026 · Citations: 0

Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.
TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
Mingyue Cheng, Shuo Yu, Chuang Jiang, Xiaoyu Tao, Qingyang Mao · Mar 8, 2026 · Citations: 0

Long Horizon

To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM).
Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech
Tajamul Ashraf, Burhaan Rasheed Zargar, Saeed Abdul Muizz, Ifrah Mushtaq, Nazima Mehdi · Mar 8, 2026 · Citations: 0

The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers.
SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration
Kan Ling, Zhen Qin, Yichi Zhu, Hengrun Zhang, Huiqun Yu · Mar 8, 2026 · Citations: 0
A Joint Neural Baseline for Concept, Assertion, and Relation Extraction from Clinical Text
Fei Cheng, Ribeka Tanaka, Sadao Kurohashi · Mar 8, 2026 · Citations: 0

We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings.
Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee · Mar 8, 2026 · Citations: 0

Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping.
Cross-Modal Taxonomic Generalization in (Vision-) Language Models
Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
J. Clayton Kerce, Alexis Fox · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Image Generation Models: A Technical History
Rouzbeh Shirvani · Mar 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System
Xiang Zhang, Hongming Xu, Le Zhou, Wei Zhou, Xuanhe Zhou · Mar 8, 2026 · Citations: 0

We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect-specific test cases.
Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
Guoli Wang, Haonan Shi, Tu Ouyang, An Wang · Mar 8, 2026 · Citations: 0

Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data.
Generalization in Online Reinforcement Learning for Mobile Agents
Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang · Mar 8, 2026 · Citations: 0

Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen.
Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
Jihyoung Jang, Hyounghun Kim · Mar 8, 2026 · Citations: 0

Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies.
Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams
Jiyeon Kim, Hyunji Lee, Dylan Zhou, Sue Hyun Park, Seunghyun Yoon · Mar 8, 2026 · Citations: 0

We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge.
SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali · Mar 7, 2026 · Citations: 0

Long Horizon

Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies.
Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh Kanojia · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness
Ravi Ranjan, Utkarsh Grover, Agorista Polyzou · Mar 7, 2026 · Citations: 0

Critique Edit

Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography.
RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts
Darya Kharlamova, Irina Proskurina · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Latent Generative Models with Tunable Complexity for Compressed Sensing and other Inverse Problems
Sean Gunn, Jorio Cocola, Oliver De Candido, Vaggos Chatziafratis, Paul Hand · Mar 7, 2026 · Citations: 0
How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
Nouran Khallaf, Serge Sharoff · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Third Ambition: Artificial Intelligence and the Science of Human Behavior
W. Russell Neuman, Chad Coleman · Mar 7, 2026 · Citations: 0

Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that…
Adversarial Latent-State Training for Robust Policies in Partially Observable Domains
Angad Singh Ahuja · Mar 7, 2026 · Citations: 0
Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan Shiu · Mar 7, 2026 · Citations: 0

To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin.
Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
Minu Kim, Hoirin Kim, David R. Mortensen · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah · Mar 7, 2026 · Citations: 0

As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety.
Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice
Yuxu Ge · Mar 7, 2026 · Citations: 0
The DIME Architecture: A Unified Operational Algorithm for Neural Representation, Dynamics, Control and Integration
Ionel Cristian Vladu, Nicu Bizdoaca, Ionica Pirici, Tudor-Adrian Balseanu, Eduard Nicusor Bondoc · Mar 7, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Fine-Grained Table Retrieval Through the Lens of Complex Queries
Wojciech Kosiuk, Xingyu Ji, Yeounoh Chung, Fatma Özcan, Madelon Hulsebos · Mar 7, 2026 · Citations: 0

Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.
Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba · Mar 7, 2026 · Citations: 0

Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions.
Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu · Mar 7, 2026 · Citations: 0
Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information
Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara · Mar 7, 2026 · Citations: 0

In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG.
Grounding Machine Creativity in Game Design Knowledge Representations: Empirical Probing of LLM-Based Executable Synthesis of Goal Playable Patterns under Structural Constraints
Hugh Xuechen Liu, Kıvanç Tatar · Mar 7, 2026 · Citations: 0

Using 26 goal pattern instantiations, we compare a direct generation baseline (natural language -> C# -> Unity) with pipelines conditioned on a human-authored Unity-specific intermediate representation (IR), across three IR configurations…
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang · Mar 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Entropy-Aware On-Policy Distillation of Language Models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou · Mar 7, 2026 · Citations: 0

Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods.
CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li · Mar 7, 2026 · Citations: 0

Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy.
Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision
Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng · Mar 7, 2026 · Citations: 0

We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions.
Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang · Mar 7, 2026 · Citations: 0

Pairwise Preference

In this paper, we propose Hit-RAG, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now