HFEPX Archive Slice

HFEPX Fortnight Archive: 2026-F01

Updated from current HFEPX corpus (Apr 12, 2026). 93 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 93 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: APPS. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 11, 2026.

Papers: 93 Last published: Jan 11, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 93 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

13.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

40.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

17.2% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 33.3% of papers in this hub.
APPS is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is rater calibration (2.2% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Jan 11, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Recall, Mrr
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Jan 6, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Agreement
†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems
Jan 11, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
Jan 10, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, Precision
FormationEval, an open multiple-choice benchmark for petroleum geoscience
Jan 5, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, Cost
DeCode: Decoupling Content and Delivery for Medical QA
Jan 5, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Relevance

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching Jan 11, 2026	Automatic Metrics	Medieval	Recall, Mrr	Not reported
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text Jan 6, 2026	Automatic Metrics	Not reported	Accuracy, Agreement	Calibration, Inter Annotator Agreement Reported
†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems Jan 11, 2026	Automatic Metrics	DROP	Accuracy	Not reported
EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation Jan 10, 2026	Automatic Metrics	Evm Questbench	Accuracy, Precision	Not reported
FormationEval, an open multiple-choice benchmark for petroleum geoscience Jan 5, 2026	Automatic Metrics	Formationeval	Accuracy, Cost	Not reported
DeCode: Decoupling Content and Delivery for Medical QA Jan 5, 2026	Automatic Metrics	Healthbench	Relevance	Not reported
Distilling Feedback into Memory-as-a-Tool Jan 9, 2026	Automatic Metrics	Not reported	Cost, Inference cost	Not reported
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue Jan 9, 2026	Human Eval, Llm As Judge	Not reported	Agreement	Not reported
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models Jan 8, 2026	Automatic Metrics	Not reported	Relevance	Not reported
What Matters For Safety Alignment? Jan 7, 2026	Automatic Metrics	Not reported	Success rate, Jailbreak success rate	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (17.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.2% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (5.4% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (18.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.6% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (14% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 2.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.6% coverage).
Annotation unit is under-specified (14% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (APPS vs evm-questbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: APPS Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 2.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.6% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (31)
Llm As Judge (5)
Human Eval (1)
Simulation Env (1)

Top Metrics

Accuracy (7)
Cost (5)
Agreement (3)
Recall (3)

Top Benchmarks

APPS (1)
Evm Questbench (1)
FCMBench (1)
Jmedethicbench (1)

Quality Controls

Calibration (2)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Task Arithmetic with Support Languages for Low-Resource ASR
Emma Rafkin, Dan DeGenaro, Xiulin Yang · Jan 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd · Jan 11, 2026 · Citations: 0

Expert Verification

Trained on 32.7 million triplet samples drawn from 67 million toponyms spanning GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names, the Student achieves the highest Recall@1 (85.2%) and Mean Reciprocal Rank (90.8%) on the…
†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems
Zabir Al Nazi, Shubhashis Roy Dipta, Sudipta Kar · Jan 11, 2026 · Citations: 0

To systematically study this challenge, we introduce DISTRACTMATH-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information.
A Mind Cannot Be Smeared Across Time
Michael Timothy Bennett · Jan 11, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Burn-After-Use for Preventing Data Leakage through a Secure Multi-Tenant Architecture in Enterprise LLM
Qiang Zhang, Elena Emma Wang, Jiaming Li, Xichun Wang · Jan 10, 2026 · Citations: 0
EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang · Jan 10, 2026 · Citations: 0

Long Horizon

Existing evaluations often overlook execution accuracy and safety.
LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models
Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Yuhua Zhu · Jan 10, 2026 · Citations: 0

To address this, we introduce Grand-SMOT, a large-scale, open-world benchmark providing high-density, dual-stream narratives that comprehensively decouple individual behaviors from environmental contexts.
NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
Robert J. Moore, Sungeun An, Farhan Ahmed, Jay Pankaj Gala · Jan 10, 2026 · Citations: 0

The Natural Conversation Benchmark (NC-Bench) introduces a new approach to evaluating the general conversational competence of large language models (LLMs).
Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
Feilong Liu · Jan 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Distilling Feedback into Memory-as-a-Tool
Víctor Gallego · Jan 9, 2026 · Citations: 0

Rubric RatingCritique Edit

We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls.
Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai · Jan 9, 2026 · Citations: 0

Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech.
FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG
Maxime Dassen, Rebecca Kotula, Kenton Murray, Andrew Yates, Dawn Lawrie · Jan 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer
Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu · Jan 9, 2026 · Citations: 0

Demonstrations

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code.
HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation
Rongxin Chen, Tianyu Wu, Bingbing Xu, Jiatang Luo, Xiucheng Xu · Jan 9, 2026 · Citations: 0

High-fidelity agent initialization is crucial for credible Agent-Based Modeling across diverse domains.
Classroom AI: Large Language Models as Grade-Specific Teachers
Jio Oh, Steven Euijong Whang, James Evans, Jindong Wang · Jan 9, 2026 · Citations: 0

Evaluations across multiple datasets with 208 human participants demonstrate substantial improvements in grade-level alignment, achieving a 35.64 percentage point increase compared to prompt-based methods while maintaining response…
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong · Jan 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Over-Searching in Search-Augmented Large Language Models
Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun · Jan 9, 2026 · Citations: 0
The Illusion of AI Expertise Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm
Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal · Jan 9, 2026 · Citations: 0
A Two-Stage Multitask Vision-Language Framework for Explainable Crop Disease Visual Question Answering
Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam Ansary · Jan 8, 2026 · Citations: 0

Without fine-tuning, the model further generalizes well to the external PlantVillageVQA benchmark, achieving 83.18% micro accuracy in the VQA task.
Token-Level LLM Collaboration via FusionRoute
Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang · Jan 8, 2026 · Citations: 0
Key-Value Pair-Free Continual Learner via Task-Specific Prompt-Prototype
Haihua Luo, Xuming Ran, Zhengji Li, Huiyan Xue, Tingting Jiang · Jan 8, 2026 · Citations: 0
Projected Autoregression: Autoregressive Language Generation in Continuous State Space
Oshri Naparstek · Jan 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models
Guanzhi Deng, Bo Li, Ronghao Chen, Xiujin Liu, Zhuo Han · Jan 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le, Yunliang Li · Jan 8, 2026 · Citations: 0

Pairwise Preference

Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
Hyeongkeun Lee, Jongmin Choi, KiHyun Nam, Joon Son Chung · Jan 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Political Alignment in Large Language Models: A Multidimensional Audit of Psychometric Identity and Behavioral Bias
Adib Sakhawat, Tahsin Islam, Takia Farhin, Syed Rifat Raiyan, Hasan Mahmud · Jan 8, 2026 · Citations: 0

These findings suggest that single-axis evaluations are insufficient and that multidimensional auditing frameworks are important to characterize alignment behavior in deployed LLMs.
Identifying Good and Bad Neurons for Task-Level Controllable LLMs
Wenjie Li, Guansong Pang, Hezhe Qiao, Debin Gao, David Lo · Jan 8, 2026 · Citations: 0
CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts
Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik · Jan 8, 2026 · Citations: 0

Multi Agent

To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics.
Vision-Language Agents for Interactive Forest Change Analysis
James Brock, Ce Zhang, Nantheera Anantrasirichai · Jan 8, 2026 · Citations: 0

To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks.
Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
San Kim, Gary Geunbae Lee · Jan 7, 2026 · Citations: 0

However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors.
Interpreting Transformers Through Attention Head Intervention
Mason Kadem, Rong Zheng · Jan 7, 2026 · Citations: 0
RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection
Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin, Pin-Yu Chen, Hong-Yan Huang · Jan 7, 2026 · Citations: 0

DemonstrationsCritique Edit

On a fake news detection benchmark, RADAR consistently outperforms strong retrieval-augmented trainable baselines, as well as general-purpose LLMs with retrieval.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0

Red Team Tool Use

This paper presents a comprehensive empirical study on the safety alignment capabilities.
IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
Wei Long, Haifeng Wu, Shiyin Jiang, Jinhua Zhang, Xinchun Ji · Jan 7, 2026 · Citations: 0
Compact Example-Based Explanations for Language Models
Loris Schoenegger, Benjamin Roth · Jan 7, 2026 · Citations: 0

As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation.
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu · Jan 6, 2026 · Citations: 0

Long Horizon

Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference.
Prompting Underestimates LLM Capability for Time Series Classification
Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios · Jan 6, 2026 · Citations: 0

Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure.
AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert · Jan 6, 2026 · Citations: 0
One Sample to Rule Them All: Extreme Data Efficiency in Multidiscipline Reasoning with Reinforcement Learning
Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li · Jan 6, 2026 · Citations: 0

Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning…
Enhancing Moral Diagnosis and Correction in Large Language Models
Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi · Jan 6, 2026 · Citations: 0

Red Team

Identifying specific moral errors in an input and generating appropriate corrections require moral sensitivity in large language models (LLMs), which is fundamental for developing their moral performance, yet a challenging task.
SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering
Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao · Jan 6, 2026 · Citations: 0

Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.
Towards Faithful Reasoning in Comics for Small MLLMs
Chengcheng Feng, Haojie Yin, Yucheng Jin, Kaizhu Huang · Jan 6, 2026 · Citations: 0

Extensive experiments on five benchmarks spanning comic understanding and broader humor-centric and abstract visual reasoning tasks demonstrate that our framework achieves strong results in the \leq 4B regime, surpasses several 7B…
LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation
Fabian Lukassen, Christoph Weisser, Michael Schlee, Manish Kumar, Anton Thielmann · Jan 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion
Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026 · Citations: 0

Pairwise Preference

To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Shinwoo Park, Yo-Sub Han · Jan 6, 2026 · Citations: 0

Rubric Rating

Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language Models
Zeyu Gan, Ruifeng Ren, Wei Yao, Xiaolin Hu, Gengze Xu · Jan 6, 2026 · Citations: 0

To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and…
Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models
Seunghwan Jang, SooJean Han · Jan 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You · Jan 6, 2026 · Citations: 0

While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory.
When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark
Subha Ghoshal, Ali Al-Bustami · Jan 6, 2026 · Citations: 0

Tool Use

We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV).
Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma · Jan 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs
Nelvin Tan, Yaowen Zhang, James Asikin Cheung, Fusheng Liu, Yu-Ching Shih · Jan 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation
Hyeong Kyu Choi, Sharon Li · Jan 5, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Estimating Text Temperature with Language Models
Nikolay Mikhaylovskiy · Jan 5, 2026 · Citations: 0

Following it, we propose a procedure to estimate the temperature of any text, including ones written by humans, with respect to a given language model.
From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
Fabian Lukassen, Jan Herrmann, Christoph Weisser, Benjamin Saefken, Thomas Kneib · Jan 5, 2026 · Citations: 0

Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting.
FormationEval, an open multiple-choice benchmark for petroleum geoscience
Almaz Ermilov · Jan 5, 2026 · Citations: 0

This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines.
DeCode: Decoupling Content and Delivery for Medical QA
Po-Jen Ko, Chen-Han Tsai, Yu-Shao Peng · Jan 5, 2026 · Citations: 0

We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses.
Agentic Retoucher for Text-To-Image Generation
Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan · Jan 5, 2026 · Citations: 0

Pairwise Preference

To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop.
Output Embedding Centering for Stable LLM Pretraining
Felix Stollenwerk, Anna Lokrantz, Niclas Hertzberg · Jan 5, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now