HFEPX Metric Hub

Accuracy Metric Papers

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Accuracy.

Read Full Context

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Accuracy. Use it to compare how accuracy is measured across human feedback and evaluation studies.

Papers: 60 Last published: Apr 9, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: High .

Metric Coverage

100.0%

60 sampled papers include metric names.

Benchmark Anchoring

16.7%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

5.0%

3 papers report calibration/adjudication/IAA controls.

60 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Use the top metric-reliable papers first, then compare benchmark context in the matrix before drawing conclusions.

Why This Matters (Expanded)

Why This Matters For Eval Research

Use this page to compare how accuracy is operationalized across benchmarks and rater setups.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Accuracy is often paired with automatic_metrics, human_eval.

Metric Interpretation

accuracy: 60 papers
cost: 6 papers
latency: 4 papers
coherence: 3 papers

Benchmark Context

GSM8K: 2 papers
aot-psyphybench: 1 papers
ARC-Challenge: 1 papers

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Apr 6, 2026 · Citations: 0 · Score: 9.5

Metrics: Accuracy · Eval: Automatic Metrics
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 9.0

Metrics: Accuracy, Helpfulness · Eval: Human Eval, Automatic Metrics
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Apr 8, 2026 · Citations: 0 · Score: 9.0

Metrics: Accuracy · Eval: Automatic Metrics
AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
Apr 9, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy · Eval: Automatic Metrics
Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Apr 9, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy · Eval: Automatic Metrics
Training Data Size Sensitivity in Unsupervised Rhyme Recognition
Apr 9, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy, Agreement · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Apr 6, 2026	Accuracy	Omnidocbench	Automatic Metrics	Adjudication
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Accuracy, Helpfulness	Rewardbench	Human Eval, Automatic Metrics	Not reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories Apr 8, 2026	Accuracy	Tracesafe Bench	Automatic Metrics	Not reported
AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages Apr 9, 2026	Accuracy	Not reported	Automatic Metrics	Calibration
Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents Apr 9, 2026	Accuracy	GSM8K	Automatic Metrics	Not reported
Training Data Size Sensitivity in Unsupervised Rhyme Recognition Apr 9, 2026	Accuracy, Agreement	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models Apr 8, 2026	Accuracy, Latency	GSM8K, TruthfulQA	Automatic Metrics	Not reported
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors Apr 8, 2026	Accuracy	Meddialbench	Automatic Metrics	Not reported
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents Apr 8, 2026	Accuracy	GAIA, HumanEval+	Automatic Metrics	Not reported
SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation Apr 8, 2026	Accuracy	Spider, Sqlstructeval	Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Human feedback

Human feedback is present in 7 of 60 papers.
Gap: Quality controls

Quality controls is present in 3 of 60 papers.
Gap: Benchmarks

Benchmarks is present in 10 of 60 papers.
Strong: Metrics

Metrics is present in 60 of 60 papers.
Gap: Known rater population

Known rater population is present in 9 of 60 papers.
Gap: Known annotation unit

Known annotation unit is present in 10 of 60 papers.

Strengths

Metrics is present in 60 of 60 papers.

Known Gaps

Human feedback is present in 7 of 60 papers.
Quality controls is present in 3 of 60 papers.
Benchmarks is present in 10 of 60 papers.

Suggested Next Analyses

Review the most recent accuracy papers first, then compare benchmark context before reusing the metric.

Recommended Queries

Search Accuracy papers

Known Limitations

This synthetic persisted page is generated from extraction data because the cached metric payload was missing for accuracy.

Research Utility Snapshot (Detailed)

Top Metrics

Accuracy (60)
Cost (6)
Latency (4)
Coherence (3)

Evaluation Modes

Automatic Metrics (60)
Human Eval (3)
Llm As Judge (2)
Simulation Env (1)

Top Benchmarks

GSM8K (2)
Aot Psyphybench (1)
ARC Challenge (1)
DROP (1)

Agentic Mix

None (52)
Long Horizon (5)
Tool Use (2)
Multi Agent (1)

Top Papers Reporting This Metric

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang · Apr 9, 2026 · Citations: 0

Automatic Metrics General

The advent of agentic multimodal models has empowered systems to actively interact with external environments.
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Jiayuan Ye, Vitaly Feldman, Kunal Talwar · Apr 9, 2026 · Citations: 0

Automatic Metrics Law

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
What do Language Models Learn and When? The Implicit Curriculum Hypothesis
Emmy Liu, Kaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja · Apr 9, 2026 · Citations: 0

Automatic Metrics MathLaw

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng · Jun 1, 2025 · Citations: 0

Automatic Metrics General

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results.
The Detection-Extraction Gap: Models Know the Answer Before They Can Say It
Hanyang Wang, Mingxuan Zhu · Apr 8, 2026 · Citations: 0

Automatic Metrics Coding

Across five model configurations, two families, and three benchmarks, we find that 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix.
Human-computer interactions predict mental health
Veith Weilnhammer, Jefferson Ortega, David Whitney · Nov 25, 2025 · Citations: 0

Automatic Metrics Medicine

Here, we show that everyday human-computer interactions encode mental health with biomarker accuracy.
AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
Lilian Wanzare, Cynthia Amol, zekiel Maina, Nelson Odhiambo, Hope Kerubo · Apr 9, 2026 · Citations: 0

Automatic Metrics Multilingual

Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy.
KV Cache Offloading for Context-Intensive Tasks
Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context.
Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Khushal Sethi · Apr 9, 2026 · Citations: 0

Automatic Metrics Math

We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement.
Stacked from One: Multi-Scale Self-Injection for Context Window Extension
Wei Han, Pan Zhou, Soujanya Poria, Shuicheng Yan · Mar 5, 2026 · Citations: 0

Automatic Metrics General

Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy.
Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Joshua Ashkinaze, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak · Jul 4, 2024 · Citations: 0

Human EvalAutomatic Metrics General

We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy.
HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang · Jan 15, 2026 · Citations: 0

Automatic MetricsSimulation Env Coding

We present HumanLLM, a framework treating psychological patterns as interacting causal forces.
Training Data Size Sensitivity in Unsupervised Rhyme Recognition
Petr Plecháč, Artjoms Šeļa, Silvie Cinková, Mirella De Sisto, Lara Nugues · Apr 9, 2026 · Citations: 0

Automatic Metrics Multilingual

This complicates automated rhymed recognition and evaluation, especially in multilingual context.
Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection
Khalid Zaman, Melike Sah, Anuwat Chaiwongyenc, Cem Direkoglu · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
Xinkai Zhang, Jingtao Zhan, Yiqun Liu, Qingyao Ai · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments.
Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation
Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Yuxi Zhang, Huimin Wang · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao · Apr 6, 2026 · Citations: 0

Automatic Metrics General

At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution shift;…
Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech
Fabian Retkowski, Alexander Waibel · Dec 30, 2025 · Citations: 0

Automatic Metrics General

First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task.
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
George Fountzoulas · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking
Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Experiments across 9 LRMs and 7 benchmarks show that SAT achieves up to 40% reduction in reasoning tokens while generally maintaining or improving accuracy.
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa · Oct 30, 2025 · Citations: 0

Automatic Metrics Coding

We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans.
Hallucination Detection and Evaluation of Large Language Model
Chenggong Zhang, Haopeng Wang, Hexi Meng · Dec 27, 2025 · Citations: 0

Automatic Metrics General

To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high…
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Human EvalAutomatic Metrics General

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction
Jackson Petty, Jaulie Goe, Tal Linzen · Apr 8, 2026 · Citations: 0

Automatic Metrics Multilingual

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik · Apr 8, 2026 · Citations: 0

Automatic Metrics Medicine

This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus.
ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection
Chhavi Dhiman, Naman Chawla, Riya Dhami, Gaurav Kumar, Ganesh Naik · Apr 8, 2026 · Citations: 0

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent
Bingxuan Li, Simo Du, Yue Guo · Apr 8, 2026 · Citations: 0

Automatic Metrics Medicine

We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module.
UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song · Jul 29, 2025 · Citations: 0

Automatic Metrics Coding

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities.
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Automatic Metrics General

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics
Kosmas Pinitas, Ilias Maglogiannis · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI.
Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
Elyas Irankhah, Samah Fodeh · Apr 8, 2026 · Citations: 0

Automatic Metrics Medicine

Third, results on the development set show that alignment accuracy is mainly limited by reasoning.
Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou · Nov 10, 2025 · Citations: 0

Automatic Metrics General

Internet of Agents (IoA) envisions a unified, agent-centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale.
IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text
Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja · Apr 8, 2026 · Citations: 0

Automatic Metrics General

In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points.
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Md Motaleb Hossen Manik, Ge Wang · Apr 8, 2026 · Citations: 0

Automatic Metrics Math

We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and…
MARS: Enabling Autoregressive Models Multi-Token Generation
Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun · Apr 8, 2026 · Citations: 0

Automatic Metrics General

When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks.
iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
Wenshuo Wang, Boyu Cao, Nan Zhuang, Wei Li · Apr 8, 2026 · Citations: 0

Automatic Metrics General

This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.
Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48\% while maintaining competitive accuracy.
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
Xiaotian Luo, Xun Jiang, Jiangcheng Wu · Apr 8, 2026 · Citations: 0

Automatic Metrics Medicine

Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or…
Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong · Apr 8, 2026 · Citations: 0

Automatic Metrics Math

Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer.
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause.
SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication
Nguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei, Akira Taniguchi · Oct 29, 2024 · Citations: 0

Automatic Metrics General

Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language.
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0

Human EvalAutomatic Metrics Math

Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
Heng Zhou, Zelin Tan, Zhemeng Zhang, Yutao Fan, Yibing Lin · Apr 8, 2026 · Citations: 0

Automatic Metrics General

When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it?
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim · Mar 11, 2026 · Citations: 0

Automatic Metrics Medicine

We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation
Yixi Zhou, Fan Zhang, Zhiqiao Guo, Yu Chen, Haipeng Zhang · Apr 8, 2026 · Citations: 0

Automatic Metrics Coding

Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable.
Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs
Haoyue Liu, Zhichao Wang, Yongxin Guo, Haoran Shou, Xiaoying Tang · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45--87% tokens on…
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM
Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Explainable fake news detection aims to assess the veracity of news claims while providing human-friendly explanations.
Feedback Adaptation for Retrieval-Augmented Generation
Jihwan Bang, Seunghan Yang, Kyuhong Shim, Simyung Chang, Juntae Lee · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced.
SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu · Apr 8, 2026 · Citations: 0

Automatic Metrics Math

Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.
DiffuMask: Diffusion Language Model for Token-level Prompt Pruning
Caleb Zheng, Jyotika Singh, Fang Tu, Weiyi Sun, Sujeeth Bharadwaj · Apr 8, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs
Maotian Ma, Zheni Zeng, Zhenghao Liu, Yukun Yan · Apr 8, 2026 · Citations: 0

Automatic Metrics MedicineCoding

Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly-condensed knowledge sufficiently through training or prompting.
Does a Global Perspective Help Prune Sparse MoEs Elegantly?
Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu · Apr 8, 2026 · Citations: 0

Automatic Metrics Law

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PACIFIC: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs
Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki · Feb 6, 2026 · Citations: 0

Automatic Metrics General

Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g.,…
ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka · Apr 7, 2026 · Citations: 0

Automatic Metrics General

Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized.
Multi-objective Evolutionary Merging Enables Efficient Reasoning Models
Mario Iacobelli, Adrian Robert Minut, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli · Apr 7, 2026 · Citations: 0

Automatic Metrics Math

Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the…
Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection
Afroza Nowshin, Prithweeraj Acharjee Porag, Haziq Jeelani, Fayeq Jeelani Syed · Apr 7, 2026 · Citations: 0

Automatic Metrics Multilingual

Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by…
Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking
Georgi Grazhdanski, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · Apr 7, 2026 · Citations: 0

Automatic Metrics Multilingual

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0

Automatic Metrics General

These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…

Related Metric Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now