Metric Hub

Accuracy + Simulation Env Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 18 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: BrowseComp. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 18 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 18 papers for Accuracy + Simulation Env Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on BrowseComp, GSM8K and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

5.6% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Counterfactual Simulation Training for Chain-of-Thought Faithfulness
automatic metrics appears in 100% of papers in this hub.

Evidence: Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Counterfactual Simulation Training for Chain-of-Thought Faithfulness , SPQ: An Ensemble Technique for Large Language Model Compression
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Counterfactual Simulation Training for Chain-of-Thought Faithfulness , SPQ: An Ensemble Technique for Large Language Model Compression

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Counterfactual Simulation Training for Chain-of-Thought Faithfulness , SPQ: An Ensemble Technique for Large Language Model Compression
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models , Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , Counterfactual Simulation Training for Chain-of-Thought Faithfulness , SPQ: An Ensemble Technique for Large Language Model Compression

Benchmark Interpretation

BrowseComp appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.
GSM8K appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 100% of hub papers (18/18); compare with a secondary metric before ranking methods.
cost is reported in 11.1% of hub papers (2/18); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Protocol MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Human-eval abstract signal: Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce.

Protocol Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text

Human-eval abstract signal: Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.

Benchmark MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

BrowseComp benchmark signal: The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity.

Metric Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text

accuracy metric signal: The results indicate that the smaller, sequentially fine-tuned DistilBERT model achieved the highest overall accuracy of 84%, outperforming all of the LLMs in zero and few-shot set ups, using minimal LLM generated code-mixed data...

Metric MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

accuracy metric signal: Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy.

Protocol Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Protocol abstract signal: Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output.

Protocol SPQ: An Ensemble Technique for Large Language Model Compression

Protocol abstract signal: This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization.

Protocol Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Protocol abstract signal: Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (5.6% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (16.7% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (11.1% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (5.6% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (11.1% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: BrowseComp - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=17

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=18, left_only=0, right_only=0

18 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=1, left_only=17, right_only=0

1 papers use both Simulation Env and Human Eval.

Benchmark Brief

BrowseComp

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention BrowseComp.

Examples: BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Benchmark Brief

GSM8K

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention GSM8K.

Examples: SPQ: An Ensemble Technique for Large Language Model Compression

Benchmark Brief

Retrieval

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention Retrieval.

Examples: Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Metric Brief

accuracy

Coverage: 18 papers (100%)

18 papers (100%) mention accuracy.

Examples: Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Metric Brief

cost

Coverage: 2 papers (11.1%)

2 papers (11.1%) mention cost.

Examples: DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation , EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Metric Brief

exact match

Coverage: 2 papers (11.1%)

2 papers (11.1%) mention exact match.

Examples: MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026

Automatic MetricsSimulation Env CodingMultilingual

Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026

Human EvalAutomatic Metrics Coding

We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026

Automatic MetricsSimulation Env Coding

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output.
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026

Automatic MetricsSimulation Env MathCoding

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026

Automatic MetricsSimulation Env Coding

Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026

Automatic MetricsSimulation Env LawCoding

The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri · Feb 16, 2026

Automatic MetricsSimulation Env Multilingual

Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces.
Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
Kais Allkivi · Feb 13, 2026

Automatic MetricsSimulation Env General

Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026

Automatic MetricsSimulation Env General

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026

Automatic MetricsSimulation Env Coding

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025

Automatic MetricsSimulation Env General

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025

Automatic MetricsSimulation Env General

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025

Automatic MetricsSimulation Env General

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
On the Inference (In-)Security of Vertical Federated Learning: Efficient Auditing against Inference Tampering Attack
Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei · Jul 3, 2025

Automatic MetricsSimulation Env General

Vertical Federated Learning (VFL) is an emerging distributed learning paradigm for cross-silo collaboration without accessing participants' data.
Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025

Automatic MetricsSimulation Env General

We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025

Automatic MetricsSimulation Env Coding

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.
Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations
Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen · Jan 29, 2025

Automatic MetricsSimulation Env Medicine

These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information.

Accuracy + Simulation Env Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Abstract Evidence Highlights

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs