Metric Hub

Cost In CS.LG Papers

Updated from current HFEPX corpus (Feb 27, 2026). 30 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 30 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 30 papers for Cost In CS.LG Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on GSM8K, NyayaBench and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

3.3% of papers report explicit human-feedback signals, led by expert verification.

Evidence: Multi-Objective Alignment of Language Models for Personalized Psychotherapy , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
automatic metrics appears in 93.3% of papers in this hub.

Evidence: How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning , From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning

Protocol Takeaways

Most common quality-control signal is rater calibration (6.7% of papers).

Evidence: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Multi-Objective Alignment of Language Models for Personalized Psychotherapy , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Stratify by benchmark (GSM8K vs NyayaBench) before comparing methods.

Evidence: How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning , From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators

Benchmark Interpretation

GSM8K appears in 3.3% of hub papers (1/30); use this cohort for benchmark-matched comparisons.
NyayaBench appears in 3.3% of hub papers (1/30); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 100% of hub papers (30/30); compare with a secondary metric before ranking methods.
accuracy is reported in 33.3% of hub papers (10/30); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (3.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (6.7% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (13.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (13.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (3.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (6.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (13.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (13.3% vs 35% target).

Known Limitations

Only 6.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: GSM8K - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=28, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

GSM8K

Coverage: 1 papers (3.3%)

1 papers (3.3%) mention GSM8K.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Benchmark Brief

NyayaBench

Coverage: 1 papers (3.3%)

1 papers (3.3%) mention NyayaBench.

Examples: Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

Benchmark Brief

Retrieval

Coverage: 1 papers (3.3%)

1 papers (3.3%) mention Retrieval.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Metric Brief

cost

Coverage: 30 papers (100%)

30 papers (100%) mention cost.

Examples: How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning

Metric Brief

accuracy

Coverage: 10 papers (33.3%)

10 papers (33.3%) mention accuracy.

Examples: How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning , From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators

Metric Brief

inference cost

Coverage: 6 papers (20%)

6 papers (20%) mention inference cost.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Luna-2: Scalable Single-Token Evaluation with Small Language Models , Sink-Aware Pruning for Diffusion Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026

Automatic Metrics Coding

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026

Automatic Metrics Coding

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang · Feb 25, 2026

Automatic Metrics General

Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026

Automatic Metrics General

On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026

Automatic Metrics General

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026

Automatic Metrics MedicineMultilingual

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP).
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026

Automatic Metrics MathCoding

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning
Zhangjie Xia, Yu Yang, Pan Xu · Feb 24, 2026

Simulation Env General

Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics.
Motivation is Something You Need
Mehdi Acheli, Walid Gaaloul · Feb 24, 2026

Automatic Metrics General

Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model
Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference
Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji · Feb 24, 2026

Automatic Metrics General

Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026

Automatic Metrics Multilingual

Personal AI agents incur substantial cost via repeated LLM calls.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026

Automatic Metrics Coding

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel, Souvik Maji, Pratik Mazumder · Feb 19, 2026

Automatic Metrics General

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026

Automatic Metrics General

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026

Automatic Metrics Medicine

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026

Simulation Env CodingMultilingual

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao · Dec 29, 2025

Automatic Metrics General

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance.
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025

Automatic Metrics MathLaw

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025

Automatic Metrics Coding

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
Guy Dar · Sep 27, 2025

Automatic Metrics General

We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data.
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie · Aug 19, 2025

Automatic Metrics General

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest pr
DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging
Neha Verma, Kenton Murray, Kevin Duh · Jul 6, 2025

Automatic Metrics General

Structured pruning methods designed for Large Language Models (LLMs) generally focus on identifying and removing the least important components to optimize model size.
Complexity-aware fine-tuning
Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev · Jun 26, 2025

Automatic Metrics General

General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains.
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang · Jun 5, 2025

Automatic Metrics Coding

Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities.
vCache: Verified Semantic Prompt Caching
Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu · Feb 6, 2025

Automatic Metrics General

We release the vCache implementation and four benchmarks to support future research.
GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression
Kainan Liu, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang · Dec 31, 2024

Automatic Metrics General

Recent studies have demonstrated that many layers are functionally redundant in large language models (LLMs), enabling model compression by removing these layers to reduce inference cost.
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields, Casey Kennington · Nov 11, 2024

Automatic Metrics Coding

To conduct these experiments, we introduce a VL evaluation framework called Renaissance.
Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning
Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh · Feb 24, 2024

Automatic Metrics Coding

While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training.

Cost In CS.LG Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs