Metric Hub

Cost In CS.CL Papers

Updated from current HFEPX corpus (Feb 27, 2026). 81 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 81 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 81 papers for Cost In CS.CL Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, MMLU and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

11.1% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders , CAMEL: Confidence-Gated Reflection for Reward Modeling , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
automatic metrics appears in 90.1% of papers in this hub.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA

Protocol Takeaways

Most common quality-control signal is rater calibration (4.9% of papers).

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Benchmark Interpretation

Retrieval appears in 12.3% of hub papers (10/81); use this cohort for benchmark-matched comparisons.
MMLU appears in 3.7% of hub papers (3/81); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 100% of hub papers (81/81); compare with a secondary metric before ranking methods.
accuracy is reported in 29.6% of hub papers (24/81); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (11.1% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (7.4% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (27.2% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (9.9% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (11.1% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (11.1% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (7.4% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.2% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9.9% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (11.1% vs 35% target).

Known Limitations

Only 7.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=2, right_only=73

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=2, left_only=71, right_only=6

2 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=8, right_only=2

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 10 papers (12.3%)

10 papers (12.3%) mention Retrieval.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs

Benchmark Brief

MMLU

Coverage: 3 papers (3.7%)

3 papers (3.7%) mention MMLU.

Examples: Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Diffusion Language Models Know the Answer Before Decoding

Benchmark Brief

BrowseComp

Coverage: 2 papers (2.5%)

2 papers (2.5%) mention BrowseComp.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Metric Brief

cost

Coverage: 81 papers (100%)

81 papers (100%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Metric Brief

accuracy

Coverage: 24 papers (29.6%)

24 papers (29.6%) mention accuracy.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Metric Brief

latency

Coverage: 12 papers (14.8%)

12 papers (14.8%) mention latency.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026

Automatic Metrics MathCoding

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao · Feb 26, 2026

Automatic Metrics General

Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2
Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang · Feb 26, 2026

Automatic Metrics General

The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents.
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026

Automatic Metrics General

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi · Feb 26, 2026

Automatic Metrics Coding

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models.
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026

Automatic Metrics Coding

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026

Automatic Metrics General

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026

Automatic Metrics Coding

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026

Automatic Metrics General

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Minhao Jiang, Zhikai Li, Xuewen Liu, Jing Zhang, Mengjuan Chen · Feb 25, 2026

Automatic Metrics Math

Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas · Feb 25, 2026

Automatic Metrics Multilingual

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly.
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026

Automatic Metrics General

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026

Automatic Metrics MedicineMultilingual

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP).
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026

Automatic Metrics Coding

Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026

Automatic Metrics General

Reward models play a fundamental role in aligning large language models with human preferences.
Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference
Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji · Feb 24, 2026

Automatic Metrics General

Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics Math

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen · Feb 23, 2026

Automatic Metrics Medicine

Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA.
Structured Prompt Language: Declarative Context Management for LLMs
Wen G. Gong · Feb 23, 2026

Automatic Metrics Multilingual

SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script.
Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026

Automatic Metrics Multilingual

We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding
Roberto Tacconelli · Feb 23, 2026

Automatic Metrics Coding

An out-of-distribution (OOD) evaluation on a document published after the model's training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026

Automatic Metrics Coding

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content
Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger · Feb 22, 2026

Automatic Metrics General

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026

Automatic Metrics MathCoding

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026

Automatic Metrics Multilingual

Personal AI agents incur substantial cost via repeated LLM calls.
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026

Automatic Metrics MathCoding

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning
Tao Wu, Adam Kapelner · Feb 20, 2026

Automatic Metrics General

In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.
Information-Theoretic Storage Cost in Sentence Comprehension
Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox · Feb 20, 2026

Automatic Metrics General

Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input.
Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026

Automatic Metrics Coding

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel, Souvik Maji, Pratik Mazumder · Feb 19, 2026

Automatic Metrics General

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026

Automatic Metrics General

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck.
ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma · Feb 19, 2026

Automatic Metrics LawMedicine

Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors.
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam · Feb 18, 2026

Automatic Metrics MedicineMultilingual

However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026

Simulation Env Coding

Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić · Feb 18, 2026

Human Eval CodingMultilingual

Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data.
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026

Automatic Metrics General

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026

Automatic Metrics Medicine

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026

Simulation Env CodingMultilingual

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026

Automatic Metrics General

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026

Automatic Metrics General

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin · Feb 17, 2026

Automatic Metrics Coding

Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik · Feb 16, 2026

Automatic Metrics General

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models.
Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu · Feb 16, 2026

Automatic Metrics MedicineCoding

Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression.
Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026

Automatic Metrics Coding

16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
Buy versus Build an LLM: A Decision Framework for Governments
Jiahao Lu, Ziwei Xu, William Tjhi, Junnan Li, Antoine Bosselut · Feb 13, 2026

Automatic Metrics General

This paper provides a strategic framework for making this decision by evaluating these options across dimensions including sovereignty, safety, cost, resource capability, cultural fit, and sustainability.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026

Simulation Env MathCoding

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu · Feb 5, 2026

Simulation Env General

Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Simulation Env Coding

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis
Gaurav Negi, MA Waskow, John McCrae, Paul Buitelaar · Jan 23, 2026

Human Eval General

Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications.
Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026

Automatic Metrics General

Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao · Dec 29, 2025

Automatic Metrics General

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance.
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025

Automatic MetricsSimulation Env General

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025

Automatic Metrics General

Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance.
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier · Dec 9, 2025

Automatic Metrics Coding

QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods.
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025

Automatic Metrics MathLaw

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025

Automatic Metrics Coding

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova · Nov 26, 2025

Simulation Env General

Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce.
Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Saurabh Srivastava, Janit Bidhan, Hao Yan, Abhishek Dey, Tanu Kansal · Nov 6, 2025

Automatic Metrics General

Across 13 diverse benchmarks with DeepSeek-R1 and OpenAI-o1, batch prompting {reduces reasoning tokens by 76\% (2{,}950$\mapsto$710), on average, while preserving or improving accuracy}.

Cost In CS.CL Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs