Skip to content
← Back to explorer

OckBench: Measuring the Efficiency of LLM Reasoning

Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu · Nov 7, 2025 · Citations: 0

Abstract

Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency. These findings provide a concrete roadmap for the community to optimize the latent reasoning ability, token efficiency. Ultimately, we argue for an evaluation paradigm shift: tokens must not be multiplied beyond necessity. Our benchmarks are available at https://ockbench.github.io/.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Coding

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.45
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation.
  • Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage.
  • The token efficiency is highly variable in practical.

Why It Matters For Eval

  • Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage.
  • Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency.

Related Papers