No verified implementation yetPretrained Models Available

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li +2 more

December 16, 2025arXiv: 2512.14698

0 repos~a few days to reproduce

Abstract

This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation i...

Summary

TimeLens rethinks video temporal grounding (VTG) for multimodal LLMs by positioning itself as a clear, non-novel baseline that emphasizes data quality, prompt design, and training recipes over new architectures. The work introduces re-annotated evaluation data (TimeLens-Bench), a large-scale training set (TimeLens-100K), an interleaved textual timestamp representation, and a thinking-free RLVR training paradigm. TimeLens-7B achieves state-of-the-art open-source VTG performance and strong IoU on VUE-TR while preserving general video understanding.

Key Contributions

Positions TimeLens as a straightforward but carefully engineered baseline for video temporal grounding with multimodal LLMs rather than a new architecture.
Introduces TimeLens-Bench by re-annotating three popular VTG benchmarks under stricter quality criteria to fix dataset issues.
Releases TimeLens-100K, a large-scale high-quality VTG training set generated by an automated re-annotation pipeline to reduce noisy labels.
Proposes an interleaved textual encoding for timestamps in prompts, identified via ablations as a simple and effective time representation.
Applies a thinking-free reinforcement learning with verifiable rewards (RLVR) training paradigm adapted specifically to video temporal grounding.

Reproducibility Notes

No official or maintained codebase is available; reproduction is paper-only.
Dataset re-annotation details must be inferred from the paper and citations.
RLVR training specifics (rewards, schedules) require careful reverse-engineering.
Expect multi-day compute and iteration before matching reported VTG results.

Results & Benchmarks

Task	Dataset	Metric	Value
Video understanding / reasoning	Qwen2.5-VL-7B	IoU @ VUE-TR .	36.0
Video understanding / reasoning	GPT-4o	IoU @ VUE-TR .	34.5
Video understanding / reasoning	Gemini-2.5-Pro	IoU @ VUE-TR .	41.6

Hardware Requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

Maintained implementation evidence is not confirmed for this paper yet.

Use the Implementation Status and Reproduction Path sections below for the current action plan.

Reproduction Path

Follow this baseline workflow to decide if this paper is worth immediate prototyping.

1
No maintained paper-verified implementation was found; start with the closest related repositories below.
2
Compare repo methods against the paper equations/algorithm before trusting metrics.
3
Create a minimal baseline implementation from the paper and use adjacent repos as references.
4
Prioritize reproducing the core method first: Reinforcement learning.

Time to first repro: a few daysAdjacent implementations are not paper-verifiedRecommended repository is adjacent and not paper-verified.Adjacent implementation match confidence is low.

Related Implementations

These are not paper-verified. Use them as reference points when no direct implementation is available.

TencentARC/TimeLens 123
Strong overlap with paper title keywords

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.

Curated Related

TencentARC/TimeLens-8B
417 9
TencentARC/TimeLens-7B
157 6

Research Context