TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li +2 more
Abstract
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation i...
Summary
TimeLens rethinks video temporal grounding (VTG) for multimodal LLMs by positioning itself as a clear, non-novel baseline that emphasizes data quality, prompt design, and training recipes over new architectures. The work introduces re-annotated evaluation data (TimeLens-Bench), a large-scale training set (TimeLens-100K), an interleaved textual timestamp representation, and a thinking-free RLVR training paradigm. TimeLens-7B achieves state-of-the-art open-source VTG performance and strong IoU on VUE-TR while preserving general video understanding.
Key Contributions
- Positions TimeLens as a straightforward but carefully engineered baseline for video temporal grounding with multimodal LLMs rather than a new architecture.
- Introduces TimeLens-Bench by re-annotating three popular VTG benchmarks under stricter quality criteria to fix dataset issues.
- Releases TimeLens-100K, a large-scale high-quality VTG training set generated by an automated re-annotation pipeline to reduce noisy labels.
- Proposes an interleaved textual encoding for timestamps in prompts, identified via ablations as a simple and effective time representation.
- Applies a thinking-free reinforcement learning with verifiable rewards (RLVR) training paradigm adapted specifically to video temporal grounding.
Reproducibility Notes
- No official or maintained codebase is available; reproduction is paper-only.
- Dataset re-annotation details must be inferred from the paper and citations.
- RLVR training specifics (rewards, schedules) require careful reverse-engineering.
- Expect multi-day compute and iteration before matching reported VTG results.
Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Video understanding / reasoning | Qwen2.5-VL-7B | IoU @ VUE-TR . | 36.0 |
| Video understanding / reasoning | GPT-4o | IoU @ VUE-TR . | 34.5 |
| Video understanding / reasoning | Gemini-2.5-Pro | IoU @ VUE-TR . | 41.6 |
Hardware Requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Best Implementation
Maintained implementation evidence is not confirmed for this paper yet.
Use the Implementation Status and Reproduction Path sections below for the current action plan.
Reproduction Path
Follow this baseline workflow to decide if this paper is worth immediate prototyping.
- 1
No maintained paper-verified implementation was found; start with the closest related repositories below.
- 2
Compare repo methods against the paper equations/algorithm before trusting metrics.
- 3
Create a minimal baseline implementation from the paper and use adjacent repos as references.
- 4
Prioritize reproducing the core method first: Reinforcement learning.
Related Implementations
These are not paper-verified. Use them as reference points when no direct implementation is available.
Strong overlap with paper title keywords
Additional Implementations
No additional verified repositories beyond the primary recommendation.
Hugging Face Artifacts
No direct paper-linked artifacts were found. Showing strongest curated related artifacts.
- TencentARC/TimeLens-8B417 9
- TencentARC/TimeLens-7B157 6