Skip to content
← Back to explorer

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia · Mar 2, 2026 · Citations: 0

Abstract

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

25/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Coding
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Long Horizon
  • Quality controls: Not reported
  • Confidence: 0.45
  • Flags: ambiguous, runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

latency

Research Brief

Deterministic synthesis

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive… HFEPX signals include Automatic Metrics, Long Horizon with confidence 0.45. Updated from current HFEPX corpus.

Generated Mar 4, 2026, 1:19 PM · Grounded in abstract + metadata only

Key Takeaways

  • While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and…
  • To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.

Researcher Actions

  • Treat this as method context, then pivot to protocol-specific HFEPX hubs.
  • Identify benchmark choices from full text before operationalizing conclusions.
  • Validate metric comparability (latency).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive…
  • To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.
  • Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization.

Why It Matters For Eval

  • While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive…
  • Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization.

Researcher Checklist

  • Gap: Human feedback protocol is explicit

    No explicit human feedback protocol detected.

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Gap: Benchmark or dataset anchors are present

    No benchmark/dataset anchor extracted from abstract.

  • Pass: Metric reporting is present

    Detected: latency

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.