Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
Kai Jiang, Siqi Huang, Xiangyu Chen, Jiawei Shao, Hongyuan Zhang, Ping Luo, Xuelong Li · Nov 23, 2025 · Citations: 0
How to use this paper page
Coverage: StaleUse this page to decide whether the paper is strong enough to influence an eval design. It summarizes the abstract plus available structured metadata. If the signal is thin, use it as background context and compare it against stronger hub pages before making protocol choices.
Best use
Background context only
Metadata: StaleTrust level
Provisional
Signals: StaleWhat still needs checking
Structured extraction is still processing; current fields are metadata-first.
Signal confidence unavailable
Abstract
Multimodal large language models (MLLMs) deployed on devices must adapt to continuously changing visual scenarios such as variations in background and perspective, to effectively perform complex visual tasks. To investigate catastrophic forgetting under real-world scenario shifts, we construct a multimodal visual understanding dataset (MSVQA), covering four distinct scenarios and perspectives: high-altitude, underwater, low-altitude, and indoor environments. Furthermore, we propose UNIFIER (mUltimodal coNtInual learning with MLLMs From multi-scenarIo pERspectives), a continual learning (CL) framework designed to address visual discrepancies while learning different scenarios. Compared to existing CL methods, UNIFIER enables knowledge accumulation within the same scenario and mutual enhancement across different scenarios via Vision Representation Expansion (VRE) and Vision Consistency Constraint (VCC). Experimental results show that UNIFIER improves the last-step VQA scores by 2.70%~10.62% and the last-step F1 scores by 3.40%~7.69% compared to the state-of-the-art method, QUAD, in 20-step cross-scenario continual learning tasks. MSVQA dataset is available at https://huggingface.co/datasets/Kaij00/MSVQA.