Skip to content

Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

Chan-Wei Hu, Zhengzhong Tu

2026-04-07T00:05:12Z

Abstract

Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

Full analysis loading… Code implementations, benchmark data, and reproduction guides are being assembled. Please check back shortly.

Browse all papers

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.