FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang · Jan 12, 2026 · Citations: 0
How to use this page
Low trustUse this as background context only. Do not make protocol decisions from this page alone.
Best use
Background context only
What to verify
Validate the evaluation procedure and quality controls in the full paper before operational use.
Evidence quality
Low
Derived from extracted protocol signals and abstract evidence.
Abstract
Scientific compound figures combine multiple labeled panels into a single image. However, in a PMC-scale crawl of 346,567 compound figures, 16.3% have no caption and 1.8% only have captions shorter than ten words, causing them to be discarded by existing caption-decomposition pipelines. We propose FigEx2, a visual-conditioned framework that localizes panels and generates panel-wise captions directly from the image, converting otherwise unusable figures into aligned panel-text pairs for downstream pretraining and retrieval. To mitigate linguistic variance in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively controls how caption features condition the detection query space, and employ a staged SFT+RL strategy with CLIP-based alignment and BERTScore-based semantic rewards. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. FigEx2 achieves 0.728 mAP@0.5:0.95 for detection, outperforms Qwen3-VL-8B by 0.44 in METEOR and 0.22 in BERTScore, and transfers zero-shot to out-of-distribution scientific domains without fine-tuning.