Cross-View Association, Reasoning, and Explainability

Speaker: Safwan Wshah

Time: 11:00 - 11:40

Abstract:

Traditional cross-view geo-localization primarily focuses on latent feature matching, often relying on low-level visual details that fail to capture fine-grained semantic correspondences between viewpoints. This talk introduces a new paradigm that aims to align objects, structures, and relations between ground-level and aerial/satellite imagery. We will explore how recent advances in multimodal learning and Large Multimodal Models (LMMs) enable higher-level reasoning across views, allowing the model to infer spatial layouts, understand functional context, and perform multistep localization decisions (e.g., “a building next to a river and a road intersection”). By shifting from pixel-level similarity to semantic association and reasoning, this talk highlights a future direction where cross-view geo-localization becomes more robust, interpretable, and ultimately closer to human-level spatial understanding.