Large Vision-Language models for cross-view geo-localization
Abstract:
This talk will focus on the application of large vision-language models for cross-view geo-localization, a task that matches ground images with overhead (aerial or satellite) views. We will explore how these models leverage both visual features and language-based descriptions to bridge the substantial appearance gap between viewpoints. Key techniques, such as joint image-text embeddings and contrastive learning, will be discussed, along with their effectiveness in geo-localization tasks. We will also highlight the challenges and opportunities in applying vision-language models to this domain, with examples from recent research advancements.