Aydın, Kerem and Aptoula, Erchan (2025) Domain generalized remote sensing scene captioning via country-level geographic information. In: 2025 Joint Urban Remote Sensing Event (JURSE), Tunis, Tunisia
Full text not available from this repository. (Request a copy)
Official URL: https://dx.doi.org/10.1109/JURSE60372.2025.11076018
Abstract
In this study, we explored the performance impact of incorporating country-level text-based geographical information into a large-scale vision language model, fine-tuned for the captioning of optical remote sensing images. We hypothesized that a model trained with country-level textual geographical context along with visual scenes would enhance its captioning capabilities when confronted with images from previously unseen countries or even continents, coupled with their respective geographical context. A large language and vision assistant (LLaVA) was fine-tuned using optical images from European countries and tested on images from other continents to evaluate its generalization capabilities. Here we report results of experiments conducted across 175 countries via the newly published Skyscript dataset, demonstrating that even superficial geographical information obtained from Wikipedia articles can mitigate the cross-country domain shift by several points in terms of accuracy score. This multimodal approach, combining textual geographical context with visual data, shows significant potential for improving the generalization capabilities of vision language models in tasks involving diverse and previously unseen geographical regions.
Item Type: | Papers in Conference Proceedings |
---|---|
Uncontrolled Keywords: | Domain Adaptation; Remote Sensing and Open Vocabulary Classification; Scene Captioning |
Divisions: | Faculty of Engineering and Natural Sciences |
Depositing User: | Erchan Aptoula |
Date Deposited: | 08 Sep 2025 14:39 |
Last Modified: | 08 Sep 2025 14:39 |
URI: | https://research.sabanciuniv.edu/id/eprint/52159 |