Loulou, Asmaa and Ünel, Mustafa (2025) PoseViTNet: multi-scene absolute pose regression using vision transformers. In: IEEE Intelligent Vehicles Symposium (IV), Cluj-Napoca, Romania
Full text not available from this repository. (Request a copy)
Official URL: https://dx.doi.org/10.1109/IV64158.2025.11097720
Abstract
Accurate camera pose estimation is crucial for autonomous driving and vehicle networking. Traditional pipelines based on geometric models and feature matching struggle in dynamic, featureless environments which are common in many environments. Inspired by the success of vision transformers (ViT), our approach uses a ViT backbone with an attention-based mask to extract a global image descriptor, which is then passed through fully connected layers for pose regression. The multi-headed self-attention in ViT helps the model learn scene layouts and focus on relevant features. We introduce an attention mask to improve performance in challenging scenes, especially dynamic or featureless ones. We compare three backbones: ViT (multi-headed self-attention throughout), ConViT (self-attention in the last two layers, gated positional self-attention elsewhere), and ResNet (pure convolution). We evaluate our model on two commonly used benchmarks for outdoor and indoor localization and we show that our model which uses ViT backbone achieves the state of the art results for both indoor and outdoor multi-scene absolute localization benchmarks.
Item Type: | Papers in Conference Proceedings |
---|---|
Divisions: | Faculty of Engineering and Natural Sciences |
Depositing User: | Mustafa Ünel |
Date Deposited: | 10 Sep 2025 10:54 |
Last Modified: | 10 Sep 2025 10:54 |
URI: | https://research.sabanciuniv.edu/id/eprint/52251 |