PoseViTNet: multi-scene absolute pose regression using vision transformers

The system is temporarily closed to updates for reporting purpose.

Loulou, Asmaa and Ünel, Mustafa (2025) PoseViTNet: multi-scene absolute pose regression using vision transformers. In: IEEE Intelligent Vehicles Symposium (IV), Cluj-Napoca, Romania

Full text not available from this repository. (Request a copy)

Official URL: https://dx.doi.org/10.1109/IV64158.2025.11097720

Abstract

Accurate camera pose estimation is crucial for autonomous driving and vehicle networking. Traditional pipelines based on geometric models and feature matching struggle in dynamic, featureless environments which are common in many environments. Inspired by the success of vision transformers (ViT), our approach uses a ViT backbone with an attention-based mask to extract a global image descriptor, which is then passed through fully connected layers for pose regression. The multi-headed self-attention in ViT helps the model learn scene layouts and focus on relevant features. We introduce an attention mask to improve performance in challenging scenes, especially dynamic or featureless ones. We compare three backbones: ViT (multi-headed self-attention throughout), ConViT (self-attention in the last two layers, gated positional self-attention elsewhere), and ResNet (pure convolution). We evaluate our model on two commonly used benchmarks for outdoor and indoor localization and we show that our model which uses ViT backbone achieves the state of the art results for both indoor and outdoor multi-scene absolute localization benchmarks.

Item Type:	Papers in Conference Proceedings
Divisions:	Faculty of Engineering and Natural Sciences
Depositing User:	Mustafa Ünel
Date Deposited:	10 Sep 2025 10:54
Last Modified:	10 Sep 2025 10:54
URI:	https://research.sabanciuniv.edu/id/eprint/52251

Actions (login required)

: View Item