Loulou, Asmaa and Ünel, Mustafa (2024) RelViTNet: relative camera pose estimation network using vision transformers. In: 50th Annual Conference of the IEEE Industrial Electronics Society (IECON), Chicago, USA
PDF
RelViTNet_IECON_final.pdf
Download (1MB)
RelViTNet_IECON_final.pdf
Download (1MB)
Abstract
Relative camera pose regressors estimate the relative pose between two cameras from two input images. A convolutional network with a multi-layer perceptron head is usually trained per scene with ground truth relative poses. However, such methods are still suffering from limited accuracy and generalization. Inspired by the success of vision transformers on computer vision tasks, we propose to learn the relative pose between two cameras using only vision transformer backbone with fully connected layers. The multiheaded self-attention mechanism of the vision transformer allows our model to attend to the full image even from the lowest layers which further enables our model to learn the layout of the scene and focuses only the features that are relevant to our task. We evaluate our model on one outdoor and two indoor datasets. We show that our model achieves new competitive accuracies for both outdoor and indoor multi-scene relative localization benchmarks. We further compare our pose estimation results to those obtained using recent local keypoints based approaches and we show that our model outperforms these methods particularly for frames with small translation, where such methods mostly fail.
Item Type: | Papers in Conference Proceedings |
---|---|
Uncontrolled Keywords: | Localization, Deep Learning, Vision Transformers |
Subjects: | T Technology > TJ Mechanical engineering and machinery > TJ163.12 Mechatronics |
Divisions: | Faculty of Engineering and Natural Sciences |
Depositing User: | Mustafa Ünel |
Date Deposited: | 29 Sep 2024 13:40 |
Last Modified: | 29 Sep 2024 13:40 |
URI: | https://research.sabanciuniv.edu/id/eprint/50288 |