Frame-first AR: egocentric scene graphs with VLM-augmented semantics and spatial reasoning

Badali Naghadeh, Faraz and Balcısoy, Selim (2026) Frame-first AR: egocentric scene graphs with VLM-augmented semantics and spatial reasoning. In: IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), Osaka, Japan

Full text not available from this repository. (Request a copy)

Abstract

This paper presents an opportunistic, frame-first augmented reality (AR) pipeline for first-responder support that departs from scan-first approaches. The system anchors real-time object detections into world coordinates via the Meta Quest 3 depth API and maintains an egocentric scene graph encoding human-object and object-object relations. A multimodal vision-language model (VLM) operates over this structured snapshot to infer geometric relations, functional groupings, and semantic context beyond object detection alone. Unlike scan-first designs, our pipeline avoids long initialization delays by incrementally building spatial awareness from partial observations while leveraging schematic priors. We report stage-wise and end-to-end latency, and comparative evaluations showing that geometry-augmented VLM reasoning improves spatial accuracy and relational reliability relative to VLM-only baselines. Together, these findings demonstrate that grounding semantics in geometry offers a practical path to responsive, low-latency AR assistance in safety-critical scenarios.
Item Type: Papers in Conference Proceedings
Uncontrolled Keywords: Augmented Reality (AR); Computer Vision; Spatial Awareness; Vision-Language Model
Divisions: Faculty of Engineering and Natural Sciences
Depositing User: Selim Balcısoy
Date Deposited: 14 May 2026 13:05
Last Modified: 14 May 2026 13:05
URI: https://research.sabanciuniv.edu/id/eprint/54081

Actions (login required)

View Item
View Item