Badali Naghadeh, Faraz and Balcısoy, Selim (2026) Frame-first AR: egocentric scene graphs with VLM-augmented semantics and spatial reasoning. In: IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), Osaka, Japan
Full text not available from this repository. (Request a copy)
Official URL: https://dx.doi.org/10.1109/AIxVR67263.2026.00048
Abstract
This paper presents an opportunistic, frame-first augmented reality (AR) pipeline for first-responder support that departs from scan-first approaches. The system anchors real-time object detections into world coordinates via the Meta Quest 3 depth API and maintains an egocentric scene graph encoding human-object and object-object relations. A multimodal vision-language model (VLM) operates over this structured snapshot to infer geometric relations, functional groupings, and semantic context beyond object detection alone. Unlike scan-first designs, our pipeline avoids long initialization delays by incrementally building spatial awareness from partial observations while leveraging schematic priors. We report stage-wise and end-to-end latency, and comparative evaluations showing that geometry-augmented VLM reasoning improves spatial accuracy and relational reliability relative to VLM-only baselines. Together, these findings demonstrate that grounding semantics in geometry offers a practical path to responsive, low-latency AR assistance in safety-critical scenarios.
| Item Type: | Papers in Conference Proceedings |
|---|---|
| Uncontrolled Keywords: | Augmented Reality (AR); Computer Vision; Spatial Awareness; Vision-Language Model |
| Divisions: | Faculty of Engineering and Natural Sciences |
| Depositing User: | Selim Balcısoy |
| Date Deposited: | 14 May 2026 13:05 |
| Last Modified: | 14 May 2026 13:05 |
| URI: | https://research.sabanciuniv.edu/id/eprint/54081 |

