A synergistic multi-stage RAG architecture for boosting context relevance in data science literature

Aytar, Ahmet Yasin and Kaya, Kamer and Kılıç, Kemal (2025) A synergistic multi-stage RAG architecture for boosting context relevance in data science literature. Natural Language Processing Journal, 13 . ISSN 2949-7191

Full text not available from this repository. (Request a copy)

Abstract

Navigating the voluminous and rapidly evolving data science literature presents a significant bottleneck for researchers and practitioners. Standard Retrieval-Augmented Generation (RAG) systems often struggle with retrieving precisely relevant context from this dense academic corpus. This paper introduces a synergistic multi-stage RAG architecture specifically tailored to overcome these challenges. Our approach integrates structured document parsing (GROBID), domain-specific embedding fine-tuning derived from textbooks, semantic chunking for coherence, and proposes a novel ’Abstract First’ retrieval strategy that prioritizes concise, high-signal summaries. Through rigorous evaluation using the RAGAS framework and a custom data science query set, we demonstrate that this integrated architecture significantly boosts Context Relevance by over 15-fold compared to baseline RAG, surpassing configurations using only subsets of these enhancements. These findings underscore the critical importance of multi-stage optimization and highlight the surprising efficacy of the abstract-centric retrieval method for specialized academic domains, offering a validated pathway to more effective literature navigation in data science.
Item Type: Article
Uncontrolled Keywords: Academic insights; Data science; Large Language Models (LLM); Literature retrieval; Retrieval-Augmented Generation (RAG)
Divisions: Faculty of Engineering and Natural Sciences
Depositing User: Kamer Kaya
Date Deposited: 16 Feb 2026 11:15
Last Modified: 16 Feb 2026 11:15
URI: https://research.sabanciuniv.edu/id/eprint/53103

Actions (login required)

View Item
View Item