Aytar, Ahmet Yasin and Kaya, Kamer and Kılıç, Kemal (2025) A synergistic multi-stage RAG architecture for boosting context relevance in data science literature. Natural Language Processing Journal, 13 . ISSN 2949-7191
Full text not available from this repository. (Request a copy)
Official URL: https://dx.doi.org/10.1016/j.nlp.2025.100179
Abstract
Navigating the voluminous and rapidly evolving data science literature presents a significant bottleneck for researchers and practitioners. Standard Retrieval-Augmented Generation (RAG) systems often struggle with retrieving precisely relevant context from this dense academic corpus. This paper introduces a synergistic multi-stage RAG architecture specifically tailored to overcome these challenges. Our approach integrates structured document parsing (GROBID), domain-specific embedding fine-tuning derived from textbooks, semantic chunking for coherence, and proposes a novel ’Abstract First’ retrieval strategy that prioritizes concise, high-signal summaries. Through rigorous evaluation using the RAGAS framework and a custom data science query set, we demonstrate that this integrated architecture significantly boosts Context Relevance by over 15-fold compared to baseline RAG, surpassing configurations using only subsets of these enhancements. These findings underscore the critical importance of multi-stage optimization and highlight the surprising efficacy of the abstract-centric retrieval method for specialized academic domains, offering a validated pathway to more effective literature navigation in data science.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | Academic insights; Data science; Large Language Models (LLM); Literature retrieval; Retrieval-Augmented Generation (RAG) |
| Divisions: | Faculty of Engineering and Natural Sciences |
| Depositing User: | Kamer Kaya |
| Date Deposited: | 16 Feb 2026 11:15 |
| Last Modified: | 16 Feb 2026 11:15 |
| URI: | https://research.sabanciuniv.edu/id/eprint/53103 |

