A high performance CPU-GPU database for streaming data analysis
||The system is temporarily closed to updates for reporting purpose.
Abdennebi, Anes (2020) A high performance CPU-GPU database for streaming data analysis. [Thesis]
Official URL: https://risc01.sabanciuniv.edu/record=b2486355 _ (Table of contents)
The outstanding spread of database management system architectures in the last decade, plus the increasing growth, volume, and velocity of the data, which is known nowadays as “Big Data”, are continuously urging researchers, businessmen and companies to build robust and scalable database management systems (DBMS) and improve them in a way they adjust smoothly with the evolution of data. On the other hand, there is a tendency to support the conventional processing units (PUs), which are the Central Processing Units (CPUs), with additional computing power like the emerging Graphical Processing Units (GPUs). The research community has accepted the potential of vigorous computing power for data-intensive applications. Several research studies were conducted in the last years that ended up in building remarkable DBMSs by integrating GPUs and using them according to different workload distribution algorithms and query optimization protocols. Thus, we try to address a new approach by building a hybrid columnar-based highperformance database management system calling it DOLAP which adopts the Online Analytical Processing (OLAP) infrastructure. Distinctively from previous hybrid DBMSs, our database, DOLAP, depends on Bloom filters while performing different operations on data (ingesting, checking, modifying, and deleting). We implement this probabilistic data structure in DOLAP to prevent unnecessary memory accesses while checking the database’s data records. This method is proved to be useful by reducing the total running times by 35%. Moreover, since there exist two main PUs with different characteristics, the CPU and GPU, a workload distribution model that effectively decides the query’s executing unit at a time T should be defined to improve the efficiency of our system. Therefore, we suggested 3 load balancing models, the Random-based, Algorithm-based and the Improved Algorithmbased models. We run our tests on the Chicago Taxi Driver dataset taken from Kaggle and among the 3 load balancing models, the improved algorithm-based model demonstrates its effectiveness in well distributing the query load between the CPUs and GPUs where it outperforms the other models in nearly all the test runs
Repository Staff Only: item control page