Metaheuristic approach for optimal data pre-processing method selection case study: Missing values handling

Farham Nia, Saied (2022) Metaheuristic approach for optimal data pre-processing method selection case study: Missing values handling. [Thesis]

[thumbnail of 10521406.pdf] PDF
10521406.pdf

Download (1MB)

Abstract

The current big data era has given rise to many pioneering opportunities both in research and in practice. However, despite the potential benefts, there are also signifcant challenges in employing the observed data for mining information and creating value based on informed decisions. Indeed, the quality of datasets, as a crucial factor, has become a major challenge and a focus area beyond the felds of database management systems and data engineering. Handling missing values in datasets as a pervasive and unavoidable phenomenon is still the subject of active research. While scientists and practitioners in the felds of statistics and machine learning have introduced various approaches and developed methods, still there is great room for improvement. In this research, a systematic approach for handling the missing values is proposed in which the appropriate method for each feature of a dataset is selected according to the downstream data analytic task in an automated manner. In this regard, a simulated annealing based meta-heuristic has been developed which assigns the appropriate one of the seven commonly used missing value handling methods, namely; Mean/Mode/Median Imputation, Hot-Deck, K-NN, Bayesian Ridge Regression Imputation, and Random Forrest Regression Imputation to each feature. Experimental analysis are conducted on four diferent datasets and the performance of the proposed approach is tested at diferent levels of missingness. The results demonstrate that the proposed approach outperforms the seven methods when they are employed separately. The results imply that a wholesale approach which is based on choosing the best missing values handling method for a particular dataset should be granularized and features should be addressed separately during the missing data handling stage.
Item Type: Thesis
Uncontrolled Keywords: Data Pre-Processing. -- Missing Values Handling. -- Metaheuristic Approach. -- simulated Annealing Algorithm. -- Veri Ön İşleme. -- Eksik Değerleri İşleme. -- Üst sezgisel yanaşmak. -- benzetilmiş tavlama algoritması.
Subjects: T Technology > T Technology (General) > T055.4-60.8 Industrial engineering. Management engineering
Divisions: Faculty of Engineering and Natural Sciences > Academic programs > Industrial Engineering
Faculty of Engineering and Natural Sciences
Depositing User: Dila Günay
Date Deposited: 12 Jul 2023 15:07
Last Modified: 12 Jul 2023 15:07
URI: https://research.sabanciuniv.edu/id/eprint/47493

Actions (login required)

View Item
View Item