Farham Nia, Saied (2022) Metaheuristic approach for optimal data pre-processing method selection case study: Missing values handling. [Thesis]
PDF
10521406.pdf
Download (1MB)
10521406.pdf
Download (1MB)
Abstract
The current big data era has given rise to many pioneering opportunities both in research and in practice. However, despite the potential benefts, there are also signifcant challenges in employing the observed data for mining information and creating value based on informed decisions. Indeed, the quality of datasets, as a crucial factor, has become a major challenge and a focus area beyond the felds of database management systems and data engineering. Handling missing values in datasets as a pervasive and unavoidable phenomenon is still the subject of active research. While scientists and practitioners in the felds of statistics and machine learning have introduced various approaches and developed methods, still there is great room for improvement. In this research, a systematic approach for handling the missing values is proposed in which the appropriate method for each feature of a dataset is selected according to the downstream data analytic task in an automated manner. In this regard, a simulated annealing based meta-heuristic has been developed which assigns the appropriate one of the seven commonly used missing value handling methods, namely; Mean/Mode/Median Imputation, Hot-Deck, K-NN, Bayesian Ridge Regression Imputation, and Random Forrest Regression Imputation to each feature. Experimental analysis are conducted on four diferent datasets and the performance of the proposed approach is tested at diferent levels of missingness. The results demonstrate that the proposed approach outperforms the seven methods when they are employed separately. The results imply that a wholesale approach which is based on choosing the best missing values handling method for a particular dataset should be granularized and features should be addressed separately during the missing data handling stage.
Item Type: | Thesis |
---|---|
Uncontrolled Keywords: | Data Pre-Processing. -- Missing Values Handling. -- Metaheuristic Approach. -- simulated Annealing Algorithm. -- Veri Ön İşleme. -- Eksik Değerleri İşleme. -- Üst sezgisel yanaşmak. -- benzetilmiş tavlama algoritması. |
Subjects: | T Technology > T Technology (General) > T055.4-60.8 Industrial engineering. Management engineering |
Divisions: | Faculty of Engineering and Natural Sciences > Academic programs > Industrial Engineering Faculty of Engineering and Natural Sciences |
Depositing User: | Dila Günay |
Date Deposited: | 12 Jul 2023 15:07 |
Last Modified: | 12 Jul 2023 15:07 |
URI: | https://research.sabanciuniv.edu/id/eprint/47493 |