السنة | 2020-09-10 |
---|---|
التخصص | ماجستير هندسة البرمجيات |
العنوان | Prediction of missing data technique to improve big data classification |
اسم المشرف الرئيسي | عايش منور هويشل الحروب | Aysh M. Alhroob |
اسم المشرف المشارك | | |
اسم الطالب | هدى حسين | Huda Hussain |
Abstract | Designing an early prediction systems-based machine learning model (for diabetes disease (is an emerging research area, increasing day by day due to the increasing of the diabetes cases all around the world. Missing values in medical datasets in general, and diabetes disease in particular is an issue faces the machine learning models and case studies. The imputation method is needed for estimating the missing values is a preprocessing step, should be implemented before classifying the cases in the dataset. In this study, a new imputation algorithm based on Firefly Algorithm (FA) is proposed, which is called Imputation Algorithm based Firefly Algorithm (IFA). In order to evaluate the proposed IFA algorithm, a classifier is needed as a fitness function, which generates the classification accuracy of the generated dataset and should be maximized. Therefore, the accuracy is obtained using three different classifiers: K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Naïve Bayesian Classifier (NBC). Pima Indian Diabetes Disease (PIDD) is the main dataset used in this study for estimating the missing values and evaluate IFA. The proposed algorithm is evaluated using two types of experiments, first experiments validated the generated datasets using k-fold cross validation (K=5). While the second experiment the validation is done using holdout validation, where the generated dataset is divided into training set (65%) and testing set (35%). The obtained results showed that the IFA-SVM was ranked the best based the average of ten run times, while IFA-NBC ranked the worst. Moreover, IFA with all classifiers had the best accuracies as compared to the four popular techniques, which proved that the optimization algorithm as an imputation algorithm is better than the statistical methods in this study. In conclusion, FA algorithm can be used for estimating missing values PIDD and medical datasets in general. |
الأبحاث المستلة |