Optimizing covid-19 diagnosis with feature selection and data balancing: A comparative study
List of Authors
  • Fauzan Iliya Khalid , Mokhairi Makhtar

Keyword
  • COVID-19, imbalanced class, classification, feature selection

Abstract
  • The COVID-19 pandemic has resulted in an urgent need for accurate diagnosis using machine learning techniques. However, the performance of machine learning models for COVID-19 diagnosis is often hindered by class imbalance, where the number of positive cases is much smaller or much larger than the number of negative cases. This paper presents the impact of imbalanced class and balanced class on the performance of machine learning models for COVID-19 diagnosis using a publicly available COVID-19 symptom-based dataset. Two experiments were conducted, one with an imbalanced dataset and one with a balance dataset achieved through oversampling which is Synthetic Minority Oversampling Technique (SMOTE) and undersampling techniques which is SpreadSubsample. In both experiments, data pre-processing, including data cleaning, and feature selection using the WrapperSubsetEval were performed. The performance of different classifiers, including Decision Tree (J48), Support Vector Machine (SVM), Naive Bayes (NB), K- Nearest Neighbors (IBk), and Sequential Minimal Optimization (SMO) were evaluated by using accuracy as the evaluation metric. In Experiment A, we found that SVM classifiers performed well in classifying the imbalanced dataset, achieving an accuracy of 98.27%. In Experiment B, the IBk classifier performed the best in classifying the balanced dataset, achieving an accuracy of 98.81%. Our results demonstrate that the choice of class balancing technique and feature selection method can significantly handle the problem of capturing minority class and lead to improve the performance of machine learning models for COVID-19 diagnosis. These findings have important implications for the development of effective classification for COVID-19 diagnosis.

Reference
  • 1. Mei X, Zhai X, Lei C, Ye X, Kang Z, Wu X, et al. Development and application of a visual loop-mediated isothermal amplification combined with lateral flow dipstick (LAMP-LFD)method for rapid detection of Salmonella strains in food samples. Food Control 2019;104:9–19. https://doi.org/10.1016/j.foodcont.2019.04.014.

    2. Harahwa TA, Lai Yau TH, Lim-Cooke MS, Al-Haddi S, Zeinah M, Harky A. The optimal diagnostic methods for COVID-19. Diagnosis 2020; 7: 349–56. https://doi.org/10.1515/dx-2020-0058.

    3. Palaz F, Kalkan AK, Tozluyurt A, Ozsoz M. CRISPR-based tools: Alternative methods for the diagnosis of COVID-19. Clin Biochem 2021; 89: 1–13. https://doi.org/10.1016/j.clinbiochem.2020.12.011.

    4. Arevalo-Rodriguez I, Buitrago-Garcia D, Simancas-Racines D, Zambrano-Achig P, Campo R Del, Ciapponi A, et al. False- negative results of initial RT-PCR assays for COVID-19: A systematic review. PLoS One 2020; 15. https://doi.org/10.1371/journal.pone.0242958.

    5. Kukar M, Gunčar G, Vovko T, Podnar S, Černelč P, Brvar M, et al. COVID-19 diagnosis by routine blood tests using machine learning. Sci Rep 2021;11. https://doi.org/10.1038/s41598-021- 90265-9.

    6. Lin HI, Nguyen MC. Boosting minority class prediction on imbalanced point cloud data. Applied Sciences (Switzerland) 2020;10. https://doi.org/10.3390/app10030973.

    7. Shatnawi R. Improving software fault-prediction for imbalanced data. 2012 International Conference on Innovations in Information Technology, IIT 2012, p.54–9. https://doi.org/10.1109/INNOVATIONS.2012.6207774.

    8. Shamsudin H, Yusof UK, Jayalakshmi A, Akmal Khalid MN. Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset. IEEE International Conference on Control and Automation, ICCA, vol. 2020- October, IEEE Computer Society; 2020, p. 803–8. https://doi.org/10.1109/ICCA51439.2020.9264517.

    9. Xu X, Chen W, Sun Y. Over-sampling algorithm for imbalanced data classification. Journal of Systems Engineering and Electronics 2019; 30: 1182–91. https://doi.org/10.21629/JSEE.2019.06.12.

    10. Sakri S, Basheer S. Fusion Model for Classification Performance Optimization in a Highly Imbalance Breast Cancer Dataset. Electronics (Switzerland) 2023; 12. https://doi.org/10.3390/electronics12051168.

    11. Venkatesh B, Anuradha J. A review of Feature Selection and its methods. Cybernetics and Information Technologies 2019;19:3– 26. https://doi.org/10.2478/CAIT-2019-0001.

    12. Piras L, Giacinto G. Synthetic pattern generation for imbalanced learning in image retrieval. Pattern Recognit Lett 2012; 33: 2198–205. https://doi.org/10.1016/j.patrec.2012.08.003.

    13. Ramírez-Del Real T, Martínez-García M, Márquez MF, López- Trejo L, Gutiérrez-Esparza G, Hernández-Lemus E. Individual Factors Associated With COVID-19 Infection: A Machine Learning Study. Front Public Health 2022; 10: 912099. https://doi.org/10.3389/fpubh.2022.912099.

    14. El-Kenawy ESM, Ibrahim A, Mirjalili S, Eid MM, Hussein SE. Novel feature selection and voting classifier algorithms for COVID-19 classification in CT images. IEEE Access 2020;8. https://doi.org/10.1109/ACCESS.2020.3028012.

    15. Raihan M, Hassan MM, Hasan T, Bulbul AAM, Hasan MK, Hossain MS, et al. Development of a Smartphone-Based Expert System for COVID-19 Risk Prediction at Early Stage. Bioengineering 2022;9. https://doi.org/10.3390/bioengineering9070281.

    16. Kumar S, Ratnoo S. An Optimal Random Forest Classifier for Diagnosing Covid-19 from X-ray and CTscan Images. Journal of Scientific Research 2022; 66: 189–97. https://doi.org/10.37398/jsr.2022.660225.

    17. Ilbeigipour S, Albadvi A, Noughabi EA. Improvement in Detecting the Fate of Covid-19 Patients and Rule-based Analysis to Discover the Most Important Rules Governing their Fate. Res Sq 2021. https://doi.org/10.21203/rs.3.rs-515541/v1.

    18. Mustafa A. Mohammad R, Aljabri M, Aboulnour M, Mirza S, Alshobaiki A. Classifying the Mortality of People with Underlying Health Conditions Affected by COVID-19 Using Machine Learning Techniques. Applied Computational Intelligence and Soft Computing 2022;2022. https://doi.org/10.1155/2022/3783058.

    19. 2 CN, Macrohon JJE, Inbaraj XA, Jeng JH, Hsieh JG. Covid-19 prediction applying supervised machine learning algorithms with comparative analysis using weka. Algorithms 2021;14. https://doi.org/10.3390/a14070201.

    20. Nurrahma, Yusuf R. Comparing Different Supervised Machine Learning Accuracy on Analyzing COVID-19 Data using ANOVA Test. 6th International Conference on Interactive Digital Media, ICIDM 2020, Institute of Electrical and Electronics Engineers Inc.; 2020. https://doi.org/10.1109/ICIDM51048.2020.9339676