Handling Missing Data in Administrative Records Through Imputation Methods for Data Quality Improvement
List of Authors
  • Nur Azmina Osman, Nurulhuda Firdaus Mohd Azmi, Suraya Yaacob

Keyword
  • Data quality, Data completeness, Missing data, Imputation, Administrative records

Abstract
  • Administrative data is one of the valuable sources used to support research, policymaking, and decision-making processes. However, missing data remains a significant challenge that can affect the completeness, quality, and reliability of data. Missing data, whether due to errors in data collection or other factors, can lead to biased results if not properly addressed. While a common approach is to discard incomplete rows or columns, this leads to loss of valuable information. Imputation provides a better alternative by estimating missing data based on observed data. Thus, this study presents an experimental evaluation of four widely used imputation methods: mean imputation, median imputation, K-Nearest Neighbors (KNN), and Multiple Imputation using Chained Equations (MICE). The methods were applied to simulated administrative data consisting of continuous variables with missing data introduced completely at random at three levels of missingness (5%, 20%, and 50%). Performance was assessed using Normalized Root Mean Squared Error (NRMSE) to quantify imputation accuracy and distribution plots to visually compare the distributions of imputed versus original data. The results reveal that increasing dataset size reduces average NRMSE. Among the methods, MICE consistently achieves the highest accuracy and best preserves data distribution across all dataset sizes and missingness levels. KNN also performs well but is outperformed by mean imputation at 50% missingness in larger datasets. Overall, advanced methods like MICE and KNN maintain the integrity of the original distribution more effectively than simpler approaches such as mean and median imputation. These findings highlight the importance of selecting robust imputation techniques for improving data quality, especially in administrative datasets. Future research could explore alternative missing data mechanisms and extend evaluations to categorical and mixed-type data.

Reference
  • No Data Recorded