Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results
DOI:
https://doi.org/10.18778/0208-6018.339.05Keywords:
incomplete data, multiple imputation, principal component analysis, missForestAbstract
The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.
Downloads
References
Allison P. D. (2002), Missing data, Series: Quantitative Applications in the Social Sciences 07–136, SAGE Publications, Thousand Oaks–London–New Delhi.
Google Scholar
Audigier V., Husson F., Josse J. (2016), Multiple imputation for continuous variables using a Bayesian principal component analysis, “Journal of Statistical Computation and Simulation”, vol. 86, no. 1, pp. 2140–2156, DOI: 10.1080/00949655.2015.1104683.
Google Scholar
Blake C., Keogh E., Merz C. J. (1988), UCI Repository of Machine Learning Datasets, Department of Information and Computer Science, University of California, Irvine.
Google Scholar
Breiman L. (2001), Random Forests, “Machine Learning”, vol. 45, no. 1, pp. 5–32.
Google Scholar
Buuren S. van (2007), Multiple imputation of discrete and continuous data by fully conditional specification, “Statistical Methods in Medical Research”, vol. 16, no. 3, pp. 219–242.
Google Scholar
Buuren S. van (2012), Flexible Imputation of Missing Data, Chapman & Hall/CRC Press, Boca Raton–London–New York.
Google Scholar
Buuren S. van, Groothuis‑Oudshoorn K. (2011), MICE: Multivariate Imputation by Chained Equations in R, “Journal of Statistical Software”, vol. 45, no. 3, pp. 1–67.
Google Scholar
Enders C. K. (2010), Applied Missing Data Analysis, The Guilford Press, New York–London.
Google Scholar
Hotelling H. (1933), Analysis of a complex of statistical variables into principal components, “Journal of Educational Psychology”, vol. 24, pp. 417–441, 498–520.
Google Scholar
Ilin A., Raiko T. (2010), Practical Approaches to Principal Component Analysis in the Presence of Missing Values, “Journal of Machine Learning Research”, vol. 11, pp. 1957–2000.
Google Scholar
Josse J. (2016), Contribution to missing values & principal component methods, Statistics [stat], Université Paris Sud, Orsay.
Google Scholar
Josse J., Husson F. (2012), Handling missing values in exploratory multivariate data analysis methods, “Journal de la Société Française de Statistique”, vol. 153, no. 2, pp. 79–99.
Google Scholar
Josse J., Husson F. (2016), missMDA: A Package for Handling Missing Values in Multivariate Data Analysis, “Journal of Statistical Software”, vol. 70, no. 1, pp. 1–31, DOI: 10.18637/jss.v070.i01.
Google Scholar
Josse J., Pagès J., Husson F. (2011), Multiple imputation in principal component analysis, “Advances in Data Analysis and Classification”, vol. 5, pp. 231–246.
Google Scholar
Little R. J.A., Rubin D. B. (2002), Statistical Analysis with Missing Data, second edition, Wiley, New Jersey.
Google Scholar
Misztal M. (2013), Some remarks on the data imputation using “missForest” method, “Acta Universitatis Lodziensis. Folia Oeconomica”, vol. 285, pp. 169–179.
Google Scholar
Newman D. A. (2014), Missing Data: Five Practical Guidelines, “Organizational Research Methods”, vol. 17(4), pp. 372–411, DOI: 10.1177/1094428114548590.
Google Scholar
Orchard T., Woodbury M. A. (1972), A missing information principle: Theory and applications, [in:] Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 697–715.
Google Scholar
Pearson K. (1901), On lines and planes of closest t to systems of points in space, “Philosophical Magazine”, vol. 6, no. 2, pp. 559–572.
Google Scholar
Schafer J. L. (1997), Analysis of incomplete multivariate data, Chapman and Hall/CRC, London.
Google Scholar
Shah A. D., Bartlett J. W., Carpenter J., Nicholas O., Hemingway H. (2014), Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study, “American Journal of Epidemiology”, vol. 179, no. 6, pp. 764–774, DOI: 10.1093/aje/kwt312.
Google Scholar
Stekhoven D. J., Bühlmann P. (2012), MissForest – Nonparametric Missing Value Imputation for Mixed‑Type Data, “Bioinformatics”, vol. 28, no. 1, pp. 112–118.
Google Scholar
Tang F., Ishwaran H. (2017), Random forest missing data algorithms, “Statistical Analysis and Data Mining”, vol. 10, issue 6, pp. 363–377, DOI: 10.1002/sam.11348.
Google Scholar
Wulff J., Ejlskov L. (2017), Multiple Imputation by Chained Equations in Praxis: Guidelines and Review, “The Electronic Journal of Business Research Methods”, vol. 15, issue 1, pp. 41–56.
Google Scholar
Yu L.‑M., Burton A., Rivero‑Arias O. (2007), Evaluation of software for multiple imputation of semi‑continuous data, “Statistical Methods in Medical Research”, vol. 16, pp. 243–258.
Google Scholar