The Problem of Redundant Variables in Random Forests
DOI:
https://doi.org/10.18778/0208-6018.339.01Keywords:
random forests, redundant variables, feature selection, clustering of featuresAbstract
Random forests are currently one of the most preferable methods of supervised learning among practitioners. Their popularity is influenced by the possibility of applying this method without a time consuming pre‑processing step. Random forests can be used for mixed types of features, irrespectively of their distributions. The method is robust to outliers, and feature selection is built into the learning algorithm. However, a decrease of classification accuracy can be observed in the presence of redundant variables. In this paper, we discuss two approaches to the problem of redundant variables. We consider two strategies of searching for best feature subset as well as two formulas of aggregating the features in the clusters. In the empirical experiment, we generate collinear predictors and include them in the real datasets. Dimensionality reduction methods usually improve the accuracy of random forests, but none of them clearly outperforms the others.
Downloads
References
Breiman L. (1996), Bagging predictors, “Machine Learning”, vol. 24(2), pp. 123–140.
Google Scholar
Breiman L. (2001), Random forests, “Machine Learning”, vol. 45, pp. 5–32.
Google Scholar
Freund Y., Schapire R. E. (1996), Experiments with a new boosting algorithm, Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann, San Francisco.
Google Scholar
Gatnar E. (2001), Nieparametryczna metoda dyskryminacji i regresji, Wydawnictwo Naukowe PWN, Warszawa.
Google Scholar
Grabiński T., Wydymus S., Zeliaś A. (1982), Metody doboru zmiennych w modelach ekonometrycznych, Państwowe Wydawnictwo Naukowe PWN, Warszawa.
Google Scholar
Granitto P. M., Furlanello C., Biasioli F., Gasperi F. (2006), Recursive feature elimination with random forest for PTR‑MS analysis of agroindustrial products, “Chemometrics and Intelligent Laboratory Systems”, vol. 83(2), pp. 83–90.
Google Scholar
Gregorutti B., Michel B., Saint‑Pierre P. (2017), Correlation and variable importance in random forests, “Statistics and Computing”, vol. 27, issue 3, pp. 659–678.
Google Scholar
Guyon I., Gunn S., Nikravesh M., Zadeh L. (2006), Feature Extraction: Foundations and Applications, Springer, New York.
Google Scholar
Hall M. (2000), Correlation‑based feature selection for discrete and numeric class machine learning, Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, San Francisco.
Google Scholar
Hapfelmeier A., Ulm K. (2013), A new variable selection approach using Random Forests, “Computational Statistics and Data Analysis”, vol. 60, pp. 50–69.
Google Scholar
Hastie T., Tibshirani R., Friedman J. (2009), The Elements of Statistical Learning: Data Mining. Inference and Prediction, 2nd edition, Springer, New York.
Google Scholar
Korf R. E. (1999), Artificial intelligence search algorithms, [in:] M. J. Atallah, Algorithms and Theory of Computation Handbook, CRC Press, Boca Raton–London–New York–Washington.
Google Scholar
Kursa M. B., Rudnicki W. R. (2010), Feature selection with the Boruta package, “Journal of Statistical Software”, vol. 36, issue 11, pp. 1–13, http://www.jstatsoft.org/v36/i11/ [accessed: 15.02.2018].
Google Scholar
Toloşi L., Lengauer T. (2011), Classification with correlated features: unreliability of feature ranking and solutions, “Bioinformatics”, vol. 27, issue 14, pp. 1986–1994, https://doi.org/10.1093/bioinformatics/btr300.
Google Scholar
Ye Y., Wu Q., Zhexue Huang J., Ng M. K., Li X. (2013), Stratified sampling for feature subspace selection in random forests for high dimensional data, “Pattern Recognition”, vol. 46(3), pp. 769–787, https://doi.org/10.1016/j.patcog.2012.09.005.
Google Scholar
Yu L., Liu H. (2004), Efficient feature selection via analysis of relevance and redundancy, “Journal of Machine Learning Research”, no. 5, pp. 1205–1224.
Google Scholar