The Problem of Redundant Variables in Random Forests

Mariusz Kubus

doi:10.18778/0208-6018.339.01

Autor

Mariusz Kubus Politechnika Opolska, Wydział Inżynierii Produkcji i Logistyki, Katedra Matematyki i Zastosowań Informatyki

DOI:

https://doi.org/10.18778/0208-6018.339.01

Słowa kluczowe:

lasy losowe, zmienne redundantne, dobór zmiennych, taksonomia cech

Abstrakt

Lasy losowe są obecnie jedną z najchętniej stosowanych przez praktyków metod klasyfikacji wzorcowej. Na jej popularność wpływ ma możliwość jej stosowania bez czasochłonnego, wstępnego przygotowywania danych do analizy. Las losowy można stosować dla różnego typu zmiennych, niezależnie od ich rozkładów. Metoda ta jest odporna na obserwacje nietypowe oraz ma wbudowany mechanizm doboru zmiennych. Można jednak zauważyć spadek dokładności klasyfikacji w przypadku występowania zmiennych redundantnych. W artykule omawiane są dwa podejścia do problemu zmiennych redundantnych. Rozważane są dwa sposoby przeszukiwania w podejściu polegającym na doborze zmiennych oraz dwa sposoby konstruowania zmiennych syntetycznych w podejściu wykorzystującym grupowanie zmiennych. W eksperymencie generowane są liniowo zależne predyktory i włączane do zbiorów danych rzeczywistych. Metody redukcji wymiarowości zwykle poprawiają dokładność lasów losowych, ale żadna z nich nie wykazuje wyraźnej przewagi.

Pobrania

Bibliografia

Breiman L. (1996), Bagging predictors, “Machine Learning”, vol. 24(2), pp. 123–140.
Google Scholar

Breiman L. (2001), Random forests, “Machine Learning”, vol. 45, pp. 5–32.
Google Scholar

Freund Y., Schapire R. E. (1996), Experiments with a new boosting algorithm, Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann, San Francisco.
Google Scholar

Gatnar E. (2001), Nieparametryczna metoda dyskryminacji i regresji, Wydawnictwo Naukowe PWN, Warszawa.
Google Scholar

Grabiński T., Wydymus S., Zeliaś A. (1982), Metody doboru zmiennych w modelach ekonometrycznych, Państwowe Wydawnictwo Naukowe PWN, Warszawa.
Google Scholar

Granitto P. M., Furlanello C., Biasioli F., Gasperi F. (2006), Recursive feature elimination with random forest for PTR‑MS analysis of agroindustrial products, “Chemometrics and Intelligent Laboratory Systems”, vol. 83(2), pp. 83–90.
Google Scholar

Gregorutti B., Michel B., Saint‑Pierre P. (2017), Correlation and variable importance in random forests, “Statistics and Computing”, vol. 27, issue 3, pp. 659–678.
Google Scholar

Guyon I., Gunn S., Nikravesh M., Zadeh L. (2006), Feature Extraction: Foundations and Applications, Springer, New York.
Google Scholar

Hall M. (2000), Correlation‑based feature selection for discrete and numeric class machine learning, Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, San Francisco.
Google Scholar

Hapfelmeier A., Ulm K. (2013), A new variable selection approach using Random Forests, “Computational Statistics and Data Analysis”, vol. 60, pp. 50–69.
Google Scholar

Hastie T., Tibshirani R., Friedman J. (2009), The Elements of Statistical Learning: Data Mining. Inference and Prediction, 2nd edition, Springer, New York.
Google Scholar

Korf R. E. (1999), Artificial intelligence search algorithms, [in:] M. J. Atallah, Algorithms and Theory of Computation Handbook, CRC Press, Boca Raton–London–New York–Washington.
Google Scholar

Kursa M. B., Rudnicki W. R. (2010), Feature selection with the Boruta package, “Journal of Statistical Software”, vol. 36, issue 11, pp. 1–13, http://www.jstatsoft.org/v36/i11/ [accessed: 15.02.2018].
Google Scholar

Toloşi L., Lengauer T. (2011), Classification with correlated features: unreliability of feature ranking and solutions, “Bioinformatics”, vol. 27, issue 14, pp. 1986–1994, https://doi.org/10.1093/bioinformatics/btr300.
Google Scholar

Ye Y., Wu Q., Zhexue Huang J., Ng M. K., Li X. (2013), Stratified sampling for feature subspace selection in random forests for high dimensional data, “Pattern Recognition”, vol. 46(3), pp. 769–787, https://doi.org/10.1016/j.patcog.2012.09.005.
Google Scholar

Yu L., Liu H. (2004), Efficient feature selection via analysis of relevance and redundancy, “Journal of Machine Learning Research”, no. 5, pp. 1205–1224.
Google Scholar