Czy komputer rozpozna hejtera? Wykorzystanie uczenia maszynowego (ML) w jakościowej analizie danych

Marek Troszyński; Aleksander Wawer

doi:10.18778/1733-8069.13.2.04

Authors

Marek Troszyński Collegium Civitas, Plac Defilad 1, 00-901 Warszawa
Aleksander Wawer Instytut Podstaw Informatyki PAN, ul. Jana Kazimierza 5, 01-248 Warszawa

DOI:

https://doi.org/10.18778/1733-8069.13.2.04

Keywords:

machine learning, qualitative data analysis, hate speech, intercoder agreement

Abstract

The purpose of this article is to present the process of automatic tagging of hate speech in social media. The implementation of this process allows for quantitative treatment of qualitative methods: analysis on the corpora of hundreds thousands of texts based on their meaning. The process is possible through algorithms of machine learning (ML). The example of the hate speech designation project in texts from Polish online forums is presented. The key issue is the precise of conceptualization and operationalization of category “hate speech.” This allows for preparing specific instructions and conducting the training code unit. As a result we get higher rates of inter-coder agreement. Marked texts will be used as training data for automated categorization methods based on ML algorithms. Then we describe the course of machine coding. This article also seeks to establish problems associated with automatic coding of hate speech and propose solutions. In summary, we point the factors that are crucial to the research process that uses machine learning.

Downloads

Download data is not yet available.

Author Biographies

Marek Troszyński, Collegium Civitas, Plac Defilad 1, 00-901 Warszawa

Marek Troszyński, doktor socjologii, kierownik Obserwatorium Cywilizacji Cyfrowej Collegium Civitas, adiunkt tamże. Zainteresowania naukowe: socjologia kultury, wykorzystanie metod automatycznej analizy języka naturalnego (NLP) w socjologicznych badaniach nad dyskursem.

Aleksander Wawer, Instytut Podstaw Informatyki PAN, ul. Jana Kazimierza 5, 01-248 Warszawa

Aleksander Wawer, doktor nauk technicznych w kierunku informatyka, absolwent socjologii i informatyki. Adiunkt w Zespole Inżynierii Lingwistycznej w Instytucie Podstaw Informatyki PAN. Zainteresowania naukowe obejmują wybrane problemy przetwarzania języka naturalnego, w szczególności analizę wydźwięku, ekstrakcję relacji oraz głębokie uczenie maszynowe.

References

Bishop Christopher (2006) Pattern Recognition and Machine Learning. Secaucus: Springer-Verlag.
Google Scholar

Breiman Leon (2001) Random Forests. „Machine Learning”, vol. 45, no. 1, s. 5‒32.
Google Scholar DOI: https://doi.org/10.1023/A:1010933404324

Bychawska-Siniarska Dominika, Głowacka Dorota, red., (2013) Mowa nienawiści w internecie: jak z nią walczyć. Warszawa: Helsińska Fundacja Praw Człowieka.
Google Scholar

Bychawska-Siniarska Dominika, Gliszczyńska-Grabias Aleksandra (2016) W stronę sieci tolerancji. Prawnomiędzynarodowe instrumenty walki z mową nienawiści [dostęp 14 maja 2017 r.]. Dostępny w Internecie http://www.siectolerancji.pl/aktualnosc/w-strone-sieci-tolerancji-publikacja-w-module-prawnym
Google Scholar

Cortes Corinna, Vapnik Vladimir (1995) Support-Vector Networks. „Machine Learning”, vol. 20, no. 3, s. 273–297.
Google Scholar DOI: https://doi.org/10.1007/BF00994018

Gutierrez Dario i in. (2016) Literal and Metaphorical Senses in Compositional Distributional Semantic Models. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, (ACL) 2016, August 7-12, 2016, Berlin, Germany, vol. 1 [dostęp 14 maja 2017 r.]. Dostępny w Internecie http://aclweb.org/anthology/P/P16/P16-1018.pdf
Google Scholar DOI: https://doi.org/10.18653/v1/P16-1018

Heinze Eric (2016) Hate Speech and Democratic Citizenship. Oxford: Oxford University Press.
Google Scholar DOI: https://doi.org/10.1093/acprof:oso/9780198759027.001.0001

Jockers Matthew (2013) Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press.
Google Scholar DOI: https://doi.org/10.5406/illinois/9780252037528.001.0001

Krejtz Izabela, Krejtz Krzysztof (2005) Wybrane statystyki zgodności między sędziami w analizie treści [w:] Katarzyna Stemplewska-Żakowicz, Krzysztof Krejtz, red., Wywiad psychologiczny. Wywiad jako postępowanie badawcze. Warszawa: Pracownia Testów Psychologicznych Polskiego Towarzystwa Psychologicznego, s. 231–249.
Google Scholar

Krippendorff Klaus (1980) Content Analysis: An Introduction to Its Methodology. Newbury Park, CA: Sage.
Google Scholar

Lafferty John D., McCallum Andrew, Pereira Fernando C. N. (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning (ICML ‘01), San Francisco, USA, Morgan Kaufmann Publishers Inc., s. 282–289.
Google Scholar

Lample Guillaume i in. (2016) Neural Architectures for Named Entity Recognition. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. The Association for Computational Linguistics, s. 260–270.
Google Scholar DOI: https://doi.org/10.18653/v1/N16-1030

Linde-Usiekniewicz Jadwiga (2015) Teoria relewancji jako narzędzie opisu mowy nienawiści. „Studia Pragmalingwistyczne”, t. 7, s. 53–68.
Google Scholar

Lombard Matthew, Snyder-Duch Jennifer, Bracken Cheryl Campanella (2004) A Call for Standardization in Content Analysis Reliability. „Human Communication Research”, vol. 30, s. 434–437.
Google Scholar DOI: https://doi.org/10.1111/j.1468-2958.2004.tb00739.x

Łodziński Sławomir (2003) Problemy dyskryminacji osób należących do mniejszości narodowych i etnicznych w Polsce. Warszawa: Kancelaria Sejmu, Biuro Studiów i Ekspertyz.
Google Scholar

Manning Christopher D. i in. (2014) The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. The Association for Computational Linguistics. ACL, System Demonstrations.
Google Scholar DOI: https://doi.org/10.3115/v1/P14-5010

Moretti Franco (2013) Distant Reading. London: Verso Books.
Google Scholar

Nijakowski Lech (2008) Mowa nienawiści w świetle teorii dyskursu [w:] Anna Horolets, red., Analiza dyskursu w socjologii i dla socjologii. Warszawa: Wydawnictwo Adam Marszałek, s. 113–133.
Google Scholar

Ogrodniczuk Maciej, Lenart Michał (2013) A Multi-Purpose Online Toolset for NLP Applications. Proceedings of the 18th International Conference on Applications of Natural Language to Information Systems, vol. 7934 of Lecture Notes in Computer Science, Springer-Verlag. Springer Berlin Heidelberg, s. 392–395.
Google Scholar DOI: https://doi.org/10.1007/978-3-642-38824-8_46

Pedregosa Fabian i in. (2011) Scikit-Learn: Machine Learning in Python. „Journal of Machine Learning Research”, vol. 12, s. 2825–2830.
Google Scholar

Siwicki Maciej (2011) Nielegalna i szkodliwa treść w Internecie. Aspekty prawnokarne. Warszawa: Oficyna Wolters Kluwer.
Google Scholar

Sperber Dan, Wilson Deidre (2011) Relewancja. Komunikacja i poznanie. Przełożyły Magdalena Charzyńska i n.. Kraków: Wydawnictwo Tertium.
Google Scholar

Stone Philip J. i in. (1966) The General Inquirer: A Computer Approach to Content Analysis. Cambridge: MIT Press.
Google Scholar

Troszyński Marek (2015) Hate Speech. Towards a Research Standard [w:] Jacek Sobczak, Jędrzej Skrzypczak, red., Professionalism in Journalism in the Era of New Media. Berlin: Logos, s. 199–208.
Google Scholar

Wawer Aleksander, Rogozińska Dominika (2012) How much supervision? Corpus-based lexeme sentiment estimation. IEEE 12th International Conference on Data Mining Workshops (SENTIRE 2012), Los Alamitos, USA, IEEE Computer Society, s. 724–730.
Google Scholar DOI: https://doi.org/10.1109/ICDMW.2012.119

Wieruszewski Roman i in., red., (2010) Mowa nienawiści a wolność słowa. Aspekty prawne i społeczne. Warszawa: Wolters Kluwer.
Google Scholar