Sentiment Classification of Bank Clients’ Reviews Written in the Polish Language

It is estimated that approximately 80% of all data gathered by companies are text docu‐ ments. This article is devoted to one of the most common problems in text mining, i.e. text classifica‐ tion in sentiment analysis, which focuses on determining the sentiment of a document. A lack of de‐ fined structure of the text makes this problem more challenging. This has led to the development of various techniques used in determining the sentiment of a document. In this paper, a comparative analysis of two methods in sentiment classification, a naive Bayes classifier and logistic regression, was conducted. Analysed texts are written in the Polish language and come from banks. The classification was conducted by means of a bag‐of‐n‐grams approach, where a text document is presented as a set of terms and each term consists of n words. The results show that logistic regression performed better.


Introduction
Approximately 80% of all data gathered by companies has textual form (Sullivan, 2001), such as e-mails, memos, reports, research, reviews, strategy, and marketing plans, etc. All of these textual forms provide a rich and extensive source of valuable (but undiscovered) information. The amount of available data is overwhelming, hence analysing data manually by analysts might be ineffective or even impossible. On the other hand, such a collection of data cannot be processed with typical techniques because of their unstructured form. Fortunately, there are several text mining applications available for deriving high-quality information from text documents. This creates an opportunity to take advantage of data to improve decision-making processes in companies.
Text classification in sentiment analysis is one of text mining applications which can provide answers to questions such as: "Do clients like my product (or service)?" or "Which aspects of my product (or service) do clients like or not?" It is also helpful in tracking and evaluating customer satisfaction. This type of text analysis focuses on detecting an author's attitude (called sentiment) toward entities and their attributes.
In this paper, sentiment classification of bank clients' reviews written in the Polish language is examined in a comparative analysis of two methods. In Section 2, sentiment analysis and document sentiment classification are introduced. The next section presents the idea of a bag-of-n-gram approach, a naive Bayes classifier and logistic regression. Section 4 contains an algorithm for the evaluation of the above-mentioned methods, a data overview, and the results of the comparison conducted. Finally, conclusions are stated at the end.

Sentiment analysis
Sentiment analysis (opinion mining) focuses on analysing textual data in order to assess an author's attitude toward entities and their attributes. This type of analysis is interdisciplinary in its nature, as it combines research and applications in such fields as: natural language processing (NLP), data mining, web mining, and information retrieval. It is presumed that the terms sentiment analysis and opinion mining were first introduced in (Dave, Lawrence, Pennock, 2003;Nasukawa, Yi, 2003) respectively, but research regarding sentiment and opinion emerged a few years earlier (Wiebe, 2000;Das, Chen, 2001;Tong, 2001;Morinaga et al., 2002;Pang, Lee, Vaithyanathan, 2002;Turney, 2002).
It is worth mentioning that there is no clear distinction between sentiment analysis and opinion mining among researchers and practitioners. In this paper, these two terms will be used interchangeably. Sentiment analysis can be performed with respect to its granularity level (Liu, 2015): 1) document level -the objective is to classify a whole opinion document into positive or negative sentiment; 2) sentence level -the main task is to assign sentiment (positive or negative) to each sentence. Sentences without an opinion are considered as neutral; 3) aspect level -this type of analysis is focused on finding opinions concerning entities or their aspects and then assigning sentiment to them; for example, opinion I love this restaurant, but the prices are too high has overall positive sentiment, but it does not mean that the author of the opinion is positive about all aspects of the restaurant; thus, to obtain such details, one needs to apply aspect level analysis.

Document sentiment classification
Document sentiment classification is one of the most studied topics in the field of sentiment analysis. Its task is to assess the overall sentiment about an entity based on the opinion document evaluating the entity. In other words, the goal of document sentiment classification is to assign one label (positive, negative or neutral) to a document. Document sentiment classification does not take into account all aspects in the opinion document or seek sentiments regarding them, hence it is considered as document level analysis. There is a great deal of research devoted to sentiment classification studying various types of data and various types of techniques. Turney (2002) used the data from Eopinios.com website that contain reviews sampled from four domains: reviews of cars, banks, movies, and travel destinations. He calculated Semantic Orientation (SO) of a term by means of the number of hits returned from the query engine 1 with the reference to words poor and excellent: hits term NEAR hits poor SO term hits term NEAR hits excellent The document is labelled as positive if averaged SO was positive, otherwise the document was labelled as negative. Pang, Lee, and Vaithyanathan (2002) used film reviews from the Internet Movie Database (IMDb). Their study utilises mostly unigrams and bigrams with term presence as features. Na, Khoo, and Wu (2005) examined unigrams and unigrams with part-of-speech (POS) tags with different weighting schemes (term presence, term frequency, and term frequency inverse document frequency) using on-line product reviews downloaded from the Review Centre (https://www.reviewcentre.com/). Many researchers appreciate messages (tweets) from Twitter as a source of data, e.g. Asur and Huberman (2010) classified film reviews (tweets) from Twitter using an n-grams approach in order to improve forecasting box-office revenue of movies. Tweets regarding the Irish Great Election in 2011 were utilised in a uni-gram approach. Hanbury and Nopp (2015) employ sentiment analysis in risk assessment for Eurozone banks. The authors evaluated CEO letters and Outlook sections (usually part of management report) by means of sentiment finance-oriented words. Such a finance-specific list of words comes from Loughran and McDonald's (2011) work. Selected studies with methods and accuracy are given in the Table 1.

Classification algorithms
To employ a particular classification algorithm, the opinion documents analysed were expressed in bag-of-n-grams fashion. In this kind of document representation, a document consists of a set of terms (features) where n stands for the number of words in this particular term, e.g. uni-gram, bi-gram, etc. Given this, the documents can be presented as the following document-term matrix (DTM): where: x -is the document-term matrix, x ij -is the number of times that the j-th term occurred in the i-th document, i = 1, …, I (I is the total number of documents in a training set), j = 1, …, J (J is the total number of terms in a training set).

Naive Bayes
Bayes' rule (Domański, Pruska, 2000) for document sentiment classifications defines conditional probability that the x i document belongs to the C k class: where: C k -is the k-th class, k = 1, …, K, x i -is the i-th document with J features, p k -is the a priori probability that the document belongs to the C k class, f(x i |C k ) -is a probability of occurrence of the x i document, given it belongs to the C k class.
A naive Bayes (NB) classifier assigns the x i document to the class C k if equation (7) is satisfied: which is equivalent for: The above-mentioned classification rule assumes that terms x j are independently distributed given the k-th class: In order to train a naive Bayes classifier, p k will be calculated using relative-frequency estimation: where n k is the number of documents given that belong to the k-th class, while f(x i |C k ) will be calculated using relative-frequency estimation (for term presence or TF): or fitting a normal distribution (for TFIDF): where: n ijk -frequency of the i-th value of the j-th term in the k-th class, n jk -frequency of the j-th term in the k-th class,  jk m -mean of TFIDF for the j-th term in the k-th class,  jk s -standard deviation of TFIDF for the j-th term in the k-th class.

Logistic regression
Let us assume that C be the Bernoulli random variable: then the logistic regression (Hosmer, Lemeshow, Sturdivant, 2013) can be written as follows: where: β 0 is an intercept and β is a vector of estimated parameters. It is convenient to apply logit transformation on (15) to obtain some desirable properties of a linear model: in particular, the above-mentioned equation is linear in its parameters, hence betas have a handy interpretation in terms of odds ratio x' x 2 , i. e. if the x j feature increases by 1 unit (ceteris paribus), the odds ratio will increase by j e b . This means that the odds that a document has positive sentiment (given the increased x j ) has increased (decreased) by ( (15) is a probability that the document x i has positive sentiment, thus a probability that the document x i has negative sentiment is calculated by the following equation: (17) The x i document is classified as negative if the following equation is satisfied: otherwise, the document is considered as positive. Parameters from equation (15) can be estimated by means of the maximum likelihood method by maximising the following equation: with respect to parameters β 0 and β: (20) 4. Evaluation

Experimental set-up
In order to evaluate a naive Bayes classifier and logistic regression in document sentiment classification, experiment is conducted in line with the algorithm presented in Figure 1. All calculations are made in R software. First, the documents analysed are read into the memory, and then they are initially processed, i.e. unwanted numbers, punctuations and words are deleted. Also, lemmatisation is a very important part of this step. The process of lemmatisation groups together the inflected forms of the word so that they can be analysed as a single item (word's lemma), e.g. płakać is lemma for płakał, płakaliśmy, płacze. It is especially important in the case of the Polish language, which is inflected. Lemmatisation is done by means of tm package in R. This step can have a crucial impact on features (and on the number of features) in the document-term matrix. For the purpose of this study, unigrams and bigrams will be considered. The DTM matrix is calculated by the use of hashmap, tm and tex2vec package. After the DTM is created, the three versions of the document-term matrix are calculated (binary, TF and TFIDF) employing RWeka and tm package. Then the matrix is used in 10-fold cross validation, according to Figure 1, where a naive Bayes classifier and logistic regression are learnt on a training sample and classification is evaluated on a validation sample. This part of algorithm is handled by e1071 and gmodels package. The classification is evaluated by means of accuracy: where: TP -the number of documents with positive sentiment classified as positive, TN -the number of documents with negative sentiment classified as negative, I -the number of all documents.

The data
The data consist of 1,559 documents that are clients' reviews concerning one of Polish banks. Each document is labelled with positive or negative sentiment (positive or negative class). These labels were assigned manually by an opinion holder (by choosing a sad or happy face icon). There were 786 negative and 773 positive documents. Words with the highest frequency in each class (red for negative and green for positive) are shown in Figure 2.

Results
Figures 3 and 4 show results of classification of the above-mentioned data set for unigrams and bigrams respectively. Document sentiment classification was conducted by means of naive Bayes classifier (NB) and logistic regression (GLM). It turns out that the considered classification methods outperformed the 50% random-choice baseline and the results ranged from 51.06% to 82.81%. The highest accuracy was observed for logistic regression (unigram DTM with TFIDF) and the lowest was achieved for a naive Bayes classifier (bigram DTM with TFIDF). Results for unigrams are quite similar for binary and TF transformation and range from 76.91% to 77.81% but for TFIDF differences are greater, i.e. a Naive Bayes classifier with TFIDF (64.14%) performs worse than NB and GLM with binary or TF. Also, in terms of accuracy, NB is worse than any logistic regression. In fact, GLM with TFIDF has the highest percentage of correctly classified documents (82.81%). As for bigrams, logistic regression performed better than a naive Bayes classifier, yielding roughly 78% of correctly classified documents. Accuracy of NB was about -9 p.p. worse than GLM for binary and TF. NB with TFIDF has the lowest accuracy (only 51.06%), yielding performance only about 1 p.p. above the random-choice baseline.

Conclusions
In this paper, a naive Bayes classifier and logistic regression were examined in document sentiment classification performed for the Polish language. This problem was found by researchers (Pang, Lee, Vaithyanathan, 2003) to be more challenging than traditional topic-based classification, which concerns keywords that help identify topics. Document sentiment classification is more complex because sentiment (rather than topics) can be expressed in a more subtle manner.
The results produced in section 4.3 indicate that the performance of naive Bayes classifier and logistic regression applied to the customer reviews written in Polish is high. In all cases, the accuracy is higher than the random-choice baseline, and it also fits in the accuracy that the researchers obtained in their studies (see Table 1). Logistic regression with TF-IDF yielded the highest accuracy, i.e. 82.81%.
When it comes to TFIDF transformation, the accuracy for a naive Bayes classifier was undoubtedly poorer than in the case of the other approaches. The reason for such drop in performance is that the distribution of TFIDF features does not necessarily follow the density function when f(x i |C k ) is a normal distribution.
It is worth mentioning that estimates of parameters of the above-mentioned methods are highly influenced by sparsity of the DTM matrix. Thus, performance of considered classifiers is driven by non-occurrence rather than occurrence of features obtained from the training set. Saif, He and Alani (2012) proposed two effective approaches to deal with sparsity of the DTM matrix.
The results (considered as high in terms of accuracy) presented in this article cannot be generalised to all types of documents written in the Polish language due to the fact that: (1) each type of data has its own specific way of expressing sentiment, (2) most of document sentiment classification research is conducted on documents written in the English language, whereas the Polish language is inflected, which affects the DTM matrix and can possibly add some complexity to expressing the sentiment. All in all, it seems that more studies on documents in the Polish language are needed.