Sentiment Analysis in Czech

using deep learning

nlp
python
Published

November 27, 2020

Introduction

Sentiment analysis is a task where the goal is to classify text as having either positive, neutral or negative sentiment (sometimes the neutral is omitted). The most of NLP studies are conducted on the English text. However, the rise of multilingual models allows us to easily perform them on other languages. Therefore, the goal of this project is to perform it on a differently structured language - Czech.

In order to accomplish the goal, we will use a recent, best performing deep learning multilingual model: XLM-RoBERTa, specifically its base variation. It has already been pretrained on 100 languages including Czech and shows promising fine-tuning results on downstream tasks, significantly better than multilingual BERT.

Terminology

  • Sentence - doesn’t necessarily mean a language sentence. We use it as a synonym to text.

Data

We will perform the sentiment analysis on three czech datasets: CSFD dataset (film reviews), Mall dataset (product reviews) and Facebook dataset (posts). The CSFD dataset consists of 30 897 positive, 30 768 neutral and 29 716 reviews, the Mall consists of 102 977 positive, 31 943 neutral and 10 387 reviews, and finally the Facebook consists of 2 587 positive, 5 174 neutral and 1 991 negative posts.

Examples

The datasets have obviously not been cleaned by a human, since they contain a lot of errors. The sentence’s sentiment were probably obtained automatically from the rating (e.g. CSFD has one to five stars, therefore the automatic annotator would probably assign this topic as a sentence with a positive sentiment etc.). However, the datasets are too large, therefore we won’t corret it manually.

The datasets have also some sentences with unclear sentiment marked as non-neutral. This could be seen as an error in the data too, I will mention it separately.

This error in data probably arises when the meaning cannot be infered from the comment alone, the corresponding score also needs to be enclosed. However, the score is not enclosed in the data, a data error can arise this way. This seems to be mainly problem of CSFD dataset.

Model

As it was already stated, the model is multilingual pretrained XLM-RoBERTa base model. On the top of this model I added Flatten, Dropout and Dense layer with softmax as the classification head. This model was chosen since it’s the SOTA multilingual model at the time of writing this article. It shows significant fine-tuning performance improvement over multilingual BERT, even reaching the performance of mono-lingual models.

The large version of this model has 550M parameters, therefore only the base variation was used with 270M parameters.

Before sending data to the model, the XLM-RoBERTa expects it to have it tokenized. The proper tokenization is employed but no further data preprocessing is necessary. Most of the preprocessing could even do more harm than good, since the information that the model could employ would be lost during that process. Thanks to the tokenization the model can handle unknown words and misspellings. The model can also handle cased text since it was pretrained on such text.

Due to the memory limitations, only the first 128 tokens will be taken. If we consider that each word consists on average of 2 tokens, then 64 words will be used and that should be sufficient for most of the sentences. The batches must be rather well due to this reason as well, hence 16 batch size.

Experiments

The datasets were split randomly into three parts, train/val/test, with ratios 0.7/0.15/0.15. One additional dataset was made for the combination of CSFD, Mall and Facebook dataset. It was called All dataset. A special model was trained for every dataset: CSFD, Mall, Facebook and All.

The models were trained (fine-tuned) using following parameters

  • Learning rate - \(1e^{-5}\)
  • Dropout - \(0.5\)
  • Epochs - \(30\)

In order to save time, the training was cut after evaluation metrics stagnating for 3 epochs. The following metrics were calculated on each dataset’s respective test set. Each metric is weighted by the class weights to solve skewness due to the class imbalance.

Evaluation on the test set.
Dataset Accuracy F1 Precision Recall Epochs
CSFD 0.8354 0.8355 0.8599 0.8071 3
Mall 0.8525 0.8499 0.8556 0.8474 23
Facebook 0.8016 0.8017 0.8084 0.7934 5
All 0.8385 0.8367 0.8396 0.8371 27

Unlike in the previous attempt of semantic segmentation in Czech, the all model doesn’t sacrifice any accuracy. The size of the model allows it generalize well even on three different types of texts. I assume that even if it was fed by even more and various kind of data, the performance would not be hindered. This suggests that XLM-RoBERTa indeed seems to be a good way to approach the sentiment analysis.

We can look at All model’s performance on a particular sentences.

  • Výborný produkt, jsem maximálně spokojen. (transl: Excellent product, I am extremely satisfied.)
    • The model managed to classify it correctly. This is a simple sentence and should be fairly easy, since it just needs to know that “výborný” and “spokojen” are positive words.
  • Jsem velmi spokojená, nejsou žádné mínusy. (transl: I am very satisfied, there are no downsides.)
    • The model managed to classify this correctly, too. This sentence is not that easy, since the word “mínusy” is in the sentence, which could lead to classifying it as a neutral or negative sentence.
  • Vydržel jsem u této sračky cca 15 minut, a celou dobu se zamýšlel nad tím proč některé filmy jsou o násilí a proč některé vyvolávají násilnické sklony i v tak mírumilovném člověku jako jsem já - tu slepici, kterou představovala ta slepice Barrymore, jsem totiž s každou ubíhající minutou filmu (až do těch 15) měl větší a větší chuť políbit…baseballovou pálkou. (transl: I stayed with this shit for about 15 minutes, thinking all the time about why some movies are about violence and why some evoke violent tendencies even in such a peaceful person as I do - the hen that the Barrymore hen represented, I’m with every in the passing minute of the film (up to 15) he had a greater and greater desire to kiss … with a baseball bat.)
    • This was correctly classified as negative.
  • Naprosto perfektní výtěr prdele všemi pláštěnkami v tom nejzábavnějším a nejoriginálnějším stylu, navíc s výborně zahranými postava, kde doslova ani jedna nenudí a všechny zaujmou a to převážně v čele a Karlem Urbanem, který je ztělesněním charismatu. V Hollywoodu jen těžko najdwte většího sympaťáka. Starr překvapil, je řádně ďábelský az slizký a Deep ukázal jasně, proč byl celé dekády (transl: Absolutely perfect swab ass with all raincoats in the funniest and most original style, moreover with a well-played character, where literally not one is bored and they will all be interested, mostly led by Karel Urban, who is the embodiment of charisma. It’s hard to find a bigger guy in Hollywood. Starr surprised, he’s properly devilish and slimy, and Deep made it clear why he had been for decades)
    • Despite even having rude words and being cut (due to 168 tokens limit), this sentence was correctly classified as positive.

The model seems to be able to grab sentiment even in more complicated sentences, where it can’t be done using just the sentiment of the individual words (like e.g. Naive Bayes would do).

Back to top