CLASSIFICATION OF TWEETS USING MULTIPLE THRESHOLDS WITH SELF-CORRECTION AND WEIGHTED CONDITIONAL PROBABILITIES

  • Tariq Ahmad

Student thesis: Phd

Abstract

Emotion analysis aims to recognise emotions such as anger, joy and trust from texts. It is a trending topic because it can be applied in important areas such as marketing, healthcare and customer services. Current, state-of-the-art, solutions are based around supervised models that are trained using examples that have been manually annotated. This is subjective, expensive and time-consuming. This thesis explores the problem of multi-label emotion classification of tweets. The task is particularly difficult as tweets are notoriously awkward to work with as they are noisy in nature and may contain unstructured text, abbreviations, slang, acronyms, emoticons and incorrect grammar and spelling. Furthermore, single tweets, even if they have none of these issues, are usually short and often do not contain much context, making them difficult to work with. To overcome some of these problems we propose a new type of corpus and investigate strategies for linking news articles to create news-stories and linking tweets to create tweet-stories and hence linking the news-stories to the tweets-stories to create a corpus of linked tweets that contain emotion-bearing markers. We describe the process of collecting tweets and news articles, the annotation process and the problems therein, and show that a thematically-linked corpus aids the classification process. Preprocessing is an important step in classification. However, there is no standard set of steps. As such, we analyse a number of preprocessing steps, evaluating each to establish its contribution and, thus, form the best combination of steps to carry forward into later experiments. We consider both Arabic and English tweets, and whilst there are well-established Natural Language Processing (NLP) tools for English, the same is not true for Arabic. As part of this work we also evaluate a new Arabic tagger specifically for tweets, and a stemmer, and compare the results to other methods. The major contribution of this thesis is a new type of classifier based on conditional probabilities that are used to build a lexicon of scores that indicate the importance of a word to a specific emotion. We show that incorporating automatic mechanisms for autocorrection, by removing words that are unhelpful in an emotion, and calculating individual thresholds for each emotion, improves classifier performance. To the best of our knowledge, this is the first time these ideas have been explored. The results of this classifier, named CENTEMENT, are compared to other common algorithms such as K-nearest Neighbours (KNN), Support Vector Machine (SVM), and two different configurations of neural networks. We also evaluate a number of other datasets and demonstrate that our algorithm is robust and performs consistently well. The results are encouraging: our approach led to appreciably better performance than currently established classifiers and also many of the latest state-of-the-art classifiers. To further test the robustness of the classifier, it was entered into the worldwide emotion-classification competition, SemEval-2018, where it came second (out of thirteen) classifying Arabic tweets and twelfth (out of thirty-four) classifying English tweets.
Date of Award1 Aug 2020
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorAllan Ramsay (Supervisor)

Keywords

  • multi-emotion classification
  • emotion analysis

Cite this

'