This thesis explores the task of analysing the linguistic structure of Arabic tweets. Arabic tweets raise many challenges that make Natural Language Processing (NLP) tasks difficult. We are faced with the same linguistic issues that any ordinary language has as well as more genre-specific problems. Tweets are difficult to manipulate because they do not always maintain formal grammar and correct spelling, and abbreviations are often used to overcome length restrictions. Arabic tweets also exhibit linguistic phenomena such as usage of different dialects, Romanised Arabic and borrowing of foreign words. All these characteristics of the microblogging genre make NLP tasks on Twitter very different from their counterparts in more formal texts. Within most NLP systems there are several early stages such as tagging, stemming and parsing that may need to be redesigned to take into account characteristics of tweets in order to be able to extract their important linguistic features. To fulfil this need, three of the most fundamental parts of the linguistic pipeline, namely POS tagging, stemming and parsing have been revisited for Arabic tweets. To the best of our knowledge, this is the first attempt to carry out this task for Arabic tweets. We investigate the challenges of processing Arabic tweets, studying a number of standard Arabic processing tools and highlighting their limitations when manipulating Arabic tweets. We make three state-of-the-art POS taggers for Modern Standard Arabic (MSA) robust towards noise when applied to the Arabic tweets. We develop the first fast and robust POS tagger for Arabic tweets and create the first POS-tagged corpus of Arabic tweets. Also, we develop two approaches to stemming Arabic tweet words: a heavy stemmer and a light stemmer, and we find that the light stemmer provides the most suitable approach for stemming Arabic tweets words because it does not use dictionaries, is fast, and yields greater accuracy compared with the heavy stemmer and MSA stemmers. We are able to automatically create the first dependency treebank from unlabelled tweets by using two approaches: using a rule-based parser only and using a rule-based parser and a data-driven parser in a bootstrapping technique. Then, we train a data-driven parsing base model on them to parse Arabic tweets. The findings are encouraging. We are able to improve POS tagging accuracy from 49% to 74.0% on Arabic tweets. Experimental results show that the light stemmer achieves 77.9% accuracy. It outperforms three well-known stemmers for Arabic. Our parser reaches 71.0% accuracy which is better than the performance of French parsing for social media data and it is not far behind English parsing for tweets.
|Date of Award||3 Jan 2018|
- The University of Manchester
|Supervisor||Allan Ramsay (Supervisor) & Liping Zhao (Supervisor)|
- POS tagging
- Arabic tweets