(University of Hamburg)
A Swahili twitter corpus
This paper will present the results of a small scale study that aimed at building a Swahili Twitter Corpus by a modified version of Twython, the Twitter-for-linguists script by Scheffler (2013). It will be shown that one major component, the language identification module (LangID; Lui & Baldwin 2012) does not meet the demands of identifying Swahili texts from the twitter API. It will be suggested that the byte n-grams based method of LangID will hit its limit if a) the training data for certain language or style is scarce and LangID is not systematically tested for languages like Swahili. Further problems arise most likely b) for short texts, if the data is not sufficient to ensure a satisfactory calculation. As a consequence Swahili text is e.g. often classified as Spanish, Indonesian or other languages, which inhabit similar phonotactic properties. However, this presentation will discuss an alternative approach, which is based on morpho-lexical patterns of Swahili and uses a large pre-defined indexing dictionary. It will be shown that this approach helps to collect data more reliably (accuracy: 83~88%). The findings call attention to what may be overlooked as significant features of indexing “graph units” in purpose of language identification, especially for strongly agglutinating languages.
Moreover, data from the received Swahili-Twitter-Corpus reveals that many Swahili tweets are code-switched with English. Therefore, the data provide also a valuable access to the analysis of code-switching. It provides an additional strategy of retrieving data other than user comments on newspaper websites (Cotterell et al. 2014), transcription of interviews (Lyu et al 2015) or dump downloads from twitter (Çetinoğlu 2016). A further evaluation showed that using combined automatic POS-tagging by TreeTagger (Schmid 1994) proves to be able to trace code-switching in the Swahili twitter corpus with good accuracy (89%). However, due to the unsharp treatment of proper nouns, spelling variations (both in English and Swahili) and the anonymizing of twitter user information, a typologizing code-switching into intra phrasal and inter phrasal still need some manual correction (accuracy of automatic processing: 68%~74%). Within the detected Swahili-English code-switching in the twitter corpus, the intraphrasal code-switching is the predominant type, even only considering these code-switching cases that involve at least three words.
Cotterell, Ryan et al. 2014. An Algerian Arabic -French Code-Switched Corpus. In LREC Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools. 2014.
Çetinoğlu, Özlem 2016. A Turkish-German Code-Switching Corpus, In the Proceedings of the 10th edition of the Language Resources and Evaluation Conference, Portorož, Slovenia, 2016.
Lui, Marco & Timothy Baldwin 2012. langid.py: An O ff-the-shelf Language Identification Tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea.
Lyu, Dau-Cheng et al. 2015. Mandarin-English code-switching speech corpus in South-East Asia: SEAME. Language Resources and Evaluation 49: 581-600
Scheffler, Tatjana 2013. Erstellung eines deutschen Twitterkorpus. (German) In
DGfS-CL Postersession, 35. Tagung der Deutschen Gesellschaft für Sprachwissenschaft, 14.3.2013, Potsdam.
Schmid, Helmut 1994. Probabilistic Part -of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing,