Henning Schreiber

(University of Hamburg)

Processing constructions: corpus based approaches to linguistic analysis in language documentation

Limited grammatical description and missing data are the most frequent annoyances for typological analysis. In particular with regard to less studied linguistic phenomena, grammars provide only occasionally exactly the information and data of interest, sometimes due to a very theoretical orientation of the authors, sometimes due to terminology or because of the authors very special view on what is grammar. While Comrie and Smith (1977) and the grammar series targeted this issue by the proposal of a comprehensive typological questionnaire, Himmelmann (1998) and others developed the corpus oriented language documentation approach. The rationale behind language documentation is that extensive amounts of annotated collection of primary data from different text and discourse genres, with a strong reference to ethnographic description, are more valuable and sustainable sources in the sense that these corpus data will allow various analyses independent from the collectors’ interests and current trends in linguistic analysis. Though language description has also shifted from elicitation and grammaticality judgments to more frequency based analysis, the fundamental condition of all research is sound linguistic data annotation. Linguistic analysis itself appears, above all, thus in the language documentation framework as annotated data.

The development of computer-aided techniques for data annotation such as ELAN, was also the result of a request for effective corpus oriented methods in this framework. Yet, corpus linguistics and language description are still very different in orientation and method. Corpus linguistics is not concerned with analysis of language specific features but with generic methods of processing natural language data. Training of statistical models by automatic Part of Speech taggers, for example, can hardly be applied for small scale corpora. Moreover, the outcome, the statistical language model, is useless for linguistic analysis and description. Nevertheless, since language documentation generally faces the challenge of describing as much data as possible with comparatively few resources in a limited period of time, effective computational bootstrapping methods are needed. With the use of NLTK or LingPy more flexible approaches have been recently developed and scripting languages such as Perl and Python are since long designed for rule based data processing.

We will present briefly our processing pipeline of the Kyanga/Shanga documentation project, which is based on regular expression rules implemented in Python. It will be shown that rule based approaches can be used at the same time for semi-automatic annotation of data and for speaker internal variation modelling and phonological analysis. We will also show how annotated data serve for the analysis of grammatical constructions.


Comrie, Bernard, and Norval Smith. 1977. Lingua descriptive studies: questionnaire. Lingua 42:1-72.

Himmelmann, Nikolaus P. 1998. Documentary and descriptive linguistics. Linguistics (36):161-195.