Annotation in Corpus Linguistics

Apart from the pure text, a corpus can also be provided with additional linguistic information called annotation. The most common form of annotated corpora is the grammatically tagged one. In a grammatically tagged corpus, the words have been assigned a word class label (part-of-speech tag).

How to annotate a corpus?

First we will discuss the importance of associating metadata (data about data) with corpus files. We will also explore how to insert other types of information into a corpus as linguistic annotations. Annotating a corpus is a way of adding value to it by widening the field of questions that it will make it possible to investigate. We will also check the reliability of annotations.

Corpus annotations

Raw data which are collected in a corpus are not always adequate for answering many research questions because it gives us baised results. For example we check the occurrence of a French noun ferme and car than baised result will come in raw data because these words have other uses apart from being noun, such as (ferme is also a conjugated form of a verb fermer and car is also a coordinating conjunction).

The great advantage of part-of-speech tagging is that it can be done automatically with almost the accuracy of a manual annotation, regardless of the amount of text to be annotated. As a matter of fact part-of-speech tagging was done on Google Books corpus which contains billions of words from different languages and has made it possible to refine research on language evolution. For example, Lin et al (2012) found that the regular past participle of the verb burned replaced the irregular form of verb burnt several decades before the change affected its adjectival uses, thus suggesting that verbs played a special role in this evolution.

The basic drawback of annotating raw corpora is that it is less flexible and less readable. In syntactically tagged corpora, every word can be viewed in its grammatical category as we have a code “manage_V3PP” to indicate that it is a verb in the third person of the present, the corpus become difficult to read. To avoid this problem it is important for annotation to remain separated from the corpus, which makes the corpus always accessible as plain text.

Finally, the main objection is that annotation is subjective. This implies of making choices about categories to be annotated, which are never completely neutral. Furthermore, the annotation is accompanied by a hermeneutic dimension, which reflects the annotators’ point of view about the text.

To better illustrate the term annotation let take the example of language acquisition. Kyratzis (1990) stated that children utterances have casual connectives and often occur in subjective form in question-based. Let take the example of Lea a 2 years old in New York;

Grandmother: why that?

Lea: because I felt like that.

Evers-Vermeul and Sanders (2011) as well as Zufferey (2010) found that children first produce objective casual relations for describing facts or events in the world. Let take the example of Lea a 2 years and 10 months old and subjective relations occur when he is 3 years and 10 months old.

Lea: I’m taking my bottle because I’m thirsty.

Lea: come because it is getting late.

So the sequence is;

Question-answer based à objective casual relations à Subjective casual relations.

Different types of annotations

Phonological annotations:

It deals with the study of transcription of spoken corpora. These annotations indicate prosodic phenomena like pauses, hesitations and prosodic parsing. It also includes the study of notion of fluency or the interface between syntax and discourse.

Morphological annotations

It deals with word forms. Words are linguistic elements the most subjected to annotations in a corpus. In this we have 2 things tokenization and lemmatization.

Tokenization includes punctuation identification, elision processing (in French), or identification of numbers and dates.

Lemmatization refers to the act of associating every word occurrence in a corpus with its basic morphological form. For example, the adjectives such as gentil and gentille (in French) are the masculine and feminine variants of the same word from a morphological point of view. This canonical form of word is called lemma. Similarly mouse and mice are lemma of a word (mouse), eat, eating and eaten are conjugated form of the lemma eat. Lemmatization is one of the types of annotation which can be done automatically with considerable accuracy. Lemma is different from lexeme.

Lexical annotation

It deals with the lexical items of annotations. For example, bat is a lemma which corresponds to two different lexemes, as this word is polysemic and may either refer to a flying mammal or an object.

Semantic annotations

It deals with meaning of a word. Words can be annotated into semantic category. For example, tennis can be associated with sports. Annotation also provides training and testing data for automatic word sense disambiguation.

Syntactic annotations

It deals with the structure of words. The most common annotation is syntactic parsing. Due to tree representation, parsed corpora are often called “treebanks”. Taylor (2003) produced English–language corpora on syntactic based at University of Pennsylvania including annotations of 3 million words from a corpus of 7 million words from different genres. This type of annotation was also carried out in French by Anne Abeille’s team in Paris on approx. 1 million words from Le Monde Corpus (Abeille at el. 2003). For example we have a sentence “The teacher congratulates the student.”

So, in corpus these structures are not valid so we use brackets or parentheses to delimit phrases. So we use this;

[ [[The]_Det[teacher]_N]_SN [[congratulates]_V [[the]_Det [student]_N]_SN ] _{SV –}]_Ph-

This shows the dependency relations between words or syntactic constituents. Dependency analysis was done on Google Books (Lin et al 2012). Lin et al (2012) have announces a 97.3% accuracy rate for part-of-speech tagging against only 84.7% for dependency analysis. Sentences can also be analyzed from the point of view of semantic relations between their constituents as well as thematic role of each of them. These roles include the agent, the patient and the cause. Finally, sentences may contain a pragmatic annotation of the speech act involved (e.g. a question, a request or a confirmation).

Standardization of annotation schemes

International Organization of Standardization (ISO) in technical committee no. 37 dedicated to language resources management has enabled around 20 standard drafts for linguistic annotation. Burnt (2012) developed as an ISO standard for dialogue act annotation. In many cases, standards are established as de facto. Prasad (2008) developed a new version of Penn Discourse Treebank. Each word belongs to a grammatical category and every sentence communicates a speech act. Crible (2018) annotated the use of discourse markers in French and in English comparable corpora.

The stages of the annotation process

Tags are one of the basic parts in annotation process. The annotation of speech acts requires setting up a list of the acts to be annotated, for example, requesting, promising, asserting, threatening, etc. However, a semantic annotation of verb types could differentiate their aspects (state or event verbs). Finally, an annotation of the grammatical categories associated with each word requires establishing a list of these categories. Avoid unwanted amalgamation to bring balance between tags. Simplification is necessary.

For annotation the annotator must defined the category of annotation. For example, for annotating speech acts in a corpus, a tag like question could be interpreted very differently depending on the annotators, if no explanation is added. We have three sentences.

Who will come to the party? (Direct Interrogative)
I wonder who will come to the party. (Indirect Interrogative)
I am dying to know the guest list. (Assertive Declarative)

To avoid these problems, a set of tags should be accompanied by a list of criteria specifying in which context each tag should be used. Once a tag has been defined, the corpus processing phase can be beginning. The first step is to identify which occurrences will be annotated in the corpus. For some phenomena such as the annotation of morphosyntactic categories or speech acts, every word or sentence in the corpus will be involved.

Whatever the strategy used for retrieving data, the results obtained will in most cases require validation and manual sorting in order to eliminate irrelevant occurrences. For example, the occurrences of discourse markers like you know, I mean etc. For example,

“You ought to show me the door today: but I don’t believe you know!”
“Perhaps it means just what I mean when I want to shout out that I am grateful to the Magic”

To eliminate these occurrences, data must be asserted and stored manually. Once the corpus to be annotated contains only the relevant occurrences, the annotation process itself can then begin. Whatever the annotation considered and the tag set chosen, the annotation of the first occurrences is generally difficult and many problems and borderline cases arise.

Stages of the annotation process are;

Stage 1	Stage 2	Stage 3	Stage 4
Theoretical definition of the categories depending on the literature.	Annotation of a corpus data sample.	Category refining from data.	Annotation of the whole corpus depending on the annotation.

Annotation tools

On one hand, there are the tools making it possible to carry out annotation in an automatic way, by means of part-of-speech tagging or parsing. On the other hand, there are tools that facilitate the process of manual annotation of data by providing an interface to perform the annotations as well as a format for representing such annotations. The usefuness of part-of-speech tagging is such for corpus studies that this annotation is now directly ambedded into some corpus creation tools. POS tagging varies from language to language and can be searched by concordancer.

Sketch Engine uses CQL (Cassandra Query Language) for chosing grammatical category or lemma. To search adjective the code is [leema=”acteur”][tag=”ADJ”]. Brat is an online tool that makes it possible to annotate only entities in a corpus but also the relation between them. Another example is EXMARaLDA tool, which has been specifically developed to assist in the transcription and annotation of spoken corpora.

One solution would be to to retrieve the relevant data from the corpus, and the annotate them separately.

Measuring the quality and reliability of an annotation

In term of quality we wil check the accuracy of the corpora and in accuracy we check two things; one is recall and other is precision. Recall measures the number of occurrences of each category correctly found by the system. On the other hand precision measures the number of occurrences properly tagged as noun, from smong all the ones tagged by the system. First we measdue the harmonic mean. The overall of POS tagging contains 95% accuracy level.

Human annotation requires a certain type of interpretation and this may vary from annotator to annotator and human make mistakes due to fatigue, lack of attention or a poor understanding of annotation instructons.

The reliability of the corpus is very much important. It determines how much a certain corpora is reliable. The basic solution is to compare the work of two annotators.

To find out the validity and reliability of corpus we have a kappa coefficient (Cohen 1960) used for measuring the aggreement between two annotators on a binary classification estimate of the probability of agreement which can be obtained by chance.

K = P(O) – P(E)

1 – P(E)

P(O) corresponds to the proportion of agreement observed during the classification task and P(E) corresponds to the statistical expectation os agreement. The value of P(O) is obtained by dividing the number of matching responses by the total number of responses. The value of P(E) is obtained by calculating propabilities by estimatingthe average of concordant classifications when the proportion of each class is fixed to the value observed for each annotation. The value of kappa coefficient may have between -1 to 1. Kappa coefficient can be calculated automatically by onlinestatistical software like VassarStats. Krippendorff (1980) states a result greater than or equal to 0.8 reflects a reliable agreement, whereas a result between 0.67 to 0.8 make it possible to establish the probable presence of agreement.

Sharing your annotations

The annotator must share their annotation for further use. This manual should allow them to understand and to reuse annotations. There must be two ways for them. First, a list of the tags used with their definition, in the way of a mini-glossary must be shared. Such as annotation of speech acts we use;

Req_dir: direct request

Req_ind: indirect request

Que_dir: direct question

ADJM1: masculine singular adjective

ADJM2: masculine plural adjective

ADJF1: feminine singular adjective

Annotation must be done systematically, and it should specify the manner in which the corpus has been segmented into words, sentences, utterances or discourse segments. Finally, when automatic processing tools have been used for preparing and annotating the data in the corpus, they must be clearly indicated.

Conclusion

We first reviewed the different types of annotation covering from phonemes to discourse relations. We then detailed the different stages that makeup an annotationprocess and stressed the importance of methodological practices, so that the annotation is as valid and re-useable whenever possible. Then we discussed tools that let you automatically make annotations. We discussed the quality and reliability of annotations and the methods to interpret data. And finally presented some recommendations for the creation of annotation.

Annotation in Corpus Linguistics and Its Types