University of Nijmegen, Department of English - University of Nijmegen, Department of Language and Speech - University of Nijmegen
The Feasibility of Incremental Linguistic Annotation
Hans van Halteren
University of Nijmegen
Keywords: linguistic annotation, corpus linguistics, syntactic analysis
This paper examines the feasibility of incremental annotation, i.e. using existing annotation on a text as the basis for further annotation rather than starting the new annotation from scratch. It contains a theoretical component, describing basic methodology and potential obstacles, as well as a practical component, describing experimental use of incremental annotation.
In both the linguistic and the language engineering communities it is generally accepted that corpora are important resources and that their usefulness increases with the presence of linguistic annotation. The added value of annotation depends not only on what type of markup is present (morpho-syntactic, syntactic, semantic, etc.) but also on the quality of that markup:
- how (un)ambiguous is the markup, i.e. have only the contextually appropriate markers been selected from among the potential ones or has some of the ambiguity been retained (either explicitly or in the form of underspecification)?
- how consistent is the annotation, e.g. is it in accordance with an annotation manual?
- how correct is the annotation, i.e. will others agree with the applied markup (given the stated meaning of the markers)?
When we examine the demands that linguistic and language engineering research makes on the annotation with regards to these three points, we see that fully automatic annotation is generally not an option. Beyond morpho-syntax (i.e. wordclass tagging), the currently available computer software does not contain sufficient knowledge about language to pinpoint the contextually appropriate markup for a large enough percentage of most types of text.
This means that linguistic annotation of corpora entails human involvement and, given the size of present day corpora, an enormous amount of it. In recognition of the fact that the amount of work that needs to be done usually exceeds the amount of work that can be done during a project (because of lack of manpower, funding or whatever), the international community is propagating the reuse of corpus resources. Users are encouraged to use annotated corpora already in existence and annotators are encouraged to perform their annotation in such a way that reuse is possible. An important factor in reusability is obviously standardization of annotation practices (as far as this is feasible), a fact which has led to initiatives such as EAGLES (cf. Calzolari and McNaught, 1996).
If the principle of reusability really works, one can imagine taking a well-annotated corpus and adding a further layer of annotation, which of course should itself also be reusable. This can then be repeated, leading to a cyclic process which in the end yields a corpus which is annotated for a very large number of aspects. We call this process incremental annotation.
Incremental annotation seems to be the ideal solution for a wide-spread problem: researchers can produce the data they need with much less work. In practice, unfortunately, there are still some obstacles to overcome. When somebody wants to add a new layer of annotation to an already annotated corpus, the question always is to which degree the existing annotation is of any real use. Most decisive are two properties of the existing annotation: quality (i.e. (un)ambiguity, consistency and correctness) and compatibility with the projected new annotation.
The importance of quality is obvious: if the existing annotation cannot be trusted, checking and correcting it may be as much work as starting from scratch. Furthermore, since quality is extremely complicated to measure, it often really is a question of trust. It would be good if all definitions of annotation standards would also include a clear-cut description of a procedure to measure the quality of an annotated corpus which uses that standard. Until such measurements become available, anyone planning to reuse an annotated corpus had better take some random samples from it and decide for himself if the quality is sufficiently high.
A high quality annotation corpus is no guarantee for unproblematic reuse, though. Even unambiguous, consistent and correct annotation is only useful if it provides the kind of information which is needed for the new layer. Insufficient information can be supplemented, of course, (cf. Black, 1994) but contradictory information will tend to be more of a problem, e.g. the Lancaster Treebankers always mark the word "it" as a noun phrase but this may lead to problems if the new annotation is supposed to describe an anticipatory "it" as a syntactic marker. Compatibility is as hard to measure as quality, maybe even harder (cf. Atwell et al, 1994). Incompatibilities between annotation schemes are often found at a level of detail which goes beyond superficial documentation and are usually highly context dependent. As a result, only outright incompatibility can be recognized easily and quickly, whereas partial incompatibility will only be noticed after substantial work has already been done.
The final complication in judging the usefulness of an existing annotation is that quality and compatibility are not independent. It is here that the difference between correctness and consistency becomes relevant. If the existing annotation has to be adapted to be useful, it may be more important that it is consistent than that it is correct. In order for the existing annotation to be useful, adaptations should preferably be made automatically and this is very difficult if there is a high level of inconsistency.
The deliberations above may well appear to stress potential problems for incremental annotation over potential gains. If this is so, it is because we feel the gains are already obvious. We certainly do not want to give the impression that incremental annotation is a hopeless cause and should not even be attempted. However, we do want to temper the unbridled optimism that tends to accompany references to the reusability principle. The choice to commit oneself to incremental annotation should always be made only after an increase in efficiency and/or quality for any new annotation has been demonstrated. The feasibility of such an increase depends to a large extent on the way in which the incremental annotation is implemented. In general, we can distinguish two methodologically different approaches to incremental annotation: the planned and the opportunistic approach.
In the planned approach, all layers of annotation are designed to be compatible (which includes being sufficiently consistent and correct). This will usually mean that more work will have to be put into layer X in order to be compatible with layers X+1, X+2, etc., but the extra work is amply paid back by the decrease of work for those layers. Obviously, the planned approach can only be used (fully) if one starts out with a raw corpus. Furthermore, there should be a certain amount of confidence that all layers of annotation will eventually be applied as planned, since otherwise the extra effort for the initial layers may be lost. Such confidence can be boosted by making the annotation design into a standard, but for the time being such cross-layer standards are not to be expected, given the lack of consensus for most types of linguistic annotation.
The opportunistic approach is less structured. Its basic tenet is that any existing annotation can be useful. Following the opportunistic approach means looking for the most promising data available and using that as a starting point. After the data has been located, there are two ways of using it. One could design the new annotation layer to be compatible with the existing annotation, in effect a post hoc planned approach. Usually, however, one will already have one's own ideas about what the new annotation should look like. These ideas tend to imply specific requirements for the existing annotation, which will then have to be adapted, corrected and extended in order to serve as the foundation for the new annotation layer. As already indicated above, such reuse can lead to tremendous gain over annotation from scratch but can equally well lead to complete disaster.
In order to illustrate the difference between the approaches we have performed an experiment in which parts of the Spoken English Corpus (MARSEC; cf. Arnfield, 1996 and UCREL, 1996) are annotated with TOSCA/ICE syntactic analysis trees. The planned approach is represented by the use of the traditional TOSCA analysis system (cf. van Halteren and Oostdijk, 1993) for this material. The opportunistic approach is represented by the use of an adapted and extended version of that same analysis system which takes the Lancaster Treebank analyses (cf. Leech and Garside, 1991) of the same portion of MARSEC as input.
This paper describes the activities involved in the adaptation, examines the experiences with both approaches and evaluates whether the use of the Treebank data as the starting point for the analysis indeed leads to a gain over the traditional method.
Arnfield, S. (1996), MARSEC: The Machine Readable Spoken English Corpus, http://midwich.reading.ac.uk/speechlab/marsec/marsec.html
Atwell, E., J. Hughes and C. Souter (1994) AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models, In Klavans, J. (ed), Proceedings of the ACL Workshop on The Balancing Act: Combining Symbolic and Statistical Approaches to Language, New Jersey: ACL.
Black, E. (1994), An experiment in customizing the Lancaster Treebank, In Oostdijk, N. and P. de Haan (eds), Corpus-based research into language, Amsterdam/Atlanta: Rodopi.
Calzolari, N. and J. McNaught (1996), EAGLES Editor's Introduction (EAG-EB-FR1), http://www.ilc.pi.cnt.it/EAGLES96/edintro/edintro.html
van Halteren, H. and N. Oostdijk (1993), Towards a syntactic database: the TOSCA analysis system, In Aarts, J. and P. de Haan and N. Oostdijk (eds), English Language Corpora: design, analysis and exploitation, Amsterdam/Atlanta: Rodopi.
Leech, G. and R. Garside (1991), Running a grammar factory: The production of syntactically analysed corpora or "treebanks", In: Johansson, S, and A. Stenström, English Computer Corpora, Berlin/New York: Mouton de Gruyter.
UCREL (1996), UCREL Projects: The Machine Readable Spoken English Corpus, http://www.comp.lancs.ac.uk/computing/users/paul/ucrel/project.html#marsec
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Queen's University
Kingston, Ontario, Canada
June 3, 1997 - June 7, 1997
76 works by 119 authors indexed
Conference website: https://web.archive.org/web/20010105065100/http://www.cs.queensu.ca/achallc97/