Machine Learning Support for Evaluation and Quality Control

  1. 1. Hans van Halteren

    University of Nijmegen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Annotated material which is to be evaluated and possibly upgraded is used as training and test data for a machine learning system. The portion of the material for which the output of the machine learning system disagrees with the human annotation is then examined in detail. This portion is shown to contain a higher percentage of annotation errors than the material as a whole, and hence to be a suitable subset for limited quality improvement. In addition, the types of disagreement may identify the main inconsistencies in the annotation so that these can then be investigated systematically.


In many humanities projects today, we see that large textual resources are manually annotated with markup symbols, as these are deemed necessary for efficient future research with those resources. The reason that the annotation is applied manually is that there is, for the time being, no automatic procedure which can apply the annotation with an acceptable degree of correctness, typically because the annotation requires detailed knowledge of language or even of the world to which the resources refer.

The choice of human annotators may be unavoidable, but it is also one which has a severe disadvantage. Human annotators are unable to sustain the amount of concentration needed for correct annotation for the amounts of time needed to annotate the enormous amounts of data present (cf. e.g. Marcus et al. 1993; Baker 1997). Loss of concentration, even if only partial and temporary, is bound to lead to a loss of correctness in the annotation. Awareness of this problem has led to the use of quality control procedures in large scale annotation projects. Such procedures generally consist of spot checks by more experienced annotators or double blind annotation of a percentage of the material. The lessons learned from such checks lead to additional instruction of the annotators, and, if the observed errors are systematic and/or severe enough, to correction of previously annotated material. Even with excellent quality control measures during annotation, though, it is likely that the end result will not be fully correct, and the measure of correctness can, at most, be estimated from the observations made in quality control. Obviously, it would be enormously helpful if there were automatic procedures to support large scale evaluation and upgrade of annotated material.


Unfortunately, as mentioned above, automatic procedures are currently unable to deal with natural language to a sufficient degree to correctly apply most types of annotation. However, although automatic procedures cannot provide correctness, they are undoubtedly well-equipped to provide consistency. Now consistency and correctness are not the same, but both are desirable qualities and, unlike other pairs of desirable qualities such as high precision and recall, they are not in opposition. Complete correctness is bound to be consistent at some level of reference and complete consistency at a sufficiently deep level of reference is bound to be correct. More practically, a highly correct annotation can be assumed to agree most of the time with a highly consistent annotation, which means that disagreement between the two will tend to indicate instances with a high likelihood of error.

An example is provided by Van Halteren et al. (Forthcoming). One of the constructed wordclass taggers is trained and tested on Wall Street Journal material tagged with the Penn Treebank tagset. In comparison with the benchmark, the tagger provides the same tag in 97.23% of the cases. When the disagreements are checked manually for 1% of the corpus, it turns out that out of 349 disagreements, 97 are in fact errors in the benchmark. Unless this is an unfortunate coincidence, it would mean that we can remove about 10,000 errors by checking fewer than 40,000 words, a much less formidable task than checking the whole 1Mw corpus. In addition, the cases where the tagger is wrong appear to be caused in 44% by inconsistencies in the training data, e.g. the word "about" in "about 20" or "about $20" is tagged as a preposition 648 times and as an adverb 612 times. Such observations are slightly harder to use systematically, but can again serve to adjust inconsistent and/or incorrect annotation.

In principle, the use of such a comparison methodology is not limited to wordclass tagging. Any annotation task which can be expressed as classification on the basis of a (preferably small) number of information units (e.g. for wordclass tagging the information units could be the word, two disambiguated preceding classes and two undisambiguated following classes) is amenable to be handled by a machine learning system. Such a system attempts to identify regularities in the relation between the set of information units and uses these regularities to classify previously unseen cases (cf. e.g. Langley 1996; Carbonell 1990). Several machine learning systems are freely available for research purposes, e.g. the memory-based learning system TiMBL ( and the decision tree system C5.0 ( If we have a machine learning system and if we can translate the annotation task into a classification task, we can train the system on the annotated material and then compare the system's output with the human annotation. The instances where the two disagree can then (a) be used as prime candidates for rechecking correctness and (b) point to systematic inconsistencies to be reconsidered.

Overview of the Paper

Using various types of annotated material and machine learning systems, this paper will attempt to answer the following questions:

For which types of annotation is this method useful?
How does the error rate in the 'highlighted' portion of the material compare to the overall error rate?
At which levels of correctness of the annotation is the method useful?
Are some machine learning systems better than others for the purpose at hand?
Can we benefit from the fact that we have more than one system at our disposal and, if so, how?
Should we use the full material in the training phase or is it better to use cross-validation?
Baker, J. P. (1997). Consistency and Accuracy in Correcting Automatically Tagged Data. In R. Garside, G. Leech and A. P. Mcenery (eds) Corpus Annotation. Addision Wesley Longman, London. 243-250.
Carbonell, J. (ed) (1990). Machine Learning: Paradigms and Methods. MIT Press, Cambridge, MA.
van Halteren, H., Daelemans, W. and Zavrel, J. (Forthcoming). Improving Accuracy in NLP through Combination of Machine Learning Systems.
Langley, P. (1996). Elements of Machine Learning. Morgan Kaufmann, Los Altos, CA.
Marcus, M., Santorini, B. and Marcinkiewicz, M. (1993). Building a large annotated Corpus of English: the Penn Treebank. Computational Linguistics 19(2). 313-330.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at University of Glasgow

Glasgow, Scotland, United Kingdom

July 21, 2000 - July 25, 2000

104 works by 187 authors indexed

Affiliations need to be double-checked.

Conference website:

Series: ALLC/EADH (27), ACH/ICCH (20), ACH/ALLC (12)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None