Experimental Design: Syntactic Annotation as Words

  1. 1. Hans van Halteren

    Department of Language and Speech - University of Nijmegen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The first step in the setup of our experiment was the selection of suitable data. Because our focus was on the difference between authors rather than the difference between genres, we decided to take two samples from the Nijmegen corpus (Keulen 1986) which are of the same genre, crime fiction. The two samples each consisted of 20,000 words of running text and were taken from M. Innes' The Bloody Wood (henceforth Sample A) and M. Allingham's The Mind Readers (henceforth Sample B).

The Nijmegen corpus has been syntactically annotated with two different analysis systems, the CCPP system (cf. Keulen 1986) and the TOSCA system (cf. Oostdijk 1991). The TOSCA analysis is the more detailed one and uses a more consistent description model. We therefore selected the TOSCA analysis as the one to be used in our experiment.


Figure 1 exemplifies the syntactic annotation assigned to the sentence "He walks his dog in the park.". On each node in the analysis tree, we find labels for syntactic function, for syntactic category, and for additional attributes. Consider, for instance, the node immediately to the left of the word "park". For this node, the function is 'Noun Phrase Head' (NPHD), the category is 'Noun' (N), and the attributes are 'Common' (com) and 'Singular' (sing).

There are many different aspects of syntactic analysis trees that might be exploited for purposes of authorship attribution. We opt to translate part of the information present in each analysis tree into a pseudo-word sequence. The crucial question is which part. We have used two criteria to decide which information to include: a) Focus on the most important information and b) try to keep the resulting pseudo-words as similar to normal words as possible, so that a greater accuracy for the syntax-based methods can only be attributed to a higher information content (with regard to the problem at hand) of the pseudo-words.

Criterion b) led us to exploit the individual rewrites (combinations of a node and its immediate constituents), since these are the building blocks of the tree, just as words are building blocks of the sentence. Criterion a) led us to focus first on the category label (e.g., NP, 'Noun Phrase'), then on the function label (e.g., SU, 'Subject') and only last on the attribute labels (e.g., sing) on the nodes.

For an exact choice of the information to use we counted the number of pseudo-word tokens and types. The total number of rewrite tokens in the two samples is 46402. Using only the category labels, e.g.,

NP -> DTP + N

(where NP, N, and DTP denote 'Noun Phrase', 'Noun, and 'Determiner Phrase' respectively), leads to 2318 types. Adding the function labels at the right hand side, e.g.,


(where DET denotes 'Determiner' and NPHD 'Noun Phrase Head'), increases this number to 2732. Addition of the function label on the left hand side as well,


(where PC denotes 'Prepositional Complement'), brings this number up to 4194. As the resulting type-token ratio is fairly close to that for the normal words of our samples, this is the labeling we have decided to use.

The most frequent rewrites are present in both samples. The first one missing in sample A is the 59th most frequent one,


(UTT: 'Utterance'; COORD: 'Coordination'; CJ: 'Conjoin'; S: 'Sentence'; COOR: 'Coordinator'; CONJN: 'Conjunction') which occurs 85 times in sample B. The first one missing in sample B is the 231st most frequent one,


(RPDU: 'Reported Utterance'; CLOID: 'Clausoid'; DIFU: 'Discourse Function'; REACT: 'Reaction Signal'; PUNC: 'Punctuation'; PM: 'Punctuation Mark') which occurs 14 times in sample A. These simple numbers already suggest that there are marked differences in the way A and B make use of syntactic rewrite rules.

We have translated the syntactic rewrite information from the samples into pseudo-words. The main reason for this is that the existing software is likely to expect words rather than the complex (and long) expressions the rewrites are. For the translation, we have sorted the rewrites according to their frequency (cumulative over both samples) and we have named them accordingly. Thus, the most frequent rewrite becomes W0001, the second most frequent one W0002, etc.:

Translated Frequency Rewrite
W0001 4670 V:VP -> MVB:LV
W0002 3566 SU:NP -> NPHD:PN
W0003 2674 DT:DTP -> DTCE:ART
W0004 1948 A:AVP -> AVHD:ADV
W0005 1729 A:PP -> P:PREP + PC:NP
W0006 1435 V:VP -> OP:AUX + MVB:LV
W0007 1395 NPPR:AJP -> AJHD:ADJ
W0008 1172 DT:DTP -> DTCE:PN
W0009 1017 PC:NP -> DT:DTP + NPHD:N
W0010 1016 -:TXTU -> UTT:S + PUNC:PM
[V: 'Verb'; VP: 'Verb Phrase'; MVB: 'Main Verb'; LV: 'Lexical Verb'; PN: 'Pronoun'; DTCE: 'Central Determiner'; ART: 'Article'; A: 'Adverbial'; AVP: 'Adverb Phrase'; AVHD: 'Adverb Phrase Head'; ADV: 'Adverb'; PP: 'Prepositional Phrase'; P: 'Preposition'; PREP: 'Preposition'; OP: 'Operator'; AUX: 'Auxiliary'; NPPR: 'Noun Phrase Premodifier'; AJP: 'Adjective Phrase'; AJHD: 'Adjective Phrase Head'; ADJ: 'Adjective'; TXTU: 'Textual Unit'.]
The translated rewrites were presented in the original order in which they appear in the samples. In addition, text unit separators were inserted to indicate which pseudo-words together formed a pseudo-sentence (i.e., which rewrites jointly form an analysis tree). As a result, the experimenters received the following kind of data:

S W0084 W3165 W0048 S W0021 W0061 W0002 W0001 W0031 W0019 S W0010 ...

The unlabeled samples were provided to the experimenters in two different forms. For general statistical operations, the two complete pseudo-texts were available. For authorship attribution techniques, the two texts were available in the form of fourteen labeled samples and six unlabeled samples.

Unknown to the experimenters, the two pseudo-texts were both divided into ten parts, such that a new part was initiated at the first text unit separator after 2500 pseudo-words (including separators). All parts were about the same size, except for the tenth part of sample B, which contained only 2254 pseudo-words. The first seven parts of each pseudo-text were provided as labeled samples (SA1-SA7, and SB1-SB7). The remaining 6 parts were provided as unlabeled samples (SQ1 (=SA10), SQ2 (=SB10), SQ3 (=SB8), SQ4 (=SA8), SQ5 (=SB9) and SQ6 (=SA9)). All correspondence information was withheld from the experimenters.

Without a-priori knowledge of how many of the six unlabeled samples Q1-6 should be attributed to A or B - this number can vary between zero and six - the probability of finding the correct assignment by chance equals (1/2)6 = 0.016. The probability of correctly assigning at least five samples equals 7/64 = 0.109. These probabilities show that our experiment is statistically non-trivial: the experimenters are not likely to arrive at the correct solution by chance.

Keulen, F. (1986). The Dutch computer corpus pilot project. In: J. Aarts and W. Meijs (Eds.), Corpus Linguistics II. New studies in the analysis and exploitation of computer corpora. Amsterdam: Rodopi.

Oostdijk, N. (1991). Corpus Linguistics and the Automatic Analysis of English. Amsterdam: Rodopi.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review


Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None