Yeats Over The Years

paper
Authorship
  1. 1. Richard Forsyth

    University of the West of England

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction
Methods of assigning dates to texts have long been of interest to literary scholars. Apart from authorship attribution, date assignment (stylochronometry) is the most commonly reported task in the stylometric literature. Yardi [13], Brainerd [2], Ule [12] and Temple [11] are just some of the researchers who have addressed this problem.
The four studies cited above deal with the plays of Shakespeare, the writings of Christopher Marlowe and the Platonic corpus -- cases where a true chronology will never be known with certainty. The present study differs from these prior investigations in two respects. Firstly, it applies a novel technique called Monte-Carlo Feature-Finding (Forsyth & Holmes, 1996) to test its suitability in this area. Secondly, the test case chosen -- the verse of William Butler Yeats -- is one where the correct dating is (comparatively) well attested.
Dating Yeats makes an interesting initial test because over the course of a long poetic career his style changed noticeably; yet scholars do not agree on the nature of that change. In this context, it is pertinent to quote Jaynes[6], who also studied the evolution of Yeats's poetic style:
"Yeats's syntactic style is quite stable over his career." ([6] p. 13)
as well as Martindale [9], who investigated historical trends in poetic language in both English and French over several centuries:
"the content of most poets' verse does not change massively across the course
of a lifetime".
However, Yeats himself firmly believed that his syntax and diction evolved as he grew older (a view shared by many of his readers) as the following quotations testify.
"My own verse has more and more adopted -- seemingly without any will of mine -- the syntax and vocabulary of common personal speech." (Yeats, 1926; in [6], p. 11)
"It was a long time before I had made a language to my liking". (Yeats, 1937; in [7], p. 265)
2. Method
The main aim of the present study was to find out whether an automated method of characterizing textual differences which has shown promise in other areas [4] could be used to analyze the development of Yeats's language. This method, Monte-Carlo Feature-Finding, is simply a random search for substrings that exist in the training data. It is implemented by a program called TEFF (Text Extending Feature Finder). This finds markers merely by searching through a given set of training texts.
TEFF randomly picks many substrings (4096 in the experiment reported here) from the combined training text. The length of each substring is also a random number from 1 to 8. All distinct substrings thus found are saved and then ranked according to a distinctiveness measure, i.e. according to a measure of their differential rate of occurrence in the different text categories for the problem concerned. Chi-squared is used to measure distinctiveness.
Earlier trials of Monte-Carlo Feature-Finding [3] indicated that the basic procedure, as described above, tends to generate substrings that are fragmented at what seem linguistically inappropriate boundary points, even when they prove effective as discriminators. This weakness, and others, have since been rectified. For technical details see [5,4].
3. Materials
A random sample of 142 poems by W.B. Yeats was used as training data. This sample was divided into two portions:
"Younger Yeats" (YY), 72 poems, 18,360 words;
"Older Yeats" (OY), 70 poems, 18,668 words.
The dividing date was 1915. Thus the YY sample was written in 1914 or before and the OY sample from 1916 onwards.
The TEFF program was used to find substrings which distinguished these two classes. Then the efficacy of these substrings as distinctive markers was tested on three types of unseen material: (1) 10 other poems, absent from the training sample; (2) two early poems that Yeats later heavily revised; and (3) two prose extracts. Details of these tests are given in the next section.
4. Results
The following listing (Table 1) shows the 40 most distinctive substrings found by the TEFF program when given samples of poetry written by William Butler Yeats before 1915 (`Younger Yeats') or after 1915 (`Older Yeats').
Roughly speaking, a string is a Younger-Yeats marker if the first of the last 2 frequency counts in the preceeding table is higher than the second; so, for example, `murmur', at rank 39, is a Younger-Yeats marker (n=21 versus n=2).
Table 1 -- TEFF Substrings.
TEFF output; date: 10/25/97 21:27:58

Rank
Substring
Chi-score
1
`what
35.2977123
30.
100.
2
` can
34.5481048
21.
82.
3
` can
32.1377981
13.
66.
4
`hat
29.8459659
132.
245.
5
`hat
28.7157873
126.
235
6
` whi
25.8148745
67.
21.
7
`s, an
25.3604502
63.
19.
8
` sea
23.8708713
52.
13.
9
` that
23.0786873
114.
206.
10
`?
22.3848552
30.
83.
11
` with
22.242135
139.
74.
12
` int
21.9169178
12.
51.
13
`. ii
21.4710501
0.
23.
14
` stars
20.9510074
26.
2.
15
` that
20.8584369
114.
200.
16
`ith
20.3048191
134.
72.
17
`s of
20.0069369
105.
52.
18
`e that
19.4328782
18.
58.
19
`though
18.6119958
30.
77.
20
` you
18.4021636
120.
65.
21
`ou
18.338578
70.
29.
22
`e that
17.6827819
18.
56.
23
`at
17.6339928
223.
332.
24
`, and
17.5216778
173.
108.
25
`ith
17.5057402
143.
84.
26
` tha
17.4767001
127.
209.
27
`ping
17.3800959
34.
7.
28
` you
17.1449342
68.
29.
29
`your
16.933698
52.
19.
30
` we
16.8084892
141.
83.
31
`, an
16.6399743
176.
112.
32
` your
16.5048223
49.
17.
33
`woo
16.2479445
32.
7.
34
`ck
15.9644919
51.
103.
35
`low
15.935057
70.
32.
36
`wed
15.7755436
23.
3.
37
` the w
15.7503404
103.
56.
38
`w the
15.5959647
31.
7.
39
`murmur
15.5603536
21.
2.
40
` of o
15.5309625
27.
5.


A minus-sign in front of a rank number indicates a marker which is a proper substring of another higher up in the table, and which will therefore not be saved on file. This removes some of the redundancy from the list, but not all of it. So a further program has been written to filter out substrings that duplicate the effect of those higher in the list. The result of running this latter program is shown in Table 2.

Table 2 -- Marker Substrings. filters.spt date: 10/25/97 22:07:45
1 C:\BM95\WY.Yx, 96547 bytes.
2 C:\BM95\WY.Ox, 100322 bytes.
proportion in class 1 = 0.4904112, proportion in class 2 = 0.5095863

Rank
Substring
Chi-score
1
`what
35.1125492
30.
100
2
` can
34.3848949
21.
82.
3
`s, an
25.4882859
63.
19.
4
` whi
25.4359635
67.
21.
5
` with
22.3028559
139.
74.
6
`?
21.9694592
30.
83.
7
` sea
21.9076975
49.
13.
8
` int
21.814545
12.
51.
9
`. ii
21.4093664
0.
23.
10
` stars
21.0202576
26.
2.
11
`ck
20.7388396
35.
87.
12
` we
20.6404509
139.
76.
13
`hat
20.6188054
113.
200.
14
`s of
19.2516067
104.
52.
15
`though
18.5468022
30.
77.
16
` you
18.1662318
115.
62.
17
` that
17.7369649
3.
27.
18
`ping
16.5610752
33.
7.
19
`woo
16.3187661
32.
7.
20
` dee
16.2970505
27.
4.


It should be noted that these are strings, not words, so item 16 ` you' (a YY marker) would cover words such as `young' `younger' `youth' and `your' as well as the word `you' itself.
The presence of `? ' on this list suggests that this method is capable of detecting syntactic change as well as change in vocabulary.
Only the top twenty markers have been shown here, to save space; and only these will be used in the analysis that follows. Note that the sole element of human judgement involved in choosing these markers was deciding how many to use, 20 being a convenient number.
5. Some Marks of Time
As an illustration, the table below shows the results of counting the occurrences of eleven Younger-Yeats and nine Older-Yeats marker substrings in a pair of poems written 50 years apart.

This shows a clear preponderance of `younger' markers in the earlier poem and an even clearer preponderance of `older' markers in the later poem. If either of these texts had just been rediscovered, we could with reasonable confidence allocate Down by the Salley Gardens to Yeats's early career and Politics to his later years.
Perhaps it is worth noting here that stylometers have generally found chronology a trickier subject than authorship attribution [3], and that they have very rarely dared to categorize text segments as short as these 2 poems -- a notable exception being Simonton [10], who analyzed, among other things, word usage in the final couplets of Shakespeare's sonnets, averaging 17.6 words in length.
However, this pair of poems was present in the training files. Thus Table 3 is merely illustrative of this method.

Table 3 -- Frequencies of Substrings in 2 Short Poems

Marker Substrings
Salley Gardens 1888
Politics 1938
Younger Yeats strings


`s, an`
0
0
` whi`
0
0
` with`
2
0
` sea`
0
0
` stars`
0
0
` we`
1
1
`s of`
0
0
` you`
2
1
`ping `
0
0
`woo`
0
0
` dee`
0
0



Total =
5
2
Older-Yeats strings:


`what`
0
2
` can`
0
1
`? `
0
1
` int`
0
0
`. ii`
0
0
`ck`
0
0
`hat `
0
6
`though`
0
1
` that`
0
4



Total =
0
15



Table 4 -- Substring Counts in 10 Unseen Poems.

Poem
Count of Younger-
Yeats Markers
Count of Older-
Yeats Markers
A Faery Song
1891 [104 words]
10
1
The Lover Tells of the Rose in his Heart
1892 [114 words]
10
4
The Hosting of the Sidhe
1893 [124 words]
6
2
The Host of the Air
1893 [310 words]
23
4
To Some I have Talked with by the Fire
1895 [139 words]
9
0



A Woman Young and Old Parting
1926 [79 words]
4
6
In memory of Eva Gore-Booth and Con
Markiewicz
1927 [190 words]
8
6
Quarrel in Old Age
1931 [72 words]
1
11
Parnell's Funeral, Part I
1933 [247 words]
4
33
A Model for the Laureate
1937 [142 words]
4
18


5.1 Unseen Trial
As a genuine test, 10 more poems were chosen, this time completely at random, five written before 1915 and five afterwards -- with the proviso that they were not already present in the training sample. These poems, and their dates of composition, are listed in Table 4, along with counts of the YY and OY substrings found in each.
In nine of the 10 poems the count is higher in the appropriate age category. The probability of 9 or more correct binary choices from 10, under a Null Hypothesis that there is an even chance of being right or wrong, is 11/1024 (p=0.0107), suggesting that short substrings can indeed be useful in this sort of problem.
5.2 A Youthful Yeatsian Index
Of course ideally in stylochronometry we would want not just to classify texts as early versus late, but to assign an estimated date to each text. With this object in mind, a `youthful Yeatsian index' (YYIX) was defined as follows
YYIX = (YY - OY) / (YY + OY)
where YY is the number of younger-Yeats markers found and OY the number of older-Yeats markers. (This time only 19 substrings were used -- omitting `hat ' from the list shown in Table 3 on the grounds that both `what' and `that ' were already present -- to avoid any suggestion of double-counting.)
In addition, three more unseen poems were added, from Yeats's middle period, to the 10 listed in Table 4. These were The Ragged Wood (1904, 105 words), All Things can Tempt me (1908, 92 words) and The Scholars (1915, 73 words).
Figure 1, below, shows a plot of YYIX against date of composition. The correlation of YYIX with date is r = -0.844 which is very highly significant (p < 0.001). A regression using YYIX to predict date accounts for over 71% of the variance.

Figure1.Plot of Youthful Yeatsian Index against Date
5.3 Visions and Revisions
Another test of this approach involved looking at poems revised by Yeats in his later years. In my Everyman edition [1] there are just two such poems where the text of both versions is given in full. These are The Lamentation of the Old Pensioner (original 1890, revised 1925) and The Sorrow of Love (original 1891, revised 1924). Table 5 gives the frequencies of YY and OY substrings in both versions of both these poems.
Table 5 -- Substring Frequencies in Revised Poems.
In each


YY Markers
OY Markers
The Lamentation
1890 version
1925 version
2
0
2
11
The Sorrow of Love
1891 version
1924 version
13
2
0
6

case Yeats's process of revision increases the number of `older-Yeats' markers and decreases the number of `younger-Yeats' markers.
5.4 Prose Trial
Although the training files used by TEFF to find distinctive substrings were entirely composed of poetry, it was thought worthwhile to look briefly at the frequencies of these marker substrings in prose passages as well, to gain an initial idea of the robustness of this approach when genre assumptions are violated. Accordingly, two short extracts of prose were taken from the first and last essay by Yeats in the collection of [7], namely the first 40 lines of The Irish National Literary Society (446 words, dated 1892) and the first 44 lines of Ireland after the Revolution (435 words, dated 1938).

In the earlier extract, there were 21 YY markers and 11 OY markers. The later passage contained 11 YY markers and 36 OY markers. Of the 11 YY substrings, 7 were more frequent in the earlier piece than the later (with 3 tied in frequency). Of the 9 OY substrings, four were more frequent in the later piece than the earlier with 4 tied). To put this another way: only 2 of the 20 markers pointed in the `wrong' direction.
6. Discussion
Counting `badges' is a rather unsophisticated method of text classification, so the performance of the marker substrings found by TEFF on the trials described above is quite impressive, especially bearing in mind the unreliability of most previous stylometric techniques on samples as small as those analyzed in this study -- as witnessed (for example) by the following quotation.
"Even using 500 word samples we should anticipate a great deal of unevenness" [8].
Weaknesses still remain. Firstly, the presence of `hat ' as well as ` that' in the list of OY markers suggests that the post-filtering program still needs improvement. There is clear overlap between these two substrings, which introduces an undesirable element of double- counting. Secondly, and perhaps more important, interpretation of markers like `though' is problematic. The question arises: does a group of words such as `though', `although', `thought' and `thoughtful' constitute a natural kind? And if not, are we justified in relying on such a heterogenous grouping? The answer depends, in large part, on what we want to do with the texts under scrutiny. If insight is our aim, then a KWIC index based on the substrings in question should enlighten us about just what verbal habits are being detected.
7. Conclusion
To conclude: assigning short poems (median length = 114 words) by W.B. Yeats to their correct chronological period is a non-trivial task. Nevertheless, a simple count of distinctive substrings found by the TEFF program led to the right assignment in 9 out of 10 unseen cases. Moreover, these substring frequencies were sensitive enough to detect authorial revision in two early poems revised by Yeats many years after he originally wrote them, and robust enough to classify a pair of short prose extracts correctly as well.
The performance in this pilot study of short substrings found by a Monte-Carlo process suggests that such strings warrant further investigation as stylistic indicators.
References
1. Albright, D. (1990) ed. The Poems: W.B. Yeats. Everyman/Dent, London.
2. Brainerd, B. (1980). The Chronology of Shakespeare's Plays: a Statistical Study. Computers & the Humanities, p. 14, pp. 221-230.
3. Forsyth, R.S. (1995). Stylistic Structures: a Computational Approach to Text Classification. Unpublished PhD thesis, Faculty of Science, University of Nottingham.
4. Forsyth, R.S. (1997). Deriving Document Descriptors from Data. In: L. Dorfman et al. (eds.) Emotion, Creativity and Art. Perm, Russia.
5. Forsyth, R.S. & Holmes, D.I. (1996). Feature-Finding for Text Classification. Literary & Linguistic Computing, pp. 11(4), pp. 163-174.
6. Jaynes, J.T. (1980). A Search for Trends in the Poetic Style of W.B. Yeats. ALLC Journal, p.1, pp. 11-18.
7. Jeffares, A.N. (1964) ed. Yeats: Selected Criticism. Macmillan & Co. London.
8. Ledger, G.R. & Merriam, T.V.N. (1994). Shakespeare, Fletcher, and the Two Noble Kinsmen. Literary & Linguistic Computing, p. 9(4), pp. 235-248.
9. Martindale, C. (1990). The Clockwork Muse. Basic Books, New York.
10. Simonton, D.K. (1990). Lexical Choices and Aesthetic Success: a Computer Content Analysis of 154 Shakespeare Sonnets. Computers & the Humanities, p. 24, pp. 251-264.
11. Temple, J.T. (1996). A Multivariate Synthesis of Published Platonic Stylometric Data. Literary & Linguistic Computing, p. 11(2), pp. 67-75.
12. Ule, L. (1982). Recent Progress in Computer Methods of Authorship Determination. ALLC Bulletin, p. 10(3), pp. 73-89.
13. Yardi, M.R. (1946). A Statistical Approach to the Problem of the Chronology of Shakespeare's Plays. Sankhya: Indian J. Statistics, p. 7(3), pp. 263-268.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998
"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Tags