The Style-Marker Mapping Project: a Rationale and Progress Report

  1. 1. Joseph Rudman

    Carnegie Mellon University

This paper explicates the what, why, and how of a substantially completed but ongoing project to identify and categorize all style-markers in written English that are quantifiable (e.g. type/token ratios, word length distributions, word length correlations, hapax legomena).

Section I treats the what - defining the project - and then addresses the why - the rational for the project. Section II outlines the how. Section III gives a status report of the project.

Although the final mapping will have value in various disciplines (e.g. stylistics, corpus linguistics, computational linguistics, and computer science), the impetus for the project is from non-traditional authorship attribution studies. Non-traditional attribution practitioners define style in the seemingly narrow framework of only those stylistic traits that are quantifiable.

The main hypothesis behind this project (and all non-traditional authorship attribution studies) is that every author has a verifiably unique style. If we look at style as an organism, style-markers are its genetic material - making this project analogous to the human genome project. This analogy is somewhat of a stretch because each style-marker is analyzed by an independent study, whereas all of the loci of an autoradiogram are obtained in one scientific analysis.

The identification of a quantifiable style-marker does not necessarily mean that that particular style-marker should be included in an authorship study (e.g. the orthography might be dictated by an editor or typesetter).

Section I

This section treats the what - defining the project - and then moves into the why of the project.

A short overview of what is in this section follows:

The style-marker mapping project is a study to identify every style-marker in written English that can be quantified. The project began in a preliminary fashion in 1983 when I started recording the various style-markers that were used in non-traditional authorship attribution studies so that I could use them in my studies of the canon of Daniel Defoe. The project continued in this vein (along with a few attempts on my part to come up with "new" style-markers) until five years ago when I realized the importance of identifying all of the quantifiable style-markers.

There is no one style-marker or even a combination of several style-markers that have proven to be a definitive discriminator in all non-traditional authorship studies. What works in one case often does not work in others. Word length distributions and sentence length distributions are two examples of style-markers that seemingly work in some cases but not in others. The idea is to look at style as a combination of all of the quantifiable style-markers and then to do the analysis as if each style-marker were a locus in the autoradiogram (See RUDMAN for a more detailed explanation).

Another reason for using all of the quantifiable style-markers is to eliminate any charges of statistical cherry-picking.

Section II

This section treats the how. References to all of the literature and techniques will appear in the final report. For example, HOLMES, DELCOURT, and ELLIOTT AND VALENZA are three of the references under non-traditional authorship studies.

1) Search the literature, e.g.:

Non-traditional Authorship Attribution Studies
Discourse Analysis
2) Query the practitioners in all of the above fields, e.g.:

Professor Erwin R. Steinberg of Carnegie Mellon for stylistics,
Professor Paul G. Hopper of Carnegie Mellon for Grammar - all of the active practitioners in the non-traditional authorship studies.
3) Establish a clearinghouse on a web page that allows anyone to query the up-to-date mapping and allows anyone to suggest "new" quantifiable style markers that would be added by the curator. This will lead to a continually updated list. Negotiations are under way to make this site an extension of the Carnegie Mellon University English Web Site.
4) Use various strategies to identify new style-markers. This is where the innovative work supplements the drudge work, e.g.:

Neural networks
Pattern searching programs
Brainstorming sessions
Section III
This section gives a status report of the project and reports a timeline for its "completion." References will be given for all of the style-markers in the final report, e.g. MOSTELLER AND WALLACE is one of the function word references. Only a few representative examples of each section are listed for this abstract.


Part of speech
Function Words
Most frequent words
Rhetorical Devices

The questions, "Can this project ever be completed?", and, "Is the number of style-markers infinite?", are addressed.

The success of this project will not solve all of the problems of non-traditional authorship studies. This project does not address the problems that gender, genre, time constraints, or conscious vs. unconscious style bring to the table. Nor does it treat the problem of lemmatization.

Identifying the style-markers is only a small part of the overall problems with non-traditional authorship attribution studies. The statistics that should be used in any study for each of these style-markers and the statistics for combining all of the markers into a "final" answer is the subject of another ongoing project.


