Authorship attribution - the case of lexical innovations
Michal Ephratt
University of Haifa
Keywords: authorship, lexical-innovations

Much effort is devoted in both conventional linguistics and computational linguistics to the development of algorithms for determining the authorship of anonymous texts (plays, poems, documents, letters, etc.).
The rationale for such algorithms is that a text makes use of all linguistic domains: semantics, syntax, lexicography, phonology (orthography) and morphology. Each of these domains is rule governed, yet, within these rules and among the components, the grammar offers the writer choices. The text as an end product is an outcome of the particular choices taken by its author. This is why each specific text carries the fingerprints of its creator. The algorithms look for these fingerprints to reveal the authorship of the anonymous text in question. Most of such algorithms combine stylistic criteria with statistical tools.

The rationale for authorship attribution comes from the following premises:

1. that there is a specific single author;

2. that there are choices to be made;

3. that the author is consistent in his/her preferred choices, and

4. that these choices are present and could be detected in all end products of that creator.

As the unit is larger, its compositionality increases, implying increase in the number of choices to be made. The maximal linguistic unit seems to be texts (these too vary from novels to letters).

In highly synthetic languages, such as Hebrew, where lexemes are derived by lower-level morphemes such as roots and patterns, affixes as well as higher-level components, each making use and emphasizing different morphosemantic strategies, one and the same concept can be lexicalized in many ways. Different persons (authors, members of the Academy for the Hebrew language as well as naive speakers) can come up with a different lexical innovation for that same concept. Several competing lexical innovations can be elicited for one signifier. Such lexical innovations could be found in literature (originals or translations), in technical publications, or even in glossaries - lists of terms - for specific domains. As traditionally held for texts, we claim here that each of these newly coined lexemes may also be subject to questions of authorship. Because of its compositionality and the choices involved, it too may carry the fingerprints of its creator.

The precise attribution of authorship to such innovations seems to be within the duties and responsibility of the lexicographer. Opposed to the situation described for texts, there were no existing models for attribution of authorship to lexemes. Lexemes seem to fall within the other extreme from texts: being the minimal composed linguistic units their compositionality is much smaller both in terms of quantity and in terms of quality (having no context they completely exclude syntax altogether).

Our paper is an outcome of our attempt to construct such a model. The need for such a model arose as crucial for our project devoted to formulating the morphosemantic rules by which the late Hebrew author and translator - Yonatan Ratosh - has coined over 4000 lexical innovations. The absence of a historical-etymological dictionary for Modern Hebrew, as well as Ratosh' innovations being used and listed anonymously in a variety of sources, made the need to single out his lexical innovations our basic problem and its solution a precondition for proceeding in the project.

Our model for attribution of authorship to lexemes, or rather lexical innovations, makes use of two independent sources:

1. Innerlinguistic sources:
As mentioned, much effort has been devoted to the development of algorithms for the attribution of authorship to anonymous texts. We first studied reports on such algorithms to elicit from them criteria that could be applicable to the word (lexical innovation) level.
Clearly, all methods that make use of context sensitive information where the scope of context reaches beyond lexeme level is excluded as not suitable for our purpose. Such examples are matters of inflection, syntactic roles, alliterations, connectives and punctuation. The scarce cases of pure use of word level (words in isolation) for determining text authorship were drawn out and adopted as is. Such are criteria looking into word structure and phonetic characteristics of words (e.g., clusters, syllables count).

To these we add other diachronic and synchronic criteria that were mentioned in such reports for levels higher than words, thus demanding here modification to word level for characterizing words in isolation. The classical word-frequency test or favorite words tests were modified to morpheme-frequency and favorite morphemes. Etymology of words was modified to etymology or origin of morhemes. The study of hapax legomena was reformed to test unique morphemes, and common vocabulary test (used in text algorithms to illiminate biasing by salience of words belonging to the specific subject matter) was replaced here with common stems. Use of proper names in corpora may supply hints regarding time and source of innovation. Quite indicative may then be the use of proper names as basis for lexical innovations.

The last group of criteria were not found in the above mentioned reposts but were the outcome of our ongoing occupation with derivational morphology of synthetic languages. Examples are use of non-productive rules; appearance of a consistent preference of the innovator that his/her lexical innovation would match one specific transparency scale and explicit statements of the innovators that indicate their methods of innovations.

2. Extralinguistic source:
Word marks. Registered word marks (used in marketing goods and services) are words specially protected both as intellectual property and under tort law. The proprietor of the word mark enjoys monopoly rights in the mark, meaning that s/he has the right to exclude others from the use of the word.
Justice Parker stated that "... apart from the law as to trade marks, no one can claim monopoly rights in the use of a word.".

In order to legally secure the proprietor's rights in his/her mark, in order to secure other traders their right to freely describe their goods and in order to secure the speakers of the common language their right in the language, the law and the courts have to provide clear tools for determining ownership of a word mark. As it is no doubt understood, the court does so not for linguistic sake but for fair trade motives. For the court it is a practical rather than a theoretical issue, an issue that calls for a clear verdict.

From both Trade Mark Acts (UK 1994; Israel 1972) legislation and appeals to court and verdicts thereof, we have drawn some criteria for exclusively characterizing words. Such are distinctiveness and descriptiveness, genericness and invented words. These too were incorporated in our set of criteria for attributing authorship to (anonymous) lexical innovations.

* This project is fully sponsored by a grant of the Israel Science Foundation.

