(Re)connecting Complex Lexical Data: Updating the Historical Thesaurus of English

paper, specified "short paper"
Authorship
  1. 1. Brian Aitken

    University of Glasgow

  2. 2. Marc Alexander

    University of Glasgow

  3. 3. Fraser Dallachy

    University of Glasgow

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The University of Glasgow’s
Historical Thesaurus of English (HT) arranges all the recorded words in the last millennium of English into almost a quarter of a million concepts. The work of half a century, it is available online (at www.ht.ac.uk) and a second edition is underway. This edition draws upon editorial work conducted by the
Oxford English Dictionary (OED) in its ongoing third edition, and thus a crucial activity in creating the new edition of the
Thesaurus is the meshing of the Glasgow database with the separate data held by the OED. This paper describes the processes developed by the HT editorial team to tackle the complex task of linking these datasets, allowing rapid updates to be made to the HT and the OED. Through these means, it illustrates challenges associated with the linking of complex, structured data comprised of a mixture of text and numerical information, discusses methods developed by the team, and evaluates the effectiveness of the different linking techniques.

Background
The HT’s database evolved over forty years of digital humanities work at Glasgow, pre-dating even the concept of a relational database or a primary key (Kay and Chase, 1987; Wotherspoon, 1992). At the time the data was sent to the OED for publication of the first edition, there were no unique IDs yet assigned to pieces of
Thesaurus data. As a result, the production of the online HT and the ‘Historical Thesaurus’ feature of the
OED Online involved the respective teams creating separate key-indexed versions of the database. The linking of HT word senses to OED word senses was a painstaking process which at times required case-by-case analysis by OED editorial staff which occasionally resulted in minor alterations to the thesaurus structure or reassignment of word senses to different categories. On rare instances in which a headword has more or fewer senses in the HT than in the OED, the OED editors had to make decisions about how best to combine the relevant data. This work allowed a one-way connection from the original HT to OED data, but not back to the online HT produced at Glasgow.

However, while the OED’s data is more up-to-date, in places the HT’s data is richer, such as in its 800,000 manually-created complex statements of usage dates. A reciprocal arrangement between the OED and the HT means that the editorial teams now share data, with the result that the OED has access to the classification experience of the HT editors, whereas the HT team receives updates on word senses and their dates of activity as established by the OED team. These data updates form the basis of the
Thesaurus’ second edition, and so in order to bring such updates into the HT database, the editorial team must create a data linkage between the HT and the OED which had not previously been attempted.

The Linking Process
A multi-stage linking process was developed. The first stage has involved aligning the hierarchies of the HT and OED datasets and verifying this alignment; the second stage consists in finding matches for the categories which appear not to have been successfully aligned during the first stage. A third stage will involve linking lexemes within categories. The initial alignment stage was achieved using category code numbers which exist in both the HT and the OED data (e.g. ‘01.05 Animals’, ‘01.05.17 Reptiles’ in the HT data). The OED editorial team made adjustments to some areas of the HT hierarchy (e.g. ‘01.10 The Universe’ in the HT became ‘01.01 The Universe’ in the OED structure). These known changes were accounted for by adjusting the HT category numbers accordingly when aligning them to the OED categories. An initial script checked for matches in the category code, part of speech, and text heading; this process confirmed over 200,000 category matches, leaving around 27,000 for which matches either could not be confirmed in this manner or which had no provisional match between the two datasets.
The number of confirmed matches between the aligned hierarchies was increased using a suite of techniques allowing for known variation between the HT category and OED category headings (e.g. OED’s house style recommends ‘relating to’ where the HT prefers ‘pertaining to’; OED uses ‘/’ where the HT uses ‘or’). For quality assurance, matches confirmed in this way were required to meet additional criteria, including a minimum number of shared lexemes within the category and matches between elements of lexeme metadata (cf. Fig. 1, below). Following these processes approximately 5,000 categories containing lexemes remained either with unconfirmed matches or entirely without a potential match.

Figure 1 Sample view of category lexeme and metadata used in linking and QA processes
In the second phase, new potential matches for these 5,000 categories were sought. Methods used included looking for matches with ‘sibling’ categories at the same tier in the HT hierarchy (e.g. HTOED subcategory ‘03.01.06|
05 elevate or raise to a higher position’ matches HT ‘03.01.06|
08 elevate/raise to a higher position’). Further methods used lexemes which only appeared once in each of the unmatched datasets, as well as the creation of ‘date profiles’ for categories based on first citation dates for their constituent words. An intractable residue of approximately 1,000 categories were manually matched by project assistants.

The linking between the datasets is work in progress and, at the time of abstract preparation, work has concentrated on matching at the category level. The next phase will match lexemes within the now linked categories, accounting for the knowledge that some lexemes may have been moved between categories as the result of editorial work conducted by OED staff in preparing the OED’s 3rd edition. By July 2019 this lexeme matching phase should be complete, and the final paper will also address the methods and challenges involved in this part of the matching process.
When both categories and their lexemes are confidently matched, the HT team can begin the important work of using OED3 data to both update the dates for which words can be evidenced as active, and introduce words which have been added to the OED but which are not, as yet, represented in the HT. Accurate linking between the two datasets will allow automation of the update process in future as OED3 updates are periodically released, thus ensuring the continued accuracy and utility of the resource for academic and public use.

Bibliography

Kay, C. and Chase, T. J. P. (1987). Constructing a thesaurus database.
Literary and Linguistic Computing
2(3): 161-163

Wotherspoon, I. (1992). Historical thesaurus database using Ingres.
Literary and Linguistic Computing
7(4): 218-225.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO