Mining the Cultural Memory of Irish Industrial Schools Using Word Embedding and Text Classification

paper, specified "long paper"
Authorship
  1. 1. Susan Leavy

    University College Dublin

  2. 2. Emilie Pine

    University College Dublin

  3. 3. Mark T. Keane

    University College Dublin

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

The Industrial Memories project aims for new dis

tant (i.e., text analytic) and close readings (i.e., wit

nessing) of the 2009 Ryan Report, the report of the Irish Government's investigation into abuse at Irish Industrial Schools. The project has digitised the Report and used techniques such as word embedding and automated text classification using machine

learning to re-present the Report's key findings in

novel ways that better convey its contents. The

Ryan Report exposes the horrific details of systematic abuse of children in Irish industrial schools between 1920 and 1990. It contains 2,600 pages with over 500,000 words detailing evidence from the 9-year-long investigation. However, the Report's narrative form and its sheer length effectively make

many of it findings quite opaque. The Industrial

Memories project uses text analytics to examine the language of the Report, to identify recurring patterns and extract key findings. The project re

presents the Report via an exploratory web-based

interface that supports further analysis of the text.

The methodology outlined is scalable and suggests

new approaches to such voluminous state docu

ments.

Method

A web-based exploratory interface was designed to enable searching and analysis of the contents of the Report represented within a relational database. The relational structure detailed the categories of knowledge contained in the Report along with key information extracted from the text (Figure 1). The Ryan Report is composed of paragraphs containing

an average of 87 words. These paragraphs were represented as database instances and annotations detailing semantic content were linked through the relational structure. Named entities were automatically extracted using NLTK (Looper and Bird, 2002).

Figure 1: Knowledge Database Relational Structure

Classifying Paragraphs into Different

Knowledge Categories

The Ryan Report describes key elements of an enduring system of abuse that operated in Irish industrial schools. Its paragraphs tend to focus on particular topics, allowing them to be classified and annotated. For instance, some cover the extent and nature of abuse, others present witness testimony, report on institutional oversight or on how clergy were moved from one school to another in response to allegations. By classifying paragraphs in terms of these high-level knowledge categories it becomes easier to put a shape on many of the report's findings and to analyse it to provide new readings.

Some of these paragraph-categories were identified using automated text classification. Others were extracted using a rule-based search (e.g., excerpts on institutional oversight). In building classification models, a variety feature sets were examined using a random forest classifier along with manually selected test data. A bag-of-words approach to feature selection yielded results that were over-fitted due to the small samples of training data. However, feature selection based on context-specific semantic lexicons generated from a sample of seed-words using a word embedding algorithm was found to yield accurate results. Lexicons were generated using the word2vec algorithm developed by Mikolov (2013) following an approach identifying synonyms outlined by Chanen (2016).

Movements of Staff and Clergy (Transfer Paragraphs)
An important paragraph-category covers those dealing with the Catholic Church's response to allegations of abuse. The typical response to discovered

abuse was to transfer clergy from one institution to

another, only for the abuse to re-occur (e.g., “...Br

Adrien was removed from Artane and transferred to

another institution...” (CICA Vol. 1, Chapter 7, Paragraph 829)). Such transfers are described in many different ways in language that often obscures what was happening (e.g., transfers out of the Order, effectively sackings, are described as “dispensations to be released from vows”). We carried out a “byhand” analysis to find transfer-paragraphs using verb-searches and then expanded this set using machine-learning classifiers.

Initial readings of the Report suggested a set of verbs frequently used to describe the transfer of staff and clergy, including ‘transfer', ‘dismiss', and ‘sack'. The highest-ranking similar words reoccurring over five word2vec models were then identified. Features based on this lexicon, along with names of schools and clergy were extracted from 250 training examples (Table 1). A classification model then classified unseen text from the Report.

Text Category

Direct Speech

Movements of Staff and Clergy

Descriptions of Abusive Events

Features Extracted_

Reporting verbs, personal pronouns, punctuation (colons, quotation marks, commas, question marks, contractions), newlines

Transfer verbs, names of clergy, schools

Clergy and staff, parts of body, abusive action, emotions and implements associated with abuse

Table 1: Optimal Features Extracted from Report Identifying Witness Testimony

Witness Testimony (Witnessing Paragraphs)
Witness testimonies in the Ryan Report are indicated through reporting verbs and structural speech

markers (e.g., punctuation). Using these features,

Schlor et. al. (2016) gained accuracy of 84.1 percent in automatically classifying direct speech. Reporting verbs in the Ryan Report are often specific to its context such as apology, allegation or concession. To extract these from the text, highest-ranking similar words across multiple word embedding models were identified based on seed terms generated from WordNet, ‘said', ‘told' and ‘explained'. The resulting context-specific synonyms combined with WordNet synonyms formed a lexicon of reporting verbs tailored to the language of the Report (Table 2). A classification model was developed using these features along with punctuation information using 500 training examples.

Seed Words

said

told

explained

answered

learned

confirmed

described

say

tell

told

state

posit

posited

submit

submitted

express

expressed

narrate

narrated

recount

recite

recited

Context Specific Lexicon for

alleged warned

recounted claimed

surmised denied

relieved asserted

protested witnessed

stating called

describes informed

agreed said

admitted explained

convinced advised

presumed assured

screams tells

reported requested

complained heard

asking says

commented confessed

questioned remarked

accepted alleged

recollection suggested

Reported Speech_

enounced explained

verbalise believed

verbalised added

assure replied

articulate thought

apologise knew

pardon felt

pardoned recalled

remember saying

articulated told

enounce thinks

condoned remembered condone conceded

saw realised

explicate stated

explicated insisted

apology guarantee

concluded asked

mentioned

Table 2: Context Specific Synonyms Using Word Embedding

Descriptions of Abusive Events (Abuse Paragraphs)
To evaluate the scale of abuse throughout the industrial school system, excerpts from the Report detailing abusive events were extracted. The language describing abuse incorporates a broad range of linguistic features. A set of seed-words from which to base a semantic lexicon for feature extraction, was not immediately apparent on reading the Report. A support vector machine algorithm was therefore used to extract the most discriminative features based on a sample set of 200 paragraphs.

Analysis of the support vectors showed that terms distinguishing excerpts describing abuse formed five categories: abusive actions, body parts, emotions engendered in the victims, implements and names of staff and clergy. Sample words associated with each category were then used as seed-words to generate word embedding models to extract similar terms from the Report. Features based on these five lexicons, combined with names of clergy and staff were then used to generate a predictive model of abusive events.

Findings and Conclusions

This research demonstrates how word embedding can be used to compile context-specific semantic lexicons to extract features for text classification. These features allowed paragraphs for each knowledge category to be automatically classified based on manually selected training data.

No. Classified Excerpts
Classification

Clergy Movements 1,340

Direct Speech 1,920

Abusive Events 1,365

Table 3: Total Number of Classified Paragraphs

The performance of classifiers was evaluated using 10-fold cross-validation on the training data and showed high levels of accuracy in categorisations (Table 4).

Classification Precision Recall F-Score Accuracy

Clergy Movements

.91

.91

.91

91.2%

Direct Speech

.94

.93

.93

93.6%

Abusive Events

.93

.93

.93

93.3%

Table 4: Performance on Training Data: Random Forest Classifier. Weighted Average Results Using 10-fold CrossValidation

The classification models were applied to unseen data and performance evaluated by manually inspecting classifications of 600 randomly selected excerpts from the Report as shown in Table 5. Though overall accuracy levels remained high, precision of the classifications did fall somewhat, especially in relation to identifying speech and transfers.

Classification Precision Recall F-Score Accuracy

Clergy Movements

.58

1.0

.73

92%

Direct Speech

.84

.92

.88

94%

Abusive Events

.86

.88

.87

95%

Table 5: Performance on Unseen Text: Classification of Report Evaluated on Random Samples

Error analysis showed that incorrectly classified excerpts (false positives and negatives) were commonly those where the meaning of the language was subtle or vague. Paragraphs incorrectly classified as quoted speech for instance, were in fact quotations from letters and diary entries. Unidentified speech excerpts all consisted of short quoted phrases.

Transfers of clergy were reliably detected. However, there was a high rate of false positives due to the fact that the transfer of children throughout the

school system is described using similar language

(e.g. “The witness remembered ... when he was leaving Artane at nine years of age...” (CICA Vol. 1, Ch. 7, Paragraph 466)). Classifying excerpts describing abuse yielded few false positives but it also returned the highest levels of false negatives. In these instances, references to abuse was subtle or addressed emotional abuse. As such, it was necessary to manually filter results.

This paper has demonstrated that machine learning can be used to classify text based on a limited number of examples, when used in conjunction with word embedding to generate context-specific semantic lexicons. Re-presenting the Ryan Report in the form of a relational database with a web-based exploratory interface has facilitated comprehensive analysis of the Report, and has exposed new insights about the dynamics of the system of child abuse in Irish industrial schools. In reformulating how the Ryan Report can be presented, this research presents a scalable approach to digital analysis of state reports.

Acknowledgements

This research is part of the Industrial Memories project funded by the Irish Research Council under New Horizons 2015.

Loper, E. and Bird, S. (2002). NLTK: e Natural Language Toolkit. Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1. (ETMTNLP '02). Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 63-70.

Mikolov, T., et al. (2013). Distributed representations of words and phrases and their compositionality. Advances

in Neural Information Processing Systems 26 (NIPS2013),

pp. 3111-119.

Schoch, C., Schlor, D., Popp, S., Brunner, A., Henny, U., & Tello, J. C. (2016). Straight Talk! Automatic Recognition of Direct Speech in Nineteenth-Century French Novels. In Digital Humanities 2016 (pp. 346-353).

Bibliography

Chanen, A. (2016). Deep learning for extracting word-level meaning from safety report narratives. In Integrated Communications Navigation and Surveillance (ICNS), 2016 (pp. 5D2-1). IEEE.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2017
"Access/Accès"

Hosted at McGill University, Université de Montréal

Montréal, Canada

Aug. 8, 2017 - Aug. 11, 2017

438 works by 962 authors indexed

Series: ADHO (12)

Organizers: ADHO