In the digitization era, documents provided by historical archives are at prompt disposal of digital humanities researchers that could directly employ them in text mining analyses. However, the high degree of textual variability and domain specificity of such materials pose several methodological and technical issues for scholars aiming to automatically extract information from historical collections
(Sporleder, 2010, Van Hooland et al., 2013
Ehrmann et al., 2016)
This abstract reports on a recently concluded project that develop
for extracting events, participants and their roles from a digitized corpus of Italian memoirs of Resistance members during the Second World War. In particular, in our work we have adopted and adapted resources, techniques and tools from research literature in information extraction to provide advanced semantic access to the collection. We chose events as structural concept for extracting and representing textual information as they offer a “natural” pivot, keeping together semantic information at different levels (type of the event, time, space, role of the involved entities). We are also encouraged on this way by the findings of a recent work
(Sprugnoli and Tonelli, 2017)
, which suggested,
, the necessity to expand the inventory of event types (compared to existing annotation schemas like ACE, ERE
(Aguilar et al., 2014)
(Pustejovsky et al., 2005)
The textual dataset is composed of 25 digitized memoirs of Italian partisans. Since our pipeline relies upon a pattern-based method for the initial event extraction, we integrate the resource with two other subcorpora on the same topic, obtained from the web: all texts in the Italian Wikipedia category “World War II Resistance movements”
and a set of biographies of Italian partisans from Wikisource
. In addition, three more books have been digitized that compose the “Memoirs-test” subcorpus and are used for evaluation.
Composition of textual corpus for event extraction.
Extraction of events, participants and roles
The methodology developed in our project combines Named Entity Disambiguation (NED), Semantic Tagging and Role Labeling: as a first step, frame-like event patterns are collected across the whole corpus. Given the domain at hand (war memoirs), we mainly focus on movements of persons and artifacts, conflictual events and events related to organizations.
Adopted Semantic Types (plotted by similarity).
The event extraction
consists of two macro-steps: first, NED is performed against a set of three gazetteers of Persons, Places and Organizations; also, all non-named entities (nouns) are tagged with semantic classes (Figure 1) by means of a semantic tagger based on word embeddings.
The output of this first stage is then used as input for a pattern matching algorithm that, based on a previously collected dictionary of lexico-semantic patterns of argument structures, extracts event mentions from text, classifies them into discrete classes and assigns the detected participants a role with respect to the event itself. The pattern dictionary built for the purpose counts 246 lexical units, i.e., event-denoting lemmas: 124 verbs, 77 nouns and 45 multi-word verbal expressions; such lexical units map overall to 88 Event Classes, where the relationship between lexical units and event classes is many-to-many. The system is designed to separate high and low confidence event mentions based on the number of correctly labeled arguments.
A detalied account of the adopted methodology can be found in Rovera et al., 2018.
The evaluation of the system is performed on a set of 300 manually annotated sentences taken from the “Memoirs-test” corpus, which has not been used for collecting patterns. In particular, we evaluate the performance of the system for
event detection, where we achieve .78 Precision and .50 Recall, and
in event classification, where we score .73 Precision (.89 where only high confidence events are taken into account).
In order to display the extracted information, we represent events, participants and roles as a network. Events and entities are represented as nodes (labeled with class and type, respectively) linked by edges, each labeled with the role that the entity plays in the correspondent event.
Visualization of a network of events, entities and roles.
The linking of a relevant number of named entities enables a rich visual representation (Figure 2) of the extracted knowledge, along with the capability to build queries based either on a specific entity, on a class of events, or on a more complex combination of roles, events and entities.
By presenting our work at the conference we hope to foster the use of event, participants and roles for obtaining a semantically advanced access to historical corpora.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Utrecht University
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
Series: ADHO (14)