Converting Medieval Documents into a Searchable Database

poster / demo / art installation
Authorship
  1. 1. Caroline Sporleder

    Universität Trier

  2. 2. Susanne Fertmann

    Universität des Saarlandes (Saarland University)

  3. 3. Tim Krones

    Universität des Saarlandes (Saarland University)

  4. 4. Robert Kolatzek

    Universität des Saarlandes (Saarland University)

  5. 5. Isolde Teufel

    Library - Albert-Ludwigs-Universität Freiburg (University of Freiburg)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1 Introduction
More and more historical documents are digitised. In their raw form, however, digitised sources are usually unstructured texts, on which only keyword search is possible. This limits their usefulness, even when searching for entities, because a person may be referred to in a variety of ways, e.g. by name (Henry VIII), by title (Lord of Ireland), by ancestry (son of Henry VII) or by role (The undersigner). Furthermore, spellings of proper names vary frequently in historical documents (Greiffenclau, Griffenclae), and persons can also be referred to by pronouns. Searching for all variants is not a perfect solution either, since noun phrases such as the King of England can, of course, refer to different entities in different contexts. Finding a particular entity is thus a laborious and error-prone task. If one is interested in certain types of events, for instance all SELLING events [Fertmann, 2013], it becomes even more difficult to locate the desired information. Hence, digitised sources should ideally be enhanced with a semantic-structural annotation layer. Unfortunately, doing this manually is time-consuming, sometimes prohibitively so [Schäfer et al., 2012].

We describe our work of automatically converting a manually compiled collection of historical manuscripts, first into XML-annotated (structured) text and then into a searchable database with a web-based front end. In the conversion process, we exploit typographic and structural cues as well as heuristics based on domain knowledge. The methods we describe are not in themselves entirely novel. Detecting structure in superficially un- or semistructured texts has been the topic of various research projects. The aim of this paper is thus not so much to introduce novel techniques but to show how relatively simple techniques can be applied to a particular type of collection and thereby make it much more accessible and useful.

2 The Data
The data are a collection of regesta, i.e. summaries of medieval charters, pertaining to the City of Saarbrücken and covering a period of 950 years from 601 to 1545 AD [Eder-Stein, 2012]. Compiling this collection, originally on file cards and later electronically in a Microsoft Word document, started in the late 1950s and ended in 2011. Initially, only an open access book publication was intended, hence the use of a Word document to collect the data, which is, of course, suboptimal from a processing point of view.

Since the manuscript was intended for off-line reading rather than electronic processing, the structure of each regest remained largely implicit, being signalled mainly by typographical means. Figure 1 shows an example. The first line is printed in bold face and contains the year and sometimes also the place of issue as well as additional information, e.g. regarding the reliability of the dating. This line is also used as a unique identifier for the regest in the collection. Then follows the modern German summary of the manuscript. Some passages, often names, are left in their original form and printed in italics, e.g. Sarebrugka debellatur. Various types of metadata follow, including the original dating, signatories, and archival information. Not all metadata are available for each regest but if they occur, their order is fixed and they are usually preceded by keywords such as Druck (print).

Fig. 1: Regest

The book also contains an extensive index, which not only lists named entities (NEs) referred to in the texts but also provides valuable additional information, e.g. about alternative spellings, family relationships (Metze, Witwe des Schultheiß Nikolaus “Metze, widow of ...”), relationships between locations and person (Einwohner “residents”), titles and roles of persons (Ritter “knight”, herrschaftlicher Schneider “stately taylor”), localisation of place names (Kelz, Dorf (Dep. Haute- Saˆone, F.)), and contexts in which an entity was mentioned in the text (Besuch in Saarbrücken “visit to Saarbrücken”). Figure 2 shows three index entries.

3 Related Work
Several projects are dedicated to digitising medieval charters and making them available via sophisticated web interfaces. The most well-known is Regesta Imperii, 1 which provides electronic access to a collection of charters from the period of the Holy Roman Empire [Kuczera, 2005]. Another project is the Charters Encoding Initiative, which has been running since 2004.2 However, as far as we know, in all of these projects, the underlying database is built manually rather than by extracting information from existing texts.

On the other hand, there is a large body of work concerned with determining structure in texts using supervised [Borkar and Sarawagi, 2001, Viola and Narasimhand, 2005] or unsupervised [Grenager et al., 2005] machine learning as well as bootstrapping from existing resources [Canisius and Sporleder, 2007]. Our methods are not as sophisticated nor do they have to be since we can infer a lot of information from typographic cues and domain knowledge. Also related are studies which aim at identifying structure in published papers [Schäfer et al., 2012]. Typically, these, too, employ heuristics [Schäfer and Weitz, 2012].

Fig. 2: Index Entries

4 From Unstructured Text To Searchable Database
The structure of regest texts and index entries is only implicitly encoded by formatting (type face, level of indentation) and ordering of elements. Many of these devices are ambiguous, e.g. italics are predominantly used to indicate original passages but metadata information is also sometimes set in italics (e.g. vgl. “cf.”). We implemented heuristics to determine the function of a piece of text depending on its typographical properties and position relative to other text elements. We also made use of a limited set of manually supplied keywords, such as titles, honorifics and words for groups of persons, mainly for parsing the index entries. This procedure should be relatively easily adaptable for other, similar collections. Once the main structural blocks had been identified, index and regest entries could be linked. Because the index lists NEs and links them to the regesta, we could also relatively easily identify and disambiguate the entities referred to in the regesta themselves, thus avoiding a separate NE recognition step.

Fig. 3: XML Mark-Up in the Index

An XML-schema was designed for representing the collection. We tried to comply with the TEI guidelines [Burnard and Bauman, 2013] whenever possible, however, since TEI does not explicitly cover medieval charters we had to deviate from it occasionally. In general, we designed the schema in such a way that it is extensible. In particular, all automatically inferable information types should be encodable, even if we do not extract them immediately. For example, we refrained from automatically identifying the issuer of a document. However, this can normally be identified relatively reliably as it is typically the first person named in the text. Figure 3 shows part of an (instantiated) example of an index entry of type LOCATION. We encode the type of settlement (Dorf “village”), whether it is abandoned or not, the name and alternative names, and the area, district and region, if provided.

Using the heuristics, an XML marked-up version of the summaries and index was automatically generated from the original Word document. Additionally, the extracted information was stored in a database. We consider the XML file the primary format. However, a database is useful because it can be employed as a straightforward backend for a web application. Furthermore, if the collection is extended at a later stage, a database offers a simple interface for data entry. Additions to the collection can be entered directly in the database, thus rendering the automatic conversion from unstructured text to structured XML unnecessary.The updated database can easily be exported to XML and the proofs for future editions of the book can be generated directly from the XML file. Finally, we created a web interface3 for searching and browsing the collection, which also implements additional features, such as a time line, allowing users to see all regesta from a pre-selected period.

5 Conclusion and Future Work
Where originally only keyword search could be performed on the collection, automatic processing greatly enhanced its utility. Linking of NEs to index entries now allows entity-based search. Furthermore, users can search explicitly for certain information types, e.g. archival information. The explicit mark-up of the documents makes it trivial to implement more complex, logical search options such as searching for co-occurring entities.It is also possible to link to external resources such as maps or show family trees based on the index information. Finally, the database makes the data more easily extensible and allows for error and consistency checking.

With the basic structure in place, one can go further and identify co-reference chains, semantic argument structures or document topics, or extract information, e.g. about all financial transaction in which a given monastery was involved or about the contexts in which certain groups of people are mentioned.

The enhanced collection is not only useful for historians. It is also a valuable resource for students and pupils studying medieval or local history. Historical linguists also have expressed an interest in using it to study historical place names. Moreover, we expect a significant demand from historically inclined laypersons, especially those living in the region covered by the collection.

References
Kaustubh Deshmukh Borkar and Sunita Sarawagi (2001). Automatic segmentation of text into structured records. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 175–186.

Lou Burnard and Syd Bauman (2013), editors. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium Charlotesville, Virginia.

Sander Canisius and Caroline Sporleder (2007). Bootstrapping information extraction from field books. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 827–836, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

Irmtraud Eder-Stein, editor (2012). Regesten zur Geschichte der Stadt Saarbrücken (bis 1545). Publikationen der Saarländischen Universitäts- und Landesbibliothek. universaar: Universitätsverlag des Saarlandes. Bearbeitet unter Verwendung von Vorarbeiten von Hanns Klein.Susanne Fertmann (2013). Extraction of selling events from historical documents. Bachelor Thesis, Saarland University.Trond Grenager, Dan Klein, and Christopher Manning. Unsupervised learning of field segmentation models for information extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 371–378, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.

Andreas Kuczera (2005). Die Regesta Imperii Online. In Deutsche Kommission für die Bearbeitung der Regesta Imperii e.V. bei der Akademie der Wissenschaften und der Literatur in Verbindung mit der Bayerischen Staatsbibliothek in München, editor, Workshop Buch und Internet - Aufbereitung historischer Quellen im digitalen Zeitalter, pages 3–4.

Ulrich Schäfer and Benjamin Weitz (2012). Combining OCR outputs for logical document structure markup. Technical background to the ACL 2012 contributed task. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, pages 104–109, Jeju Island, Korea, July 2012. Association for Computational Linguistics.

Ulrich Schäfer, Jonathon Read, and Stephan Oepen (2012). Towards an ACL Anthology Corpus with logical document structure. An overview of the ACL 2012 contributed task. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, pages 88–97, Jeju Island, Korea, July 2012. Association for Computational Linguistics.

Paul Viola and Mukund Narasimhand (2005). Learning to extract information from semistructured text using a discriminative context free grammar. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 330–337.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO