Encoding metaknowledge for historical databases

paper, specified "long paper"
Authorship
  1. 1. Marc-Antoine Nuessli

    École Polytechnique Fédérale de Lausanne (EPFL)

  2. 2. Frédéric Kaplan

    École Polytechnique Fédérale de Lausanne (EPFL)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Motivation
Historical knowledge is fundamentally uncertain. A given account of an historical event is typically based on a series of sources and on sequences of interpretation and reasoning based on these sources. Generally, the product of this historical research takes the form of a synthesis, like a narrative or a map, but does not give a precise account of the intellectual process that led to this result.

Our project consists of developing a methodology, based on semantic web technologies, to encode historical knowledge, while documenting, in detail, the intellectual sequences linking the historical sources with a given encoding, also know as paradata1. More generally, the aim of this methodology is to build systems capable of representing multiple historical realities, as they are used to document the underlying processes in the construction of possible knowledge spaces.

Overview of the Approach
Semantic web technologies, with formal languages like RDF and OWL, offer relevant solutions for deploying sustainable, large-scale and collaborative historical databases (see for instance 2). Compared to traditional relational databases, these technologies offer more flexibility and scalability, avoiding the painful problems of large schema migration. They are grounded in logic and thus permit us to easily conduct semantic inferences. Some very stable semantic based ontologies like CIDOC-CRM, now an ISO standard, have been used successfully in the cultural heritage domain for about 20 years 3.

However, the languages used in the semantic web technologies have a major limitation that prevents their usage for encoding metahistorical information. Expressed knowledge is typically formalised with RDF triplets which are not objects in the same order as the knowledge content (RDF resources identified with URIs) to which they link. For example, it is difficult to document the source, the author or the uncertainty of given RDF statement.

One way to compensate for this flaw, while respecting the W3C norms, consists of transforming each RDF triplet (subject predicate object) into three triplets (statement rdf:subject subject), (statement rdf:predicate predicate), (statement rdf:object object). Using this approach, it becomes possible to add new triplets with a given statement as subject, documenting additional paradata about this statement. The resulting knowledge base can include metahistorical information, i.e. information about historical information creation processes. This metainformation can document the choice of sources, transcription phases, coding strategies, interpretation methods and whether these steps are realised by humans or machines. Thus, each historical database designed following this methodology integrates two levels of knowledge. The first level provides the documentation about the origin, the nature and the formalisation used to encode historical data, while the second level codes for the historical data itself.

The Knowledge Construction Vocabulary (KCV)
We are working on a specific RDF vocabulary, called Knowledge Construction Vocabulary (KCV), which will enable us to implement the two level organisation using the standards of the semantic web. KCV RDF statements represent knowledge construction steps, while effective historical knowledge is only expressed through reified triplets. An important concept in this vocabulary is the notion of knowledge spaces. A knowledge space designates a closed set of coherent knowledge, typically based on a defined set of sources and methods. Examples of knowledge spaces include documentary spaces (e.g. a defined corpus of sources) and fictional spaces (e.g a coherent world typically described in a book).

Figure 1 shows an example of the kind of graphs that can be built using the KCV vocabulary. In this example, two knowledge spaces have been defined: one documentary space (DHLABDocuments) and one so-called fictional space (HistoireVenise_S1). Each of these two spaces is defined as a unique resource with an associated URI. A statement (Statement1) stands for a reified triplet defining that (HistoireVenise) is a kind of Book and is linked to the documentary space. The KCV vocabulary allows us to document who entered the information (fournier) and the creation time of the statement (May 06th). To formalise the fact that the book, HistoireVenise, is used as a knowledge source, a specific resources HistoireVenise_KS is created and linked with the HistoireVenise, the book, and the general document space DHLABDocuments.

Fig. 1: A "toy" example of the use of the KCV vocabulary to code historical and metahistorical information

In the fictional space HistoireVenise_S1, a statement (Statement2) codes for a reified triplet indicating that the reconstruction of the Rialto bridge occurred during the period of 1588-1591. Information about the author, the creation date and the reliability of Statement2 are documented using various KCV triplets. The link between the document space and the fictional space is encoded by a link between the knowledge source HistoireVenise_KS and a statement, Statement2_origin, linked to Statement2 of type interpretedtextknowledge.

We can make three remarks:

This is obviously a "toy" example (real graphs encoding historical data are typically much bigger), but it illustrates how historical and metahistorical information can be coded with a linked data approach. This allows us to envision queries mixing both historical and metahistorical requests, for instance reconstructing an historical context based only on certain kinds of sources or excluding information that was provided to the database by some authors.
The kind of intellectual processes documented by KCV can easily include algorithmic steps like digitisation, optical character recognition pipelines on documents, text mining, semantic disambiguation, etc. The version and the author of the algorithms used can easily be included using KCV statements. This kind of documentation permits us to exclude historical information linked with some processing using early versions of algorithms that may have "polluted" the data. This is an important prerequisite for building sustainable databases in the long term.
Documenting metahistorical information using KCV may look like a tedious process; however, in most cases, this information can be inserted automatically using a higher-level interface. A database interface in which the user is logged permits to easily produce historical data based on the KCV vocabulary, taking the form of reified RDF triplets, while documenting the author, the data and the methods used.
Ontologies Matching
The KCV approach for encoding historical databases is also interesting from the perspective of ontologies alignment: a notoriously difficult issue 4. Each research group tended to code historical data using their own local ontologies, adapted to their research approach. The metahistorical documentation provided by the KCV vocabulary enables us to envision strategies for mapping such ontologies to a pivot ontology. Figure 2 shows this general process in which several knowledge spaces are linked. Each group locally describes the source documents used (1), transcribes their content (2) and eventually codes/interprets this content (3). Throughout this process, two groups produced two independent custom ontologies (A and B). The alignment process proceed in two additional steps. First, both local ontologies are mapped onto a general content ontology (4) (for instance CIDOC-CRM, but not necessarily) and then, once expressed in this common conceptual model, the information contained in the graph is aligned and the content is merged (5).

Fig. 2: The general process of ontologies matching

Figure 3 gives a more detailed account of the final step. First knowledge sources are mapped, then types are mapped and eventually predicates are mapped. In some cases, only a partial level of correspondence can be reached. These steps can be done manually or automatically and are, of course, subject to errors. It is therefore crucial to document the authors of these matching steps, whether they are humans or algorithms. This is why the authors are, linked all the other steps, described in the KCV vocabulary.

Fig. 3: Detail of the ontologies matching process

Conclusion
The approach briefly presented in this paper enables us to encode historical and metahistorical data in a unified framework. The method we describe is fully compliant with the current technologies and standards of the semantic web (RDF, SPARQL, etc.). It does impose a unified historical terminology but can also be used in conjunction with existing standards. For instance CIDOC-CRM can be used to describe historical knowledge extracted from archival documents (e.g events, people, places) using RDF triplets and KCV can be used to code information about the CIDOC-CRM triplets themselves, such as documenting who entered a particular triplet. The originality of our proposal comes from the introduction of the this second level (metahistorical) on top of the existing RDF ontologies. This does not necessarily impose an additional burden on the person encoding the historical data. Using a dedicated web interface, the metahistorical information can be added automatically as the data is progressively entered.

Coding metahistorical information by making explicit the many underlying modelling processes allows us to prepare for possible ontology evolution and enables easier ontology matching. More importantly, our approach does not impose the search for a global truth (a unique and common version of historical events) but pushes towards the explication of the intellectual and technical processes involved in historical research, thus giving the possibility of fully documented historical reconstructions.

References
1. Bentkowska-Kafel A., Denard H., and Baker D. (2012). Paradata and Transparency in Virtual Heritage. Ashgate Publishing, Ltd.

2. Ide, N., and D. Woolner (2007). Historical Ontologies. Words and Intelligence II: 137–152.

3. Doerr, M. (2003) The CIDOC CRM – an Ontological Approach to Semantic Interoperability of Metadata. AI Magazine 24, no. 3.

4. Shvaiko P. and J. Euzenat (2013). Ontology matching: state of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering, 25(1): 158-176.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO