XML2RDF – Extracting RDF Statements From XML Resources With XTriples

workshop / tutorial
Authorship
  1. 1. Max Grüntgens

    Akademie der Wissenschaften und der Literatur (ADW) Mainz

  2. 2. Thomas Kollatz

    Akademie der Wissenschaften und der Literatur (ADW) Mainz

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The tutorial focuses on scholars, who want to acquaint themselves with a “low threshold” — generic, simple yet powerful, and explorative — workflow of modelling, extracting, processing (queries, visualizations), and publishing of “structured data” (LOD, Semantic Web) from heterogeneous XML-sources by means of XPath. The instructors assume basic acquaintance with XML and XPath (both will be quickly revised, “cheat sheets” will be provided).

Number of Participants
12 persons.

Special Requirements
Own laptop with an XML-Editor (we recommend the oXygen XML editor with a trial license) and a modern web browser.

Focal points
Triple-Statement-Extraction from heterogeneous XML-Resources, Basic-Research-Workflow (Statement-Formulation, Statement-Extraction, Triple-Store-Import, SPARQL-Querying, Visualization).

Introduction
RDF is an abstract and high level data model and RDF statements themselve are therefore to such an extent abstractly formulated that they inevitably transcend the data heterogeneity ingrained in individualized notations and practices essential to XML encoded resources. Printed as well as digital scholarly editions in this regard for the most part oscillate between more data centered and more narrative centered verbose approaches. RDF may thus function as a bridge technology providing the means to mirror several heterogeneous XML resources onto one common abstract (meta) data layer of interconnected as well as outward pointing data points, that is easily and efficiently queryable and visualizable. In this way RDF and LOD are extending the capabilities of XML resources without cluttering of the underlying documents or imposing of external restrictions for the encoding scholars (see Brodhun, de la Iglesia, Moretton (2015), Grüntgens, Schrade (2016), 61–62); also compare Ciotti, Tomasi (2016), Ciotti (2018)).
One of the main focus points of the workshop is thus to make it clear to the participants that it is pivotal to make an effort to recognize the ‘implicit semantic potential’ already inherent in human-readable XML-resources, e.g. in the form of markup like

(‘implicit’ in this context means not being formally expressed as RDF triples in a RDF serialization format, like e.g. RDF-XML; for more detailed examples see Grüntgens, Schrade (2016), 54, Ciotti, Tomasi (2016), pars. 46–49). This potential accordingly “[has] to be transformed into explicit semantic annotations (e.g. RDF) to make the data usable for Semantic Web approaches [...]. On the whole, the process of translating XML to RDF is therefore mainly focused on the determination of general statement patterns, which can then be applied to and extracted from all resources of the data set in question.” (Grüntgens, Schrade (2016), 56).

Relevance
In the above mentioned context, many scholars may well enough know about the obvious advantages of supplementing their XML resources with an additional RDF layer and nevertheless may find it difficult to bridge the gap between XML and RDF. A common obstacle is in this regard, that statement extraction from XML may deteriorate into overly complex or semi-manual data wrangling. Additionally, most ontologies, like CIDOC-CRM, are — or rather may at least be perceived as being — too complicated and intricate for a
first “low-threshold” explorative approach to statement extraction in particular and the structured data ecosystem in general.

The tutorial thus wants to mitigate both hurdles by teaching how to utilize T. Schrade’s
XTriples web service for RDF statement extraction (see Grüntgens, Schrade (2016)). The web service provides a low-threshold approach to statement extraction from
any well formed XML resource — be it a full-fledged REST-API serving TEI-XML or the entrails of a word processor file like .DOC, .ODF, or .FODT — via XPath and triple building via simple XML statements. With the aid of a configuration template written in simple XML scholars are quickly enabled to build first statements for prototypical research that are “flat”— e.g. not build upon a complex ontology — and in this regard to gradually acquaint themselves further with the extended toolbox around RDF (different serialization formats, triplestores, SPARQL, ontologies, …). These “flat” patterns, however, have nevertheless to be properly formalized and operationalized in regard to the research question. Thus, after having worked out a prototype, the scholar may (and should) subsequently lift the prototypical extraction on to a distributable level by incorporating more complex ontologies into the statements described in the configuration template hereby providing the semantic interoperability needed, e.g. for LOD (concerning this necessity see f.e. Eide, Ore (2019), 194–195).

Tutorial
The tutorial will start with a short introduction into the basic assumption behind the RDF data model, a short overview of a workflow in regard to statement extraction with XTriples, and a quick recap of the basics of XPath.
The tutorial will then use research data provided by the API of the scholarly edition “Der Sturm: Digitale Quellenedition zur Geschichte der internationalen Avantgarde” (
; see Trautmann (2017)). The edition provides several APIs that provide data about encapsulating persons, places, works, letters, and files. The participants will form small work groups, evaluate the APIs contents, and will henceforth formulate whiteboard-friendly “flat” and verbose statements about the
resources, e.g. not corresponding to an established ontology.

Subsequently the participants will inscribe these statement patterns into the XTriples configuration file in order to automatically extract the desired statements by means of form-style POST requests.

The participants will subsequently reiterate the process of adding statements to the configuration, extracting, and subsequently evaluating the statements. When a (first) sound basis has been achieved, the participants will make a final extraction and then import their RDF-XML files into the web based RDF4j triple store (provided for the tutorial by the instructors), where the separate graphs will be merged automatically.

Then the participants will be introduced to the basics of SPARQL. Subsequently the groups will — instructed by the instructors — perform some simple queries to become acquainted with SPARQL and the corresponding query endpoint.

Further more complicated queries that aggregate information in a tabular form, may be executed from ‘saved queries’, written earlier by the lecturers, in order to facilitate the process.

The aggregate will then be visualized with the web based service
https://rawgraphs.io. To make it clear: This section’s focal point lies not on performing the most intricate queries, but rather on showing in a lucid way how and based on what kind of technological foundation a simple aggregation and subsequent visualization of RDF data may be performed, thus clearly outlining the conceptual side of the workflow as a whole.

Outline

Time
Topic
Mode
Note

10:00–10:30
Short introduction to RDF and the basic structure of an extraction workflow via XTriples; very short recap of basics of XML & XPath.
lecture
Slides; XPath cheat sheet

10:30–11:45
Modelling statements and writing of the XTriples configuration file. First extractions.
Group work (instructed); supported by the instructors
Via web based Form-style POST request on

11:45–12:00
Final statement extraction
Group work

12:00–13:00
Lunchbreak

13:00–13:15
Import into Triplestore
Group work (instructed)
Web based RDF4j-triplestore provided by instructors.

13:15–13:45
Basic querying of Triplestore via SPARQL endpoint
Group work (instructed)
SPARQL-endpoint provided by instructors; SPARQL cheat sheet

13:45–14:00
Visualization via rawgraphs.io
Group work (instructed)

Bibliography

Brodhun, M., de la Iglesia, M., Moretto, N.
(2015). Metadaten, LOD und der Mehrwert standardisierter und vernetzter Daten. In Neuroth, H. et al. (eds),
TextGrid: Von der Community – für die Community
, Göttingen: Universitätsverlag Göttingen, pp. 91–102,

Ciotti, F.
(2018). A Formal Ontology for the Text Encoding Initiative.
Umanistica Digitale
, 3,

Ciotti F. and Tomasi, F.
(2016). Formal Ontologies, Linked Data, and TEI Semantics.
Journal of the Text Encoding Initiative
,
9,

Eide, Ø., Ore, C.-E. (2019). 8. Ontologies and Data Modeling. In Flanders, J., Jannidis, F. (eds),
The Shape of Data in Digital Humanities Modeling Texts and Text-based Resources. London, pp. 178–196.

Grüntgens, M., Schrade, T.
(2016). Data repositories in the Humanities and the Semantic Web: modelling, linking, visualising. In Alessandro, A. et al. (eds),
Proceedings of the 1st Workshop on Humanities in the Semantic Web co-located with 13th ESWC Conference 2016 (ESWC 2016),
Anissaras, pp. 53–64 URL:

Schrade, T.
(2016).
XTriples. A generic webservice for RDF lifting from XML resources
,

Trautmann, M., Schrade, T.
(2018).
DER STURM. Digitale Quellenedition zur Geschichte der internationalen Avantgarde
. Mainz: Akademie der Wissenschaften und der Literatur,

Trautmann, M.
(2017).
Eine digitale Edition. Ausgewählte Briefe von Jacoba van Heemskerck und Franz Marc an Herwarth Walden (1914–1915)
. Mainz,

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO