A novel approach for a reusable federation of research data within the arts and humanities

paper, specified "long paper"
Authorship
  1. 1. Tobias Gradl

    Otto-Friedrich-University of Bamberg

  2. 2. Andreas Henrich

    Otto-Friedrich-University of Bamberg

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1 Introduction

In distributed systems literature the orthogonal but interdependent characteris- tics of autonomy, distribution and heterogeneity are used to classify distributed systems [1,2]. From a holistic perspective on the arts and humanities, collections have evolved over decades or centuries from highly autonomous disciplines and institutions and are widely spread, which resulted in heterogeneous perspectives and data models [3]. Despite its negative notion as data integration problem, the term heterogeneity also symbolizes the diversity of research methodologies within the disciplines of the arts and humanities. Resolving heterogeneity hence implies an abstraction from the specifics that are valuable for focused disciplinary and interdisciplinary research projects.
Our approach presents a novel concept for data federation in the arts and humanities, which focuses the needs of research projects as well as interdisci- plinary and broad use-cases. We especially address the reusability of explicated knowledge on correlations between schemata and digital collections and show where domain experts are required to bridge semantic gaps.
2 Background

Approaches to data integration often follow the theoretical foundation expressed in [4] by employing the concept of a global view. While being highly distinctive in terms of their underlying concepts, established examples such as ISIDORE [5], OAIster [6] and Europeana [7] share the goal of facilitating access to a wide range of research data through integrated schemata or ontologies. Aside from broad services, an integration need that focuses on a specific topic and related research questions is addressed by the Steinheim Institute, which provides a search in the context of german-jewish history and judaism [8]
Despite usability concerns in having to identify relevant services and accord- ingly collections, the reappearing need to overcome the same aspects of hetero- geneity in reaction to new use-cases is one of the problems we address.
2.1 Use cases

Our federation concept is primarily focused on the realization of the DARIAH- DE Generic Search (http://dev3.dariah.eu/search), which includes support for queries over large sets of unre- lated collections (broad search) and tightly correlated data (deep search)—with different information needs in mind:
– Broad search: Due to the quantity and distribution, the relevance of digital collections for particular research questions is not easily assessable. A broad view assists scholars in finding and evaluating possibly relevant data. Figure 1 shows an exemplary, collection-level aggregation of results based on term statistics in the prototype of our search, which will be continuously extended by other relevant visualization techniques (e.g. with respect to spacial and temporal aspects).
– Deep search: If the granularity of local data models can be used to formulate more specific queries targeting structure and content (e.g. in search facets), the deep search utilizes mappings specified in the DARIAH-DE Schema Reg- istry. Broad search continuously fades to deep search with an increasing count and richness of mappings and hence typically smaller sets of selected collections.

Fig. 1: Result aggregation in the generic search
Despite the focus on the generic search with its virtual integration at query-time, the proposed concept also addresses requirements of a materialized integration of data:
– Data migration and consolidation: Traditional applications of data integra- tion often do not require a dynamic adaption to selected collections, but de- termine a set of relevant data sources and an appropriate integration schema or ontology [4]. Examples include data migration induced by the introduction of new information systems (e.g. replacement of outdated archive informa- tion software) or the consolidation of selected data sources under a merged schema for the purpose of interdisciplinary analysis and visualization e.g. in the DARIAH-DE GeoBrowser [9].
2.2 Problem Definition

The common objective of data integration approaches is to resolve heterogeneity on various levels: Syntactical aspects such as the existence of different access and encoding methods can be solved by technical means, whereas structural and semantic heterogeneity depend on the application of background knowledge [1]. Despite continuing efforts in the fields of schema and ontology matching, the manual intervention of domain experts—especially for large or complex schemata and ontologies often found in the arts and humanities—has shown to be essential to generate high-quality results [10].
The correlation of the used schemata and ontologies is an inherently complex manual task in our context, which depends on the fragmented and distributed knowledge of individual disciplines, collections and scholars. Requiring a common understanding, research projects concentrate knowledge about schemata and semantics used in relevant collections and specify meanings and correlations. In order to integrate the described data and establish technical interoperability, an application of digital methods and tools is required.
3 Concept

Abstracting from aspects of technical and syntactical heterogeneity concerned with accessing, preprocessing and integrating data in a generic fashion, we aim to enable researchers to focus on those aspects of integration, that depend on their knowledge and expertise: the description and correlation of schemata and ontologies. Despite the immediate benefit for individual integration tasks, the centralized formalization and explication of semantics results in the significant advantage of knowledge reusability.
3.1 Semantic cluster

The logical architecture of our idea is represented by a directed, weighed graph, where the schemata and ontologies are described by vertices, and mappings be- tween them are symbolized by edges. Whereas correlations between structural elements symbolize a relation of the described concepts (e.g. persons, locations) and could be considered undirected, more specific rules that are required for data transformation can be composed of non-reversible functions (e.g. the concate- nation of fields). For that reason, parallel edges are required for the description of both mapping directions. Differences of schemata in terms of their complex- ity and expressiveness reduce the achievable level of accumulated completeness, which is represented by the value of cohesion
Figure 2 indicates how the cohesion between schemata can be utilized to sug- gest semantic clusters: C1, C2 and C3 could be the result of research projects, which needed a high level of mapping completeness between relevant schemata. By interrelating clusters or generically used schemata (S10), the expressed se- mantics can be reused in other contexts.
3.2 Use-case orientation

Our example indicates the difference to the commonly found integration pattern of a global ontology or schema. Despite its theoretical foundation, simplicity and proven applicability for broad integration use-cases [5,6,7], we consider the approach to be impracticable for a holistic context of the arts and humanities because a global structure would either have to be an abstraction from collection or discipline specifics or unmanageably complex.
Narrowing this context to individual domains or research projects, standards could be elected as appropriate integrative structures. As exemplified in figure 2, the schemata S3, S5 and S8 form the integration baseline within our clusters due to their cohesion with other schemata. Considering our deep search and data migration and consolidation use-cases, these schemata can be utilized to generate a fine-grained view over selected collections accessible within the cluster. In order to support interdisciplinary use-cases, clusters can be combined (symbolized by the strong cohesion between S5 and S8) to resolve semantic gaps.
For broad use-cases we rely on the collaborative and continuous emergence of schemata or ontologies (compare S10) within our federation that are used to connect the clusters on the coarse levels sufficient for broad use-cases.

Fig. 2. Semantic clusters of schemata
3.2 Use-case orientation3.3 Scalability considerations

The simplicity of traditional data integration emerges as new local schemata are added to the system and hence an appropriate mapping target needs to be identified. To ensure extensibility and scalability, our proposed federation concept depends on two strategies:
Cluster globals: The concept of semantic clusters builds on the existence or ad- vancement of standards that are considered as appropriate common perspectives by research communities. Although clusters are not predetermined but expected to evolve, established standards such as the CIDOC Conceptual Reference Model (CIDOC CRM) or the Text Encoding Initiative (TEI) Guildelines could be iden- tified as initial cluster schemata, which can be mapped in a generic fashion [11]. As new schemata need to be added, the standard which promises to achieve the highest completeness is selected to be mapped.
Model inheritance: Our proposal includes an approach to specify the actual usage of schemata more precisely than it is possible at the level of generic cross- walks. Figure 3 shows the exemplary derivation of the Dublin Core element dc:coverage to resolve an encapsulated substructure. Mappings are inherited to correlate the refined elements or to specify detailed data transformation rules. As derived schemata are related to their parent, generic mappings remain valid and can be utilized if specific rules are missing.

Fig. 3: Exemplary derived version of Dublin Core
4 Conclusion

As we abstract from technical aspects of heterogeneity and reuse the valuable disciplinary knowledge explicated in terms of correlations, processing and trans- formation rules, the efforts required for integrating research data can be sig- nificantly reduced. Another important aspect that is currently being evaluatedconsists in appropriate techniques for the visualization of our federation concept and system. After all, domain experts need to be able to recognize clusters, im- portant schemata and ontologies as well as their correlations in order to identify semantic gaps and to collaboratively fill them.
References

1. Sheth, A.P., Kashyap, V. (1993): So Far (Schematically) yet So Near (Semantically). In: Proceedings of the IFIP WG 2.6 Database Semantics Conference on Interoperable Database Systems (DS-5), Amsterdam and The Netherlands and The Netherlands, North-Holland Publishing Co 283–31
2. Busse, S., Kutsche, R.D., Leser, U., Weber, H. (1999): Federated Information Systems: Concepts, Terminology and Architectures
3. Henrich,A.,Gradl,T.:DARIAH(-DE) (2013): DigitalResearchInfrastructurefortheArts and Humanities — Concepts and Perspectives. International Journal of Humanities and Arts Computing 7(supplement) 47–5
4. Lenzerini, M. (2002): Data Integration: A Theoretical Perspective. In Abiteboul, S., ed.: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, New York and NY, ACM 23
5. Pouyllau, S. (2011): ISIDORE : acces to open data of arts & humanities
6. Hagedorn, K. (2003): OAIster: a ”no dead ends” OAI service provider. Library Hi Tech 21(2) 170–181
7. Peroni, S., Tomasi, F., Vitali, F. (2013) : Reflecting on the Europeana Data Model. IAgosti, M., Esposito, F., Ferilli, S., Ferro, N., eds.: Digital Libraries and Archives. Volume 354 of Communications in Computer and Information Science. Springer Berlin Heidelberg, Berlin and Heidelberg 228–24
8. Lordick, H. (2013): Vieles finden – die Suchmaschine im Steinheim-Institut
9. Romanello, M. (2013). DARIAH Geo-browser: Exploring Data through Time and Space
10. Rahm, E. (2011): Towards Large-Scale Schema and Ontology Matching. In BellahseneZ., Bonifati, A., Rahm, E., eds.: Schema Matching and Mapping. Springer BerlinHeidelberg, Berlin and Heidelberg 3–27
11. Baca, M., Harpring, P., Ward, J., Beecroft, A.:Metadata Standards Crosswalk

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO