Four Years of correspSearch – Challenges, Potentials and Lessons of Open Data Aggregation

paper, specified "short paper"
Authorship
  1. 1. Stefan Dumont

    Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) (Berlin-Brandenburg Academy of Sciences and Humanities)

  2. 2. Sascha Grabsch

    Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) (Berlin-Brandenburg Academy of Sciences and Humanities)

  3. 3. Jonas Müller-Laackman

    Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) (Berlin-Brandenburg Academy of Sciences and Humanities)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The main goal of the

correspSearch web service

https://www.correspsearch.net
, developed and hosted by Berlin Brandenburg Academy of Sciences and Humanities, is the aggregation of open and freely accessible metadata from printed and digital editions of correspondences.

For a general overview see Dumont (2016).
Over the last four years we have successfully aggregated metadata for about 52.000 letters. Most of the data is obtained from external contributions ranging from a wide variety of scholarly editions and institutions. With this recap we aim to infer successful recipes and practices for the decentralized aggregation of domain specific open metadata. Furthermore we will show the possibilities which arise from the aggregation of such metadata on a bigger scale and discuss ways to manage as well as explore the complex realities of our data.

Gathering a large quantity of suitable metadata about published letters is the precondition and one of the basic functions of the web service
correspSearch. The format used in a data aggregation project such as ours has to fulfill a wide array of requirements, ranging from interoperability with existing standards to ease of use and a straightforward creation process. The TEI Correspondence SIG

https://www.tei-c.org/activities/sig/correspondence/
developed a format for the purpose of simplifying correspondence metadata and thus offering a way for a standardized exchange of data, the Correspondence Metadata Interchange Format (CMIF)

https://correspsearch.net/index.xql?id=participate_cmi-format&l=de
. The flat hierarchy and simple node structure of CMIF offers an easy way to process metadata not only in a machine readable way but also in a way that does not require much prior knowledge to get in touch with (see also Stadler et al., 2016). Given the specific nature of correspondence metadata, the focus of CMIF lies on people, places and dates. The format also encourages the extensive usage of authority control data. By relying on metadata as the core of the exchange format, it is not only of use for editions that are already present in a digital form, but can also be employed to print editions (see Stadler, 2014).

Since the aggregated data is stored decentralized with each edition, participating editions and data providers preserve full ownership and control over their data. Based on the TEI-XML standard, CMIF follows strong guidelines and utilizes the strengths of XML to prevent faulty or ambiguous data while taking into account the heterogeneity of metadata. Furthermore, CMIF requires the usage of authority files for names and places in order to account for their ambiguity and linguistic heterogeneity (see Stadler, 2012). All in all, CMIF successfully provides a model for describing correspondences in an easy and rather flat hierarchical way, while relying on a strict ruleset in favor of standardized and machine-processable data.
As the metadata aggregated by our service is licensed under CC BY, it remains open and thus stands for an Open Access-based approach to correspondence metadata aggregation. Open Data provides an easier ground on which research can be conducted without having to take care of licensing beforehand and with a much larger data pool. The analysis of a specific network of letters, as for example in
csLink, can only benefit from this approach. In addition and in accordance to the licensing for the aggregated data, all software developed by
correspSearch is published as Free Software.

The

CMIF Creator

https://correspsearch.net/creator/index.xql, https://github.com/correspSearch/CMIF-Creator
that was released in its second version in 2018 pursues our approach to an easy handling of CMIF-based data. It enables the creation of correspondence metadata without any prerequisite knowledge about XML/TEI or the CMI format. With the
CMIF Creator we offer a clean graphical user interface to enter available metadata and transform the entered data to valid XML. Lowering the barrier of generating and contributing data has been a key factor for successful and lively external data contributions in the last years. As the
CMIF Creator is implemented as a browser-based application, all entered and processed data is saved locally and thus stays within the control of the user at any time. As CMIF heavily relies in authority control data, authority files data for names and places can be acquired directly from GND

https://www.dnb.de/
and GeoNames

https://www.geonames.org/
via the lobid

https://lobid.org/gnd/api
and GeoNames APIs (see Tasovac et al., 2016). The
CMIF Creator also offers a validation service as well as the option to locally save drafts in JSON format so that work can be continued at any time. The final CMIF files can then be provided for aggregation into the
correspSearch database. The benefits of using the
CMIF Creator are obvious when it comes to the actual experiences we made: Besides the low barriers for data entry – it does not require any prior knowledge and experience of TEI-XML – the average time for a student assistant to process and enter the metadata of a single letter out of a printed letter edition is approximately 30 to 60 seconds, depending on the necessity to further disambiguate authority control data. Thus, large quantities of letters can be processed in a reasonable amount of time. The output format is standardized and does not deviate from TEI specifications, which reduces the incidence of errors in the final XML.

Another potential of the open data approach with a rather analytical purpose is followed in our development of the application

csLink

https://github.com/correspSearch/csLink
.
csLink is a widget for websites that can be implemented and included in existing digital editions of letters. It establishes a “network of letters”, displaying the correspondence network of a certain person and reaching beyond the scope of a single edition of letters. Customized by optional parameters given,
csLink provides a list of other letters from the same network of letters, as well as a list of persons, who are part of the wider correspondence network. By relying on CC BY licensed metadata the widget is available for anyone interested in such a visual representation of the correspondence network belonging to a person. Being able to acquire a visual impression of different letter networks and the corresponding persons offers immediate epistemological gains in the study of complex network relationships. Utilizing the aggregated metadata enables
csLink to situate letters in a single edition within a wider context of correspondence and communication. Examples for applications of
csLink are the digital editions

humboldt digital

https://edition-humboldt.de

, as well as

Weber Gesamtausgabe

https://weber-gesamtausgabe.de/de/Index
. We are currently developing additional ways to visualize the aggregated data and to make it more accessible for analysis and valuable for academic purposes. This necessarily includes a critical discussions of the interpretational character of visualisations, as for example in which ways a network visualisation already is making statements based on the composition of nodes on a canvas (see Dunne, 2013; Biehl et al., 2015; and Szafir, 2018).

In opposition to closed data services, the open data approach of CMIF and its potential in enabling any edition (be it printed or digital) to provide their very own correspondence metadata is extremely beneficial to mapping wider networks of communication. Since any letter network to be explored (for example in
csLink) only shows the information that is available as data already entered into
correspSearch from letters that are already edited and in some way published, a larger base of people actually committing reliable data substantially improves the database and reliability of epistemological gains from this data (see Grüntgens/Schrade, 2016). The development and usage of the
CMIF Creator has proven very valuable for increasing the amount of aggregated metadata. Further addition of data and increasing connections between the letters in our data form a kind of crowd based validated open metadata that does not rely on a single contributor or institution. It is thus not only beneficial for network analysis in a strict analytical sense but also contributes to and implements an agenda to further establish open data principles in the digital humanities.

As a leading provider for the decentralized aggregation of metadata from letter editions with the purpose to facilitate research on letter editions and correspondence networks on the basis of a standardised and open XML-format,
correspSeach, together with
CMIF and
correspDesc, were awarded with the Rahtz Price for TEI Ingenuity 2018.

Bibliography

Biehl, T., Lorenz, A. and Osierenski, D. (2015). Exilnetz33. Ein Forschungsportal als Such- und Visualisierungsinstrument. In
Grenzen und Möglichkeiten der Digital Humanities, edited by Constanze Baum and Thomas Stäcker. Sonderband der Zeitschrift für digitale Geisteswissenschaften 1.

http://dx.doi.org/10.17175/sb001_011
.

Dumont, S. (2016). correspSearch – Connecting Scholarly Editions of Letters“. In
Journal of the Text Encoding Initiative Issue 10 (2016),

http://journals.openedition.org/jtei/1742
.

Dunne, C. (2013). Measuring and Improving the Readability of Network Visualizations. University of Maryland. 2013. http://www.cs.umd.edu/hcil/trs/2013-14/2013-14.pdf (accessed 11/27/2018).

Grüntgens, M. and Schrade,T. (2016). Data repositories in the Humanities and the Semantic Web: modelling, linking, visualising. In Adamou, A., Daga, E. and Isaksen, L. (eds):
WHiSe 2016. Humanities in the Semantic Web. (CEUR Workshop Proceedings, Bd. 1608). Aachen 2016, p. 53-64. URL:

<http://ceur-ws.org/Vol-1608/#paper-07>
.

Stadler, P. (2012). Normdateien in der Edition. In
editio 26 (2012), S. 174–183.

Stadler, P. (2014). Interoperabilität von digitalen Briefeditionen. In Wolzogen, D. and Falk, R. (eds): Fontanes Briefe ediert, Fontaneana 12, Würzburg: Königshausen & Neumann 2014, S. 278–287. (doi:10.1515/editio-2012-0013).

Stadler, P., Illetschko, M. and Seifert, S. (2016). Towards a Model for Encoding Correspondence in the TEI: Developing and Implementing <correspDesc>. In
Journal of the Text Encoding Initiative [Online] 9.

https://dx.doi.org/10.4000/jtei.1742
.

Szafir, D. (2018). The good, the bad, and the biased: five ways visualizations can mislead (and how to fix them). Interactions 25, 4 (June 2018), 26-33. (

https://doi.org/10.1145/3231772
).

Tasovac, T., V.Barbaresi, T., Clérice, T., Edmond, J., Ermolaev, N., Garnett, and Wulfman, C. (2016). APIs in Digital Humanities: The Infrastructural Turn. In
Digital Humanities 2016, 93–96. Digital Humanities 2016 Conference Abstracts. Cracovie, Poland.

https://hal.archives-ouvertes.fr/hal-01348706
.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.