Open Data, Open Edition: How Can the Inferences Between Scientific Papers and Evidence Be Managed?

panel / roundtable
Authorship
  1. 1. Adeline Joffres

    Huma-Num - CNRS (Centre national de la recherche scientifique)

  2. 2. Xavier Rodier

    Ecole Polytechnique de l'Université de Tours

  3. 3. Olivier Baude

    Huma-Num - CNRS (Centre national de la recherche scientifique)

  4. 4. Stéphane Pouyllau

    Huma-Num - CNRS (Centre national de la recherche scientifique)

  5. 5. Olivier Marlet

    Ecole Polytechnique de l'Université de Tours

  6. 6. Pierre-Yves Buard

    Université de Caen Normandie (University of Caen)

  7. 7. Christophe Parisse

    Université Paris-Ouest Nanterre (Paris Nanterre University)

  8. 8. Carole Etienne

    Ecole Nationale Supérieure des Arts et Techniques du Théâtre

  9. 9. Céline Poudat

    Université de Nice Sophia Antipolis (University of Nice)

  10. 10. Fatiha Idmhand

    Université de Poitiers (University of Poitiers)

  11. 11. Thomas Lebarbé

    Université Grenoble Alpes

  12. 12. Paul Bertrand

    Katholieke Universiteit (KU) Leuven (Catholic University of Louvain)

  13. 13. Nicolas Perreaux

    Johann-Wolfgang-Goethe-Universität Frankfurt am Main (Goethe University of Frankfurt)

  14. 14. Eliana Magnani

    Université Paris 1 Panthéon-Sorbonne

  15. 15. Florent Laroche

    École Centrale de Nantes

  16. 16. Xavier Granier

    Université de Bordeaux

  17. 17. Mehdi Chayani

    Université de Bordeaux

  18. 18. Pierre Mounier

    Infrastructure - OpenEdition

  19. 19. Nathalie Fargier

    Infrastructure - Persée

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Recent years have been marked by the emergence of the ‘Open Science’ movement. It encourages, among other practices, ‘open data’, ‘open publication’, and ‘open edition’. Boosted by the digital turn, it has radically transformed research and academic communities and on academic communities. From now on, the EU requires funded scientific papers to be published open access and their full availability ensured in order to be reused by anyone. The EU's deadline is 2020, but many people think that it is unrealistic. Other initiatives, by NASA in the USA for example, decided that the time for conversing about Open Access is now past and that current discussion must focus on how we are going to achieve it in practice. What is the situation concerning SSH?

This panel proposes to question the existing scientific publishing paradigm linked to the current changing relationship between the necessary publication of primary data and review papers and the SSH perspective. The challenge is to break with the descriptive model which links raw data and original research articles, and to offer the opportunity to combine data and review papers in order to ensure the link with evidence and to highlight interpretation and reasoning.
It is widely recognized that the number of papers currently published in SSH is such that we only consult some of them, following our own selection strategies. However, we still write as if our work were to be read, without any attempt to redraft it from an alternative perspective. Digital publishing does not solve the problem: it makes it worse while, at the same time, the increasing accessibility of online data is not sufficiently exploited in conjunction with publications.
Corpora made available for the sharing, interoperability and reuse of data sets are rarely considered in connection with scientific publications. Yet these are the bodies of evidence on which the published analyses and demonstrations are based. A special challenge when consulting publications is to quickly assess their relevance to our work, and to do so, to have easy access to the scientific reasoning and the provenance of the diverse items (observations, comparisons, references) on which it is based. This new paradigm raises a new challenge: how to provide papers linked to open access datasets.
A spectrum of possibilities is available, ranging from short, scholarly publications that describe datasets accessible online, to full papers that integrate datasets from repositories (either as part of the paper or externally maintained) or datasets linked to data papers.

By bringing together representatives of various SSH communities and infrastructures, including through the SSHOC H2020 project this panel aims at reflecting on these questions. The introduction will briefly describe the different points presented in such a way that we can see the common points between the disciplines and the way in which the tools proposed can bring communities together around the practice of data-publications. Each talk will try to go in this direction without using too much disciplinary jargon in order to be understandable by all.

It will review the specificity of each research community and how they are facing these challenges by focusing on the particular case of the French Infrastructure Huma-Num and its consortia. Experts from two research infrastructures for digital publishing in SSH will also contribute to this reflection. Is it possible to build a standardized model for SSH? How are French infrastructures dealing with scientific (digital) publishing? Are the research communities and their infrastructures capable of working together in order to link and structure their syntheses with their data to highlight the chain of inference to evidence?

Talk 1 = “Simplifying the Writing of Logicist Publications in Archaeology: An Attempt by the French MASA Consortium through LogicistWriter”

Within the MASA Consortium, the Digital Document Centre of the

Caen

MRSH and CITERES-LAT are collaborating on the electronic publication of the Rigny excavation (Indre-et-Loire, France) by Elisabeth Zadora-Rio in a logicist format developed by Jean-Claude Gardin. The objective is to highlight the chain of inferences in order to ensure the administration of evidence from the archaeological synthesis. As the archaeological experiment is not reproducible, field records are the main data on which researchers will rely. The publication of archaeological data is therefore essential since this data constitutes the evidence on which the reasoning that led to the synthesis is based.

The logicist publishing interface set up by the Digital Document Centre of the MRSH of Caen provides different levels of access to content, allowing both quick reading and in-depth consultation. It is thus possible to visualize all the inference strings in the reasoning structure through synthetic diagrams but also to consult the ArSol database containing the field records that provide evidence for the initial propositions. The XML file containing the whole argumentation structure is based on the entities of CIDOC CRM to ensure the interoperability of this publication within the semantic web. The inference chains are mapped to CRMinf (extension of CIDOC dedicated to the modeling of reasoning) and ArSol records are mapped to CIDOC and its extensions CRMSci (scientific observations) and CRMArchaeo (archaeology).

For the implementation of the electronic publication of Rigny's archaeological excavations in logicist format, the researcher must inform all his/her logicist proposals in an XML-TEI file. From this XML-TEI source file, the Digital Document Centre of the Caen MRSH has set up a tool to automatically generate the reasoning graph in an SVG (Scalable Vector Graphics) format.

However, writing logicism proposals in an XML-TEI file is difficult for the researcher. In order to simplify the writing of archaeological publications that respect the precepts of logicism, the MASA Consortium provides an online tool to assist in logicist formalization, based on a more intuitive process: LogicistWriter. In this application, the researcher begins by writing propositions in a schematic form and graphically linking them together in a graphical interface. The researcher can thus set up the structure of logicist propositions by building the tree structure of his/her reasonings in a visual way, linking the propositions to each other from the initial propositions (the evidence) to the final propositions (the synthesis) in the form of a graph. A TEI-XML file containing the entire logicist structure of the reasoning is created on the fly from this graph. The XML-TEI file generated can then be expanded in an XML editor to enhance proposals with text, illustrations, bibliography or cross-references.

Although the corpora of archaeological data does not allow the experiment to be replayed as in the experimental sciences, logicism highlights the sequence of reasoning and will ultimately make it possible to produce reusable databases of inferences.

Talk 2= “Validation of scientific results and FAIR principles in linguistic research within the French CORLI consortium”

Corpus research in linguistics has a long existence but has induced huge changes in the community in the last 30 years (Laks, 2011). The advent of high-speed Internet connections and large storage possibilities, with the generalization of the use of mathematical models and statistical software, have made it easier to build linguistic research on data attested by language corpora.

There is a classical opposition in linguistics between theories based on examples (introspective or not) and theories based on data and corpora. The second approach corresponds to a large body of authors and linguistic models. The use of corpora makes it theoretically possible to control or reproduce the published research, which is not the case in studies based on examples. This control might be possible in linguistic fields that adopt an experimental framework, similar to other experimental sciences such as cognitive psychology for instance.

In the field of First Language Acquisition, the use of corpora has been mandatory since the 1970s and researchers rapidly realized the interest of sharing corpora and working hand in hand, due to the high development cost of such corpora. For instance, Pine et al. (2005) were able to respond to the work of Schütze and Wexler (1996) because the three corpora used by Schütze and Wexler were available on CHILDES (MacWhinney, 2000). Pine et al.’s response was based on a corpus that is also available on CHILDES (Manchester Corpus, Theakston et al., 2001). The existence and the reusability of the different corpora is a crucial element here to make it possible to replay or enrich a previous work. More importantly, these corpora follow the same well-known guidelines (CHAT format) to ensure interoperability and have been freely available on the CHILDES website for 30 years for the oldest data, and for 15 years for the more recent ones to ensure their accessibility, two key concepts in the CHILDES initiative.

These new approaches in linguistics are an important step for the linguistic field as they make it possible to significantly control, replicate, and enrich research. Control of the proof is also something that might be possible today, if not only the corpus but also the methods and the tools are made publicly available. This is an important point we are trying to develop and reflect on within the French consortium in linguistics in CORLI. What are the conditions of such a procedure? How can colleagues be encouraged to deposit their corpora? What anonymization procedures need to be developed in the case of sensitive data? What are the legal frameworks for data?

In this paper, we will address these questions, particularly in the framework of Huma-Num and within CORLI in France, or CLARIN in Europe. Indeed, they follow the guidelines of the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) to promote and help develop the use of clear and rich metadata, common open formats, accessible tools, open archives and a correct citation of data or tools. Such methods and practices may benefit linguistics research by saving time, encouraging researchers to reuse already existing, documented and multi-level annotated data as well as tools, to increase our knowledge on languages.

Talk 3 = “Dealing with Literary Scholars’ Data and Evidence: the Perspective of the French CAHIER Consortium”

With digital editions of texts, sources and scientific literature, literary scholars are dealing with unprecedented changes, in particular regarding the publication of the data their analyses are based on. To administer their evidence, they have to tackle two major challenges: on the one hand, the publication of all their data and on the other, its storage. The two aspects cannot be separated. What solutions are available for literary scholars? How are they dealing with these challenges?

In the French context, part of the digital humanities data storage problem has been solved thanks to the HumaNum infrastructure: HumaNum provides a GITLAB to store software codes and a technical tool, Nakala, to carry out SSH data storage. Nakala also offers a Uniform Resource Identifier (URI) that unambiguously identifies resources. Thus, one might think that literary scholars' papers could be based on these URI and gits, and that it would be enough to have more data papers, with more reproducible experiments and new ways of writing papers. However it is not that easy. That’s why CAHIER has organised three actions in order to facilitate:

a) data interoperability by sharing models (such as the XML-TEI schema) and organising data visibility by providing sitemaps (for XML-TEI encoded texts for example) and OAI repositories for projects;
b) data storage by helping researchers to model their data in an interoperable format from the very beginning of their project;
c) data reuse by promoting data deposit and its visibility.

For each action, we have to deal with different challenges because of the complexity of the scientific questions raised in literary research that includes a hermeneutical framework. One way of solving the problem might be to provide highly flexible data models and to develop researchers’ skills. That's why we are also investing in training courses, but lack of time is the main obstacle for the researchers. Moreover, if we consider that we could have a pool of experts, where are the journals and data journals that would enable them to publish their results?

Talk 4= “Towards New Forms of Integrated Publications in Medieval History: the Exploratory Research of the COSME Consortium”

The construction of digital corpora for medieval sources has always been associated with specific scientific methodologies. These are related to particular historical approaches, such as quantitative codicology 25 years ago, the deep search in Medieval Latin texts in recent years or, very recently, studies on medieval spaces based on GIS or, more broadly, geomatics.
Currently, corpora of raw data and corpora of qualified data, metadata/referentials are being built and cross-comparisons are being increasingly carried out. Rather than developing tagged textual/documentary corpora, the search by automatic detection and extraction of named entities from repositories (established within the framework of the COSME consortium in particular: names of persons, places, topics, values...) makes it possible to envision specific and multiple enrichments of these corpora in an automatic way as well as their historical exploitation. However, these qualified and requalified corpora are source places that must be associated each time with the published results.
As part of the development of next-generation open archives (

https://www.coar-repositories.org/files/NGR-Final-Formatted-Report_english-version.pdf
), each publication should have its own metadata-/referential datasets, specific, appropriately qualified corpora and intermediate data (extracted data tables). The traditional publishing model must therefore be completely overhauled with the implementation of the necessary permanent, open warehouses capable of preserving and visualizing complex content. Several test solutions are envisaged: the development of the TELMA platform for the electronic edition and publication of medieval sources; the creation of an Overlay Journal dedicated to medieval sources and studies on literacy and medieval writing practices; or the connected and joint development of these solutions.

The paper will aim to review the development of these solutions, the advantages and disadvantages and the potential proposal of new complementary solutions (in relation to EquipEx Biblissima, for example).

Talk 5= “How does 3D Influence Scientific Research and Publications in Digital Humanities?”

3D technologies have offered researchers in the Humanities (archaeologists, anthropologists, architects, art historians...) new and effective tools to process, analyse and disseminate their scientific data.
As a true research tool, 3D enables one to examine and visualize digital data and also facilitates dialogue and exchange between researchers by providing digital models and visual support to test different hypotheses and confront different documentary resources to find solutions to historical questions that traditional methods of investigation cannot solve. It also raises new questions.
3D makes it possible to present information in a more coherent form and to support a demonstration by making it more comprehensible for the observer. Thus, the 3D solution presented in the publication will clarify the scientific purpose of a text or a speech that might be too abstract without this visual transcription.
Thanks to their flexibility, 3D models can be used to follow the historical evolution of an object from cultural heritage or more specifically from an archaeological site: such a digital replica can be updated at will and viewed from different angles.
With the online publication of these 3D models, researchers have rapid access to information through an organized database synthesizing all the scientific documentation. Moreover, by this digital medium, researchers can continue their investigation without any need to remain on the site.
Furthermore, 3D models can also be used as support for an information system by associating a coherent spatial coordinate system in order to offer researchers all the data associated with a humanities object of study. However exploitation of the large amount of 3D data remains difficult for humanities scholars who are not used to manipulating such data.

In order to help the Digital Community, resultats of the ReSeed project, currently funded by the French National Agency for Research and associated with an axis of the 3D-SSH consortium will be detailed by the author. It aims at the development of a new technology: a tool and an interoperable format in order to digitize both historical semantic data and 3D physical objects. ReSeed will implement an ethical code assigned to guarantee the authenticity and uniqueness of the future semantically augmented numerical objects.

Talk 6= “From Papers to Data in the Open Access Context: Use Cases from OpenEdition Platforms”

OpenEdition is the French national infrastructure for open scholarly communication in the humanities and social sciences. OpenEdition portal brings together four platforms dedicated to open access digital resources in the SSH: OpenEdition Books (monographs and edited volumes); Hypotheses (research blogs); Calenda (a calendar of academic events; and OpenEdition Journals (academic journals). 

With 20 years of experience and more than 500 open access journals and 7000 books in all disciplines of the humanities and the social sciences, OpenEdition platforms are an interesting observation deck from where researchers' needs in terms of linking their data to their publications can be observed.

My communication will present several cases of academic publications disseminated on OpenEdition platforms from different disciplines (such as geography, sociology, history, archeology) that reflects the variety of ways authors and editors try to create links with data whereas no specific feature exist on the platforms. It will be also an opportunity to measure how those actors play with the technical constraints imposed by the existing publishing tools to elaborate an extended scientific argumentation that integrate direct access to data. This communication will be the start of a collaborative project with Huma-Num on that topic, both on a national level and with the framework of SSHOC project.

Talk 7= “Linked Open Data for Heritage Content: an example of implementation within Persée infrastructure”

Persée is a digital platform for the digitization, structural markup, online publishing and long-term preservation of heritage content. Persée deals mainly with the “collections of old, rare and valuable documents housed by libraries and archives” in particular, but not exclusively, academic journals, serials, books, proceedings, grey literature, maps and iconography. The originality of the Persée platform is the accuracy and the standardization of item description and structuring (article, chapter, illustration, named entity, etc.) and the use of standards (TEI, METS, MODS, MADS, DC, OAI-PMH).

Currently, the Persée portal (

www.persee.fr

) provides open access to archives of scholarly publications in French in the humanities and social sciences, with more than 740,000 documents available. The oldest article dates from 1837 and the most recent one from 2017. This set of documents can be considered from a double point of view: on the one hand, they are published research results that constitute a documentary resource which is still relevant for students and researchers; on the other hand, they are a mass of structured and qualified data that researchers can consider as a digital corpus.

The presentation will focus on the methods used by Persée to ensure the usability of these datasets outside the digital library framework and their sustainability for digital humanists. Persée provides web services and implements the methods and techniques of Linked Open Data. Data Persée (data.persee.fr) gathers all the metadata produced by Persée and makes them available in a structured way (RDF graph) according to the principles of the semantic web (DCMI, FRBR, FOAF, CITO, BIBO, SKOS). The mapping with international information systems makes it possible to explore and link not only Persée databases but also data offered by the library community (IdRef, data.bnf.fr), the scientific community (Cairo Gazetteer, GBIF) and other crowd-sourced databases (DBpedia). To ensure quality and relevance, we have prioritized human expertise over automated processes.

So, all the links are checked. In addition to that, Persée is involved in a long term preservation programme to guarantee the durability of the links and of the identifiers.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.