Humboldt-Universität zu Berlin (Humboldt University)
Independent Consultant
Universidad de Alicante
Institute of Polish Language - Polish Academy of Sciences
Universität Trier
Introduction
The COST Action
Distant Reading for European Literary History is a collaborative, interdisciplinary network which aims “to facilitate the creation of a broader, more inclusive and better-grounded account of European literary history and cultural identity” (
). The network consists of European researchers from different disciplines and research fields such as computational linguistics, corpus linguistics, and (digital) literary studies. Currently, over 100 researchers from 30 different countries are working together in the Action. With the present poster, we would like to present our strategy for developing a key output of the project, the corpus which serves as an empirical basis for our project.
Basic idea of ELTeC
The
Distant Reading Action aims to “develop the resources and methods necessary to change the way European literary history is written” (“Memorandum of Understanding”). To that end, the Action will create a diachronic, multilingual, medium-sized open access benchmark corpus of novels from 1840-1919, called ELTeC (European Literary Text Collection). A working group dedicated to ‘Scholarly Resources’ is collecting sets of 100 novels in at least 10 European languages. These novels are encoded according to the recommendations of the TEI (Text Encoding Initiative; see TEI Consortium 2018 and Burnard 2014) and made freely available via GitHub and Zenodo (see
. The ELTeC repositories will be archived via Zenodo). ELTeC is the basis for work taking place in the
Distant Reading network on evaluating and improving methods and tools supporting text annotation and analysis for Digital Literary Studies. It also provides the materials to work on theoretical concerns and use cases from European literary history also taking place in the
Distant Reading network. Hence, ELTeC is a key activity in the
Distant Reading network.
Corpus design
The key challenge regarding the corpus design and annotation schemas for ELTeC is the need to handle the tension between valid and meaningful criteria and an operationalisation of these criteria for texts from many different cultures, a challenge typical of large-scale DH projects. The corpus design is therefore metadata-based by defining a set of parameters instead of relying on literary canons. Therefore, we integrate famous, canonical as well as forgotten, non-canonical novels. We represent the variation of production and aim to maximize the variety within each time period. The corpus design of ELTeC is similar to reference corpora designed to serve as an empirical base for different research approaches. We have put a focus on the composition of the corpus. The corpus will be balanced with regard to parameters which include language, publication date, author gender, length and reprint counts. The corpus also includes ‘smaller’ varieties of a language in order to be able to consider the high variance of the vernacular languages across Europe.
Cf. for further details our white papers on “Canonicity and corpus design parameters in the ELTeC context” and “Sampling criteria for the ELTeC”.
Corpus annotation
The ELTeC is encoded using TEI XML, customised for Distant Reading methods and tools. “Our main goal has been to identify a small core set of textual features which can be readily (preferably automatically) identified in existing digital transcriptions, or easily and consistently provided by new transcriptions” (Section ‘Principles’ of the white paper “Encoding Guidelines for the ELTeC”). Unlike many Digital Scholarly Editing projects, the intention is not to support a rich representation of the full complexities of an original source, but rather to represent in a consistent and economical way an agreed minimum of the textual features most relevant to Distant Reading practices. We distinguish a basic encoding (level 0), a richer TEI encoding (level 1) and a linguistic encoding (level 2). A single TEI ODD is defined, from which we derive schemas and documentation for each level of annotation.
The encoding scheme is formally documented using the TEI ODD system. See further
and
. In 2019, we have started to develop a customization for level 2, which includes linguistic annotation.
At level 0, only chapters, headings, and paragraphs are distinguished. At level 1, more semantic encoding is introduced, for example distinguishing the functions of highlighting, such as to mark foreign words or emphasis. Additional markup for corrections, footnotes and some structural features is also included. Contributors to the ELTeC may choose to submit materials using either level of encoding, depending on what is most appropriate for them.
Data sources, collaboration and publication
We integrate already existing textual resources, whether already TEI-encoded or not, e.g. from Project Gutenberg (
), TextGrid (
) or the Cervantes Virtual Library (
). Project members also digitize and encode new texts, especially non-canonical novels, and make them available for research for the first time. As far as possible, ELTeC corpora include a digital edition of the first edition of each novel.
Collaborative corpus creation, documentation and publication of ELTeC is happening on our GitHub organisation (COST-ELTeC; see:
), where each language collection is stored in a single repository with all available encoding levels. For each language, a specific group of researchers is responsible for selecting and annotating the novels. However, as we are working on an open platform, the remaining members of the project as well as other colleagues are welcome to review and improve the text selection and annotation.
Acknowledgements
Distant Reading
for European Literary History (CA16204) is a COST Action funded by the COST Association through the Horizon 2020 Framework Programme of the EU.
Bibliography
Burnard, L.
What is the Text Encoding Initiative? How to add intelligent markup to digital resources. Marseille: Open Edition Press, 2018.
“Canonicity and corpus design parameters in the ELTeC context”. Working paper, last updated Sept. 2018.
“Encoding Guidelines for the ELTeC”. Draft.
“Memorandum of Understanding for the implementation of the COST Action
Distant Reading for European Literary History (DISTANT-READING) CA16204”. Brussels, COST Association. June 2017.
“Sampling criteria for the ELTeC”. Discussion paper. January 2018.
TEI Consortium (eds.) 2018.
TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.4.0. TEI Consortium.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Utrecht University
Utrecht, Netherlands
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html
Series: ADHO (14)
Organizers: ADHO