Institut de recherche et d'histoire des textes (IRHT) - CNRS (Centre national de la recherche scientifique)
Institut de recherche et d'histoire des textes (IRHT) - CNRS (Centre national de la recherche scientifique)
Institute of Polish Language - Polish Academy of Sciences
Institut de recherche et d'histoire des textes (IRHT) - CNRS (Centre national de la recherche scientifique)
The medieval civilization can only be investigated by means of the study of traces that have survived to our times. The best source of our knowledge is the texts, preserved in huge quantity and variety. Written mainly in Medieval Latin, they have not benefited from recent advances in computational linguistics and digital humanities in general. The paper briefly presents and reports on the early development of the project
VELUM – Visualisation, Exploration et Liaison de ressources innovantes pour le latin médiéval.
The ANR-funded project (2018-2022) is a first step towards an innovative digital environment for the study of the language and culture of medieval Europe. It has a challenging and ambitious goal: to build foundations for empirical research in medieval culture, language and history, by providing an appropriate environment for textual source analysis. We aim at building, firstly, a large (100 M tokens) balanced corpus of Medieval Latin texts composed between 500 and 1500 AD all across Europe. In order to assess the population that we would like our corpus to reflect, we rely mainly on the most comprehensive list of Medieval Latin texts, namely the
Index Scriptorum Mediae Latinitatis (Blatt, 1973; Bon, 2005).
A representative corpus should make it possible to avoid the bias that affect existing text collections, resulting from uneven and unjustified distribution of sources of different date, genre and place of composition. For example, medievalists have at their disposal big digital text collections, but none has been conceived in order to give an idea of the whole production in Medieval Latin in its variety (see References: Text Corpora). They tend to focus either on specific domains (e.g. theological texts vs. diplomas and charters) or territories (e.g. Catalan vs. Polish texts). As to the chronological variation, the texts written before 1200 are generally far better represented than the texts of the Later Middle Ages. The corpus will also attempt at remedying another shortcoming of existing text collections: since they usually are not balanced, they do not lend themselves to any sound statistical analysis (both syn- and diachronic). Apart from wide geographical and temporal coverage, the corpus will also reflect the variety of genres practised in the Middle Ages, as well as the functional richness of the medieval textual culture. In order to enable automatic processing, the texts will be annotated with part-of-speech, lemma, time and place labels. The compilation and annotation of the corpus, albeit extremely important, will be only a first step of the project.
Secondly, a corpus search engine will be built with the help of the CQP-Web software. The users will be able to query the texts and benefit from their rich linguistic annotation through a user-friendly interface. Thirdly, the project aims at developing a set of efficient statistical analysis and data visualisation tools that researchers would embed in their own workflows. Written mostly in R, scripts, programs, wrapper functions will allow for advanced study of Medieval Latin vocabulary, but will be applicable to other languages as well. Both the texts and the tools will be made freely available to the scientific community through the project’s website and public code repositories.
After the brief presentation of the project’s goals, we will report on the workflow and the first results of the project that has started in March 2018. During less than two years, we intend to cover every aspect of corpus compilation from digitizing paper edition to linguistic annotation of the machine-readable texts. The first stage started with intensive work of cleaning and structuring the texts; some of them already are available in interoperable formats, but many are not. Our first work consisted, then, in selecting and retrieving about 1500 scholarly editions of Medieval Latin texts, among others from the Web. Image files which drastically differed in quality needed to be cleaned and standardized. To that purpose we employed a set of optimization scripts and tools such as ScanTailor. The TIFF files were next sent to the OCR engine. We chose the Tesseract-4 as our OCR engine for both its accurateness and performance. The quality of the OCRed text was optimized through error profiling and selective manual proofreading with the latest version of the Post Correction Tool software. The PoCoTo software addresses the problem of recurrent OCR errors by making easier both their detection and batch correction by even less technically-oriented users.
Figure : Using PoCoTo for OCR correction
The next phase of this early development will consist mainly in separating Latin from non-Latin text. The text blocks will be automatically annotated with ‘Latin’ or ‘Non-Latin’ labels by using a simple classifier. We will next employ Transkribus to verify the results of the automatic classification and to manually remove redundant text blocks such as editorial introductions, footnotes, indexes etc. After that, the texts will be exported to the TEI-XML format and annotated with metadata. Once a rough representativeness and balance of this corpus is achieved, the texts will be lemmatized and PoS-tagged using the Treetagger with the parameters developed for Medieval Latin during the “Omnia” Project (
http://glossaria.eu).
Bibliography
B
latt, F. (1973).
Index Scriptorum novus Mediae Latinitatis. Hafniae: Ejnar Munksgaard.
Bon
,
B
. (2005).
Index Scriptorum novus Mediae Latinitatis, Supplementum (1973-2005). Genève: Librairie Droz.
Corpora.
Catalan texts:
Corpora.
Diplomas & charters:
https://deeds.library.utoronto.ca,
http://philologic.cbma-project.eu
Corpora.
Polish texts:
Corpora.
Theological texts:
http://documentacatholicaomnia.eu,
Tools.
PoCoTo:
Tools.
ScanTailor:
http://scantailor.org
Tools.
Tesseract:
Tools.
Transkribus:
Tools.
Treetagger:
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Hosted at Utrecht University
Utrecht, Netherlands
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html
Series: ADHO (14)
Organizers: ADHO