Reusability of Literary Corpora: the "Montaigne at work" Project

Marie-Luce Demonet

Authorship

1. Marie-Luce Demonet

unités mixtes de recherche - CNRS (Centre national de la recherche scientifique), Université Francois-Rabelais

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Reusability of Literary Corpora: the "Montaigne at work" Project
Marie-Luce Demonet, Centre d'Etudes Supérieures de la Renaissance, UMR-CNRS , Université Francois-Rabelais, Tours, France, marie-luce.demonet@univ-tours.fr
Introduction
I shall examine to what extent the BVH (Virtual Humanistic Libraries) project on Montaigne could be considered not simply another electronic edition, but also a component of a digital humanities infrastructure, observing the keywords of an integrated search: reliability, sustainability, dissemination, and above all, reusability. Is a project about Montaigne's work compatible with the "genericity" required for an undertaking that concerns a wide community?

"Montaigne at Work" in the BVH Website
The Bibliothèques Virtuelles Humanistes (http://www.bvh.univ-tours.fr) offers facsimiles (jpeg and light pdf) of books or manuscripts, extracted graphics with their indexing systems, and a textual database called Epistemon. It offers two types of digital surrogates of the book: the facsimile and, for some documents written in French, the corresponding transcription without modifications.

In itself, the idea of digitizing Montaigne's complete works is not original. A "Corpus Montaigne" already exists on CD-Rom but no online version is available at present, and access is limited to the few libraries that could afford it. Corpus Montaigne, Paris: Champion-Garnier électronique, 1999.The "Montaigne project" in Chicago (P. Desan) offers many documents related to Montaigne: it displays the "Villey" edition with every page of the "Bordeaux copy" (Exemplaire de Bordeaux, the so-called "EB"), but the distinction between the three main layers (1580-1582, 1588, 1588-EB) does not comply to the requirements of modern philology and scholarship: cancellations are not visible, and the editor modified the punctuation as well as the spelling.Montaigne project, http://www.lib.uchicago.edu/efts/ARTFL/projects/montaigne/The 1595 (posthumous) edition is already available in HTML format at the mysterious Trismegiste website, but there is no XML encoding, and the spelling is regularized. http://www.bribes.org/trismegiste/es1ch03.htm.In our project, all the editions will offer the double display of original/regularized spelling; indexes of names, places, errata, and a basic encoding appropriate for early printed books and manuscripts. Easy retrieval of both versions, in the format preferred (XML/TEI, HTML, PDF) will be the user's choice, according to the principle of reusability.

As we share our expertise with cultural institutions, we borrow our techniques and methods of digitization of cultural heritage objects, such as rare books collections, from libraries and archive repositories: digitization, metadata organization and catalogs, and database management. Our membership in the European Library (Europeana) helps to understand the difference between a cultural heritage attitude and a research project.http://www.europeana.eu/portal/.The parallel display of the facsimiles and their transcriptions, TEI encoding, tools for scholarly annotations and an accurate query system are not simple challenges to take on: the uniqueness of every work of art, the complexity of the process of writing seems incompatible with the unified view of textual databases usually found in library websites (e.g. Shakespeare at the British Library) or linguistic corpora (Frantext database or ARTFL in Chicago). Scholarly annotation will be minimal, and limited to the accuracy of the transcription, in order to provide a basis for further commentary, encyclopedic information, and glossaries. The very process of building the corpus for online publication is a field of new research in this case, for it combines ergonomic full display and retrieval, complex and relevant extraction procedures, treatment of texts and graphics.

The "Montaigne at work" project aims to support both the reading and the mining of the text, and to render the chronology. Our new data, expertise, and tools will try to fulfill the main goal we have always had of understanding Montaigne's Essays better: 1) to offer a genetic editionof the "Bordeaux copy," containing several layers of handwritten additions that reveal up to seven moments of writing or re-writing; 2) to give access to what is left of the famous "Librairie de Montaigne". The main corpus (all the editions from 1580 to 1595, with their transcriptions) will be enshrined inside a wider set of later editions of the Essais (Marie de Gournay's copy, Rousseau's copy), of several other works of Montaigne (the translation of the Theologie naturelleby Raymond Sebond, the Journal de voyage), of all his surviving manuscripts (marginalia, letters and Parliament archives), and of facsimiles of about a hundred identified sources (mainly classics, but also books by his contemporaries).

Genetic Encoding
The genetic edition of the Bordeaux copy, compatible with the TEI schemas for manuscripts and prints, and the "TEI Renaissance encoding" protocol developed in Tours, Manuel d'encodage TEI-Renaissance, 2009, http://www.bvh.univ-tours.fr/XML-TEI/index.asp.raises the question of the relevance of such an undertaking. It must from with a benchmarking of other websites offering open access to digitized works of late Medieval and Early Modern period (Chaucer, Dante, Shakespeare, Cervantes, Descartes, Molière,...). What kind of textual properties do these sites represent? Do they use several models? Exclusive tools? Many literary projects, particularly in France, do not use TEI encoding (Flaubert in Rouen, Montesquieu in Lyons, Stendhal in Grenoble), and scholarly corpora seem to be specific to each author.

Classicists and Medievalists have opened many doors, and they know quite well how to refine an ultra-diplomatic encoding and display. Rendering the writing process requires the adequate edition to feed every hypothesis about the moments of the gesture itself, the "traits de plume" (pen strokes), and the modifications the printing press of the time forced upon the original. Special software designed by our computer science partners (in Tours, Paris, Rouen, La Rochelle) is currently being developed to detect image similarity. Thus, Montaigne's different "hands" could be classified according to time and language, with the expert help of Alain Legros (researcher in Tours, and an expert in Montaigne's handwriting).LEGROS, Alain, Montaigne manuscrit, Paris : Garnier, 2010.We need also the clearest visualization of the readable parts, the possibility of displaying either a smoothed text or a page, which represents all the complex arrangements of the words in a spatio-temporal order. No models seem to be directly reusable: ours would take place between the very precise reconstitution of all the spellings of Medieval texts (e.g. the Actes des Apôtresproject) and the Madame Bovarydigital edition of manuscripts at the University of Rouen, but with a display system that would look like the Deutsche Text Archive (DTA): the facsimile of the page linked to the HTML text, and to the XML/TEI source, searchable with PhiloLogic (Mark Olsen, University of Chicago) and other NLP tools, with the XTF search engine. Actes des Apôtres, http://eserve.org.uk/anr/; DTA, http://www.deutschestextarchiv.de/; Madame Bovary, http://bovary.univ-rouen.fr/; XTF, http://xtf.cdlib.org/documentation/programming-guide/.All the quotations of the Bordeaux copy will be fully referenced and translated in French.

In Tours, we have already begun the keyboarding and encoding of the main editions (1580-82, 1588, 1595). The genetic edition of the Bordeaux copy will be based on the principles of the ITEM laboratory (École Normale Supérieure, Paris), a leader in genetic analysis, which are compatible with the TEI tagging of the main operations (addition, deletion, inversion, etc.), according to the latest documentation of the TEI consortium. The COST "Interedition" project (funded by Europe) offers several tools to test (e.g. Collatex for the main editions), and discusses some issues close to our preoccupations, such as the limits of crowdsourcing: we plan to use collaborative annotation by scholars for corrections of errors. http://www.interedition.eu/.

Every layer of text must be retrievable, to avoid incompatibility between the genetic and the generic, and to guarantee reusability to anyone who wishes to process the text (with permission) for other purposes. Ideally, a collaborative edition of Montaigne's Essaiswould blossom out of the accurate transcription of the Bordeaux copy, and/or of the posthumous 1595 edition: the debate is still pending among specialists.

Automatic Regularization
We will generate three levels of transcripts:

the "quasi-diplomatic" transcript, crucial for the comparison between the typeset and the handwritten passages (the spelling of which has never been thoroughly studied)
the "cultural heritage" transcription that regularizes the distinction of I/J and U/V, expands the brevigraphs and normalizes the ends-of-lines, so that the corpus can be processed by the NLP tools and parallel corpora analysis
the modernized version, so that powerful search engines can offer accurate results to anybody.
A prototype of I/J U/V normalization tool is already prototyped in Tours and Poitiers, with a set of rules and specific dictionaries; the modernization tool is in progress, and requires another set of rules, and other dictionaries.
The development of these tools benefits one of the two Google awards that the University of Tours obtained in December 2010 for "Full-text retrieval and indexation for Early Modern French documents". New software will process a sentence such as:

Ie veus qu'ō m'y voie en ma façō simple, naturelle & ordinaire, sans estu de & artifice: car c'est moy que ie peins(Montaigne, Essais, 1580)

With these spellings, the user who is not a specialist will find only few results in his word or string query because of typographic abbreviations ( façōfor façon), obsolete morphology ( veusfor veux), and the frequent lack of hyphenation. In modern editions, one will find easily « estude » in the editions following former spellings; but if one looks for « étude » (in modern French), the old spelling will not be offered, and one will miss the variant « estude » in the corpus, where moreover the word is typed without hyphenation.Cf. the Old English variation analysis in the York-Helsinki corpora (http://www.helsinki.fi/varieng/CoRD/corpora/YCOE/index.html).

Montaigne's Library
Thus, Montaigne's library itself can be rebuilt through the comprehensive digitization of what remains of the hundred known copies with his signatures and annotations: 33 are preserved at the French National Library, 30 at the Bordeaux Public Library, others in at least 15 other libraries and private collections. Samples of his handwriting will be analyzed and compared to non-attributed manuscripts, in order to confirm or exclude dubious documents.

Such a project will enlarge the knowledge we already have of Montaigne's method of writing, within the context of his favorite readings. If other projects provide data, this one offers also reusable sets of transcriptions, facsimiles and new tools for further analysis.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2011

"Big Tent Digital Humanities"

Hosted at Stanford University

Stanford, California, United States

June 19, 2011 - June 22, 2011

151 works by 361 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)

Conference website: https://dh2011.stanford.edu/

Series: ADHO (6)

Organizers: ADHO

Reusability of Literary Corpora: the "Montaigne at work" Project

1. Marie-Luce Demonet

ADHO - 2011

"Big Tent Digital Humanities"