Indexing and Linking Text in a Large Body of Family Writings

poster / demo / art installation
  1. 1. Beatrice Dal Bo

    Université Paul-Valéry Montpellier

  2. 2. Francesca Frontini

    Université Paul-Valéry Montpellier

  3. 3. Giancarlo Luxardo

    Université Paul-Valéry Montpellier

  4. 4. Agnès Steuckardt

    Université Paul-Valéry Montpellier

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Corpus 14, a corpus of correspondences from World War I
The processing of correspondences for a digital edition presents similarities with other textual genres, and therefore common methodologies and tools can be applied. However, as correspondence corpora display a reticular presentational organisation and since the related projects often require to progressively gather various sources, publishing through digital media has several advantages with respect to print publishing.
As opposed to most correspondence projects, dedicated to well-known authors or scholarly writings

For a list of projects see Stadler
et al., 2016.

, the
Corpus 14 project deals with correspondences from ordinary people

Another project dealing with the digitisation of this kind of correspondences is
Digitising experiences of migrations:
The development of interconnected letter collections, by Moreton and Nesi (2013-2014, For further information, see also Moreton, 2016; Moreton
et al., 2014.

. In conjunction with the commemoration of the centenary of WWI, the project coincides with the French initiative to appeal to families in order to gather documents or objects related to the war (the
Grande Collecte was initiated by the
Mission Centenaire

in 2013). The project allowed for the digitisation and the publication of a vast amount of data (letters, postcards) aimed at specialists in two fields:

the study of the Great War and its legacy on social memory, including the cultural transformations that occurred during and because of the war,
the evolution of the linguistic uses that happened at that time, in particular among less-literate people (from different areas of France, mostly with rural origins), the propagation of a slang (the so-called
“argot des Poilus”) or the influence of regional variations, especially in areas characterised by diglossia situations.

As of August 2018, the corpus is comprised of almost 1800 correspondences written by 37 writers in 11 areas, for a total of almost 500,000 tokens

The current size of the corpus is limited, but future additions are planned; the effort was so far concentrated on establishing a methodology, selecting relevant texts that have been accurately transcribed.
. The construction of the corpus has been carried out with the following requirements:

the selection of exclusively less-literate writers,
the preference for writings that can be followed up over time for a network of related writers (although it is generally impossible to build a complete set of correspondences),
the accuracy of the transcription in terms of the encountered spelling, punctuation or syntax as well as characteristics of the sources (legibility, layout).

The transcripts were encoded in conformance with the TEI guidelines


, which preserves the alignment between the text and the facsimiles, to mark up the logical structure of the text as well as features of readability. Two versions of the text are kept: one accurate and one with a normalised orthography.

The metadata, described by the
TEIheader, includes the
correspDesc element (as defined by the TEI SIG on Correspondence and introduced in the 2.8.0 version of TEI P5), allowing for the identification of sender(s), addressee(s), relationship, date and place of sending.

In addition to the methodological approach followed in the transcription process,
Corpus 14 relies on the data representation model provided by the TXM platform

You can access the
Corpus 14 from

, which is used for text querying and analysis. TXM distinguishes properties for structural units (either inherited from the TEI encoding or defined by the user annotation) and lexical units (through an automated morphosyntactic tagging and lemmatisation, which in fact is of limited usefulness for non-standard language).

The corpus is also available for download from the Ortolang platform

Access via permanent identifier

, a repository for language resources, which provides long-term archiving and allows for metadata harvesting by means of the OAI-PMH protocol.

Semantic annotation of persons and places
Recent developments of the project mainly concern the semantic annotation of Named Entities, especially persons and places. These are identified following a workflow including both an automated tool, REDEN Online (Frontini
et al., 2016) and a manual annotation, and eventually marked up according to the guidelines of the TEI module for names, dates, people and places. Place references are linked to an external knowledge bases (DBpedia) whenever possible, whereas persons are referenced to an internal index. NEs in the
correspDesc and in the
body are annotated in the same way, but their position in the TEI tree determines their role, so that a simple XPath can return for instance only people and places that are mentioned in a letter. Such analysis shows that places and persons evoked in the correspondences between soldiers and their families are more likely to refer to their previous lives at home. This may be due to the wish of soldiers to maintain the connection with their network and their link to the household environment, which is one of the main functions of these letters (Caffarena, 2005; Gibelli, 2014).

Ongoing work: organisation of correspondence networks
Various ways of visualising information related to people, places and dates in correspondences, especially when it concerns metadata, have been developed by several projects. We cite among others
The Migrant letter (O'Leary and Moreton
, 2017);
Visual Correspondence

Mapping the Republic of Letters

Early Modern Letters Online

Clavius on the Web

. Specifically, the investigation of spatial references can be facilitated by a cartographic visualisation. Our main aim in developing our own visualisation interface is to adhere to and exploit existing standards, such as TEI and Semantic Web technologies, so as to create a framework that could ideally be reused for similar corpora.

The visualisation proposed for the
Corpus 14 project displays each set of correspondences (a soldier and his wife/relatives at home) on the map, with its sender and receiver addresses identified by two distinct colors, as well as markers for places mentioned in the letter

The visualisation (which will be available as a demo on the project's website) was developed in collaboration with Pietro Barbieri, Chiara Capone and Luca Ciccone, MSc students in computer science, supervised by Marina Ribaudo, associate professor at DIBRIS, Università degli Studi di Genova.


This way of visualising spatial and personal information contained in letters brings an additional layer to the reading, as well as to the analysis of
Corpus 14 correspondences, which have already been carried out in Steuckardt
et al. (2015) and Roynette
et al. (2017)
inter alia.


Caffarena, F.
Lettere dalla Grande Guerra. Scritture del quotidiano, monumenti della memoria, fonti per la storia. Il caso italiano
. Milano: Unicopli.

Frontini, F., Brando, C. and Ganascia, J. G.
(2016). REDEN ONLINE: Disambiguation, Linking and Visualisation of References in TEI Digital Editions.
Digital Humanities 2016: Conference Abstracts
. Kraków: Jagiellonian University & Pedagogical University.

(accessed 21 November 2018).

Gibelli, A.
La guerra grande: storie di gente comune
. Bari: Laterza

Moreton, E.
(2016). Ph.D. thesis: The emigrant letter digitised: markup and analysis. Birmingham: University of Birmingham. Available from:

(accessed 18 March 2019)

Moreton, E., O'Leary, N. and O'Sullivan, P.
(2014). Visualising the emigrant letter.
Revue Européene des Migrations Internationales. Special issue: Traces of Dispersion
, vol.
30 - n° 3 et 4.

(accessed 18 March 2019).

O'Leary, N. and Moreton, E.
(2017). The Migrant Letter Digitised: Visualising Metadata.
Journal of Cultural Analytics
. doi:


(accessed 19 March 2019).

Corpus 14


Roynette, O., Siouffi, G. and Steuckardt, A. (eds).
La langue sous le feu : mots, textes, discours de la Grande Guerre
. (Histoire). Rennes: Presses universitaires de Rennes.

Stadler, P., Illetschko, M. and Seifert, S.
(2016). Towards a Model for Encoding Correspondence in the TEI: Developing and Implementing <correspDesc>.
Journal of the Text Encoding Initiative


(accessed 21 November 2018).

Steuckardt, A. (ed)
(2015). Entre villages et tranchées: l’écriture de poilus ordinaires. Uzès: Inclinaison.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.