Building spell-checking facilities for ancient Spanish

  1. 1. Alejandro Bia

    Libraries - University of Alicante

  2. 2. Manuel Sanchez-Quero

    Libraries - University of Alicante


The huge development of information technology has motivated the appearance of this new type of libraries, called digital libraries (Arms, 2000). The Miguel de Cervantes Digital Library ( is one of the most ambitious projects of its kind ever to have been undertaken in the Spanish-speaking world with more that 4000 digital books at present. This enormous amount of digitised works are mostly Hispanic classics from the 12th up to the 20th century. The development of these digital books require a lot of care from the point of view of correction and editing, but can be processed in a massive uniform way afterwards to produce the different publications formats and services offered to the readers.

Concerning human resources involved in the project, the biggest group by far corresponds to correction and markup people (Bia and Pedreño, 2000), who are in charge of the hardest-to-automate part of the production process, which involves reading and correcting digitisation errors, structurally marking up the texts, and taking important editing decisions that involve both rendering and functionality of the hypertext documents to be published. These humanists are highly skilled people with at least a bachelor degree in philology, or other humanistic disciplines. We want them to devote their time to higher intellectual tasks like taking editing or markup decisions, or preparing the texts for interesting Internet services (like text analysis or concordance queries), than to spend their energies in the tedious mechanical task of correction, the main bottleneck in our production workflow, and by far the most time-consuming task.

In the case of contemporary works, spell-checkers turned out to be a useful aid to the correction process, but for literary works written in ancient Spanish, commercially available modern spell-checkers may produce more mistakes than they can prevent. The reason for this is that spell-checker-dictionaries include only modern uses of the language, and when they are applied to old texts, the result is that they take correct ancient uses of words for mistakes and try to correct them. Unable to use spell-checking as an aid, correctors have to do a side by side comparison of the original and the digitised texts to detect the errors.

Being aware of the usefulness of spell-checkers on the correction of modern works, and lacking this facility for ancient texts, we decided to build dictionaries for ancient Spanish. These decision led to new problems and new questions. As there is no such thing as ancient Spanish, but instead a dynamically evolving language that changes through the centuries, how many old-Spanish dictionaries should we build? Should we set arbitrary chronological limits?

Taking advantage of the 4000 books already digitised and corrected at the Miguel de Cervantes Digital Library, as a corpus covering several centuries of Spanish writings, we’ve built a time-aware system of dictionaries that takes into account the temporal dynamics of language, to help solve the problem of ancient Spanish spell-checking.

In this paper we present the problems we have found, the decisions we have made and the conclusions and results we arrived at. We have also been able to extract statistical information on the evolution of the Spanish language through time. The final section of the paper deals with the technical details of this project and the innovative application of digital methods like the use of TEI ans XML markup.


