Re-Engineering a War-Machine: ARTFL's Encyclopedie

Mark Olsen

Authorship

1. Mark Olsen

ARTFL Project - University of Chicago

Parent session

TEI/SGML (a), Allen Renear

Original URL

http://web.archive.org/web/19991001092819/http://lingua.arts.klte.hu/allcach98/abst/abs32.htm

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
The Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers, par une Société de Gens de lettres was published under the direction of Diderot, with 17 volumes of text and 11 volumes of plates between 1751 and 1772. Contributors included the most prominent philosophes: Voltaire, Rousseau, d'Alembert, Marmontel, d'Holbach and Turgot, to name only a few. These great minds (and some lesser ones) collaborated in the goal of assembling and disseminating in clear, accessible prose the fruits of accumulated knowledge and learning. Containing 72,000 articles written by more than 140 contributors, the Encyclopédie was a massive reference work for the arts and sciences, as well as a machine de guerre which served to propagate Enlightened ideas.
Due to problems of censorship, successive volumes of the Encyclopédie appeared at an irregular pace. The first seven volumes were issued, one per year, from 1751 to 1757. Distribution of the ten remaining volumes took place in 1766. The volumes of plates, relatively unaffected by censorship, were released at the rate of roughly one per year from 1761-1772. In its original printing, about 4,000 copies were made.
The impact of the Encyclopédie was enormous, not only in its original edition, but also in multiple reprintings in smaller formats and in later adaptations. It was hailed, and also persecuted, as the sum of modern knowledge, as the a monument to the progress of reason in the eighteenth century. Through its attempt to classify learning and to open all domains of human activity to its readers, the Encyclopédie gave expression to many of the most important intellectual and social developments of its time.
The Encyclopédie presents an interesting set of design and implementation problems arising from the substance of this compendium of Enlightened knowledge and the editors' view that this body of knowledge had a structure of inter-relations that must be preserved and exploited by scholars in various ways. I would like to present an overview of the ARTFL Encyclopédie with particular focus on the design principles, new style of search engine, and some preliminary applications that our alpha test users have worked out.
ARTFL Encyclopédie: Scale of the Project
The Encyclopédie is a large undertaking. It is comprised of 17 volumes of articles and 11 volumes of plate legends, roughly 18,000 folio, typically two column pages of text, or just around 21 million words (20,736,912 tokens, 391,893 types). The internal structure is complicated. We have, to date, identified 76,242 textual objects, including 44,632 main articles, 28,366 subarticles, and 2,575 plate legends, 64,000 cross references and 12,000 article to plate references. Each textual object has a significant number of important attributes, including the headword(s), author(s), class of knowledge, and part of speech.
While ARTFL received a very generous grant from the SCALER Foundation, budgetary considerations remained of primary importance to the project. As the cost of data entry, new servers, and programming were calculated to absorb most of the grant, we decided early on to attempt to build the Encyclopédie with a minimum of human intervention at every stage. We have built the entire system, with it's current levels of functionality, without human editing and relatively minimal tagging.
Design Overview
In order to keep the cost of data entry as low as possible, we decided to digitize each page of the Encyclopédie. This allowed us to adopt a simple, low cost, data entry specification. Elements that could not be keyboarded easily are indicated by an omit tag, including a wide range of materials such as tables, mathematical formulae, musical scores, characters from many alphabets, and so on. And since we assumed the existence of linked digital images of pages, we did not encode font or page layout specifications except for those which aided in subsequent automatic structural tagging. Finally, the large number of inline graphics and other untypable materials, convinced us that it would be far more cost-effective to simply digitize entire pages, and link them to text pages dynamically, than to provide the inline images as discrete files.
By utilizing digital page images as an immediate adjunct to the full text, we were able to develop a simple and cheap typographic tag set. Keyboard operators did not perform any substantive tagging, such as article or plate legend identification, which is frequently ambiguous, thus lowering data entry costs further. We anticipated that we would be able to automatically identify all required structures automatically.
Automated Tagging
The processing of loading the data capture image of the Encyclopédie consists of a series of programs using typographic conventions to parse the document and add feature elements using fairly simple rules and pattern recognition. While a more exact description of these proceedures will be addressed in the full talk, it is important to note that this proved to be remarkably successful, with better than 99% of articles and subarticles properly identified, and only slightly lower rates of identification for authors, classes of knowledge, and parts of speech. Owing to their variable nature, recognition of cross-references is slightly lower, well over 95% and plate references somewhat lower than that. While structural data is written to the text data file, for house keeping purposes, during the loading process, the final representation is linked to an external database, since that can be edited easily by hand at a later date and is required for Mixed Database Searching (see next section below). Cross-references are stored as internal arguments, but are implemented as headword searches on the database, allowing us to build full hypertextual referencing automatically, rather than encoding references directly.
Automated tagging has two additional benefits. As a part of a database load, we developed a quick search engine that allowed us to navigate the database immediately, which proved extraordinarily helpful for debugging. In this environment, we could make corrections to the recognizers and rerun the database load, allowing us to make rapid refinements to tagging and structures. To date, we still reload the database from the data capture image.
Mixed Database Search Engine
ARTFL search engines have, for many years, used a two stage mechanism for searching large databases. The first is a multi-field database to create sub-corpora object lists: the query for the phrase "opinion publique" in texts from 1780-1802 would first query a bibliographic database to build a list of texts to search, which is passed to the index engine to identify the locations of corresponding citations in that set of documents. Some non-trivial implementation modifications to this model were required to enable full "sub-corpus" searching of the Encyclopédie using all of the attributes of the 76,000 identified objects. Thus, for example, one may search articles by d'Alembert, with "astron" in the class of knowledge field for the word "Newton". If no keyword is specified, the object list is simply passed to a routine that provides links to the corresponding articles. The importance of this model cannot be under-stated, since textual objects, of an arbitrary size or make-up, can be selected by using a typical boolean database query. Textual object management databases will become an increasingly important in the full text systems as databases grow in size and become more complex in structural representation.
Sample Applications
The preliminary implementation of the Encyclopédie is being used for some courses and by a team of alpha-testers. Already, it is resulting in several important research projects. Studies are beginning to examine economic and political ideas in the Enlightenment based on this new resource. My own interests in tracing the cross-references, now implemented as hypertext links, to see the degree to which various classes of knowledge, a vital substructure to the organization of the Encyclopédie and, I will assert, the structure of 18th century Enlightened knowledge. The database readily supports statistical analysis of these linkages and should result in a visual representation of the web of disciplinary connections that the editors and authors of the Encyclopédie thought were critical to proper understanding of all of the sciences, arts and métiers.
Conclusion and Current Project Status
Developing large, complex databases -- in the humanities at least -- is always a balance between available funding and an ideal implementation. Many of the design and implementation proceedures we developed were in response to our desire to build as complete a system as possible, as quickly as possible, within a fairly limited budget. Reliance on a complete digital image facsimilie of the Encyclopédie, for example, allowed us to adopt a very simple and light tagging scheme which was completely automated. Even such limited encoding, however, required significant modifications to our search engines in order to provide the kinds of functionality required.
The most important, general implication of this talk is the assertion that reliance on digital imaging and automated mechanisms to identify many structural elements in databases may call into question much of the discussion I have heard surrounding the Text Encoding Initiative and projects using TEI guidelines. There are more cost-effective and less labor intensive ways to create complex databases. Thus, while the design and implementation of the ARTFL Encyclopédie is, I hope, of substantive interest to researchers in humanities computing, it is my belief that reliance on the TEI approach to encoding and preserving textual resources may blind other scholars to better ways of building and using textual databases. Of particular note is the fact that database creation and search/analysis engine development need to be related activities, a failure that I believe will become more evident when the TEI turns 20 in ten years time.
The ARTFL Encyclopédie is in preliminary alpha-test mode. The project overview and a relatively limited sample is publically available at:
http://humanities.uchicago.edu/ARTFL/projects/encyc/

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/

References: http://web.archive.org/web/19990225164509/http://lingua.arts.klte.hu/allcach98/abst/jegyzek.htm

Attendance: ~60 (https://web.archive.org/web/19990128030244/http://lingua.arts.klte.hu/allcach98/listpar3.htm)

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Re-Engineering a War-Machine: ARTFL's Encyclopedie

1. Mark Olsen

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"