Knowledge Representation: Old, New, and Automated Indexing

paper, specified "short paper"
  1. 1. Peter M Logan

    Temple University

  2. 2. Jane Greenberg

    Drexel University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

While historical documents are common sources of digital humanities collections, many digital projects still do not use controlled terminologies to represent their content and aid search, discoverability, and use. There are often pragmatic reasons for this. The art of identifying appropriate descriptive terms is a valuable skill. Unfortunately, too few DH projects have access to information specialists who can index their documents for them. We are addressing these challenges with the Nineteenth-Century Knowledge Project, an ongoing initiative to create the first standards-compliant digital version of historical editions of the
Encyclopedia Britannica. The sheer scale of the project precludes human indexing, because it would take an estimated six-to-eight years to read through all of the entries. Instead, we use an innovative method to add automatically generated content metadata using linked open terminologies and the HIVE-approach. This method has allowed us to experiment on the optimal controlled vocabulary to use for indexing historical documents. Our presentation will focus on the results of this experiment.

HIVE stands for Helping Interdisciplinary Vocabulary Engineering, and allows for standardized, linked open vocabularies to be used for automatically indexing digitized text. The Knowledge Project demonstrates how large corpora can use automated indexing and still garner the benefits of controlled terminology. As part of the undertaking, we need to optimize the fit between the material and the controlled vocabulary we choose to do the indexing. What is the best vocabulary to use? We will compare different outputs generated from current and historical controlled vocabularies. The question we are trying to answer is whether a historical vocabulary that was current at the time of publication produces significantly different results than the present-day Library of Congress Subject Headings.
The topic is critical for the Knowledge Project in particular. We are examining a 120-year span of historical editions, with an eye on identifying changes large and small in the construction of knowledge over time. Researchers know that knowledge changed dramatically between 1790 and 1910, and part of that change was a shift in attitudes about what counted as “official” knowledge and what did not. As social beliefs change, so too do the culture’s ideas about what matters and even about who has the authority to define knowledge. In the nineteenth century, that authority decisively shifted out of the hands of religious authorities and into those of scientists and professionals. The history of the earth changed, from a narrative based in scripture to an evolving narrative dependent of the discoveries of geologists and biologists, like Charles Darwin. Social beliefs at the time also define the selection of articles; there were few topics specific to women, as we would expect in a society that devalued their contributions. And social beliefs also shaped the content of the articles that were included: those on India and Africa reflect a colonizer’s perspective and represent indigenous people in ways we recognize as offensive stereotypes. Britannica was the most authoritative general reference source in the English language in the nineteenth century, largely because it faithfully represented the idea of knowledge at the time, in the most comprehensive fashion possible, and so it documents for us today the many problems inherent in the nineteenth-century concept of knowledge itself.
This historical collection is not what we would call “knowledge” today, and that is exactly the point. It serves as a viable stand in for what constituted knowledge in a previous century, and thus provides us with an important data set to explore the changing structure of knowledge over time. Within this corpus, we can trace the emergence of the twentieth-century concept of knowledge. But to do this accurately, we need to identify the older structure of knowledge and preserve its internal integrity as a historical object.
Example of knowledge organization in 1728.

Chambers, E. (1728), Preface.
Cyclopaedia, or, An universal dictionary of arts and sciences, 2 vols., London: Knapton, 1728, p. ii.

Like other comprehensive representations of knowledge, older vocabularies are also socially constituted, rather than neutral categorizations of knowledge topics. The Dewey Decimal Classification illustrates this principle well; in its early formation, the only two categories for African Americans were “slave” and “colonial subject.” This left no way to indicate books that had been authored by African American writers, and so it mirrored the same racial stereotyping present in the nineteenth-century Encyclopedia editions. Both are products of the same society and so both embody parallel issues in representing knowledge as a system.
In theory, by imposing a controlled vocabulary from the twenty-first century onto older historical materials, we are distorting the historical structure of knowledge and muddying the waters, precisely when we need to see most clearly the difference between current and historical concepts. In theory, using a historically-appropriate vocabulary will generate metadata that better captures the older structure of knowledge by not translating it into the terms of current, equally contingent vocabulary.

There is little evidence that this theory has ever been tested. We are running a study on these entries using the HIVE automated keyword generator with two controlled vocabularies. The first is the current Library of Congress Subject Headings. The second is the controlled vocabulary created by Ephraim Chambers for his
Cyclopaedia (2 vols., 1728). This was the ontology used by the
Britannica for its third edition of 1790, so it is the ideal vocabulary for the material. In this short presentation, we would like to review the experiment, compare the automated index terms output from the two vocabularies, and present our findings on the real-life consequences of indexing older materials with historical vocabularies.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO