Linking and Gathering: Automatic Hypertext in the Perseus Digital Library

David A. Smith

Authorship

1. David A. Smith

Northeastern University, Tufts University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Web of Text and Hypertext

Our research aims to augment electronic documents by automatically generating hypertexts from the contents of a digital library. Although hypertext developed from paper practices of citation and annotation, technical considerations have until recently made linking to other documents, rather than gathering information into an enriched reading space, the primary hypertextual gesture, but an integrated digital library can facilitate automatic, bi-directional links and also the creation of composite documents for richer contextualization. Our testbed is the Perseus digital library, which focuses on digitizing and delivering the primary materials for scholarship on classical antiquity and beyond. In particular, much of our recent work has concentrated on the Bolles collection of books, maps, and prints related to early modern London.

The Perseus Project has not confined itself to making information accessible; most of our research has worked on connecting information to enhance scholarship. Reliable links were one of the preconditions of modern scholarship: once page numbers and other citation schemes became widely used with the printing revolution of the fifteenth century, scholars could base their arguments on the corpus of all previously published materials and assume that a sufficiently energetic reader, with access to a library, could check the sources (Eisenstein, 1983). This system presumed that stability of links would be preserved by multiplication of copies with the same structure rather than by preservation of authoritative artifacts.1

Creators of electronic texts can draw not only on the practice of bibliographic citations but also on the even older tradition of annotation practiced since the invention of writing. In addition to the private marginalia that persist to this day, the scarcity of books in a scribal culture meant that annotations would be more public. Others' annotations, rather than being as now mostly of fetishizing or specialist interest, were highly prized by later readers and transferred when new copies of manuscripts were made. More public annotations persisted into print with the publication of manuscript scholia and commonplace books, as well as certain augmentations to printed works. Besides commenting in the margins, bibliophiles could have extra sheets for notes or illustrations bound into existing books, a process was known as "extra-illustration" or "grangerizing". James Granger (1723-1776) had published his Biographical History of England, from Egbert the Great to the Revolution ... adapted to a Methodical Catalogue of Engraved British Heads in 1769 with blank pages for readers to fill with engravings of their own, a practice which the publishers of later editions admit had substantially reduced the affordability and condition of old books of prints.

Grangerization had physical limits for some enthusiasts, however. Around the turn of the twentieth century, former Tufts professor and chaplain Edwin C. Bolles assembled a collection on the history and topography of London, including hundreds of books, prints, and maps. Bolles, who owned a copy of Granger's history, started binding engravings of buildings and historical personages into his growing collection of books. By the time he started to augment Thornbury's Old and New London, however, Bolles had to resort to underlining passages in the text or to writing anchor points in the margin with numerical pointers to clippings and illustrations kept in separate boxes. When Thornbury introduces a section on "Temple Bar", for example, Bolles underlined the phrase and linked to twenty prints, mounted on cards, depicting that structure. Titus Oates, Jack Cade, and the Duchess of Marlborough get similar treatment, as do dividend day at the Bank of England and the coronation of Charles II. Such grangerized texts, perhaps, could be added to canonical hypertext harbingers like Agostino Ramelli's sixteenth-century book wheel and Vannevar Bush's Memex.

Much electronic hypertext, from the 1970s on, developed within the annotation paradigm (Catano, 1979). The reader could write marginalia to be laid over the base text or map trails through the information landscape. The expense of early computers reinforced the trend towards centralized storage and editing even for users' annotations, and there has always also been an "encyclopedic" urge to organize information in some central repository (Bolter, 1991, p. 99f.). The advent of the World Wide Web reversed some of these trends: linking and copying replaced direct annotation although the separation of text and images allowed for some primitive distributed documents.

Automatic Linking and Gathering

Part of the initial impetus for building Perseus was a desire to enable, in an electronic environment, the kinds of links that developed for commentaries on canonical texts like the Iliad and the Torah. A basic function of our digital library, therefore, has been to reify the links implicit in paper citations (Smith et al., 2000); the reader of a text in Perseus can see immediately which passages are cited by other texts. In Perseus' Greco-Roman collection -- the deepest and most mature in the digital library -- there are over 600,000 links from commentaries, grammars, and dictionaries to 133 megabytes of primary data, for an average of about nine links per page displayed, and over 100 links per page of highly canonical texts like the Iliad.

In addition to leveraging explicitly intertextual print resources, we have worked on information extraction techniques to make new connections and visualizations through document features such as personal and place names, dates, and technical terms.2 Space and time are particularly interesting to exploit, since they admit generalized visualization interfaces with maps and timelines. Dates, furthermore, are generally quite easy to recognize in the texts that we deal with. Place names, on the other hand, have some difficulties: particularly in North America, common words and personal names can be used for places, for example "Christmas, Florida" and "John, Louisiana". Once recognized as a place, about 90% of the toponyms in our texts may refer to more than one place, as with Athens (Greece or Georgia) or Springfield (Massachusetts, Illinois, or Missouri, to name a few of the thirty-eight). Using local and global context markers, we are able to disambiguate most place names and get over 90% recall and over 85% precision for toponyms in our corpus of texts, from ancient Greece to nineteenth-century North America. Readers can plot all of the sites mentioned in the current page or chapter, or in the entire document, or call up a timeline of all the mentioned dates.

Pointing out connections and visualizing patterns, however, are not enough for a truly augmented scholarly environment. Much has been made of the low information available to users wishing to follow links on the web, and some link annotation schemes have been proposed (Weinreich and Lamersdorf, 1999). The practice of paper annotation and extra-illustration also suggests that readers can benefit from importing outside information into the reading space. Perseus already automatically generates maps of all places mentioned in texts, and we are experimenting with embedding small, intelligible maps directly in the text.3 For works such as encyclopedias and archaeological catalogues, where each article is generally about a single topic, we illustrate the entries with images to which the author did not explicitly link. In the less restricted texts in the London collection, we are exploiting Bolles' maps and prints, as well as the illustrations in books, to enrich Bolles' hypertext and to extend his grangerization to books that he did not touch. In particular, the reference works we have digitized, such as the Dictionary of National Biography and Crutchley's London street guide, have provided useful authority lists for tying together personal and place names. Finally, we have produced several virtual-reality walkthroughs of nineteenth-century London based on storefront elevations that Bolles collected. In addition to an immersive environment, we are investigating using these walkthroughs as peripheral cues, along with maps, for improving spatial comprehension, although some previous work suggests that this would be better accomplished with discrete steps rather than continuous motion (Maglio and Campbell, 2000).

Drawing on traditions of practice for connecting information on paper, we have worked to augment access to documents through automatically linking to and gathering in information relevant to what the reader is reading. As the architectural deficiencies of the web are redressed by XML and by various citation services (Hitchcock et al., 2000) and linkbases (DeRose, 1999), digital libraries can help authors on the web create not just bi-directional links but composite, augmented, distributed documents.

Notes

1 See Lesk (1997) for a chart of the declining longevity and growing availability of information media.

2 For an overview of information extraction techniques, see Wilks and Catizone (1999).

3 For a hand-crafted realization of embedded maps, see Strassler (1996).

References

Jay David Bolter. Writing Space: The Computer, Hypertext, and the History of Writing. Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1991.
James V. Catano. Poetry and computers: Experimenting with the communal text. Computers and the Humanities, 13(9):269-275, 1979.
Steven J. DeRose. XML linking. ACM Computing Surveys. 31(4), 1999. Available at http://www.cs.brown.edu/memex/ACM_HypertextTestbed/papers/47.html.
Elizabeth L. Eisenstein. The Printing Revolution in Early Modern Europe. Cambridge University Press, Cambridge, 1983.
Steve Hitchcock, Les Carr, Zhuoan Jiao, Donna Bergmark, Wendy Hall, Carl Lagoze, and Stevan Harnad. Developing services for open eprint archives: Globalisation, integration and the impact of links. In Proceedings of the Fifth ACM Conference on Digital Libraries, pages 143-151, ACM Press, New York, 2000.
Michael Lesk. Practical Digital Libraries: Books, Bytes, and Bucks. Morgan Kaufmann Publishers, San Francisco, 1997.
Paul P. Maglio and Christopher S. Campbell. Tradeoffs in displaying peripheral information. In Proceedings of CHI 2000, pages 241-248, ACM Press, New York, 2000.
David A. Smith, Jeffrey A. Rydberg-Cox, and Gregory R. Crane. The Perseus Project: A digital library for the humanities. Literary and Linguistics Computing, 15(1):15-25, 2000.
Robert B. Strassler, ed. The Landmark Thucydides: A Comprehensive Guide to the Peloponnesian War. Free Press, New York, 1996.
Harald Weinreich and Winfried Lamersdorf . Concepts for improved visualization of Web link attributes. In Proceedings of the Ninth International World Wide Web Conference. Available at http://www.www9.org/w9cdrom/index.html
Yorick Wilks and Roberta Catizone. Can we make information extraction more adaptive? In Maria Theresa Pazienza (ed.), Information Extraction: Towards Scaleable, Adaptable Systems, pages 1-16, Springer, New York, 1999.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20011127030143/http://www.nyu.edu/its/humanities/ach_allc2001/

Attendance: 289 (https://web.archive.org/web/20011125075857/http://www.nyu.edu/its/humanities/ach_allc2001/participants.html)

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Linking and Gathering: Automatic Hypertext in the Perseus Digital Library

1. David A. Smith

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001