Dept of Mathematics and Computer Science - Duquesne University
Dept of Mathematics and Computer Science - Duquesne University
Back-of-the-book indexing is the process of
generating a list of relevant terms from a corpus and providing the user with the page references of these
terms. It differs from web indexing in its ability to identify
synonyms and subcategories. This indexing process
has become somewhat automated through computer
applications, most of which can at best generate a
concordance, or list of keywords in the text. The difficulty
lies in the ability to make intelligent decisions regarding which words or phrases to include. Human indexers
primarily do this step. The biggest drawback is the time
required to perform this task. It is estimated that every hundred pages of text takes one week to index manually
(Chicago, 2003). The challenge, which the authors of this paper hope to have met, is to develop a software
program that bridges the gap between computerized
concordances and manual indexing, to provide a much more robust draft index, which a human indexer can
refine in a fraction of the time. This application takes the ideas put forth in (Juola, 2005) and extends them to a working prototype system.
There are many different types of software available
today for performing parts of the indexing process.
Natural language parsers determine which words or phrases
are subjects. Concordance-generation packages, such as AntConc, use a cluster analysis technique to study the relationships between words and their surrounding text, but still require the user to provide the specific words to search. Another application that uses cluster analysis, Grokker, aids in searches of text by creating visual maps
of related terms. Many indexers use products like
CINDEX, MACREX, and SKY Index, which function like
databases, and can be integrated with Microsoft Access and dbase III. Each row of data consists of a heading, a
subheading (if applicable), and the page number (locator). These programs take user-entered index information and provide an interface for grouping, censoring, sorting, and
finally generating a clean index as output. No single
application that is commercially available today, however,
handles all of the processes described above. It is the
intent of the computer-aided application described herein
to do all of these things.
Our software application takes advantage of existing
natural language parsers, and uses a factor analysis
technique, based on the principles of Latent Semantic Analysis (Landauer, 1998 and Wiemer-Hastings, 2004) and Latent Semantic Indexing (Deerwester, 1990), to study the relationships among nouns / phrases and their surrounding text. This type of multivariate analysis uses matrix manipulation of the words and their locations
in sentences, to create clusters of related words and
contexts, and will assign probabilities to these clusters of
words that can be translated into degrees of importance. A certain cutoff level of importance will be used to limit
the size of the resulting list. The proximity of words to other words will also help to determine which words
appear to be synonyms (terms that correlate highly in every
dimension), and which words seem to be subcategories
of broader terms (these terms, hyponyms, correlate
highly in one or more dimensions.) In our application, LSA
is used to generate the synonyms that will translate to cross-references and the hyponyms that will translate to subheadings of terms in the resulting index.
There are several resources currently available that define what makes a good index. One such resource is the NISO technical reference, “Guidelines for Indexes and Related
Information Retrieval Devices” (Anderson, 1997).
Anderson defines an index as “a systematic guide designed
to indicate topics or features of documents in order to
facilitate retrieval of documents or parts of documents.” He
further states that an index should include the following components: (a) terms representing the topics or features of documentary units; (b) a syntax for combining terms
into headings in order to represent compound or
complex topics; (c) cross-references among synonymous
and other related terms; (d) a procedure for linking
headings with particular documentary units; (e) a
systematic ordering of headings (in displayed indexes). It is the objective of our application to successfully
incorporate each of these components.
The task of evaluating our application’s overall
performance (i.e., how “close” the resulting indices are to fulfilling the guidelines above) is a difficult one, which
we attempt to answer by performing side-by-side
comparisons of an index generated for a given corpus by our application and by a typical human indexer. We then compare the indexes generated by each, and create a
matrix comparing (i) what we and the indexer both included
in the index, (ii) what we included but the indexer did not, (iii) what the indexer included but we did not, and (iv) what neither of us included. We also note the relative
times required for performing the task in each case.
Our feeling is that if we can generate a similar index in a fraction of the time, then this application is a success.
It is our hope that this all-inclusive software, a “one-
stop shop” for back-of-the-book indexing needs, will
revolutionize the indexing industry by providing an
easy to use, accurate means of generating an index in a drastically reduced amount of time.
References
Anthony, L. (2006). Ant Conc 3.1.2 concordance generation
software, http://www.antlab.sci.waseda.ac.jp/
American Society of Indexers. (2006). http://www.asindexing.org/site/index.html.
Anderson, J. (1997). NISO Technical Report 2: Guidelines for Indexes and Related Information Retrieval Devices. 8. Bethesda: NISO Press.
Deerwester, S., et al. (1990). “Indexing by latent semantic
analysis.” Journal of the American Society for
Information Science, 41(6), 391-407. Wiley.
Groxis, Inc. Grokker software, http://www.groxis.com. San Francisco: Groxis, Inc.
Indexing Research. CINDEX Indexing Software, New
York, NY, http://www.indexres.com. New York:
Indexing Research.
Juola, P. (2005). “Towards an Automatic Index
Generation Tool.” Proceedings of the 2005 Joint
Annual Conference of the Association for Computing
and the Humanities and the Association for Literary and Linguistic Computing (ACH/ALLC 2005).
Jurafsky, D., Martin, J. (2000). “Word Sense
Disambiguation and Information Retrieval.” Speech and Language Processing, 631 – 666. NJ: Prentice Hall.
Landauer, T., et al. (1998). “An Introduction to Latent Semantic Analysis.” Discourse Processes, 25: 259 – 284. Mahwah, NJ: Erlbaum Associates.
Liu, H., MIT Media Lab. (2004). MontyLingua: An end-to-end natural language processor with
common sense, http://web.media.mit.edu/~hugo/montylingua/.
Macrex Indexing Services. (2005). Macrex Indexing Program, http://www.macrex.com. Daly City, CA: MACREX.
O’Grady, W., et al. (2001). “Computational Linguistics.” Contemporary Linguistics 4th Edition, 663 – 703. Boston: Bedford/St. Martin’s.
Press, W., et al. (1992). “Singular Value Decomposition.”
Numerical Recipes in C: The Art of Scientific
Computing, 59 – 70. Cambridge: Cambridge
University Press.
SKY Software. SKY Index 6.0 Professional Edition, http://www.sky-software.com. Stephens City, VA: SKY Software.
Smith, L. (2002). “A Tutorial on Principal Components Analysis.” www.cs.otago.ac.nz/cosc453/student_
tutorials/principal_components.pdf.
University of Chicago Press Staff. (2003). The Chicago Manual of Style, 15th Edition, Chicago: University of Chicago Press.
Wiemer-Hastings, P. (2004). “Latent Semantic Analysis.” Encyclopedia of Language and Linguistics. Oxford: Elsevier.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)
Paris, France
July 5, 2006 - July 9, 2006
151 works by 245 authors indexed
The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.
Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/