An open source implementation of PhiloLogic for large TEI-Lite document collections

Mark Olsen; Robert Voyer; Orion Montoya; Leonid Andreev

Authorship

1. Mark Olsen

Department of Romance Languages - University of Chicago
2. Robert Voyer

University of Chicago
3. Orion Montoya

University of Chicago
4. Leonid Andreev

Harvard University

Original URL

http://web.archive.org/web/20040903094353/http://www.hum.gu.se/allcach2004/AP/html/prop100.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The wide array of XML data specifications and the recent deployment of basic XML processing tools provides an important opportunity for the collaborative development of higher-level, interoperable tools for Humanities Computing applications. The sophistication and power of the TEI-XML encoding specification supports the development of extremely rich textual data representations that encourage, if not require, development of sets of tools to exploit features of encoded text to perform particular tasks. It may be the case that one general tool will never fit all possible uses for encoded documents, but that a set of more specialized, interoperable tools for end-user applications will provide mechanism for cost-effective deployment of end-user applications.

As the ARTFL Project's contribution to the collaborative development of these tools, this paper will outline recent work on PhiloLogic2 to support a wide variety of TEI-Lite (XML and SGML) encoded documents optionally using the Unicode character specification. We will present a general design overview, indicating the current features and limitations, of this implementation, which the ARTFL Project is releasing under the GNU General Public License.3 We feel that Humanities Computing applications are particularly well suited to open source development by a community with wide ranging technical abilities that is not well supported by the commercial sector.

PhiloLogic is the primary full-text search, retrieval and analysis tool developed by the ARTFL Project and Digital Library Development Center (DLDC) at the University of Chicago. Originally implemented to support large databases of French literature, PhiloLogic has been extended to support a wide variety of textual and hypermedia databases in collaboration with numerous academic institutions and, more recently, commercial organizations.4 PhiloLogic is a modular system, in which a textbase is treated as a set of coordinated or related databases, typically including an object (units of text such as a letter, scene, document, etc) database, a word forms database, a word concordance index mapped to textual objects, and an object manager mapping text objects to byte offsets in data files. Each of these databases is stored and managed using its own subsystem.

Textual metadata, for example, is extracted from the database and loaded into a subsystem that handles delimited field data using appropriate functionality, including arithmetic, boolean, and regular expression searching and may be implemented using variety of systems including standard open-source packages such as MYSQL or PostgreSQL. This model also supports extensive use of standoff markup5, since objects in the document tree may be linked to extensive records in relational databases describing these objects.

As with most full text search and retrieval systems, the amount of text processing actually involved in a user search is very limited, typically being only that required to extract an element from a document and format it on output. Similarly, building a PhiloLogic database typically requires a data pass to extract, in a consistent way, structured data from the text. The addition of TEI-Lite support for PhiloLogic required creation of new loaders and output formatters. We are using various approaches to both tasks and will discuss the strengths and weaknesses using SGML/XML-aware tools as opposed to more general programming techniques. It appears that certain tasks, particularly those involving data extraction from heavily nested elements, are better implemented using a tree based approach, while others, such as calculating the position of words and byte offsets in an abstract document object hierarchy are better approached as streams. Output formatting of objects also appear to present similar options, depending on the complexity of the encoding and display requirements.

The current development version of PhiloLogic is also able to process Unicode in the form of UTF-8.6 The internal word and object indexing system of PhiloLogic has been UTF-8 capable for some time, since it is independent of language or character specifications. Word searches are performed independently of look-ups in the word occurrence indexes. We are using several different approaches to Unicode support based on a multifield word management subsystem. While index entries are stored in UTF-8, search fields can be configured for various languages and combined with slightly modified regular expression matching which allows for searching on Unicode representations or various simplified representations. Romanizations, for our current experiments, are performed by Perl module (Obliterator) developed by ARTFL and the DLDC for transliterating and transcoding among ISCII, Unicode, and a host of romanizations of Indic scripts.7 This model may be extended to handle many more writing systems.

We are basing this paper on the 2t/2e series of the PhiloLogic engine. While fast, robust, and well proven, it is based on a fixed object depth word indexing architecture. This has two distinct limitations: it flattens object depths and it does not provide for a word "attribute" field to indicate that a word belongs to particular types of objects that users may want to include or exclude from searches (such as notes or stage directions). We will describe current work on the 3t generation of engine and hope to be able to include a development version of this variant in 2004. Even then, the possible variations of document encoding possible within the TEI-Lite specification may not be treated in ways that original encoders had in mind.

We are planning for a base release of PhiloLogic with examples of how to implement a wide variety of options required for different document types and collections. Based on the long history of PhiloLogic use and development at ARTFL and among various collaborators, we are planning that the base release will include as many features as possible while not requiring significant administrative or development work to use effectively.

Humanities Computing needs to foster collaborative tool development. It is our belief that these tools will be specialized, focusing on the theoretical orientations and practical experience of various humanities computing organizations. PhiloLogic is a result of ARTFL's need for tools to handle large amounts of relatively lightly encoded text with a significant orientation to the manipulation of large amounts of descriptive and analytical metadata. Encouraging interoperability and collaborative development, particularly the use of standard processing tools and encoding systems, will provide for a way to leverage development work being done at various institutions.

Some sample prototypes of PhiloLogic for TEI (XML/SGML) are available at:

http://www.lib.uchicago.edu/efts/ARTFL/philologic/tei-samples/

Notes

1. Contact: Mark Olsen, mark@barkov.uchicago.edu. Affiliations: Andreev, Virtual Data Center, Harvard University; Montoya, Digital Library Development Center, University of Chicago; Olsen and Voyer, ARTFL Project, University of 2. A general description and PhiloLogic manual available at http://www.lib.uchicago.edu/efts/ARTFL/philologic/. More recent development descriptions and technical overviews are found in Olsen, "Words, Objects, and Attributes: Leveraging the Full Power of TEI Encoding in Database Searching", October 9, 2002, Second Annual TEI Consortium Meeting, Newberry Library, Chicago (http://barkov.uchicago.edu/talks/TEI2002/) and "Rich Textual Metadata: Implementation and Theory", Poster Session, ALLC/ACH 2002 Conference, University of Tübingen, July 24-29, 2002 (http://barkov.uchicago.edu/talks/ACH2002/rich-metadata.html ).
2. The GNU General Public License (http://www.gnu.org/licenses/gpl.html) allows unrestricted use, but requires that distribution of the software be accompanied by the source code (or an offer for it), and does not allow anyone other than the copyright holder to add or remove restrictions on redistribution.
3. Academic collaborations include Opera del Vocabolario Italiano (http://www.lib.uchicago.edu/efts/ARTFL/projects/OVI/ ), The Abraham Lincoln Historical Digitization Project (http://lincoln.lib.niu.edu/) and the Artemene Project (http://www.artamene.org/). Commercial collaborations include Alexander Street Press (http://www.alexanderst.com/) and Editions Champion (http://www.lib.uchicago.edu/efts/ARTFL/databases/champion/b asile/).
4. There are a number of ways to implement standoff markup. See, for example, http://www.tei-c.org/Activities/SO/sow06.html. Our implementation uses a non-standard model of relational links from databases in SQL to objects identified in PhiloLogic object trees.
5. See http://www.unicode.org.
6. Originally evolving from stream-oriented Perl scripts using hashes to transpose source and target characters, it now treats strings as objects, initialized with a source string, with methods that return the various target strings. We are currently replacing the internal hash-table approach with SWIG (http://www.swig.org/) wrappers around IBM's International Components for Unicode (http://oss.software.ibm.com/icu/).

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

An open source implementation of PhiloLogic for large TEI-Lite document collections

1. Mark Olsen

2. Robert Voyer

3. Orion Montoya

4. Leonid Andreev

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004