Organising Material in a De-centred Enviroment

  1. 1. Chris Stephens

    Humanities Computing Unit - Oxford University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Ordering Information in a De-centered Enviroment
This paper will look at the problem of providing users with convenient and up to date web resources for work in the humanities. I will try to address the inherent failings of the Search Engine and the HTML based InformationGateway. I will also show how the HUMBUL Gateway, as an on-going project, is trying to take some of the aspects of each and produce a readily expandable, searchable, and easily maintained set of refereed resources.
We can only speculate about the number of pages currently on the WWW. Recent estimates have put the figure at around 320,000,000 [1] which is thought to double every three months. The samizdat spirit of the WWW means that only a fraction of these pages is going to be useful to the serious scholar. Discovering the useful resources in the midst of so much material is a growing problem for those who use the WWW in their work. As the web continues to expand the problem can only become more acute. The tools and techniques we currently use will become increasingly ineffective as they are overwhelmed by the sheer size of the task.
A recent post to the Humanist email list pointed out the shortcomings of AltaVista, often regarded as the most comprehensive of the search engines. AltaVista claims to index some 31,000,000 pages held on 476,000 servers. At the above estimate this would constitute only some 20% of the whole. AltaVista, like many other search engines, works on principals of automated indexing. Web robots or crawlers scan the web continuously for new sites to add to the AltaVista index, however the material which is indexed constitutes "only a small, flawed, arbitrary, and not even random sample of what is on the web today." [2] If AltaVista cannot index everything, and as the robots are mechanically indiscriminate about the material they do pick up, one has to wonder how fit to the task of scholarly resource discovery such a tool can be.
An alternative way of organizing the material is the Information or Subject Gateway; a collection of links, usually within a specific subject area, gathered, maintained, and often annotated by someone with a knowledge of the subject. Alan Liu's excellent 'Voice of the Shuttle' is one such gateway covering a wide range of Humanities resources. Alan runs the site as a one man effort which, he feels, gives him a fine degree of control over the running and development of the site. Voice of the Shuttle and other such sites bring the benefits of human critical judgement to resource discovery. Users can be sure of finding worthwhile sites through these gateways and most of the gateways allow users to suggest further sites for inclusion.
As a Subject Gateway based on pages of collected links starts to grow, it faces problems in the volume of maintenance required. Keeping the links up to date, fixing broken links, and deleting redundant ones used to involve a process of checking links, searching for lost sites, and updating the HTML pages by hand. A couple of software tools have since become available which help in this task. LinkBot or Infoeval under Win95 are automated link verifiers which check the integrity of the links and generate a report. Improvements in HTML editing packages make the of updating the pages somewhat easier. The task remains a time consuming and laborious one which "involves an almost daily process of collecting links for the 'new' page, moving previously collected links from the 'new' page to the regular pages, fixing broken links, and corresponding with users." [3]
Information Gateways, if they cover a number of subjects arranged hierarchically, also face the problem of how to organize the material locally. Classification standards based on principals used in physical libraries are starting to be applied, but many sites still suffer from an idiosyncratic organization method which can be less than intuitive to the user. The problem here comes partly from the static nature of HTML pages. Links to resources are fixed within the category decided by the author of the gateway and, unless duplicated in pages relating to other categories, are unavailable to a user whose primary interest may initially lead them elsewhere. This can leave the user with the task of browsing a number of pages in the search for relevant materials. Resource discovery, in these circumstances, can still account for a significant proportion of the users online time.
We have tried to address some of these issues in the redesign of the HUMBUL Gateway. The existing HUMBUL suffered from the problems of fixed categories. To free it from these restrictions, we decided to move its collection of links from static HTML pages into a data base format. The scripts to access the data base were written in Perl and the interface between Perl and the MSQL server consists of a publicly available module. A major problem in the early stages of development was to get the Perl scripts to talk to the MSQL server. UNIX, in all its many flavours, requires some localization of MSQL and the MSQL Perl interface. Once these problems had been overcome scripts were written to read from and write to the database. We wanted to simplify how the user contributed resources to HUMBUL while retaining the ability to screen submissions. We did this by having the resource and comment submission forms write their information into a holding database. These records are then read into a password-protected form from which they can either be transferred to the main HUMBUL database or discarded.
We decided to a set of subject headings, based on the original HUMBUL categories, as a rough and ready way of navigating the Gateway, however, since the resources are no longer fixed within these categories, the user has the opportunity to search across the database, possibly returning results from several sections. For example, searching for the keyword "Art" returns results not only from the "Visual Arts" section but also turns up resources held within the "Anthropology" section, the Philosophy section, and others. The search mechanism is still only simple allowing the user to search in either the links and comments or the conference diary, but not both. We are in the process of refining the search mechanism to include the ability to search across the entire database and to allow the use of Boolean operators.
Storing the resources in the form of a database also allows for the association of a given record with further information, such as extended descriptions and user-submitted comments. The extended information forms part of the searchable material and would not practical to implement with an HTML based collection. Such extended information is intended to give the user an overview of the site being linked to and often can show them what others think of the site. Each record will also be associated with a set of keywords. At the time of writing the keyword scheme is still in development. We are trying to establish some sort of controlled vocabulary to describe web resources. It is intended that sites will be classified by the geographical area of the material they cover, the historical period, and some free keywords. We hope to enlist the aid of library professionals when it comes to classifying the link records. We intend to assign particular HUMBUL sections to subject librarians with the appropriate expertise for evaluation and classification.
The links database has greatly simplified the task of maintaining HUMBUL. The administrator can access and edit any of the database fields through a series of Web forms and has no need to know any HTML. The integrity of existing links is checked with the freeware link checker Infoeval. This program allows for the checking of links on a page based on the URL of the page rather than a physical document. This means that, HUMBUL's dynamically generated pages are still amenable to automated link checking. What this means in practical terms is that a task which formerly occupied a couple of days a week can now be performed in about half an hour.
While the database has certainly improved the efficiency and function of HUMBUL from its HTML based incarnation, there are some drawbacks to the database method. There was difficulty of the initial coding. I am not a programmer, but had an interest in learning the perl language. HUMBUL was my first major project in this language and, as such, the code is perhaps less than optimal and still contains a few bugs. The database system also implies a slight performance overhead over plain text HTML. The client has to connect to the DB server, submit a query, wait for the query to be processed and then wait for the results to be formatted before being returned via the web server. This process is generally almost instant, however, when demand on the db server is high, it can slow the response quite noticably. Another problem, which stems in part from my inexperience as a programmer, is that it is impossible to predict what people are going to type into the fields on the various public submission forms. Certain ASCII characters and combinations of characters (such as & or ' or @) have a particular meaning in perl and will affect how the form data is processed by the associated scripts. I have tried to cover most of these situations by a perl technique called 'escaping' the characters. From time to time, however, someone will enter something I had not allowed for and difficulties can arise. I have managed so far to correct such submissions with a little command line tweaking of the DB interface, but this requires a bit of specialist knowledge and is outside of our intention with HUMBUL to produce a system usable by anyone.
Overall we feel that the design of the new HUMBUL goes some way to mitigating the problems involved in running an information gateway. While the search mechanism in HUMBUL is still a long way from being as sophisticated as that of the major search engines, its development in the future should give it a similar flexibility. The design philosophy behind HUMBUL has tried to incorporate the best features of both the dynamic nature of the Search Engine and the critical evaluation applied to resources held in an Information Gateway.
1. Source of figures NUA Internet Survey
2. John Pike, Federation of American Scientists in a post to the Red Rock eater list, reposted to Humanist by Matthew Kirschenbaum, University of Virginia.
3. Alan Liu, University of Santa Barbara in email.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC