The Search Engine and the User Interface

  1. 1. Mirko Tavoni

    Università di Pisa

  2. 2. Elena Pierazzo

    King's College London, Università di Pisa, Université Grenoble Alpes

  3. 3. Letizia Leoncini

    Università di Pisa

  4. 4. Paolo Ferrargina

    Scuola Normale Superiore di Pisa

  5. 5. Ivan Boscaino

    Università di Pisa

  6. 6. Mirko Tavosanis

    Università di Pisa

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. The XCDE Search Engine
The Linguistic Laboratory web site employs the XCDE Search Engine, based on the XCDE Library. Both tools have been developed by Paolo Ferragina and his co-workers. The XCDE Library is a native system written in C to compress and index XML files. The library includes an API (C functions that execute several operations) for storing, indexing and compressing XML documents, and some commands for implementing higher-level queries and/or for document (de)compressing operations. State-of-the-art algorhythms and data structures were adapted to reach good time/space performance and to support efficiently some innovative features, such as:

extraction of well-formed portions of XML documents (snippets);
proximity queries on words;
structural queries.
The library is modular; it's not available in the public domain, but it is free for non-commercial purposes and it is open source. More documentation and the Search Engine itself are available from the web site (

2. The User Interface
The user interface design was created to respond to two main requests:

allowing the exploitation of the semantic data structure of TEI encoded texts also to users not even knowing the TEI encoding scheme at all;
querying user-defined sub corpora of the texts stored in the system.
The user interface has been developed by Ivan Boscaino and Elena Pierazzo and it is based on scripts realized by Paolo Ferragina and Andrea Mastroianni.

2.1 User Interface: client side
2.1.1 User
In order to access to the documents' search area the user need to insert a login and a password. A generic login and password is available (user:user) that allows querying all the free documents present in the system. Registered users with a personal account can query also documents and corpora with a limited access, following their exigencies.

After logging in, two kinds of queries are available:

lemmatized corpus of Dante's texts.
other corpora.
To query the lemmatized texts by Dante the user needs to choose firstly a grammatic category - a Latin or a Vernacular one. Depending on the selected category, it is possible to filter the search by setting some characteristic of the lemma or of the form. For example, in case of a Latin substantive it is possible to specify the gender, the number and the form's case or the lemma's declension.

Having once composed the query, the system returns the list of the texts containing the searched pattern and the number of occurrences for each text. The results can be displayed in two ways. It is possible to visualize all the occurrences in all documents (with the possibility to enlarge the context), or to visualize just the occurrences of a single text. In this last case the opening screen is divided in two frames: in the upper frame the list of the occurrences in a minimal context will be displayed; in the lower frame the full text will be displayed, where the found occurrences are emphasized. At that point, the user has also a further possibility to analyse the document: approaching the mouse pointer to each form of the text in the lower frame, a tool tip containing the corresponding lemma and the grammar category will be displayed.

To query the other corpora and texts, instead, the user is asked to choose whether to query the full texts base or just a selection of it. Once having set the documents to be queried, the user will be able to set the search criteria, available from three drop down menus. Each criterion corresponds to a specific TEI encoding that has been "translated" by the system administrator in a definition (alias) easy to understand.

Advanced TEI users, instead, can bypass the administrator intermediation through the advanced search option, which allows specifying the search criteria by typing the name of the elements and optionally the name of the attributes and their values. The three criteria are nested (the second criterion has to be nested in the first one, and the third has to be nested in the second one). The search string can be specified as a full word, string, prefix, suffix or regular expression. If more than one string is specified, it will be necessary to indicate the proximity between strings (by setting the proximity to 0, only contiguous strings will be searched).

Once having set the criteria and entered the query, the results visualization works in the way seen for the lemmatized works by Dante.

2.1.2 Administrator
Logging in the system as an administrator, it is possible to manage the texts and the users. The possible operations are:

new users' creation
rights attributions (typically, it is possible to allow users to see and query the full text base or just a selection of it)
to upload new documents -
the declaration and administration of aliases; an alias is the description that, from the user side, substitutes the triplets of element-attribute-value.
creation and administration of collections
insertion of uploaded texts into collections.
In particular, when a new document is uploaded in the system, the text is firstly parsed to verify whether the file is well formed or not; the document is then indexed and stored by the search engine; finally all the elements present in the text, together with their attributes and value, are displayed. At that point an administrator can choose which elements and which triplet composed by element-attribute-value will receive a definition (alias). After having settled all the aliases, the document must be stored in at least one collection (if none of the defined collections is suited for the document, a new one can be created). Finally, it is necessary to verify that all the interested users have the rights to access that collection.

2.2 User Interface: server side
The user interface works through:

a cgi (common gateway interface) that manages all the operations;
a database that stores users and texts data.
When the user queries the system, he/she fills an html form. The request is carried on by the cgi (written in perl language) that performs many operations, following the different user's requests. In details, the cgi access, queries and update the database, or calls other external programs in order to process and give the response to the user's requests.

The system architecture can be designed as follow:

Figure 1

The database manages the users (account creation, rights management), the texts (upload, association to the collections), the collections (creation, update), the aliases (creation, update) and the sessions (login, duration, logout). The connected programs called by the cgi are an XML parser (that verifies if the file to be uploaded is well formed), the different programs that index and store the texts, and the search engine that queries the texts.

All the responses, both the database responses and the ones given by the other programs (error messages included), are sent to the user through html pages dynamically generated from templates. Data flows coming from the database or the programs are instanced in different templates, according to the performed operation, by substituting special tags. Finally, the composed page is sent to the user's browser.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info



Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

  • Keywords: None
  • Language: English
  • Topics: None