TauRo - A search and advanced management system for XML documents

poster / demo / art installation
Authorship
  1. 1. Alida Isolani

    Scuola Normale Superiore di Pisa

  2. 2. Dianella Lombardini

    Scuola Normale Superiore di Pisa

  3. 3. Paolo Ferrargina

    Scuola Normale Superiore di Pisa

  4. 4. Tommaso Schiavinotto

    Scuola Normale Superiore di Pisa

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

With the advent of Web 2.0 we have seen a radical change
in the use of Internet, which is no longer seen as a tool from
which to draw information produced by others, but also
as a means to collaborate and to share ideas and contents
(examples are Wikipedia, YouTube, Flickr, MySpace, LinkedIn,
etc.). It is with this in mind that TauRo1 was designed – a userfriendly
tool with which the user can create, manage, share,
and search digital collections of XML documents via Web. TauRo
is indeed a collaborative system through which the user who
has access to Internet and to a browser can publish and share
their own XML documents in a simple and effi cient manner,
thus creating personal and/or shared thematic collections.
Furthermore, TauRo provides extremely advanced search
mechanisms, thanks to the use of a search engine for XML
documents, TauRo-core – an open source software that offers
implemented search and analysis functions to meet the current
need of the humanity texts encoding.
TauRo-core: the search engine
The success of XML as an online data exchange format on the
one hand, and the sucess of the search engines and Google on
the other, offer a stimulating technological opportunity to use
great quantities of data on standard PCs or on other portable
devices such as smart-phones and PDA. The TauRo-core search
engine is an innovative, modular, and sophisticated software
tool which offers the user compressed storing and efficient
analysis/research of arbitrary patterns in large collections
of XML documents that are available both on a single PC
and distributed among several PCs which are dislocated
in various areas of the network. The fl exibility of TauRocore’s
modular architecture along with the use of advanced
compression techniques for the storing of documents and for
the memorization of indexes, makes it suitable to be used in
the various scenarios illustrated in Figure 1. In particular, the use of the system in centralized modality –
that is, in which both the documents and the engine are located
in the same server – is already operative and suitable tested
via implementation of the system on the Web (TauRo). We are
currently working on the structure of Web services – matching
the distributed mode – in order to supply collection creation,
submission of documents, search, and consultation services.
Experiments have also been run to make it possible to consult
collections via PDA or smart-phone: via a specifi c interface the
user can make a query and consult the documents effi ciently
by using the Nokia Tablet 770.
This way, we can also evaluate the behavior of the software
with reduced computational and storing resources.
Compared to the search engines available on the international
scene, TauRo-core offers added search and analysis functions
to meet the current needs of the humanity texts encoding.
Indeed, these may be marked in such a way as to make their
analysis diffi cult on behalf of the standard search engines
designed for non-structured documents (i.e. Web search
engines), or for highly-structured documents (i.e. database), or
for semi-structured documents (i.e. search engines for XML),
but in which these are no assumptions on the semantics of the
mark-up itself.
TauRo-core, instead, allows the user to index XML texts for
which the opportune tag categories have been defi ned. These
tags are denominated smart-tag2, and they are associated with
specifi c management/search guidelines. In order to appreciate
the fl exibility of the smart-tag concept, we have illustrated the
classifi cation here below:
• jump-tag: the tags of this group indicate a temporary
change in context – as in the case of a tag that indicates
a note – and this way the tag content is distinct from the
actual text and the search takes place while distinguishing
the two semantic planes.
• soft-tag: these tags involve a change of context, if the
starting or ending element of the tag is present within a
character string which is not separated by a space, the string
forms a single word.
• split-tag: the tags to which a meaning similar to the word
separator is assigned, fall within this category. Therefore,
the context does not change and the words are in effect
considered as separate.
Furthermore, TauRo-core offers its own query language,
called TRQL, which is powerful enough to allow the user
to carry out complex text analysis that take into account the
above classifi cation and the relationship between content and
structure of the documents. TRQL operates on document
collections, manages the smart-tag and implements the main
full-text search functions requested in the specifi cs of the
W3C3 for XQuery.
This fl exibility allows TauRo-core to also be used in contexts
that are different from the strictly literary one; for example,
the collection of documents of the public administration,
biological archives, manuals, legislative collections, news, etc.
The literary context however remains the most complex and
thus constitutes, due to its peculiar lack of uniformity, a more
stimulating and meaningful trial.
TauRo: the system on the Web
TauRo is a collaborative system that allows any Web user,
after free registration, to create and share collections of XML
documents, and to exploit the potential of TauRo-core to run
full-text searches for regular expressions, by similarity, and
searches within the XML document structure. These searches
can be run on a single collection at a time or on various
collections simultaneously, independently from their nature.
Aided screenshots, we show here below some characteristics
of the system.
Figure 2 – TauRo home page
The collections
Each registered user can upload on TauRo their own collection
of XML documents. Once uploaded, the collection will be
indexed by TauRo-core and made available to the next search
operations. The user may, at any time, modify their collection
by adding or deleting documents, by moving documents from
one collection to the other, and share documents between
various collections, or modify the status of a collection that
can be: • private: accessible and searchable only by the owner;
• public: searchable by all the users after registering and
modifi able only by the owner;
• invite-only: this collections can be subscribed only after
invitation by one of the collection administrators. However,
the user has the possibility to ask for the invitation.
Figure 3 – Collection edit form.
During the uploading or modifi cation of a collection, the
user can select some parameter settings, such as the smarttag
and page interruption tag specifi cs, for the purpose
of exploiting to the fulltest the search engine’s potential. A
further personalization option offered by TauRo consists in
associating each collection with its own XSL stylesheet4 aimed
at visualizing in a pleasant way the results of the searches run
on them.
The documents
The system provides the user with a group of functions that
can upload, classify, and remove XML documents from the
user’s collections. During the upload, the system will try to
run an automatic acknowledgment of the DTD and of the
entity fi les used in the XML document by comparing the public
name with those previously saved. If the acknowledgment fails,
the user is give the option of entering the information. Every
document can be freely downloaded by anyone or one of the
Creative Commons5 licenses that safeguard the owner from
improper use can be selected.
Figure 4 – In the foreground is the document edit form,
and in the background the list of documents.
Search
TauRo offers two different search modes. The fi rst is designed
to search for words within some documents (basic search),
the second allows the user to also construct structural type
of queries, namely on elements – tags and attributes – of the
mark-up via a graphic interface (advanced search). In both cases
the queries are translated into a syntax that can be interpreted
by TauRo-core and sent to it. The search result is the list
of documents of the collection which verify the query, set
alongside the distribution of the results within the documents
themselves. By selecting a document, the user accesses a list
of contextualized occurrences, namely those entered in a text
fragment which contains them, with the possibility of directly
accessing the text as a whole.
In both cases the search can be exact, by prefi x, suffi x, standard
expression or by difference. The user can specify several words,
and, in this case, they will appear next to the document. A basic
search can also be run on several collections simultaneously Figure 5 – Search results
The search results still consist in a list of collection documents
that verify the query. By selecting a document, the user accesses
the list of occurrences (Figure 5 ) that represents the starting
point for accessing the text.
The project has been designed to allow any user to try and
exploit the potential of the search engine by using their PC,
and without having to install complex software. Thanks to
the potentials of TauRo, indeed any user in any part of the
world may create and manage via Web their own XML text
collection.
Notes
1 http://tauro.signum.sns.it
2 L. Lini, D. Lombardini, M. Paoli, D. Colazzo, C. Sartiani, XTReSy: A Text
Retrieval System for XML Documents. In Augmenting Comprehension:
Digital Tools for the History of Ideas, ed by H. Short, G. Pancaldi, D.
Buzzetti, luglio 2001. Offi ce for Humanities Communication
Publications 2001.
3 http://www.w3.org/TR/xquery-full-text-requirements/ Specifi cs of
the language and interrogation requisits, XQuery.
4 eXtensible Stylesheet Language (XSL), a language for expressing
stylesheets.
5 http://www.creativecommons.it/

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None