Linguistic Description and Exploration using RDF

Cecilia S. M. Wong; Jonathan J. Webster

Authorship

1. Cecilia S. M. Wong

City University of Hong Kong
2. Jonathan J. Webster

City University of Hong Kong

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

In addition to the typical methods employed in information retrieval systems, e.g. calculating frequency of keywords, pattern matching involving keywords, we are proposing an approach to information search and retrieval based not only on the basic element set known as the Dublin Core Metadata Element Set (DCMES), which represents the content or bibliographical information of the data, but also based on the identification of linguistic information about the rhetorical structure of the text. This rhetorical structure information may be inferred from linguistic cues identified, which represent the textual information of the data. Both types of information are tagged using RDF (Resource Description Framework). The cues and criteria in identifying rhetorical structure information are based on those developed by Corston-Oliver(1998).

The text base in question consists of abstracts of linguistics journal articles drawn from a collection of over three hundred papers on the topic of Chinese Linguistics. Included in my text base are abstracts from linguistics journals in both Chinese and English. Information retrieval is web-based. Besides offering a search and retrieval capability, a web interface is also being developed for authors or publishers to submit their abstracts to the text base.

Role of RDF in the research

Tim, Berners-Lee (1998) describes the Semantic Web as "a web of data, in some ways like a global database." RDF makes it possible to declare a knowledge base, which may be further extended through inferencing. In many ways, RDF brings together the advantages of an object database and the programming power of a logic programming language like Prolog. Using RDF, linguistic data may be encoded in a machine-readable format from which inferences can be drawn by machine about the structure and meaning of texts.

The three important standards employed in the research include, the two encoding standards: Dublin Core Metadata Element Set (DCMES) and Resource Description Framework (RDF) as well as the linguistics theory: Rhetorical Structure Theory (RST). The information about the abstracts consists of both bibliographical and textual information. This knowledge base is represented using RDF, which is discussed in detail in this paper.

Representing the bibliographical information using Dublin Core and RDF

Dublin Core is a metadata encoding standard using XML syntax that is basically divided into three aspects, including Content, Intellectual Property and Instantiation. Dublin Core provides the underlying metadata framework for describing my collection of abstracts in terms of Title, Creator, Subject, etc., hence, the bibliographical information. The advantage of Dublin Core is that it uses "a common vocabulary for classifying information" (Laurent and Biggar, 1999, p. 225). It attempts to serve as a framework for interdisciplinary information, so that interoperability can be achieved. Moreover, Dublin Core can be easily extended using two types of extensions, one for refining or enhancing the meaning of elements, another for refining or enhancing the interpretation of values (Miller, Miller & Brickley, 1999).

According to Miller and Weibel ("Metadata With a Mission: Dublin Core" published in XML.com Oct. 25 2000), "members of the RSS community have recently been advocating RDF as a powerful, modular means of combining semantics defined by Dublin Core with additional vocabularies (syndication, aggregation, threading) to produce effective site summaries and syndication services".

In the RDF schema proposed here, there are seven classes including Abstracts, Abstract, Clause, Journal, Person, Publisher and Source. Abstract is a sub-class of Abstracts, the root class.

In addition to attributes like abstractNumber, content, comment, keyword and language, the Abstract class contains certain attributes such as creator, clause and source, which point to the corresponding classes with the respective information. That is Person, Clause and Source respectively.

Representing textual information described by Rhetorical Structure Theory using RDF

Rhetorical Structure Theory (RST) was developed at USC Information Sciences Institute by William C Mann and Sandra A Thompson. RST is primarily aimed at describing those functions and structures that make texts effective and comprehensible tools for human communication (Mann, et.al. 1992:43). Rhetorical Structure Theory (RST) aims to describe natural texts, "characterizing their structure primarily in terms of relations that hold between parts of the text" (1987:1).

Using RST for text analysis, clauses or text spans are classified as being nuclei or satellites, clearly identifying which information is considered by the analyst to be more central. Also, the relations among the clauses or text spans are identified using RST in order to illustrate the development of a coherent text. By identifying the different relations in a text, one can better ascertain how the various spans in a text combine to form a coherent and meaningful text.

RST offers a systematic approach to interpretation of textual meaning. A recent attempt at automatic RST analysis (Corston-Oliver (1998)) has illustrated how linguistic cues, e.g. voice, grammatical dependency relations and conjunctions, may be used to determine the rhetorical structure. For example, the use of consecutive conjunctions or adverbials, like firstly, secondly, next, then, finally, indicate a sequential structure. Besides, there are criteria to be fulfilled in order to determine whether the text spans belong to a certain relation as the same linguistic cues can be used to indicate more than one relation. Corston Oliver also introduced a heuristic scoring procedure to aid in the determination of text relations. For example, if text spans contain the subordinate conjunction "whereas", 30 points would be assigned to indicate the likelihood of the AsymmetricContrast relation, which is one of the rhetorical relations defined in RST, between text spans. We intend to tag these linguistic cues and criteria in the texts as the basis for subsequent exploration of rhetorical relations using the inferencing capability of RDF.

Linguistics exploration through RDF

The rhetorical structure of texts also plays a significant role in facilitating search and retrieval. By identifying the nuclei and satellites of different text spans in texts, the core information of the texts can be differentiated from the more peripheral information. Mann and Thompson point out that if all the satellite spans were deleted, the remainder would still form a coherent text (1988:267). Since the contents of nuclear spans are the major concern of a text, it suggests the likelihood that keywords occurring only in satellite spans in a text may not be among the major findings or conclusions of the paper. Corston-Oliver suggests that the retrieval result from a statistical approach can be improved by weighting in favor of the nuclei spans (1999:238). Through the process in determining the rhetorical relations among text spans, it will be possible to identify nuclear spans.

Certain rhetorical relations appear more prominently in abstracts, e.g. solutionhood and interpretation. Typically, the nuclear span in an interpretation relationship highlights major findings and the conclusion. On the other hand, the nucleus in a solutionhood relationship may refer to the method of investigation, while the satellite(s) may refer to the theoretical framework. Our goal is to ascertain the kind of relations operating between text spans through a kind of linguistic exploration based on encoded linguistic cues. Exploration into possible rhetorical relations between text spans together with the identification of nuclear spans will enable search and retrieval of information along the lines of such queries as: 'retrieve all abstracts whose theoretical approach is based on optimality theory'; 'retrieve all abstracts whose methodology is qualitative and/or ethnographic'.

We are attempting to construct the RST structure of our data using the criteria and cues developed by Corston-Oliver.

Linguistics exploration through inferencing

Inference drawing can help achieve improved results from user queries. In this research, inferences can be drawn from the two different kinds of tagged information, (i) bibliographical details (i.e. following the standard Dublin Core set), and (ii) textual and grammatical information. In this paper, I will focus my discussion on the application of textual and grammatical information. On the basis of this information, inferences about the rhetorical structure will be drawn.

Various studies have been conducted into the discourse patterns typical of abstracts. Santos (1996), for example, based on his investigation of linguistics abstracts, notes that procedures may mark the onset of the description of the author's methodology. Tentatively, if we can show a sequential/chronological structure typifies spans in which one finds the author's account of their procedures, then the linguistic features which characterize 'sequential/chronological'-oriented discourse may point to discourse about the author's methodology, and should therefore be assigned a score with an appropriate weighting.

Moreover, given the assignment of relations as part of the rhetorical structure analysis of the text, we are looking into the possibility that where there are two spans related in terms of Solutionhood, the nuclear span in that relationship may also point to that section of the abstract dealing with methodology. The possibility that this might be the case - given that the methodology is all about finding the solution to the research question - is one reason why we assign to this rhetorical pattern a weighting which suggests it deals with methodology. Of course, how we go about assigning the amount of weighting is something which is still being investigated.

Given a scenario where the user is searching for abstracts containing keywords in the context of discussion about the author's methodology, then certainly having some indication of which span of text deals with the author's methodology may help to obtain a better result.

Conclusion

We are investigating to what extent a knowledge base which includes information about the rhetorical structure can contribute to web-based information retrieval. As the retrieval of information in the web-based application is in machine-readable format, we call it a 'smart' retrieval. We are focusing on improving this kind of 'smart' retrieval based on inferencing from a knowledge base which indicates information about linguistic features and hoping that the precision and recall of the retrieval will be improved as well.

References

1. Berners-Lee, T. (1998.10.14). Semantic Web Road map [On-line]. Available HTTP: http://www.w3.org/DesignIssues/Semantic.html

2. Corston-Oliver, S. (1998). Computing representations of the structure of written discourse. U.S.A.: UMI Company.

3. Laurent, S. & Biggar, R. (1999). Organizing information: RDF and Dublin Core. In Inside XML DTDs. U.S.A.:McGraw-Hill Companies Ltd.

4. Mann, W. (1999.9.10). RST, Programs and Tools [On-line]. Available HTTP: http://www.sil.org/linguistics/rst/toolnote.htm

5. Mann, W. (1999.11.23). The Two Frameworks Text [On-line]. Available HTTP: http://www.sil.org/linguistics/rst/2framewk/index.htm

6. Mann, W., & Matthiessen, C. (1991.12). Functions of language in two frameworks. Word, 42(3), 231-249.

7. Mann, W., Matthiessen, C., & Thompson S. (1992). Rhetorical Structure Theory and text analysis. In Mann, C., & Thompson, S. (eds) Discourse Description: Diverse linguistics analyses of a fund-raising text. U.S.A.: John Benjamins Publishing Co.

8. Mann, W., & Thompson, S. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3), 243-281.

9. Miller, & Weibel (2000.10.25). Metadata With a Mission: Dublin Core [On-line]. Available HTTP: http://www.xml.com

10. Miller, E., Miller, P., & Brickley, D. (1999.7.1). Guidance on expressing the Dublin Core within the Resource Description Framework (RDF) [On-line]. Available HTTP: http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-rdf/

11. Santos, M. (1996). The textual organization of research paper abstracts in applied linguistics. Text, 16(4), 481-499.

12. Thompson, S., & Mann, W. (1987.7). Rhetorical Structure Theory: A framework for the analysis of texts. IPrA-Papers-in-Pragmatics, 1(1), 79-105.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20011127030143/http://www.nyu.edu/its/humanities/ach_allc2001/

Attendance: 289 (https://web.archive.org/web/20011125075857/http://www.nyu.edu/its/humanities/ach_allc2001/participants.html)

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Linguistic Description and Exploration using RDF

1. Cecilia S. M. Wong

2. Jonathan J. Webster

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001