Exploring the Generic Structure of Scientific Articles in a Contrastive and Corpus - Based PersPective

Noëlle Serpollet; Céline Poudat

Authorship

1. Noëlle Serpollet

University of Orléans
2. Céline Poudat

University of Orléans

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper will describe and analyse the generic structure of linguistic articles, using a corpus-based methodology and working within a contrastive (English-French) perspective. The main question that we wish to answer is the following: “Do scientific articles – and more particularly linguistics ones – have a generic structure,
and to what extent does this structure vary from one
language to another?”
We will answer this linguistic question using techniques from computational and corpus linguistics.
The notion of genre is more and more present, as much in linguistics as in information retrieval or in didactics. Genres and texts are intimately connected, as genres could not be tackled within the restricted framework
of the word or the sentence. Indeed, genres can only
be perceptible using text corpora both generically
homogeneous and representative of the genre studied.
The progress of information technology and the
possibilities of digitization have made it possible to
gather homogeneous and synchronic corpora of written
texts to analyse and characterize genres.
Moreover, the development of computational linguistics,
of linguistic statistics and more generally of corpus
linguistics has led to that of tools and methods to process
large corpora which make it possible nowadays to detect linguistic phenomena and regularities that could not have been traced before. In that sense, inductive typological
methods and multi-dimensional statistical methods (see Biber, 1988) seem crucial to make the criteria which
define the genres appear more clearly.
If literary genres have been largely explored, the study of academic / scientific and professional genres has mainly been undertaken for about thirty years within a
more applied trend. English for Specific Purpose, is a rhetorical-functional trend which is interested in
macro-textual descriptions and in describing genres from a phrasal or propositional point of view. The description of rhetorical moves (see Swales, 1990) is rather qualitative
than quantitative, as the moves can scarcely be
automatically identified – although several studies have set out to demonstrate their relative identification by training classifiers on manually annotated corpora
(Kando, 1999 and Langer et al., 2004).
Our perspective is however different, as we do not start from a set of predefined moves: our objective is indeed to describe the genre of the article and its structure in a quantitative perspective, starting from three levels of description: the structural, the morphosyntactic and the lexical level.
The study is based on a generically homogeneous corpus composed of French and English journal articles that all belong to the linguistic domain, chosen as this is the field we have the best expertise in. The French corpus is made
up of 32 issues of 11 linguistic journals, that amounts to 224 articles; whereas the English one includes 100
articles, that is 16 issues of 4 linguistic journals. Texts have all been issued between 1995 and 2001 to limit the possibilities of diachronic variations.
In order to describe the document structure of scientific articles, we first marked up the document structure and the article constituents according to the Text Encoding Initiative Guidelines (Sperberg-McQueen et al., 2001), to ensure the corpus reusability and comparability with other corpora: the article sections were taken into account
(introduction, body, divisions, conclusion), as well as its titles, subtitles, and specific components (examples,
citations, appendices, etc.).
This XML markup enabled us to obtain the main
characteristics of the article structure and organization in the two languages (number of sections, structure depth, etc.) and to assess their stability and differences, using XSL stylesheets.
Once these characteristics were established, we focused
on the article sections: as both French and English
linguistics articles are not submitted to an IMRAD structure
(Introduction, Materials and methods, Results, Analysis, Discussion), only introductions and conclusions could be directly observed and compared. Indeed, it would have been irrelevant to analyze “third sections” as many texts are only divided into two main parts.
The linguistic properties of introductions and
conclusions were described thanks to two different levels of description: the lexical and the morphosyntactic levels,
which did not require the same processing. The lexical characteristics of the sections were first obtained using Alceste and its Hierarchical Descendant Classification.
We then concentrated on the morphosyntactic level, on the one hand because morphosyntactic variables easily lend themselves to voluminous data as they are formal
enough to be tagged and calculable and on the other
hand because various studies have demonstrated their
efficiency in genre processing (Karlgren & Cutting 1994;
Kessler et al., 1997; Malrieu & Rastier 2001; Poudat, 2003).
Although several taggers are available, they are generally
little adapted to the processing of scientific texts; for
instance, the French Inalf Institute trained Brill tagger on 19th century novels and Le Monde articles. Most of the English taggers are trained on the Penn TreeBank corpus and use very robust tagsets which interest is descriptively very weak. As many available taggers are trainable (Brill
Tagger, TreeTagger, TnT tagger, etc.), we decided to
develop our own tagset and to generate a new tagger
devoted to the processing of scientific texts. We then used
a specific tagset of 136 descriptors (described in Poudat, 2004) to process the French corpus. The tagset is devoted to the characteristics of scientific discourse, and gathers
the general descriptive hypothesis put forward in the
literature concerning scientific discourse. Among the very specific variables we developed, we can mention symbols, title cues (such as 1.1.), modals, connectives,
dates, two categories for the IL personal pronoun, in
order to distinguish between the French anaphoric and impersonal IL, etc.
The training task is very costly, as it requires the building
of a manually annotated training corpus that has to be large enough to enable the system to generate tagging
rules. For this reason, it was only led on the French corpus,
using TnT tagger. We then adapted the tags and the
outputs of CLAWS (the Constituent Likelihood Automatic
Word-tagging System developped at Lancaster University, see Garside, 1987) to get comparable data.
The morphosyntactic characteristics of the French
and English introductions and conclusions were then
determined, using statistical methods.
After having described our methodology, we will present the results obtained thanks to this same methodology. The last part of our paper will discuss the conclusions that could be drawn from these findings.
References
Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press.
Garside, R. (1987). The CLAWS Word-tagging System. In Garside, R., Leech, G. and Sampson, G. (eds),
The Computational Analysis of English: A
Corpus-based Approach. London: Longman.
Kando, N. (1999). Text Structure Analysis as a Tool to Make Retrieved Documents Usable. In Proceedings
of the 4th International Workshop on Information
Retrieval with Asian Languages, Taipei, Taiwan, Nov. 11-12, pp. 126-135.
Karlgren, J. and Cutting, D. (1994). Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. Proceedings of COLING 94, Kyoto, pp. 1071-75.
Kesler, B., Nunberg, G. and Schütze, H. (1997).
Automatic Detection of Genre. Proceedings of
the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Meeting of the European Chapter of the Association for
Computational Linguistics. San Francisco CA:
Morgan Kaufmann Publishers, pp 32-38.
Langer, H. Lüngen, H. and Bayerl, P. S. (2004).
Towards Automatic Annotation of Text Type
Structure. Experiments Using an XML-annotated Corpus and Automatic Text Classification Methods. Proceedings of the LREC-Workshop on XML-based richly annotated corpora, Lisbon, Portugal, pp.
8-14.
Malrieu, D. and Rastier, F. (2001). Genres et variations morphosyntaxiques. TAL, 42(2) : 547-578.
Poudat, C. (2003). Characterization of French Linguistic Research Papers with Morphosyntactic Variables. In Fløttum K. and Rastier F. (eds). Academic discourses - Multidisciplinary Approaches. Oslo :Novus, pp. 77-96
Poudat, C. (2004). Une annotation de corpus dédiée à la caractérisation du genre de l’article scientifique.
Workshop TCAN - La construction du savoir
scientifique dans la langue, Maison des Sciences Humaines - Alpes, Grenoble, 20 octobre 2004.
Sperberg-McQueen, C.M. and Burnard, L. (eds) (2002). TEI P4: Guidelines for Electronic Text
Encoding and Interchange. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium. XML Version.
http://www.tei-c.org/Guidelines2/index.xml.ID=P4 and http://www.tei-c.org/P4X/ (accessed 14 November 2005).
Swales, J. (1990). Genre Analysis : English in Academic
and Research Settings. Cambridge: Cambridge
University Press.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

Exploring the Generic Structure of Scientific Articles in a Contrastive and Corpus - Based PersPective

1. Noëlle Serpollet

2. Céline Poudat

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006