The Royal Society Corpus: Towards a high-quality corpus for studying diachronic variation in scientific writing

Hannah Kermes; Stefania Degaetano-Ortlieb; Ashraf Khamis; Jörg Knappen

Authorship

1. Hannah Kermes

Universität des Saarlandes (Saarland University)
2. Stefania Degaetano-Ortlieb

Universität des Saarlandes (Saarland University)
3. Ashraf Khamis

Universität des Saarlandes (Saarland University)
4. Jörg Knappen

Universität des Saarlandes (Saarland University)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Big data are a potential source for quantitative research in the humanities, but typically they do not contain all relevant contextual meta-data (time, register/genre, social group, author, etc.) to be readily usable for social, historical or philological studies (cf. Schöch, 2013). Small corpora, in contrast, are typically carefully hand-crafted and provide rich meta-data as well as structural and linguistic data, but the application of data-driven analysis techniques is impeded by their small size.
We introduce a diachronic corpus of English scientific writing - the Royal Society Corpus (RSC) - adopting a middle ground between big and ‘poor’ and small and ‘rich’ data. The corpus has been built from an electronic version of the Transactions and Proceedings of the Royal Society of London and comprises c. 35 million tokens from the period 1665- 1869 (see Table 1). The motivation for building a corpus from this material is to investigate the diachronic development of written scientific English.

Table 1: Material used for the RSC

In terms of corpus building (see Figure 1 for a schematic overview), the sources for the RSC were obtained from JSTOR and include some but not all relevant meta-data (year of publication and authors, but not disciplines), structural data is partial and erroneous (e.g. scrambled pages, text duplicates), and the base text contains OCR errors. To move towards a cleaner and richer version of the corpus, an approach is needed that allows obtaining good-quality base-text data and relevant meta-data as well as structural and linguistic data with affordable effort. For this purpose, we use a combination of pattern-based techniques (e.g. by adapting the patterns for OCR corrections made available by Underwood and Auvil)

and data-mining methods (e.g. topic modelling
(Blei et al.,
2003)to approximate disciplines; cf.
McFarland et al.
(2013)for an overview of types of topic models applied to capture differentiation in scientific language). Additionally, to enrich the RSC with basic linguistic annotations, we build on existing tools adapting them to the diachronic material. For normalization we use VARD
(Baron and Rayson,
2008)with a model we trained on a manually normalized subset of the RSC, and for tokenization, lemmatization, segmentation and part-of-speech annotation we useTreeTagger
(Schmid,
1994)on the normalized texts. Inspired by the idea of Agile Software Development (Cockburn, 2001), we intertwine the actual corpus building with corpus annotation and analysis, continuously building new versions of the corpus whenever we see a recurrent problem in data quality. We work with a dedicated pipeline and keep the corpus-building process as modular and automatic as possible, applying manual work before the first automatic step. In the last step, the corpus is encoded in CQP format (cf. IMS Open Corpus Workbench (CWB) (Evert and Hardie, 2011)) and can be accessed via a CQPweb interface (Hardie, 2012)

Figure 1: Corpus building steps

In terms of analysis, our main assumption is that due to specialization, scientific texts exhibit greater encoding density over time
(Halliday and Martin,
1993),resulting in a specific discourse type characterized by high information density (Crocker et al., 2015) that is functional for expert communication (but rather inaccessible to lay persons). Linguistically, this may be reflected in lexical compression (e.g. compounding, derivation) and syntactic reduction (e.g. relativizer omission, contractions). For instance, there is evidence from the Thesaurus of the OED (Oxford English Dictionary)

that affixation rises considerably as a means of word formation in scientific texts in the mid-17th century. For the identification that affixation rises considerably as a means of word formation in scientific texts in the mid-17th century. For the identification of further linguistic features possibly involved in denser encoding, we draw, on the one hand, on existing literature (e.g. Harris, 1991) and, on the other hand, on exploratory data-mining techniques (e.g. pattern mining as in Vreeken, 2010).

In the poster, we show the corpus-building process and selected analyses of diachronic development in the RSC with dedicated visualizations (Fankhauser et al., 2014).

Bibliography

Baron, A. and Rayson,
P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. In
Proceedings of

the

Postgraduate

Conference

Corpus

Linguistics, Birmingham, UK.

Blei, D. M., Ng, A.
Y.,
and Jordan, M. I. (2003). Latent Dirichlet Allocation.
Journal of Machine Learning Research,
3: 993–1022.

Cockburn, A. (2001).
Agile Software Development. Addison-Wesley Professional, Boston, USA.

Crocker, M.
W.,
Demberg,
V.
and
Teich,
E. (2015). Information density and linguistic encoding (IDeaL).
KI - Künstliche Intelligenz, pp. 1–5.

Evert, S. and Hardie, A. (2011). Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In
Proceedings of the Corpus Linguistics Conference, Birmingham, UK.

Fankhauser,
P.,
Kermes, H. and
Teich,
E. (2014). Combining Macro- and Microanalysis for Exploring the Construal of Scientific Disciplinarity. In
Digital Humanities, Lausanne, Switzerland.

Halliday, M. & Martin, J. (1993).
Writing science: literacy and discursive power. Falmer Press, London.

Hardie, A. (2012). International Journal of Corpus Linguistics 17(3): 380-409.
CQPweb – combining power, flexibility and usability in a corpus analysis tool.

Harris, Z. S. (1991).
A theory of language and information: a mathematical approach. Oxford University Press, USA.

McFarland,

A.,

Ramage,

D.,

Chuang,

J.,

Heer,

J.,

Manning,

and

Jurafsky,

D. (2013). Differentiating language usage through topic models.
Poetics, 41(6): 607–25.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2016

"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website: https://dh2016.adho.org/

Series: ADHO (11)

Organizers: ADHO

The Royal Society Corpus: Towards a high-quality corpus for studying diachronic variation in scientific writing

1. Hannah Kermes

2. Stefania Degaetano-Ortlieb

3. Ashraf Khamis

4. Jörg Knappen

ADHO - 2016

"Digital Identities: the Past and the Future"