Text and data mining for East Asian sources in classical Chinese

workshop / tutorial
  1. 1. Donald Sturgeon

    Durham University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Brief description
Substantial volumes of primary sources important to the historical written record of China and other East Asian civilizations have been scanned and made available through online databases. Amongst these, the contents of many important sources have been transcribed into textual form, while many more remain available only as images with uncorrected OCR transcriptions. A small but growing number of texts have been semantically annotated, with named entities explicitly marked in the texts and linked to open knowledge bases. Using the Chinese Text Project (
https://ctext.org) as a source, this interactive workshop introduces participants to ways of efficiently working with digitized and annotated historical texts, as well as demonstrating how to improve the state of digitization of such texts in a crowdsourced environment supporting manual correction of OCR, semantic annotation of named entities, and construction and use of a Linked Open Data knowledge graph.

This session will introduce participants to:

Basic navigation of this large and moderately complex digital library – e.g. handling of multiple editions, complex metadata etc.
Text mining using openly available browser-based tools that use interactive visualizations to allow user-driven exploration of the contents of both this digital library, and arbitrary user-supplied materials in any language.
Hands-on introduction to crowdsourced editing to correct errors in textual transcriptions – such as errors introduced through OCR – and principles of versioned textual repositories.
Semantic annotation and knowledge base construction. This will introduce the motivation of semantic annotation with concrete examples, and equip participants with the tools to contribute directly to the annotation of classical Chinese sources through crowdsourcing, as well as to the construction of a crowdsourced knowledge graph of data extracted from these same materials.
Basic knowledge graph querying and data mining. The knowledge graph introduced supports online querying, the semantics and use of which will be explained in this section.
Introduction to querying the knowledge graph with RDF and SPARQL. The knowledge graph introduced closely follows the design principles used in Wikidata, and as such has an RDF representation


which can be queried in substantially the same way using SPARQL. This section will provide a brief introduction to this process.

Participants are encouraged to create a free account on ctext.org prior to the workshop by visiting this page:
https://ctext.org/account.pl?if=en .

Target audience
Scholars of East Asian history in fields where important source materials are written in classical Chinese (including in particular: China, Japan, Korea, and Vietnam), with interests in any period from around the first millennium BC to 1911 AD. Note that although the source materials used are in classical Chinese, all software used has complete English and Chinese interfaces, and the workshop content should be intelligible to anyone with a minimal degree of Chinese language ability, and/or familiarity with any language written using Sinitic characters (e.g. modern Japanese). Due to the regional importance of classical Chinese historically, sources written in the classical Chinese language remain important in many East Asian historical domains of study.

While many researchers working with these materials are likely to have used the Chinese Text Project before – it is accessed by over 30,000 unique users each day – most will not have experience of either the text mining or data mining extensions available, which require a greater investment of effort to meaningfully engage with.

1. ~20 mins Introduction and overview
2. ~40 mins Interactive text mining using Text Tools (Sturgeon 2018a) and the ctext API (Sturgeon 2021a)
3. ~20 mins Collaborative editing and correcting errors
4. ~40 mins Semantic annotation and knowledge base construction
5. ~30 mins Basic knowledge graph querying and data mining
6. ~30 mins Brief introduction to querying the knowledge graph with RDF and SPARQL
A number of previous workshops run by the same instructor have variously covered many aspects of the material in parts 1 through 5 above, e.g.:



Online written tutorials (created in part for use in previous workshops) exist for much of the content in parts 1, 2, and 3 (available in English, Japanese, and Chinese):

Part 6 of the tutorial will be entirely new, as RDF serialization of the knowledge graph is a relatively new feature, and previous shorter workshops have lacked sufficient time to cover this aspect.

Donald Sturgeon is Assistant Professor of Computer Science at Durham University, and the creator of ctext.org. His research interests include digital libraries, text and data mining, natural language processing of premodern Chinese, and classical Chinese philosophy.


Sturgeon, D. (2018a). Digital Approaches to Text Reuse in the Early Chinese Corpus.
Journal of Chinese Literature and Culture,
5(2). Duke University Press: 186–213.

Sturgeon, D. (2018b). Large-scale Optical Character Recognition of Pre-modern Chinese Texts.
International Journal of Buddhist Thought and Culture doi:10.16893/IJBTC.2018.

Sturgeon, D. (2018c). Unsupervised identification of text reuse in early Chinese literature.
Digital Scholarship in the Humanities,
33(3): 670–84 doi:10.1093/llc/fqx024.

Sturgeon, D. (2020). Digitizing Premodern Text with the Chinese Text Project.
Journal of Chinese History,
4(2). Cambridge University Press: 486–98 doi:10.1017/jch.2020.19.

Sturgeon, D. (2021a). Chinese Text Project: A dynamic digital library of premodern Chinese.
Digital Scholarship in the Humanities,
36(Supplement_1): i101–12 doi:10.1093/llc/fqz046.

Sturgeon, D. (2021b). Constructing a crowdsourced linked open knowledge base of Chinese history.
2021 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC). pp. 1–6 doi:10.23919/PNC53575.2021.9672294.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO