A Novel Semi-supervised Framework to Identify Military Documents: A Quantitative Analysis on Military Records in "Ming Shi-Lu"

paper, specified "long paper"
  1. 1. You-Jun Chen

    Center for GIS, Research Center for Humanities and Social Sciences, Academia Sinica, Taiwan; Department of Mathematics, University of California, Los Angeles, CA, USA

  2. 2. Hsin-Yi Hsieh

    Center for GIS, Research Center for Humanities and Social Sciences, Academia Sinica, Taiwan; Department of Computer Science and Information Engineering, National Central University, Taiwan

  3. 3. Richard Tzong-Han Tsai

    Center for GIS, Research Center for Humanities and Social Sciences, Academia Sinica, Taiwan; Department of Computer Science and Information Engineering, National Central University, Taiwan

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Faced with an extensive digitized corpus to analyze, a historian may find text analysis or concordance software such as Antconc, utilizing context-independent and word-count approach, desirable for identifying relevant sentences or documents in which words of interests located. However, a potential pitfall of such a methodology is that a particular keyword’s existence or absence does not necessarily entail discussing a subject
(Bingham, 2010), especially for those exploring specific phenomena or broad social or cultural themes.

This work presents a semi-supervised deep neural framework, leveraging contextualized representation learning techniques, to automatically identify military documents in Ming Shi-Lu, without previously labeled data. We aim to expedite the onerous document compilation process of relevant and consistent information about the phenomenon to be investigated. In particular, we leverage a weakly supervised classification model (WSM) built upon the study by
(Meng et al., 2020), which emulates how humans categorize documents into named categories to generate high-confidence labeled data. The quantitative results in our analysis, contributing another dimension toward studying the Ming military, lend credence to the effectiveness of our approach and shed light on the development and collapse of the Ming empire.


Ming Shi-Lu (
MSL), composed of 208,522 records, is an official annalistic work centering on Ming emperors compiled by the officials in the Ming Dynasty (1368 A.D. to 1644 A.D.). Preserving enormous original documents of edicts, decree, and records of political, military, socio-economic, and other major events,
MSL plays an
essential role in the historical reconstruction of the diverse Eastern Asian societies and polities. Our study focuses on military records in
MSL, aiming to unveil their underlying historical, academic, and documentary value by natural language processing techniques.

Even though the emergence of pre-trained language model (LM) has drastically reduced the amount of training data needed for supervised methods, experiments show that the amount of training data still has to reach 4-5% of the entire dataset (Grießhaber et al., 2020), which in our case approximates to 5,400 - 6,900 labeled documents, to yield steady and satisfying accuracy.
The proposed framework overcomes the need for previously labeled data (Fig. 1). The included steps are: (1) select a small set of seed words describing the categories to be classified, (2) use a WSM to produce labeled data according to the selected words, (3) rearrange the resulting labeled data into a training set, and (4) use a supervised classification model (SCM) to categorize the entire dataset.
Our WSM is based on
LOTClass (Meng et al., 2020), which leverages the
BERT-base-chinese language model (Devlin et al., 2019) as the general knowledge base for category name understanding and feature representation learning model for classification. The BERT-based-chinese model is also the backbone for our SCM.

The generic framework in this study
(1) The WSM will (i) learn a set of category indicative vocabulary from user-provided seed words of each class (Table 2 and 3), (ii) find the category-indicative words (w) in the text, and train the model via cross-entropy loss with a classifier on top of each w’s contextualized embedding to predict their implied categories, (iii) generalize through a self-training mechanism, and (iv) make predictions (Table 3). (2) The process of supervised document classification includes: (i) take a set of documents from the prediction results of WSM as ground truth labels (ii) fine-tune a pre-trained LM on a classification task with the labeled data, (iii) evaluate the model, and (iv) use the trained model to predict the remaining set of the documents

Dataset and definition of military documents
We define the military documents as records containing offensive or defensive operations of both combat and non-combat nature and consider only documents involving human activities in

(Institute of History and Philology, Academia Sinica, 1984)
, which is in full 136,427 documents

Weakly labeled data generation
We use trivial military action words such as attack and defense as seed words for military documents. To capture non-military documents, we also select 20 different classes of seed words (Table 1). The WSM will generate a set of category indicative vocabulary (Table 2) based on the input seed words for each class and leverage the contextualized category indicative words to classify the documents (Table 3).

User-provided seed words of Category Lawbreaking, Repair, and Military in our study
We carefully choose and expand univocal characters as seed words by inspecting the classification results of WSM

Category vocabulary of Category Lawbreaking, Repair, and Military in our study

Examples of Category Lawbreaking, Repair, and Military prediction results by WSM.
(1) Examples 1, 5, and 6 have no user-provided seed words or category indicative vocabulary. This implies that the model can identify a document’s category without trivial keywords, surmounting the limitation of the keyword search approach. (2) We take 4,000 documents from Category Military and 4,000 from the rest of the non-military categories for training data

BERT-based supervised document classification
To initiate the supervised document classification task, we take 8,000 labeled documents generated by WSM. The examination result by manually examining 5% of the labeled documents shows that the WSM achieves around 87.3% accuracy. We then fine-tune the bert-base-chinese model for a binary classification task by randomly taking 80% of the labeled data for training and 20% for validation, and evaluating the model performance via precision, recall, and f1 scores (Table 4). Subsequently, we use the trained binary classifier to predict the rest of 128,427 documents.

Table 4 Precision, recall, and f1 scores of the BERT-based binary classification model

Result and analysis

Comparison with the distribution of war frequency in the Ming Dynasty

As armed forces are primarily intended for warfare,
we compare the distribution of the number of military documents with the distribution of war frequency (W)
(Editorial Committee of Chinese Military History, 1985) in the Ming Dynasty (Fig. 2).
It can be seen that, up to the official compilation of
MSL (1627 A.D.), the fluctuations of both trends match, evincing the robustness of our framework.

Distributions of the Number of Military Documents in
Ming Shi-Lu
and Number of Wars in the Ming Dynasty

(1) The x-axis labels show the duration of reign for each Ming Emperor. (2) The divergence of the two trends after the 1630s may be explained by two factors: (i) The official
only covers reigns from the Hongwu Emperor (1368 A.D. - 1398 A.D.) to the Tianqi Emperor (1605 A.D. - 1627 A.D.). Records of the Chongzhen Emperor (1627 A.D. - 1644A.D.), the last Ming emperor, are from
Chongzhen Shi-Lu
Chongzhen Chang-Bian
placed in the appendix of
, even though providing an account of the reign, yet significantly fewer in numbers than records of other reigns. (ii)The last interval of the Chinese war data (W), which we have access to, ranges from 1640 to 1649, outrunning the rule of the Ming empire

Evaluation of military document ratio distribution in
Ming Shi-Lu

To identify high-density periods of military documents, we convert the absolute number of military documents into ratio distribution calculated on a 5-year interval (Fig. 3).

Military Document Ratio Distribution in
Ming Shi-Lu
We marked the peaks and listed the corresponding major military events below: (A) The founding of the Ming dynasty. (B) The

Jignan campaign

(1399-1402) and the Yongle Emperor’s campaigns against the Mongols (1410-1424). (C) The

War of Lucha


Tumu Crisis

(1449), and the

Defense of Jingshi

(1449). (D)


raids and


raids. (E) The

Jiajing wokou raids

and t

he War of Gengxu

(1550). (F) The Bozhou campaign (1589-1600), the

Ningxia campaign

(1592), and the

Imjin War

(1592-1598). (G) The Battle of Sarhū and the collapse of the Ming dynasty. Fig. 2 (2) (i) can explain the anomalous decline in ratio during 1640-1644

The successive peaks (B - E) coincide with major military events of profound influence on the development of the Ming Dynasty. The Jignan campaign (B) attributed to a drastic change in the country’s military defense system. The incredible feats achieved by the Ming in the War of Luchuan (C) indirectly influenced the origin of civil officials’ exercising military power (Li, 2003). The War of Gengxu (E) occurred while Ming armies suffered repeated defeats in combating the Wokou and the Mongols raids on the Ming territory led to the abolishment of the Superintendent of the Integrated Division established for a century (Wu, 2021).
During the later reign of Wanli Emperor, the increased military expenditure and the exacerbation of the fiscal crisis resulting from the military campaigns (F) brought about the downfall of the Ming Dynasty (Zhao et al., 2016). Additionally, pleasantry uprisings instigated by severe droughts in 1627-1643 and the ensuing famine, and the southward migration of the Mongols caused by the effects of the Little Ice Age precipitated the collapse of the Ming Empire (Zheng et al., 2014; Sun and Zhang, 2018).

This work introduces a semi-supervised framework identifying military documents without any labeled data, significantly reducing the manual labeling effort by domain experts. Empirical results in our analysis, aligning with the occurrence of major campaigns, demonstrate the robustness of our approach. For future work, we would like to continue exploring the potential of this framework and apply it to existing Asian corpora such as
Veritable Records of the Joseon Dynasty, contributing to the reconstruction of diverse Asian history. Additionally, we plan to conduct an in-depth investigation on the military documents in
MSL to substantiate perceived historical hypotheses with quantitative, temporal, or geographical evidence.


Bingham, A. (2010). ‘The Digitization of Newspaper Archives: Opportunities and Challenges for Historians’.
Twentieth Century British History, 21(2): 225–31 doi:10.1093/tcbh/hwq007.

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–86 doi:10.18653/v1/N19-1423.

Editorial Committee of Chinese Military History (1985). Tabulation of Wars in Ancient China.
People’s Liberation Army Press, Beijin.

Grießhaber, D., Maucher, J. and Vu, N. T. (2020). Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning.
Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, pp. 1158–71 doi:10.18653/v1/2020.coling-main.100.

Institute of History and Philology, Academia Sinica (1984). Scripta Sinica

Li, F. (2003). 明代文人對軍隊的統領論析 [The Civil Officers’ Command of the Army in the Ming Dynasty] Master Thesis. http://cnki.sris.com.tw/kns55/brief/result.aspx?dbPrefix=CMFD.

Meng, Y., Zhang, Y., Huang, J., Xiong, C., Ji, H., Zhang, C. and Han, J. (2020). Text Classification Using Label Names Only: A Language Model Self-Training Approach.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 9006–17 doi:10.18653/v1/2020.emnlp-main.724.

Sun, C. and Zhang, Q. (2018). 氣候變遷、政府能力與王朝興衰——基于中國兩千年來歷史經驗的實證研究 [Climate Change, State Capacity, and the Rise and Fall of Dynasties——An Empirical Study Based on Chinese Historical Experience in the Past 2000 Years].
China Economic Quarterly, 18(1): 311–36.

Wu, Y. (2021). 六師之任:明代協理京營戎政與北京防禦 [Governor-General of Capital Defenses and Military Affairs of Beijing Area in Ming China]. Ph.D. thesis, National Taiwan Normal University.

Zhao, Y., Fu, W. and Hei, W. (2016). 試析播州之役對明朝的影響 [The Influence of the Bozhou rebellion on the Ming Dynasty].
Wenjiao Ziliao, 19: 72–73.

Zheng, J., Xiao, L., Fang, X., Hao, Z., Ge, Q. and Li, B. (2014). How climate change impacted the collapse of the Ming dynasty.
Climatic Change, 127(2): 169–82 doi:10.1007/s10584-014-1244-7.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO