Inspired by the increasing influence of China Biographical Database (CBDB
China Biographical Database: <https://projects.iq.harvard.edu/cbdb/home>
) to historical studies (Huang and Luo, 2018), researchers are building biographical databases for Japan (Born, 2018)
Japan Biographical Database: <https://www.network-studies.org/#!/>.
Singapore Biographical Database on Chinese Pioneers: <http://www.nlb.gov.sg/resourceguides/category/singapore-bio-db-on-chinese-pioneers/>
, and Taiwan
Taiwan Biographical Database: <http://tbdb.ntnu.edu.tw/>.
. We extract information from various types of historical documents, and organize the extracted information for further queries and analyses. In CBDB and the Taiwan Biographical Database (Sie, Ke, and Chang, 2017), data are stored in relational databases.
We report experiences for producing a chronical listing for accounting an individual’s life.
In the Chinese community, this is called 年譜 /nian2 pu3/ or 年表 /nian2 biao3/.
The chronicles record main achievements of individuals in selected years. The statements summarize yearly achievements, and are usually brief and informative.
To this end, we locate descriptions about an individual’s yearly work from source biographies, and aim to produce shorter descriptions for the chronicles. One may accomplish the latter sub-goal with some different techniques including statistics, extractive summarization, abstractive summarization, and sentence compression (Knight and Marcu, 2000; Liu and Liu, 2009). We focus on our experience in sentence compression that utilize parsed structures (Filippova and Strube, 2008). Results of empirical evaluation suggest that we might produce reasonable assisting software for compiling chronicles, but fully automatic compilation of chronicles without large human annotated data can be challenging (Hasegawa, Kikuchi, Takamura, and Okumura, 2017).
Data Sources and Main Ideas
We use the texts in two volumes of biographies which belong to the official gazetteers that were published by the Taipei city government (Chen, 2017), and we refer to this collection as STCG henceforth. The biography part of STCG includes more than 479 thousand of Chinese characters and punctuation for 319 persons. The biographies have between one and five pages. Although the contents of biographies differ, many of the statements in the biographies follow a typical style
We have opportunities to contact the editorial board of the gazetteers, and was confirmed that the authors did agree on a format that they should follow whenever reasonably possible.
, so it is possible to extract statements about yearly accomplishments from those relatively long biographies.
Figure 1: a constituency tree
Figure 2: a dependency structure
We then process the statements with constituency and dependency parsers.
We use the Stanford package that includes several components, including constituency and dependency parsers. <https://nlp.stanford.edu/software/>
Figures 1 and 2 show the output of the parsers for the statement “
An English translation for this Chinese statement is “
Mr. Zhang (張先生) served as (擔任) the sales manager (銷售經理) of the Taiwan Power Company (臺灣電力公司).” We chose not to show the constituency and dependency trees for this English translation mainly because the syntactic grammars (and word orders) of Chinese and English are different. Showing the constituency and dependency trees for the English translation might not really help reviewers and readers who do not read Chinese to appreciate the meaning and applicability of the trees.
These figures were produced by the Stanford CoreNLP 3.9.1 GUI (Updated 2018/04/05).
The constituency tree shows the syntactic relationships among the words, and the dependency tree shows the functional relationships among the words. In Figure 1, we can see that the constituency parser recognizes “
張先生” as the main noun phrase (
NP) and “
擔任” as the main verb of a verb phrase (
VP). In Figure 2, the dependency parser identifies “
張先生” and “
銷售經理”, respectively, as the subject (
nsubj) and the object (
dobj) of the main verb.
Given a dependency structure, we may design a computational procedure to shorten the original statement while attempting to keep the core information of the statement. For the above example, if we can identify the main verb in the dependency structure, keep the
nsubj and the
dobj nodes and the associative
compound:nn nodes, and drop all other nodes, we will produce “張先生擔任銷售經理”
The English translation of this compressed sentence is “
Mr. Zhang serves as the sales manager”.
, which is a good compressed expression of the original statement.
To evaluate the effectiveness of the main ideas, we implemented a prototype for creating personal chronicles from a given biography.
Extracting Statements about Yearly Achievements
The first step is to extract statements that begin at a time stamp and end at another time stamp. For instance, we can extract the statements between “In 1931” and “In 1932”, and assume that the statements are about the biographee’s status in 1931. Although this assumption might not work for all genres of texts, we actually extracted 4948 such statements from our corpora.
Our current method for identifying time stamps needs to be strengthened. Not all time expressions include numbers, e.g., “next year” and “two years later”. We need to build a full-fledged capability to identify all time stamps and extract the statements perfectly.
Figure 3: Distribution of sentence lengths
Segmenting and Sampling Chinese Sentences
Although modern Chinese texts use a punctuation mark for ending a sentence, such a Chinese sentence may actually be separated by few Chinese commas, and these separated segments may correspond to multiple English sentences. Treating statements that are delimited by Chinese commas and/or Chinese periods as sentences is a common practice. We analyze the numbers of characters in the sentences, and Figure 3 shows the distribution of the sentence lengths.
We chose to work on the sentences that have between 7 and 20 characters, and ignored those that have no verbs or have multiple verbs in the current study. Sentences that are very short do not need to be shorted, and, normally, parsers cannot correctly handle very long Chinese sentences yet.
Sentence Compression using Heuristics
We can employ and evaluate heuristic rules for the compression task. An intuitive rule is to start from the main verb in a sentence. We reserve the
dobj nodes that connect to the main verb in a dependency structure. Keep the
tmod nodes that are with the
dobj nodes. Keep also the
neg nodes because they all carry crucial information, e.g.,
auxpass indicates the passive voice.
Definitions about the dependency labels are available at <https://nlp.stanford.edu/software/dependencies_manual.pdf>.
Alternatively, we compute the word vectors for the words in the corpora with Gensim (Mikolov, Sutskever, Chen, Corrado, and Dean, 2013).
Gensim: https://radimrehurek.com/gensim/: window_size=5, vector size = 250, CBOW
For typical simple sentences, we could find the top level NP and VP (Wang, Chen, and Liu, 2018), and identify the main verb and the main noun for the main VP in a constituency tree. We calculate the average vector for these core words, and select a next word to include to optimize the resulting cosine similarity between the average vector and the sentence vector of the original sentence until the similarity reach 0.95.
Among the shortened sentences, we randomly chose ten samples for a specific length of sentences, and judged the appropriateness of the sampled sentences ourselves. We repeated the evaluation five times, and accepted 317 instances out of 700 test sentences when we tried the first heuristic. Figure 4 shows the average acceptance.
We list some pairs of the original sentence and the shortened sentence for reader’ reference.
Accepted: (出任東方出版社董事長職務, 出任董事長職務), (大多演出詼諧逗趣的甘草角色, 演出甘草角色), (總統府頒授特種領綬景星勳章, 總統府頒授景星勳章), (於永和開設東方藝術舞蹈研究社, 開設舞蹈研究社), (中華民國營養學會頒發營養特殊貢獻獎, 營養學會頒發特殊貢獻獎), (板橋林本源家族在大稻埕開設林本源博愛醫院, 本源家族開設博愛醫院)
Failed: (在台山東菁英籌組中華齊魯工商文教協會, 在菁英籌組文教協會), (于北投中華佛教文化館創辦中華佛學研究所, 佛教文化館創辦佛學研究所), (於該刊發表多篇關於中醫理論與實務的文章, 發表文章), (首度由雨傘節蛇毒分離出甲型雨傘節神經毒素, 分離出神經毒素), (病逝於美國洛杉磯, 病逝), (更進一步進行全國律師界之改革, 進行改革)
We repeated the same procedure to evaluate the effects of using the second heuristic. Figure 5 shows the average acceptance when we asked three different persons to judge the accpetance of the compressed sentences.
Figure 4. Average acceptance rates of the shortened sentences for the original sentences of varying lengths (The red dash line shows the trend.)
Figure 5. Average acceptance rates judged by three persons (using the second heuristic; the dash lines show the trends)
The research was supported in part by contracts MOST-104-2221-E-004-005-MY3 and MOST-107-2221-E-004-009-MY3 of the Ministry of Science and Technology of Taiwan and in part by projects 107H121-08 and 108H121-08 of the National Chengchi University.
Born, L. (2018) Leveraging the Japanese biographical database as a digital resource for education and research,
Proceedings of the 2018 Annual Meeting of the Japanese Association for Digital Humanities.
Chen, T.-l. (Ed.) (2017) A Sequel of Taipei City Gazetteers–Volume 8: Biographies (
續修台北市志人物志), Taipei: Taipei City Government. <https://www.chr.gov.taipei/cp.aspx?n=EEBCFC3935F53838>
Filippova, K. and Strube M. (2008) Dependency tree based sentence compression.
Proceedings of the Fifth International Natural Language Generation Conference, 25–32.
Hasegawa, S., Kikuchi, Y., Takamura, H., and Okumura, M. (2017) Japanese sentence compression with a large training dataset,
Proceedings of the Fifty-fifth Annual Meeting of the Association for Computational Linguistics, 281‒286.
Huang, J. and Luo, T. (2018) Computing len for exploring the historical people's social network.
Proceedings of 2018 Sixth International Conference on Future Internet of Things and Cloud Workshops, 95–101.
Knight, K. and Marcu, D. (2000) Statistics-based summarization - step one: Sentence compression,
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, 703–710.
Liu, F. and Liu, Y. (2009) From extractive to abstractive meeting summaries: Can it be done by sentence compression?
Proceedings of 2009 ACL-IJCNLP Conference (short papers), 261–264.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean J. (2013). Distributed representations of words and phrases and their compositionality.
Advances in Neural Information Processing Systems 26, 3111‒3119.
Sie, S.-H., Ke, H.-R., and Chang, S.-B. (2017) Development of a text retrieval and mining system for Taiwanese historical people.
Proceedings of 2017 Pacific Neighborhood Consortium Annual Conference and Joint Meetings, 56–62.
Wang, Y.-S., Chen, W.-R., and Liu, C.-L. (2018) Extracting social information and chronicles from personal biographies in Taipei city gazetteers,
Proceedings of the Ninth International Conference of Digital Archives and Digital Humanities, 359‒405. (in Chinese)
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at Utrecht University
July 9, 2019 - July 12, 2019
436 works by 1162 authors indexed
Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html
Series: ADHO (14)