CCVG Data: A Unique, Curated, and Searchable Chinese Village Dataset for Chinese Study Scholars

paper, specified "long paper"
  1. 1. Daqing He

    University of Pittsburgh

  2. 2. Rongqian Ma

    University of Pittsburgh

  3. 3. Ruoyun Zheng

    University of Pittsburgh

  4. 4. Haihui Zhang

    University of Pittsburgh

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Chinese local gazetteers are unique primary resources with a long history in China. According to Endymion Wilkinson, “[Gazetteers] form one of the most important sources for the study of Chinese history in the past one thousand years” (2000, p. 154). During the past 30 years, North American-centered Chinese research has developed rapidly. Scholars recognize that there is still a need for more ways to pursue systematic research, thus requiring more reliable information sources. Although some data on Chinese rural areas are available, they are scattered across different resources at varying levels of accessibility.
The Contemporary Chinese Village Gazetteer Data Project (CCVG Data) is a project initiated and conducted by the East Asian Library (EAL) of the University of Pittsburgh Library System (ULS). CCVG Data, the first of its kind in the world, is designed to establish a dataset of significant humanities and social science value based on the East Asian Library’s extensive collection of village gazetteers. Village gazetteers report quantitative and qualitative data for China’s most basic administrative unit, covering a series of topics such as local history, genealogy, economics, education, politics and management, public health, etc. Data at this level of detail can only be found in village gazetteers. In 2018, CCVG Data was awarded a one-year, $35,100 Pitt Seed Project grant supported by the University of Pittsburgh’s Office of the Chancellor. In August 2019, CCVG Data project completed its first stage as a pilot project and made the first 500 villages’ data public for teaching and research of Chinese studies. By the end of September 2021, about 1,500 villages’ data have been extracted, entered, and opened to the public. To enhance the accessibility of data as well as to improve user experience, in January 2020, the CCVG team initiated a collaborative project with the School of Computing and Information at the University of Pittsburgh. Two important tasks are under way to increase online accessibility to CCVG data: First, storing the extracted village gazetteer data into a database, and second, providing powerful search interfaces for scholars.
In the presentation, we will briefly introduce the background and rationales of CCVG Data project, such as its data extraction, quality control, and data dictionary, and then we will focus on our ongoing efforts to design and construct the database and search interface. Currently, CCVG data is stored in a MySQL database, where 38 tables were identified for storing the data related to the villages’ 12 thematic topics such as gazetteer information, village basis, natural environment, natural disaster, ethnic groups, economy, and education. Two online search interfaces are provided to CCVG data: single village search for basic search and multiple village search for advanced search. The single village search enables scholars to access and download information related to one village on all 12 topics. The multiple village search provides advanced search capabilities, where several villages can be selected with filters on province, city, and county. The 12 topics are presented as a tree structure for the scholars to browse and select, which forms more filters to specify the relevant CCVG data. Scholars can make single-year selections or year range inputs to impose further filters. The resulting data is downloadable for the scholars. Our initial assessments on the search interfaces indicated their usefulness to the scholars.
The CCVG Data project has received much attention from scholars and researchers in a variety of disciplines. Inquiries on using CCVG Data have been received from many disciplines of humanities and social studies such as religion, education, economic, family planning and public health, local management, etc. Following the overview of the CCVG Data interface and its usage among scholars, in this presentation, we will also discuss the value of CCVG Data project and the collaboration model which has contributed to the progress of the project.
From books to a dataset, CCVG Data is an experimental project with significant value to the humanities and social sciences. It supports Chinese studies in fields such as politics, economics, sociology, environmental science, history, and public health, and proves to be a meaningful exploration in Digital Humanities (DH). More specifically, with the rapid development of East Asian DH, the CCVG Data project marks a milestone in this emerging field from multiple perspectives (Vierthaler, 2020). On the one hand, CCVG Data project demonstrates libraries’ leadership roles in DH work and represents a deeper, more well-rounded collaboration model between librarians, humanities scholars, social scientists, and computer scientists. The collaboration covers various aspects and nearly every step of the DH work, such as data collection and processing, database and platform design, user research, and programming and development of digital tools. As collaboration becomes a well-recognized topic that matters significantly for the success of DH research, the CCVG Data project suggests diverse roles that information professionals can serve in DH scholarship (Poremski, 2017; Richardson & Eichmann-Kalwara, 2017; Risam et al., 2017). On the other hand, CCVG Data project contributes to DH scholarship with infrastructural innovation. Unlike the extensive bibliographic databases, tools, and textual corpora available for East Asian studies, the CCVG Data project focuses on numeric datasets and leverages computational methods (e.g., searchable interface, visual analytic tools) to facilitate users’ and researchers’ interaction with the data.

CCVG Data website:

Poremski, M. D. (2017). Evaluating the landscape of digital humanities librarianship. College & Undergraduate Libraries, 24(2–4), 140–154.

Richardson, H. A. H., & Eichmann-Kalwara, N. (2017). Process and collaboration: Assessing digital humanities work through an embedded lens. College & Undergraduate Libraries, 24(2–4), 595–615.

Risam, R., Snow, J., & Edwards, S. (2017). Building an ethical digital humanities community: Librarian, faculty, and student collaboration. College & Undergraduate Libraries, 24(2–4), 337–349.

Vierthaler, P. (2020). Digital humanities and East Asian studies in 2020. History Compass, 18(11), e12628.Wilkinson, E. P. (2000). Chinese History: A Manual, Revised and Enlarged. Cambridge, Mass: Harvard University Asia Center for the Harvard-Yenching Institute.
Wilkinson, E. P. (2000). Chinese History: A Manual, Revised and Enlarged. Cambridge, Mass: Harvard University Asia Center for the Harvard-Yenching Institute.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website:

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO