Georgia Institute of Technology, United States of America
Georgia Institute of Technology, United States of America
Georgia Institute of Technology, United States of America
In traditional, paper-based archives, finding aids afford basic means for discovery, but tend to return results shaped by researchers’ preexisting knowledge and queries. With the digitization of archives, new analysis and visualization methods allow researchers to textually map large corpora and not only pinpoint specific materials, but also open new ways of navigation.
ENDNOTES
See Graham, S., Milligan, I., and Weingart, S. (2015),
Exploring Big Historical Data: The Historian's Macroscope
, Imperial College Press, London; Morrissey, R. (2015), “Archives of connection: ‘Whole network’ analysis and social history,”
Historical Methods
, vol. 48, no. 2, pp. 67–79; Duff, W., and Haskell, J. (2015), “New uses for old records: A rhizomatic approach to archival access,”
American Archivist
, vol. 78, no. 1, pp. 38-58; Putnam, L. (2016), “The transnational and the text-searchable: Digitized sources and the shadows they cast,”
American Historical Review
, vol. 121, no. 2, pp. 377–402; Edelstein, D., Findlen, P., Ceserani, G., Winterer, C., and Coleman, N. (2017), “Historical research in a digital age: Reflections from the Mapping the Republic of Letters project,”
American Historical Review
, vol. 122, no. 2, pp. 400–424; and Hoekstra, R. and Koolen, M. (2018), “Data scopes for digital history research,”
Historical Methods: A Journal of Quantitative and Interdisciplinary History
, vol. 52, no. 2, pp. 79–94.
However, effective use of computational tools including natural language processing (NLP), named entity recognition (NER), and knowledge graphing often requires considerable technical sophistication, forming an access barrier for many researchers.
Piotrowski, M. (2012),
Natural Language Processing for Historical Texts
, Synthesis Lectures on Human Language Technologies, vol. 17, Morgan & Claypool Publishers, San Rafael; Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., and Gómez-Berbís, J.M. (2013), “Named Entity Recognition: fallacies, challenges and opportunities,”
Computer Standards and Interfaces
, vol. 35, no. 5, pp. 482–489; Shahin, S. (2016), “When scale meets depth: Integrating Natural Language Processing and textual analysis for studying digital corpora,”
Communication Methods and Measures
, vol. 10, no. 1, pp. 28–50; Srinivasa-Desikan, B. (2018),
Natural Language Processing and Computational Linguistics: A Practical Guide to Text Analysis with Python, Gensim, spaCy, and Keras
, Packt Publishing, Birmingham and Mumbai; Ehrmann, M., Romanello, M., Flückiger, A. and Clematide, S. (2020), “Named Entity Recognition and linking on historical newspapers,” in Arampatzis, A. et al. (eds.),
Experimental IR Meets Multilinguality, Multimodality, and Interaction
, Springer International, Cham, pp. 288–310.
To help bridge this technical gap, we have been developing a toolset, currently funded by an NEH-ODH Digital Humanities Advancement Grant, that integrates large-scale text processing and data visualization capabilities into the open-source
Omeka
We developed
Archiviz
for the Omeka Classic version; technical and user support for our plugin will be available in February 2023 in the
Omeka plugin library
. Additional availability is planned through github and a Docker container for ease of use.
content management platform. With the tool, users can upload batches of documents, perform NER on them, and manipulate an interactive graph that displays extracted entities (people, places, organizations, etc.) from the collection and how they interconnect in its component documents. The toolset’s components
The primary tools and plug-ins used to construct Archiviz include Harvard’s
Elasticsearch
Omeka plugin for enhanced search capabilities, Google
Tesseract
for OCR,
spaCy
’s NLP functionality, particularly its NER functionality, with current development focusing on the integration of the Science Museum Group’s
Heritage Connector
for better entity identification, linking, and disambiguation. The graphing of this information is primarily performed with the force-directed graphing of
d3.js
.
were all specifically developed for (and tested by) non-technical researchers and community advocates, with all computational work taking place in a simple graphical user interface.
Our
Archiviz
toolset produces social network-style visualizations that connect results on the basis of the important people, places, organizations, and other entities mentioned. By applying NER to a collection, users can see the full breadth of their research subject at a glance, and explore connections they may not have known exist.
For a similar application, see Tumbe, C. (2019), "Corpus linguistics, newspaper archives and historical research methods,"
Journal of Management History
, vol. 25, no. 4, pp. 533–549.
What our implementation offers that is new, is the efficient display of interrelationships between large numbers of nodes (named entities) combined with the possibility of rapid navigation to search results (documents). In addition, the interface moves beyond traditional query-based archival searching, “flattening” out a collection to show results beyond a researcher’s interests or knowledge.
We are continuing to refine
Archiviz
in ways intended to be intuitive, usable, and readily adoptable by users from a variety of different backgrounds. A primary design goal of the project has been to make it accessible not just to relatively tech-savvy academics, but more importantly, communities beyond the academy. It is our hope that the tool may particularly benefit disadvantaged communities—which in the United States have often faced pressure from gentrification and urban redevelopment, with consequent coercive displacement from historical neighborhoods. Residents in such areas often lack the infrastructure to collect, preserve, and interpret local history. As such, we have partnered with various community groups and activists throughout the process to ensure that the tool can be useful to them. Most ambitiously, in 2019 we hosted a “Community Researcher Workshop” for Atlanta-based librarians, archivists, community organizers, nonprofit staffers, and students to explore the
Mayor Ivan Allen Digital Archive
that has served as our first test corpus, using a prototype of our toolset.
On the issues here, see Flinn, A., Stevens, M., and Shepherd, E. (2009), “Whose memories, whose archives? Independent community archives, autonomy and the mainstream,”
Archival Science
, vol. 9, no. 71, pp. 71–86.
User testing of the interface during this workshop produced very positive feedback, with participants calling the platform “incredibly useful,” with the “potential to break down traditional barriers of why people are hesitant to use archives.” A GLAM (Gallery, Library, Archive, Museum) researcher noted that it “is something almost any archive or library could utilize to their advantage,” while a community advocate stated that she “wanted the information for myself, but also…to share with my fellow residents.” With an additional two years of grant-funded development since this workshop, we are excited to share our work with the DH community. While we have integrated much of the feedback from the workshop, future plans include capabilities to process a wider spectrum of digital, textualized media—documents, video and audio converted to text, and even images identified and categorized with computer vision making our tool relevant for a more diverse array of communities and collections.
Wiriyathammabhum, P., Summers-Stay, D., Fermüller, C., and Aloimonos, Y. (2017), “Computer vision and Natural Language Processing: Recent approaches in multimedia and robotics,”
ACM Computing Surveys
, vol. 49, no. 4, pp. 1–44.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Tokyo, Japan
July 25, 2022 - July 29, 2022
361 works by 945 authors indexed
Held in Tokyo and remote (hybrid) on account of COVID-19
Conference website: https://dh2022.adho.org/
Contributors: Scott B. Weingart, James Cummings
Series: ADHO (16)
Organizers: ADHO