Complexities in the Use, Analysis, and Representation of Historical Digital Periodicals

Clifford Edward Wulfman; Sinai Rusinek; Zef Segal; Nanette Rißler-Pipka; Sarah Ketchley; Torsten Roeder; Estelle Bunout; Marten Düring

Authorship

1. Clifford Edward Wulfman

Libraries - Princeton University
2. Sinai Rusinek

Open University, University of Haifa
3. Zef Segal

Open University, University of Haifa
4. Nanette Rißler-Pipka

Karlsruher Institut für Technologie / Karlsruhe Institute of Technology (KIT)
5. Sarah Ketchley

University of Washington
6. Torsten Roeder

Leopoldina National Academy of Sciences - Martin-Luther-Universität Halle-Wittenberg
7. Estelle Bunout

Luxembourg Centre for Contemporary and Digital History (C2DH) - University of Luxembourg / Universität Luxemburg
8. Marten Düring

Luxembourg Centre for Contemporary and Digital History (C2DH) - University of Luxembourg / Universität Luxemburg

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Panel introduction
The theme of this conference is complexities, and there are few printed media as complex as newspapers and magazines. They are, generally, serial publications (and thus have a complex temporal dimension not often found in other publications), frequently with miscellaneous content and complex page layout, and often entailing complex publication relationships among authors, editors, and publishers. They have generally been poorly served by conventional research libraries, which have often discarded covers and advertising in the desire to conserve space and whose catalogues seldom represent the actual nature of their holdings.
The digital remediation of periodicals has also been complex. Digitization is a complex term with a variety of meanings and consequently a variety of products, each with different stakeholders. Librarians want simple formats that can be produced, catalogued, stored, and delivered in standard ways. Scholars of visual arts want pages images of very high resolution and chromatic fidelity; many linguists and textual scholars don't care about pages at all but are keenly invested in machine-readable transcriptions of complete texts that are often spread across different pages and different issues. Scholars of print need information about paper types and typography. Repositories need information about copyright.
Although most text-based scholars (literature, history, sociology, cultural studies) seldom venture beyond the downloaded PDF and token search, some have begun to engage digital textuality in more complex and sophisticated ways, and these engagements have exposed some of the limitations of standard approaches to digitization, which often focus on graphical user interfaces to page images and the production of efficient inverted indexes at the expense of the periodical corpus.
This panel follows a session at the 2014 Digital Humanities Conference in Lausanne, Switzerland, entitled "Remediating 20th-Century Magazines of the Arts: Approaches, Methods, Possibilities," which stimulated an ongoing series of discussions and calls for the formation of an ADHO SIG devoted to Periodicals. Since then, projects large and small–Europeana Newspapers (
http://www.europeana-newspapers.eu), the Chronicling America Data Challenge (
https://www.neh.gov/news/press-release/2016-07-25), Traslantis (
https://translantis.wp.hum.uu.nl), the Viral Texts Project (
https://viraltexts.org) and, more recently, Impresso (
https://impresso-project.ch/) and Newseye (
https://www.newseye.eu/)–have begun to go beyond the retrieval paradigm to explore other methods and approaches to newspapers. This panel showcases the complexity of periodical studies and the many ways digital technologies enable them today.

„Horizontal“ reading in a corpus of 19th century German music magazines and daily newspapers: analysing the debates on Verdi’s
Messa da Requiem

Perhaps even more than today, musical concerts in the 19th century were most relevant social and cultural events. Whenever a new musical piece was performed, detailed reviews were published in local newspapers, discussing musical ideas as well as the quality of composition and performance. The discussions were usually published in the
feuilleton and in many cases even on the title page. Music critique was part of a general cultural discourse, connected to political opinions, religious beliefs and social imprints, and newspapers were a relevant forum for these debates.

In the late 1860s, the number of German newspapers increased immensely, due to less restrictive press laws and to advanced printing techniques as well. It was in the mid-1870s when the newspapers announced a new piece by the Italian opera composer Giuseppe Verdi: The grand
Messa da Requiem, a commemorative work for orchestra, choir and soloists, for the deceased Italian writer Alessandro Manzoni. The first performances in Austria and Germany induced a discussion about the (in)appropriateness of operatic style in the genre of sacred music, and if church music should ideally be free of realism and overwhelming or untrue emotion.

The research described here focuses on analyzing the ideals of requiem music from the perspective of German music critique and compares them on regional and local levels. The analysis was supported by TEI-encoded transcriptions of all relevant German articles between 1874 and 1878, resulting in a corpus of 330 texts and approximately one million characters. The markup focuses on semantics: About 8,000 entities and relations, amongst them more than 500 individual persons, 140 performances, 100 geographic items and 135 musical works, and 100 cases of text reuse were identified.
The corpus was then analyzed from different perspectives. Based on the semantic markup, each partial analysis started from one single entry point (e.g.
Kyrie, the first part of Verdi’s composition), extracts all occurrences of one semantic unit, including some context, and compares all these in respect to publication journal, place and date and possibly author’s background. A delicate matter is the extent of context, which operates sufficiently on phrase segments in most cases. This method of extracting and parallel reading the contexts of one single semantic unit, or metaphorically: »horizontal« reading, helps identifying patterns in the reception. For example, the part
Agnus Dei, a very popular piece which was mentioned in more than 60 texts, was strikingly often compared with other known opera numbers for its tune, which suggested that Verdi could have copied the melody. While musicological research has so far been limited to individual critics, the new method can be used to compare different receptions to each other without requiring background information on the single authors, which is often not available. However, this method requires accessible digital copies of newspapers, reliable metadata, high quality OCR or transcriptions, phrase segmentation, and tagged semantic entities.

Re-discovering the „hidden women“ of 19th century Egyptology through thematic analysis of historical newspapers
The end of the 19th and beginning of the 20th centuries saw great archaeological activity in Egypt, a period that came to be known as the “Golden Age” of Egyptology. My research seeks to re-examine the ambiguous and liminal position of women attached to these archaeological digs and expeditions. The starting point was the unpublished travel journals of Mrs. Emma B. Andrews, who for over two decades between 1889 and 1914 traveled the Nile with the millionaire lawyer turned archaeologist, Theodore M. Davis. Andrews was present when Davis discovered eighteen of the forty-two tombs now known in the Valley of the Kings. and her writings provide a detailed record of excavation often lacking in contemporary publications.
My initial work included transcribing and encoding the diaries in TEI XML, tagging named entities that have formed the basis of a research database and a digital edition. The research corpus has expanded to include material written by the wives, relatives, and friends of expedition directors, archaeologists, artists, and photographers working in Egypt at that time, and comprises diaries, correspondence, historical ephemera and a large number of historical newspapers. These documents give the historian an overview of the social, geographical and political history of Egypt at the time within the broader context of the histories of archaeology and Egyptology, of gender studies, and of the social, cultural and political history of the Victorian era.
Working with a large corpus of digitized newspapers has shifted the work away from granular markup towards broader thematic analysis, including topic modeling and ngram analysis. This paper focuses on the process and challenges of extracting significant information from collections of digitized newspapers of varying quality and content using computational methods, using material drawn from archives including Chronicling America, Gale Primary Sources, Europeana Newspapers, and the British Library.
Relevant newspaper articles, advertisements, and editorials have been compiled from keyword searches, based on lists generated from the named entities captured during the markup of diaries and correspondence. This has been a simple yet effective way to create relevant datasets - a task that would otherwise have been slow and unwieldy given the scope of the source material.
The main challenge has been poor-quality OCR, which has affected returned search results. This has led to the development of strategies for cleaning bad OCR programmatically using NLTK and RegEx, as well as with digital tools developed for the purpose, including Lexos and Gale’s Digital Scholar Lab.
Work beyond text cleaning has focused on analyzing curated newspaper datasets in an effort to identify prevalent cultural and historical themes using topic modeling tools and ngram analysis. The aim is to explore Egyptological women’s writing within this cultural context, to develop a picture of everyday life from multiple historical primary source material, a topic that is often obscured from an historian’s view.

The periodical as a geographical space: the 19th century Hebrew
HaZefirah as a case study

A historical periodical, just like any historical event or phenomenon, is defined in space and time. Its institutions operate from actual buildings; it is circulated via transportation routes; and through its reader network it connects between places and people. However, the relation between the two concepts, “geographical space” and “periodical,” is not one-way. Just as much as the periodical is within a geographical space, one can and should explore the ways in which geographical spaces are within periodicals, and most importantly, are created by them. Many periodicals produce an internal geographical space, whose attributes and connections result from the text and inner conventions of the periodical, which might include the distinction between different pages, articles, and sections, or the distinction between places that appear in the headlines and places that appear in the body of the text. Geographical places, such as cities, mountains, rivers, states and continents, are located on paper rather than on the globe; their sizes (or altitudes) are defined by the frequency of their references, and the distances between the places are defined by proximity on the line, page or article. These geographical spaces that emerge from the text are mediated geographical spaces with completely different shapes and topographies than the “normal” world map. They are influenced by cultural, political, ideological and economic perceptions of writers and readers of the periodical, but they also reinforce such perceptions and legitimize them. Recent advances in digitization, automatic annotation, morphological and lexical analysis, as well as geographic and network mapping, have made the exploration of such uncharted and chaotic spaces more feasible than before. This paper will use the nineteenth century Hebrew newspaper
HaZefira as a case study for different geographical spaces that originate from and within the newspaper. By alternating between various ways of representing the geographical data in the journalistic text, I will show that spaces are constantly being deterritorialized and reterritorialized.
Hazefira, which operated between 1862 and 1931 was established in order to educate its readers with worldly knowledge. Its articles discussed global news, travel stories, and scientific discoveries, all of which involved spatial references. This paper will discuss the different ways of exploring the relations between textual spaces (within the journal) and their spatial references, the digital methodologies that assist the unveiling of these spaces, and the problems that are still left to be dealt with.

Reconceptualizing resources: transforming Blue Mountain into TEI-encoded editions and a bibliographic knowledge base
This paper presents work being done by the Blue Mountain Project to create TEI-based editions of its magazines and a bibliographic knowledge base of avant-garde publication that leverages its detailed bibliographic metadata and the resources of the semantic web.
The Blue Mountain corpus of European and North American Avant-Garde periodicals currently contains full runs of thirty-four titles in eleven languages, with full text encoded in METS/ALTO and descriptive metadata in MODS. Blue Mountain's rich and extensive metadata are not well represented in the content models predominant in digital library repositories, models based on page images of books and manuscripts. Furthermore, traditional bibliographic metadata often fails to capture important information about periodicals, including periodicity, issuance, and republication.
The paper describes the reconception of Blue Mountain's digital objects from library resources to digital editions, and of its metadata from descriptive catalog records to a bibliographic knowledge base. It describes our transformation of METS/ALTO documents into TEI-encoded XML and IIIF manifests, as well as our conversion of Blue Mountain's MODS data to PRESSoo, part of a family of models that enable analysts to capture the temporal nature of publication as well as the inter-relationships among a variety of agents and entities, allowing researchers to ask questions not easily answered before.

A generous design of interfaces as an answer to the complexity of digitized newspaper collections
The digitization of historical sources does not automatically translate into accessibility for historical research. The publication of digitized newspapers and periodicals still leaves researchers willing to use them, hamstrung by interfaces that offer little interaction other than search functionalities. The insufficient interactions offered by current interfaces derive from their technical and material background but also from the limited information collected so far on the actual needs of the users, especially for research. Research in the humanities is characterized by iteration, discovery and exploration, especially given the huge volume and inherent complexity of historical newspapers. Digitized newspapers offer opportunities for enrichment which in turn should allow the development of novel research workflows which go beyond mere content retrieval by means of keyword searches and get us closer to humanist research practices.
The ambition to offer room for more flexible and open interaction with digitized cultural heritage collections has been captured by Whitelaw (DHQ, 2015) under his call for more “generous interfaces”. The interdisciplinary project “impresso - Media Monitoring of the Past” aims at materializing this call with the development of a methodologically reflected technological framework to enable new ways of engaging with multilingual digitized historical newspapers, mainly from Switzerland and Luxembourg. In other words, impresso works on the conception of a generous interface by fostering synergies between computational linguists, designers, humanist researchers and institutions holding the collections. Our proposition for the implementation of generosity is to enable users to interact with primary sources in a digital environment using all possible entry points resulting from this transformation. In our case, such interactions are made possible by OCR improvement, annotations of named entities, topics and text reuse detection as well as close and persistent cooperation with historians and designers.
We make use of newspaper metadata provided by the partnering libraries (who printed it, when, where), the metadata produced during the compilation of digital collections e.g. the quantity of words, pages, issues, as well as annotations such as the segmentation of the document, named entities, topic modeling and text reuse. We integrate indicators of the quality of the digitization, e.g. OCR quality, missing issues, pages. Each of these items opens a distinct entry for search, exploration and analysis.
These added layers of information require a suitable interface to be usable for researchers. This interface can facilitate the navigation of large-scale collections, both to get an overview of the general collection made available but also to get an overview of the content of a particular query or a particular collection of articles, curated by a user. The impresso interface provides space for a visual exploration of the collections, using filters to customize the visualization based on the mentioned metadata, as well as automatically extracted named entities, topics and text similarities. This push towards the design of a “generous” interface should encourage novel research workflows and a more reflective use of historical newspapers.

Opportunities and limitations of research on historical periodicals in the era of digitization: usability of digital collections in the light of different modes of reception
This paper reflects the initial discussions of the newspaper and magazine working group in the Digital Humanities in the German-Speaking World Association (
https://dig-hum.de/ag-zeitungen-zeitschriften ).

Before the new opportunities of digitization emerged in libraries all over the world, historical periodicals were a rare and difficult field for both researchers and librarians. In cultural and media history, only few titles were in focus of research. In literary studies, periodicals were only consulted for first editions by canonical authors. In economics, media studies and social sciences, historical periodicals were observed for reader and consumer behavior. For linguistics, historical periodicals showed changes in non-fictional language though building up a corpus was very difficult before digitization (Burr 1997).
Today, we can observe two ways of dealing with mass digitization. For libraries, there is always the question of emphasizing the quantity or the quality of digital editions. Both strategies have advantages: emphasizing quantity yields access to a wider range of titles; emphasizing quality leads to better access, better searchability, and greater re-usability.
Likewise, researchers take advantage of the opportunities offered by digitization in two ways. Those simply looking to read content in hard-to-find titles are happy with the quantitative approach: a badly scanned page image is sufficient for their needs. Researchers in a DH context, however, whose research entails establishing author networks, layout analysis, or topic modeling, are frustrated by the quantitative approach and dependent upon it at the same time. Still other researchers are hampered by the loss of materiality inherent in digital editions.
It is time to open up a discussion about how the different players in the process of digitization could work together to overcome these obstructions. We need to establish and compare the requirements of the different stakeholders as well as the status of the current landscape of digitized historical periodicals. At the moment there is a very active movement in research on historical periodicals and some are already considering digital collections as a knowledge systems and not only as containers of text, images, and other pieces of information.

Bibliography

Allen, Robert B.
(2015).Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design. ArXiv:1502.03943 [Cs], February 13, 2015.

http://arxiv.org/abs/1502.03943

Blevins, Cameron.
(2014). Space, Nation, and the Triumph of Region: A View of the World from Houston, The Journal of American History 101(1): 122-147.

Burr, Elisabeth.
(1997). Neutral oder stereotyp. Referenz auf Frauen und Männer in der italienischen Tagespresse. In Dahmen, W., Holtus, G., Kramer, J., Metzeltin, M., Schweickard, W. and Winkelmann, O. (eds): Sprache und Geschlecht in der Romania. Romanistisches Kolloquium X (= TBL 417). Tübingen: Narr 133-179, 1997.

Erlin, Matt and Tatlock, Lynne.
(2014) Distant Readings. Topologies of German Culture in the Long Nineteenth Century. Rochester: Camden House, 2014.

Pierazzo, Elena.
(2016) Textual Scholarship and Text Encoding. In Schreibman, S., Siemens, R. and Unsworth, J. (eds), A New Companion to Digital Humanities Wiley Blackwell, 2016.

Schöch, Christof.
(2013). Big? Smart? Clean? Messy? Data in the Humanities, Journal of Digital Humanities 2(3).

http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/

(accessed 3 May 2019).

Soffer, Oren.
(2007). Ein lefalpel! Iton HaZefira ve-ha-Modernizatsia shel haSiakh ha-hevrati ve-ha-politi. Jerusalem: Mosad Bialik.

Van Galen, Quintus and Nicholson, Bob.
(2018). In search of America: topic modelling nineteenth-century newspaper archives. Digital Journalism, 2018. 1-21.

https://doi.org/10.1080/21670811.2018.1512879

Whitelaw, Mitchell.
(2015). Generous interfaces for digital cultural collections. Digital Humanities Quarterly 9(1).

http://www.digitalhumanities.org/dhq/vol/9/1/000205/000205.html

(accessed 3 May 2019).

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO