How big can a static site be? Staticizing a census database :

paper, specified "long paper"
Authorship
  1. 1. Martin Holmes

    University of Victoria

  2. 2. Greg Newton

    University of Victoria

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Project Endings and static websites

It has long been recognized that building DH web applications which do not have perpetual funding on complex computing stacks that require regular updates presents an overwhelming problem for long-term maintenance and archivability (Nowviskie and Porter 2010; Dombrowski 2019; Smithies et al. 2019). Project Endings is a collaboration between DH scholars, librarians and programmers, aiming to create tools and recommendations for building extremely low-maintenance, easily archivable, but still highly functional digital edition projects (Goddard 2018; Holmes and Takeda 2019a). Starting in 2016, the Endings team have converted a number of high-traffic, well-known digital edition projects that previously ran on XML databases and similar back-end infrastructure into entirely static websites

See
for the full list of staticized projects.

built with HTML5, CSS and client-side JavaScript (Holmes 2017; Arneil, Holmes and Newton 2019). We have also developed a set of Principles

.

to guide us in converting existing projects and creating new ones, as well as a client-side pure-JavaScript search engine, staticSearch

.

(Holmes and Takeda 2020a, 2020b).

At the outset, we assumed that only relatively small digital editions would be suitable candidates for complete staticization, and we began with sites consisting of only a few hundred pages.

See for example
My Norse Digital Image Repository (
) or
The Robert Graves Diary (
).

However, seeing how smoothly the process worked with smaller sites, we took on some of our larger projects, including The Map of Early Modern London (13,086 pages), The Colonial Despatches (10,826 pages), and Digital Victorian Periodical Poetry (20,685 pages). The results were very encouraging: the new sites were faster and more responsive than the old, and the staticSearch engine performed very effectively even at these scales.

See
;
;
.

But there must, presumably, be some practical limits on the scale of a DH project which can be effectively converted to the static model, and it would be helpful to know where those limits might be. We reviewed our own catalogue and discovered a candidate which appears to be precisely the kind of project that would be suitable for testing this:
VIHistory.

The
VIHistory project

VIHistory (
), a PostgreSQL/PHP project created about 15 years ago, presents census data from the City of Victoria and Vancouver Island, from censuses taken in 1871, 1881, 1891, 1892, 1901, and 1911, comprising nearly 150,000 individual census records, along with associated tables of occupations, familial relationships, locations, addresses, religion, languages, nationalities and other associated concepts. In terms of the number of individual HTML pages required for a static site, it is between five and ten times the size of the largest project we have previously staticized. VIHistory has undergone successive infrastructure migrations over the years, and various features have broken as a result. A looming server migration will render it non-functional, so we must take action within a short period. Furthermore, a new dataset (an addendum to the 1901 census) is now available for addition to the collection. Rather than invest more time in patching the existing site, we are instead creating a static version, with a completion deadline of April 2022.

Census data brings with it a range of challenges, particularly when multiple censuses are to be presented as an integrated dataset. From one census to another, the range of data collected will vary; ward boundaries change; and descriptors such as nationality, race, occupation, familial relationships and languages mutate as social and political norms evolve.

See, for example, Stanger-Ross 2008, which discusses the evolution of ethnicity in census data.
The old version of the VIHistory site addressed these issues through a set of PostgreSQL views, which merged disparate datasets into a normalized form which could be queried more easily. One of our challenges will be to accomplish this in static form.

Another challenge will be to devise an appropriate granularity for the HTML pages which constitute the site. We expect to create a single HTML page for every distinct census entry, but other pages will be constructed to bring together collections of similar features (people with the same occupation, on the same street, with the same nationality, etc.) in order to provide a useful browsing approach to the data (something lacking in the existing site, which has only a search interface). The search itself will need to be carefully constructed so that all the features of the existing site search are preserved, and we expect to add more search options too.

Plan, approach, and prospects

We initially considered converting all the existing census data into XML as the first phase of the project, but we have not found any XML standard suitable for this historical census data.
We then considered converting the data directly into HTML5, but found
it more effective to design an intermediate custom XML schema, which
enables records from all the different censuses to be encoded in a
single flexible structure, and also permits us to apply datatype
constraints and catch errors. The XML is then converted into XHTML5 for
the website.

Each census entry page will present a standardized tabular view of the data from the census record, with explanations to clarify differences between census datasets.
Each “page” will have a condensed single-line title capturing essential data, used for display in search results and listings pages.
Values for datapoints such as nationality, language and so on will be constrained by the schema, and all pages will be validated during the build process.
Detailed diagnostics (Holmes and Takeda 2019b) will expose inconsistencies and errors in the dataset.
Metadata will be encoded in the HTML header and used to create search filters, but the text of the pages will also be indexed for a new full-text search feature (the old site search allows only metadata filters).

We fully expect the conversion to succeed, but we also know that we will be pushing the practical limits of this approach, and will need to devise optimizations and workarounds to make the site, and particularly the static search engine, usable. Our paper will report on this work, and present recommendations on strategies and limitations for creating static resources on this scale.

Bibliography
Arneil, Stewart, Martin Holmes, and Greg Newton. 2019. “Clearing the Air for Maintenance and Repair: Strategies, Experiences, Full Disclosure; Paper Three: Ruthless Principles for Digital Longevity.” Presented at Digital Humanities 2019, Utrecht, the Netherlands.
.

Dombrowski, Quinn. 2019. “Sorry for all the Drupal: Reflections on the 3rd anniversary of ‘Drupal for Humanists.’” Quinn Dombrowski (blog), November 8, 2019.
.

Goddard, Lisa. 2018. “The Endings Project @ UVic: Concluding, Archiving, and Preserving Digital Projects for Long-Term Usability.” @Risk North 2: Digital Collections, Montreal, Canada.
.

Holmes, Martin. 2017. “Selecting Technologies for Long-Term Survival.” Presented at the SHARP Conference 2017: Technologies of the Book, Victoria, BC, Canada.
.

Holmes, Martin. 2021. “Using ODD for HTML.”
The Journal of the Text Encoding Initiative. Text Encoding Initiative Consortium.
.

Holmes, Martin and Joseph Takeda. 2019a. “The Prefabricated Website: Who needs a server anyway?” Text Encoding Initiative Conference, Graz, Austria.
.

Holmes, Martin and Joseph Takeda. 2019b. “Beyond Validation: Using Programmed Diagnostics to Learn About, Monitor, and Successfully Complete Your DH Project.”
Digital Scholarship in the Humanities. Oxford University Press/EADH.
.

Holmes, Martin, and Joey Takeda. 2020a. “Static Search: An Archivable and Sustainable Search Engine for the Digital Humanities.” Presented at the Digital Humanities Summer Institute (DHSI) Colloquium (#VirtualDHSI). [
].

Holmes, Martin, and Joey Takeda. 2020b. “Nine Projects, One Codebase: A Static Search Engine for Digital Editions.” Presented at the COLLABORATION Digital Humanities Conference, University of British Columbia / online.
.

Nowviskie, Bethany, and Dot Porter. 2010. “
Graceful Degradation: Results of the Survey
.” Presented at Digital Humanities 2010, King’s College, London.
.

Smithies, James, Carina Westling, Anna-Maria Sichani, Pam Mellen, and Arianna Ciula. 2019. “Managing 100 Digital Humanities Projects: Digital Scholarship & Archiving in King’s Digital Lab.”
Digital Humanities Quarterly 13, no 1 (2019).
.

Stanger-Ross, Jordan. 2008. “Citystats and the History of Community and Segregation in Post-WWII Urban Canada.”
Journal of the Canadian Historical Association 19, 2 (2008), 3-22.
.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO