Resolving South Asian Orthographic Indeterminacy In Colonial-Era Archives

One of the challenges of doing archival research with respect to colonial-era Indian print archives is orthography. A substantial number of Indian newspapers produced under have now been digitized, and are accessible through services such as Readex’s “South Asian Newspapers” archive, the Digital Library of India, the Panjab Digital Library, and others.
Within the English-language archive, the searchability of these archives is limited, in large part due to idiosyncratic choices made by editors and authors in rendering words from South Asian languages in Roman script. Thus, the pioneering feminist doctor whose name is usually rendered as “Rukhmabai” by present-day scholars was quite often represented as “Rukmabai,” “Rukmibai” and “Rukhmibai” in English-language newspapers from the British colonial era. The Roman rendering of Bengali-language names such as “Chatterjee” and “Tagore” also have similar indeterminacy (Chatterjee could be rendered in Indian print archives as “Chatterji,” “Chaterjee,” or “Chattopadhyay”; “Tagore” could be “Thakur”).
The orthographic indeterminacies also proliferate beyond how authors’ names are rendered; indeed, we see the issue occurring with reference especially to the representation of South Asian vowel forms (“i” vs “ee”; “u” vs. “oo”), aspirated consonants (“d vs” “dh”; “t” vs “th”; “b” vs. “bh”), and labials (“b” vs. “v”). Given that these archives tend to have simple search features that do not feature intelligent spelling correction, searching for topics of historical interest (“sati” or “satee” or “suttee”?) can lead to highly incomplete results.
Finally, orthographic indeterminacy can be an issue within and across South Asian languages themselves. “V” sounds in the Punjabi language, for instance, are frequently pronounced and spelled with “b” or “bh” in Hindi. The "ā” vowel sound common in many north Indian languages is rendered as “ɒ” (that is to say, a soft “o” sound) in Bengali.
A possible solution to the South Asian orthographic indeterminacy problem might be found by appropriating tools developed by digital humanists in Early Modern studies. A team at Newcastle University, led by pioneering DH scholar Hugh Craig, has developed a tool called Corella, which is designed to help resolve orthographic indeterminacies in early modern English corpora (Craig 2010). Here, we propose to use a limited corpus from an existing archive of texts by British authors in India (the Kipling family) as well as a series of Indian authors (the afore-mentioned Rukhmabai as well as several others). We will aim to train Craig’s Corella tool to work with Indian languages rather than with early modern orthography. This will allow us to address linguistic indeterminacies in the Roman rendering of Indian languages along the lines of those mentioned above. Can the searchability of these archives be improved via the use of such tools? What are the prospects of training tools such as Corella to work with larger corpora?

