Linking and Querying Ancient Texts: a case study with three epigraphic/ papyrological datasets

Gabriel Bodard; Tobias Blanke; Mark Hedges

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The OGSA-DAI (Open Grid Service Architecture—
Data Access and Integration, http://www.ogsadai.
org.uk/) project supports the exposure of data resources,
such as relational or XML databases, on to grids. Various interfaces are provided and many database management
systems are supported, with a particular view to
querying, transforming and delivering data in different
ways via a simple toolkit for developing client applications.
OGSA-DAI is designed to be extensible, so users
can provide their own additional functionality.
Colleagues at the Edinburgh Parallel Computing Centre
and the Centre for e-Research at KCL have been funded
to carry out a small case study applying the OGSADAI
platform to three datasets of ancient texts in different
formats. The Heidelberger Gesamtverzeichnis der
griechischen Papyrusurkunden Ägyptens (HGV) is a
collection of metadata (largely bibliographic, geographical,
and dating) for 65000 Greek papyri from Egypt,
stored in a large Filemaker Pro database. The Project
Volterra is a database of legal texts from the Roman
empire, currently in the low tens of thousands but very
much in progress, stored in a series of themed tables in
MS Access. The Inscriptions of Aphrodisias (IAph) is
a corpus of just under 2000 ancient Greek inscriptions
from a single city in Asia Minor, published in TEI XML.
These collections span roughly the same period - the first
five centuries or so of the Roman Empire - and also overlap
in terms of places and people, although their contents
are otherwise quite different. The provision of an integrated
view would thus be fruitful for the researcher. A
particularly challenging issue being investigated is that
of handling different levels of uncertainty in temporal
data: some dates are extremely precise – even to the day
– whereas many others are very vague – perhaps to a
span of 50 or 100 years.
These datasets are all freely available in one form or another,
and the scholars who own the databases are happy
for us to re-use them in this way and publish the results
of our aggregation and federated querying. In an ideal
world, of course, we should not have to seek permission
from the owners at all in order to re-use and re-purpose
their published data. The IAph texts are all published under
a Creative Commons-Attribution licence (CC-BY),
so re-use is not only permitted but encouraged (in fact
Bodard is one of the authors of this dataset, but in any
case we can use these texts for anything we like without
asking or even informing the authors so long as we attribute
the original material to the copyright holders). A
transformation of the HGV data into EpiDoc XML has
likewise been published under CC-BY, although it is
the master database that interests us for this project, and
that is not publicly available in its raw form (although a
HTML version is online and free). There is also a free,
web-available version of the Volterra data (although the
website is down at time of writing), but the database itself
was acquired for this project with the permission of
the editors.
As mentioned above the contents of these three datasets
vary quite widely, but there is sufficient overlap to enable
a certain amount of cross-database searching to be
feasible, at least as a proof of principle. For instance,
although the Volterra database specifically addresses legal
texts, it contains some papyri and thus possibly references
to places that also occur in the HGV metadata.
Likewise, although the Volterra texts do not include any
inscriptions from Aphrodisias, there may be attestations
of persons that appear in both the Volterra and IAph texts
(especially in the late antique period, which is where the
Aphrodisias material is most richly annotated). IAph and
HGV do not directly share any content, but the categories
that are used to organize the texts have a certain overlap,
for example letters, decrees, honours, contracts. As mentioned
above, all three datasets overlap fairly closely in
date, and have similar (but not identical) mechanisms for
recording dates, date-ranges, periods, and uncertain dating.
Cross-corpus search in all of these areas or combinations
of them should test the OGSA-DAI software and
demonstrate the validity and usefulness of this approach.
OGSA-DAI is considered to be a standard for database
integration in Grid environments, which enable virtualisation
and sharing of resources via the Internet, as well
as in a purely web-service environment. Until now, the
OGSA-DAI technology has been used mainly to provide
integrated views of relational databases with different
schemas, and the LaQuAT demonstrator will to begin
with use it in this way with the two database resources,
HGV and Projet Volterra. Subsequently the work will be
extended to integrate the InsAph XML files, providing
an integrated view over the three three structured data
resources. The project will also produce significant enhancements
to the OGSA-DAI software, specifically in
its handling of XML resources, which is currently more
restricted than its features for database integration. OGSA-DAI will then integration of multiple database and XML data resources. The LaQUaT demonstrator will
use a recent extension to OGSA-DAI called OGSADQP,
which is a service-based distributed query processor,
to produce queries across these data resources.
One project output will be thus be an openly available
demonstrator allowing an integrated view over these
three datasets. However, the resources selected are just examples from among numerous others to which the La-QuAT approach could be applied. In the fields of archaeology and classics alone, there are numerous datasets, often small and isolated, that would be of great utility if the information they contained could be integrated. Three points to note about many of these resources are that:
• Formats are very diverse. The databases rarely follow
standardised database schemas, so typically
any two schemas will be different. Moreover, use of
mark-up can vary significantly, particularly in older
resources before much effort had been towards standardisation
(such as EpiDoc), but stylistic variation
may occurs even when standards are applied.
• Resources are not easily available for use; they may
locked away on local or departmental machines, or
“published” on a website in a way that is not particularly
usable by a researcher.
• Even when a resource is available it is often available
only in isolation. Many of these resources may
be regarded as fragments of a larger picture, with
vastly more value if researchers could access this
larger picture rather than just the parts.
• Resources may be owned by different communities
and subject to different rights; the scholars who created
them may be unwilling to accept anything that
affects the integrity of the original resources. Consequently,
any integration initiative must respect this
autonomy and integrity, if it is to be successful.
The ability to link up such diverse data resources, in
a way that respects the original data resources and the
communities responsible for them, is a pressing need
among humanities researchers. The LaQuAT project is
developing a software demonstrator utilising a small set
of resources in a particular discipline; however, the solution
developed will have a lifespan beyond the initial
project and will provide a framework into which other
researchers will be able to attach resources of interest,
thus building up a critical mass of related material whose
utility as a research tool will be significantly greater than
that of the sum of its parts. We see this project as providing
an opportunity to start building a more extensive
e-infrastructure for advanced research in the (digital) humanities.
Once humanities scholars are persuaded of the
feasibility of this approach, there are many other datasets,
in France, Italy, Germany and the US, among others,
which could be exploited in such a way, building up a
critical mass of material that will enable new connections
to be made. The data-silo mentality could be gently
undermined once scholars can see their own construct
as remaining identifiable, while at the same time greatly
enriched. The infrastructure will be sustained initially by
King’s College London and the UK National Grid Service
(NGS), and subsequently as part of the European
infrastructure being developed by the DARIAH project
funded by the EU FP7 programme.

Full text license: This text is republished here with permission from the original rights holder.

Linking and Querying Ancient Texts: a case study with three epigraphic/ papyrological datasets

1. Gabriel Bodard

2. Tobias Blanke

3. Mark Hedges

ADHO - 2009