JSTOR
JSTOR
JSTOR
This paper details work that has been conducted
through a collaborative project between JSTOR,
the Frick Collection and the Metropolitan
Museum of Art. This work, funded by
the Andrew W. Mellon Foundation, was to
understand how auction catalogs can be best
preserved for the long term and made most
easily accessible for scholarly use. Auction
catalogs are vital for provenance research as
well as for the study of art markets and the
history of collecting. Initially a set of 1604
auction catalogs, over 100,000 catalog pages,
was digitised – these catalogs date from the 18
th
through the early 20
th
century.
An auction catalog is a structured set of records
describing items or lots offered for sale at an
auction. The lots are grouped into sections –
such as works by a particular artist, each of the
sections are then grouped into a particular sale –
this is the actual event that happened in the sale
room, and then these sales are grouped together
in the auction catalog. The auction catalog
document also generally includes handwritten
marginalia added to record other details about
the actual transaction such as the sale price and
the buyer.
A repository was constructed – this holds and
provides access to page images, optical character
recognition (OCR) text and database records
from the digitised auction catalogs. In addition a
website was created that provides public access
to the catalogs and automatically generated
links to other collections. This site offers the
ability to search and browse the collection and
allows users to add their own content to the site.
When searching a user may only be interested in
a single item within a larger catalog, therefore,
to facilitate searching the logical structure of
the catalog needs to be determined in order to
segment the catalog into items. The catalogs
are extremely variable in structure, format and
language, and there are no standard rules that
can divide the catalog into the lots, sections and
sales. Therefore, machine-learning techniques
are used to generate the segmentation rules
from a number of catalogs that have been
marked up by hand. These rules are then applied
to classify and segment the remaining catalogs.
The focus of this paper is the research
and creation of a system to automatically
process digitised auction catalog documents,
with the aim to automatically segment and
label entities within and create a logical
structure for each document. The catalogs
processed are in an XML format produced
from physical documents via an OCR process.
The segmentation and assignment of entity
types will facilitate, deep searching, browsing,
annotation and manipulation activities over the
collection. The ability to automatically label
previously unseen documents will enable the
production of other large scale collections where
the hand labelling of the collection content is
highly expensive or unfeasible.
The catalog, sale, section, lot model requires
that the content of the document be distributed
between these entities, which are themselves
distributed over the pages of the document.
Each line of text is assigned to a single entity,
whole entities may be contained within other
entities (a logical hierarchy), and a parent
entity may generate content both before and
after its child entities in the text sequence.
This hierarchical organisation differentiates
the problem of automatically labelling auction
catalog document content from other semantic
labelling tasks such as Part of Speech labelling
(Lafferty et al., 2001) or Named Entity
Recognition (McCallum and Li, 2003). In these
tasks the classes or states can be thought of
as siblings in the text sequence, rather than
as having hierarchical relationships. Hence, the
digitisation of auction catalog documents may
require a different set of procedures to that
2
applied to, for example, the digitisation of
Magazine collections (Yacoub et al., 2005) or
scholarly articles (Lawrence et al., 1999).
Although a particular document model is
assumed throughout this work, the theory
and tools detailed can be applied to arbitrary
document models that incorporate hierarchical
organisation of entities.
Techniques that are successfully applied
to other Natural Language Processing and
document digitisation tasks may be applied
to this problem. Specifically, we have
developed task appropriate feature extraction
and normalisation procedures to produce
parameterisations of catalog document content
suitable for use with statistical modelling
techniques. The statistical modelling technique
applied to these features, Conditional Random
Fields (CRFs) (Sutton and McCallum, 2007),
models the dependence structure between
the different states (which relate to the
logical entities in the document) graphically.
A diagrammatic representation of the auction
catalogue document model is given in figure 1a
and an example of a model topology that might
be derived from it is given in figure 1b. It should
be noted that CRFs are discriminative models,
rather than generative models like HMMs, a
property that may be advantageous when such
models are applied to NLP tasks (Lafferty et al.,
2001).
(a) Document data
(b) FST transition grammar
Figure 1: Document model for an auction catalog and a
Simple Finite State Transducer topology derived from it
Figure 2: FST transition grammar extended to
incorporate start and continue states for each entity type
The application of such techniques to
hierarchically structured documents requires
the logical structure of a document to be
recoverable from a simple sequence of state
labels. The basic transition grammar shown in
figure 1b is not appropriate for this task as it
is impossible to differentiate concurrent lines
of a single entity from concurrent lines from
two entities of the same type and relationships
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at King's College London
London, England, United Kingdom
July 7, 2010 - July 10, 2010
142 works by 295 authors indexed
XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)
Conference website: http://dh2010.cch.kcl.ac.uk/
Series: ADHO (5)
Organizers: ADHO