The Ghost in the Manuscript: Hyperspectral Text Recovery and Segmentation

paper
Authorship
  1. 1. Patrick Shiel

    Maynooth University (National University of Ireland, Maynooth)

  2. 2. John G. Keating

    Maynooth University (National University of Ireland, Maynooth)

  3. 3. Malte Rehbein

    National University of Ireland, Galway (NUI Galway)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The condition of medieval manuscripts ranges from
those that are fully legible to those which can only
be read in part, and their legibility is determined by
the manner in which they were preserved and treated
throughout the ages. In some cases deterioration is due
to processes such as fading or staining; in others, the
text may have been interfered with in some way. For
instance, in the Irish context, the oldest surviving manuscript
written entirely in Irish, Leabhar na hUidhre (12th
century) was subject to part-erasure and rewriting by a
scribe who was active at some point between the 12th
and the 14th centuries. The Yellow Book of Lecan (circa
1400) was overwritten in part by an antiquarian scholar
active in the 18th century, while the use of chemical
agents to enhance readings by scholars working in the
19th century has had a long-term detrimental effect on a
number of other manuscripts. Two central research questions
of interest to all scholars, therefore, are:
• To what extent is text recovery in key medieval texts
possible using current technologies, e.g. palimpsest,
fading, deliberate removal?
• To what is it possible to establish irrefutable scientific
evidence for interpretation of questioned documents,
e.g., identify the different hands (inks)?
In this paper, we illustrate how to provide answers to
these questions using modern scientific techniques and
emerging forensic technology, i.e. hyperspectral imaging
[1] and associated image processing techniques [2].
The acquisition of a hyperspectral scanner by An Foras Feasa (AFF), a Forensic XP-4010, has presented an
opportunity to subject damaged or illegible texts to a
modern scientific re-examination. The scanner has the
potential to read various different layers of a manuscript
in a manner not possible to the human eye and to analyse
elements of its composition. As such it presents the
possibility of retrieving text that has been lost through
fading, staining, overwriting or other forms of erasure.
In addition, it offers the prospect of distinguishing different
ink-types, and furnishing us with details of the
manuscript’s composition, all of which are refinements,
which can be used to answer questions about date and
provenance. This process marks a new departure for
the study of manuscripts and may provide answer many
long-standing questions posed by palaeographers and by
scholars in a variety of disciplines. Furthermore, through
text retrieval, it holds out the prospect of adding considerably
to the existing corpus of texts and to providing
very many new research opportunities for coming generations
of scholars. In this introductory paper on hyperspectral
imaging we concentrate on two key processes:
text recovery and text segmentation.
Methodology
The investigative and analytic methods described here
are based on a novel and highly specialised technique
called Hyperspectral Imaging. Hyperspectral imaging
is a non-destructive optical technique that measures reflectance
(fraction of light reflected) characteristics of
a document with high spatial and spectral resolution.
An hyperspectral imaging device records a sequence
of digital images of the selected manuscript area (with
maximum dimensions 50 mm x 50 mm) illuminated
with monochromatic light from a tuneable light source
from 350 nm (near-UV) through the entire visible range
and up to 2400 nm (infrared). The value of each image
pixel in the recorded image sequence represents an accurate
measurement of the reflectance curve for a tiny,
13 micron square, area on the document. Analysis of all
spectral curves, essentially a cube of information, provides
information about the physical characteristics of
questioned manuscripts.
Hyperspectral imaging, together with modern twodimensional
spectrum software and three-dimensional
image and visualisation software, provides modern
researchers working in the field of historic documents
analysis with opportunities for forensic examination that
were heretofore unavailable. Methodologically, there are
two main fields of applications of this technique: (i) the
extraction of relevant historic, diplomatic and palaeographic
information from documents and (ii) the investigation
of the impact of environmental conditions on document
condition and of degradation effects on writing
materials and substrates. In particular, reflectance curves
found in different sections of the manuscripts can be
compared with each other in order to determine whether
different types of inks had been used during text composition
or to identify modifications that occurred during
the manuscripts’ history. Light spectroscopy analyses
may also be conducted to aid recovery and segmentation.
Fluorescence occurs when an object emits a high
wavelength (low energy light) following illumination by
a shorter wavelength (higher energy light) due to molecular
absorption of part of the incident light. Furthermore,
the spectral curves may be compared with those in
international databases containing typical ink spectra to
determine and date the kind of ink or pigment used. The
image cube recorded using the technique may be used to
enhance the visibility of hidden material such as palimpsest
or erased text.
This methodology for manuscript analysis is of significant
interest to archivists. For example, The National
Archive of the Netherlands in The Hague recently commissioned
the development of a dedicated hyperspectral
imaging instrument specifically designed to analyze
archival documents and to monitor their aging process
during storage and exhibitions. Furthermore, at the 16th
ICA (International Congress on Archives) Congress in
Kuala Lumpur from 21–27 July 2008, two special sessions
were devoted to the application of hyperspectral
imaging for historic document analysis and its use in
providing quantitative, objective criteria for balancing
the competing interests of document preservation and
presentation. This proposal is therefore informed by current
international research and best practice in document
analysis and textual imaging.
Example: Text Recovery of hidden text
Figure 1(a) shows a 16C book cover that has become
degraded with time and contains mould in places. The
interior cover’s structure, shown in Figure 1(b) consists
of an underlying text which has been pasted over with
a clean blank faced sheet of paper. Using fluorescence
spectroscopy, i.e, light induced fluorescence in the pages
it is possible to reveal the underlying text as shown in
Figure 1(c) and 1(d). Here, we will describe the underlying
fluorescence and reflectance spectroscopy principles
and techniques used to reveal hidden text of this nature,
and in several other examples. A complete description of
the techniques used for this particular recovery is given
in another paper in this conference session.
Example: Text Segmentation of three
different inks
The general objectives of cursive text segmentation include tasks such as word spotting, text/image alignment,
authentication and extraction of specific fields [3]. An
important step associate with all of these tasks document
segmentation into text lines. In general, this is diffcult
due of the low quality and the complexity of these documents,
and automatic text line segmentation is an open
research field. In general, sophisticated image processing
of single-image documents is the norm [3].Here we
describe recent approach to the segmentation problem,
which we refer to as hyperspectral segmentation; the
technique is based on the separation and segmentation
of differing inks by recording and analysing the reflectance
properties of different inks. This technique is particularly
useful for the segmentation of texts that have
been edited by various authors over a long time period.
It is also possible to date the editions by comparing the
segments with known dates, or using repositories containing
hyperspectral properties of different materials
(e.g. inks, paper, etc.). Figure 2, below, shows the results
of the technique when used to segment some text with
two later annotations (giving three primary additions or
segments). Three different inks are used, although the
all appear similar in visible light, as shown in Figure
2(a). Figure 2(b) shows the reflectance curves for the
two annotations and how it is possible, using appropriate
image processing to colour-code the different inks.
Figure 2(c)demonstrates how it is possible to foreground
the original text, and Figure 2(d) demonstrates that is is
possible to isolate, or segment, the annotations. A complete
description of the potential for these hyperspectral
segmentation techniques for dating and segmenting the
Göttingen kundige bok 2 is given in another paper in this
conference session.
Conclusion
The recovery of illegible or deleted text, the analysis and
dating of ink-types, and the minute analyses of scribal
hands, segmentation of cursive test, are activities in
palaeographic and manuscript studies, and their subsequent
textual encoding. This abstract describes how hyperspectral
imaging can be used to perform quality text
recovery, segmentation and dating of historical documents.
Our paper will provide a complete overview of
the acquisition (reflectance spectroscopy, fluorescence)
, computational and segmentation techniques used for
(i) a 16C pastedown cover, and (ii) a three-ink example
typical of that found in Göttingen kundige bok 2.
References
1. Chang, C.I.: Hyperspectral Imaging: Techniques for
Spectral Detection and Classification. Kluwer Academic,
New York, N. Y. (2003) . 2. Chang, C.I.: Hyperspectral Imaging: Signal Processing
Algorithm Design and Anal¬ysis. John Wiley and
Sons, New York, N. Y. (2007)
3. Likforman-Sulem, L., Zahour, A., Taconet, B.: Text
line segmentation of historical documents: a survey. International
Journal on Document Analysis and Recognition
9(2-4) (April 2006) 123–138.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None