The Reading Database of Syllable Structure

paper
Authorship
  1. 1. Erik Fudge

    Department of Linguistic Science - University of Reading

  2. 2. Linda Shockey

    Department of Linguistic Science - University of Reading

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Every description of a language includes statements of vowel and consonant inventories. Databases to deal with such matters (e.g. UPSID) have
been in existence for some time.
Some language descriptions also include statements of what sequences and other combinations
may or may not occur in words of the language:
phonotactic statements. Not all language descriptions actually include phonotactic statements:
many treatments restrict themselves to stating inventories of vowels and consonants, and possibly
tones and/or accents. Even so, such statements can
in fact be made for all languages.
1. Syllable-structure
The content of phonotactic statements varies
greatly in detail from language to language: simple
v. complex branching, small v. large inventories.
The first author has been collecting relevant data,
and formulating such statements where possible,
for a number of years (we now have information
for 200+ languages). In spite of the variation between languages, it soon became clear that a common general phonotactic framework can be set up
(Fudge, 1969, 1987) and stored in computer files.
The aim of our database, then, is to set up such a
framework for phonotactic statements for as many
languages as possible. The most important units
for establishing this framework are syllables.
Terminology for parts of syllable structure has
been worked out: Onset, Rhyme, Peak (or Nucleus), Coda. Phrase-structure relationships analogous to those of syntax are recognised between
and within these parts; in fact, phrase-structure
rules can be written to generate the occurring
combinations.
Each structural place may be occupied by one
element selected from the inventory of available
sounds. Rather than a single inventory of sounds,
or even an inventory divided into vowels and
consonants, syllable structure may require different inventories to be available at different places
in the structure. Typically, relations of overlapping or inclusion will hold between these different
inventories (e.g. m, n, w, l, r, j in both Initial and
Post- initial for English. For justification of this
duplication, see Fudge, 1969: 274, 280).
Above the Syllable: the Word
Syllable-structure is, of course, only part of the
story: there are larger units, particularly the word
(Fudge, 1969: 258f), which are needed for a full
statement of the constraints. As yet this has not
been incorporated in the database. Some examples
of word-based phonotactic statements would be:
(a) Some Coda phenomena are restricted to wordfinal position.
(b) Stressed syllables may permit bigger vowel
inventories than unstressed syllables.
(c) Corresponding parts of successive syllables
may exhibit constraints for or against co-occurrence (e.g. vowel harmony).
(d) Some Codas may need to refer to the Onset of
the following syllable.
3. Problems
Three main problem areas are worth special attention:
(a) form of source descriptions;
(b) systematic or accidental gap?;
(c) status of loanwords.
Each will be discussed further in a subsection.
3.1 Form of source descriptions
In spite of our claim that phonotactic statements
can be made for all languages, there is no guarantee that any descriptive article(s) on which the
phonotactic description is to be based will be cast
in anything like the same form as the ideal phonotactic statements. Where a description contains no
statement at all of limitations on what sequences
of consonants may occur, word-by-word inspection of data cited becomes necessary to establish
such statements. In order to maintain consistency
of descriptive format, it has therefore been necessary to devise a standard method for proceeding
from source material to standard format.
3.2 Systematic or accidental gap?
It is sometimes also necessary to reach a decision
on whether absence of some combination represents a systematic gap or merely an accidental gap.
For example, English words can begin with /st/ or
/tw/: is the absence of /stw-/ systematic or accidental? For all other consonants X and Y, if /sX-/ and
/XY-/ are both possible, then /sXY-/ is also possible - this suggests the absence of /stw-/ is accidental (perhaps due to the low frequency of /tw-/).
3.3 Loanwords
The status of “loanwords” is always difficult.
Some loans are immediately “assimilated” to the
patterns of the “borrowing” language, e.g. any
word beginning with /st/ borrowed into Spanish
will split this unpermitted cluster by prefixing /e/,
in effect putting the /s/ into a separate syllable of
its own. Others, however, cause new phonotactic
patterns to arise, e.g. many languages of the Philippines permit only single-consonant Onsets, but
have imported loanwords from Spanish and English with clusters /pl-/, /tr-/ etc.: at what stage of
the loaning process can we conclude that the
language has developed branching Onsets?
So much for the theoretical background and the
aims of this database. What of the methods by
which these aims have been achieved?
4. Preparation
Two aspects of this are discussed:
(a) the format of language files, and
(b) the database software used.
4.1 Language Files
Information about permissible syllable structures
in over 200 languages was gathered from linguistic literature as described above. These languages
were from a wide variety of language families and
from all over the globe. Information was initially
stored on filing cards as formulae of the type
normally used by linguists. The phonemic inventory or each language was noted as well. These
cards were then used as the basis for language files
which were entered on a computer. Only 191 of
these files were created, as complete information
was not available in all cases. Entering new
languages is an ongoing and open-ended task.
4.2 Database Software
Commercially-available database software is designed to create tables and to perform arithmetic
operations on the material stored in the rows and
columns. Queries about whether something is present in the database are answered by a simple
lookup procedure. Clearly this form of data representation is not amenable to the storage of formulae. In using the latter it is necessary to ascertain
whether a sequence or structure which is being
sought can be generated by the syllable grammar,
i.e. by expanding the formula.
For example, to know whether the syllable [qi] is
possible in a language or language family, one has
to find out whether these phonemes are present in
the inventory and whether each of them is possible
in com-bination with the other and in that order,
as well as making sure that the CV structure is
permitted. ([q] might, for example, not be permitted syllable-initially or before front vowels, even
if it does exist in the phonemic inventory).
It would, of course, have been possible to expand
each grammar ourselves and put the resulting list
in the database. This list could be done automatically by computer, and, as storing and searching
are becoming daily easier, it would provide a
viable but brute-force solution. Generating all possible syllables would also be a practical way to
avoid writing rules for co-occurrence restrictions.
In the end, we decided to use software which
worked with syllable grammars, even though it
may be more difficult. We wished to take advantage of linguistic knowledge in the system, using
what is known about natural classes and universal
phonological constraints. The program which was
used is one which was loaned to us for research
purposes by Bird, Ellison, and Klein of the University of Edinburgh Centre for Cognitive
Science. This is based on finite-state automata as
described in Bird and Ellison (1994). and matches
input against a grammar written in regular expressions. If there is a nonempty intersection between
the structure specified in the query and the grammar, the program reports a success, otherwise a
failure.
5. Data Entry
At the outset of the project, we were faced with
having to represent the symbols of the International Phonetic Alphabet using an ordinary typewriter keyboard. No commercially-available font
could be found which would allow both for storage and editing of data: we could create files containing non-ASCII symbols within specified programs, but couldn’t search for these symbols using
either another program (such as an editor) or an
operating system. In order that our results be maximally usable by other scientists, we chose to use
the three-number codes suggested by the International Phonetic Association.
6. Setting up a Data Base
Before you make a query, you need several kinds
of information entered into files as well as the
software for matching patterns with syllable grammars:
(a) A list of all the symbols you wish to use for
any and all languages, and the classes they fall
into, for example, vowels, high vowels, front
vowels.
This list must include modified symbols: nasalised, breathy, velarised, and all other options count as separate units (i.e. [i] will not
match with nasalised [i] unless a query is very
carefully worded). This makes an enormous
list and accounts to some degree for the
slowness with which the program runs.
(b) A phonemic inventory for each language.
100
(Both (a) and (b) are stored in the same file, which
contains all the classes you are ever planning to
work with. We will refer to this as the Class File).
(c) A syllable grammar for each language, written
as a regular expression.
Here is the grammar for English:
{ ((132) C (Engel)) V ((Engel1) Engc1 )
(Engf (Engf)) & [English]* & $1}"
where (132) is /s/, C is any English consonant, and
Engel, Engel1, etc. are subclasses of phonemes
which are listed in the Class File. Each grammar
is kept in a separate file, which is named after the
language .
7. Using the Database
The naming convention for the language grammar
files allows us to recall them using any of several
fields. “French.Romance. IndoEuropean. all”
would be accessed in a query involving any of its
subparts or in a query about all languages in the
database.
It is possible to enquire about structures, specific
sounds, sounds with a given feature, or any combination of the above. For example:
(a) IE “CVC” looks for CVC structures in all of
the Indo-European languages (b) all “110 H”
looks for /g/ + high vowel sequences in the
whole database (c) Romance “C 103 H” looks
for a consonant followed by a /t/ + high vowel
in the Romance languages.
8. Some Minor Problems
(a) We have not yet adequately confronted the
details of phonotaxis, though the broad generalisations are in place. In English, for example, our current grammar gives “sdring” and
“ssming” the stamp of approval, though it
clearly should not. It will be possible to include constraints in the grammars to filter out
false hits. We are somewhat hampered in less
well-known languages by lack of information
about which sequences do not occur.
(b) The program which does the matching of input
with syllable grammars came to us in a compiled form, and this prevents us from making
some changes which we hope might help to
make the program run faster. Searches are at
present very time-consuming.
9. Conclusion
We are now in a position to investigate the type of
syllable universal suggested by Greenberg (1965)
and to re-evaluate the work of Hooper (1976) on
the sonority hierarchy of syllable structure as well
as to ask a variety of new questions. We hope to
improve our ability to specify co-occurrence restrictions within the syllable, though this is necessary for only a fairly small subset of the languages
included. The problems we confront at the moment seem more related to implementation than to
content.
Statistics about syllable shapes of the languages in
our database will be presented in the oral version
of the paper.
References
Bird, Steven, and Mark Ellison. 1992. One-Level
Phonology: Autosegmental Representations
and Rules as Finite-State Automata. Computational Linguistics 20,1: 55-90.
Esling, John, and Harry Gaylord. 1993. Computer
Codes for Phonetic Symbols. Journal of the
International Phonetic Association 23:83-97.
Fudge, E.C. 1969. Syllables. Journal of Linguistics 5:253-286.
Fudge, E.C. 1987. Branching Structures within the
Syllable. Journal of Linguistics 23:359-377.
Greenberg, Joseph. 1969. Some Generalisations
Concerning Initial and Final Consonant Sequences. Linguistics 18:5-34.
Hooper, Joan Bybee. 1976. An Introduction to
Natural Generative Phonology. Academic
Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996

Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

Tags