CSIC, Spain
1. Polyphemus, a lexicographic database of Greek papyri
At present time, there is no way to search the corpus of Greek papyri for lemmata, or to search for specific grammatical forms of a word. Much less is there a way to search for examples of a grammatical category. Polyphemus comes to solve these shortcomings, and some more.
For this purpose we have processed all the papyrus texts from PapyInfo (). This processing is done at the same time as the processing that results in the Callimachus database, which we present at this Congress. I summarize below the procedure by which we obtain our database Polyphemus.
A) First we analyze each line of papyrus and differentiate the actual full words from the gaps or non-textual elements.
B) Then we identify the complete words and separate them from the fragments..
C) We then proceed to lemmatize each of the words, and determine to which part of speech it corresponds, and what is its morphological analysis. All this is done with the help of the Madrid list, which I will discuss below. For text fragments (incomplete words), we try to see if they can be ascribed to a root. We also separate proper nouns from common nouns.
D) Lemma assignment and POS-tagging is performed in two phases. In a first pass we tag the forms with the highest frequency of occurrence. We then go on to label all the remaining forms using the
Madrid
Wordlist.
E) All this information is transferred to a SQL database, and put in relation with the data on the papyri that we have obtained when creating the Callimachus database. In this way, for each lexical form we obtain a lemma, a non-disambiguated morphological analysis, and a translation or gloss. Each of these parameters can be searched in combination with the more than fifty categories available to us thanks to Callimachus, such as date, origin, category, extension, subject, etc.
To date, we have been able to analyze 95% of the complete words, including proper names, which are very numerous.
2. The
Madrid
Ancient
Greek Word List
The lemmatization and analysis in Parts Of Speech (POS tagging) is performed by comparing each record in our database with the records of a word list that we have created over the last 3 years, which we have called the Madrid Ancient Greek Wordlist.
Most of the Ancient Greek wordlists are evolutions, simplifications, or improvements from the
Morpheus list, is a "rule-based morphological analyzer
. Our list also starts with Morpheus, but has been enriched with our own treebank (cf. Riaño 2006); the digital version of the
Greek-English Lexicon of Liddell-Scott-Jones, and Bailly; about 100,000 proper names from
The Lexicon of Greek Personal Names and the
Trismegistos repository of papyrological and epigraphic resources. All these data were processed to obtain morphological information: I have generated automatically the Attic and Ionic paradigm for each nominal entry in LSJ and Bailly.
The lemmas are assigned a translation, or rather a gloss, mainly from the
Greek-English Lexicon of Liddell-Scott-Jones and S.C. Woodhouse "English-Greek dictionary".
3. Polyphemus interface
Polyphemus can be consulted online. It currently contains about 4,600,000 words from Ancient Greek papyri. POS tagging and lemmatization allow the user to query the database for any morphological feature, lemma, or translation. By being able to combine this data with that of the formal content of the papyri provided by the sister database Callimachus, it allows querying the database using more than 80 search criteria.
Since both the original readings and editorial regularizations are preserved, the researcher can use Polyphemus to search for phonetic or morphological features of the papyri. Some searches that can be performed using Polyphemus are the following:
a) Texts containing a Greek word that translates as “poison”, “medicine”, “
praetor”, “water”, etc.
b) Texts in which any lemma (word) appears, in a specific grammatical form, from Elephantine between the 2nd century BC and 3rd AD.
c) All adjectives in accusative plural; or the optative of verbs in -μι, in all texts.
Bibliography
Bohnet, Bernd and Joakim Nivre 2012 “A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing”
EMNLP-CoNLL, pp. 1455-1465 [https://aclanthology.org/D12-1133]
Celano, Giuseppe G.A., Gregory Crane and Saeed Majidi 2016 “Part of Speech Tagging for Ancient Greek”
Open Linguistics 2:393–399 [DOI 10.1515/opli-2016-0020]
Crane, Gregory 1991 “Generating and Parsing Classical Greek”
Literary and Linguistic Computing, 6:4, pp. 243–245 [https://doi.org/10.1093/llc/6.4.243]
Riaño Rufilanchas, Daniel 2006
El complemento directo en griego antiguo en Anejos de Emerita, XLVII. Madrid: CSIC
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
In review
Tokyo, Japan
July 25, 2022 - July 29, 2022
361 works by 945 authors indexed
Held in Tokyo and remote (hybrid) on account of COVID-19
Conference website: https://dh2022.adho.org/
Contributors: Scott B. Weingart, James Cummings
Series: ADHO (16)
Organizers: ADHO