Extracting bilingual dictionaries from corpora can be seen as a very fine-grained alignment process, where the aligned units are not paragraphs or sentences but words and phrases. Most approaches to this problem rely on statistical means to build translation lexicons from bilingual texts, roughly falling into two categories: the hypotheses testing approach and the estimating approach. There are pros and cons for each type of approach, some of them discussed in (Hiemstra 1997).
Our method hardly fits into one of these two categories, but is closer in spirit to hypotheses testing approach, by first generating a list of translation equivalence candidates and then extracting, in several steps, the most probable translation-equivalence pairs. The candidate list is constructed from the aligned translation units (TU).
2 A baseline and the iterative algorithm
The input data for the extraction algorithm is a sentence-aligned parallel text (bi-text) with each word in both parts of the bi-text being lemmatized and tagged. The algorithm is sensitive to tagging, lemmatization and alignment errors, but they will affect only the recall not the precision. We showed that using a tiered-tagging approach with combined language models (Tufiş, 2000) and Tufis et al, 2000) one could construct, from clean training corpora, high accuracy taggers (close to 99%). For many languages lemmatization, following an accurate POS tagging, is also very accurate and sentence aligners with almost perfect job (at least for some types of texts) are not rare. Therefore, the prerequisites of our algorithm are not very difficult to meet.
Based on the alignment, the first step is to compute a list of translation equivalence candidates (TEC). This list (TECL) contains several sub-lists, one for each POS considered in the extraction procedure. The TECs are generated by the Cartesian product of the sets of tokens (of the same POS) in the two parts of each TU. Finally, the pairs generated from all TUs are uniquely sorted and attached with the number of occurrences of the respective association. The core of the baseline algorithm is a selection procedure, which prunes the TECs considered to have occurred just by chance. This hypothesis may be tested by different statistical tests (we used various statistical measures such as chi-square, dice, mutual information, log-likelihood and left-Fisher). One could use also a minimal number of occurrences for <TS TT> (usually, as in our case, 3).
Our iterative algorithm is also very simple but significantly faster than the baseline. The algorithm gets as input the aligned parallel corpus and the maximum number of iterations. At each iteration step, the pairs that pass the selection (see below) will be removed from TECL so that this list is shortened after each step and eventually may be emptied. From TECL, for each POS is constructed a contingency table (TBLk) as showed in Table 1. The rows of the table are indexed by the distinct source lemmas and the columns are indexed by the distinct target lemmas (of the same POS).
Table 1: TBLk - Contingency table with counts for TECs at step K
Each cell (i,j) contains the number of occurrences of the <TSi, TTj> pair. The marginal counts are ni*=, n*j=and n**=.
The equation (EQ1): expresses the selection condition and represents the key idea of the extraction algorithm. In order to select, at step k, the pair <TSi, TTj> as a translation equivalence pair (TEP), the number of associations TSi has with TTj must be higher than (or at least equal to) with any other TTp (p¹j) and vice-versa. If TSi is translated in more than one way, the rest of translations will be found in subsequent steps (if frequent enough).
We conducted experiments on one of the few publicly available multilingual aligned corpora, namely the "1984" multilingual corpus (Dimitrova et all 1998) containing 6 translations of the English original. This corpus was developed within the Multext-East project, published on a CD-ROM (Erjavec et all 1998) and recently improved within the CONCEDE project (to be soon released on the CONCEDE's homepage: www.itri.brighton.ac.uk/projects/concede/). Each monolingual part of the corpus (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene) was tokenised, lemmatised, tagged and sentence aligned to the English hub.
No. of distinct words
No. of lemmas
No. of >2-occ lemmas*
Table 2:The lemmatized monolingual "1984" overview
* the number of lemmas does not include interjections, particles, residuals
The number of lemmas in each monolingual part of the multilingual corpus as well as the number of lemmas that occurred more than twice are shown in Table 2. For validation purposes we set the step limit of the algorithm to 4. The evaluation was fully done for Estonian, Hungarian and Romanian and partially for Slovene (the first step was fully evaluated while from the rest were evaluated randomly selected pairs). The algorithm provides as output a list of TEPs and up to 10 examples of translation units where the respective pair appeared. The figures in Table 3 show the results and the evaluation. The precision was computed as usual but the recall (Rec*) was computed as the number of correct TEPs divided by the number of lemmas in the source language with more than 3 occurrences.
As one can see from the figures in Table 4, the precision is higher than 98% for Romanian and Slovene, almost 97% for Hungarian and more than 96% for Estonian. The recall (our defined Rec*) ranges from 50.92% (Slovene) to 63.90% (Estonian). We run the extractor for the Ro-En bi-text without imposing a step limit. The program stopped after 25 steps with a number of 2765 extracted pairs, out of which 113 were wrong. The precision decreased to 95,91%, but the recall significantly improved: 86,89%.
Table 3: The results after 4 iteration steps and partial evaluation
In an initial version of this algorithm we used a chi-square test (as in the baseline algorithm) to check the selected TEPs. Since the selection condition (EQ1) is very powerful, the vast majority of the selected TEPs passed the chi-square test while many pairs that used to pass the chi-square threshold did not pass the condition (EQ1) and therefore we eliminated the time consuming statistical tests.
The extraction program is written in Perl and runs under practically any platform (Perl implementations exists not only for UNIX/LINUX but also for Windows and MACOS). The Table 4 shows the running time (Pentium III/600Mhz with 96 MB RAM) for each bi-text in the "1984" corpus.
Table 4: Extraction time (in seconds) for each of the bilingual lexicons
The 6 bilingual lexicons as well as a 7-languages lexicon are available at www.racai.ro/~tufis.
A quite similar approach to ours (also Perl-based) is presented in (Ahrenberg et all 1998) and (Ahrenberg et all 2000). The languages considered by their experiments are English and Swedish. For a novel of about half length of Orwell's "1984" their algorithm needed 55 minutes on a Ultrasparc1 with 320 MB RAM and the best results reported are 96.7% precision and 54.6% recall. For a computer manual containing about 45% more token than our corpus, their algorithm needed 4,5 hours with the best results being 85,6% precision and 67,1% recall. Unlike us, they don’t rely on tagging and lemmatisation, although they use a “morphology” module that achieves stemming and grouping of inflectional variants. This strategy makes possible to link low-frequency source expressions belonging to the same suffix paradigm. The obvious advantage of not using POS categorisation is that their approach would be able to identify TEPs where the POS of the source token is different from the POS of the target token. The disadvantage is that search space is extremely large. An explanation of the much better response time in our case, besides not using statistical tests (they use t-test), is that the search space in our case is probably several orders of magnitude smaller.
4 Conclusions and further work
We presented a simple but very effective algorithm for extracting bilingual lexicons, based on a word to word model of translation equivalence. In case a language specific tokeniser is able to recognise and "pack" the compounds in a pre-processing phase, the 1:1 model will obviously discover multi-words translation equivalencies. The MULTEXT tokenizer for instance allows for recognition of generic multi-word expressions (dates, literally expressed numbers) or specific ones based on external resources containing lists of compounds, proper names and idiomatic expressions. If the compounds cannot be dealt with in the tokenizing pre-processing phase one may consider either extending the bilingual lexicon extractor's model to an explicit N:M paradigm or consider using monolingual tools for recognising the compounds in both languages. We are currently considering both options. For the first one we started the implementation of a new language independent tokeniser, including statistical collocation recognition (we use S. Banerjee's and T. Pedersen's BSP - Bigram Statistical Package). For the second option, we are carrying out some preliminary experiments with a version of the program presented in this paper. Conceptually, the modified version of the program can be seen as receiving the same text as source and target input file with all the sentence alignments being 1:1.
Ahrenberg L, Andersson M, Merkel M, 1998. A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts. In Proceedings of COLING'98, Montreal, pp 29-35.
Ahrenberg L, Andersson M, Merkel M, 2000. A knowledge-lite approach to word alignment, in Véronis J (ed), Parallel Text Processing. Text, Speech and Language Technology Series, Kluwer Academic Publishers, pp 97-116.
Dimitrova L, Erjavec T, Ide N, Kaalep H, Petkevic V, Tufiş D, 1998. Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and East European Languages in Proceedings of COLING’98, Montreal, pp 315-319.
Erjavec T, Lawson A, Romary L, 1998. East Meet West: A Compendium of Multilingual Resources. TELRI-MULTEXT EAST CD-ROM, 1998, ISBN: 3-922641-46-6.
Hiemstra D, 1997. Deriving a bilingual lexicon for cross language information retrieval. In Proceedings of Gronics 21-26
Tufiş D, 2000. Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. In Proceedings of LREC2000, Athens, pp.1105-1112.
Tufiş D, Dienes P, Oravecz C, Váradi T, 2000. Principled Hidden Tagset Design for Tiered Tagging of Hungarian. In Proceedings of LREC’2000, Athens, 1421-1426.
Word count (excluding tables and references): 1499
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at New York University
New York, NY, United States
July 13, 2001 - July 16, 2001
94 works by 167 authors indexed
Affiliations need to be double-checked.