Linguistic Data Consortium - University of Pennsylvania
"Low Density" languages are those for which relatively few computational resources are available. Not being majority languages, they have received less attention in resource collection. But increasingly, substantial resources can be found on the Internet for many intermediate-sized and even small-sized languages.
At the Linguistic Data Consortium (LDC), we are creating basic language 'resource kits' for a number of low density languages. Each such resource kit is to include monolingual texts (in a standard encoding), bilingual texts (in English and the target language), a lexicon (which may be a simple bilingual word list), and where relevant, a morphological parser.
Building a resource kit for a particular language involves locating appropriate resources on the web, downloading them, converting from HTML-tagged text to some other form deemed more useful for the end purpose (which often implies stripping out html tags, and tokenization), converting the text to a standard encoding where necessary, merging multiple resources, and assigning meta-data.
In this paper, I describe the first step in this process. Specifically, I focus on collection techniques which have proven useful--and some which have not worked as well as we had hoped. We are using these techniques on languages ranging from Hindi (360 million+ speakers) to Chechen (one million speakers). I have also experimented with the use of search techniques on much smaller languages (Tzeltal, a Mayan language with 200 thousand speakers; and Shuar, a language of Ecuador with 30 thousand speakers), with surprisingly fruitful results.
For text resources and some kinds of lexicons, one of the most useful search techniques is to enter a few common words in the desired language into a search engine, such as Google. Under good circumstances, this can result in excellent recall and precision. Unfortunately, circumstances are not always good; problems include:
Multiple encodings (Hindi);
Dialectal variation (Quechua);
Non-standard spellings or orthographies (Chechen);
Short words (which tend to result in false hits in other languages);
Web pages which are purely graphical, with no searchable text (some Burmese sites); and
The existence of languages which are closely enough related that they greatly lower the precision of the search, particularly where the related language has a larger web presence than the target language (Bahasa Indonesian hits are returned in much greater numbers than hits for the related language Aceh).
Another circumstance which might be thought to make this kind of search work poorly, strongly inflected languages, turns out not to be as bad as one might think, in part because nearly all such languages have some words which do not inflect. However, if the uninflected words are uniformly short (as often happens with pre-/post-positions), inflection can still be problematical. Alternatively, one can 'or' together a few inflected forms of a single word into a search engine.
There are several ways of obtaining common words for search purposes. The simplest is to bring up an already found web page in the language, copy a few words, and paste them into a search engine. If one does not know the language in question (this is common in our work at the LDC), it may not be obvious which words are the most common, but a crude heuristic is to copy the shortest words.
Another way to find seed terms is to key in a few words from a printed dictionary (assuming that the encoding is simple, or that one has the language-specific keyboard installed on the computer). The best method is to use a simple computer program to tokenize one (or better, several) web pages, and sort them by frequency of appearance.
As mentioned, searching for terms in the target language does not work well if web pages in the language do not use a standard encoding. Roman writing systems are reasonably standardized, although issues still arise when the orthography includes non-ASCII characters--even such trivial things as accented characters. Likewise, although there are several competing encodings for Cyrillic orthographies, they tend to be fairly well documented, so that one can transliterate search terms from one encoding to another.
But very real problems with encodings surface with many languages which have non-Roman (and non-Cyrillic) orthographies, including languages of India, Ethiopia and Eritrea, and southeast Asia. While Unicode is intended to solve the problem, the fact is that there are multiple competing encodings for many of these languages, and they are often completely undocumented. The result is that there is at present no simple solution for searching across multiple web sites in languages with the multiple encoding problem.
For some low density languages, few if any "ordinary" web pages exist. This may be because the country where the language is spoken discourages its use (non-official languages of Indonesia), or because there is very little Internet use in the country (many countries of Africa), or because the language lacks a standardized writing system (regional varieties of Arabic). Nevertheless, other sorts of language resources can sometimes be found for such languages; these may include weblogs (Aceh, a language of Indonesia), chat rooms (Arabic 'dialects'), web pages maintained by expatriate speakers (Rwandan), and text collections by other researchers (Mayan languages).
Among the techniques which seem obvious, but which turn out not to work well, is searching for websites according the country suffix (e.g. '.tr' for Turkey). This results in both poor recall (since many of the websites we have found for low density languages are not hosted in the country where that language is primarily spoken) and poor precision (because most countries have websites in multiple languages, usually including English or some other regionally dominant language).
Another technique which we have tried is to search for the name of the language. While this may result in many hits, precision often turns out badly, resulting in an overwhelming task of sifting through the results. Combining the language name with other search terms (e.g. 'dictionary' or 'lexicon') results in fewer hits, but is still likely to return many uninteresting links. Additionally, languages often have multiple names; the Ethnologue (http://www.ethnologue.com) lists seven names for Cebuano (a Philippine language), and our initial search missed an on-line lexicon in this language because the web page title used one of the alternative names for the language.
The Open Languages Archive Community (OLAC, http://www.language-archives.org/) is intended to be a repository of the sort of resources we are looking for. Other language resource aggregators include Seven Tones (http://linguistlist.org/~zheng/7tones/), the Yamada Language Center (http://babel.uoregon.edu), Rosetta (http://www.rosettaproject.org), and the collections at many Internet search engines (such as Google and Yahoo). However, significant resources--particular text resources--are missed by all these sites. We view our search technologies as supplements to these portals, and intend that our results will be made available for inclusion in OLAC's database.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at Göteborg University (Gothenburg)
Gothenborg, Sweden
June 11, 2004 - June 16, 2004
105 works by 152 authors indexed
Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/