Alignment and Browsing of the English-Norwegian Parallel Corpus

Knut Hofland; Jarle Ebeling

Authorship

1. Knut Hofland

Norwegian Computing Centre for the Humanities - University of Bergen
2. Jarle Ebeling

Department of British and American Studies - University of Oslo

Original URL

https://web.archive.org/web/19981206202418/http://gonzo.hit.uib.no/allc-ach96/demos/hofland.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Project Background
The English-Norwegian Parallel Corpus (ENPC) project was started in the beginning of 1994 and it is expected to end in 1996. The parallel corpus is meant to be a research tool for linguists and students interested in contrastive linguistics. The corpus will contain English and Norwegian originals and their translations (including both English-to-Norwegian and Norwegian-to-English translations).

The corpus will only contain written material, but both fiction and non-fictional texts will be included. To make it possible to include as many different writers and translators as possible, text extracts of 10,000 - 15,000 words will be selected and not complete books. Each text extract will start at the beginning of the book, and if possible, end at a chapter boundary. The finished corpus will consist of 100 pairs of texts with a total of about 2.5 million words. The texts are marked up according to TEI P3. As regards the structure of the corpus and the projected uses of the material, see (Johansson and Hofland, 1994) and (Johansson, Ebeling, and Hofland 1996).

Alignment program
The alignment program has been written by Knut Hofland. The program makes use of a simple bilingual lexicon (anchor words), but in addition uses information like proper nouns, special characters and tags, cognates and sentence length in characters. Statistics based on half the corpus gives an error rate of approx 2 per cent. The program has also been used in aligning texts from other language pairs like French-Norwegian, English-French/German/Polish/Swedish/Finnish, Swedish-Estonian. The program gives output in several formats, a TEI recommended format and one format also suitable for use with ParaConc and WordCruncher for Windows.

Illustration not available

Browsing tool
The browsing tool has been written by Jarle Ebeling. The aligned and proofread texts are indexed, making it possible to search the text database for words in one of the languages and retrieve the sentences together with their translations in the other language. Words in the two languages can be combined with the and or the not operator so that only pairs of sentences with a specific word in the first language together with (or not together with) another word in the second language are found. Words can also be truncated and a distance (in number of words) between particular words can be given.

References
Web page with more information and articles: http://www.hd.uib.no/enpc.html

Johansson, S., and Hofland, K. (1994), `Towards an English-Norwegian parallel corpus', in U. Fries, G. Tottie, and P. Schneider (eds.), Creating and Using English language Corpora, (Amsterdam), 25-37.

Johansson, S., Ebeling, J., and Hofland, K. (1996) `Coding and Aligning the English-Norwegian Parallel Corpus', in K. Aijmer, B. Altenberg and M. Johansson (eds) Papers from Symposium on Text-based Cross-linguistic Studies, (Lund), 87-112.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996

Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

Alignment and Browsing of the English-Norwegian Parallel Corpus

1. Knut Hofland

2. Jarle Ebeling

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996