University of Bergen
The Bergen Corpus of London Teenage Language (COLT) has been transcribed and the cassette tapes have been digitized to Windows WAV-files (9 GB). The texts have been time aligned at the word level with the sound files by the company Softsound in the UK. The poster will describe how this material is made available through the Corpus WorkBench from IMS in Stuttgart. The user can search in the corpus by means of a Web-browser and from the resulting concordance play the corresponding sound to each occurrence (5-15 seconds). For this purpose a program was written to deliver small pieces of a sound file across the Web. These sound extracts can be saved by the user and further analyzed by signal processing programs.
Two Norwegian spoken corpora are also available for searching in this way. In the one corpus, a mark was put manually in the transcripts for every 10 seconds. A program then generated an interpolated time stamp for each word. In the other corpus, the program SyncWriter was used while transcribing the text. This program keeps track of time information for each unit which is transcribed. This information can be extracted from the data file together with the text. The time stamp for each word is interpolated between these values and the text and time information are indexed by the search software.
Softsound Speech/Text alignment <http://www.softsound.com/SpeechText.html>
Corpus WorkBench <http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/index.html>
Demo concordance <http://helmer.hit.uib.no/test-of-sound.html>
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Hosted at University of Glasgow
Glasgow, Scotland, United Kingdom
July 21, 2000 - July 25, 2000
104 works by 187 authors indexed
Affiliations need to be double-checked.
Conference website: https://web.archive.org/web/20190421230852/https://www.arts.gla.ac.uk/allcach2k/