REF - Green Database Description (FINAL).pdf (167.37 kB)

A spoken corpus of Cameroon Pidgin English: pilot study

online resource

posted on 2023-06-09, 03:47 authored by Melanie GreenMelanie Green, Miriam Ayafor, Gabriel Ozon

This resource is a 240,000-word corpus of spoken Cameroon Pidgin English (CPE), a widely-used yet stigmatised and largely uncodified pidgin/creole variety. The corpus consists of transcriptions of private and public dialogues and monologues, with mark-up and POS-tagging, together with accompanying sound files. The recordings were conducted in five different locations in Cameroon (Bamenda, Buea, Douala, Kumba and Yaounde), allowing some insights into regional variation. Text categories and the proportions of monologue and dialogue are guided by those of the International Corpus of English (ICE) project, which makes the corpus immediately comparable with existing corpora of post-colonial varieties of English. • Spelling: since there is no standardised orthography for CPE, the orthography adopted for this project is based on that developed by Ayafor (2014), which was kept under review during the course of the project. • Annotation was added to the transcriptions based on ICE guidelines for the annotation of spoken texts: standard mark-up symbols were used to denote text unit, speaker identification, overlapping speech, unclear words, uncertain transcriptions, anthropo-phonics, editorial comments, foreign words and indigenous language words. • Tagging: a tagset for CPE was devised based on CLAWS 5. Initially tagging was conducted manually, and then by means of TreeTagger. A third of the corpus has been post-checked, with accuracy rates at 94%. The corpus is aimed at providing a resource for linguistic description and comparison. It allows linguists to identify and describe recurring grammatical patterns, as well as the phonology of the language (given the availability of sound files deposited with the text files). It also allows comparison of CPE with other pidgin/creole languages, other Cameroonian and West African languages, and other varieties of post-colonial English. Furthermore, the corpus provides an exceptional resource for the study of general/theoretical linguistics, creolistics, typology, language contact and change, sociolinguistics and discourse analysis. The corpus contains 80 sound recordings of monologues (scripted and unscripted) and dialogues (public and private). Each sound file (in .wav format) is 10-15 minutes in length. These recordings have been transcribed (each approximately 3,000 words in length) and annotated. Transcriptions are submitted in two formats: (a) plain transcription (with basic markup indicating speaker turns, overlaps, etc.), and (b) a POS-tagged version, which adds POS-tags to the plain version of the transcription. The language of the monologues is Cameroon Pidgin English, with codeswitching into English, French, and indigenous Cameroonian languages.

Funding

A spoken corpus of Cameroon Pidgin English (pilot study); G1418; BRITISH ACADEMY; SG140663

History

Publication status

Published

File Version

Other

Publisher

University of Oxford Text Archive

Publisher URL

http://ota.ox.ac.uk/desc/2563

Place of publication

Oxford Text Archive

Department affiliated with

English Publications

Full text available

Yes

Contributors

Sarah FitzGerald

Legacy Posted Date

2016-10-28

Usage metrics

Keywords

Linguistic corpora Corpus Speech-Research Linguistics analysis Speech-Synthesis Pidgin English Code switching

Licence

Copyright not evaluated

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC