Journal Article● Published2024

Speech Recognition Datasets for Low-Resource Congolese Languages

Kimanuka, U., Ciira wa Maina, Büyük, O.

Elsevier Data in Brief

Vol. 52, pp. 109796

📚 20 citations (Google Scholar)

Abstract

Large pre-trained Automatic Speech Recognition (ASR) models have shown improved performance in low-resource languages due to the increased availability of benchmark corpora and the advantages of transfer learning. However, only a limited number of languages possess ample resources to fully leverage transfer learning. In such contexts, benchmark corpora become crucial for advancing methods. In this article, we introduce two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: the Lingala Read Speech Corpus, with 4 hours of labelled audio, and the Congolese Speech Radio Corpus, which offers 741 hours of unlabelled audio spanning four significant low-resource languages of the region. During data collection, Lingala Read Speech recordings of thirty-two distinct adult speakers, each with a unique context under various settings with different accents, were recorded. Concurrently, Congolese Speech Radio raw data were taken from the archive of a broadcast station, followed by a designed curation process. The datasets, freely accessible to all researchers, serve as a valuable resource for investigating and developing monolingual and multilingual approaches for linguistically similar and distant languages. Using supervised and self-supervised learning techniques, they enable inaugural benchmarking of speech recognition systems for Lingala and the first multilingual model tailored for four Congolese languages spoken by an aggregated population of 95 million.

Keywords

ASRlow-resource languagesCongolese languagesLingalaspeech corpustransfer learningself-supervised learningDRC

DOI / Full Text ↗