← Back to Publications
Conference PaperPublished2023

MasakhaneNews: News Topic Classification for African Languages

Adelani, D. I., Masiak, M., Azime, I. A., Alabi, J., Tonja, A. L., Mwase, C., Ogundepo, O., Kimanuka, U., et al.

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pp. 144–159

📚 55 citations (Google Scholar)arXiv:2304.09972

Abstract

Despite representing roughly a fifth of the world population, African languages are underrepresented in NLP research, in part due to a lack of datasets. While there are individual language-specific datasets for several tasks, only a handful of tasks (e.g. named entity recognition and machine translation) have datasets covering geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS — the largest dataset for news topic classification covering 16 languages widely spoken in Africa. We provide and evaluate a set of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives for transfer learning to improve classification in low-resource settings.

Keywords

news classificationAfrican languagesNLPMasakhanemultilingualtext classification
arXiv Preprint ↗Project Link ↗