Inter‐Model Feature Fusion for Robust Low‐Resource Speech Recognition
Kimanuka, U., Ciira wa Maina, Büyük, O.
Applied AI Letters
Abstract
Substantial improvements in automatic speech recognition performance have been realized through supervised fine‐tuning after self‐supervised pre‐training of a speech foundation model. The large size of foundation models, along with their varied losses and objective functions, makes it impractical to obtain optimum results with these models, and fine‐tuning each model independently for several downstream tasks is prohibitively expensive. The proposed methodology consists of three phases: feature extraction, feature fusion, and prediction. During feature extraction, several pre‐trained models, each with varying losses and objective functions, are used to derive representations. Then, a designed co‐attentional fusion mechanism is applied during feature fusion, enabling the network to adaptively weight different fusion operations to acquire common representations across models. Finally, a connectionist temporal classification (CTC) layer is used as a framework to generate transcription predictions. Moreover, the proposed self‐supervised feature‐fusion transformer block (SSF‐FT), incorporating inter‐model techniques, effectively captures both shared and distinctive information across all fused representations. We conducted an interpretability study in high‐resource (English) and low‐resource (Congolese) scenarios. In both settings, we observe that features performing well with shallow ensemble methods also perform well with attention‐weighted soft mixing. Experimental results demonstrate that our approach offers complementary strengths to existing ensemble techniques, with particular improvements in acoustically challenging and low‐resource scenarios.