Self-Supervised Learning Approaches for Content-Factor Extraction from Raw Speech

Varun Krishna, P S

View/Open

Thesis full text (3.073Mb)

Author

Varun Krishna, P S

Metadata

Show full item record

Abstract

The rapid expansion of digital data has led to a growing interest in self-supervised learning (SSL) techniques, particularly for speech processing tasks where labeled data is often scarce. SSL enables models to learn meaningful representations directly from raw data by capturing inherent structures and patterns without requiring explicit supervision. These learned representations can then be fine-tuned for downstream tasks such as automatic speech recognition (ASR) and spoken language modeling. To be effective, speech representations must not only capture content-related information—such as phonetic, lexical, and semantic features—but also remain robust against speaker variations, co-articulation effects, channel distortions, and background noise. However, existing SSL models face limitations in effectively encoding content while maintaining invariance to non-semantic variations. This thesis proposes novel frameworks to enhance the robustness of SSL-based representations in extracting semantic content from raw speech. The first contribution of this thesis is the development of the Hidden Unit Clustering (HUC) framework, which integrates contrastive learning with deep clustering techniques to enhance representation quality. A speaker normalization strategy is incorporated to mitigate speaker variability, ensuring that the extracted representations focus primarily on content-related in formation. Additionally, a heuristic data sampling method is introduced to generate pseudo targets for deep clustering, further refining the learned representations. The framework is evaluated across multiple SSL models, demonstrating significant improvements in phonetic and semantic benchmarks, as well as superior performance in the low-resource ASR settings. The second contribution focuses on improving context-invariant representations to address challenges posed by co-articulation effects and variations in speaker and channel characteris tics. To achieve this, a pseudo-con loss framework is proposed, leveraging pseudo-targets to guide the contrastive learning and enhance robustness. This approach serves as a lightweight yet effective auxiliary module that can be seamlessly integrated into models based on deep clustering. Extensive evaluations demonstrate state-of-the-art performance across multiple Ze roSpeech 2021 sub-tasks, context-invariance benchmarks, as well as significant improvements in phoneme recognition and ASR performance, particularly in resource-constrained settings. The final contribution explores the integration of adversarial learning to further enhance the quality of semantic representation. A gradient-reversal mechanism is employed to explicitly suppress non-semantic variations within SSL models, thereby refining the learned representa tions. This adversarial approach effectively disentangles content from non-semantic factors, leading to more robust semantic representations. Experimental results on various ZeroSpeech and resource-constrained settings confirm that the proposed method enhances the ability of speech processing models to generalize across different acoustic conditions while preserving critical linguistic information

URI

https://etd.iisc.ac.in/handle/2005/7003

Collections

Electrical Engineering (EE) [451]