Show simple item record

dc.contributor.advisorKumar, Udaya
dc.contributor.authorVarun Krishna, P S
dc.date.accessioned2025-07-23T05:32:58Z
dc.date.available2025-07-23T05:32:58Z
dc.date.submitted2025
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/7003
dc.description.abstractThe rapid expansion of digital data has led to a growing interest in self-supervised learning (SSL) techniques, particularly for speech processing tasks where labeled data is often scarce. SSL enables models to learn meaningful representations directly from raw data by capturing inherent structures and patterns without requiring explicit supervision. These learned representations can then be fine-tuned for downstream tasks such as automatic speech recognition (ASR) and spoken language modeling. To be effective, speech representations must not only capture content-related information—such as phonetic, lexical, and semantic features—but also remain robust against speaker variations, co-articulation effects, channel distortions, and background noise. However, existing SSL models face limitations in effectively encoding content while maintaining invariance to non-semantic variations. This thesis proposes novel frameworks to enhance the robustness of SSL-based representations in extracting semantic content from raw speech. The first contribution of this thesis is the development of the Hidden Unit Clustering (HUC) framework, which integrates contrastive learning with deep clustering techniques to enhance representation quality. A speaker normalization strategy is incorporated to mitigate speaker variability, ensuring that the extracted representations focus primarily on content-related in formation. Additionally, a heuristic data sampling method is introduced to generate pseudo targets for deep clustering, further refining the learned representations. The framework is evaluated across multiple SSL models, demonstrating significant improvements in phonetic and semantic benchmarks, as well as superior performance in the low-resource ASR settings. The second contribution focuses on improving context-invariant representations to address challenges posed by co-articulation effects and variations in speaker and channel characteris tics. To achieve this, a pseudo-con loss framework is proposed, leveraging pseudo-targets to guide the contrastive learning and enhance robustness. This approach serves as a lightweight yet effective auxiliary module that can be seamlessly integrated into models based on deep clustering. Extensive evaluations demonstrate state-of-the-art performance across multiple Ze roSpeech 2021 sub-tasks, context-invariance benchmarks, as well as significant improvements in phoneme recognition and ASR performance, particularly in resource-constrained settings. The final contribution explores the integration of adversarial learning to further enhance the quality of semantic representation. A gradient-reversal mechanism is employed to explicitly suppress non-semantic variations within SSL models, thereby refining the learned representa tions. This adversarial approach effectively disentangles content from non-semantic factors, leading to more robust semantic representations. Experimental results on various ZeroSpeech and resource-constrained settings confirm that the proposed method enhances the ability of speech processing models to generalize across different acoustic conditions while preserving critical linguistic informationen_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;ET01011
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectSelf-Supervised learningen_US
dc.subjectspeech recognitionen_US
dc.subjectautomatic speech recognitionen_US
dc.subjectHidden Unit Clusteringen_US
dc.subjectHUCen_US
dc.subjectZerospeechen_US
dc.subjectNon-Semantic Speech Benchmarken_US
dc.subjectspeech recognitionen_US
dc.subjectGramVaani Hindi ASRen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Electrical engineeringen_US
dc.titleSelf-Supervised Learning Approaches for Content-Factor Extraction from Raw Speechen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record