Self-Supervised Learning Approaches for Content-Factor Extraction from Raw Speech

Varun Krishna, P S

dc.contributor.advisor	Kumar, Udaya
dc.contributor.author	Varun Krishna, P S
dc.date.accessioned	2025-07-23T05:32:58Z
dc.date.available	2025-07-23T05:32:58Z
dc.date.submitted	2025
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/7003
dc.description.abstract	The rapid expansion of digital data has led to a growing interest in self-supervised learning (SSL) techniques, particularly for speech processing tasks where labeled data is often scarce. SSL enables models to learn meaningful representations directly from raw data by capturing inherent structures and patterns without requiring explicit supervision. These learned representations can then be fine-tuned for downstream tasks such as automatic speech recognition (ASR) and spoken language modeling. To be effective, speech representations must not only capture content-related information—such as phonetic, lexical, and semantic features—but also remain robust against speaker variations, co-articulation effects, channel distortions, and background noise. However, existing SSL models face limitations in effectively encoding content while maintaining invariance to non-semantic variations. This thesis proposes novel frameworks to enhance the robustness of SSL-based representations in extracting semantic content from raw speech. The first contribution of this thesis is the development of the Hidden Unit Clustering (HUC) framework, which integrates contrastive learning with deep clustering techniques to enhance representation quality. A speaker normalization strategy is incorporated to mitigate speaker variability, ensuring that the extracted representations focus primarily on content-related in formation. Additionally, a heuristic data sampling method is introduced to generate pseudo targets for deep clustering, further refining the learned representations. The framework is evaluated across multiple SSL models, demonstrating significant improvements in phonetic and semantic benchmarks, as well as superior performance in the low-resource ASR settings. The second contribution focuses on improving context-invariant representations to address challenges posed by co-articulation effects and variations in speaker and channel characteris tics. To achieve this, a pseudo-con loss framework is proposed, leveraging pseudo-targets to guide the contrastive learning and enhance robustness. This approach serves as a lightweight yet effective auxiliary module that can be seamlessly integrated into models based on deep clustering. Extensive evaluations demonstrate state-of-the-art performance across multiple Ze roSpeech 2021 sub-tasks, context-invariance benchmarks, as well as significant improvements in phoneme recognition and ASR performance, particularly in resource-constrained settings. The final contribution explores the integration of adversarial learning to further enhance the quality of semantic representation. A gradient-reversal mechanism is employed to explicitly suppress non-semantic variations within SSL models, thereby refining the learned representa tions. This adversarial approach effectively disentangles content from non-semantic factors, leading to more robust semantic representations. Experimental results on various ZeroSpeech and resource-constrained settings confirm that the proposed method enhances the ability of speech processing models to generalize across different acoustic conditions while preserving critical linguistic information	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET01011
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Self-Supervised learning	en_US
dc.subject	speech recognition	en_US
dc.subject	automatic speech recognition	en_US
dc.subject	Hidden Unit Clustering	en_US
dc.subject	HUC	en_US
dc.subject	Zerospeech	en_US
dc.subject	Non-Semantic Speech Benchmark	en_US
dc.subject	speech recognition	en_US
dc.subject	GramVaani Hindi ASR	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Electrical engineering	en_US
dc.title	Self-Supervised Learning Approaches for Content-Factor Extraction from Raw Speech	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: VarunKrishna_Thesis_Revised.pdf
Size:: 3.073Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Engineering (EE) [358]

Show simple item record

Self-Supervised Learning Approaches for Content-Factor Extraction from Raw Speech

Files in this item

This item appears in the following Collection(s)

Related items

Demodulation of Narrowband Speech Spectrograms ﻿

Joint Evaluation Of Multiple Speech Patterns For Speech Recognition And Training ﻿

Spectro-Temporal Features For Robust Automatic Speech Recognition ﻿

Demodulation of Narrowband Speech Spectrograms

Joint Evaluation Of Multiple Speech Patterns For Speech Recognition And Training

Spectro-Temporal Features For Robust Automatic Speech Recognition