dc.contributor.advisor | Kumar, Udaya | |
dc.contributor.author | Varun Krishna, P S | |
dc.date.accessioned | 2025-07-23T05:32:58Z | |
dc.date.available | 2025-07-23T05:32:58Z | |
dc.date.submitted | 2025 | |
dc.identifier.uri | https://etd.iisc.ac.in/handle/2005/7003 | |
dc.description.abstract | The rapid expansion of digital data has led to a growing interest in self-supervised learning (SSL)
techniques, particularly for speech processing tasks where labeled data is often scarce. SSL
enables models to learn meaningful representations directly from raw data by capturing inherent
structures and patterns without requiring explicit supervision. These learned representations
can then be fine-tuned for downstream tasks such as automatic speech recognition (ASR)
and spoken language modeling. To be effective, speech representations must not only capture
content-related information—such as phonetic, lexical, and semantic features—but also remain
robust against speaker variations, co-articulation effects, channel distortions, and background
noise. However, existing SSL models face limitations in effectively encoding content while
maintaining invariance to non-semantic variations. This thesis proposes novel frameworks to
enhance the robustness of SSL-based representations in extracting semantic content from raw
speech.
The first contribution of this thesis is the development of the Hidden Unit Clustering (HUC)
framework, which integrates contrastive learning with deep clustering techniques to enhance
representation quality. A speaker normalization strategy is incorporated to mitigate speaker
variability, ensuring that the extracted representations focus primarily on content-related in
formation. Additionally, a heuristic data sampling method is introduced to generate pseudo
targets for deep clustering, further refining the learned representations. The framework is
evaluated across multiple SSL models, demonstrating significant improvements in phonetic and
semantic benchmarks, as well as superior performance in the low-resource ASR settings.
The second contribution focuses on improving context-invariant representations to address
challenges posed by co-articulation effects and variations in speaker and channel characteris
tics. To achieve this, a pseudo-con loss framework is proposed, leveraging pseudo-targets to
guide the contrastive learning and enhance robustness. This approach serves as a lightweight
yet effective auxiliary module that can be seamlessly integrated into models based on deep
clustering. Extensive evaluations demonstrate state-of-the-art performance across multiple Ze
roSpeech 2021 sub-tasks, context-invariance benchmarks, as well as significant improvements
in phoneme recognition and ASR performance, particularly in resource-constrained settings.
The final contribution explores the integration of adversarial learning to further enhance
the quality of semantic representation. A gradient-reversal mechanism is employed to explicitly
suppress non-semantic variations within SSL models, thereby refining the learned representa
tions. This adversarial approach effectively disentangles content from non-semantic factors,
leading to more robust semantic representations. Experimental results on various ZeroSpeech
and resource-constrained settings confirm that the proposed method enhances the ability of
speech processing models to generalize across different acoustic conditions while preserving
critical linguistic information | en_US |
dc.language.iso | en_US | en_US |
dc.relation.ispartofseries | ;ET01011 | |
dc.rights | I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part
of this thesis or dissertation | en_US |
dc.subject | Self-Supervised learning | en_US |
dc.subject | speech recognition | en_US |
dc.subject | automatic speech recognition | en_US |
dc.subject | Hidden Unit Clustering | en_US |
dc.subject | HUC | en_US |
dc.subject | Zerospeech | en_US |
dc.subject | Non-Semantic Speech Benchmark | en_US |
dc.subject | speech recognition | en_US |
dc.subject | GramVaani Hindi ASR | en_US |
dc.subject.classification | Research Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Electrical engineering | en_US |
dc.title | Self-Supervised Learning Approaches for Content-Factor Extraction from Raw Speech | en_US |
dc.type | Thesis | en_US |
dc.degree.name | PhD | en_US |
dc.degree.level | Doctoral | en_US |
dc.degree.grantor | Indian Institute of Science | en_US |
dc.degree.discipline | Engineering | en_US |