• Login
    View Item 
    •   etd@IISc
    • Division of Electrical, Electronics, and Computer Science (EECS)
    • Electrical Engineering (EE)
    • View Item
    •   etd@IISc
    • Division of Electrical, Electronics, and Computer Science (EECS)
    • Electrical Engineering (EE)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Self-Supervised Learning Approaches for Content-Factor Extraction from Raw Speech

    Thumbnail
    View/Open
    Thesis full text (3.073Mb)
    Author
    Varun Krishna, P S
    Metadata
    Show full item record
    Abstract
    The rapid expansion of digital data has led to a growing interest in self-supervised learning (SSL) techniques, particularly for speech processing tasks where labeled data is often scarce. SSL enables models to learn meaningful representations directly from raw data by capturing inherent structures and patterns without requiring explicit supervision. These learned representations can then be fine-tuned for downstream tasks such as automatic speech recognition (ASR) and spoken language modeling. To be effective, speech representations must not only capture content-related information—such as phonetic, lexical, and semantic features—but also remain robust against speaker variations, co-articulation effects, channel distortions, and background noise. However, existing SSL models face limitations in effectively encoding content while maintaining invariance to non-semantic variations. This thesis proposes novel frameworks to enhance the robustness of SSL-based representations in extracting semantic content from raw speech. The first contribution of this thesis is the development of the Hidden Unit Clustering (HUC) framework, which integrates contrastive learning with deep clustering techniques to enhance representation quality. A speaker normalization strategy is incorporated to mitigate speaker variability, ensuring that the extracted representations focus primarily on content-related in formation. Additionally, a heuristic data sampling method is introduced to generate pseudo targets for deep clustering, further refining the learned representations. The framework is evaluated across multiple SSL models, demonstrating significant improvements in phonetic and semantic benchmarks, as well as superior performance in the low-resource ASR settings. The second contribution focuses on improving context-invariant representations to address challenges posed by co-articulation effects and variations in speaker and channel characteris tics. To achieve this, a pseudo-con loss framework is proposed, leveraging pseudo-targets to guide the contrastive learning and enhance robustness. This approach serves as a lightweight yet effective auxiliary module that can be seamlessly integrated into models based on deep clustering. Extensive evaluations demonstrate state-of-the-art performance across multiple Ze roSpeech 2021 sub-tasks, context-invariance benchmarks, as well as significant improvements in phoneme recognition and ASR performance, particularly in resource-constrained settings. The final contribution explores the integration of adversarial learning to further enhance the quality of semantic representation. A gradient-reversal mechanism is employed to explicitly suppress non-semantic variations within SSL models, thereby refining the learned representa tions. This adversarial approach effectively disentangles content from non-semantic factors, leading to more robust semantic representations. Experimental results on various ZeroSpeech and resource-constrained settings confirm that the proposed method enhances the ability of speech processing models to generalize across different acoustic conditions while preserving critical linguistic information
    URI
    https://etd.iisc.ac.in/handle/2005/7003
    Collections
    • Electrical Engineering (EE) [358]

    Related items

    Showing items related by title, author, creator and subject.

    • Demodulation of Narrowband Speech Spectrograms 

      Aragonda, Haricharan (2017-11-22)
      Speech is a non-stationary signal and contains modulations in both spectral and temporal domains. Based on the type of modulations studied, most speech processing algorithms can be classified into short-time analysis ...
    • Joint Evaluation Of Multiple Speech Patterns For Speech Recognition And Training 

      Nair, Nishanth Ulhas (2009-12-09)
      Improving speech recognition performance in the presence of noise and interference continues to be a challenging problem. Automatic Speech Recognition (ASR) systems work well when the test and training conditions match. ...
    • Spectro-Temporal Features For Robust Automatic Speech Recognition 

      Suryanarayana, Venkata K (2011-01-18)
      The speech signal is inherently characterized by its variations in time, which get reflected as variations in frequency. The specto temporal changes are due to changes in vocaltract, intonation, co-articulation and successive ...

    etd@IISc is a joint service of SERC & J R D Tata Memorial (JRDTML) Library || Powered by DSpace software || DuraSpace
    Contact Us | Send Feedback | Thesis Templates
    Theme by 
    Atmire NV
     

     

    Browse

    All of etd@IIScCommunities & CollectionsTitlesAuthorsAdvisorsSubjectsBy Thesis Submission DateThis CollectionTitlesAuthorsAdvisorsSubjectsBy Thesis Submission Date

    My Account

    LoginRegister

    etd@IISc is a joint service of SERC & J R D Tata Memorial (JRDTML) Library || Powered by DSpace software || DuraSpace
    Contact Us | Send Feedback | Thesis Templates
    Theme by 
    Atmire NV