dc.description.abstract | The current realm of Automatic Speech Recognition (ASR) systems, pivotal in applications ranging from voice assistants to transcription services and assistive technologies, requires an improvement when confronted with diverse variations of user voice patterns. Addressing this challenge involves enhancing existing systems by the integration of user voice adaptation techniques. The attempts to implement these techniques directly onto edge devices presents inherent challenges, but this ensures the privacy of sensitive information in spoken utterances. Our research attempts to bridge the gap between theoretical advancements and real-world implementation by focusing on the translation of user voice adaptation techniques into fully operational systems. By navigating through the challenges posed by edge devices, our work aims to contribute to the development of robust, real-world ASR systems that can serve as a
cornerstone for the evolution of improved speech recognition systems.
In our first work, we propose a resource-aware framework for user voice personalization of ASR models on constrained edge devices such as mobile phones. We consider the memory and battery capabilities of the devices for making informed decisions to choose the most suitable sub-model for training in situations with limited resources. In our second work, we introduce a new Federated Learning (FL) framework designed for edge devices to collaboratively train ASR models. We elaborate on the entire methodology for deploying the model with FL functionalities and provide a thorough evaluation of the framework in a real-world setup using actual mobile phones as client devices. Following this, in our third work, we introduce a client selection algorithm for FL that optimizes waiting time by considering system resources, including computation, storage, power, and phone-specific capabilities of client devices. Our algorithm dynamically adjusts the number of training epochs for selected clients, considering available resources, thereby minimizing waiting times in the FL process. In our fourth work, we introduce a novel semi-asynchronous FL for edge devices. We calculate the time for aggregating the weights in the server with the help of our resource-aware work allocation algorithm with partial modeling approach. This strategy aids in mitigating staleness in practical scenarios within the asynchronous FL setup. Our next work concentrates on addressing ASR errors by enhancing the decoding algorithm and introducing an error correction algorithm that utilizes token-based language models and pronunciation models. The errors frequently observed in ASR output includes mistakes related to word boundary disambiguation, phonetically ambiguous words, spelling errors, and various others. Driven by the limitations of the current standard evaluation metrics in ASR tasks, we present two unique approaches aimed at developing improved evaluation metrics for ASR systems. Finally, we put forward two metrics, Heval and SeMaScore, and demonstrate their effectiveness in evaluating ASR systems, particularly when confronted with atypical speech patterns | en_US |