Research Statement
Research Statement
Research Vision:
My research is driven by the ambition to advance the field of audio and speech processing using
Deep Learning through the development of novel algorithms and architectures that significantly
improve audio quality and enhance audio-visual speech processing capabilities. In an increasingly
digital and interconnected world, the ability to capture, process, and interpret audio signals with
high fidelity is crucial. My vision is to contribute to the evolution of this field by addressing the
key challenges posed by real-world environments, such as noise, reverberation, and variability in
speaking styles, and by integrating multi-modal data to enhance the robustness and accuracy of
speech processing systems.
Research Interests:
1. Audio Quality Enhancement: Improving audio quality in real-world scenarios is at the
core of my research. I am particularly interested in developing advanced speech
enhancement algorithms that can effectively mitigate the adverse effects of noise and
reverberation. This includes exploring novel architectures like Convolutional U-Nets,
Self-Attention mechanism, Transformer neural networks, which I have already proposed,
and have shown promise in real-time speech enhancement tasks. My goal is to further
refine these architectures to enhance their performance while minimizing computational
complexity, making them suitable for deployment in resource-constrained environments
such as mobile devices.
2. Deep Learning for Robust Speech Processing: The integration of deep learning
techniques into speech processing has opened new avenues for improving the robustness
of speech recognition and verification systems. I am interested in leveraging advanced
neural network architectures, such as noise-aware extended U-Nets, which incorporate
noise information into the speech enhancement process, and hybrid models like ViT-GRU
for anomaly detection in medical imaging, which can be adapted for speech anomaly
detection. My research will focus on optimizing these models for various speech
processing tasks, including speaker verification, speech emotion recognition, and
dysarthric speech recognition, aiming to push the boundaries of what is possible in noisy
and challenging acoustic environments.
3. Audio-Visual Speech Processing: As part of my broader vision, I am deeply interested
in the interaction between audio and visual data in speech processing. The human ability
to understand speech is not limited to auditory information; visual cues from lip
movements, facial expressions, and gestures play a crucial role. I aim to develop
algorithms that integrate these multi-modal inputs to create more robust and accurate
speech processing systems. This includes exploring techniques for speech emotion
recognition, which converts facial expressions into phonograms and extracts features
using an improved transform structure-based network. By combining audio and visual
data, my research seeks to improve speech recognition accuracy, particularly in noisy
environments where audio-only systems might struggle.
4. Multimodal and Multisensory Fusion: Beyond traditional audio-visual fusion, I am also
interested in exploring multisensory fusion techniques, such as integrating bone-
conducted (BC) and air-conducted (AC) microphones for speech enhancement. The novel
time-domain BC and AC speech fusion enhancement method that I am developing is a
testament to this interest. This method leverages LSTM and Kalman Filtering to predict
and enhance clean speech signals, showing promise in reducing noise and improving
speech quality. My future research will continue to explore these innovative fusion
techniques to further enhance the performance of speech processing systems in diverse
and challenging environments.
5. Applications in Real-World Scenarios: While theoretical advancements are important,
I am committed to ensuring that my research has practical applications. I aim to develop
speech processing systems that can be deployed in real-world scenarios, such as
telecommunication systems, hearing aids, assistive devices for individuals with speech
impairments, and smart home devices. This involves not only improving the algorithms
themselves but also addressing the challenges of deploying these systems in
environments with limited computational resources, ensuring they are efficient, reliable,
and user-friendly.
Future Directions:
Looking ahead, I envision my research contributing to the development of next-generation
speech processing systems that are not only more accurate and robust but also more intuitive
and capable of understanding the context in which speech occurs. This could involve the
integration of advanced AI techniques, such as reinforcement learning and transfer learning, to
create adaptive systems that learn and improve over time. Additionally, I am interested in
exploring the ethical and social implications of speech processing technologies, ensuring that
they are developed and used in ways that are fair, inclusive, and beneficial to all members of
society.