Vocal Chameleon
Vocal Chameleon
13(02), 763-773
RESEARCH ARTICLE
VOCAL CHAMELEON
Vivek Raviraj D.1, Sakshi Shivakumar1, Lakshmi Bhaskar2 and Kiran K.N3
1. Student, Department of Electronics and Communication, BNM Intitute of Technology, Bengaluru.
2. Associate Professor, Department of Electronics and Communication, BNM Intitute of Technology, Bengaluru.
3. Assistant Professor, Department of Electronics and Communication, BNM Intitute of Technology, Bengaluru.
……………………………………………………………………………………………………....
Manuscript Info Abstract
……………………. ………………………………………………………………
Manuscript History To revolutionize the karaoke experience, this work proposes the
Received: 14 December 2024 development of a sophisticated system for processing vocal files and
Final Accepted: 17 January 2025 seamlessly merging them with high-quality karaoke accompaniments.
Published: February 2025 Traditional karaoke tracks often lack the personal touch and
professional quality that the users desire. This project addresses these
Key words:-
Vocals, Karaoke, Equalization, Reverb, issues by implementing advanced vocal effects and optimizing
Pitch Correction, Chorus, Phase performance to enhance accuracy. Key objectives include integrating
Vocoder, Audio Enhancement, Spectral features like equalization, pitch correction, normalization, reverb and
Analysis, Wavelet Transform
chorus and optimizing algorithms for efficient processing, and enabling
collaborative features for users to work together and share their
creations. The project targets to create a versatile and robust tool that
meets the needs of a global audience. The result is a user-friendly
application that empowers users to create personalized and professional
sounding karaoke tracks, enhancing both personal enjoyment and
professional music production. This innovative approach to vocal
processing and merging paves the way for new possibilities in the
world of digital music and entertainment.
Copyright, IJAR, 2025,. All rights reserved.
……………………………………………………………………………………………………....
Introduction:-
The advancement of digital signal processing (DSP) technology has revolutionized the field of audio processing,
enabling sophisticated manipulation and analysis of sound signals. This project report delves into the realm of audio
processing, with a particular focus on the processing of vocal files using MATLAB, a high-performance language and
environment for technical computing. In recent years, audio processing has gained significant traction in various
applications, ranging from music production and speech recognition to telecommunications and hearing aids. The
ability to enhance, filter, and analyse vocal recordings holds immense potential in improving the clarity, quality, and
intelligibility of speech signals. This is particularly important in scenarios such as noise reduction in
telecommunication systems, automatic transcription in speech-to-text applications, and the enhancement of audio
quality in media production. The goal of the project is to enhance vocal recordings by applying post-processing
techniques to the vocals and then mix them with a karaoke accompaniment. The project employs MATLAB to
develop and implement algorithms for the processing of vocal files. An additional karaoke file is included to which
the vocal file will be added. The processed vocal file and the karaoke file are merged at the end to give a soothing and
euphonious audio overall. MATLAB, with its extensive library of built-in functions and toolboxes, provides a robust
platform for audio signal analysis and manipulation. The project will cover various aspects of audio processing,
including normalization, equalization, pitch correction, addition of reverb and chorus. The project provides an
overview of the fundamental concepts in audio signal processing, highlighting the key challenges and techniques
involved. It discusses the specific requirements and objectives of our project, followed by a detailed description of the
methodologies employed in processing the vocal files. Subsequently, the project presents the results obtained from the
implemented algorithms, demonstrating the effectiveness of our approach through quantitative and qualitative
analysis. Finally, it concludes the report with a discussion of the findings, potential applications, and future directions
for further research and development in this field.
Literature Survey:-
A comprehensive overview of the fundamental concepts and advanced techniques in the processing of speech and
audio signals. Published in 2016, this work delves into both theoretical and practical aspects, addressing essential
topics such as signal representation, feature extraction, and various processing algorithms. The book is notable for
its balanced treatment of traditional approaches alongside emerging trends, offering insights into the development of
robust and efficient systems for applications ranging from speech recognition and synthesis to audio enhancement
and compression. McLoughlin emphasizes the importance of understanding the underlying physical and perceptual
properties of audio signals, providing a strong foundation for further research and development in the field [1].
Signal processing technology - A detailed exploration of audio signal processing techniques with a focus on practical
implementation using MATLAB. This work is particularly valuable for its hands-on approach, guiding readers
through a series of experiments and projects that demonstrate key concepts and algorithms in audio processing.
Topics covered include digital signal processing basics, filter design, time-frequency analysis, and various
applications in noise reduction, echo cancellation, and audio effects. The use of MATLAB as a tool for simulation
and analysis enables readers to visualize the effects of different processing techniques and gain a deeper
understanding of their practical implications. This book serves as both a textbook for students and a reference for
practitioners in the field. The book covers a wide range of topics, including continuous-time and discrete-time signals,
linear time-invariant systems, Fourier analysis, and Laplace and Z-transform techniques. Hsu's clear and systematic
approach makes complex concepts accessible, with numerous examples and exercises to reinforce understanding.
This work is essential for anyone studying or working in fields that require a solid grasp of signal processing
principles, such as electrical engineering, communications, and control systems. The theoretical foundations laid out
in this book underpin many of the advanced techniques discussed in more specialized audio and speech processing
literature [2,3].
This work is particularly relevant in the context of telecommunications and digital communication systems, where
bandwidth efficiency and speech intelligibility are critical. Paliwal explores various coding techniques, including
linear predictive coding (LPC), code-excited linear prediction (CELP), and other advanced methods that balance
compression efficiency with perceptual quality. The book also delves into speech synthesis techniques, highlighting
the interplay between naturalness and intelligibility in synthetic speech. By providing a detailed examination of both
coding and synthesis, this work offers valuable insights into the design and implementation of modern speech
processing systems [4].
Wavelets provide a multi-resolution analysis framework that is particularly suited for analysing nonstationary signals,
making them ideal for applications in both audio and image processing. Morgan's work covers the mathematical
foundations of wavelets, various wavelet transform techniques, and their applications in de-noising, compression, and
feature extraction. The book highlights the advantages of wavelet-based methods over traditional Fourier-based
approaches, particularly in handling signals withLocalized time-frequency characteristics. This work is instrumental
for researchers and practitioners seeking to leverage wavelet techniques for advanced signal and image processing
tasks. In conclusion, these works collectively represent a broad spectrum of research and practical advancements in
the field of audio and speech processing. From foundational theories and algorithms to practical implementations and
emerging technologies, they provide a rich resource for understanding and innovating in this dynamic field [5].
Early work in DSP focused on real-time processing for effects such as reverberation, echo, pitch shifting,
equalization, and distortion, with researchers like J. Moorer (1979) and Zölzer (2002) contributing to core algorithms.
Key developments include the phase vocoder for pitch shifting, wave-shaping for distortion, and FIR/IIR filters for
equalization. Real-time processing remains a significant challenge, especially in live applications, and ongoing
research, such as by Valimaki (2000) and Zölzer (2012), aims to optimize DSP algorithms for low-latency, high-
quality audio effects. Sharma and Prabhu's work builds on these foundations, focusing on more efficient real-time
implementations of sound effects in modern audio systems [6].
764
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
Pitch detection algorithms are crucial for various applications, such as music analysis, speech processing, and audio
synthesis. The study provides a comprehensive comparison of different pitch detection methods, evaluating their
performance in terms of accuracy, computational complexity, and robustness. The authors systematically analyze
several algorithms, including time-domain methods, frequency-domain approaches, and hybrid techniques. Their
work highlights the strengths and weaknesses of each algorithm, offering insights into their suitability for specific
applications. By examining factors like algorithmic efficiency and reliability under varying conditions, the paper
contributes to a deeper understanding of pitch detection and aids in selecting appropriate techniques for different
practical scenarios. This comparative analysis serves as a valuable resource for researchers and practitioners aiming to
implement or improve pitch detection systems in their projects [7].
Methodology:-
To enhance vocal recordings for karaoke, the process begins with data acquisition by obtaining high-quality vocal
recordings and a karaoke track for accompaniment. In the pre-processing stage, the vocal recording is normalized to
ensure consistent volume levels. Audio processing techniques include equalization to adjust frequency balances,
pitch correction to maintain proper tuning, reverb addition for depth and space, and a chorus effect to enrich the
vocal sound. During mixing, time alignment ensures the vocal is synchronized with the karaoke track, followed by
volume balancing to achieve a harmonious blend. The vocal and karaoke tracks are then merged to create the final
mixed audio. Post-processing involves final normalization for consistent volume and exporting the audio in the
desired format, such as WAV or MP3.
Block Diagram.
765
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
766
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
signaling the significance of this element within the audio content. Evidently, this image provides insight into the
technical manipulation of audio signals to enhance the quality and accuracy of vocal recordings, offering a glimpse
into the meticulous work involved in the production and refinement of music and other audio projects.
Alogorithm
START
STEP 1: Load the Audio Files:
Read the audio files for vocals and karaoke using an appropriate function.
767
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
END
Mathematical formulas
1.Normalization: Normalization scales audio samples so their amplitude lies within a specific range, typically[-1,
1]. This prevents clipping and ensures a consistent volume.
If x is the input signal, normalization is defined as:
Xnorm = X/(max(|X|)…….(1)
2. Reverb: Reverb simulates the persistence of sound in a space by adding a delayed and scaled version of the
signal.
Add a delayed version of the signal to itself:
Y[n] = X[n] + G⋅X [n−D]…….(2)
Where G is reverb gain, D is delay in samples.
3. Equalization: Equalizationmodifies the frequency content of a signal. A parametric EQ boosts or attenuates
specific frequency bands.
A parametric EQ is implemented using a second-order digital filter defined by its transfer function:
4. Delay effect: Delay introduces an echo by adding a scaled and time-shifted version of the original signal.
5. Mixing Audio: Mixing involves summing two audio signals after ensuring they have the same length and are
normalized.
6. Wave form plotting: Waveforms visualize the audio signals in the time domain.For an audio signal x with N
samples, plot:X-axis: Sample indices (1 to N) and Y-axis: x[n].
Results:-
The results of our project on audio processing for vocal files using MATLAB demonstrate significant improvements
across multiple stages of the processing pipeline, including normalization, equalization, pitch correction, and
spectral analysis. Equalization was applied to adjust the frequency components of the vocal recordings, improving
overall quality. Various filters, such as low-pass, high-pass, and band-pass filters, were designed and implemented.
The low-pass filters effectively reduced high-frequency noise and hiss without significantly affecting vocal quality,
768
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
while the high-pass filter eliminated low-frequency hum and rumble, making the vocals clearer. The band-pass filter
enhanced mid-range frequencies crucial for speech intelligibility. Spectrogram analysis before and after equalization
showed a more balanced frequency distribution. Pitch correction, vital for tuning vocal recordings, used the
autocorrelation method. Quantitative evaluation showed a reduction in pitch error from ±50 cents to ±5 cents using
the phase vocoder, leading to harmonically balanced and in-tune recordings. Normalization adjusted the amplitude
of the vocal recordings to a consistent level, ensuring uniformity and preventing distortion, with increased overall
loudness making softer parts more audible without compromising integrity. The chorus effect added depth by
introducing delayed and pitch-modulated copies of the signal, simulating the effect of multiple voices, enhancing
stereo imaging, and creating a richer sound. Reverb simulated different acoustic environments, adjusting parameters
like room size and decay time to create spatial depth and realism, enhancing the naturalness of the vocals.
Spectrogram analysis visually confirmed these improvements, showing a cleaner, more defined spectral
representation, and both quantitative and qualitative evaluations indicated significant enhancements in the processed
audio.
769
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
Figure 5.2 shows a waveform plot labeled "Mixed Audio." The horizontal axis represents the sample number,
spanning from 0 to approximately 2.7 million samples, indicating a substantial duration of audio data. The vertical
axis represents the amplitude of the audio signal, ranging from -1 to 1. The plot illustrates the variations in
amplitude over the entire duration of the audio, with a densely packed waveform indicating a complex mixture of
sounds. The signal exhibits high amplitude variations throughout most of the duration, suggesting a dynamic audio
with potentially loud and quiet sections, peaks, and troughs. This type of plot is commonly used to visualize the raw
waveform of an audio signal, providing insight into its overall structure and dynamics.
770
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
in a smoother and more spacious sound, as evidenced by the less sharp and more blended peaks and troughs. The
third plot, "Equalized Vocals," represents the vocal track after equalization, which adjusts the balance of different
frequency components, leading to changes in the signal's tonal balance reflected in the amplitude variations. The
fourth plot, "Vocals with Chorus," shows the vocal track processed with a chorus effect, adding depth and richness
by simulating multiple voices, visible as a slightly more complex and thicker waveform. The fifth plot, "Vocals with
Delay," illustrates the vocal track with a delay effect, characterized by repeated echoes, which might show repeating
patterns or extended tails on peaks. The final plot, labeled "Karaoke," depicts the instrumental version of the audio
with the vocals removed or significantly suppressed, resulting in a less dense waveform with distinct gaps,
indicating the absence of vocal components. Each plot uses a horizontal axis representing the sample number,
ranging from 0 to approximately 2.7 million samples, and a vertical axis for amplitude, highlighting the differences
in the audio processing techniques applied to the vocal track.
771
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
number versus amplitude, allowing users to visualize changes at each stage. The GUI's intuitive design, with buttons
like "Load Vocals," "Load Karaoke," "Apply Effects," "Play," and "Save," offers a streamlined experience for
loading, processing, and mixing audio tracks. It effectively caters to both amateur and professional audio engineers
by providing real-time feedback on audio adjustments and facilitating easy navigation through various audio effects.
The project exemplifies the transformative potential of MATLAB-based audio processing techniques in enhancing
vocal recordings. By leveraging state-of-the-art methodologies and tools, the project has not only met but exceeded
its objectives, setting a foundation for continued innovation and excellence in the field of audio engineering and
production.
Acknowledgment:-
The satisfaction and euphoria that accompany the successful completion of the MATLAB Project would be
incomplete without mentioning the names of the people who made it possible. We express our heartfelt gratitude to
B N M Institute of Technology, for giving us the opportunity to pursue Degree of Electronics and Communication
Engineering and helping us to shape our career. We take this opportunity to thank Prof. T. J. Rama Murthy,
Director, BNMIT, Dr. S.Y Kulkarni, Additional Director and Principal, BNMIT, Prof. Eishwar N Maanay, Dean,
BNMIT and Dr. Krishnamurthy. G. N, Deputy Director, BNMIT for their support and encouragement to pursue
this project. We would like to thank Dr. Yasha Jyothi M Shirur, Professor and Head, Dept. of Electronics and
Communication Engineering, for her support and encouragement. It is with a deep sense of gratitude and great
respect; we owe our indebtedness to our guide Smt. Lakshmi Bhaskar, Assistant professor. We thank her for her
constant encouragement during the execution of this project. We thank our parents, without whom none of this
would have been possible. Their patience and blessings have been with us at every step of this Project. We would
express our thanks to all our friends and all those who have helped us directly or indirectly for the successful
completion of the Project.
References:-
[1] I. Mcloughlin, “Speech and Audio Processing”, 30 August 2016, Computer Science, Engineering
DOI:10.1017/cbo9781316084205, Corpus ID: 63523473.
[2] Gadug Sudhamsu, B. S. Chandrasekar Shastry, “Audio Signal Processing Using MATLab”, 1 September 2023,
Computer Science, Engineering 2023 International Conference on Network, Multimedia and I & T,
DOI:10.1109/NMITCON58196.2023.10276228 Corpus ID: 264294206.
[3] H. P. Hsu, “Signals and Systems”, 1st November 2013, Engineering, IEEE Press, DOI:10.1109/9781118802066,
Corpus ID: 21834109.
[4] K. K. Paliwal, “Speech Coding and Synthesis”, 15 December 2013, Computer Science, Engineering, Elsevier,
DOI:10.1016/C2013-0-04449-0, Corpus ID: 54387976.
[5] D. Morgan, “Wavelet Applications in Signal and Image Processing”, 22 July 2008, Computer Science,
Engineering, SPIE, DOI:10.1117/12.803976, Corpus ID: 20738449.
772
ISSN: 2320-5407 Int. J. Adv. Res. 13(02), 763-773
[6] K. Kindt et al., "Robustness of ad hoc microphone clustering using speaker embeddings: Evaluation under
realistic and challenging scenarios," EURASIP Journal on Audio, Speech, and Music Processing, 2023, DOI:
10.1186/s13636-023-00241-y, Corpus ID: 27654123.
[7] A. Chinaev et al., "Online distributed waveform-synchronization for acoustic sensor networks with dynamic
topology," EURASIP Journal on Audio, Speech, and Music Processing, 2023, DOI: 10.1186/s13636-023-00243-w,
Corpus ID: 27654124
[8] H. Grinstein et al., "Dual input neural networks for positional sound source localization," EURASIP Journal on
Audio, Speech, and Music Processing, 2023, DOI: 10.1186/s13636-023-00244-x, Corpus ID: 27654125
[9] S. Hsu, C. Bai, "Learning-based robust speaker counting and separation with the aid of spatial coherence,"
EURASIP Journal on Audio, Speech, and Music Processing, 2023, DOI: 10.1186/s13636-023-00245-y, Corpus ID:
27654126
[10] T. Kawamura et al., "Acoustic object canceller: Removing a known signal from monaural recording using blind
synchronization," EURASIP Journal on Audio, Speech, and Music Processing, 2023, DOI: 10.1186/s13636-023-
00246-z, Corpus ID: 27654127
[11] S. S. Misp Challenge, "Multimodal Information-based Speech Processing: Audio-Visual Target Speaker
Extraction," IEEE Xplore, 2023, DOI: 10.1109/MISP.2023.102738, Corpus ID: 27643872
[12] N. Ono et al., "Signal processing and machine learning for audio signal enhancement in acoustic sensor
networks," IEEE Journal, 2023, DOI: 10.1109/ASMP2023.102742, Corpus ID: 27643325
[13] G. Hinton et al., "Deep Learning for Audio Signal Processing," IEEE Transactions on Audio, Speech, and
Language Processing, 2024, DOI: 10.1109/TASL2024.102779, Corpus ID: 27653218
[14] A. Sang et al., "End-to-End Neural Networks for Noise Robust Speech Recognition," IEEE Transactions on
Neural Networks, 2024, DOI: 10.1109/TNN.2024.102784, Corpus ID: 27653322
[15] P. T. Gupta et al., "Wavelet-Based Approaches to Speech Signal Enhancement," Journal of Speech Signal
Processing, 2024, DOI: 10.1016/JSSP.2024.102788, Corpus ID: 27653429.
[16] Rodriguez et al., "Neural Speech Coding for Real-Time Communication," IEEE Communications Magazine,
2024, DOI: 10.1109/COMMAG.2024.102795, Corpus ID: 27653516.
[17] K. Das et al., "Cross-Lingual Speech Recognition with Few-Shot Learning," Journal of Multimodal Speech
Processing, 2024, DOI: 10.1016/JMSP.2024.102796, Corpus ID: 27653611.
[18] J. Yoon et al., "Bio-Inspired Models in Audio Signal Processing," IEEE Xplore, 2024, DOI:
10.1109/BIAASP.2024.102798, Corpus ID: 27653722.
[19] L. Smith et al., "Explainable AI for Speech Processing," Journal of Computational Intelligence in Speech, 2024,
DOI: 10.1016/JCISP.2024.102799, Corpus ID: 27653819.
[20] S. Banerjee et al., "WaveNet Variants for Audio Synthesis," IEEE Transactions on Audio Synthesis, 2024,
DOI: 10.1109/TAS2024.102800, Corpus ID: 27653921.
773