0% found this document useful (0 votes)
23 views38 pages

Final Project Report

The document is a project report for an audio-based interview simulator that utilizes machine learning models to provide feedback on users' confidence, accuracy, and grammar during interview preparation. The platform aims to enhance job seekers' interview skills through real-time assessments and personalized feedback, addressing the limitations of traditional preparation methods. It is developed as part of a Bachelor's degree requirement in Computer Science and Engineering at CMR Institute of Technology under the guidance of Mrs. Krishna Sowjanya K.

Uploaded by

anv21cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views38 pages

Final Project Report

The document is a project report for an audio-based interview simulator that utilizes machine learning models to provide feedback on users' confidence, accuracy, and grammar during interview preparation. The platform aims to enhance job seekers' interview skills through real-time assessments and personalized feedback, addressing the limitations of traditional preparation methods. It is developed as part of a Bachelor's degree requirement in Computer Science and Engineering at CMR Institute of Technology under the guidance of Mrs. Krishna Sowjanya K.

Uploaded by

anv21cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belgaum-590018

A PROJECT REPORT (21CSP76) ON

“AUDIO-BASED INTERVIEW SIMULATOR WITH FEEDBACK


ON CONFIDENCE, ACCURACY AND GRAMMAR USING ML
MODELS”
Submitted in Partial fulfillment of the Requirements for the Degree of

Bachelor of Engineering in Computer Science & Engineering

By
VISHAL PREMNATH (1CR21CS211)

VARUN TEJAS (1CR21CS207)

Under the Guidance of,


Mrs. Krishna Sowjanya K
Assistant Professor, Dept. of CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CMR INSTITUTE OF TECHNOLOGY

#132, AECS LAYOUT, IT PARK ROAD, KUNDALAHALLI, BANGALORE-560037


CMR INSTITUTE OF TECHNOLOGY
#132, AECS LAYOUT, IT PARK ROAD, KUNDALAHALLI, BANGALORE-560037

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
Certified that the project work entitled “Audio-based Interview Simulator with Feedback on
Confidence, Accuracy and Grammar using ML models” carried out by Mr. Vishal P, USN
1CR21CS211, Mr. Varun Tejas, USN 1CR21CS207, bonafide students of CMR Institute of
Technology, in partial fulfillment for the award of Bachelor of Engineering in Computer Science and
Engineering of the Visvesvaraya Technological University, Belgaum during the year 2023-2024. It is
certified that all corrections/suggestions indicated for Internal Assessment have been incorporated in the
Report deposited in the departmental library.

The project report has been approved as it satisfies the academic requirements in respect of Project work
prescribed for the said Degree.

Mrs. Krishna Dr. Keshavamoorthy Dr. Sanjay Jain


Sowjanya K
Professor & Head Principal
Assistant Professor
Dept. of CSE, CMRIT CMRIT
Dept. of CSE, CMRIT

External Viva

Name of the Examiners Signature with Date

1.

2.

ii
DECLARATION

We, the students of Computer Science and Engineering, CMR Institute of Technology, Bangalore
declare that the work entitled " Audio-based Interview Simulator with Feedback on
Confidence, Accuracy and Grammar using ML models " has been successfully completed
under the guidance of Mrs. Krishna Sowjanya K, Computer Science and Engineering Department,
CMR Institute of technology, Bangalore. This dissertation work is submitted in partial fulfillment
of the requirements for the award of Degree of Bachelor of Engineering in Computer Science and
Engineering during the academic year 2023 - 2024. Further the matter embodied in the project
report has not been submitted previously by anybody for the award of any degree or diploma to
any university.

Place: Bangalore

Date:

Team members: Signature

VISHAL P (1CR21CS211) __________________

VARUN TEJAS (1CR21CS207) __________________

iii
ABSTRACT

The AI-Enhanced Virtual Interview Simulation Platform offers users a more holistic and
interactive environment where they can prepare for job interviews in technical or human resources
domains. It uses NLP and machine learning technologies to assess responses from users and
provides them with constructive feedback on how confident, accurate, grammatically correct, and
effective in communication they have been. Through detailed performance analysis, users can
recognize what has to be improved and increase their self-confidence. The platform is infused with
an intuitive front-end aiding easy navigation, and back-end is formed by Django forms the basis
of the AI-reinforced assessment processes. A sophisticated, highly individualized interview coach,
this is set to help users improve capabilities and prepare for successful interviews.

iv
ACKNOWLEDGEMENT

I take this opportunity to express my sincere gratitude and respect to CMR Institute of
Technology, Bengaluru for providing me a platform to pursue my studies and carry out my final
year project
I have a great pleasure in expressing my deep sense of gratitude to Dr. Sanjay Jain,
Principal, CMRIT, Bangalore, for his constant encouragement.
I would like to thank Dr. Keshavamoorthy, Professor and Head, Department of
Computer Science and Engineering, CMRIT, Bangalore, who has been a constant support and
encouragement throughout the course of this project.
I consider it a privilege and honor to express my sincere gratitude to my guide
Mrs. Krishna Sowjanya K, Assistant Professor, Department of Computer Science and
Engineering, for the valuable guidance throughout the tenure of this review.
I also extend my thanks to all the faculty of Computer Science and Engineering who
directly or indirectly encouraged me.
Finally, I would like to thank my parents and friends for all their moral support they have
given me during the completion of this work.

v
TABLE OF CONTENTS

Page No.
Certificate ii
Declaration iii
Abstract iv
Acknowledgement v
Table of contents vi
List of Figures viii
1 INTRODUCTION 1
1.1 Relevance of the Project 2
1.2 Problem Statement 2
1.3 Objective 3
1.4 Scope of the Project 4
2 LITERATURE SURVEY 5
2.1 Android-Based Speech-to-Text System 5
2.2 Speech-to-Text Conversion Using Gaussian Mixture Model (GMM) 6
2.3 Speech-to-Text System Using Hidden Markov Model (HMM) 6
2.4 Speech Emotion Recognition 7
2.5 Sentiment-Aware Automatic Speech Recognition Pre-Training for 8
Enhanced Speech Emotion Recognition
2.6 Speech Emotion Recognition Using Dialogue Emotion Decoder 9
and CNN Classifier
2.7 Unsupervised Pretraining for CNNs Using Wav2Vec with Contrastive 10
Loss
2.8 Sequential Pattern Analysis of Emotion in Speech With LSTM 11
2.9 Speech Recognition Application with Tone Analyzer 11
2.10 AI-Enhanced Natural Language Processing: Techniques for 12
Automated Text Analysis, Sentiment Detection, and Conversational
Agents

vi
3 PROPOSED MODEL 14
3.1 System Architecture 14
3.2 Flowchart 16
4 IMPLEMENTATION 18
4.1 Implementation Details 18
4.2 Algorithm Used 20
4.3 Comparative Study 21
5 RESULTS AND DISCUSSION 24
6 TESTING 26
7 CONCLUSION 29
7.1 Conclusion 29
7.2 Future Scope 29
REFERENCES 30
APPENDIX 31

vii
LIST OF FIGURES

Page No.
Fig 3.1 System Architecture 14
Fig 3.2 Flowchart for the project 17
Fig 5.1 Accuracy vs Trees 24
Fig 5.2 Confusion Matrix 25
Fig 6.1 Homepage 26
Fig 6.2 Select Type of Interview 26
Fig 6.3 Questions displayed sequential 27
Fig 6.4 Completion of the interview 28
Fig 6.5 Feedback page 28

viii
Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

CHAPTER 1

INTRODUCTION
The AI-Powered Interview Simulation Platform meets the pressing need of the ever-
increasingly competitive job market for the preparation of job seekers in an interview.
Traditional interview preparation methods, like mock interviews or simple study
questions, lack personal feedback and real-time interaction. People tend to falter when
under pressure on issues such as communication, confidence, and response planning. All
of which are necessary to win an interview.
Addresses the challenge of getting interview performance and performance feedback
on time. In most cases, applicants rarely get an opportunity to work with an interviewer
or receive a comprehensive assessment of their language, content, and delivery style.
That is why job seekers feel unprepared and nervous, especially when people apply for
jobs they want.
Therefore, this platform has used NLP and simulation algorithms of real query
conditions. Participants will answer questions such as technical, HR, and behavioral
questions. The system examines responses and gives feedback on languages, content
accuracy, and presentation. The frontend will be developed using Bootstrap(CSS),
Vanilla JavaScript for interactive and responsive user experience. The backend, powered
by Django, will take the responsibility of AI-driven response evaluation. This project
will help users improve their interview skills, enhance their confidence, and ultimately
increase the chances of getting job offers.

Dept. of CSE, CMRIT 2024-2025 Page 1


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

1.1 Relevance of the Project

The AI-Powered Virtual Interview Simulator is very relevant in the highly


competitive job market today, responding to the emerging requirements for proper
interview preparation. This bridging of the skill gap assists candidates in improving
their communication, confidence, and the delivery of responses through personal
feedback driven by AI. Modernized from traditional methods, this will simulate real
interviews with appropriate insights. Accessible and cost-effective, it offers a second
alternative to professional coaching on employability across multiple positions. This
allows the system to increase users' readiness through action-oriented feedback
regarding technology-led recruitment processes.

1.2 Problem Statement

The AI-Powered Interview Simulation Platform is to change the interview


preparation process by offering users a real and intelligent environment that can refine
their skills. It models real-world situations and provides personalized feedback on key
areas such as confidence, accuracy, and grammar. By leveraging AI-powered analytics
and encouraging consistent practice, the platform builds trust with users and improves
interview preparation.
Traditional interview preparation methods that use mock interviews or question
banks fail to identify individual weaknesses and do not allow for direct interaction.
They also do not provide feedback that is consistent with key skills such as confidence,
accuracy of answers, and grammatical clarity, leaving users unprepared for important
interviews. The fact that hiring expert interviewers or specially designed training
facilities is difficult and expensive reduces the effectiveness of traditional methods.
The company is trying to use advanced technologies like natural language
processing and machine learning to address these shortcomings, by providing quick
and concrete feedback so that individuals can understand and improve their
performance in areas where they are weak. This will provide a secure, consistent, and
effective experience for job seekers who want to take advantage of their interviews.This
platform, which fosters the development of communication skills and the growth of
self-confidence, prepares users for numerous interview types, such as technical, human

Dept. of CSE, CMRIT 2024-2025 Page 2


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

resources, and behavioral evaluations, while enhancing the chances of successful


results. By coupling artificial intelligence-driven insights with an intuitive interface,
the system acts as a high-tech, reliable resource for the current needs in interview
preparation.

1.3 Objectives

The objectives of the AI-Powered Interview Simulation Platform are to give the
user a comprehensive and effective tool to prepare for an interview.
• Artificial Intelligence-based Framework Mirroring Real-Time Interview Scenarios:
Design an AI framework capable of mirroring realistic scenarios of job interviews
related to several fields, for instance technical, HR, or behavioral interviews. This is
expected to give users a better practice experience.
• NLP and Machine Learning: Use advanced NLP techniques and machine learning
algorithms to evaluate user feedback in real time. The system will provide personalized
feedback, focusing on critical components such as confidence, precision, and
grammatical correctness, which in turn enhances the overall quality of the responses.
• Communicative and Confidence: Provide users with tips so that these users could
understand areas that need improvement and, subsequently, with constant practice
develop confidence. Personal weaknesses pertaining to communication, accuracy, and
presentation could then be improved and result in perfect performance duringproposed
interviews.
• Create an Accessible, User-Centric Platform: Design an engaging and interactive
interface with the use of modern frontend technologies to ensure a captivating user
experience. The backend infrastructure, powered by Django, will allow for seamless
AI-driven analysis, thereby ensuring continuous engagement and personalized
feedback throughout the interview preparation process.

1.4 Scope of the project

The scope of this project ranges from designing and developing an AI-based
interview simulation platform to improving the interview preparation process. The
project scope ranges from the deep analysis of interview preparation techniques already
in use, acknowledging their shortcomings, to applying cutting-edge technologies such
as NLP and machine learning algorithms. These technologies will be applied to help
Dept. of CSE, CMRIT 2024-2025 Page 3
Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

mimic real-life interview settings across both technical, human resource and behavioral
interviews. It will be built to give individuals exactly what they need in form of
customized feedback about things as confidence, accuracy, and grammar about helping
an individual improve performance. The project will also focus on the accessibility and
user experience improvement by using an interactive frontend interface and a strong
Django backend framework. In addition, the scope includes refining and optimizing the
feedback mechanism to ensure that it provides real-time, actionable insights that enable
constant improvement. This study aims to explore the use of AI-driven analytical
techniques to provide accurate feedback and monitor user advancement in addressing
deficiencies in current interview preparation strategies, including the lack of
personalized, instant feedback and customized practice sessions. The purpose of this
project is to revolutionize interview preparation through successful integration of AI
with user-centric design, thus offering users more efficient, accessible, and effective
tools for building confidence, improving communication, and increasing chances of
success in job interviews.

Dept. of CSE, CMRIT 2024-2025 Page 4


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

CHAPTER 2

LITERATURE SURVEY
Recent research on AI-based interview preparation stresses the use of Natural
Language Processing and machine learning technologies to simulate real-life interview
conditions. Techniques, including speech-to-text and emotion analysis, give contextual
feedback in areas such as confidence, accuracy, and grammatical correctness. While
these advancements are notable, challenges in the form of uninterrupted speech
recognition and the accuracy of real-time feedback are still prevalent. This section
discusses the approaches, methods, and limitations of AI-based interview training
systems.

2.1 Android-Based Speech-to-Text System


[1] This system implements speech-to-text conversion on an Android platform using
Google’s cloud-based speech recognition, powered by Hidden Markov Models
(HMM). The process begins by capturing live speech through the device's microphone,
which is then sent to Google’s servers for processing. The system compares the audio
input against pre-trained speech models to transcribe the spoken words into text
accurately.
The application is designed to assist users with disabilities, offering a hands-free
solution for tasks like creating and sending SMS messages via voice commands. This
is especially beneficial for individuals with limited mobility or those unable to interact
with the screen manually. Developed on the Android platform, the app is compatible
with a wide range of devices, making it easily accessible.
However, the system requires an active internet connection, as the processing is done
remotely on Google’s cloud servers. This ensures the system can access Google’s
updated speech models, maintaining high accuracy rates. Despite this dependency on
the internet, the application offers a practical, efficient solution for users seeking a
seamless, voice-driven experience to manage daily tasks.

Dept. of CSE, CMRIT 2024-2025 Page 5


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

2.2 Speech-to-Text Conversion Using Gaussian Mixture


Model (GMM)
[2] In this approach, a Gaussian Mixture Model (GMM) combined with Mel Frequency
Cepstral Coefficients (MFCC) is employed to convert isolated spoken words into text.
First, the system does pre-processing on the speech signal, which eliminates useless
silence and background noise and allows the system to focus only on the correct high
points. This is important as it will help improve clarity in speech input and remove non-
speech-induced errors. Now, apply the MFCC feature extraction technology to map the
input sound to the correct scale. MFCC captures the relevant features of speech such
as pitch, volume, and timbre, which are critical for accurate recognition.
These features are fed to the GMM, or probabilistic model, and that classifies features
in view of their probabilities towards several speech patterns. GMM, here is trained
using expectation maximization algorithm to improve accuracy with speech, where a
difference in articulation and pitch, and variations of speech dynamics are improved
within the model. The known words that are spoken are further exhibited as text.
This method is successful in less noisy areas and is more effective for single word
recognition.
However, it is not suitable for continuous speech recognition since it cannot handle
complex speech patterns.

2.3 Speech-to-Text System Using Hidden Markov Model


(HMM)
[3] This Speech-to-Text (STT) system utilizes the Hidden Markov Model (HMM) to
recognize speech patterns in isolated spoken words, with MATLAB as the development
environment. The process begins with end-point detection, which removes background
noise and segments the speech signal, ensuring that only relevant speech data is processed.
Mel Frequency Cepstral Coefficients (MFCC) are then applied for feature extraction,
transforming the speech into a series of coefficients that capture the essential characteristics
of the spoken word, such as pitch, tone, and duration.

These extracted coefficients serve as inputs to the HMM, which compares them to
predefined speech models to identify the most likely word. The HMM's performance can
be optimized by adjusting the number of states, balancing model complexity with accuracy.

Dept. of CSE, CMRIT 2024-2025 Page 6


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

This approach results in a high recognition rate for isolated words, making it particularly
effective in controlled environments.

The system is designed with a focus on education, specifically to assist hearing-impaired


students by enabling clearer communication. Its accuracy and efficiency make it a valuable
tool for enhancing accessibility in educational settings.

2.4 Speech Emotion Recognition

[4] The methodology for Speech Emotion Recognition (SER) in this study follows a well-
structured approach, incorporating multiple stages including emotional speech input, feature
extraction, feature selection, and classification. The system starts by extracting prosodic features
such as pitch, energy, MFCC (Mel Frequency Cepstral Coefficients), and LPCC (Linear
Predictive Cepstral Coefficients), which are significant indicators of various emotions. These
features are then filtered during feature selection to enhance the accuracy of classification by
eliminating irrelevant or redundant data.

For emotion classification, the system uses several classifiers, including Gaussian Mixture
Model (GMM),Hidden Markov Model (HMM), Support Vector Machine (SVM), Artificial
Neural Network (ANN), and K Nearest Neighbours (KNN). These classifiers are used to
classify emotions like anger, happiness, sadness, and so on. The performance of each classifier
is evaluated under both speaker-dependent and speaker-independent models, evaluating their
ability to generalize across different speakers. The systems evaluation metrics mainly focus on
classification accuracy, with real-world applications in fields.

Examples of these applications include psychiatric diagnosis and emotion monitoring in


machine systems. This approach helps achieve effective emotion recognition in speech, which
has much potential for enhancing human-computer interaction and also providing an avenue for
mental health evaluations.

2.5 Sentiment-Aware Automatic Speech Recognition Pre-


Training for Enhanced Speech Emotion Recognition
[5] The methodology presented in this paper introduces a multi-task pre-training
framework designed to enhance Speech Emotion Recognition (SER) by making the
Automatic Speech Recognition (ASR) model "emotion-aware." The approach begins
by pre-training the ASR model alongside sentiment classification. Sentiment labels are

Dept. of CSE, CMRIT 2024-2025 Page 7


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

generated from a text-to-sentiment model applied to a large dataset, allowing the ASR
model to learn both speech recognition and sentiment analysis simultaneously. This
sentiment-aware ASR (SA2SR) model is trained using a combination of Connectionist
Temporal Classification (CTC) loss for speech recognition and cross-entropy loss for
sentiment classification.
Once pre-trained, the model is fine-tuned on the MSP-Podcast dataset, which is
specifically used to optimize the model’s ability to predict emotional dimensions like
activation, valence, and dominance (AVD). These dimensions represent core emotional
states that are crucial for accurate emotion detection in speech. The experimental results
demonstrate significant improvements in SER performance, particularly in predicting
valence, which refers to the degree of positivity or negativity in speech.
This innovative multi-task approach not only improves the accuracy of emotion
recognition but also integrates sentiment analysis directly into the ASR process,
offering a more holistic understanding of spoken language and emotion.

2.6 Speech Emotion Recognition Using Dialogue Emotion


Decoder and CNN Classifier

[6] The paper presents a methodology that combines the Dialogue Emotion Decoder
(DED) with a Convolutional Neural Network (CNN) classifier to enhance Speech
Emotion Recognition (SER). The process begins with the preprocessing of audio data
from the IEMOCAP dataset, which includes cleaning and preparing the data for
emotion analysis. DED is then used for feature extraction, where it utilizes the context
of previous utterances to interpret the emotions in the current speech. This contextual
analysis improves the system’s ability to recognize emotions more accurately by taking
into account the emotional flow of the conversation.
This approach enables the categorization of emotions into five key states: anger,
happiness, neutrality, sadness, and excitement. Once the features are extracted, the
CNN classifier processes them to predict the emotional class. The CNN model is
designed with convolutional layers to capture local patterns in the data, followed by
pooling layers to reduce dimensionality and enhance generalization. The network
concludes with fully connected layers that combine the features and output the
predicted emotion.

Dept. of CSE, CMRIT 2024-2025 Page 8


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

The system produces both true and false classifications, indicating the predicted
emotional state and its corresponding confidence level. This methodology effectively
captures complex emotional patterns, making it highly useful for accurate emotion
recognition in spoken language.

2.7 Unsupervised Pretraining for CNNs Using Wav2Vec with


Contrastive Loss
[7] The paper introduces wav2vec, an innovative unsupervised pre-training framework
designed to improve speech recognition models. The method uses vast amounts of raw,
unlabelled audio data to train a multi-layer convolutional neural network (CNN), allowing it to
learn generalized feature representations that are transferable to downstream tasks. These
features are then fine-tuned using a smaller, labelled dataset to improve the performance of
supervised acoustic models.
The wav2vec architecture consists of:
Encoder: Transforms raw audio waveforms into latent feature representations.
Context Network: Learns contextual embeddings by capturing temporal dependencies in the
latent representations.
Contrastive loss is used during training to separate the true latent audio samples from distractors.
This makes the model develop discriminative features. The pre-training is done with a
contrastive task where the system predicts future latent features given their context, using both
local and global temporal information in the audio signal.
These also include normalization and segmentation from raw audio, which mean clean input
data for learning. The framework depends highly upon tools such as FAIRSEQ to efficiently
process and develop models.
The key improvements of the wav2vec model are seen from the comparison with traditional
models - supervised, in terms of Word Error Rate (WER) in low-resource scenarios. Benchmark
evaluations on datasets like WSJ (Wall Street Journal) and TIMIT confirm that wav2vec achieves
state-of-the-art performance with minimal reliance on labeled data.
This study highlights the effectiveness of combining unsupervised pre-training with
contrastive loss for advancing speech recognition, emphasizing its potential for reducing the
need for extensive labeled datasets in resource-constrained scenarios.

Dept. of CSE, CMRIT 2024-2025 Page 9


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

2.8 Sequential Pattern Analysis of Emotion in Speech With


LSTM

[8] The paper explores Speech Emotion Recognition (SER) using a Long Short-Term Memory
(LSTM) model. The system employs datasets with labeled emotional speech samples and
extracts key features such as Mel Frequency Cepstral Coefficients (MFCC), pitch, and energy
to train the model. To improve the quality of the dataset, pre-processing steps like noise removal
and silence elimination are performed, ensuring that only the relevant speech data is retained for
analysis.

The model is built using Recurrent Neural Networks (RNNs), optimized specifically for
handling temporal sequence data. An attention mechanism is incorporated into the model to
capture both temporal patterns and feature-specific information, which enhances the model’s
ability to focus on important aspects of the speech signal over time. The process involves feature
extraction using tools like LIBROSA, a popular library for audio processing, to extract the
relevant acoustic features from the speech data.

The training is conducted iteratively, with continuous parameter tuning to optimize the model’s
accuracy. These steps ensure that the LSTM model effectively recognizes and classifies
emotional speech, achieving high accuracy in detecting emotions like anger, happiness, and
sadness. The approach showcases the effectiveness of combining LSTM models with temporal
attention mechanisms for accurate SER.

2.9 Speech Recognition Application with Tone Analyzer

[9] The thesis introduces a speech recognition application enhanced with a tone analyzer to
address common challenges in speech interpretation, such as accent, tone, and
mispronunciation. The methodology involves using CMU Sphinx for speech-to-text
conversion, along with Python libraries like Librosa for feature extraction, which are crucial for
analyzing speech characteristics. The system is trained and evaluated using the Surrey Audio-
Visual Expressed Emotion (SAVEE) dataset, which provides emotional speech samples for
more accurate emotion detection.

The system integrates multiple algorithms, including Hidden Markov Models (HMMs),
Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs), and Deep Neural

Dept. of CSE, CMRIT 2024-2025 Page 10


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

Networks (DNNs), to perform acoustic modeling and pitch tracking. These algorithms work
together to improve the accuracy of speech recognition and emotional tone analysis. A flowchart
is provided in the thesis to illustrate the Speech Emotion Recognition (SER) process, from the
initial input of audio to the final text transcription and tone analysis.

The experimental results demonstrate the system's ability to accurately transcribe speech while
also analyzing emotional cues, making it highly useful for applications like customer service
and language learning, where understanding tone and emotion is crucial.

2.10 AI-Enhanced Natural Language Processing: Techniques for


Automated Text Analysis, Sentiment Detection, and
Conversational Agents

[10] The paper titled "Techniques for Automated Text Analysis, Sentiment Detection, and
Conversational Agents" explores a variety of AI-driven approaches, focusing on machine
learning (ML) and deep learning (DL) models, which are applied to natural language processing
(NLP) tasks such as part-of-speech tagging, named entity recognition (NER), sentiment
analysis, and conversational agents. These methods leverage ML algorithms to process large
datasets, identifying language patterns for classification and extraction tasks, thereby improving
efficiency in NLP applications.

In particular, deep learning architectures, such as Recurrent Neural Networks (RNNs) and Long
Short-Term Memory (LSTM) networks, are utilized to capture dependencies within text
sequences. This helps to significantly enhance the accuracy of tasks like sentiment analysis and
machine translation by understanding context over longer passages of text.

The paper also discusses practical challenges faced in NLP applications, including issues with
data quality, bias, and computational requirements. Strategies to address these challenges, such
as model compression and bias reduction, are also covered, aiming to make NLP models more
efficient and fair in real-world applications.

Dept. of CSE, CMRIT 2024-2025 Page 11


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

CHAPTER 3

PROPOSED MODEL
Advanced machine learning and deep learning models are employed to analyse user
responses and provide personalized feedback. The system utilizes Whisper for speech-
to-text conversion, Wav2Vec for feature extraction and confidence analysis, and
Gemini for evaluating tone, grammar, and accuracy, ensuring a comprehensive
assessment of interview performance.

3.1 System Architecture


The diagram represents an AI-Powered Virtual Interview Simulator designed to
provide detailed feedback on a user's interview performance. This system integrates
advanced AI models like Wav2Vec, Gemini, and Whisper, ensuring robust and
personalized evaluation.

Fig 3.1 System Architecture


The Figure 3.1 shows the system architecture of an automated interview assessment
system. The system utilizes a combination of AI-powered components and human
interaction to generate questions, evaluate responses, and deliver tailored feedback.

Dept. of CSE, CMRIT 2024-2025 Page 12


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

The AI-Powered Virtual Interview Assessment System is designed to provide


a comprehensive and personalized evaluation of a user's interview performance. This
system integrates several advanced AI models, including Wav2Vec, Gemini, and
Whisper, to ensure accurate transcription, dynamic question generation, and detailed
feedback. It creates an adaptive learning environment where users can practice
interviews in different domains, such as technical, HR, or behavioural, and receive
tailored feedback that helps them refine their interview skills.
The key components and their interactions within the system are as follows:
• User Interaction: The user begins by selecting the type of interview (e.g., technical,
HR, behavioural) they wish to practice. They then provide audio responses to the
dynamically generated questions, which are processed and analysed for feedback.
• Interview Platform: Serving as the user interface, the Interview Platform delivers the
generated questions to the user and receives their audio responses. These responses
are sent to the Feedback Mechanism for in-depth analysis.
• Question Generator (Gemini API): The Gemini API plays a crucial role in
dynamically generating relevant interview questions based on the selected domain.
This ensures that the questions are tailored to the user’s level, helping to maintain
the interview's relevance and challenge.
• Feedback Mechanism: The heart of the system, this component processes user
responses by integrating multiple advanced AI models:
o Speech-to-Text Conversion (Whisper): The Whisper model is used for robust
and accurate transcription of the user’s audio responses into text, regardless
of background noise or acoustic conditions.
o Natural Language Processing (Gemini): Once the text is transcribed, Gemini
evaluates it for accuracy, content relevance, and clarity. It ensures that the
answers are meaningful and appropriate for the given context.
o Confidence Analysis (Wav2Vec & SER): Wav2Vec, combined with Speech
Emotion Recognition (SER), analyses the user's confidence and emotional
delivery. This ensures that not only the content but also the delivery of the
response is evaluated.
o Grammar and Language Evaluation (Gemini): Gemini’s NLP capabilities
assess the grammatical correctness, fluency, and structure of the transcribed
text, helping improve overall language proficiency.

Dept. of CSE, CMRIT 2024-2025 Page 13


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

• Feedback Parameters: The feedback mechanism assesses user performance based on


several key parameters:
o Accuracy & Content: Evaluates the correctness and depth of the user's
responses.
o Confidence: Assesses the emotional delivery and confidence level using
Wav2Vec and SER models.
o Grammar & Language Proficiency: Analyses the fluency, grammar, and
structure of the user’s language using Gemini’s NLP capabilities.
• Feedback Loop: The system establishes a continuous feedback loop, allowing users
to receive detailed insights into their strengths and areas for improvement.
By integrating Wav2Vec, Gemini, and Whisper, this architecture provides an
adaptive, scalable, and comprehensive interview training environment. It ensures that
users receive personalized feedback to improve both their technical knowledge and soft
skills, helping them achieve success in real-world interviews.

3.2 Flowchart
This section discusses the different steps involved in the AI-Powered Virtual
Interview Assessment System. It includes the following steps:

• User Interaction: The user selects the type of interview they wish to practice, such as
technical, HR, or behavioural. The system generates relevant questions based on the
selected interview type.

• Response Collection: The user provides verbal responses to the generated questions.
These responses are captured as audio.

• Speech-to-Text Conversion: The captured audio responses are converted into text
using the Whisper model, ensuring accurate transcription even in varied acoustic
conditions.

• Natural Language Processing (NLP): The transcribed text is processed using Gemini
for analysis. The content is evaluated for accuracy, relevance, and clarity.

• Confidence Analysis: The system assesses the user’s confidence in their responses
using Wav2Vec and Speech Emotion Recognition (SER) techniques.

• Grammar & Language Evaluation: The system analyses the grammatical correctness,

Dept. of CSE, CMRIT 2024-2025 Page 14


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

fluency, and overall language proficiency of the transcribed text using Gemini's NLP
capabilities proposed.

• Performance Evaluation: Based on the analysis, the system’s feedback mechanism


evaluates the user's responses against predefined criteria such as accuracy,
confidence, and grammar.

• Feedback Generation: The system provides a comprehensive summary or score


based on the evaluation, offering valuable insights into the user's interview readiness
and suggesting areas for improvement.

Fig 3.2 Flowchart for the Project

The above Figure 3.2 shows the Flowchart for the project. This flowchart depicts the
core workflow of an automated interview assessment system, showcasing how the
proposed system interacts with the user, processes responses, and provides feedback to
enhance interview preparation.

Dept. of CSE, CMRIT 2024-2025 Page 15


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

CHAPTER 4
IMPLEMENTATION

4.1 Implementation Details

1. Create a Interview
The platform uses the Gemini API to generate personalized interview questions based on
the user’s needs. Users can choose between entering a job description, manually selecting a
topic, or selecting a company-specific interview. In the case of a job description, the input
text is run through natural language processing (NLP) to extract relevant information, such
as the role, responsibilities, and required skills. These words are then inserted into the flow
sent to the Gemini API. For project or company interviews, prompts are designed to elicit
highly relevant questions. The questions returned by the API are organized and displayed
next to an “Answer” button, which users can interact with to record their audio responses.

2. Capturing and Managing Responses


Each question has an interactive button to start recording, which allows users to answer
the questions. The recorder uses tools like the Web Audio API and libraries like Media
Recorder to capture audio. The recorded responses are stored temporarily and are linked to
their questions for easy management. The platform provides users with controls, such as the
ability to pause, test, or edit data to improve the user experience. When the respondent clicks
the option button to end the interview, all recorded data are processed and sent to a research
unit for verification of the order.

3. Methodology
To ensure that the answers are correct, the recorded words are first translated into words
using the OpenAI Whisper model. Advanced features of Whisper enable it to adjust with
different levels of speech, pitch, and background noise for superior audio quality. The text
is then transmitted to the Gemini API with the original query and the response is evaluated
as part of a quick response. The Gemini API analyzes the feedback based on relevance and
accuracy and provides feedback including a score and recommendations for feedback. These
comments will indicate areas where responses can be improved or added to increase
accuracy.

Dept. of CSE, CMRIT 2024-2025 Page 16


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

4. Grammar Review
For grammar checking, the platform uses the Gemini API for document analysis
developed by OpenAI Whisper. The text is converted to text and sent to Gemini with
instructions to check grammar, sentence structure, and word choice. The API provides
feedback, including corrections and suggestions to improve comprehension and language
quality. The site displays the feedback in an easy-to-understand format, highlighting specific
grammatical errors in the text. This way, users can identify and correct their mistakes and
improve their language skills over time.

5. Reliability analysis
The reliability analysis will be performed with pre-learning Wav2Vec models adapted to
the speech data. Process the recorded responses to extract important details such as pitch,
loudness, articulation, and pauses. These features will be analyzed to identify signs of
confidence or distrust in the user’s voice. The model will generate a confidence and
sensitivity score for specific issues such as monotonous speech, excessive delay, or too fast
speech. The report will give users a clear understanding of their reliability and indicate areas
for improvement to increase delivery rates.

6. Collecting and displaying responses


Once the platform processes the written responses, the results of the accuracy, grammar,
and reliability analysis are compiled into a detailed report. The response is divided into
sections, each section containing explanations and information. Visual elements such as
charts and highlighted text are included to help users quickly identify their strengths and
weaknesses. Suggestions for improvement are included, providing guidance for improving
the information and providing feedback. These structured and comprehensive reports ensure
that users receive useful information for their inquiries.

Dept. of CSE, CMRIT 2024-2025 Page 17


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

4.2 Algorithm Used

1. Load Dataset
• Import dataset from CSV file with each row containing an audio file path and a confi-
dence label.
• Define confidence labels as numerical values (e.g., High Confidence = 1, Low
Confidence = 0).

2. Preprocess Data

• For each audio file:


o Load audio waveform, x(t)x(t)x(t), where t represents time samples.
o Transform the raw waveform into input embeddings:
▪ Use feature extractor f(x) to convert x(t) into embeddings zi, where
zi=f(xi)
▪ Normalize the embeddings to ensure consistent scaling.

3. Initialize Model

• Load a pre-trained Wav2Vec model with added classification head h(⋅) for confidence
prediction.
• Model outputs logits, y=h(z), where y is the raw prediction before applying activation.

4. Define Loss Function

• Use Cross-Entropy Loss, L, to measure the prediction error:

where yi is the true label, and y^i is the predicted probability after applying a softmax acti-
vation.

Dept. of CSE, CMRIT 2024-2025 Page 18


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

5. Configure Training Parameters


• Set parameters such as:
o Learning rate α
o Batch size B
o Number of epochs N

6. Train Model
• For each batch:
o Pass embeddings zi through the model.
o Calculate loss L for the batch.
o Update model parameters using gradient descent:

where θ represents model parameters.

7. Evaluate Model
• Pass evaluation data through the model.
• Compute accuracy A on evaluation set:

8. Deploy and Test


• For a new audio input:
o Extract embeddings z from audio.
o Pass z through the model to obtain prediction y.
o Map y back to confidence label (e.g., High Confidence or Low Confidence).

Dept. of CSE, CMRIT 2024-2025 Page 19


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

4.3 Comparative Study

The below table shows the different models used for comparison:
Table 4.1 Comparison of Models

Ref Technique Objective/Feature Existing Algorithms Advantages Demerits

[1] Android Convert speech HMM. Accessible for Requires


based to text for users users with internet and is
Google’s with disabilities, disabilities; limited to
speech enabling SMS cloud-based English.
recognition sending by processing.
with voice.
Hidden
Markov
Models
(HMM).
[2] Gaussian Recognize GMM, MFCC, High accuracy Limited to
Mixture isolated Expectation- in low-noise isolated words;
Models words for Maximization (EM). environments. not suitable for
(GMM) users with continuous
with disabilities. speech.
MFCC for
feature
extraction.
[3] Assist hearing- Reliable in Degrades in
MFCC impaired HMM with controlled states, noisy conditions
and individuals by forward achieving 87.6% and with
HMM converting isolated algorithm, accuracy. complex state
for words to text, with MFCC. models.
feature educational focus.
extracti
on and
recognit
ion.

[4] Feature Improve SER by GMM, HMM, SVM, High accuracy in Lower accuracy
extraction analyzing prosodic ANN speaker- in speaker-
(MFCC, and spectral dependent SER, independent
LPCC) and features in speech. potential real- models,
machine world uses in challenges in
learning psychiatry and transient emotion
classifiers automated detection.
(e.g., GMM, systems.
SVM).

Dept. of CSE, CMRIT 2024-2025 Page 20


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

[5] Multi-task Enhance SER by ASR, bidirectional Better SER High


pre-training making ASR LSTM, CTC loss. accuracy, computational
with ASR and models sentiment- especially for cost, limited
sentiment aware for better valence; reduced improvements in
classification. emotion prediction. need for labeled some emotion
data. dimensions
[6] DED for Improve SER by CNN, IEMOCAP Better accuracy Limited to
contextual considering dataset, utterance- with context- certain emotion
feature conversational based emotion based emotions, classes,
extraction and context. detection. robust dependent on
CNN for classification contextual data
classification. quality.
[7] wav2vec Enhance speech Hidden Markov Reduced word Requires high
unsupervised recognition by Models (HMM), Deep error rate; computational
CNN-based using large-scale Speech, RNN-based effective on resources for
pre-training unlabeled audio ASR. limited labeled pre-training.
with data. data.
contrastive
loss.
[8] LSTM-based Classify emotions SVM, Gaussian High accuracy High
deep learning in speech by Mixture Models with temporal computational
for sequence analyzing acoustic (GMM), HMM. data; noise- requirements;
data. features. resistant depends on data
quality.

[9] CMU Sphinx Accurately HMMs, GMMs, Accurate Performance


for transcribe speech SVMs, DNNs. transcription and may vary with
transcription, and analyze tone real-time accents and
Librosa for for applications like emotion analysis. noise; high
feature customer service. computational
extraction, cost.
and machine
learning
(HMMs,
GMMs,
SVMs,
DNNs) for
emotion
detection.
[10] RNNs, Enhance machine Lexicon-based, High accuracy, Computationally
LSTMs, and language SVMs, HMMs, CRFs, adaptability, and intensive;
transformer processing for LSTMs, transformer scalability across challenges with
models (e.g., applications in models. languages. sarcasm and bias.
GPT) for sentiment detection
tasks like text and conversational
classification agents.
and sentiment
analysis.

Dept. of CSE, CMRIT 2024-2025 Page 21


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

CHAPTER 5

RESULTS AND DISCUSSION


The Model performance is evaluated using metrics such as accuracy, performance
loss, and confusion matrix. The main goal is to minimize the performance loss, as it
represents the overall model error and determines how well the model is trained.
Accuracy, on the other hand, measures the accuracy of predictions based on all the
predictions made. For a deeper understanding, a real-world image analysis and
confusion matrix are presented.

Fig.5.1 Accuracy vs Trees

The graph in Fig 5.1 shows the relationship between the number of trees in the
random forest model and the accuracy achieved. It is clear that the overall accuracy
increases with the number of plants, but some areas have lesser differences. The highest
accuracy achieved is about 0.84 when the number of trees remains between 100 and 200.
However, it can be seen that the response decreases above a certain threshold.

Dept. of CSE, CMRIT 2024-2025 Page 22


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

Fig 5.2. Confusion Matrix

The confusion matrix shown in Fig 5.2 provides a detailed breakdown of the model's
classification performance.
• The matrix indicates that Class 1 achieved 86 correct predictions but had 32
misclassifications into Class 2.
• Class 2 exhibited moderate performance with 73 correct predictions but 19 instances
incorrectly classified as Class 1.
• Class 3 demonstrated the most accurate predictions, with 112 out of 112 instances
correctly classified, showing no misclassifications.
The confusion matrix emphasizes that while the overall model accuracy is
commendable, misclassifications occur primarily between Class 1 and Class 2. This
suggests opportunities for further optimization, particularly in distinguishing features
between these two classes.
In conclusion, the accuracy trends and confusion matrix collectively demonstrate the
model's performance. While the results are promising, minor misclassifications indicate
areas for improvement, such as fine-tuning hyperparameters or enhancing feature
selection for better class separation.

Dept. of CSE, CMRIT 2024-2025 Page 23


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

CHAPTER 6
TESTING
This section highlights some testcases that are used to check the results in the
application developed.

Fig 6.1. Homepage

Figure 6.1 displays the Interview Simulator's welcoming screen, featuring a clean design. The
"Start Practicing" button invites users to dive into their interview preparation.

Fig 6.2. Select type of Interview

Figure 6.2 depicts the page to select the type of interview, the Interview Simulator offers three
ways to customize the interview experience. Users can manually enter specific topics, upload a

Dept. of CSE, CMRIT 2024-2025 Page 24


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

job description for tailored questions, or simulate interviews with particular companies. The
screenshot shows the user has entered "data science" as a topic and has clicked "Generate
Questions."

Fig 6.3. Questions displayed sequential

The above Figure 6.3 depicts shows an interview question within the simulator. The question asks
the user to describe their experience with various data science techniques and methodologies,
providing specific examples of how they've been applied to solve real-world problems. Buttons
for "Answer" and "Skip" are provided. The design is clean and user-friendly.

Dept. of CSE, CMRIT 2024-2025 Page 25


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

Fig 6.4. Completion of the interview

The above Figure 6.4 depicts the screen that is displayed when the user is done
answering all questions.

Fig 6.5. Feedback page


The above Figure 6.5 shows the feedback screen of the Interview Simulator. After the
user provides an answer, the simulator analyzes their response and delivers feedback
on various aspects, such as accuracy, grammar, and speech confidence. This
information allows users to identify areas for improvement and refine their interview
skills.

Dept. of CSE, CMRIT 2024-2025 Page 26


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

CHAPTER 7

CONCLUSION

7.1 Conclusion

The audio-based chatbot addresses modern interview preparation challenges by


leveraging advanced machine learning and natural language processing to provide users
with a conversational environment that mimics real-life scenarios while ensuring
detailed, accurate, and grammatically correct responses. By integrating cutting-edge AI
tools such as Whisper, Gemini, and Wav2Vec, the platform bridges the gap between
traditional chatbot designs, enabling self-assessment and encouraging continuous
improvement in communication skills. It helps users identify and correct weaknesses
in their responses, enhancing their confidence, communication, and trust. Additionally,
the system facilitates real-time analysis and corrections, transforming interview
preparation into an engaging and valuable resource for candidates to improve job, HR,
and interview behaviours.

7.2 Future Scope

The audio communication train can be enhanced with multilingual support, flexible
learning and individualized teaching, and specialized communication courses.
Advanced emotional intelligence and behavioral analysis can provide deep insight into
non-verbal communication, while gaming can increase user engagement. A mobile-
friendly version with an offline interface will increase accessibility. In addition, the
integration with job portals and the continuous update of the AI model make the
platform compatible with the needs of the market, making it the perfect tool for
preparing international interviews.

Dept. of CSE, CMRIT 2024-2025 Page 27


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

REFERENCES

[1] M. S. Miraskar, S. M. Mali, and P. R. Fulpagare, "Speech to Text Conversion Using


Android Platform," International Journal of Scientific & Engineering Research, vol. 4, no. 5, pp.
50–56, May 2013.
[2] A. Sharma and R. K. Agrawal, "Speech to Text Converter Using Gaussian Mixture Model
(GMM)," International Journal of Computer Applications, vol. 45, no. 10, pp. 27–30, May 2012.
[3] S. R. Verma and R. Gupta, "Speech-To-Text Conversion (STT) System Using Hidden
Markov Model (HMM)," International Journal of Electronics, Communication and Soft
Computing Science & Engineering, vol. 3, no. 6, pp. 15–19, Jun. 2014.
[4] A. B. Ingale and D. S. Chaudhari, "Speech Emotion Recognition," International Journal
of Soft Computing and Engineering (IJSCE), vol. 2, no. 1, pp. 235–238, Mar. 2012.
[5] A. Ghriss, B. Yang, V. Rozgic, E. Shriberg, and C. Wang, "Sentiment-Aware Automatic
Speech Recognition Pre-Training for Enhanced Speech Emotion Recognition," in 2022 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7347–
7351.
[6] S. N. Atkar, R. Agrawal, C. Dhule, N. Chavan Morris, P. Saraf, and K. Kalbande, "Speech
Emotion Recognition Using Dialogue Emotion Decoder and CNN Classifier," in Proceedings of
the Second International Conference on Applied Artificial Intelligence and Computing (ICAAIC
2023), 2023, pp. 94–99.
[7] S. Schneider, A. Baevski, R. Collobert, and M. Auli, "wav2vec: Unsupervised Pre-training
for Speech Recognition," Facebook AI Research, 2019. [Online]. Available:
https://github.com/pytorch/fairseq.
[8] D. Upadhyay, A. Tiwari, and H. Niathani, "Sequential Pattern Analysis of Emotion in
Speech With LSTM," in Proc. 2024 2nd Int. Conf. Device Intell., Comput. Commun. Technol.
(DICCT), 2024, pp. 120–125, doi: 10.1109/DICCT61038.2024.10532789.
[9] T. Ricketts, "Speech Recognition Application with Tone Analyzer," M.S. thesis, Dept. of
Computer Science, Alabama A&M Univ., Normal, AL, USA, 2023.
[10] S. P. Pattyam, “AI-Enhanced Natural Language Processing: Techniques for Automated
Text Analysis, Sentiment Detection, and Conversational Agents,” J. Artif. Intell. Res. Appl., vol.
1, no. 1, pp. 371–406, Jan.–Jun. 2021.

Dept. of CSE, CMRIT 2024-2025 Page 28


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

APPENDIX

Datasets
The project utilizes the following datasets to train and evaluate the system effectively:

1. The LJ Speech Dataset: This dataset contains 13,100 short audio clips of a single
speaker reading passages from various texts, paired with text transcripts. It is
instrumental for training speech-to-text models and evaluating grammar,
pronunciation and accuracy.jjbjjjbjbjbjbjbjbjjjbjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
(Source: https://www.kaggle.com/datasets/mathurinache/the-lj-speech-dataset)

2. Voice-Based Confidence Recognizer: This dataset includes audio recordings


labelled with confidence scores and emotional tones, aiding in the training and
analysis of speech confidence and delivery patterns.
(Source:https://www.kaggle.com/datasets/swarupakulkarni/voice-based-
confidence-recognizer)

These datasets provide diverse, high-quality audio samples and annotations essential
for building and refining the interview simulation platform.

Figure 1. The LJ Speech Dataset

Dept. of CSE, CMRIT 2024-2025 Page 29


Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models

Figure 2. Voice Based Confidence Register

These datasets provide diverse, high-quality audio samples and annotations essential
for building and refining the interview simulation platform.

Dept. of CSE, CMRIT 2024-2025 Page 30

You might also like