Final Project Report
Final Project Report
By
VISHAL PREMNATH (1CR21CS211)
CERTIFICATE
Certified that the project work entitled “Audio-based Interview Simulator with Feedback on
Confidence, Accuracy and Grammar using ML models” carried out by Mr. Vishal P, USN
1CR21CS211, Mr. Varun Tejas, USN 1CR21CS207, bonafide students of CMR Institute of
Technology, in partial fulfillment for the award of Bachelor of Engineering in Computer Science and
Engineering of the Visvesvaraya Technological University, Belgaum during the year 2023-2024. It is
certified that all corrections/suggestions indicated for Internal Assessment have been incorporated in the
Report deposited in the departmental library.
The project report has been approved as it satisfies the academic requirements in respect of Project work
prescribed for the said Degree.
External Viva
1.
2.
ii
DECLARATION
We, the students of Computer Science and Engineering, CMR Institute of Technology, Bangalore
declare that the work entitled " Audio-based Interview Simulator with Feedback on
Confidence, Accuracy and Grammar using ML models " has been successfully completed
under the guidance of Mrs. Krishna Sowjanya K, Computer Science and Engineering Department,
CMR Institute of technology, Bangalore. This dissertation work is submitted in partial fulfillment
of the requirements for the award of Degree of Bachelor of Engineering in Computer Science and
Engineering during the academic year 2023 - 2024. Further the matter embodied in the project
report has not been submitted previously by anybody for the award of any degree or diploma to
any university.
Place: Bangalore
Date:
iii
ABSTRACT
The AI-Enhanced Virtual Interview Simulation Platform offers users a more holistic and
interactive environment where they can prepare for job interviews in technical or human resources
domains. It uses NLP and machine learning technologies to assess responses from users and
provides them with constructive feedback on how confident, accurate, grammatically correct, and
effective in communication they have been. Through detailed performance analysis, users can
recognize what has to be improved and increase their self-confidence. The platform is infused with
an intuitive front-end aiding easy navigation, and back-end is formed by Django forms the basis
of the AI-reinforced assessment processes. A sophisticated, highly individualized interview coach,
this is set to help users improve capabilities and prepare for successful interviews.
iv
ACKNOWLEDGEMENT
I take this opportunity to express my sincere gratitude and respect to CMR Institute of
Technology, Bengaluru for providing me a platform to pursue my studies and carry out my final
year project
I have a great pleasure in expressing my deep sense of gratitude to Dr. Sanjay Jain,
Principal, CMRIT, Bangalore, for his constant encouragement.
I would like to thank Dr. Keshavamoorthy, Professor and Head, Department of
Computer Science and Engineering, CMRIT, Bangalore, who has been a constant support and
encouragement throughout the course of this project.
I consider it a privilege and honor to express my sincere gratitude to my guide
Mrs. Krishna Sowjanya K, Assistant Professor, Department of Computer Science and
Engineering, for the valuable guidance throughout the tenure of this review.
I also extend my thanks to all the faculty of Computer Science and Engineering who
directly or indirectly encouraged me.
Finally, I would like to thank my parents and friends for all their moral support they have
given me during the completion of this work.
v
TABLE OF CONTENTS
Page No.
Certificate ii
Declaration iii
Abstract iv
Acknowledgement v
Table of contents vi
List of Figures viii
1 INTRODUCTION 1
1.1 Relevance of the Project 2
1.2 Problem Statement 2
1.3 Objective 3
1.4 Scope of the Project 4
2 LITERATURE SURVEY 5
2.1 Android-Based Speech-to-Text System 5
2.2 Speech-to-Text Conversion Using Gaussian Mixture Model (GMM) 6
2.3 Speech-to-Text System Using Hidden Markov Model (HMM) 6
2.4 Speech Emotion Recognition 7
2.5 Sentiment-Aware Automatic Speech Recognition Pre-Training for 8
Enhanced Speech Emotion Recognition
2.6 Speech Emotion Recognition Using Dialogue Emotion Decoder 9
and CNN Classifier
2.7 Unsupervised Pretraining for CNNs Using Wav2Vec with Contrastive 10
Loss
2.8 Sequential Pattern Analysis of Emotion in Speech With LSTM 11
2.9 Speech Recognition Application with Tone Analyzer 11
2.10 AI-Enhanced Natural Language Processing: Techniques for 12
Automated Text Analysis, Sentiment Detection, and Conversational
Agents
vi
3 PROPOSED MODEL 14
3.1 System Architecture 14
3.2 Flowchart 16
4 IMPLEMENTATION 18
4.1 Implementation Details 18
4.2 Algorithm Used 20
4.3 Comparative Study 21
5 RESULTS AND DISCUSSION 24
6 TESTING 26
7 CONCLUSION 29
7.1 Conclusion 29
7.2 Future Scope 29
REFERENCES 30
APPENDIX 31
vii
LIST OF FIGURES
Page No.
Fig 3.1 System Architecture 14
Fig 3.2 Flowchart for the project 17
Fig 5.1 Accuracy vs Trees 24
Fig 5.2 Confusion Matrix 25
Fig 6.1 Homepage 26
Fig 6.2 Select Type of Interview 26
Fig 6.3 Questions displayed sequential 27
Fig 6.4 Completion of the interview 28
Fig 6.5 Feedback page 28
viii
Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models
CHAPTER 1
INTRODUCTION
The AI-Powered Interview Simulation Platform meets the pressing need of the ever-
increasingly competitive job market for the preparation of job seekers in an interview.
Traditional interview preparation methods, like mock interviews or simple study
questions, lack personal feedback and real-time interaction. People tend to falter when
under pressure on issues such as communication, confidence, and response planning. All
of which are necessary to win an interview.
Addresses the challenge of getting interview performance and performance feedback
on time. In most cases, applicants rarely get an opportunity to work with an interviewer
or receive a comprehensive assessment of their language, content, and delivery style.
That is why job seekers feel unprepared and nervous, especially when people apply for
jobs they want.
Therefore, this platform has used NLP and simulation algorithms of real query
conditions. Participants will answer questions such as technical, HR, and behavioral
questions. The system examines responses and gives feedback on languages, content
accuracy, and presentation. The frontend will be developed using Bootstrap(CSS),
Vanilla JavaScript for interactive and responsive user experience. The backend, powered
by Django, will take the responsibility of AI-driven response evaluation. This project
will help users improve their interview skills, enhance their confidence, and ultimately
increase the chances of getting job offers.
1.3 Objectives
The objectives of the AI-Powered Interview Simulation Platform are to give the
user a comprehensive and effective tool to prepare for an interview.
• Artificial Intelligence-based Framework Mirroring Real-Time Interview Scenarios:
Design an AI framework capable of mirroring realistic scenarios of job interviews
related to several fields, for instance technical, HR, or behavioral interviews. This is
expected to give users a better practice experience.
• NLP and Machine Learning: Use advanced NLP techniques and machine learning
algorithms to evaluate user feedback in real time. The system will provide personalized
feedback, focusing on critical components such as confidence, precision, and
grammatical correctness, which in turn enhances the overall quality of the responses.
• Communicative and Confidence: Provide users with tips so that these users could
understand areas that need improvement and, subsequently, with constant practice
develop confidence. Personal weaknesses pertaining to communication, accuracy, and
presentation could then be improved and result in perfect performance duringproposed
interviews.
• Create an Accessible, User-Centric Platform: Design an engaging and interactive
interface with the use of modern frontend technologies to ensure a captivating user
experience. The backend infrastructure, powered by Django, will allow for seamless
AI-driven analysis, thereby ensuring continuous engagement and personalized
feedback throughout the interview preparation process.
The scope of this project ranges from designing and developing an AI-based
interview simulation platform to improving the interview preparation process. The
project scope ranges from the deep analysis of interview preparation techniques already
in use, acknowledging their shortcomings, to applying cutting-edge technologies such
as NLP and machine learning algorithms. These technologies will be applied to help
Dept. of CSE, CMRIT 2024-2025 Page 3
Audio-Based Interview Simulator with Feedback on Confidence, Accuracy
and Grammar using ML Models
mimic real-life interview settings across both technical, human resource and behavioral
interviews. It will be built to give individuals exactly what they need in form of
customized feedback about things as confidence, accuracy, and grammar about helping
an individual improve performance. The project will also focus on the accessibility and
user experience improvement by using an interactive frontend interface and a strong
Django backend framework. In addition, the scope includes refining and optimizing the
feedback mechanism to ensure that it provides real-time, actionable insights that enable
constant improvement. This study aims to explore the use of AI-driven analytical
techniques to provide accurate feedback and monitor user advancement in addressing
deficiencies in current interview preparation strategies, including the lack of
personalized, instant feedback and customized practice sessions. The purpose of this
project is to revolutionize interview preparation through successful integration of AI
with user-centric design, thus offering users more efficient, accessible, and effective
tools for building confidence, improving communication, and increasing chances of
success in job interviews.
CHAPTER 2
LITERATURE SURVEY
Recent research on AI-based interview preparation stresses the use of Natural
Language Processing and machine learning technologies to simulate real-life interview
conditions. Techniques, including speech-to-text and emotion analysis, give contextual
feedback in areas such as confidence, accuracy, and grammatical correctness. While
these advancements are notable, challenges in the form of uninterrupted speech
recognition and the accuracy of real-time feedback are still prevalent. This section
discusses the approaches, methods, and limitations of AI-based interview training
systems.
These extracted coefficients serve as inputs to the HMM, which compares them to
predefined speech models to identify the most likely word. The HMM's performance can
be optimized by adjusting the number of states, balancing model complexity with accuracy.
This approach results in a high recognition rate for isolated words, making it particularly
effective in controlled environments.
[4] The methodology for Speech Emotion Recognition (SER) in this study follows a well-
structured approach, incorporating multiple stages including emotional speech input, feature
extraction, feature selection, and classification. The system starts by extracting prosodic features
such as pitch, energy, MFCC (Mel Frequency Cepstral Coefficients), and LPCC (Linear
Predictive Cepstral Coefficients), which are significant indicators of various emotions. These
features are then filtered during feature selection to enhance the accuracy of classification by
eliminating irrelevant or redundant data.
For emotion classification, the system uses several classifiers, including Gaussian Mixture
Model (GMM),Hidden Markov Model (HMM), Support Vector Machine (SVM), Artificial
Neural Network (ANN), and K Nearest Neighbours (KNN). These classifiers are used to
classify emotions like anger, happiness, sadness, and so on. The performance of each classifier
is evaluated under both speaker-dependent and speaker-independent models, evaluating their
ability to generalize across different speakers. The systems evaluation metrics mainly focus on
classification accuracy, with real-world applications in fields.
generated from a text-to-sentiment model applied to a large dataset, allowing the ASR
model to learn both speech recognition and sentiment analysis simultaneously. This
sentiment-aware ASR (SA2SR) model is trained using a combination of Connectionist
Temporal Classification (CTC) loss for speech recognition and cross-entropy loss for
sentiment classification.
Once pre-trained, the model is fine-tuned on the MSP-Podcast dataset, which is
specifically used to optimize the model’s ability to predict emotional dimensions like
activation, valence, and dominance (AVD). These dimensions represent core emotional
states that are crucial for accurate emotion detection in speech. The experimental results
demonstrate significant improvements in SER performance, particularly in predicting
valence, which refers to the degree of positivity or negativity in speech.
This innovative multi-task approach not only improves the accuracy of emotion
recognition but also integrates sentiment analysis directly into the ASR process,
offering a more holistic understanding of spoken language and emotion.
[6] The paper presents a methodology that combines the Dialogue Emotion Decoder
(DED) with a Convolutional Neural Network (CNN) classifier to enhance Speech
Emotion Recognition (SER). The process begins with the preprocessing of audio data
from the IEMOCAP dataset, which includes cleaning and preparing the data for
emotion analysis. DED is then used for feature extraction, where it utilizes the context
of previous utterances to interpret the emotions in the current speech. This contextual
analysis improves the system’s ability to recognize emotions more accurately by taking
into account the emotional flow of the conversation.
This approach enables the categorization of emotions into five key states: anger,
happiness, neutrality, sadness, and excitement. Once the features are extracted, the
CNN classifier processes them to predict the emotional class. The CNN model is
designed with convolutional layers to capture local patterns in the data, followed by
pooling layers to reduce dimensionality and enhance generalization. The network
concludes with fully connected layers that combine the features and output the
predicted emotion.
The system produces both true and false classifications, indicating the predicted
emotional state and its corresponding confidence level. This methodology effectively
captures complex emotional patterns, making it highly useful for accurate emotion
recognition in spoken language.
[8] The paper explores Speech Emotion Recognition (SER) using a Long Short-Term Memory
(LSTM) model. The system employs datasets with labeled emotional speech samples and
extracts key features such as Mel Frequency Cepstral Coefficients (MFCC), pitch, and energy
to train the model. To improve the quality of the dataset, pre-processing steps like noise removal
and silence elimination are performed, ensuring that only the relevant speech data is retained for
analysis.
The model is built using Recurrent Neural Networks (RNNs), optimized specifically for
handling temporal sequence data. An attention mechanism is incorporated into the model to
capture both temporal patterns and feature-specific information, which enhances the model’s
ability to focus on important aspects of the speech signal over time. The process involves feature
extraction using tools like LIBROSA, a popular library for audio processing, to extract the
relevant acoustic features from the speech data.
The training is conducted iteratively, with continuous parameter tuning to optimize the model’s
accuracy. These steps ensure that the LSTM model effectively recognizes and classifies
emotional speech, achieving high accuracy in detecting emotions like anger, happiness, and
sadness. The approach showcases the effectiveness of combining LSTM models with temporal
attention mechanisms for accurate SER.
[9] The thesis introduces a speech recognition application enhanced with a tone analyzer to
address common challenges in speech interpretation, such as accent, tone, and
mispronunciation. The methodology involves using CMU Sphinx for speech-to-text
conversion, along with Python libraries like Librosa for feature extraction, which are crucial for
analyzing speech characteristics. The system is trained and evaluated using the Surrey Audio-
Visual Expressed Emotion (SAVEE) dataset, which provides emotional speech samples for
more accurate emotion detection.
The system integrates multiple algorithms, including Hidden Markov Models (HMMs),
Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs), and Deep Neural
Networks (DNNs), to perform acoustic modeling and pitch tracking. These algorithms work
together to improve the accuracy of speech recognition and emotional tone analysis. A flowchart
is provided in the thesis to illustrate the Speech Emotion Recognition (SER) process, from the
initial input of audio to the final text transcription and tone analysis.
The experimental results demonstrate the system's ability to accurately transcribe speech while
also analyzing emotional cues, making it highly useful for applications like customer service
and language learning, where understanding tone and emotion is crucial.
[10] The paper titled "Techniques for Automated Text Analysis, Sentiment Detection, and
Conversational Agents" explores a variety of AI-driven approaches, focusing on machine
learning (ML) and deep learning (DL) models, which are applied to natural language processing
(NLP) tasks such as part-of-speech tagging, named entity recognition (NER), sentiment
analysis, and conversational agents. These methods leverage ML algorithms to process large
datasets, identifying language patterns for classification and extraction tasks, thereby improving
efficiency in NLP applications.
In particular, deep learning architectures, such as Recurrent Neural Networks (RNNs) and Long
Short-Term Memory (LSTM) networks, are utilized to capture dependencies within text
sequences. This helps to significantly enhance the accuracy of tasks like sentiment analysis and
machine translation by understanding context over longer passages of text.
The paper also discusses practical challenges faced in NLP applications, including issues with
data quality, bias, and computational requirements. Strategies to address these challenges, such
as model compression and bias reduction, are also covered, aiming to make NLP models more
efficient and fair in real-world applications.
CHAPTER 3
PROPOSED MODEL
Advanced machine learning and deep learning models are employed to analyse user
responses and provide personalized feedback. The system utilizes Whisper for speech-
to-text conversion, Wav2Vec for feature extraction and confidence analysis, and
Gemini for evaluating tone, grammar, and accuracy, ensuring a comprehensive
assessment of interview performance.
3.2 Flowchart
This section discusses the different steps involved in the AI-Powered Virtual
Interview Assessment System. It includes the following steps:
• User Interaction: The user selects the type of interview they wish to practice, such as
technical, HR, or behavioural. The system generates relevant questions based on the
selected interview type.
• Response Collection: The user provides verbal responses to the generated questions.
These responses are captured as audio.
• Speech-to-Text Conversion: The captured audio responses are converted into text
using the Whisper model, ensuring accurate transcription even in varied acoustic
conditions.
• Natural Language Processing (NLP): The transcribed text is processed using Gemini
for analysis. The content is evaluated for accuracy, relevance, and clarity.
• Confidence Analysis: The system assesses the user’s confidence in their responses
using Wav2Vec and Speech Emotion Recognition (SER) techniques.
• Grammar & Language Evaluation: The system analyses the grammatical correctness,
fluency, and overall language proficiency of the transcribed text using Gemini's NLP
capabilities proposed.
The above Figure 3.2 shows the Flowchart for the project. This flowchart depicts the
core workflow of an automated interview assessment system, showcasing how the
proposed system interacts with the user, processes responses, and provides feedback to
enhance interview preparation.
CHAPTER 4
IMPLEMENTATION
1. Create a Interview
The platform uses the Gemini API to generate personalized interview questions based on
the user’s needs. Users can choose between entering a job description, manually selecting a
topic, or selecting a company-specific interview. In the case of a job description, the input
text is run through natural language processing (NLP) to extract relevant information, such
as the role, responsibilities, and required skills. These words are then inserted into the flow
sent to the Gemini API. For project or company interviews, prompts are designed to elicit
highly relevant questions. The questions returned by the API are organized and displayed
next to an “Answer” button, which users can interact with to record their audio responses.
3. Methodology
To ensure that the answers are correct, the recorded words are first translated into words
using the OpenAI Whisper model. Advanced features of Whisper enable it to adjust with
different levels of speech, pitch, and background noise for superior audio quality. The text
is then transmitted to the Gemini API with the original query and the response is evaluated
as part of a quick response. The Gemini API analyzes the feedback based on relevance and
accuracy and provides feedback including a score and recommendations for feedback. These
comments will indicate areas where responses can be improved or added to increase
accuracy.
4. Grammar Review
For grammar checking, the platform uses the Gemini API for document analysis
developed by OpenAI Whisper. The text is converted to text and sent to Gemini with
instructions to check grammar, sentence structure, and word choice. The API provides
feedback, including corrections and suggestions to improve comprehension and language
quality. The site displays the feedback in an easy-to-understand format, highlighting specific
grammatical errors in the text. This way, users can identify and correct their mistakes and
improve their language skills over time.
5. Reliability analysis
The reliability analysis will be performed with pre-learning Wav2Vec models adapted to
the speech data. Process the recorded responses to extract important details such as pitch,
loudness, articulation, and pauses. These features will be analyzed to identify signs of
confidence or distrust in the user’s voice. The model will generate a confidence and
sensitivity score for specific issues such as monotonous speech, excessive delay, or too fast
speech. The report will give users a clear understanding of their reliability and indicate areas
for improvement to increase delivery rates.
1. Load Dataset
• Import dataset from CSV file with each row containing an audio file path and a confi-
dence label.
• Define confidence labels as numerical values (e.g., High Confidence = 1, Low
Confidence = 0).
2. Preprocess Data
3. Initialize Model
• Load a pre-trained Wav2Vec model with added classification head h(⋅) for confidence
prediction.
• Model outputs logits, y=h(z), where y is the raw prediction before applying activation.
where yi is the true label, and y^i is the predicted probability after applying a softmax acti-
vation.
6. Train Model
• For each batch:
o Pass embeddings zi through the model.
o Calculate loss L for the batch.
o Update model parameters using gradient descent:
7. Evaluate Model
• Pass evaluation data through the model.
• Compute accuracy A on evaluation set:
The below table shows the different models used for comparison:
Table 4.1 Comparison of Models
[4] Feature Improve SER by GMM, HMM, SVM, High accuracy in Lower accuracy
extraction analyzing prosodic ANN speaker- in speaker-
(MFCC, and spectral dependent SER, independent
LPCC) and features in speech. potential real- models,
machine world uses in challenges in
learning psychiatry and transient emotion
classifiers automated detection.
(e.g., GMM, systems.
SVM).
CHAPTER 5
The graph in Fig 5.1 shows the relationship between the number of trees in the
random forest model and the accuracy achieved. It is clear that the overall accuracy
increases with the number of plants, but some areas have lesser differences. The highest
accuracy achieved is about 0.84 when the number of trees remains between 100 and 200.
However, it can be seen that the response decreases above a certain threshold.
The confusion matrix shown in Fig 5.2 provides a detailed breakdown of the model's
classification performance.
• The matrix indicates that Class 1 achieved 86 correct predictions but had 32
misclassifications into Class 2.
• Class 2 exhibited moderate performance with 73 correct predictions but 19 instances
incorrectly classified as Class 1.
• Class 3 demonstrated the most accurate predictions, with 112 out of 112 instances
correctly classified, showing no misclassifications.
The confusion matrix emphasizes that while the overall model accuracy is
commendable, misclassifications occur primarily between Class 1 and Class 2. This
suggests opportunities for further optimization, particularly in distinguishing features
between these two classes.
In conclusion, the accuracy trends and confusion matrix collectively demonstrate the
model's performance. While the results are promising, minor misclassifications indicate
areas for improvement, such as fine-tuning hyperparameters or enhancing feature
selection for better class separation.
CHAPTER 6
TESTING
This section highlights some testcases that are used to check the results in the
application developed.
Figure 6.1 displays the Interview Simulator's welcoming screen, featuring a clean design. The
"Start Practicing" button invites users to dive into their interview preparation.
Figure 6.2 depicts the page to select the type of interview, the Interview Simulator offers three
ways to customize the interview experience. Users can manually enter specific topics, upload a
job description for tailored questions, or simulate interviews with particular companies. The
screenshot shows the user has entered "data science" as a topic and has clicked "Generate
Questions."
The above Figure 6.3 depicts shows an interview question within the simulator. The question asks
the user to describe their experience with various data science techniques and methodologies,
providing specific examples of how they've been applied to solve real-world problems. Buttons
for "Answer" and "Skip" are provided. The design is clean and user-friendly.
The above Figure 6.4 depicts the screen that is displayed when the user is done
answering all questions.
CHAPTER 7
CONCLUSION
7.1 Conclusion
The audio communication train can be enhanced with multilingual support, flexible
learning and individualized teaching, and specialized communication courses.
Advanced emotional intelligence and behavioral analysis can provide deep insight into
non-verbal communication, while gaming can increase user engagement. A mobile-
friendly version with an offline interface will increase accessibility. In addition, the
integration with job portals and the continuous update of the AI model make the
platform compatible with the needs of the market, making it the perfect tool for
preparing international interviews.
REFERENCES
APPENDIX
Datasets
The project utilizes the following datasets to train and evaluate the system effectively:
1. The LJ Speech Dataset: This dataset contains 13,100 short audio clips of a single
speaker reading passages from various texts, paired with text transcripts. It is
instrumental for training speech-to-text models and evaluating grammar,
pronunciation and accuracy.jjbjjjbjbjbjbjbjbjjjbjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
(Source: https://www.kaggle.com/datasets/mathurinache/the-lj-speech-dataset)
These datasets provide diverse, high-quality audio samples and annotations essential
for building and refining the interview simulation platform.
These datasets provide diverse, high-quality audio samples and annotations essential
for building and refining the interview simulation platform.