Batch 13
Batch 13
BACHELOR OF TECHNOLOGY
IN
Submitted By
Professor
CERTIFICATE
This is to certify that the project work entitled “CHATBOT USING FINE TUNED
RANDOM FOREST” is a bonafide work carried out by Ms. G. VARA LAKSHMI
(18KT1A0517), Mr. K. J HARSHA VARDHAN (18KT1A0524), Ms. M. YESASWINI
(18KT1A0530), Mr. V. PRUDHVI CHARAN (18KT1A0559), Ms. M. DEEPIKESWARI
(18KT1A0532). Fulfillment for the award of the degree of Bachelor of Technology in COMPUTER
SCIENCE AND ENGINEERING of Jawaharlal Nehru Technological University, Kakinada during
the year 2021-2022. It is certified that all corrections/suggestions indicated for internal assessment
have been incorporated in the report. The project report has been approved as it satisfies the academic
requirements in respect of project work prescribed for the above degree.
External Examiner
ACKNOWLEDGEMENT
We owe a great many thanks to a great many people who helped and supported and suggested
us in every step.
We are glad for having the support of our principal Dr. K. Sri Rama Krishna who inspired
us with his words filled with dedication and discipline towards work.
We express our gratitude towards Dr. A. Pathanjali Sastri, Professor & HoD of CSE for
extending his support through training classes which had been the major source to carry out our
project.
We are very much thankful to Dr. Shaik Akbar, Professor, Guide of our project for guiding
and correcting various documents of ours with attention and care. He has taken pain to go through
the project and make necessary corrections as and when needed.
Finally, we thank one and all who directly and indirectly helped us to complete our project
successfully.
Project Associates
M. YESASWINI (18KT5A0530)
M. DEEPIKESWARI (18KT1A0532)
DECLARATION
This is to declare that the project entitled “CHATBOT USING FINE TUNED
RANDOM FOREST” submitted by us in the partial fulfillment of requirements for the award
of the degree of Bachelor of Technology in Computer Science & Engineering in Potti
Sriramulu Chalavadi Mallikarjuna Rao College of Engineering and Technology, is
bonafide record of project work carried out by us under the supervision and guidance of Dr.
Shaik Akbar, Professor of CSE. As per our knowledge, the work has not been submitted to
any other institute or universities for any other degree.
Project Associates
M. YESASWINI (18KT5A0530)
M. DEEPIKESWARI (18KT1A0532)
ABSTRACT
Question-Answering System is one of the most researched topic in Machine
Learning related problems. Answer selection in Community Question Answering (CQA) is
a complex and crucial problem in the development of an autonomous Question
Answering System in a Natural Language Processing (NLP) environment. QA systems allow
user to express a question in natural language and get an immediate and brief response. A
question Answering System is an information retrieval system in which a direct response is
expected in response to a submitted query, as opposed to a list of references that may contain
the answers. It is a device for human-machine communication. QA systems in Natural
Language Processing are designed to provide learners with accurate responses to their
inquiries.
1. INTRODUCTION 1
1.1.1 Scope 1
1.1.2 Purpose 1
2. SYSTEM ANALYSIS 7
2.4 Methodologies 11
2.4.1 Preprocessing 11
3. SYSTEM DESIGN 19
3.5 Dataset(Sample) 32
4. SYSTEM IMPLEMENTATION 33
4.3 Results 34
5. TESTING 40
6. CONCLUSION 50
7. REFERENCES 51
8. BIBLIOGRAPHY 53
9. APPENDIX 54
LIST OF FIGURES
S.NO Fig. NO NAME OF THE FIGURE PAGE NO
1. 2.1 Data Transformation 12
1. INTRODUCTION
1.1 BRIEF OVERVIEW OF THE PROJECT
The main aim of this project is to build an efficient question answering system with various Machine
Learning techniques and reproduce a comparative study on which will be the best one for the same.
In this project, we built a QA system which can answer any questions based on the confirmed
knowledge base which was developed by using texts posted to a social media list. We apply naïve
bayes, Support Vector Machine and Random Forest algorithms on the collected data and checks
which algorithm is more accurate for Question-Answering system. Now we use python to host the
web application and design our own question answering system.
1.1.1 Scope
The question answering system using NLP comes into existence to give the accurate answers to the
questions raised by the user and provides efficient answers when compared to other systems. Modern
information retrieval systems allow us to locate document or website that might have the associated
information, but the majority of them leave it to the user to extract the useful information from an
ordered list. In this system, we used Machine learning algorithms to get 85% accuracy. Question
Answering is in itself intersection of Natural Language Processing, Information Retrieval, Machine
Learning, Knowledge Representation, Logic and Inference, Sematic Search.
1.1.2 Purpose
The main aim of this question answering system is to get an accurate answer. This QA system
enables users to access the knowledge resources in a natural way (i.e., by asking questions) and to
get back a relevant and proper response in concise words. QA systems is to understand the natural
language questions correctly and deduce the precise meaning to retrieve exact responses. The
processing of question answering system have three stages i.e., analyzing the question and
classifying the question, identify the answers related to the question, select the accurate and efficient
answer.
Now a days Chatbots plays an important role in the various sectors. Consumers are demanding
round-the-clock service for assistance in areas ranging from banking and finance, to health and
wellness. Because of this demand, chatbots are increasing in popularity among business and
consumers alike. The Objective of our project is to review various ML techniques and comprehend
their Limitations and Advantages. And to get a thorough list of all the Natural Language Processing
methods. Finally, our main motive is to build an efficient system and choose the best algorithms for
the QA model.
[1] Jafar A. Alzubi et.al, identified the dangers caused by COVID- 19. These covid infections are
characterized as a disease caused by a novel virus called Coronavirus, which is now known as
extreme acute respiratory condition. They proposed COBERT, a retriever-reader dual algorithmic
system that answers complex queries by searching a document of 59K corona virus-related
literature made available through the Coronavirus Open Research Dataset Challenge (CORD-19).
The retriever is made up of a TF-IDF vectorizer that captures the top 500 documents with the
highest scores. On SQuAD 1.1 dev, the reader is pre-trained Bidirectional Encoder
Representations from Transformers (BERT). The use of DistilBERT version of BERT in reader
phase on CORD-19 provides an accuracy of 70% when compared to BERT. Only the generated
embedding is used by the proposed model to capture context sensitive features. Their system has
been pre-trained on a limited domain of COVID-19 literature, all of which is in English only.
[2] Darshana V. Vekariya et.al, presented a Versatile global T-max pooling and DeepLSTM for
quality answer prediction in this research paper. They also used Efficient DFM to forecast the good
solutions, and DFM is especially useful for ranking cause. In comparison to all other existing
strategies, this approach is solely based on neural structure and RNN. They used DeepLSTM,
which is extremely simple to teach. For testing, they used four datasets. STSB, MRPC, SICK, and
Wikipedia QA datasets are all available. STSB, SICK, and Wikipedia compute semantic similarity.
The MRPC dataset is also used for para-identity. SICK is made up of 10,000 English sentence
pairs. The results of the experiments reveal that the proposed method provides exceptional overall
performance with an accuracy of 79% approximately.
[3] Emmanuel Mutabazi et.al, provided a study of medical textual question-answering systems
based on deep learning methodologies which can express the exact meaning of the statement.
Various researchers have employed machine learning and NLP methodologies integrated with
probabilistic, algebraic, and neural network models such as CNN, RNN are used to solve various
answer-processing challenges. MedQuAD, EmrQA are the datasets used in this paper. As a result,
as compared to traditional methods, complicated medical or clinical problems can be answered
with a high degree of accuracy. There is delay in the retrieval of relevant and reliable answers.
[4] Md. Rafiuzzaman Bhuiyan et.al proposed a paper in which they used sequence to sequence
architecture to create an automatic context-based Question and Answering system. An encoder
layer, a bi-directional LSTM, and a decoder layer will be used, followed by an attention
mechanism. The main challenge of this work is data collection, as well as finding the appropriate
vocabulary for word mapping and other tasks. Their model primary function is to provide answers
to the questions. They were successful in answering the question and reducing our training loss to
0.003. They created their own dataset for the web. The first disadvantage is that there are enough
words in Bengali to vector, and there is a lemmatizer in Bengali.
[5] In this research Mrs. Kavita Moholkar et.al used the LSTM model, a hybrid LSTM–
Convolution NN model, and the Multilayer Perception (MLP) model to present an ensemble
strategy for predicting responses to multiple choice questions. To begin, LSTM and hybrid LSTM-
CNN models are trained in parallel using LSTM. To predict the choice of training dataset
individually, Multilayer Perception is employed. The datasets from the 8thGr-NDMC are chosen
for model evaluation and comparison. The eighth GR-NDMC is utilized for testingpurposes. The
obtained findings show that the proposed approach outperforms several alternative single
forecasting models.
At present situation there are lot much of documents, webpages, files. The previous systems only
give the several documents related to the question. But they are not provided the correct document
related to the question. By using this existing systems, information retrieval is only possible. So
there is an ability to unauthorized people can access our data in the system. By this traditional
approach there is no confirmation of data that we get is correct. So our question answer system will
help the people to get the accurate and efficient answers.
2. SYSTEM ANALYSIS
System Analysis is the process of analyzing a system with the potential goal of improving or
modifying the system. Analysis is breaking down the problem into smaller elements of study and
ultimately providing a better solution. During the process of system development, Analysis is an
important aspect. This involves gathering and interpreting facts, diagnosing theproblem and using the
information to recommend improvements to the system ultimately, the goal is to give a computerized
solution.
Economic analysis is most frequently used for evaluation of the effectiveness of the system. More
commonly known as cost/benefit analysis the procedure is to determine the benefit and saving that
are expected from a system and compare them with costs, decisions is made to design and implement
the system. This part of feasibility study gives the top management the economic justification for
the new system. A simple economic analysis that gives the actual comparison of costs and benefits
is much more meaningful in such cases. In the system, the organization is most satisfied by economic
feasibility. Because, if the organization implements this system, it need not require anyadditional
hardware resources as well as it will be saving lot of time.
Functional user requirements may be high-level statements of what the system should do in detail.
• Reliability
The system is more reliable because it uses the API’s developed by google that work even in noisy
environment. Also at receiver side python platform is used that makes the code more reliable.
• Performance
This system exhibits high performance because it is well optimised and is developed by using
high level languages which will give response to end user in a very less amount of time.
• Supportability
This system is designed to be cross platform supportable. The system is supported ona wide range
of hardware and any software platform. This system also uses python and hence it is highly portable.
• Flexibility
If we intend to increase or extend the functionality of the software after it is deployed, it should
be planned from the beginning New modules can be easily integrated to our system without
disturbing the existing modules or modifying the local schema of existing applications.
System Requirements Specification (SRS) the requirements work product that formally specifies
the system-level requirements of a single system or an application. The System Requirements
Specification identifies, defines and clarifies the requirements, that when satisfied through
development meet the operational/functional need identified in the Project Concept Proposal,
Project Business Case, and Project Charter.Approval of this document constitutes agreement that
the developed system satisfying theserequirements will be accepted.
➢ Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python.
It provides a selection of efficient tools for machine learning and statistical modelingincluding
classification, regression, clustering, and dimensionality reduction via a consistent interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy, and
Matplotlib.
➢ Pandas
Pandas is an open source Python package that is most widely used for data science/data
analysis and machine learning tasks. It is built on top of another package named Numpy, which
provides support for multi-dimensional arrays. As one of the most popular data wrangling
packages, Pandas works well with many other data science modules inside the Python
ecosystem, and is typically included in every Python distribution, from those that come with
your operating system to commercial vendor distributions like Active State’s Active Python
➢ Numpy
NumPy (Numerical Python) is a linear algebra library in Python. It is a very important library
on which almost every data science or machine learning Python packages such as SciPy
(Scientific Python), Mat−plot lib (plotting library), Scikit-learn, etc depends on to a reasonable
extent. NumPy is very useful for performing mathematical and logical operations on Arrays.
It provides an abundance of useful features for operations on n-arrays and matrices in Python.
➢ Seaborn
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Java
Script for Platform compatibility.
➢ Word Cloud
Word Cloud is a data visualization technique used for representing text data in which the size
of each word indicates its frequency or importance. Significant textual data points can be
highlighted using a word cloud. Word clouds are widely used for analysing data from social
network websites.
2.4 METHODOLOGIES
2.4.1 Pre-processing
Data pre-processing is the process of transforming raw data into an understandable format.
It is also an important step in data mining as we cannot work with raw data. The quality
of data should be checked before applying machine learning or data mining algorithms.
All the previous studies have proven that, data cleaning and dimensionality reduction have its
impact on the efficiency of machine learning and deep learning systems. In the process of data
cleaning, the proposed system has identified that we have missing data and it uses the traditional
imputation technique, since most of the attributes don’t exhibit the skewness property all the missing
values are replaced with their respective mean values. The next important factor in the pre-
processing is data transformation. The numerical values have a vital role in the computation time
rather than the categorical data and more than all the attributes in the dataset have few unique values
in their respective columns.
Data transformation is also a key pre-processing step, which is applied as standard feature scaling
in the proposed system, which is calculated as the “difference between the attribute and the whole
mean is divided by the standard deviation”.
Constructing Data Cube: Data are transformed into appropriate forms of mining. Data
Transformation involves the following:
1. In Normalisation, where the attribute data are scaled to fall within a small specified range, such
as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from the data. Such techniques include binning, clustering,
and regression.
3. In Aggregation, summary or aggregation operations are applied to the data. For example, daily
sales data may be aggregated so as to compute monthly and annual total amounts. This step is
typically used in constructing a data cube for analysis of the data at multiple granularities.
4. In Generalisation of the Data, low level or primitive/raw data are replaced by higher level concepts
through the use of concept hierarchies. For example, categorical attributes are generalised to
higher level concepts street into city or country. Similarly, the values for numeric attributes may
be mapped to higher level concepts like, age into young, middle-aged, or senior.
As we know that many Machine Learning algorithms and almost all Deep Learning
Architectures are not capable of processing strings or plain text in their raw form. In a broad sense,
they require numerical numbers as inputs to perform any sort of task, such as classification,
regression, clustering, etc. Also, from the huge amount of data that is present in the text format, it is
imperative to extract some knowledge out of it and build any useful applications. In short, we can
say that to build any model in machine learning or deep learning, the final level data has to be in
numerical form because models don’t understand text or image data directly as humans do.
Counter Vectorizer
1. It is one of the simplest ways of doing text vectorization.
2. It creates a document term matrix, which is a set of dummy variables that indicates if a particular
word appears in the document.
3. Count vectorizer will fit and learn the word vocabulary and try to create a document term matrix
in which the individual cells denote the frequency of that word in a particular document, which is
also known as term frequency, and the columns are dedicated to each word in the corpus.
tokens. Now, the dictionary consists of these N tokens, and the size of the Count Vector matrix M
formed is given by D X N. Each row in the matrix M describes the frequency of tokens present in
The dictionary created contains the list of unique tokens(words) present in the corpus
Now, a column can also be understood as a word vector for the corresponding word in the matrix
M. For Example, for the above matrix formed, let’s see the word vectors generated.
2.4.3 Classification
Classification is a technique where we categorize data into a given number of classes. The main
goal of a classification problem is to identify the category/class to which a new data will fall under
Classifier: An algorithm that maps the input data to a specific category.
In machine learning and statistics, classification is a supervised learning approach in which the
computer program learns from the data input given to it and then uses this learning to classify new
observation.
There are some classification techniques that are given below
• Support Vector Machine
• Navie Bayes
• Random Forest
Algorithm:
Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1 while the other is
identified as -1.
Step 2: As all machine learning algorithms convert the business problem into a mathematical
equation involving unknowns. These unknowns are then found by converting the problem into an
optimization problem. As optimization problems always aim at maximizing or minimizing
something while looking and tweaking for the unknowns, in the case of the SVM classifier, a loss
function known as the hinge loss function is used and tweaked to find the maximum margin.
Step 3: For ease of understanding, this loss function can also be called a cost function whose cost is
0 when no class is incorrectly predicted. However, if this is not the case, then error/loss is calculated.
The problem with the current scenario is that there is a trade-off between maximizing margin and
the loss generated if the margin is maximized to a very large extent. To bring these concepts in
theory, a regularization parameter is added.
Step 4: As is the case with most optimization problems, weights are optimized by calculating the
gradients using advanced mathematical concepts of calculus viz. partial derivatives.
Step 5: The gradients are updated only by using the regularization parameter when there is no error
in the classification while the loss function is also used when misclassification happens.
Step 6: The gradients are updated only by using the regularization parameter when there is no error
in the classification, while the loss function is also used when misclassification happens.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
Accuracy: Accuracy is described as the absolute accuracy of the pattern and is estimated as
the totalof specific prediction factors. In the proposed model, we have applied the model on a
dataset. The Social Media dataset contains 200 records. The confusion matrix for the Social
Media dataset for intelligent ensemble algorithm can be described as follows and the
computation of accuracy is as shown in equation.
= 85%
Recall: The ability of a model to find all the relevant cases within a dataset. Mathematically,
recall is defined as follows:
F- Measure: A measure that combines precision and recall is the harmonic mean of
precisionand recall, the
traditional F-measure.
F-measure = 2 * ( ( Precision*Recall) / Precision + Recall) ) or
= 2*TP / (2*TP) + FP + FN
=85.
B. ROC curve:
An ROC curve (receiver operating characteristic curve) is a graph showing the performance
of a classification model at all classification thresholds. This curve plots two parameters: True
Positive Rate and False Positive Rate.
True Positive Rate (TPR) is a synonym for recall and is therefore defined as
follows: TPR=TP/TP+FN
False Positive Rate (FPR) is defined as follows: FPR=FP/FP+TN
3.SYSTEM DESIGN
3.1 ABOUT SYSTEM DESIGN
System design is the process of designing the elements of a system such as the architecture, modules,
components, the different interfaces of those components and the data that goes through that system.
The purpose of the System Design process is to provide sufficient detailed data and information
about the system and its system elements to enable the implementation consistentwith architectural
entities as defined in models and views of the system architecture.
Elements of a System:
Architecture - This is the conceptual model that defines the structure, behavior and more views of
a system. We can use flowcharts to represent to illustrate the architecture.
Modules - This are components that handle one specific task in a system. A combination of the
modules makes up the system.
Components - This provides a particular function or group of related functions. They are made up
of modules.
Interfaces - This is the shared boundary across which the components of the system exchange
information and relate.
A system architecture or systems architecture is the conceptual model that defines the structure,
behavior, and more views of a system. An architecture description is a formal description and
representation of a system, organized in a way that supports reasoning about the structures and
behaviors of the system.
An architecture description also indicates how nonfunctional requirements will be satisfied. For
example:
➢ Safety integrity. Elements of the design that reduce the risk that the system will cause(or
allow causation of) harm to property and human beings.
➢ Fault tolerance. Elements of the design that allow the system to continue to operateif some
components fail (e.g. no single point of failure).
➢ Consideration of product evolution. The facility for individual components to be modified
or dynamically reconfigured without the need for major modification of the system as a whole.
Further, the ability to add functionality with new components in a cost effective manner.
➢ Consideration of the emergent qualities of the system as a whole when components are
assembled and operated by human beings. For example, can themissile launch system be effectively
operated in a high stress combat situation.
UML stands for Unified Modelling Language. UML is a standardized general-purpose modelling
language in the field of object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object oriented
computer software. In its current form UML is comprised of two major components: a Meta-model
and a notation. In the future, some form of method or process may also be added to; or associated
with, UML.
The UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the software
development process. The UML uses mostly graphical notations to express the designof software
projects.
GOALS:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can develop and
exchange meaningful models.
6. Support higher level development concepts such as collaborations, frameworks, patterns and
components.
Use Case diagrams identify the functionality provided by the system (use cases), the users
who interact with the system (actors), and the association between the users and the functionality.
Use Cases are used in the Analysis phase of software development to articulate the high-level
requirements of the system. The primary goals of Use Case diagrams include:
➢ Providing a high-level view of what the system does.
➢ Identifying the users ("actors") of the system.
➢ Determining areas needing human-computer interfaces.
Graphical Notation: The basic components of Use Case diagrams are the Actor, the UseCase,
and the Association.
Behind each Use Case is a series of actions to achieve the proper functionality, as well as alternate
paths for instances where validation fails, or errors occur. These actions can be further defined in
a Use Case description. Because this is not addressed in UML, there are no standards for Use Case
descriptions. However, there are some common templates can follow, and whole books on the
subject writing of Use Case description.
Sequence diagrams document the interactions between classes toachieve a result, such as a
use case. Because UML is designed for object-oriented programming, these communications
between classesare known as messages. The Sequence diagram lists objects horizontally, and time
vertically, and models these messages over time. Graphical Notation: In a Sequence diagram,
classes and actors are listed as columns, with vertical lifelines indicating the lifetime of the object
over time.
Lifeline The Lifeline identifies the existence of the object over time.
Sequence Diagram:
Action State: An action state represents the execution of an atomic action, typically the
invocation of an operation. An action state is a simple state with an entry action whose only
exit transition is triggered by the implicit event of completing the execution of the entry
action.
Activity1
Transition: A transition is a directed relationship between a source state vertex and a target
state vertex. It may be part of a compound transition, which takes the static machine from one
static configuration to another, representing the complete response of the static machine to a
particular event instance.
Final state: A final state represents the last or "final" state of the enclosing compositestate.
There may be more than one final state at any level signifying that the composite state can
end in different ways or conditions. When a final state is reached and there are no other
enclosing states it means that the entire state machine has completed its transitions and no
more transitions can occur.
Decision: A state diagram (and by derivation an activity diagram) expresses decision when
guard conditions are used to indicate different possible transitions that depend on Boolean
conditionsof the owning object.
Activity Diagram:
A data flow diagram (DFD) maps out the flow of information for any process or system.
It uses defined symbols like rectangles, circles and arrows, plus short text labels, to show data
inputs, outputs, storage points and the routes between each destination. Data flow diagrams
provide a graphical representation of how information moves between processes in a system
Data flowcharts can range from simple, even hand-drawn process overviews, to in-depth,
multi-level DFDs that dig progressively deeper into how the data is handled. They can be used to
analyze an existing system or model a new one.
The objective of a DFD is to show the scope and boundaries of a system as a whole. It
may be used as a communication tool between a system analyst and any person who plays a part
in the order that acts as a starting point for redesigning a system.
RULES OF DFD:
The basic rules for DFD are as follows:
• Each data store should have at least one data flow in and one data flow out.
• Remember that DFD is not a flow chart. Arrows is a flow chart that represents the order of events;
arrows in DFD represents flowing data. A DFD does not involve any order of events.
• Do not become bogged down with details. Defer error conditions and error handling until the end
of the analysis.
SYMBOLS IN DFD:
1. Process — represents any transformative process of the incoming flow of information into the
outgoing workflow. The process receives input and generates an output;
2. Data flow — represents the movement of information within the system between external entities,
data stores, and processes. Reflects the nature of the data used in the system;
3. Datastore — represents repositories for data that is not moving at the moment. It may be either a
just in case buffer or queue for later use. Most commonly it is either database tables or membership
forms;
4. External entity — represents sources or destination points of information outside the boundaries
of the described system. Entities either provide data to the system or receive from the processes.
EE usually resides on the edges of the diagram.
DFD Level 1 provides a more detailed breakout of pieces of the Context Level Diagram. You will
highlight the main functions carried out by the system, as you break down the high-level process
of the Context Diagram into its subprocesses.
DFD Level 2 then goes one step deeper into parts of Level 1. It may require more text to reach the
necessary level of detail about the system’s functioning.
3.5 Dataset(Sample):
4. SYSTEM IMPLEMENTATION
4.1 ABOUT IMPLEMENTATION:
Implementation is the most crucial stage in achieving a successful system and giving the user’s
confidence that the new system is workable and effective. Implementation of a modified
application to replace an existing one. This type of conversation is relatively easy to handle,
provide there are no major changes in the system.
Each program is tested individually at the time of development using the data and hasverified
that this programmed linked together in the way specified in the programs specification, the
computer system and its environment is tested to the satisfaction of the user. The system that has
been developed is accepted and proved to be satisfactory for the use. And so the system is going
to be implemented very soon. A simple operating procedure is included so that the use can
understand the different functions clearly and quickly.
DATA COLLECTION:
Code:
#DATA READING
chatbot=pd.read_csv(r"C:\Users\LENOVO\Downloads\STACK\Sheet_1.csv",usecols=['response
_id','class','response_text'],encoding='latin-1')
resume=pd.read_csv(r"C:\Users\LENOVO\Downloads\STACK\Sheet_2.csv",encoding='latin-1')
chatbot.head(5)
resume.head(5)
DATA VISUALIZATION:
Word cloud:
Word Cloud is a data visualization technique used for representing text data in which the size of
each word indicates its frequency or importance. Significant textual data points can be highlighted
using a word cloud. Word clouds are widely used for analyzing data from social network websites.
Code:
def cloud(text):
plt.imshow(wordcloud)
plt.axis("off")
cloud(chatbot['response_text'])
RESULT:
Code:
tsne=TSNE(n_components=3,init='random',random_state=101,method='barnes_hut',n_iter=250,v
erbose=2,angle=0.5).fit_transform(Tf_idf.toarray())
RESULT:
Code:
trace1=go.Scatter3d(x=tsne[:,0],y=tsne[:,1],z=tsne[:,2],mode='markers',marker=dict(sizemode='
diameter',color = 'red',colorscale = 'Portland',colorbar=dict(title='TExt'),line=dict(color='rgb(255,
255, 255)'),opacity=0.75))
data=[trace1]
fig=dict(data=data, layout=layout)
py.iplot(fig, filename='3DBubble')
RESULT:
DATA PREPROCESSING:
Transformers:
Transformers provides APIs to easily download and train state-of-the-art pretrained models. Using
pretrained models can reduce your compute costs, carbon footprint, and save you time from
training a model from scratch. The models can be used across different modalities such as:
📝 Text: text classification, information extraction, question answering, summarization,
Code:
count_vect = CountVectorizer()
x = chatbot['response_text']
y = chatbot['class']
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1)
X_train_counts = count_vect.fit_transform(x_train)
X_test_counts = count_vect.transform(x_test)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
RESULT:
0.75
B) Navie Bayes:
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.It is mainly used in text classification that includes a
high-dimensional training dataset.Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning models that can make
quick predictions.It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
Code:
x = chatbot.response_text
y = chatbot.Label
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1)
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)
NB.fit(x_train_dtm,y_train)
y_predict = NB.predict(x_test_dtm)
metrics.accuracy_score(y_test,y_predict)
RESULT:
0.7
C)Random Forest:
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.As the name suggests, "Random
Forest is a classifier that contains a number of decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one
decision tree, the random forest takes the prediction from each tree and based on the majority votes
of predictions, and it predicts the final output.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 38
CHATBOT USING FINE TUNED RANDOM FOREST
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
Code:
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(x_train_dtm,y_train)
rf_predict = rf.predict(x_test_dtm)
metrics.accuracy_score(y_test,rf_predict)
RESULT:
0.8
5. TESTING
Software Testing is an investigation conducted to provide stakeholders with information about the
quality of the product or service under test. Software testing can also provide an objective,
independent view of the software to allow the business to appreciate and understand the risks of
software implementation. Test techniques include, but are not limited to, the process of executing a
program or application with the intent of finding software bugs (errors or other defects). It involves
the execution of a software component or system component to evaluate one or more properties of
interest. In general, these properties indicate the extent to which the component or system under test:
As the number of possible tests for even simple software components is practically infinite, all
software testing uses some strategy to select tests that are feasible for the available timeand resources.
As a result, software testing typically (but not exclusively) attempts to execute a program or
application with the intent of finding software bugs (errors or other defects).
Software Testing can provide objective, independent information about the quality of software and
risk of its failure to users and/or sponsors.
Software testing can be conducted as soon as executable software (even if partially complete) exists.
The overall approach to software development often determines when and how testing is conducted.
For example, in a phased process, most testing occurs after system requirementshave been defined
and then implemented in testable programs. In contrast, under an Agile approach, requirements,
programming, and testing are often done concurrently.
There are many approaches in software testing. Reviews, walkthroughs, or inspections are
referred to as static testing, whereas actually executing programmed code with a given set of test
cases is referred to as dynamic testing. Static testing is often implicit, as proofreading,plus when
programming tools/text editors check source code structure or compilers (pre- compilers) check
syntax and data flow as static program analysis. Dynamic testing takes place when the program
itself is run. Dynamic testing may begin before the program is 100% complete in order to test
particular sections of code and are applied to discrete functions or modules. Typical techniques for
this are either using stubs/drivers or execution from a debugger environment.
Static testing involves verification, whereas dynamic testing involves validation. Together
they help improve software quality. Among the techniques for static analysis, mutation testing can
be used to ensure the test-cases will detect errors which are introduced by mutating the sourcecode.
Software testing methods are traditionally divided into white- and black-box testing. These two
approaches are used to describe the point of view that a test engineer takes when designing test
cases.
While white-box testing can be applied at the unit, integration and system levels of the software
testing process, it is usually done at the unit level. It can test paths within a unit,paths between
units during integration, and between subsystems during a system–level test.
Though this method of test design can uncover many errors or problems, it might not detect
unimplemented parts of the specification or missing requirements.
➢ API testing: It tests the application using public and private APIs.
➢ Code coverage: It creates tests to satisfy some criteria of code coverage.
Code coverage tools can evaluate the completeness of a test suite that was created with any method,
including black-box testing. This allows the software team to examine parts of a system that are
rarely tested and ensures that the most important function points have been tested.
Specification-based testing aims to test the functionality of software according to the applicable
requirements. This level of testing usually requires thorough test cases to be
provided to the tester, who then can simply verify that for a given input, the output value (or behavior),
either “is” or “is not” the same as the expected value specified in the test case. Test cases are built
around specifications and requirements, i.e., what the application is supposedto do. It uses external
descriptions of the software, including specifications, requirements, anddesigns to derive test cases.
These tests can be functional or non-functional, though usually functional.
One advantage of the black box technique is that no programming knowledge is required. Whatever
biases the programmers may have had, the tester likely has a different set and may emphasize different
areas of functionality. On the other hand, black-box testing has been said to be “like a walk in a dark
labyrinth without a flashlight.” Because they do not examine the source code, there are situations
when a tester writes many test cases to check something that could have been tested by only one test
case, or leaves some parts of the program untested.
This method of test can be applied to all levels of software testing: unit, integration, system and
acceptance. It typically comprises most if not all testing at higher levels, but can also dominate unit
testing as well.
These types of tests are usually written by developers as they work on code (white-box style),to ensure
that the specific function is working as expected. One function might have multiple tests, to catch
corner cases or other branches in the code. Unit testing alone cannot verify the
functionality of a piece of software, but rather is used to ensure that the building blocks of the
software work independently from each other.
Unit testing is a software development process that involves synchronized application of a broad
spectrum of defect prevention and detection strategies in order to reduce software development
risks, time, and costs. It is performed by the software developer or engineer during the
construction phase of the software development lifecycle. Rather than replace traditional QA
focuses, it augments it. Unit testing aims to eliminate construction errors before code is promoted
to QA; this strategy is intended to increase the quality of the resulting software as well as the
efficiency of the overall development and QA process.
Depending on the organization's expectations for software development, unit testing might
include static code analysis, data flow analysis, metrics analysis, peer code reviews, code
coverage analysis and other software verification practices.
Integration testing works to expose defects in the interfaces and interaction between integrated
components (modules). Progressively larger groups of tested software components
corresponding to elements of the architectural design are integrated and tested until the software
works as a system.
next unit. Component interface testing is a variation of black-box testing, with the focus on the
data values beyond just the related actions of a subsystem component.
In addition, the software testing should ensure that the program, as well as working as expected,
does not also destroy or partially corrupt its operating environment or causeother processes
within that environment to become inoperative (this includes not corrupting shared memory, not
consuming or locking up excessive resources and leaving any parallel processes unharmed by its
presence).
Software testing verification and validation are the most important components to be considered.
In this article we will discuss the details about verification and validation part of software testing.
Software verification and validation actions confirm the software aligned with its terms. All
assignment should validate and verify the software it produces. This is made by:
Project management is accountable for organizing the classification of software verification and
validation roles, software verification and validation behaviors and the allotment ofemployees to
those roles.
Whatsoever the volume of project, software verification and validation very much affects
software value. Populace is not reliable, and software that has not been confirmed has little
possibility of functioning. Characteristically, 2%-5% errors per 1000 lines of code are found
through development and 0.15%-4% error per 1000 lines of code remain still after testing of the
system. Every error might lead to an equipped breakdown or non-cooperation with a necessity.
The purpose of software verification and validation is to decrease software errors to a satisfactory
level. The effort wanted can vary from 30%-90% of the whole project property, depend upon the
difficulty and criticality of the software. The following diagram shows the process flow.
Verification makes ensure that the result is intended to give all functionality to the client.
It is done at the initial of the development method. It includes reviews and meetings,
walkthroughs, check, etc. to assess credentials, strategy, code, necessities and stipulation.
It ensures for building a right product.
It also checks for accessing the data correct in the correct place and in the correctway.
Verification is a Low level action.
It is performed at the time of development on walkthroughs, reviews and inspections, adviser
comment, guidance, checklists and principles.
Manifestation of reliability, wholeness, and accuracy of the software at every phase and among
every phase of the development life cycle.
According to the Capability Maturity Model, we can also describe verification as the method of
evaluating software to establish whether the yield of a particular development stage please the
situation forced at the beginning of that stage.
Validation determines how the system compiles with the necessities and performs functions for
which it is proposed and meets the organization’s goals and user requirements
It is for building a right product.
It also checks for accessing the accurate data.
Validation is a High Level action
Also determines exactness of the ultimate software product by an advanced project with respect
to the consumer desires and necessities.
According to the capability maturity model we can also describe validation as procedure of
evaluate software at the time of development procedure to establish whether it satisfies particular
necessities.
6. CONCLUSION
The Project outcome will be a web application which is interactive and question to answer
based system. Where User can experience a virtual answering from the system. Firstly, we perform
data preprocessing and clean the data by finding the missing values. Later we used wordembedding
to convert the categorical data into binary format which is machine understandable format. Here
we used Machine Learning Models like Random Forest, SVM, naïve bias Algorithmson the data for
further developing robust and accurate Question answering System.
7. REFERENCES
[1] Alzubi, J.A., Jain, R., Singh, A. et al. COBERT: COVID-19 Question Answering System Using
BERT. Arab J Sci Eng (2021). https://doi.org/10.1007/s13369-021-05810-5
[2] D. V. Vekariya and N. R. Limbasiya, "A Novel Approach for Semantic Similarity Measurement
for High Quality Answer Selection in Question Answering using Deep Learning Methods," 2020
6th International Conference on Advanced Computing and Communication Systems (ICACCS),
2020, pp. 518-522, doi: 10.1109/ICACCS48705.2020.9074471.
[3] Mutabazi, E.; Ni, J.; Tang, G.; Cao, W. A Review on Medical Textual Question Answering
Systems Based on Deep Learning Approaches. Appl. Sci. 2021, 11, 5456. https://doi.org/10.3390/
app11125456.
[5] K. Moholkar and S. H. Patil, "Multiple Choice Question Answer System using Ensemble Deep
Neural Network," 2020 2nd International Conference on Innovative Mechanisms for Industry
Applications (ICIMIA), 2020, pp. 762-766, doi: 10.1109/ICIMIA48430.2020.9074855.
[6] Thomala, L. L. Number of new hospital beds to be added in the designated hospitals after the
coronavirus covid-19 outbreak in Wuhan, China as of February 2, 2020. Statista.
https://www.statista.com/statistics/1095434/china-changes-inthe-number-of-hospital-beds-in-
designated-hospitals-after-coronavirus-outbreakin-wuhan/.
[7] Bogage, J. Tesla unveils ventilator prototype made with car parts on youtube. Wash. Post.
https://www.washingtonpost.com/business/2020/04/06/tesla-coronavirusventilators-musk/ (2020).
[8] Day, M. & Soper, S. Amazon is prioritizing essential products as online orders spike. Bloomberg.
https://www.bloomberg.com/news/articles/2020-03-17/ amazon-prioritizing-essentials-medical-
goods-in-virus-response.
[9] Acter, T.; Uddin, N.; Das, J.; Akhter, A.; Choudhury, T.R.; Kim, S.: Evolution of severe acute
respiratory syndrome coronavirus 2 (SARS-CoV-2) as coronavirus disease 2019 (COVID-19)
pandemic: A global health emergency. Science of the Total Environment p. 138996 (2020)
[10] Jeyaprakash, K.; Velavan, S.; Ganesan, S.; Arjun, P.: Covid-19- molecular transmission and
diagnosis-Review article. Asian J. Innov. Res. 5(2), 1 (2020)
[11] da Silveira, M.P.; da Silva Fagundes, K.K.; Bizuti, M.R.; Starck, É.; Rossi, R.C.; e Silva,
D.T.d.R.: Physical exercise as a tool to help the immune system against COVID-19: an integrative
review of the current literature. Clinical and experimental medicine p. 1 (2020)
[12] Shen, I.; Zhang, L.; Lian, J.; Wu, C.H.; Fierro, M.G.; Argyriou, A.; Wu, T.: In search for a cure:
recommendation with knowledge graph on CORD-19. In: Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, p. 3519 (2020)
8. BIBLIOGRAPHY
➢ Meta Learning: How to Learn Deep Learning and Thrive In The Digital Worldby
Radek Osmulski.
➢ Ensemble Learning Algorithms with Python by Jason Browniee Head First Python,
2nd Edition by Paul Barry.
APPENDIX
Python Introduction
Python is a general-purpose high level programming language that is being increasingly used in
data science and in designing machine learning algorithms.
• Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP.
• Python is Interactive − You can actually sit at a Python prompt and interactwith the interpreter
directly to write your programs.
• Python is a Beginner's Language − Python is a great language for the beginner- level
programmers and supports the development of a wide range of applications from simple text
processing to WWW browsers to games.
History of Python
Python was developed by Guido van Rossum in the late eighties and early nineties at theNational
Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNUGeneral
Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
Python Features
Python's features include −
• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined syntax. This
allows the student to pick up the language quickly.
• A broad standard library − Python's bulk of the library is very portable and cross-platform
compatible on UNIX, Windows, and Macintosh.
• Interactive Mode − Python has support for an interactive mode which allows interactive testing
and debugging of snippets of code.
• Portable − Python can run on a wide variety of hardware platforms and has the same interface on
all platforms.
• Extendable − You can add low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be moreefficient.
• GUI Programming − Python supports GUI applications that can be created and ported to many
system calls, libraries and windows systems, such as Windows MFC, Macintosh, and the X Window
system of Unix.
• Scalable − Python provides a better structure and support for large programsthan shell scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed below
−
• It can be used as a scripting language or can be compiled to byte-code for building large applications.
• It provides very high-level dynamic data types and supports dynamic type checking.
Python is a popular platform used for research and development of production systems. It is a vast
language with number of modules, packages and libraries that provides multiple ways of achieving
a task.
Python and its libraries like NumPy, SimpleITK, Tensorflow, Keras, Pandas are used in data
science and data analysis. They are also extensively used for creating scalable machine learning
algorithms. Python implements popular machine learning techniques such as Classification,
Regression, Recommendation, and Clustering.
Python offers ready-made framework for performing data mining tasks on large volumes of data
effectively in lesser time. It includes several implementations achieved through algorithms such
as linear regression, logistic regression, Naïve Bayes, k-means, K nearest neighbor, and Random
Forest.
Python has libraries that enable developers to use optimized algorithms. It implements popular
machine learning techniques such as recommendation, classification, and clustering. Therefore, it
is necessary to have a brief introduction to machine learning before we move further.
Machine learning is a discipline that deals with programming the systems so as to make them
automatically learn and improve with experience. Here, learning implies recognizing and
understanding the input data and taking informed decisions based on the supplied data. It is very
difficult to consider all the decisions based on all possible inputs. To solve this problem, algorithms
are developed that build knowledge from a specific data and past experience by applying the
principles of statistical science, probability, logic, mathematical optimization, reinforcement
learning, and control theory.
Vision processing
Language processing
Pattern recognition
Games
Data mining
Expert systems
Robotics
Defining a Problem
Preparing Data
Evaluating Algorithms
Improving Results
Presenting Results
The best way to get started using Python for machine learning is to work through a project end-to-
end and cover the key steps like loading data, summarizing data, evaluating algorithms and making
some predictions. This gives you a replicable method that can be used dataset after dataset.You can
also add further data and improve the results.
It is the fundamental package for scientific computing with Python. It contains variousfeatures
including these important ones:
Tensor flow: Tensor Flow is an end-to-end open source platform for machine learning. Ithasa
comprehensive, flexible ecosystem of tools, libraries and community resources that lets
researchers push the state-of-the-art in ML and developers easily build and deploy MLpowered
applications.
Simple ITK: It provides an abstraction layer to ITK that enables developers and users to access
the powerful features of the Insight Toolkit in an easy to use manner for biomedical image analysis.