0% found this document useful (0 votes)
6 views77 pages

Batch 13

Uploaded by

Prudhvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views77 pages

Batch 13

Uploaded by

Prudhvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

A PROJECT REPORT ON

CHATBOT USING FINE TUNED RANDOM


FOREST
A Major Project Submitted to Jawaharlal Nehru Technological University, Kakinada in
Partialfulfillments of Requirements for the Award of the Degree of

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING

Submitted By

Ms.G. VARA LAKSHMI (18KT1A0517)

Mr.K. J HARSHA VARDHAN (18KT1A0524)

Ms.M. YESASWINI (18KT1A0530)

Mr.V. PRUDHVI CHARAN (18KT1A0559)

Ms.M. DEEPIKESWARI (18KT1A0532)

Under the Esteemed Guidance of

Dr. SHAIK AKBAR

Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

POTTI SRIRAMULU CHALAVADI MALLIKARJUNA


RAOCOLLEGE OF ENGINEERING & TECHNOLOGY
(Approved by AICTE New Delhi, Affiliated to JNTU-Kakinada)
KOTHAPET, VIJAYAWADA-520001, A.P
2018-2022
POTTI SRIRAMULU CHALAVADI MALLIKHARJUNARAO

COLLEGE OF ENGINEERING & TECHNOLOGY


KOTHAPET, VIJAYAWADA-520001.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the project work entitled “CHATBOT USING FINE TUNED
RANDOM FOREST” is a bonafide work carried out by Ms. G. VARA LAKSHMI
(18KT1A0517), Mr. K. J HARSHA VARDHAN (18KT1A0524), Ms. M. YESASWINI
(18KT1A0530), Mr. V. PRUDHVI CHARAN (18KT1A0559), Ms. M. DEEPIKESWARI
(18KT1A0532). Fulfillment for the award of the degree of Bachelor of Technology in COMPUTER
SCIENCE AND ENGINEERING of Jawaharlal Nehru Technological University, Kakinada during
the year 2021-2022. It is certified that all corrections/suggestions indicated for internal assessment
have been incorporated in the report. The project report has been approved as it satisfies the academic
requirements in respect of project work prescribed for the above degree.

Project Guide Head of the Department

External Examiner
ACKNOWLEDGEMENT

We owe a great many thanks to a great many people who helped and supported and suggested
us in every step.

We are glad for having the support of our principal Dr. K. Sri Rama Krishna who inspired
us with his words filled with dedication and discipline towards work.

We express our gratitude towards Dr. A. Pathanjali Sastri, Professor & HoD of CSE for
extending his support through training classes which had been the major source to carry out our
project.
We are very much thankful to Dr. Shaik Akbar, Professor, Guide of our project for guiding
and correcting various documents of ours with attention and care. He has taken pain to go through
the project and make necessary corrections as and when needed.

Finally, we thank one and all who directly and indirectly helped us to complete our project
successfully.

Project Associates

G. VARA LAKSHMI (18KT1A0517)

K. J HARSHA VARDHAN (18KT5A0524)

M. YESASWINI (18KT5A0530)

V. PRUDHVI CHARAN (18KT1A0559)

M. DEEPIKESWARI (18KT1A0532)
DECLARATION

This is to declare that the project entitled “CHATBOT USING FINE TUNED
RANDOM FOREST” submitted by us in the partial fulfillment of requirements for the award
of the degree of Bachelor of Technology in Computer Science & Engineering in Potti
Sriramulu Chalavadi Mallikarjuna Rao College of Engineering and Technology, is
bonafide record of project work carried out by us under the supervision and guidance of Dr.
Shaik Akbar, Professor of CSE. As per our knowledge, the work has not been submitted to
any other institute or universities for any other degree.

Project Associates

G. VARA LAKSHMI (18KT1A0517)

K. J HARSHA VARDHAN (18KT5A0524)

M. YESASWINI (18KT5A0530)

V. PRUDHVI CHARAN (18KT1A0559)

M. DEEPIKESWARI (18KT1A0532)
ABSTRACT
Question-Answering System is one of the most researched topic in Machine
Learning related problems. Answer selection in Community Question Answering (CQA) is
a complex and crucial problem in the development of an autonomous Question
Answering System in a Natural Language Processing (NLP) environment. QA systems allow
user to express a question in natural language and get an immediate and brief response. A
question Answering System is an information retrieval system in which a direct response is
expected in response to a submitted query, as opposed to a list of references that may contain
the answers. It is a device for human-machine communication. QA systems in Natural
Language Processing are designed to provide learners with accurate responses to their
inquiries.

Unlike the majority of information retrieval systems, QA systems strive to retrieve


point-by-point responses as opposed to a deluge of documents or even matching sections.
The most difficult aspect of a question-answering system is providing reliable responses from
the vast amount of data. Taking these challenges into account, our proposed system will work
efficiently by handling large datasets.
CONTENTS

S.NO TOPIC PAGE.NO

1. INTRODUCTION 1

1.1 Brief Overview of the project 1

1.1.1 Scope 1

1.1.2 Purpose 1

1.1.3 Objective of the study 2

1.1.4 Literature Survey 3

1.2 Problem Statement 6

1.3 Proposed System 6

2. SYSTEM ANALYSIS 7

2.1 System Study 7

2.1.1 Feasibility Study 7

2.2 Requirement Analysis 8

2.2.1 Functional Requirements 8

2.2.2 Non Functional Requirements 8

2.3 System Requirement Specification 9

2.3.1 Hardware Requirements 9

2.3.2 Software Requirements 9

2.3.3 Libraries Required 10

2.4 Methodologies 11
2.4.1 Preprocessing 11

2.4.2 Word Embeddings 12


14
2.4.3 Classification
2.4.4 Performance Metrics 17

3. SYSTEM DESIGN 19

3.1 About System Design 19

3.1.1 Initialize Design Definition 19

3.1.2 Establish Design Characteristics 20

3.1.3 Assess alternatives for system 20


elements
3.1.4 Manage the design 20

3.2 System Architecture 19

3.3 Unified Modelling Language(UML) Diagrams 21

3.4 Data Flow Diagrams(DFD) 28

3.5 Dataset(Sample) 32

4. SYSTEM IMPLEMENTATION 33

4.1 About Implementation 33

4.2 Source code 33

4.3 Results 34

5. TESTING 40

5.1 About Testing 40


5.2 Testing Methods 41

5.2.1 Static vs Dynamic Testing 41

5.2.2 The Box Approach 41

5.2.3 Testing Levels 43

5.3 Validation and Verification 46

5.3.1 Verification in software testing 47

5.3.2 Validation in software testing 48

5.4 Test Cases 49

6. CONCLUSION 50

7. REFERENCES 51

8. BIBLIOGRAPHY 53

9. APPENDIX 54
LIST OF FIGURES
S.NO Fig. NO NAME OF THE FIGURE PAGE NO
1. 2.1 Data Transformation 12

2. 2.2 SVM Classification using Hyperplane 15

3. 2.3 Random Forest 17

4. 3.1 System Architecture for CHATBOT 21


5. 3.2 Use Case Diagram for System 23
6. 3.3 Sequence diagram for system 25

7. 3.4 Activity diagram for system 27

8. 3.5 Data Flow Diagram for level-0 30

9. 3.6 Data Flow Diagram for level-1 30

10. 3.7 Data Flow Diagram for level-2 31

11. 3.8 Dataset 32


12 5.1 Black Box Testing 42
13. 5.2 Levels of Testing 44
14. 5.3 Verification and Validation 47
LIST OF TABLES
S.NO Fig. NO NAME OF THE TABLE PAGE NO
1 1.1 Existing System 4
2 5.1 Test Cases 49
CHATBOT USING
FINE TUNED
RANDOM FOREST
CHATBOT USING FINE TUNED RANDOM FOREST

1. INTRODUCTION
1.1 BRIEF OVERVIEW OF THE PROJECT
The main aim of this project is to build an efficient question answering system with various Machine
Learning techniques and reproduce a comparative study on which will be the best one for the same.
In this project, we built a QA system which can answer any questions based on the confirmed
knowledge base which was developed by using texts posted to a social media list. We apply naïve
bayes, Support Vector Machine and Random Forest algorithms on the collected data and checks
which algorithm is more accurate for Question-Answering system. Now we use python to host the
web application and design our own question answering system.

1.1.1 Scope

The question answering system using NLP comes into existence to give the accurate answers to the
questions raised by the user and provides efficient answers when compared to other systems. Modern
information retrieval systems allow us to locate document or website that might have the associated
information, but the majority of them leave it to the user to extract the useful information from an
ordered list. In this system, we used Machine learning algorithms to get 85% accuracy. Question
Answering is in itself intersection of Natural Language Processing, Information Retrieval, Machine
Learning, Knowledge Representation, Logic and Inference, Sematic Search.

1.1.2 Purpose
The main aim of this question answering system is to get an accurate answer. This QA system
enables users to access the knowledge resources in a natural way (i.e., by asking questions) and to
get back a relevant and proper response in concise words. QA systems is to understand the natural
language questions correctly and deduce the precise meaning to retrieve exact responses. The
processing of question answering system have three stages i.e., analyzing the question and
classifying the question, identify the answers related to the question, select the accurate and efficient
answer.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 1


CHATBOT USING FINE TUNED RANDOM FOREST

1.1.3 Objective of the study

Now a days Chatbots plays an important role in the various sectors. Consumers are demanding
round-the-clock service for assistance in areas ranging from banking and finance, to health and
wellness. Because of this demand, chatbots are increasing in popularity among business and
consumers alike. The Objective of our project is to review various ML techniques and comprehend
their Limitations and Advantages. And to get a thorough list of all the Natural Language Processing
methods. Finally, our main motive is to build an efficient system and choose the best algorithms for
the QA model.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 2


CHATBOT USING FINE TUNED RANDOM FOREST

1.1.4 Literature Survey:


In this section, we would be discussing briefly on various literatures available on question
answering system using NLP.

[1] Jafar A. Alzubi et.al, identified the dangers caused by COVID- 19. These covid infections are
characterized as a disease caused by a novel virus called Coronavirus, which is now known as
extreme acute respiratory condition. They proposed COBERT, a retriever-reader dual algorithmic
system that answers complex queries by searching a document of 59K corona virus-related
literature made available through the Coronavirus Open Research Dataset Challenge (CORD-19).
The retriever is made up of a TF-IDF vectorizer that captures the top 500 documents with the
highest scores. On SQuAD 1.1 dev, the reader is pre-trained Bidirectional Encoder
Representations from Transformers (BERT). The use of DistilBERT version of BERT in reader
phase on CORD-19 provides an accuracy of 70% when compared to BERT. Only the generated
embedding is used by the proposed model to capture context sensitive features. Their system has
been pre-trained on a limited domain of COVID-19 literature, all of which is in English only.

[2] Darshana V. Vekariya et.al, presented a Versatile global T-max pooling and DeepLSTM for
quality answer prediction in this research paper. They also used Efficient DFM to forecast the good
solutions, and DFM is especially useful for ranking cause. In comparison to all other existing
strategies, this approach is solely based on neural structure and RNN. They used DeepLSTM,
which is extremely simple to teach. For testing, they used four datasets. STSB, MRPC, SICK, and
Wikipedia QA datasets are all available. STSB, SICK, and Wikipedia compute semantic similarity.
The MRPC dataset is also used for para-identity. SICK is made up of 10,000 English sentence
pairs. The results of the experiments reveal that the proposed method provides exceptional overall
performance with an accuracy of 79% approximately.

[3] Emmanuel Mutabazi et.al, provided a study of medical textual question-answering systems
based on deep learning methodologies which can express the exact meaning of the statement.
Various researchers have employed machine learning and NLP methodologies integrated with
probabilistic, algebraic, and neural network models such as CNN, RNN are used to solve various
answer-processing challenges. MedQuAD, EmrQA are the datasets used in this paper. As a result,
as compared to traditional methods, complicated medical or clinical problems can be answered
with a high degree of accuracy. There is delay in the retrieval of relevant and reliable answers.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 3


CHATBOT USING FINE TUNED RANDOM FOREST

[4] Md. Rafiuzzaman Bhuiyan et.al proposed a paper in which they used sequence to sequence
architecture to create an automatic context-based Question and Answering system. An encoder
layer, a bi-directional LSTM, and a decoder layer will be used, followed by an attention
mechanism. The main challenge of this work is data collection, as well as finding the appropriate
vocabulary for word mapping and other tasks. Their model primary function is to provide answers
to the questions. They were successful in answering the question and reducing our training loss to
0.003. They created their own dataset for the web. The first disadvantage is that there are enough
words in Bengali to vector, and there is a lemmatizer in Bengali.

[5] In this research Mrs. Kavita Moholkar et.al used the LSTM model, a hybrid LSTM–
Convolution NN model, and the Multilayer Perception (MLP) model to present an ensemble
strategy for predicting responses to multiple choice questions. To begin, LSTM and hybrid LSTM-
CNN models are trained in parallel using LSTM. To predict the choice of training dataset
individually, Multilayer Perception is employed. The datasets from the 8thGr-NDMC are chosen
for model evaluation and comparison. The eighth GR-NDMC is utilized for testingpurposes. The
obtained findings show that the proposed approach outperforms several alternative single
forecasting models.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 4


CHATBOT USING FINE TUNED RANDOM FOREST

Table 1.1.4.1 Existing system

S.No Author Algorithms Advantages Disadvantages


1. Dharshan V T-Max Pooling and Deep DeepLSTM is LSTM requires very
Venkariya LSTM,DFM(Deep powerful for large amount of data to
Factorization Machine). capturing long perform efficiently
term than other techniques.
dependencies.
2. Jafar A.Alzubi Deep Learning DistilBERT is DistilBERT captures
Algorithm- able to capture the context sensitive
BERT(Bidirectional the feature features only from the
Encoder Representation dependencies of generated embedding
from Transformers). long documents and this system is
in better way. pretrained on a closed
domain(COVID-19).
3. Emmanuel LSTM(Long-Short Time This QA system In this QA system
Mutabazi Memory),RNN,CNN. provide answer document
within a short summarization takes
span. long time to generate.
4. Md.Rafiuzzaman BiLSTM(Bidirectional Training loss is It requires more
Bhuijan LSTM). very less in this memory to train in this
QA system QA system.
(0.003).
5. Mrs.Kavitha LSTM and LSTM- The approach This QA system gives
Moholkar CNN(parallel) followed in this less accuracy for bulky
system performs data.
better than some
other single
forecasting
models.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 5


CHATBOT USING FINE TUNED RANDOM FOREST

1.2 PROBLEM STATEMENT

At present situation there are lot much of documents, webpages, files. The previous systems only
give the several documents related to the question. But they are not provided the correct document
related to the question. By using this existing systems, information retrieval is only possible. So
there is an ability to unauthorized people can access our data in the system. By this traditional
approach there is no confirmation of data that we get is correct. So our question answer system will
help the people to get the accurate and efficient answers.

1.3 PROPOSED SYSTEM


Question answering is a critical NLP problem and a long-standing artificial intelligence milestone.
QA systems allow a user to express a question in natural language and get an immediate and brief
response. QA systems are now found in search engines and phone conversational interfaces, and
they’re fairly good at answering simple snippets of information. Reading comprehension is the
ability to read a piece of text and then answer questions about it. Reading comprehension is difficult
for machines because it requires both natural language understanding and knowledge of the world.
So in our proposed system, we took an unsupervised dataset where our chatbot will answer the
questions raised by user. Firstly, we perform data preprocessing by using data transformation
technique and then goes through the corpus engine which compares the words i.e keywords and then
it analyses the text patterns by checking the phrases .In next phase by count vectorization the basic
encoding is done i.e converting them to least binary form and then by TFID vectorizer it identify
the patterns by binding words to form single string. We used fine tuned Random Forest algorithm
to obtain the result more accurately.
The steps involved in proposed system are

• Preprocessing the dataset


• Word embedding
• Classification

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 6


SYSTEM ANALYSIS
CHATBOT USING FINE TUNED RANDOM FOREST

2. SYSTEM ANALYSIS
System Analysis is the process of analyzing a system with the potential goal of improving or
modifying the system. Analysis is breaking down the problem into smaller elements of study and
ultimately providing a better solution. During the process of system development, Analysis is an
important aspect. This involves gathering and interpreting facts, diagnosing theproblem and using the
information to recommend improvements to the system ultimately, the goal is to give a computerized
solution.

2.1 SYSTEM STUDY

2.1.1 Feasibility Study


It is wise to think about the feasibility of any problem we take on. Feasibility is the study of impact,
happens in the organization by the development of a system. The impact can be either positive or
negative. When the positive dominates the negative, then the system is considered feasible. Here the
feasibility study can be performed in four ways such as economic feasibility, technical feasibility,
operational feasibility, behavioral feasibility.

2.1.1.1 Operational Feasibility:


Proposed system is beneficial if and only if they can be turned into a system that will meet the
operating requirements. The best feasibility asks if the system if the system will work when it is
developed and installed. The proposed of the operational feasibility study is to determine whether the
new system will be used if it is developed and implemented. The proposed system will be used
frequently since it satisfies all the communication needs, hence operational feasibility is assured.

2.1.1.2 Technical Feasibility:


According to feasibility analysis procedure the technical feasibility of the system is analyzed and the
technical requirements such as software facilities, procedure, inputs are identified. It is also one of
the important phases of the system development activities. All resources needed for the development
of the software as well as the maintenance of the same is available. Here we are utilizing the resources,
which are already available.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 7


CHATBOT USING FINE TUNED RANDOM FOREST

2.1.1.3 Behavioral Feasibility:


People are inherently resistant to change and computer has been known to facilitate changes. An
estimate should be made of how strong the user is likely to move towards the development of
computerized system. These are various levels of users in order to ensure proper authentication and
authorization and security of sensitive data of the organization.

2.1.1.4 Financial and Economic Feasibility:

Economic analysis is most frequently used for evaluation of the effectiveness of the system. More
commonly known as cost/benefit analysis the procedure is to determine the benefit and saving that
are expected from a system and compare them with costs, decisions is made to design and implement
the system. This part of feasibility study gives the top management the economic justification for
the new system. A simple economic analysis that gives the actual comparison of costs and benefits
is much more meaningful in such cases. In the system, the organization is most satisfied by economic
feasibility. Because, if the organization implements this system, it need not require anyadditional
hardware resources as well as it will be saving lot of time.

2.2 REQUIREMENT ANALYIS


We are overcoming the tedious manual procedure by this approach involving automatic report
generations and providing with required information.

2.2.1 Functional Requirement


A functional requirement document defines the functionality of a system or one of its subsystems.
It also depends upon the type of software, expected users and the type of system where the software
is used.

Functional user requirements may be high-level statements of what the system should do in detail.

2.2.2 Non Functional Requirement


In system engineering and requirements engineering, a non-functional requirement isa requirement
that specifies criteria that can be used to judge the operation of a system, rather than specific
behaviors.
• Usability
This section includes all of those requirements that effect usability. It will be veryeasy to use
for the naïve user. The software is simple, user-friendly interface so that the user can save time and
confusion.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 8


CHATBOT USING FINE TUNED RANDOM FOREST

• Reliability
The system is more reliable because it uses the API’s developed by google that work even in noisy
environment. Also at receiver side python platform is used that makes the code more reliable.
• Performance
This system exhibits high performance because it is well optimised and is developed by using
high level languages which will give response to end user in a very less amount of time.
• Supportability
This system is designed to be cross platform supportable. The system is supported ona wide range
of hardware and any software platform. This system also uses python and hence it is highly portable.
• Flexibility
If we intend to increase or extend the functionality of the software after it is deployed, it should
be planned from the beginning New modules can be easily integrated to our system without
disturbing the existing modules or modifying the local schema of existing applications.

2.3 SYSTEM REQUIREMENT SPECIFICATION

System Requirements Specification (SRS) the requirements work product that formally specifies
the system-level requirements of a single system or an application. The System Requirements
Specification identifies, defines and clarifies the requirements, that when satisfied through
development meet the operational/functional need identified in the Project Concept Proposal,
Project Business Case, and Project Charter.Approval of this document constitutes agreement that
the developed system satisfying theserequirements will be accepted.

2.3.1 Hardware Requirements


➢ Processor: i5 Processor
➢ Input Devices: Keyboard, Mouse
➢ RAM: MIN 8 GB
➢ Hard disk: 500 GB

2.3.2 Software Requirements


➢ Operating System: Windows 10
➢ Coding Language: Python 3.9
➢ Spyder 5.1.5
➢ Jupyter Notebook 6.4.11

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 9


CHATBOT USING FINE TUNED RANDOM FOREST

2.3.3 Libraries Required

➢ Sklearn

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python.
It provides a selection of efficient tools for machine learning and statistical modelingincluding
classification, regression, clustering, and dimensionality reduction via a consistent interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy, and
Matplotlib.

➢ Pandas

Pandas is an open source Python package that is most widely used for data science/data
analysis and machine learning tasks. It is built on top of another package named Numpy, which
provides support for multi-dimensional arrays. As one of the most popular data wrangling
packages, Pandas works well with many other data science modules inside the Python
ecosystem, and is typically included in every Python distribution, from those that come with
your operating system to commercial vendor distributions like Active State’s Active Python

➢ Numpy

NumPy (Numerical Python) is a linear algebra library in Python. It is a very important library
on which almost every data science or machine learning Python packages such as SciPy
(Scientific Python), Mat−plot lib (plotting library), Scikit-learn, etc depends on to a reasonable
extent. NumPy is very useful for performing mathematical and logical operations on Arrays.
It provides an abundance of useful features for operations on n-arrays and matrices in Python.

➢ Seaborn

Seaborn is a library for making statistical graphics in Python. It builds on top


of matplotlib and integrates closely with pandas data structures. Seaborn helps you explore and
understand your data. Its plotting functions operate on data frames and arrays containing whole
datasets and internally perform the necessary semantic mapping and statistical aggregation to
produce informative plots. Its dataset-oriented, declarative API lets you focus on what the
different elements of your plots mean, rather than on the details of how to draw them.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 10


CHATBOT USING FINE TUNED RANDOM FOREST
➢ Matplotlib

Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Java
Script for Platform compatibility.

➢ Word Cloud

Word Cloud is a data visualization technique used for representing text data in which the size
of each word indicates its frequency or importance. Significant textual data points can be
highlighted using a word cloud. Word clouds are widely used for analysing data from social
network websites.

2.4 METHODOLOGIES

2.4.1 Pre-processing

Data pre-processing is the process of transforming raw data into an understandable format.
It is also an important step in data mining as we cannot work with raw data. The quality
of data should be checked before applying machine learning or data mining algorithms.

All the previous studies have proven that, data cleaning and dimensionality reduction have its
impact on the efficiency of machine learning and deep learning systems. In the process of data
cleaning, the proposed system has identified that we have missing data and it uses the traditional
imputation technique, since most of the attributes don’t exhibit the skewness property all the missing
values are replaced with their respective mean values. The next important factor in the pre-
processing is data transformation. The numerical values have a vital role in the computation time
rather than the categorical data and more than all the attributes in the dataset have few unique values
in their respective columns.

Data transformation is also a key pre-processing step, which is applied as standard feature scaling
in the proposed system, which is calculated as the “difference between the attribute and the whole
mean is divided by the standard deviation”.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 11


CHATBOT USING FINE TUNED RANDOM FOREST

2.4.1.1 Data Transformation

Figure : 2.1 Data Transformation

Constructing Data Cube: Data are transformed into appropriate forms of mining. Data
Transformation involves the following:
1. In Normalisation, where the attribute data are scaled to fall within a small specified range, such
as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from the data. Such techniques include binning, clustering,
and regression.
3. In Aggregation, summary or aggregation operations are applied to the data. For example, daily
sales data may be aggregated so as to compute monthly and annual total amounts. This step is
typically used in constructing a data cube for analysis of the data at multiple granularities.
4. In Generalisation of the Data, low level or primitive/raw data are replaced by higher level concepts
through the use of concept hierarchies. For example, categorical attributes are generalised to
higher level concepts street into city or country. Similarly, the values for numeric attributes may
be mapped to higher level concepts like, age into young, middle-aged, or senior.

2.4.2 Word Embeddings:

As we know that many Machine Learning algorithms and almost all Deep Learning
Architectures are not capable of processing strings or plain text in their raw form. In a broad sense,
they require numerical numbers as inputs to perform any sort of task, such as classification,
regression, clustering, etc. Also, from the huge amount of data that is present in the text format, it is
imperative to extract some knowledge out of it and build any useful applications. In short, we can
say that to build any model in machine learning or deep learning, the final level data has to be in

numerical form because models don’t understand text or image data directly as humans do.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 12


CHATBOT USING FINE TUNED RANDOM FOREST
To convert the text data into numerical data, we need some smart ways which are known
as vectorization, or in the NLP world, it is known as Word embeddings. Therefore, Vectorization
or word embedding is the process of converting text data to numerical vectors. Later those vectors
are used to build various machine learning models. In this manner, we say this as extracting features
with the help of text with an aim to build multiple natural languages, processing models, etc.

Counter Vectorizer
1. It is one of the simplest ways of doing text vectorization.

2. It creates a document term matrix, which is a set of dummy variables that indicates if a particular
word appears in the document.

3. Count vectorizer will fit and learn the word vocabulary and try to create a document term matrix
in which the individual cells denote the frequency of that word in a particular document, which is
also known as term frequency, and the columns are dedicated to each word in the corpus.

2.4.2.1 Matrix Formulation


Consider a Corpus C containing D documents {d1, d2…..dD} from which we extract N unique

tokens. Now, the dictionary consists of these N tokens, and the size of the Count Vector matrix M

formed is given by D X N. Each row in the matrix M describes the frequency of tokens present in

the document D(i).

Let’s consider the following example:

Document-1: He is a smart boy. She is also smart.


Document-2: Chirag is a smart person.

The dictionary created contains the list of unique tokens(words) present in the corpus

Unique Words: [‘He’, ’She’, ’smart’, ’boy’, ’Chirag’, ’person’]

Here, D=2, N=6

So, the count matrix M of size 2 X 6 will be represented as –

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 13


CHATBOT USING FINE TUNED RANDOM FOREST

He She smart boy Chirag person


D1 1 1 2 1 0 0
D2 0 0 1 0 1 1

Now, a column can also be understood as a word vector for the corresponding word in the matrix

M. For Example, for the above matrix formed, let’s see the word vectors generated.

Vector for ‘smart’ is [2,1],


Vector for ‘Chirag’ is [0, 1], and so on.

2.4.3 Classification

Classification is a technique where we categorize data into a given number of classes. The main
goal of a classification problem is to identify the category/class to which a new data will fall under
Classifier: An algorithm that maps the input data to a specific category.
In machine learning and statistics, classification is a supervised learning approach in which the
computer program learns from the data input given to it and then uses this learning to classify new
observation.
There are some classification techniques that are given below
• Support Vector Machine
• Navie Bayes
• Random Forest

2.4.3.1 Support Vector Machine


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 14


CHATBOT USING FINE TUNED RANDOM FOREST

Figure 2.2: SVM classification using hyperplane.

Algorithm:
Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1 while the other is
identified as -1.
Step 2: As all machine learning algorithms convert the business problem into a mathematical
equation involving unknowns. These unknowns are then found by converting the problem into an
optimization problem. As optimization problems always aim at maximizing or minimizing
something while looking and tweaking for the unknowns, in the case of the SVM classifier, a loss
function known as the hinge loss function is used and tweaked to find the maximum margin.
Step 3: For ease of understanding, this loss function can also be called a cost function whose cost is
0 when no class is incorrectly predicted. However, if this is not the case, then error/loss is calculated.
The problem with the current scenario is that there is a trade-off between maximizing margin and
the loss generated if the margin is maximized to a very large extent. To bring these concepts in
theory, a regularization parameter is added.
Step 4: As is the case with most optimization problems, weights are optimized by calculating the
gradients using advanced mathematical concepts of calculus viz. partial derivatives.
Step 5: The gradients are updated only by using the regularization parameter when there is no error
in the classification while the loss function is also used when misclassification happens.
Step 6: The gradients are updated only by using the regularization parameter when there is no error
in the classification, while the loss function is also used when misclassification happens.

2.4.3.2 Navie Bayes


Naïve Bayes algorithm is a supervised learning algorithm. Naive Bayes classifiers are a collection
of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of
algorithms where all of them share a common principle, i.e every pair of features being classified is
independent of each other.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 15


CHATBOT USING FINE TUNED RANDOM FOREST
Bayes Theorem:
Bayes’ Theorem finds the probability of an event occurring given the probability of another event
that has already occurred. Bayes’ theorem is stated mathematically as the following equation:

where A and B are events and P(B) ≠ 0.


P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence. Now, it’s time to put a naive assumption to
the Bayes’ theorem
Algorithm:
• Convert the given data into frequency table.
• Generate Likelihood table by finding the Probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.

2.4.3.3 Random Forest


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 16


CHATBOT USING FINE TUNED RANDOM FOREST

Figure 2.3 : Random Forest

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.

2.4.4 Performance Metrics:

Accuracy: Accuracy is described as the absolute accuracy of the pattern and is estimated as
the totalof specific prediction factors. In the proposed model, we have applied the model on a
dataset. The Social Media dataset contains 200 records. The confusion matrix for the Social
Media dataset for intelligent ensemble algorithm can be described as follows and the
computation of accuracy is as shown in equation.

Accuracy = True Churn +True NonChurn


Total Records

= 85%
Recall: The ability of a model to find all the relevant cases within a dataset. Mathematically,
recall is defined as follows:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 17


CHATBOT USING FINE TUNED RANDOM FOREST
Recall=TP/TP+FN
Precision: The ability of a classification model to identify only the relevant data points.

Precision is defined as follows:


Precision=TP/TP+FP

F- Measure: A measure that combines precision and recall is the harmonic mean of
precisionand recall, the
traditional F-measure.
F-measure = 2 * ( ( Precision*Recall) / Precision + Recall) ) or
= 2*TP / (2*TP) + FP + FN
=85.
B. ROC curve:

An ROC curve (receiver operating characteristic curve) is a graph showing the performance
of a classification model at all classification thresholds. This curve plots two parameters: True
Positive Rate and False Positive Rate.

True Positive Rate (TPR) is a synonym for recall and is therefore defined as
follows: TPR=TP/TP+FN
False Positive Rate (FPR) is defined as follows: FPR=FP/FP+TN

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 18


SYSTEM DESIGN
CHATBOT USING FINE TUNED RANDOM FOREST

3.SYSTEM DESIGN
3.1 ABOUT SYSTEM DESIGN
System design is the process of designing the elements of a system such as the architecture, modules,
components, the different interfaces of those components and the data that goes through that system.

The purpose of the System Design process is to provide sufficient detailed data and information
about the system and its system elements to enable the implementation consistentwith architectural
entities as defined in models and views of the system architecture.

Elements of a System:
Architecture - This is the conceptual model that defines the structure, behavior and more views of
a system. We can use flowcharts to represent to illustrate the architecture.

Modules - This are components that handle one specific task in a system. A combination of the
modules makes up the system.

Components - This provides a particular function or group of related functions. They are made up
of modules.

Interfaces - This is the shared boundary across which the components of the system exchange
information and relate.

Data - This management of the information and data flow.

3.1.1 Initialize design definition


➢ Plan for and Identify the technologies that will compose and implement the systems elements
and their physical interfaces.
➢ Determine which technologies and system elements have a risk to become obsolete,or evolve
during the operation stage of the system. Plan for their potential replacement.
➢ Document the design definition strategy, including the need for and requirements of any
enabling systems, products, or services to perform the design.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 19


CHATBOT USING FINE TUNED RANDOM FOREST

3.1.2 Establish design characteristics


➢ Define the design characteristics relating to the architectural characteristics and checkthat
they are implementable.
➢ Define the interfaces that were not defined by the System Architecture process or thatneed to
be refined as the design details evolve.
➢ Define and document the design characteristics of each system element2

3.1.3 Assess alternatives for obtaining system elements


➢ Assess the design options
➢ Select the most appropriate alternatives.
➢ If the decision is made to develop the system element, rest of the design definition process and
the implementation process are used. If the decision is to buy or reuse a system element, the
acquisition process may be used to obtain the system element.

3.1.4 Manage the design


➢ Capture and maintain the rationale for all selections among alternatives and decisions for
the design, architecture characteristics.
➢ Assess and control the evolution of the design characteristics.

3.2 SYSTEM ARCHITECTURE

A system architecture or systems architecture is the conceptual model that defines the structure,
behavior, and more views of a system. An architecture description is a formal description and
representation of a system, organized in a way that supports reasoning about the structures and
behaviors of the system.

An architecture description also indicates how nonfunctional requirements will be satisfied. For
example:

➢ Specification of system/component performance. For example, data throughput andresponse


times as a function of concurrent users.
➢ Consideration of scalability. For example, can an air traffic control system designedto
manage 100 aircraft be extended to manage 1000 aircraft?
➢ System availability. For example, elements of the design that enable a system tooperate 24/7.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 20


CHATBOT USING FINE TUNED RANDOM FOREST

➢ Safety integrity. Elements of the design that reduce the risk that the system will cause(or
allow causation of) harm to property and human beings.
➢ Fault tolerance. Elements of the design that allow the system to continue to operateif some
components fail (e.g. no single point of failure).
➢ Consideration of product evolution. The facility for individual components to be modified
or dynamically reconfigured without the need for major modification of the system as a whole.
Further, the ability to add functionality with new components in a cost effective manner.
➢ Consideration of the emergent qualities of the system as a whole when components are
assembled and operated by human beings. For example, can themissile launch system be effectively
operated in a high stress combat situation.

Fig 3.1 : System Architecture for Chat Bot

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 21


CHATBOT USING FINE TUNED RANDOM FOREST

3.3 UNIFIED MODELLING LANGUAGE (UML) DIAGRAMS:

UML stands for Unified Modelling Language. UML is a standardized general-purpose modelling
language in the field of object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group.

The goal is for UML to become a common language for creating models of object oriented
computer software. In its current form UML is comprised of two major components: a Meta-model
and a notation. In the future, some form of method or process may also be added to; or associated
with, UML.

The Unified Modeling Language is a standard language for specifying, Visualization,


Constructing and documenting the artifacts of software system, as well as for business modeling
and other non-software systems.

The UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems.

The UML is a very important part of developing objects oriented software and the software
development process. The UML uses mostly graphical notations to express the designof software
projects.

GOALS:

The Primary goals in the design of the UML are as follows:

1. Provide users a ready-to-use, expressive visual modeling Language so that they can develop and
exchange meaningful models.

2. Provide extendibility and specialization mechanisms to extend the core concepts.

3. Be independent of particular programming languages and development process.

4. Provide a formal basis for understanding the modeling language.

5. Encourage the growth of OO tools market.

6. Support higher level development concepts such as collaborations, frameworks, patterns and
components.

7. Integrate best practices.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 22


CHATBOT USING FINE TUNED RANDOM FOREST

3.3.1 Use case Diagram:

Use Case diagrams identify the functionality provided by the system (use cases), the users
who interact with the system (actors), and the association between the users and the functionality.
Use Cases are used in the Analysis phase of software development to articulate the high-level
requirements of the system. The primary goals of Use Case diagrams include:
➢ Providing a high-level view of what the system does.
➢ Identifying the users ("actors") of the system.
➢ Determining areas needing human-computer interfaces.

Graphical Notation: The basic components of Use Case diagrams are the Actor, the UseCase,
and the Association.

Actor An Actor, as mentioned, is a user of the


system, and is depicted using a stick figure.
The role of the user is written beneath the
icon. Actors are not limited to humans. If a
system communicates with another
application, and expects input or delivers
output, then that application can also be
considered an actor.
Use case A Use Case is functionality provided by the

system, Use Cases are depicted with an

ellipse. The name of the use case is written

within the ellipse.

Directed These Associations are used to link Actors

Association with Use Cases, and indicate that an Actor

participates in the Use Case in some form.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 22


CHATBOT USING FINE TUNED RANDOM FOREST

Behind each Use Case is a series of actions to achieve the proper functionality, as well as alternate
paths for instances where validation fails, or errors occur. These actions can be further defined in
a Use Case description. Because this is not addressed in UML, there are no standards for Use Case
descriptions. However, there are some common templates can follow, and whole books on the
subject writing of Use Case description.

USE CASE DIAGRAM:

Fig 3.2: Use case diagram for System

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 23


CHATBOT USING FINE TUNED RANDOM FOREST

3.3.2 Sequence Diagram

Sequence diagrams document the interactions between classes toachieve a result, such as a
use case. Because UML is designed for object-oriented programming, these communications
between classesare known as messages. The Sequence diagram lists objects horizontally, and time
vertically, and models these messages over time. Graphical Notation: In a Sequence diagram,
classes and actors are listed as columns, with vertical lifelines indicating the lifetime of the object
over time.

Object Objects are instances of classes, and are arranged


horizontally. The pictorial representation for an Object is a
class (a rectangle) with the name prefixed by the object.

name (optional) and a semi-colon.

Lifeline The Lifeline identifies the existence of the object over time.

The notation 2for a Lifeline is a vertical dotted line

extending from an object.

Activation Activations, modeled as rectangular boxes on

lifeline, indicate when the object is performing anaction.

Messages, modeled as horizontal arrows between


Message Activations, indicate the communications between objects.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 24


CHATBOT USING FINE TUNED RANDOM FOREST

Sequence Diagram:

Fig 3.3: Sequence diagram for System

3.3.2 Activity Diagram


This shows the flow of events within the system. The activities that occur within
a usecase or within an objects behavior typically occur in a sequence. An activity
diagram is designed to be simplified look at what happens during an operations or a
process.

Each activity is represented by a rounded rectangle the processing within an activity


goes to compilation and then an automatic transmission to the next activity occurs. An
arrow represents the transition from one activity to the next. An activity diagram
describes a systemin terms of activities. Activities are the state that represents the
execution of a set of operations. These are similar to flow chart diagram and dataflow.

Initial state: which state is starting the process?

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 25


CHATBOT USING FINE TUNED RANDOM FOREST

Action State: An action state represents the execution of an atomic action, typically the
invocation of an operation. An action state is a simple state with an entry action whose only
exit transition is triggered by the implicit event of completing the execution of the entry
action.

Activity1

Transition: A transition is a directed relationship between a source state vertex and a target
state vertex. It may be part of a compound transition, which takes the static machine from one
static configuration to another, representing the complete response of the static machine to a
particular event instance.

Final state: A final state represents the last or "final" state of the enclosing compositestate.
There may be more than one final state at any level signifying that the composite state can
end in different ways or conditions. When a final state is reached and there are no other
enclosing states it means that the entire state machine has completed its transitions and no
more transitions can occur.

Decision: A state diagram (and by derivation an activity diagram) expresses decision when
guard conditions are used to indicate different possible transitions that depend on Boolean
conditionsof the owning object.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 26


CHATBOT USING FINE TUNED RANDOM FOREST

Activity Diagram:

Fig 3.4: Activity diagram for System

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 27


CHATBOT USING FINE TUNED RANDOM FOREST

3.4 DATA FLOW DIAGRAMS:

A data-flow diagram is a way of representing a flow of data through a process or a system.


The DFD also provides information about the outputs and inputs of each entity and the process
itself. A data-flow diagram has no control flow — there are no decision rules and no loops.

A data flow diagram (DFD) maps out the flow of information for any process or system.
It uses defined symbols like rectangles, circles and arrows, plus short text labels, to show data
inputs, outputs, storage points and the routes between each destination. Data flow diagrams
provide a graphical representation of how information moves between processes in a system

Data flowcharts can range from simple, even hand-drawn process overviews, to in-depth,
multi-level DFDs that dig progressively deeper into how the data is handled. They can be used to
analyze an existing system or model a new one.

The objective of a DFD is to show the scope and boundaries of a system as a whole. It
may be used as a communication tool between a system analyst and any person who plays a part
in the order that acts as a starting point for redesigning a system.

RULES OF DFD:
The basic rules for DFD are as follows:

• Each process should have at least one input and an output.

• Each data store should have at least one data flow in and one data flow out.

• Data stored in a system must go through a process.

• All processes in a DFD go to another process or a data store.

The following observations about DFDs are essential:


• All names should be unique. This makes it easier to refer to elements in the DFD.

• Remember that DFD is not a flow chart. Arrows is a flow chart that represents the order of events;
arrows in DFD represents flowing data. A DFD does not involve any order of events.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 28


CHATBOT USING FINE TUNED RANDOM FOREST

• Do not become bogged down with details. Defer error conditions and error handling until the end
of the analysis.

SYMBOLS IN DFD:

1. Process — represents any transformative process of the incoming flow of information into the
outgoing workflow. The process receives input and generates an output;

2. Data flow — represents the movement of information within the system between external entities,
data stores, and processes. Reflects the nature of the data used in the system;

3. Datastore — represents repositories for data that is not moving at the moment. It may be either a
just in case buffer or queue for later use. Most commonly it is either database tables or membership
forms;

4. External entity — represents sources or destination points of information outside the boundaries
of the described system. Entities either provide data to the system or receive from the processes.
EE usually resides on the edges of the diagram.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 29


CHATBOT USING FINE TUNED RANDOM FOREST

3.4.1 DFD Level 0:


DFD Level 0 is also called a Context Diagram. It’s a basic overview of the whole system or
process being analyzed or modeled. It’s designed to be an at-a-glance view, showing the system
as a single high-level process, with its relationship to external entities. It should be easily
understood by a wide audience, including stakeholders, business analysts, data analysts and
developers.

Figure 3.5: Level-0 Data Flow Diagram

3.4.2 DFD Level 1

DFD Level 1 provides a more detailed breakout of pieces of the Context Level Diagram. You will
highlight the main functions carried out by the system, as you break down the high-level process
of the Context Diagram into its subprocesses.

Figure 3.6: Level-1 Data Flow Diagram

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 30


CHATBOT USING FINE TUNED RANDOM FOREST

3.4.3 DFD Level 2

DFD Level 2 then goes one step deeper into parts of Level 1. It may require more text to reach the
necessary level of detail about the system’s functioning.

Figure 3.7: Level-2 Data Flow Diagram

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 31


CHATBOT USING FINE TUNED RANDOM FOREST

3.5 Dataset(Sample):

Figure 3.8: Data Set

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 32


SYSTEM
IMPLEMENTATION
CHATBOT USING FINE TUNED RANDOM FOREST

4. SYSTEM IMPLEMENTATION
4.1 ABOUT IMPLEMENTATION:

Implementation is the most crucial stage in achieving a successful system and giving the user’s
confidence that the new system is workable and effective. Implementation of a modified
application to replace an existing one. This type of conversation is relatively easy to handle,
provide there are no major changes in the system.

Each program is tested individually at the time of development using the data and hasverified
that this programmed linked together in the way specified in the programs specification, the
computer system and its environment is tested to the satisfaction of the user. The system that has
been developed is accepted and proved to be satisfactory for the use. And so the system is going
to be implemented very soon. A simple operating procedure is included so that the use can
understand the different functions clearly and quickly.

4.2 SOURCE CODE FOR CHATBOT USING FINE TUNED


RANDOM FOREST:

DATA COLLECTION:

Code:

#DATA READING

chatbot=pd.read_csv(r"C:\Users\LENOVO\Downloads\STACK\Sheet_1.csv",usecols=['response
_id','class','response_text'],encoding='latin-1')

resume=pd.read_csv(r"C:\Users\LENOVO\Downloads\STACK\Sheet_2.csv",encoding='latin-1')

chatbot.head(5)

resume.head(5)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 33


CHATBOT USING FINE TUNED RANDOM FOREST
RESULT:

DATA VISUALIZATION:

Word cloud:

Word Cloud is a data visualization technique used for representing text data in which the size of
each word indicates its frequency or importance. Significant textual data points can be highlighted
using a word cloud. Word clouds are widely used for analyzing data from social network websites.

Code:

def cloud(text):

wordcloud = WordCloud(background_color="blue",stopwords=stop).generate(" ".join([i for i


in text.str.upper()]))

plt.imshow(wordcloud)

plt.axis("off")

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 34


CHATBOT USING FINE TUNED RANDOM FOREST
plt.title("Chat Bot Response")

cloud(chatbot['response_text'])

RESULT:

TSNE (T-distributed Stochastic Neighbour Embedding):


t-SNE [1] is a tool to visualize high-dimensional data. It converts similarities between data points
to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint
probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost
function that is not convex, i.e. with different initializations we can get different results.

Code:

tsne=TSNE(n_components=3,init='random',random_state=101,method='barnes_hut',n_iter=250,v
erbose=2,angle=0.5).fit_transform(Tf_idf.toarray())

RESULT:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 35


CHATBOT USING FINE TUNED RANDOM FOREST

Code:

trace1=go.Scatter3d(x=tsne[:,0],y=tsne[:,1],z=tsne[:,2],mode='markers',marker=dict(sizemode='
diameter',color = 'red',colorscale = 'Portland',colorbar=dict(title='TExt'),line=dict(color='rgb(255,
255, 255)'),opacity=0.75))

data=[trace1]

layout=dict(height=800, width=800, title='test')

fig=dict(data=data, layout=layout)

py.iplot(fig, filename='3DBubble')

RESULT:

DATA PREPROCESSING:

Transformers:

Transformers provides APIs to easily download and train state-of-the-art pretrained models. Using
pretrained models can reduce your compute costs, carbon footprint, and save you time from
training a model from scratch. The models can be used across different modalities such as:
📝 Text: text classification, information extraction, question answering, summarization,

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 36


CHATBOT USING FINE TUNED RANDOM FOREST
translation, and text generation in over 100 languages.

🖼️ Images: image classification, object detection, and segmentation.


Audio: speech recognition and audio classification.

🐙 Multimodal: table question answering, optical character recognition, information extraction


from scanned documents, video classification, and visual question answering.

Code:

count_vect = CountVectorizer()
x = chatbot['response_text']
y = chatbot['class']
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1)
X_train_counts = count_vect.fit_transform(x_train)
X_test_counts = count_vect.transform(x_test)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Machine Learning Model Results:


A) SVM:
Support Vector Machine(SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well its best suited for
classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional space
that distinctly classifies the data points. The dimension of the hyperplane depends upon the number
of features. If the number of input features is two, then the hyperplane is just a line. If the number
of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine
when the number of features exceeds three.
Code:
from sklearn.linear_model import SGDClassifier
svm=Pipeline([('vect',CountVectorizer()),('tfidf',TfidfTransformer()),('clf-
svm',SGDClassifier(loss='hinge',penalty='l2',alpha=1e3,n_iter_no_change=5,random_state=42)))
svm_fit = svm.fit(x_train, y_train)
svm_predict = svm.predict(x_test)
np.mean(svm_predict == y_test)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 37


CHATBOT USING FINE TUNED RANDOM FOREST

RESULT:
0.75

B) Navie Bayes:
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.It is mainly used in text classification that includes a
high-dimensional training dataset.Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning models that can make
quick predictions.It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.

Code:

x = chatbot.response_text
y = chatbot.Label
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1)
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)
NB.fit(x_train_dtm,y_train)
y_predict = NB.predict(x_test_dtm)
metrics.accuracy_score(y_test,y_predict)

RESULT:
0.7

C)Random Forest:
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.As the name suggests, "Random
Forest is a classifier that contains a number of decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one
decision tree, the random forest takes the prediction from each tree and based on the majority votes
of predictions, and it predicts the final output.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 38
CHATBOT USING FINE TUNED RANDOM FOREST

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

Code:

rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(x_train_dtm,y_train)
rf_predict = rf.predict(x_test_dtm)
metrics.accuracy_score(y_test,rf_predict)

RESULT:

0.8

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 39


TESTING
CHATBOT USING FINE TUNED RANDOM FOREST

5. TESTING

5.1 ABOUT TESTING

Software Testing is an investigation conducted to provide stakeholders with information about the
quality of the product or service under test. Software testing can also provide an objective,
independent view of the software to allow the business to appreciate and understand the risks of
software implementation. Test techniques include, but are not limited to, the process of executing a
program or application with the intent of finding software bugs (errors or other defects). It involves
the execution of a software component or system component to evaluate one or more properties of
interest. In general, these properties indicate the extent to which the component or system under test:

➢ Meets the requirements that guided its design and development


➢ Responds correctly to all kinds of inputs
➢ Performs its functions within an acceptable time.
➢ Is sufficiently usable.
➢ Can be installed and run in its intended environments

As the number of possible tests for even simple software components is practically infinite, all
software testing uses some strategy to select tests that are feasible for the available timeand resources.
As a result, software testing typically (but not exclusively) attempts to execute a program or
application with the intent of finding software bugs (errors or other defects).

Software Testing can provide objective, independent information about the quality of software and
risk of its failure to users and/or sponsors.

Software testing can be conducted as soon as executable software (even if partially complete) exists.
The overall approach to software development often determines when and how testing is conducted.
For example, in a phased process, most testing occurs after system requirementshave been defined
and then implemented in testable programs. In contrast, under an Agile approach, requirements,
programming, and testing are often done concurrently.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 40


CHATBOT USING FINE TUNED RANDOM FOREST

5.2 TESTING METHODS (levels of Testing)

5.2.1 Static vs. Dynamic Testing

There are many approaches in software testing. Reviews, walkthroughs, or inspections are
referred to as static testing, whereas actually executing programmed code with a given set of test
cases is referred to as dynamic testing. Static testing is often implicit, as proofreading,plus when
programming tools/text editors check source code structure or compilers (pre- compilers) check
syntax and data flow as static program analysis. Dynamic testing takes place when the program
itself is run. Dynamic testing may begin before the program is 100% complete in order to test
particular sections of code and are applied to discrete functions or modules. Typical techniques for
this are either using stubs/drivers or execution from a debugger environment.

Static testing involves verification, whereas dynamic testing involves validation. Together
they help improve software quality. Among the techniques for static analysis, mutation testing can
be used to ensure the test-cases will detect errors which are introduced by mutating the sourcecode.

5.2.2 The Box Approach

Software testing methods are traditionally divided into white- and black-box testing. These two
approaches are used to describe the point of view that a test engineer takes when designing test
cases.

5.2.2.1 White-Box Testing


White-box testing (also known as clear box testing, glass box testing, transparent box testing and
structural testing) tests internal structures or workings of a program, as opposed to the functionality
exposed to the end-user. In white-box testing an internal perspective of the system, as well as
programming skills, are used to design test cases. The tester chooses inputs to exercise paths
through the code and determine the appropriate outputs. This is analogous totesting nodes in a
circuit, e.g. in-circuit testing (ICT).

While white-box testing can be applied at the unit, integration and system levels of the software
testing process, it is usually done at the unit level. It can test paths within a unit,paths between
units during integration, and between subsystems during a system–level test.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 41


CHATBOT USING FINE TUNED RANDOM FOREST

Though this method of test design can uncover many errors or problems, it might not detect
unimplemented parts of the specification or missing requirements.

Techniques used in white-box testing include:

➢ API testing: It tests the application using public and private APIs.
➢ Code coverage: It creates tests to satisfy some criteria of code coverage.

Code coverage tools can evaluate the completeness of a test suite that was created with any method,
including black-box testing. This allows the software team to examine parts of a system that are
rarely tested and ensures that the most important function points have been tested.

Code coverage as a software metric can be reported as a percentage for:

• Function coverage, which reports on functions executed


• Statement coverage, which reports on the number of lines executed to complete the test
100% statement coverage ensures that all code paths or branches (in terms of control flow) are
executed at least once. This is helpful in ensuring correct functionality, but not sufficient since the
same code may process different inputs correctly or incorrectly.

5.2.2.2 Black-Box Testing


Black-box testing treats the software as a "black box", examining functionality without any
knowledge of internal implementation. The testers are only aware of what the software is supposed
to do, not how it does it. Black-box testing methods include:equivalence partitioning, boundary
value analysis, all-pairs testing, state transition tables, decision table testing, fuzz testing, model-
based testing, use case testing,exploratory testing and specification-based testing.

Fig5.1: Black box testing

Specification-based testing aims to test the functionality of software according to the applicable
requirements. This level of testing usually requires thorough test cases to be

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 42


CHATBOT USING FINE TUNED RANDOM FOREST

provided to the tester, who then can simply verify that for a given input, the output value (or behavior),
either “is” or “is not” the same as the expected value specified in the test case. Test cases are built
around specifications and requirements, i.e., what the application is supposedto do. It uses external
descriptions of the software, including specifications, requirements, anddesigns to derive test cases.
These tests can be functional or non-functional, though usually functional.

Specification-based testing may be necessary to assure correct functionality, but it is insufficient to


guard against complex or high-risk situations.

One advantage of the black box technique is that no programming knowledge is required. Whatever
biases the programmers may have had, the tester likely has a different set and may emphasize different
areas of functionality. On the other hand, black-box testing has been said to be “like a walk in a dark
labyrinth without a flashlight.” Because they do not examine the source code, there are situations
when a tester writes many test cases to check something that could have been tested by only one test
case, or leaves some parts of the program untested.

This method of test can be applied to all levels of software testing: unit, integration, system and
acceptance. It typically comprises most if not all testing at higher levels, but can also dominate unit
testing as well.

5.2.2.3 Testing Levels


There are generally four recognized levels of tests: unit testing, integration testing, system testing,
and acceptance testing. Tests are frequently grouped by where they are added in the software
development process, or by the level of specificity of the test. The main levels during the development
process as defined by the SWEBOK guide are unit-, integration-, and system testing that are
distinguished by the test target without implying a specific process model. Other test levels are
classified by the testing objective.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 43


CHATBOT USING FINE TUNED RANDOM FOREST

Client Needs Acceptance Testing

Requirements System Testing

Design Integration Testing

Code Unit Testing

Figure 5.2: Levels of testing

5.2.2.4 Unit Testing


Unit testing, also known as component testing, refers to tests that verify the functionality of a specific
section of code, usually at the function level. In an object-oriented environment, thisis usually at the
class level, and the minimal unit tests include the constructors and destructors.

These types of tests are usually written by developers as they work on code (white-box style),to ensure
that the specific function is working as expected. One function might have multiple tests, to catch
corner cases or other branches in the code. Unit testing alone cannot verify the

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 44


CHATBOT USING FINE TUNED RANDOM FOREST

functionality of a piece of software, but rather is used to ensure that the building blocks of the
software work independently from each other.

Unit testing is a software development process that involves synchronized application of a broad
spectrum of defect prevention and detection strategies in order to reduce software development
risks, time, and costs. It is performed by the software developer or engineer during the
construction phase of the software development lifecycle. Rather than replace traditional QA
focuses, it augments it. Unit testing aims to eliminate construction errors before code is promoted
to QA; this strategy is intended to increase the quality of the resulting software as well as the
efficiency of the overall development and QA process.

Depending on the organization's expectations for software development, unit testing might
include static code analysis, data flow analysis, metrics analysis, peer code reviews, code
coverage analysis and other software verification practices.

5.2.2.5 Integration Testing


Integration testing is any type of software testing that seeks to verify the interfaces between
components against a software design. Software components may be integrated in an iterative
way or all together (“big bang”). Normally the former is considered a better practice since it
allows interface issues to be located more quickly and fixed.

Integration testing works to expose defects in the interfaces and interaction between integrated
components (modules). Progressively larger groups of tested software components
corresponding to elements of the architectural design are integrated and tested until the software
works as a system.

5.2.2.6 Component Interface Testing


The practice of component interface testing can be used to check the handling of data passed
between various units, or subsystem components, beyond full integration testing between those
units. The data being passed can be considered as “message packets” and the range or data types
can be checked, for data generated from one unit, and tested for validity before being passed into
another unit. One option for interface testing is to keep a separate log file of data items being
passed, often with a time stamp logged to allow analysis of thousands of cases of data passed
between units for days or weeks. Tests can include checking the handling of some extreme data
values while other interface variables are passed as normal values. Unusual data values in an
interface can help explain unexpected performance in the

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 45


CHATBOT USING FINE TUNED RANDOM FOREST

next unit. Component interface testing is a variation of black-box testing, with the focus on the
data values beyond just the related actions of a subsystem component.

5.2.2.7 System Testing


System testing, or end-to-end testing, tests a completely integrated system to verify that it meets
its requirements. For example, a system test might involve testing a logon interface, then creating
and editing an entry, plus sending or printing results, followed by summary processing or deletion
(or archiving) of entries, then logoff.

In addition, the software testing should ensure that the program, as well as working as expected,
does not also destroy or partially corrupt its operating environment or causeother processes
within that environment to become inoperative (this includes not corrupting shared memory, not
consuming or locking up excessive resources and leaving any parallel processes unharmed by its
presence).

Software testing verification and validation are the most important components to be considered.
In this article we will discuss the details about verification and validation part of software testing.

5.3 VALIDATION AND VERIFICATION

Software verification and validation actions confirm the software aligned with its terms. All
assignment should validate and verify the software it produces. This is made by:

• Verify that every software object meets particular necessities.


• Verify all software objects it is used as a key to a new action.
• Ensure that checking every software entities are done, as possible, by someone elseother than
the developer.
• Ensure that the sum of validation and verification effort is enough to explain everysoftware
objects are appropriate for equipped use.

Project management is accountable for organizing the classification of software verification and
validation roles, software verification and validation behaviors and the allotment ofemployees to
those roles.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 46


CHATBOT USING FINE TUNED RANDOM FOREST

Whatsoever the volume of project, software verification and validation very much affects
software value. Populace is not reliable, and software that has not been confirmed has little
possibility of functioning. Characteristically, 2%-5% errors per 1000 lines of code are found
through development and 0.15%-4% error per 1000 lines of code remain still after testing of the
system. Every error might lead to an equipped breakdown or non-cooperation with a necessity.
The purpose of software verification and validation is to decrease software errors to a satisfactory
level. The effort wanted can vary from 30%-90% of the whole project property, depend upon the
difficulty and criticality of the software. The following diagram shows the process flow.

Figure 5.3: Verification and Validation

5.3.1 Verification in Software Testing:


Now let us see what the verification does in a software testing.

Verification makes ensure that the result is intended to give all functionality to the client.
It is done at the initial of the development method. It includes reviews and meetings,
walkthroughs, check, etc. to assess credentials, strategy, code, necessities and stipulation.
It ensures for building a right product.
It also checks for accessing the data correct in the correct place and in the correctway.
Verification is a Low level action.
It is performed at the time of development on walkthroughs, reviews and inspections, adviser
comment, guidance, checklists and principles.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 47


CHATBOT USING FINE TUNED RANDOM FOREST

Manifestation of reliability, wholeness, and accuracy of the software at every phase and among
every phase of the development life cycle.

According to the Capability Maturity Model, we can also describe verification as the method of
evaluating software to establish whether the yield of a particular development stage please the
situation forced at the beginning of that stage.

5.3.2 Validation in Software Testing:

Following points discuss about the validation process in software testing

Validation determines how the system compiles with the necessities and performs functions for
which it is proposed and meets the organization’s goals and user requirements
It is for building a right product.
It also checks for accessing the accurate data.
Validation is a High Level action
Also determines exactness of the ultimate software product by an advanced project with respect
to the consumer desires and necessities.
According to the capability maturity model we can also describe validation as procedure of
evaluate software at the time of development procedure to establish whether it satisfies particular
necessities.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 48


CHATBOT USING FINE TUNED RANDOM FOREST

5.4 TEST CASES

Test CaseID Test Case Expecting Exhibiting Result


Behaviour Behavior
1 Greetings/Wishes Hello Hi Pass

2 What is your Chatbot Chatbot Pass


Name?

3 What are your French, NLP,Machine Fail


fav subjects? English learning

4 What is A.I? Artificial Machine Pass


Intelligence powered
Intelligence

5 What is today? Friday Beautiful day Fail

Table 5.1: Test cases

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 49


CONCLUSION
CHATBOT USING FINE TUNED RANDOM FOREST

6. CONCLUSION

The Project outcome will be a web application which is interactive and question to answer
based system. Where User can experience a virtual answering from the system. Firstly, we perform
data preprocessing and clean the data by finding the missing values. Later we used wordembedding
to convert the categorical data into binary format which is machine understandable format. Here
we used Machine Learning Models like Random Forest, SVM, naïve bias Algorithmson the data for
further developing robust and accurate Question answering System.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 50


REFERENCES &
BIBLIOGRAPHY
CHATBOT USING FINE TUNED RANDOM FOREST

7. REFERENCES

[1] Alzubi, J.A., Jain, R., Singh, A. et al. COBERT: COVID-19 Question Answering System Using
BERT. Arab J Sci Eng (2021). https://doi.org/10.1007/s13369-021-05810-5

[2] D. V. Vekariya and N. R. Limbasiya, "A Novel Approach for Semantic Similarity Measurement
for High Quality Answer Selection in Question Answering using Deep Learning Methods," 2020
6th International Conference on Advanced Computing and Communication Systems (ICACCS),
2020, pp. 518-522, doi: 10.1109/ICACCS48705.2020.9074471.

[3] Mutabazi, E.; Ni, J.; Tang, G.; Cao, W. A Review on Medical Textual Question Answering
Systems Based on Deep Learning Approaches. Appl. Sci. 2021, 11, 5456. https://doi.org/10.3390/
app11125456.

[4] M. R. Bhuiyan, A. K. M. Masum, M. Abdullahil-Oaphy, S. A. Hossain and S. Abujar, "An


Approach for Bengali Automatic Question Answering System using Attention Mechanism," 2020
11th International Conference on Computing, Communication and Networking Technologies
(ICCCNT), 2020, pp. 1-5, doi: 10.1109/ICCCNT49239.2020.9225264.

[5] K. Moholkar and S. H. Patil, "Multiple Choice Question Answer System using Ensemble Deep
Neural Network," 2020 2nd International Conference on Innovative Mechanisms for Industry
Applications (ICIMIA), 2020, pp. 762-766, doi: 10.1109/ICIMIA48430.2020.9074855.

[6] Thomala, L. L. Number of new hospital beds to be added in the designated hospitals after the
coronavirus covid-19 outbreak in Wuhan, China as of February 2, 2020. Statista.
https://www.statista.com/statistics/1095434/china-changes-inthe-number-of-hospital-beds-in-
designated-hospitals-after-coronavirus-outbreakin-wuhan/.

[7] Bogage, J. Tesla unveils ventilator prototype made with car parts on youtube. Wash. Post.
https://www.washingtonpost.com/business/2020/04/06/tesla-coronavirusventilators-musk/ (2020).

[8] Day, M. & Soper, S. Amazon is prioritizing essential products as online orders spike. Bloomberg.
https://www.bloomberg.com/news/articles/2020-03-17/ amazon-prioritizing-essentials-medical-
goods-in-virus-response.

[9] Acter, T.; Uddin, N.; Das, J.; Akhter, A.; Choudhury, T.R.; Kim, S.: Evolution of severe acute
respiratory syndrome coronavirus 2 (SARS-CoV-2) as coronavirus disease 2019 (COVID-19)
pandemic: A global health emergency. Science of the Total Environment p. 138996 (2020)

[10] Jeyaprakash, K.; Velavan, S.; Ganesan, S.; Arjun, P.: Covid-19- molecular transmission and
diagnosis-Review article. Asian J. Innov. Res. 5(2), 1 (2020)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 51


CHATBOT USING FINE TUNED RANDOM FOREST

[11] da Silveira, M.P.; da Silva Fagundes, K.K.; Bizuti, M.R.; Starck, É.; Rossi, R.C.; e Silva,
D.T.d.R.: Physical exercise as a tool to help the immune system against COVID-19: an integrative
review of the current literature. Clinical and experimental medicine p. 1 (2020)

[12] Shen, I.; Zhang, L.; Lian, J.; Wu, C.H.; Fierro, M.G.; Argyriou, A.; Wu, T.: In search for a cure:
recommendation with knowledge graph on CORD-19. In: Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, p. 3519 (2020)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 52


CHATBOT USING FINE TUNED RANDOM FOREST

8. BIBLIOGRAPHY

➢ HTML 101 by Jo Foster.

➢ The Guide of HTML5 and JAVASCRIPT by Jeanine Meyer.

➢ PHP for the Web: Visual QuickStart Guide by Larry Ullman.

➢ Headfirst PHP & MySQL (A Brain-friendly Guide) by Lynn Beighley.

➢ The Joy of PHP: A Beginner’s Guide to Programming Interactive Web Application


with PHP and MySQL by Alan Forbes.

➢ Introduction to Machine Learning with Python by Andreas C. Muller, Sarah Guido.

➢ Meta Learning: How to Learn Deep Learning and Thrive In The Digital Worldby
Radek Osmulski.

➢ Ensemble Learning Algorithms with Python by Jason Browniee Head First Python,
2nd Edition by Paul Barry.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 53


APPENDIX
CHATBOT USING FINE TUNED RANDOM FOREST

APPENDIX

Python Introduction

Python is a general-purpose high level programming language that is being increasingly used in
data science and in designing machine learning algorithms.

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is


designed to be highly readable. It uses English keywords frequently where as otherlanguages use
punctuation, and it has fewer syntactical constructions than other languages.

• Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP.

• Python is Interactive − You can actually sit at a Python prompt and interactwith the interpreter
directly to write your programs.

• Python is Object-Oriented − Python supports Object-Oriented style or technique ofprogramming


that encapsulates code within objects.

• Python is a Beginner's Language − Python is a great language for the beginner- level
programmers and supports the development of a wide range of applications from simple text
processing to WWW browsers to games.

History of Python
Python was developed by Guido van Rossum in the late eighties and early nineties at theNational
Research Institute for Mathematics and Computer Science in the Netherlands.

Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.

Python is copyrighted. Like Perl, Python source code is now available under the GNUGeneral
Public License (GPL).

Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 54


CHATBOT USING FINE TUNED RANDOM FOREST

Python Features
Python's features include −

• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined syntax. This
allows the student to pick up the language quickly.

• Easy-to-read − Python code is more clearly defined and visible to theeyes.

• Easy-to-maintain − Python's source code is fairlyeasy-to-maintaining.

• A broad standard library − Python's bulk of the library is very portable and cross-platform
compatible on UNIX, Windows, and Macintosh.

• Interactive Mode − Python has support for an interactive mode which allows interactive testing
and debugging of snippets of code.

• Portable − Python can run on a wide variety of hardware platforms and has the same interface on
all platforms.

• Extendable − You can add low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be moreefficient.

• Databases − Python provides interfaces to all major commercialdatabases.

• GUI Programming − Python supports GUI applications that can be created and ported to many
system calls, libraries and windows systems, such as Windows MFC, Macintosh, and the X Window
system of Unix.

• Scalable − Python provides a better structure and support for large programsthan shell scripting.

Apart from the above-mentioned features, Python has a big list of good features, few are listed below

• It supports functional and structured programming methods as well as OOP.

• It can be used as a scripting language or can be compiled to byte-code for building large applications.

• It provides very high-level dynamic data types and supports dynamic type checking.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 55


CHATBOT USING FINE TUNED RANDOM FOREST

• It supports automatic garbage collection.

• It can be easily integrated with C, C++, COM, ActiveX, CORBA, andJava.

Python is a popular platform used for research and development of production systems. It is a vast
language with number of modules, packages and libraries that provides multiple ways of achieving
a task.

Python and its libraries like NumPy, SimpleITK, Tensorflow, Keras, Pandas are used in data
science and data analysis. They are also extensively used for creating scalable machine learning
algorithms. Python implements popular machine learning techniques such as Classification,
Regression, Recommendation, and Clustering.

Python offers ready-made framework for performing data mining tasks on large volumes of data
effectively in lesser time. It includes several implementations achieved through algorithms such
as linear regression, logistic regression, Naïve Bayes, k-means, K nearest neighbor, and Random
Forest.

Python has libraries that enable developers to use optimized algorithms. It implements popular
machine learning techniques such as recommendation, classification, and clustering. Therefore, it
is necessary to have a brief introduction to machine learning before we move further.

What is Machine Learning?


Data science, machine learning and artificial intelligence are some of the top trending topicsin
the tech world today. Data mining and Bayesian analysis are trending and this is adding thedemand
for machine learning. This tutorial is your entry into the world of machine learning.

Machine learning is a discipline that deals with programming the systems so as to make them
automatically learn and improve with experience. Here, learning implies recognizing and
understanding the input data and taking informed decisions based on the supplied data. It is very
difficult to consider all the decisions based on all possible inputs. To solve this problem, algorithms
are developed that build knowledge from a specific data and past experience by applying the
principles of statistical science, probability, logic, mathematical optimization, reinforcement
learning, and control theory.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 56


CHATBOT USING FINE TUNED RANDOM FOREST

Applications of Machine Learning Algorithms


The developed machine learning algorithms are used in various applications such as

Vision processing

Language processing

Forecasting things like stock market trends, weather

Pattern recognition

Games

Data mining

Expert systems

Robotics

Steps Involved in Machine Learning


A machine learning project involves the following steps −

Defining a Problem

Preparing Data
Evaluating Algorithms

Improving Results

Presenting Results

The best way to get started using Python for machine learning is to work through a project end-to-
end and cover the key steps like loading data, summarizing data, evaluating algorithms and making
some predictions. This gives you a replicable method that can be used dataset after dataset.You can
also add further data and improve the results.

Libraries and Packages and Datasets


To understand machine learning, you need to have basic knowledge of Python programming. In
addition, there are a number of libraries and packages generally used in performing various
machine learning tasks as listed below −

NumPy: It is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 57


CHATBOT USING FINE TUNED RANDOM FOREST

It is the fundamental package for scientific computing with Python. It contains variousfeatures
including these important ones:

• A powerful N-dimensional arrayobject


• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random number capabilities
Pandas: It is a high-level data manipulation tool developed by Wes McKinney. It is built on the
Numpy package and its key data structure is called the Data Frame. Data Frames allow you to store
and manipulate tabular data in rows of observations and columns of variables.

Tensor flow: Tensor Flow is an end-to-end open source platform for machine learning. Ithasa
comprehensive, flexible ecosystem of tools, libraries and community resources that lets
researchers push the state-of-the-art in ML and developers easily build and deploy MLpowered
applications.

Keras: It is an open-source neural-network library written in Python. It is capable ofrunning on


top of Tensor Flow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast
experimentation with deep neural networks, it focuses on being user- friendly, modular, and
extensible.

Simple ITK: It provides an abstraction layer to ITK that enables developers and users to access
the powerful features of the Insight Toolkit in an easy to use manner for biomedical image analysis.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page 58

You might also like