0% found this document useful (0 votes)
64 views34 pages

Final Last

This document provides an overview of machine learning and introduces some key concepts: [1] Machine learning allows computers to learn from data and experiences to predict outcomes without being explicitly programmed. It works by building prediction models from historical data that can then be used to predict new data. [2] There are three main types of machine learning: supervised learning which uses labeled data to predict outputs, unsupervised learning which finds hidden patterns in unlabeled data, and reinforcement learning which learns optimal behavior from interactions. [3] Python is introduced as a programming language for machine learning due to its useful data science libraries like NumPy, Pandas, Matplotlib, and Scikit-learn which support data analysis and modeling tasks. Common

Uploaded by

akhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views34 pages

Final Last

This document provides an overview of machine learning and introduces some key concepts: [1] Machine learning allows computers to learn from data and experiences to predict outcomes without being explicitly programmed. It works by building prediction models from historical data that can then be used to predict new data. [2] There are three main types of machine learning: supervised learning which uses labeled data to predict outputs, unsupervised learning which finds hidden patterns in unlabeled data, and reinforcement learning which learns optimal behavior from interactions. [3] Python is introduced as a programming language for machine learning due to its useful data science libraries like NumPy, Pandas, Matplotlib, and Scikit-learn which support data analysis and modeling tasks. Common

Uploaded by

akhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

AN INTERNSHIP REPORT ON

MACHINE LEARNING USING PYTHON


An internship report is submitted in partial fulfillment of the requirement
for the award of the degree of
Bachelor of Technology
In
ELECTRONICS AND COMMUNICATION ENGINEERING

SUBMITTED
NALLALA HARI VIJAY RAM GOPAL
Regd.no:208297603033

Under the esteemed guidance of

Mr.B. Krishna M.Tech


Assistant Professor, Dept of ECE

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


UNIVERSITY COLLEGE OF ENGINEERING

ADIKAVI NANNAYA UNIVERSITY: RAJAMAHENDRAVARAM, AP-533296

A.Y: 2020-2024

I
ADIKAVI NANNAYA UNIVERSITY: RAJAMAHENDRAVARAM

UNIVERSITY COLLEGE OF ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

CERTIFICATE

This is to certify that this internship report entitled, “MACHINE LEARNING USING PYTHON”
bonafide work of NALLALA HARI VIJAY RAM GOPAL Regd.no:208297603033 submitted in
the partial fulfillment of requirements for the award of Degree of B. Tech (ECE) during the period
2020- 2024. This work is carried out in the EASY SHIKSHA

INTERNSHIP GUIDE HEAD OF THE DEPARTMENT/

COOURSE COORDINATOR

II
INTERNSHIP CERTIFICATE

III
DECLARATION

I, NALLALA HARI VIJAY RAM GOPAL, Reg.no:208297603033 hereby declare that the
Internship project report entitled “MACHINE LEARING USING PYTHON” done by me under
the guidance of Mr.B. Krishna M.Tech, Assistant professor, Department of Electronics and
Communication Engineering University College of Engineering, Adikavi Nannaya
University,RAJAMAHENDRAVARAM is submitted for the partial fulfillment of requirement for
the award of the degree, Bachelor of Technology in Electronics and Communication Engineering in
the academic year 2020-2024.

NALLALA HARI VIJAY RAM GOPAL

208297603033

IV
ACKNOWLEDGEMENT

I would like to take this opportunity to express my deep gratitude to the members who assisted us
directly and indirectly for the completion of this project work

I feel fortunate to pursue my Bachelor Degree from the campus of Adikavi Nannaya University. It
provided all the facilities in the areas of Electronics and Communication Engineering.

I profusely thank Dr. V PERSIS, Principal, University College of Engineering for all the
encouragement and support.

I also thank Mr. B. SUDHA KIRAN M.Tech, Course Coordinator, Electronics and Communication
Engineering for the guidance throughout the academic year.

It is a genuine pleasure to express my deep sense of thanks and gratitude to my mentor and project
guide Mr.B. Krishna M.Tech, Assistant Professor, Department of Electronics and Communication
Engineering, excellent guidance right from the selection of internship.

I wish to express my deep sense of gratitude to the management of EASY SHIKSHA for giving me
an opportunity to complete my MACHINE LEARNING WITH PYTHON for the partial
fulfillment of my degree in Bachelors of Technology in Electronics and Communication
Engineering.

A great deal of thanks goes to review committee members and the entire faculty members for their
support throughout the project.

N. HARI VIJAY RAM GOPAL

208297603033

V
ABSTRACT

Machine learning (ML) is the scientific study of algorithms and statistical models that
computer systems use to perform a specific task without being explicitly programmed. Every
time a web search engine like Google is used to search the internet, one of the reasons that
work so well is because a learning algorithm that has learned how to rank web pages. These
algorithms are used for various purposes like data mining, image processing, predictive
analytics, etc. to name a few. The main advantage of using machine learning is that, once an
algorithm learns what to do with data, it can do its work automatically.

This course covers:

1. Introduction to Machine Learning and structure of the data that is being fed to ML
model.

2. Basics of Python to get started with Machine Learning and implementing the concept
using hands on examples.

3. Understanding the 5 most import Machine Learning algorithms to implement the


predictive model.

VI
COMPANY PROFILE

INTRODUCTION:

MR. SUNIL SHARMA our founder & CEO, started Easy Shiksha as a blog
with a mission to bring a culture of meaningful internships in India. And for the first two
years, he hired only virtual interns.

After building a small team, we then launched our website with just one goal - to
equip every student in India with their dream internship. And we did it all for free.

The next big step could not have been anything other than launching our very own
Android app, bringing Easy Shiksha in the ‘hands’ of the students.

After many successful years as an internship platform, our motivation to upskill the
students only increased, and that’s when we kickstarted a new journey with Easy Shiksha
Trainings.

With Fresher jobs, we embarked on a journey filled with newer challenges, which
allowed us to provide bigger & better opportunities to graduates with 0-2 years of
experience.

With an insight that more than 90% of the graduates in India start their careers with
a job that pays less than 3LPA, we came up with Jobs Oriented Specialization programs
to help the students start their careers in their dream profiles.

Sunil Sharma, The one who started it all. It takes a lot of guts to create a start up
and he's the one with all of it. Meet the CEO. Yes, he's got loads of patience and humility.
But on that once-in-a-blue-moon day when he breathes fire, no one dares to tread his
territory. He's a master manager and rightfully at the helm of affairs at Easy Shiksha

VII
INDEX

CHAPTER – 1: INTRODUCTION 1-2


1.1 WHAT IS MACHINE LEARNING?
1.2 CLASSIFICATION OF MACHINE LEARNING

CHAPTER – 2: PYTHON 3-5


2.1 DATA TYPES

2.2 STRUTURED DATA AND UNSTRUCTURED DATA

2.3 ADVANTAGES

2.4 PYTHON IDE

CHAPTER – 3: PYTHON LIBRARIES 6-9

3.1 NUMPY

3.2 PANDAS

3.3 MATPLOTLIB

3.4 SCIKIT LEARN

CHAPTER – 4: MACHINE LEARNING 10-16

4.1 TYPES OF MACHINE LEARNING

4.2 TYPES OF SUPERVISED LEARNING

4.3 TYPES OF UNSUPERVISED LEARNING

4.4 APPLICATIONS OF MACHINE LEARNING

CHAPTER – 5: DATA EXPLORATION AND PREPROCESSING 17-20

5.1 STEPS IN DATA EXPLORATION

5.2 SPLITTING OF DATA

5.3 DATA PRE-PROCESSING

VIII
CHAPTER – 6: DECISION TREE 21-22

6.1 DECISION TREE TERMINOLOGIES

6.2 ATTRIBUTE SELECTION MEASURES (ASM)

CHAPTER – 7: PROJECT 23-24


7.1 PROJECT DESCRIPTION

7.2 TRAINING AND TESTING

CHAPTER – 8: 25
CONCLUSION

REFERENCES

IX
CHAPTER 1

INTRODUCTION

In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work on our
instructions. But can a machine also learn from experiences or past data like a human does? So here
comes the role of Machine Learning.

1.1 WHAT IS MACHINE LEARNING?

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past experiences
on their own. The term machine learning was first introduced by Arthur Samuel in 1959. Define it in
a summarized way as:

“Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.”

How does Machine Learning works?

A Machine Learning system learns from historical data, builds the prediction models, and whenever
it receives new data, predicts the output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model which predicts the output
more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of
writing a code for it, we just need to feed the data to generic algorithms, and with the help of these
algorithms, machine builds the logic as per the data and predict the output. The below block diagram
explains the working of Machine Learning algorithm:

Fig 1.1 : Working of Machine Learning

1
1.2 CLASSIFICATION OF MACHINE LEARNING

At a broad level, machine learning can be classified into three types:

I. Supervised learning
II. Unsupervised learning
III. Reinforcement learning

Fig 1.2 : Classification of Machine Learning

Applications of Machine Learning

 Facebook Newsfeed
 Facebook photo auto tagging feature
 Product Recommendations by shopping portals
 Weather report Prediction

CONCLUSION

Machine learning is a field of artificial intelligence that deals with the design and
development of algorithms that can learn from and make predictions on data. The aim of machine
learning is to automate analytical model building and enable computers to learn from data without
being explicitly programmed to do so.

2
CHAPTER 2

PYTHON

INTRODUCTION

Python is a dynamic, interpreted (bytecode-compiled) language. There are no type

declarations of variables, parameters, functions, or methods in source code. This makes the code

short and flexible, and you lose the compile-time type checking of the source code. Python tracks

the types of all values at runtime and flags code that does not make sense as it runs.

2.1 DATA TYPES

Data types determine whether an object can do something, or whether it just would not make
sense. Other programming languages often determine whether an operation makes sense for an
object by making sure the object can never be stored some where the operation will be performed on
the object (this type system is called static typing). Python does not do that. Instead, it stores the type
of an object with the object, and checks when the operation is performed whether that operation
makes sense for that object Python has many native data types. Here are the important ones:

• Booleans are either True or False

• Numbers can be integers (1 and 2), floats (1.1 and 1.2), fractions, or even complex
numbers

• Strings are sequences of Unicode characters, e.g., an HTML document

• Bytes and byte arrays, e.g., a JPEG image file.

• Lists are ordered sequences of values.

• Tuples are ordered, immutable sequences of values.

• Sets are unordered bags of values.

3
2.2 STRUTURED DATA AND UNSTRUCTURED DATA

Fig 2.2 : Structured data vs Unstructured data

Structured Data:

This type of data is either numbers or words. This can take numerical values but
mathematical operations cannot be performed on it. This type of data is expressed in tabular format.

Unstructured Data:

This type of data does not have the proper format and is therefore known as unstructured
data. This comprises textual data, sounds, images, videos, etc.

2.3 ADVANTAGES

 Object Oriented Programming Language

 Simple Syntax

 Interpreter based language

4
 Robust Standard Libraries

 Open Source

 Compatible with different platforms like Windows, Mac, Linux, Raspberry Pie etc.

2.4 PYTHON IDE

An IDE (Integrated Development Environment) is a Software Suite that consolidates the basic
tools that developers need to write and test a software. We not only need Python but also need IDE to
manage code, files etc. An IDE consolidates basic environmental tools like code editor, compiler and
debugger.

Some of the IDE tools are Anaconda Jupyter, Pycharm, Anaconda Spyder etc.

Anaconda Jupyter Notebook

 It is a collection of Codes, Documents and Visualization at one place

 Versatile and Shareable.

 Open-Source Web Application.

 Ability to execute specific part of the code without having to execute the entire code.

CONCLUSION

Python is both a free and open-source programming language. Python is the easiest language to

learn. Its English-like structure aids non-technical pupils in quickly picking up the language. A

newcomer can always turn to the most active python community for assistance. Python also does

not necessitate the installation of any software. It can be researched in a browser using Google

Collabo or Kaggle. Python generates a considerable number of employments as its popularity and

usability grow.

5
CHAPTER – 3

PYTHON LIBRARIES

INTRODUCTION

A Python library is a reusable chunk of code that you may want to include in your programs/
projects. Compared to languages like C++ or C, a Python libraries do not pertain to any specific
context in Python. Here, a ‘library’ loosely describes a collection of core modules. Essentially, then,
a library is a collection of modules. A package is a library that can be installed using a package
manager like rubygems or npm.

3.1 NUMPY

Fig 3.1 : Numpy

• NumPy is a library for the Python programming language, adding support for large,
multi- dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.
• Library for scientific computing.

• NumPy is an open-source library which can be integrated by python.

• NumPy is supporting multiple datasets at once.

6
NUMPY FUNCTIONS:

1. np.array()

2. np.ones()

3. np.zeros()

4. np.arange(10)

5. np.linspace(1,3,5)

6. np.sum()

7. np.max()

8. np.min()

9. np.mean()

NumPy is a Python library used for working with arrays. It also has functions for
working in domain of linear algebra, Fourier transform, and matrices.

3.2 PANDAS

Fig 3.2 : Pandas

Pandas is a software library written for the Python programming language for data
manipulation and analysis. It offers data structures and operations for manipulating numerical
tables and time series. It is free software released under the three-clause BSD license.

7
Example:

Load a CSV file into a Pandas Data Frame:


import pandas as pd
df =
pd.read_csv('data.csv')
print(df.to_string())

Advantages:
 Fast and efficient for manipulating and analysing data.
 Data from different file objects can be loaded.
 Easy handling of missing data (represented as NaN) in floating point as well as
non- floating-point data
 Size mutability: columns can be inserted and deleted from Data Frame and
higher dimensional objects
 Data set merging and joining.
 Flexible reshaping and pivoting of data sets
 Provides time-series functionality.
 Powerful group by functionality for performing split-apply-combine operations on
data sets.

3.3 MATPLOTLIB

Fig 3.3 : Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.

8
3.4 SCIKIT LEARN

Fig 3.4:Scikit Learn

Scikit-learn is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-
vector machines, etc.

Key concepts and features include: Algorithmic decision-making methods, including:


Classification: identifying and categorizing data based on patterns.

scikit-learn allows us to define machine learning algorithms and compare them to one another, as
well as offers tools to pre-process data. K-means clustering, Random Forests, Support Vector
Machines, and any other machine learning model that we might want to develop are all included
in Scikit-learn.

CONCLUSION

Python libraries are very much important for any specific operation to be done using
python and only single library does not contain all the commands there are some libraries as we
discussed above.

9
CHAPTER – 4

MACHINE LEARNING

4.1 TYPES OF MACHINE LEARNING

The types of machine learning algorithms differ in their approach, the type of data they input
and output, and the type of task or problem that they are intended to solve. Broadly Machine
Learning can be categorized into four categories.
1. Supervised
2. Unsupervised
3. Reinforcement
4. Semi-supervised

Machine learning enables analysis of massive quantities of data. While it generally delivers
faster, more accurate results in order to identify profitable opportunities or dangerous risks, it may
also require additional time and resources to train it properly.

Fig 4.1.1 : Types of Machine Learning

1. Supervised Learning

Supervised learning is one of the most basic types of machine learning. In this type, the
machine learning algorithm is trained on labelled data. Even though the data needs to be labelled

10
accurately for this method to work, supervised learning is extremely powerful when used in the
right circumstances.

In supervised learning, the ML algorithm is given a small training dataset to work with. This
training dataset is a smaller part of the bigger dataset and serves to give the algorithm a basic
idea of the problem, solution, and data points to be dealt with. The training dataset is also very
similar to the final dataset in its characteristics and provides the algorithm with the labelled
parameters required for the problem.

Fig 4.1.2 : Supervised Learning Process

The algorithm then finds relationships between the parameters given, essentially establishing
a cause-and-effect relationship between the variables in the dataset. At the end of the training,
the algorithm has an idea of how the data works and the relationship between the input and
the output.

2. Unsupervised Learning

Unsupervised machine learning holds the advantage of being able to work with unlabelled
data. This means that human lab or is not required to make the dataset machine-readable,
allowing much larger datasets to be worked on by the program.

In supervised learning, the labels allow the algorithm to find the exact nature of the relationship
between any two data points. However, unsupervised learning does not have labels to work off
of, resulting in the creation of hidden structures. Relationships between data points are perceived
by the algorithm in an abstract manner, with no input required from human beings.

11
Fig4.1.3: Working of unsupervised learning

The creation of these hidden structures is what makes unsupervised learning algorithms versatile.
Instead of a defined and set problem statement, unsupervised learning algorithms can adapt to
the data by dynamically changing hidden structures.

3. Reinforcement Learning

Reinforcement learning is a sub-branch of Machine Learning that trains a model to return an


optimum solution for a problem by taking a sequence of decisions by itself.

We model an environment after the problem statement. The model interacts with this
environment and comes up with solutions all on its own, without human interference.
4. Semi-supervised

Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised


and unsupervised methods, as it operates on both labeled and unlabeled data. Thus, it falls between
learning “without supervision” and learning “with supervision”. In the real world, labeled data could
be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is
useful. The ultimate goal of a semi-supervised learning model is to provide a better outcome for
prediction than that produced using the labeled data alone from the model.

4.2 TYPES OF SUPERVISED LEARNING

There are two types of supervised learning

I. Regression
II. Classification

12
Regression:
Regression is a technique for investigating the relationship between independent
variables or features and a dependent variable or outcome. It's used as a method for predictive
modelling in machine learning, in which an algorithm is used to predict continuous outcomes.
 Linear Regression
 Decision tree regression
 Random forest Regression

Classification:
Classification is a technique for determining which class the dependent belongs to base
on one or more independent variables.

 Logistic Regression

 Decision Tree

 Random Forest

 KNN

Logistic regression is a supervised learning classification algorithm used to predict the probability of
a target variable. The nature of target or dependent variable is dichotomous, which means there
would be only two possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no).

4.3 TYPES OF UNSUPERVISED LEARNING

The unsupervised learning algorithm can be further categorized into two types of problems:

 Clustering
 Association
Clustering (k – means)

Clustering or Cluster Analysis is a machine learning technique, which groups the unlabeled dataset.
It can be defined as:

13
"A way of grouping the data points into different clusters, consisting of similar data points. The
objects with the possible similarities remain in a group that has less or no similarities with another
group."

Clustering Algorithms

The clustering algorithm is based on the kind of data that we are using. Such as, some algorithms
need to guess the number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.

K-Means Clustering Algorithm:

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems. It groups the unlabeled dataset into different clusters. “It is an iterative algorithm that
divides the unlabelled dataset into k different clusters in such a way that each dataset belongs only
one group that has similar properties.”

What is K-Means Algorithm?

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the
particular k- center, create a cluster.

The below diagram explains the working of the K-means Clustering Algorithm:

Fig 4.3 : K-means

14
4.4 APPLICATIONS OF MACHINE LEARNING:

Machine learning is a buzzword for today’s technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:

Image Recognition:

Fig 4.4.1 : Image Recognition

Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion.

Self-driving cars:

Fig 4.4.2 : Self driving car

One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car manufacturing
company, is working on self-driving cars. It is using an unsupervised learning method to train the
car models to detect people and objects while driving.

15
Medical Diagnosis:

Fig 4.4.3 : Medical Diagnosis


In medical science, machine learning is used for disease diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position of
lesions in the brain.

CONCLUSION
Machine learning is a powerful tool for making predictions from data. However, it is
important to remember that machine learning is only as good as the data that is used to train the
algorithms. In order to make accurate predictions, it is important to use high-quality data that is
representative of the real-world data that the algorithm will be used on.

16
CHAPTER – 5
DATA EXPLORATION AND PREPROCESSING

INTRODUCTION

Data exploration is a key aspect of data analysis and model building. Without spending
significant time on understanding the data and its patterns one cannot expect to build efficient
predictive models. Data exploration takes major chunk of time in a Machine Learning Project
comprising the data cleaning and preprocessing.

5.1 STEPS IN DATA EXPLORATION

The key steps involved in data exploration are:

 Identify variables
 Variable analysis
 Handling missing values
 Handling outliers
 Feature engineering
1. Identifying Variables:

Variable can be of different types such as character, numeric, categorical, and continuous.
Identifying the predictor and target variable is also a key step in model building. Target is the
dependent variable and predictor is the independent variable based on which the prediction is
made. Categorical or discrete variables are those that cannot be mathematically manipulated. It is
made up of fixed values such as 0 and 1.

2. Variable Analysis:
Variable analysis can be done in three ways, univariate analysis, bivariate analysis, and multivariate
analysis

Univariate analysis

 It is used to highlight missing and outlier values. Here each variable is analyzed on its own
for range and distribution.

 Univariate analysis differs for categorical and continuous variables.

17
 For categorical variables, you can use frequency table to understand distribution of each
category.

 For continuous variables, you have to understand the central tendency and spread of the
variable. It can be measured using mean, median, mode, etc..

Bivariate Analysis

Analysis can be performed for combination of categorical and continuous variables.

o It is used to find the relationship between two variables.

o Scatter plot is suitable for analyzing two continuous variables. It indicates the linear
or non-linear relationship between the variables.

o Bar chart helps to understand relation between two categorical variables.

o Matplotlib and Seaborn libraries can be used to plot different relational graphs that
help visualizing bivariate relationship between different types of variables.

3. Handling Missing Values:

Missing values in the dataset can reduce model fit. It can lead to a biased model as the data cannot be
analysed completely. Behaviour and relationship with other variables cannot be deduced correctly. It
can lead to wrong predictions or classifications. Missing values can be treated by deletion,
mean/mode/median imputation, at Transformation, and Binning.

4. Handling Outliers:

Outliers can occur naturally in a data or can be due to data entry errors. They can drastically change
the results of the data analysis and statistical modelling. Outliers are easily detected by visualization
methods, like box- plots, histograms, and scatter plot. Outliers are handled like missing values by
deleting by deleting, transforming them, binning or grouping them, treating them as a separate group,
or imputing values.

5. Feature Engineering:

Feature engineering is the process of extracting more information from existing data. Feature
selection also can be part of it. Two common techniques of feature engineering are variable
transformation and variable creation. In variable transformation existing variable is transformed
using certain functions. For example, a number can be replaced by its logarithmic value.

18
5.2 SPLITTING OF DATA

It is suggested to divide your dataset into three parts to avoid overfitting andmodel selection bias
called:

 Training set (Has to be the largest set)


 Cross Validation set
 Testing Set
We try to build a model upon training set then try to optimize hyperparameters on the dev set as
much as possible then after our model is ready, we try and evaluate the testing set.

Fig 5.2: Splitting of Data

5.3 DATA PRE-PROCESSING

Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning model.

Why do we need Data Pre-processing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.

Steps in Data Pre-processing

It involves below steps:

 Getting the dataset

19
 Importing libraries
 Importing datasets
 Finding Missing Data
 Encoding Categorical Data
 Splitting dataset into training and test set
 Feature scaling

CONCLUSION

Data exploration is very helpful whenever the need is to gain new insights from data. It can
be the most effective – and sometimes only – approach when the data suffers from quality problems
(e.g., gaps, anomalies, outliers) and other real-world complexities such as process changes or when it
is not practical to formulate a precise question beforehand that can be answered by a numeric result.
In this respect, research & development, engineering as well as data science are among those fields
that can benefit a lot from data exploration. With today’s computing power and the support of
modern analytics, interactive data exploration can be an exciting and engaging experience for
everyone to discover unexpected value in large amounts of complex data.

20
CHAPTER 6

DECISION TREE

INTRODUCTION

Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. “It is a graphical
representation for getting all the possible solutions to a problem/decision based on given conditions”.

6.1 DECISION TREE TERMINOLOGIES

 Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

 Branch/Sub Tree: A tree formed by splitting the tree.

 Parent/Child node: The root node of the tree is called the parent node,and other nodes are
called the child nodes.

 Splitting: Splitting is the process of dividing the decision node/root nodeinto sub- nodes
according to the given conditions.

 Pruning: Pruning is the process of removing the unwanted branches from the tree.

6.2 ATTRIBUTE SELECTION MEASURES (ASM)

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute Selection Measure (ASM). By this measurement, easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:

1. Information Gain

21
2. Gini Index
1.Information Gain:

 Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
 It calculates how much information a feature provides us about a class.
 According to the value of information gain, we split the node and build the decision tree.
 A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy (each feature)]

2. Gini Index:

 Gini index is a measure of impurity or purity used while creating decisiontreeinthe


CART(Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the high Gini index.
 The CART algorithm uses the Gini index to create binary splits. Gini index can be calculated
using the below formula:
Gini Index= 1- ∑Pj2

CONCLUSION

As one of the most important and supervised algorithms, Decision Tree plays a vital role in
decision analysis in real life. As a predictive model, it is used in many areas for its split approach
which helps in identifying solutions based on different conditions by either classification or
regression method.

22
CHAPTER – 7
PROJECT

7.1 PROJECT DESCRIPTION

Multiple linear Regression is a statistical technique that uses multiple linear regression
to model more complex relationships between two or more independent variables and one
dependent variable. It is used when there are two or more x variables.
I was instructed to perform project on SENTIMENTAL ANALYSIS, which comes
under Multilinear Regression. In project, they provided a dataset of restaurant reviews
which is a
example of sentimental analysis. Hence, I developed the code to predict the reviews given by
the customers and found the accuracy score for the prediction. Finally designed a webapp for
analyzing the review.

sentimental analysis:

The process of computationally identifying and categorizing opinions from piece of text,
and determine whether the writer’s attitude towards a particular the topic or the product, is
positive, negative or neutral. The process for performing sentimental analysis is followed by
below steps:

 Tokenization
 Cleaning the data (remove the special characters)
 Removing stop words
 Classification (apply supervised algorithm for classification)

For example, “the android system is so good.” This sentence will undergo those five steps and
will get the result as positive, negative, or neutral (+1, -1, 0).
Also, there is twitter sentimental analysis, like restaurant reviews in this also twitter comment
is identified.
The Naive bayes algorithms were tied for the highest accuracy of the 12 twitter sentiment
analysis approaches tested. The sentimental analysis of twitter data has many organizational

23
benefits such as understanding your brand more deeply, growing your influence, and improving
your customer service.

7.2 TRAINING AND TESTING

Fig 7.2 : Training and Testing Models

Training data and test data are two important concepts in machine learning.

Training Data
The observations in the training set form the experience that the algorithm uses to learn. In
supervised learning problems, each observation consists of an observed output variable and one or
more observed input variables.

Test Data
The test set is a set of observations used to evaluate the performance of the model using some
performance metric. It is important that no observations from the training set are included in the test
set. If the test set does contain examples from the training set, it will be difficult to assess whether
the algorithm has learned to generalize from the training set or has simply memorized it.

Some training sets may contain only a few hundred observations; others may include millions.
Inexpensive storage, increased network connectivity, the ubiquity of sensor-packed smartphones, and
shifting attitudes towards privacy have contributed to the contemporary state of big data, or training
sets with millions or billions of examples.

Many supervised training sets are prepared manually, or by semi-automated processes. Creating a
large collection of supervised data can be costly in some domains. Fortunately, several datasets are
bundled with scikit-learn, allowing developers to focus on experimenting with models instead.

24
CHAPTER – 8

CONCLUSION

Now, we know that, Machine learning is a powerful tool for making predictions from data.It is a
technique of training machines to perform the activities a human brain can do, a bit faster and better
than an average human-being. However, it is important to remember that machine learning is only as
good as the data that is used to train the algorithms. In order to make accurate predictions, it is
important to use high-quality data that is representative of the real-world data that the algorithm will
be used on.

I understand that the Machine Learning can be a Supervised or Unsupervised. If there is lesser
amount of data and clearly labelled data for training, opt for Supervised Learning. Unsupervised
Learning would generally give better performance and results for large data sets.

REFERENCES

www.tutorials.com

www.javatpoint.com

www.geeksforgeeks.org

25

You might also like