0% found this document useful (0 votes)
40 views36 pages

Unit 3 Model Construction 3.1 Machine Learning Concepts - An Overview

FUNDAMENTALS OF DATA SCIENCE UNIT 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views36 pages

Unit 3 Model Construction 3.1 Machine Learning Concepts - An Overview

FUNDAMENTALS OF DATA SCIENCE UNIT 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit 3

Model Construction
3.1 Machine Learning Concepts – An Overview
Machine Learning is defined as a technology that is used to train machines to
perform various actions such as predictions, recommendations, estimations, etc.,
based on historical data or past experience.
Machine Learning enables computers to behave like human beings by training
them with the help of past experience and predicted data.
There are three key aspects of Machine Learning, which are as follows:
o Task: A task is defined as the main problem in which we are interested.
This task/problem can be related to the predictions and recommendations
and estimations, etc.
o Experience: It is defined as learning from historical or past data and used
to estimate and resolve future tasks.
o Performance: It is defined as the capacity of any machine to resolve any
machine learning task or problem and provide the best outcome for the
same. However, performance is dependent on the type of machine learning
problems.
How does machine learning work?

Machine learning uses two techniques: supervised learning, which trains a model
on known input and output data to predict future outputs, and unsupervised
learning, which uses hidden patterns or internal structures in the input data.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Techniques in Machine Learning

Machine Learning techniques are divided mainly into the following 4 categories:

1. Supervised Learning

Supervised machine learning creates a model that makes predictions based on


evidence in the presence of uncertainty. A supervised learning algorithm takes a
known set of input data and known responses to the data (output) and trains
a model to generate reasonable predictions for the response to the new data.

When to use: Use supervised learning if you have known data for the output
you are trying to estimate.

For example,

• Whether the email is genuine, or spam, or the tumor is cancerous or benign.


Typical applications include medical imaging, speech recognition, and
credit scoring.
• Physicians want to predict whether someone will have a heart attack within
a year. They have data on previous patients, including age, weight, height,
and blood pressure. They know if previous patients had had a heart attack
within a year. So the problem is to combine existing data into a model that
can predict whether a new person will have a heart attack within a year.

Supervised learning uses classification and regression techniques to develop


machine learning models.

Supervised learning is applicable when a machine has sample data, i.e., input as
well as output data with correct labels. Correct labels are used to check the
correctness of the model using some labels and tags. Supervised learning
technique helps us to predict future events with the help of past experience and
labeled examples. Initially, it analyses the known training dataset, and later it
introduces an inferred function that makes predictions about output values.
Further, it also predicts errors during this entire learning process and also corrects
those errors through algorithms.

Example: Let's assume we have a set of images tagged as ''dog''. A machine


learning algorithm is trained with these dog images so it can easily distinguish
whether an image is a dog or not.

2. Unsupervised Learning

Detects hidden patterns or internal structures in unsupervised learning data. It is


used to eliminate datasets containing input data without labeled responses.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Clustering is a common unsupervised learning technique. It is used for
exploratory data analysis to find hidden patterns and clusters in the data.
Applications for cluster analysis include gene sequence analysis, market research,
and commodity identification.

For example, if a cell phone company wants to optimize the locations where they
build towers, they can use machine learning to predict how many people their
towers are based on.

A phone can only talk to 1 tower at a time, so the team uses clustering algorithms
to design the good placement of cell towers to optimize signal reception for their
groups or groups of customers.

In unsupervised learning, a machine is trained with some input samples or labels


only, while output is not known. The training information is neither classified nor
labeled; hence, a machine may not always provide correct output compared to
supervised learning.

Although Unsupervised learning is less common in practical business settings, it


helps in exploring the data and can draw inferences from datasets to describe
hidden structures from unlabeled data.

Example: Let's assume a machine is trained with some set of documents having
different categories (Type A, B, and C), and we have to organize them into
appropriate groups. Because the machine is provided only with input samples or
without output, so, it can organize these datasets into type A, type B, and type C
categories, but it is not necessary whether it is organized correctly or not.

3. Reinforcement Learning
Reinforcement Learning is a feedback-based machine learning technique. In such
type of learning, agents (computer programs) need to explore the environment,
perform actions, and on the basis of their actions, they get rewards as feedback.
For each good action, they get a positive reward, and for each bad action, they
get a negative reward. The goal of a Reinforcement learning agent is to maximize

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


the positive rewards. Since there is no labeled data, the agent is bound to learn by
its experience only.

Eg.

Imagine a mouse in a maze trying to find hidden pieces of cheese. At first, the
Mouse may move randomly, but after a while, the Mouse's feel helps sense which
actions bring it closer to the cheese. The more times we expose the Mouse to the
maze, the better at finding the cheese.

When to use RL?

You can use RL when you have little or no historical data about a problem, as
it does not require prior information (unlike traditional machine learning
methods). In the RL framework, you learn from the data as you go. Not
surprisingly, RL is particularly successful with games, especially games of
"correct information" such as chess and Go. With games, feedback from the agent
and the environment comes quickly, allowing the model to learn faster. The
downside of RL is that it can take a very long time to train if the problem is
complex.

4. Semi-supervised Learning

Semi-supervised Learning is an intermediate technique of both supervised and


unsupervised learning. It performs actions on datasets having few labels as well
as unlabeled data. However, it generally contains unlabeled data. Hence, it also
reduces the cost of the machine learning model as labels are costly, but for
corporate purposes, it may have few labels. Further, it also increases the accuracy
and performance of the machine learning model.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3.2 Commonly used Machine Learning Algorithms
3.2.1 Linear Regression
Linear Regression is one of the simplest and popular machine learning algorithms
recommended by a data scientist. It is used for predictive analysis by making
predictions for real variables such as experience, salary, cost, etc.
It is a statistical approach that represents the linear relationship between two or
more variables, either dependent or independent, hence called Linear Regression.
It shows the value of the dependent variable changes with respect to the
independent variable, and the slope of this graph is called as Line of Regression.
A line that summarises the linear
relationship (or linear trend) between the two
variables in a linear regression analysis,
from the bivariate data collected.
A regression line is an estimate of the line
that describes the true, but unknown, linear
relationship between the two variables. The
equation of the regression line is used to
predict (or estimate) the value of the
response variable from a given value of the
explanatory variable. A regression line
indicates a linear relationship between the dependent variables on the y-axis
and the independent variables on the x-axis.

If Y is the dependent variable and X is the independent variable, the Y on X


regression line equation is represented as follows: ‘Y = a + bX + ɛ.’
Example
The actual weights and self-perceived ideal weights of a random sample of 40
female students enrolled in an introductory Statistics course at the University of
Auckland are displayed on the scatter plot below. A regression line has been
drawn. The equation of the regression line is
predicted y = 0.6089x + 18.661 or predicted ideal weight = 0.6089 × actual weight
+ 18.661

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3.2.2 Logistic Regression
Logistic Regression is a subset of the Supervised learning technique. It helps us
to predict the output of categorical dependent variables using a given set of
independent variables. However, it can be Binary (0 or 1) as well as Boolean
(true/false), but instead of giving an exact value, it gives a probabilistic value
between o or 1. It is much similar to Linear Regression, depending on its use in
the machine learning model. As Linear regression is used for solving regression
problems, similarly, Logistic regression is helpful for solving classification
problems.
Logistic Regression can be expressed as an 'S-shaped curve called sigmoid
functions. It predicts two maximum values (0 or 1)
3.2.3 KNN (Kth Nearest Neighbor)
It is also one of the simplest machine learning algorithms that come under
supervised learning techniques. It is helpful for solving regression as well as
classification problems. It assumes the similarity between the new data and
available data and puts the new data into the category that is most similar to the
available categories. It is also known as Lazy Learner Algorithms because it does
not learn from the training set immediately; instead, it stores the dataset, and at
the time of classification, it performs an action on the dataset. Let's suppose we
have a few sets of images of cats and dogs and want to identify whether a new
image is of a cat or dog. Then KNN algorithm is the best way to identify the cat
from available data sets because it works on similarity measures. Hence, the KNN
model will compare the new image with available images and put the output in
the cat's category.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3.2.4 K means Clustering.
K-Means Clustering is a subset of unsupervised learning techniques. It helps us
to solve clustering problems by means of grouping the unlabeled datasets into
different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on
3.2.5 Decision Tree
Decision Tree is also another type of Machine Learning technique that comes
under Supervised Learning. Similar to KNN, the decision tree also helps us to
solve classification as well as regression problems, but it is mostly preferred to
solve classification problems. The name decision tree is because it consists of a
tree-structured classifier in which attributes are represented by internal nodes,
decision rules are represented by branches, and the outcome of the model is
represented by each leaf of a tree. The tree starts from the decision node, also
known as the root node, and ends with the leaf node.
Decision nodes help us to make any decision, whereas leaves are used to
determine the output of those decisions.
A Decision Tree is a graphical representation for getting all the possible outcomes
to a problem or decision depending on certain given conditions.

3.2.6 Random Forest


Random Forest is also one of the most preferred machine learning algorithms that
come under the Supervised Learning technique. Similar to KNN and Decision
Tree, It also allows us to solve classification as well as regression problems, but
it is preferred whenever we have a requirement to solve a complex problem and
to improve the performance of the model.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


A random forest algorithm is based on the concept of ensemble learning, which
is a process of combining multiple classifiers.
Random forest classifier is made from a combination of a number of decision
trees as well as various subsets of the given dataset. This combination takes input
as an average prediction from all trees and improves the accuracy of the model.
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting. Further, It also takes less training time as compared to
other algorithms.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3.3 Linear regression and multiple regression models
What is a Regression?
In Regression, we plot a graph between the variables which best fit the
given data points. The machine learning model can deliver predictions regarding
the data. In naïve words, “Regression shows a line or curve that passes through
all the data points on a target-predictor graph in such a way that the vertical
distance between the data points and the regression line is minimum.” It is used
principally for prediction, forecasting, time series modeling, and determining the
causal-effect relationship between variables.

Types of Regression models


1. Linear Regression
2. Polynomial Regression
3. Logistics Regression
Linear Regression
Linear regression is a quiet and simple statistical regression method used
for predictive analysis and shows the relationship between the continuous
variables. Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-axis), consequently
called linear regression. If there is a single input variable (x), such linear
regression is called simple linear regression. And if there is more than one input
variable, such linear regression is called multiple linear regression. The linear
regression model gives a sloped straight line describing the relationship within
the variables. The above graph presents the linear relationship between the
dependent variable and independent variables. When the value of x (independent
variable) increases, the value of y (dependent variable) is likewise increasing.
The red line is referred to as the best fit straight line. Based on the given data
points, we try to plot a line that models the points the best.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


y= Dependent Variable.
x= Independent Variable.
a0= intercept of the line.
a1 = Linear regression coefficient.

A regression line can be a


Positive Linear Relationship or a
Negative Linear Relationship.

The goal of the linear regression algorithm is to get the best values for a0 and a1
to find the best fit line. The best fit line should have the least error means the error
between predicted values and actual values should be minimized.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Cost Function
o The different values for weights or coefficient of lines (a0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping
function is also known as Hypothesis
function.

For Linear Regression, we use the Mean


Squared Error (MSE) cost function, which is
the average of squared error occurred between
the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated using below formula:

How to reduce RMS Error?

• Gradient descent is used to minimize the MSE by calculating the gradient


of the cost function.
• A regression model uses gradient descent to update the coefficients of the
line by reducing the cost function.
• It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.

Linear Regression Model Performance

The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:

R-squared method:

R-squared is a statistical method that determines the goodness of fit.

It measures the strength of the relationship between the dependent and


independent variables on a scale of 0-100%.The high value of R-square

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


determines the less difference between the predicted values and actual values and
hence represents a good model.

It is also called a coefficient of


determination, or coefficient of multiple determination for multiple regression.

Example of SLR
Example https://www.javatpoint.com/simple-linear-regression-in-machine-learning

The key point in Simple Linear Regression is that the dependent variable must be
a continuous/real value. However, the independent variable can be measured on
continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the


relationship between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year,
etc.

Here we are taking a dataset that has two variables: salary (dependent variable)
and experience (Independent variable). The goal of this problem is:

o We want to find out if there is any correlation between these two variables.
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.

Solution:

Step 1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-
processing. We have already done it earlier in this tutorial. But there will be some
changes, which are given in the below steps:

o First, we will import the three important libraries, which will help us for
loading the dataset, plotting the graphs, and creating the Simple Linear
Regression model.
o Next, we will load the dataset into our code:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


o After that, we need to extract the dependent and independent variables
from the given dataset. The independent variable is years of experience,
and the dependent variable is salary.
o Next, we will split both variables into the test set and training set. We have
30 observations, so we will take 20 observations for the training set and 10
observations for the test set. We are splitting our dataset so that we can
train our model using a training dataset and then test the model using a test
dataset. The code for this is given below:
o On executing the above code, we get test dataset and train data set.

Step 2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so,
we will import the LinearRegression class of the linear_model library
from the scikit learn. After importing the class, we are going to create an
object of the class named as a regressor. In the above code, we have used
a fit() method to fit our Simple Linear Regression object to the training set.
In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have
fitted our regressor object to the training set so that the model can easily
learn the correlations between the predictor and target variables. After
executing the above lines of code, we will get the below output.

Step: 3. Prediction of test set result:

Dependent (salary) and an independent variable (Experience). So, now, our


model is ready to predict the output for the new observations. In this step, we will
provide the test dataset (new observations) to the model to check whether it can
predict the correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain
predictions of test dataset, and prediction of training set respectively. On
executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the
training set and test set.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use
the scatter() function of the pyplot library, which we have already imported in the
pre-processing step. The scatter () function will create a scatter plot of
observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-
axis, salary of employees. In the function, we will pass the real values of training
set, which means a year of experience x_train, training set of Salaries y_train, and
color of the observations. Here we are taking a green color for the observation,
but it can be any color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot()
function of the pyplot library. In this function, we will pass the years of
experience for training set, predicted salary for training set x_pred, and color of
the line.

Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.

Output:

By executing the above lines of code, we will get the below graph plot as an
output.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Step: 5. visualizing the Test set results:

In the previous step, we have visualized the


performance of our model on the training
set. Now, we will do the same for the Test
set. The complete code will remain the same
as the above code, except in this, we will use
x_test, and y_test instead of x_train and
y_train.

Here we are also changing the color of


observations and regression line to differentiate between the two plots, but it is
optional.

#Loading Important Libs


import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#Loading the dataset


data_set= pd.read_csv('Salary_Data.csv')

# Extract the dependent and independent variables from the given dataset. The independent
variable is years of experience, and the dependent variable is salary.
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

#Fitting the Simple Linear Regression model to the training dataset


from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)

#Prediction of Test and Training set result


y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)

#visualizing the Train set results


mtp.scatter(x_train, y_train, color="green")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


mtp.ylabel("Salary(In Rupees)")
mtp.show()

#visualizing the Test set results


mtp.scatter(x_test, y_test, color="blue")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
Multiple Regression Model (Multiple RL in same plot)

Multiple regression generally explains the relationship between multiple


independent or predictor variables and one dependent or criterion variable. A
dependent variable is modeled as a function of several independent variables with
corresponding coefficients, along with the constant term. Multiple regression
requires two or more predictor variables, and this is why it is called multiple
regression.

The multiple regression equation explained above takes the following form:

y = b1x1 + b2x2 + … + bnxn + c.

Here, bi’s (i=1,2…n) are the regression coefficients, which represent the value at
which the criterion variable changes when the predictor variable changes. where
x1, x2, ….xk are the k independent variables and y is the dependent variable

As an example, let’s say that the test score of a student in an exam will be
dependent on various factors like his focus while attending the class, his intake
of food before the exam and the amount of sleep he gets before the
exam. Using this test one can estimate the appropriate relationship among these
factors.

Multiple regression is like linear regression, but with more than one independent
value, meaning that we try to predict a value based on two or more variables.

Example 2: Predicting CO Emission based on Engine CC and Weight of the car.

https://www.w3schools.com/python/python_ml_multiple_regression.asp

We can predict the CO2 emission of a car based on the size of the engine, but
with multiple regression we can throw in more variables, like the weight of the
car, to make the prediction more accurate. R(Car, Model,Volume,Weight,CO2)

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Py Code

PredictedCO2 = 107.208gms

We have predicted that a car with 1.3 liter engine, and a weight of 2300 kg, will
release approximately 107 grams of CO2 for every kilometer it drives.

We have already predicted that if a car with a 1300cm3 engine weighs 2300kg,
the CO2 emission will be approximately 107g. What if we increase the weight
with 1000kg?

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Coefficient

The coefficient is a factor that describes the relationship with an unknown


variable.

Example: if x is a variable, then 2x is x two times. x is the unknown variable, and


the number 2 is the coefficient.

In this case, we can ask for the coefficient value of weight against CO2, and for
volume against CO2. The answer(s) we get tells us what would happen if we
increase, or decrease, one of the independent values.

The result array represents the coefficient values of weight and volume.

Weight:0.00755095
Volume: 0.00780526

These values tell us that if the weight increase by 1kg, the CO2 emission increases
by 0.00755095g.

And if the engine size (Volume) increases by 1 cm3, the CO2 emission increases
by 0.00780526 g.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3.4 K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to
the new data.

Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider
the below diagram:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
o Step-6: Our model is ready.
Sample Explanation
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of


neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean
distance between the data points. The
Euclidean distance is the distance between
two points, which we have already studied in
geometry. It can be calculated as:

o By calculating the Euclidean distance we


got the nearest neighbors, as three nearest
neighbours in category A and two nearest
neighbours in category B.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
o It is recommended to choose an odd value for k to avoid ties in classification.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

KNN Example
https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning

Steps to implement the K-NN algorithm:


o Data Pre-processing step
o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Distance Metrics Used in KNN Algorithm


As we know that the KNN algorithm helps us identify the nearest points or the groups for a
query point. But to determine the closest groups or the nearest points for a query point we need
some metric. For this purpose, we use below distance metrics:

• Euclidean Distance
• Manhattan Distance

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


• Minkowski Distance

Breaking It Down – Pseudo Code of KNN


We can implement a KNN model by following the below steps:

1. Load the data


2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of training data points
o Calculate the distance between test data and each row of training dataset. Here
we will use Euclidean distance as our distance metric since it’s the most popular
method. The other distance function or metrics that can be used are Manhattan
distance, Minkowski distance, Chebyshev, cosine, etc. If there are categorical
variables, hamming distance can be used.
o Sort the calculated distances in ascending order based on distance values
o Get top k rows from the sorted array
o Get the most frequent class of these rows
o Return the predicted class

3.5 Training Dataset Construction (Cross Validation)


Cross-validation is a technique used in machine learning and statistical modeling to assess the
performance of a model and to prevent overfitting. It involves dividing the dataset into multiple
subsets, using some for training the model and the rest for testing, multiple times to obtain
reliable performance metrics.

Types of Cross-validation
1. K-Fold Cross-Validation: The data is divided into K subsets (or “folds”). The model is
trained K times, using K-1 folds for training and one-fold for testing in each iteration.
2. Leave-One-Out Cross-Validation (LOOCV): K-Fold CV with K equal to the number of
data points, i.e., each data point is used once as a test set, and the model is trained K times.
3. Stratified K-Fold Cross-Validation: It ensures that the class distribution remains similar
in each fold, important when dealing with imbalanced datasets.
4. Time Series Cross-Validation: For time-dependent data, it uses a series of temporally
ordered training and testing sets, preventing the use of future data for training.
5. Shuffle-Split Cross-Validation: Randomly shuffles the data, and then splits it into
training and testing sets multiple times.
6. Group K-Fold Cross-Validation: Useful when the data contains groups, like multiple

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


samples from the same subject, and ensures all samples from a group are kept together in the
same fold.

Cross Validation Steps


Cross-validation is a technique in which we
train our model using the subset of the data-
set and then evaluate using the
complementary subset of the data-set. The
three steps involved in cross-validation are as
follows:

1. Reserve some portion of sample data-set.


2. Using the rest data-set train the model.
3. Test the model using the reserve portion
of the data-set.

Example The diagram below shows an


example of the training subsets and
evaluation subsets generated in k-fold cross-
validation. Here, we have total 25 instances.
In first iteration we use the first 20 percent of
data for evaluation, and the remaining 80
percent for training([1-5] testing and [5-25]
training) while in the second iteration we use
the second subset of 20 percent for
evaluation, and the remaining three subsets of
the data for training([5-10] testing and [1-5
and 10-25] training), and so on.

Total instances: 25 Value of k: 5

No. Iteration Training set observations Testing set observations.


1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 [0 1 2 3 4]
20 21 22 23 24]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 [5 6 7 8 9]
20 21 22 23 24]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 [10 11 12 13 14]
21 22 23 24]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 [15 16 17 18 19]
21 22 23 24]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [20 21 22 23 24]
16 17 18 19]

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Advantages of Cross Validation:

1. Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a


more robust estimate of the model’s performance on unseen data.
2. Model Selection: Cross validation can be used to compare different models and select
the one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the
hyperparameters of a model, such as the regularization parameter, by selecting the
values that result in the best performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available data for both training
and validation, making it a more data-efficient method compared to traditional
validation techniques.

Disadvantages of Cross Validation:

1. Computationally Expensive: Cross validation can be computationally expensive,


especially when the number of folds is large or when the model is complex and requires
a long time to train.
2. Time-Consuming: Cross validation can be time-consuming, especially when there are
many hyperparameters to tune or when multiple models need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross validation can
impact the bias-variance tradeoff, i.e., too few folds may result in high variance, while
too many folds may result in high bias.

What is bias vs variance in machine learning?

A model with high variance may represent the data set accurately but could lead to
overfitting to noisy or otherwise unrepresentative training data. In comparison, a model with
high bias may underfit the training data due to a simpler model that overlooks regularities in
the data. Bias creates consistent errors in the ML model, which represents a simpler ML model
that is not suitable for a specific requirement. On the other hand, variance creates variance
errors that lead to incorrect predictions seeing trends or data points that do not exist.

What is Overfitting?
Overfitting is an undesirable machine learning behavior that occurs when the machine learning
model gives accurate predictions for training data but not for new data. When data
scientists use machine learning models for making predictions, they first train the model on a
known data set. Then, based on this information, the model tries to predict outcomes for new
data sets. An overfit model can give inaccurate predictions and cannot perform well for all
types of new data.

Why does overfitting occur?

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


You only get accurate predictions if the machine learning model generalizes to all types of data
within its domain. Overfitting occurs when the model cannot generalize and fits too closely to
the training dataset instead. Overfitting happens due to several reasons, such as:

• The training data size is too small and does not contain enough data samples to accurately
represent all possible input data values.
• The training data contains large amounts of irrelevant information, called noisy data.
• The model trains for too long on a single sample set of data.
• The model complexity is high, so it learns the noise within the training data.

What is underfitting?
Underfitting is another type of error that occurs when the model cannot determine a meaningful
relationship between the input and output data. You get underfit models if they have not trained
for the appropriate length of time on a large number of data points.

Underfitting vs. overfitting


Underfit models experience high bias—they give inaccurate results for both the training data
and test set. On the other hand, overfit models experience high variance—they give accurate
results for the training set but not for the test set. More model training results in less bias but
variance can increase. Data scientists aim to find the sweet spot between underfitting and
overfitting when fitting a model. A well-fitted model can quickly establish the dominant trend
for seen and unseen data sets.

3.7 Regression Line


A regression line is a line which is used to describe the behaviour of a set of data. In other
words, it gives the best trend of the given data.
Regression lines are useful in forecasting procedures. Its purpose is to describe the
interrelation of the dependent variable(y variable) with one or many independent variables(x
variable). Using the equation obtained from the regression line acts as an analyst who can
forecast future behaviors of the dependent variables by inputting different values for the
independent ones.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Regression Line Formula:
y = a + bx + u
Multiple Regression Line
Formula:
y= a + b1x1 +b2x2 + b3x3 +…+
btxt + u

There can be two cases of simple linear regression:


1. The equation is Y on X, where the value of Y changes with a variation in the value of
X.
2. The equation is X on Y, where the change in X variable depends upon the Y variable’s
deviation.

3.8 Least Squares Regression


Linear Regression is a linear approach to model the relationship between a scalar response (or
dependent variable), say Y, and one or more explanatory variables (or independent variables),
say X.

Regression Line: If our data shows a linear relationship between X and Y, then the straight line
which best describes the relationship is the regression line. It is the straight line that covers the
maximum points in the graph.

Performing Linear Regression


Our aim is to calculate the values m (slope) and b (y-intercept) in the eqn y=mx+b Where:

• y = how far up
• x = how far along
• m = Slope or Gradient (how steep the line is)
• b = the Y Intercept (where the line crosses the Y axis)

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan
FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan
FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan
3.9 Standard Error of Estimate
The standard error of the estimate is the estimation of the accuracy of any predictions. It is
denoted as SEE. The regression line depreciates the sum of squared deviations of prediction. It
is also known as the sum of squares error. SEE is the square root of the average
squared deviation. The deviation of some estimates from intended values is given by standard
error of estimate formula.

Where xi stands for data values, x bar is the mean value and n
is the sample size.

The standard error of estimate measures the accuracy of the predictions made by a
regression model. In other words, it determines how well the regression line describes the
values of a data set. If you have a collection of data from an experiment, survey, or other
source, follow along with us below to learn how to calculate your data set’s standard error of
estimate.
Example
Consider the following Data pairs: (1,2) (2,4) (3,5) (4,4) (5,5)

Solution: Calculate the Regression line (as seen in previous example) and populate the table.

Find the sum of the squared errors (SSE)

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3.10 Interpretation of R2

R squared (R2) value in machine learning is referred to as the coefficient of


determination or the coefficient of multiple determination in case of multiple
regression. R squared in regression acts as an evaluation metric to evaluate the
scatter of the data points around the fitted regression line. It recognizes the
percentage of variation of the dependent variable.

R-squared method:

R-squared is a statistical method that determines the goodness of fit.

It measures the strength of the relationship between the dependent and


independent variables on a scale of 0-100%. The high value of R-square
determines the less difference between the predicted values and actual values and
hence represents a good model.

It is also called a coefficient of determination, or coefficient of multiple


determination for multiple regression.

The value of R-squared stays between 0 and 100%:

• 0% corresponds to a model that does not explain the variability of the


response data around its mean. The mean of the dependent variable helps
to predict the dependent variable and also the regression model.
• On the other hand, 100% corresponds to a model that explains the
variability of the response variable around its mean.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


If your value of R2 is large, you have a better chance of your regression model
fitting the observations.

Interpretation of the R2 value

Below is a graph showing how the number lectures per day affects the number of hours spent
at university per day. The equation of the regression line is drawn on the graph and it has the
below equation

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


.

Solution:

Step 1: For each point compute the residual = actual Y – Predicted Y

For the point (2,2), sub the values on the regression line. The predicted Y is computed as
follows.

The actual value of Y in (2,2) is 2. Residual is computed as the difference between actual and
predicted values of Y.

Similarly, the residual values of the points are given as follows


For (3,4) = 0.17,
for (4,6) = 0.941,
for (6,7) = -0.517

Step 2: Compute:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Step 3: Compute:

Step 4:

Step 5:

R- Squared = 89.5% indicates a good value.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3.11 Regression Towards the Mean

Regression to the mean refers to the idea that rare or extreme events are
likely to be followed by more typical ones. Over time, outcomes regress to the
average or mean.

How Regression To The Mean Works

Regression to the mean is a statistical phenomenon in which an extreme


outcome is likely to revert back to average. This is often because the outlier
experience involves a series of variables — either good or bad — occurring
simultaneously to create a great or poor result, like flipping a coin and getting
heads multiple times. This is why an extreme outcome is typically followed
by an average result.

If we take a look at the pattern of events spoken about in these statements.


In all these statements, events unfold as follows: an “extreme” event (either
in a good or bad direction) is followed by a more typical one. For example,
regression to the mean explains why the second time you visit a
restaurant you thought so highly of last time fails to live up to
your expectations.

Consider two students, Jane and Joe. In year one, Jane does horribly but Joe
is outstanding. Jane is ranked in the bottom 1 percent while Joe is ranked at
the top 99 percent. If their results were entirely due to talent, there would be
no regression. Jane should be as bad in year two, and Joe should be as good
in year two. This is represented as possibility A in the diagram below. If their
results were equal parts luck and talent, we would expect halfway regression:
Jane should rise to around 25 percent and Joe should fall to around 75
percent, possibility B. If their results were caused entirely by luck (e.g.
flipping a coin), then in year two we would expect both Jane and Joe to
regress all the way back to 50 percent, possibility C.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Why Regression to the Mean Is Important
Regression to the mean describes the feature that extreme outcomes tend to
be followed by more normal ones. It’s a statistical concept that is both easy
to understand and easy to forget. When we witness extreme events, such as
unlikely successes or failures, we forget how rare such events are. When these
events are followed by more normal events, we try to explain why these
“normal” events happened. We forget that these “normal” events are normal
and that we should expect them to happen. This often leads us to attribute
causal powers to people, events, and interventions that may have played no
role in bringing about that normal event.

**End of Chapter 3**

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan

You might also like