0% found this document useful (0 votes)
29 views73 pages

Ds Module 4

The document discusses machine learning techniques including supervised learning, unsupervised learning and reinforcement learning. It also discusses classification and regression problems in machine learning and how to evaluate classifiers using accuracy, confusion matrix and other metrics. Key machine learning algorithms like linear regression, logistic regression are also explained.

Uploaded by

Prathik Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views73 pages

Ds Module 4

The document discusses machine learning techniques including supervised learning, unsupervised learning and reinforcement learning. It also discusses classification and regression problems in machine learning and how to evaluate classifiers using accuracy, confusion matrix and other metrics. Key machine learning algorithms like linear regression, logistic regression are also explained.

Uploaded by

Prathik Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

2023-2024

21IS5C05
Data Science

Module 4
Rampur Srinath
NIE, Mysuru
[email protected] 1
Machine Learning

Machine learning involves coding programs that


automatically adjust their performance in
accordance with their exposure to information in
data.

Machine learning can be considered a subfield of


artificial intelligence (AI) we can roughly divide
the field into the following three major classes.
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Supervised learning: Algorithms which learn from a training set of
labeled examples to generalize to the set of all possible inputs.
Examples of techniques in supervised learning: logistic regression,
support vector machines, decision trees, random forest, etc.

Unsupervised learning: Algorithms that learn from a training set of


unlabeled examples. Used to explore data according to some
statistical, geometric or similarity criterion. Examples of unsupervised
learning include k-means clustering and kernel density estimation.

Reinforcement learning: Algorithms that learn via reinforcement from


criticism that provides information on the quality of a solution, but not
on how to improve it. Improved solutions are achieved by iteratively
exploring the solution space.
• As a data scientist, the first step you apply given a certain
problem is to identify the question to be answered. According
to the type of answer we are seeking, we are directly aiming
for a certain set of techniques.
• If our question is answered by YES/NO, we are facing a
classification problem. Classifiers are also the tools to use if
our question admits only a discrete set of answers, i.e., we
want to select from a finite number of choices.
• – Given the results of a clinical test, e.g., does this patient
suffer from diabetes?
• – Given a magnetic resonance image, is it a tumor shown in
the image?
• – Given the past activity associated with a credit card, is the
current operation fraudulent?
• If our question is a prediction of a real-valued quantity, we
are faced with a regression problem.
• – Given the description of an apartment, what is the expected
market value of the flat? What will the value be if the
apartment has an elevator?
• – Given the past records of user activity on Apps, how long
will a certain client be connected to our App?
• – Given my skills and marks in computer science and maths,
what mark will I achieve in a data science course?
• Classification is the natural choice of machine learning tools
for prediction with discrete known outcomes. According to
the cardinality of the target set, one usually distinguishes
between binary classifiers when the target output only takes
two values, i.e., the classifier answers questions with a yes or
a no; or multiclass classifiers, for a larger number of classes.
• We can encode both target states in a numerical variable,
e.g., a successful loan target can take value +1; and it is −1,
otherwise.
A problem in Scikit-learn is modeled as
follows:
Input data is structured in Numpy arrays. The size of the array
is expected to be [n_samples, n_features]:
• – n_samples: The number of samples (n). Each sample is an
item to process (e.g., classify). A sample can be a document,
a picture, an audio file, a video, an astronomical object, a
row in a database or whatever you can describe with a fixed
set of quantitative traits.
• – n_features: The number of features (d) or distinct traits
that can be used to describe each item in a quantitative
manner. Features are generally real-valued, but may be
Boolean, discrete-valued or even categorical.
Considering data arranged as in the previous matrices we refer to:
• the columns as features, attributes, dimensions, regressors,
covariates, predictors, or independent variables;
• the rows as instances, examples, or samples;
• the target as the label, outcome, response, or dependent variable.

All objects in Scikit-learn share a uniform and limited API


consisting of three complementary interfaces:
• an estimator interface for building and fitting models (fit());
• a predictor interface for making predictions (predict());
• a transformer interface for converting data (transform()).
The basic measure of performance of a classifier is its
accuracy. This is defined as the number of correctly
predicted examples divided by the total amount of examples.
Accuracy is related to the error as follows: acc = 1 − err .

Each estimator has a score() method that invokes the default


scoring metric.
• Although accuracy is the most normal metric for evaluating
classifiers, there are cases when the business value of
correctly predicting elements from one class is different from
the value for the prediction of elements of another class.
• In those cases, accuracy is not a good performance metric
and more detailed analysis is needed. The confusion matrix
enables us to define different metrics considering such
scenarios.
The confusion matrix considers the concepts of the classifier
outcome and the actual ground truth or gold standard. In a
binary problem, there are four possible cases:
• • True positives (TP): When the classifier predicts a sample
as positive and it really is positive.
• • False positives (FP): When the classifier predicts a sample
as positive but in fact it is negative.
• • True negatives (TN): When the classifier predicts a sample
as negative and it really is negative.
• • False negatives (FN): When the classifier predicts a
sample as negative but in fact it is positive.
Training,Validation and Test

• Test data is used exclusively for assessing performance at


the end of the process and will never be used in the learning
process.
• Validation data is used explicitly to select the
parameters/models with the best performance according to an
estimation of the generalization error. This is a form of
learning.
• Training data are used to learn the instance of the model
from a model class.
1. Split the original dataset into training and test data. For
example, use 30% of the original dataset for testing purposes.
This data is held back and will only be used to assess the
performance of the method.
2. Use the remaining training data to select the
hyperparameters by means of cross-validation.
3. Train the model with the selected parameter and assess the
performance using the test dataset.
Regression Analysis

• Regression is related to how to make predictions about real-world


quantities such as, for instance, the predictions alluded to in the
following questions.
• How does sales volume change with changes in price?
• How is sales volume affected by the weather?
• How does the title of a book affect its sales?
• How does the amount of a drug absorbed vary with the patient’s
body weight; and does this relationship depend on blood pressure?
• How many customers can I expect today?
• At what time should I go home to avoid traffic jams?
• What is the chance of rain on the next two Mondays;
• what is the expected temperature?
All these questions have a common structure:
they ask for a response that can be expressed
as a combination of one or more
(independent) variables (also called covariates
or predictors).

The role of regression is to build a model to


predict the response from the variables. This
process involves the transition from data to
model.
More specifically, the model can be useful in different
tasks, such as the following:
(1) analyzing the behavior of data (the relation between the
response and the variables),
(2) predicting data values (whether continuous or discrete),
(3) Finding important variables for the model.
• In order to understand how a regression model can be
suitable for tackling these tasks, we will introduce three
practical cases for which we use three real datasets and solve
different questions. These practical cases will motivate
• simple linear regression,
• multiple linear regression,
• logistic regression
Linear Regression

• The objective of performing a regression is to build a model


to express the relation between the response and a
combination of one or more (independent) variables.
• The model allows us to predict the response y from the
variables.

• Two quantities are correlated if there is a relationship


between the two variables,
• The simplest model which can be considered is a linear
model, where the response y depends linearly on the d
variables xi :

y = Xw,
Correlation Coefficient

• The correlation coefficient measures the degree of linear


relationship among variables.
• In a correlation analysis we estimate a value bounded
between -1 and 1 and we call it the correlation coefficient.
This coefficient tells us the strength of the linear association
between the two variables.
• If the two quantities vary in tandem (if one
increases/decreases, the other one does too) the correlation
coefficient is positive,
• It is negative when the two quantities vary out of sync (if
one decreases, the other one increases).
• It is important to remember that the correlation coefficient
measures the strength of linear relationship between the
variables.
• A value of zero does not mean that there is no relationship at
all. It simply indicates that there is no linear relation
between the variables in question.

People use umbrellas when it rains does not mean that


umbrellas cause rain to fall
Types of Linear Regression

Linear regression can be further divided into two types of the


algorithm:
• Simple Linear Regression:
If a single independent variable is used to predict the value of
a numerical dependent variable, then such a Linear
Regression algorithm is called Simple Linear Regression.
• Multiple Linear regression:
If more than one independent variable is used to predict the
value of a numerical dependent variable, then such a Linear
Regression algorithm is called Multiple Linear Regression.
Linear Regression Line

• A linear line showing the relationship between the dependent


and independent variables is called a regression line. A
regression line can show two types of relationship:
Positive Linear Relationship:

• If the dependent variable increases on the Y-axis and


independent variable increases on X-axis, then such a
relationship is termed as a Positive linear relationship.
Negative Linear Relationship:

• f the dependent variable decreases on the Y-axis and


independent variable increases on the X-axis, then such a
relationship is called a negative linear relationship.
Finding the best fit line:

• When working with linear regression, our main goal is to


find the best fit line that means the error between predicted
values and actual values should be minimized. The best fit
line will have the least error.
• The different values for weights or the coefficient of lines
(a0, a1) gives a different line of regression, so we need to
calculate the best values for a0 and a1 to find the best fit line,
so to calculate this we use cost function.
Cost function

• The different values for weights or coefficient of lines (a 0, a1)


gives the different line of regression, and the cost function is
used to estimate the values of the coefficient for the best fit
line.
• Cost function optimizes the regression coefficients or
weights. It measures how a linear regression model is
performing.
• We can use the cost function to find the accuracy of
the mapping function, which maps the input variable to the
output variable. This mapping function is also known
as Hypothesis function.
MSE

• For Linear Regression, we use the Mean Squared Error


(MSE) cost function, which is the average of squared error
occurred between the predicted values and actual values. It
can be written as:
• For the above linear equation, MSE can be calculated as:
• Residuals: The distance between the actual value and
predicted values is called residual. If the observed points are
far from the regression line, then the residual will be high,
and so cost function will high. If the scatter points are close
to the regression line, then the residual will be small and
hence the cost function.
Simple Linear Regression

• Simple linear regression considers n samples of a single


variable x and describes the relationship between the
variable and the response with the model:

where the parameter a0 is called the intercept or the constant


term.
• Simple Linear Regression is a type of Regression algorithms
that models the relationship between a dependent variable
and a single independent variable. The relationship shown by
a Simple Linear Regression model is linear or a sloped
straight line, hence it is called Simple Linear Regression.
• The key point in Simple Linear Regression is that
the dependent variable must be a continuous/real value.
However, the independent variable can be measured on
continuous or categorical values.
Simple Linear regression algorithm has mainly two objectives:
• Model the relationship between the two variables. Such as
the relationship between Income and expenditure, experience
and Salary, etc.
• Forecasting new observations. Such as Weather forecasting
according to temperature, Revenue of a company according
to the investments in a year, etc.
Implementation

import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

data_set= pd.read_csv('Salary_Data.csv')

x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
1/3, random_state=0)

#Fitting the Simple Linear Regression model to the trainin


g dataset
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
#Prediction of Test and Training set result
y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)
mtp.scatter(x_train, y_train, color="green")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
#visualizing the Test set results
mtp.scatter(x_test, y_test, color="blue")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show(
Multiple Linear Regression

• There may be various cases in which the response variable is


affected by more than one predictor variable; for such cases,
the Multiple Linear Regression algorithm is used.

• Multiple Linear Regression is one of the important regression


algorithms which models the linear relationship between a
single dependent continuous variable and more than one
independent variable.
• For MLR, the dependent or target variable(Y) must be the
continuous/real, but the predictor or independent variable
may be of continuous or categorical form.
• Each feature variable must model the linear relationship with
the dependent variable.
• MLR tries to fit a regression line through a multidimensional
space of data-points.
Assumptions for Multiple Linear Regression:
• A linear relationship should exist between the Target and
predictor variables.
• The regression residuals must be normally distributed.
• MLR assumes little or no multicollinearity (correlation
between the independent variable) in data.
Implementation

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('50_CompList.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 4].values
#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEnc
oder
labelencoder_x= LabelEncoder()
x[:, 3]= labelencoder_x.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()

#avoiding the dummy variable trap:


x = x[:, 1:]
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
0.2, random_state=0)

#Fitting the MLR model to the training set:


from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
#Predicting the Test set result;
y_pred= regressor.predict(x_test)

print('Train Score: ', regressor.score(x_train, y_train))


print('Test Score: ', regressor.score(x_test, y_test))

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446
ML Polynomial Regression

• Polynomial Regression is a regression algorithm that models


the relationship between a dependent(y) and independent
variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:

• It is also called the special case of Multiple Linear


Regression in ML. Because we add some polynomial terms
to the Multiple Linear regression equation to convert it into
Polynomial Regression.
• "In Polynomial regression, the original features are
converted into Polynomial features of required degree
(2,3,..,n) and then modeled using a linear model."
Implementation

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('Position_Salaries.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, 1:2].values
y= data_set.iloc[:, 2].values
#Fitting the Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_regs= LinearRegression()
lin_regs.fit(x,y)

#Fitting the Polynomial regression to the dataset


from sklearn.preprocessing import PolynomialFeatures
poly_regs= PolynomialFeatures(degree= 2)
x_poly= poly_regs.fit_transform(x)
lin_reg_2 =LinearRegression()
lin_reg_2.fit(x_poly, y)
#Visulaizing the result for Linear Regression model
mtp.scatter(x,y,color="blue")
mtp.plot(x,lin_regs.predict(x), color="red")
mtp.title("Bluff detection model(Linear Regression)")
mtp.xlabel("Position Levels")
mtp.ylabel("Salary")
mtp.show()
#Visulaizing the result for Polynomial Regression
mtp.scatter(x,y,color="blue")
mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), colo
r="red")
mtp.title("Bluff detection model(Polynomial Regression)")
mtp.xlabel("Position Levels")
mtp.ylabel("Salary")
mtp.show()
For degree= 3:
If we change the degree=3, then we will give a more accurate plot,
as shown in the below image.
Degree= 4: Let's again change the degree to 4, and now will
get the most accurate plot. Hence we can get more accurate
results by increasing the degree of Polynomial.
Logistic Regression

• Logistic regression is one of the most popular Machine


Learning algorithms, which comes under the Supervised
Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent
variables.
• Logistic regression predicts the output of a categorical
dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or
1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0
and 1.
• Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for
solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we
fit an "S" shaped logistic function, which predicts two
maximum values (0 or 1).
• Logistic Regression is a significant machine learning
algorithm because it has the ability to provide probabilities
and classify new data using continuous and discrete datasets.
Logistic Function (Sigmoid Function):

• The sigmoid function is a mathematical function used to map


the predicted values to probabilities.
• It maps any real value into another value within a range of 0
and 1.
• The value of the logistic regression must be between 0 and 1,
which cannot go beyond this limit, so it forms a curve like
the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
• In logistic regression, we use the concept of the threshold
value, which defines the probability of either 0 or 1. Such as
values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
The mathematical steps to get Logistic
Regression equations are given below:
• We know the equation of the straight line can be written as:

• In Logistic Regression y can be between 0 and 1 only, so for


this let's divide the above equation by (1-y):

• But we need range between -[infinity] to +[infinity], then


take logarithm of the equation it will become:
Implementation

#Data Pre-procesing Step


# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

#Fitting Logistic Regression to the training set


from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
#Predicting the test set result
y_pred= classifier.predict(x_test)

#Creating the Confusion matrix


from sklearn.metrics import confusion_matrix
cm= confusion_matrix()
Thank You

You might also like