Machine Learning 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

PRACTICAL:4

Write a model to demonstrate the working Simple Linear Regression,


Multiple Linear Regression

Simple Linear Regression


Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent variable. The
relationship shown by a Simple Linear Regression model is linear or a sloped
straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be
a continuous/real value. However, the independent variable can be measured on
continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year,
etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. (For a good model it will be negligible)
Implementation of Simple Linear Regression Algorithm using Python

Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent
variable.

In this section, we will create a Simple Linear Regression model to find out the best
fitting line for representing the relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python,
we need to follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-
processing. We have already done it earlier in this tutorial. But there will be some
changes, which are given in the below steps:

o First, we will import the three important libraries, which will help us for
loading the dataset, plotting the graphs, and creating the Simple Linear
Regression model.

1. import numpy as nm
2. import matplotlib.pyplot as mtp
3. import pandas as pd
o Next, we will load the dataset into our code:

1. data_set= pd.read_csv('Salary_Data.csv')

By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.

After that, we need to extract the dependent and independent variables from the
given dataset. The independent variable is years of experience, and the dependent
variable is salary. Below is code for it:

1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values

In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as a
parameter, since we want to extract the second column and indexing starts from the
zero.

By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y
(dependent) variable has been extracted from the given dataset.

o Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10
observations for the test set. We are splitting our dataset so that we can train
our model using a training dataset and then test the model using a test dataset.
The code for this is given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=
0)

By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:

Test-dataset:
Training Dataset:

o For simple linear Regression, we will not use Feature Scaling. Because Python
libraries take care of it for some cases, so we don't need to perform it here.
Now, our dataset is well prepared to work on it and we are going to start
building a Simple Linear Regression model for the given problem.
Step-2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so, we will
import the LinearRegression class of the linear_model library from the scikit
learn. After importing the class, we are going to create an object of the class named
as a regressor. The code for this is given below:

1. #Fitting the Simple Linear Regression model to the training dataset


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)

In the above code, we have used a fit() method to fit our Simple Linear Regression
object to the training set. In the fit() function, we have passed the x_train and y_train,
which is our training dataset for the dependent and an independent variable. We have
fitted our regressor object to the training set so that the model can easily learn the
correlations between the predictor and target variables. After executing the above
lines of code, we will get the below output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is
ready to predict the output for the new observations. In this step, we will provide the
test dataset (new observations) to the model to check whether it can predict the
correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain
predictions of test dataset, and prediction of training set respectively.

1. #Prediction of Test and Training set result


2. y_pred= regressor.predict(x_test)
3. x_pred= regressor.predict(x_train)
On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the
training set and test set.

Output:

You can check the variable by clicking on the variable explorer option in the IDE,
and also compare the result by comparing values from y_pred and y_test. By
comparing these values, we can check how good our model is performing.

Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis,
salary of employees. In the function, we will pass the real values of training set,
which means a year of experience x_train, training set of Salaries y_train, and color
of the observations. Here we are taking a green color for the observation, but it can
be any color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot()
function of the pyplot library. In this function, we will pass the years of experience
for training set, predicted salary for training set x_pred, and color of the line.

Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.

Finally, we will represent all above things in a graph using show(). The code is given
below:

1. mtp.scatter(x_train, y_train, color="green")


2. mtp.plot(x_train, x_pred, color="red")
3. mtp.title("Salary vs Experience (Training Dataset)")
4. mtp.xlabel("Years of Experience")
5. mtp.ylabel("Salary(In Rupees)")
6. mtp.show()

Output:

By executing the above lines of code, we will get the below graph plot as an output.

In the above plot, we can see the real values observations in green dots and predicted
values are covered by the red regression line. The regression line shows a correlation
between the dependent and independent variable.

The good fit of the line can be observed by calculating the difference between actual
values and predicted values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our model is good for the
training set.

Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training
set. Now, we will do the same for the Test set. The complete code will remain the
same as the above code, except in this, we will use x_test, and y_test instead of
x_train and y_train.

Here we are also changing the color of observations and regression line to
differentiate between the two plots, but it is optional.
1. #visualizing the Test set results
2. mtp.scatter(x_test, y_test, color="blue")
3. mtp.plot(x_train, x_pred, color="red")
4. mtp.title("Salary vs Experience (Test Dataset)")
5. mtp.xlabel("Years of Experience")
6. mtp.ylabel("Salary(In Rupees)")
7. mtp.show()

Output:

By executing the above line of code, we will get the output as:

In the above plot, there are observations given by the blue color, and prediction is
given by the red regression line. As we can see, most of the observations are close
to the regression line, hence we can say our Simple Linear Regression is a good
model and able to make good predictions.

Multiple Linear Regression

Multiple Linear Regression is one of the important regression algorithms which


models the linear relationship between a single dependent continuous variable and
more than one independent variable.
Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real,


but the predictor or independent variable may be of continuous or categorical
form.
o Each feature variable must model the linear relationship with the dependent
variable.
o MLR tries to fit a regression line through a multidimensional space of data-
points.

MLR equation:

In Multiple Linear Regression, the target variable(Y) is a linear combination of


multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple
Linear Regression, so the same is applied for the multiple linear regression equation,
the equation becomes:

1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</su
b>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:


o A linear relationship should exist between the Target and predictor
variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.
Implementation of Multiple Linear Regression model using Python:

To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main


information: R&D Spend, Administration Spend, Marketing Spend, State, and
Profit for a financial year. Our goal is to create a model that can easily determine
which company has a maximum profit, and which is the most affecting factor for the
profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the MLR
model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this
tutorial. This process contains the below steps:

o Importing libraries: Firstly we will import the library which will help in
building the model. Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which
contains all the variables. Below is the code for it:

1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')
Output:
We will get the dataset as:

In above output, we can clearly see that there are five variables, in which four
variables are continuous and one is categorical variable.

o Extracting dependent and independent Variables:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, :-1].values
3. y= data_set.iloc[:, 4].values

Output:

Out[5]:

array([[165349.2, 136897.8, 471784.1, 'New York'],


[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)

As we can see in the above output, the last column contains categorical variables
which are not suitable to apply directly for fitting the model. So we need to encode
this variable.

Encoding Dummy Variables:

As we have one categorical variable (State), which cannot be directly applied to the
model, so we will encode it. To encode the categorical variable into numbers, we
will use the LabelEncoder class. But it is not sufficient because it still has some
relational order, which may create a wrong model. So in order to remove this
problem, we will use OneHotEncoder, which will create the dummy variables.
Below is code for it:

1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()

Here we are only encoding one independent variable, which is state as other
variables are continuous.

Output:
As we can see in the above output, the state column has been converted into dummy
variables (0 and 1). Here each dummy variable column is corresponding to the
one State. We can check by comparing it with the original dataset. The first column
corresponds to the California State, the second column corresponds to the Florida
State, and the third column corresponds to the New York State.

o Now, we are writing a single line of code just to avoid the dummy variable
trap:

1. #avoiding the dummy variable trap:


2. x = x[:, 1:]

If we do not remove the first dummy variable, then it may introduce multicollinearity
in the model.

As we can see in the above output image, the first column has been removed.

o Now we will split the dataset into training and test set. The code for this is
given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=
0)

The above code will split our dataset into a training set and test set.

Output: The above code will split the dataset into training set and test set. You can
check the output by clicking on the variable explorer option given in Spyder IDE.
The test set and training set will look like the below image:

Test set:

Training set:
Step: 2- Fitting our MLR model to the Training set:

Now, we have well prepared our dataset in order to provide training, which means
we will fit our regression model to the training set. It will be similar to as we did
in Simple Linear Regression model. The code for this will be:

1. #Fitting the MLR model to the training set:


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)

Output:

Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Now, we have successfully trained our model using the training dataset. In the next
step, we will test the performance of the model using the test dataset.

Step: 3- Prediction of Test set results:

The last step for our model is checking the performance of the model. We will do it
by predicting the test set result. For prediction, we will create a y_pred vector.
Below is the code for it:

1. #Predicting the Test set result;


2. y_pred= regressor.predict(x_test)

By executing the above lines of code, a new vector will be generated under the
variable explorer option. We can test our model by comparing the predicted values
and test set values.

Output:
In the above output, we have predicted result set and test set. We can check model
performance by comparing these two value index by index. For example, the first
index has a predicted value of 103015$ profit and test/real value of 103282$ profit.
The difference is only of 267$, which is a good prediction, so, finally, our model is
completed here.

o We can also check the score for training dataset and test dataset. Below is the
code for it:

1. print('Train Score: ', regressor.score(x_train, y_train))


2. print('Test Score: ', regressor.score(x_test, y_test))

Output: The score is:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446

The above score tells that our model is 95% accurate with the training dataset
and 93% accurate with the test dataset.
PRACTICAL:5

Write a model to demonstrate the working of Logistic Regression

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of independent
variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not based
on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:
Logistic Function (Sigmoid Function): The sigmoid function is a mathematical
function used to map the predicted values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve
is called the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of


the equation it will become:
The above equation is the final equation for Logistic Regression.

Python Implementation of Logistic Regression (Binomial)

To understand the implementation of Logistic Regression in Python, we will use the


below example:

Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many users
from the dataset, wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic
regression algorithm. The dataset is shown in the below image. In this problem, we
will predict the purchased variable (Dependent Variable) by using age and
salary (Independent variables).

Steps in Logistic Regression: To implement the Logistic Regression using Python,


we will use the same steps as we have done in previous topics of Regression. Below
are the steps:
o Data Pre-processing step
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so


that we can use it in our code efficiently. It will be the same as we have done in Data
pre-processing topic. The code for this is given below:

1. #Data Pre-procesing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider
the given image:
Now, we will extract the dependent and independent variables from the given
dataset. Below is the code for it:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values

In the above code, we have taken [2, 3] for x because our independent variables are
age and salary, which are at index 2, 3. And we have taken 4 for y variable because
our dependent variable is at index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_stat0)
The output for this
For training set:

In logistic regression, we will do feature scaling because we want accurate result of


predictions. Here we will only scale the independent variable because dependent
variable have only 0 and 1 values. Below is the code for it:

1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:

2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the
training set. For providing training or fitting the model to the training set, we will
import the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model
to the logistic regression. Below is the code for it:

#Fitting Logistic Regression to the training set

1. from sklearn.linear_model import LogisticRegression


2. classifier= LogisticRegression(random_state=0)
3. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:
Out[5]:

1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random _state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by
using test set data. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under
the variable explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to
purchase or not purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import the confusion_matrix function of the
sklearn library. After importing the function, we will call it using a new variable cm.
The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()

Output:

By executing the above code, a new confusion matrix will be created. Consider the
below image:

We can find the accuracy of the predicted result by interpreting the confusion matrix.
By above output, we can interpret that 65+24= 89 (Correct Output) and 8+3=
11(Incorrect Output).
5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:

1. #Visualizing the training set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].ma
x() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape
(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10.for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13.mtp.title('Logistic Regression (Training set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib


library to create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will create


regions of provided colors (purple and green). In this function, we have passed
the classifier.predict to show the predicted data points predicted by the classifier.

Output: By executing the above code, we will get the below output:
The graph can be explained in the below points:

o In the above graph, we can see that there are some Green points within the
green region and Purple points within the purple region.
o All these data points are the observation points from the training set, which
shows the result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable)
is probably 0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable)
is probably 1 means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low
salary, did not purchase the car, whereas older users with high estimated salary
purchased the car.
o But there are some purple points in the green region (Buying the car) and some
green points in the purple region(Not buying the car). So we can say that
younger users with a high estimated salary purchased the car, whereas an older
user with a low estimated salary did not purchase the car.
The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and
our goal for this classification is to divide the users who purchased the SUV car and
who did not purchase the car. So from the output graph, we can clearly see the two
regions (Purple and Green) with the observation points. The Purple region is for
those users who didn't buy the car, and Green Region is for those users who
purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as
we have used the Linear model for Logistic Regression. In further topics, we will
learn for non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result
for new observations (Test set). The code for the test set will remain same as above
except that here we will use x_test and y_test instead of x_train and y_train.
Below is the code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].ma
x() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape
(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10.for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13.mtp.title('Logistic Regression (Test set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()

Output:

The above graph shows the test set result. As we can see, the graph is divided into
two regions (Purple and Green). And Green observations are in the green region, and
Purple observations are in the purple region. So we can say it is a good prediction
and model. Some of the green and purple data points are in different regions, which
can be ignored as we have already calculated this error using the confusion matrix
(11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this
classification problem.
PRACTICAL:6

Write a model to demonstrate the working of Decision tree

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
o The decisions or the test are performed on the basis of features of the given
dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Decision Tree algorithm Work

In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree. The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size
of the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Python Implementation of Decision Tree

Now we will implement the Decision tree using Python. For this, we will use the
dataset "user_data.csv," which we have used in previous classification models. By
using the same dataset, we can compare the Decision tree classifier with other
classification models such as KNN SVM, LogisticRegression, etc.

Steps will also remain the same, which are given below:

o Data Pre-processing step


o Fitting a Decision-Tree algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-Processing Step:

Below is the code for the pre-processing step:


1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10.x= data_set.iloc[:, [2,3]].values
11.y= data_set.iloc[:, 4].values
12.
13.# Splitting the dataset into training and test set.
14.from sklearn.model_selection import train_test_split
15.x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state
=0)
16.
17.#feature Scaling
18.from sklearn.preprocessing import StandardScaler
19.st_x= StandardScaler()
20.x_train= st_x.fit_transform(x_train)
21.x_test= st_x.transform(x_test)

In the above code, we have pre-processed the data. Where we have loaded the
dataset, which is given as:
2. Fitting a Decision-Tree algorithm to the Training set

Now we will fit the model to the training set. For this, we will import
the DecisionTreeClassifier class from sklearn.tree library. Below is the code for
it:

1. #Fitting Decision Tree classifier to the training set


2. From sklearn.tree import DecisionTreeClassifier
3. classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
4. classifier.fit(x_train, y_train)

In the above code, we have created a classifier object, in which we have passed two
main parameters;

o "criterion='entropy': Criterion is used to measure the quality of split, which


is calculated by information gain given by entropy.
o random_state=0": For generating the random states.

Below is the output for this:

Out[8]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
3. Predicting the test result

Now we will predict the test set result. We will create a new prediction
vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

In the below output image, the predicted output and real test output are given. We
can clearly see that there are some values in the prediction vector, which are different
from the real vector values. These are prediction errors.

4. Test accuracy of the result (Creation of Confusion matrix)

In the above output, we have seen that there were some incorrect predictions, so if
we want to know the number of correct and incorrect predictions, we need to use the
confusion matrix. Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:

In the above output image, we can see the confusion matrix, which has 6+3= 9
incorrect predictions and62+29=91 correct predictions. Therefore, we can say
that compared to other classification models, the Decision Tree classifier made
a good prediction.

5. Visualizing the training set result:

Here we will visualize the training set result. To visualize the training set result we
will plot a graph for the decision tree classifier. The classifier will predict yes or No
for the users who have either Purchased or Not purchased the SUV car as we did
in Logistic Regression. Below is the code for it:

1. #Visulaizing the trianing set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].ma
x() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape
(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10.fori, j in enumerate(nm.unique(y_set)):
11.mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13.mtp.title('Decision Tree Algorithm (Training set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()

Output:

The above output is completely different from the rest classification models. It has
both vertical and horizontal lines that are splitting the dataset according to the age
and estimated salary variable.

As we can see, the tree is trying to capture each dataset, which is the case of
overfitting.

6. Visualizing the test set result:

Visualization of test set result will be similar to the visualization of the training set
except that the training set will be replaced with the test set.
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].ma
x() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape
(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10.fori, j in enumerate(nm.unique(y_set)):
11.mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13.mtp.title('Decision Tree Algorithm(Test set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()

Output:

As we can see in the above image that there are some green data points within the
purple region and vice versa. So, these are the incorrect predictions which we have
discussed in the confusion matrix.
PRACTICAL:7

Write a model to demostrate the working k-nearest neighbours (K-


KNN)
Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:

The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three


nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A

You might also like