0% found this document useful (0 votes)
19 views59 pages

ML Lab Manual

Uploaded by

gopigangula36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views59 pages

ML Lab Manual

Uploaded by

gopigangula36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

LAB Manual

Machine Learning using


Python Lab

Sri. B N V Narasimha Raju

Sri. S Suryanarayana Raju

Sri. V V Sivarama Raju


Course Objectives
1.​ Implement different mechanisms in pre-processing and model evaluation &
implementation.
2.​ Implement different dimensionality reduction techniques.
3.​ Implement different clustering & classification techniques.
4.​ Evaluate the model.
5.​ Implement simple linear, logistic regressions and Feed-Forward Network.

Course Outcomes
1.​ Design Pre-processing model for their own data sets.
2.​ Apply dimensional reduction techniques for their own datasets.
3.​ Develop different clustering & classification techniques.
4.​ Evaluate the model with Lasso and Ridge Regularization.
5.​ Design neural network for structured, unstructured data classification and
Regression.

1
List of Experiments
1.​ Vector addition.
2.​ Data pre-processing: Handling missing values, handling categorical data, bringing
features to the same scale, and selecting meaningful features.
3.​ Regression model.
4.​ Write a program to implement the KNN classifier and logistic regression for binary
classification and multiclass classification.
5.​ Ensemble Learning, grid search and learning, and validation curves.
6.​ Write a program for Data Clustering (K-Means) and evaluate the clustering model.
7.​ Compressing data via dimensionality reduction: PCA, LDA.
8.​ Model Evaluation and Optimization: K-fold cross-validation.
9.​ Write a program to reduce the variance of a linear regression model using Lasso
and Ridge Regularization.
10.​Perceptron for digits.
11.​ Feed-Forward Network for wheat seeds dataset.
12.​Write a program to implement a neural network for regression.
13.​Write a program to save and load a trained machine learning model.

Additional Experiments
1.​ Write a program to implement data pre-processing techniques like data sampling,
data discretization, and data augmentation.
2.​ Write a program to implement a Naïve Bayes algorithm.
3.​ Write a program to implement classification using SVM.
4.​ Write a program to implement a regression tree.
5.​ Write a program to implement Boosting techniques.
6.​ Write a program to implement Hierarchical clustering.
7.​ Write a program to implement a multilayer perceptron.

2
Session #1
Vector Addition
Learning Objective
To implement the vector addition.

Learning Context
Here we shall learn how to perform vector addition and subtraction in Python. A vector, in
programming terms, refers to a one-dimensional array. An array is one of the data
structures that stores similar elements, i.e., elements having the same data type. The
general features of the array include

●​ An array can contain many values based on the same name.


●​ Accessing the elements is based on the index number. If the array size is "n", the
last index value is [n-1], and the starting index is always [0].
●​ We can also slice the elements in the array [start: end] based on the start and end
positions.

Addition and Subtraction of Vectors in Python

Now let’s learn how to perform the basic mathematical operations, such as addition and
subtraction, on arrays in Python. To perform this task, we need to know about the Numpy
module in Python. Numpy is the Numerical Python that has several inbuilt methods that
will make our task easier. The easiest and simplest way to create an array in Python is by
adding comma-separated literals in matching square brackets. For example

A = [1, 2, 3]

B = [4, 5, 6]

We can even create multidimensional arrays, for example, a two-dimensional array as


shown below:

3
A = ([1, 2, 3], [4, 5, 6])

B = ([2, -4, 7], [5, -20, 3])

To use this amazing module, we need to import it. The variables A and B are used to
store the array elements. To perform the addition, we need to call the add() method of the
NumPy module as NP.add(). Here we have aliased the NumPy as NP, which is not
necessary. We can directly write it as NumPy.add(). To perform subtraction on the same
array elements, we just need to write another line of code invoking the subtract method,
i.e., NP.subtract(), and print the result obtained after the subtraction.

Exercise
Write a program for vector addition.

Solution
import numpy as NP
A = [4, 8, 7]
B = [5, -4, 8]
print("The input arrays are :\n","A:",A ,"\n","B:",B)
Res= NP.add(A,B)
print("After addition the resulting array is :",Res)

The input arrays are :


A: [4, 8, 7]
B: [5, -4, 8]
After addition the resulting array is : [ 9 4 15]

import numpy as NP
A = [4, 8, 7]
B = [5, -4, 8]
print("The input arrays are :\n","A:",A ,"\n","B:",B)
Res1= NP.add(A,B)
Res2= NP.subtract(A,B)

4
print("Result of Addition is :",Res1,"\nResult of Subtraction is:",Res2)

The input arrays are :


A: [4, 8, 7]
B: [5, -4, 8]
Result of Addition is : [ 9 4 15]
Result of Subtraction is: [-1 12 -1]

# Create a vector
from numpy import array
v = array([1, 2, 3])
print(v)

[1 2 3]

from numpy import array


a = array([1, 2, 3])
print(a)
b = array([1, 2, 3])
print(b)
c = a + b
print(c)

[1 2 3]
[1 2 3]
[2 4 6]

from numpy import array


a = array([1, 2, 3])
print(a)
b = array([0.5, 0.5, 0.5])
print(b)
c = a - b
print(c)

[1 2 3]

5
[0.5 0.5 0.5]
[0.5 1.5 2.5]

# multiply vectors
from numpy import array
a = array([1, 2, 3])
print(a)
b = array([1, 2, 3])
print(b)
c = a * b
print(c)

[1 2 3]
[1 2 3]
[1 4 9]

from numpy import array


a = array([1, 2, 3])
print(a)
b = array([1, 2, 3])
print(b)
c = a / b
print(c)

[1 2 3]
[1 2 3]
[1. 1. 1.]

# dot product vectors


from numpy import array
a = array([1, 2, 3])
print(a)
b = array([1, 2, 3])
print(b)
c = a.dot(b)

6
print(c)

[1 2 3]
[1 2 3]
14

# VECTOR AND SCALAR


from numpy import array
a = array([1, 2, 3])
print(a)
s = 0.5
print(s)
c = s * a
print(c)

[1 2 3]
0.5
[0.5 1. 1.5]

7
Session #2
Data pre-processing
Learning Objective
To perform data pre-processing like handling missing values, handling categorical data,
bringing features to the same scale, and selecting meaningful features.

Learning Context
Checking Null Values

The DataFrame.isnull() function detects missing values in the given object. It returns a
Boolean same-sized object indicating the values of NA.isnull() is the method that returns
true if the value is null and false otherwise.

Handling Missing Values

Missing data is defined as the values or data that is not stored (or not present) for some
variables in the given dataset. The first step in handling missing values is to look at the
data carefully and find out the missing values using isnull(). Some of the measures to
handle it by using dropna() and fillna()

Handling Categorical Data

When your data has categories represented by strings, it will be difficult to use them to
train machine learning models, which often only accept numeric data. Instead of ignoring
the categorical data and excluding the information from our model, you can transform the
data so it can be used in your models. Categorical data is a type of data that is used to
group information with similar characteristics, while numerical data is a type of data that
expresses information in the form of numbers. Most machine learning algorithms cannot
handle categorical variables unless we convert them to numerical values. Many
algorithms performances even vary based on how the categorical variables are encoded.

8
Categorical values are divided into a range of features in a dataset. Real-world datasets
often contain features in two categories.

●​ Nominal - No particular order.


●​ Ordinal - there is some order between values.

Bringing Features to the Same Scale

Feature scaling is the process of normalizing features that vary in degrees of magnitude,
range, and units. Therefore, for machine learning models to interpret these features on
the same scale, we need to perform feature scaling. Feature scaling is a technique to
standardize the independent features present in the data within a fixed range. It is
performed during the data pre-processing. More specifically, we will be looking at 3
different scalers in the Scikit-Learn library for feature scaling, and they are

●​ Standard Scaler
●​ Robust Scaler

Selecting Meaningful Features

The iloc() function in Python is defined in the Pandas module and helps us select a
specific row or column from the data set. Using the iloc method in Python, we can easily
retrieve any particular value from a row or column by using index values.

Syntax: pandas.dataset.iloc[row, column]

parameters:

The iloc function in Python takes two parameters. However, both parameters of the iloc()
method are optional. Let us discuss both of these parameters:

●​ The row parameter is an optional parameter that specifies the index position of
the row in the form of an integer or list of integers.
●​ The column parameter is also an optional parameter that specifies the index
position of the column in the form of an integer or list of integers.

If we specify only a row value, then the iloc function returns a Pandas series. If we specify
the row value and column value, then the iloc function returns all the content of the
specified cell. If we specify a list of values, the Python iloc function returns a Pandas
DataFrame.

9
Exercise
Data pre-processing: handling missing values, handling categorical data, bringing
features on the same scale, and selecting meaningful features.

Dataset

Solution
# Importing the necessary libraries and Datasets
import pandas as pd
import numpy as nm
data_set= pd.read_csv('dataset.csv')

print(data_set)

Country Age Salary Purchased


0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

x= data_set.iloc[:,:-1].values
print(x)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]

10
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

data_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 10 non-null object
1 Age 9 non-null float64
2 Salary 9 non-null float64
3 Purchased 10 non-null object
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes

# Checking for the null values in the data


data_set.isnull().sum()

Country 0
Age 1
Salary 1
Purchased 0
dtype: int64

# Dropping of the null records from the original


data
x=data_set.dropna()
print(x)

Country Age Salary Purchased


0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
5 France 35.0 58000.0 Yes
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

11
# Filling the missing values with mean of the data
data_set['Age'].fillna(value=data_set['Age'].mean(), inplace=True)
print(data_set)

Country Age Salary Purchased


0 France 44.000000 72000.0 No
1 Spain 27.000000 48000.0 Yes
2 Germany 30.000000 54000.0 No
3 Spain 38.000000 61000.0 No
4 Germany 40.000000 NaN Yes
5 France 35.000000 58000.0 Yes
6 Spain 38.777778 52000.0 No
7 France 48.000000 79000.0 Yes
8 Germany 50.000000 83000.0 No
9 France 37.000000 67000.0 Yes

data_set['Salary'].fillna(value=data_set['Salary'].mean(), inplace=True)
print(data_set)

Country Age Salary Purchased


0 France 44.000000 72000.0 No
1 Spain 27.000000 48000.0 Yes
2 Germany 30.000000 54000.0 No
3 Spain 38.000000 61000.0 No
4 Germany 40.000000 NaN Yes
5 France 35.000000 58000.0 Yes
6 Spain 38.777778 52000.0 No
7 France 48.000000 79000.0 Yes
8 Germany 50.000000 83000.0 No
9 France 37.000000 67000.0 Yes

# Handling Categorical Data for Country and Purchased variable


sklearn.preprocessingfrom import LabelEncoder
le=LabelEncoder()
data_set['Country']=le.fit_transform(data_set[['Country']])
data_set['Purchased']=le.fit_transform(data_set[['Purchased']])
print(data_set)

Country Age Salary Purchased

12
0 0 44.000000 72000.000000 0
1 2 27.000000 48000.000000 1
2 1 30.000000 54000.000000 0
3 2 38.000000 61000.000000 0
4 1 40.000000 63777.777778 1
5 0 35.000000 58000.000000 1
6 2 38.777778 52000.000000 0
7 0 48.000000 79000.000000 1
8 1 50.000000 83000.000000 0
9 0 37.000000 67000.000000 1

from sklearn.preprocessing import OneHotEncoder


onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot_encoder.fit_transform(data_set[['Country']])
onehot_df = pd.DataFrame(onehot_encoded,
columns=onehot_encoder.get_feature_names_out(['Country']))
final_df = pd.concat([data_set, onehot_df], axis=1).drop('Country',
axis=1)
print("\nDataFrame after One-Hot Encoding:")
print(final_df)

DataFrame after One-Hot Encoding:


Age Salary Purchased Country_France Country_Germany
0 44.000000 72000.0 No 1.0 0.0
1 27.000000 48000.0 Yes 0.0 0.0
2 30.000000 54000.0 No 0.0 1.0
3 38.000000 61000.0 No 0.0 0.0
5 35.000000 58000.0 Yes 0.0 0.0
6 38.777778 52000.0 No 1.0 0.0
7 48.000000 79000.0 Yes 0.0 1.0
8 50.000000 83000.0 No 1.0 0.0
9 37.000000 67000.0 Yes NaN NaN
4 NaN NaN NaN 1.0 0.0

Country_Spain
0 0.0
1 1concat.0
2 0.0
3 1.0
5 1.0
6 0.0
7 0.0
8 0.0

13
9 NaN
4 0.0

from sklearn.preprocessing import MinMaxScaler


""" MIN MAX SCALER """
min_max = MinMaxScaler(feature_range =(0, 1))

# Scaled feature
x1= min_max.fit_transform(data_set)
name=data_set.columns
data=pd.DataFrame(x1, columns=name)
print(data)

Country Age Salary Purchased


0 0.0 0.739130 0.685714 0.0
1 1.0 0.000000 0.000000 1.0
2 0.5 0.130435 0.171429 0.0
3 1.0 0.478261 0.371429 0.0
4 0.5 0.565217 0.450794 1.0
5 0.0 0.347826 0.285714 1.0
6 1.0 0.512077 0.114286 0.0
7 0.0 0.913043 0.885714 1.0
8 0.5 1.000000 1.000000 0.0
9 0.0 0.434783 0.542857 1.0

from sklearn.feature_selection import SelectKBest


from sklearn.feature_selection import chi2
bestfit=SelectKBest(score_func=chi2,k=3)
fit=bestfit.fit(data.iloc[:,0:-1],data.iloc[:,-1])
pd.DataFrame({"columns":["Country","Age","Salary"], "Scores"
:fit.scores_})

columns​ Scores
0​ Country​ 0.500000
1​ Age​ 0.070076
2​ Salary​ 0.007011

14
15
Session #3
Regression model
Learning Objective
To Implement the Regression model.

Learning Context
This is about the basics of linear regression and its implementation in the Python
programming language. Linear regression is a statistical method for modeling
relationships between a dependent variable with a given set of independent variables.

In this, we refer to dependent variables as responses and independent variables as


features for simplicity. In order to provide a basic understanding of linear regression, we
start with the most basic version of linear regression, i.e. Simple linear regression.

Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear
function that predicts the response value (y) as accurately as possible as a function of the
feature or independent variable (x). Let us consider a dataset where we have a value of
response y for every feature x:

For generality, we define x as feature vector, i.e x = [x1, x2, …., xn], y as response vector, i.e
y = [y1, y2, …., yn] for n observations (in above example, n=10). A scatter plot of the above
dataset is shown in figure 5.1

16
Figure 5.1 Scatter Plot

Now, the task is to find a line that fits best in the above scatter plot so that we can predict
the response for any new feature values. (i.e., a value of x not present in a dataset). This
line is called a regression line. The equation of the regression line is represented as:
ℎ(𝑥𝑖) = β0 + β1𝑥𝑖

Here ℎ(𝑥𝑖) represents the predicted response value for ith observation. β0 and β1 are
regression coefficients and represent the y-intercept and slope of the regression line,
respectively. To create our model, we must learn or estimate the values of regression
coefficients, β0 and β1 once we’ve estimated these coefficients, we can use the model to
predict responses.

In this, we are going to use the principle of least squares. Now consider:
ℎ(𝑥𝑖) = β0 + β1𝑥𝑖 + ε𝑖 = ℎ − (𝑥𝑖) + ε𝑖 = ε𝑖 = 𝑦𝑖 − ℎ(𝑥𝑖)

Here ε𝑖 is a residual error in the ith observation. So, our aim is to minimize the total
residual error. We define the squared error or cost function, J as:
𝑛
1 2
𝐽(β0, β1) = 2𝑛
∑ ε𝑖
𝑖=1

Our task is to find the value of β0 and β1 for which 𝐽(β0, β1) the minimum. Without going
into the mathematical details, we present the result here:
𝑆𝑆𝑥𝑦
β1 =
𝑆𝑆𝑥𝑥

17
β0 = 𝑦 − β1𝑥

where 𝑆𝑆𝑥𝑦 is the sum of cross-deviations of y and x


𝑛 𝑛
𝑆𝑆𝑥𝑦 = ∑ (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) = ∑ (𝑦𝑖 𝑥𝑖 − 𝑛𝑥 𝑦)
𝑖=1 𝑖=1

and 𝑆𝑆𝑥𝑥 is the sum of squared deviations of x


𝑛 2 𝑛 2
2
𝑆𝑆𝑥𝑥 = ∑ (𝑥𝑖 − 𝑥) = ∑ 𝑥𝑖 − 𝑛(𝑥)
𝑖=1 𝑖=1

Exercise
Linear Regression model

Dataset​

Solution
import pandas as pd
dataset=pd.read_csv('Realestate.csv')
dataset
x = dataset.iloc[:,[2,3,4]]
y = dataset.iloc[:,-1]
print(x)
print(y)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=
0.1, random_state=0)
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(x_train, y_train)
print("regression score is", regr.score(x_test, y_test))
y_pred = regr.predict(x_test)
print(y_pred)

X2 house age X3 distance to the nearest MRT station \


0 32.0 84.87882

18
1 19.5 306.59470
2 13.3 561.98450
3 13.3 561.98450
4 5.0 390.56840
.. ... ...
409 13.7 4082.01500
410 5.6 90.45606
411 18.8 390.96960
412 8.1 104.81010
413 6.5 90.45606

X4 number of convenience stores


0 10
1 9
2 5
3 5
4 5
.. ...
409 0
410 9
411 7
412 5
413 9
[414 rows x 3 columns]
0 37.9
1 42.2
2 47.3
3 54.8
4 43.1
...
409 15.4
410 50.0
411 40.6
412 52.5
413 63.9
Name: Y house price of unit area, Length: 414, dtype: float64
regression score is 0.5426765635381208
[40.17828284 12.37676547 40.40657076 12.00511074 39.33690269 41.83214774
42.30423375 32.814078 48.27632758 43.11734031 47.70290475 44.41485275
39.68413595 42.83940406 48.48806133 34.24805299 41.42344916 41.56285853
43.66607406 43.87882365 53.35285918 33.66027278 36.51025303 43.93766323
46.70312464 46.56116249 46.78952159 28.33096273 46.17036843 23.58223041
46.91771177 33.42828847 42.51214771 33.07857161 46.91888719 33.60852654
47.49031425 37.22494294 52.95653384 5.93663422 53.51265583
32.57340028]

19
Session #4
Logistic Regression
Learning Objective
To implement the KNN Classifier, logistic regression for binary and multiclass
classification

Learning Context
Logistic Regression: Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of independent
variables.Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes or
No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1. Logistic Regression is much similar to the
Linear Regression except that how they are used. Linear Regression is used for solving
Regression problems, whereas Logistic regression is used for solving the classification
problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1). The equation of the straight line
can be represented as

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

20
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

Type of Logistic Regression

On the basis of the categories, Logistic Regression can be classified into three types:

●​ Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
●​ Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
●​ Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High"

K-Nearest Neighbours Classifier

K -Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most similar
to the available categories.K-NN algorithm stores all the available data and classifies a
new data point based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm. K-NN algorithm can
be used for Regression as well as for Classification but mostly it is used for the
Classification problems.K-NN is a non- parametric algorithm, which means it does not
make any assumption on underlying data.It is also called a lazy learner algorithm
because it does not learn from the training set immediately instead it stores the dataset
and at the time of classification, it performs an action on the dataset.

21
Exercise
Write a program to implement KNN Classifier, logistic regression for binary and multiclass
classification

Dataset

Solutions
# Binary Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
iris =datasets.load_iris()
features = iris.data[:100,:]
target =iris.target[:100]
scaler =StandardScaler()
features_standardized =scaler.fit_transform(features)
logistic_regression = LogisticRegression(random_state=0)
model = logistic_regression.fit(features_standardized,target)
y_pred=model.predict(features_standardized)
print(metrics.accuracy_score(y_pred,target))

1.0

22
# Multinomial Logistic Regression
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets, metrics
from sklearn.linear_model import LogisticRegression
iris=datasets.load_iris()
iris_data=iris.data
iris_data=pd.DataFrame(iris_data,columns=iris.feature_names)
iris_data['species']=iris.target
iris_data['species'].unique()
features =iris.feature_names
target ='species'
X=iris_data[features]
y=iris_data[target]
lr_iris=LogisticRegression() # default value for multi class problem is
multinomial
lr_iris =lr_iris.fit(X,y)
y_pred=lr_iris.predict(X)
print(metrics.accuracy_score(y_pred,y))

0.9733333333333334

Dataset
# Write a program to implement k-Nearest Neighbor algorithm to classify
the iris data set. Print both correct and wrong predictions.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

23
# Load the Iris dataset
data = pd.read_csv('IRIS.csv')

# Prepare the data


X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=0)

# Feature scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Train the k-NN classifier


classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski',
p=2)
classifier.fit(X_train, y_train)

# Predict the test set results


y_pred = classifier.predict(X_test)

# Print correct and wrong predictions


correct_predictions = []
wrong_predictions = []

for i in range(len(y_test)):
if y_test[i] == y_pred[i]:
correct_predictions.append((X_test[i], y_test[i], y_pred[i]))
else:
wrong_predictions.append((X_test[i], y_test[i], y_pred[i]))

print("\nCorrect Predictions:")
for item in correct_predictions:

24
print(f"Features: {item[0]}, True Label: {item[1]}, Predicted Label:
{item[2]}")

print("\nWrong Predictions:")
for item in wrong_predictions:
print(f"Features: {item[0]}, True Label: {item[1]}, Predicted Label:
{item[2]}")

# Print the confusion matrix and accuracy


cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)
print("\nAccuracy:", accuracy)

Correct Predictions:
Features: [-0.09984503 -0.57982483 0.72717965 1.51147115], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [ 0.13072494 -1.96153508 0.11355956 -0.28533458], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.44569998 2.64416573 -1.33681519 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 1.62942973 -0.34953979 1.39658338 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-1.0221249 0.80188541 -1.28103155 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [0.47657989 0.57160037 1.22923245 1.63981441], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-1.0221249 1.03217045 -1.39259884 -1.18373745], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [0.93771983 0.11103029 0.50404507 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 1.05300481 -0.57982483 0.55982872 0.22803848], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -0.57982483 0.11355956 0.09969522], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -1.04039491 1.00609787 0.22803848], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [0.59186487 0.34131533 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -0.57982483 0.50404507 -0.02864805], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.70714986 -0.57982483 0.44826143 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -0.34953979 0.50404507 0.22803848], True Label:

25
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.13740989 0.11103029 -1.28103155 -1.44042398], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.13072494 -0.34953979 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.44569998 -1.04039491 0.33669414 -0.02864805], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.25269487 -0.11925475 -1.33681519 -1.18373745], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-0.56098497 1.95331061 -1.39259884 -1.05539418], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-0.330415 -0.57982483 0.61561236 0.99809808], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-0.330415 -0.11925475 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.25269487 0.80188541 -1.05789697 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-1.71383481 -0.34953979 -1.33681519 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.36129491 -0.57982483 0.55982872 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-1.48326484 1.26245549 -1.55994977 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-0.90683992 1.72302557 -1.05789697 -1.05539418], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.36129491 -0.34953979 0.28091049 0.09969522], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.0221249 -1.73125004 -0.27692595 -0.28533458], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.0221249 0.80188541 -1.2252479 -1.05539418], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [0.59186487 0.11103029 0.95031423 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-0.56098497 -0.11925475 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.79155494 1.03217045 -1.28103155 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.24600992 -0.11925475 0.61561236 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [ 0.59186487 -0.57982483 1.00609787 1.25478461], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-0.79155494 -0.81010987 0.05777592 0.22803848], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.21513002 1.72302557 -1.16946426 -1.18373745], True Label:
Iris-setosa, Predicted Label: Iris-setosa

Wrong Predictions:
Features: [ 0.13072494 -0.81010987 0.72717965 0.48472502], True Label:
Iris-versicolor, Predicted Label: Iris-virginica

26
Confusion Matrix:
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]

Accuracy: 0.9736842105263158

27
Session #5
Data Clustering and Evaluation
Learning Objective
To implement K-Means Clustering and evaluate the clustering model.

Learning Context
K-Means

K-means is a data clustering approach for unsupervised machine learning that can
separate unlabeled data into a predetermined number of disjoint groups of equal
variance—clusters—based on their similarities. It's a popular algorithm thanks to its ease
of use and speed on large datasets. K-Means Clustering is an unsupervised learning
algorithm that is used to solve clustering problems in machine learning or data science.
In this topic, we will learn what K-means clustering algorithm is, how the algorithm works,
and the Python implementation of K-means clustering.

K-Means Clustering is an Unsupervised Learning algorithm that groups the unlabeled


dataset into different clusters. Here K defines the number of predefined clusters that
need to be created in the process, as if K = 2, there will be two clusters, and for K = 3,
there will be three clusters, and so on. It is an iterative algorithm that divides the
unlabeled dataset into k different clusters in such a way that each dataset belongs to
only one group that has similar properties. It is a centroid-based algorithm, where each
cluster is associated with a centroid. The main aim of this algorithm is to minimize the
sum of distances between the data points and their corresponding clusters. The k-means
clustering algorithm mainly performs two tasks:

1.​ Determines the best value for K center points or centroids by an iterative process.
2.​ Assigns each data point to its closest k-center. Those data points that are near the
particular k-center create a cluster.

28
The below diagram explains the working of the K-means Clustering Algorithm

Predicting optimal clusters is of utmost importance in Cluster Analysis. For given data, we
need to evaluate which Clustering model will best fit the data or which parameters of the
model will give optimal clusters. We often need to compare two clusters or analyze which
model would be optimal to deal with outliers. Different performance and evaluation
metrics are used to evaluate clustering methods. This silhouette index is one of these
evaluation metrics.

Silhouette Index

The Silhouette score is a measure of how similar a data point is to its own cluster as
compared to other clusters. A higher Silhouette score value indicates that the data point
is better matched to its own cluster and badly matched to other clusters. The best score
value is 1, and -1 is the worst.

●​ 1: Mean clusters are well apart from each other and clearly distinguished.
●​ 0: This means clusters are indifferent, or we can say that the distance between
clusters is not significant.
●​ -1: Means clusters are assigned in the wrong way.

Exercise
Write a program to evaluate the clustering model.

Dataset

29
Solutions
# KMeans Clustering
import numpy as nm
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
data=pd.read_csv("Iris.csv")
print(data.head(),"\n")
x=data.iloc[:,3:5].values
wcss=[]
for i in range(1,10):
kmeans=KMeans(n_clusters = i, init = 'k-means++', max_iter = 100,
n_init = 10, random_state = 0).fit(x)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 10), wcss, 'bx-', color='red')
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm


Species
0 1 5.1 3.5 1.4 0.2
Iris-setosa
1 2 4.9 3.0 1.4 0.2
Iris-setosa
2 3 4.7 3.2 1.3 0.2
Iris-setosa
3 4 4.6 3.1 1.5 0.2
Iris-setosa
4 5 5.0 3.6 1.4 0.2
Iris-setosa

<ipython-input-19-e1e7b0ac194d>:12: UserWarning: color is redundantly


defined by the 'color' keyword argument and the fmt string "bx-" (->
color='b'). The keyword argument will take precedence.
plt.plot(range(1, 10), wcss, 'bx-', color='red')

30
kmeans=KMeans(n_clusters=3,init='k-means++',random_state=42)
y_predict=kmeans.fit_predict(x)
print(" Y predict is \n", y_predict)
print("\n Cluster centers are \n", kmeans.cluster_centers_,"\n")
plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c =
'red', label = 'Iris-setosa')
plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c =
'blue', label = 'Iris-versicolour')
plt.scatter(x[y_predict == 2, 0], x[y_predict== 2, 1], s = 100, c =
'green', label = 'Iris-virginica')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1],
s = 100, c = 'black', label = 'Centroids')
plt.legend()

Y predict is
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2
2 2 2 0 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 2 0 0 0
0
0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
0
0 0]

31
Cluster centers are
[[5.59583333 2.0375 ]
[1.464 0.244 ]
[4.26923077 1.34230769]]

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to
'auto' in 1.4. Set the value of `n_init` explicitly to suppress the
warning
warnings.warn(
<matplotlib.legend.Legend at 0x7f33aa0061d0>

import numpy as np
from sklearn.metrics import silhouette_score
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
features,_=make_blobs(n_samples=1000,n_features=10,centers=2,cluster_std
=0.5,shuffle=True,random_state=1)
model=KMeans(n_clusters=2,random_state=1).fit(features)
target_predicted=model.labels_
silhouette_score(features,target_predicted)

0.891626556407141

32
Session #6
Ensemble learning, grid search and
learning, and validation curves
Learning Objective
To implement Ensemble Learning, grid search, learning and validation curves.

Learning Context
Random Forest Classifier

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
"Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of
that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, it predicts the
final output. The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.

33
Grid Search

Exhaustive search over specified parameter values for an estimator. GridSearchCV


implements a “fit” and a “score” method. It also implements “score_samples”, “predict”,
“predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are
implemented in the estimator used. The parameters of the estimator used to apply these
methods are optimized by cross-validated grid-search over a parameter grid.

Validation Curve

To validate a model we need a scoring function (see Metrics and scoring: quantifying the
quality of predictions), for example accuracy for classifiers. The proper way of choosing
multiple hyperparameters of an estimator is of course grid search or similar methods (see
Tuning the hyper-parameters of an estimator) that select the hyperparameter with the
maximum score on a validation set or multiple validation sets. Note that if we optimize
the hyperparameters based on a validation score the validation score is biased and not a
good estimate of the generalization any longer. To get a proper estimate of the
generalization we have to compute the score on another test set. However, it is
sometimes helpful to plot the influence of a single hyperparameter on the training score
and the validation score to find out whether the estimator is overfitting or underfitting for
some hyperparameter values.

34
Exercise
Ensemble learning, grid search and learning, and validation curves

Dataset

Solutions
import numpy as nm
import matplotlib.pyplot as plt
import pandas as pd
data=pd.read_csv("Wine.csv")
cols=['Alcohol','Color_Intensity','Proline','Ash_Alcanity']
from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler(feature_range=(0,1))
data[cols]=mms.fit_transform(data[cols])
x=data.drop('Customer_Segment',axis=1)
y=data['Customer_Segment']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_
state=42)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
preds=clf.predict(X_test)
print(preds)
from sklearn.metrics import accuracy_score
accuracy_test_DT=accuracy_score(y_test,preds)
train_preds=clf.predict(X_train)
accuracy_train_DT=accuracy_score(y_train,train_preds)
print('accuracy_train_DT',accuracy_train_DT)
print('accuracy_test_DT',accuracy_test_DT)

[1 1 3 1 2 1 2 3 2 3 1 3 1 2 1 2 2 2 1 2 1 2 2 3 3 3 2 2 2 1 1 2 3 1 1 1
3
3 2 3 1 2 2 2 3 1 2 2 3 1 2 1 1 3]

35
accuracy_train_DT 1.0
accuracy_test_DT 1.0

#Grid Search
from sklearn.model_selection import GridSearchCV
grid_param={ 'n_estimators':[100,500,800],
'criterion':['gini','entropy'], 'bootstrap':[True,False] }
gd_sr=GridSearchCV(estimator=clf,param_grid=grid_param,scoring='accuracy
',cv=5)
gd_sr.fit(X_train,y_train)
best_parameters=gd_sr.best_params_
print(best_parameters)
best_result=gd_sr.best_score_
print(best_result)

{'bootstrap': True, 'criterion': 'gini', 'n_estimators': 100}


0.968

#Validation Curve
import numpy as np
import pandas as pd
import matplotlib.pyplot as mtp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import validation_curve
df = pd.read_csv("Wine.csv") # Loading the data
X = df.iloc[:,:-1] # Feature matrix in pd.DataFrame format
y = df.iloc[:,-1] # Target vector in pd.Series format
param_range=np.arange(1, 11)
# Making a Random Forest Classifier object
rf = RandomForestClassifier(n_estimators=100, criterion='gini')
train_score, test_score=validation_curve(rf, X, y,
param_name="max_depth",param_range=param_range, cv=10,
scoring="accuracy")
# Plot the validation curve
mean_train_score = np.mean(train_score, axis = 1)

36
print("Mean of train score \n", mean_train_score)
mean_test_score = np.mean(test_score, axis = 1)
print("Mean of test score \n",mean_test_score)
mtp.plot(param_range, mean_train_score, label = "Training Score", color
= 'b')
mtp.plot(param_range, mean_test_score,label = "Cross Validation Score",
color = 'g')
mtp.title("Validation Curve with randomforest Classifier")
mtp.xlabel("max depth")
mtp.ylabel("Accuracy")
mtp.tight_layout()
mtp.legend(loc = 'best')
mtp.show()

Mean of train score


[0.98002329 0.98627329 0.99751165 1. 1. 1.
1. 1. 1. 1. ]
Mean of test score
[0.96666667 0.97222222 0.97222222 0.97777778 0.98888889 0.97777778
0.96666667 0.98333333 0.97222222 0.98333333]

37
Session #7
Compressing data via dimensionality
reduction: PCA, LDA
Learning Objective
To implement Compressing data via dimensionality reduction like PCA and LDA

Learning Context
Principal Component Analysis

The Principal Component Analysis is a popular unsupervised learning technique for


reducing the dimensionality of data. It increases interpretability yet, at the same time, it
minimizes information loss. It helps to find the most significant features in a dataset and
makes the data easy for plotting in 2D and 3D.The Principal Component Analysis is a
popular unsupervised learning technique for reducing the dimensionality of data. It
increases interpretability yet, at the same time, it minimizes information loss. It helps to
find the most significant features in a dataset and makes the data easy for plotting in 2D
and 3D. PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various communication
channels. It is a feature extraction technique, so it contains the important variables and
drops the least important variable.

Linear Discriminant Analysis

LDA is a supervised classification technique that is considered a part of crafting


competitive machine learning models. This category of dimensionality reduction is used
in areas like image recognition and predictive analysis in marketing.Linear Discriminant
Analysis (LDA) is one of the commonly used dimensionality reduction techniques in

38
machine learning to solve more than two-class classification problems. It is also known as
Normal Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).This can be
used to project the features of higher dimensional space into lower-dimensional space in
order to reduce resources and dimensional costs. In this topic, "Linear Discriminant
Analysis (LDA) in machine learning”, we will discuss the LDA algorithm for classification
predictive modeling problems, limitation of logistic regression, representation of linear
Discriminant analysis model, how to make a prediction using LDA, how to prepare data
for LDA, extensions to LDA and much more. So, let's start with a quick introduction to
Linear Discriminant Analysis (LDA) in machine learning.

Exercise
Compressing data via dimensionality reduction: PCA, LDA

Dataset
#PCA
import pandas as pd
d = pd.read_csv('wineQualityReds.csv')
print(d.head())
x = d.iloc[:,:-1]
y = d.iloc[:,-1]
from sklearn.preprocessing import MinMaxScaler
s = MinMaxScaler()
x = pd.DataFrame(s.fit_transform(x),columns = x.columns)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

from sklearn.decomposition import PCA


pca = PCA(n_components = 8)

X_train = pca.fit_transform(x_train)
X_test = pca.transform(x_test)
explained_variance = pca.explained_variance_ratio_
print("PCA Variance \n", explained_variance)
from sklearn.neighbors import KNeighborsClassifier

39
KNN_mod =
KNeighborsClassifier(n_neighbors=10)KNN_mod.fit(X_train,y_train)
pred = KNN_mod.predict(X_test)
from sklearn.metrics import confusion_matrix,accuracy_score
print("PCA Accuracy score is \n", accuracy_score(y_test,pred)*100)

#LDA
import pandas as pd
d = pd.read_csv('wineQualityReds.csv')
x = d.iloc[:,:-1]
y = d.iloc[:,-1]
from sklearn.preprocessing import MinMaxScaler
s = MinMaxScaler()
x = pd.DataFrame(s.fit_transform(x),columns = x.columns)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as


LDA
lda = LDA(n_components=5)
X_train = lda.fit_transform(x_train,y_train)
X_test = lda.transform(x_test)
explained_variance = lda.explained_variance_ratio_
print("LDA Variance is \n", explained_variance)
from sklearn.neighbors import KNeighborsClassifier
KNN_mod = KNeighborsClassifier(n_neighbors=10)
KNN_mod.fit(X_train,y_train)
pred = KNN_mod.predict(X_test)
from sklearn.metrics import confusion_matrix,accuracy_score
print("LDA Accuracy score is \n", accuracy_score(y_test,pred)*100)

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides


\
0 7.4 0.70 0.00 1.9
0.076
1 7.8 0.88 0.00 2.6
0.098
2 7.8 0.76 0.04 2.3

40
0.092
3 11.2 0.28 0.56 1.9
0.075
4 7.4 0.70 0.00 1.9
0.076

free.sulfur.dioxide total.sulfur.dioxide density pH sulphates


\
0 11.0 34.0 0.9978 3.51 0.56
1 25.0 67.0 0.9968 3.20 0.68
2 15.0 54.0 0.9970 3.26 0.65
3 17.0 60.0 0.9980 3.16 0.58
4 11.0 34.0 0.9978 3.51 0.56

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
PCA Variance
[0.36091181 0.19343588 0.15019667 0.07150031 0.05365209 0.05066068
0.04207002 0.03260641]
PCA Accuracy score is
57.8125
LDA Variance is
[0.83468126 0.12153022 0.02544075 0.01069278 0.00765499]
LDA Accuracy score is
57.49999999999999

41
Session #8
Model Evaluation and Optimization
Learning Objective
To implement Model Evaluation and optimization by using K-fold cross-validation.

Learning Context
Splitting of data into training and test data

Data splitting is when data is divided into two or more subsets. Typically, with a two-part
split, one part is used to evaluate or test the data and the other to train the model. Data
splitting is an important aspect of data science, particularly for creating models based on
data. This technique helps ensure the creation of data models and processes that use
data models, such as machine learning are accurate.

In a basic two-part data split, the training data set is used to train and develop models.
Training sets are commonly used to estimate different parameters or to compare
different model performances. The testing data set is used after the training is done. The
training and test data are compared to ensure that the final model works correctly.
Scikit-learn alias Sklearn, is the most useful and robust library for machine learning in

42
Python. The scikit-learn library provides us with the model_selection mmodule,in which
we have the splitter function train_test_split().

Syntax:train_test_split(*arrays,test_size=None,train_size=None,random_state=None,
shuffle=True,stratify=None)

Parameters:

1.​ *arrays: inputs such as lists, arrays, data frames, or matrices


2.​ test_size: This is a float value whose value ranges between 0.0 and 1.0. It
represents the proportion of our test size. its default value is none.
3.​ test_size: This is a float value whose value ranges between 0.0 and 1.0. It
represents the proportion of our test size. its default value is none.
4.​ random_state: This parameter is used to control the shuffling applied to the data
before applying the split. It acts as a seed.
5.​ shuffle: This parameter is used to shuffle the data before splitting. Its default value
is true
6.​ stratify: This parameter is used to split the data in a stratified fashion.

Cross Validation

Cross-validation is a technique for validating the model's efficiency by training it on a


subset of input data and testing it on a previously unseen subset of the input data. We
can also say that it is a technique to check how a statistical model generalizes to an
independent dataset. In machine learning, there is always the need to test the stability of
the model. It means that based only on the training dataset; we can't fit our model on the
training dataset. For this purpose, we reserve a particular sample of the dataset, which
was not part of the training dataset. After that, we test our model on that sample before
deployment, and this complete process comes under cross-validation. This is something
different from the general train-test split.

K-Fold Cross-Validation: K-fold cross-validation approach divides the input dataset into K
groups of samples of equal sizes. These samples are called folds. For each learning set,
the prediction function uses k-1 folds, and the rest of the folds are used for the test set.
This approach is a very popular CV approach because it is easy to understand, and the
output is less biased than other methods. The steps for k-fold cross-validation are:

1.​ Split the input dataset into K groups

43
2.​ For each group:
a.​ Take one group as the reserve or test data set.
b.​ Use remaining groups as the training dataset
c.​ Fit the model on the training set and evaluate the performance of the
model using the test set.

Exercise
Model Evaluation and optimization: K-fold cross-validation.

Dataset
import pandas as pd
import numpy as np
dataset = pd.read_csv("wineQualityReds.csv", sep=',')
dataset.head()
X = dataset.iloc[:, 0:11].values
y = dataset.iloc[:, 11].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10,
random_state=0)
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
X_train = featufrom sklearn.model_selection import KFold
scores=[]
kFold=KFold(n_splits=10,random_state=42,shuffle=True)
for train_index,test_index in kFold.split(X):
X_train, X_test, y_train, y_test = X[train_index], X[test_index],

44
y[train_index], y[test_index]
classifier.fit(X_train, y_train)
scores.append(classifier.score(X_test, y_test))
classifier.fit(X_train,y_train)
scores.append(classifier.score(X_test,y_test))
print('Cross validation scores are: \n', cross_val_score(classifier, X,
y, cv=10))
classifier.fit(X_train,y_train)
print("ACCURACY OF THE MODEL:", classifier.score(X_test,y_test))
re_scaler.fit_transform(X_train)
X_test = feature_scaler.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=300, random_state=0)
from sklearn.model_selection import cross_val_score

Cross validation scores are:


[0.53125 0.58125 0.54375 0.55 0.575 0.6125
0.575 0.6 0.60625 0.58490566]
ACCURACY OF THE MODEL: 0.7547169811320755

45
Session #9
Regularization
Learning Objective
To implement for reducing the variance of a linear regression model using Lasso and
Ridge Regularization evaluation of clustering model.

Learning Context
Regularization is one of the most important concepts of machine learning. It is a
technique to prevent the model from overfitting by adding extra information to
it.Sometimes the machine learning model performs well with the training data but does
not perform well with the test data. It means the model is not able to predict the output
when deals with unseen data by introducing noise in the output, and hence the model is
called overfitted. This technique can be used in such a way that it will allow to maintain
all variables or features in the model by reducing the magnitude of the variables. Hence,
it maintains accuracy as well as a generalization of the model.It mainly regularizes or
reduces the coefficient of features toward zero. In simple words, "In regularization
technique, we reduce the magnitude of the features by keeping the same number of
features."

How does Regularization Work?

Regularization works by adding a penalty or complexity term to the complex model. Let's
consider the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted X1, X2, …Xn are the
features for Y.

46
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost function. The
equation for the cost function for the linear model is given below:

Techniques of Regularization

There are mainly two types of regularization techniques, which are given below:

1.​ Ridge Regression


2.​ Lasso Regression

Ridge Regression

Ridge regression is one of the types of linear regression in which a small amount of bias
is introduced so that we can get better long-term predictions.Ridge regression is a
regularization technique, which is used to reduce the complexity of the model. It is also
called as L2 regularization.In this technique, the cost function is altered by adding the
penalty term to it. The amount of bias added to the model is called Ridge Regression
penalty. We can calculate it by multiplying with the lambda to the squared weight of each
individual feature.The equation for the cost function in ridge regression will be:

As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression mode.

47
Lasso Regression

Lasso regression is another regularization technique to reduce the complexity of the


model. It stands for Least Absolute and Selection Operator. It is similar to the Ridge
Regression except that the penalty term contains only the absolute weights instead of a
square of weights. Since it takes absolute values, hence, it can shrink the slope to 0,
whereas Ridge Regression can only shrink it near to 0.It is also called as L1 regularization.
The equation for the cost function of Lasso regression will be:

Ridge regression is mostly used to reduce the overfitting in the model, and it includes all
the features present in the model. It reduces the complexity of the model by shrinking
the coefficients. Lasso regression helps reduce overfitting in the model as well as feature
selection.

Exercise
Write a program to reduce variance of a linear regression model using Lasso and Ridge
Regularization
from sklearn.linear_model import Lasso
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
diabetes=load_diabetes()
features = diabetes.data
target = diabetes.target
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
regression = Lasso(alpha =0.5)
model =regression.fit(features_standardized, target)
print(model.coef_)

[ -0. -10.28809083 24.98390085 14.66877173 -7.76136877

48
-0. -8.44971502 3.28432608 24.95304334 2.90702924]

from sklearn.linear_model import Ridge


from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
boston=load_diabetes()
features = diabetes.data
target = diabetes.target
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
regression = Ridge(alpha =0.5)
model =regression.fit(features_standardized, target)
print(model.coef_)

[ -0.45160034 -11.36796054 24.75403875 15.39943346 -33.44143756


19.31313255 2.93639178 7.91553777 34.12396693 3.24318124]

49
Session #10
Perceptron for digits
Learning Objective
To implement perceptron for digits.

Learning Context
Perceptron

It consists of summation, Sigmoid Function and it takes n no. of inputs and generate
output. Frank Rosenblatt (1928 – 1971) was an American psychologist notable in the field
of Artificial Intelligence. In 1957 he started something really big. He "invented" a
Perceptron program, on an IBM 704 computer at Cornell Aeronautical Laboratory.
Scientists had discovered that brain cells (Neurons) receive input from our senses by
electrical signals. The Neurons, then again, use electrical signals to store information,
and to make decisions based on previous input. Frank had the idea that Perceptrons
could simulate brain principles, with the ability to learn and make decisions. The original
Perceptron was designed to take a number of binary inputs, and produce one binary
output (0 or 1).The idea was to use different weights to represent the importance of each
input, and that the sum of the values should be greater than a threshold value before
making a decision like yes or no (true or false) (0 or 1).

Frank Rosenblatt suggested this algorithm:

1.​ Set a threshold value


2.​ Multiply all inputs with its weights
3.​ Sum all the results
4.​ Activate the output

50
Exercise
Write a program to implement Perceptron for digits
from sklearn.datasets import load_digits
from sklearn.linear_model import Perceptron
X, y = load_digits(return_X_y=True)
clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(X, y)
clf.score(X, y)

0.9393433500278241

51
Session #11
Feed-Forward Network
Learning Objective
To implement Feed-Forward Network for wheat seed dataset.

Learning Context
We are still making use of a gradient descent optimization algorithm which acts to
minimize the error of our model by iteratively moving in the direction with the steepest
descent, the direction which updates the parameters of our model while ensuring the
minimal error. It is important to recognize the subsequent training of our neural network.
Recognition is done by dividing our data samples through some decision boundary". The
process of receiving an input to produce some kind of output to make some kind of
prediction is known as Feed Forward." Feed Forward neural network is the core of many
other important neural networks such as convolution neural network. In the feed-forward
neural network, there are not any feedback loops or connections in the network. Here is
simply an input layer, a hidden layer, and an output layer. If every model in every single
layer. We will talk more about optimization algorithms and backpropagation later.

52
Exercise
Write a program to implement Feed-Forward Network for wheat seeds datas.

Dataset
import pandas as pd
df=pd.read_csv("wheat.csv",index_col=None)
X = df.iloc[:, 0:7].values
y = df.iloc[:, 7].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
X_train = feature_scaler.fit_transform(X_train)
X_test = feature_scaler.transform(X_test)
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
y_train = lb.fit_transform(y_train)
y_test=lb.transform(y_test)
from sklearn.neural_network import MLPClassifier
#Initializing the MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100,50), max_iter=300,activation
= 'relu',solver='adam',random_state=1)
# Fit data onto the model
clf.fit(X_train,y_train)
ypred=clf.predict(X_test)
# Import accuracy score
from sklearn.metrics import accuracy_score
# Calcuate accuracy
accuracy_score(y_test,ypred)

0.9523809523809523

53
Session #12
Neural Network for Regression
Learning Objective
To implement a neural network for regression.

Learning Context
Keras is an open-source high-level Neural Network library that is written in Python and
capable enough to run on Theano, TensorFlow, or CNTK. It was developed by one of the
Google engineers, Francois Chollet. It is made user-friendly, extensible, and modular for
facilitating faster experimentation with deep neural networks. It not only supports
Convolutional Networks and Recurrent Networks individually but also in combination. It
cannot handle low-level computations, so it makes use of the Backend library to resolve
it. The backend library acts as a high-level API wrapper for the low-level API, which lets it
run on TensorFlow, CNTK, or Theano.

Exercise
Write a program to implement a neural network for regression.
#Load libraries
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras import models
from keras import layers
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

54
# Set random seed
np.random.seed(0)

# Generate features matrix and target vector


features, target = make_regression(n_samples = 10000,
n_features = 3,
n_informative = 3,
n_targets = 1,
noise = 0.0,
random_state = 0)

# Divide our data into training and test sets


features_train, features_test, target_train, target_test
=train_test_split(
features, target, test_size=0.33, random_state=0)

# Start neural network


network = models.Sequential()

# Add fully connected layer with a ReLU activation function


network.add(layers.Dense(units=32,activation="relu",input_shape=
(features_train.shape[1],)))

# Add fully connected layer with a ReLU activation function


network.add(layers.Dense(units=32, activation="relu"))

# Add fully connected layer with no activation function


network.add(layers.Dense(units=1))

# Compile neural network


network.compile(loss="mse", # Mean squared error
optimizer="RMSprop", # Optimization algorithm
metrics=["mse"]) # Mean squared error

# Train neural network


history = network.fit(features_train, # Features

55
target_train, # Target vector
epochs=10, # Number of epochs
verbose=0, # No output
batch_size=100, # Number of observations per batch
validation_data=(features_test, target_test))

predicted_target = network.predict(features_test)
print(predicted_target)
from sklearn.metrics import r2_score
print("RMS: %r " % np.sqrt(np.mean((predicted_target -
target_test) ** 2)))
print("R2= ", r2_score(predicted_target,target_test))
# print(r2_score(features_train,target_train))

104/104 [==============================] - 0s 1ms/step


[[ 89.74974]
[ -88.54554]
[-195.74356]
...
[-183.51932]
[ -51.0432 ]
[ 92.63008]]
RMS: 190.00101997820917
R2= 0.9886432717990226

56
Session #13
Machine Learning Model
Learning Objective
To implement a save and load trained machine learning model.

Learning Context
JOBLIB

Joblib is a python library for running computationally intensive tasks in parallel. It


provides a set of operations in parallel on large datasets and for caching the results of
the computationally expensive functions. Joblib is a set of tools to provide lightweight
pipelining in Python. ”If you are running heavy grid search cross validation or other forms
of hypertuning.(Sklearn+Joblib). Joblib is part of the SciPy ecosystem and provides
utilities for pipelining Python jobs. It provides utilities for saving and loading Python
objects that make use of NumPy data structures, efficiently.

Saving and Loading

In machine learning, while working with scikit learn library, we need to save the trained
models in a file and restore them in order to reuse them to compare the model with other
models, and to test the model on new data. The saving of data is called Serialization,
while restoring the data is called Deserialization. Also, we deal with different types and
sizes of data. Some datasets are easily trained i.e. they take less time to train but the
datasets whose size is large (more than 1GB) can take a very large time to train on a local
machine even with GPU. When we need the same trained data in some different project
or later sometime, to avoid the wastage of the training time, store the trained model so
that it can be used anytime in the future.

57
Exercise
Write a program to save and load a trained machine learning model.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import joblib
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create decision tree classifer object
classifer = RandomForestClassifier()
# Train model
model = classifer.fit(features, target)
# Save model as pickle file
joblib.dump(model, "model.pkl")
classifer = joblib.load("model.pkl")
new_observation = [[ 5.2, 3.2, 1.1, 0.1]]
output_class=classifer.predict(new_observation)
print(output_class)
output_class=classifer.predict(iris.data[101].reshape(1,-1))
print(output_class)
print(iris.target[101])

[0]
[2]
2

58

You might also like