ML Lab Manual
ML Lab Manual
Course Outcomes
1. Design Pre-processing model for their own data sets.
2. Apply dimensional reduction techniques for their own datasets.
3. Develop different clustering & classification techniques.
4. Evaluate the model with Lasso and Ridge Regularization.
5. Design neural network for structured, unstructured data classification and
Regression.
1
List of Experiments
1. Vector addition.
2. Data pre-processing: Handling missing values, handling categorical data, bringing
features to the same scale, and selecting meaningful features.
3. Regression model.
4. Write a program to implement the KNN classifier and logistic regression for binary
classification and multiclass classification.
5. Ensemble Learning, grid search and learning, and validation curves.
6. Write a program for Data Clustering (K-Means) and evaluate the clustering model.
7. Compressing data via dimensionality reduction: PCA, LDA.
8. Model Evaluation and Optimization: K-fold cross-validation.
9. Write a program to reduce the variance of a linear regression model using Lasso
and Ridge Regularization.
10.Perceptron for digits.
11. Feed-Forward Network for wheat seeds dataset.
12.Write a program to implement a neural network for regression.
13.Write a program to save and load a trained machine learning model.
Additional Experiments
1. Write a program to implement data pre-processing techniques like data sampling,
data discretization, and data augmentation.
2. Write a program to implement a Naïve Bayes algorithm.
3. Write a program to implement classification using SVM.
4. Write a program to implement a regression tree.
5. Write a program to implement Boosting techniques.
6. Write a program to implement Hierarchical clustering.
7. Write a program to implement a multilayer perceptron.
2
Session #1
Vector Addition
Learning Objective
To implement the vector addition.
Learning Context
Here we shall learn how to perform vector addition and subtraction in Python. A vector, in
programming terms, refers to a one-dimensional array. An array is one of the data
structures that stores similar elements, i.e., elements having the same data type. The
general features of the array include
Now let’s learn how to perform the basic mathematical operations, such as addition and
subtraction, on arrays in Python. To perform this task, we need to know about the Numpy
module in Python. Numpy is the Numerical Python that has several inbuilt methods that
will make our task easier. The easiest and simplest way to create an array in Python is by
adding comma-separated literals in matching square brackets. For example
A = [1, 2, 3]
B = [4, 5, 6]
3
A = ([1, 2, 3], [4, 5, 6])
To use this amazing module, we need to import it. The variables A and B are used to
store the array elements. To perform the addition, we need to call the add() method of the
NumPy module as NP.add(). Here we have aliased the NumPy as NP, which is not
necessary. We can directly write it as NumPy.add(). To perform subtraction on the same
array elements, we just need to write another line of code invoking the subtract method,
i.e., NP.subtract(), and print the result obtained after the subtraction.
Exercise
Write a program for vector addition.
Solution
import numpy as NP
A = [4, 8, 7]
B = [5, -4, 8]
print("The input arrays are :\n","A:",A ,"\n","B:",B)
Res= NP.add(A,B)
print("After addition the resulting array is :",Res)
import numpy as NP
A = [4, 8, 7]
B = [5, -4, 8]
print("The input arrays are :\n","A:",A ,"\n","B:",B)
Res1= NP.add(A,B)
Res2= NP.subtract(A,B)
4
print("Result of Addition is :",Res1,"\nResult of Subtraction is:",Res2)
# Create a vector
from numpy import array
v = array([1, 2, 3])
print(v)
[1 2 3]
[1 2 3]
[1 2 3]
[2 4 6]
[1 2 3]
5
[0.5 0.5 0.5]
[0.5 1.5 2.5]
# multiply vectors
from numpy import array
a = array([1, 2, 3])
print(a)
b = array([1, 2, 3])
print(b)
c = a * b
print(c)
[1 2 3]
[1 2 3]
[1 4 9]
[1 2 3]
[1 2 3]
[1. 1. 1.]
6
print(c)
[1 2 3]
[1 2 3]
14
[1 2 3]
0.5
[0.5 1. 1.5]
7
Session #2
Data pre-processing
Learning Objective
To perform data pre-processing like handling missing values, handling categorical data,
bringing features to the same scale, and selecting meaningful features.
Learning Context
Checking Null Values
The DataFrame.isnull() function detects missing values in the given object. It returns a
Boolean same-sized object indicating the values of NA.isnull() is the method that returns
true if the value is null and false otherwise.
Missing data is defined as the values or data that is not stored (or not present) for some
variables in the given dataset. The first step in handling missing values is to look at the
data carefully and find out the missing values using isnull(). Some of the measures to
handle it by using dropna() and fillna()
When your data has categories represented by strings, it will be difficult to use them to
train machine learning models, which often only accept numeric data. Instead of ignoring
the categorical data and excluding the information from our model, you can transform the
data so it can be used in your models. Categorical data is a type of data that is used to
group information with similar characteristics, while numerical data is a type of data that
expresses information in the form of numbers. Most machine learning algorithms cannot
handle categorical variables unless we convert them to numerical values. Many
algorithms performances even vary based on how the categorical variables are encoded.
8
Categorical values are divided into a range of features in a dataset. Real-world datasets
often contain features in two categories.
Feature scaling is the process of normalizing features that vary in degrees of magnitude,
range, and units. Therefore, for machine learning models to interpret these features on
the same scale, we need to perform feature scaling. Feature scaling is a technique to
standardize the independent features present in the data within a fixed range. It is
performed during the data pre-processing. More specifically, we will be looking at 3
different scalers in the Scikit-Learn library for feature scaling, and they are
● Standard Scaler
● Robust Scaler
The iloc() function in Python is defined in the Pandas module and helps us select a
specific row or column from the data set. Using the iloc method in Python, we can easily
retrieve any particular value from a row or column by using index values.
parameters:
The iloc function in Python takes two parameters. However, both parameters of the iloc()
method are optional. Let us discuss both of these parameters:
● The row parameter is an optional parameter that specifies the index position of
the row in the form of an integer or list of integers.
● The column parameter is also an optional parameter that specifies the index
position of the column in the form of an integer or list of integers.
If we specify only a row value, then the iloc function returns a Pandas series. If we specify
the row value and column value, then the iloc function returns all the content of the
specified cell. If we specify a list of values, the Python iloc function returns a Pandas
DataFrame.
9
Exercise
Data pre-processing: handling missing values, handling categorical data, bringing
features on the same scale, and selecting meaningful features.
Dataset
Solution
# Importing the necessary libraries and Datasets
import pandas as pd
import numpy as nm
data_set= pd.read_csv('dataset.csv')
print(data_set)
x= data_set.iloc[:,:-1].values
print(x)
10
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
data_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 10 non-null object
1 Age 9 non-null float64
2 Salary 9 non-null float64
3 Purchased 10 non-null object
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes
Country 0
Age 1
Salary 1
Purchased 0
dtype: int64
11
# Filling the missing values with mean of the data
data_set['Age'].fillna(value=data_set['Age'].mean(), inplace=True)
print(data_set)
data_set['Salary'].fillna(value=data_set['Salary'].mean(), inplace=True)
print(data_set)
12
0 0 44.000000 72000.000000 0
1 2 27.000000 48000.000000 1
2 1 30.000000 54000.000000 0
3 2 38.000000 61000.000000 0
4 1 40.000000 63777.777778 1
5 0 35.000000 58000.000000 1
6 2 38.777778 52000.000000 0
7 0 48.000000 79000.000000 1
8 1 50.000000 83000.000000 0
9 0 37.000000 67000.000000 1
Country_Spain
0 0.0
1 1concat.0
2 0.0
3 1.0
5 1.0
6 0.0
7 0.0
8 0.0
13
9 NaN
4 0.0
# Scaled feature
x1= min_max.fit_transform(data_set)
name=data_set.columns
data=pd.DataFrame(x1, columns=name)
print(data)
columns Scores
0 Country 0.500000
1 Age 0.070076
2 Salary 0.007011
14
15
Session #3
Regression model
Learning Objective
To Implement the Regression model.
Learning Context
This is about the basics of linear regression and its implementation in the Python
programming language. Linear regression is a statistical method for modeling
relationships between a dependent variable with a given set of independent variables.
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear
function that predicts the response value (y) as accurately as possible as a function of the
feature or independent variable (x). Let us consider a dataset where we have a value of
response y for every feature x:
For generality, we define x as feature vector, i.e x = [x1, x2, …., xn], y as response vector, i.e
y = [y1, y2, …., yn] for n observations (in above example, n=10). A scatter plot of the above
dataset is shown in figure 5.1
16
Figure 5.1 Scatter Plot
Now, the task is to find a line that fits best in the above scatter plot so that we can predict
the response for any new feature values. (i.e., a value of x not present in a dataset). This
line is called a regression line. The equation of the regression line is represented as:
ℎ(𝑥𝑖) = β0 + β1𝑥𝑖
Here ℎ(𝑥𝑖) represents the predicted response value for ith observation. β0 and β1 are
regression coefficients and represent the y-intercept and slope of the regression line,
respectively. To create our model, we must learn or estimate the values of regression
coefficients, β0 and β1 once we’ve estimated these coefficients, we can use the model to
predict responses.
In this, we are going to use the principle of least squares. Now consider:
ℎ(𝑥𝑖) = β0 + β1𝑥𝑖 + ε𝑖 = ℎ − (𝑥𝑖) + ε𝑖 = ε𝑖 = 𝑦𝑖 − ℎ(𝑥𝑖)
Here ε𝑖 is a residual error in the ith observation. So, our aim is to minimize the total
residual error. We define the squared error or cost function, J as:
𝑛
1 2
𝐽(β0, β1) = 2𝑛
∑ ε𝑖
𝑖=1
Our task is to find the value of β0 and β1 for which 𝐽(β0, β1) the minimum. Without going
into the mathematical details, we present the result here:
𝑆𝑆𝑥𝑦
β1 =
𝑆𝑆𝑥𝑥
17
β0 = 𝑦 − β1𝑥
Exercise
Linear Regression model
Dataset
Solution
import pandas as pd
dataset=pd.read_csv('Realestate.csv')
dataset
x = dataset.iloc[:,[2,3,4]]
y = dataset.iloc[:,-1]
print(x)
print(y)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=
0.1, random_state=0)
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(x_train, y_train)
print("regression score is", regr.score(x_test, y_test))
y_pred = regr.predict(x_test)
print(y_pred)
18
1 19.5 306.59470
2 13.3 561.98450
3 13.3 561.98450
4 5.0 390.56840
.. ... ...
409 13.7 4082.01500
410 5.6 90.45606
411 18.8 390.96960
412 8.1 104.81010
413 6.5 90.45606
19
Session #4
Logistic Regression
Learning Objective
To implement the KNN Classifier, logistic regression for binary and multiclass
classification
Learning Context
Logistic Regression: Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of independent
variables.Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes or
No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1. Logistic Regression is much similar to the
Linear Regression except that how they are used. Linear Regression is used for solving
Regression problems, whereas Logistic regression is used for solving the classification
problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1). The equation of the straight line
can be represented as
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
20
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
On the basis of the categories, Logistic Regression can be classified into three types:
● Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
● Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
● Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High"
21
Exercise
Write a program to implement KNN Classifier, logistic regression for binary and multiclass
classification
Dataset
Solutions
# Binary Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
iris =datasets.load_iris()
features = iris.data[:100,:]
target =iris.target[:100]
scaler =StandardScaler()
features_standardized =scaler.fit_transform(features)
logistic_regression = LogisticRegression(random_state=0)
model = logistic_regression.fit(features_standardized,target)
y_pred=model.predict(features_standardized)
print(metrics.accuracy_score(y_pred,target))
1.0
22
# Multinomial Logistic Regression
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets, metrics
from sklearn.linear_model import LogisticRegression
iris=datasets.load_iris()
iris_data=iris.data
iris_data=pd.DataFrame(iris_data,columns=iris.feature_names)
iris_data['species']=iris.target
iris_data['species'].unique()
features =iris.feature_names
target ='species'
X=iris_data[features]
y=iris_data[target]
lr_iris=LogisticRegression() # default value for multi class problem is
multinomial
lr_iris =lr_iris.fit(X,y)
y_pred=lr_iris.predict(X)
print(metrics.accuracy_score(y_pred,y))
0.9733333333333334
Dataset
# Write a program to implement k-Nearest Neighbor algorithm to classify
the iris data set. Print both correct and wrong predictions.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
23
# Load the Iris dataset
data = pd.read_csv('IRIS.csv')
# Feature scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
for i in range(len(y_test)):
if y_test[i] == y_pred[i]:
correct_predictions.append((X_test[i], y_test[i], y_pred[i]))
else:
wrong_predictions.append((X_test[i], y_test[i], y_pred[i]))
print("\nCorrect Predictions:")
for item in correct_predictions:
24
print(f"Features: {item[0]}, True Label: {item[1]}, Predicted Label:
{item[2]}")
print("\nWrong Predictions:")
for item in wrong_predictions:
print(f"Features: {item[0]}, True Label: {item[1]}, Predicted Label:
{item[2]}")
Correct Predictions:
Features: [-0.09984503 -0.57982483 0.72717965 1.51147115], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [ 0.13072494 -1.96153508 0.11355956 -0.28533458], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.44569998 2.64416573 -1.33681519 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 1.62942973 -0.34953979 1.39658338 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-1.0221249 0.80188541 -1.28103155 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [0.47657989 0.57160037 1.22923245 1.63981441], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-1.0221249 1.03217045 -1.39259884 -1.18373745], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [0.93771983 0.11103029 0.50404507 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 1.05300481 -0.57982483 0.55982872 0.22803848], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -0.57982483 0.11355956 0.09969522], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -1.04039491 1.00609787 0.22803848], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [0.59186487 0.34131533 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -0.57982483 0.50404507 -0.02864805], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.70714986 -0.57982483 0.44826143 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [ 0.24600992 -0.34953979 0.50404507 0.22803848], True Label:
25
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.13740989 0.11103029 -1.28103155 -1.44042398], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.13072494 -0.34953979 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.44569998 -1.04039491 0.33669414 -0.02864805], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.25269487 -0.11925475 -1.33681519 -1.18373745], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-0.56098497 1.95331061 -1.39259884 -1.05539418], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-0.330415 -0.57982483 0.61561236 0.99809808], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-0.330415 -0.11925475 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.25269487 0.80188541 -1.05789697 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-1.71383481 -0.34953979 -1.33681519 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.36129491 -0.57982483 0.55982872 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-1.48326484 1.26245549 -1.55994977 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [-0.90683992 1.72302557 -1.05789697 -1.05539418], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.36129491 -0.34953979 0.28091049 0.09969522], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.0221249 -1.73125004 -0.27692595 -0.28533458], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-1.0221249 0.80188541 -1.2252479 -1.05539418], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [0.59186487 0.11103029 0.95031423 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-0.56098497 -0.11925475 0.39247778 0.35638175], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.79155494 1.03217045 -1.28103155 -1.31208072], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Features: [ 0.24600992 -0.11925475 0.61561236 0.74141155], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [ 0.59186487 -0.57982483 1.00609787 1.25478461], True Label:
Iris-virginica, Predicted Label: Iris-virginica
Features: [-0.79155494 -0.81010987 0.05777592 0.22803848], True Label:
Iris-versicolor, Predicted Label: Iris-versicolor
Features: [-0.21513002 1.72302557 -1.16946426 -1.18373745], True Label:
Iris-setosa, Predicted Label: Iris-setosa
Wrong Predictions:
Features: [ 0.13072494 -0.81010987 0.72717965 0.48472502], True Label:
Iris-versicolor, Predicted Label: Iris-virginica
26
Confusion Matrix:
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
Accuracy: 0.9736842105263158
27
Session #5
Data Clustering and Evaluation
Learning Objective
To implement K-Means Clustering and evaluate the clustering model.
Learning Context
K-Means
K-means is a data clustering approach for unsupervised machine learning that can
separate unlabeled data into a predetermined number of disjoint groups of equal
variance—clusters—based on their similarities. It's a popular algorithm thanks to its ease
of use and speed on large datasets. K-Means Clustering is an unsupervised learning
algorithm that is used to solve clustering problems in machine learning or data science.
In this topic, we will learn what K-means clustering algorithm is, how the algorithm works,
and the Python implementation of K-means clustering.
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points that are near the
particular k-center create a cluster.
28
The below diagram explains the working of the K-means Clustering Algorithm
Predicting optimal clusters is of utmost importance in Cluster Analysis. For given data, we
need to evaluate which Clustering model will best fit the data or which parameters of the
model will give optimal clusters. We often need to compare two clusters or analyze which
model would be optimal to deal with outliers. Different performance and evaluation
metrics are used to evaluate clustering methods. This silhouette index is one of these
evaluation metrics.
Silhouette Index
The Silhouette score is a measure of how similar a data point is to its own cluster as
compared to other clusters. A higher Silhouette score value indicates that the data point
is better matched to its own cluster and badly matched to other clusters. The best score
value is 1, and -1 is the worst.
● 1: Mean clusters are well apart from each other and clearly distinguished.
● 0: This means clusters are indifferent, or we can say that the distance between
clusters is not significant.
● -1: Means clusters are assigned in the wrong way.
Exercise
Write a program to evaluate the clustering model.
Dataset
29
Solutions
# KMeans Clustering
import numpy as nm
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
data=pd.read_csv("Iris.csv")
print(data.head(),"\n")
x=data.iloc[:,3:5].values
wcss=[]
for i in range(1,10):
kmeans=KMeans(n_clusters = i, init = 'k-means++', max_iter = 100,
n_init = 10, random_state = 0).fit(x)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 10), wcss, 'bx-', color='red')
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
30
kmeans=KMeans(n_clusters=3,init='k-means++',random_state=42)
y_predict=kmeans.fit_predict(x)
print(" Y predict is \n", y_predict)
print("\n Cluster centers are \n", kmeans.cluster_centers_,"\n")
plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c =
'red', label = 'Iris-setosa')
plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c =
'blue', label = 'Iris-versicolour')
plt.scatter(x[y_predict == 2, 0], x[y_predict== 2, 1], s = 100, c =
'green', label = 'Iris-virginica')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1],
s = 100, c = 'black', label = 'Centroids')
plt.legend()
Y predict is
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2
2 2 2 0 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 2 0 0 0
0
0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
0
0 0]
31
Cluster centers are
[[5.59583333 2.0375 ]
[1.464 0.244 ]
[4.26923077 1.34230769]]
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to
'auto' in 1.4. Set the value of `n_init` explicitly to suppress the
warning
warnings.warn(
<matplotlib.legend.Legend at 0x7f33aa0061d0>
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
features,_=make_blobs(n_samples=1000,n_features=10,centers=2,cluster_std
=0.5,shuffle=True,random_state=1)
model=KMeans(n_clusters=2,random_state=1).fit(features)
target_predicted=model.labels_
silhouette_score(features,target_predicted)
0.891626556407141
32
Session #6
Ensemble learning, grid search and
learning, and validation curves
Learning Objective
To implement Ensemble Learning, grid search, learning and validation curves.
Learning Context
Random Forest Classifier
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
"Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of
that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, it predicts the
final output. The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
33
Grid Search
Validation Curve
To validate a model we need a scoring function (see Metrics and scoring: quantifying the
quality of predictions), for example accuracy for classifiers. The proper way of choosing
multiple hyperparameters of an estimator is of course grid search or similar methods (see
Tuning the hyper-parameters of an estimator) that select the hyperparameter with the
maximum score on a validation set or multiple validation sets. Note that if we optimize
the hyperparameters based on a validation score the validation score is biased and not a
good estimate of the generalization any longer. To get a proper estimate of the
generalization we have to compute the score on another test set. However, it is
sometimes helpful to plot the influence of a single hyperparameter on the training score
and the validation score to find out whether the estimator is overfitting or underfitting for
some hyperparameter values.
34
Exercise
Ensemble learning, grid search and learning, and validation curves
Dataset
Solutions
import numpy as nm
import matplotlib.pyplot as plt
import pandas as pd
data=pd.read_csv("Wine.csv")
cols=['Alcohol','Color_Intensity','Proline','Ash_Alcanity']
from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler(feature_range=(0,1))
data[cols]=mms.fit_transform(data[cols])
x=data.drop('Customer_Segment',axis=1)
y=data['Customer_Segment']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_
state=42)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
preds=clf.predict(X_test)
print(preds)
from sklearn.metrics import accuracy_score
accuracy_test_DT=accuracy_score(y_test,preds)
train_preds=clf.predict(X_train)
accuracy_train_DT=accuracy_score(y_train,train_preds)
print('accuracy_train_DT',accuracy_train_DT)
print('accuracy_test_DT',accuracy_test_DT)
[1 1 3 1 2 1 2 3 2 3 1 3 1 2 1 2 2 2 1 2 1 2 2 3 3 3 2 2 2 1 1 2 3 1 1 1
3
3 2 3 1 2 2 2 3 1 2 2 3 1 2 1 1 3]
35
accuracy_train_DT 1.0
accuracy_test_DT 1.0
#Grid Search
from sklearn.model_selection import GridSearchCV
grid_param={ 'n_estimators':[100,500,800],
'criterion':['gini','entropy'], 'bootstrap':[True,False] }
gd_sr=GridSearchCV(estimator=clf,param_grid=grid_param,scoring='accuracy
',cv=5)
gd_sr.fit(X_train,y_train)
best_parameters=gd_sr.best_params_
print(best_parameters)
best_result=gd_sr.best_score_
print(best_result)
#Validation Curve
import numpy as np
import pandas as pd
import matplotlib.pyplot as mtp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import validation_curve
df = pd.read_csv("Wine.csv") # Loading the data
X = df.iloc[:,:-1] # Feature matrix in pd.DataFrame format
y = df.iloc[:,-1] # Target vector in pd.Series format
param_range=np.arange(1, 11)
# Making a Random Forest Classifier object
rf = RandomForestClassifier(n_estimators=100, criterion='gini')
train_score, test_score=validation_curve(rf, X, y,
param_name="max_depth",param_range=param_range, cv=10,
scoring="accuracy")
# Plot the validation curve
mean_train_score = np.mean(train_score, axis = 1)
36
print("Mean of train score \n", mean_train_score)
mean_test_score = np.mean(test_score, axis = 1)
print("Mean of test score \n",mean_test_score)
mtp.plot(param_range, mean_train_score, label = "Training Score", color
= 'b')
mtp.plot(param_range, mean_test_score,label = "Cross Validation Score",
color = 'g')
mtp.title("Validation Curve with randomforest Classifier")
mtp.xlabel("max depth")
mtp.ylabel("Accuracy")
mtp.tight_layout()
mtp.legend(loc = 'best')
mtp.show()
37
Session #7
Compressing data via dimensionality
reduction: PCA, LDA
Learning Objective
To implement Compressing data via dimensionality reduction like PCA and LDA
Learning Context
Principal Component Analysis
38
machine learning to solve more than two-class classification problems. It is also known as
Normal Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).This can be
used to project the features of higher dimensional space into lower-dimensional space in
order to reduce resources and dimensional costs. In this topic, "Linear Discriminant
Analysis (LDA) in machine learning”, we will discuss the LDA algorithm for classification
predictive modeling problems, limitation of logistic regression, representation of linear
Discriminant analysis model, how to make a prediction using LDA, how to prepare data
for LDA, extensions to LDA and much more. So, let's start with a quick introduction to
Linear Discriminant Analysis (LDA) in machine learning.
Exercise
Compressing data via dimensionality reduction: PCA, LDA
Dataset
#PCA
import pandas as pd
d = pd.read_csv('wineQualityReds.csv')
print(d.head())
x = d.iloc[:,:-1]
y = d.iloc[:,-1]
from sklearn.preprocessing import MinMaxScaler
s = MinMaxScaler()
x = pd.DataFrame(s.fit_transform(x),columns = x.columns)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
X_train = pca.fit_transform(x_train)
X_test = pca.transform(x_test)
explained_variance = pca.explained_variance_ratio_
print("PCA Variance \n", explained_variance)
from sklearn.neighbors import KNeighborsClassifier
39
KNN_mod =
KNeighborsClassifier(n_neighbors=10)KNN_mod.fit(X_train,y_train)
pred = KNN_mod.predict(X_test)
from sklearn.metrics import confusion_matrix,accuracy_score
print("PCA Accuracy score is \n", accuracy_score(y_test,pred)*100)
#LDA
import pandas as pd
d = pd.read_csv('wineQualityReds.csv')
x = d.iloc[:,:-1]
y = d.iloc[:,-1]
from sklearn.preprocessing import MinMaxScaler
s = MinMaxScaler()
x = pd.DataFrame(s.fit_transform(x),columns = x.columns)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
40
0.092
3 11.2 0.28 0.56 1.9
0.075
4 7.4 0.70 0.00 1.9
0.076
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
PCA Variance
[0.36091181 0.19343588 0.15019667 0.07150031 0.05365209 0.05066068
0.04207002 0.03260641]
PCA Accuracy score is
57.8125
LDA Variance is
[0.83468126 0.12153022 0.02544075 0.01069278 0.00765499]
LDA Accuracy score is
57.49999999999999
41
Session #8
Model Evaluation and Optimization
Learning Objective
To implement Model Evaluation and optimization by using K-fold cross-validation.
Learning Context
Splitting of data into training and test data
Data splitting is when data is divided into two or more subsets. Typically, with a two-part
split, one part is used to evaluate or test the data and the other to train the model. Data
splitting is an important aspect of data science, particularly for creating models based on
data. This technique helps ensure the creation of data models and processes that use
data models, such as machine learning are accurate.
In a basic two-part data split, the training data set is used to train and develop models.
Training sets are commonly used to estimate different parameters or to compare
different model performances. The testing data set is used after the training is done. The
training and test data are compared to ensure that the final model works correctly.
Scikit-learn alias Sklearn, is the most useful and robust library for machine learning in
42
Python. The scikit-learn library provides us with the model_selection mmodule,in which
we have the splitter function train_test_split().
Syntax:train_test_split(*arrays,test_size=None,train_size=None,random_state=None,
shuffle=True,stratify=None)
Parameters:
Cross Validation
K-Fold Cross-Validation: K-fold cross-validation approach divides the input dataset into K
groups of samples of equal sizes. These samples are called folds. For each learning set,
the prediction function uses k-1 folds, and the rest of the folds are used for the test set.
This approach is a very popular CV approach because it is easy to understand, and the
output is less biased than other methods. The steps for k-fold cross-validation are:
43
2. For each group:
a. Take one group as the reserve or test data set.
b. Use remaining groups as the training dataset
c. Fit the model on the training set and evaluate the performance of the
model using the test set.
Exercise
Model Evaluation and optimization: K-fold cross-validation.
Dataset
import pandas as pd
import numpy as np
dataset = pd.read_csv("wineQualityReds.csv", sep=',')
dataset.head()
X = dataset.iloc[:, 0:11].values
y = dataset.iloc[:, 11].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10,
random_state=0)
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
X_train = featufrom sklearn.model_selection import KFold
scores=[]
kFold=KFold(n_splits=10,random_state=42,shuffle=True)
for train_index,test_index in kFold.split(X):
X_train, X_test, y_train, y_test = X[train_index], X[test_index],
44
y[train_index], y[test_index]
classifier.fit(X_train, y_train)
scores.append(classifier.score(X_test, y_test))
classifier.fit(X_train,y_train)
scores.append(classifier.score(X_test,y_test))
print('Cross validation scores are: \n', cross_val_score(classifier, X,
y, cv=10))
classifier.fit(X_train,y_train)
print("ACCURACY OF THE MODEL:", classifier.score(X_test,y_test))
re_scaler.fit_transform(X_train)
X_test = feature_scaler.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=300, random_state=0)
from sklearn.model_selection import cross_val_score
45
Session #9
Regularization
Learning Objective
To implement for reducing the variance of a linear regression model using Lasso and
Ridge Regularization evaluation of clustering model.
Learning Context
Regularization is one of the most important concepts of machine learning. It is a
technique to prevent the model from overfitting by adding extra information to
it.Sometimes the machine learning model performs well with the training data but does
not perform well with the test data. It means the model is not able to predict the output
when deals with unseen data by introducing noise in the output, and hence the model is
called overfitted. This technique can be used in such a way that it will allow to maintain
all variables or features in the model by reducing the magnitude of the variables. Hence,
it maintains accuracy as well as a generalization of the model.It mainly regularizes or
reduces the coefficient of features toward zero. In simple words, "In regularization
technique, we reduce the magnitude of the features by keeping the same number of
features."
Regularization works by adding a penalty or complexity term to the complex model. Let's
consider the simple linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted X1, X2, …Xn are the
features for Y.
46
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The
equation for the cost function for the linear model is given below:
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
Ridge Regression
Ridge regression is one of the types of linear regression in which a small amount of bias
is introduced so that we can get better long-term predictions.Ridge regression is a
regularization technique, which is used to reduce the complexity of the model. It is also
called as L2 regularization.In this technique, the cost function is altered by adding the
penalty term to it. The amount of bias added to the model is called Ridge Regression
penalty. We can calculate it by multiplying with the lambda to the squared weight of each
individual feature.The equation for the cost function in ridge regression will be:
As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression mode.
47
Lasso Regression
Ridge regression is mostly used to reduce the overfitting in the model, and it includes all
the features present in the model. It reduces the complexity of the model by shrinking
the coefficients. Lasso regression helps reduce overfitting in the model as well as feature
selection.
Exercise
Write a program to reduce variance of a linear regression model using Lasso and Ridge
Regularization
from sklearn.linear_model import Lasso
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
diabetes=load_diabetes()
features = diabetes.data
target = diabetes.target
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
regression = Lasso(alpha =0.5)
model =regression.fit(features_standardized, target)
print(model.coef_)
48
-0. -8.44971502 3.28432608 24.95304334 2.90702924]
49
Session #10
Perceptron for digits
Learning Objective
To implement perceptron for digits.
Learning Context
Perceptron
It consists of summation, Sigmoid Function and it takes n no. of inputs and generate
output. Frank Rosenblatt (1928 – 1971) was an American psychologist notable in the field
of Artificial Intelligence. In 1957 he started something really big. He "invented" a
Perceptron program, on an IBM 704 computer at Cornell Aeronautical Laboratory.
Scientists had discovered that brain cells (Neurons) receive input from our senses by
electrical signals. The Neurons, then again, use electrical signals to store information,
and to make decisions based on previous input. Frank had the idea that Perceptrons
could simulate brain principles, with the ability to learn and make decisions. The original
Perceptron was designed to take a number of binary inputs, and produce one binary
output (0 or 1).The idea was to use different weights to represent the importance of each
input, and that the sum of the values should be greater than a threshold value before
making a decision like yes or no (true or false) (0 or 1).
50
Exercise
Write a program to implement Perceptron for digits
from sklearn.datasets import load_digits
from sklearn.linear_model import Perceptron
X, y = load_digits(return_X_y=True)
clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(X, y)
clf.score(X, y)
0.9393433500278241
51
Session #11
Feed-Forward Network
Learning Objective
To implement Feed-Forward Network for wheat seed dataset.
Learning Context
We are still making use of a gradient descent optimization algorithm which acts to
minimize the error of our model by iteratively moving in the direction with the steepest
descent, the direction which updates the parameters of our model while ensuring the
minimal error. It is important to recognize the subsequent training of our neural network.
Recognition is done by dividing our data samples through some decision boundary". The
process of receiving an input to produce some kind of output to make some kind of
prediction is known as Feed Forward." Feed Forward neural network is the core of many
other important neural networks such as convolution neural network. In the feed-forward
neural network, there are not any feedback loops or connections in the network. Here is
simply an input layer, a hidden layer, and an output layer. If every model in every single
layer. We will talk more about optimization algorithms and backpropagation later.
52
Exercise
Write a program to implement Feed-Forward Network for wheat seeds datas.
Dataset
import pandas as pd
df=pd.read_csv("wheat.csv",index_col=None)
X = df.iloc[:, 0:7].values
y = df.iloc[:, 7].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
X_train = feature_scaler.fit_transform(X_train)
X_test = feature_scaler.transform(X_test)
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
y_train = lb.fit_transform(y_train)
y_test=lb.transform(y_test)
from sklearn.neural_network import MLPClassifier
#Initializing the MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100,50), max_iter=300,activation
= 'relu',solver='adam',random_state=1)
# Fit data onto the model
clf.fit(X_train,y_train)
ypred=clf.predict(X_test)
# Import accuracy score
from sklearn.metrics import accuracy_score
# Calcuate accuracy
accuracy_score(y_test,ypred)
0.9523809523809523
53
Session #12
Neural Network for Regression
Learning Objective
To implement a neural network for regression.
Learning Context
Keras is an open-source high-level Neural Network library that is written in Python and
capable enough to run on Theano, TensorFlow, or CNTK. It was developed by one of the
Google engineers, Francois Chollet. It is made user-friendly, extensible, and modular for
facilitating faster experimentation with deep neural networks. It not only supports
Convolutional Networks and Recurrent Networks individually but also in combination. It
cannot handle low-level computations, so it makes use of the Backend library to resolve
it. The backend library acts as a high-level API wrapper for the low-level API, which lets it
run on TensorFlow, CNTK, or Theano.
Exercise
Write a program to implement a neural network for regression.
#Load libraries
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras import models
from keras import layers
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
54
# Set random seed
np.random.seed(0)
55
target_train, # Target vector
epochs=10, # Number of epochs
verbose=0, # No output
batch_size=100, # Number of observations per batch
validation_data=(features_test, target_test))
predicted_target = network.predict(features_test)
print(predicted_target)
from sklearn.metrics import r2_score
print("RMS: %r " % np.sqrt(np.mean((predicted_target -
target_test) ** 2)))
print("R2= ", r2_score(predicted_target,target_test))
# print(r2_score(features_train,target_train))
56
Session #13
Machine Learning Model
Learning Objective
To implement a save and load trained machine learning model.
Learning Context
JOBLIB
In machine learning, while working with scikit learn library, we need to save the trained
models in a file and restore them in order to reuse them to compare the model with other
models, and to test the model on new data. The saving of data is called Serialization,
while restoring the data is called Deserialization. Also, we deal with different types and
sizes of data. Some datasets are easily trained i.e. they take less time to train but the
datasets whose size is large (more than 1GB) can take a very large time to train on a local
machine even with GPU. When we need the same trained data in some different project
or later sometime, to avoid the wastage of the training time, store the trained model so
that it can be used anytime in the future.
57
Exercise
Write a program to save and load a trained machine learning model.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import joblib
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create decision tree classifer object
classifer = RandomForestClassifier()
# Train model
model = classifer.fit(features, target)
# Save model as pickle file
joblib.dump(model, "model.pkl")
classifer = joblib.load("model.pkl")
new_observation = [[ 5.2, 3.2, 1.1, 0.1]]
output_class=classifer.predict(new_observation)
print(output_class)
output_class=classifer.predict(iris.data[101].reshape(1,-1))
print(output_class)
print(iris.target[101])
[0]
[2]
2
58