0% found this document useful (0 votes)
10 views46 pages

ML Lab

The document outlines a series of practical exercises for a Machine Learning laboratory course at I. K. Gujral Punjab Technical University, focusing on various algorithms such as decision trees, random forests, and naive Bayes classifiers. Each practical includes aims, steps, and example datasets to help students implement and understand machine learning concepts using Python and Java. The exercises cover data handling, model training, accuracy computation, and visualization techniques.

Uploaded by

Kusam Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

ML Lab

The document outlines a series of practical exercises for a Machine Learning laboratory course at I. K. Gujral Punjab Technical University, focusing on various algorithms such as decision trees, random forests, and naive Bayes classifiers. Each practical includes aims, steps, and example datasets to help students implement and understand machine learning concepts using Python and Java. The exercises cover data handling, model training, accuracy computation, and visualization techniques.

Uploaded by

Kusam Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

lOMoARcPSD|39384629

ML Practical - vfx

Machine Learning Laboratory (I. K. Gujral Punjab Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Kusam Thakur ([email protected])
lOMoARcPSD|39384629

INDEX

S. no. Practicals Page No. Signature


1. Read the numeric data from .CSV file and use some basic 1-5
operations on it.
2. Write a program to demonstrate the working of the 6 - 10
decision tree algorithm. Use an appropriate data set for
building the decision tree and apply this knowledge to
classify a new sample.
3. Write a program to demonstrate the working of the 11 - 16
Random forest Algorithm.
4. Write a program to implement the naïve Bayesian 17 - 20
classifier for a sample training data set stored as a .CSV
file. Compute the accuracy of the classifier, considering
few test data sets
5. Assuming a set of documents that need to be classified, 21 - 25
use the naïve Bayesian Classifier model to perform this
task. Built-in Java classes/API can be used to write the
program. Calculate the accuracy, precision, and recall for
your data set.
6. Write a program to construct a Bayesian network 26 - 29
considering medical data. Use this model to demonstrate
the diagnosis of heart patients using standard Heart
Disease Data Set. You can use Java/Python ML library
classes/API.
7. Write a program to implement k-Nearest Neighbour 30 - 32
algorithm to classify the iris data set. Print both correct
and wrong predictions. Java/Python ML library classes
can be used for this problem.
8. Write a program to demonstrate the working of the K- 33 - 34
means clustering algorithm.
9. Write a program to demonstrate the working of the 35 - 39
Support Vector Machine for Classification Algorithm.
10. Write a program to demonstrate the working of the 40 - 44
Hierarchical Clustering

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 1

Aim: Read the numeric data from .CSV file and use some basic operations on it.
Data in the form of tables is also called CSV (comma separated values) – literally “comma-separated
values.” Data in the form of tables is also called CSV (comma separated values) – literally “comma-
separated values.” CSV can be easily read and processed by Python.

Step 1: Create a tabular data in Excel

Step 2: Save this tabular data with .CSV extension

Step 3: Upload .CSV file in google colab

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Basic Operations:

Reading .csv file

1. Import the csv library.

2. Open the CSV file.

The .open() method in python is used to open files and return a file object.

The type of file is “_io.TextIOWrapper” which is a file object that is returned by the open() method.

3. Use the csv.reader object to read the CSV file.


csvreader = csv.reader(file)

4. Extract the field names.

Create an empty list called a header. Use the next() method to obtain the header.

The .next() method returns the current row and moves to the next row.

The first time user run next(), it returns the header, and the next time you run, it returns the first record, and

so on.

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

5. Extract the rows/records. Create an empty list called rows and iterate through the csvreader object and

append each row to the rows list.

6. Close the file.

.close() method is used to close the opened file. Once it is closed, we cannot perform any operations on it.

Read CSV Files Using Pandas


1. Import pandas library.

2. Load CSV files to pandas using read_csv().

Basic Syntax: pandas.read_csv(filename, delimiter=’,’)

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

3. Extract the field names.

.columns is used to obtain the header/field names.

4. Extract the rows.

All the data of a data frame can be accessed using the field names.

Write to a .CSV file:

Using csv.writer
1. The csv.writer() function returns a writer object that converts the input data into a delimited string.
For example, let’s assume we are recording the data of 3 students (Name, M1 Score, M2 Score)

2. Import csv library:

3.
a) Define a filename and Open the file using open().
b) Create a csvwriter object using csv.writer().
c) Write the header.
d) Write the rest of the data.

Code for steps a-d

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Write CSV Using Pandas

1. Import pandas library.

2. Create a pandas dataframe using pd.DataFrame.

Syntax: pd.DataFrame(data, columns)

The data parameter takes the records/observations, and the columns parameter takes the columns/field
names.

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 2

Aim: Write a program to demonstrate the working of the decision tree algorithm. Use an appropriate data
set for building the decision tree and apply this knowledge to classify a new sample.

Dataset:

PlayTennis Dataset is saved as .csv (comma separated values) file in the current working directory otherwise
use the complete path of the dataset set in the program ans save this dataset as dataset.csv

outlook temperature humidity wind answer


sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rain mild high weak yes
rain cool normal weak yes
rain cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rain mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast hot normal weak yes
rain mild high strong no

Input:

1. Import libraries and read data using read_csv() function. Remove the target from the data and store
attributes in the features variable.

2. Create a class named Node with four members children, value, isLeaf and pred.

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

3.

4. Define a function called entropy to find the entropy oof the dataset.

5. Define a function named info_gain to find the gain of the attribute

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

6. Define a function named ID3 to get the decision tree for the given dataset

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

7. Define a function named printTree to draw the decision tree

8. Define a function named classify to classify the new example

9. Finally, call the ID3, printTree and classify functions

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Output:

10

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 3

Aim: Write a program to demonstrate the working of the Random forest Algorithm.

1.Data Pre-Processing Step:

Below is the code for the pre-processing step:

11

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

1.Data Pre-Processing Step:Below is the code for the pre-processing step:

2. Fitting the Random Forest algorithm to the training set:

Now we will fit the Random forest algorithm to the training set. To fit it, we will import
the RandomForestClassifier class from the sklearn.ensemble library. The code is given below:In the above
code, the classifier object takes below parameters:

o n_estimators= The required number of trees in the Random Forest. The default value is 10. We can
choose any number but need to take care of the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken "entropy" for the
information gain.

Output:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',


max_depth=None, max_features='auto', max_leaf_nodes=None,

12

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)

3. Predicting the Test Set result

Since our model is fitted to the training set, so now we can predict the test result. For prediction, we will
create a new prediction vector y_pred. Below is the code for it:

Output:

The prediction vector is given as:

By checking the above prediction vector and test set real vector, we can determine the incorrect predictions
done by the classifier.

13

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

4. Creating the Confusion Matrix

Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is the
code for it:

Output:

As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92 correct predictions.

14

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

5. Visualizing the training Set result

Here we will visualize the training set result. To visualize the training set result we will plot a graph for the
Random forest classifier. The classifier will predict yes or No for the users who have either Purchased or
Not purchased the SUV car as we did in Logistic Regression. Below is the code for it:

Output:

The above image is the visualization result for the Random Forest classifier working with the training set
result. It is very much similar to the Decision tree classifier. Each data point corresponds to each user of the
user_data, and the purple and green regions are the prediction regions. The purple region is classified for the
users who did not purchase the SUV car, and the green region is for the users who purchased the SUV.

So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for the Purchased
variable. The classifier took the majority of the predictions and provided the result.

15

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

6. Visualizing the test set result

Now we will visualize the test set result. Below is the code for it:

Output:

The above image is the visualization result for the test set. We can check that there is a minimum number of
incorrect predictions (8) without the Overfitting issue. We will get different results by changing the number
of trees in the classifier.

16

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 4

Aim: Write a program to implement the naïve Bayesian classifier for a sample training data set stored as a
.CSV file. Compute the accuracy of the classifier, considering few test data sets

The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for classification tasks,
like text classification.

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Sample Data

17

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Input:

18

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

19

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Output:

20

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 5

Aim: Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to
perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy,
precision, and recall for your data set.

1. collect all words, punctuation, and other tokens that occur in Examples

• Vocabulary ← c the set of all distinct words and other tokens occurring in any text document from

Examples

2. calculate the required P(vj) and P(wk|vj) probability terms

 For each target value vj in V do

 Docsj ← the subset of documents from Examples for which the target value is vj

 • P(vj← | docsj | / |Examples| • Textj ← a single document created by concatenating all


members of docsj

 n ← total number of distinct word positions in Textj

 for each word wk in Vocabulary

 nk ← number of times word wk occurs in Textj

 P(wk|vj) ← ( nk + 1) / (n + | Vocabulary| )

CLASSIFY_NAIVE_BAYES_TEXT (Doc) Return the estimated target value for the document Doc. Ai

denotes the word found in the ith position within Doc.

• positions ← all word positions in Doc that contain tokens found in Vocabulary

• Return Vnb, where

21

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Dataset:

22

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Input:

23

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Output:

24

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

25

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 6

Aim: Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can use Java/Python
ML library classes/API.

Dataset:

Title: Heart Disease Databases

The Cleveland database contains 76 attributes, but all published experiments refer to using a
subset of 14 of them. In particular, the Cleveland database is the only one that has been used
by ML researchers to this date. The "Heart disease" field refers to the presence of heart disease
in the patient. It is integer valued from 0 (no presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:

1. age: age in years


2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
 Value 1: typical angina
 Value 2: atypical angina
 Value 3: non-anginal pain
 Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
 Value 0: normal
 Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or
depression of > 0.05 mV)
 Value 2: showing probable or definite left ventricular hypertrophy by Estes'criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
 Value 1: upsloping
 Value 2: flat
 Value 3: downsloping
12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
13. Heartdisease: It is integer valued from 0 (no presence) to 4.

26

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Input:

27

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Output:

28

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

29

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 7

Aim: Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set. Print both
correct and wrong predictions. Java/Python ML library classes can be used for this problem.

K-Nearest Neighbor Algorithm

Training algorithm:
 For each training example (x, f (x)), add the example to the list training examples Classification
algorithm:
 Given a query instance xq to be classified,
Let x1 . . .xk denote the k instances from training examples that are nearest to xq
Return

Where, f(xi) function to calculate the mean value of the k nearest trainingexamples.

Data Set:
Iris Plants Dataset: Dataset contains 150 instances (50 in each of three classes) Number of Attributes: 4
numeric, predictive attributes and the Class

30

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Input:

31

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Output:

sepal-length sepal-width petal-length petal-width


[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
. . . . .
. . . . .

[6.2 3.4 5.4 2.3]


[5.9 3. 5.1 1.8]]

class: 0-Iris-Setosa, 1- Iris-Versicolour, 2- Iris-Virginica


[0 0 0 ………0 0 1 1 1 …………1 1 2 2 2 ………… 2 2]

Confusion Matrix

Accuracy Metrics

Precision recall f1-score support

0 1.00 1.00 1.00 20


1 0.91 1.00 0.95 10
2 1.00 0.93 0.97 15

avg / total 0.98 0.98 0.98 45

32

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 8

Aim: Write a program to demonstrate the working of the K-means clustering algorithm.
Working of K-Means Algorithm
 Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.
 Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words,
classify the data based on the number of data points.
 Step 3 − Now it will compute the cluster centroids.
 Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of
data points to the clusters that are not changing any more −

4.1 − First, the sum of squared distance between data points and centroids would be computed.

4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).

4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.

Implementing K-Means clustering algorithm:

first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the
result.

First, we will start by importing the necessary packages −

The following code will generate the 2D, containing four blobs –

Next, the following code will help us to visualize the dataset –

33

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Next, make an object of KMeans along with providing number of clusters, train the model and do the
prediction as follows –

Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means
Python estimator −

Output:

34

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 9

Aim: Write a program to demonstrate the working of the Support Vector Machine for Classification
Algorithm.

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

use of a kernel in the SVM algorithm:


The SVM kernel is a function that takes low-dimensional input space and transforms it into higher-
dimensional space, ie it converts nonseparable problems to separable problems. It is mostly useful in non -
linear separation problems. Simply put the kernel, does some extremely complex data transformations and
then finds out the process to separate the data based on the labels or outputs defined.

35

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

After executing the above code, we will pre-process the data. The code will give the dataset as:

36

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:

In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training
dataset(x_train, y_train)

Output:

The model performance can be altered by changing the value of C(Regularization factor), gamma, and
kernel.

o Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below is the
code for it:

After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.

o Creating the confusion matrix:

Now we will see the performance of the SVM classifier that how many incorrect predictions are there as
compared to the Logistic regression classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the function, we will call it using a new
variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the targeted
value return by the classifier). Below is the code for it:

37

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

o Visualizing the training set result:

Now we will visualize the training set result, below is the code for it:

Output:

38

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got
the straight line as hyperplane because we have used a linear kernel in the classifier. And we have also
discussed above that for the 2d space, the hyperplane in SVM is a straight line.

o Visualizing the test set result:

39

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Practical – 10

Aim: Write a program to demonstrate the working of the Hierarchical Clustering


Hierarchical clustering is an unsupervised learning method for clustering data points. The algorithm builds
clusters by measuring the dissimilarities between data. Unsupervised learning means that a model does not
have to be trained, and we do not need a "target" variable. This method can be used on any data to visualize
and interpret the relationship between individual data points.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all
data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.

Hierarchical Clustering with Python

In Python, the Scipy and Scikit-Learn libraries have defined functions for hierarchical clustering.
The below examples use these library functions to illustrate hierarchical clustering in Python.
First, we'll import NumPy, matplotlib, and seaborn (for plot styling):

Next, we'll define a small sample data set:

40

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Graph this data set as a scatter plot:

use this data set to perform hierarchical clustering.

41

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Hierarchical Clustering using Scipy

The Scipy library has the linkage function for hierarchical (agglomerative) clustering.
The linkage function has several methods available for calculating the distance between
clusters: single, average, weighted, centroid, median, and ward. We will compare these methods below.
For more details on the linkage function, see the docs.
To draw the dendrogram, we'll use the dendrogram function. Again, for more details of
the dendrogram function, see the docs.
First, we will import the required functions, and then we can form linkages with the various methods:

Now, by passing the dendrogram function to matplotlib, we can view a plot of these linkages:

42

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

Notice that each distance method produces different linkages for the same data.
Finally, let's use the fcluster function to find the clusters for the Ward linkage:

43

Downloaded by Kusam Thakur ([email protected])


lOMoARcPSD|39384629

The fcluster function gives the labels of clusters in the same index as the data set, X1 . For example,
index one of f1 is 2 , which is the cluster for the entry at index one of X1 , which is [1, 1] .

44

Downloaded by Kusam Thakur ([email protected])

You might also like