ML Lab
ML Lab
ML Practical - vfx
INDEX
Practical – 1
Aim: Read the numeric data from .CSV file and use some basic operations on it.
Data in the form of tables is also called CSV (comma separated values) – literally “comma-separated
values.” Data in the form of tables is also called CSV (comma separated values) – literally “comma-
separated values.” CSV can be easily read and processed by Python.
Basic Operations:
The .open() method in python is used to open files and return a file object.
The type of file is “_io.TextIOWrapper” which is a file object that is returned by the open() method.
Create an empty list called a header. Use the next() method to obtain the header.
The .next() method returns the current row and moves to the next row.
The first time user run next(), it returns the header, and the next time you run, it returns the first record, and
so on.
5. Extract the rows/records. Create an empty list called rows and iterate through the csvreader object and
.close() method is used to close the opened file. Once it is closed, we cannot perform any operations on it.
All the data of a data frame can be accessed using the field names.
Using csv.writer
1. The csv.writer() function returns a writer object that converts the input data into a delimited string.
For example, let’s assume we are recording the data of 3 students (Name, M1 Score, M2 Score)
3.
a) Define a filename and Open the file using open().
b) Create a csvwriter object using csv.writer().
c) Write the header.
d) Write the rest of the data.
The data parameter takes the records/observations, and the columns parameter takes the columns/field
names.
Practical – 2
Aim: Write a program to demonstrate the working of the decision tree algorithm. Use an appropriate data
set for building the decision tree and apply this knowledge to classify a new sample.
Dataset:
PlayTennis Dataset is saved as .csv (comma separated values) file in the current working directory otherwise
use the complete path of the dataset set in the program ans save this dataset as dataset.csv
Input:
1. Import libraries and read data using read_csv() function. Remove the target from the data and store
attributes in the features variable.
2. Create a class named Node with four members children, value, isLeaf and pred.
3.
4. Define a function called entropy to find the entropy oof the dataset.
6. Define a function named ID3 to get the decision tree for the given dataset
Output:
10
Practical – 3
Aim: Write a program to demonstrate the working of the Random forest Algorithm.
11
Now we will fit the Random forest algorithm to the training set. To fit it, we will import
the RandomForestClassifier class from the sklearn.ensemble library. The code is given below:In the above
code, the classifier object takes below parameters:
o n_estimators= The required number of trees in the Random Forest. The default value is 10. We can
choose any number but need to take care of the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken "entropy" for the
information gain.
Output:
12
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Since our model is fitted to the training set, so now we can predict the test result. For prediction, we will
create a new prediction vector y_pred. Below is the code for it:
Output:
By checking the above prediction vector and test set real vector, we can determine the incorrect predictions
done by the classifier.
13
Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is the
code for it:
Output:
As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92 correct predictions.
14
Here we will visualize the training set result. To visualize the training set result we will plot a graph for the
Random forest classifier. The classifier will predict yes or No for the users who have either Purchased or
Not purchased the SUV car as we did in Logistic Regression. Below is the code for it:
Output:
The above image is the visualization result for the Random Forest classifier working with the training set
result. It is very much similar to the Decision tree classifier. Each data point corresponds to each user of the
user_data, and the purple and green regions are the prediction regions. The purple region is classified for the
users who did not purchase the SUV car, and the green region is for the users who purchased the SUV.
So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for the Purchased
variable. The classifier took the majority of the predictions and provided the result.
15
Now we will visualize the test set result. Below is the code for it:
Output:
The above image is the visualization result for the test set. We can check that there is a minimum number of
incorrect predictions (8) without the Overfitting issue. We will get different results by changing the number
of trees in the classifier.
16
Practical – 4
Aim: Write a program to implement the naïve Bayesian classifier for a sample training data set stored as a
.CSV file. Compute the accuracy of the classifier, considering few test data sets
The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for classification tasks,
like text classification.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
Sample Data
17
Input:
18
19
Output:
20
Practical – 5
Aim: Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to
perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy,
precision, and recall for your data set.
1. collect all words, punctuation, and other tokens that occur in Examples
• Vocabulary ← c the set of all distinct words and other tokens occurring in any text document from
Examples
Docsj ← the subset of documents from Examples for which the target value is vj
P(wk|vj) ← ( nk + 1) / (n + | Vocabulary| )
CLASSIFY_NAIVE_BAYES_TEXT (Doc) Return the estimated target value for the document Doc. Ai
• positions ← all word positions in Doc that contain tokens found in Vocabulary
21
Dataset:
22
Input:
23
Output:
24
25
Practical – 6
Aim: Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can use Java/Python
ML library classes/API.
Dataset:
The Cleveland database contains 76 attributes, but all published experiments refer to using a
subset of 14 of them. In particular, the Cleveland database is the only one that has been used
by ML researchers to this date. The "Heart disease" field refers to the presence of heart disease
in the patient. It is integer valued from 0 (no presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:
26
Input:
27
Output:
28
29
Practical – 7
Aim: Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set. Print both
correct and wrong predictions. Java/Python ML library classes can be used for this problem.
Training algorithm:
For each training example (x, f (x)), add the example to the list training examples Classification
algorithm:
Given a query instance xq to be classified,
Let x1 . . .xk denote the k instances from training examples that are nearest to xq
Return
Where, f(xi) function to calculate the mean value of the k nearest trainingexamples.
Data Set:
Iris Plants Dataset: Dataset contains 150 instances (50 in each of three classes) Number of Attributes: 4
numeric, predictive attributes and the Class
30
Input:
31
Output:
Confusion Matrix
Accuracy Metrics
32
Practical – 8
Aim: Write a program to demonstrate the working of the K-means clustering algorithm.
Working of K-Means Algorithm
Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.
Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words,
classify the data based on the number of data points.
Step 3 − Now it will compute the cluster centroids.
Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of
data points to the clusters that are not changing any more −
4.1 − First, the sum of squared distance between data points and centroids would be computed.
4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).
4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.
first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the
result.
The following code will generate the 2D, containing four blobs –
33
Next, make an object of KMeans along with providing number of clusters, train the model and do the
prediction as follows –
Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means
Python estimator −
Output:
34
Practical – 9
Aim: Write a program to demonstrate the working of the Support Vector Machine for Classification
Algorithm.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
35
Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.
Till the Data pre-processing step, the code will remain the same. Below is the code:
After executing the above code, we will pre-process the data. The code will give the dataset as:
36
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training
dataset(x_train, y_train)
Output:
The model performance can be altered by changing the value of C(Regularization factor), gamma, and
kernel.
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.
Now we will see the performance of the SVM classifier that how many incorrect predictions are there as
compared to the Logistic regression classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the function, we will call it using a new
variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the targeted
value return by the classifier). Below is the code for it:
37
Now we will visualize the training set result, below is the code for it:
Output:
38
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got
the straight line as hyperplane because we have used a linear kernel in the classifier. And we have also
discussed above that for the 2d space, the hyperplane in SVM is a straight line.
39
Practical – 10
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all
data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.
In Python, the Scipy and Scikit-Learn libraries have defined functions for hierarchical clustering.
The below examples use these library functions to illustrate hierarchical clustering in Python.
First, we'll import NumPy, matplotlib, and seaborn (for plot styling):
40
41
The Scipy library has the linkage function for hierarchical (agglomerative) clustering.
The linkage function has several methods available for calculating the distance between
clusters: single, average, weighted, centroid, median, and ward. We will compare these methods below.
For more details on the linkage function, see the docs.
To draw the dendrogram, we'll use the dendrogram function. Again, for more details of
the dendrogram function, see the docs.
First, we will import the required functions, and then we can form linkages with the various methods:
Now, by passing the dendrogram function to matplotlib, we can view a plot of these linkages:
42
Notice that each distance method produces different linkages for the same data.
Finally, let's use the fcluster function to find the clusters for the Ward linkage:
43
The fcluster function gives the labels of clusters in the same index as the data set, X1 . For example,
index one of f1 is 2 , which is the cluster for the entry at index one of X1 , which is [1, 1] .
44