0% found this document useful (0 votes)

47 views39 pages

Dwdm-Lab Manual

The document appears to be a student's lab certificate for a data mining course. It contains the student's name, registration number, branch and academic year. It also mentions the number of experiments held and done for the course. The certificate is signed by the internal and external examiners.

Uploaded by

sivavenkatkumar34

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views39 pages

Dwdm-Lab Manual

Uploaded by

sivavenkatkumar34

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 39

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

(AUTONOMOUS)

Department of Computer Science & Engineering

20CS58 Data Mining Using Python Lab

Name of the Student:

Registered Number:

Branch & Section:

Academic Year: 2022 23

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING
(AUTONOMOUS)

CERTIFICATE

Certificate that this is a bonafied record of the practical work

done in ______________________________________________________________Laboratory

by ____ with Regd. No.

of ______B.Tech Course _Semester in

______________________________________________________ Branch during the

Academic Year 2022-23

No. of Experiments held: ________

No. of Experiments Done: ________

2023 Signature of the Faculty

INTERNAL EXAMINER EXTERNAL EXAMINER

Module 1
1.Loading the dataset:
Dataset:
A dataset is a collection of data that is used to train the model. A dataset acts as an example to
teach the machine learning algorithm how to make predictions.
Program

import pandas as pd
data=pd.read_csv(r"D:\New folder\emp.csv")
data

Output

Country Gender Age Salary Purchased

0 France Male 44.0 72000. No

1 Spain Female 27.0 48000. Yes

2 Germany Male 30.0 54000. No

3 Spain MAle 38.0 61000. No

4 Germany Female 40.0 NaN Yes

5 France Female 35.0 58000. Yes

6 Spain Female NaN 52000. No

7 France Male 48.0 79000. Yes

8 Germany Male 50.0 83000. No

9 France Male 37.0 67000. Yes

2.Identifying dependent and Independent variables

Dependent Variables:
The values of this variable depend on other variables. It is the outcome that you’re studying. It’s
also known as the response variable, outcome variable, and left-hand variable. Statisticians commonly
denote them using a Y. Traditionally, graphs place dependent variables on the vertical, or Y, axis.
Independent Variables:
The name helps you understand their role in statistical analysis. These variables are
independent. In this context, independent indicates that they stand alone and other variables in the
model do not influence them. The researchers are not seeking to understand what causes the
independent variables to change.
Program

x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x)
print(y)
Output

x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x)
print(y)
[['France' 'Male' 44.0 72000.0]
['Spain' 'Female' 27.0 48000.0]
['Germany' 'Male' 30.0 54000.0]
['Spain' 'MAle' 38.0 61000.0]
['Germany' 'Female' 40.0 nan]
['France' 'Female' 35.0 58000.0]
['Spain' 'Female' nan 52000.0]
['France' 'Male' 48.0 79000.0]
['Germany' 'Male' 50.0 83000.0]
['France' 'Male' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

3.Accessing specific columns

Program

x=data.iloc[:,1:-1].values
y=data.iloc[:,[0,4]].values
print(x)
print(y)

Output

[['Male' 44.0 72000.0]

['Female' 27.0 48000.0]
['Male' 30.0 54000.0]
['MAle' 38.0 61000.0]
['Female' 40.0 nan]
['Female' 35.0 58000.0]
['Female' nan 52000.0]
['Male' 48.0 79000.0]
['Male' 50.0 83000.0]
['Male' 37.0 67000.0]]
[['France' 'No']
['Spain' 'Yes']
['Germany' 'No']
['Spain' 'No']
['Germany' 'Yes']
['France' 'Yes']
['Spain' 'No']
['France' 'Yes']
['Germany' 'No']
['France' 'Yes']]

4.Dealing with missing values

Missing Values:
Missing data is defined as the values or data that is not stored (or not present) for some
variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can
see the columns ‘Age’ and ‘Cabin’ have some missing values.In Pandas, usually, missing values are
represented by NaN. It stands for Not a Number.
Handling Missing Values:
 Deleting the Missing Values.
 Imputing the Missing Values
Deleting the Missing Values:
This approach is not recommended. It is one of the quick and dirty techniques one can
use to deal with missing values. If the missing value is of the type Missing Not At Random
(MNAR), then it should not be deleted.
Imputing Missing Values:
This approach focus on replacing the missing data with mean or most frequent value in
that feature.
Program
import numpy as np

# Importing the SimpleImputer class

from sklearn.impute import SimpleImputer

# Imputer object using the mean strategy and

# missing_values type for imputation
imputer = SimpleImputer(missing_values = np.nan,
strategy ='mean')

data = [[12, np.nan, 34], [10, 32, np.nan],

[np.nan, 11, 20]]

print("Original Data : \n", data)

# Fitting the data to the imputer object
imputer = imputer.fit(data)

# Imputing the data

data = imputer.transform(data)

print("Imputed Data : \n", data)

Output

Original Data :
[[12, nan, 34], [10, 32, nan], [nan, 11, 20]]
Imputed Data :
[[12. 21.5 34. ]
[10. 32. 27. ]
[11. 11. 20. ]]

MODULE-2
Demonstrate the following data preprocessing tasks using python libraries.
Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way.
So for this, we use data preprocessing task.
1.Dealing with the Categorical Data:
Categorical data:
Categorical Data is the statistical data comprising categorical variables of data that are converted
into categories.
One-hot Encoding:
In this method, each category is mapped to a vector that contains 1 and 0 denoting the
presence or absence of the feature. The number of vectors depends on the number of categories for
features.
Program:
#loading the libraries
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
print(x)

Output
[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
[1.0 0.0 0.0 162597.7 151377.59 443898.53]
[0.0 1.0 0.0 153441.51 101145.55 407934.54]
[0.0 0.0 1.0 144372.41 118671.85 383199.62]
[0.0 1.0 0.0 142107.34 91391.77 366168.42]
[0.0 0.0 1.0 131876.9 99814.71 362861.36]
[1.0 0.0 0.0 134615.46 147198.87 127716.82]
[0.0 1.0 0.0 130298.13 145530.06 323876.68]
[0.0 0.0 1.0 120542.52 148718.95 311613.29]
[1.0 0.0 0.0 123334.88 108679.17 304981.62]
[0.0 1.0 0.0 101913.08 110594.11 229160.95]
[1.0 0.0 0.0 100671.96 91790.61 249744.55]
[0.0 1.0 0.0 93863.75 127320.38 249839.44]
[1.0 0.0 0.0 91992.39 135495.07 252664.93]
[0.0 1.0 0.0 119943.24 156547.42 256512.92]
[0.0 0.0 1.0 114523.61 122616.84 261776.23]
[1.0 0.0 0.0 78013.11 121597.55 264346.06]
[0.0 0.0 1.0 94657.16 145077.58 282574.31]
[0.0 1.0 0.0 91749.16 114175.79 294919.57]
[0.0 0.0 1.0 86419.7 153514.11 0.0]
[1.0 0.0 0.0 76253.86 113867.3 298664.47]
[0.0 0.0 1.0 78389.47 153773.43 299737.29]
[0.0 1.0 0.0 73994.56 122782.75 303319.26]
[0.0 1.0 0.0 67532.53 105751.03 304768.73]
[0.0 0.0 1.0 77044.01 99281.34 140574.81]
[1.0 0.0 0.0 64664.71 139553.16 137962.62]
[0.0 1.0 0.0 75328.87 144135.98 134050.07]
[0.0 0.0 1.0 72107.6 127864.55 353183.81]
[0.0 1.0 0.0 66051.52 182645.56 118148.2]
[0.0 0.0 1.0 65605.48 153032.06 107138.38]
[0.0 1.0 0.0 61994.48 115641.28 91131.24]
[0.0 0.0 1.0 61136.38 152701.92 88218.23]
[1.0 0.0 0.0 63408.86 129219.61 46085.25]
[0.0 1.0 0.0 55493.95 103057.49 214634.81]
[1.0 0.0 0.0 46426.07 157693.92 210797.67]
[0.0 0.0 1.0 46014.02 85047.44 205517.64]
[0.0 1.0 0.0 28663.76 127056.21 201126.82]
[1.0 0.0 0.0 44069.95 51283.14 197029.42]
[0.0 0.0 1.0 20229.59 65947.93 185265.1]
[1.0 0.0 0.0 38558.51 82982.09 174999.3]
[1.0 0.0 0.0 28754.33 118546.05 172795.67]
[0.0 1.0 0.0 27892.92 84710.77 164470.71]
[1.0 0.0 0.0 23640.93 96189.63 148001.11]
[0.0 0.0 1.0 15505.73 127382.3 35534.17]
[1.0 0.0 0.0 22177.74 154806.14 28334.72]
[0.0 0.0 1.0 1000.23 124153.04 1903.93]
[0.0 1.0 0.0 1315.46 115816.21 297114.46]
[1.0 0.0 0.0 0.0 135426.92 0.0]
[0.0 0.0 1.0 542.05 51743.15 0.0]
[1.0 0.0 0.0 0.0 116983.8 45173.06]]

Label Encoder
In label encoding, each category is assigned a value from 1 through N where N is the number of
categories for the feature. There is no relation or order between these assignments.

Program
#importing libraries
import pandas as pd
from sklearn import preprocessing
df=pd.read_csv(r"D:\New folder\iris.csv")
# label_encoder object knows
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['variety']= label_encoder.fit_transform(df['variety'])

Df['variety']

Output
0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: variety, Length: 150, dtype: int32

2.Featuring Scale:
When your data has different values, and even different measurement units, it can be difficult to
compare them. What is kilograms compared to meters? Or altitude compared to time?The answer to
this problem is scaling. We can scale data into new values that are easier to compare.
Program:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
scaledX = scale.fit_transform(X)
print(scaledX)
Output:
[[-2.10389253 -1.59336644]
[-0.55407235 -1.07190106]
[-1.52166278 -1.59336644]
[-1.78973979 -1.85409913]
[-0.63784641 -0.28970299]
[ 0.15800719 -0.0289703 ]
[ 0.3046118 -0.0289703 ]
[-0.05142797 1.53542584]
[-0.72580918 -0.0289703 ]
[ 0.14962979 1.01396046]
[ 1.2219378 -0.0289703 ]
[ 0.5685001 1.01396046]
[ 0.3046118 1.27469315]
[ 0.51404696 -0.0289703 ]
[ 0.51404696 1.01396046]
[ 0.72348212 -0.28970299]
[ 0.8281997 1.01396046]
[ 1.81254495 1.01396046]
[ 0.96642691 -0.0289703 ]
[ 1.72877089 1.01396046]]
3.Write a program for splitting the data into training and testing
Training Data:
The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model. Firstly, the training data is fed to the ML algorithms, which lets them learn how to make
predictions for the given task.
Testing Data:
This dataset evaluates the performance of the model and ensures that the model can generalize well with
the new or unseen dataset. The test dataset is another subset of original data, which is independent of the training
dataset.
Splitting:
For splitting the dataset, we can use the train_test_split function of scikit-learn.
Program:
#importing the packages
import numpy as np
from sklearn.model_selection import train_test_split
#splitting the data into training and testing
x = np.arange(1, 25).reshape(12, 2)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
print(x_train)
print(x_test)
print(y_train)
print(x_test)

Output
[[ 3 4]
[13 14]
[ 1 2]
[15 16]
[23 24]
[19 20]
[17 18]
[11 12]]
[[ 5 6]
[ 7 8]
[ 9 10]
[21 22]]
[1 0 0 1 0 0 1 0]
[[ 5 6]
[ 7 8]
[ 9 10]
[21 22]]

Module-3
1.Similarity and Dissimilarity Measures:
Similarity Measure:
Measures of similarity provide a numerical value which indicates the strength of associations between
objects or variables. The extent to which the variables are corresponding with each other is usually indicated
between “0” and “1” where “0” means no similarity or exclusion and “1” means perfect similarity of identity.
Dissimilarity Measure:
Numerical measure of how different two data objects are.Range from 0 (objects are alike) to ∞
(objects are different).

Pearsons coefficient:
The Pearson coefficient is a type of correlation coefficient that represents the relationship between two
variables that are measured on the same interval or ratio scale. The Pearson coefficient is a measure of
the strength of the association between two continuous variables.

Program

from scipy.stats import pearsonr

x=[-2,-1,0,1,2]
y=[4,1,3,2,0]
corr=pearsonr(x,y)
print('pearson correlation: ',corr)

Output

pearson correlation: PearsonRResult(statistic=-0.7000000000000001, pvalue=0.1881204043741873)

Cosine similarity:
Cosine similarity is a metric used to measure the similarity of two vectors. Specifically, it
measures the similarity in the direction or orientation of the vectors ignoring differences in their
magnitude or scale. Both vectors need to be part of the same inner product space, meaning they must
produce a scalar through inner product multiplication. The similarity of two vectors is measured by the
cosine of the angle between them.

Program

import numpy as np
from numpy.linalg import norm
#defining two lists
A=np.array([2,1,2,3,3,9])
B=np.array([3,4,2,4,5,5])
print("A: ",A)
print("B: ",B)
#compute ccosine similarity
cosine=np.dot(A,B)/norm(A)*norm(B)
print("cosine Similarity :",cosine)

Output

A: [2 1 2 3 3 9]
B: [3 4 2 4 5 5]
cosine Similarity : 80.6581721881964
Jaccard similarity:
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents. Jaccard similarity can be used to find the similarity between
two asymmetric binary vectors.

Program
import numpy as np
from scipy.spatial.distance import jaccard
A=np.array([1,0,0,1,1,1])
B=np.array([0,0,1,1,1,1])
d=jaccard(A,B)
print("distance:",d)

Output

distance: 0.4

Euclidean:
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents. Jaccard similarity can be used to find the similarity between
two asymmetric binary vectors
Program

from sklearn.metrics.pairwise import euclidean_distances

x=[[0,1],[1,1]]
euclidean_distances(x,x)
#calculating euclidean distance between two vectors
from scipy.spatial.distance import euclidean
row1=[10,20,15,10,5]
row2=[12,24,18,8,7]
dist=euclidean(row1,row2)
print(dist)

Output

6.082762530298219
MInkowski_distance:
Minkowski distance calculates the distance between two real-valued vectors. It is a
generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the
“order” or “p“, that allows different distance measures to be calculated.
Program

from scipy.spatial import minkowski_distance

row1=[10,20,15,10,5]
row2=[12,24,18,8,7]
#calculate distance (p=1)
dist=minkowski_distance(row1,row2,1)
print(dist)
#calculate distance (p=2)
dist=minkowski_distance(row1,row2,2)
print(dist)

Output
13.0
6.082762530298219

Manhattan Distance:
This determines the absolute difference among the pair of the coordinates. Suppose we have
two points P and Q to determine the distance between these points we simply have to calculate the
perpendicular distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1)
and Q at (x2, y2).
Program:
x1 = (1,2,3,4,5,6)
x2 = (10,20,30,1,2,3)
print(manhattan_distance(x1, x2))
Output:
63

Module-4
Build a model using linear regression algorithm on any dataset:
Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.
Program:
import pandas as pd
import numpy as np
ds2 = pd.read_csv(r'C:\21761A05F6_PYTHON_SCRIPTS\SalaryInfo.csv')
print(ds2)
x = ds2.iloc[:,0].values
y = ds2.iloc[:,1].values
mean_x = np.mean(x)
mean_y = np.mean(y)
print("mean of Experience: ",mean_x)
print("mean of Experience: ",mean_y)
sum_add = 0
denom = 0
for (i,j) in zip(x,y):
sum_add = sum_add+((i-mean_x)*(j-mean_y))
denom = denom + (i-mean_x)*(i-mean_x)
w1_val = sum_add/denom
print("w1 Value is: ",w1_val)
w0_val = mean_y-(w1_val*mean_x)
print("w0 Value is: ",w0_val)
experience = int(input("Enter the Experience: "))
salary_predict = w0_val + w1_val*experience
print("Predicted Salary for ",experience," years experience is: ",salary_predict)

Output:
YEARS_EXPERIENCE SALARY(in $1000)
0 3 30
1 8 57
2 9 64
3 13 72
4 3 36
5 6 43
6 11 59
7 21 90
8 1 20
9 16 83
mean of Experience: 9.1
mean of Experience: 55.4
w1 Value is: 3.5374756199498467
w0 Value is: 23.208971858456394
Enter the Experience: 3
Predicted Salary for 3 years experience is: 33.821398718305936
Program:
#import required packages
import matplotlib.pyplot as pt
import numpy as np
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score

#Building a linear regression model for diabetes dataset

diabetes_X,diabetes_Y = datasets.load_diabetes(return_X_y=True)
print("X is: \n",diabetes_X)
print("Y is: \n",diabetes_Y)
diabetes_X = diabetes_X[:,np.newaxis,2]

#split data into training and testing sets

#split the independent variables

diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

#split the dependent variables

diabetes_Y_train = diabetes_Y[:-20]
diabetes_Y_test = diabetes_Y[-20:]

#create an object for linear regression

regr = linear_model.LinearRegression()

#train the model using the training sets

regr.fit(diabetes_X_train,diabetes_Y_train)

#make predictions using training set

diabetes_Y_pred = regr.predict(diabetes_X_test)

print(diabetes_Y_pred)

#plotting the result

pt.scatter(diabetes_X_test,diabetes_Y_test,color='red')
pt.plot(diabetes_X_test,diabetes_Y_pred,color='green',linewidth=2,marker='o',markerfacecolor='yellow')
pt.grid()
pt.show()

#calculating mean squared error

print('Mean Squared Error: %f' %r2_score(diabetes_Y_test,diabetes_Y_pred))

Output:
[225.9732401 115.74763374 163.27610621 114.73638965 120.80385422
158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]
Plot:
Module-5
Build a classification model using Decision Tree algorithm on iris dataset
Classification:
Classification in data mining is a common technique that separates data points into different
classes. It allows you to organize data sets of all sorts, including complex and large datasets as well as
small and simple ones. It primarily involves using algorithms that you can easily modify to improve the
data quality.
Decision Tree Algorithm:
 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Attribute Selection Measures:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is called as
Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques for ASM, which are:
Information Gain:

Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as.
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples

o P(yes)= probability of yes
o P(no)= probability of no

Program:
#importing libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
iris=load_iris()
#splitting data

x=iris.data
y=iris.target

#splitting data into training and testing

X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.2,random_state=None)
tree_clf=DecisionTreeClassifier(max_depth=6)
tree_clf.fit(X_train,Y_train)
DecisionTreeClassifier(max_depth=6)
y_pred=tree_clf.predict(X_test)

#determining Accuracy

accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy :",accuracy)
plt.figure(figsize=(15,10))

#plotting tree

plot_tree(tree_clf,filled=True,feature_names=iris.feature_names,class_names=iris.target_names)
plt.show()

Output

Accuracy : 1.0

Plot:
2. Diabetes dataset

Program

#loading libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_X=diabetes_X[:,np.newaxis,2]
diabetes_x_train=diabetes_X[:-20]
diabetes_x_test=diabetes_X[-20:]
diabetes_y_train=diabetes_y[:-20]
diabetes_y_test=diabetes_y[-20:]
regr=linear_model.LinearRegression()
regr.fit(diabetes_x_train,diabetes_y_train)
diabetes_y_pred=regr.predict(diabetes_x_test)
print(diabetes_y_pred)

#building a decision tree and finding Accuracy

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
pima=pd.read_csv(r"D:\New folder\diabetes.csv")
feature_cols=["Pregnancies","Glucose","BloodPressure","SkinThickness","DiabetesPedigreeFunction","
BMI","Insulin","Age"]
X=pima[feature_cols]
y=pima.Outcome
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.3,random_state=None)
clf=DecisionTreeClassifier(max_depth=6)
clf=clf.fit(X_train,Y_train)
y_pred=clf.predict(X_test)
print("Accuracy :",metrics.accuracy_score(Y_test,Y_pred))

Output

[225.9732401 115.74763374 163.27610621 114.73638965 120.80385422

158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]

Accuracy:1.00000000000000
Module-6

1.Demonstarte the naive bayes algorithm using any dataset

Naive Bayes Algorithm:
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.It is mainly used in text classification that includes a high-
dimensional training dataset.Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning models that can make quick
predictions.It is a probabilistic classifier, which means it predicts on the basis of the pro bability of an
object.
Bayes Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
Bayes Theorem Formulae:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Program

#loading libraries
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset=pd.read_csv(r"D:\New folder\iris.csv")
print(dataset)
dataset.info()

#loading data into x and y

X=dataset.iloc[:,:4].values
Y=dataset['variety'].values
print(X)
print(Y)

#splitting the data into training and testing

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,random_state=None,test_size=0.3)
classifier=GaussianNB()
classifier.fit(X_train,Y_train)
print(X_test[0])
y_pred=classifier.predict(X_test)
print(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy :",accuracy)

#making a dataframe of real and predicted values

df=pd.DataFrame({'Real Values':Y_test,'predicted_values':y_pred})
print(df)

Output
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica

[150 rows x 5 columns]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal.length 150 non-null float64
1 sepal.width 150 non-null float64
2 petal.length 150 non-null float64
3 petal.width 150 non-null float64
4 variety 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

[5.2 4.1 1.5 0.1]

[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
['Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica']

[5.4 3.9 1.3 0.4]

['Setosa' 'Virginica' 'Virginica' 'Versicolor' 'Setosa' 'Setosa'
'Versicolor' 'Setosa' 'Versicolor' 'Versicolor' 'Setosa' 'Virginica'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Setosa' 'Versicolor'
'Setosa' 'Virginica' 'Setosa' 'Setosa' 'Versicolor' 'Setosa' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Virginica' 'Setosa' 'Setosa' 'Virginica' 'Virginica'
'Setosa' 'Virginica' 'Versicolor' 'Versicolor' 'Setosa' 'Virginica'
'Setosa' 'Versicolor' 'Versicolor']
Accuracy : 0.9333333333333333

Real Values predicted_values

0 Setosa Setosa
1 Virginica Virginica
2 Virginica Virginica
3 Versicolor Versicolor
4 Setosa Setosa
5 Setosa Setosa
6 Versicolor Versicolor
7 Setosa Setosa
8 Virginica Versicolor
9 Versicolor Versicolor
10 Setosa Setosa
11 Virginica Virginica
12 Versicolor Versicolor
13 Versicolor Versicolor
14 Virginica Versicolor
15 Versicolor Versicolor
16 Setosa Setosa
17 Versicolor Versicolor
18 Setosa Setosa
19 Virginica Virginica
20 Setosa Setosa
21 Setosa Setosa
22 Versicolor Versicolor
23 Setosa Setosa
24 Versicolor Versicolor
25 Versicolor Versicolor
26 Versicolor Versicolor
27 Versicolor Versicolor
28 Versicolor Versicolor
29 Versicolor Versicolor
30 Versicolor Versicolor
31 Virginica Virginica
32 Setosa Setosa
33 Setosa Setosa
34 Virginica Virginica
35 Virginica Virginica
36 Setosa Setosa
37 Virginica Virginica
38 Virginica Versicolor
39 Versicolor Versicolor
40 Setosa Setosa
41 Virginica Virginica
42 Setosa Setosa
43 Versicolor Versicolor
44 Versicolor Versicolor

Program:
#importing Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

#loading iris dataset

iris=load_iris()
X=iris.data
Y=iris.target
le=LabelEncoder()
Y=le.fit_transform(Y)

#splitting dataset into training and testing

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
nb_model=GaussianNB()

#Fitting data

nb_model.fit(X_train,Y_train)
Y_pred=le.inverse_transform(Y_test)

#calculating accuracy

accuracy=accuracy_score(Y_test,Y_pred)
print("Accuracy ",accuracy)
new_observation=np.array([[5.8,3.0,4.5,1.5]])
predicted_class=nb_model.predict(new_observation)
predicted_class=le.inverse_transform(predicted_class)
print("predicted class ",predicted_class)

Output:
Accuracy 1.0
predicted class [1]

2.Write a program for calculating the entropies of a dataset:

Entropy:
Entropy is the measure of uncertainty in the data. The effort is to reduce the entropy and
maximize the information gain. The feature having the most information is considered important by the
algorithm and is used for training the model. By using Information gain you are actually using entropy.
Program:
#importing dependencies

from sklearn import tree

import pandas as pd
import numpy as np

#loading dataset

df=pd.read_csv(r"D:New folder\playtennis.csv")
print(df)
df.value_counts()

#calculating Entropy for play Tennis and printing that records

Entropy_play=-(9/14)*np.log2(9/14)-(5/14)*np.log2(5/14)
print(Entropy_play)

#calculating Entropy for Outlook sunny and printing that records

df[df.Outlook=='Sunny']
Entropy_play_Outlook_Sunny=-(3/5)*np.log2(3/5)-(2/5)*np.log2(2/5)
print(Entropy_play_Outlook_Sunny)

#calculating Entropy for Outlook_rain and printing that records

df[df["Outlook"]=='Overcast']
Entropy_play_outlook_overcast=-(0/4)*np.log2(0/4)-(4/4)*np.log2(4/4)
print(Entropy_play_outlook_overcast)

#calculating the entropy for Outlook_rain and printing that records

df[df.Outlook=='Rain']
Entropy_play_Outlook_Rain=-(2/5)*np.log2(2/5)-(3/5)*np.log2(3/5)
Entropy_play_Outlook_Rain

#calculating the gain

Gain=Entropy_play-(5/14)*Entropy_play_Outlook_Sunny-(4/14)*0-(5/14)*Entropy_play_Outlook_Rain
Gain
Output

Outlook Temperature Humidity Wind Play Tennis

0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes
5 Rain Cool Normal Strong No
6 Overcast Cool Normal Strong Yes
7 Sunny Mild High Weak No
8 Sunny Cool Normal Weak Yes
9 Rain Mild Normal Weak Yes
10 Sunny Mild Normal Strong Yes
11 Overcast Mild High Strong Yes
12 Overcast Hot Normal Weak Yes
13 Rain Mild High Strong No

#Entropy for play

0.9402859586706311

#Entropy for sunny

Outlook Temperature Humidit Wind Play Tennis

0 Sunny Hot High Weak No

1 Sunny Hot High Strong No

7 Sunny Mild High Weak No

8 Sunny Cool Normal Weak Yes

10 Sunny Mild Normal Strong Yes

0.9709505944546686

#Entropy for Overcast

Outlook Temperature Humidit Wind Play Tennis

2 Overcast Hot High Weak Yes

6 Overcast Cool Normal Strong Yes

1 Overcast Mild High Strong Yes

1 Overcast Hot Normal Weak Yes

nan
#Entropy for rain

Outlook Temperature Humidit Wind Play Tennis

3 Rain Mild High Weak Yes

4 Rain Cool Normal Weak Yes

5 Rain Cool Normal Strong No

9 Rain Mild Normal Weak Yes

13 Rain Mild High Strong No

0.9709505944546686

#Gain

0.24674981977443933

Module-7

1.Apply Apriori algorithm on market basket data

Using our own data to apply apriori algorithm:

Apriori Algorithm:
two or more objects are related to one another. In other words, we can say that the apriori
algorithm is an association rule leaning that analyzes that people who bought product A also bought
product B.
Support:
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the total
number of transactions.
Confidence:
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the confidence.
Lift:
lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits.
Program:

#importing libraries

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori,association_rules
data=[["milk","Diapers","Beer","Cola"],
["bread","Milk","Diapers","Beer","Cola","Eggs"],
["Beer","Milk","Bread","Diapers"],
["Beer","Milk","Cola"]
]

#creating object for Transaction encoder

te=TransactionEncoder()
te_ary=te.fit(data).transform(data)
df=pd.DataFrame(te_ary,columns=te.columns_)
print(df)
frequent_itemsets=apriori(df,min_support=0.75,use_colnames=True)
print(frequent_itemsets)
itemsets=apriori(df,min_support=0.75)
print(itemsets)
rules=association_rules(frequent_itemsets,metric="confidence",min_threshold=0.7)
print(rules)

#Accessing specific columns in the rules

selected_columns=['antecedents','consequents','antecedent support','consequent
support','support','confidence']
print(rules[selected_columns])

Output

Beer Bread Cola Diapers Eggs Milk bread milk

0 True False True True False False False True
1 True False True True True True True False
2 True True False True False True False False
3 True False True False False True False False
support itemsets
0 1.00 (Beer)
1 0.75 (Cola)
2 0.75 (Diapers)
3 0.75 (Milk)
4 0.75 (Beer, Cola)
5 0.75 (Diapers, Beer)
6 0.75 (Beer, Milk)
support itemsets
0 1.00 (0)
1 0.75 (2)
2 0.75 (3)
3 0.75 (5)
4 0.75 (0, 2)
5 0.75 (0, 3)
6 0.75 (0, 5)
antecedents consequents antecedent support consequent support support \
0 (Beer) (Cola) 1.00 0.75 0.75
1 (Cola) (Beer) 0.75 1.00 0.75
2 (Diapers) (Beer) 0.75 1.00 0.75
3 (Beer) (Diapers) 1.00 0.75 0.75
4 (Beer) (Milk) 1.00 0.75 0.75
5 (Milk) (Beer) 0.75 1.00 0.75

confidence lift leverage conviction zhangs_metric

0 0.75 1.0 0.0 1.0 0.0
1 1.00 1.0 0.0 inf 0.0
2 1.00 1.0 0.0 inf 0.0
3 0.75 1.0 0.0 1.0 0.0
4 0.75 1.0 0.0 1.0 0.0
5 1.00 1.0 0.0 inf 0.0

antecedents consequents antecedent support consequent support support \

0 (0) (2) 1.00 0.75 0.75
1 (2) (0) 0.75 1.00 0.75
2 (0) (3) 1.00 0.75 0.75
3 (3) (0) 0.75 1.00 0.75
4 (0) (5) 1.00 0.75 0.75
5 (5) (0) 0.75 1.00 0.75

confidence
0 0.75
1 1.00
2 0.75
3 1.00
4 0.75
5 1.00

Apriori Program on Market Basket Data:

Program:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('store_data.csv')
transactions=[]
#k=0
for i in range(0, 7500):
print(k)
# k=k+1
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.0045, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
print(len(results))
print(results)
for item in results:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])

print("Support: " + str(item[1]))

print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("=====================================")

Output

7
[RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004533333333333334,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
items_add=frozenset({'chicken'}), confidence=0.2905982905982906, lift=4.843304843304844)]),
RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}),
support=0.005733333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}),
items_add=frozenset({'escalope'}), confidence=0.30069930069930073, lift=3.7903273197390845)]),
RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005866666666666667,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}),
items_add=frozenset({'escalope'}), confidence=0.37288135593220345, lift=4.700185158809287)]),
RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.016,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}),
items_add=frozenset({'ground beef'}), confidence=0.3234501347708895, lift=3.2915549671393096)]),
RelationRecord(items=frozenset({'tomato sauce', 'ground beef'}), support=0.005333333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}),
items_add=frozenset({'ground beef'}), confidence=0.37735849056603776, lift=3.840147461662528)]),
RelationRecord(items=frozenset({'olive oil', 'whole wheat pasta'}), support=0.008,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'whole wheat pasta'}),
items_add=frozenset({'olive oil'}), confidence=0.2714932126696833, lift=4.130221288078346)]),
RelationRecord(items=frozenset({'shrimp', 'pasta'}), support=0.005066666666666666,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'shrimp'}),
confidence=0.3220338983050848, lift=4.514493901473151)])]

Rule: chicken -> light cream

Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
=====================================
Rule: escalope -> mushroom cream sauce
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
=====================================
Rule: escalope -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
=====================================
Rule: ground beef -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
=====================================
Rule: tomato sauce -> ground beef
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
=====================================
Rule: olive oil -> whole wheat pasta
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
=====================================
Rule: shrimp -> pasta
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
=====================================

Module-8
1.Apply K- Means clustering algorithm on any dataset
K-Means Algorithm:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Program:
#K-Means Clustering:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
x = np.array([[1,1],[1.5,2],[3,4],[5,7],[3.5,5],[4.5,5],[3.5,4.5]])
print(x)

plt.scatter(x[:,0],x[:,1])

kmeans = KMeans(n_clusters=2)
kmeans.fit(x)

print("CLusters: ",kmeans.cluster_centers_)
print("Labels: ",kmeans.labels_)

plt.scatter(x[:,0],x[:,1],c=kmeans.labels_,cmap="rainbow")

Output:
[[1. 1. ]
[1.5 2. ]
[3. 4. ]
[5. 7. ]
[3.5 5. ]
[4.5 5. ]
[3.5 4.5]]
CLusters: [[1.25 1.5 ]
[3.9 5.1 ]]
Labels: [0 0 1 1 1 1 1]

Plot:
Module-9
1.Apply Hierarchical Clustering algorithm on any dataset
Hierarchical Clustering Algorithm:
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA.Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with

taking all data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

Program:
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
from scipy.spatial.distance import squareform,pdist
import pandas as pd
a = np.random.random_sample(size=5)
b = np.random.random_sample(size=5)
point = ['P1','P2','P3','P4','P5']
data = pd.DataFrame({'Point':point,'a':np.round(a,2),'b':np.round(b,2)})
data = data.set_index('Point')
print(data)
plt.figure(figsize=(8,5))
plt.xlabel('Column a')
plt.ylabel('Column b')

plt.title('Scatter plot of x and y')

plt.scatter(data['a'],data['b'],c='r',marker='*')
for j in data.itertuples():
plt.annotate(j.Index,(j.a,j.b),fontsize=15)
dist=pd.DataFrame(squareform(pdist(data[['a','b']]),'euclidean'),columns=data.index.values,index =
data.index.values)
plt.figure(figsize=(12, 5))
plt.title('Dendrogram with Single linkage')
dend = shc.dendrogram(shc.linkage(data[['a','b']],method='single'),labels=data.index)
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='single')
print(cluster.fit_predict(data))
Output:
Point a b
P1 0.86 0.13
P2 0.27 0.04
P3 0.89 0.65
P4 0.05 0.43
P5 0.44 0.05
[0 0 1 0 0]

Plots:
Module-10
1. Apply DBSCAN clustering algorithm on any dataset.
DBSCAN Clustering Algorithm:
DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are
dense regions in space separated by regions of lower density. It groups 'densely grouped' data
points.DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are
dense regions in space separated by regions of lower density. It groups 'densely grouped' data points.
Program:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

centers = [[0.5,2],[-1,-1],[1.5,-1]]
x,y = make_blobs(n_samples=100,centers=centers,cluster_std=0.5,random_state=0)
print(x,y)

db = DBSCAN(eps=0.4,min_samples=5)
print(db.fit(x))

labels = db.labels_
n_clusters_ = len(set(labels))-(1 if -1 in labels else 0)
print("Estimated number of clusters: %d" %n_clusters_)
y_pred = db.fit_predict(x)

print(db.labels_)
plt.figure(figsize=(6,4))
plt.scatter(x[:,0],x[:,1],c=y_pred,cmap='Paired')
plt.title("Clusters determined by DBSCAN")
plt.savefig("DBSCAN.jpg")

from sklearn.cluster import KMeans

plt.scatter(x[:,0],x[:,1])
kmeans = KMeans(n_clusters=2)
kmeans.fit(x)

print("Clusters: ",kmeans.cluster_centers_)
print("Labels: ",kmeans.labels_)

plt.title("K-Means for DBSCAN Algorithm")

plt.scatter(x[:,0],x[:,1],c=kmeans.labels_,cmap="Paired")
plt.show()
plt.savefig("KMeansDBSCAN.jpg")
Output:
[[ 0.57747371 2.18908126]
[ 1.1781908 -2.11170158]
[ 0.53325861 2.15123595]
[ 0.94780833 -0.97391746]
[ 1.27223375 -0.99126042]
[ 1.26638961 2.73467938]
[ 0.58871307 1.79910953]
[ 1.18207696 -0.66178335]
[-0.41061021 -1.08996242]
[-0.3531351 2.9753877 ]
[ 2.0940149 -0.84152869]
[-1.45364918 -0.9740273 ]
[-0.66385262 -0.79626908]
[ 1.45077374 -1.33173914]
[ 2.58161797 -0.33173603]
[ 0.57202179 2.72713675]
[ 1.4658792 -0.14332864]
[-0.69296031 -0.53889666]
[ 1.78829541 -1.10414938]
[-1.34728393 -1.07481727]
[-1.38495804 -0.7303754 ]
[ 0.24459743 1.40968391]
[-0.89586251 -0.51168048]
[-0.79882918 -1.34240505]
[-0.8218168 -0.64671342]
[ 0.9322181 1.62891749]
[-1.20158847 -0.38877746]
[ 1.433779 1.51136106]
[-1.58257492 -0.54958676]
[-0.05842465 -1.67387953]
[ 1.11514534 2.60118992]
[ 1.63487731 1.27281716]
[-1.5865617 -0.02818941]
[ 1.81261573 -1.80102883]
[ 1.88589528 -0.58824792]
[ 0.99989233 -1.77238555]
[ 0.85357155 -0.86647457]
[ 0.09342686 1.1368587 ]
[ 0.98287858 -0.65920274]
[ 2.69157239 -0.52776026]
[-0.77649491 2.3268093 ]
[ 2.06331796 -1.53996575]
[ 0.84204629 -1.2307923 ]
[ 1.96042941 -0.84063617]
[-0.85088091 -0.33680705]
[ 0.98936899 3.1204466 ]
[-0.2558739 -0.05205541]
[ 0.97504421 1.9243214 ]
[-0.99474999 -0.10706475]
[ 1.69800336 -1.54653075]
[ 0.32604393 2.07817448]
[-0.43029966 -1.61741291]
[ 0.30633659 1.84884862]
[ 1.38202617 2.2000786 ]
[ 0.65653385 1.57295213]
[ 1.48035859 -1.58404675]
[-0.76716878 -1.76812184]
[ 1.09829517 -1.34477489]
[ 0.88728224 -0.57781851]
[ 1.92841531 -1.3255128 ]
[-1.21757678 -0.07536814]
[-0.06622052 -0.54697767]
[-1.20680949 -1.37372741]
[-1.6352425 -0.51530165]
[-0.31509917 2.23139113]
[ 0.18283895 1.81862942]
[ 0.48590889 2.21416594]
[ 1.12762259 -1.41321927]
[-1.13400169 -0.5987718 ]
[-0.02427648 1.28999103]
[ 0.05226672 2.19345125]
[-0.03852899 -0.2597426 ]
[ 0.16376978 1.82022342]
[ 1.76163833 -1.08577317]
[ 0.44839057 2.20529925]
[-0.30694892 1.89362986]
[-0.12639768 2.38874518]
[-1.43061284 -0.04496752]
[ 0.05610713 1.00960177]
[-0.81178723 -1.5497004 ]
[-1.33716633 -0.98408472]
[ 1.58333675 -0.68248428]
[ 1.9747104 -0.95622438]
[ 0.88051886 2.06083751]
[ 1.32300304 -1.68747565]
[ 0.92626567 -1.21891002]
[ 0.24517391 1.78096285]
[ 1.25098377 -0.03523397]
[ 0.72193162 2.16683716]
[ 1.1302185 -0.2284927 ]
[ 1.24703954 1.89742087]
[-1.15577627 -0.97191733]
[-1.43539857 -1.28942483]
[ 0.7543712 -0.78030415]
[ 0.52287926 1.90640807]
[-1.53537631 -0.47277414]
[ 1.04358889 -0.44149186]
[-0.93654395 -0.79900532]
[-0.52637402 -1.07750505]
[-0.63545472 -0.93550854]] [0 2 0 2 2 0 0 2 1 0 2 1 1 2 2 0 2 1 2 1 1 0 1 1 1 0 1 0 1 1 0 0 1 2 2 2 2
0220222101012010002122211110002100102
0 0 0 1 0 1 1 2 2 0 2 2 0 2 0 2 0 1 1 2 0 1 2 1 1 1]
DBSCAN(eps=0.4)
Estimated number of clusters: 3
[ 0 -1 0 1 1 -1 0 1 2 -1 1 2 2 1 -1 -1 -1 2 1 2 2 0 2 2
2 0 2 -1 2 -1 -1 -1 2 1 1 1 1 -1 1 -1 -1 1 1 1 2 -1 -1 0
2 1 0 -1 0 -1 0 1 -1 1 1 1 2 -1 2 2 0 0 0 1 2 -1 0 -1
0 1 0 -1 0 2 -1 -1 2 1 1 0 1 1 0 -1 0 1 0 2 2 1 0 2
1 2 2 2]
Clusters: [[ 0.49210963 2.00254955]
[ 0.25333922 -0.89314775]]
Labels: [0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1
0110111101011010001111111110001100101
0 0 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1]

Plot:

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Machine Learning
No ratings yet
Machine Learning
81 pages
Pro Swiftui
No ratings yet
Pro Swiftui
265 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
AI Processor Electronics Basic Technology of Artificial Intelligence
No ratings yet
AI Processor Electronics Basic Technology of Artificial Intelligence
399 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
DE - Python For Data Science - Machine Learning
No ratings yet
DE - Python For Data Science - Machine Learning
45 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Edx Course Lab Programs
No ratings yet
Edx Course Lab Programs
19 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
DA Programs
No ratings yet
DA Programs
44 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
Machine Learning Laboratory: Manual
No ratings yet
Machine Learning Laboratory: Manual
52 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
7 محاضرات
No ratings yet
7 محاضرات
36 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Kartik MLP 4-9prg
No ratings yet
Kartik MLP 4-9prg
10 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
9 pages
Data Pre Process I
No ratings yet
Data Pre Process I
6 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
MLCyber Lab
No ratings yet
MLCyber Lab
9 pages
ML
No ratings yet
ML
8 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Chat GPT Questions
No ratings yet
Chat GPT Questions
16 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Unit-Wise Question Bank: 8M Questions
No ratings yet
Unit-Wise Question Bank: 8M Questions
13 pages
TCS CodeVita Preparation
No ratings yet
TCS CodeVita Preparation
3 pages
How To Crack at Top Products Company
No ratings yet
How To Crack at Top Products Company
14 pages
Computer Science PG Rur Syllabus
No ratings yet
Computer Science PG Rur Syllabus
56 pages
100 Day DSA by Jeetaksh Gandhi
No ratings yet
100 Day DSA by Jeetaksh Gandhi
24 pages
DAA-Lab File
No ratings yet
DAA-Lab File
47 pages
Research and Design of Low-Power High-Performance
No ratings yet
Research and Design of Low-Power High-Performance
7 pages
RC4 Encryption Algorithm
No ratings yet
RC4 Encryption Algorithm
7 pages
An Overview of XAI Algorithms
No ratings yet
An Overview of XAI Algorithms
5 pages
CSE 103 Lab-3
No ratings yet
CSE 103 Lab-3
5 pages
Machine Learning - Intro Bayes DecisionT Neural
No ratings yet
Machine Learning - Intro Bayes DecisionT Neural
81 pages
37 PUT Delete Image and FM
No ratings yet
37 PUT Delete Image and FM
14 pages
CT, 6803415-3 ChapterZero
No ratings yet
CT, 6803415-3 ChapterZero
60 pages
Term Paper About Computer Architecture
100% (1)
Term Paper About Computer Architecture
6 pages
PPL Unit-1-Mcqs-Sheet
No ratings yet
PPL Unit-1-Mcqs-Sheet
20 pages
Function in C++
No ratings yet
Function in C++
11 pages
Process Management
No ratings yet
Process Management
2 pages
Global and Local Variables in Functions
No ratings yet
Global and Local Variables in Functions
3 pages
Topics: Critical Points, Identification of Relative Maxima and Minima, 1st and Second
No ratings yet
Topics: Critical Points, Identification of Relative Maxima and Minima, 1st and Second
7 pages
AP Computer Science A 2020 Practice Exam Scoring Worksheet PDF
No ratings yet
AP Computer Science A 2020 Practice Exam Scoring Worksheet PDF
1 page
Project Management
No ratings yet
Project Management
18 pages
TypeScript Types
No ratings yet
TypeScript Types
1 page
Yokesh
No ratings yet
Yokesh
1 page
E2.3 - CPSC 210 - PrairieLearn
No ratings yet
E2.3 - CPSC 210 - PrairieLearn
3 pages
Roll No. ......................... : B. Tech. Examination
No ratings yet
Roll No. ......................... : B. Tech. Examination
3 pages
Manual SIWAREX WP521 WP522 en - PDF Page 108
No ratings yet
Manual SIWAREX WP521 WP522 en - PDF Page 108
1 page
Problem Solving (Compatibility Mode) PDF
No ratings yet
Problem Solving (Compatibility Mode) PDF
4 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet

Dwdm-Lab Manual

Uploaded by

Dwdm-Lab Manual

Uploaded by

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

Department of Computer Science & Engineering

20CS58 Data Mining Using Python Lab

Name of the Student:

Branch & Section:

Academic Year: 2022 23

Certificate that this is a bonafied record of the practical work

by __________________________________ with Regd. No. ______________________________

of __________________B.Tech Course__________ ___________Semester in

______________________________________________________ Branch during the

Academic Year 2022-23

No. of Experiments held: ________

No. of Experiments Done: ________

2023 Signature of the Faculty

INTERNAL EXAMINER EXTERNAL EXAMINER

Country Gender Age Salary Purchased

0 France Male 44.0 72000. No

1 Spain Female 27.0 48000. Yes

2 Germany Male 30.0 54000. No

3 Spain MAle 38.0 61000. No

4 Germany Female 40.0 NaN Yes

5 France Female 35.0 58000. Yes

6 Spain Female NaN 52000. No

7 France Male 48.0 79000. Yes

8 Germany Male 50.0 83000. No

9 France Male 37.0 67000. Yes

2.Identifying dependent and Independent variables

3.Accessing specific columns

[['Male' 44.0 72000.0]

4.Dealing with missing values

# Importing the SimpleImputer class

# Imputer object using the mean strategy and

data = [[12, np.nan, 34], [10, 32, np.nan],

print("Original Data : \n", data)

# Imputing the data

print("Imputed Data : \n", data)

from scipy.stats import pearsonr

pearson correlation: PearsonRResult(statistic=-0.7000000000000001, pvalue=0.1881204043741873)

from sklearn.metrics.pairwise import euclidean_distances

from scipy.spatial import minkowski_distance

#Building a linear regression model for diabetes dataset

#split data into training and testing sets

#split the independent variables

#split the dependent variables

#create an object for linear regression

#train the model using the training sets

#make predictions using training set

#plotting the result

#calculating mean squared error

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

o S= Total number of samples

#splitting data into training and testing

#building a decision tree and finding Accuracy

[225.9732401 115.74763374 163.27610621 114.73638965 120.80385422

1.Demonstarte the naive bayes algorithm using any dataset

#loading data into x and y

#splitting the data into training and testing

#making a dataframe of real and predicted values

[150 rows x 5 columns]

[5.2 4.1 1.5 0.1]

[5.4 3.9 1.3 0.4]

Real Values predicted_values

#loading iris dataset

#splitting dataset into training and testing

2.Write a program for calculating the entropies of a dataset:

from sklearn import tree

#calculating Entropy for play Tennis and printing that records

#calculating Entropy for Outlook sunny and printing that records

#calculating Entropy for Outlook_rain and printing that records

#calculating the entropy for Outlook_rain and printing that records

#calculating the gain

Outlook Temperature Humidity Wind Play Tennis

#Entropy for play

#Entropy for sunny

Outlook Temperature Humidit Wind Play Tennis

0 Sunny Hot High Weak No

by ____ with Regd. No.

of ______B.Tech Course _Semester in