DWDM Lab Manual
DWDM Lab Manual
1. Introduction to Python libraries for Data Mining: NumPy, Pandas, Matplotlib etc.
Write a Python program to do the following operations: Library: NumPy
a) Create multi-dimensional arrays and find its shape and dimension
b) Create a matrix full of zeros and ones
c) Reshape and flatten data in the array
d) Append data vertically and horizontally
e) Apply indexing and slicing on array
f) Use statistical functions on array - Min, Max, Mean, Median and Standard Deviation
PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)
a) Create multi-dimensional arrays and find its shape and dimension
import numpy as np
#creation of multi-dimensional array
a=np.array([[1,2,3],[2,3,4],[3,4,5]])
#shape
b=a.shape
print("shape:")
print(a.shape)
#dimension
c=a.ndim
print("dimensions:")
print(a.ndim)
b) Create a matrix full of zeros and ones
import numpy as np
#matrix full of zeros
z=np. zeros ((2,2))
DEPARTMENT OF AI Page 1
print("zeros:")
print(z)
#matrix full of ones
o=np. ones ((2,2))
print("ones:")
print(o)
f) Use statistical functions on array - Min, Max, Mean, Median and Standard Deviation
import numpy as np
#min for finding minimum of an array
a=np.array([[1,3,-1,4],[3,-2,1,4]])
b=a.min()
print("minimum:",b)
#max for finding maximum of an array
c=a.max()
print("maximum:",c)
#mean
a=np.array([1,2,3,4,5])
d=a.mean()
print("mean:",d)
#median
e=np.median(a)
print("median:",e)
#standard deviation
f=a.std()
print("standard deviation:",f)
DEPARTMENT OF AI Page 3
Exercise 2: UNDERSTANDING DATA
Write Python programs to do the following operations:
1. Loading data from CSV file
2. Compute the basic statistics of given data - shape, no. of columns, mean
3. Splitting a data frame on values of categorical variables
4. Visualize data using Scatter plot
Dataset: brain_size.csv
Library: Pandas, matplotlib
a) Loading data from CSV file
b) Compute the basic statistics of given data - shape, no. of columns, mean
c) Splitting a data frame on values of categorical variables
d) Visualize data using Scatter plot
1. Loading data from CSV file
import pandas as pd
a=pd.read_csv("D:/data.csv")
print(a)
2. Compute the basic statistics of given data - shape, no. of columns, mean
import pandas as pd
a=pd.read_csv("D:/data.csv")
print('shape :',a.shape)
#no of columns
cols=len(a.axes[1])
print('no of columns:',cols)
#mean of data
m=a["marks"].mean()
DEPARTMENT OF AI Page 4
print('mean of marks:',m)
import pandas as pd
a=pd.read_csv("D:\data.csv")
print("Before:")
print(a)
a_split=a['address'].str.split(' ',1)
a['district']=a_split.str.get(0)
a['state']=a_split.str.get(1)
del(a['address'])
print("After:")
print(a)
import pandas as pd
a=pd.read_csv("D:\data.csv")
print("Before",a)
a_split=a['address'].str.split(',',1)
a['district']=a_split.str.get(0)
a['state']=a_split.str.get(1)
del(a['address'])
print("After=",a)
a.plot(kind='scatter',x='marks',y='rollno',c='red')
DEPARTMENT OF AI Page 5
plt.show()
DEPARTMENT OF AI Page 6
# To find the correlation among columns # using pearson method
print(df.corr(method ='pearson'))
# using „kendall‟ method.
print(df.corr(method ='kendall'))
DEPARTMENT OF AI Page 7
Exercise 4: DATA PREPROCESSING – HANDLING MISSING VALUES
Write a python program to impute missing values with various techniques on given dataset.
1. Remove rows/ attributes
2. Replace with mean or mode
3. Write a python program to perform transformation of data using Discretization (Binning)
and normalization (MinMaxScaler or MaxAbsScaler) on given dataset.
1. Remove rows/ attributes
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("D:/diabetes.csv")
# filling missing value using fillna()
print(df.fillna(0))
# filling a missing value with previous value
print(df.fillna(method ='pad'))
#Filling null value with the next ones
print(df.fillna(method ='bfill'))
# filling a null values using fillna()
DEPARTMENT OF AI Page 8
print(df["gender"].fillna("No Gender", inplace = True))
# will replace Nan value in dataframe with value -99
print(df.replace(to_replace = np.nan, value = -99))
# using dropna() function to remove rows having one Nan
print(df.dropna())
# using dropna() function to remove rows with all Nan
print(df.dropna(how = 'all'))
# using dropna() function to remove column having one Nan
print(df.dropna(axis = 1))
# import statsmodels.api as sm
DEPARTMENT OF AI Page 11
import statistics
import math
x =[]
x = list(map(float, input().split()))
bi = int(input())
X_dict = OrderedDict()
x_old ={}
x_new ={}
for i in range(len(x)):
X_dict[i]= x[i]
x_old[i]= x[i]
# list of lists(bins)
binn =[]
DEPARTMENT OF AI Page 12
avrg = 0
i=0
k=0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
# performing binning
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
avrg = avrg + h
i=i+1
elif(i == num_of_data_in_each_bin):
k=k+1
i=0
avrg = 0
avrg = avrg + h
i=i+1
rem = len(x)% bi
if(rem == 0):
else:
DEPARTMENT OF AI Page 13
i=0
j=0
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
x_new[g]= binn[j]
i=i+1
else:
i=0
j=j+1
x_new[g]= binn[j]
i=i+1
print(math.ceil(len(x)/bi))
print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))
import numpy as np
# import statsmodels.api as sm
import statistics
import math
DEPARTMENT OF AI Page 14
from collections import OrderedDict
x =[]
x = list(map(float, input().split()))
bi = int(input())
X_dict = OrderedDict()
x_old ={}
x_new ={}
for i in range(len(x)):
X_dict[i]= x[i]
x_old[i]= x[i]
# list of lists(bins)
binn =[]
avrg =[]
i=0
k=0
DEPARTMENT OF AI Page 15
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
# performing binning
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
avrg.append(h)
i=i+1
elif(i == num_of_data_in_each_bin):
k=k+1
i=0
binn.append(statistics.median(avrg))
avrg =[]
avrg.append(h)
i=i+1
binn.append(statistics.median(avrg))
i=0
j=0
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
x_new[g]= round(binn[j], 3)
i=i+1
else:
DEPARTMENT OF AI Page 16
i=0
j=j+1
x_new[g]= round(binn[j], 3)
i=i+1
print(math.ceil(len(x)/bi))
print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))
import numpy as np
# import statsmodels.api as sm
import statistics
import math
x =[]
x = list(map(float, input().split()))
bi = int(input())
DEPARTMENT OF AI Page 17
X_dict = OrderedDict()
x_old ={}
x_new ={}
for i in range(len(x)):
X_dict[i]= x[i]
x_old[i]= x[i]
# list of lists(bins)
binn =[]
avrg =[]
i=0
k=0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
avrg.append(h)
i=i+1
elif(i == num_of_data_in_each_bin):
k=k+1
i=0
DEPARTMENT OF AI Page 18
binn.append([min(avrg), max(avrg)])
avrg =[]
avrg.append(h)
i=i+1
binn.append([min(avrg), max(avrg)])
i=0
j=0
for g, h in X_dict.items():
if(i<num_of_data_in_each_bin):
x_new[g]= binn[j][1]
i=i+1
else:
x_new[g]= binn[j][0]
i=i+1
else:
i=0
j=j+1
x_new[g]= binn[j][1]
else:
x_new[g]= binn[j][0]
i=i+1
DEPARTMENT OF AI Page 19
print("number of data in each bin")
print(math.ceil(len(x)/bi))
print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))
The motivation to use this scaling include robustness to very small standard deviations of features and
preserving zero entries in sparse data.
# example of a normalization
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
# define data
data = asarray([[100, 0.001],[8, 0.05],[50, 0.005],[88, 0.07],[4, 0.1]])
print(data)
# transform data
scaled = scaler.fit_transform(data)
print(scaled)
DEPARTMENT OF AI Page 20
Exercise 5: ASSOCIATION RULE MINING- APRIORI
Write a python program to find rules that describe associations by using Apriori algorithm
Steps in Apriori:
1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for
the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with
other items (e.g. confidence).
DEPARTMENT OF AI Page 21
2. Extract all the subsets having higher value of support than minimum threshold.
3. Select all the rules from the subsets with confidence value higher than minimum threshold.
Example:
#CASE1:
results = list(apriori(transactions))
association_results = list(results)
print(results[0])
#CASE2:
min support=.5,minconfidence=.8
association_results = list(results)
print(len(results))
print(association_results)
OUTPUT: 5
RelationRecord(items=frozenset({'beer'}), support=1.0,
ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'beer'}),
confidence=1.0, lift=1.0)])
Case 2:
[RelationRecord(items=frozenset({'beer'}), support=1.0,
ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'beer'}),
confidence=1.0, lift=1.0)]),
DEPARTMENT OF AI Page 22
RelationRecord(items=frozenset({'cheese', 'beer'}), support=0.5,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'cheese'}), items_add=frozenset({'beer'}),
confidence=1.0, lift=1.0)]),
• Support
• Confidence
• Lift
Suppose a record of 1 thousand customer transactions. Consider two items e.g. burgers and ketchup.
Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150
transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data,
Find the support, confidence, and lift.
Support:
For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item
Ketchup can be calculated as:
Confidence:
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be
calculated by finding the number of transactions where A and B are bought together, divided by total
number of transactions where A is bought.
DEPARTMENT OF AI Page 23
A total of 50 transactions where Burger and Ketchup were bought together. While in 150 transactions,
burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be
represented as confidence of Burger -> Ketchup and can be mathematically written as:
Lift :
Lift (A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be
calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be
represented as:
In Burger and Ketchup problem, the Lift (Burger -> Ketchup) can be calculated as:
DEPARTMENT OF AI Page 25
\
DEPARTMENT OF AI Page 26
2. Training with various split measures (Gini index, Entropy and Information Gain)
3. Compare the accuracy
A decision tree is a machine learning algorithm that uses a tree-like model of decisions and their
subsequent consequences to arrive at a particular decision. It is a Supervised Machine Learning model,
where the data is continuously split according to a certain parameter, and finally, a decision is made.
Usually, a decision tree is drawn upside down, with the root node at the top and the leaf nodes at the
bottom. A decision tree usually contains 3 types of nodes.
1. Root node: The very top node that represents the entire population or sample.
2. Decision nodes: Sub-nodes that split from the root node.
3. Leaf nodes: Nodes with no children, also known as terminal nodes.
DEPARTMENT OF AI Page 27
Build a Decision Tree using IRIS dataset in Python:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
DEPARTMENT OF AI Page 28
Decision Tree
DEPARTMENT OF AI Page 29
Exercise 7: CLASSIFICATION –BAYESIAN NETWORK
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
PROGRAM:-
DEPARTMENT OF AI Page 30
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset=pd.read_csv("Iris.csv")
X=dataset.iloc[:,:4].values
Y=dataset['Species'].values
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
classifier=GaussianNB()
classifier.fit(X_train,Y_train)
print(X_test[0])
y_pred=classifier.predict(X_test)
print(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy:",accuracy)
DEPARTMENT OF AI Page 31
Exercise 8: CLUSTERING – K-MEANS
1. To perform Preprocessing
Program:
import numpy as np
X=np.array([[1,1],[1.5,2],[3,4],[5,7],[3.5,5],[4.5,5],[3.5,4.5]])
print(X)
plt.scatter(X[:,0],X[:,1])
kmeans=KMeans(n_clusters=2)
kmeans.fit(X)
DEPARTMENT OF AI Page 32