0% found this document useful (0 votes)
63 views

Data Mining Assignment No. 1

The document describes analyzing a dataset of 30 students to predict which students play cricket during their leisure time based on their gender, class, and height. It discusses using entropy, information gain, and Gini impurity calculations to determine that gender is the most significant variable to initially split the decision tree on. It also provides code to generate a decision tree model in Python using scikit-learn and graph the results. Finally, it implements logistic regression, k-nearest neighbors, and other algorithms on a diabetes dataset and compares their accuracy and confusion matrices.

Uploaded by

Saamia A
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Data Mining Assignment No. 1

The document describes analyzing a dataset of 30 students to predict which students play cricket during their leisure time based on their gender, class, and height. It discusses using entropy, information gain, and Gini impurity calculations to determine that gender is the most significant variable to initially split the decision tree on. It also provides code to generate a decision tree model in Python using scikit-learn and graph the results. Finally, it implements logistic regression, k-nearest neighbors, and other algorithms on a diabetes dataset and compares their accuracy and confusion matrices.

Uploaded by

Saamia A
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Mining

Assignment no. 1
Question no. 1:
Let’s say we have a sample of 30 students with three variables Gender (Boy/Girl), Class (IX/ X) and
Height (5 to 6 C). 15 out of these 30-play cricket in leisure time. Now, we want to create a model to
predict who will play cricket during leisure period? In this problem, we need to segregate students who
play cricket in their leisure time based on highly significant input variable among all three.

Answer:

1. Entropy
First, we need to find entropy for parent node,
Entropy for Parent Node = -(15/30) log2 (15/30) - (15/30) log2 (15/30) = 1
the entropy is 1 which shows that it is an impure node.

For Split on gender:


Entropy for Female node = -(2/10) log2 (2/10) — (8/10) log2 (8/10) = 0.72
Entropy for Male node = -(13/20) log2 (13/20) — (7/20) log2 (7/20) = 0.93
Entropy for split Gender = (10/30)*0.72 + (20/30)*0.93 = 0.86
Information Gain for split on gender = 1–0.86 = 0.14

For Split on Class:


Entropy for Class IX node = -(6/14) log2 (6/14) — (8/14) log2 (8/14) = 0.99
Entropy for Class X node = -(9/16) log2 (9/16) — (7/16) log2 (7/16) = 0.99
Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99

Information Gain for split on Class = 1– 0.99 = 0.01


Above, you can see that Information Gain for Split on Gender is the Highest among all, so the tree
will split on Gender.

2. Gini
Secondly, we have to find Gini for the given data:
It works with categorical target variable “Success” or “Failure”.

For Split on gender:


Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

For Split on Class:


Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the
node split will take place on Gender.
Now we have to find Gini impurity, So mathematically we can say,
Gini Impurity = 1-Gini

Question no. 2:
Draw a decision tree using pen and paper and then share it via email. Do it for the

given

– Create a decision tree for the dataset at the following link

Answer:
As it was a big data set and it was to hard to make that graph on paper so I decided to make a program
with uses different python libraries to make a decision tree on the given dataset of “Diabetes”

# -*- coding: utf-8 -*-


"""
Created on Mon Oct 28 19:47:33 2019

@author: sgfghh
"""

# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

col_names = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age', 'diabetes']
# load dataset
pima = pd.read_csv("C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv", header=None,
names=col_names)

print(pima.head())

feature_cols = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']


#feature_cols = ['triceps', 'insulin', 'bmi', 'dpf', 'age']
X = pima[feature_cols] # Features
y = pima.diabetes # Target variable

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training
and 30% test

# Create Decision Tree classifer object


clf = DecisionTreeClassifier()

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn.tree import export_graphviz


from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:

Question no. 3:
Further, implement it in python. Apply decision tree, logistic regression and nearest neighbor to the
dataset. Then, compare the accuracy and the confusion matrix.

Code in Python:
"""
Spyder Editor

This is a temporary script file.


"""

#importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing our cancer dataset
dataset = pd.read_csv('C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv')
X = dataset.iloc[:, 1:8].values
Y = dataset.iloc[:, 8].values

#print(dataset.head())
#print("Cancer data set dimensions : {}".format(dataset.shape))

#print(dataset.isnull().sum())
#print(dataset.isna().sum())
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#Using Logistic Regression Algorithm to the Training Set


from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, Y_train)
#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)
#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, Y_train)
#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm

from sklearn.tree import DecisionTreeClassifier


classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)

#Using RandomForestClassifier method of ensemble class to use Random Forest Classification


algorithm

from sklearn.ensemble import RandomForestClassifier


classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)

print("Accuracy with each of our model", Y_pred)

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(Y_test, Y_pred)

print("Confusion matrix", cm)

Output:

Accuracy with each of our model [1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0


0010100000000110000000000010100110000
1100001000000010001010001000110000101
1000000000000010000101000000010000010
1001001000100000000000100100000010000
0 0 0 0 0 1 0]

Confusion matrix [[117 14]


[ 31 30]]

You might also like