Data Mining Assignment No. 1
Data Mining Assignment No. 1
Assignment no. 1
Question no. 1:
Let’s say we have a sample of 30 students with three variables Gender (Boy/Girl), Class (IX/ X) and
Height (5 to 6 C). 15 out of these 30-play cricket in leisure time. Now, we want to create a model to
predict who will play cricket during leisure period? In this problem, we need to segregate students who
play cricket in their leisure time based on highly significant input variable among all three.
Answer:
1. Entropy
First, we need to find entropy for parent node,
Entropy for Parent Node = -(15/30) log2 (15/30) - (15/30) log2 (15/30) = 1
the entropy is 1 which shows that it is an impure node.
2. Gini
Secondly, we have to find Gini for the given data:
It works with categorical target variable “Success” or “Failure”.
Question no. 2:
Draw a decision tree using pen and paper and then share it via email. Do it for the
given
Answer:
As it was a big data set and it was to hard to make that graph on paper so I decided to make a program
with uses different python libraries to make a decision tree on the given dataset of “Diabetes”
@author: sgfghh
"""
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age', 'diabetes']
# load dataset
pima = pd.read_csv("C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv", header=None,
names=col_names)
print(pima.head())
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:
Question no. 3:
Further, implement it in python. Apply decision tree, logistic regression and nearest neighbor to the
dataset. Then, compare the accuracy and the confusion matrix.
Code in Python:
"""
Spyder Editor
#print(dataset.head())
#print("Cancer data set dimensions : {}".format(dataset.shape))
#print(dataset.isnull().sum())
#print(dataset.isna().sum())
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Output: