0% found this document useful (0 votes)

77 views7 pages

Data Mining Assignment No. 1

The document describes analyzing a dataset of 30 students to predict which students play cricket during their leisure time based on their gender, class, and height. It discusses using entropy, information gain, and Gini impurity calculations to determine that gender is the most significant variable to initially split the decision tree on. It also provides code to generate a decision tree model in Python using scikit-learn and graph the results. Finally, it implements logistic regression, k-nearest neighbors, and other algorithms on a diabetes dataset and compares their accuracy and confusion matrices.

Uploaded by

Saamia A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views7 pages

Data Mining Assignment No. 1

Uploaded by

Saamia A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Mining

Assignment no. 1
Question no. 1:
Let’s say we have a sample of 30 students with three variables Gender (Boy/Girl), Class (IX/ X) and
Height (5 to 6 C). 15 out of these 30-play cricket in leisure time. Now, we want to create a model to
predict who will play cricket during leisure period? In this problem, we need to segregate students who
play cricket in their leisure time based on highly significant input variable among all three.

Answer:

1. Entropy
First, we need to find entropy for parent node,
Entropy for Parent Node = -(15/30) log2 (15/30) - (15/30) log2 (15/30) = 1
the entropy is 1 which shows that it is an impure node.

For Split on gender:

Entropy for Female node = -(2/10) log2 (2/10) — (8/10) log2 (8/10) = 0.72
Entropy for Male node = -(13/20) log2 (13/20) — (7/20) log2 (7/20) = 0.93
Entropy for split Gender = (10/30)*0.72 + (20/30)*0.93 = 0.86
Information Gain for split on gender = 1–0.86 = 0.14

For Split on Class:

Entropy for Class IX node = -(6/14) log2 (6/14) — (8/14) log2 (8/14) = 0.99
Entropy for Class X node = -(9/16) log2 (9/16) — (7/16) log2 (7/16) = 0.99
Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99

Information Gain for split on Class = 1– 0.99 = 0.01

Above, you can see that Information Gain for Split on Gender is the Highest among all, so the tree
will split on Gender.

2. Gini
Secondly, we have to find Gini for the given data:
It works with categorical target variable “Success” or “Failure”.

For Split on gender:

Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

For Split on Class:

Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the
node split will take place on Gender.
Now we have to find Gini impurity, So mathematically we can say,
Gini Impurity = 1-Gini

Question no. 2:
Draw a decision tree using pen and paper and then share it via email. Do it for the

given

– Create a decision tree for the dataset at the following link

Answer:
As it was a big data set and it was to hard to make that graph on paper so I decided to make a program
with uses different python libraries to make a decision tree on the given dataset of “Diabetes”

# -- coding: utf-8 --

"""
Created on Mon Oct 28 19:47:33 2019

@author: sgfghh
"""

# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

col_names = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age', 'diabetes']
# load dataset
pima = pd.read_csv("C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv", header=None,
names=col_names)

print(pima.head())

feature_cols = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']

#feature_cols = ['triceps', 'insulin', 'bmi', 'dpf', 'age']
X = pima[feature_cols] # Features
y = pima.diabetes # Target variable

# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training
and 30% test

# Create Decision Tree classifer object

clf = DecisionTreeClassifier()

# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)

#Predict the response for test dataset

y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn.tree import export_graphviz

from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:

Question no. 3:
Further, implement it in python. Apply decision tree, logistic regression and nearest neighbor to the
dataset. Then, compare the accuracy and the confusion matrix.

Code in Python:
"""
Spyder Editor

This is a temporary script file.

"""

#importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing our cancer dataset
dataset = pd.read_csv('C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv')
X = dataset.iloc[:, 1:8].values
Y = dataset.iloc[:, 8].values

#print(dataset.head())
#print("Cancer data set dimensions : {}".format(dataset.shape))

#print(dataset.isnull().sum())
#print(dataset.isna().sum())
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#Using Logistic Regression Algorithm to the Training Set

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, Y_train)
#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)
#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, Y_train)
#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)

#Using RandomForestClassifier method of ensemble class to use Random Forest Classification

algorithm

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)

print("Accuracy with each of our model", Y_pred)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, Y_pred)

print("Confusion matrix", cm)

Output:

Accuracy with each of our model [1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0

0010100000000110000000000010100110000
1100001000000010001010001000110000101
1000000000000010000101000000010000010
1001001000100000000000100100000010000
0 0 0 0 0 1 0]

Confusion matrix [[117 14]

[ 31 30]]

Free Access to Elementary Statistics 11 th Edition Robert Johnson Chapter Answers
100% (4)
Free Access to Elementary Statistics 11 th Edition Robert Johnson Chapter Answers
82 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Assignment On Statistics For Management
100% (4)
Assignment On Statistics For Management
36 pages
Week 1 - Introduction To Statistics PDF
No ratings yet
Week 1 - Introduction To Statistics PDF
34 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Data Mining Journal 4 Kashan
No ratings yet
Data Mining Journal 4 Kashan
8 pages
DT-2023-24-sols
No ratings yet
DT-2023-24-sols
8 pages
Soft Computing Lab Practical Assignment 2
No ratings yet
Soft Computing Lab Practical Assignment 2
10 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Openlab1
No ratings yet
Openlab1
17 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
ML Lab PT
No ratings yet
ML Lab PT
25 pages
ML Lab Experiments (1) - Pages-2
No ratings yet
ML Lab Experiments (1) - Pages-2
10 pages
ML Lab Manual PDF
No ratings yet
ML Lab Manual PDF
9 pages
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
No ratings yet
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
30 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Programs Lab Bca
No ratings yet
Programs Lab Bca
16 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
ML MANUAL WITH OUTPUTS (2)
No ratings yet
ML MANUAL WITH OUTPUTS (2)
30 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
IEEE Conference Team ATOM
No ratings yet
IEEE Conference Team ATOM
5 pages
ML LAB P-1
No ratings yet
ML LAB P-1
10 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
ML Questions
No ratings yet
ML Questions
9 pages
Shobit Sharma (2124399) ML lab file pdf
No ratings yet
Shobit Sharma (2124399) ML lab file pdf
19 pages
P02 DecisionTrees SolutionNotes
No ratings yet
P02 DecisionTrees SolutionNotes
3 pages
# ELG 5255 Applied Machine Learning Fall 2020 # Quiz 1 (Bayesian Decision Theory)
No ratings yet
# ELG 5255 Applied Machine Learning Fall 2020 # Quiz 1 (Bayesian Decision Theory)
6 pages
ml6
No ratings yet
ml6
15 pages
ML5_Implementation
No ratings yet
ML5_Implementation
32 pages
Lab Manual ML
No ratings yet
Lab Manual ML
28 pages
bacdeaf_23032025_115708_split_1
No ratings yet
bacdeaf_23032025_115708_split_1
37 pages
MLDA1
No ratings yet
MLDA1
8 pages
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
No ratings yet
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
4 pages
22104057_Prakhar_Week 5
No ratings yet
22104057_Prakhar_Week 5
8 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
AAM CODES
No ratings yet
AAM CODES
8 pages
ML Lab
No ratings yet
ML Lab
7 pages
Ai Combined Update
No ratings yet
Ai Combined Update
274 pages
Prathamesh KRAI
No ratings yet
Prathamesh KRAI
38 pages
Machine Learning Laboratory (21AIL66)
No ratings yet
Machine Learning Laboratory (21AIL66)
7 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
RandomForest
No ratings yet
RandomForest
8 pages
ML
No ratings yet
ML
11 pages
ML_Prac1-10
No ratings yet
ML_Prac1-10
32 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Machine Learning Laboratory Record Book: 1 Find S Algorithm
No ratings yet
Machine Learning Laboratory Record Book: 1 Find S Algorithm
22 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Expt7_ML2025_250306_143857
No ratings yet
Expt7_ML2025_250306_143857
5 pages
ML2
No ratings yet
ML2
7 pages
ML practical Manjot 6-10
No ratings yet
ML practical Manjot 6-10
10 pages
ML Priyesha - 778
No ratings yet
ML Priyesha - 778
23 pages
Final
No ratings yet
Final
13 pages
ml term 1 and 2
No ratings yet
ml term 1 and 2
6 pages
DWDM Lab 3
No ratings yet
DWDM Lab 3
10 pages
Ass3 v1
No ratings yet
Ass3 v1
4 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
13 pages
allcodesml2
No ratings yet
allcodesml2
10 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Capture d’écran, le 2025-04-21 à 21.26.38
No ratings yet
Capture d’écran, le 2025-04-21 à 21.26.38
14 pages
Review Question Stat
No ratings yet
Review Question Stat
19 pages
Multiple Choice Problem CH 12
No ratings yet
Multiple Choice Problem CH 12
29 pages
3rd Test Result
No ratings yet
3rd Test Result
8 pages
Chapter1 - An Overview of Regression Analysis
No ratings yet
Chapter1 - An Overview of Regression Analysis
35 pages
Mann Whitney Test - 20240530 - 013144 - 0000
No ratings yet
Mann Whitney Test - 20240530 - 013144 - 0000
24 pages
Standard Normal Distribution
No ratings yet
Standard Normal Distribution
28 pages
Analisis Pengaruh Inflasi, Suku Bunga Kredit, Pendapatan Per Kapita Terhadap Penanaman Modal Dalam Negeri Di Indonesia
No ratings yet
Analisis Pengaruh Inflasi, Suku Bunga Kredit, Pendapatan Per Kapita Terhadap Penanaman Modal Dalam Negeri Di Indonesia
19 pages
Difference in Difference Estimation in R and Stata
100% (1)
Difference in Difference Estimation in R and Stata
3 pages
SmartPLS Report
No ratings yet
SmartPLS Report
201 pages
Intro Spatial Models INLA-3-43
No ratings yet
Intro Spatial Models INLA-3-43
41 pages
AE 248: AI and Data Science: Prabhu Ramachandran 2024-03-01
No ratings yet
AE 248: AI and Data Science: Prabhu Ramachandran 2024-03-01
3 pages
1Hsiao
No ratings yet
1Hsiao
4 pages
1 hw2
No ratings yet
1 hw2
1 page
CH 3 FORECASTS BASED ON TIME SERIES DATA - Rodel
No ratings yet
CH 3 FORECASTS BASED ON TIME SERIES DATA - Rodel
19 pages
Jemh114 PDF
No ratings yet
Jemh114 PDF
35 pages
Chapter4 Sampling Stratified Sampling 1
No ratings yet
Chapter4 Sampling Stratified Sampling 1
27 pages
Logit Model For Binary Data
No ratings yet
Logit Model For Binary Data
50 pages
CFA L2 SimpleSheets Formula Sheet Final
No ratings yet
CFA L2 SimpleSheets Formula Sheet Final
5 pages
Lecture-1 Factor Analysis
No ratings yet
Lecture-1 Factor Analysis
27 pages
Assiment 3 FM Final
No ratings yet
Assiment 3 FM Final
31 pages
Tutorial
No ratings yet
Tutorial
4 pages
Statistical Modelling: Univ.-Prof. Dr. Habil. Albrecht Gnauck
No ratings yet
Statistical Modelling: Univ.-Prof. Dr. Habil. Albrecht Gnauck
67 pages
Meeting 13 - 14 Non Parametric Statistics 16 - 17
No ratings yet
Meeting 13 - 14 Non Parametric Statistics 16 - 17
28 pages
Frontier Functions: Stochastic Frontier Analysis (SFA) & Data Envelopment Analysis (DEA)
100% (1)
Frontier Functions: Stochastic Frontier Analysis (SFA) & Data Envelopment Analysis (DEA)
45 pages
City of Dasmariñas, Cavite: Correlation and Regression Analysis
No ratings yet
City of Dasmariñas, Cavite: Correlation and Regression Analysis
8 pages
Demand Forecasting
No ratings yet
Demand Forecasting
10 pages