0% found this document useful (0 votes)
39 views36 pages

DMMLASSIGNMENT

The document is a project report submitted to the University of Liverpool for a course on data mining and machine learning. It describes building a decision tree model on a dataset about car purchases. It involves calculating Gini impurity measures to determine the nodes of the decision tree, fully growing the tree, and pruning it to address overfitting. It then tests the decision tree on a separate test dataset and evaluates the results using a confusion matrix.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views36 pages

DMMLASSIGNMENT

The document is a project report submitted to the University of Liverpool for a course on data mining and machine learning. It describes building a decision tree model on a dataset about car purchases. It involves calculating Gini impurity measures to determine the nodes of the decision tree, fully growing the tree, and pruning it to address overfitting. It then tests the decision tree on a separate test dataset and evaluates the results using a confusion matrix.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/359511158

DATA MINING AND MACHINE LEARNING COURSEWORK

Article · March 2022

CITATIONS READS

0 2,679

1 author:

Shivani Mishra
Xalient
35 PUBLICATIONS 1 CITATION

SEE PROFILE

All content following this page was uploaded by Shivani Mishra on 28 March 2022.

The user has requested enhancement of the downloaded file.


A PROJECT REPORT
Submitted to

University of Liverpool

In Partial Fulfillment of the Requirement for the module of

DATA MINING AND MACHINE LEARNING


(EBUS537)
BY
SHIVANI MISHRA (201602773)

MANAGEMENT SCHOOL
UNIVERSITY OF LIVERPOOL

1|Page
CONTENTS
PART 1
Sl.no. Topic Pg.no.
1 Building Decision Tree 4
2 Post pruning activities 10
3 Testing Dataset 12
4 Applications of the decision tree in management 14

PART 2
Sl.no. Topic Pg.no.
1 Executive summary 16
2 Introduction 17
3 Implementation 18
Dataset- cleaning & clustering
K-means clustering
Code snippet
4 Evaluating Results 25
5 Novelty And Significance 29
6 References 30
7 Appendix 1 32

2|Page
LIST OF FIGURES
PART 1
Fig.no. Topic Pg.no.
1 Decision Tree for root node after Gini impurity calculation 5
2 Decision Tree for step 2 after ‘safety=high’ Gini impurity 7
calculation
3 Decision Tree for step 2 after ‘safety=medium’ Gini impurity 8
calculation
4 Decision Tree after step 3 Gini impurity calculation 9
5 Fully grown decision tree 10
6 Pruned decision tree 11

PART 2
Fig.no. Topic Pg.no.
1 Data flow diagram for the clustering algorithm 19
2 Scatter plot for Total crimes against women year-wise. 20
3 Methodology of proposed research work 22
4 Bar plot for total crimes against women year-wise using 22
KMeans
5 Scatter plot for total crimes against women state-wise 23
6 Bar plot for total crimes against women state-wise 24
7 Bar plot for rape crimes state-wise 25
8 Bar plot for kidnapping & abduction crimes state-wise 26
9 Bar plot for dowry deaths state-wise 27
10 Bar plot for assault of women state-wise 27
11 Bar plot for the insult to the modesty of women crimes, state- 28
wise
12 Bar plot for cruelty crimes by husband or relatives against 28
women, state-wise

3|Page
PART 1
Training Dataset
Given the following dataset – “myCarTrainDataset.csv.”
https://drive.google.com/file/d/1-R1cJ0Rs6W2kX1Rl2J9d4slcUj7nk5cF/view?usp=sharing

STEP 1: Calculating the Gini impurity measures of the attributes to decide the nodes of the decision
tree:

1. For Buying

Gini(High) = 1 - (46/204)^2 - (158/204)^2 = 0.349


Gini(Med) = 1 - (47/107)^2 - (60/107)^2 = 0.492
Gini(Low) = 1 - (38/89)^2 - (51/89)^2 = 0.489
Ginisplit (Buying) = 204/400 * 0.349 + 107/400 * 0.492 + 89/400 * 0.489 = 0.4184

2. For Maintenance

Gini(High) = 1 - (49/202)^2 - (153/202)^2 = 0.367


Gini(Low) = 1 - (82/198)^2 - (116/198)^2 = 0.485
Ginisplit (Maintenance) = 202/400 * 0.367 + 198/400 * 0.485 = 0.4254

4|Page
3. For Doors

Gini(3) = 1 - (61/202)^2 - (141/202)^2 = 0.421


Gini(5) = 1 - (70/198)^2 - (128/198)^2 = 0.457
Ginisplit (Doors) = 202/400 * 0.421 + 198/400 * 0.457 = 0.4388

4. For Safety

Gini(High) = 1 - (75/134)^2 - (59/134)^2 = 0.493


Gini(Med) = 1 - (56/133)^2 - (77/133)^2 = 0.487
Gini(Low) = 1 - (0/133)^2 - (133/133)^2 = 0
Ginisplit(Safety) = 134/400 * 0.493 + 133/400 * 0.487 + 133/400 * 0 = 0.327

As per Greedy strategy, ‘Safety’ is selected as the root node of the decision tree since it has the lowest
Ginisplit value (0.327)

Fig 1: Decision Tree for root node after Gini impurity calculation

5|Page
Step 2: Calculating Gini splits for Safety=High and Safety=Med

('Safety=Low' is a homogeneous node so that the class label would be unacc)

1. For Safety= High


a. Buying

Gini(High) = 1 - (30/66)^2 - (36/66)^2 = 0.4958


Gini(Med) = 1 - (27/39)^2 - (12/39)^2 = 0.426
Gini(Low) = 1 - (18/29)^2 - (11/29)^2 = 0.4708
Ginisplit (Buying) = 66/134 * 0.4958 + 39/134 * 0.426 + 29/134 * 0.4708 =
0.47

b. Maintenance

Gini(High) = 1 - (29/63)^2 - (34/63)^2 = 0.4968


Gini(Low) = 1 - (46/71)^2 - (25/71)^2 = 0.4562
Ginisplit (Maintenance) = 63/134 * 0.4968 + 71/134 * 0.4562 = 0.4752

c. Doors

Gini(3) = 1 - (40/69)^2 - (29/69)^2 = 0.4873


Gini(5) = 1 - (35/65)^2 - (30/65)^2 = 0.497
Ginisplit (Doors) = 69/134 * 0.4873 + 65/134 * 0.497 = 0.492

6|Page
For ‘Safety=High’ next level node would be ‘Buying’ since it has the lowest Ginisplit value (0.47)

Fig 2: Decision Tree for step 2 after ‘safety=high’ Gini impurity calculation

2. For Safety=Medium
a. Buying

Gini(High) = 1 - (16/68)^2 - (52/68)^2 = 0.3598


Gini(Med) = 1 - (20/34)^2 - (14/34)^2 = 0.4844
Gini(Low) = 1 - (20/31)^2 - (11/31)^2 = 0.4578
Ginisplit (Buying) = 68/133 * 0.3598 + 34/133 * 0.4844 + 31/133 * 0.4578 = 0.4145

7|Page
b. Maintenance

Gini(High) = 1 - (20/63)^2 - (43/63)^2 = 0.433


Gini(Low) = 1 - (36/71)^2 - (35/71)^2 = 0.499
Ginisplit (Maintenance) = 63/133 * 0.433 + 71/133 * 0.499 = 0.4715

c. Doors

Gini(3) = 1 - (21/65)^2 - (44/65)^2 = 0.4374


Gini(5) = 1 - (35/68)^2 - (33/68)^2 = 0.499
Ginisplit (Doors) = 65/133 * 0.4374 + 68/133 * 0.499 = 0.4689

For ‘Safety=Medium’ next level node would be ‘Buying’ since it has the lowest Ginisplit value.

8|Page
Fig 3: Decision Tree for step 2 after ‘safety=medium’ Gini impurity calculation

Step 3: Calculating further Gini splits as no nodes are homogenous yet.

[ Gini impurity calculations for this step are mentioned in Appendix I. ]

Fig 4: Decision Tree after step 3 Gini impurity calculation

9|Page
Step 4: Creating a fully grown decision tree

This decision tree has an ambiguous output value for ‘Safety= High, Buying= Low, Maintenance=
High, Doors= 5’. The values are acc=3 and unacc=3. Hence to decide the class label, according to
Hunt's algorithm, the voting system is used, and the class label of the parent, i.e. Maintenance=High
(acc: 9, unacc: 7), is used to determine the class label the child. Hunt's approach recursively develops
a decision tree by separating the training records into purer subsets. It repeats the procedure on each
subset until all of the documents in the subgroup belong to the same class.[ Humanoriented.com,
2022] The final output would be ‘acc’.

Fig 5: Fully grown decision tree

Step 5: Pruning the fully grown decision tree

Applying pruning where leaves of subtrees with the same class labels are pruned to address

overfitting and creating a more concise version of the decision tree.

10 | P a g e
Fig 6: Pruned decision tree

11 | P a g e
Testing Dataset
Given the following test dataset – “myCarTestDataset.csv”

Buying Maintenance Doors Safety Acceptance


high low 3 med acc
low high 5 low unacc
high high 5 high unacc
low high 5 high unacc
med low 5 med acc
high low 3 med unacc
med low 5 med acc
high high 3 low unacc
high low 5 high unacc
low high 5 high unacc
high low 5 med acc
high high 3 med unacc
low low 5 low unacc
high high 3 med unacc
high high 5 high unacc
high high 3 med acc
med high 5 med unacc
low low 3 low unacc
high low 3 high unacc
med high 3 med acc

A Confusion matrix is used to test the decision tree using the above dataset.

A confusion matrix is a summary of classification problem prediction outcomes. The number of rights
and unsuccessful predictions is totalled and split by class using count values. The confusion matrix's
key is this. It informs you about the faults made by your classifier and the sorts of errors that are
being made. This breakdown addresses the drawback of relying solely on categorisation accuracy.
[Jason Brownlee, 2016]

Considering the values of the pruned decision tree as actual class and values of the test dataset as a
predicted class.

12 | P a g e
From the above confusion matrix the following result is obtained:

Accuracy= (TP + TN)/ Total test dataset = 13/20= 0.65= 65%

Error rate= (FP + FN)/ Total test dataset= 7/20= 0.35= 35%

The decision tree obtained is a 65% accurate model with 35% error rate.

Precision= TP/ TP+FP= 3/7= 0.428= 43%

Recall= TP/ TP+FN= 3/6= 0.5= 50%

Precision is a valuable metric in cases where False Positive is a more significant concern than False
Negatives.

43% of the correctly predicted cases turned out to be unaccepted. At the same time, 50% of the
unaccepted cases were successfully predicted by our model.

Inorder to improve the model's precision, the recall suffers, and vice versa. In a single number, the
F1-score captures both trends:

F1 Score= (2 x Recall x Precision) / (Recall + Precision)= 0.428/ 0.928= 0.46= 46%

Because the F1-score is a harmonic mean of Precision and Recall, it provides a composite picture of
these two measures. When Precision equals Recall, it reaches its peak.

However, there is a catch. The F1-score has low interpretability, i.e. there is no idea if the classifier
is optimising Precision or Recall. Consequently, it’s combined with other assessment indicators to
get a holistic view of the outcome.

13 | P a g e
Applications of the decision tree in the management field
A Decision Tree is a Supervised Machine Learning Algorithm that makes judgments based on a set
of rules, like how people do.

Many businesses built their databases to give customer service in recent decades. Decision trees are
a means to extract meaningful information from databases, and they've previously been used in
various business and management applications. Decision tree modelling is viral in customer
relationship management, fraud detection, and healthcare management. [What-when-how.com, 2022]

1. Customer Relationship Management


Investigating how people use internet services is a common strategy for managing customer
interactions. This type of inquiry includes gathering and evaluating individual user data and
making suggestions based on the data gathered.
A study used decision trees to examine the connections between client wants and preferences
and online shopping success. The frequency with which users shop online is utilised as a label
in the study to divide users into two groups: (a) users who seldom shop online and (b) users
who shop online frequently. In terms of the former, the model says that the number of time
customers must spend in a transaction and the urgency they must acquire a product are the
essential elements to consider. The constructed model reveals that the most significant aspects
are pricing and the degree of human resources involved in the latter. The developed decision
trees also imply that the success of online shopping is heavily dependent on the frequency of
consumers' purchases and the product prices. The information gleaned through decision trees
can help businesses better understand their consumers' requirements and preferences. [Lee et
al.,2007]

2. Fraud Statement Detection


Identifying falsified financial statements is another extensively utilised business application
(FFS). Because the existence of FFS may reduce the government's tax revenue, such an
application is crucial [Spathis et al., 2003]. Statistical approaches are a specific tool to identify
FFS. Due to the requirement of establishing many assumptions and redefining the
relationships among the vast number of variables in a financial statement, it isn’t easy to
uncover all concealed information. Previous research has shown that developing a decision

14 | P a g e
tree is a viable solution to this problem since it can consider all factors throughout the model
creation process.
76 Greek manufacturing companies were chosen for the study, and their public financial
records, including balance sheets and income statements, were collected for modelling
purposes. All non-fraud cases and 92 per cent of fraud instances were accurately categorised
using the tree model built. According to this result, decision trees can substantially contribute
to FFS identification due to their high accuracy rate. [Kirkos et al.,2007]

3. Healthcare Management
Because decision tree modelling may be used to make predictions, many researchers are
looking into how decision trees can be utilised in healthcare management.
A study used 516 pieces to create a decision tree model to investigate hidden information in
the medical histories of developmentally delayed children. The developed model predicts that
most diseases would cause delays in cognitive development, language development, and
motor development, with 77.3%, 97.8%, and 88.6 per cent accuracy, respectively. These
findings might assist healthcare practitioners in providing early intervention to
developmentally delayed children to catch up to their peers in terms of development and
growth. [Chang,2007]
Decision tree calculations have also been implemented to predict the survival of breast cancer
patients. Research in this area showed that using a decision tree to locate and analyse hidden
information in healthcare management is a good idea. [ Delen et al.,2005]

15 | P a g e
PART 2
EXECUTIVE SUMMARY
Violence against women has become a hot topic of discussion in India. The Indian government
and media have given this issue a lot of attention due to the continually growing crime rate.
Many crimes have been perpetrated in India's various states. This study primarily focuses on
pattern identification using machine learning to examine crime against women tendencies
across Indian states. Machine learning is a subclass of artificial intelligence in computer science
that uses statistical methods to allow computers to "learn" (i.e., improve performance on a
given job) from data without needing to be explicitly programmed. This aids in the appropriate
analysis of data about crime against women, and it may also assist the government in effective
policymaking for prevention efforts. This report will include the development of a clustering
methodology and machine learning algorithms like cleaning the dataset and clustering. The
performance of each algorithm is evaluated, and the best algorithm with the highest crime
detection accuracy is identified. [Shivani Mishra, ResearchGate, 2019]

• Application area: Crimes against women in India


• Machine learning tool: Clustering method- K-Means algorithm

16 | P a g e
INTRODUCTION
India keeps track of all reported crimes against women and regularly publishes a list of them.
Due to high rates of sexual assault, lack of access to justice in rape cases, underage marriage,
female foeticide, and human trafficking, India has been named the most hazardous country for
women in a study published by the Reuters news agency June 2018. India outperformed war-
torn countries like Syria and Afghanistan. The poll was a rerun of one conducted in 2011,
which concluded that the most hazardous nations for women were Afghanistan, the Democratic
Republic of Congo, Pakistan, India, and Somalia. Experts say India's ascension to the top of
the poll shows that not enough is being done to address the threats women face; more than five
years ago, the murder and rape of a student on a Delhi bus made violence against women a
national concern. [Thomson Reuters Foundation, 2018]

Due to the rising influence of the media and the advancement of technology, these concerns
have recently acquired prominence, been discussed and explored. According to the study, India
topped the list because its government had done nothing to safeguard women, according to 548
women's problems specialists. Rapes, female foeticide, sexual assault, and harassment are still
occurring. According to the research, documented incidents of crimes against women increased
by 83% between 2007 and 2016, with four cases of rape occurring every hour. Even though
recorded rapes are on the rise in India, the incidence of rape per lakh people remains far lower
than in several Western nations, like the United States, which experts say is owing to years of
fear and under-reporting. Understanding the crimes in India requires analysing the data offered
by these massive data sets. Researchers may get insight into the underlying causes of crimes,
criminal mindset, and potential indications of future crimes by statistically examining the data.
All this categorisation and analysis falls under the umbrella of Data Science, a discipline that
studies massive datasets using probability and statistics and draws valuable findings. This
report analyses India's most current data gathering and makes some predictions concerning
crime with the help of machine learning techniques.

17 | P a g e
IMPLEMENTATION
The procedures utilised to acquire the results will be discussed in this section.

1. The Data Set

The data set is accessible to the general public via the website https://data.gov.in/

Cruelty by husbands and relatives, Dowry, Immoral Trafficking, Kidnapping, Molestation,


Rape, and Sexual Harassment are among the crimes included in the dataset collection. Every
crime is recorded in the database and the number of victims in each state or union territory.
The Spyder programme turned this previously separated data into clustering analysis diagrams
using k-means (Python 3.8). This method assisted us in producing a separate chart for each
crime. Also, each state and union territory received its cluster. This allows us to investigate a
specific crime in more depth across the country. We can see which regions require more care
and which areas are in better shape by visiting the clusters we've created.

Link to dataset: Crime_Dataset.csv


Source of dataset:
https://data.gov.in/ (District-wise crimes committed against women 2001-2015)
https://www.kaggle.com/datasets (Crimes against women in India 2016-2020)

2. Cleaning the dataset

Simple Python coding was used to tidy up the data set. Extra headers must be eliminated for
proper graph plotting. The Set has been cleared of additional headings such as ‘All India’,
‘TOTAL Crime’. Cleaning codes have been created for each state separately. A basic graph
plot was created after cleaning the Data Set. For the rest of the procedures, the basic story was
used.

3. Clustering of data

Clustering analysis seems to be the most appropriate strategy for this investigation. The
clustering analysis technique is one of the most used analytical methods in data mining, and
the clustering algorithm approach directly impacts the clustering outcomes [Xumin, Yong,
2010]. We're using k-means because it's easier to apply using software like RapidMiner, Weka,
KNIME, Orange, and others. The k-means algorithm adapts to larger data sets and ensures
convergence. It can warm up the starting point of centroids. It can quickly adapt to new cases

18 | P a g e
because it generalises clusters of various forms and sizes, such as elliptical clusters. This may
verify business assumptions about the sorts of groups that exist and find unknown groups in
large data sets. Any additional data may be readily allocated to the right group once the
algorithm has been run and the groups have been formed. This is a flexible method that may
be used to categorise data in various ways. Additionally, monitoring if a monitored data point
transitions between groups overtime may be utilised to discover relevant changes in the data.
Scaling about dimension and clustering data of varied sizes and densities were two of the
challenges we faced when using kmeans for clustering. The value of k had to be manually
chosen. Furthermore, the clustering was not dynamic and depended on the starting value of k.
[Rishabh, Vidhi, Prathamesh, 2020]

Practical clustering methods employ an iterative algorithm (e.g., K-Means, EM) that converges
to one of the many local minima. These iterative approaches are particularly sensitive to early
beginning circumstances [Motwani, Arora, Gupta, 2019]. K-means separates n observations
into k clusters, with each word belonging to the cluster with the nearest mean (cluster centres
or cluster centroid), which acts as the cluster's prototype.

Fig 1: Data flow diagram for the clustering algorithm

19 | P a g e
Fig 2: Methodology of proposed research work.

20 | P a g e
4. Code snippet

21 | P a g e
Fig 3: Scatter plot for Total crimes against women year-wise.

Fig 4: Bar plot for total crimes against women year-wise using KMeans

22 | P a g e
Fig 5: Scatter plot for total crimes against women state-wise

23 | P a g e
Fig 6: Bar plot for total crimes against women state-wise

24 | P a g e
EVALUATING RESULTS
Exceptions will always occur for every machine learning method. Outcome analysis is required
to find the best result. Every aspect of the investigation into crimes against women in India
showed disturbing results. Figure 6 shows that India’s overall crime rate has pushed us into a
high-risk zone. Considering the above graph, we see that the state of Andhra Pradesh has the
highest rate of crimes and the Union Territory, Lakshadweep, has the lowest rate of Total
Crimes. The results of K-means clustering suggest that the states of Andhra Pradesh, Uttar
Pradesh, Madhya Pradesh, Rajasthan, and Maharashtra are more vulnerable to crime.

Fig 7: Bar plot for rape crimes state-wise

The red bar in the graph above represents the most significant number of rapes in the state of
Madhya Pradesh, while the blue bar represents the lowest incidence of rapes in the form of
Lakshadweep and Daman & Diu.

25 | P a g e
Fig 8: Bar plot for kidnapping & abduction crimes state-wise

The blue bars in the graph above represents the most significant number of kidnappings and
abductions in Uttar Pradesh and Rajasthan.

Similarly, we have obtained the below graphs using the K-means clustering algorithm, which
depict the rate of different types of crimes against women in several states of India.

26 | P a g e
Fig 9: Bar plot for dowry deaths state-wise

Fig 10: Bar plot for assault of women state-wise

27 | P a g e
Fig 11: Bar plot for the insult to the modesty of women crimes, state-wise

Fig 12: Bar plot for cruelty crimes by husband or relatives against women, state-wise

28 | P a g e
NOVELTY AND SIGNIFICANCE
Not much research has been done to check the crime rate against women in India. Last year,
the NCRB, which is part of the Union home ministry of India, recorded a total of 371,503
incidents of crime against women across the nation, compared to 405,326 in 2019 and 378,236
in 2018. Cases of crime against women in cities were down by 8.3% in the year 2020 when
compared to 2019. This report analyses all recent data, and figure 4 showed the bar plot height
reduction in 2020.

The crime rate case study demonstrates that the Machine Learning algorithm is effective in the
data analysis of various crimes by state and year. Rapes, kidnappings, assaults, and molestation
are rising, and women are becoming increasingly susceptible. The law-and-order situation in
the states with the most significant crime rates should be improved. Our government should
adopt social awareness initiatives to reduce these crimes by teaching people to respect women's
dignity—states with a higher crime rate. Machine Learning, which was used in this case, can
be used in various other algorithms to find information about crimes in various areas of
different states. Though clustering analysis has provided sufficient clarity on women's
criminality and its influence on society, more work is still done on the same dataset. As a result,
the future research will be described as follows:

1. For further predictions, classification and regression analysis can be performed on the same
dataset. Advanced technologies such as machine learning and artificial intelligence (AI) can
aid in crime analysis prediction.

2. The women's crime information can be combined with the population density and literacy
rates of specific Indian states. Appropriate statistical analyses can determine whether literacy
contributes to various types of crimes.

3. The clustering technique can be advanced further to compare crime clusters to other nations.

29 | P a g e
REFERENCES
1. Chang, C.-L. (2007). A study of applying data mining to early intervention for
developmentally delayed children. Expert Systems with Applications, [online] 33(2),
pp.407–412. Available at:
https://www.sciencedirect.com/science/article/abs/pii/S0957417406001552 [Accessed
11 Jan. 2022].

2. Delen, D., Walker, G. and Kadam, A. (2005). Predicting breast cancer survivability: a
comparison of three data mining methods. Artificial Intelligence in Medicine, [online]
34(2), pp.113–127. Available at:
https://www.sciencedirect.com/science/article/pii/S0933365704001010 [Accessed 11
Jan. 2022].

3. Humanoriented.com. (2022). Decision Tree Classifier. [online] Available at:


http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/deci
sionTree.html [Accessed 11 Jan. 2022].

4. Jason Brownlee [https://www.facebook.com/MachineLearningMastery] (2016). What


is a Confusion Matrix in Machine Learning? [online] Machine Learning Mastery.
Available at: https://machinelearningmastery.com/confusion-matrix-machine-
learning/ [Accessed 11 Jan. 2022].

5. Kirkos, E., Spathis, C. and Manolopoulos, Y., 2005, April. Detection of fraudulent
financial statements through the use of data mining techniques. In Proceedings of the
2nd International Conference on Enterprise Systems and Accounting, Thessaloniki,
Greece (pp. 310-325). [Accessed 10 Jan. 2022].

6. KIRKOS, E., SPATHIS, C. and MANOLOPOULOS, Y. (2007). Data Mining


Techniques for the detection of fraudulent financial statements. Expert Systems with
Applications, [online] 32(4), pp.995–1003. Available at:
https://www.sciencedirect.com/science/article/abs/pii/S0957417406000765 [Accessed
11 Jan. 2022].

7. LEE, S., LEE, S. and PARK, Y. (2007). A prediction model for the success of services
in e-commerce using decision tree: E-customer's attitude towards online service. Expert

30 | P a g e
Systems with Applications, [online] 33(3), pp.572–581. Available at:
https://www.sciencedirect.com/science/article/abs/pii/S0957417406001825 [Accessed
11 Jan. 2022].

8. Motwani, M., Arora, N., & Gupta, A. (2019). A study on initial centroids selection for
partitional clustering algorithms. In Software Engineering (pp. 211-220). Springer,
Singapore.
9. Na, S., Xumin, L., &Yong, G. (2010, April). Research on k-means clustering algorithm:
An improved K-means clustering algorithm. In 2010 Third International Symposium
on intelligent information technology and security informatics (pp. 63-67). IEEE

10. Rishabh Reddy, Vidhi Kapoor, Prathamesh, 2020 (16) (PDF) K-means Clustering
Analysis of Crimes on Indian Women. (2020). ResearchGate. [online] Available at:
https://www.researchgate.net/publication/343501538_K-
means_Clustering_Analysis_of_Crimes_on_Indian_Women [Accessed 6 Jan. 2022].

11. Shivani Mishra, Suraj Kumar, ResearchGate. (2019). (16) (PDF) A comparative study
of crimes against women based on Machine Learning using Big Data techniques
ABSTRACT. [online] Available at:
https://www.researchgate.net/publication/357810730_A_comparative_study_of_crim
es_against_women_based_on_Machine_Learning_using_Big_Data_techniques_ABS
TRACT [Accessed 6 Jan. 2022].

12. Thomson Reuters Foundation, June 26, 2018 "India most dangerous country for women
with sexual violence rife - global poll." Available at:
https://www.reuters.com/article/women-dangerous-poll-idINKBN1JM076 (Accessed
on: Jan 5, 2022)

13. What-when-how.com. (2022). Decision Tree Applications for Data Modelling


(Artificial Intelligence). [online] Available at: http://what-when-how.com/artificial-
intelligence/decision-tree-applications-for-data-modelling-artificial-intelligence/
[Accessed 11 Jan. 2022].

31 | P a g e
APPENDIX I
Step 3: Calculating further Gini splits as no nodes are homogenous yet.

1. For Safety= High and Buying= High


a. Maintenance
High (30) Low (36)
acc 8 acc 22
unacc 22 unacc 14

Gini(High) = 1 - (8/30)^2 - (22/30)^2 = 0.3911


Gini(Low) = 1 - (22/36)^2 - (14/36)^2 = 0.4753
Ginisplit (Maintenance) = 30/66 * 0.3911 + 36/66 * 0.4753 = 0.437

b. Doors
3(34) 5 (32)
acc 16 acc 14
unacc 18 unacc 18

Gini(3) = 1 - (16/34)^2 - (18/34)^2 = 0.4983


Gini(5) = 1 - (14/32)^2 - (18/32)^2 = 0.4922
Ginisplit (Doors) = 34/66 * 0.4983 + 32/66 * 0.4922 = 0.4953

The next level node for ‘Safety= High and Buying= High’ would be ‘Maintenance’ since it
has the lowest Ginisplit value.

2. For Safety= High and Buying= Medium


a. Maintenance
High (17) Low (22)
acc 12 acc 15
unacc 5 unacc 7

Gini(High) = 1 - (12/17)^2 - (5/17)^2 = 0.4152


Gini(Low) = 1 - (15/22)^2 - (7/22)^2 = 0.4339
Ginisplit (Maintenance) = 17/39 * 0.4152 + 22/39 * 0.4339 = 0.4257

32 | P a g e
b. Doors
3(18) 5 (21)
acc 13 acc 14
unacc 5 unacc 7

Gini(3) = 1 - (13/18)^2 - (5/18)^2 = 0.4012


Gini(5) = 1 - (14/21)^2 - (7/21)^2 = 0.444
Ginisplit (Doors) = 18/39 * 0.4012 + 21/39 * 0.444 = 0.4242

The next level node for ‘Safety= High and Buying= Medium’ would be ‘Doors’ since it has
the lowest Ginisplit value.

3. For Safety= High and Buying= Low


a. Maintenance
High (16) Low (13)
acc 9 acc 9
unacc 7 unacc 4

Gini(High) = 1 - (9/16)^2 - (7/16)^2 = 0.4922


Gini(Low) = 1 - (9/13)^2 - (4/13)^2 = 0.426
Ginisplit (Maintenance) = 16/29 * 0.4922 + 13/29 * 0.426 = 0.4625

b. Doors
3(17) 5 (12)
acc 11 acc 7
unacc 6 unacc 5

Gini(3) = 1 - (11/17)^2 - (6/17)^2 = 0.4567


Gini(5) = 1 - (7/12)^2 - (5/12)^2 = 0.4861
Ginisplit (Doors) = 17/29 * 0.4567 + 12/29 * 0.4861 = 0.4689

The next level node for ‘Safety= High and Buying= Low’ would be ‘Maintenance’ since it
has the lowest Ginisplit value.

33 | P a g e
4. For Safety= Med and Buying= High
a. Maintenance
High (32) Low (36)
acc 3 acc 13
unacc 29 unacc 23

Gini(High) = 1 - (3/32)^2 - (29/32)^2 = 0.1699


Gini(Low) = 1 - (13/36)^2 - (23/36)^2 = 0.4614
Ginisplit (Maintenance) = 32/68 * 0.1699 + 36/68 * 0.4614 = 0.3242

b. Doors
3(35) 5 (33)
acc 7 acc 9
unacc 28 unacc 24

Gini(3) = 1 - (7/35)^2 - (28/35)^2 = 0.32


Gini(5) = 1 - (9/33)^2 - (24/33)^2 = 0.3967
Ginisplit(Doors) = 35/68 * 0.32 + 33/68 * 0.3967 = 0.3572

The next level node for ‘Safety= Medium and Buying= High’ would be ‘Maintenance’ since
it has the lowest Ginisplit value.

5. For Safety= Med and Buying= Med


a. Maintenance
High (18) Low (16)
acc 9 acc 11
unacc 9 unacc 5

Gini(High) = 1 - (9/18)^2 - (9/18)^2 = 0.5


Gini(Low) = 1 - (11/16)^2 - (5/16)^2 = 0.4297
Ginisplit(Maintenance) = 18/34 * 0.5 + 16/34 * 0.4297 = 0.467

34 | P a g e
b. Doors
3(13) 5 (21)
acc 8 acc 12
unacc 5 unacc 9

Gini(3) = 1 - (8/13)^2 - (5/13)^2 = 0.4733


Gini(5) = 1 - (12/21)^2 - (9/21)^2 = 0.4898
Ginisplit(Doors) = 13/34 * 0.4733 + 21/34 * 0.4898 = 0.4835

The next level node for ‘Safety= Medium and Buying= Medium’ would be ‘Maintenance’
since it has the lowest Ginisplit value.

6. For Safety= Med and Buying= Low


a. Maintenance
High (13) Low (18)
acc 8 acc 12
unacc 5 unacc 6

Gini(High) = 1 - (8/13)^2 - (5/13)^2 = 0.4733


Gini(Low) = 1 - (12/18)^2 - (6/18)^2 = 0.44
Ginisplit (Maintenance) = 13/31 * 0.4733 + 18/31 * 0.44 = 0.4539

b. Doors
3(17) 5 (14)
acc 6 acc 14
unacc 11 unacc 0

Gini(3) = 1 - (6/17)^2 - (11/17)^2 = 0.4567


Gini(5) = 1 - (14/14)^2 - (0/0)^2 = 0
Ginisplit (Doors) = 17/31 * 0.4567 + 14/31 * 0 = 0.25

The next level node for ‘Safety= Medium and Buying= Low’ would be ‘Doors’ since it has
the lowest Ginisplit value.

35 | P a g e

View publication stats

You might also like