DM Cia 4
DM Cia 4
DATA MINING
3 BBA FMA A
BY:
Bengaluru-73
TABLE OF CONTENTS
1. Classification of data
3. Classification Algorithms
4. Decision tree
5. Naïve Bayes
7. Bibliography
1. CLASSIFICATION OF DATA
Classification of Data is one of the types of machine learning where outcome or prediction is
already defined or in simpler term on which classification is to be made is already known. This
encompasses making a model perform supervised learning on features with previously tagged labels
known as output. The model finds some relationship in the data and then applies the relationship to
new data in the right class. In decision-making, classification may be two-way, meaning it puts
objects or phenomena into two or more categories, or multiclass, meaning it puts them into more
than two categories.
For instance in a medical diagnose system classification helps in decision of a patient to be classified
as having diabetes or not given the features such as age, blood sugar level and BMI. The features
which could be included in the training set would be as follows: Along with the correct labels
coupling that patient data set that would be developed, for determining whether or not the patient has
diabetes. In essence, what the model is able to do is, after being trained, predict whether or not new
patients have diabetes from their health indicators.
Classification of data is being used because it makes the process if decision-making easier through
the labeling of data. In numerous cases in real life, people have to assess objects, events or situations
in order to analyze them, to anticipate subsequent events or to provide adequate response. For
instance in fraud detection the output is either “fraud” or “not fraud” where the bank aims at
reducing its loss. In medical diagnosis, arriving to a diagnosis patients marked classifies assist
doctors in giving correct treatments at the right time. Classification makes it easy for us to
accomplish these tasks and in a short span of time hence saving time and resources as well as
minimizing on overalls made by human beings.
Improved Decision-Making: Classification aids decision makers in organizations gain better and
faster insight by organizing information in any suitable manner they deem most efficient.
Automation of Complex Tasks: It makes it possible to execute a process where; for instance, tasks
involving the identification of a fraudulent case, ejections of spam mail or identification of a
potential disease in a patient, which could otherwise have been a tiresome exercise.
Enhanced Accuracy: As a consequence, classification models gain more coherent predictions and
outcomes in combination with significant error elimination by a human.
Scalability: Classification algorithms can easily cope with big data meaning that these algorithms
can easily be scaled up to suit higher volumes of data or changing needs of a business.
Cost and Time Efficiency: The use of automate classification practitioners time and resources,
when completing large classification tasks, it allows for the business to perform tasks that are more
valuable, while constantly providing the same high level of classification.
Users of Data Classification are found across industries and sectors because Data Classification is
widely used for analysis of decision-making, predicting and determining risks:
Healthcare Professionals: Classification is used in the medical field by doctors and researchers to
determine diseases by symptoms, prognosis of disease by medical records and results of various
tests.
Financial Institutions: Commmercial banks and insurance companies use transactions to identify
fraud and credit risks or to approve loans using customer data and transaction type analysis.
Retailers: Marketing professionals also sort customers in their internet platforms depending on their
behaviors to offer them reasonable recommendations about commodities they are interested in.
Government and Law Enforcement: Agencies sort through data relevant to surveillance purposes
and crime and national security by mining massive amounts of publicly and privately generated data.
Tools for Data Classification have extensive features and capabilities which may be viewed as
follows:
TensorFlow: An open-source policy learning library for Google that makes use of deep learning
algorithms as well as neural networks in classification tasks.
Scikit-learn: One of the powerful and popular libraries in Python by containing various data
classification tools such as decision tree k-NN and support vector machine.
IBM Watson: One of the most appealing AI platforms that provide effective data categorization to
provide an organization with critical insights in business intelligence, customer engagement, and
healthcare.
Weka: A complete set of machine learning algorithms and utilities for data classification that
remains widely used for educational and research purposes.
RapidMiner: An easy to use data science tool to create classification models that can be used by
those with professional as well as no background in programming.
2.ABOUT THE DATASET
Explanation of columns:
This data is inspired by CRICBUZZ and the data is assumed to be the playing conditions of days of
past 6 test matches played in CHINNASWAMY STADIUM, BENAGLURU
Row No:
This is just an identifier having a sequential number from 1 to 20 for every observation or record in
the dataset.
Play (Target):
This is the dependent variable that the model is seeking to find a relationship between and the
independent variables. It informs us whether a game or activity is possible, with Yes meaning the
activity is possible and No meaning it isn’t. This is the second category, a binary classification, that
results from this.
Temperature (°F):
If the event took place in winter then the temperature is presented in degrees of Fahrenheit. This
numeric value may influence the decision to play because too high or low temperature may lead to
non-playing.
Humidity (%):
The portion of the humidity that indeed influences the human comfort when performing various
activities outdoors. The environment of high humidity will also make feel any type of hot and
uncomfortable for play. It varies between 55% and 95% in the data.
Classification algorithms are basic techniques in the technique of machine learning and specifically useful in
problems solving techniques that aims to have the input data classified into specific categories or labels.
These are under the category of supervised learning as they make use of the training datum to make
forecasts on unknown data. Other popular classification methods include Decision Tree, which states a
model of tree structures in order to sort data by asking questions on feature values. Every oval stands for a
test on an attribute while the line connection represents the outputs until a classification decision is made.
Random Forest: An extension of decision trees trains n numbers of decision trees during training and at the
time of prediction it gives the mode of the classes which improves the accuracy and also reduces the
overfitting. k-Nearest Neighbors (k-NN) is a simple mostly instance base algorithm in the classification
type of new instances with respect to its k nearest neighbors in the feature space. On the other hand it is an
intuitive method which is reasonably efficient for many data sets, especially when the dimensionality is not
high. The Naive Bayes Classifier is another important algorithm, and also the simplest one because it relies
on the assumption of independence of all features; it has low complexity and is perfect for text classification
tasks such as spam detection. In real problems, choice of the class of classification algorithms often depends
on the nature of the data and the application, its size and a structure of the dataset, difficulty of the task and
the resources available. These algorithms can be applied to any area and any organization – from disease
diagnosis in health care, identity theft detection in finance, to recommendation systems in e-commerce
industries, meaning that they are a critical piece of data science and artificial intelligence.
4.DECISION TREE
A Decision Tree is perhaps one of the most intuitive classification and regression algorithms in machine
learning. It has a flowchart-like structure, where each internal node represents a decision on a feature, or
attribute. Each branch represents the outcome of that decision, and each leaf node would represent a final
classification or output value. The tree is constructed based on the recursive splitting of the dataset into
subsets, based upon the most significant feature at each node with a maximization of distinction between the
target classes. One of the key metrics used in decision trees to decide on the best split is Gini Impurity or
Information Gain (derived from entropy), that computes the quality of a split by quantifying how well the
data points in a subset are separated by the chosen feature. The decision trees are pretty easy to interpret,
because the rules for classification or regression are simple, can be visualized, and hence of very great use in
understanding the process of a decision tree; however they suffer from the problem of overfitting where the
tree is too complex and learns the noise in the training data, thus causing poor generalization on unseen data.
To overcome this, methods such as pruning are applied in order to remove unnecessary branches, thereby
reducing the complexity of the tree. Decision trees can handle both numerical and categorical data and do
not have restrictions against variables needing normal scaling, thereby providing a lot of flexibility. One of
the benefits of decision trees is that one can establish complex relationships without needing major
preprocessing of the data set. However, they are sensitive to small variations in data; thus, leading to
different tree structures. As a whole, the decision tree is a very powerful baseline algorithm, which is often
used together with some other techniques when trying to improve the performance on a multitude of real-
world machine learning tasks.
STEP 1: Import the dataset from excel to “Altair RapidMiner”. Insert ‘select attributes’ to exclude the ‘row
no’.
STEP 2: Insert the “Set role” operator and let “Match Timing” be the label so decision tree will made on its
basis.
STEP 3: Insert the “Decision Tree” and connect it to splain and run it
Step 4: Result
INTERPRETATIONS: Temperature being the most significant factor influencing match timing.
Temperature as the Key Determinant: The root node splits based on temperature <86.5°F or not,
which indicates temperature is the most important factor in deciding match timing as it is an outdoor
game and extreme temperatures either during play or game timings are not suitable for players.
When the temperature is greater than 86.5°F the tree will directly classify the match as being
played in the Morning. This decision can be represented as: “Morning {Afternoon=0, Morning=2,
Evening=0}”. That is, when temperature is really high morning are more preferred to avoid hot time
of a day and no matches scheduled in afternoon or evening.
Temperature ≤ 86.5°F: If the temperature is less than or equal to 86.5°F, then tree goes on further
deep diving to check whether temperature is greater than 77°F; if yes, then it’s a maple leaf
otherwise oak leaf.
Temperature > 77°F: For temperatures greater than 77°F and less or equal to 86.5°F, the tree
predicts that the match is more likely to be held in the Afternoon, and with a few occurrences in the
Evening. The decision text is as follows: “Afternoon {Afternoon=6, Morning=0, Evening=3},”
which shows that for this temperature range, 6 matches were scheduled in the afternoon and 3 in the
evening. So for this range of temperature it’s optimal to play in the afternoon, but if it’s a little too
warm then evening games are preferred.
Temperature ≤ 77°F: When temperature is less than or equal to 77°F then tree is classifying the
match will be in Morning. But also predicting some matches to be played in Evening and few in
Afternoon as well. Morning {Afternoon=1, Morning=5, Evening=3} means total 5 matches are there
in morning, 3 in evening and 1 in afternoon under cooler temperature. So under cooler temperature
morning matches are preferred. That means early hours of the day will have a better visibility may be
due to higher humidity.
For temperatures lower than or equal to 77°F, the model introduces a third decision criterion, which
is based on the Row No..
This additional criterion is likely being used as a way to handle remaining ambiguities in the dataset.
Rows numbered higher than 5 are associated with Morning matches, while rows numbered 5 or
lower are linked with Afternoon matches.
Row No. > 5:
If the row number is greater than 5 and the temperature is low (≤ 77°F), the match timing is predicted
to be Morning.
This covers 4 Morning matches and 3 Evening matches.
Row No. ≤ 5:
If the row number is 5 or lower and the temperature is low (≤ 77°F), the match timing is predicted as
Afternoon.
Detailed Insights:
Morning Matches:
The tree predicts morning matches in two distinct temperature ranges:
When temperature is above 86.5°F, all the matches are being played in morning, may be because of
extreme hot, morning is only the time before it raises to peak.
When the temperature is 77°F or less morning matches are again the most frequent, likely because
it’s cooler early in the day.
Afternoon Matches:
Afternoon games are most frequently predicted when the temperature is in between 77-86.5°F. In
this temperature range, it’s not too cold and not too hot so therefore afternoon would be the optimal
time for these games to take place. Although, we do see games happening also in the evening, I
would guess that it is because it does get slightly too hot in the afternoon.
Evening Matches:
Evening matches occur in two cases: in fairly warm temperatures (over 77°F but below 86.5°F) and
in cool temperatures (≤ 77°F). Nevertheless, evening matches are less common than morning or
afternoon matches. They look more like a last resort option when the temperatures are still playable
but privacy or other factors make it hard to set a match during the day and playing at noon is
avoided. I guess they are practised when the temperatures are still arguably to play but likely not best
suited for when it is colder in the morning.
Conclusion:
Analyzing the features that influence the decision of when exactly the cricket match is to take, the
decision tree reveals that the temperature finally makes the greatest impact. The tree splits at key
temperature thresholds, showing that:
If temperatures go beyond 86.5°F, the majority of matches usually occur in the morning section to
prevent heat.
In these moderate temperatures match, are mainly in the afternoon but there are also matches in the
evening.
In temperatures below what is considered high for a game- 77 F, morning games are favored once
more, but there can be games in the evening and even afternoon.
This decision making exercise is consistent with the way many people would think about cricket
fixture, that is making the fixture at a time that is comfortable for the players and conditions that are
suitable for the game. Matches that are played early in the morning are preferable if the temperature
is high or low and matches indeed in the afternoon are best to be preferred if the temperature is warm
but not extremely high. The evening matches act as an alternative most probably when the condition
in the afternoon is not too good. The tree offers a good guide on how decisions on match timing can
indeed be reached based on the temperature factor alone.
5.NAIVE BAYES
Naïve Bayes is one of the probability based classifier algorithms based on Bayes’ theorem which is very
much used in machine learning due to simplicity and proper working. It presupposes that all features are
independent of each others conditioned on the class label is assumed, although being a very strenuous
assumption, practice experiments yield very high results in many cases. The algorithm calculates the
posterior probability of a class given the input features and the prior probability of class, All Naïve Bayes
classifiers are suitable for big data, and the most suitable applications include text mining, spam filtering,
sentiment analysis, and diagnosis datasets. There are three main types of Naïve Bayes classifiers: Each of
the three models – Gaussian, Multinomial, and Bernoulli – is designed for different data type. Here the use
of Gaussian Naïve Bayes for the continuous data as it considers data distributed normally. Multinomial
Naïve Bayes is ideal for data which it is going to process in the form of discrete data such as word counts
while Bernoulli Naïve Bayes is ideal when the feature vectors are the binary or boolean type. The key
strength of the model is that it is able to take large number of features and compute the predictions rather
quickly. Nonetheless, the “naïve” independence assumption turns out to be astonishingly effective much of
the time, particularly where the dependency of features on one another does nothing for or against
classification. However, it’s accuracy degrades when there is violation of the independence assumption, or
when dealing with small datasets where it may not estimate probabilities properly. To conclude, Naïve
Bayes is a healthy and effective method can be applied for many actual problems of classification when it is
necessary to improve speed and application of a classifier.
STEP 1: Retrieve the dataset on the spline. Add ‘split data’ operator with the parameters ratios 0.6 and 0.4
STEP 2: Add “Set role” operator and let the label be “Cloud Cover”
STEP 3: Add “Naïve Bayes” operator and “Apply Model” operator and connect the spline as following.
STEP 4: Run
Interpretation:
Naive Bayes classifier operates by computing probabilities of Low, Medium, High given input data
set, then outputs the most probable class.
Low Cloud Cover: This is the most common class on the dataset since its occurrence rate stands at
0.50. That is, 50% of the matches are played under low cloud cover conditions. The 6 distributions
under this class indicate that an increase in the match timings has low cloud cover distributed evenly.
Soccer matches under low cloud are bound to correlate with better forecasted conditions probably
contributing to decision to play more often at these times of the day, morning or afternoon, that
typically provides better weather conditions for a game of cricket. The density plot of low cloud has
a larger spread of the match instances and therefore covers almost all parts of the day, hence more
common.
High Cloud Cover: The high cloud cover class has a probability of 025 indicating that 25% of the
matches are made at these conditions. This class also has 6 distributions; while still small, this shows
that it is not as rare as one might have thought. Cloud cover usually comes with rain or over cast
conditions, which can influence the chances of play most especially during early morning or evening
when the temperatures are low, and unsuitable for play. The density plot for high cloud cover is
somewhat more compact than that of low cloud cover, indicating that the matches occur at a higher
frequency within a much smaller range and may be linked to the weather patterns or high humidity
and low temperatures.
Medium Cloud Cover: As in high cloud cover case, its possibility is equal to 0.25, which means
that it takes part in a quarter of the matches. Six distributions for this class indicate that the number
of matches under the medium cloud cover is as important as that under the high cloud cover. EH Chb
Medium cloud cover means it is partly cloudy meaning matches could be played during such
conditions, meaning it is possibly mid-way between clearer and overcast conditions. The density plot
of medium cloud cover has narrower bell curve which suggests that the matches may take place zone
more closely from this certain weather condition, as for example in the afternoon when the weather
is not too clear and not too cloudy.
In this particular case, it is easy to comprehend from the Naive Bayes model how levels of cloud
cover affect match timings. The option to use the density plot helps focus on insight which
conditions of cloud cover occur most frequently and how these conditions are spread all through the
different matches. For example, although the peak of the graph of the low cloud cover is not very
high, it indicates the condition of more and varied cricket match occurrences. On the other hand, the
high and medium cloud cover classes are more congregate in terms if frequency and dispersion since
the classes represent instances where clouds are abundant as opposed to class A.
Conclusion:
The detailed distribution in addition to the Naive Bayes classifier shows the effect of weather
conditions especially cloud cover on the scheduling of cricket matches. This analysis will enable
future match timings to be predicted with reference to cloud cover and other comparably related
variables, therefore this can be a useful tool for the players and the match organizers. When one
evaluates results depicted in this data, the chances to predict how often matches will take place under
specific weather conditions makes it easier to plan and prepare for the matches.
6.K – NEAREST NEIGHBOURS (KNN)
K-Nearest Neighbors (KNN) is one of the simplest and yet most popular algorithm which can used in both
classification and regression. This can be done for accessing the ‘k’ nearest data points (neighbors) to a
query instance, whereby its distance can be defined by the distance metrics for example Euclidean distance
and then making generalized predictions in accordance to the majority class or mean of these neighbors.
With regard to classification, the class most frequently occurring among the k neighbors would recommend
the class for the query instance In regression, the average or weighted average of the value of the neighbors
is used to make a prediction. KNN is easy to implement because of no assumptions adopted due to its non-
parametric nature that may be an advantage when learning different sets of data. The other thing that I like
more concerning this algorithm is the fact that KNN can perform multi-class classification and it is capable
of handling continuous data making it more versatile in areas such as; recommendations systems, pattern
recognition, as well as medical diagnosis. Further, it reveals that KNN can be optimized with some
exchanges such as, calculation of weighted KNN that gives more importance to nearer neighbors in
comparison to further neighbors; and in the similar manner, some methods like Principal Component
Analysis (PCA) can be applied to minimize dimensionality to improve calculation time. Just like any other
distance-based approach, KNN is also sensitive to the proportion of records among the classification classes;
However, KNN is excellent in small neat datasets, as it is easily understandable since the Firth boundary is
made from the data points. Explaining why the algorithm I uncover doesn’t learn a model and does only
some computations at the query time, it’s called lazy and thus computationally costly during inference.
However, it does not also carry the uncertainty of having model assumptions or approximations. In general,
KNN is an easy-to-purpose and easily approximated algorithm suitable for use with small scales of data or
where it is appropriate to prioritize interpretability and comprehensibility, time and energy for accurate
model training.
STEP 1: Retrieve the dataset twice, Insert ‘generate IDs’ twice. Add ‘set role’ to the first spline and mark
“Weather Condition” as the Label.
STEP 2: Add ‘KNN’ from operators and then add “Apply Model” operator.
STEP 3: Result
INTERPRETATIONS:
The scatter plot in the image compares the weather conditions as per the match timings against the
temperature of the match timings namely morning, afternoon and evening. Also, the size of the
bubbles is proportional to the humidity percentage of each match wherein the bigger bubble refers to
greater humidity level.
Weather Condition:
The X-axis categorizes the data points into three distinct weather conditions: Sunny, Rainy, and
Overcast. All of these types of weather affect play, as some kind of weather, such as rain, may make
the game difficult to have. In the analysis of the names it will be seen that “No” results with
reference to the game not being played are coupled with rainy conditions while “Yes” results with
reference to the game being played are related to sunny and overcast conditions.
Temperature:
The second factor and the Y-axis is the temperature in degrees Fahrenheit, and whether the game is
played based on its temperature. It is usually moderately hot with occasional high heat with
temperatures falling between 68°F and 88°F. The current data extracted from the plot show that
temperatures above 80°F are more favorable to games being played especially if played under
condition that are sunny and overcast. On the other hand, games are seldom played at lower
temperatures, that is, at temperatures of about 70°F and dampness.
Humidity:
In the bubbles below, size of bubbles indicates humidity level where larger bubbles signifies high
humidity. Regarding the relative humidity, percentages recorded are individually high under rainy
weather and has recorded between 85%-95%. However, while other games bare high humidity
together with sunshine and overcast conditions, showing that humidity alone is not the root cause
though high humi-dity coupled with rain results to high “No” play percentage.
The dataset provided includes instances which are tagged as ‘Yes’ the game was played or ‘No’ the
game was not played. The model is able to assess when a new game begins by matching the given
condition to the closest data set counterpart and provides the with the majority class label (Yes or
No) to the new example. Based on the trends derived from the analysis, one would expect the model
to label new games as “No” especially under rain; high humidity; and cooler temperatures and “Yes”
under sunny or overcast conditions, moderate temperatures and lower humidity.
Conclusion:
From the k-NN classification and analysis of the visualization results, weather condition, temperature
and humidity are the key drivers of games. While the presence of clouds with cold, high humidity,
especially during the rainy period is conducive to cancellation of games, relatively warm and sunny
or overcast weather favors the games.
7.BIBLIOGRAPHY
1. https://atlan.com/what-is-data-classification/?form=MG0AV3
2. https://dataaspirant.com/classification-algorithms/?form=MG0AV3
3. https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
4. https://www.javatpoint.com/machine-learning-naive-bayes-classifier
5. https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
6. Rapid miner
8. Cricbuzz