0% found this document useful (0 votes)
16 views50 pages

Classification

classification
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views50 pages

Classification

classification
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

UNIT 2

CLASSIFICATION
• What is Classification in Data Mining?
• Classification is a technique in data mining that involves categorizing or
classifying data objects into predefined classes, categories, or groups based
on their features or attributes.
• It is a supervised learning technique that uses labelled data to build a model
that can predict the class of new, unseen data.
• It is an important task in data mining because it enables organizations to
make informed decisions based on their data.
• There are two main types of classification:
• binary classification and multi-class classification.
• Binary classification involves classifying instances into two classes, such as
“spam” or “not spam”,
• Multi-class classification involves classifying instances into more than two
classes.
• Steps to Build a Classification Model
• There are several steps involved in building a classification model, as shown
below -
• Data preparation - The first step in building a classification model is to
prepare the data. This involves collecting, cleaning, and transforming the
data into a suitable format for further analysis.
• Feature selection - The next step is to select the most important and
relevant features that will be used to build the classification model. This can
be done using various techniques, such as correlation, feature importance
analysis, or domain knowledge.
• Prepare train and test data - Once the data is prepared and relevant features
are selected, the dataset is divided into two parts - training and test datasets.
The training set is used to build the model, while the testing set is used to
evaluate the model's performance.
• Model selection - Many algorithms can be used to build a classification
model, such as decision trees, logistic regression, k-nearest neighbors, and
neural networks. The choice of algorithm depends on the type of data, the
number of features, and the desired accuracy.
• Model training - Once the algorithm is selected, the model is trained on the
training dataset. This involves adjusting the model parameters to minimize
the error between the predicted and actual class labels.
• Model evaluation - The model's performance is evaluated using the test
dataset. The accuracy, precision, recall, and F1 score are commonly used
metrics to evaluate the model performance.
• Model tuning - If the model's performance is not satisfactory, the model can
be tuned by adjusting the parameters or selecting a different algorithm. This
process is repeated until the desired performance is achieved.
• Model deployment - Once the model is built and evaluated, it can be
deployed in production to classify new data. The model should be monitored
regularly to ensure its accuracy and effectiveness over time.
• Classification Vs. Regression in Data Mining

Factors Classification Regression

Identifying or assigning Estimating a continuous


the class label of a new or discrete value for a
Task/Objective
observation based on new observation based
its features. on its features.

Categorical variable, i.e., Continuous or discrete


Outcome a class label or variable, i.e., a numeric
category. value.
Mean squared error,
Accuracy, precision, root mean squared
Evaluation
recall error, correlation
coefficient.
Decision trees, Linear regression,
rule-based systems, logistic regression,
neural networks, polynomial
Algorithms
support vector regression, time
machines, k-nearest series analysis,
neighbors. neural networks.
Housing price
Spam email prediction, stock
classification, price prediction,
Examples
sentiment analysis, predicting a
fraud detection, etc. customer's purchase
amount or sale, etc.
• Real-Life Examples
• There are many real-life examples and applications of classification in data
mining. Some of the most common examples of applications include -
• Email spam classification - This involves classifying emails as spam or
non-spam based on their content and metadata.
• Image classification - This involves classifying images into different
categories, such as animals, plants, buildings, and people.
• Medical diagnosis - This involves classifying patients into different categories
based on their symptoms, medical history, and test results.
• Credit risk analysis - This involves classifying loan applications into different
categories, such as low-risk, medium-risk, and high-risk, based on the
applicant's credit score, income, and other factors.
• Sentiment analysis - This involves classifying text data, such as reviews or
social media posts, into positive, negative, or neutral categories based on the
language used.
• Customer segmentation - This involves classifying customers into different
segments based on their demographic information, purchasing behavior, and
other factors.
• Statistical Based Algorithms:

• Statistical Based Algorithms are fundamental component of data mining and


machine learning, particularly in the context of classification tasks.
• Commonly used Statistical Based Algorithms are
• Regression Analysis
• Bayesian Classification
• What is Regression in Data Mining?
• Regression in data mining is a statistical technique that is used to model the
relationship between a dependent variable and one or more independent
variables.
• The goal is to predict the value of the dependent variable based on the
values of the independent variables.
• The dependent variable is also called the response variable, while the
independent variable(s) is also known as the predictor(s).
• There are several types of regression models, including linear
regression, logistic regression, and polynomial regression.
• Bayes Theorem
• Bayes theorem is a theorem in probability and statistics, named after the
Reverend Thomas Bayes, that helps in determining the probability of an
event that is based on some event that has already occurred.
• Bayes rule states that the conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the likelihood of B,
given A and the probability of A divided by the probability of B. It is given as:
• Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
• Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions.
So to solve this problem, we need to follow the below steps:
• Convert the given dataset into frequency tables.
• Generate Likelihood table by finding the probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.
• Problem: If the weather is sunny, then the Player should play or not?
• Solution: To solve this, first consider the below dataset:
•Distance Based Algorithm:
• It is a type of classification algorithm that use similarity or distance measures.
• Similarity measures can be used to determine which items are more similar
to each other and which items are less similar, which is useful for clustering
or classification.
• Eg: Imagine a website that sells various products and wants to recommend
similar items to customers based on their browsing history.
• In this scenario, the website can use distance based algorithm to measure
the similarity between the products a customer has viewed and other items
in the database.
• If a customer has looked at a particular pair of running shoes, the website
can use similarity measures to find other shoes that are similar in terms of
style, brand or other features.
• 2 methods are
• Simple Approach
• KNN(K Nearest Neighbors) Algorithm

• Simple Approach
• Euclidean Distance Formula
• The formula for Euclidean distance in two dimensions

• where D is the Euclidean distance, and (x1,y1) and (x2,y2) are the Cartesian
coordinates of the two points.
• Assign this to the class whose centroid is closest to it.
• K-Nearest Neighbor(KNN) Algorithm for Machine Learning
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
• Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it
in either cat or dog category.
•Decision Tree Classification Algorithm

•Decision Tree is a Supervised learning technique that can be used


for both classification and Regression problems, but mostly it is
preferred for solving Classification problems.

•It is a tree-structured classifier, where internal nodes represent the


features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• How does the Decision Tree algorithm Work?
•Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select
the best attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM.
• By this measurement, we can easily select the best attribute for the nodes of
the tree. There are three popular techniques for ASM, which are:
• Information Gain
• Gain Ratio
• Gini Index
• Information Gain
• Information gain is a measure used to determine which feature should be
used to split the data at each internal node of the decision tree. It is
calculated using entropy.
• Entropy:
• Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. In a decision tree, the goal is to decrease the entropy
of the dataset by creating more pure subsets of data. Since entropy is a
measure of impurity, by decreasing the entropy, we are increasing the
purity of the data.
•Pruning: Getting an Optimal Decision tree
• Pruning is a process of deleting the unnecessary nodes from a tree in order to
get the optimal decision tree.
• A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset.
• Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
• Cost Complexity Pruning
• Reduced Error Pruning.

You might also like