0% found this document useful (0 votes)
21 views15 pages

U4 Clasification and Prediction

data mining unit 4

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

U4 Clasification and Prediction

data mining unit 4

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

U4 CLASIFICATION & PREDICTION DATA MINING

There are two forms of data analysis that can be used to extract models describing important classes or predict
future data trends. These two forms are as follows:

1. Classification

2. Prediction

We use classification and prediction to extract a model, representing the data classes to predict future data
trends. Classification predicts the categorical labels of data with the prediction models. This analysis provides us
with the best understanding of the data at a large scale.

Classification models predict categorical class labels, and prediction models predict continuous-valued
functions. For example, we can build a classification model to categorize bank loan applications as either safe or
risky or a prediction model to predict the expenditures in dollars of potential customers on computer equipment
given their income and occupation.

What is Classification?

Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training
data set includes the input data and their associated class labels. Using the training dataset, the algorithm derives
a model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network.
In classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.

Classification is the process of classifying a record. One simple example of classification is to check whether it
is raining or not. The answer can either be yes or no. So, there is a particular number of choices. Sometimes
there can be more than two classes to classify. That is called multiclass classification.
U4 CLASIFICATION & PREDICTION DATA MINING

The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example, based on
observable data for multiple loan borrowers, a classification model may be established that forecasts credit risk.
The data could track job records, homeownership or leasing, years of residency, number, type of deposits,
historical credit ranking, etc. The goal would be credit ranking, the predictors would be the other characteristics,
and the data would represent a case for each consumer. In this example, a model is constructed to find the
categorical label. The labels are risky or safe.

How does Classification Works?

The functioning of classification with the assistance of the bank loan application has been mentioned above.
There are two stages in the data classification system: classifier or model creation and classification classifier.

1. Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records as
samples, objects, or data points.

2. Applying classifier for classification: The classifier is used for classification at this level. The test
data are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:

o Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can
use it to extract social media insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning algorithms. The accurate trained
models provide consistently accurate outcomes and result in a fraction of the time.

o Document Classification: We can use document classification to organize the documents into
sections according to the content. Document classification refers to text classification; we can
U4 CLASIFICATION & PREDICTION DATA MINING

classify the words in the entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.

o Image Classification: Image classification is used for the trained categories of an image.
These could be the caption of the image, a statistical value, a theme. You can tag images to
train your model for relevant categories by applying supervised learning algorithms.

o Machine Learning Classification: It uses the statistically demonstrable algorithm rules to


execute analytical tasks that would take humans hundreds of more hours to perform.

3. Data Classification Process: The data classification process can be categorized into five steps:

o Create the goals of data classification, strategy, workflows, and architecture of data
classification.

o Classify confidential details that we store.

o Using marks by data labelling.

o To improve protection and obedience, use effects.

o Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?

The data classification life cycle produces an excellent structure for controlling the flow of data to an enterprise.
Businesses need to account for data security and compliance at each level. With the help of data classification,
we can perform it at every stage, from origin to deletion. The data life-cycle has the following stages, such as:
U4 CLASIFICATION & PREDICTION DATA MINING

1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.

2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.

3. Storage: Here, we have the obtained data, including access controls and encryption.

4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various
devices and platforms.

5. Archive: Here, data is eventually archived within an industry's storage systems.

6. Publication: Through the publication of data, it can reach customers. They can then view and
download in the form of dashboards.

What is Prediction?

Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding numerical output values. The algorithm derives the model
or a predictor according to the training dataset. The model should find a numerical output when the new data is
given. Unlike in classification, this method does not have a class label. The model predicts a continuous-valued
function or ordered value.

Regression is generally used for prediction. Predicting the value of a house depending on the facts such as the
number of rooms, the total area, etc., is an example for prediction.

For example, suppose the marketing manager needs to predict how much a particular customer will spend at his
company during a sale. We are bothered to forecast a numerical value in this case. Therefore, an example of
numeric prediction is the data processing activity. In this case, a model or a predictor will be developed that
forecasts a continuous or ordered value function.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following
activities, such as:
U4 CLASIFICATION & PREDICTION DATA MINING

1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The noise
is removed by applying smoothing techniques, and the problem of missing values is solved by
replacing a missing value with the most commonly occurring value for that attribute.

2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.

3. Data Transformation and reduction: The data can be transformed by any of the following methods.

o Normalization: The data is transformed using normalization. Normalization involves scaling


all values for a given attribute to make them fall within a small specified range. Normalization
is used when the neural networks or the methods involving measurements are used in the
learning step.

o Generalization: The data can also be transformed by generalizing it to the higher concept.
For this purpose, we can use the concept hierarchies.

NOTE: Data can also be reduced by some other methods such as wavelet transformation, binning, histogram
analysis, and clustering.

Comparison of Classification and Prediction Methods

Here are the criteria for comparing the methods of Classification and Prediction, such as:

o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to predict the
class label correctly, and the accuracy of the predictor can be referred to as how well a given predictor
can estimate the unknown value.

o Speed: The speed of the method depends on the computational cost of generating and using the
classifier or predictor.

o Robustness: Robustness is the ability to make correct predictions or classifications. In the context of
data mining, robustness is the ability of the classifier or predictor to make correct predictions from
incoming unknown data.

o Scalability: Scalability refers to an increase or decrease in the performance of the classifier or


predictor based on the given data.
U4 CLASIFICATION & PREDICTION DATA MINING

o Interpretability: Interpretability is how readily we can understand the reasoning behind predictions or
classification made by the predictor or classifier.

Difference between Classification and Prediction

Classification Prediction
Classification is the process of identifying which Predication is the process of identifying the missing or
category a new observation belongs to based on a training unavailable numerical data for a new observation.
data set containing observations whose category
membership is known.

In classification, the accuracy depends on finding the In prediction, the accuracy depends on how well a given
class label correctly. predictor can guess the value of a predicated attribute for new
data.

In classification, the model can be known as the In prediction, the model can be known as the predictor.
classifier.

A model or the classifier is constructed to find the A model or a predictor will be constructed that predicts a
categorical labels. continuous-valued function or ordered value.
For example, the grouping of patients based on their For example, We can think of prediction as predicting the
medical records can be considered a classification. correct treatment for a particular disease for a person.

Decision Tree Induction

Decision Tree is a supervised learning method used in data mining for classification and regression methods. It
is a tree that helps us in decision-making purposes. The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller subsets, and at the same time, the decision tree is steadily
developed. The final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two
branches. The leaf nodes show a classification or decision. We can't accomplish more split on leaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root node. Decision trees can deal
with both categorical and numerical data.

Key factors:

Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it measures the randomness or
impurity in data sets.
U4 CLASIFICATION & PREDICTION DATA MINING

Information Gain:

Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy Reduction.
Building a decision tree is all about discovering attributes that return the highest data gain.

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing decisions. Starting
with the dataset, we can measure the entropy to find a way to segment the set until the data belongs to the same
class.

Why are decision trees useful?

It enables us to analyze the possible consequences of a decision thoroughly.

It provides us a framework to measure the values of outcomes and the probability of accomplishing them.

It helps us to make the best decisions based on existing data and best speculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split an
extensive collection of records into smaller sets of the class by implementing a sequence of simple decision
rules. A decision tree model comprises a set of rules for portioning a huge heterogeneous population into
smaller, more homogeneous, or mutually exclusive classes. The attributes of the classes can be any variables
from nominal, ordinal, binary, and quantitative values, in contrast, the classes must be a qualitative type, such as
categorical or ordinal or binary. In brief, the given data of attributes together with its class, a decision tree
creates a set of rules that can be used to identify the class. One rule is implemented after another, resulting in a
hierarchy of segments within a segment. The hierarchy is known as the tree, and each segment is called a node.
With each progressive division, the members from the subsequent sets become more and more similar to each
other. Hence, the algorithm used to build a decision tree is referred to as recursive partitioning. The algorithm is
known as CART (Classification and Regression Trees)

Consider the given example of a factory where


U4 CLASIFICATION & PREDICTION DATA MINING

Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which leads to $8 million
profit, and the probability of a bad economy is 0.4 (40%), which leads to $6 million profit.

Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which leads to $4 million
profit, and the probability of a bad economy is 0.4, which leads to $2 million profit.

The management teams need to take a data-driven decision to expand or not based on the given data.

Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M


Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.

Decision tree Algorithm:

The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is as follows:

The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.

Generally, we refer to D as a data partition.

Initially, D is the entire set of training tuples and their related class levels (input training data).

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method specifies a heuristic process for choosing the attribute that "best" discriminates
the given tuples according to class.

Attribute_selection_method process applies an attribute selection measure.


U4 CLASIFICATION & PREDICTION DATA MINING

Advantages of using decision trees:

A decision tree does not need scaling of information.

Missing values in data also do not influence the process of building a choice tree to any considerable extent.

A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.

Compared to other algorithms, decision trees need less exertion for data preparation during pre-processing.

A decision tree does not require a standardization of data.

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems.

o It is mainly used in text classification that includes a high-dimensional training dataset.

o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.

o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.

o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.

o The formula for Bayes' theorem is given as:


U4 CLASIFICATION & PREDICTION DATA MINING

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.rward Skip 10s

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.

3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes
U4 CLASIFICATION & PREDICTION DATA MINING

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35
U4 CLASIFICATION & PREDICTION DATA MINING

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.

o It can be used for Binary as well as Multi-class Classifications.

o It performs well in Multi-class predictions as compared to the other Algorithms.

o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.

o It is used in medical data classification.

o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
U4 CLASIFICATION & PREDICTION DATA MINING

o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.

o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed.
It is primarily used for document classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.

Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one
cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

 A cluster of data objects can be treated as one group.


 While doing cluster analysis, we first partition the set of data into groups based on data similarity and
then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to changes and helps single
out useful features that distinguish different groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern recognition,
data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionalities and gain insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according to house type, value, and
geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit card fraud.
U4 CLASIFICATION & PREDICTION DATA MINING

 As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −

 Scalability − We need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical) data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods

Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each
partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the
following requirements −

 Each group contains at least one object.


 Each object must belong to exactly one group.

Points to remember −

 For a given number of partitions (say k), the partitioning method will create an initial partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical
methods on the basis of how the hierarchical decomposition is formed. There are two approaches here −

 Agglomerative Approach
 Divisive Approach

Agglomerative Approach
U4 CLASIFICATION & PREDICTION DATA MINING

This approach is also known as the bottom-up approach. In this, we start with each object forming a separate
group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the
groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects in the same
cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −

 Perform careful analysis of object linkages at each hierarchical partitioning.


 Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group
objects into micro-clusters, and then performing macro-clustering on the micro-clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as
the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the
radius of a given cluster has to contain at least a minimum number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid
structure.

Advantages

 The major advantage of this method is fast processing time.


 It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods

In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This
method locates the clusters by clustering the density function. It reflects spatial distribution of the data points.

This method also provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. It therefore yields robust clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A
constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us
with an interactive way of communication with the clustering process. Constraints can be specified by the user
or the application requirement.

You might also like