U4 Clasification and Prediction
U4 Clasification and Prediction
There are two forms of data analysis that can be used to extract models describing important classes or predict
future data trends. These two forms are as follows:
1. Classification
2. Prediction
We use classification and prediction to extract a model, representing the data classes to predict future data
trends. Classification predicts the categorical labels of data with the prediction models. This analysis provides us
with the best understanding of the data at a large scale.
Classification models predict categorical class labels, and prediction models predict continuous-valued
functions. For example, we can build a classification model to categorize bank loan applications as either safe or
risky or a prediction model to predict the expenditures in dollars of potential customers on computer equipment
given their income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training
data set includes the input data and their associated class labels. Using the training dataset, the algorithm derives
a model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network.
In classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.
Classification is the process of classifying a record. One simple example of classification is to check whether it
is raining or not. The answer can either be yes or no. So, there is a particular number of choices. Sometimes
there can be more than two classes to classify. That is called multiclass classification.
U4 CLASIFICATION & PREDICTION DATA MINING
The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example, based on
observable data for multiple loan borrowers, a classification model may be established that forecasts credit risk.
The data could track job records, homeownership or leasing, years of residency, number, type of deposits,
historical credit ranking, etc. The goal would be credit ranking, the predictors would be the other characteristics,
and the data would represent a case for each consumer. In this example, a model is constructed to find the
categorical label. The labels are risky or safe.
The functioning of classification with the assistance of the bank loan application has been mentioned above.
There are two stages in the data classification system: classifier or model creation and classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records as
samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The test
data are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can
use it to extract social media insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning algorithms. The accurate trained
models provide consistently accurate outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize the documents into
sections according to the content. Document classification refers to text classification; we can
U4 CLASIFICATION & PREDICTION DATA MINING
classify the words in the entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.
o Image Classification: Image classification is used for the trained categories of an image.
These could be the caption of the image, a statistical value, a theme. You can tag images to
train your model for relevant categories by applying supervised learning algorithms.
3. Data Classification Process: The data classification process can be categorized into five steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
The data classification life cycle produces an excellent structure for controlling the flow of data to an enterprise.
Businesses need to account for data security and compliance at each level. With the help of data classification,
we can perform it at every stage, from origin to deletion. The data life-cycle has the following stages, such as:
U4 CLASIFICATION & PREDICTION DATA MINING
1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various
devices and platforms.
6. Publication: Through the publication of data, it can reach customers. They can then view and
download in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding numerical output values. The algorithm derives the model
or a predictor according to the training dataset. The model should find a numerical output when the new data is
given. Unlike in classification, this method does not have a class label. The model predicts a continuous-valued
function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts such as the
number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular customer will spend at his
company during a sale. We are bothered to forecast a numerical value in this case. Therefore, an example of
numeric prediction is the data processing activity. In this case, a model or a predictor will be developed that
forecasts a continuous or ordered value function.
The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following
activities, such as:
U4 CLASIFICATION & PREDICTION DATA MINING
1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The noise
is removed by applying smoothing techniques, and the problem of missing values is solved by
replacing a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following methods.
o Generalization: The data can also be transformed by generalizing it to the higher concept.
For this purpose, we can use the concept hierarchies.
NOTE: Data can also be reduced by some other methods such as wavelet transformation, binning, histogram
analysis, and clustering.
Here are the criteria for comparing the methods of Classification and Prediction, such as:
o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to predict the
class label correctly, and the accuracy of the predictor can be referred to as how well a given predictor
can estimate the unknown value.
o Speed: The speed of the method depends on the computational cost of generating and using the
classifier or predictor.
o Robustness: Robustness is the ability to make correct predictions or classifications. In the context of
data mining, robustness is the ability of the classifier or predictor to make correct predictions from
incoming unknown data.
o Interpretability: Interpretability is how readily we can understand the reasoning behind predictions or
classification made by the predictor or classifier.
Classification Prediction
Classification is the process of identifying which Predication is the process of identifying the missing or
category a new observation belongs to based on a training unavailable numerical data for a new observation.
data set containing observations whose category
membership is known.
In classification, the accuracy depends on finding the In prediction, the accuracy depends on how well a given
class label correctly. predictor can guess the value of a predicated attribute for new
data.
In classification, the model can be known as the In prediction, the model can be known as the predictor.
classifier.
A model or the classifier is constructed to find the A model or a predictor will be constructed that predicts a
categorical labels. continuous-valued function or ordered value.
For example, the grouping of patients based on their For example, We can think of prediction as predicting the
medical records can be considered a classification. correct treatment for a particular disease for a person.
Decision Tree is a supervised learning method used in data mining for classification and regression methods. It
is a tree that helps us in decision-making purposes. The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller subsets, and at the same time, the decision tree is steadily
developed. The final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two
branches. The leaf nodes show a classification or decision. We can't accomplish more split on leaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root node. Decision trees can deal
with both categorical and numerical data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the randomness or
impurity in data sets.
U4 CLASIFICATION & PREDICTION DATA MINING
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy Reduction.
Building a decision tree is all about discovering attributes that return the highest data gain.
In short, a decision tree is just like a flow chart diagram with the terminal nodes showing decisions. Starting
with the dataset, we can measure the entropy to find a way to segment the set until the data belongs to the same
class.
It provides us a framework to measure the values of outcomes and the probability of accomplishing them.
It helps us to make the best decisions based on existing data and best speculations.
In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split an
extensive collection of records into smaller sets of the class by implementing a sequence of simple decision
rules. A decision tree model comprises a set of rules for portioning a huge heterogeneous population into
smaller, more homogeneous, or mutually exclusive classes. The attributes of the classes can be any variables
from nominal, ordinal, binary, and quantitative values, in contrast, the classes must be a qualitative type, such as
categorical or ordinal or binary. In brief, the given data of attributes together with its class, a decision tree
creates a set of rules that can be used to identify the class. One rule is implemented after another, resulting in a
hierarchy of segments within a segment. The hierarchy is known as the tree, and each segment is called a node.
With each progressive division, the members from the subsequent sets become more and more similar to each
other. Hence, the algorithm used to build a decision tree is referred to as recursive partitioning. The algorithm is
known as CART (Classification and Regression Trees)
Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which leads to $8 million
profit, and the probability of a bad economy is 0.4 (40%), which leads to $6 million profit.
Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which leads to $4 million
profit, and the probability of a bad economy is 0.4, which leads to $2 million profit.
The management teams need to take a data-driven decision to expand or not based on the given data.
The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is as follows:
Initially, D is the entire set of training tuples and their related class levels (input training data).
Attribute_selection_method specifies a heuristic process for choosing the attribute that "best" discriminates
the given tuples according to class.
Missing values in data also do not influence the process of building a choice tree to any considerable extent.
A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation during pre-processing.
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.rward Skip 10s
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
U4 CLASIFICATION & PREDICTION DATA MINING
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
U4 CLASIFICATION & PREDICTION DATA MINING
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
U4 CLASIFICATION & PREDICTION DATA MINING
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed.
It is primarily used for document classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.
Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one
cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
Clustering analysis is broadly used in many applications such as market research, pattern recognition,
data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation database. It
also helps in the identification of groups of houses in a city according to house type, value, and
geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
U4 CLASIFICATION & PREDICTION DATA MINING
As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.
The following points throw light on why clustering is required in data mining −
Scalability − We need highly scalable clustering algorithms to deal with large databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
Clustering Methods
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each
partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the
following requirements −
Points to remember −
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical
methods on the basis of how the hierarchical decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
U4 CLASIFICATION & PREDICTION DATA MINING
This approach is also known as the bottom-up approach. In this, we start with each object forming a separate
group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the
groups are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects in the same
cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.
Here are the two approaches that are used to improve the quality of hierarchical clustering −
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as
the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the
radius of a given cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid
structure.
Advantages
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This
method locates the clusters by clustering the density function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. It therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A
constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us
with an interactive way of communication with the clustering process. Constraints can be specified by the user
or the application requirement.