0% found this document useful (0 votes)
13 views92 pages

Unit-5 3161610

The document discusses the concepts of classification and prediction in data analysis, highlighting their definitions, differences, and methodologies. It covers various algorithms used for classification, such as decision trees, Bayesian classifiers, and neural networks, as well as the importance of data preparation and evaluation measures. Additionally, it explains the processes involved in building classifiers and the significance of attributes like information gain and Gini index in decision tree induction.

Uploaded by

rekhabenmummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views92 pages

Unit-5 3161610

The document discusses the concepts of classification and prediction in data analysis, highlighting their definitions, differences, and methodologies. It covers various algorithms used for classification, such as decision trees, Bayesian classifiers, and neural networks, as well as the importance of data preparation and evaluation measures. Additionally, it explains the processes involved in building classifiers and the significance of attributes like information gain and Gini index in decision tree induction.

Uploaded by

rekhabenmummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

Unit: 5

Classification and Prediction


Content
• Classification vs. prediction, Issues regarding
classification and prediction, Statistical-Based
Algorithms, Distance-Based Algorithms, Decision
Tree-Based Algorithms, Neural Network-Based
Algorithms, Rule-Based Algorithms, Combining
Techniques, accuracy and error measures,
evaluation of the accuracy of a classifier or
predictor. Neural Network Prediction methods:
Linear and nonlinear regression, Logistic
Regression Introduction of tools such as DB
Miner / WEKA / DTREG DM Tools
Formal Classification and prediction Definition:

• Classification and Prediction are two forms of data


analysis those can be used to extract models,
describing important data classes or to predict future
data trends.

• Such analysis can help to provide us with a better


understanding of the data at large.

• Classification predicts categorical (discrete,


unordered) labels, prediction models
continuous valued functions.
Cont..
Classification:

Prediction:
Classification
• The goal of data classification is to organize
and categorize data in distinct classes.
• A model is first created based on the
data distribution.
• The model is then used to classify new data.
• Given the model, a class can be predicted for
new data.
• In general way of saying classification is for discrete
and nominal values.
Prediction
• The goal of prediction is to forecast or deduce the
value of an attribute based on values of other
attributes.

• A model is first created based on the data


distribution.

• The model is then used to predict future or unknown


values.
Summarization of Classification and Prediction:

• If forecasting discrete value ( Classification )

• If forecasting continuous value ( Prediction )


Understanding Classification and Prediction
Cont..
Classification:
• Suppose from your past data ( train data ) you come to
know that your best friend likes above movies.
• Now one new movie ( test data ) released and hopefully you
want to know your best friend like it or not.
• If you strongly conformed about chances of liking that movie
by your friend, you can take your friend to movie this
weekend.
• If you clearly observe the problem it is just whether your
friend like or not.
• Finding solution to this type of problem is called
as classification. This is because we are classifying the things
to their belongings (yes or no, like or dislike )
Cont..
• Keep in mind here we are
forecasting discrete value( classification ) and the other
thing this classification belongs to Supervised learning.
• This is because you are learning this from your train
data.
• Mostly classification is binary classification in which we
have to predict whether output belongs to class 1 or
class 2 (class 1 : yes, class 2: no )
• We can use classification for predicting more classes
too. Like (suppose colors:
RED,GREEN,BLUE,YELLOW,ORANGE)
Cont..
Prediction:
• Suppose from your past data ( train data ) you come to
know that your best friend liked above movies and you
also know how many times each particular movie seen
by your friend.
• Now one new movie ( test data ) released same like
above, now your are going to find how many times this
present newly released movie will your friend watch is it
, 5 times, 6 times,10 times anything.
• If you clearly observe the problem it is about finding
the count, some times we can say this as predicting the
value.
Cont..
• Keep in mind, here we are forecasting continuous
value ( Prediction ) and the other thing this
prediction is also belongs to Supervised learning.
• This is because you are learning this from you train
data.
Difference between Discrete data values
and Continuous data values
• Discrete data values can only take certain values.

• For example you will have certain number of friends


like 4 or 5 but you can’t have 4.5(4 and half) friends.

• So These type of data values are called as discrete.

• weight of some object, height of the person these


type of data are called as Continuous data.
How Does Classification Works?

• With the help of the bank loan application , let us


understand the working of classification.

• The Data Classification process includes two steps:

• Building the Classifier or Model (Learning)


• Using Classifier for Classification (Classification)
How Does Classification Works?
• Building the Classifier or Model

• This step is the learning step or the learning phase.


• In this step the classification algorithms build the
classifier.
• The classifier is built from the training set made up
of database tuples and their associated class labels.
• Each tuple that constitutes the training set is
referred to as a category or class.
• These tuples can also be referred to as sample,
object or data points.
How Does Classification Works?
• Using Classifier for Classification

• In this step, the classifier is used for classification.


• Here the test data is used to estimate the accuracy of
classification rules.
• The classification rules can be applied to the new
data tuples if the accuracy is considered acceptable.
How Does Classification Works?
Learning Phase
How Does Classification Works?
• Learning Phase:

• Training data are analyzed by a classification


algorithm.

• Here, the class label attribute is loan decision, and


the learned model or classifier is
represented in the form of classification rules.
How Does Classification Works?
Classification Phase
How Does Classification Works?
• Classification:

• Test data are used to estimate


the accuracy of the classification rules.

• If the accuracy is considered acceptable, the rule can


be applied to the classification of new data tuples.
Classification and Prediction Issues
• The major issue is preparing the data for
Classification and Prediction. Preparing the data
involves the following activities:

• Data Cleaning
• Relevance Analysis
• Data Transformation and reduction
– Normalization
– Generalization
Classification and Prediction Issues
• Data Cleaning:

• Data cleaning involves removing the noise and


treatment of missing values.
• The noise is removed by applying smoothing
techniques and the problem of missing values is
solved by replacing a missing value with most
commonly occurring value for that attribute.
Classification and Prediction Issues
• Relevance Analysis:

• Database may also have the irrelevant attributes and the data
may be redundant.

• Correlation analysis is used to know whether any two given


attributes are related.

• Hence, relevance analysis, in the form of correlation analysis


and attribute subset selection, can be used to detect
attributes that do not contribute to the classification or
prediction task.
Classification and Prediction Issues
• Data Transformation and reduction − The data can
be transformed by any of the following methods.

• Normalization − The data is transformed using


normalization.
• Normalization involves scaling all values for a given
attribute so that they fall within a small specified
range, such as −1.0 to 1.0, or 0.0 to 1.0.
Classification and Prediction Issues
• Data Transformation and reduction −

• Generalization − The data can also be transformed


by generalizing it to the higher concept. For this
purpose we can use the concept hierarchies.

• For example, numeric values for the attribute income


can be generalized to discrete ranges, such as low,
medium, and high.
Classification and Prediction Methods

Classification

Decision Tree
Bayesian Classification

Rule Based Classification

Neural Network
Types of Algorithms
• Statistical-Based Algorithms: Naïve Bayes, Linear Discriminant
Analysis (LDA)
• Distance-Based Algorithms: k-Nearest Neighbors (k-NN),
Support Vector Machines (SVM)
• Decision Tree-Based Algorithms: ID3, C4.5, CART, Random
Forest
• Neural Network-Based Algorithms: Multi-Layer Perceptron
(MLP), Deep Neural Networks (DNN)
• Rule-Based Algorithms: Rule Induction, Associative
Classification
• Combining Techniques: Ensemble Methods (Bagging, Boosting,
Stacking)
Decision Tree Induction
 Used in classification, or in regression
 A decision tree is a structure that includes a root node,
branches, and leaf nodes.
 Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf
node holds a class label.
 The topmost node in the tree is the root node.
 Types –
- ID3 Iterative Dichotomiser
- C4.5
- CART Classification and Regression Trees
Decision Tree Representation
Root Node

Branches

Leaf Node Leaf Node

Set of Possible Answers Set of Possible Answers


flow
• Dataset  Algorithm model/classifier class
Label Splitting Node

Entire Data Set D Employed Test?


no Yes

creditscore
income

High Low High Low

Approve Reject Approve Reject


Decision Tree (cont..)
• The following decision tree is for the concept buy_computer
that indicates whether a customer at a company is likely to
buy a computer or not. Each internal node represents a test
on an attribute. Each leaf node represents a class.

Features/
Test Attributes
Classification by Decision Tree Induction
•Decision tree
◦ Flowchart like Structure
◦ Internal node denotes a test on an attribute
◦ Branch represents an outcome of the test
◦ Leaf nodes represent class labels or class distribution

•Decision tree generation consists of two phases


◦ Tree construction
◦ At start, all the training examples are at the root
◦ Partition examples recursively based on selected attributes
◦ Tree pruning
◦ Identify and remove branches that reflect noise or outliers
Important Terms for Decision Tree
Attribute Selection Measures
-It’s a heuristic for selecting the splitting
criterion that “best” separates a given data
partition,D, of class-labeled training tuples into
individual classes.
-Also known as splitting rules
•Information Gain
•Entropy
•Gini Index
Information Gain
• Information gain can be used for continues-valued (numeric)
attributes.

• The attribute which has the highest information gain is selected


for split.

• Assume, that there are two classes P(positive) & N(negative).

• Suppose we have S samples, out of these p samples belongs to


class P and n samples belongs to class N.

• The amount of information, needed to decide split in S belongs to


P or N & that is defined as
Information Gain (Cont…)
• The amount of information, needed to decide split in S belongs to
P or N & that is defined as
Entropy (E)
• Entropy is the measure of impurity ,disorder or
uncertainty in bunch of examples.

• What an Entropy basically does?

• Entropy controls how a Decision Tree decides


to split the data.
• It actually effects how a Decision Tree draws its
boundaries.
Gini Index
• An alternative method to information gain is called the Gini
Index.
• Gini is used in CART (Classification and Regression Trees).
• If a dataset T Contains examples from n classes, gini index, gini(T)
is defined as
Gini (T) = 1 - 2

• n: the number of classes


• pj: the probability that a tuple in D belongs to class Ci
Gini Index (cont…)
• After splitting T into two subsets T1 and T2 with sizes
N1 and N2, the Giniindex of the split data is defined
as

Ginisplit(T) = gini (T1) + gini (T2)


Decision Tree Example
Decision Tree Example
Decision Tree Example
Gini Index
• It is used in the CART.
• the Gini index measures the impurity of D, a data
partition or set of training tuples, as

• where pi is the probability that a tuple in D belongs to


class Ci and is estimated by |Ci,D|/|D|.
• The sum is computed over m classes.
Gini Index Example
Gini Index
• Let D be the training data of Table where
there are 9 tuples belonging to the class buys
computer = yes and the remaining 5
tuples belong to the class buys computer = no.

• A (root) node N is created for the tuples


in D. We first use Equation for Gini index to compute
the impurity of D:
Gini Index
• To find the splitting criterion for the tuples in D, we need to
compute the Gini index for each attribute.

• Let’s start with the attribute income and consider each of the
possible splitting subsets.

• Consider the subset {low, medium}.

• This would result in 10 tuples in partition D1 satisfying the


condition “income ∈ {low, medium}.”

• The remaining four tuples of D would be assigned to partition D2.


Gini Index
• The Gini index value computed based on this partitioning is
Gini Index
• Similarly, the Gini index values for splits on the remaining
subsets are: 0.315 (for the subsets {low, high} and {medium})
and 0.300 (for the subsets {medium, high} and {low}).

• Therefore, the best binary split for attribute income is on


{medium, high} (or {low}) because it minimizes the Gini index.

• Evaluating the attribute, we obtain {youth, senior} (or


{middle aged}) as the best split for age with a Gini index of
0.375; the attributes {student} and {credit rating} are both
binary, with Gini index values of 0.367 and 0.429, respectively.
Gini Index
• The attribute income and splitting subset {medium, high} therefore
give the minimum Gini index overall, with a reduction in impurity
of 0.459 − 0.300 = 0.159.

• The binary split “income ∈ {medium, high}” results in the


maximum reduction in impurity of the tuples in D and is returned
as the splitting criterion.

• Node N is labeled with the criterion, two branches are grown from
it, and the tuples are partitioned accordingly.

• Hence, the Gini index has selected income instead of age at the
root node, unlike the (non binary) tree created by information gain.
Bayesian Classification
• “What are Bayesian classifiers?”

• Bayesian classifiers are statistical classifiers.

• They can predict class membership probabilities, such as the


probability that a given tuple belongs to a particular class.

• Bayesian classification is based on Bayes’ theorem.

• Bayesian classifiers have also exhibited high accuracy and speed


when applied to large databases.
Bayesian Classification
• Naïve Bayesian classifiers assume that ,

“The effect of an attribute value on a given class is independent


of the values of the other attributes”

• This assumption is called class conditional independence.


Bayesian Algorithm
• Bayes theorem provides a way of calculating the posterior
probability, P(c|x), from P(c), P(x), and P(x|c).

• Naive Bayes classifier assume that the effect of the value of a


predictor (x) on a given class (c) is independent of the values of
other predictors.

• This assumption is called class conditional independence.


Bayesian Algorithm

• P(c|x) is the posterior probability of class (target) given predictor


(attribute).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given
class.
• P(x) is the prior probability of predictor.
Bayesian Example
Bayesian Example
• Predicting a class label using naïve Bayesian classification.

• We wish to predict the class label of a tuple using naïve Bayesian


classification, given the same training data “buys_computer”

• The data tuples are described by the attributes age, income,


student, and credit rating.

• The class label attribute, buys computer, has two distinct values
namely {yes, no}.

• Let C1 correspond to the class buys computer = yes

• C2 correspond to buys computer = no.


Bayesian Example
• Predicting a class label using naïve Bayesian classification.

• The tuple we wish to classify is

X = (age = youth, income = medium, student = yes, credit rating = fair)

We need to maximize P(X/Ci)P(Ci), for i = 1, 2.

P(Ci), the prior probability of each class, can be computed based on the
training tuples:

P(buys computer = yes) : 9/14 = 0.643

P(buys computer = no) : 5/14 = 0.357


Bayesian Example
• Predicting a class label using naïve Bayesian classification.

To compute P(X/Ci), for i = 1, 2


we compute the following conditional probabilities:

P(age = youth / buys computer = yes) : 2/9 = 0.222


P(age = youth / buys computer = no) : 3/5 = 0.600

P(income = medium / buys computer = yes) : 4/9 = 0.444


P(income = medium / buys computer = no) : 2/5 = 0.400

P(student = yes / buys computer = yes): 6/9 = 0.667


P(student = yes / buys computer = no) : 1/5 = 0.200

P(credit rating = fair / buys computer = yes) : 6/9 = 0.667


P(credit rating = fair / buys computer = no) : 2/5 = 0.400
Bayesian Example
Using the above probabilities, we obtain

P(X/buys computer = yes) = P(age = youth / buys computer = yes) *


P(income = medium / buys computer = yes) *
P(student = yes / buys computer = yes) *
P(credit rating = fair / buys computer = yes)

= 0.222*0.444*0.667*0.667

= 0.044.

Similarly,

P(X/buys computer = no) = 0.600*0.400*0.200*0.400


= 0.019.
Bayesian Example

To find the class, Ci, that maximizes P(X/Ci)P(Ci), we compute

P(X/buys computer = yes)P(buys computer = yes) = 0.044*0.643


= 0.028

P(X/buys computer = no)P(buys computer = no) = 0.019*0.357


= 0.007

Therefore, the naïve Bayesian classifier predicts buys computer = yes


for tuple X.
Bayesian classification Example
Bayesian Algorithm
• The posterior probability can be calculated by first, constructing a
frequency table for each attribute against the target.

• Then, transforming the frequency tables to likelihood tables

• and finally use the Naive Bayesian equation to calculate the


posterior probability for each class.

• The class with the highest posterior probability is the outcome of


prediction.
Bayesian Algorithm
Bayesian Algorithm
Bayesian Algorithm
• The likelihood tables for all four predictors.
Bayesian Algorithm
• An unseen sample X = <rain, hot, high, false>

•P(X|p)·P(p) = P(rain|p)*P(hot|p)*P(high|p)*P(false|p)*P(p)
= (2/9)*(2/9)*(3/9)*(6/9)*(9/14)
= 0.222*0.222*0.333*0.666*0.642
= 0.007017

•P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
= (3/5)*(2/5)*(4/5)*(2/5)*(5/14)
= 0.6*0.4*0.8*0.4*0.357
= 0.0274176
•Sample X is classified in class n (don’t play)
Rule Based Classification
• In this section, we look at rule-based classifiers, where the learned
model is represented as a set of IF-THEN rules.

• We first examine how such rules are used for classification.

Using IF-THEN Rules for Classification

• Rules are a good way of representing information or bits of


knowledge.
• A rule-based classifier uses a set of IF-THEN rules for classification.

• An IF-THEN rule is an expression of the form

IF condition THEN conclusion


Rule Based Classification
• An example is rule R1,

R1: IF age = youth AND student = yes THEN buys computer = yes.

• The “IF”-part (or left-hand side) of a rule is known as the rule


antecedent or precondition.

• The “THEN”-part (or right-hand side) is the rule consequent.

• In the rule antecedent, the condition consists of one or more


attribute tests (such as age = youth, and student = yes) that are
logically ANDed.

• The rule’s consequent contains a class prediction (in this case, we


are predicting whether a customer will buy a computer).
Rule Based Classification
• R1 can also be written as

R1: if (age = youth) ∧ (student = yes) ⇒ then (buys computer =


yes).

• If the condition (that is, all of the attribute tests) in a rule


antecedent holds true for a given tuple,

• we say that the rule antecedent is satisfied (or simply, that the
rule is satisfied) and that the rule covers the tuple.
Cont..
• IF AND THEN
• R1.
• IF outlook = sunny AND humidity = high
THEN play=no
R2
IF outlook = sunny THEN play=yes
Rule Based Classification
• A rule R can be assessed by its coverage and accuracy.

• Given a tuple, X, from a class labeled data set, D.

• let ncovers be the number of tuples covered by R.

• ncorrect be the number of tuples correctly classified by R.


• and |D| be the number of tuples in D.

• We can define the coverage and accuracy of R as


Rule Based Classification

• That is, a rule’s coverage is the percentage of tuples that are


covered by the rule.
(i.e., whose attribute values hold true for the rule’s antecedent).

• For a rule’s accuracy, we look at the tuples that it covers and see
what percentage of them the rule can correctly classify.
Rule Based Classification

• Consider rule R1 above, which covers 2 of the 14 tuples. It can


correctly classify both tuples.

• Therefore, coverage(R1) = (2/14)*100


= 14.28%

accuracy(R1) = (2/2)*100
= 100%.
Neural Network
• “Neural Network" (NN), is a mathematical model or computational
model based on biological ‘neural networks’

• It’s set of connected input/output units in which each connection


has a weight associated with it.

• Advantages:

◦ Prediction accuracy is generally high


◦ Fast evaluation of the learned target function
Neural Network
• Disadvantages:

◦ long training time


◦ difficult to understand the learned function (weights)
◦ not easy to incorporate domain knowledge
NEURONS
Cell body

axoms
dendrites

Terminal exom
Neural Network

F(y)

x1w1
Neural Network
• Network Training:

• The ultimate objective of training is to obtain a set of weights


that makes almost all the tuples in the training data classified
correctly.
•Steps:
◦ Initialize weights with random values.
◦ Feed the input tuples into the network one by one.
◦ For each unit
◦ Compute the net input to the unit as a linear combination of all the
inputs to the unit
◦ Compute the output value using the activation function
◦ Compute the error
◦ Update the weights and the bias
Other Classification Methods
•k-nearest neighbor classifier
•case-based reasoning
•Genetic algorithm

•Rough set approach

•Fuzzy set approaches


K-nearest neighbor classifier
• A powerful classification algorithm used in pattern recognition.

• K nearest neighbors stores all available cases and classifies new


cases based on a similarity measure(e.g distance function).
K-nearest neighbor classifier
• A powerful classification algorithm used in pattern recognition.

• K nearest neighbors stores all available cases and classifies new


cases based on a similarity measure(e.g distance function).

KNN: Classification Approach

• An object (a new instance) is classified by a majority votes for its


neighbor classes.

• The object is assigned to the most common class amongst its K


nearest neighbors.(measured by a distant function )
K-nearest neighbor classifier
• Distance measured for continuous variables..

Euclidean Distance formula


K-nearest neighbor Algorithm
• All the instances correspond to points in an n-dimensional feature
space.

• Each instance is represented with a set of numerical attributes.

• Each of the training data consists of a set of vectors and a class label
associated with each vector.

• Classification is done by comparing feature vectors of different K


nearest points.

• elect the K-nearest examples to E in the training set.

• Assign E to the most common class among its K-nearest neighbors.


K-nearest neighbor Algorithm
• Strengths of KNN

• Very simple.
• Can be applied to the data from any distribution.
• Good classification if the number of samples is large enough.

• Weaknesses of KNN

• Takes more time to classify a new example.


• need to calculate and compare distance from new example to all
other examples.
• Choosing k may be tricky.
• Need large number of samples for accuracy.
Prediction
• Regression:

• Regression is a data mining technique used to predict a range of


numeric values (also called continuous values), given a particular
dataset.

• For example, regression might be used to predict the cost of a


product or service, given other variables.

• Regression is used across multiple industries for business and


marketing planning, financial forecasting, environmental modeling
and analysis of trends.
Prediction
• Regression involves ,
Predictor variable (the values which are known) and
Response variable (values to be predicted).

There are 2 types of regression:

1) Linear Regression
2) Multiple Regression
Linear regression
• It is simplest form of regression. Linear regression attempts to
model the relationship between two variables by fitting a linear
equation to observe the data.

• Linear regression attempts to find the mathematical relationship


between variables.

• If outcome is straight line then it is considered as linear model


and if it is curved line, then it is a non linear model.

• The relationship between dependent variable is given by straight


line and it has only one independent variable.

Y= α+ΒX
Linear regression
• Model 'Y', is a linear function of 'X'.

• The value of 'Y' increases or decreases in linear manner according


to which the value of 'X' also changes.
Multiple Regression
• Multiple linear regression is an extension of linear regression
analysis.

• It uses two or more independent variables to predict an outcome


and a single continuous dependent variable.

Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e
where,

'Y' is the response variable.


X1 + X2 + Xk are the independent predictors.
'e' is random error.
a0, a1, a2, ak are the regression coefficients.
• Accuracy and Error Measures
• Accuracy, Precision, Recall, F1-Score
• Mean Squared Error (MSE), Root Mean
Squared Error (RMSE)
• Receiver Operating Characteristic (ROC)
Curve, Area Under Curve (AUC)
• Evaluating Classifier or Predictor Accuracy
• Cross-Validation (k-fold, Leave-One-Out)
• Confusion Matrix
• Bias-Variance Analysis
Data Mining Tools
• WEKA
• DB Miner
• DTREG DM
 Thank You 

You might also like