Unit-5 3161610
Unit-5 3161610
Prediction:
Classification
• The goal of data classification is to organize
and categorize data in distinct classes.
• A model is first created based on the
data distribution.
• The model is then used to classify new data.
• Given the model, a class can be predicted for
new data.
• In general way of saying classification is for discrete
and nominal values.
Prediction
• The goal of prediction is to forecast or deduce the
value of an attribute based on values of other
attributes.
• Data Cleaning
• Relevance Analysis
• Data Transformation and reduction
– Normalization
– Generalization
Classification and Prediction Issues
• Data Cleaning:
• Database may also have the irrelevant attributes and the data
may be redundant.
Classification
Decision Tree
Bayesian Classification
Neural Network
Types of Algorithms
• Statistical-Based Algorithms: Naïve Bayes, Linear Discriminant
Analysis (LDA)
• Distance-Based Algorithms: k-Nearest Neighbors (k-NN),
Support Vector Machines (SVM)
• Decision Tree-Based Algorithms: ID3, C4.5, CART, Random
Forest
• Neural Network-Based Algorithms: Multi-Layer Perceptron
(MLP), Deep Neural Networks (DNN)
• Rule-Based Algorithms: Rule Induction, Associative
Classification
• Combining Techniques: Ensemble Methods (Bagging, Boosting,
Stacking)
Decision Tree Induction
Used in classification, or in regression
A decision tree is a structure that includes a root node,
branches, and leaf nodes.
Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf
node holds a class label.
The topmost node in the tree is the root node.
Types –
- ID3 Iterative Dichotomiser
- C4.5
- CART Classification and Regression Trees
Decision Tree Representation
Root Node
Branches
creditscore
income
Features/
Test Attributes
Classification by Decision Tree Induction
•Decision tree
◦ Flowchart like Structure
◦ Internal node denotes a test on an attribute
◦ Branch represents an outcome of the test
◦ Leaf nodes represent class labels or class distribution
• Let’s start with the attribute income and consider each of the
possible splitting subsets.
• Node N is labeled with the criterion, two branches are grown from
it, and the tuples are partitioned accordingly.
• Hence, the Gini index has selected income instead of age at the
root node, unlike the (non binary) tree created by information gain.
Bayesian Classification
• “What are Bayesian classifiers?”
• The class label attribute, buys computer, has two distinct values
namely {yes, no}.
P(Ci), the prior probability of each class, can be computed based on the
training tuples:
= 0.222*0.444*0.667*0.667
= 0.044.
Similarly,
•P(X|p)·P(p) = P(rain|p)*P(hot|p)*P(high|p)*P(false|p)*P(p)
= (2/9)*(2/9)*(3/9)*(6/9)*(9/14)
= 0.222*0.222*0.333*0.666*0.642
= 0.007017
•P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
= (3/5)*(2/5)*(4/5)*(2/5)*(5/14)
= 0.6*0.4*0.8*0.4*0.357
= 0.0274176
•Sample X is classified in class n (don’t play)
Rule Based Classification
• In this section, we look at rule-based classifiers, where the learned
model is represented as a set of IF-THEN rules.
R1: IF age = youth AND student = yes THEN buys computer = yes.
• we say that the rule antecedent is satisfied (or simply, that the
rule is satisfied) and that the rule covers the tuple.
Cont..
• IF AND THEN
• R1.
• IF outlook = sunny AND humidity = high
THEN play=no
R2
IF outlook = sunny THEN play=yes
Rule Based Classification
• A rule R can be assessed by its coverage and accuracy.
• For a rule’s accuracy, we look at the tuples that it covers and see
what percentage of them the rule can correctly classify.
Rule Based Classification
accuracy(R1) = (2/2)*100
= 100%.
Neural Network
• “Neural Network" (NN), is a mathematical model or computational
model based on biological ‘neural networks’
• Advantages:
axoms
dendrites
Terminal exom
Neural Network
F(y)
x1w1
Neural Network
• Network Training:
• Each of the training data consists of a set of vectors and a class label
associated with each vector.
• Very simple.
• Can be applied to the data from any distribution.
• Good classification if the number of samples is large enough.
• Weaknesses of KNN
1) Linear Regression
2) Multiple Regression
Linear regression
• It is simplest form of regression. Linear regression attempts to
model the relationship between two variables by fitting a linear
equation to observe the data.
Y= α+ΒX
Linear regression
• Model 'Y', is a linear function of 'X'.
Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e
where,