Classification Algorithm

data mining notes outliers

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views78 pages

Classification Algorithm

data mining notes outliers

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 78

Classification

UNIT-III
• There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data
trends. These two forms are as follows −
– Classification
– Prediction
• Classification models predict categorical class labels;
– For example, we can build a classification model to categorize bank loan
applications as either safe or risky.
• prediction models predict continuous valued functions.
– EX:a prediction model to predict the expenditures in dollars of potential
customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer
with a given profile, who will buy a new computer.
• In both of the above examples,
a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and
yes or no for marketing data.
What is prediction?
• Following are the examples of cases where the data analysis task
is Prediction −
• Suppose the marketing manager needs to predict how much a
given customer will spend during a sale at his company. In this
example we are bothered to predict a numeric value. Therefore
the data analysis task is an example of numeric prediction. In this
case, a model or a predictor will be constructed that predicts a
continuous-valued-function or ordered value.
• Note − Regression analysis is a statistical methodology that is
most often used for numeric prediction.
How Does Classification Works?
• With the help of the bank loan application that we have discussed
above, let us understand the working of classification. The Data
Classification process includes two steps −
– Building the Classifier or Model
– Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database
tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as sample,
object or data points.
Using Classifier for Classification
• In this step, the classifier is used for classification. Here the test
data is used to estimate the accuracy of classification rules. The
classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.
Classification and Prediction Issues
• The major issue is preparing the data for Classification and Prediction. Preparing
the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods.
– Normalization − The data is transformed using normalization. Normalization involves
scaling all values for given attribute in order to make them fall within a small specified
range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
– Generalization − The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
• Here is the criteria for comparing the methods of Classification and
Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It
predict the class label correctly and the accuracy of the predictor
refers to how well a given predictor can guess the value of
predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and
using the classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to
make correct predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the
classifier or predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Decision Tree Induction
• Decision Tree is a supervised learning method used in data
mining for classification and regression methods.
• The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller
subsets, and at the same time, the decision tree is steadily
developed.
• The final tree is a tree with the decision nodes and leaf
nodes.
• A decision node has at least two branches. The leaf nodes
show a classification or decision.
• The uppermost decision node in a tree that relates to the
best predictor called the root node.
• Decision trees can deal with both categorical and numerical
data.
• During tree construction Attribute selection measures are
used to select that best partitions the tuples into distinct
classes.
Some Characteristics
Decision Tree and Classification Task
Definition of Decision Tree

Definition 9.1: Decision Tree

Building Decision Tree
Illustration : BuildDT Algorithm

Person Gender Height Class

1 F 1.6 S
2 M 2.0 M
3 F 1.9 M
4 F 1.88 M Attributes:
5 F 1.7 S Gender = {Male(M), Female (F)} // Binary attribute
6 M 1.85 M Height = {1.5, …, 2.5} // Continuous
7 F 1.6 S attribute
8 M 1.7 S
9 M 2.2 T Class = {Short (S), Medium (M), Tall (T)}
10 M 2.1 T
11 F 1.8 M
12 M 1.95 M
13 F 1.9 M Given a person, we are to test in which class s/he
14 F 1.8 M belongs
15 F 1.75 S
Illustration : BuildDT Algorithm
Illustration : BuildDT Algorithm
Illustration : BuildDT Algorithm
Benefits:
• It does not require any domain knowledge.
• The learning and classification steps of a decision
tree are simple and fast.
• Decision Tree is used to build classification and
regression models. It is used to create data models
that will predict class labels or values for the
decision-making process.
• This tree can easily converted to classification rule.
• we can visualize the decisions that make it easy to
understand and thus it is a popular data mining
technique.
Weakness
• Not suitable for prediction of continuous attribute.
• Perform poorly with many class and small data.
• Computationally expensive to train.
– At each node, each candidate splitting field must be sorted before its
best split can be found.
– In some algorithms, combinations of fields are used and a search must
be made for optimal combining weights.
– Pruning algorithms can also be expensive since many candidate sub-
trees must be formed and compared.
• Do not treat well non-rectangular regions.
concept buy_computer
Example of a Decision Tree
l l us
ir ca ir ca o
go go
ti nu ss
te te n l a
ca ca co c
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Another Example of Decision Tree
l l us
ir ca ir ca o
go go ti nu ss
te te n l a Single,
c a c a co c MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K