0% found this document useful (0 votes)
21 views38 pages

Unit-IV Classification Part 1

The document discusses classification and prediction techniques in data mining. It covers classification using decision trees, Bayesian classification, rule-based classification, and prediction using linear regression. It then provides more details on decision tree classification, including decision tree induction, attribute selection measures, tree pruning, and enhancing decision trees for continuous attributes and missing values. It also discusses scaling decision tree classification to large databases.

Uploaded by

gayathriande20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views38 pages

Unit-IV Classification Part 1

The document discusses classification and prediction techniques in data mining. It covers classification using decision trees, Bayesian classification, rule-based classification, and prediction using linear regression. It then provides more details on decision tree classification, including decision tree induction, attribute selection measures, tree pruning, and enhancing decision trees for continuous attributes and missing values. It also discusses scaling decision tree classification to large databases.

Uploaded by

gayathriande20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

III (b): Classification and Prediction

 Classification by Decision Tree Induction


 Bayesian Classification
 Rule-based Classification
 Prediction: Linear Regression

Data Mining: Concepts and Techniques 1


Classification Vs Prediction
 Two forms of data analysis and Extract models describing important data
classes/to predict future data trends. (Intelligent Decision Making)
Classification :
 Predicts Categorical Labels (discrete or nominal)

 Classifies data (constructs a model) based on the training set and the

values (class labels) in a classifying attribute and uses it in classifying


new data.
Prediction - Models continuous-valued functions/Predicts unknown or
missing values
Eg: Categorize Bank Loan Applications as either safe/risky or
a prediction model to predict the expenditures on computer
equipment given their income and occupation.
 Typical Applications
 Credit Approval, Target Marketing, Medical Diagnosis

 Fraud Detection, Performance Prediction, Manufacturing

Data Mining: Concepts and Techniques 2


Supervised Vs Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data

Data Mining: Concepts and Techniques 3


Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Data Mining: Concepts and Techniques 4
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(X, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and Techniques 5
III (b): Classification and Prediction
 Classification by Decision Tree Induction
 Bayesian Classification
 Rule-based Classification
 Prediction: Linear Regression

Data Mining: Concepts and Techniques 6


Classification by Decision Tree Induction

i. Decision Tree Induction

ii. Attribute Selection Measures

iii. Tree pruning

iv. Scalability

v. Visual Mining for Decision Tree Induction

Data Mining: Concepts and Techniques 7


i. Decision Tree Induction
 Decision Tree (J.Ross Quinlan-1970-1980s)
 a flowchart-like tree structure, where each internal node denotes

a test on an attribute, each branch represents an outcome of the


test, and each leaf node holds a class label. The topmost node in
a tree is the root node.
 Decision Tree Induction
 The learning of decision trees from class-labeled training tuples.

 Decision Trees Usage


 Given a tuple X, for which the associated class label is unknown,

the attribute values of the tuple are tested against the decision
tree.
 A path is traced from the root to a leaf node, which holds the

class prediction for that tuple.


 Decision trees can easily be converted to classification rules.

Data Mining: Concepts and Techniques 8


Decision Tree Induction
 Popularity
 No domain Knowledge
 High Dimensional Data
 Easy to interpret
 Accuracy

 Applications
 Medicine
 Manufacturing
 Production
 Financial analysis
 Astronomy
 Molecular Biology

Data Mining: Concepts and Techniques 9


Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Data Mining: Concepts and Techniques 10


Output: A Decision Tree for “buys_computer”
Age?

<=30 31..40 >40

Student? Yes
Credit Rating?

No Yes Excellent Fair

No Yes No Yes

Data Mining: Concepts and Techniques 11


12
Comparing Attribute Selection Measures

 The three measures295, in general, return good results but


 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized
partitions and purity in both partitions
Data Mining: Concepts and Techniques 13
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
 The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others

Data Mining: Concepts and Techniques 14


iii. Overfitting305 and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or
outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “best pruned tree”
Data Mining: Concepts and Techniques 15
Enhancements to Basic Decision Tree Induction

 Allow for continuous-valued attributes


 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
Data Mining: Concepts and Techniques 16
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification

methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases

 comparable classification accuracy with other methods

Data Mining: Concepts and Techniques 17


iv. Scalable Decision Tree Induction Methods
 SLIQ (EDBT’96 — Mehta et al.)
 Builds an index for each attribute and only class list and

the current attribute list reside in memory


 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure

 PUBLIC (VLDB’98 — Rastogi & Shim)


 Integrates tree splitting and tree pruning: stop growing

the tree earlier


 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)

 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)


 Uses bootstrapping to create several small samples

Data Mining: Concepts and Techniques 18


Rainforest: Training Set and Its AVC Sets
Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computer Age Buy_Computer income Buy_Computer

<=30 high no fair no yes no


yes no
<=30 high no excellent no
high 2 2
31…40 high no fair yes <=30 3 2
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer Buy_Computer
>40 medium yes fair yes
Credit
<=30 medium yes excellent yes yes no
rating yes no
31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
Data Mining: Concepts and Techniques 19
Data Cube-Based Decision-Tree Induction
 Integration of generalization with decision-tree induction
(Kamber et al.’97)
 Classification at primitive concept levels
 E.g., precise temperature, humidity, outlook, etc.
 Low-level concepts, scattered classes, bushy
classification-trees
 Semantic interpretation problems
 Cube-based multi-level classification
 Relevance analysis at multi-levels
 Information-gain analysis with dimension + level
Data Mining: Concepts and Techniques 20
BOAT (Bootstrapped Optimistic Algorithm
for Tree Construction)

 Use a statistical technique called bootstrapping to create


several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
 It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.

Data Mining: Concepts and Techniques 21


2. Bayesian Classification

a. Bayes’ Theorem

b. Naïve Bayesian Classification

Data Mining: Concepts and Techniques 22


Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities.
 Based on Bayes’ Theorem.
 Performance: Naïve Bayesian Classifier, has comparable
performance with decision tree and neural network
classifiers and also scalable, speed and accurate.
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
Data Mining: Concepts and Techniques 23
(a) Bayes’ Theorem: Basics
 Let X be a data sample : class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Data Mining: Concepts and Techniques 24
Bayesian Theorem
 Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
P( H | X)  P(X | H ) P( H )
P(X)
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and Techniques 25
(b) Naïve Bayesian Classifier
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

 Since P(X) is constant for all classes, only


P(C | X)  P(X | C )P(C )
i i i

needs to be maximized
Data Mining: Concepts and Techniques 26
Derivation of Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P ( X | C i )   P ( x | C i )  P ( x | C i )  P ( x | C i )  ... P ( x | C i )
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts
the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ ( x ) 2
1 
g ( x,  ,  ) 
2 2
e
2 
and P(xk|Ci) is P ( X | C i )  g ( xk ,  Ci ,  Ci )
Data Mining: Concepts and Techniques 27
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data sample
>40 low yes excellent no
X = (age <=30,
Income = medium,
31…40 low yes excellent yes
Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and Techniques 28
Naïve Bayesian Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


Data Mining: Concepts and Techniques 29
Avoiding the 0-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case

Prob(income = low) = 1/1003


Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their “uncorrected”

counterparts
Data Mining: Concepts and Techniques 30
Naïve Bayesian Classifier: Comments
 Advantages
 Easy to implement , Min. Error Rate

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore

loss of accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history, etc.


Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayesian

Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks

Data Mining: Concepts and Techniques 31


Data Mining: Concepts and Techniques 32
III (b): Classification and Prediction
 Classification by Decision Tree Induction
 Bayesian Classification
 Rule-based Classification
 Prediction: Linear Regression

Data Mining: Concepts and Techniques 33


3. Rule-based Classification
i. Using IF-THEN Rules for Classification

ii. Rule Extraction from a Decision Tree

iii. Rule Induction Using a Sequential Covering Algorithm


 Rule Quality Measures
 Rule Pruning

Data Mining: Concepts and Techniques 34


i. Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule is triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test)
 Class/Rule ordering: decreasing order of misclassification cost per class or
rules are organized into one long priority list, according to some measure
of rule quality or by experts
Data Mining: Concepts and Techniques 35
ii. Rule Extraction from a Decision Tree
age?

 Rules are easier to understand than large trees <=30 31..40 >40

 One rule is created for each path from the root student?
yes
credit rating?

to a leaf no yes excellent fair

 Each attribute-value pair along a path forms a no yes no yes

conjunction: the leaf holds the class prediction


 Rules are mutually exclusive and exhaustive

 Example: Rule extraction from our buys_computer decision-tree


1. IF age = young AND student = no THEN buys_computer = no
2. IF age = young AND student = yes THEN buys_computer = yes
3. IF age = mid-age THEN buys_computer = yes
4. IF age = old AND credit_rating = excellent THEN buys_computer = no
5. IF age = old AND credit_rating = fair THEN buys_computer = yes

Data Mining: Concepts and Techniques 36


iii. Rule Extraction from the Training Data

 Sequential covering algorithm: Extracts rules directly from training data


 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules simultaneously
Data Mining: Concepts and Techniques 37
Data Mining: Concepts and Techniques 38

You might also like