0% found this document useful (0 votes)

21 views38 pages

Unit-IV Classification Part 1

The document discusses classification and prediction techniques in data mining. It covers classification using decision trees, Bayesian classification, rule-based classification, and prediction using linear regression. It then provides more details on decision tree classification, including decision tree induction, attribute selection measures, tree pruning, and enhancing decision trees for continuous attributes and missing values. It also discusses scaling decision tree classification to large databases.

Uploaded by

gayathriande20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views38 pages

Unit-IV Classification Part 1

Uploaded by

gayathriande20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 38

III (b): Classification and Prediction

 Classification by Decision Tree Induction

 Bayesian Classification
 Rule-based Classification
 Prediction: Linear Regression

Data Mining: Concepts and Techniques 1

Classification Vs Prediction
 Two forms of data analysis and Extract models describing important data
classes/to predict future data trends. (Intelligent Decision Making)
Classification :
 Predicts Categorical Labels (discrete or nominal)

 Classifies data (constructs a model) based on the training set and the

values (class labels) in a classifying attribute and uses it in classifying

new data.
Prediction - Models continuous-valued functions/Predicts unknown or
missing values
Eg: Categorize Bank Loan Applications as either safe/risky or
a prediction model to predict the expenditures on computer
equipment given their income and occupation.
 Typical Applications
 Credit Approval, Target Marketing, Medical Diagnosis

 Fraud Detection, Performance Prediction, Manufacturing

Data Mining: Concepts and Techniques 2

Supervised Vs Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data

Data Mining: Concepts and Techniques 3

Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Data Mining: Concepts and Techniques 4
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(X, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and Techniques 5
III (b): Classification and Prediction
 Classification by Decision Tree Induction
 Bayesian Classification
 Rule-based Classification
 Prediction: Linear Regression

Data Mining: Concepts and Techniques 6

Classification by Decision Tree Induction

i. Decision Tree Induction

ii. Attribute Selection Measures

iii. Tree pruning

iv. Scalability

v. Visual Mining for Decision Tree Induction

Data Mining: Concepts and Techniques 7

i. Decision Tree Induction
 Decision Tree (J.Ross Quinlan-1970-1980s)
 a flowchart-like tree structure, where each internal node denotes

a test on an attribute, each branch represents an outcome of the

test, and each leaf node holds a class label. The topmost node in
a tree is the root node.
 Decision Tree Induction
 The learning of decision trees from class-labeled training tuples.

 Decision Trees Usage

 Given a tuple X, for which the associated class label is unknown,

the attribute values of the tuple are tested against the decision
tree.
 A path is traced from the root to a leaf node, which holds the

class prediction for that tuple.

 Decision trees can easily be converted to classification rules.

Data Mining: Concepts and Techniques 8

Decision Tree Induction
 Popularity
 No domain Knowledge
 High Dimensional Data
 Easy to interpret
 Accuracy

 Applications
 Medicine
 Manufacturing
 Production
 Financial analysis
 Astronomy
 Molecular Biology

Data Mining: Concepts and Techniques 9

Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Data Mining: Concepts and Techniques 10

Output: A Decision Tree for “buys_computer”
Age?

<=30 31..40 >40

Student? Yes
Credit Rating?

No Yes Excellent Fair

No Yes No Yes

Data Mining: Concepts and Techniques 11

12
Comparing Attribute Selection Measures

 The three measures295, in general, return good results but

 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized
partitions and purity in both partitions
Data Mining: Concepts and Techniques 13
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
 The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others

Data Mining: Concepts and Techniques 14

iii. Overfitting305 and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or
outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “best pruned tree”
Data Mining: Concepts and Techniques 15
Enhancements to Basic Decision Tree Induction

 Allow for continuous-valued attributes

 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
Data Mining: Concepts and Techniques 16
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification

methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases

 comparable classification accuracy with other methods

Data Mining: Concepts and Techniques 17

iv. Scalable Decision Tree Induction Methods
 SLIQ (EDBT’96 — Mehta et al.)
 Builds an index for each attribute and only class list and

the current attribute list reside in memory

 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure

 PUBLIC (VLDB’98 — Rastogi & Shim)

 Integrates tree splitting and tree pruning: stop growing

the tree earlier

 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)

 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)

 Uses bootstrapping to create several small samples

Data Mining: Concepts and Techniques 18

Rainforest: Training Set and Its AVC Sets
Training Examples AVC-set on Age AVC-set on income
age income studentcredit_rating
buys_computer Age Buy_Computer income Buy_Computer

<=30 high no fair no yes no

yes no
<=30 high no excellent no
high 2 2
31…40 high no fair yes <=30 3 2
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer Buy_Computer
>40 medium yes fair yes
Credit
<=30 medium yes excellent yes yes no
rating yes no
31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
Data Mining: Concepts and Techniques 19
Data Cube-Based Decision-Tree Induction
 Integration of generalization with decision-tree induction
(Kamber et al.’97)
 Classification at primitive concept levels
 E.g., precise temperature, humidity, outlook, etc.
 Low-level concepts, scattered classes, bushy
classification-trees
 Semantic interpretation problems
 Cube-based multi-level classification
 Relevance analysis at multi-levels
 Information-gain analysis with dimension + level
Data Mining: Concepts and Techniques 20
BOAT (Bootstrapped Optimistic Algorithm
for Tree Construction)

 Use a statistical technique called bootstrapping to create

several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
 It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.

Data Mining: Concepts and Techniques 21

2. Bayesian Classification

a. Bayes’ Theorem

b. Naïve Bayesian Classification

Data Mining: Concepts and Techniques 22

Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities.
 Based on Bayes’ Theorem.
 Performance: Naïve Bayesian Classifier, has comparable
performance with decision tree and neural network
classifiers and also scalable, speed and accurate.
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
Data Mining: Concepts and Techniques 23
(a) Bayes’ Theorem: Basics
 Let X be a data sample : class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Data Mining: Concepts and Techniques 24
Bayesian Theorem
 Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
P( H | X)  P(X | H ) P( H )
P(X)
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and Techniques 25
(b) Naïve Bayesian Classifier
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

 Since P(X) is constant for all classes, only

P(C | X)  P(X | C )P(C )
i i i

needs to be maximized
Data Mining: Concepts and Techniques 26
Derivation of Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P ( X | C i )   P ( x | C i )  P ( x | C i )  P ( x | C i )  ... P ( x | C i )
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts
the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ ( x ) 2
1 
g ( x,  ,  ) 
2 2
e
2 
and P(xk|Ci) is P ( X | C i )  g ( xk ,  Ci ,  Ci )
Data Mining: Concepts and Techniques 27
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data sample
>40 low yes excellent no
X = (age <=30,
Income = medium,
31…40 low yes excellent yes
Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and Techniques 28
Naïve Bayesian Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

Data Mining: Concepts and Techniques 29
Avoiding the 0-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their “uncorrected”

counterparts
Data Mining: Concepts and Techniques 30
Naïve Bayesian Classifier: Comments
 Advantages
 Easy to implement , Min. Error Rate

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore

loss of accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayesian

Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks

Data Mining: Concepts and Techniques 31

Data Mining: Concepts and Techniques 32
III (b): Classification and Prediction
 Classification by Decision Tree Induction
 Bayesian Classification
 Rule-based Classification
 Prediction: Linear Regression

Data Mining: Concepts and Techniques 33

3. Rule-based Classification
i. Using IF-THEN Rules for Classification

ii. Rule Extraction from a Decision Tree

iii. Rule Induction Using a Sequential Covering Algorithm

 Rule Quality Measures
 Rule Pruning

Data Mining: Concepts and Techniques 34

i. Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule is triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test)
 Class/Rule ordering: decreasing order of misclassification cost per class or
rules are organized into one long priority list, according to some measure
of rule quality or by experts
Data Mining: Concepts and Techniques 35
ii. Rule Extraction from a Decision Tree
age?

 Rules are easier to understand than large trees <=30 31..40 >40

 One rule is created for each path from the root student?
yes
credit rating?

to a leaf no yes excellent fair

 Each attribute-value pair along a path forms a no yes no yes

conjunction: the leaf holds the class prediction

 Rules are mutually exclusive and exhaustive

 Example: Rule extraction from our buys_computer decision-tree

1. IF age = young AND student = no THEN buys_computer = no
2. IF age = young AND student = yes THEN buys_computer = yes
3. IF age = mid-age THEN buys_computer = yes
4. IF age = old AND credit_rating = excellent THEN buys_computer = no
5. IF age = old AND credit_rating = fair THEN buys_computer = yes

Data Mining: Concepts and Techniques 36

iii. Rule Extraction from the Training Data

 Sequential covering algorithm: Extracts rules directly from training data

 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules simultaneously
Data Mining: Concepts and Techniques 37
Data Mining: Concepts and Techniques 38

Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Data Mining
No ratings yet
Data Mining
63 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
SAP PM Configuration 3
100% (1)
SAP PM Configuration 3
30 pages
Magnetic Flow E&H
No ratings yet
Magnetic Flow E&H
20 pages
Lecture 8
No ratings yet
Lecture 8
109 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Unit I-1data Mining Introduction
No ratings yet
Unit I-1data Mining Introduction
39 pages
7 Classification
100% (3)
7 Classification
63 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
100% (1)
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
34 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
112 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Cobra C1 FastScanManual
No ratings yet
Cobra C1 FastScanManual
64 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
27 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Classification Lecture 1
No ratings yet
Classification Lecture 1
51 pages
7 Class
No ratings yet
7 Class
72 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
AASHTO M300 Inorganic Zinc-Rich Primer
100% (2)
AASHTO M300 Inorganic Zinc-Rich Primer
8 pages
Curriculum Development Prof Ed LET Reviewer
100% (1)
Curriculum Development Prof Ed LET Reviewer
6 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
30 pages
BI Ch02
No ratings yet
BI Ch02
29 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
7 Class
No ratings yet
7 Class
72 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Analysis and Design of (Concentric, Edge, Corner) Footing: Sample Structural Manila
100% (1)
Analysis and Design of (Concentric, Edge, Corner) Footing: Sample Structural Manila
3 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Chapter 7. Classification and Prediction
No ratings yet
Chapter 7. Classification and Prediction
68 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Updated DM Unit 3
No ratings yet
Updated DM Unit 3
28 pages
Classification
No ratings yet
Classification
45 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Class Basic
No ratings yet
Class Basic
67 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
BI Chapter 04 - Unlocked
No ratings yet
BI Chapter 04 - Unlocked
47 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Class Basic
No ratings yet
Class Basic
75 pages
DM Classification 1 3
No ratings yet
DM Classification 1 3
19 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
72 pages
A Data Mining Query Language
No ratings yet
A Data Mining Query Language
69 pages
ESSAY
No ratings yet
ESSAY
9 pages
Chap 7
No ratings yet
Chap 7
71 pages
Data Mining
No ratings yet
Data Mining
30 pages
Extra Grammar Exercises Lesson 1 UNIT 6
No ratings yet
Extra Grammar Exercises Lesson 1 UNIT 6
2 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
Madhav Institute of Technology & Science, Gwalior
No ratings yet
Madhav Institute of Technology & Science, Gwalior
28 pages
Castaneda Notes
No ratings yet
Castaneda Notes
10 pages
Slides Courtesy: Ling Chen [email protected]
No ratings yet
Slides Courtesy: Ling Chen [email protected]
42 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
No ratings yet
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
36 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
Article 6
No ratings yet
Article 6
6 pages
C9 WS 3 PHY Electromagnet
No ratings yet
C9 WS 3 PHY Electromagnet
5 pages
Letter Writing: Lead in
No ratings yet
Letter Writing: Lead in
23 pages
ER Model and Relational Model: Learning Objectives
No ratings yet
ER Model and Relational Model: Learning Objectives
18 pages
Class Notes For English 2 (PDF 2)
No ratings yet
Class Notes For English 2 (PDF 2)
17 pages
Mechanical Tube English
No ratings yet
Mechanical Tube English
8 pages
Template Resource Mobilization
No ratings yet
Template Resource Mobilization
14 pages
De Chuyen Anh Vinh Phuc 2018-2019
No ratings yet
De Chuyen Anh Vinh Phuc 2018-2019
6 pages
PhysicsBowl 2017
No ratings yet
PhysicsBowl 2017
11 pages
Acre
No ratings yet
Acre
6 pages
Volktek - Solution Catalog For Surveillance Ethernet
No ratings yet
Volktek - Solution Catalog For Surveillance Ethernet
55 pages
Brochure 4200 en
No ratings yet
Brochure 4200 en
8 pages
The Inventory Control Account Balance of Magic Fashions at June 30
No ratings yet
The Inventory Control Account Balance of Magic Fashions at June 30
2 pages
08 - FGD by Ammonia Scrubbing in CFB Power Plant
No ratings yet
08 - FGD by Ammonia Scrubbing in CFB Power Plant
4 pages
Def Slide
No ratings yet
Def Slide
9 pages
Health - Lisa Bouslimani - Mental Wellbeing 2024-06-22
No ratings yet
Health - Lisa Bouslimani - Mental Wellbeing 2024-06-22
2 pages
Write Up of Mech Dept For NAAC
No ratings yet
Write Up of Mech Dept For NAAC
3 pages
Item Analysis Procedures 1
No ratings yet
Item Analysis Procedures 1
2 pages
Graph 2 Worksheet
No ratings yet
Graph 2 Worksheet
2 pages
Invoice 10
No ratings yet
Invoice 10
1 page
Red Zone Equipment Checklist
No ratings yet
Red Zone Equipment Checklist
4 pages
The Best of Charlie Munger 1994 2011 PDF
No ratings yet
The Best of Charlie Munger 1994 2011 PDF
1 page

Unit-IV Classification Part 1

Uploaded by

Unit-IV Classification Part 1

Uploaded by

III (b): Classification and Prediction

 Classification by Decision Tree Induction

Data Mining: Concepts and Techniques 1

values (class labels) in a classifying attribute and uses it in classifying

 Fraud Detection, Performance Prediction, Manufacturing

Data Mining: Concepts and Techniques 2

Data Mining: Concepts and Techniques 3

NAME RANK YEARS TENURED Classifier

Data Mining: Concepts and Techniques 6

i. Decision Tree Induction

ii. Attribute Selection Measures

iii. Tree pruning

v. Visual Mining for Decision Tree Induction

Data Mining: Concepts and Techniques 7

a test on an attribute, each branch represents an outcome of the

 Decision Trees Usage

class prediction for that tuple.

Data Mining: Concepts and Techniques 8

Data Mining: Concepts and Techniques 9

Data Mining: Concepts and Techniques 10

<=30 31..40 >40

No Yes Excellent Fair

Data Mining: Concepts and Techniques 11

 The three measures295, in general, return good results but

Data Mining: Concepts and Techniques 14

 Allow for continuous-valued attributes

 comparable classification accuracy with other methods

Data Mining: Concepts and Techniques 17

the current attribute list reside in memory

 PUBLIC (VLDB’98 — Rastogi & Shim)

the tree earlier

 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)

Data Mining: Concepts and Techniques 18

<=30 high no fair no yes no

 Use a statistical technique called bootstrapping to create

Data Mining: Concepts and Techniques 21

b. Naïve Bayesian Classification

Data Mining: Concepts and Techniques 22

 Since P(X) is constant for all classes, only

 Compute P(X|Ci) for each class

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

Therefore, X belongs to class (“buys_computer = yes”)

Prob(income = low) = 1/1003

 Good results obtained in most of the cases

 E.g., hospitals: patients: Profile: age, family history, etc.

Data Mining: Concepts and Techniques 31

Data Mining: Concepts and Techniques 33

ii. Rule Extraction from a Decision Tree

iii. Rule Induction Using a Sequential Covering Algorithm

Data Mining: Concepts and Techniques 34

to a leaf no yes excellent fair

 Each attribute-value pair along a path forms a no yes no yes

conjunction: the leaf holds the class prediction

 Example: Rule extraction from our buys_computer decision-tree

Data Mining: Concepts and Techniques 36

 Sequential covering algorithm: Extracts rules directly from training data

You might also like