0% found this document useful (0 votes)
3 views67 pages

Unit5 Ch8 - DataMiningClassification

Machine Learning is a branch of Artificial Intelligence that enables computers to learn from data without explicit programming. It includes techniques such as supervised, unsupervised, semi-supervised, and reinforcement learning, with applications in various fields like finance, marketing, and healthcare. Key concepts include classification, numeric prediction, and decision tree algorithms, which help in making predictions based on training data.

Uploaded by

akashilay0701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views67 pages

Unit5 Ch8 - DataMiningClassification

Machine Learning is a branch of Artificial Intelligence that enables computers to learn from data without explicit programming. It includes techniques such as supervised, unsupervised, semi-supervised, and reinforcement learning, with applications in various fields like finance, marketing, and healthcare. Key concepts include classification, numeric prediction, and decision tree algorithms, which help in making predictions based on training data.

Uploaded by

akashilay0701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Machine Learning

❖Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed
❖The primary aim is to allow the computers learn automatically without human intervention or
assistance and adjust actions accordingly
❖It is a branch of Artificial Intelligence
❖Machine Learning Techniques
❖Supervised learning
❖Unsupervised learning
❖Semi-supervised learning
❖Re-inforcement machine learning

12/17/2022 MIT-WPU, DMDW 2


Terms
❑Class label attribute
❑In a dataset one column is a class label which classifies the tuple into categories.
❑Example of class labels – red/blue/green , bird/animal, safe/risky, yes/no
❑Labelled dataset - Dataset in which every tuple contains class label attribute
❑Training data - From the dataset, major portion of data is fed to the classification
algorithm . This data is labelled which enables the algorithm to construct a model or a
classifier

12/17/2022 MIT-WPU, DMDW 3


Terms
❑Test data (Validation set) - From the dataset, remaining portion of data is used to test the
classifier. The results are compared with its labels to find the accuracy of the classifier
❑Ground truth - Ground truth is a term used in statistics and machine learning that means
checking the results of machine learning for accuracy against the real world
❑Training tuples - Individual tuples making up the training set are referred to as training
tuples . They are also referred to as samples, examples, instances, data points or objects

12/17/2022 MIT-WPU, DMDW 4


Supervised vs. Unsupervised Learning
Supervised learning (classification/prediction)
◦ The training data (observations, measurements, etc.) are accompanied by labels indicating
the class of the observations
◦ New data is classified based on the training set
◦ Ground truth is available
◦ Techniques for Supervised Learning – Classification and Numeric Prediction
Unsupervised learning (clustering)
◦ The class labels of training data is unknown
◦ Its goal is to infer the natural structure present within a set of data points
◦ Exploratory analysis
◦ In situations where it is either impossible or impractical for a human to propose trends in the
data, unsupervised learning can provide initial insights that can then be used to test individual
hypotheses.

12/17/2022 MIT-WPU, DMDW 5


12/17/2022 MIT-WPU, DMDW 6
Classification
Form of data analysis that extracts models describing important data classes
Such models are called as classifiers, which are built to predict classes
Applications
➢ A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
➢ A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
➢ Image classification, Face recognition
➢ Spam filtering, Cancer diagnosis
➢ Sentiment analysis, Mood detection
➢ Prediction of rainfall

7
Definition - Classification
Classification is the problem of identifying to which of a set of categories (subpopulations),
a new observation belongs to, on the basis of a training set of data containing
observations and whose categories membership is known.

12/17/2022 MIT-WPU, DMDW 8


Numeric Prediction
In this case, a model or a predictor is constructed that predicts a continuous-
valued-function or ordered value. Regression analysis is a statistical
methodology that is most often used for numeric prediction.
Examples/ applications of prediction:
✓Marketing manager needs to predict how much a given customer will spend during a sale at his
company.
✓Predict the agricultural yield
✓Predict the salary of the customer
✓Predict the number of runs by a cricketer
✓Predict stock market value

9
Classification—A Two-Step Process
Learning phase or Model construction: describing a set of predetermined classes
◦ Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
◦ The set of tuples used for model construction is training set
◦ The model is represented as classification rules, decision trees, or mathematical formulae
Classification phase or Model usage: for classifying future or unknown objects
◦ Estimate accuracy of the model
◦ The known label of test sample is compared with the classified result from the model
◦ If the accuracy is acceptable, use the model to classify new data

Note: If the test set is used to select models, it is called validation set

12/17/2022 MIT-WPU, DMDW 10 10


Step (1): Model Construction
Classification
Training Algorithms
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
12/17/2022 MIT-WPU, DMDW 11 11
Step (2): Using the Model in Prediction

Classifier

Testing
Unseen Data
Data
(Jeff, Professor, 4)
Tenured?

12/17/2022 MIT-WPU, DMDW 12 12


Classification Techniques
✓Decision tree based methods
✓Rule based methods
✓Memory based reasoning
✓Neural Networks
✓Genetic algorithms
✓Bayesian networks
✓Support vector machines
12/17/2022 MIT-WPU, DMDW 13
Decision Tree – Whether to play or not?
Outlook

Sunny Overcast Rain

Yes
Humidity Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree : Another Example
❑ Training data set: Buys_computer
❑ Resulting tree:
age?

<=30 31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
12/17/2022 MIT-WPU, DMDW 15 15
Decision Tree Induction
❑Decision tree induction is learning of decision tree from class-labeled training
tuples
❑In decision tree, each internal node denotes a test on an attribute, each
branch represents an outcome of test, and each leaf node holds a class label
❑For an unknown tuple X, path is traced from root to leaf node for prediction of
class.
❑Decision tree can be converted to classification rules.

12/17/2022 MIT-WPU, DMDW 16


Algorithms
❖ID3 (Iterative Dichotomizer)
❖C4.5 (Successor of ID3)
❖CART (Classification and Regression Tree)
❖SLIQ
❖SPRINT

12/17/2022 MIT-WPU, DMDW 17


ID3 Algorithm
Input :
Data partition D, which is a set of training tuples and their associated
class labels
Attribute_list: the set of candidate attributes
Attribute_selection_method: a procedure to determine selection
criterion that best partitions data tuples into individual classes
Output: A decision tree

12/17/2022 MIT-WPU, DMDW 18


Method:
Create a node N
If tuples in D are all of class C then
return N as a leaf node labeled class C
If attribute_list is empty then
return N as a leaf node labeled with the majority class in D
Apply Attribute_Selection_Method(D, attribute_list )
to find the best splitting criterion

12/17/2022 MIT-WPU, DMDW 19


Label node N with splitting criterion
If splitting_attribute is discrete valued and
multiway split is allowed then
attribute_list = attribute_list – splitting attribute ;
For each outcome j of splitting_criterion
// partition the tuples and grow subtrees
let Dj be the set of tuples in D satisfying j
If Dj is empty then
attach a leaf labeled with majority class in D to node N

12/17/2022 MIT-WPU, DMDW 20


Else attach the node returned by
Generate_decision_tree(Dj, attribute_list) to node N
Endfor
Return N;
End Algo

Time Complexity: O(m * |D| * log |D|)


m= number of attributes
|D| = number of records in dataset

12/17/2022 MIT-WPU, DMDW 21


Attribute Selection
1. Attribute selection method specifies a heuristic procedure for selecting the attribute that
best discriminates the given tuples according to class
2. The measure so selected determines the splitting criterion. Splitting criterion tells us which
attribute to test at node N
3. It determines the best way to separate or partition tuples in D into individual classes
4. It indicates the splitting attribute such that the resulting partitions at each branch are as pure
as possible
5. A partition is pure if all the tuples in it belong to the same class

12/17/2022 MIT-WPU, DMDW 22


Cases
a) A is discrete
b) A is continuous
c) A is discrete but
binary tree should
be produced

12/17/2022 MIT-WPU, DMDW 23


Attribute selection measure
oInformation Gain
oGain Ratio
oGini Index
oChi-square test
oC-SEP
oMinimum description length
Attribute selection measures are also known as splitting rules because
they determine how the tuples at a given node are to be split

12/17/2022 MIT-WPU, DMDW 24


Information Gain
Based on Information theory in Physics which studies information content of messages
Entropy -> measure of disorder in the dataset
Information gain = measure of decrease in disorder achieved by partitioning the original
dataset
Information gain = difference between original information requirement and new
requirement after partitioning on a certain attribute
Information gain = Entropy for dataset - Entropy for partition
The attribute with highest information gain is chosen as the splitting attribute for a node
This attribute minimizes the information needed to classify tuples in the resulting
partition, and reflects the least impurity/randomness in these partition

12/17/2022 MIT-WPU, DMDW 25


Entropy - Example

12/17/2022 MIT-WPU, DMDW 26


Partitioning on color to reduce the disorder

12/17/2022 MIT-WPU, DMDW 27


Attribute Selection Measure: Information Gain (ID3/C4.5)

▪Select the attribute with the highest information gain reflecting least impurity
▪Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci,D|/|D|
▪Let m = number of distinct values of class label attribute
▪Expected information (entropy of D) needed to classify a tuple in D:

▪Information needed (after using A to split D into v partitions) to classify D:

▪Information gained by branching on attribute A

12/17/2022 MIT-WPU, DMDW 28 28


Day Outlook Temperature Humidity Wind PlayGolf

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes


D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No


D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No


12/17/2022 MIT-WPU, DMDW 29
Info (D) = 0.940 bits

12/17/2022 MIT-WPU, DMDW 30


Now, consider attribute outlook
For outlook, there are 3 categories: sunny, overcast, rainy

12/17/2022 MIT-WPU, DMDW 31


Calculating entropy for each attribute
➢The dataset is split on the different attributes.
➢The entropy for each branch is calculated.
➢Then it is added proportionally, to get total entropy for the split.
➢The resulting entropy is subtracted from the entropy before the split.
➢The result is the Information Gain, or decrease in entropy.

12/17/2022 MIT-WPU, DMDW 32


Gain when split on different attributes

12/17/2022 MIT-WPU, DMDW 33


Choose attribute with the largest information gain as the decision node, divide
the dataset by its branches and repeat the same process on every branch.

12/17/2022 MIT-WPU, DMDW 34


A branch with entropy of 0 is a leaf node. For outlook=overcast, the
entire partition has play golf = yes

12/17/2022 MIT-WPU, DMDW 35


A branch with entropy more than 0 needs further splitting.

The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified.
12/17/2022 MIT-WPU, DMDW 36
Final Decision Tree

12/17/2022 MIT-WPU, DMDW 37


Decision Tree to Decision Rules

12/17/2022 MIT-WPU, DMDW 38


12/17/2022 MIT-WPU, DMDW 39
Limitation
Limitations of information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values.
For example, consider an attribute that acts as a unique identifier such as product ID. A split on
product ID would result in a large number of partitions (as many as there are values), each one
containing just one tuple. Because each partition is pure, the information required to classify
data set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore, the informa-
tion gained by partitioning on this attribute is maximal.

12/17/2022 MIT-WPU, DMDW 40


Gain Ratio
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which
attempts to overcome this bias.
It applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as

12/17/2022 MIT-WPU, DMDW 41


Gain Ratio
Gain Ratio is defined as

12/17/2022 MIT-WPU, DMDW 42


12/17/2022 MIT-WPU, DMDW 43
Solution
Computation of gain ratio for the attribute income.
❖A test on income splits the data of Table into three partitions, namely low, medium, and high,
containing four, six, and four tuples, respectively.
❖To compute the gain ratio of income, we first use Split Info Eq. to obtain

❖we have Gain(income) = 0.029. Therefore, GainRatio(income) = 0.029/1.557 = 0.019.

12/17/2022 MIT-WPU, DMDW 44


Gini Index
❖The Gini index is used in CART. Using the notation previously described, the Gini index measures the
impurity of D, a data partition or set of training tuples.

❖where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The sum
is computed over m classes.
❖The Gini index considers a binary split for each attribute.
❖For example, if income has three possible values, namely {low, medium, high}, then the possible subsets
are {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low}, {medium}, {high}, and {}.
❖We exclude the power set, {low, medium, high}, and the empty set from consideration since, conceptu-
ally, they do not represent a split.
❖Therefore, there are 2v − 2 possible ways to form two partitions of the data, D, based on a binary split
on A.
12/17/2022 MIT-WPU, DMDW 45
Gini Index
❖When considering a binary split, we compute a weighted sum of the impurity of each resulting
partition.
❖For example, if a binary split on A partitions D into D1 and D2, the Gini index of D given that
partitioning is

❖a possible split-point of A, D1 is the set of tuples in D satisfying A ≤ split point, and D2 is the set
of tuples in D satisfying A > split point.
❖The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-
valued attribute A is

12/17/2022 MIT-WPU, DMDW 46


Tree Pruning
❖Tree pruning addresses the problem of overfitting the data
❖Tree overfitting can occur due to irrelevant attributes, noisy data or too
little training data.
❖As depth of the tree increases, training error reduces but test error
increases
❖Statistical measures to remove least reliable branches
❖Approaches – prepruning and postpruning
❖Prepruning – construction halted early, node converted to leaf
❖Postpruning – removes subtrees from a fully grown tree

12/17/2022 MIT-WPU, DMDW 47


Advantages of Decision Tree
1. Simple to understand and interpret. Trees can also be displayed graphically in a way
that is easy for non-experts to interpret
2. Able to handle both numerical and categorical data
3. Requires little data preparation. Other techniques often require data normalization
4. Uses a white box model. By contrast, in a black box model, the explanation for the
results is typically difficult to understand, for example with an artificial neural network
5. Possible to validate a model using statistical tests, hence reliable
6. Makes no assumptions of the training data
7. Performs well with large datasets

12/17/2022 MIT-WPU, DMDW 48


Limitations of Decision Tree
1. A small change in the training data can result in a large change in the tree and consequently
the final predictions.
2. Decision-tree learners can create over-complex trees that do not generalize well from the
training data. (overfitting)

12/17/2022 MIT-WPU, DMDW 49


Implementations
R, Weka, Orange, scikit (python), IBM SPSS, SAS, Rapidminer, Matlab

12/17/2022 MIT-WPU, DMDW 50


Bayes Classification Methods
❖Bayesian classifiers are statistical classifiers
❖They predict class membership probabilities – probability that a tuple belongs
to a particular class
❖Classification is based on Bayes’ theorem
❖High accuracy and speed
❖Naïve Bayesian Classifiers assume that effect of an attribute value on a given
class is independent of the values of other attributes → class conditional
independence
❖Application: Chinese word segmentation

12/17/2022 MIT-WPU, DMDW 51


Bayes’ Theorem
X data tuple or evidence (contains n attributes)
H hypothesis (e.g. data tuple X belongs to class C)
P(H/X) probability that hypothesis H holds given evidence X [classification problem]
It is called “Posterior / Posteriori Probability” of H conditioned on X
P(H) probability of hypothesis irrespective of any condition (Prior / Priori probability)

P(X/H) posterior probability of X conditioned on H


P(X) prior probability of X
Bayes’ Theorem :
P(H/X) = [ P(X/H) P(H)] / P(X)
Suppose the fraction of UG students who play cricket is 15% and
fraction of graduate students playing cricket is 23%. If one-fifth of
the college students are graduate students and rest are UG, what is
the probability that a student who plays cricket is a graduate
student?
P(C|U) = 0.15
P(C|G)=0.23
P(G)=0.2
P(U)=0.8
To find P(G|C)
P(G|C) = [P(C|G) * P(G)]/ P(C) = (0.23 * 0.2 )/ (0.15*0.8 + 0.23 *
0.2) = 0.277
12/17/2022 MIT-WPU, DMDW 53
Working of Naïve Bayesian Classifier
❑Let D – training set of tuples, and their associated class labels
❑ X = {x1, x2, …, xn}
❑There are m classes, C1, C2, … ,Cm
❑Given tuple X, classifier predicts that it belongs to class having highest posterior probability

❑ P(Ci/X) > P(Cj/X) for 1 ≤ j ≤ m, j ≠ i


Thus, we have to maximize)
❑Class Ci for which P(Ci/X) is maximized is called “maximum posteriori hypothesis”
❑ P(Ci/X) = [ P(X/Ci)P(Ci)] / P(X) …… Bayes’ theorem
Continued…
❑P(X) is constant for all classes, hence only P(X/Ci)P(Ci) is to be
maximized.
❑If class probabilities are not known, then it is assumed that the classes
are equally likely. i.e. P(C1) = P(C2) = …=P(Cm)
Hence only maximize P(X/Ci)
❑With class conditional independence:
P(X/Ci) = ∏ P(xk/Ci) ….. k=1 to n
= P(x1/ Ci) * P(x2/ Ci) * … * P(xn/ Ci)
RID Age Income Student Credit-rating Class: buys_
computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle-aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle-aged Low Yes Excellent Yes

8 Youth Medium No Fair No


9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle-aged Medium No Excellent Yes

13 Middle-aged High Yes Fair Yes

14 Senior Medium No Excellent no


Problem statement
Given the training data, classify the tuple :

X =( age=youth, income=medium, student=yes, credit_rating=fair)


Let C1: buys_computer = yes, C2: buys_computer = no
We need to maximize P(X|Ci) for i=1 and i=2
P(C1) = P(buys_computer=yes) = 9/14 = 0.643
P(C2) = P(buys_computer=no) = 5/14 = 0.357

12/17/2022 MIT-WPU, DMDW 57


Computing conditional probabilities:
P(age=youth|buys_computer=yes) = 2/9 = 0.222
P(age=youth|buys_computer=no) = 3/5 = 0.600
P(income=medium | buys_computer=yes) = 4/9 = 0.444
P(income=medium | buys_computer=no) = 2/5 = 0.400
P(student=yes | buys_computer=yes) = 6/9 = 0.667
P(student=yes | buys_computer=no) = 1/5 = 0.2
P(credit_rating=fair | buys_computer=yes) = 6/9 = 0.667
P(credit_rating=fair | buys_computer=no) = 2/5 = 0.4

P(X|buys_computer=yes) = P(age=youth|buys_computer=yes) *
P(income=medium|buys_computer=yes) *
P(student=yes|buys_computer=yes) *
P(credit_rating=fair|buys_computer=yes)
= 0.222 * 0.444 * 0.667 * 0.667
= 0.044

12/17/2022 MIT-WPU, DMDW 58


Similarly,
P (X|buys_computer=no) = 0.6 * 0.4 * 0.2 * 0.4 = 0.019

To find the class Ci that maximizes P(X|Ci).P(Ci):


P (X|buys_computer=yes).P(buys_computer=yes)= 0.044 * 0.643 = 0.028
P (X|buys_computer=no).P(buys_computer=no)= 0.019 * 0.357 = 0.007

Thus, Naïve Bayesian Classifier predicts buys_computer = YES for tuple X


If X =( age=youth, income=medium, student=yes, credit_rating=fair)
Then buys_computer = yes

12/17/2022 MIT-WPU, DMDW 59


Try it out on Weather Data
X =( Outlook = sunny , temperature = cool, humidity = high, windy =
strong, play = ?)

12/17/2022 MIT-WPU, DMDW 60


Zero probability value – Laplace correction
If no training tuples for particular attribute value, then its probability is zero, result in product
probability =0
Thus we would have P(X|Ci)=0, cancelling the effect of other probabilities
To avoid this problem, assume that D is very large and add one to each count to avoid zero
probability.
This technique is known as Laplacian correction or Laplace estimator.
Example: In a training database D, for class buys_computer =yes,
0 tuples with income=yes, 990 tuples with income = medium and 10 tuples with income=high
Probabilities of events without Laplacian correction: 0, 0.990, 0.010
Using Laplace correction, add 1 tuple of each type in D
New probabilities: 1/1003, 991/ 1003, 11/1003 i.e., 0.001, 0.988, 0.011

12/17/2022 MIT-WPU, DMDW 61


Model Evaluation and Selection
Terminology
➢Positive tuples- tuples of main class of interest
➢Negative tuples – all other tuples
➢True Positives (TP) – Positive tuples that were correctly labelled by the classifier
➢True Negatives (TN) – Negative tuples that were correctly labelled by the classifier
➢False Positives (FP) – Negative tuples that were incorrectly labelled as positives
➢False Negatives (FN) – Positive tuples that were incorrectly labelled as negative

12/17/2022 MIT-WPU, DMDW 62


Confusion Matrix
Confusion Matrix – method to analyze how well classifier can recognize
tuples of different classes

Predicted Class
C1 (yes / +ve) C2 (no / -ve)

Actual Class C1 True positive False negative

C2 False positive True negative


Example of confusion matrix

12/17/2022 MIT-WPU, DMDW 64


Evaluation Measures - Accuracy
1. Accuracy - Accuracy of a classifier on a given test set is percentage of test set tuples that are
correctly classified by the classifier. It is termed as overall Recognition rate
2. Class imbalance problem –
◦ In classification involving skewed or highly imbalanced data, e.g., network intrusion, financial fraud
detections or detection of cancer we are interested only in the minority class.
◦ High accuracy does not mean any intrusion is detected.
◦ Hence accuracy is not suitable in some applications

Classes Cancer=yes Cancer=no Total Find the overall


accuracy of the
Cancer=yes 90 210 300 classifier to detect
cancer patients
Cancer-=no 140 9560 9700

Total 230 9770 10000

12/17/2022 MIT-WPU, DMDW 65


Precision and Recall measures
TP TP
p= . r= .
TP + FP TP + FN
◼ Precision p is the number of correctly classified positive
examples divided by the total number of examples that
are classified as positive. [Exactness]
◼ Recall r is the number of correctly classified positive
examples divided by the total number of actual positive
examples in the test set. [Completeness]
o Find the precision and recall for cancer patients in
previous slide
CS583, Bing Liu, UIC 66
Calculate accuracy , precision and recall from the foll. confusion
matrix

Predicted Yes Predicted No Total


Actual Yes 100 5 105
Actual No 10 50 60
Total 110 55 165

12/17/2022 MIT-WPU, DMDW 67


Other parameters to compare classifiers
1. Speed – computational costs involved in generating and using the classifier
2. Robustness – ability of classifier to make correct predictions by handling noise and missing
values
3. Scalability – ability to construct classifier efficiently given large amounts of data
4. Interpretability - understandable and insight provided by the model

CS583, BING LIU, UIC 68

You might also like