Classification DecisionTreesNaiveBayeskNN
Classification DecisionTreesNaiveBayeskNN
Classification??
Data Cleaning − Missing values and noise is removed by applying smoothing techniques.
and missing values problems are sorted out by replacing missing values with most commonly
occurring value for the given attribute attribute.
Relevance Analysis − Database consists irrelevant attributes or the attributes which are not
relevant. Correlation analysis is used to find out, whether any two given attributes are related.
Data Transformation and reduction − The data can be transformed by these two methods.
Normalization − Normalization method scales all values for given attribute within a small
specified range.
Generalization − The data can also be transformed by generalizing it to the higher concept. For
this purpose hierarchies concept can be used.
Data can also be reduced by some other methods such as wavelet transformation, binning,
histogram analysis, and clustering.
Comparison of Classification and Prediction Methods
Accuracy − Accuracy of classifier predict the class label correctly and the accuracy
of the predictor tells how well a given predictor can tell the predicted attribute value
for a new data.
Speed − This refers to the computational cost in generating and using the classifier or
predictor.
Robustness − It refers to the capability of classifier or predictor to make correct
predictions from given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor understands.
Both predictor and classifier same in these 2 step process: First constructing a model
and using model to predict values of unknown objects and different because
classification predict categorical class label and prediction predict continuous value.
Decision Tree Induction
The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
Decision trees can handle multidimensional data.
Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans.
The learning and classification steps of decision tree induction are
simple and fast.
In general, decision tree classifiers have good accuracy.
However, successful use may depend on the data at hand.
ID3, C4.5 and CART
ID3 (Iterative Dichotomiser ). :During the late 1970s and early 1980s, J. Ross
Quinlan, a researcher in machine learning, developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser).
This work expanded on earlier work on concept learning systems, described by E.
B. Hunt, J. Marin, and P. T. Stone.
Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to
which newer supervised learning algorithms are often compared.
In 1984,a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone)
published the book Classification and Regression Trees (CART), which described
the generation of binary decision trees.
Attribute Selection Measures
How much more information would we still need (after the partitioning) to arrive
at an exact classification? This amount is measured by
|D| acts as the weight of the jth partition. InfoA(D) is the expected information
required to classify a tuple from D based on the partitioning by A.
The smaller the expected information (still) required, the greater the purity of the
partitions.
Information Gain (V)
Because age has the highest information gain among the attributes, it is selected as the
splitting attribute.
Node N is labeled with age, and branches are grown for each of the attribute’s values.
The tuples are then partitioned accordingly, as shown in Figure
Notice that the tuples falling into the partition for age D middle aged all belong to the
same class. Because they all belong to class “yes,” a leaf should therefore be created at
the end of this branch and labeled “yes.”
Information Gain-Algorithm
If the values of A are sorted in advance, then determining the best split for A requires only
one pass through the values.
For each possible split-point for A, we evaluate Info A(D), where the number of partitions is
two, that is, v (or j= 1, 2).
The point with the minimum expected information requirement for A is selected as the split
point for A. D1 is the set of tuples in D satisfying A ≤ split point, and D2 is the set of tuples
in D satisfying A > split point.
Gain Ratio (I)
The information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values.
For example, consider an attribute that acts as a unique identifier such as
product ID. A split on product ID would result in a large number of
partitions (as many as there are values), each one containing just one tuple.
Because each partition is pure, the information required to classify data set
D based on this partitioning would be Infoproduct-ID(D)= 0.
Therefore, the information gained by partitioning on this attribute is
maximal. Clearly, such a partitioning is useless for classification.
Gain Ratio (II)
This value represents the potential information generated by splitting the training
set data, D, into v partitions corresponding to the v outcomes of a test on attribute
A.
Note that, for each outcome, it considers the number of tuples having that
outcome with respect to the total number of tuples in D. It differs from
information gain, which measures the information with respect to classification
that is acquired based on the same partitioning.
Gain Ratio(III)
The attribute with the maximum gain ratio is selected as the splitting attribute.
Note, however, that as the split information approaches 0, the ratio becomes
unstable.
A constraint is added to avoid this, whereby the information gain of the test
selected must be large—at least as great as the average gain over all tests
examined.
Problem-Calculate the gain ratio for the attribute income
Gain ( A )
GainRatio ( A ) =
SplitInf o A (D )
4/14=0.2857143, 6/14=0.42857143
Log2(4/14)=Log24-Log214=2-3.807355=-1.807355
Log2(6/14)=Log26-Log214=2.5849625-3.807355= -
1.2224
SplitInfo(income)=0.516+0.523+0.516=1.555
Gini Index (I)
Similarly, the Gini index values for splits on the remaining subsets are
0.458 (for the subsets {low, high} and {medium} and
0.450 (for the subsets {medium, high} and {low}.
0.443 (for the subset {low, medium} and {high}
Therefore, the best binary split for attribute income is on {low, medium} and {high})
because it minimizes the Gini index. Delta gini for income 0.459-0.443=0.016
Evaluating for attribute age, we obtain {youth, 0.375senior} (or {middle_aged}) as the
best split for age with a Gini index of 0.357;
the attributes student and credit rating are both binary, with Gini index values of 0.367
and 0.429, respectively.
The attribute age and splitting subset {youth, senior} (or {middle_aged}) gives overall
minimum Gini index with a reduction in impurity as
0.459-0.357=0.102
The attribute age and splitting subset {youth, senior} therefore give the
minimum Gini index overall, with a reduction in impurity of
0.459-0.357=0.102.
The binary split “age ϵ {youth, senior}” results in the maximum reduction in
impurity of the tuples in D and is returned as the splitting criterion.
Tree Pruning
Sometimes, when decision tress are built, many of the branches may
reflect anomalies in training data due to noise or outliers.
Tree pruning is a method to remove the least-reliable branches from the
decision tree.
Two approaches:
Pre-pruning
Post-pruning
Pre-pruning
=
=**….*
Two cases
1. Ak is a categorical , no of tuples of Ci in D having value xk divided by |C i,D|
2. Ak is a numeric, the probability is given by the characteristic function of
Gaussian distribution
X(age=youth, income=medium, student=yes,
credit_rating=fair)
m=2
P(C1=yes)=9/14=0.643
P(C2=no)=5/14=0.357
P(X|C1)=P(age|C1)*P(income|C1)*P(student|C1)*P(credit_rating|
C1)
=2/9*4/9*6/9*6/9=0.044
P(X|C2)=0.019
Zero Frequency Problem in Naïve
Bayes Classification
It is easy and fast to predict class of test data set. It also perform well in multi
class prediction
When assumption of independence holds, a Naive Bayes classifier performs
better compare to other models like logistic regression and you need less
training data.
It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve,
which is a strong assumption).
Disadvantage of Naïve Bayes
•When categorical variable has a category (in test data set), which was not seen
in training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction.
•This is often known as “Zero Frequency”. To solve this, we can use the
smoothing technique. One of the simplest smoothing techniques is called
Laplace estimation.
•Naive Bayes is also known as a bad estimator.
•Naive Bayes is the assumption of independent predictors. In real life, it is
almost impossible that we get a set of predictors which are completely
independent.
Applications of Naive Bayes Algorithms
Real time Prediction: Naive Bayes is an eager learning classifier and it is quite fast.
So, it is used for making predictions in real time.
Multi class Prediction: we can predict the probability of multiple classes of target
variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to find out positive and negative customer
sentiments)
Recommendation System: Predict whether a user would like a given resource or
not
Lazy Learners: kNN Classifier
Eager learners are those which when given a set of training tuples, will
construct a classification model before receiving the test tuples to
classify.
Lazy learners on the other hand store the training tuples (or do a minor
processing) and wait until the last minute before doing any model
construction to classify test tuples.
Lazy learners do less work when training tuples are presented and more
work when test tuples are presented.
Lazy learners are also called as instance-based learners.
When kNN algorithm is used
dist(X1,X2)=2
Lazy algorithm.
Neglect intermediate value.
Result is generated after analysis of stored data.
KNN Algorithm is based on feature similarity
KNN stores the entire training dataset which it uses as its representation.
KNN does not learn any model.
KNN makes predictions just-in-time by calculating the similarity between an
input sample and each training instance.
Why KNN is Used
Regression is a technique that displays the relationship between variable “y” based on the
values of variable “x”.
Regression technique is used to understand relationship between Product Price & Sales
we have a dataset of patients having not having heart disease. There are features like tobacco
consumption, cholesterol, alcohol consumption, type a personality traits, adiposity value,
obesity measures, systolic blood pressure to name a few.
Correlation can be find out among all features and their dependencies on each other. For
example,
Does a obese person with adiposity prone to heart disease ?
Does cholesterol has any impact?
What are the chances of a person’s tobacco consumption has any association with heart
disease?
Linear Regression
Linear regression applied on model which have two types of variables. We use
one variable to forecast another variable value. Those variables are called as
explanatory variable(Independent variable) and dependent variable.
Crop yields on rainfall : Yield is Dependent variable (Output which we forecast),
Rainfall is explanatory variable.
Marks on activities : Marks is dependent and activities explanatory
Products on sales: Sales are explanatories
Predicting house prices with increase in sizes of houses.
Relationship between the hours of study a student puts in, w.r.t the exam results
THANKU