0% found this document useful (0 votes)
110 views50 pages

Data Mining Unit 3

Classification and prediction models are used for extracting patterns from data. Classification models predict categorical labels while prediction models predict continuous values. There are several techniques for building classification and prediction models, including decision trees. Decision trees use attributes to split data into partitions at internal nodes, with class labels at leaf nodes. Information gain and the Gini index are common measures used to select the optimal attribute for splitting at each node. Tree pruning can later be used to remove outlier branches and reduce overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views50 pages

Data Mining Unit 3

Classification and prediction models are used for extracting patterns from data. Classification models predict categorical labels while prediction models predict continuous values. There are several techniques for building classification and prediction models, including decision trees. Decision trees use attributes to split data into partitions at internal nodes, with class labels at leaf nodes. Information gain and the Gini index are common measures used to select the optimal attribute for splitting at each node. Tree pruning can later be used to remove outlier branches and reduce overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Unit 3

Classification and prediction


• Basic concepts
• Decision tree induction
• Bayesian classification,
• Rule–based classification,
• Lazy learner
Classification and prediction
• There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
These two forms are as follows −
• Classification
• Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions.
For example, we can build a classification model to categorize bank
loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer
equipment given their income and occupation.
What is classification?
• Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
What is prediction?
• Following are the examples of cases where the data analysis task is
Prediction −
• Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we
are bothered to predict a numeric value. Therefore the data analysis
task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-
function or ordered value.
• Note − Regression analysis is a statistical methodology that is most
often used for numeric prediction.
How Does Classification Works?
• With the help of the bank loan application that
we have discussed above, let us understand the
working of classification. The Data Classification
process includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the
classifier.
• The classifier is built from the training set made up of
database tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as
a category or class. These tuples can also be referred to as
sample, object or data points.
Using Classifier for Classification

• In this step, the classifier is used for


classification.
• Here the test data is used to estimate the accuracy
of classification rules.
• The classification rules can be applied to the new
data tuples if the accuracy is considered
acceptable.
Classification and Prediction Issues
• The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying
smoothing techniques and the problem of missing values is solved by
replacing a missing value with most commonly occurring value for
that attribute.
• Relevance Analysis − Database may also have the irrelevant
attributes. Correlation analysis is used to know whether any two given
attributes are related.
• Data Transformation and reduction − The data can
be transformed by any of the following methods.
• Normalization − The data is transformed using
normalization. Normalization involves scaling all values
for given attribute in order to make them fall within a
small specified range. Normalization is used when in the
learning step, the neural networks or the methods
involving measurements are used.
• Generalization − The data can also be transformed by
generalizing it to the higher concept. For this purpose we
can use the concept hierarchies.
Comparison of Classification and Prediction Methods
• Here is the criteria for comparing the methods of Classification and Prediction

• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well a
given predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Classification by Decision Tree Induction
• Decision tree induction is the learning of decision trees from class-
labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each internal
node (nonleaf node) denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (or terminal
node) holds a class label. The topmost node in a tree is the root node.
Decision Tree Induction
• During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in
machine learning, developed a decision tree algorithm known as ID3 (Iterative
Dichotomiser).
• Quinlan later presented C4.5 (a successor of ID3), which became a benchmark
to which newer supervised learning algorithms are often compared.
• Classification and Regression Trees (CART), which described the generation of
binary decision trees.
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which
decision trees are constructed in a top-down recursive divide-and-conquer
manner.
Information gain
• ID3 uses information gain as its attribute selection measure.
• Let node N represent or hold the tuples of partition D. The attribute
with the highest information gain is chosen as the splitting attribute
for node N. This attribute minimizes the information needed to
classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions.
• The expected information needed to classify a tuple in D is given by

Info(D) is just the average amount of information needed to identify


the class label of a tuple in D.
• Now, suppose we were to partition the tuples in D on some attribute
A having v distinct values, {a1, a2,..., av}, as observed from the
training data. If A is discrete-valued, these values correspond directly
to the v outcomes of a test on A. Attribute A can be used to split D
into v partitions or subsets, {D1, D2,..., Dv}, where Dj contains those
tuples in D that have outcome aj of A. These partitions would
correspond to the branches grown from node N.
• How much more information would we still need (after the
partitioning) in order to arrive at an exact classification? This amount
is measured by

The term |Dj | |D| acts as the weight of the jth partition. InfoA (D) is the expected
information required to classify a tuple from D based on the partitioning by A
• Information gain is defined as the difference between the
original information requirement (i.e., based on just the
proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,
Gain ratio
• The information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values.
• For example, consider an attribute that acts as a unique identifier, such as
product ID.
• A split on product ID would result in a large number of partitions (as many
as there are values), each one containing just one tuple.
• Because each partition is pure, the information required to classify data
set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore,
the information gained by partitioning on this attribute is maximal.
• Clearly, such a partitioning is useless for classification.
C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio
Gini index
• Gini Index is calculated for binary variables only. It measures the
impurity in training tuples of dataset D, as

P is the probability that tuple belongs to class C. The Gini index that is
calculated for binary split dataset D by attribute A is given by:
Example:
Lets consider the dataset in the image below and draw a decision tree using gini index.

• Index A B C D E
• 1 4.8 3.4 1.9 0.2 positive
• 2 5 3 1.6 1.2 positive
• 3 5 3.4 1.6 0.2 positive
• 4 5.2 3.5 1.5 0.2 positive
• 5 5.2 3.4 1.4 0.2 positive
• 6 4.7 3.2 1.6 0.2 positive
• 7 4.8 3.1 1.6 0.2 positive
• 8 5.4 3.4 1.5 0.4 positive
• 9 7 3.2 4.7 1.4 negative
• 10 6.4 3.2 4.7 1.5 negative
• 11 6.9 3.1 4.9 1.5 negative
• 12 5.5 2.3 4 1.3 negative
• 13 6.5 2.8 4.6 1.5 negative
• 14 5.7 2.8 4.5 1.3 negative
• 15 6.3 3.3 4.7 1.6 negative
• 16 4.9 2.4 3.3 1 negative
• In Gini Index, we have to choose some random values to categorize
each attribute. These values for this dataset are:
Tree Pruning

• Pruning is the method of removing the unused branches from the


decision tree. Some branches of the decision tree might represent
outliers or noisy data.
• Tree pruning is the method to reduce the unwanted branches of the
tree. This will reduce the complexity of the tree and help in effective
predictive analysis. It reduces the overfitting as it removes the
unimportant branches from the trees.
There are two ways of pruning the tree:
• #1) Prepruning: In this approach, the construction of the decision tree is
stopped early. It means it is decided not to further partition the branches. The
last node constructed becomes the leaf node and this leaf node may hold the
most frequent class among the tuples.
• The attribute selection measures are used to find out the weightage of the
split. Threshold values are prescribed to decide which splits are regarded as
useful. If the portioning of the node results in splitting by falling below
threshold then the process is halted.
• #2) Postpruning: This method removes the outlier branches from a
fully grown tree. The unwanted branches are removed and replaced by
a leaf node denoting the most frequent class label. This technique
requires more computation than prepruning, however, it is more
reliable.
• The pruned trees are more precise and compact when compared to
unpruned trees but they carry a disadvantage of replication and
repetition.
• Repetition occurs when the same attribute is tested again and again
along a branch of a tree. Replication occurs when the duplicate
subtrees are present within the tree. These issues can be solved by
multivariate splits.

You might also like