0% found this document useful (0 votes)
27 views15 pages

Unit 4 Classification

The document discusses classification in data science, focusing on decision tree induction as a supervised learning technique that categorizes data into classes based on features. It outlines the structure of decision trees, their advantages and disadvantages, and the decision tree induction algorithm, along with issues such as overfitting and data imbalance. Additionally, it covers Bayesian classification methods, rule-based classification, and lazy learning approaches like k-nearest neighbors, highlighting their respective characteristics and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views15 pages

Unit 4 Classification

The document discusses classification in data science, focusing on decision tree induction as a supervised learning technique that categorizes data into classes based on features. It outlines the structure of decision trees, their advantages and disadvantages, and the decision tree induction algorithm, along with issues such as overfitting and data imbalance. Additionally, it covers Bayesian classification methods, rule-based classification, and lazy learning approaches like k-nearest neighbors, highlighting their respective characteristics and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Unit 4

Classification
What is classification:
Classification refers to the process of categorizing data into predefined classes or groups
based on certain characteristics or features. It is a supervised learning technique, meaning it
requires labeled data for training.

Decision Tree Induction


A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each
leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.

The benefits of having a decision tree are as follows −


 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
Note: *) Flow chart like tree structure.
*) Supports in taking decisions
*) It defines rules visually in form of tree.

Fundamentals of Data Science Unit 5 1


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Types of nodes
1. Root node – Main Question
2. Branch node – Intermediate Node
3. Leaf node - Answer

Attribute selection measures


 Information Gain: How much information does the answer to the specific question provide.
(Note: it is providing 100% information or 90% information).
 Entropy: Measures the amount of uncertainty (it means given information how much mount of
information’s goes wrong) in the information. (As the information increases, Entropy
decreases).

Decision Tree Induction classification:


Decision tree induction classification is a supervised learning method where a tree-like model
is built from labeled data to predict the class label of new instances. It involves recursively
partitioning the data based on features to create decision boundaries, enabling accurate
classification.
Decision tree induction classification with key points:
1. Supervised Learning: Decision tree induction is a supervised learning method, meaning it
requires labeled training data consisting of input features and corresponding class labels.
2. Classification Task: It is primarily used for classification tasks, where the goal is to categorize
input instances into predefined classes or categories.
3. Tree-like Model: Decision tree induction constructs a tree-like model from the training data. Each
internal node of the tree represents a feature, and each leaf node represents a class label.
4. Recursive Partitioning: The process involves recursively partitioning the dataset based on the
values of input features. This creates decision boundaries that separate different classes.
5. Impurity Reduction: The algorithm selects the best feature at each node to split the data,
aiming to minimize impurity or maximize information gain.
6. Interpretability: Decision trees are highly interpretable, allowing users to understand the
decision-making process easily by visualizing the tree structure.
7. Prediction: Once the decision tree is built, it can be used to predict the class label of new
instances by traversing the tree from the root node to a leaf node based on the input features.
8. Wide Applicability: Decision tree induction is widely used in various domains due to its simplicity,
effectiveness, and ability to handle both numerical and categorical data.

Fundamentals of Data Science Unit 5 2


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Example:
Credit score rating
A  Average, B  Bad, C Good , D  Excellent

Rules that can be defined as


If age <30, Income <30k then credit score is =Bad
If age <30, Income >=30k then credit score is =Good
Using decision tree makes use of quick decision and classification of data.

Decision Tree Induction Algorithm


The decision tree induction algorithm, commonly known as CART (Classification and Regression Trees),
is a fundamental method for building decision trees. Here's a simplified overview of the algorithm:
1. Selecting the Root Node:
o Choose the best attribute to split the data based on some criterion (e.g., information gain,
Gini impurity, entropy).
o Calculate the chosen criterion for each attribute and select the one that maximizes the gain
or minimizes impurity.
2. Splitting the Data:
o Partition the dataset into subsets based on the values of the selected attribute.
o Create a branch for each possible value of the attribute.

Fundamentals of Data Science Unit 5 3


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

3. Recursive Tree Building:


o For each subset created by the split:
 If all instances belong to the same class, create a leaf node with that class label.
 If the subset is empty or if all attributes have been used for splitting, create a leaf
node with the majority class label.
 Otherwise, recursively apply the above steps to the subset to continue building the
tree.
4. Stopping Criteria:
o Define stopping criteria to halt the tree-building process. Common criteria include:
 Maximum tree depth.
 Minimum number of instances in a leaf node.
 Minimum gain in impurity or information.
5. Pruning (Optional):
o After the tree is constructed, pruning techniques can be applied to reduce overfitting and
improve generalization.
o Pruning involves removing branches or nodes that do not contribute significantly to the
predictive accuracy of the tree.
6. Making Predictions:
o Once the tree is built, it can be used to make predictions on new instances.
o Traverse the tree from the root node to a leaf node based on the values of the input
features, following the decision paths.
o The class label associated with the leaf node reached is the predicted class for the input
instance.

Decision Tree Induction issues


Some of the common issues with decision tree induction:
1. Overfitting: Decision trees can memorize noise in the data, leading to poor generalization on
unseen data.
2. High Variance: Small changes in the training data can result in significantly different tree
structures, leading to instability.
3. Bias-Variance Tradeoff: Balancing between a too simple (high bias) and too complex (high
variance) tree is crucial for optimal performance.
4. Data Imbalance: Decision trees may be biased towards dominant classes in imbalanced datasets,
leading to poor performance on minority classes.
5. Handling Continuous Attributes: Traditional methods for handling continuous attributes may
not be optimal.

Fundamentals of Data Science Unit 5 4


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

6. Handling Missing Values: Decision tree algorithms may not handle missing values effectively,
potentially leading to biased models.
7. Prone to Local Optima: Greedy algorithms used in decision tree induction may lead to
suboptimal splits.
8. Interpretability vs. Accuracy: While decision trees are interpretable, their simplicity may limit
predictive accuracy compared to more complex models.

Addressing these issues often involves using techniques such as pruning, ensemble methods,
regularization, and careful hyper-parameter tuning.

Decision Tree Induction advantages and disadvantages


Advantages:
1. Interpretability: Decision trees are easy to understand and interpret.
2. No Data Assumptions: They don't require assumptions about data distributions.
3. Handle Non-Linearity: Can capture non-linear relationships.
4. Handle Both Numerical and Categorical Data: Can process both types of features.
5. Feature Importance: Provide insight into feature importance.
6. Fast Training: Quick to train on large datasets.
7. Robust to Outliers: Resilient to outliers and noisy data.

Disadvantages:
1. Overfitting: Prone to overfitting, especially with deep trees.
2. High Variance: Sensitive to variations in the training data.
3. Bias-Variance Tradeoff: Balancing bias and variance can be challenging.
4. Instability: Small changes in data can lead to different tree structures.
5. Limited Expressiveness: May struggle to capture complex relationships.
6. Greedy Nature: May not find the globally optimal tree structure.
7. Difficulty with Imbalanced Data: May perform poorly on imbalanced datasets.

Fundamentals of Data Science Unit 5 5


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Bayes Classification Methods


Bayesian classification methods are a family of probabilistic algorithms based on Bayes' theorem,
which provides a framework for calculating conditional probabilities. These methods are commonly
used for classification tasks in machine learning and statistics.

1. Bayes's theorem is expressed mathematically by the following equation that is given below.

Where X and Y are the events and P (Y) ≠ 0


 P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.
 P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.
 P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is known
as the marginal probability.

2. Naive Bayes Classifier: The Naive Bayes classifier is one of the simplest Bayesian classifiers. It
assumes that the features are conditionally independent given the class label. Despite this strong
assumption (which is often not true in practice), Naive Bayes classifiers are surprisingly effective
and efficient for many real-world problems. Common variants include:
o Gaussian Naive Bayes: Assumes that the likelihood of the features follows a Gaussian
distribution.
o Multinomial Naive Bayes: Suitable for features that represent counts or frequencies (e.g.,
text classification).
o Bernoulli Naive Bayes: Assumes that features are binary.

3. Bayesian Network Classifiers: These are more complex models that represent the dependencies
between features using a directed acyclic graph (DAG). Each node in the graph represents a
random variable (feature), and edges represent probabilistic dependencies between them.
Bayesian network classifiers can capture more complex relationships between features but require
more sophisticated inference algorithms.

Fundamentals of Data Science Unit 5 6


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

4. Bayesian Logistic Regression: This is a probabilistic approach to logistic regression. Instead of


directly estimating the parameters of the logistic regression model, Bayesian logistic regression
assigns a prior distribution to these parameters and then updates this distribution based on the
observed data, resulting in a posterior distribution over the parameters.

Bayesian classification methods are particularly useful when dealing with small datasets or when
interpretability is important. They provide a principled way to incorporate prior knowledge into the
classification process and can produce well-calibrated probability estimates. However, they may not always
perform as well as more complex models like deep neural networks on very large datasets.

Rule-Based Classification
Rule-based classification is a method of classifying data based on a set of predefined rules. These rules
are typically derived from domain knowledge or extracted from the data itself. Rule-based classification
systems make decisions by applying these rules to the input data and assigning a class label based on the
conditions satisfied.
Here's an overview of how rule-based classification works:
1. Rule Representation: Rules are typically represented in the form of "if-then" statements. Each
rule consists of a condition (antecedent) and an action (consequent). For example:
o If (age < 30) and (income > $50,000), then class = "young and affluent"
o If (age >= 30) and (age < 50), then class = "middle-aged"
o If (age >= 50) and (income < $30,000), then class = "senior with low income"
2. Rule Induction: Rule-based classification systems can be built using rule induction algorithms,
which automatically generate rules from training data. These algorithms analyze the data to
identify patterns and relationships between features and class labels. Common rule induction
algorithms include:
o Decision Trees: Decision trees can be converted into rule sets by traversing the tree from
the root to the leaf nodes, where each path represents a rule.
o Sequential Covering Algorithms: These algorithms iteratively generate rules to cover
different subsets of the data, often using techniques like exhaustive search or heuristics to
select the best rule at each step.
o Association Rule Mining: Association rule mining algorithms, such as Apriori, identify
rules that describe relationships between different attributes in the data.

Fundamentals of Data Science Unit 5 7


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

3. Rule Refinement: Once initial rules are generated, they may be refined to improve their accuracy
or interpretability. This can involve pruning redundant or irrelevant rules, optimizing rule
conditions, or combining rules to create more general or specific rules.
4. Rule Evaluation: Rule-based classifiers are evaluated using metrics such as accuracy, precision,
recall, and F1-score on a held-out test dataset. The performance of the classifier can be assessed
by comparing the predicted class labels with the true class labels in the test data.
5. Interpretability: One of the key advantages of rule-based classification is its interpretability.
Since the classification decisions are based on explicit rules, it is easy to understand why a
particular decision was made. This makes rule-based classifiers particularly useful in domains
where interpretability is important, such as healthcare and finance.

While rule-based classification systems have the advantage of interpretability, they may struggle to
capture complex relationships in the data compared to more flexible models like neural networks.
Additionally, designing effective rule sets can require significant domain expertise, and rule-based
classifiers may not perform as well as other methods on very large or high-dimensional datasets.

Lazy Learners (or Learning from your Neighbours)


Lazy learning, also known as instance-based learning or memory-based learning, is a machine learning
approach where the system generalizes new instances based on similar instances seen during training.
The key idea is to delay the learning process until a new query instance is encountered, at which point the
system finds the most similar instances from the training data and makes predictions based on them.
The most common lazy learning algorithm is the k-nearest neighbors (k-NN) algorithm. Here's how it
works:
1. Training Phase: In lazy learning, there is typically no explicit training phase. Instead, the
algorithm simply memorizes the training data.
2. Prediction Phase:
o When a new query instance is received, the algorithm identifies the k nearest neighbors of
that instance from the training data. The "nearest" neighbors are usually determined based
on a distance metric, such as Euclidean distance or cosine similarity.
o Once the nearest neighbors are identified, the algorithm makes predictions by aggregating
the labels or values of those neighbors. For classification tasks, this could involve taking a
majority vote among the class labels of the neighbors (for example, if most of the neighbors
belong to class A, predict class A for the query instance). For regression tasks, this could
involve averaging the target values of the neighbors.
o The value of k (the number of neighbors to consider) is a hyperparameter that needs to be
tuned. A smaller value of k leads to a more flexible model that may capture local patterns

Fundamentals of Data Science Unit 5 8


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

well but might be sensitive to noise, while a larger value of k leads to a smoother decision boundary but
may miss local variations.

Advantages of lazy learning


1. lexibility in adapting to changes in the data.
2. Simple implementation and interpretability.
3. Ability to handle complex and non-linear decision boundaries.
4. Effective in small to moderately sized datasets.
5. No training phase, allowing for quick deployment.
limitations of lazy learning
1. High computational complexity.
2. Large storage requirements.
3. Slow prediction times.
4. Vulnerability to the curse of dimensionality.
5. Sensitivity to noise and outliers.
6. Lack of model representation.
7. Bias towards local patterns.

Examples of Real-World Lazy Learning Applications


1. Recommendation Systems: Recommending movies or products based on similar users'
preferences.
2. Spam Detection: Classifying emails as spam or non-spam based on similarity to previously
classified emails.
3. Medical Diagnosis: Classifying patients into disease categories based on symptoms and medical
history.
4. Anomaly Detection: Identifying unusual behavior in datasets, such as detecting fraudulent
transactions.
5. Handwriting Recognition: Recognizing handwritten characters or digits based on similar
patterns.
6. Predictive Maintenance: Predicting machine failures based on similar past behavior of machines.
7. Natural Language Processing (NLP): Classifying text documents or sentiment analysis based
on similar language usage patterns.

Fundamentals of Data Science Unit 5 9


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

K Nearest Neighbour(KNN) Prediction


1. K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
2. K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
3. K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by
using K- NN algorithm.
4. K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
5. K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
6. It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action on
the dataset.
7. KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a
K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:

Fundamentals of Data Science Unit 5 10


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K number of neighbors
 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data points in each category.
 Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
 Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

 Firstly, we will choose the number of neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is
the distance between two points, which we have already studied in geometry. It can be calculated
as:

Fundamentals of Data Science Unit 5 11


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

 By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

 As we can see the 3 nearest neighbors are from category A, hence this new data point must belong
to category A.

Fundamentals of Data Science Unit 5 12


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Advantages of KNN Algorithm:


 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data points for all the
training samples.

Prediction - Accuracy- Precision and Recall


The concepts of accuracy, precision, and recall in the context of prediction, especially within the scope of
classification problems.
1. Accuracy
Definition: Accuracy is the ratio of correctly predicted instances to the total instances.
Formula:

Interpretation: Accuracy tells us how often the classifier is correct overall.

2. Precision
Definition: Precision is the ratio of correctly predicted positive observations to the total predicted
positives.
Formula:

Interpretation: Precision answers the question: "What proportion of positive identifications was actually
correct?"

3. Recall (Sensitivity or True Positive Rate)


Definition: Recall is the ratio of correctly predicted positive observations to the all observations in actual
class.
Formula:

Fundamentals of Data Science Unit 5 13


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Interpretation: Recall answers the question: "What proportion of actual positives was identified
correctly?"

Relationships and Trade-offs:


 Precision vs. Recall: There is often a trade-off between precision and recall. Increasing
precision typically reduces recall and vice versa. A high precision score means that the model
has a low false positive rate, while a high recall score means that the model has a low false
negative rate. The balance between these two depends on the specific context and what is more
critical for the application.
 F1 Score: The F1 Score is the harmonic mean of precision and recall and provides a single metric
that balances both concerns.
Formula

Confusion Matrix
To make these concepts clearer, consider the confusion matrix, which is a summary of prediction results
on a classification problem.

Actual \ Predicted Positive (P) Negative (N)

Positive (P) TP FN

Negative (N) FP TN

 True Positives (TP): Correctly predicted positive instances.


 True Negatives (TN): Correctly predicted negative instances.
 False Positives (FP): Incorrectly predicted positive instances (Type I error).
 False Negatives (FN): Incorrectly predicted negative instances (Type II error).
Example Calculation
Suppose we have a confusion matrix for a binary classifier:

Actual \ Predicted Positive (P) Negative (N)

Positive (P) 50 10

Negative (N) 5 35

 TP = 50
 TN = 35
 FP = 5
 FN = 10

Fundamentals of Data Science Unit 5 14


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

From these values:

These metrics provide a comprehensive evaluation of a model's performance in classification tasks.

Fundamentals of Data Science Unit 5 15

You might also like