0% found this document useful (0 votes)
17 views23 pages

Chapter 6: Classification and Prediction: Classify Predictions

Uploaded by

Ecem Eryılmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

Chapter 6: Classification and Prediction: Classify Predictions

Uploaded by

Ecem Eryılmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 6: Classification and Prediction

This chapter focuses on techniques used to classify data and make predictions based on learned models.
The main topics include:

● Bayesian Classification
● Instance-Based Methods
● Classification Accuracy

Bayesian Classification: Why?

Bayesian classification is a probabilistic approach used in machine learning and statistics. Key points are:

1. Probabilistic learning:
○ It calculates explicit probabilities for a hypothesis.
○ Useful for practical learning problems where probabilistic approaches provide reliable results.
2. Incremental:
○ Bayesian methods allow incremental learning.
○ With each training sample, the probability of a hypothesis being correct can be updated.
3. Probabilistic prediction:
○ Allows multiple hypotheses, weighted based on their probabilities.
4. Standard:
○ Bayesian methods offer a standard for decision-making, even when computational resources
are limited.
-
Key Takeaways:

● Bayesian classification assigns a sample to the class with the highest posterior probability.
● It simplifies to maximizing the product of likelihood P(X∣Ci) and prior P(Ci).
● Computing likelihood directly can be difficult in practice.

-
-
-

Naïve Bayesian Classifier: Comments

Advantages

1. Easy to implement:
○ The algorithm is straightforward and simple to apply.
2. Good results in most cases:
○ Despite its assumptions, the Naïve Bayesian Classifier performs well in many practical
applications.

Disadvantages

1. Class Conditional Independence Assumption:


○ Assumes that the features (attributes) are conditionally independent given the class.
○ In real-world scenarios, this assumption often does not hold, reducing accuracy.
2. Dependencies Between Variables:
○ Practically, dependencies exist among variables.
○ Example:
■ Hospitals: Patient profiles depend on age, family history, symptoms, and diseases.
■ Dependencies like fever and cough indicating diseases such as lung cancer cannot be
captured by Naïve Bayesian Classifier.
3. Inability to Model Dependencies:
○ The classifier fails to account for interdependencies among variables.

Solution: Bayesian Belief Networks

To address dependencies:

● Use Bayesian Belief Networks.

Bayesian Networks

Overview:

● A Bayesian Belief Network allows a subset of variables to be conditionally independent.

Key Points:

1. Graphical Model:
○ Represents causal relationships among variables.
○ Nodes: Random variables.
○ Links: Dependencies (arrows indicate relationships).
2. Features:
○ Specifies joint probability distributions.
○ Handles dependencies between variables.
○ Has no loops or cycles (acyclic graph).
3. Example:
○ X and Y are parents of Z.
○ Y is the parent of P.
○ There is no direct dependency between Z and P.

-
Learning Bayesian Networks

Several Cases for Learning Bayesian Networks

1. Given Network Structure and All Variables Observable:


○ Only the Conditional Probability Tables (CPTs) need to be learned.
○ This is the simplest case because the structure (relationships) is already provided.
2. Network Structure Known, Some Hidden Variables:
○ Use methods like gradient descent to learn parameters.
○ This approach is similar to training neural networks.
3. Network Structure Unknown, All Variables Observable:
○ Requires searching through the model space to reconstruct the network structure (graph
topology).
○ Algorithms optimize based on observed data to determine the best graph structure.
4. Unknown Structure and All Hidden Variables:
○ This is the most complex case.
○ Currently, no good algorithms exist to effectively solve this problem due to its complexity.

Instance-Based Methods

What are Instance-Based Methods?

● Definition:
Instance-based learning stores training examples and delays processing (lazy evaluation) until a
new instance needs to be classified.
● Key Feature:
The method compares a new instance to existing training examples to make a classification decision.

Key Approach: k-Nearest Neighbor (k-NN)

1. Representation:
○ Instances are treated as points in a Euclidean space (geometric space).
2. How it Works:
○ Given a new instance, calculate the distance between the new instance and all training
examples.
○ Find the k closest neighbors (based on Euclidean distance).
○ Assign the class based on the majority class of the nearest neighbors.

Memory-Based Reasoning (MBR)

Concept:

MBR mirrors human reasoning by identifying similar examples from the past and applying what is
known/learned to solve a new problem.

1. Examples in Daily Life:


○ Recognizing traffic patterns/routes.
○ Meeting new people by recalling past experiences.
○ Trying new food based on past preferences.
2. Terminology:
In MBR, the similar examples are referred to as neighbors.

Applications of MBR

1. Fraud Detection:
○ Identify fraudulent activities by comparing with known cases.
2. Customer Response Prediction:
○ Predict customer behavior based on historical patterns.
3. Medical Treatments:
○ Suggest treatments by matching symptoms with similar past cases.
4. Classifying Responses:
○ Process free-text responses and assign codes, often used in natural language processing
tasks.

The k-Nearest Neighbor (k-NN) Algorithm

Key Concept

● k-NN is an instance-based learning algorithm where:


○ All training instances are represented as points in an n-dimensional space.
○ The classification or prediction of a new instance is based on the distance to its nearest
neighbors.

Target Functions in k-NN

1. Discrete Target Function (Classification):


○ k-NN returns the most common class (majority vote) among the kkk-nearest training
examples.
○ Example: Predict if an object is "yes" or "no" based on the majority class.
2. Continuous Target Function (Regression/Numerical Prediction):
○ k-NN returns the mean value of the kkk-nearest neighbors.
○ Example: Predicting a numerical value like temperature, house price, etc.
Robustness to Noise

● k-NN is robust to noisy data because averaging the kkk-nearest neighbors smoothens the impact of
outliers.

Summary

The k-NN algorithm is:

● Non-parametric (no assumptions about the data distribution).


● Lazy learning (stores the training data and delays computation until prediction).
● Used for both classification and regression tasks.

MBR Challenges

1. Choosing Appropriate Historical Data


○ Ensuring relevant and high-quality data is used for training the system.
2. Efficient Representation of Training Data
○ Lazy learning delays computation until prediction.
○ Efficient indexing methods are critical for quick retrieval of neighbors.
3. Choosing the Number of Neighbors (k)
○ Small k:
■ Too noisy; sensitive to outliers.
○ Large k:
■ Approaches the size of the dataset, reducing specificity and leading to the same result
for all cases.
Discussion on the k-NN Algorithm

1. Choosing the Distance Function


○ Common measures:
■ Euclidean Distance: Most widely used.
■ Manhattan Distance: Sum of absolute differences.
2. Impact of Variable Units
○ Units of variables can distort distance calculations.
○ Solution:
■ Scaling/Normalization: Standardize variables to eliminate unit dependency.
■ Methods:
■ z-scores: Scale values to have mean 0 and standard deviation 1.
■ Min-max scaling: Normalize values between 0 and 1.

Choosing the Combination Function

1. Categorical Target Variables (Classification)


○ Majority Rule: Class with the highest frequency among neighbors.
○ Weighted Voting:
■ Neighbors closer to the query point have higher weights.
2. Numerical Target Variables (Regression)
○ Average: Mean of the values of the k-nearest neighbors.
○ Weighted Average:
■ Assign higher weights to closer neighbors (inverse of the distance).

Weighted k-NN

1. Why Weight Neighbors?


○ Not all neighbors contribute equally.
○ Neighbors closer to the query point should have a greater impact.
2. Weight Formula:
w=1d(xq,xi)2w = \frac{1}{d(x_q, x_i)^2}w=d(xq​,xi​)21​
Here:
○ d(xq,xi)d(x_q, x_i)d(xq​,xi​) is the distance between query point xqx_qxq​and training instance
xix_ixi​.
3. Distance-Weighted Nearest Neighbor
○ Assign weights inversely proportional to distance.
○ Helps improve prediction accuracy for both:
■ Classification (categorical).
■ Regression (numerical).

Weighing Variables

1. Purpose:
○ Similar to clustering, weighing variables helps address relevance during scaling or
standardization.
2. Need:
○ Some variables require higher weights (e.g., standardized income vs. standardized age).
3. Outcome:
○ Balances the impact of variables during distance-based methods.

Curse of Dimensionality

1. Problem:
○ High-dimensional data can cause distances to be dominated by irrelevant attributes.
2. Solutions:
○ Stretch axes or eliminate irrelevant attributes.
○ Use different scaling factors.
○ Cross-validation to determine the best factor.
3. Key Strategy:
○ Assign small k values to irrelevant variables, eliminating them if necessary.

Lazy Learning

1. Concept:
○ No explicit training phase; all historical data acts as the training set.
○ Minimal training cost.
2. Drawbacks:
○ Classifying a new instance requires real-time computation, which can be time-consuming.
3. Key Operation:
○ Finding the k-nearest neighbors for new observations.

MBR Strengths

1. Advantages:
○ Uses data "as is" with distance functions and combination functions.
○ Adapts easily with new data.
○ Produces results without lengthy training.
2. Real-Time Utility:
○ Efficient for applications requiring constant updates.

-
-

Chapter 6: Classification and Prediction


measure of how accurately the model predicts the output label.

Holdout Estimation

● Concept:
○ When data is limited, the holdout method splits the dataset into:
■ Training Set: Builds the model.
■ Testing Set: Evaluates model performance.
○ Typical split: 1/3 for testing and the rest for training.
● Problem:
○ Samples may not be representative.
○ Example: A class might be missing in the test set.
● Advanced Method:
○ Stratification: Ensures balanced representation of each class in both training and test sets.

Repeated Holdout Method

● Definition: The holdout estimate becomes more reliable by repeating the process multiple times
using different subsamples of data.
● Steps:
○ In each iteration, a portion of data is randomly selected for training.
■ Stratification: Ensures equal representation of each class in samples.
○ Error rates from different iterations are averaged to calculate an overall error rate.
● Issues:
○ Test Set Overlap: Repeated random selection may lead to overlapping test sets.
○ Question: Can overlapping be avoided?

Cross-Validation

1. Definition:
○ Cross-validation avoids overlapping test sets by systematically splitting the data into subsets.
○ It is an improvement over the simple holdout method.
2. Steps:
○ Step 1: Split data into k subsets of equal size.
○ Step 2: Use each subset in turn for testing, while the remaining subsets are used for training.
○ This is called k-fold cross-validation.
3. Key Points:
○ Stratification ensures equal class proportions in each fold.
○ The error estimates from each iteration are averaged to produce the overall error estimate.

More on Cross-Validation

1. Stratified Ten-Fold Cross-Validation:


○ Why 10 folds?:
■ Extensive experiments show that 10 folds provide the most accurate estimate.
■ Supported by theoretical evidence.
○ Stratification:
■ Reduces variance by ensuring balanced class representation in each fold.
2. Improvement:
○ Repeated Stratified Cross-Validation:
■ Ten-fold cross-validation is repeated multiple times.
■ Results are averaged to further reduce variance.

Cross-Validation Example

1. Scenario:
○ Data size: 1000 records.
○ k = 10 folds:
■ Randomize data to eliminate biases.
■ Split data into 10 equal subsets (folds) of 100 records each.
2. Process:
○ Fold 1: Used as test set; remaining 9 folds are for training.
○ Fold 2: Used as test set; remaining 9 folds for training.
○ Repeat this process until all folds have been used for testing.
3. Result:
○ Each record gets tested once.
○ Final error estimate is the average of all fold results.
Leave-One-Out Cross-Validation (LOOCV)

1. Definition:
○ A specific form of cross-validation where the number of folds equals the number of training
instances.
○ For n training instances, the classifier is trained n times.
2. Key Points:
○ Makes the best use of data.
○ No random subsampling is involved.
○ It is computationally expensive because the model is trained multiple times.
■ Exception: Nearest Neighbor (NN) methods.

The Bootstrap

1. Definition:
○ A resampling technique where instances are selected with replacement.
○ Bootstrap creates multiple datasets from the original data for training/testing.
2. Process:
○ Sample a dataset of n instances repeatedly to create a new training set.
○ Instances not selected form the test set.
3. Difference from Cross-Validation:
○ Cross-validation uses without replacement sampling, while bootstrap uses with replacement.

The 0.632 Bootstrap

1. Concept:
○ A particular instance has a probability of (1 - 1/n) of not being picked in a single sample.
○ For large n, this probability approximates to 0.368.
2. Implications:
○ Around 63.2% of the instances are included in the training data.
○ The remaining 36.8% of the instances appear in the test data.

Example of Bootstrap
1. Dataset Size:
○ The original dataset has 1000 observations.
2. Process:
○ Create a training set by sampling with replacement 1000 times.
○ The size of the training set remains 1000, but:
■ Some observations appear multiple times.
■ Some observations do not appear at all.
3. Test Set:
○ Observations not appearing in the training set form the test set.
○ The size of the test set is variable.

Bagging and Boosting

Bagging

1. Definition:
○ Bagging stands for Bootstrap Aggregating.
○ It improves accuracy by reducing variance in model predictions.
2. Steps:
○ Generate t bootstrap samples (with replacement) from the original dataset.
○ Train a new classifier CtC_tCt​for each sample StS_tSt​.
○ For classification problems:
■ Combine predictions using the majority vote rule.
○ For regression problems:
■ Combine results by averaging predictions.
3. Outcome:
○ The final classifier C∗C^*C∗ is an aggregated version of all classifiers.

Boosting

1. Definition:
○ Boosting builds multiple classifiers sequentially, where each classifier pays more attention to
misclassified examples.
2. Steps:
○ Learn a series of classifiers.
○ Misclassified examples in each iteration receive higher weight for the next classifier.
3. Characteristics:
○ Boosting works well with decision trees or Bayesian classifiers.
○ Requires linear time and constant space.
-

Key Points

1. Boosting improves classification accuracy by:


○ Focusing on misclassified examples in subsequent iterations.
○ Combining multiple weak classifiers into a strong ensemble model.
2. The final hypothesis is a weighted combination of all classifiers, with more accurate classifiers
receiving higher weights.
3. Boosting is iterative and adapts the model during each step.

-
-
Evaluating Numeric Prediction

Characteristics:

● Penalizes larger errors more heavily due to the squaring of differences.


● Easy to manipulate mathematically.
● Widely used for regression problems.

Characteristics:

● Gives an interpretable error in the same scale as the original values.


● Sensitive to outliers (like MSE).
Characteristics:

● Less sensitive to outliers compared to MSE.


● Represents the average "absolute" error.

4. Relative Error:

● Definition: Expresses errors as a percentage of the target value, useful for understanding errors in
relative terms.
● Example:
○ If an error of 50 occurs while predicting 500, the relative error is 10%10\%10%.

Key Differences:

1. MSE and RMSE: Both emphasize large errors due to squaring, but RMSE makes the error more
interpretable.
2. MAE: Less sensitive to large outliers as it does not square the errors.
3. Relative Error: Provides a percentage-based understanding of error magnitude, useful for scaled
evaluation.

Lift Charts: Explanation and Practical Use

Definition

A lift chart is a visual tool used to compare the performance of predictive models or decisions. It shows the
"lift," or improvement, of targeting a subset of the population (using a model or strategy) versus random
selection.

Why Use Lift Charts?

1. Costs are unknown: In practice, you may not always have exact cost figures for decision-making.
2. Comparing Scenarios: Instead of relying solely on cost-based analysis, decisions are compared
based on how effective they are relative to the baseline.

How a Lift Chart Works

1. X-Axis: Proportion of the population targeted (e.g., 10%, 20%, etc.).


2. Y-Axis: Proportion of positive responses achieved (e.g., purchases, clicks, or any positive outcome).
3. Baseline (Random): A straight line representing expected results if no model is used (random
selection).
4. Lift Curve: The curve showing the actual performance of the model.
○ A steeper lift curve means the model is better at identifying promising subsets of the
population.

Practical Interpretation

● The higher the curve above the baseline, the better the predictive model or decision strategy.
● Decision-makers can visually identify the trade-offs:
○ How much of the population to target.
○ What percentage of responses to expect.

You might also like