Chapter 6: Classification and Prediction: Classify Predictions
Chapter 6: Classification and Prediction: Classify Predictions
This chapter focuses on techniques used to classify data and make predictions based on learned models.
The main topics include:
● Bayesian Classification
● Instance-Based Methods
● Classification Accuracy
Bayesian classification is a probabilistic approach used in machine learning and statistics. Key points are:
1. Probabilistic learning:
○ It calculates explicit probabilities for a hypothesis.
○ Useful for practical learning problems where probabilistic approaches provide reliable results.
2. Incremental:
○ Bayesian methods allow incremental learning.
○ With each training sample, the probability of a hypothesis being correct can be updated.
3. Probabilistic prediction:
○ Allows multiple hypotheses, weighted based on their probabilities.
4. Standard:
○ Bayesian methods offer a standard for decision-making, even when computational resources
are limited.
-
Key Takeaways:
● Bayesian classification assigns a sample to the class with the highest posterior probability.
● It simplifies to maximizing the product of likelihood P(X∣Ci) and prior P(Ci).
● Computing likelihood directly can be difficult in practice.
-
-
-
Advantages
1. Easy to implement:
○ The algorithm is straightforward and simple to apply.
2. Good results in most cases:
○ Despite its assumptions, the Naïve Bayesian Classifier performs well in many practical
applications.
Disadvantages
To address dependencies:
Bayesian Networks
Overview:
Key Points:
1. Graphical Model:
○ Represents causal relationships among variables.
○ Nodes: Random variables.
○ Links: Dependencies (arrows indicate relationships).
2. Features:
○ Specifies joint probability distributions.
○ Handles dependencies between variables.
○ Has no loops or cycles (acyclic graph).
3. Example:
○ X and Y are parents of Z.
○ Y is the parent of P.
○ There is no direct dependency between Z and P.
-
Learning Bayesian Networks
Instance-Based Methods
● Definition:
Instance-based learning stores training examples and delays processing (lazy evaluation) until a
new instance needs to be classified.
● Key Feature:
The method compares a new instance to existing training examples to make a classification decision.
1. Representation:
○ Instances are treated as points in a Euclidean space (geometric space).
2. How it Works:
○ Given a new instance, calculate the distance between the new instance and all training
examples.
○ Find the k closest neighbors (based on Euclidean distance).
○ Assign the class based on the majority class of the nearest neighbors.
Concept:
MBR mirrors human reasoning by identifying similar examples from the past and applying what is
known/learned to solve a new problem.
Applications of MBR
1. Fraud Detection:
○ Identify fraudulent activities by comparing with known cases.
2. Customer Response Prediction:
○ Predict customer behavior based on historical patterns.
3. Medical Treatments:
○ Suggest treatments by matching symptoms with similar past cases.
4. Classifying Responses:
○ Process free-text responses and assign codes, often used in natural language processing
tasks.
Key Concept
● k-NN is robust to noisy data because averaging the kkk-nearest neighbors smoothens the impact of
outliers.
Summary
MBR Challenges
Weighted k-NN
Weighing Variables
1. Purpose:
○ Similar to clustering, weighing variables helps address relevance during scaling or
standardization.
2. Need:
○ Some variables require higher weights (e.g., standardized income vs. standardized age).
3. Outcome:
○ Balances the impact of variables during distance-based methods.
Curse of Dimensionality
1. Problem:
○ High-dimensional data can cause distances to be dominated by irrelevant attributes.
2. Solutions:
○ Stretch axes or eliminate irrelevant attributes.
○ Use different scaling factors.
○ Cross-validation to determine the best factor.
3. Key Strategy:
○ Assign small k values to irrelevant variables, eliminating them if necessary.
Lazy Learning
1. Concept:
○ No explicit training phase; all historical data acts as the training set.
○ Minimal training cost.
2. Drawbacks:
○ Classifying a new instance requires real-time computation, which can be time-consuming.
3. Key Operation:
○ Finding the k-nearest neighbors for new observations.
MBR Strengths
1. Advantages:
○ Uses data "as is" with distance functions and combination functions.
○ Adapts easily with new data.
○ Produces results without lengthy training.
2. Real-Time Utility:
○ Efficient for applications requiring constant updates.
-
-
Holdout Estimation
● Concept:
○ When data is limited, the holdout method splits the dataset into:
■ Training Set: Builds the model.
■ Testing Set: Evaluates model performance.
○ Typical split: 1/3 for testing and the rest for training.
● Problem:
○ Samples may not be representative.
○ Example: A class might be missing in the test set.
● Advanced Method:
○ Stratification: Ensures balanced representation of each class in both training and test sets.
● Definition: The holdout estimate becomes more reliable by repeating the process multiple times
using different subsamples of data.
● Steps:
○ In each iteration, a portion of data is randomly selected for training.
■ Stratification: Ensures equal representation of each class in samples.
○ Error rates from different iterations are averaged to calculate an overall error rate.
● Issues:
○ Test Set Overlap: Repeated random selection may lead to overlapping test sets.
○ Question: Can overlapping be avoided?
Cross-Validation
1. Definition:
○ Cross-validation avoids overlapping test sets by systematically splitting the data into subsets.
○ It is an improvement over the simple holdout method.
2. Steps:
○ Step 1: Split data into k subsets of equal size.
○ Step 2: Use each subset in turn for testing, while the remaining subsets are used for training.
○ This is called k-fold cross-validation.
3. Key Points:
○ Stratification ensures equal class proportions in each fold.
○ The error estimates from each iteration are averaged to produce the overall error estimate.
More on Cross-Validation
Cross-Validation Example
1. Scenario:
○ Data size: 1000 records.
○ k = 10 folds:
■ Randomize data to eliminate biases.
■ Split data into 10 equal subsets (folds) of 100 records each.
2. Process:
○ Fold 1: Used as test set; remaining 9 folds are for training.
○ Fold 2: Used as test set; remaining 9 folds for training.
○ Repeat this process until all folds have been used for testing.
3. Result:
○ Each record gets tested once.
○ Final error estimate is the average of all fold results.
Leave-One-Out Cross-Validation (LOOCV)
1. Definition:
○ A specific form of cross-validation where the number of folds equals the number of training
instances.
○ For n training instances, the classifier is trained n times.
2. Key Points:
○ Makes the best use of data.
○ No random subsampling is involved.
○ It is computationally expensive because the model is trained multiple times.
■ Exception: Nearest Neighbor (NN) methods.
The Bootstrap
1. Definition:
○ A resampling technique where instances are selected with replacement.
○ Bootstrap creates multiple datasets from the original data for training/testing.
2. Process:
○ Sample a dataset of n instances repeatedly to create a new training set.
○ Instances not selected form the test set.
3. Difference from Cross-Validation:
○ Cross-validation uses without replacement sampling, while bootstrap uses with replacement.
1. Concept:
○ A particular instance has a probability of (1 - 1/n) of not being picked in a single sample.
○ For large n, this probability approximates to 0.368.
2. Implications:
○ Around 63.2% of the instances are included in the training data.
○ The remaining 36.8% of the instances appear in the test data.
Example of Bootstrap
1. Dataset Size:
○ The original dataset has 1000 observations.
2. Process:
○ Create a training set by sampling with replacement 1000 times.
○ The size of the training set remains 1000, but:
■ Some observations appear multiple times.
■ Some observations do not appear at all.
3. Test Set:
○ Observations not appearing in the training set form the test set.
○ The size of the test set is variable.
Bagging
1. Definition:
○ Bagging stands for Bootstrap Aggregating.
○ It improves accuracy by reducing variance in model predictions.
2. Steps:
○ Generate t bootstrap samples (with replacement) from the original dataset.
○ Train a new classifier CtC_tCtfor each sample StS_tSt.
○ For classification problems:
■ Combine predictions using the majority vote rule.
○ For regression problems:
■ Combine results by averaging predictions.
3. Outcome:
○ The final classifier C∗C^*C∗ is an aggregated version of all classifiers.
Boosting
1. Definition:
○ Boosting builds multiple classifiers sequentially, where each classifier pays more attention to
misclassified examples.
2. Steps:
○ Learn a series of classifiers.
○ Misclassified examples in each iteration receive higher weight for the next classifier.
3. Characteristics:
○ Boosting works well with decision trees or Bayesian classifiers.
○ Requires linear time and constant space.
-
Key Points
-
-
Evaluating Numeric Prediction
Characteristics:
Characteristics:
4. Relative Error:
● Definition: Expresses errors as a percentage of the target value, useful for understanding errors in
relative terms.
● Example:
○ If an error of 50 occurs while predicting 500, the relative error is 10%10\%10%.
Key Differences:
1. MSE and RMSE: Both emphasize large errors due to squaring, but RMSE makes the error more
interpretable.
2. MAE: Less sensitive to large outliers as it does not square the errors.
3. Relative Error: Provides a percentage-based understanding of error magnitude, useful for scaled
evaluation.
Definition
A lift chart is a visual tool used to compare the performance of predictive models or decisions. It shows the
"lift," or improvement, of targeting a subset of the population (using a model or strategy) versus random
selection.
1. Costs are unknown: In practice, you may not always have exact cost figures for decision-making.
2. Comparing Scenarios: Instead of relying solely on cost-based analysis, decisions are compared
based on how effective they are relative to the baseline.
Practical Interpretation
● The higher the curve above the baseline, the better the predictive model or decision strategy.
● Decision-makers can visually identify the trade-offs:
○ How much of the population to target.
○ What percentage of responses to expect.