0% found this document useful (0 votes)
79 views17 pages

Aiml Unit 4

NOTES FOE AIML -CSE,ECE,BME
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views17 pages

Aiml Unit 4

NOTES FOE AIML -CSE,ECE,BME
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS3491 & ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED


LEARNING

Combining multiple learners: Model combination schemes, Voting, Ensemble


Learning - bagging, boosting, stacking, Unsupervised learning: K-means,
Instance Based Learning: KNN, Gaussian mixture models and Expectation
maximization.

Combining multiple learners:


Combining multiple learners, often referred to as ensemble
learning, is a popular technique in machine learning. Ensemble methods
can improve predictive performance and robustness compared to individual
models. Here are some common approaches to combining multiple
learners:

1. Voting Classifiers:
 Hard Voting: Each model in the ensemble "votes" for a class, and the
class with the majority of votes is chosen.
 Soft Voting: Each model provides a probability estimate for each
class, and the class with the highest average probability is chosen.
2. Bagging (Bootstrap Aggregating):
 Train multiple instances of the same learning algorithm on different
random subsets of the training data and combine their predictions.
Random Forest is an example where decision trees are combined
using bagging.
3. Boosting:
 Train multiple weak learners sequentially, with each subsequent
learner focusing on the mistakes made by the previous ones.
AdaBoost and Gradient Boosting are popular boosting algorithms.
4. Stacking:
 Train several models and use another model (meta-model or blender)
to combine their predictions. The meta-model takes the outputs of
the base models as input and produces the final prediction.
5. Random Subspace Method:
 Similar to bagging, but instead of using random subsets of the data,
different subsets of features are used for training each model.
6. Ensemble of Different Models:
 Combine predictions from different types of models to benefit from
their diverse strengths. For example, combining decision trees with
support vector machines or neural networks.
7. Weighted Averaging:
 Assign different weights to the predictions of individual models
based on their performance or confidence, and combine them with a
weighted average.

The choice of ensemble method depends on the nature of the problem, the
data, and the base learners being used. Ensemble methods are particularly
effective when the individual models have different strengths and
weaknesses, as they can compensate for each other.

Keep in mind that while ensemble methods can enhance performance, they
also come with the cost of increased computational complexity and training
time.

Model combination schemes:


Model combination schemes refer to the strategies and techniques
used to effectively combine the outputs or predictions of multiple models. These
schemes are essential in ensemble learning and are designed to leverage the
strengths of individual models while mitigating their weaknesses. Here are some
common model combination schemes:

1. Voting Schemes:
 Majority Voting (Hard Voting): The class that receives the majority of votes
from the individual models is chosen as the final prediction.
 Weighted Voting: Assign different weights to the votes of each model,
considering their individual performance.
2. Averaging Schemes:
 Simple Averaging: Combine predictions by taking the average of the outputs
from individual models.
 Weighted Averaging: Assign different weights to the predictions of each
model and calculate a weighted average.
3. Stacking:
 Train a meta-model that takes the outputs of individual models as inputs and
learns to make the final prediction. Stacking involves using the predictions of
base models as features for the meta-model.
4. Boosting Schemes:
 AdaBoost: Combine the predictions of weak learners by assigning different
weights to the training instances based on their classification errors.
 Gradient Boosting: Sequentially train models, with each model focusing on
the mistakes of the previous ones.
5. Bagging Schemes:
 Bootstrap Aggregating (Bagging): Train multiple instances of the same
model on different random subsets of the training data and combine their
predictions.
6. Randomization Schemes:
 Random Forest: A specific form of bagging where decision trees are trained
on random subsets of features, enhancing diversity.
7. Stacked Generalization (Blending):
 Combine predictions from different models using a meta-model, often a
simpler model like linear regression.
8. Dynamic Model Selection:
 Dynamically select the best-performing model for a given instance or
problem. This approach adapts to the characteristics of the data.
9. Bayesian Model Averaging:
 Assign probabilities to different models and combine their predictions based
on these probabilities.
10. Ensemble of Ensembles:
 Combine multiple ensembles of models to create a higher-level ensemble.
This is done to further enhance diversity and improve robustness.

Voting, Ensemble Learning:

Model combination schemes refer to the strategies and


techniques used to effectively combine the outputs or predictions of
multiple models. These schemes are essential in ensemble learning and are
designed to leverage the strengths of individual models while mitigating
their weaknesses. Here are some common model combination schemes:

1. Voting Schemes:
 Majority Voting (Hard Voting): The class that receives the majority
of votes from the individual models is chosen as the final prediction.
 Weighted Voting: Assign different weights to the votes of each
model, considering their individual performance.
2. Averaging Schemes:
 Simple Averaging: Combine predictions by taking the average of
the outputs from individual models.
 Weighted Averaging: Assign different weights to the predictions of
each model and calculate a weighted average.
3. Stacking:
 Train a meta-model that takes the outputs of individual models as
inputs and learns to make the final prediction. Stacking involves using
the predictions of base models as features for the meta-model.
4. Boosting Schemes:
 AdaBoost: Combine the predictions of weak learners by assigning
different weights to the training instances based on their
classification errors.
 Gradient Boosting: Sequentially train models, with each model
focusing on the mistakes of the previous ones.
5. Bagging Schemes:
 Bootstrap Aggregating (Bagging): Train multiple instances of the
same model on different random subsets of the training data and
combine their predictions.
6. Randomization Schemes:
 Random Forest: A specific form of bagging where decision trees are
trained on random subsets of features, enhancing diversity.
7. Stacked Generalization (Blending):
 Combine predictions from different models using a meta-model,
often a simpler model like linear regression.
8. Dynamic Model Selection:
 Dynamically select the best-performing model for a given instance or
problem. This approach adapts to the characteristics of the data.
9. Bayesian Model Averaging:
 Assign probabilities to different models and combine their
predictions based on these probabilities.
10. Ensemble of Ensembles:
 Combine multiple ensembles of models to create a higher-level
ensemble. This is done to further enhance diversity and improve
robustness.

Choosing the appropriate combination scheme depends on the


characteristics of the data, the individual models involved, and the specific
goals of the ensemble learning task. Experimentation and cross-validation
are often used to determine the most effective combination strategy for a
given problem.

You
Voting, Ensemble Learning
ChatGPT
ChatGPT
Voting in the context of ensemble learning involves combining the
predictions of multiple individual models to make a final decision. It is a
common technique used in ensemble methods, and it can be categorized
into two main types: Hard Voting and Soft Voting.

1. Hard Voting:
 In hard voting, each model in the ensemble "votes" for a specific
class, and the class that receives the majority of votes is chosen as the
final prediction.
 This approach is effective when the individual models are diverse and
can complement each other's strengths and weaknesses.
 Hard voting is often used in classification tasks.
Example:
 Model 1 predicts Class A
 Model 2 predicts Class B
 Model 3 predicts Class A
The final prediction by hard voting would be Class A.
2. Soft Voting:
 In soft voting, each model provides a probability estimate for each
class, and the final prediction is based on the average probability for
each class across all models.
 This approach is useful when the models can output probability
scores, allowing for a more nuanced decision-making process.
 Soft voting is also commonly used in classification tasks.
Example:
 Model 1 predicts: Class A (0.8), Class B (0.2)
 Model 2 predicts: Class A (0.6), Class B (0.4)
 Model 3 predicts: Class A (0.7), Class B (0.3)
The final prediction by soft voting would involve averaging the probabilities
for each class and selecting the class with the highest average probability.
Ensemble learning, on the other hand, is a broader concept that involves
building and combining multiple models to achieve better overall
performance than individual models. Voting is just one technique within the
ensemble learning paradigm. Other ensemble methods include bagging,
boosting, stacking, and more.

Common types of ensembles include:

 Bagging (Bootstrap Aggregating): Combining predictions from multiple


instances of the same model trained on different subsets of the training
data (e.g., Random Forest).
 Boosting: Sequentially training multiple weak learners, with each
subsequent model focusing on the mistakes made by the previous ones
(e.g., AdaBoost, Gradient Boosting).
 Stacking: Combining predictions from multiple models using another
model (meta-model) that learns to make the final prediction based on the
outputs of the base models.

Ensemble learning is a powerful technique to improve the overall


robustness and generalization of machine learning models, especially when
individual models may have limitations or biases.

Bagging, boosting, stacking, Un supervised learning:

Bagging, boosting, stacking, and unsupervised learning are all distinct concepts in
the field of machine learning. Let's briefly cover each of them:

1. Bagging (Bootstrap Aggregating):


 Definition: Bagging is an ensemble learning technique that involves training
multiple instances of the same learning algorithm on different random subsets
(with replacement) of the training data.
 Purpose: It helps reduce overfitting and variance, as the combination of
multiple models can provide a more stable and accurate prediction.
 Example: Random Forest is a popular ensemble method that uses bagging to
train multiple decision trees and combine their predictions.
2. Boosting:
 Definition: Boosting is an ensemble learning technique that trains multiple
weak learners sequentially. Each subsequent model focuses on the mistakes
made by the previous ones, with the goal of improving overall predictive
performance.
 Purpose: Boosting is effective for reducing bias and improving the accuracy
of the ensemble. It gives more weight to instances that are misclassified by
earlier models.
 Examples: AdaBoost (Adaptive Boosting) and Gradient Boosting are common
boosting algorithms.
3. Stacking:
 Definition: Stacking, or stacked generalization, involves training multiple
models and using another model (meta-model or blender) to combine their
predictions.
 Purpose: Stacking aims to capture the strengths of individual models and
create a meta-model that can make more accurate predictions.
 Process: Base models produce predictions, and a meta-model is trained using
these predictions as features. The meta-model learns to combine the base
models effectively.
 Example: A combination of decision trees, support vector machines, and
neural networks can be stacked together.
4. Unsupervised Learning:
 Definition: Unsupervised learning is a type of machine learning where the
algorithm is given input data without explicit output labels. The system tries to
learn the patterns, relationships, or structure within the data.
 Purpose: Unsupervised learning is used for tasks such as clustering,
dimensionality reduction, and density estimation. It helps find hidden patterns
in data.
 Examples:
 Clustering: K-Means clustering, Hierarchical clustering.
 Dimensionality Reduction: Principal Component Analysis (PCA), t-
distributed Stochastic Neighbor Embedding (t-SNE).

K-means:
K-means is a popular unsupervised machine learning algorithm used
for clustering data. It falls under the category of partitioning methods and
is widely used for grouping similar data points into clusters. Here's an
overview of how K-means works:

K-means Clustering Algorithm:


1. Initialization:
 Choose the number of clusters (K) you want to identify in the data.
 Randomly initialize K cluster centroids in the feature space.
2. Assignment Step:
 Assign each data point to the nearest centroid, based on some
distance metric (commonly Euclidean distance).
3. Update Step:
 Recalculate the centroid of each cluster by taking the mean of all the
data points assigned to that cluster.
4. Iteration:
 Repeat steps 2 and 3 until convergence criteria are met. Convergence
can be reached when the centroids do not change significantly
between iterations or when a predetermined number of iterations is
reached.

Key Concepts:
 Centroids: The center point of each cluster, representing the mean of the
data points assigned to that cluster.
 Euclidean Distance: The most common distance metric used to measure
the similarity between data points and cluster centroids. Other distance
metrics can also be used based on the nature of the data.
 Sum of Squared Distances (Inertia): The objective of K-means is to
minimize the sum of squared distances between data points and their
assigned cluster centroids. This is often referred to as the inertia or within-
cluster sum of squares.

Applications:
1. Image Compression:
 Reduce the number of colors in an image by clustering similar pixel
values together.
2. Customer Segmentation:
 Identify groups of customers with similar purchasing behavior for
targeted marketing.
3. Anomaly Detection:
 Detect unusual patterns or outliers in datasets.
4. Document Classification:
 Group similar documents together based on their content.
5. Genomic Data Analysis:
 Identify patterns in gene expression data.

Limitations and Considerations:


 Sensitive to Initial Centroid Positions: K-means results can depend on
the initial placement of centroids. Multiple runs with different initializations
are often performed, and the best result is chosen.
 Assumes Spherical Clusters of Similar Sizes: K-means works well when
clusters are roughly spherical and have similar sizes. It may struggle with
elongated or irregularly shaped clusters.
 Requires Predefined Number of Clusters (K): The user needs to specify
the number of clusters in advance, which might not always be known.

Despite these limitations, K-means remains a widely used and efficient


algorithm for clustering tasks in various domains.

Instance Based Learning:

Instance-Based Learning, also known as Instance-Based Methods or


Memory-Based Learning, is a type of machine learning where the system
learns directly from specific examples in the training data. Instead of
explicitly building a model, the algorithm memorizes the training instances
and makes predictions based on the similarity between new instances and
those stored in the training set. Two common types of instance-based
learning methods are k-Nearest Neighbors (k-NN) and Case-Based
Reasoning (CBR).

1. k-Nearest Neighbors (k-NN):


 Operation:
 Given a new instance to classify, k-NN identifies the k training
instances closest to it in the feature space.
 The class label of the majority of these k neighbors is assigned to the
new instance.
 Parameters:
 k: The number of nearest neighbors to consider. This parameter
influences the smoothness of the decision boundary.
 Characteristics:
 Non-parametric: No assumptions about the underlying data
distribution.
 Lazy Learning: No explicit training phase; the model is built at
prediction time.

2. Case-Based Reasoning (CBR):


 Operation:
 CBR solves new problems by retrieving and adapting solutions from
similar problems encountered in the past.
 It involves a cycle of four steps: Retrieve, Reuse, Revise, and Retain.
 Steps:
1. Retrieve: Find similar cases in the memory based on the current
problem.
2. Reuse: Apply the solution from the retrieved cases to the current
problem.
3. Revise: If necessary, modify the solution to fit the current context.
4. Retain: Store the new solution for future use.
 Characteristics:
 Adaptable to changing environments.
 Well-suited for problem-solving in dynamic and uncertain domains.

Key Considerations for Instance-Based Learning:


 Memory Requirements: The method relies on storing the entire training
dataset, which can be computationally expensive for large datasets.
 Distance Metric: The choice of distance metric (Euclidean distance,
Manhattan distance, etc.) plays a crucial role in determining similarity
between instances.
 Sensitivity to Feature Scaling: Feature scaling is often necessary to ensure
that all features contribute equally to the distance calculation.
 Impact of Outliers: Outliers can significantly affect predictions, especially
in k-NN where distances are crucial.
 Computational Efficiency: While k-NN is simple and intuitive, it may
become computationally expensive as the dataset size increases.
Instance-Based Learning methods are particularly useful when the
underlying structure of the data is complex or not well understood. They
provide a flexible and adaptive approach to learning patterns in the
absence of a predefined model. However, their computational requirements
and sensitivity to certain parameters should be considered based on the
characteristics of the data.

KNN:
K-Nearest Neighbors (k-NN) is a popular instance-based learning
algorithm used for both classification and regression tasks in machine
learning. It is a simple and intuitive method that makes predictions based
on the majority class (for classification) or the average of neighboring
values (for regression) of the k nearest data points in the feature space.
Here's an overview of how k-NN works:

K-Nearest Neighbors Algorithm:


1. Training:
 Store all training examples.
2. Prediction:
 Given a new, unseen instance to classify (or predict for regression),
find the k training instances closest to it in the feature space. The
proximity is often measured using distance metrics like Euclidean
distance.
3. Classification:
 For classification, assign the class label that is most frequent among
the k nearest neighbors. This is often done by majority voting.
4. Regression:
 For regression, predict the average value of the target variable
among the k nearest neighbors.

Key Parameters:
 k (Number of Neighbors):
 The choice of the parameter k influences the smoothness of the
decision boundary. Smaller values of k lead to a more flexible model
but can be sensitive to noise, while larger values of k result in
smoother decision boundaries but might miss local patterns.
 Distance Metric:
 Common distance metrics include Euclidean distance, Manhattan
distance, Minkowski distance, etc. The choice of distance metric
depends on the nature of the data.

Characteristics and Considerations:


 Non-Parametric:
 k-NN is a non-parametric algorithm, meaning it doesn't make
assumptions about the underlying data distribution. The model is
built during prediction time.
 Lazy Learning:
 k-NN is considered a lazy learning algorithm because it doesn't
explicitly build a model during training. The model is constructed at
prediction time based on the stored training instances.
 Impact of Feature Scaling:
 Feature scaling is often necessary as k-NN is sensitive to the scale of
features. If features have different scales, the algorithm may give
more weight to features with larger magnitudes.
 Computational Cost:
 The main computational cost occurs during prediction when
distances between the new instance and all training instances need to
be calculated. This can be expensive for large datasets.
 Curse of Dimensionality:
 As the number of features increases, the data becomes more sparse
in the feature space, and the notion of proximity becomes less
meaningful. This is known as the "curse of dimensionality."

Use Cases:
 Classification: Identifying the class or category of a new instance based on
its similarity to existing instances.
 Regression: Predicting a continuous value for a new instance based on the
average of its k nearest neighbors.
 Anomaly Detection: Identifying instances that deviate significantly from
the majority of the data.
 Pattern Recognition: Discovering patterns in data based on the
distribution of instances in feature space.

K-NN is a versatile algorithm and is often used as a baseline model for


comparison in various machine learning tasks. Its simplicity and
interpretability make it a good choice for certain scenarios, especially when
the underlying structure of the data is complex and not well understood.
However, its computational cost and sensitivity to parameter choices
should be considered in practice.

Gaussian mixture models:


Gaussian Mixture Models (GMMs) are probabilistic models that
represent a mixture of Gaussian (normal) distributions. They are commonly
used for modeling complex probability distributions, especially in situations
where the data may be generated by multiple underlying processes. GMMs
are flexible and can model a wide variety of data distributions.

Key Concepts:
1. Mixture of Gaussians:
 A GMM is a weighted sum of multiple Gaussian distributions, where
each Gaussian represents one component of the mixture.
 Each component is characterized by its mean, covariance matrix, and
weight.
2. Parameters:
 Means: The average values for each dimension in each Gaussian
component.
 Covariance Matrices: Describing the shape and orientation of the
distribution for each component.
 Weights: The relative importance of each component in the mixture.
3. Probability Density Function (PDF):
 The probability of a data point being generated by the GMM is
calculated as the weighted sum of the probabilities from each
Gaussian component.
4. Expectation-Maximization (EM) Algorithm:
 The training of GMMs often involves the EM algorithm.
 Expectation (E-step): Calculate the probability that each data
point belongs to each component (responsibilities).
 Maximization (M-step): Update the parameters (means,
covariances, and weights) based on the responsibilities.

Use Cases:
1. Clustering:
 GMMs can be used for clustering when the underlying data
distribution is assumed to be a mixture of Gaussians.
2. Density Estimation:
 GMMs can be employed to estimate the probability density function
of the observed data.
3. Anomaly Detection:
 By modeling the normal behavior of the data with a GMM, deviations
from this model can be used to detect anomalies.
4. Image and Signal Processing:
 GMMs are applied in image and signal processing for tasks such as
segmentation and noise reduction.
5. Speech Recognition:
 GMMs have been historically used in modeling the distribution of
speech features.

Characteristics:
 Flexibility:
 GMMs are highly flexible and can model complex data distributions.
 Soft Clustering:
 Unlike k-means, GMMs provide "soft" assignments, meaning each
data point is assigned a probability of belonging to each cluster.
 Model Complexity:
 The number of components in the mixture is a critical factor in
GMMs. Selecting an appropriate number of components is essential
and can be determined using techniques like the Bayesian
Information Criterion (BIC) or the Akaike Information Criterion (AIC).
 Sensitivity to Initialization:
 GMMs are sensitive to the choice of initial parameters, and multiple
runs with different initializations may be necessary.
GMMs are particularly useful in scenarios where the underlying structure of
the data is not well-defined or when there are multiple patterns present in
the data. They provide a probabilistic framework for modeling complex
relationships in the data, making them valuable in various machine learning
applications.

Expectation maximization:
Expectation-Maximization (EM) is a general framework for finding
maximum likelihood estimates of parameters in models with latent
variables. It is an iterative optimization algorithm commonly used in
machine learning, statistics, and signal processing. The EM algorithm
consists of two main steps: the Expectation (E) step and the Maximization
(M) step. It is particularly useful when dealing with incomplete or missing
data.

Expectation-Maximization (EM) Algorithm:


1. Expectation (E-step):
 In the E-step, the algorithm estimates the expected values of the
latent variables given the observed data and the current parameter
estimates.
 It calculates the "responsibilities" or the probability that each data
point belongs to each cluster or component in the model.
2. Maximization (M-step):
 In the M-step, the algorithm maximizes the expected log-likelihood
with respect to the model parameters.
 It updates the parameters based on the responsibilities obtained
from the E-step.
 The goal is to find the parameters that maximize the expected log-
likelihood of the observed data.
3. Iteration:
 The E-step and M-step are repeated iteratively until the algorithm
converges to a solution.
 Convergence is typically determined by observing changes in the log-
likelihood or other convergence criteria.
Key Concepts:
 Latent Variables:
 Latent variables are variables that are not directly observed in the
data but are inferred from the observed variables.
 Incomplete Data:
 EM is particularly useful when dealing with incomplete data or
situations where some variables are not observed.
 Application to Gaussian Mixture Models (GMMs):
 EM is commonly used in the context of Gaussian Mixture Models
(GMMs), where the latent variables are the cluster assignments of
data points.
 Likelihood Maximization:
 EM seeks to maximize the likelihood of the observed data given the
model, even in cases where direct maximization is computationally
intractable due to the presence of latent variables.

Use Cases:
 Clustering:
 EM is frequently applied to cluster data when there is uncertainty
about the cluster assignments of data points (soft clustering).
 Density Estimation:
 It is used for estimating the parameters of probability distributions
when some data points have missing or unobserved values.
 Hidden Markov Models (HMMs):
 EM is employed in training Hidden Markov Models for time-series
data.
 Image and Signal Processing:
 EM has applications in image and signal processing, such as in image
segmentation or the estimation of signal parameters.

Characteristics:
 Iterative Optimization:
 EM is an iterative optimization algorithm that seeks to maximize the
likelihood of the observed data given the model.
 Local Maximum:
 EM can converge to local maxima, and the choice of initial parameter
values can impact the final result.
 Initialization Sensitivity:
 The algorithm is sensitive to the choice of initial parameters, and
multiple initializations may be necessary.
 Convergence Criteria:
 Convergence is typically assessed by monitoring changes in the log-
likelihood or other criteria.

Expectation-Maximization is a versatile and widely used algorithm in


various fields. Its ability to handle missing data and estimate parameters in
the presence of latent variables makes it particularly valuable in scenarios
where direct optimization is challenging.

You might also like