M5
M5
Clustering
Introduction
• Clustering or cluster analysis is a machine learning technique,
which groups the unlabelled dataset. It can be defined as "A
way of grouping the data points into different clusters,
consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no
similarities with another group."
Application of clustering
• Market Segmentation – Businesses use clustering to group their customers and
use targeted advertisements to attract more audience.
• Market Basket Analysis – Shop owners analyze their sales and figure out which
items are majorly bought together by the customers. For example, In USA, according
to a study diapers and beers were usually bought together by fathers.
• Social Network Analysis – Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or
content recommendations.
• Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic
images like X-rays.
• Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting
fraudulent transactions we can use clustering to identify them.
• Simplify working with large datasets – Each cluster is given a cluster ID after
clustering is complete. Now, you may reduce a feature set’s whole feature set into its
cluster ID. Clustering is effective when it can represent a complicated case with a
straightforward cluster ID. Using the same principle, clustering data can make
complex datasets simpler.
Requirement of clustering
• Scalability
• Dealing with different types of attributes
• Discovery of clusters with arbitrary shape
• Avoiding domain knowledge to determine input parameter
• Handling noisy data
• Incremental clustering
• Insensitivity to input order
• Handling high dimensional data
• Handling constraints
• Interpretability and usability
( for explanation refer text book pdf sent)
Types of clustering
Partitioning method
Border Point: A point that is not a core point but lies within the 𝜖-
radius of a core point. Belongs to a cluster but does not contribute to
expanding it.
Noise Point: A point that is neither a core point nor a border
Cluster: Formed by connecting core points that are within the 𝜖-radius
point.Considered an outlier.
of each other, along with any border points associated with those core
points.
Grid-Based Clustering
The query point exactly matches one of the training instances and the denominator is therefore
zero, we assign to be in this case. ,
Locally Weighted Regression (LWR) / Locally
Weighted Linear Regression (LWLR)
• LWR fits a separate linear regression model for each query point based on the weights
assigned to the training data points.
• The weights assigned to each training data point are inversely proportional to their
distance from the query point.
• Training data points that are closer to the query point will have a higher weight and
contribute more to the linear regression model.
• LWLR is useful when a global linear model does not well-capture the relationship
between the input and output variables. The goal is to capture local patterns in the
data.
Locally Weighted Regression (LWR)
Key features
Locality:
•Instead of using the entire dataset to fit a single model, LWR focuses on
the data points near the target input value.
•Nearby points are given higher importance (weight), and farther points
are given lower weight.
Weights:
•LWR assigns weights to data points based on their distance from the
query point.
•A common weighting function is the Gaussian kernel
Here, xqis the query point, and τ (bandwidth) controls the influence of distant points.
Local Model:
•For a given query point xq, LWR fits a simple linear regression model (or any desired
regression model) weighted by w(i).
•The model minimizes the following weighted loss function:
Predictions:
•The fitted model at xqis used to predict yq.
•This process is repeated for every query point, making the method computationally
expensive for large datasets.
Algorithm Steps:
1.Choose a query point xq.
2.Compute weights w(i) for all points in the dataset, using a kernel function.
3.Fit a weighted linear regression model using the weights.
4.Predict the output yq for xq using the fitted model.
5.Repeat for all desired query points.
Advantages:
•Flexibility: Captures non-linear relationships in data.
•Local Adaptation: Adapts to varying trends across different regions of the input space.
Disadvantages:
•Computationally Expensive: Each prediction requires fitting a new model, making it inefficient
for large datasets.
•Bandwidth Sensitivity: Choice of τ\tau significantly affects the model's performance.
•Small τ\tau: Too sensitive, prone to overfitting.
•Large τ\tau: Too smooth, prone to underfitting.
Applications:
•Data visualization
•Non-linear regression when a global model is inadequate.
•Situations where interpretability is important locally.
Radial basis function (RBF)
• Kernels play a fundamental role in transforming data into higher-dimensional
spaces, enabling algorithms to learn complex patterns and relationships.
Among the diverse kernel functions, the Radial Basis Function (RBF) kernel
stands out as a versatile and powerful tool. The Radial Basis Function (RBF)
kernel, also known as the Gaussian kernel, is one of the most widely used
kernel functions. It operates by measuring the similarity between data points
based on their Euclidean distance in the input space. Mathematically, the RBF
kernel between two ∣x–x’∣2 represents the squared Euclidean distance
between the two data points.
• ∣x–x’∣2 represents the squared Euclidean distance between the two data
points
2.Case Base:
A repository or database where cases are stored. These past cases are used to guide
the reasoning process.
3.Reasoning Process in CBR: The CBR cycle often consists of four steps:
1.Retrieve: Identify and retrieve the most similar past cases from the case base.
2.Reuse: Adapt the solutions from the retrieved cases to solve the new problem.
3.Revise: Test the solution in the real world and revise it if necessary.
4.Retain: If the solution works well, save it as a new case in the case base for future reference.
Benefits of CBR:
• Leverages past knowledge: CBR builds on existing knowledge,
avoiding the need to reinvent the wheel for every problem.
• Handles complex problems: It can provide solutions for complex,
poorly understood, or incomplete problems.
• Adaptive: The system evolves by learning from new experiences.
Applications:
1.Medical Diagnosis: Using past cases to diagnose and recommend
treatments for diseases.
2.Legal Reasoning: Applying precedents from previous cases to
resolve legal disputes.
3.Customer Support: Solving customer queries based on similar past
issues.
4.Engineering Design: Reusing solutions for similar design challenges.
5.Education: Adaptive tutoring systems that recommend solutions
based on past learner behavior.