0% found this document useful (0 votes)
6 views24 pages

ML U5

SD FS DSF SDF

Uploaded by

Thil Pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views24 pages

ML U5

SD FS DSF SDF

Uploaded by

Thil Pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

5.

0 Unsupervised Learning

Unsupervised Learning
Unsupervised learning is a type of machine learning where algorithms discover
patterns, relationships, or groupings within data without prior knowledge of labeled
outcomes.
Types of Unsupervised Learning

1. Clustering : Grouping similar data points.


2. Dimensionality Reduction : Reducing data complexity.
3. Anomaly Detection : Identifying unusual data points.
4. Association Rule Learning : Discovering relationships.

Algorithms

1. K-Means Clustering : Simple, efficient clustering.


2. Hierarchical Clustering : Visualizing cluster hierarchy.
3. Principal Component Analysis (PCA) : Dimensionality reduction.
4. t-Distributed Stochastic Neighbor Embedding (t-SNE) : Non-linear dimensionality
reduction.
5. DBSCAN : Density-based clustering.

Applications

1. Customer Segmentation : Clustering customers by behavior.


2. Image Compression : Reducing image dimensions.
3. Fraud Detection : Identifying anomalous transactions.
4. Recommendation Systems : Suggesting products based on user behavior.
5. Genomics : Identifying gene expression patterns.

Evaluation Metrics

1. Silhouette Coefficient : Measuring cluster separation.


2. Calinski-Harabasz Index : Evaluating cluster quality.
3. Elbow Method : Determining optimal cluster number.
5.1 Compare Supervised and Unsupervised learning

Supervised vs Unsupervised Learning


Supervised Learning

 Goal : Predict target variable based on labeled data.


 Data : Labeled examples (input-output pairs).
 Algorithms : Regression, Classification, Support Vector Machines.
 Evaluation : Accuracy, Precision, Recall, F1-score.

Unsupervised Learning

 Goal : Discover patterns, relationships, or groupings in unlabeled data.


 Data : Unlabeled examples.
 Algorithms : Clustering, Dimensionality Reduction, Anomaly Detection.
 Evaluation : Silhouette Coefficient, Calinski-Harabasz Index, Elbow Method.

Key Differences

1. Label Availability : Supervised learning requires labeled data, while unsupervised


learning uses unlabeled data.
2. Learning Objective : Supervised learning predicts target variables, while
unsupervised learning discovers patterns.
3. Algorithm Complexity : Supervised learning algorithms tend to be more complex.
4. Evaluation Metrics : Different evaluation metrics are used.

Comparison Table
Real-World Examples
Supervised Learning:

 Image classification (e.g., object detection)


 Sentiment analysis (e.g., movie reviews)

Unsupervised Learning:

 Customer segmentation (e.g., clustering)


 Anomaly detection (e.g., fraud detection)

5.2 Explain different types of clustering techniques

Clustering Techniques
Clustering is a type of unsupervised learning that groups similar data points into
clusters.
Types of Clustering Techniques

1. Hierarchical Clustering
 Agglomerative (bottom-up)
 Divisive (top-down)
 Visualizes cluster hierarchy using dendrograms

2. K-Means Clustering
 Partitional clustering
 Simple, efficient, and widely used
 Requires specifying number of clusters (k)

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)


 Density-based clustering
 Handles noise and outliers
 Does not require specifying number of clusters
4. K-Medoids
 Similar to K-Means, but uses medoids (objects) instead of centroids
 More robust to outliers

5. Expectation-Maximization (EM) Clustering


 Probabilistic clustering
 Handles missing data
 Assumes Gaussian distribution

6. Fuzzy C-Means Clustering


 Soft clustering (allows partial membership)
 Handles overlapping clusters

7. Gaussian Mixture Model (GMM) Clustering


 Probabilistic clustering
 Assumes Gaussian distribution
 Handles complex cluster structures

8. Self-Organizing Map (SOM) Clustering


 Neural network-based clustering
 Preserves data topology
 Visualizes high-dimensional data

9. Bisecting K-Means Clustering


 Variance-based clustering
 Splits clusters recursively

10. Clustering using Neural Networks


 Autoencoders
 Convolutional Neural Networks (CNNs)

Evaluation Metrics for Clustering

1. Silhouette Coefficient
2. Calinski-Harabasz Index
3. Davies-Bouldin Index
4. Elbow Method

5.2.1 Partitioning Methods

Partitioning Methods
Partitioning methods are a type of clustering algorithm that divide data into non-
overlapping subsets or clusters.
Types of Partitioning Methods

1. K-Means Clustering
 Simple, efficient, and widely used
 Requires specifying number of clusters (k)
 Sensitive to initial centroids

2. K-Medoids
 Similar to K-Means, but uses medoids (objects) instead of centroids
 More robust to outliers

3. K-Modes
 Extension of K-Means for categorical data
 Uses modes instead of means

4. Fuzzy C-Means (FCM)


 Soft clustering (allows partial membership)
 Handles overlapping clusters

5. Claro
 Robust to outliers and noise
 Handles varying cluster densities

6. Partition Around Medoids (PAM)


 Similar to K-Medoids, but uses different distance metric
 More efficient for large datasets

Advantages

1. Efficient: Fast computation


2. Simple: Easy to implement
3. Effective: Good clustering quality

Disadvantages

1. Sensitive to Initial Centroids: K-Means


2. Requires Specifying k: Number of clusters
3. Not Suitable for Complex Clusters: Non-convex or varying densities

Real-World Applications

1. Customer Segmentation: K-Means


2. Image Segmentation: FCM
3. Gene Expression Analysis
5.2.2 Hierarchical Methods

Hierarchical Methods
Hierarchical methods are a type of clustering algorithm that build a hierarchy of
clusters by merging or splitting existing clusters.
Types of Hierarchical Methods

1. Agglomerative Hierarchical Clustering (AHC)


 Bottom-up approach
 Starts with individual data points
 Merges closest clusters

2. Divisive Hierarchical Clustering (DHC)


 Top-down approach
 Starts with entire dataset
 Splits clusters recursively

3. Ward's Method
 AHC with Ward's linkage
 Minimizes within-cluster variance

4. Single Linkage
 AHC with single linkage
 Merges clusters based on closest points

5. Complete Linkage
 AHC with complete linkage
 Merges clusters based on farthest points

6. Average Linkage
 AHC with average linkage
 Merges clusters based on average distance

Advantages

1. Visualizes Cluster Hierarchy : Dendrograms


2. No Need to Specify k : Number of clusters
3. Robust to Outliers : AHC

Disadvantages

1. Computationally Expensive : O(n^2 log n)


2. Sensitive to Linkage Method : Different methods yield different results
3. Difficult to Interpret : Large datasets

Real-World Applications

1. Gene Expression Analysis : AHC


2. Image Segmentation : DHC
3. Social Network Analysis : AHC

5.2.3 Density based Methods

Density-Based Methods

Density-based methods are a type of clustering algorithm that groups data points into
clusters based on their density and proximity.

Types of Density-Based Methods

1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)


 Handles noise and outliers
 Does not require specifying number of clusters
 Parameters: ε (eps), min_samples

2. OPTICS (Ordering Points To Identify the Clustering Structure)


 Extension of DBSCAN
 Handles varying densities
 Parameters: min_samples, eps

3. DENCLUE (DENSITY-BASED CLUSTERING)


 Uses Gaussian distributions
 Handles noisy data
 Parameters: σ (sigma), threshold

4. DBCLASD (Distribution-Based Clustering Algorithm for Large Spatial Databases)


 Handles large datasets
 Uses grid-based approach
 Parameters: grid_size, min_density

Advantages

1. Handles Noise and Outliers : DBSCAN


2. No Need to Specify k : Number of clusters
3. Robust to Varying Densities : OPTICS
4. Efficient for Large Datasets : DBCLASD
Disadvantages

1. Sensitive to Parameters : ε, min_samples


2. Difficult to Choose Parameters : Requires expertise
3. Computationally Expensive : O(n^2)

Real-World Applications

1. Geographic Information Systems (GIS) : DBSCAN


2. Image Segmentation : OPTICS
3. Customer Segmentation : DENCLUE
4. Network Analysis : DBCLASD

5.3 Analyze Clustering Algorithms

Analyzing Clustering Algorithms


Clustering algorithms group similar data points into clusters. Here's an analysis of
popular clustering algorithms:
Criteria for Evaluation

1. Cluster Quality : Homogeneity, separation, and consistency.


2. Scalability : Handling large datasets.
3. Complexity : Computational efficiency.
4. Robustness : Handling noise and outliers.
5. Interpretability : Easy to understand results.

Clustering Algorithms Analysis


5.3.1 K-Means algorithm

K-Means Algorithm
K-Means is a popular unsupervised learning algorithm for clustering data points into
K distinct groups.
How K-Means Works

1. Initialization : Randomly select K centroids.


2. Assignment : Assign each data point to the closest centroid.
3. Update : Update centroids as the mean of assigned data points.
4. Repeat : Repeat steps 2-3 until convergence.

5.3.1.1 Elbow Method

Elbow Method
The Elbow Method is a visual technique used to determine the optimal number of
clusters (K) in K-Means clustering.
How Elbow Method Works

1. Compute Distortion : Calculate the sum of squared errors (SSE) or distortion for
different values of K.
2. Plot Elbow Curve : Plot SSE against K.
3. Identify Elbow Point : Choose K at the elbow point, where the rate of decrease of
SSE becomes less pronounced.

5.3.1.2 Strength and Weaknesses‟ of k-Means algorithm

Strengths of K-Means Algorithm

1. Efficient : Fast computation, especially for large datasets.


2. Simple : Easy to implement and understand.
3. Effective : Good clustering quality for spherical clusters.
4. Scalable : Handles high-dimensional data.
5. Interpretable : Cluster centers are easily interpretable.

Weaknesses of K-Means Algorithm

1. Sensitive to Initial Centroids : Affects clustering quality.


2. Requires Specifying K : Number of clusters must be predefined.
3. Not Suitable for Complex Clusters : Non-convex, varying densities, or nested
clusters.
4. Sensitive to Outliers : Outliers can affect cluster centers.
5. Assumes Equal Cluster Sizes : Not suitable for clusters with varying sizes.

5.3.2 k-Medoids Algorithm

K-Medoids Algorithm
K-Medoids is a clustering algorithm that partitions data into K clusters based on
similarity.
How K-Medoids Works

1. Initialize Medoids : Select K initial medoids (representative data points).


2. Assign Data Points : Assign each data point to the closest medoid.
3. Update Medoids : Replace medoids with data points that minimize total distance.
4. Repeat : Repeat steps 2-3 until convergence.

5.3.3 Hierarchical clustering Algorithm

Hierarchical Clustering Algorithm


Hierarchical clustering is a type of unsupervised learning algorithm that builds a
hierarchy of clusters by merging or splitting existing clusters.
Types of Hierarchical Clustering

1. Agglomerative Hierarchical Clustering (AHC) : Bottom-up approach, merging


clusters.
2. Divisive Hierarchical Clustering (DHC) : Top-down approach, splitting clusters.

How Hierarchical Clustering Works

1. Initialize Clusters : Each data point is a cluster.


2. Compute Distance : Calculate distance between clusters.
3. Merge/Split Clusters : Merge or split clusters based on distance.
4. Repeat : Repeat steps 2-3 until desired number of clusters.

5.3.3.1 Agglomerative clustering

Agglomerative Clustering
Agglomerative clustering is a type of hierarchical clustering algorithm that builds
clusters by merging existing clusters.
How Agglomerative Clustering Works

1. Initialize Clusters : Each data point is a cluster.


2. Compute Distance : Calculate distance between clusters.
3. Merge Closest Clusters : Merge closest clusters based on distance.
4. Repeat : Repeat steps 2-3 until desired number of clusters.

5.3.3.2 Divisive Clustering

Divisive clustering! That's a fascinating topic in data analysis and machine learning.
Divisive clustering is a type of hierarchical clustering algorithm that splits data into
smaller clusters by dividing them into more homogeneous subsets. Unlike
agglomerative clustering, which starts with individual data points and merges them,
divisive clustering begins with the entire dataset and recursively divides it.
Key characteristics:

1. Top-down approach
2. Divides the dataset into smaller clusters
3. Focuses on splitting clusters based on dissimilarity

Algorithms:
Some popular divisive clustering algorithms include:

1. DIANA (Divisive Analysis)


2. k-means++
3. Hierarchical clustering with divisive splitting

Advantages:

1. Handles large datasets efficiently


2. Identifies clusters with varying densities
3. Robust to noise and outliers

Applications:

1. Image segmentation
2. Gene expression analysis
3. Customer segmentation
4. Network analysis

Challenges:

1. Choosing the optimal number of clusters


2. Dealing with high-dimensional data
3. Interpreting results

5.4 Analyze Association Algorithm

Association analysis algorithms! These algorithms uncover relationships between variables in


large datasets, revealing patterns and connections.
What is Association Analysis?
Association analysis is a data mining technique that identifies associations, correlations, or
relationships between variables in a dataset. It aims to discover patterns, such as:

1. Frequent itemsets (e.g., products often bought together)


2. Rules (e.g., if-then statements)
3. Correlations (e.g., relationships between variables)

Popular Association Analysis Algorithms:

1. Apriori Algorithm
2. Eclat Algorithm
3. FP-Growth Algorithm
4. Association Rule Mining (ARM)
5. Correlation Analysis (e.g., Pearson, Spearman)

Key Concepts:

1. Support: Frequency of itemset or rule


2. Confidence: Strength of rule (e.g., if-then)
3. Lift: Measure of correlation between variables
4. Frequent Itemset: Itemset appearing frequently

Applications:

1. Market Basket Analysis (e.g., product recommendations)


2. Recommendation Systems
3. Fraud Detection
4. Medical Diagnosis
5. Social Network Analysis

Benefits:

1. Identifies hidden patterns


2. Improves prediction accuracy
3. Enhances decision-making
4. Optimizes business processes

Challenges:

1. Handling large datasets


2. Dealing with noisy or missing data
3. Interpreting results
4. Avoiding false discoveries

Tools and Libraries:


1. Python: scikit-learn, pandas, numpy
2. R: arules, data.mining
3. SQL: association rule mining extensions

Real-World Examples:

1. Amazon's "Frequently Bought Together" feature


2. Netflix's recommendation system

5.4.1 Common terms for association rule (pattern, itemset, support, count)

Here are common terms related to association rules:


Key Terms:

1. Item: A single element or attribute in the dataset (e.g., product, feature).


2. Itemset: A set of one or more items (e.g., {milk, bread}).
3. Pattern: A frequent itemset or association rule (e.g., "customers who buy milk also
buy bread").
4. Support: The proportion of transactions containing an itemset (e.g., 60% of
customers buy milk and bread together).
5. Count: The number of transactions containing an itemset (e.g., 100 customers bought
milk and bread together).
6. Confidence: The probability of finding the consequent (right-hand side) given the
antecedent (left-hand side) in an association rule (e.g., 80% of milk buyers also buy
bread).
7. Lift: Measures the strength of association between items (e.g., 2.5 times more likely
to buy bread with milk).
8. Rule: An implication of the form "Antecedent → Consequent" (e.g., "milk →
bread").

Additional Terms:

1. Frequent Itemset: An itemset appearing frequently in the data (above a minimum


support threshold).
2. Closed Itemset: An itemset with no proper supersets having the same support.
3. Maximal Itemset: An itemset not contained in any other itemset.
4. Generator: A minimal itemset with no proper subsets having the same support.
5. Consequent: The item(s) on the right-hand side of an association rule.
6. Antecedent: The item(s) on the left-hand side of an association rule.
7. Transaction: A single data point or record (e.g., customer purchase).

Association Rule Metrics:


1. Support: Measures frequency (e.g., 60%)
2. Confidence: Measures probability (e.g., 80%)
3. Lift: Measures strength of association (e.g., 2.5)
4. Conviction: Measures rule reliability (e.g., 1.25)
5. Interest Factor: Measures unexpectedness (e.g., 2.1)

Understanding these terms will help you work effectively with association rule mining
algorithms and interpret results.

5.4.2 Association rule

Association Rule Mining!


Association Rule: An implication of the form "Antecedent → Consequent" that
identifies patterns and relationships between items in a dataset.
Components:

1. Antecedent (LHS): The item(s) on the left-hand side.


2. Consequent (RHS): The item(s) on the right-hand side.
3. Support: Proportion of transactions containing both antecedent and consequent.
4. Confidence: Probability of finding consequent given antecedent.
5. Lift: Measures strength of association.

Example: {Milk, Bread} → {Butter} (Support: 60%, Confidence: 80%, Lift: 2.5)
Interpretation: 60% of transactions contain Milk, Bread, and Butter. 80% of
transactions with Milk and Bread also contain Butter. Butter is 2.5 times more likely
to be purchased with Milk and Bread.
Types of Association Rules:

1. Single-item rules: {A} → {B}


2. Multi-item rules: {A, B} → {C}
3. Composite rules: {A} → {B, C}

Evaluation Metrics:

1. Support
2. Confidence
3. Lift
4. Conviction
5. Interest Factor

Algorithms:
1. Apriori
2. Eclat
3. FP-Growth
4. ARM (Association Rule Mining)

Applications:

1. Market Basket Analysis


2. Recommendation Systems
3. Fraud Detection
4. Medical Diagnosis
5. Social Network Analysis

Benefits:

1. Identifies hidden patterns


2. Improves prediction accuracy
3. Enhances decision-making
4. Optimizes business processes

Challenges:

1. Handling large datasets


2. Dealing with noisy or missing data
3. Interpreting results
4. Avoiding false discoveries

Tools and Libraries:

1. Python: scikit-learn, pandas, numpy


2. R: arules, data.mining
3. SQL: association rule mining extensions

5.4.3 Apriori algorithm

Apriori Algorithm!
Overview:
The Apriori algorithm is a popular association rule mining technique used to discover
frequent itemsets and generate association rules. It's an efficient, scalable, and
widely used algorithm.
Key Steps:
1. Data Preparation: Transform data into transactions (e.g., customer purchases).
2. Itemset Generation: Find frequent itemsets (1-item, 2-item, ...).
3. Rule Generation: Derive association rules from frequent itemsets.
4. Rule Filtering: Filter rules based on support, confidence, and lift.
5.4.4 Strengths and Weaknesses of Apriori algorithm

Here are the strengths and weaknesses of the Apriori algorithm:


Strengths:
Advantages:

1. Efficient handling of large datasets: Apriori is designed to handle large


transactional datasets.
2. Scalable: Can handle high-dimensional data with many items.
3. Robust to noise and missing values: Apriori can handle noisy or missing data.
4. Easy to implement: Simple to understand and implement.
5. Fast execution: Relatively fast compared to other association rule mining algorithms.
6. Flexible: Can generate various types of association rules.

Weaknesses:
Disadvantages:
1. Generates many redundant rules: Apriori produces multiple rules with similar
antecedents and consequents.
2. Requires careful parameter tuning: Minimum support and confidence thresholds
need careful adjustment.
3. Not suitable for very large itemsets: Performance degrades with extremely large
itemsets.
4. Sensitive to minimum support threshold: Setting threshold too low/high affects
results.
5. Does not handle sequential patterns: Apriori focuses on co-occurrence, not
sequential patterns.
6. Limited handling of categorical variables: Apriori assumes binary (0/1) variables.

5.5 List the applications of Un-supervised learning

Unsupervised learning has numerous applications across various industries:


Industry Applications:

1. Customer Segmentation: Identify customer groups with similar behavior.


2. Market Research: Uncover hidden patterns in customer preferences.
3. Recommendation Systems: Suggest products based on user behavior.
4. Fraud Detection: Identify anomalous transactions.
5. Image and Video Analysis: Object detection, facial recognition.
6. Natural Language Processing (NLP): Text clustering, sentiment analysis.
7. Predictive Maintenance: Identify equipment anomalies.
8. Genomics and Proteomics: Analyze gene expression.
9. Social Network Analysis: Identify communities, influencers.
10. Cybersecurity: Detect intrusions, malware.

Business Applications:

1. Customer Profiling: Understand customer demographics.


2. Marketing Personalization: Tailor messages to customer segments.
3. Supply Chain Optimization: Identify bottlenecks.
4. Risk Management: Identify potential risks.
5. Competitive Intelligence: Analyze market trends.

Scientific Applications:

1. Climate Modeling: Identify patterns in climate data.


2. Genomics: Analyze gene expression.
3. Proteomics: Identify protein structures.
4. Neuroscience: Analyze brain activity.
5. Astronomy: Identify patterns in celestial data.

Healthcare Applications:

1. Disease Diagnosis: Identify patterns in medical images.


2. Patient Stratification: Group patients by disease progression.
3. Treatment Optimization: Identify effective treatments.
4. Medical Imaging: Analyze images for abnormalities.
5. Personalized Medicine: Tailor treatments to individual patients.

Financial Applications:

1. Portfolio Optimization: Identify optimal investment portfolios.


2. Risk Management: Identify potential risks.
3. Credit Scoring: Evaluate creditworthiness.
4. Trading Strategy: Identify profitable trades.
5. Asset Management: Optimize asset allocation.

Other Applications:

1. Speech Recognition: Identify spoken words.


2. Text Summarization: Summarize documents.
3. Image Compression: Compress images.
4. Time Series Analysis: Forecast future values.
5. Network Analysis: Identify network structures.

You might also like