DA-Unit V
DA-Unit V
Clustering Algorithms
● Clustering Algorithms:
○ K-Means, Hierarchical Clustering,
○ Time-series analysis.
● Introduction to Text Analysis:
○ Text-preprocessing, Bag of words, TF-IDF and topics.
● Need and Introduction to social network analysis
● Introduction to business analysis.
● Model Evaluation and Selection:
○ Metrics for Evaluating Classifier Performance, Holdout Method and Random
Sub sampling
○ Parameter Tuning
○ Clustering and Time-series analysis using Scikit-learn, sklearn.
○ Metrics Confusion matrix, AUC-ROC Curves , Elbow plot.
Clustering
Trying to determine the appropriate Using Clustering algorithms on the Selling the products to the targeted
audience for the product customer base audience
Clustering Algorithms
❏ K-Means
❏ K-Means
Unsupervised learning algorithm
Here K defines the number of predefined clusters that need to be created in the process,
as if K=2, there will be two clusters, and
for K=3, there will be three clusters, and so on.
Clustering Algorithms
❏ K-Means
Clustering Algorithms
❏ K-Means
● It is an iterative algorithm that divides the unlabeled dataset into k different clusters
● in such a way that each dataset belongs only one group that has similar properties.
● It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
Clustering Algorithms
❏ K-Means
● It is a centroid-based algorithm, where each cluster is associated with a centroid.
❏ K-Means
● Determines the best value for K center points or centroids by an iterative process.
● Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Clustering Algorithms
❏ K-Means
Clustering Algorithms
❏ K-Means
● Basically K-Means runs on distance calculations, which uses “Euclidean Distance” to calculate
the distance between two given instances.
● For given instances (X1, Y1) and (X2, Y2), the formula is
Step 1
Step 2
Select random K points or centroids. (It can be other from the input dataset).
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?
Repeat the third steps, which means reassign each datapoint to the new
Step 5
closest centroid of each cluster.
Clustering Algorithms
❏ K-Means
How does the K-Means Algorithm Work?
Elbow Method
❏ Hierarchical Clustering
❏ Hierarchical Clustering
❏ Hierarchical Clustering
❏ Hierarchical Clustering
● It means, this algorithm considers each dataset as a single cluster at the beginning,
and then start combining the closest pair of clusters together.
● It does this until all the clusters are merged into a single cluster that contains all
the datasets.
Clustering Algorithms
❏ Hierarchical Clustering
Divisive Hierarchical clustering
● This is top Down Strategy does the reverse of agglomerative hierarchical clustering by starting
with all objects in one cluster.
● It subdivides the clusters into smaller & smaller pieces, until each object from a cluster on its
own or until it satisfies certain termination conditions.
Like , a desired number of cluster or the diameter of each cluster is within a certain threshold
Clustering Algorithms
❏ Hierarchical Clustering
❏ Hierarchical Clustering
● the closest distance between the two clusters is crucial for the hierarchical
clustering.
● There are various ways to calculate the distance between two clusters, and
these ways decide the rule for clustering.
Average
Linkage
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods
Single ● It is the Shortest Distance between the closest points of the clusters.
Linkage
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods
● It is the farthest distance between the two points of two different clusters.
Complete
Linkage
● It is one of the popular linkage methods as it forms tighter clusters than
single-linkage.
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods
Average ● It is the linkage method in which the distance between each pair of datasets is
Linkage
added up and then divided by the total number of datasets to calculate the
average distance between two clusters.
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Linkage Methods
Centroid ● It is the linkage method in which the distance between the centroid of the
Linkage clusters is calculated.
Reference
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?
Step 2
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?
Step 3
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?
Step 3
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?
Step 3
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?
To decide the distance between (1,2)->3
Step 3
● Check the proximity matrix
min((1,3),(1,2))
=min(18,21)
=18
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
How the Agglomerative Hierarchical clustering Work?
Step 4
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Dendrogram Representation
Clustering Algorithms
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering
Dendrogram Representation
❏ Hierarchical Clustering
Agglomerative Hierarchical clustering ● Decide threshold
● Consider Threshold =12
Number of Cluster ● The number of clusters will be the number of vertical lines
which are being intersected by the line drawn using the
threshold.
● The red line intersects 2 vertical lines
● we will have 2 clusters.
● One cluster will have a sample (1,2,4) and the other will have
a sample (3,5)
Clustering Algorithms
❏ Hierarchical Clustering
● Initially each item in its own ● Initially each item in its one
cluster cluster
● Bottom up
● Top Down
Clustering Algorithms
❏ Time-series analysis
● Time series is a sequence of data points in chronological sequence, most often gathered in regular
intervals.
● It can be applied to any variable that changes over time and generally speaking, usually data
points that are closer together are more similar than those further apart
● It is the way of studying the characteristics of the response variable with respect to time, as the
independent variable
● To estimate the target variable in the name of predicting or forecasting, use the time variable as
the point of reference
Clustering Algorithms
❏ Time-series analysis
Timestamp Stock - Price
Example stock price
2015-10-11 09:00:00 100
❏ Time-series analysis
Trend
Components of time series
Components
Seasonality of Irregularity
time series
Cyclical
Clustering Algorithms
❏ Time-series analysis
Components of time series
In which there is no fixed interval and any divergence within the given
Trend dataset is a continuous timeline.
❏ Time-series analysis
Components of time series
Seasonality In which regular or fixed interval shifts within the dataset in a continuous
timeline.
● Identifying seasonality in time series data is important for the development of a useful
time series model.
Clustering Algorithms
❏ Time-series analysis
Components of time series
● tools that are useful for detecting seasonality in time series data
● Time series plots
● Statistical analysis and tests
Clustering Algorithms
❏ Time-series analysis
Components of time series
Seasonality
Clustering Algorithms
❏ Time-series analysis
Components of time series
Cyclical
Source In which there is no fixed interval, uncertainty in movement and its pattern
Clustering Algorithms
❏ Time-series analysis
Components of time series
❏ Time-series analysis
Time Series analysis can be classified as :
❏ Time-series analysis
Techniques used for time series analysis
ARIMA Models
❏ Time-series analysis
Techniques used for time series analysis
ARIMA Models
keywords
Names
Survey Responses
Text Analysis
Tokenization
POS Tagging
Text Analysis
Tokenization
Tokenization
Tokenization
Text Analysis
● Text may contain stop words such as is, am, are, this, a, an, the, etc.
Text Analysis
Text Analysis Operations using natural language toolkit
POS Tagging
Term Frequency
Term Frequency
Formula
Text Analysis
Term Frequency
Example
Text Analysis
● Some words such as’ of’, ‘and’, etc. can be most frequently present but are of little
significance.
● IDF provides weightage to each word based on its frequency in the corpus D.
Text Analysis
Formula
Text Analysis
● TFIDF gives more weightage to the word that is rare in the corpus (all the
documents).
● TFIDF provides more importance to the word that is more frequent in the
document.
Text Analysis
Formula
Text Analysis
Example
Text Analysis
Disadvantage of TF IDF
Source
, Text Analysis
https://www.latentview.com/blog/a-guide-to-social-network-analysis-and-its-use-cases/
, Text Analysis
A graph is made up of
vertices(also called nodes) that
are connected by edges(also
called links or relationships).
, Text Analysis
● Symmetric and
Asymmetric
(Directionality)
● Binary and Valued
(Weight)
, Text Analysis
Here are three different edges relationships: ● The relationship between nodes is ‘child of’, then the
relationship is asymmetric.
● Symmetric and Asymmetric ● This is the case if someone follows someone else on
(Directionality) Twitter.
● Binary and Valued (Weight) ● If A is the child of B, then B is not a child of A. Such a
network where the relationship is asymmetric
, Text Analysis
● 5 Nodes
● Potential edges= 5(5-1)/2 = 5*4
=20/2=10
● Actual Edges= 9
● Density= 9/10= 90%
● Hence it is a high-density network.
, Text Analysis
● 5 Nodes
● Potential edges= 5(5-1)/2 = 5*4 /2
=20/2=10
● Actual Edges= 4
● Density= 4/10= 40%
● Hence it is a low-density network.
, Text Analysis
● Closeness Cardinality Closeness measures how close a node is to the rest of the
network. It is the ability of the node to reach the other nodes in the network.
● It is calculated as the inverse of the sum of the distance between a node and other
nodes in the network.
, Text Analysis
● Closeness Cardinality
● Hence the Closeness score for node 1 will
be 1/16.
● The standardized score is calculated by
multiplying the score by (n-1).
, Text Analysis
● Closeness Cardinality
Business analysis
● Validation data: Used for validating the performance of the same model
Cross Validation
, Model Evaluation & Selection
Cross Validation
● It is a resampling procedure used to evaluate machine learning models and access how the
model will perform for an independent test dataset.
, Model Evaluation & Selection
Cross Validation
● In the case of holdout cross-validation, the dataset is randomly split into training
and validation data.
● The more data is used to train the model, the better the model is.
, Model Evaluation & Selection
● For the holdout cross-validation method, a good amount of data is isolated from
training.
Pros Cons
❏ In the below example we have split the dataset to create the test data with a size of
30% and train data with a size of 70%. The random_state number ensures the split is
deterministic in every run.
, Model Evaluation & Selection
Cross Validation
● Unlikely k-fold cross-validation split of the dataset into not in groups or folds but splits in
this case in random.
Cross Validation
Pros Cons
1. The proportion of train and validation 1. Some samples may not be selected for either
training or validation.
splits is not dependent on the number of
iterations or partitions. 2. Not suitable for an imbalanced dataset.
Model Evaluation & Selection
● They all are different in some way or the other, but what makes them different is
nothing but input parameters for the model.
● the best part about these is that you get a choice to select these for your model.
Model Evaluation & Selection
Parameter Tuning
● we are not aware of optimal values for hyperparameters which would generate the
best model output.
● what we tell the model is to explore and select the optimal model architecture
automatically.
❏ here we would discuss what questions this hyperparameter tuning will answer for us
● What should be the value for the maximum depth of the Decision Tree?
● How many trees should I select in a Random Forest model?
● Should use a single layer or multiple layer Neural Network, if multiple layers
then how many layers should be there?
● How many neurons should I include in the Neural Network?
● What should be the minimum sample split value for Decision Tree?
● What value should I select for the minimum sample leaf for my Decision Tree?
Model Evaluation & Selection
❏ here we would discuss what questions this hyperparameter tuning will answer for us
● Manual Search
● Random Search
● Grid Search
Model Evaluation & Selection
● Manual Search
❏ we select some hyperparameters for a model based on our gut feeling and experience.
❏ Based on these parameters, the model is trained, and model performance measures are
checked.
❏ This process is repeated with another set of values for the same hyperparameters until
optimal accuracy is received, or the model has attained optimal error.
❏ This might not be of much help as human judgment is biased, and here human experience is
playing a significant role.
Model Evaluation & Selection
● Random Search
❏ doing multiple rounds of this process, it would be better to give multiple values for all
the hyperparameters in one go to the model and let the model decide which one best
suits.
Model Evaluation & Selection
● Grid Search
Confusion Matrix
Model Evaluation & Selection
Confusion Matrix
Model Evaluation & Selection
Confusion Matrix
Model Evaluation & Selection
Confusion Matrix
ROC-AUC Curve
● The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary
classification problems.
● It is a probability curve that plots the TPR against FPR at various threshold values
and essentially separates the ‘signal’ from the ‘noise’.
● The Area Under the Curve (AUC) is the measure of the ability of a classifier to
distinguish between classes and is used as a summary of the ROC curve.
● The higher the AUC, the better the performance of the model at distinguishing
between the positive and negative classes.
Model Evaluation & Selection
ROC-AUC Curve
Model Evaluation & Selection
ROC-AUC Curve
Model Evaluation & Selection
ROC-AUC Curve
Model Evaluation & Selection
ROC-AUC Curve
Model Evaluation & Selection
ROC-AUC Curve
ROC-AUC Curve
ROC-AUC Curve
ROC-AUC Curve