0% found this document useful (0 votes)

19 views38 pages

Clustering FinancialData

Uploaded by

Zeeshan Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views38 pages

Clustering FinancialData

Uploaded by

Zeeshan Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Clustering

(Unsupervised Learning)
Clustering

Clustering is the process of dividing the entire data into groups (also known as clusters) based on the
patterns in the data.

Eg., Banks segment its customers into different groups - High Income, Average Income and Low Income

There might be situations where we do not have any target variable to predict. Such problems, without any
fixed target variable, are known as unsupervised learning problems. In these problems, we only have the
independent variables and no target/dependent variable. So It's an unsupervised machine learning
technique

Properties of Clusters:

All the data points in a cluster should be similar to each other

The data points from different clusters should be as different as possible.

2
Clustering

Applications of Clustering :

Customer Segmentation
Document Clustering
Image Segmentation
Recommendation Engines

Different categories of clustering algorithms :

Partitional clustering
Hierarchical clustering
Density-based clustering

3
Clustering
Partitional clustering :

Divides data objects into non-overlapping groups. In other words, no object can be a member of more than
one cluster, and every cluster must have at least one object. These techniques require the user to specify the
number of clusters, indicated by the variable k.

Eg., KMeans and KMedoids

These algorithms are both nondeterministic, meaning they could produce different results from two separate
runs even if the runs were based on the same input.
Partitional clustering strengths:

They work well when clusters have a spherical shape.

They’re scalable with respect to algorithm complexity.

Weaknesses:

They’re not well suited for clusters with complex shapes and different sizes.
4
They break down when used with clusters of different densities.
Clustering - KMeans

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to

assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

The main objective of the K-Means algorithm is to minimize the sum of distances between the points and
their respective cluster centroid.

Steps :

Choose the number of clusters k

Select k random points from the data as centroids
Assign all the points to the closest cluster centroid
Recompute the centroids of newly formed clusters
Repeat the last 2 steps

5
Clustering - KMeans

Stopping Criteria for K-Means Clustering :

Centroids of newly formed clusters do not

change
Points remain in the same cluster
Maximum number of iterations are reached

Challenges with the K-Means Clustering Algorithm

Size of the clusters is different

densities of the original points are different
Solution -> Higher number of clusters

K-Means++ to Choose Initial Cluster Centroids

for K-Means Clustering

6
Clustering - KMeans

Before Clustering After Clustering

7
Clustering - KMeans

Movement of the cluster means for each step of our version of the k-means algorithm. The first plot shows initial random
assignment and confusion. This is followed by three corrective steps. Light circles show former positions. 8
Clustering - KMeans

Steps to initialize the centroids using K-Means++ :

The first cluster is chosen uniformly at random from the data points that we want to cluster. This is
similar to what we do in K-Means, but instead of randomly picking all the centroids, we just pick one
centroid here

Next, we compute the distance (D(x)) of each data point (x) from the cluster center that has already been
chosen

Then, choose the new cluster center from the data points with the probability of x being
proportional to (D(x))2

We then repeat steps 2 and 3 until k clusters have been chosen

9
X means Clustering: This
Clustering - KMeans method is a modification of the k
means technique. In simple
words, it starts from k = 1 and
How to Choose the Right Number of Clusters in K-Means continues to divide the set of
Clustering? observations into clusters until
the best split is found or the
Ø Cross Validation stopping criterion is reached.
But, how does it find the best
Ø Elbow Method
split ? It uses the Bayesian
Ø Silhouette Method information criterion to decide
Ø X means Clustering the best split.

Cross Validation: It's a commonly used Elbow Method: This method Silhouette Method: It returns a
method for determining k value. It calculates the best k value by value between -1 and 1 based on
divides the data into X parts. Then, it considering the percentage of variance the similarity of an observation
trains the model on X-1 parts and explained by each cluster. It results in a with its own cluster. Similarly,
validates (test) the model on the plot similar to PCA’s Scree plot. In fact, the observation is also compared
remaining part. The model is validated the logic behind selecting the best with other clusters to derive at
by checking the value of the sum of cluster value is the same as PCA. In the similarity score. High value
squared distance to the centroid. This PCA, we select the number of indicates high match, and vice
final value is calculated by averaging components such that they explain the versa. We can use any distance
over X clusters. Practically, for different maximum variance in the data. metric (explained above) to
values of k, we perform cross validation Similarly, in the plot generated by the calculate the silhouette score.
and then choose the value which elbow method, we select the value of k
returns the lowest error. such that percentage of variance 10
explained is maximum.
Clustering - KMeans
How to Choose the Right
Number of Clusters in K-Means
Clustering?

Plot a Graph : elbow curve, where

the x-axis will represent the number
of clusters and the
y-axis will be an evaluation metric.
(Inertia / Dunn Index)

Cluster value where this decrease in

inertia value becomes constant can
be chosen as the right cluster value
for our data.

11
Clustering - KMeans
Different Evaluation Metrics :

Inertia : Inertia actually calculates the sum Dunn Index : Different clusters should be as different from
of distances of all the points within a cluster each other as possible. Along with the distance between the
from the centroid of that cluster. Other term centroid and points, the Dunn index also takes into account
used is Within Cluster Sum of Squares the distance between two clusters. This distance between the
(WCSS) centroids of two different clusters is known as inter-cluster
distance.
We graph the relationship between the
number of clusters and Within Cluster Sum Dunn Index = min (Inter Cluster Distance) / max
of Squares (WCSS) (Intra Cluster Distance)

Distance between points / observations Dunn index is the ratio of the minimum of inter-cluster
should be as low as possible. distances and maximum of intracluster
distances.
Inertia tries to minimize the intra cluster
distance. We want to maximize the Dunn index. The more the value of
the Dunn index, the better will be the clusters.
12
Clustering - KMeans
Different Evaluation Metrics :

Silhouette coefficient :

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies

How well a data point fits into its assigned cluster based on two factors:

How close the data point is to other points in the cluster ?

How far away the data point is from points in other clusters ?

Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their
clusters than they are to other clusters.

13
Clustering - KMeans
Different Evaluation Metrics :

Silhouette coefficient :

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies

How well a data point fits into its assigned cluster based on two factors:

How close the data point is to other points in the cluster ?

How far away the data point is from points in other clusters ?

Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their
clusters than they are to other clusters.

14
Clustering - KMeans
Clustering of stocks by return and volatility

15
Clustering - KMeans
Clustering of stocks by return and volatility

Outlier treatment

When creating the clusters, outliers are detected in a scatter plot. Outliers are data points that are
significantly different from the rest of the data points in the data set. Often, they can lead to inaccurate
results when using an algorithm, since they don’t fit the same pattern as the other data points. Therefore,
it is important to segregate and remove outliers to improve the accuracy of the model.

Outlier removal can help the algorithm focus on the most representative data points and reduce the
effect of outliers on the results. This can help increase the accuracy of the model and ensure that the data
points are grouped correctly.

16
Clustering - KMeans
Clustering of stocks by return and volatility

Outlier treatment

The graph shows 4 clusters that were generated using

a K-means algorithm with 2 variables: average
annualized return and average annualized volatility.
These variables are used to measure the risk and
return of a stock.

The 4 clusters represent 4 groups of actions with

different levels of risk and return in the period under
study.

17
Clustering - KMeans
Clustering of stocks by return and volatility

Clustering is useful for identifying peer groups among stocks,

thus allowing differentiation between stocks with different levels
of risk and return. This is useful for investors looking to
diversify their investment portfolios, as it allows them to identify
groups of stocks with different levels of risk and return.

Investors could use the clusters to select a mix of stocks with

different levels of risk and return based on their investment
objectives. This will help them diversify their portfolio and
reduce the risk of their investment since they will be investing
in a variety of assets with different levels of risk.

Clustering securities in the volatility and average return

dimensions groups securities with similar market behavior
together. Investment preferences can be personal choices. If
one likes the market behavior of the stock X, then one may
also be interested in similar behaving stocks in the same k-
means cluster. K-means clustering provides recommendations
of alternative securities like a recommender system would
18
Clustering - KMeans
Clustering of stocks by return, volatility and Price to Book value

19
Gaussian Graphical Models (GGM)
Exploratory data analyses are an important first step in scientific research. Exploratory analyses provide a first
understanding of the relationships between items and variables included in a study, which enables researchers to
better understand the data before opting for more complicated and sophisticated analyses.

Exploratory analyses are of particular relevance in so-called problem-oriented fields such as environmental psychology,
where researchers often study how variables from different theories can help to explain a phenomenon to help solve a
problem. Furthermore, applied psychologists may often work on large projects in which people from different (sub)
disciplines collaborate in understanding climate-change related topics (or other complex challenges). Such problem-
oriented approaches often aim to examine multiple research questions and test multiple hypotheses and theories,
typically with questionnaire studies. This can result in large multivariate datasets. In such situations, researchers
would profit from exploratory methods and analyses that help them get a “feel” for patterns in their dataset in an
intuitive manner.

In such cases, exploratory analyses may involve three steps. First, relationships between items included in a study can
be explored to get some initial insights into whether items that are assumed to measure the same underlying
construct are indeed correlated. Second, after aggregating individual items into relevant scales, researchers can
explore relationships between variables, as they would expect on the basis of theory. Third, in cases where the dataset
comprises of multiple groups, exploratory analyses are helpful to examine similarities and differences in relationships
between these variables across groups.
20
Gaussian Graphical Models (GGM)
A Gaussian graphical model comprises of a set of items or variables, depicted by circles, and a set of lines that visualize
relationships between the items or variables. The thickness of these lines represents the strength of the relationships
between items or variables; and consequently, the absence of a line implies no or very weak relationships between
the relevant items or variables.

Notably, in the Gaussian graphical model, these lines capture partial correlations, that is, the correlation between two
items or variables when controlling for all other items or variables included in the data set. As mentioned above, a key
advantage of partial correlations is that it avoids spurious correlations.

While this visual representation of relationships can facilitate getting a first feel of the data, Gaussian graphical models
can still be hard to read when the estimated graphs are dense and contain a large number of lines. In fact, due to
sampling variation, truly zero partial correlations are rarely observed, and, as a consequence, graphs can be very
dense and consist of spurious relationships.

In Gaussian graphical models, the glasso algorithm is a commonly used method to obtain a sparser graph. This
algorithm forces small partial correlation coefficients to zero and thus induces sparsity.

21
Gaussian Graphical Models (GGM)
The amount of sparsity in the graph is controlled by a tuning parameter and different values of the tuning parameter
result in different graphs.

Low values of the tuning parameter will result in dense graphs and high values of the tuning parameter will result in
sparse graphs. Typically, the extended Bayesian information criteria (EBIC) is used to select an optimal setting of the
tuning parameter such that the strongest relationships are retained in the graph (maximizes true positives).

22
Covariance, Precision & Adjacency Matrices
The covariance matrix, denoted as Σ, is a symmetric matrix that quantifies the pairwise relationships between
variables in a multivariate dataset. Each element in the covariance matrix represents the covariance between two
variables. In the context of portfolio analysis, the covariance matrix describes the variability and co-movements of the
returns of different assets in the portfolio.

The precision matrix, denoted as Ω or Θ, is the inverse of the covariance matrix. It quantifies the partial correlations
between variables after accounting for the influence of all other variables. A non-zero entry in the precision matrix
indicates a direct conditional dependence between two variables, given all other variables in the model. In the
context of GGMs, the precision matrix represents the structure of conditional dependencies among variables.

The adjacency matrix represents the structure of conditional dependencies among variables in the GGM. It is
obtained from the precision matrix by thresholding or binarizing the non-zero elements. Each non-zero entry in the
adjacency matrix corresponds to an edge in the graphical representation of the GGM, indicating a direct conditional
dependence between the corresponding pair of variables.

23
Covariance, Precision & Adjacency Matrices
The covariance matrix, denoted as Σ, is a symmetric matrix that quantifies the pairwise relationships between
variables in a multivariate dataset. Each element in the covariance matrix represents the covariance between two
variables. In the context of portfolio analysis, the covariance matrix describes the variability and co-movements of the
returns of different assets in the portfolio.

24
Covariance, Precision & Adjacency Matrices
The precision matrix, denoted as Ω or Θ, is the inverse of the covariance matrix. It quantifies the partial correlations
between variables after accounting for the influence of all other variables. A non-zero entry in the precision matrix
indicates a direct conditional dependence between two variables, given all other variables in the model. In the
context of GGMs, the precision matrix represents the structure of conditional dependencies among variables.

Usage in Glasso Algorithm

In the glasso algorithm, the goal is to estimate a sparse precision matrix from observed data.

By imposing a penalty on the L1 norm of the precision matrix, the glasso algorithm encourages sparsity, leading to a
sparse representation of conditional dependencies.

The resulting sparse precision matrix provides insights into the conditional relationships between variables, with
many entries being zero, indicating conditional independence.

25
Precision Matrix - Usage
1.Modeling Conditional Dependencies: The precision matrix captures the conditional dependencies between assets
in a portfolio. Each element of the precision matrix represents the partial correlation between two assets while
controlling for the effects of all other assets. A high absolute value of an entry in the precision matrix indicates a
strong conditional relationship between the corresponding pair of assets.
2.Portfolio Optimization: The precision matrix plays a crucial role in portfolio optimization, where the goal is to
construct a portfolio that maximizes returns while minimizing risk. By incorporating the conditional dependencies
between assets represented in the precision matrix, investors can build more efficient portfolios that account for the
interrelationships among assets.
3.Risk Management: Understanding the conditional dependencies between assets is essential for effective risk
management in financial markets. The precision matrix helps identify which assets are likely to move together or
diverge under different market conditions, allowing investors to manage portfolio risk more effectively.
4.Factor Modeling: The precision matrix can also be used in factor modeling, where the goal is to identify common
factors driving asset returns. By examining the structure of the precision matrix, analysts can identify clusters of
assets that are strongly related and potentially driven by common underlying factors.
I

26
Covariance, Precision & Adjacency Matrices
The adjacency matrix represents the structure of conditional dependencies among variables in the GGM. It is
obtained from the precision matrix by thresholding or binarizing the non-zero elements. Each non-zero entry in the
adjacency matrix corresponds to an edge in the graphical representation of the GGM, indicating a direct conditional
dependence between the corresponding pair of variables.

27
Covariance - Visualization
By visualizing the covariance matrix, analysts can gain insights into the relationships between assets in the portfolio
and assess the diversification benefits.

28
Glasso Algorithm
The graphical lasso (glasso) algorithm is a method used for estimating sparse inverse covariance matrices from data.
It is particularly useful in situations where the number of variables (or features) is large compared to the number of
observations and where there are conditional dependencies among the variables. In financial applications, the glasso
algorithm can be used to model the relationships between different assets, considering the conditional
dependencies among them.

29
Glasso Algorithm
Objective:

The goal of the glasso algorithm is to estimate a sparse inverse covariance matrix (also
known as the precision matrix) from observed data.It assumes that the data follows a
multivariate normal distribution and seeks to find the precision matrix that best represents
the conditional dependencies between variables.

Regularization:

1. The glasso algorithm incorporates a penalty term, typically the L1 norm (Lasso
penalty), to promote sparsity in the precision matrix.
2. By penalizing the absolute values of the elements of the precision matrix, the
glasso algorithm encourages many elements to be exactly zero, resulting in a
sparse structure.
30
Glasso Algorithm
Optimization:
1. The glasso algorithm minimizes the negative log-likelihood of the data subject to
the L1 penalty on the precision matrix.
2. This optimization problem is typically solved using efficient optimization
techniques such as coordinate descent or block coordinate descent.

Estimation:
1. The output of the glasso algorithm is the estimated sparse inverse covariance
matrix, which provides insights into the conditional dependencies between
variables.
2. The precision matrix captures partial correlations between variables after
accounting for the influence of all other variables in the dataset.

31
Glasso Algorithm
Steps :

Retrieve historical stock price data using the getSymbols function from the quantmod
package.

Calculate the log returns of the stock prices to obtain a time series of returns for each asset.

The glasso algorithm is then applied to estimate the sparse precision matrix from the
covariance matrix of the returns data.

Extract the precision matrix from the glasso result.

Finally, visualize the estimated precision matrix using a heatmap to observe the conditional
dependencies between the assets.
32
Wishart Distribution
The Wishart distribution is commonly used in financial applications, particularly in portfolio optimization and
risk management. It's a probability distribution that describes the distribution of sample covariance matrices,
which are fundamental in understanding the relationship between multiple assets in a portfolio.

In R, you can simulate a Wishart distribution using the rWishart() function from the MASS package.

In financial applications, you might use the Wishart distribution to simulate possible scenarios for the
covariance matrix of asset returns. This can be useful in portfolio optimization, risk assessment, and
estimating the uncertainty associated with your portfolio's performance. For example, you could use these
simulated covariance matrices in a Monte Carlo simulation to assess the potential distribution of portfolio
returns under different market conditions.

33
Wishart Distribution
Properties:

The Wishart distribution is a generalization of the chi-square distribution to multiple

dimensions.

It's a positive definite matrix-valued distribution, meaning that covariance matrices

generated from the Wishart distribution are guaranteed to be positive definite.

The mode of the distribution depends on the degrees of freedom and the scale matrix. As
the degrees of freedom increase, the mode of the distribution moves closer to the scale
matrix.

34
Wishart Distribution
The Wishart distribution is a probability distribution that describes the distribution of sample
covariance matrices

key points about the Wishart distribution:

Multivariate Data: The Wishart distribution is defined for multivariate data, meaning data with
more than one dimension or variable. It's commonly used when dealing with multiple
correlated variables simultaneously, such as asset returns in finance, pixel intensities in image
processing, or genetic data in biology.

35
Wishart Distribution
Parameters: The Wishart distribution depends on two main parameters:

Degrees of Freedom (df): This parameter determines the shape of the distribution and is
usually denoted by νν. It represents the number of observations used to estimate the
covariance matrix. A higher degree of freedom implies a greater dispersion of the sample
covariance matrix around the population covariance matrix.

Scale Matrix (Σ): This is the true covariance matrix of the variables being studied. It's
typically denoted by ΣΣ and represents the variability and the relationships between the
variables.

36
Wishart Distribution
Applications:

Portfolio Optimization: In finance, the Wishart distribution is used to model the distribution
of covariance matrices of asset returns. This is essential for portfolio optimization, where
investors aim to construct portfolios with optimal risk-return characteristics.

Risk Management: Understanding the distribution of covariance matrices helps in assessing

and managing risks associated with investment portfolios and financial derivatives.

Multivariate Analysis: In general, the Wishart distribution is used in multivariate analysis to

model the variability and correlations among multiple variables simultaneously.

37
Wishart Distribution
Statistical Inference:

1. In statistical inference, the Wishart distribution is often used as a prior distribution

in Bayesian analysis, particularly in hierarchical models involving covariance
matrices.
2. It's also used in hypothesis testing and confidence interval estimation for
covariance matrices.

Overall, the Wishart distribution is a fundamental tool in multivariate statistics, providing a

flexible framework for modelling covariance structures and understanding the
relationships between multiple variables

Data Warehousing & Data Mining PDF
100% (6)
Data Warehousing & Data Mining PDF
143 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Unit 4
No ratings yet
Unit 4
29 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 4
No ratings yet
Unit 4
125 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
Clustering
No ratings yet
Clustering
125 pages
Algo
No ratings yet
Algo
59 pages
Unit IV
No ratings yet
Unit IV
96 pages
Clustering
No ratings yet
Clustering
104 pages
Unit 3 DV
No ratings yet
Unit 3 DV
44 pages
Unit 4
No ratings yet
Unit 4
46 pages
Week 10
No ratings yet
Week 10
50 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Week 9
No ratings yet
Week 9
66 pages
Unit 4
No ratings yet
Unit 4
74 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
4 Clustring
No ratings yet
4 Clustring
48 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
07 Clustering
No ratings yet
07 Clustering
34 pages
Week 11
No ratings yet
Week 11
49 pages
Unit 4
No ratings yet
Unit 4
22 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Clustering
No ratings yet
Clustering
84 pages
K Means Clustering
No ratings yet
K Means Clustering
27 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
K Means
No ratings yet
K Means
25 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Kmean
No ratings yet
Kmean
24 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Unsupesfwafarvised Learning
No ratings yet
Unsupesfwafarvised Learning
49 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
K Mean
No ratings yet
K Mean
9 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
UnsupervisedLearning FoundationalMathofAI S24
No ratings yet
UnsupervisedLearning FoundationalMathofAI S24
6 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
A Review of Machine Learning For The Optimization of Production Process
No ratings yet
A Review of Machine Learning For The Optimization of Production Process
14 pages
Movie Recommendation System Using Content Based Filtering Ijariie14954
No ratings yet
Movie Recommendation System Using Content Based Filtering Ijariie14954
16 pages
PGP-AIFL Brochure
No ratings yet
PGP-AIFL Brochure
16 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
1.1 Overview: Data Mining Based Risk Estimation of Road Accidents
No ratings yet
1.1 Overview: Data Mining Based Risk Estimation of Road Accidents
61 pages
S. No. Roll NO Name Project Title: B R Krishna Kokiligada
No ratings yet
S. No. Roll NO Name Project Title: B R Krishna Kokiligada
8 pages
Unit 4 - 1
No ratings yet
Unit 4 - 1
99 pages
Pradeep Aiml
No ratings yet
Pradeep Aiml
47 pages
(IJCST-V5I2P55) : Aruna Bajpai
No ratings yet
(IJCST-V5I2P55) : Aruna Bajpai
7 pages
Data Mining For CRM
No ratings yet
Data Mining For CRM
9 pages
A P-Spline Based Clustering Approach For Portfolio Selection Iorio 2018
No ratings yet
A P-Spline Based Clustering Approach For Portfolio Selection Iorio 2018
16 pages
Deployment: Cheat Sheet: Machine Learning With KNIME Analytics Platform
No ratings yet
Deployment: Cheat Sheet: Machine Learning With KNIME Analytics Platform
2 pages
INDVAL
No ratings yet
INDVAL
15 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
A Fast DBSCAN Algorithm For Big Data Based On Efficient Density
No ratings yet
A Fast DBSCAN Algorithm For Big Data Based On Efficient Density
12 pages
1comparison of Two Science Mapping Tools Based On Software Technical Evaluation and Bibliometric Case Studies
No ratings yet
1comparison of Two Science Mapping Tools Based On Software Technical Evaluation and Bibliometric Case Studies
33 pages
THESIS Paper
No ratings yet
THESIS Paper
63 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
NUS SOC MLDA Brochure 090523 V8
No ratings yet
NUS SOC MLDA Brochure 090523 V8
23 pages
Somoclu Manual
No ratings yet
Somoclu Manual
27 pages
MS RapidHRV Resubmission Clean V2
No ratings yet
MS RapidHRV Resubmission Clean V2
28 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
Learning From Imbalanced Data: Open Challenges and Future Directions
No ratings yet
Learning From Imbalanced Data: Open Challenges and Future Directions
13 pages
Final Parents Experience of Labour Market IFS Report
No ratings yet
Final Parents Experience of Labour Market IFS Report
32 pages
Kman 07
No ratings yet
Kman 07
9 pages
Paper 1 73
No ratings yet
Paper 1 73
6 pages
Analysis, Modeling and Simulation of Workload
No ratings yet
Analysis, Modeling and Simulation of Workload
14 pages
Honeypot: Intrusion Detection System
No ratings yet
Honeypot: Intrusion Detection System
4 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Clustering FinancialData

Uploaded by

Clustering FinancialData

Uploaded by

Clustering

All the data points in a cluster should be similar to each other

Different categories of clustering algorithms :

Eg., KMeans and KMedoids

They work well when clusters have a spherical shape.

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to

Choose the number of clusters k

Stopping Criteria for K-Means Clustering :

Centroids of newly formed clusters do not

Challenges with the K-Means Clustering Algorithm

Size of the clusters is different

K-Means++ to Choose Initial Cluster Centroids

Before Clustering After Clustering

Steps to initialize the centroids using K-Means++ :

We then repeat steps 2 and 3 until k clusters have been chosen

Plot a Graph : elbow curve, where

Cluster value where this decrease in

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies

How close the data point is to other points in the cluster ?

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies

How close the data point is to other points in the cluster ?

The graph shows 4 clusters that were generated using

The 4 clusters represent 4 groups of actions with

Clustering is useful for identifying peer groups among stocks,

Investors could use the clusters to select a mix of stocks with

Clustering securities in the volatility and average return

Usage in Glasso Algorithm

Extract the precision matrix from the glasso result.

The Wishart distribution is a generalization of the chi-square distribution to multiple

It's a positive definite matrix-valued distribution, meaning that covariance matrices

key points about the Wishart distribution:

Risk Management: Understanding the distribution of covariance matrices helps in assessing

Multivariate Analysis: In general, the Wishart distribution is used in multivariate analysis to

1. In statistical inference, the Wishart distribution is often used as a prior distribution

Overall, the Wishart distribution is a fundamental tool in multivariate statistics, providing a

You might also like