Clustering FinancialData
Clustering FinancialData
(Unsupervised Learning)
Clustering
Clustering is the process of dividing the entire data into groups (also known as clusters) based on the
patterns in the data.
Eg., Banks segment its customers into different groups - High Income, Average Income and Low Income
There might be situations where we do not have any target variable to predict. Such problems, without any
fixed target variable, are known as unsupervised learning problems. In these problems, we only have the
independent variables and no target/dependent variable. So It's an unsupervised machine learning
technique
Properties of Clusters:
2
Clustering
Applications of Clustering :
Customer Segmentation
Document Clustering
Image Segmentation
Recommendation Engines
Partitional clustering
Hierarchical clustering
Density-based clustering
3
Clustering
Partitional clustering :
Divides data objects into non-overlapping groups. In other words, no object can be a member of more than
one cluster, and every cluster must have at least one object. These techniques require the user to specify the
number of clusters, indicated by the variable k.
These algorithms are both nondeterministic, meaning they could produce different results from two separate
runs even if the runs were based on the same input.
Partitional clustering strengths:
Weaknesses:
They’re not well suited for clusters with complex shapes and different sizes.
4
They break down when used with clusters of different densities.
Clustering - KMeans
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and
their respective cluster centroid.
Steps :
5
Clustering - KMeans
6
Clustering - KMeans
7
Clustering - KMeans
Movement of the cluster means for each step of our version of the k-means algorithm. The first plot shows initial random
assignment and confusion. This is followed by three corrective steps. Light circles show former positions. 8
Clustering - KMeans
The first cluster is chosen uniformly at random from the data points that we want to cluster. This is
similar to what we do in K-Means, but instead of randomly picking all the centroids, we just pick one
centroid here
Next, we compute the distance (D(x)) of each data point (x) from the cluster center that has already been
chosen
Then, choose the new cluster center from the data points with the probability of x being
proportional to (D(x))2
9
X means Clustering: This
Clustering - KMeans method is a modification of the k
means technique. In simple
words, it starts from k = 1 and
How to Choose the Right Number of Clusters in K-Means continues to divide the set of
Clustering? observations into clusters until
the best split is found or the
Ø Cross Validation stopping criterion is reached.
But, how does it find the best
Ø Elbow Method
split ? It uses the Bayesian
Ø Silhouette Method information criterion to decide
Ø X means Clustering the best split.
Cross Validation: It's a commonly used Elbow Method: This method Silhouette Method: It returns a
method for determining k value. It calculates the best k value by value between -1 and 1 based on
divides the data into X parts. Then, it considering the percentage of variance the similarity of an observation
trains the model on X-1 parts and explained by each cluster. It results in a with its own cluster. Similarly,
validates (test) the model on the plot similar to PCA’s Scree plot. In fact, the observation is also compared
remaining part. The model is validated the logic behind selecting the best with other clusters to derive at
by checking the value of the sum of cluster value is the same as PCA. In the similarity score. High value
squared distance to the centroid. This PCA, we select the number of indicates high match, and vice
final value is calculated by averaging components such that they explain the versa. We can use any distance
over X clusters. Practically, for different maximum variance in the data. metric (explained above) to
values of k, we perform cross validation Similarly, in the plot generated by the calculate the silhouette score.
and then choose the value which elbow method, we select the value of k
returns the lowest error. such that percentage of variance 10
explained is maximum.
Clustering - KMeans
How to Choose the Right
Number of Clusters in K-Means
Clustering?
11
Clustering - KMeans
Different Evaluation Metrics :
Inertia : Inertia actually calculates the sum Dunn Index : Different clusters should be as different from
of distances of all the points within a cluster each other as possible. Along with the distance between the
from the centroid of that cluster. Other term centroid and points, the Dunn index also takes into account
used is Within Cluster Sum of Squares the distance between two clusters. This distance between the
(WCSS) centroids of two different clusters is known as inter-cluster
distance.
We graph the relationship between the
number of clusters and Within Cluster Sum Dunn Index = min (Inter Cluster Distance) / max
of Squares (WCSS) (Intra Cluster Distance)
Distance between points / observations Dunn index is the ratio of the minimum of inter-cluster
should be as low as possible. distances and maximum of intracluster
distances.
Inertia tries to minimize the intra cluster
distance. We want to maximize the Dunn index. The more the value of
the Dunn index, the better will be the clusters.
12
Clustering - KMeans
Different Evaluation Metrics :
Silhouette coefficient :
How well a data point fits into its assigned cluster based on two factors:
Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their
clusters than they are to other clusters.
13
Clustering - KMeans
Different Evaluation Metrics :
Silhouette coefficient :
How well a data point fits into its assigned cluster based on two factors:
Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their
clusters than they are to other clusters.
14
Clustering - KMeans
Clustering of stocks by return and volatility
15
Clustering - KMeans
Clustering of stocks by return and volatility
Outlier treatment
When creating the clusters, outliers are detected in a scatter plot. Outliers are data points that are
significantly different from the rest of the data points in the data set. Often, they can lead to inaccurate
results when using an algorithm, since they don’t fit the same pattern as the other data points. Therefore,
it is important to segregate and remove outliers to improve the accuracy of the model.
Outlier removal can help the algorithm focus on the most representative data points and reduce the
effect of outliers on the results. This can help increase the accuracy of the model and ensure that the data
points are grouped correctly.
16
Clustering - KMeans
Clustering of stocks by return and volatility
Outlier treatment
17
Clustering - KMeans
Clustering of stocks by return and volatility
19
Gaussian Graphical Models (GGM)
Exploratory data analyses are an important first step in scientific research. Exploratory analyses provide a first
understanding of the relationships between items and variables included in a study, which enables researchers to
better understand the data before opting for more complicated and sophisticated analyses.
Exploratory analyses are of particular relevance in so-called problem-oriented fields such as environmental psychology,
where researchers often study how variables from different theories can help to explain a phenomenon to help solve a
problem. Furthermore, applied psychologists may often work on large projects in which people from different (sub)
disciplines collaborate in understanding climate-change related topics (or other complex challenges). Such problem-
oriented approaches often aim to examine multiple research questions and test multiple hypotheses and theories,
typically with questionnaire studies. This can result in large multivariate datasets. In such situations, researchers
would profit from exploratory methods and analyses that help them get a “feel” for patterns in their dataset in an
intuitive manner.
In such cases, exploratory analyses may involve three steps. First, relationships between items included in a study can
be explored to get some initial insights into whether items that are assumed to measure the same underlying
construct are indeed correlated. Second, after aggregating individual items into relevant scales, researchers can
explore relationships between variables, as they would expect on the basis of theory. Third, in cases where the dataset
comprises of multiple groups, exploratory analyses are helpful to examine similarities and differences in relationships
between these variables across groups.
20
Gaussian Graphical Models (GGM)
A Gaussian graphical model comprises of a set of items or variables, depicted by circles, and a set of lines that visualize
relationships between the items or variables. The thickness of these lines represents the strength of the relationships
between items or variables; and consequently, the absence of a line implies no or very weak relationships between
the relevant items or variables.
Notably, in the Gaussian graphical model, these lines capture partial correlations, that is, the correlation between two
items or variables when controlling for all other items or variables included in the data set. As mentioned above, a key
advantage of partial correlations is that it avoids spurious correlations.
While this visual representation of relationships can facilitate getting a first feel of the data, Gaussian graphical models
can still be hard to read when the estimated graphs are dense and contain a large number of lines. In fact, due to
sampling variation, truly zero partial correlations are rarely observed, and, as a consequence, graphs can be very
dense and consist of spurious relationships.
In Gaussian graphical models, the glasso algorithm is a commonly used method to obtain a sparser graph. This
algorithm forces small partial correlation coefficients to zero and thus induces sparsity.
21
Gaussian Graphical Models (GGM)
The amount of sparsity in the graph is controlled by a tuning parameter and different values of the tuning parameter
result in different graphs.
Low values of the tuning parameter will result in dense graphs and high values of the tuning parameter will result in
sparse graphs. Typically, the extended Bayesian information criteria (EBIC) is used to select an optimal setting of the
tuning parameter such that the strongest relationships are retained in the graph (maximizes true positives).
22
Covariance, Precision & Adjacency Matrices
The covariance matrix, denoted as Σ, is a symmetric matrix that quantifies the pairwise relationships between
variables in a multivariate dataset. Each element in the covariance matrix represents the covariance between two
variables. In the context of portfolio analysis, the covariance matrix describes the variability and co-movements of the
returns of different assets in the portfolio.
The precision matrix, denoted as Ω or Θ, is the inverse of the covariance matrix. It quantifies the partial correlations
between variables after accounting for the influence of all other variables. A non-zero entry in the precision matrix
indicates a direct conditional dependence between two variables, given all other variables in the model. In the
context of GGMs, the precision matrix represents the structure of conditional dependencies among variables.
The adjacency matrix represents the structure of conditional dependencies among variables in the GGM. It is
obtained from the precision matrix by thresholding or binarizing the non-zero elements. Each non-zero entry in the
adjacency matrix corresponds to an edge in the graphical representation of the GGM, indicating a direct conditional
dependence between the corresponding pair of variables.
23
Covariance, Precision & Adjacency Matrices
The covariance matrix, denoted as Σ, is a symmetric matrix that quantifies the pairwise relationships between
variables in a multivariate dataset. Each element in the covariance matrix represents the covariance between two
variables. In the context of portfolio analysis, the covariance matrix describes the variability and co-movements of the
returns of different assets in the portfolio.
24
Covariance, Precision & Adjacency Matrices
The precision matrix, denoted as Ω or Θ, is the inverse of the covariance matrix. It quantifies the partial correlations
between variables after accounting for the influence of all other variables. A non-zero entry in the precision matrix
indicates a direct conditional dependence between two variables, given all other variables in the model. In the
context of GGMs, the precision matrix represents the structure of conditional dependencies among variables.
In the glasso algorithm, the goal is to estimate a sparse precision matrix from observed data.
By imposing a penalty on the L1 norm of the precision matrix, the glasso algorithm encourages sparsity, leading to a
sparse representation of conditional dependencies.
The resulting sparse precision matrix provides insights into the conditional relationships between variables, with
many entries being zero, indicating conditional independence.
25
Precision Matrix - Usage
1.Modeling Conditional Dependencies: The precision matrix captures the conditional dependencies between assets
in a portfolio. Each element of the precision matrix represents the partial correlation between two assets while
controlling for the effects of all other assets. A high absolute value of an entry in the precision matrix indicates a
strong conditional relationship between the corresponding pair of assets.
2.Portfolio Optimization: The precision matrix plays a crucial role in portfolio optimization, where the goal is to
construct a portfolio that maximizes returns while minimizing risk. By incorporating the conditional dependencies
between assets represented in the precision matrix, investors can build more efficient portfolios that account for the
interrelationships among assets.
3.Risk Management: Understanding the conditional dependencies between assets is essential for effective risk
management in financial markets. The precision matrix helps identify which assets are likely to move together or
diverge under different market conditions, allowing investors to manage portfolio risk more effectively.
4.Factor Modeling: The precision matrix can also be used in factor modeling, where the goal is to identify common
factors driving asset returns. By examining the structure of the precision matrix, analysts can identify clusters of
assets that are strongly related and potentially driven by common underlying factors.
I
26
Covariance, Precision & Adjacency Matrices
The adjacency matrix represents the structure of conditional dependencies among variables in the GGM. It is
obtained from the precision matrix by thresholding or binarizing the non-zero elements. Each non-zero entry in the
adjacency matrix corresponds to an edge in the graphical representation of the GGM, indicating a direct conditional
dependence between the corresponding pair of variables.
27
Covariance - Visualization
By visualizing the covariance matrix, analysts can gain insights into the relationships between assets in the portfolio
and assess the diversification benefits.
28
Glasso Algorithm
The graphical lasso (glasso) algorithm is a method used for estimating sparse inverse covariance matrices from data.
It is particularly useful in situations where the number of variables (or features) is large compared to the number of
observations and where there are conditional dependencies among the variables. In financial applications, the glasso
algorithm can be used to model the relationships between different assets, considering the conditional
dependencies among them.
29
Glasso Algorithm
Objective:
The goal of the glasso algorithm is to estimate a sparse inverse covariance matrix (also
known as the precision matrix) from observed data.It assumes that the data follows a
multivariate normal distribution and seeks to find the precision matrix that best represents
the conditional dependencies between variables.
Regularization:
1. The glasso algorithm incorporates a penalty term, typically the L1 norm (Lasso
penalty), to promote sparsity in the precision matrix.
2. By penalizing the absolute values of the elements of the precision matrix, the
glasso algorithm encourages many elements to be exactly zero, resulting in a
sparse structure.
30
Glasso Algorithm
Optimization:
1. The glasso algorithm minimizes the negative log-likelihood of the data subject to
the L1 penalty on the precision matrix.
2. This optimization problem is typically solved using efficient optimization
techniques such as coordinate descent or block coordinate descent.
Estimation:
1. The output of the glasso algorithm is the estimated sparse inverse covariance
matrix, which provides insights into the conditional dependencies between
variables.
2. The precision matrix captures partial correlations between variables after
accounting for the influence of all other variables in the dataset.
31
Glasso Algorithm
Steps :
Retrieve historical stock price data using the getSymbols function from the quantmod
package.
Calculate the log returns of the stock prices to obtain a time series of returns for each asset.
The glasso algorithm is then applied to estimate the sparse precision matrix from the
covariance matrix of the returns data.
Finally, visualize the estimated precision matrix using a heatmap to observe the conditional
dependencies between the assets.
32
Wishart Distribution
The Wishart distribution is commonly used in financial applications, particularly in portfolio optimization and
risk management. It's a probability distribution that describes the distribution of sample covariance matrices,
which are fundamental in understanding the relationship between multiple assets in a portfolio.
In R, you can simulate a Wishart distribution using the rWishart() function from the MASS package.
In financial applications, you might use the Wishart distribution to simulate possible scenarios for the
covariance matrix of asset returns. This can be useful in portfolio optimization, risk assessment, and
estimating the uncertainty associated with your portfolio's performance. For example, you could use these
simulated covariance matrices in a Monte Carlo simulation to assess the potential distribution of portfolio
returns under different market conditions.
33
Wishart Distribution
Properties:
The mode of the distribution depends on the degrees of freedom and the scale matrix. As
the degrees of freedom increase, the mode of the distribution moves closer to the scale
matrix.
34
Wishart Distribution
The Wishart distribution is a probability distribution that describes the distribution of sample
covariance matrices
Multivariate Data: The Wishart distribution is defined for multivariate data, meaning data with
more than one dimension or variable. It's commonly used when dealing with multiple
correlated variables simultaneously, such as asset returns in finance, pixel intensities in image
processing, or genetic data in biology.
35
Wishart Distribution
Parameters: The Wishart distribution depends on two main parameters:
Degrees of Freedom (df): This parameter determines the shape of the distribution and is
usually denoted by νν. It represents the number of observations used to estimate the
covariance matrix. A higher degree of freedom implies a greater dispersion of the sample
covariance matrix around the population covariance matrix.
Scale Matrix (Σ): This is the true covariance matrix of the variables being studied. It's
typically denoted by ΣΣ and represents the variability and the relationships between the
variables.
36
Wishart Distribution
Applications:
Portfolio Optimization: In finance, the Wishart distribution is used to model the distribution
of covariance matrices of asset returns. This is essential for portfolio optimization, where
investors aim to construct portfolios with optimal risk-return characteristics.
37
Wishart Distribution
Statistical Inference:
38