Correlation is a statistical tool that is used to analyze and measure the degree of relationship or degree of association between two or more variables. There are generally three types of correlation:
- Positive correlation: When we increase the value of one variable, the value of another variable increased respectively, this is called Positive Correlation.
- Negative correlation: When we increase the value of one variable, the value of another variable decreases respectively, this is called negative Correlation.
- Zero correlation: When the change in the value of one variable does not impact another substantially, then it is called zero correlation.

Pearson Correlation:
Pearson correlation is the most common way of calculating the correlation. It is denoted by r. Consider for two variables x and y, it is represented by the following formula:
r = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}
A value closer to -1 represents a perfectly negative correlation, whereas 0 represents no correlation and 1 represents a strong positive correlation.
The Pearson correlation coefficient is a good estimator of correlation between two variables for normal distribution. However, it does not fill the criteria of the robust estimator because it is not:
- Resistant: This means changing a small fraction of data even by a huge amount does not considerably affect the value of the estimate.
- Robustness of Efficiency: the statistic has high efficiency in a variety of situations rather than in any one situation. Efficiency means that the estimate is close to the optimal estimate given that we know what distribution that the data comes from.
Efficiency can be measure using the following formula:
Efficiency = \frac{lowest-variance-possible}{actual-variance}
Percentage Bend Correlation:
Percent bend correlation was proposed by shoemaker and Hettmanspergr in 1982 and also mentioned by Wilcox in his book. This correlation is both resistant and robust to efficiency.
Following are the steps to perform Percentage Bend correlation on two variables X and Y:
- Set m = (1-\beta) *m + 0.5, Round m to nearest integer. Here, \beta is between 0 and 0.5
- Take W_{i} = |X_{i} - M_{x}| for i = 1, 2, ...n, where M_x is the median of X.
- Sort W_i in the ascending order.
- \hat{W_x} = W(m) , where W(m) is the estimate of the (1-\beta) quantile of W.
- Sort the X values.
- Computer the number of values \frac{(X_{i} - M_{x})}{\hat{W}_{x}(\beta)} that are <-1 and store in i_1 and the number that are > +1 and store in i_2 respectively. Then compute the following:
S_{x} = \sum_{i=i1+1}^{n-i2}{X_{i}}
\hat{\phi}_{x} = \frac{\hat{W}_{x}(i2 - i1) + S_{x}}{n - i1 - i2}
U_{i} = \frac{X_{i} - \hat{\phi}_{x}}{\hat{W}_{x}}
- Repeat the above steps for the Y estimator to get \hat{W_y}, \hat{\phi}_{y} and V_i.
- Define the function:
\Psi(x) = \max[-1, \min(1,x)]
therefore compute,
A_i = \Psi (U_i), B_i = \Psi (V_i)
- Calculate the percent bend correlation:
\rho_{pb} = \frac{\sum_{i=1}^{n}{A_{i}B_{i}}} {\sqrt{\sum_{i=1}^{n}{A_{i}^2}\sum_{i=1}^{n}{B_{i}^2}}}
Winsorized Correlation:
The standard correlation like Pearson is sometimes heavily influenced by extreme values. The Winsorized correlation solves this by setting the tail values equal to a certain percentile value.
For example, for a 90% Winsorized correlation, the bottom 5% of the values are set equal to the value corresponding to the 5th percentile while the upper 5% of the values are set equal to the value corresponding to the 95th percentile. Then the standard correlation is applied.
Implementation:
- In this implementation, we will be using the motor trend car Road Tests dataset available in the graphics library in R. It is very popular and easily available. This dataset contains 32 observations of 11 different variables related to cars. We will be performing the correlation analysis between these variables (Pearson, percentage bend and winsorized) and plot them.
R
# Install the required packages
install.packages("dplyr")
install.packages("correlation")
install.packages("see")
# import required packages
library(dplyr)
library(correlation)
library(see)
# Load data
data("mtcars")
# check help for mtcars data
?mtcars
## Description
# The data was extracted from the 1974 Motor Trend US magazine,
# and comprises fuel consumption and 10 aspects of automobile
# design and performance for 32 automobiles
#(1973–74 models).
## Usage
# mtcars
## Format
# A data frame with 32 observations on 11 (numeric) variables.
#
# [, 1] mpg Miles/(US) gallon
# [, 2] cyl Number of cylinders
# [, 3] disp Displacement (cu.in.)
# [, 4] hp Gross horsepower
# [, 5] dart Rear axle ratio
# [, 6] wt Weight (1000 lbs)
# [, 7] qsec 1/4 mile time
# [, 8] vs Engine (0 = V-shaped, 1 = straight)
# [, 9] am Transmission (0 = automatic, 1 = manual)
# [,10] gear Number of forward gears
# [,11] carb Number of carburetors
## Source
# Henderson and Velleman (1981), Building multiple regression
# models interactively. Biometrics, 37, 391–411.
# perform different correlation and print summary
# pearson correlation
pearson_corr = correlation(mtcars)
pearson_summary = summary(pearson_corr)
print(pearson_summary)
# percentage bend correlation
pbc_corr = correlation(mtcars,method='percentage')
pbc_summary= summary(pbc_corr)
print(pbc_summary)
# winsorized correlation
wins_corr = correlation(mtcars, winsorize = 0.2)
winsor_summary = summary(wins_corr)
print(winsor_summary)
# plot different correlation analysis
pearson_summary%>%plot()
pbc_summary%>%plot()
winsor_summary%>%plot()
# Correlation Matrix (pearson-method)
Parameter | carb | gear | am | vs | qsec | wt | dart | hp | disp | cyl
---------------------------------------------------------------------------------------------------------------------
mpg | -0.55* | 0.48 | 0.60** | 0.66** | 0.42 | -0.87*** | 0.68*** | -0.78*** | -0.85*** | -0.85***
cyl | 0.53* | -0.49 | -0.52* | -0.81*** | -0.59* | 0.78*** | -0.70*** | 0.83*** | 0.90*** |
disp | 0.39 | -0.56* | -0.59* | -0.71*** | -0.43 | 0.89*** | -0.71*** | 0.79*** | |
hp | 0.75*** | -0.13 | -0.24 | -0.72*** | -0.71*** | 0.66** | -0.45 | | |
dart | -0.09 | 0.70*** | 0.71*** | 0.44 | 0.09 | -0.71*** | | | |
wt | 0.43 | -0.58* | -0.69*** | -0.55* | -0.17 | | | | |
qsec | -0.66** | -0.21 | -0.23 | 0.74*** | | | | | |
vs | -0.57* | 0.21 | 0.17 | | | | | | |
am | 0.06 | 0.79*** | | | | | | | |
gear | 0.27 | | | | | | | | |
p-value adjustment method: Holm (1979)>
# Correlation Matrix (percentage-method)
Parameter | carb | gear | am | vs | qsec | wt | dart | hp | disp | cyl
----------------------------------------------------------------------------------------------------------------------
mpg | -0.64** | 0.55* | 0.58** | 0.68*** | 0.48 | -0.90*** | 0.68*** | -0.90*** | -0.88*** | -0.91***
cyl | 0.58* | -0.55* | -0.52* | -0.81*** | -0.60** | 0.85*** | -0.72*** | 0.91*** | 0.94*** |
disp | 0.47 | -0.61** | -0.60** | -0.73*** | -0.50 | 0.88*** | -0.74*** | 0.89*** | |
hp | 0.70*** | -0.37 | -0.40 | -0.79*** | -0.69*** | 0.80*** | -0.59** | | |
dart | -0.11 | 0.78*** | 0.73*** | 0.47 | 0.13 | -0.76*** | | | |
wt | 0.53* | -0.64** | -0.76*** | -0.57* | -0.26 | | | | |
qsec | -0.68*** | -0.13 | -0.17 | 0.80*** | | | | | |
vs | -0.62** | 0.27 | 0.17 | | | | | | |
am | -0.07 | 0.80*** | | | | | | | |
gear | 0.11 | | | | | | | | |
p-value adjustment method: Holm (1979)>
# Winsorized Correlation Matrix
Parameter | carb | gear | am | vs | qsec | wt | dart | hp | disp | cyl
---------------------------------------------------------------------------------------------------------------------
mpg | -0.63** | 0.65** | 0.55* | 0.70*** | 0.49 | -0.86*** | 0.67*** | -0.88*** | -0.87*** | -0.93***
cyl | 0.60** | -0.68*** | -0.52* | -0.81*** | -0.60** | 0.87*** | -0.74*** | 0.90*** | 0.94*** |
disp | 0.45 | -0.74*** | -0.57* | -0.72*** | -0.51* | 0.85*** | -0.74*** | 0.89*** | |
hp | 0.69*** | -0.56* | -0.37 | -0.79*** | -0.63** | 0.77*** | -0.60** | | |
dart | -0.12 | 0.88*** | 0.72*** | 0.50* | 0.22 | -0.76*** | | | |
wt | 0.53* | -0.69*** | -0.78*** | -0.56* | -0.29 | | | | |
qsec | -0.61** | 0.15 | -0.12 | 0.84*** | | | | | |
vs | -0.62** | 0.45 | 0.17 | | | | | | |
am | -0.11 | 0.78*** | | | | | | | |
gear | -0.03 | | | | | | | | |
p-value adjustment method: Holm (1979)
Pearson correlation
Percentage Bend Correlation
Winsor correlationReferences:
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice