0% found this document useful (0 votes)
13 views15 pages

Sushant

Uploaded by

ompatil16022002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Sushant

Uploaded by

ompatil16022002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Computational Statistics Capstone Report

On
Karl Pearson’s Coefficient of Correlation

MTech Year I Semester-I


Computer Science and Engineering (Data Science)
By
Sushant Kothari (24CD1005)

Supervisor
Prof. Chandrakant Gaikwad
Theory
Karl Pearson's Coefficient of Correlation (r) is a widely used statistical
measure that quantifies the strength and direction of a linear relationship
between two variables. It is a value between -1 and 1, and the closer the
value is to either extreme, the stronger the relationship between the two
variables. It was developed by the British statistician Karl Pearson in 1896
and has since become a fundamental tool in statistics and data analysis.

Key Concepts in Pearson's Correlation:

• Linearity: Pearson’s (r) measures only linear relationships between two


variables. A linear relationship means that changes in one variable result in
proportional changes in another variable, following a straight-line pattern. If
the relationship is non-linear (e.g., exponential, quadratic), Pearson’s (r)may
fail to capture the strength or direction of the relationship.

• Range: The coefficient (r) lies in the interval:

• r=1: Indicates a perfect positive linear relationship. As one variable


increases, the other increases in a perfectly predictable manner.
• r = −1: Indicates a perfect negative linear relationship. As one variable increases, the
other decreases in a perfectly predictable manner.
• r = 0: Implies no linear relationship between the two variables. However, this does not
mean the variables are unrelated; they could have a non-linear relationship.

Formula:
The Pearson correlation coefficient r between two variables X and Y is
given by the formula:
Where:
Derivation of Pearson’s Correlation
Coefficient:
Step 1: Concept of Covariance:
Covariance is a measure of how two variables vary together. If two
variables tend to increase together, the covariance is positive. If one variable
increases while the other decreases, the covariance is negative. The formula
for covariance is:

Where:

Step 2: Variance of Each Variable


The variance measures the spread or dispersion of a set of values. It’s the
average of the squared deviations from the mean. The formula for the
variance of X and Y is:
The standard deviation is simply the square root of the variance:

Step 3: Normalization by Standard Deviations

The Pearson correlation coefficient normalizes the covariance by


dividing it by the product of the standard deviations of X and Y. This
process scales the covariance to ensure that the resulting correlation
coefficient is a dimensionless value, bound between -1 and 1. Hence, the
formula for Pearson’s (r) becomes:

Substituting the expressions for covariance and standard deviations, the


formula becomes:

This formula gives the final value of Pearson’s correlation coefficient.

Step 4: Interpretation of the Formula


The closer (r) is to 1 or -1, the stronger the linear relationship between X
and Y, and the closer it is to 0, the weaker the relationship.
Python Simulation and Output
Let’s now use Python to calculate Pearson’s correlation coefficient for
two variables with a linear relationship.

Python Code:
Output:
Understanding the Significance of r=0.98:
When the Pearson correlation coefficient, (r), is calculated as 0.98, it
implies an extremely strong positive linear relationship between the
two variables X and Y. Let’s break this down in detail:
1. Strength of the Relationship
• A correlation value of r=0.98 is very close to 1.0, which is the theoretical
maximum for Pearson’s correlation coefficient.
• This indicates that the data points almost perfectly align along a straight
line with a positive slope, suggesting minimal deviation from the
expected linear trend.
• In practical terms, such a strong value means that the changes in X almost
entirely explain the changes in Y, with very little variation left
unaccounted for.
2. Direction of the Relationship
• The positive sign of (r)signifies that as the values of X increase, the
values of Y also increase. This is called a direct relationship. o For
instance, if X represents advertising spend and Y represents sales
revenue, a correlation of 0.98 would indicate that increased spending on
advertising is strongly associated with increased revenue.
• Conversely, if the correlation were negative (e.g., r= −0.98), it would
mean that as X increases, Y decreases in a nearly perfect linear pattern.
3. Scatterplot Analysis
• A scatterplot of the data for r=0.98 would show that almost all data points
lie very close to or directly on a straight line with a positive slope.
• This alignment visually confirms the nearly perfect relationship. Unlike
lower correlation values where data points might show a more scattered
pattern, here the clustering along the line is extremely tight.
Results with Evaluation
Parameters
Interpretation of Results:

1.Strength of Relationship:
• The Pearson correlation coefficient of r= 0.98 signifies an exceptionally
strong positive linear relationship. This means that as X increases, Y
tends to increase in a highly predictable manner with minimal deviation.
• A correlation this high implies that nearly all variation in Y is accounted
for by the variation in X.
2.Direction of Relationship:
• The positive value of r=0.98 indicates a direct relationship, meaning X
and Y increase together.
3.Evaluation Parameters:
• Strength: The correlation value of 0.98 indicates a nearly perfect
positive relationship

• Outliers: Pearson’s correlation is sensitive to outliers. While the


correlation here is very strong, it is critical to ensure no extreme data
points distort this value.
• Linearity Assumption: Since Pearson’s method measures linear
relationships, it’s important to verify that the relationship between X and
Y is indeed linear, which scatterplots or regression residuals can confirm.

4. Visual Representation:
1. Tight Clustering of Points:

• The scatterplot with r=0.98 would show that nearly all data points are
tightly clustered along a straight line with a positive slope.

• The small deviations from the line suggest minimal residual variance,
emphasizing the strength of the linear relationship.

2. Best-Fit Regression Line:

• The best-fit line (calculated using least squares regression) would pass
very close to most of the data points, confirming the high predictability of
Y based on X.

• The slope of this line would indicate the rate of change in Y for a unit
change in X, which is meaningful in practical terms (e.g., a 1-unit increase
in marketing spend increases revenue by a measurable amount).
Conclusion
1. Significance of r = 0.98:
• A Pearson correlation coefficient of 0.98 represents an exceptionally
strong positive linear relationship. Such a high value indicates that the
changes in one variable are almost entirely mirrored by proportional
changes in the other.
• This relationship is extremely reliable for predictive modeling, as nearly
all variance in Y can be explained by X.

2. Implications:
• High Predictability: The high (r) value suggests that X is a very
effective predictor of Y, which is useful in fields like finance,
healthcare, and physics.
• Practical Utility: In applied scenarios, a correlation of 0.98 might
indicate near-perfect synchronization, such as the relationship
between:
• Marketing budget and revenue growth.
• Physical measurements like height and weight.
• Scientific measurements like temperature and reaction rates.

3. Cautions:
• Outliers: Even with r=0.98, a small number of extreme values could
artificially inflate the correlation. It is vital to preprocess the data and
remove or address such outliers.
• Causation: A high correlation does not imply causation. Both X and Y
could be influenced by external or latent factors.
4. Final Insight: Broader Implications of r=0.98:
4.1 Quantitative Strength of r=0.98:
• A correlation coefficient of r=0.98underscores a near-perfect linear
association, indicating that almost every fluctuation in one variable X is
mirrored by a proportional change in the other variable Y.
• This strong relationship offers confidence that predictive models based on
this data will be highly accurate, reducing uncertainty in decision-making.

4.2 Practical Applications:


• Finance and Economics:
o In financial markets, such a high correlation might indicate that two
stocks or indices move almost in lockstep, making them ideal
candidates for hedging strategies or portfolio diversification studies. o
In economics, r=0.98r = 0.98r=0.98 might suggest a strong linkage
between consumer spending and disposable income, aiding
policymaking.  Healthcare:
o In health-tech, such a high correlation between a diagnostic metric
(e.g., blood pressure) and an outcome (e.g., risk of stroke) could lead to
early intervention strategies.
• Education:
o A study of test preparation hours and scores yielding r=0.98r =
0.98r=0.98 would indicate that targeted intervention programs
could have substantial impacts.
• Environmental Science:
o In climate studies, r=0.98r = 0.98r=0.98 between carbon emissions
and temperature rise would strongly emphasize the importance of
reducing emissions.

4.3. Reliability for Predictive Analysis:


• With r=0.98r = 0.98r=0.98, the predictability of Y given X is
exceptionally high, making this relationship a foundation for robust
machine learning models or statistical forecasting methods.
• Regression models built on such a relationship will have minimal
prediction errors, enabling precise and actionable insights.

4.4. Confidence in Linearity:


• This high correlation confirms that the underlying relationship is truly
linear with minimal noise. It negates the need to explore non-linear
models unless theoretical or contextual factors suggest otherwise.
• The exceptional value of r=0.98 assures that deviations from linearity are
negligible, making this relationship ideal for regression-based predictions
without requiring complex transformations or higher-order terms.
Thus,
The Pearson correlation coefficient of r=0.98 highlights a nearly perfect
positive linear relationship between two variables. This relationship is both
statistically significant and practically valuable, providing a strong foundation
for predictive analysis, strategic decision-making, and research exploration.
The high (r) value ensures that X is an excellent predictor of Y, with 96.04%
of the variation in Y explained by X. However, careful consideration must be
given to potential outliers, causation, and the assumptions of linearity.
Ultimately, r = 0.98 demonstrates the robustness of Karl Pearson’s correlation
method for capturing linear relationships and its invaluable role in a wide
array of disciplines, from business to science. This finding paves the way for
impactful real-world applications, offering clarity, predictability, and
actionable insights.

You might also like