0% found this document useful (0 votes)
13 views7 pages

Correlation and Regression Handout 1

The document discusses correlation and regression analysis, explaining how to measure the strength of association between two numerical variables using correlation coefficients. It details the interpretation of correlation values, the significance of linear relationships, and the process of regression analysis to predict dependent variables based on independent variables. The conclusion emphasizes a strong positive relationship between store area and sales, with statistical significance indicating that area significantly affects sales.

Uploaded by

keithceoal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Correlation and Regression Handout 1

The document discusses correlation and regression analysis, explaining how to measure the strength of association between two numerical variables using correlation coefficients. It details the interpretation of correlation values, the significance of linear relationships, and the process of regression analysis to predict dependent variables based on independent variables. The conclusion emphasizes a strong positive relationship between store area and sales, with statistical significance indicating that area significantly affects sales.

Uploaded by

keithceoal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Correlation Analysis

◼ Used to measure and interpret the strength of association (linear relationship) between two numerical
variables

 Only concerned with strength of the relationship

 No causal effect is implied

Example: if cigarette smoking and lung cancer are highly correlated, it is not sufficient proof of
causation. One variable may cause the other or vice-versa, or a third factor is involved, or a rare
event may have already occurred

◼ Population correlation coefficient r (Rho) is used to measure the strength of the linear relationship
between two variables, X and Y, that is independent of their respective scales of measurement

◼ Sample correlation coefficient r is a point estimate of r and is used to measure the strength of the linear
relationship of two variables in the sample observations

◼ r is the Pearson product moment coefficient of correlation between X and Y

Features of 𝜌 and 𝑟

◼ Unit free

◼ Range between -1 and 1, inclusive of the endpoints

◼ The closer to -1, the stronger the negative linear relationship

◼ The closer to +1, the stronger the positive linear relationship

◼ The closer to 0, the weaker the linear relationship

Correlation Strength of
Interpretation
Coefficient (r) Relationship

Perfect Positive A perfect positive linear relationship: as one variable


r=1
Relationship increases, the other increases proportionally.

Very Strong Positive A very strong positive linear relationship, with only small
r = 0.9 to 1
Relationship deviations from the ideal straight line.

Strong Positive A strong positive relationship, but with some fluctuations


r = 0.7 to 0.9
Relationship or variance in the data.

lpile
Moderate Positive A moderate positive relationship, but there is some scatter
r = 0.5 to 0.7
Relationship or noise in the data.

Weak Positive A weak positive relationship, with considerable variability


r = 0.3 to 0.5
Relationship in the data.

Very Weak Positive or


r = 0 to 0.3 A very weak or nearly nonexistent positive relationship.
No Relationship

Very Weak Negative A very weak negative relationship, with little or no


r = -0.3 to 0
Relationship predictable inverse correlation.

Weak Negative A weak negative relationship, but with some scatter or


r = -0.5 to -0.3
Relationship noise in the data.

Moderate Negative A moderate negative relationship, where one variable


r = -0.7 to -0.5
Relationship tends to decrease as the other increases, but not perfectly.

Strong Negative A strong negative relationship, with only small deviations


r = -0.9 to -0.7
Relationship from the ideal straight line.

Perfect Negative A perfect negative linear relationship: as one variable


r = -1
Relationship increases, the other decreases proportionally.

Key Points

◼ Positive Correlation (r > 0): As one variable increases, the other also increases.

- Hours Studied vs. Test Scores


- Temperature vs. Ice Cream Sales
- Number of Hours Exercised vs. Calories Burned

◼ Negative Correlation (r < 0): As one variable increases, the other decreases.

- Age of a Vehicle vs. Resale Value


- Time Spent Watching TV vs. Academic Performance

◼ No or Weak Correlation (r ≈ 0): There is little to no linear relationship between the two variables.

- Shoe Size vs. Test Scores


- Hair Color vs. Intelligence
- Favorite Movie Genre vs. Monthly Income

lpile
Example 1 Correlation Analysis You want to examine the correlation of the annual sales of produce stores on their
size in square footage. Sample data for seven stores were obtained.

Annual Store Area (in Sq Ft) Sales ($1000)

1 1,726 3,681

2 1,542 3,395

3 2,816 6,653

4 5,555 9,543

5 1,292 3,318

6 2,208 5,563

7 1,313 3,760

Scatter Diagram

12000
Annual Sales ($000)

10000

8000

6000

4000

2000

0
0 2000 4000 6000

Square Feet

Multiple R – correlation coefficient (Pearson 𝑟) representing the strength and direction of the relationship between
the independent and the dependent variable

Interpretation: There is a very strong positive relationship between Area (in square feet) and Sales.

lpile
Question: Is there any evidence of a linear relationship between the annual sales of a store and its square footage
at .05 level of significance?

𝐻0 : 𝜌 = 0 (No association)

𝐻𝐴 : 𝜌 ≠ 0 (Association)

Since 𝑝 = 0.00028 < 0.01, we reject the null hypothesis and conclude that there is a statistically significant linear
association between the store's square footage and its annual sales.

Remark: If the relationship/ association is significant, you can proceed to Regression Analysis.

Regression Analysis

◼ Regression analysis is used primarily to establish linear relationship between variables and provide
prediction

 Predicts the value of a dependent (response) variable based on the value of at least one
independent (explanatory) variable

 Explains the relationship of the independent variables on the dependent variable

◼ Relationship between variables is described by a linear function

◼ This function relates how much change in the dependent variable is associated with a unit increase (or
decrease) in the independent variable.

◼ Sample regression line provides an estimate of the population regression line as well as a predicted
value of Y

Slope and Intercept

◼ 𝑏0 is the estimated average value of Y when the value of X is zero.

◼ 𝑏1 is the estimated change in the average value of Y as a result of a one-unit change in X.

lpile
◼ R-squared: Measure of how well the regression line fits the data. A higher value (close to 1) indicates a
good fit. It also tell us the percentage of variability in the dependent variable that is likely due to or
explained by the independent variable.

Example 1 Regression Analysis

Since Area (in sq ft) and Sales (in $1000) are significantly associated, we can proceed to regression analysis to

i) fit a linear regression model to the data;


ii) use the regression line to predict values; and
iii) check whether or not the independent variable significantly affects the dependent variable.

Solutions:

i) The coefficients of the regression line are 𝑏0 = 1636.41 and 𝑏1 = 1.49. Thus, the regression line is

𝑦 = 1636.41 + 1.49𝑥
where 𝑥 is the area (in sq ft) and 𝑦 is the sales (in $1000).

ii) Suppose that we want to predict the sales (𝑦) when are is 3000 sq ft (𝑥), we use the regression line and
substitute the values of 𝑥 and 𝑦. Hence,
𝑦 = 1636.41 + 1.49(3000) = 6106.41(in $1000)
Thus, if the store area is 3000 sq ft, the predicted sales is $6,106,410.

lpile
R-square = 0.94198: It means that 94.198% of the variability in the dependent variable is explained by the
independent variable(s) in your regression model. The remaining 5.802% of the variability is due to other factors not
captured by the model, such as random error, other variables not included in the model, or inherent variability in
the data.

Measure of how well the regression line fits the data. A higher value (close to 1) indicates a good fit.

iii) To check whether or not the area (independent variable) significantly affects sales (dependent
variable), we set-up the following hypotheses:
𝐻0 : β1 = 0 (no effect)
𝐻𝐴 : 𝛽1 ≠ 0 (there is an effect)

Since 𝑝 = 0.00028 < 0.01, we reject 𝐻0 and conclude that there is evidence that square footage affects annual sales.

ANOVA

In regression analysis, the ANOVA output is used to assess the overall significance of the regression model.
It tells you whether the independent variable(s) in the regression model collectively explain a significant portion of
the variation in the dependent variable.

Components:

◼ F (F-statistic): The F-statistic is used to test if the regression model is a good fit for the data. It is calculated
as the ratio of MS (regression) to MS (residual) (mean square of residuals or errors). A larger F-value
indicates that the model explains a significant portion of the variability in the dependent variable.

◼ Significance F (p-value for F-statistic): This is the p-value associated with the F-statistic. It tells you whether
the overall regression model is statistically significant. If Significance F is less than your alpha level (usually
0.05), you reject the null hypothesis and conclude that the regression model is statistically significant.

ANOVA
df SS MS F Significance F
Regression 1 30380456.12 30380456.12 81.17909015 0.000281201
Residual 5 1871199.595 374239.919
Total 6 32251655.71

◼ The F-statistic (81.18) is very large, suggesting that the regression model fits the data well.
◼ The p-value (0.000281) is much less than the significance level of 0.05, so we reject the null hypothesis
that the regression model does not explain a significant portion of the variability in the dependent
variable.

lpile
◼ This means the independent variable in the regression model is statistically significant and has a
meaningful relationship with the dependent variable.

CONCLUSION:
The regression analysis demonstrates a very strong positive linear relationship (𝑟 = 0.97056) between area (in sq
ft) and sales (in $1000). The model is statistically significant at 𝛼 = 0.01, with an 𝑅 2 of 0.94198, indicating that
94.198% of the variability in sales can be explained by the area.

The slope of the regression equation (𝑏1 =1.48663) suggests that for every additional square foot in area, sales are
expected to increase by approximately P1.48663 (in $1000). The intercept (𝑏0 =1636.41) indicates that when the
area is 0 square feet, the baseline sales are P1,636.41 (in $1000).

lpile

You might also like