Correlation and Regression Handout 1
Correlation and Regression Handout 1
◼ Used to measure and interpret the strength of association (linear relationship) between two numerical
variables
Example: if cigarette smoking and lung cancer are highly correlated, it is not sufficient proof of
causation. One variable may cause the other or vice-versa, or a third factor is involved, or a rare
event may have already occurred
◼ Population correlation coefficient r (Rho) is used to measure the strength of the linear relationship
between two variables, X and Y, that is independent of their respective scales of measurement
◼ Sample correlation coefficient r is a point estimate of r and is used to measure the strength of the linear
relationship of two variables in the sample observations
Features of 𝜌 and 𝑟
◼ Unit free
Correlation Strength of
Interpretation
Coefficient (r) Relationship
Very Strong Positive A very strong positive linear relationship, with only small
r = 0.9 to 1
Relationship deviations from the ideal straight line.
lpile
Moderate Positive A moderate positive relationship, but there is some scatter
r = 0.5 to 0.7
Relationship or noise in the data.
Key Points
◼ Positive Correlation (r > 0): As one variable increases, the other also increases.
◼ Negative Correlation (r < 0): As one variable increases, the other decreases.
◼ No or Weak Correlation (r ≈ 0): There is little to no linear relationship between the two variables.
lpile
Example 1 Correlation Analysis You want to examine the correlation of the annual sales of produce stores on their
size in square footage. Sample data for seven stores were obtained.
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Scatter Diagram
12000
Annual Sales ($000)
10000
8000
6000
4000
2000
0
0 2000 4000 6000
Square Feet
Multiple R – correlation coefficient (Pearson 𝑟) representing the strength and direction of the relationship between
the independent and the dependent variable
Interpretation: There is a very strong positive relationship between Area (in square feet) and Sales.
lpile
Question: Is there any evidence of a linear relationship between the annual sales of a store and its square footage
at .05 level of significance?
𝐻0 : 𝜌 = 0 (No association)
𝐻𝐴 : 𝜌 ≠ 0 (Association)
Since 𝑝 = 0.00028 < 0.01, we reject the null hypothesis and conclude that there is a statistically significant linear
association between the store's square footage and its annual sales.
Remark: If the relationship/ association is significant, you can proceed to Regression Analysis.
Regression Analysis
◼ Regression analysis is used primarily to establish linear relationship between variables and provide
prediction
Predicts the value of a dependent (response) variable based on the value of at least one
independent (explanatory) variable
◼ This function relates how much change in the dependent variable is associated with a unit increase (or
decrease) in the independent variable.
◼ Sample regression line provides an estimate of the population regression line as well as a predicted
value of Y
lpile
◼ R-squared: Measure of how well the regression line fits the data. A higher value (close to 1) indicates a
good fit. It also tell us the percentage of variability in the dependent variable that is likely due to or
explained by the independent variable.
Since Area (in sq ft) and Sales (in $1000) are significantly associated, we can proceed to regression analysis to
Solutions:
i) The coefficients of the regression line are 𝑏0 = 1636.41 and 𝑏1 = 1.49. Thus, the regression line is
𝑦 = 1636.41 + 1.49𝑥
where 𝑥 is the area (in sq ft) and 𝑦 is the sales (in $1000).
ii) Suppose that we want to predict the sales (𝑦) when are is 3000 sq ft (𝑥), we use the regression line and
substitute the values of 𝑥 and 𝑦. Hence,
𝑦 = 1636.41 + 1.49(3000) = 6106.41(in $1000)
Thus, if the store area is 3000 sq ft, the predicted sales is $6,106,410.
lpile
R-square = 0.94198: It means that 94.198% of the variability in the dependent variable is explained by the
independent variable(s) in your regression model. The remaining 5.802% of the variability is due to other factors not
captured by the model, such as random error, other variables not included in the model, or inherent variability in
the data.
Measure of how well the regression line fits the data. A higher value (close to 1) indicates a good fit.
iii) To check whether or not the area (independent variable) significantly affects sales (dependent
variable), we set-up the following hypotheses:
𝐻0 : β1 = 0 (no effect)
𝐻𝐴 : 𝛽1 ≠ 0 (there is an effect)
Since 𝑝 = 0.00028 < 0.01, we reject 𝐻0 and conclude that there is evidence that square footage affects annual sales.
ANOVA
In regression analysis, the ANOVA output is used to assess the overall significance of the regression model.
It tells you whether the independent variable(s) in the regression model collectively explain a significant portion of
the variation in the dependent variable.
Components:
◼ F (F-statistic): The F-statistic is used to test if the regression model is a good fit for the data. It is calculated
as the ratio of MS (regression) to MS (residual) (mean square of residuals or errors). A larger F-value
indicates that the model explains a significant portion of the variability in the dependent variable.
◼ Significance F (p-value for F-statistic): This is the p-value associated with the F-statistic. It tells you whether
the overall regression model is statistically significant. If Significance F is less than your alpha level (usually
0.05), you reject the null hypothesis and conclude that the regression model is statistically significant.
ANOVA
df SS MS F Significance F
Regression 1 30380456.12 30380456.12 81.17909015 0.000281201
Residual 5 1871199.595 374239.919
Total 6 32251655.71
◼ The F-statistic (81.18) is very large, suggesting that the regression model fits the data well.
◼ The p-value (0.000281) is much less than the significance level of 0.05, so we reject the null hypothesis
that the regression model does not explain a significant portion of the variability in the dependent
variable.
lpile
◼ This means the independent variable in the regression model is statistically significant and has a
meaningful relationship with the dependent variable.
CONCLUSION:
The regression analysis demonstrates a very strong positive linear relationship (𝑟 = 0.97056) between area (in sq
ft) and sales (in $1000). The model is statistically significant at 𝛼 = 0.01, with an 𝑅 2 of 0.94198, indicating that
94.198% of the variability in sales can be explained by the area.
The slope of the regression equation (𝑏1 =1.48663) suggests that for every additional square foot in area, sales are
expected to increase by approximately P1.48663 (in $1000). The intercept (𝑏0 =1636.41) indicates that when the
area is 0 square feet, the baseline sales are P1,636.41 (in $1000).
lpile