Statistics Regression Final Project
Statistics Regression Final Project
Statistics Regression Final Project
Basic Statistics
Introduction
I will be explaining what correlation is, what regression lines are, and also
how each is determined. To help visualize the practical applications for which correlation plots
and regressions are used for, I will show 5 data sets with correlation lines using Excel. However,
any program capable of drawing scatter plots with linear correlations will suffice. After a
correlation line is added and a regression equation determined for each of the 5 data sets, I hope
Definition of Correlation
are useful because they can indicate a predictive relationship. For instance, the amount of time a
student spends studying (the first variable) and their academic performance (the second
variable). Common sense would dictate that the more time a student spends studying the better
their academic performance, and vice versa. Hence, the first variable and the second variable are
In statistics, these variables are commonly defined as x a nd y. X is called the
independent variable and Y is called the dependent variable. In the example above, the amount
spent studying for an exam is independent as it does not depend on anything and is up to each
individual student. We can call that x. The students’ academic performance, on the other hand,
does depend on the amount spent studying and therefore we can call that the y variable.
There are three types of correlations. These are positive correlation, negative correlation,
and zero (or no) correlation. Positive correlation refers to a dependent variable that shows a
clear relationship that is greater than zero between the x and y variables. For instance, height and
weight are positive correlations since taller people tend to be heavier. A negative correlation
would be a relationship between two variables in which an increase in one variable results in a
decrease in the other variable. For instance, the more time a student spends playing video games,
the lower their GPA. While one variable increases (playing video games), the other decreases
(their GPA). It’s important to note that a negative correlation does not imply a negative side
effect. As a basic example, the more time a person spends exercising, the lower their weight
tends to be; a negative correlation but not inherently bad. A zero correlation is one in which
there is no relationship between two variables. For instance, the lumen of a flashlight and how
waterproof it is have no linear relationship whatsoever and therefore we can not determine a
A positive correlation ranges from > 0 to +1, with intervals in between that determining
the strength of the correlation. A negative correlation ranges from < 0 to -1. And a zero
● The closer the data points are to the lines, the “stronger” the correlation is. If there are
many outliers, it can be said that the correlation is “moderate” or even “weak”. If there is
Source: danshiebler.com
Scatter Plots
A scatterplot is a graph that is used to plot the data points for two or more variables. Each
scatterplot has a horizontal axis (x-axis) and a vertical axis (y-axis). One variable is plotted on
each axis. Scatterplots are made up of marks; each mark represents one study participant's
measures on the variables that are on the x-axis and y-axis of the scatterplot. Most scatter plots
contain a line of best fit, which is a straight line drawn through the center of the data points that
best represents the trend of the data. Scatter plots provide a visual representation of the
4
relationship between the variables and make it easier to spot trends quickly.
In statistics, the correlation coefficient r measures the strength and direction of a linear
relationship between two variables on a scatter plot. The correlation coefficient tells us how
closely the data variables of a scatter plot fall along a trend line (closer to the trend line would
indicate a strong correlation while further away would indicate a relatively weaker correlation).
The value of r is always between +1 and –1. To interpret its value, see which of the
Source: sciencedirect.com
Regression Lines
A regression line is a straight line that describes a data set in a visual way. It’s also
known as a trend line or “line of best fit”. Regression lines are very useful for predicting future
outcomes and trends. The purpose of the line is to describe the correlation of a dependent
variable, y, with one or more independent variables, x. Regression lines are used in a variety of
ways. Some of the more common ways that they are used are when predicting pandemic
infection rates, predicting stock prices, predicting sports odds and gambling and other areas
Data Set 1
This scatter plot shows a strong positive linear correlation indicating that the more time a student
spends studying the higher their test scores will be. As the X axis increases, the Y axis increases
with a linear upwards trend. The correlation coefficient is 0.86. The closer a correlation
coefficient is to 1, the stronger the correlation and thus this proves to be a strong positive
correlation. The data points are close to the trend line and are indicative of a strong correlation.
If the number of hours of studying is 7, the predicted test score is 76. Using the slope
Data Set 2
This scatter plot represents a moderately weak negative linear correlation with a coefficient of
-0.46. As x increases, y tends to decrease with a linear downwards trend. However, compared to
the first data set, it is easy to see that the correlation is not as strong since the data points are
further from the trend line. According to this graph, for one reason or another, the more time a
Data Set 3
This plot represents a strong (nearly perfect) negative linear correlation with a coefficient of
-0.98. As the age of a person increases, the amount of hours spent jogging per week decreases.
The data points are almost on the trend line itself - indicating a very strong correlation.
If we predict the amount of hours a 40 year old person jogs per week, we can use the intercept
and predict that he or she jogs approximately 4.6 hours per week. If x = 40, then y = (-0.1396 x
Data Set 4
This scatter plot represents a moderate positive linear correlation, with a coefficient of
correlation r = 0.59. Some data points are on near the trend line while others are further away
and thus this shows a neither strong nor weak linear correlation. According to this graph,
spending more on advertising may influence the number of products sold in a positive way.
10
Data Set 5
Data set 5 plot shows no linear relationship between the variables as data points do not have a
clear trend line and are scattered randomly throughout. According to this graph, temperature
Correlation, as defined above, indicates a simple relationship between the values of two
variables. A scatter plot displays this data and is a useful tool for visually determining if there
Causation means that one event causes another event to occur. Causation only applies
when one variable has been proven to cause a change in a dependent variable. Causation is
determined by testing and rigorous experiments with at least 95% confidence intervals.
Causation and correlation can occur simultaneously between two data sets. However,
correlation does not imply causation. As an example, there seems to be a correlation between the
number of 5G cell phone towers and confirmed COVID-19 cases on maps. However, there is no
evidence that the 5G cell phone towers actually cause or increase the risk of getting COVID-19.
The correlation might be that the areas with 5G towers tend to be in large metropolitan areas
with larger populations and that may account for the increase in COVID-19 cases compared to
cities with lower populations and no 5G towers. The 5G towers can’t be said to “cause or
increase the risk” of contracting COVID-19 and thus there is no causal link even though a
positive correlation may be seen. We are always looking for patterns around us to explain what
we see and find links between things. Events that seem to “connect” based on our own common
sense and judgement can not be said to be causal unless tested and should be assumed to be
correlations.
12
Conclusion
In conclusion, scatter plots, correlation and regression lines are very useful statistical
tools that help determine the relationship between a set of data. We can forecast and predict
future outcomes making their usage and interpretation very important in a variety of settings.
Scatter plots allow us to visualize data and quickly determine what type of, if any, correlation
exists between two variables. One drawback of a scatter plot is that it may be used to present
data that shows correlation but not causation and presented as evidence of a false link between
two variables. For instance, in data set 3, it can be said that as we age we tend to jog less per
week. However, this is based off of 6 people and can not be implied that getting older results in
jogging less hours per week. The sample size is very small and the people may have been
cherry-picked to imply causation. More rigorous experiments and studies would need to be done
to determine if there is a causative factor. Knowing these important statistical measurement tools
can help us better understand the relationship between various factors in our world.