Data Analytics Lesson 11 Notes
Data Analytics Lesson 11 Notes
Intro to Linear
Regression
2
Contents
3 Lesson outcomes
3 Introduction
6 Correlation
8 References
DATA ANALYTICS
3
Lesson outcomes
By the end of this lesson, you should be able to:
Introduction
With inferential statistics, we try to infer from sample data how population data might act. Inferential statistics tries to
make conclusions that surpass the immediate data alone. The general linear model makes up a significant amount of the
general family of statistical models. We use linear regression to predict numerical or quantitative values, such as scores of
a test.
This simple approach is extensively used and forms the footing to many fancier and elaborate regression models.
where Y is the continuous dependent variable that we are trying to predict, also known as the outcome variable.
beta is the second unknown variable also known as the variable that estimates the slope of the model, and
epsilon is the random error term that we use to represent the part of the dependent variable Y the model will not be able
to predict or explain.
DATA ANALYTICS
4
There exists a numerous number of regression techniques that we can use to make forecasts and predictions about the
data set depending on varying scenarios that best fits a specific technique. All of these methods aim to investigate the
effect the independent variables have on the dependent variable.
• Linear regression
• Logistic regression
• Polynomial regression
• Lasso regression
• Ridge regression
• Random forest regression
Many more techniques exist, but we will focus on understanding linear regression better in this lesson.
Linear regression
We will start our regression journey with linear regression. This might seem like the simplest of approaches, but it will form
our understanding of the more modern regression techniques, therefore it is important to gain a good understanding of
the simple linear regression technique.
Linear regression aims to model the relationship between the independent variables, our unknowns, and the dependent
variable, our outcome variable, through fitting a straight line equation to the data. The model therefore assumes that the
relationship between the predictor variable X and the response variable Y is linear. The least squares method is the most
common method used to fit the straight line of best fit to the given set of data points.
Mathematically, we can write the linear regression relationship as the prediction of estimate of y that is represented by the
intercept and the slope terms. These estimates are used to predict the value of the outcome variable. The slope measures
the change of increase of one unit of y to one-unit change in x. The intercept is the value represented by y when x is zero.
DATA ANALYTICS
5
0
0 1 2 3 4 5 6
Assume we have collected data about how many chocolates a person buys based on the amount of time they spend at the
chocolaterie. If we visualise this on a graph, the x axis would represent the amount of time spent in the shop and the y axis
the amount of chocolates a customer bought. Each dot represents each customer.
The next question we want to ask naturally, is, if we now have a new customer visiting the shop and they spend 6 minutes
in the chocolaterie, how many chocolates will they buy based on fitting a linear model to the data?
0
0 1 2 3 4 5 6
We use the method of least squares to draw a straight line through the data points that best fits the data, meaning that the
line minimizes the sum of residuals for the given set of data. By drawing this line, we can find a point that corresponds to 6
minutes on the x axis and find the corresponding value of how many chocolates the customer will buy on the y axis.
DATA ANALYTICS
6
Correlation
“One of the first things taught in introductory statistics textbooks is that correlation is not causation. It is also one of the
first things forgotten.” T. Sowell
Correlation defined
Correlation means association. Correlation measures the association between the x and the y variable in a normal
population. The correlation coefficient lies between negative 1 and positive 1.
• If the correlation coefficient is greater than zero, it indicates that the trend of the data is positive and as one
variable increases, so will the other. The closer the correlation coefficient is to 1, the stronger positive we say the
relationship between variables are.
• If the correlation coefficient is less than zero, it indicates that the trend of the data points is negative and and one
variable increases, the other variable decreases. The closer the correlation coefficient is to negative 1, the
stronger negative relationship there is between the variables.
• A correlation of zero value indicates that there is no relationship between the variables.
Population correlation
We indicate the correlation of the population with the Greek letter, Rho, and the following equation:
This correlation coefficient is used when the data represents the entire population. As before, this coefficient only occurs
between the value negative 1 and positive 1.
DATA ANALYTICS
7
Once again, it indicates the linear relationship between variables and lies between negative 1 and positive 1 with strong
linear relationships being indicated by values closer to positive 1 and weak linear relationships being indicated by values
closer to negative 1. A random pattern will thus have a correlation closer to zero. In order for the sample correlation
coefficient to be an unbiased estimate of the population correlation coefficient, a large enough random sample has to be
collected.
Note: It is important to note the difference between correlation and causation. Correlation does not automatically
indicate causation. If one variable has a strong linear relationship to another, we cannot say that the change is one
variable is the cause of the change in the other. Correlation merely shows us if a relationship between the variables exists.
Correlation in R
• To compute Pearson’s correlation coefficient, we can use the function cor in R with x and y being numeric vectors.
• The function cor.test will also test for correlation between variables but will return both the correlation coefficient
as well as the significance level (also known as the p-value) of the correlation.
• We can run a function with the command funcname with the arguments included in brackets.
• Furthermore, we can create a vector consisting of numbers, with the function c() and assign a variable, like x, to
the vector. The vectors can either be assigned through an arrow or an equality sign to the variable x.
More basics in R
• We are also able to check the length of a function with the length() command in R.
• The ls() command allows us to look up a list of all objects we have saved in the session.
• If we want to delete any of the objects in the list, we can use the rm() command to do so.
DATA ANALYTICS
8
References
• Fernandez, J., 2020, Introduction to regression analysis, towards data science,
https://towardsdatascience.com/introduction-to-regression-analysis-9151d8ac14b3
• James, G., Witten, D., Hastie, T. & Tibshirani, R., 2017, An Introduction to Statistical Learning
with Applications in R, 8th edition, Springer, New York,
http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
• Kassambara, A., Correlation Test Between Two Variables in R, Statistical tools for high-
throughput data analysis, http://www.sthda.com/english/wiki/correlation-test-between-
two-variables-in-
r#:~:text=Pearson%20correlation%20(r)%2C%20which,named%20the%20linear%20regress
ion%20curve
• Nolan, D., 2020, Data Types, Department of Statistics, University of California, Berkeley,
https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DataTypes4.pdf
DATA ANALYTICS