0% found this document useful (0 votes)
2 views

Data Analytics Lesson 11 Notes

The document provides an introduction to linear regression as part of a Diploma in Data Analytics, covering key concepts such as regression analysis, correlation, and the use of R for statistical modeling. It explains the relationship between dependent and independent variables, various regression techniques, and the importance of understanding correlation. Additionally, it includes practical commands for performing linear regression and correlation in R.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Analytics Lesson 11 Notes

The document provides an introduction to linear regression as part of a Diploma in Data Analytics, covering key concepts such as regression analysis, correlation, and the use of R for statistical modeling. It explains the relationship between dependent and independent variables, various regression techniques, and the importance of understanding correlation. Additionally, it includes practical commands for performing linear regression and correlation in R.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Diploma in Data Analytics

Intro to Linear
Regression
2

Contents

3 Lesson outcomes

3 Introduction

3 Intro to linear regression

6 Correlation

7 Vectors and factors in R

8 References

DATA ANALYTICS
3

Lesson outcomes
By the end of this lesson, you should be able to:

• Introduction to linear regression


• Correlation
• Vectors and factors in R

Introduction
With inferential statistics, we try to infer from sample data how population data might act. Inferential statistics tries to
make conclusions that surpass the immediate data alone. The general linear model makes up a significant amount of the
general family of statistical models. We use linear regression to predict numerical or quantitative values, such as scores of
a test.

This simple approach is extensively used and forms the footing to many fancier and elaborate regression models.

Introduction to linear regression


Regression analysis
Regression analysis is the overarching term we use to describe methods that determine the relationship that best fits the
data. This means that regression analysis is a set of statistical methods that is used to examine the relationship between
the outcome variable and the predictor variables of the model. We use regression as one of the tools to make predictions
with the given set of data

Regression analysis can be represented through the general formula

where Y is the continuous dependent variable that we are trying to predict, also known as the outcome variable.

Alpha is the first unknown coefficient variable known as the intercept,

beta is the second unknown variable also known as the variable that estimates the slope of the model, and

epsilon is the random error term that we use to represent the part of the dependent variable Y the model will not be able
to predict or explain.

DATA ANALYTICS
4

There exists a numerous number of regression techniques that we can use to make forecasts and predictions about the
data set depending on varying scenarios that best fits a specific technique. All of these methods aim to investigate the
effect the independent variables have on the dependent variable.

Some regression methods include:

• Linear regression
• Logistic regression
• Polynomial regression
• Lasso regression
• Ridge regression
• Random forest regression

Many more techniques exist, but we will focus on understanding linear regression better in this lesson.

Did you know


Legendre and Gauss issued papers on the method of least squares in the early 1900’s on what is today known as linear
regression. Their proposal pertained to problems in astronomy. Over the course of many years, as technology improved,
many more techniques, apart from linear models, came into being that helped us to improve forecasting.

Linear regression
We will start our regression journey with linear regression. This might seem like the simplest of approaches, but it will form
our understanding of the more modern regression techniques, therefore it is important to gain a good understanding of
the simple linear regression technique.

Linear regression aims to model the relationship between the independent variables, our unknowns, and the dependent
variable, our outcome variable, through fitting a straight line equation to the data. The model therefore assumes that the
relationship between the predictor variable X and the response variable Y is linear. The least squares method is the most
common method used to fit the straight line of best fit to the given set of data points.

Mathematically, we can write the linear regression relationship as the prediction of estimate of y that is represented by the
intercept and the slope terms. These estimates are used to predict the value of the outcome variable. The slope measures
the change of increase of one unit of y to one-unit change in x. The intercept is the value represented by y when x is zero.

DATA ANALYTICS
5

0
0 1 2 3 4 5 6

Assume we have collected data about how many chocolates a person buys based on the amount of time they spend at the
chocolaterie. If we visualise this on a graph, the x axis would represent the amount of time spent in the shop and the y axis
the amount of chocolates a customer bought. Each dot represents each customer.

The next question we want to ask naturally, is, if we now have a new customer visiting the shop and they spend 6 minutes
in the chocolaterie, how many chocolates will they buy based on fitting a linear model to the data?

0
0 1 2 3 4 5 6

We use the method of least squares to draw a straight line through the data points that best fits the data, meaning that the
line minimizes the sum of residuals for the given set of data. By drawing this line, we can find a point that corresponds to 6
minutes on the x axis and find the corresponding value of how many chocolates the customer will buy on the y axis.

DATA ANALYTICS
6

R linear regression commands

• Use the function lm() to fit a simple linear regression model in R.


o R uses the lm command and takes the variables from the data set in the format where y represents the
target variable and x represents the predictor variable. R also needs to be told in which dataset to look,
therefore we attach information about the data in the command.
• Lm.fit provides us with some basic information about the model.
• Summary(lm.fit) provides us with more detailed information, like the standard error and the r squared value to
name a few.

Correlation
“One of the first things taught in introductory statistics textbooks is that correlation is not causation. It is also one of the
first things forgotten.” T. Sowell

Correlation defined
Correlation means association. Correlation measures the association between the x and the y variable in a normal
population. The correlation coefficient lies between negative 1 and positive 1.

The correlation coefficient can have 3 possible directions or results:

• If the correlation coefficient is greater than zero, it indicates that the trend of the data is positive and as one
variable increases, so will the other. The closer the correlation coefficient is to 1, the stronger positive we say the
relationship between variables are.
• If the correlation coefficient is less than zero, it indicates that the trend of the data points is negative and and one
variable increases, the other variable decreases. The closer the correlation coefficient is to negative 1, the
stronger negative relationship there is between the variables.
• A correlation of zero value indicates that there is no relationship between the variables.

Population correlation
We indicate the correlation of the population with the Greek letter, Rho, and the following equation:

This correlation coefficient is used when the data represents the entire population. As before, this coefficient only occurs
between the value negative 1 and positive 1.

DATA ANALYTICS
7

Sample correlation coefficient


The sample correlation coefficient is defined by Pearson’s coefficient r.

Once again, it indicates the linear relationship between variables and lies between negative 1 and positive 1 with strong
linear relationships being indicated by values closer to positive 1 and weak linear relationships being indicated by values
closer to negative 1. A random pattern will thus have a correlation closer to zero. In order for the sample correlation
coefficient to be an unbiased estimate of the population correlation coefficient, a large enough random sample has to be
collected.

Note: It is important to note the difference between correlation and causation. Correlation does not automatically
indicate causation. If one variable has a strong linear relationship to another, we cannot say that the change is one
variable is the cause of the change in the other. Correlation merely shows us if a relationship between the variables exists.

Correlation in R
• To compute Pearson’s correlation coefficient, we can use the function cor in R with x and y being numeric vectors.
• The function cor.test will also test for correlation between variables but will return both the correlation coefficient
as well as the significance level (also known as the p-value) of the correlation.

Vectors and factors in R


How does R store data structures?
R utilizes functions to perform operations in R.

• We can run a function with the command funcname with the arguments included in brackets.
• Furthermore, we can create a vector consisting of numbers, with the function c() and assign a variable, like x, to
the vector. The vectors can either be assigned through an arrow or an equality sign to the variable x.

More basics in R
• We are also able to check the length of a function with the length() command in R.
• The ls() command allows us to look up a list of all objects we have saved in the session.
• If we want to delete any of the objects in the list, we can use the rm() command to do so.

DATA ANALYTICS
8

References
• Fernandez, J., 2020, Introduction to regression analysis, towards data science,
https://towardsdatascience.com/introduction-to-regression-analysis-9151d8ac14b3

• Gallo, A., 2015, A Refresher on Regression Analysis, Harvard Business Analytics,


https://hbr.org/2015/11/a-refresher-on-regression-analysis

• James, G., Witten, D., Hastie, T. & Tibshirani, R., 2017, An Introduction to Statistical Learning
with Applications in R, 8th edition, Springer, New York,
http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

• Kassambara, A., Correlation Test Between Two Variables in R, Statistical tools for high-
throughput data analysis, http://www.sthda.com/english/wiki/correlation-test-between-
two-variables-in-
r#:~:text=Pearson%20correlation%20(r)%2C%20which,named%20the%20linear%20regress
ion%20curve

• McLeod, S., 2020, Correlation definition, examples& interpretation, SimplyPsychology,


https://www.simplypsychology.org/correlation.html

• Nolan, D., 2020, Data Types, Department of Statistics, University of California, Berkeley,
https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DataTypes4.pdf

• Porras, E.M., 2018, Linear Regression in R, Datacamp,


https://www.datacamp.com/community/tutorials/linear-regression-
R?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=6508363174
8&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&
utm_creative=278443377086&utm_targetid=dsa-
429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=1028743&gclid=EAIaIQobCh
MIttKl-J7S6wIVA-vtCh0h8g_REAAYASAAEgIarvD_BwE

• Prabhakaran, S., 2017, Linear Regression, r-statistics.co, http://r-statistics.co/Linear-


Regression.html

• The Carpentries, 2020, Programming with R: Understanding Factors,


https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors/

• Trochim, Prof. W.M.K., 2020, Inferential Statistics, Conjointly,


https://conjointly.com/kb/inferential-
statistics/#:~:text=With%20inferential%20statistics%2C%20you%20are,beyond%20the%20
immediate%20data%20alone.&text=Or%2C%20we%20use%20inferential%20statistics,by
%20chance%20in%20this%20study

DATA ANALYTICS

You might also like