0% found this document useful (0 votes)

42 views8 pages

Data Analytics Lesson 11 Notes

The document provides an introduction to linear regression as part of a Diploma in Data Analytics, covering key concepts such as regression analysis, correlation, and the use of R for statistical modeling. It explains the relationship between dependent and independent variables, various regression techniques, and the importance of understanding correlation. Additionally, it includes practical commands for performing linear regression and correlation in R.

Uploaded by

nkosinathimdluli997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views8 pages

Data Analytics Lesson 11 Notes

Uploaded by

nkosinathimdluli997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Diploma in Data Analytics

Intro to Linear
Regression
2

Contents

3 Lesson outcomes

3 Introduction

3 Intro to linear regression

6 Correlation

7 Vectors and factors in R

8 References

DATA ANALYTICS
3

Lesson outcomes
By the end of this lesson, you should be able to:

• Introduction to linear regression

• Correlation
• Vectors and factors in R

Introduction
With inferential statistics, we try to infer from sample data how population data might act. Inferential statistics tries to
make conclusions that surpass the immediate data alone. The general linear model makes up a significant amount of the
general family of statistical models. We use linear regression to predict numerical or quantitative values, such as scores of
a test.

This simple approach is extensively used and forms the footing to many fancier and elaborate regression models.

Introduction to linear regression

Regression analysis
Regression analysis is the overarching term we use to describe methods that determine the relationship that best fits the
data. This means that regression analysis is a set of statistical methods that is used to examine the relationship between
the outcome variable and the predictor variables of the model. We use regression as one of the tools to make predictions
with the given set of data

Regression analysis can be represented through the general formula

where Y is the continuous dependent variable that we are trying to predict, also known as the outcome variable.

Alpha is the first unknown coefficient variable known as the intercept,

beta is the second unknown variable also known as the variable that estimates the slope of the model, and

epsilon is the random error term that we use to represent the part of the dependent variable Y the model will not be able
to predict or explain.

DATA ANALYTICS
4

There exists a numerous number of regression techniques that we can use to make forecasts and predictions about the
data set depending on varying scenarios that best fits a specific technique. All of these methods aim to investigate the
effect the independent variables have on the dependent variable.

Some regression methods include:

• Linear regression
• Logistic regression
• Polynomial regression
• Lasso regression
• Ridge regression
• Random forest regression

Many more techniques exist, but we will focus on understanding linear regression better in this lesson.

Did you know

Legendre and Gauss issued papers on the method of least squares in the early 1900’s on what is today known as linear
regression. Their proposal pertained to problems in astronomy. Over the course of many years, as technology improved,
many more techniques, apart from linear models, came into being that helped us to improve forecasting.

Linear regression
We will start our regression journey with linear regression. This might seem like the simplest of approaches, but it will form
our understanding of the more modern regression techniques, therefore it is important to gain a good understanding of
the simple linear regression technique.

Linear regression aims to model the relationship between the independent variables, our unknowns, and the dependent
variable, our outcome variable, through fitting a straight line equation to the data. The model therefore assumes that the
relationship between the predictor variable X and the response variable Y is linear. The least squares method is the most
common method used to fit the straight line of best fit to the given set of data points.

Mathematically, we can write the linear regression relationship as the prediction of estimate of y that is represented by the
intercept and the slope terms. These estimates are used to predict the value of the outcome variable. The slope measures
the change of increase of one unit of y to one-unit change in x. The intercept is the value represented by y when x is zero.

DATA ANALYTICS
5

0
0 1 2 3 4 5 6

Assume we have collected data about how many chocolates a person buys based on the amount of time they spend at the
chocolaterie. If we visualise this on a graph, the x axis would represent the amount of time spent in the shop and the y axis
the amount of chocolates a customer bought. Each dot represents each customer.

The next question we want to ask naturally, is, if we now have a new customer visiting the shop and they spend 6 minutes
in the chocolaterie, how many chocolates will they buy based on fitting a linear model to the data?

0
0 1 2 3 4 5 6

We use the method of least squares to draw a straight line through the data points that best fits the data, meaning that the
line minimizes the sum of residuals for the given set of data. By drawing this line, we can find a point that corresponds to 6
minutes on the x axis and find the corresponding value of how many chocolates the customer will buy on the y axis.

DATA ANALYTICS
6

R linear regression commands

• Use the function lm() to fit a simple linear regression model in R.

o R uses the lm command and takes the variables from the data set in the format where y represents the
target variable and x represents the predictor variable. R also needs to be told in which dataset to look,
therefore we attach information about the data in the command.
• [Link] provides us with some basic information about the model.
• Summary([Link]) provides us with more detailed information, like the standard error and the r squared value to
name a few.

Correlation
“One of the first things taught in introductory statistics textbooks is that correlation is not causation. It is also one of the
first things forgotten.” T. Sowell

Correlation defined
Correlation means association. Correlation measures the association between the x and the y variable in a normal
population. The correlation coefficient lies between negative 1 and positive 1.

The correlation coefficient can have 3 possible directions or results:

• If the correlation coefficient is greater than zero, it indicates that the trend of the data is positive and as one
variable increases, so will the other. The closer the correlation coefficient is to 1, the stronger positive we say the
relationship between variables are.
• If the correlation coefficient is less than zero, it indicates that the trend of the data points is negative and and one
variable increases, the other variable decreases. The closer the correlation coefficient is to negative 1, the
stronger negative relationship there is between the variables.
• A correlation of zero value indicates that there is no relationship between the variables.

Population correlation
We indicate the correlation of the population with the Greek letter, Rho, and the following equation:

This correlation coefficient is used when the data represents the entire population. As before, this coefficient only occurs
between the value negative 1 and positive 1.

DATA ANALYTICS
7

Sample correlation coefficient

The sample correlation coefficient is defined by Pearson’s coefficient r.

Once again, it indicates the linear relationship between variables and lies between negative 1 and positive 1 with strong
linear relationships being indicated by values closer to positive 1 and weak linear relationships being indicated by values
closer to negative 1. A random pattern will thus have a correlation closer to zero. In order for the sample correlation
coefficient to be an unbiased estimate of the population correlation coefficient, a large enough random sample has to be
collected.

Note: It is important to note the difference between correlation and causation. Correlation does not automatically
indicate causation. If one variable has a strong linear relationship to another, we cannot say that the change is one
variable is the cause of the change in the other. Correlation merely shows us if a relationship between the variables exists.

Correlation in R
• To compute Pearson’s correlation coefficient, we can use the function cor in R with x and y being numeric vectors.
• The function [Link] will also test for correlation between variables but will return both the correlation coefficient
as well as the significance level (also known as the p-value) of the correlation.

Vectors and factors in R

How does R store data structures?
R utilizes functions to perform operations in R.

• We can run a function with the command funcname with the arguments included in brackets.
• Furthermore, we can create a vector consisting of numbers, with the function c() and assign a variable, like x, to
the vector. The vectors can either be assigned through an arrow or an equality sign to the variable x.

More basics in R
• We are also able to check the length of a function with the length() command in R.
• The ls() command allows us to look up a list of all objects we have saved in the session.
• If we want to delete any of the objects in the list, we can use the rm() command to do so.

DATA ANALYTICS
8

References
• Fernandez, J., 2020, Introduction to regression analysis, towards data science,
[Link]

• Gallo, A., 2015, A Refresher on Regression Analysis, Harvard Business Analytics,

[Link]

• James, G., Witten, D., Hastie, T. & Tibshirani, R., 2017, An Introduction to Statistical Learning
with Applications in R, 8th edition, Springer, New York,
[Link]

• Kassambara, A., Correlation Test Between Two Variables in R, Statistical tools for high-
throughput data analysis, [Link]
two-variables-in-
r#:~:text=Pearson%20correlation%20(r)%2C%20which,named%20the%20linear%20regress
ion%20curve

• McLeod, S., 2020, Correlation definition, examples& interpretation, SimplyPsychology,

[Link]

• Nolan, D., 2020, Data Types, Department of Statistics, University of California, Berkeley,
[Link]

• Porras, E.M., 2018, Linear Regression in R, Datacamp,

[Link]
R?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=6508363174
8&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&
utm_creative=278443377086&utm_targetid=dsa-
429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=1028743&gclid=EAIaIQobCh
MIttKl-J7S6wIVA-vtCh0h8g_REAAYASAAEgIarvD_BwE

• Prabhakaran, S., 2017, Linear Regression, [Link], [Link]

[Link]

• The Carpentries, 2020, Programming with R: Understanding Factors,

[Link]

• Trochim, Prof. W.M.K., 2020, Inferential Statistics, Conjointly,

[Link]
statistics/#:~:text=With%20inferential%20statistics%2C%20you%20are,beyond%20the%20
immediate%20data%20alone.&text=Or%2C%20we%20use%20inferential%20statistics,by
%20chance%20in%20this%20study

DATA ANALYTICS

Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Unit III
No ratings yet
Unit III
13 pages
d90840b8 1721727178674
No ratings yet
d90840b8 1721727178674
43 pages
Bivariate Analysis: Correlation & Regression
No ratings yet
Bivariate Analysis: Correlation & Regression
19 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Correlation and Regression 2020
No ratings yet
Correlation and Regression 2020
63 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
MetNum1 2023 1 Week 13
No ratings yet
MetNum1 2023 1 Week 13
70 pages
Class Note II - 044242
No ratings yet
Class Note II - 044242
19 pages
Econometrics For MGT ppt-2
No ratings yet
Econometrics For MGT ppt-2
58 pages
Correlation and Regression
100% (6)
Correlation and Regression
36 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
QT - Unit 2 - Part B - Regression
No ratings yet
QT - Unit 2 - Part B - Regression
40 pages
BCSE352E EDA CAT 2 Mod 1,2,5
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5
146 pages
Correlation Regression
No ratings yet
Correlation Regression
58 pages
Linear Regression Analysis - 1
No ratings yet
Linear Regression Analysis - 1
18 pages
Negative Non-Linear Correlation Insights
No ratings yet
Negative Non-Linear Correlation Insights
15 pages
Chapter 6
No ratings yet
Chapter 6
58 pages
Regression and Correlation
No ratings yet
Regression and Correlation
37 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Business Analytics Regression Guide
No ratings yet
Business Analytics Regression Guide
91 pages
Stat Cor Reg
No ratings yet
Stat Cor Reg
85 pages
6 Correlation and Linear Regression
No ratings yet
6 Correlation and Linear Regression
32 pages
Correlation
100% (1)
Correlation
29 pages
Intro to Linear Regression Basics
No ratings yet
Intro to Linear Regression Basics
34 pages
Iskak, Stats 2
No ratings yet
Iskak, Stats 2
5 pages
Business Analytics: Data Analysis Methods
No ratings yet
Business Analytics: Data Analysis Methods
83 pages
Lesson 9
No ratings yet
Lesson 9
4 pages
REGRESSION and CORRELATION ANALYSIS STA 106 - DR. BASHIRU
No ratings yet
REGRESSION and CORRELATION ANALYSIS STA 106 - DR. BASHIRU
10 pages
Correlation and Simple Linear Regression Analyses: Objectives
No ratings yet
Correlation and Simple Linear Regression Analyses: Objectives
6 pages
BCSE352E EDA CAT 2 Mod 1,2,5 PDF
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5 PDF
146 pages
Meweek 3
No ratings yet
Meweek 3
57 pages
Lekcija 10 - Korelacija I Regresija
No ratings yet
Lekcija 10 - Korelacija I Regresija
76 pages
Best Fit Line for Regression Analysis
No ratings yet
Best Fit Line for Regression Analysis
57 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (2)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Module 2 - Section 4 (Linear Regression) - 11
No ratings yet
Module 2 - Section 4 (Linear Regression) - 11
20 pages
Regression Analysis Basics
No ratings yet
Regression Analysis Basics
12 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Exploring Quantitative Variable Associations
No ratings yet
Exploring Quantitative Variable Associations
7 pages
Aiml M3 C3
No ratings yet
Aiml M3 C3
37 pages
Excel Regression for Finance Students
No ratings yet
Excel Regression for Finance Students
19 pages
Unit 2
No ratings yet
Unit 2
44 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Forecasting Models & Regression Analysis
No ratings yet
Forecasting Models & Regression Analysis
13 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
Advanced Data Analysis Techniques
No ratings yet
Advanced Data Analysis Techniques
23 pages
Week-4 Statistical-Forecasting Handout
No ratings yet
Week-4 Statistical-Forecasting Handout
9 pages
Day 3
No ratings yet
Day 3
85 pages
Regression Coeffient
No ratings yet
Regression Coeffient
52 pages
Module5 Bigdata Analytics
No ratings yet
Module5 Bigdata Analytics
110 pages
Data Analytics Lesson 11 Slides
No ratings yet
Data Analytics Lesson 11 Slides
23 pages
Ra Web
No ratings yet
Ra Web
70 pages
Stat Chapter 9
No ratings yet
Stat Chapter 9
34 pages
Regression & Correlation 230224 221642
No ratings yet
Regression & Correlation 230224 221642
9 pages
Unit III Part B
No ratings yet
Unit III Part B
31 pages
Anova PDF
100% (1)
Anova PDF
7 pages
Missing Data Techniques - UCLA
No ratings yet
Missing Data Techniques - UCLA
66 pages
Data Mining Exam for B.Sc. Students
No ratings yet
Data Mining Exam for B.Sc. Students
6 pages
Quality Control with Minitab
No ratings yet
Quality Control with Minitab
7 pages
Acceptance Crieteria For Msa
No ratings yet
Acceptance Crieteria For Msa
17 pages
Iso 16269-4-2010 - 7500
No ratings yet
Iso 16269-4-2010 - 7500
3 pages
Jurnal PDF
No ratings yet
Jurnal PDF
15 pages
Economic Predictions With Big Data
No ratings yet
Economic Predictions With Big Data
5 pages
BCS301 Question Bank-2
No ratings yet
BCS301 Question Bank-2
21 pages
Know The Preference of Consumer On Hero-Honda Two-Wheelers
No ratings yet
Know The Preference of Consumer On Hero-Honda Two-Wheelers
73 pages
Statistical Inference 2 Note 1
No ratings yet
Statistical Inference 2 Note 1
3 pages
Analisis Komparatif Abnormal Return Dan Trading Volume Activity Berdasarkan Political Event
No ratings yet
Analisis Komparatif Abnormal Return Dan Trading Volume Activity Berdasarkan Political Event
6 pages
CH 12
No ratings yet
CH 12
29 pages
Probability and Stats
No ratings yet
Probability and Stats
4 pages
Random Number Tests
No ratings yet
Random Number Tests
4 pages
Specificity of Sprint and Agility Training Methods
No ratings yet
Specificity of Sprint and Agility Training Methods
6 pages
Econometric Lec6
No ratings yet
Econometric Lec6
52 pages
Review Questions
No ratings yet
Review Questions
3 pages
Unit 3 TQM - SPC
No ratings yet
Unit 3 TQM - SPC
35 pages
S K K K K K K: 6. A Tennis Tournament Has 5 Rounds. After Each Round, Winners Go Into The Next Round and
No ratings yet
S K K K K K K: 6. A Tennis Tournament Has 5 Rounds. After Each Round, Winners Go Into The Next Round and
24 pages
Project Chapter 3
No ratings yet
Project Chapter 3
2 pages
LSS Cheat Sheets Revised
No ratings yet
LSS Cheat Sheets Revised
28 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
3 pages
4MA007 - Assignment 2 - 1916367
No ratings yet
4MA007 - Assignment 2 - 1916367
28 pages
Guidance 006 Sample
No ratings yet
Guidance 006 Sample
1 page
Practice Questions
No ratings yet
Practice Questions
3 pages
Anova
No ratings yet
Anova
14 pages
REGRESSION
No ratings yet
REGRESSION
9 pages
Stock Watson 3U ExerciseSolutions Chapter14 Instructors
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter14 Instructors
13 pages
Standard Deviation
No ratings yet
Standard Deviation
8 pages

Data Analytics Lesson 11 Notes

Uploaded by

Data Analytics Lesson 11 Notes

Uploaded by

Diploma in Data Analytics

3 Intro to linear regression

7 Vectors and factors in R

• Introduction to linear regression

Introduction to linear regression

Regression analysis can be represented through the general formula

Alpha is the first unknown coefficient variable known as the intercept,

Some regression methods include:

Did you know

R linear regression commands

• Use the function lm() to fit a simple linear regression model in R.

The correlation coefficient can have 3 possible directions or results:

Sample correlation coefficient

Vectors and factors in R

• Gallo, A., 2015, A Refresher on Regression Analysis, Harvard Business Analytics,

• McLeod, S., 2020, Correlation definition, examples& interpretation, SimplyPsychology,

• Porras, E.M., 2018, Linear Regression in R, Datacamp,

• Prabhakaran, S., 2017, Linear Regression, [Link], [Link]

• The Carpentries, 2020, Programming with R: Understanding Factors,

• Trochim, Prof. W.M.K., 2020, Inferential Statistics, Conjointly,

You might also like