Regression
Regression
Regression
Analytics
Machine Learning: Regression
Andy Oh
03 May 2021
0.1 Objectives
The course will begin with a gentle introduction to linear regression starting with correlation
analysis before moving into multiple regression models. The key assumptions in linear
regression is beyond the coverage of this course. Heuristics data transformation to address
non-normal data distribution will also be covered in this course.
Linear regression models are used to predict a continuous value. For example, predicting the
price of a house given its sizes, location and number of rooms as input variables. In machine
learning, linear regression is considered as a supervised learning technique, where a set of
input variables and their output are known. The data from the input variables (X) are trained
to predict the output variable (Y). The accuracy of the prediction are then compared to the
known outcome to determine the overall quality of the model.
1 Correlation Analysis
1.1 Scatter plot
The study of relationship between two variables (interval or ratio data) often starts with
visual representation by using a scatter plot. Consider a simple dataset of advertising
spending on YouTube and Sales revenue, can you observe the relationship between the
advertising cost and sales revenue?
From the scatterplot, we can observe a positive relationship between advertising spending
and sales revenue. Higher advertising spending on YouTube will bring about higher sales
revenue for the company.
There are times when the relationship between two variables are not so clear. In the
scatterplot below, we are not able to observe a clear pattern like the previous scatter plot.
In correlation, we can generalise the relationship between two variables
as positive, negative and no correlation.
Correlation coefficient describes the quantitative strength of two variables (ratio or interval).
It is often referred to as Pearson’s correlation or Pearson’s r.
A correlation of -1.00 and +1.00 indicates perfect correlation.
1.2 Correlation using Azure
Use the Compute Linear Correlation module in Azure Machine Learning Studio to
compute a set of Pearson correlation coefficients for each possible pair of variables in the
input dataset.
The Pearson correlation coefficient, sometimes called Pearson’s R test, is a statistical value
that measures the linear relationship between two variables. By examining the coefficient
values, you can infer something about the strength of the relationship between the two
variables, and whether they are positively correlated or negatively correlated.
source: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/compute-linear-correlation
From the correlation output, which variable has the highest correlation to sales? Is the
relationship a positive or negative one?
The objective of least square method is to produce a line with the least total
distance between the predicted (Ŷ) and the observed values of the dependent variable (Y).
In other words, OLS minimises the sum of squares differences between the observed value of
the dependent variable (Y) and its predicted value (Ŷ).
Formula to calculate β (beta) with OLS:
β=∑ni=1(xi−x⎯⎯⎯)(yi−y⎯⎯⎯)∑ni=1(xi−x⎯⎯⎯)2
source: https://www.hackerearth.com
After calculating the Ŷ for each observed Y, three different sum of squares can be derived
between Yi, Y⎯⎯⎯⎯ and Ŷ.
• Total Sum of Squares (TSS) = ∑ni=1(Yi−Y¯)2
• Explained Sum of Squares (ESS) = ∑ni=1(Ŷi−Y¯)2
• Residual Sum of Squares (RSS) = ∑ni=1(Yi−Ŷi)2
Throughout this course, we will select Ordinary Least Square and set L2 regularization
weight to zero for Linear Regression modeling.
In other words, the equation allows us to predict the sales revenue given a value of YouTube
advertising.
2.5.2 Tasks
• Create a linear regression model using marketing.csv.
• Setup newspaper as an independent variable to predict sales in one experiment.
• Use facebook as an independent variable to predict sales in one experiment.
• Express the weights as an equation to predict sales for each experiment
• Which variable seem to produce less errors and able to explain the variability
of sales better?
Which independent variable is the most important one to affect sales positively?
From the output of the Evaluation Model instance, we can observe that regression tree
model produces a lower RMSE and higher R2 than linear regression model.
6 Categorical Data with Linear
Regression
Linear regression can also be performed when the independent variables are qualitative or
categorical data.
Under the hood, the category of gender (male and female) is encoded as 0 or 1. When
using male as an input to predict cholesterol, male takes on a value of 1 and female takes on a
value of 0 . Conversely, when using female as an input to predict cholesterol, female takes
on a value of 1 and male takes on a value of 0 . This is method is commonly known as one
hot encoding.
The results of the train model instance will show more weights as more categorical data are
added to the regression