Azure Data Studio 1694473395
Azure Data Studio 1694473395
Introduction
In this activity, we will use simple linear regression to explore the relationship between two
continuous variables. To accomplish this, we will perform a complete simple linear regression
analysis, which includes creating and fitting a model, checking model assumptions, analyzing
model performance, interpreting model coefficients, and communicating results to stakeholders.
For this activity, we are part of an analytics team that provides insights about marketing and
sales. Wwe have been assigned to a project that focuses on the use of influencer marketing,
and we would like to explore the relationship between marketing promotional budgets and sales.
The dataset provided includes information about marketing campaigns across TV, radio, and
social media, as well as how much revenue in sales was generated from these campaigns.
Based on this information, leaders in our company will make decisions about where to focus
future marketing efforts, so it is critical to have a clear understanding of the relationship between
the different types of marketing and the revenue they generate.
This activity will develop your knowledge of linear regression and your skills evaluating
regression results which will help prepare us for modeling to provide business recommendations
in the future.
Step 1: Imports
Import packages
Import relevant Python libraries and packages. In this activity, you will need to use pandas ,
pyplot from matplotlib , and seaborn .
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 1/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Each row corresponds to an independent marketing promotion where the business invests in
TV , Social_Media , and Radio promotions to increase Sales .
The business would like to determine which feature most strongly predicts Sales so they have
a better understanding of what promotions they should invest in in the future. To accomplish
this, we'll construct a simple linear regression model that predicts sales using a single
independent variable.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 2/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Question: What are some reasons for conducting an EDA before constructing a simple linear
regression model?
Out[4]: (4572, 4)
Before fitting the model, ensure the Sales for each promotion (i.e., row) is present. If the
Sales in a row is missing, that row isn't of much value to the simple linear regression model.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 3/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Display the percentage of missing values in the Sales column in the DataFrame data .
Question 2: What do you observe about the percentage of missing values in the Sales
column?
The preceding output shows that 0.13% of rows are missing the Sales value.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 4/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Question: What did we observe about the distribution of Sales from the preceding histogram?
Create a pairplot to visualize the relationships between pairs of variables in the data. You will
use this to visually determine which variable has the strongest linear relationship with Sales .
This will help you select the X variable for the simple linear regression.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 5/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
TV clearly has the strongest linear relationship with Sales . You could draw a straight line
through the scatterplot of TV and Sales that confidently estimates Sales using TV .
Radio and Sales appear to have a linear relationship, but there is larger variance than
between TV and Sales .
Replace the comment with the correct code. Use the variable you chose for X for building the
model.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 6/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Out[10]:
OLS Regression Results
Df Model: 1
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
To justify using simple linear regression, check that the four linear regression assumptions are
not violated. These assumptions are:
Linearity
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 7/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Independent Observations
Normality
Homoscedasticity
The linearity assumption requires a linear relationship between the independent and dependent
variables. Check this assumption by creating a scatterplot comparing the independent variable
with the dependent variable.
Create a scatterplot comparing the X variable you selected with the dependent variable.
There is a clear linear relationship between TV and Sales , meeting the linearity assumption.
The independent observation assumption states that each observation in the dataset is
independent. As each marketing promotion (i.e., row) is independent from one another, the
independence assumption is not violated.
The normality assumption states that the errors are normally distributed.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 8/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
The histogram of the residuals are approximately normally distributed, which supports that the
normality assumption is met for this model.The residuals in the Q-Q plot form a straight line,
further supporting that the normality assumption is met.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 9/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
The homoscedasticity (constant variance) assumption is that the residuals have a constant
variance for all values of X .
Check that this assumption is not violated by creating a scatterplot with the fitted values and
residuals. Add a line at 𝑦=0
to visualize the variance of residuals above and below . 𝑦=0
In [13]: # Create a scatterplot with the fitted values from the model and the residuals.
fig = sns.scatterplot(x = model.fittedvalues, y = model.resid)
# Set the x-axis label.
fig.set_xlabel("Fitted Values")
# Set the y-axis label.
fig.set_ylabel("Residuals")
# Set the title.
fig.set_title("Fitted Values v. Residuals")
# Add a line at y = 0 to visualize the variance of residuals above and below 0.
fig.axhline(0)
# Show the plot.
plt.show()
The variance of the residuals is consistant across all 𝑋. Thus, the assumption of
homoscedasticity is met.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 10/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Display the OLS regression results from the fitted model object, which includes information
about the dataset, model fit, and coefficients.
Out[14]:
OLS Regression Results
Df Model: 1
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Question: The R-squared on the preceding output measures the proportion of variation in the
dependent variable (Y) explained by the independent variable (X). What is your intepretation of
the model's R-squared?
Using TV as X results in a simple linear regression model with 𝑅2 = 0.999. In other words,
TV explains 99.9%
of the variation in Sales .
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 11/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
With the model fit evaluated, we can assess the coefficient estimates and the
uncertainty of these estimates.
Question: Based on the preceding model results, what do we observe about the coefficients?
When TV is used as the independent variable X, the coefficient for the Intercept is -0.1263
and the coefficient for TV is 3.5614.
Question: How would we write the relationship between X and Sales in the form of a linear
equation?
𝑌 = Intercept + Slope ∗ 𝑋
Sales (in millions) = Intercept + Slope ∗ TV (in millions)
Sales (in millions) = −0.1263 + 3.5614 ∗ TV (in millions)
Question: Based on this model, what is the interpretation of the relationship between X and
Sales ?
According to the model, when TV is used as the independent variable X, an increase of one
million dollars for the TV promotional budget results in an estimated 3.5614 million dollars more
in sales.
Beta coefficients allow us to estimate the magnitude and direction (positive or negative) of the
effect of each independent variable on the dependent variable. The coefficient estimates can be
converted to explainable insights, such as the connection between an increase in TV
promotional budgets and sales mentioned previously.
Model coefficients are estimated. This means there is an amount of uncertainty in the estimate.
A p-value and 95%confidence interval are provided with each coefficient to quantify the
uncertainty for that coefficient estimate.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 12/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Out[15]:
OLS Regression Results
Df Model: 1
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Question: Based on this model, what is our interpretation of the p-value and confidence interval
for the coefficient estimate of X?
When TV is used as the independent variable, it has a p-value of 0.000 and a 95% confidence
interval of [3.558,3.565]. This means there is a 95% chance the interval [3.558,3.565]
contains the true parameter value of the slope. These results indicate little uncertainty in the
estimation of the slope of X. Therefore, the business can be confident in the impact TV has on
Sales .
Providing the business with the estimated sales given different TV promotional budgets
Using both TV and Radio as independent variables
Adding plots to help convey the results, such as using the seaborn regplot() to plot
the data with a best fit regression line
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 13/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
Of the three available promotion types (TV, radio, and social media), TV has the strongest
positive linear relationship with sales. According to the model, an increase of one million dollars
for the TV promotional budget will result in an estimated 3.5614 million dollars more in sales.
This is a very confident estimate, as the p-value for this coefficient estimate is small. Thus, the
business should prioritize increasing the TV promotional budget over the radio and social media
promotional budgets to increase sales.
Considerations
What are some key takeaways that we learnt from this lab?
Exploratory data analysis can be used to find a suitable X variable for a simple linear
regression model.
It's important to check that simple linear regression assumptions are met before interpreting
model results.
R-squared is a useful measure of prediction error.
It's important to provide measures of uncertainty (e.g., p-values, confidence intervals) with
any coefficient estimates.
Sales is relatively equally distributed between $25 and $350 million across all promotions.
In the data visualization, TV has the strongest linear relationship with sales. Radio and
sales appear to have a moderate linear relationship, but there is larger variance than
between TV and sales. Social media and sales appear to have a weak linear relationship.
When TV is used as the independent variable, the R-squared for the simple linear
regression model estimating sales is 0.999.
When TV is used as the independent variable, the coefficient for the intercept is -0.1263
and the coefficient for TV is 3.5614.
When TV is used as the independent variable, the slope of the coefficient for TV has a p-
value of 0.000 and a 95% confidence interval of [3.558, 3.565].
Among TV, social media, and radio, TV had the strongest positive linear relationship with
sales.
The linear regression model estimates that 99.9% of the variation in sales is explained by
the TV promotional budget. In other words, nearly all of the variation in sales can be
explained by the TV promotional budget alone, making TV an excellent predictor of sales.
According to the model, when TV is used as the independent variable X, an increase of one
million dollars for the TV promotional budget would result in an estimated 3.5614 million
more dollars in sales.
The interval (3.558 million, 3.565 million) has a 95% probability of containing the true
estimate of the increase in sales for a one million dollar increase in the TV promotional
budget. Therefore, the estimate provided in the previous bullet is very confident.
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 14/15
11/5/23, 9:05 PM Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal - Jupyter Notebook
References
Dale, D.,Droettboom, M., Firing, E., Hunter, J. (n.d.). Matplotlib.Pyplot.Axline — Matplotlib 3.5.0
Documentation (https://matplotlib.org/3.5.0/api/_as_gen/matplotlib.pyplot.axline.html).
localhost:8888/notebooks/Downloads/Simple Linear Regression (Marketing Mix Modelin) Part 2 By Ayobola Lawal.ipynb 15/15