Exercise#8 Instructions Linear Regression Model
Exercise#8 Instructions Linear Regression Model
Pre-requisites:
1- Install Anoconda
2- We will be using a lot of Public datasets these datasets called 'Advertising.csv' . Download it
from the course shell
Following is the code, make sure you update the path to the correct path where you placed the
files and update the data frame name correctly:
# -*- coding: utf-8 -*-
"""
@author: viji
"""
import pandas as pd
import os
path = "C:/A_COMP309/data/Datasets for Predictive Modelling/Datasets for Predictive
Modelling with Python/Chapter 5"
filename = 'Advertising.csv'
fullpath = os.path.join(path,filename)
data_viji_adv = pd.read_csv(fullpath)
data_viji_adv.columns.values
data_viji_adv.shape
data_viji_adv.describe()
data_viji_adv.dtypes
data_viji_adv.head(5)
3- Let us check if there is a correlation between advertisement costs on TV and the resultant sales.
Remember the formula:
a. Use the numpy package to build a function to calculate the correlation between each
input variable TV,Radio & Newspaper and the output Sales
b. Run the below code snippet , you should get a result the following results:
0.782224424861606
0.5762225745710553
0.22829902637616525
Following is the code, make sure you update the path to the correct path where you placed the
files and the dataframe name.
import numpy as np
def corrcoeff(df,var1,var2):
df['corrn']=(df[var1]-np.mean(df[var1]))*(df[var2]-np.mean(df[var2]))
df['corrd1']=(df[var1]-np.mean(df[var1]))**2
df['corrd2']=(df[var2]-np.mean(df[var2]))**2
corrcoeffn=df.sum()['corrn']
corrcoeffd1=df.sum()['corrd1']
corrcoeffd2=df.sum()['corrd2']
corrcoeffd=np.sqrt(corrcoeffd1*corrcoeffd2)
corrcoeff=corrcoeffn/corrcoeffd
return corrcoeff
print(corrcoeff(data_viji_adv,'TV','Sales'))
print(corrcoeff(data_viji_adv,'Radio','Sales'))
print(corrcoeff(data_viji_adv,'Newspaper','Sales'))
4- Use the matplotlib module to visualize the relationships between each of the inputs and the
output (sales), i.e. generate three scattered plots.
Following is the code, make sure you update the path to the correct path where you placed the files
and use the correct dataframe name:
4. Use the ols method and the statsmodel.formula.api library to build a linear regression model
with TV costs as the predictor (input) and sales as the predicted i.e. estimate the parameters of
the model. You should get the following results:
Intercept 7.032594
TV 0.047537
Following is the code, make sure you update the path to the correct path where you placed the
files and use the correct dataframe name:
print(model1.pvalues)
print(model1.rsquared)
print(model1.summary())
6- Re-build the model with two predictors TV and Radio as input variables and print the
parameters, p-values, rsquared and summary. Then:
a. Create a new data frame with 2 new values for TV and Radio
b. Predict using the new values
c. Change the values and run the prediction again
d. Change the values again to two values already existing in the dataset and run the
prediction again
7- Based on the output our new formula is:
Following is the code, make sure you update the path to the correct path where you placed the
files and use the correct dataframe name:
model3=smf.ols(formula='Sales~TV+Radio',data=data_viji_adv).fit()
print(model3.params)
print(model3.rsquared)
print(model3.summary())
sales_pred2=model3.predict(X_new2)
print(sales_pred2)
Notice in this exercise we used all the data for training, this is not the best approach, it is better to
split the data randomly into test and train.
8- In this step we will build the model using scikit-learn package, this is the more commonly used
package to build data science projects. This method is more elegant as it has more in-built
methods to perform the regular processes associated with regression. Carry out the following:
a. Import the necessary modules
b. Split the dataset into 80% for training and 20% for testing
c. Print out the parameters
d. Test the model using the Train/Test
Following is the code, make sure you update the path to the correct path where you placed the files
and use the correct dataframe name:
#Better solution than the previous method- test and train split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
feature_cols = ['TV', 'Radio']
X = data_viji_adv[feature_cols]
Y = data_viji_adv['Sales']
trainX,testX,trainY,testY = train_test_split(X,Y, test_size = 0.2)
lm = LinearRegression()
lm.fit(trainX, trainY)
print (lm.intercept_)
print (lm.coef_)
zip(feature_cols, lm.coef_)
[('TV', 0.045706061219705982), ('Radio', 0.18667738715568111)]
lm.score(trainX, trainY)
lm.predict(testX)
9- Feature selection: using the scikit , in order to check which predictors are best as input variable
to the model run the following code sinpet and don’t forget to change the path name: