0% found this document useful (0 votes)
24 views4 pages

Exercise#8 Instructions Linear Regression Model

Uploaded by

laylaydeanne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views4 pages

Exercise#8 Instructions Linear Regression Model

Uploaded by

laylaydeanne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Week b Interactive Exercise#7a

A: Linear Regression Model (estimated time 30 minutes)


In this exercise we will do the following:

 Build a linear regression model using:


o The ols method and the statsmodel.formula.api library
o The scikit-learn package

Pre-requisites:

1- Install Anoconda
2- We will be using a lot of Public datasets these datasets called 'Advertising.csv' . Download it
from the course shell

Steps for building a linear regression model:

1- Open your spider IDE


2- Load the 'Advertising.csv' file into a dataframe name the dataframe data_firstname_adv where
first name is your first name carry out the following activities:
a. Display the column names
b. Display the shape of the data frame i.e number of rows and number of columns
c. Display the main statistics of the data
d. Display the types of columns
e. Display the first five records

Following is the code, make sure you update the path to the correct path where you placed the
files and update the data frame name correctly:
# -*- coding: utf-8 -*-
"""
@author: viji
"""
import pandas as pd
import os
path = "C:/A_COMP309/data/Datasets for Predictive Modelling/Datasets for Predictive
Modelling with Python/Chapter 5"
filename = 'Advertising.csv'
fullpath = os.path.join(path,filename)
data_viji_adv = pd.read_csv(fullpath)
data_viji_adv.columns.values
data_viji_adv.shape
data_viji_adv.describe()
data_viji_adv.dtypes
data_viji_adv.head(5)
3- Let us check if there is a correlation between advertisement costs on TV and the resultant sales.
Remember the formula:

a. Use the numpy package to build a function to calculate the correlation between each
input variable TV,Radio & Newspaper and the output Sales
b. Run the below code snippet , you should get a result the following results:
0.782224424861606
0.5762225745710553
0.22829902637616525
Following is the code, make sure you update the path to the correct path where you placed the
files and the dataframe name.
import numpy as np
def corrcoeff(df,var1,var2):
df['corrn']=(df[var1]-np.mean(df[var1]))*(df[var2]-np.mean(df[var2]))
df['corrd1']=(df[var1]-np.mean(df[var1]))**2
df['corrd2']=(df[var2]-np.mean(df[var2]))**2
corrcoeffn=df.sum()['corrn']
corrcoeffd1=df.sum()['corrd1']
corrcoeffd2=df.sum()['corrd2']
corrcoeffd=np.sqrt(corrcoeffd1*corrcoeffd2)
corrcoeff=corrcoeffn/corrcoeffd
return corrcoeff
print(corrcoeff(data_viji_adv,'TV','Sales'))
print(corrcoeff(data_viji_adv,'Radio','Sales'))
print(corrcoeff(data_viji_adv,'Newspaper','Sales'))
4- Use the matplotlib module to visualize the relationships between each of the inputs and the
output (sales), i.e. generate three scattered plots.

Following is the code, make sure you update the path to the correct path where you placed the files
and use the correct dataframe name:

import matplotlib.pyplot as plt


plt.plot(data_viji_adv['TV'],data_viji_adv['Sales'],'ro')
plt.title('TV vs Sales')
plt.plot(data_viji_adv['Radio'],data_viji_adv['Sales'],'ro')
plt.title('Radio vs Sales')
plt.plot(data_viji_adv['Newspaper'],data_viji_adv['Sales'],'ro')
plt.title('Newspaper vs Sales')

4. Use the ols method and the statsmodel.formula.api library to build a linear regression model
with TV costs as the predictor (input) and sales as the predicted i.e. estimate the parameters of
the model. You should get the following results:
Intercept 7.032594
TV 0.047537
Following is the code, make sure you update the path to the correct path where you placed the
files and use the correct dataframe name:

import statsmodels.formula.api as smf


model1=smf.ols(formula='Sales~TV',data=data_viji_adv).fit()
model1.params
5- Generate the p-values and the R-squared and model summary, run the following lines of code

print(model1.pvalues)
print(model1.rsquared)
print(model1.summary())
6- Re-build the model with two predictors TV and Radio as input variables and print the
parameters, p-values, rsquared and summary. Then:
a. Create a new data frame with 2 new values for TV and Radio
b. Predict using the new values
c. Change the values and run the prediction again
d. Change the values again to two values already existing in the dataset and run the
prediction again
7- Based on the output our new formula is:

Following is the code, make sure you update the path to the correct path where you placed the
files and use the correct dataframe name:

import statsmodels.formula.api as smf

model3=smf.ols(formula='Sales~TV+Radio',data=data_viji_adv).fit()

print(model3.params)

print(model3.rsquared)

print(model3.summary())

## Predicte a new value

X_new2 = pd.DataFrame({'TV': [50],'Radio' : [40]})

# predict for a new observation

sales_pred2=model3.predict(X_new2)

print(sales_pred2)

Notice in this exercise we used all the data for training, this is not the best approach, it is better to
split the data randomly into test and train.

8- In this step we will build the model using scikit-learn package, this is the more commonly used
package to build data science projects. This method is more elegant as it has more in-built
methods to perform the regular processes associated with regression. Carry out the following:
a. Import the necessary modules
b. Split the dataset into 80% for training and 20% for testing
c. Print out the parameters
d. Test the model using the Train/Test

Following is the code, make sure you update the path to the correct path where you placed the files
and use the correct dataframe name:

#Better solution than the previous method- test and train split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
feature_cols = ['TV', 'Radio']
X = data_viji_adv[feature_cols]
Y = data_viji_adv['Sales']
trainX,testX,trainY,testY = train_test_split(X,Y, test_size = 0.2)
lm = LinearRegression()
lm.fit(trainX, trainY)
print (lm.intercept_)
print (lm.coef_)
zip(feature_cols, lm.coef_)
[('TV', 0.045706061219705982), ('Radio', 0.18667738715568111)]
lm.score(trainX, trainY)
lm.predict(testX)
9- Feature selection: using the scikit , in order to check which predictors are best as input variable
to the model run the following code sinpet and don’t forget to change the path name:

from sklearn.feature_selection import RFE


from sklearn.svm import SVR
feature_cols = ['TV', 'Radio','Newspaper']
X = data_viji_adv[feature_cols]
Y = data_viji_adv['Sales']
estimator = SVR(kernel="linear")
selector = RFE(estimator,2,step=1)
selector = selector.fit(X, Y)
print(selector.support_)
print(selector.ranking_)

You might also like