0% found this document useful (0 votes)
19 views4 pages

Continuous Assessment

AI Continuous Assessment

Uploaded by

garyluk6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views4 pages

Continuous Assessment

AI Continuous Assessment

Uploaded by

garyluk6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Continuous Assessment

Deadline: 14th September 2024

You will be given a data set called the Boston Housing Dataset. The Boston Housing Dataset is a derived from
information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following
describes the dataset columns:

 CRIM - per capita crime rate by town


 ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
 INDUS - proportion of non-retail business acres per town.
 CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
 NOX - nitric oxides concentration (parts per 10 million)
 RM - average number of rooms per dwelling
 AGE - proportion of owner-occupied units built prior to 1940
 DIS - weighted distances to five Boston employment centres
 RAD - index of accessibility to radial highways
 TAX - full-value property-tax rate per $10,000
 PTRATIO - pupil-teacher ratio by town
 B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
 LSTAT - % lower status of the population
 MEDV - Median value of owner-occupied homes in $1000's

The goal of this project is to predict the housing prices of a town or a suburb based on the features of the locality
provided to us. In the process, we need to identify the most important features affecting the price of the house.
We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices
for the unseen data.

Initialization

Import the necessary libraries and overview of the data set

# Import libraries for data manipulation


import pandas as pd

import numpy as np

# Import libraries for data visualization


import matplotlib.pyplot as plt

import seaborn as sns


from statsmodels.graphics.gofplots import ProbPlot

# Import libraries for building linear regression model


from statsmodels.formula.api import ols

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Import library for preparing data


from sklearn.model_selection import train_test_split

# Import library for data preprocessing


from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore")

Load the data

df = pd.read_csv("boston.csv")
df.head()

1. Describe the data set, how many rows and columns are there? What are the data types? What is the
average, min, max of each column? [5 marks]
Hint: info, describe method

df.info()
df.describe()

2. Plot a histogram to visualize the columns and how do you interpret the results? [5 marks]
Hint: sns.histplot

# let's plot all the columns to look at their distributions


for i in df.columns:
plt.figure(figsize=(7, 4))
sns.histplot(data=df, x=i, kde = True)
plt.show()

3. MEDV is our dependent variable, run a log transformation on this feature. Why do you think we need
to perform this?
Hint: Examine the distribution of MEDV and log MEDV

df['MEDV_log'] = np.log(df['MEDV'])

sns.histplot(data=df,x='MEDV_log',kde=True)

4. Check the correlation using heatmap and how do you interpret the results? [5 marks]
Hint: sns.heatmap

plt.figure(figsize = (12,8))
cmap = sns.diverging_palette(230,20,as_cmap=True)
sns.heatmap(df.corr(),annot=True,fmt='.2f',cmap=cmap)
plt.show()

5. Visualize the relationship between the AGE and DIS columns using a scatter plot and how do you
interpret the results? [5 marks]
Hint: sns.scatterplot

# scatterplot to visualize the relationship between AGE and DIS


plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'DIS', data = df)
plt.show()

6. Do the same with RAD and TAX [5 marks]


# scatterplot to visulaize the relationship between RAD and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RAD', y = 'TAX', data = df)
plt.show()

7. Do the same with INDUS and TAX [5 marks]


8. Do the same with RM and MEDV [5 marks]
9. Do the same with LSTAT and MEDV [5 marks]
10. Do the same with DIS and NOX [5 marks]

11. Split the data into the dependent and independent variables and further split it into train and test set in a
ratio of 70:30 for train and test sets. [5 marks]
Hint: add_constant, train_test_split

# separate the dependent and independent variable


Y = df['MEDV_log']
X = df.drop(columns = {'MEDV', 'MEDV_log'})

# add the intercept term


X = sm.add_constant(X)

# splitting the data in 70:30 ratio of train to test data


X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.30 , random_state=1)

12. Check for multicollinearity [5 marks]


Hint: variance_inflation_factor

from statsmodels.stats.outliers_influence import


variance_inflation_factor

# function to check VIF


def checking_vif(train):
vif = pd.DataFrame()
vif["feature"] = train.columns

# calculating VIF for each feature


vif["VIF"] = [
variance_inflation_factor(train.values, i) for i in
range(len(train.columns))
]
return vif

print(checking_vif(X_train))

13. Drop the TAX column and check if multicollinearity is resolved [5 marks]

# creating the model after dropping TAX


X_train = X_train.drop(columns = {'TAX'})

# checking for VIF


print(checking_vif(X_train))

14. Build a linear regression model which uses all features except for the TAX feature to predict log
MEDV [5 marks]
Hint: sm.OLS

# create the model


model1 = sm.OLS(y_train, X_train).fit()

# get the model summary


model1.summary()

15. Interpret the results [5 marks]

16. Create the model after dropping columns 'MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'INDUS' from df
DataFrame [5 marks]

# creating the model after dropping columns 'MEDV', 'MEDV_log',


'TAX', 'ZN', 'AGE', 'B' 'INDUS' from df dataframe
Y = df['MEDV_log']
X = df.drop(columns = {'MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'B',
'INDUS'})
X = sm.add_constant(X)

#splitting the data in 70:30 ratio of train to test data


X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.30 , random_state=1)

# create the model


model2 = sm.OLS(y_train, X_train).fit() #write your code here
# get the model summary
model2.summary()

17. Is this model better? [5 marks]

18. Let’s assume that you were just given the above data set. Write a short paragraph summarizing your
statistical findings. This type of exercise mimics the approach of a typical quantitative research
exercise whereby you are given some data and you need to extract some insight from it. The above
instructions are typical steps one could follow to tackle these types of problems. [15 marks]

You might also like