Continuous Assessment
Continuous Assessment
You will be given a data set called the Boston Housing Dataset. The Boston Housing Dataset is a derived from
information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following
describes the dataset columns:
The goal of this project is to predict the housing prices of a town or a suburb based on the features of the locality
provided to us. In the process, we need to identify the most important features affecting the price of the house.
We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices
for the unseen data.
Initialization
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("boston.csv")
df.head()
1. Describe the data set, how many rows and columns are there? What are the data types? What is the
average, min, max of each column? [5 marks]
Hint: info, describe method
df.info()
df.describe()
2. Plot a histogram to visualize the columns and how do you interpret the results? [5 marks]
Hint: sns.histplot
3. MEDV is our dependent variable, run a log transformation on this feature. Why do you think we need
to perform this?
Hint: Examine the distribution of MEDV and log MEDV
df['MEDV_log'] = np.log(df['MEDV'])
sns.histplot(data=df,x='MEDV_log',kde=True)
4. Check the correlation using heatmap and how do you interpret the results? [5 marks]
Hint: sns.heatmap
plt.figure(figsize = (12,8))
cmap = sns.diverging_palette(230,20,as_cmap=True)
sns.heatmap(df.corr(),annot=True,fmt='.2f',cmap=cmap)
plt.show()
5. Visualize the relationship between the AGE and DIS columns using a scatter plot and how do you
interpret the results? [5 marks]
Hint: sns.scatterplot
11. Split the data into the dependent and independent variables and further split it into train and test set in a
ratio of 70:30 for train and test sets. [5 marks]
Hint: add_constant, train_test_split
print(checking_vif(X_train))
13. Drop the TAX column and check if multicollinearity is resolved [5 marks]
14. Build a linear regression model which uses all features except for the TAX feature to predict log
MEDV [5 marks]
Hint: sm.OLS
16. Create the model after dropping columns 'MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'INDUS' from df
DataFrame [5 marks]
18. Let’s assume that you were just given the above data set. Write a short paragraph summarizing your
statistical findings. This type of exercise mimics the approach of a typical quantitative research
exercise whereby you are given some data and you need to extract some insight from it. The above
instructions are typical steps one could follow to tackle these types of problems. [15 marks]