Predictive Analytics Group Assignment
Predictive Analytics Group Assignment
Independent
Type of Variable/De
Attributes Description Values
Variable pendent
Variable
num_rooms No. of rooms in the Independent
1 house Integer Variable
num_people No of people staying in Independent
Integer
2 the house Variable
housearea Independent
3 Area of the house(sqft) Float
Variable
Whether Air 0 or 1(Binary)
is_ac Independent
4 Conditioner is Installed Integer Where 0 is No and
Variable
in the house 1 is Yes
0 or 1(Binary)
is_tv Whether Television is Independent
5 Integer Where 0 is No and
Installed in the house Variable
1 is Yes
0 or 1(Binary)
is_flat Whether the house is a Independent
6 Integer Where 0 is No and
flat Variable
1 is Yes
Average monthly
ave_monthly_income Independent
7 income of people Float
Variable
staying
num_children Number of children in Independent
8 Integer
the house Variable
0 or 1(Binary)
is_urban Whether the house is Independent
9 Integer Where 0 is No and
situated in urban are Variable
1 is Yes
amount_paid Amount paid towards Dependent
10 Float
electricity bill Variable
Explanation of R Script
Setting Working Directory& loading the installed package(ggplot2) which will come in use
later
Importing the data & viewing the dataset. summary(Bill_Prediction) shows that there is one
NA in ave_monthly_income (Average monthly income of people staying) as well as in
amount_paid (Amount paid towards electricity bill)
Tracing missing values in the Dataset and negative values which do not hold any sense by
using is.na. any(is.na(Bill_Prediction)) gives TRUE which means there are some missing
values in the dataset while the command sum(is.na(Bill_Prediction)) results in 2, which
means there are two missing values in the complete dataset.
Applying is.na to each variable to find where the identified two missing values are located,
I.e. exactly in which row and under which heading/variable to find exact position.
There were negative values in number of people and number of rooms, which is incorrect
and cannot be negative thus we removed the rows in which there were negative values.
Now we have discovered there are one missing(NA) values under ave_monthly_income and
amount_paid each, thus input the NA values with mean values of the corresponding header.
Replacing NA values with mean values by using the following respectively
Avg_avgmonthlyincome=mean((Bill_Prediction3$ave_monthly_income),na.rm = TRUE)
Avg_amountpaid=mean((Bill_Prediction3$amount_paid),na.rm = TRUE)
Then we check again for missing values by is.na, which gives the result FALSE, thus indicating
there are no missing values as NA value are now replaced with Mean value
Now that we have dealt with missing values and negative values in the dataset, the next step
is to check to see if there are any outliers present in the data. To look for outliers in the data
we use boxplots and for categorical data we use tables.
We start with the process of removing outliers from data for num_people. First we find out
the quantiles, which gives us four values. Using 1st& 3rd quantiles we find out the
interquartle range (IQR). Next we multiply IQR by 1.5 to arrive atthe value of range of
num_people.
Using this range, we find out the upper and lower limits of the subset such that the outliers
are omitted as follows
Thus the subset No_outliers1 is such that it is greater than the lower limit and lesser than the
upper limit. The boxplot exists in the subset while the outliers fall out of it.
We use View(No_outliers1) to view the subset and get boxplot for the final data by
boxplot(No_outliers1$num_people), which shows no outliers.
The same procedure is followed for removing outliers from housearea, ave_monthly_income
and amount_paid.
Box plots of all variables prior to the treatment are as follows which shows there were some
outliers which were removed by us giving appropriate treatment:
1. House area:
4. Number of people
5. Number of Rooms
6. Number of children
SPLITTING THE DATA
Now that we have removed the outliers and cleaned the data, we will split the data into test and
train. The train data will be 70% of the original data and the test data will be 30% of original
data.
For modelling purpose, we will build our model on the train data and further test it on the test
data.
With the aic model, we get to know that only num_rooms have to be removed from the model.
We thus build our model on the remaining independent variables and exclude num_rooms.
We will check the p-value of each and every independent variables and the model which is found
out to be less that 0.05, thus our model exists and each independent variable taken is significant
at 95% confidence level.
We will check multicollinearity which is around 1 for every variable and thus we are good to go.
RESIDUAL ANALYSIS
We will analyze the residual for the assumption LINE i.e Linearity, Independent Residuals,
Normality, and Equal Variance. We will use plot function on MLR model to make various plots.
Slight corrections are needed in the model as model does not look completely perfect as the red
line is deviating a bit. We will further check the Normality.
TESTS
The p value is smaller than alpha i.e. we reject H0 and can infer that the data is not Normal.
Same inference can be drawn from another test.
The p-value suggests that we accept H0 since it is greater than 0.05 and thus the Variables are
Independent and no autocorrelation exists.
The tests suggest that we accept H0 as p-value is greater than 0.05 and thus variance of residuals
are equal. Bp test also suggests the same.
Now we will do the outlier test. It suggests that outlier of residual exists.
Now, we will predict the values for test data using our model.
We will check the accuracy of our model which suggest that our model is 90% accurate.
Now we will predict the values of test data and plot it corresponding to the actual values in test
data.