0% found this document useful (0 votes)

32 views21 pages

Predictive Analytics Group Assignment

The document summarizes the steps taken to build a predictive model for electricity bill amount using various variables. Key steps include: 1. Cleaning the data by removing outliers, negative values and imputing missing values. 2. Splitting the data into 70% train and 30% test sets. 3. Building the model on the train set and checking significance of variables and multicollinearity. 4. Analyzing residuals to check model assumptions and validating the model fits the data well. 5. Predicting values for the test set and checking the model accuracy is 90%.

Uploaded by

Namit Baser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views21 pages

Predictive Analytics Group Assignment

Uploaded by

Namit Baser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

PREDICTIVE ANALYTICS GROUP ASSIGNMENT

Submitted By: Group No. 191408

Submitted To: Prof Chetan Jhaveri
Details of Variables

Independent
Type of Variable/De
Attributes Description Values
Variable pendent
Variable
num_rooms No. of rooms in the Independent
1 house Integer Variable
num_people No of people staying in Independent
Integer
2 the house Variable
housearea Independent
3 Area of the house(sqft) Float
Variable
Whether Air 0 or 1(Binary)
is_ac Independent
4 Conditioner is Installed Integer Where 0 is No and
Variable
in the house 1 is Yes
0 or 1(Binary)
is_tv Whether Television is Independent
5 Integer Where 0 is No and
Installed in the house Variable
1 is Yes
0 or 1(Binary)
is_flat Whether the house is a Independent
6 Integer Where 0 is No and
flat Variable
1 is Yes
Average monthly
ave_monthly_income Independent
7 income of people Float
Variable
staying
num_children Number of children in Independent
8 Integer
the house Variable
0 or 1(Binary)
is_urban Whether the house is Independent
9 Integer Where 0 is No and
situated in urban are Variable
1 is Yes
amount_paid Amount paid towards Dependent
10 Float
electricity bill Variable
Explanation of R Script

 Setting Working Directory& loading the installed package(ggplot2) which will come in use
later

 Importing the data & viewing the dataset. summary(Bill_Prediction) shows that there is one
NA in ave_monthly_income (Average monthly income of people staying) as well as in
amount_paid (Amount paid towards electricity bill)

 Tracing missing values in the Dataset and negative values which do not hold any sense by
using is.na. any(is.na(Bill_Prediction)) gives TRUE which means there are some missing
values in the dataset while the command sum(is.na(Bill_Prediction)) results in 2, which
means there are two missing values in the complete dataset.

 Applying is.na to each variable to find where the identified two missing values are located,
I.e. exactly in which row and under which heading/variable to find exact position.

 There were negative values in number of people and number of rooms, which is incorrect
and cannot be negative thus we removed the rows in which there were negative values.
 Now we have discovered there are one missing(NA) values under ave_monthly_income and
amount_paid each, thus input the NA values with mean values of the corresponding header.
 Replacing NA values with mean values by using the following respectively

Avg_avgmonthlyincome=mean((Bill_Prediction3$ave_monthly_income),na.rm = TRUE)

Avg_amountpaid=mean((Bill_Prediction3$amount_paid),na.rm = TRUE)

Then we check again for missing values by is.na, which gives the result FALSE, thus indicating
there are no missing values as NA value are now replaced with Mean value
 Now that we have dealt with missing values and negative values in the dataset, the next step
is to check to see if there are any outliers present in the data. To look for outliers in the data
we use boxplots and for categorical data we use tables.

 From above, we find outliers in num_people, housearea, ave_monthly_income and

amount_paid. We need to remove these outliers from the dataset next.

 We start with the process of removing outliers from data for num_people. First we find out
the quantiles, which gives us four values. Using 1st& 3rd quantiles we find out the
interquartle range (IQR). Next we multiply IQR by 1.5 to arrive atthe value of range of
num_people.

 Using this range, we find out the upper and lower limits of the subset such that the outliers
are omitted as follows

quartile_Numpeople[1]-range_Numpeople For calculating lower limit

quartile_Numpeople[2]+range_Numpeople For calculating upper limit

Thus the subset No_outliers1 is such that it is greater than the lower limit and lesser than the
upper limit. The boxplot exists in the subset while the outliers fall out of it.

 We use View(No_outliers1) to view the subset and get boxplot for the final data by
boxplot(No_outliers1$num_people), which shows no outliers.

 The same procedure is followed for removing outliers from housearea, ave_monthly_income
and amount_paid.
Box plots of all variables prior to the treatment are as follows which shows there were some
outliers which were removed by us giving appropriate treatment:

1. House area:

2. Average Monthly income:

3. Amount Paid

4. Number of people
5. Number of Rooms

6. Number of children
SPLITTING THE DATA

Now that we have removed the outliers and cleaned the data, we will split the data into test and
train. The train data will be 70% of the original data and the test data will be 30% of original
data.

For modelling purpose, we will build our model on the train data and further test it on the test
data.

With the aic model, we get to know that only num_rooms have to be removed from the model.

We thus build our model on the remaining independent variables and exclude num_rooms.

We will check the p-value of each and every independent variables and the model which is found
out to be less that 0.05, thus our model exists and each independent variable taken is significant
at 95% confidence level.
We will check multicollinearity which is around 1 for every variable and thus we are good to go.

RESIDUAL ANALYSIS

We will analyze the residual for the assumption LINE i.e Linearity, Independent Residuals,
Normality, and Equal Variance. We will use plot function on MLR model to make various plots.
Slight corrections are needed in the model as model does not look completely perfect as the red
line is deviating a bit. We will further check the Normality.
TESTS

We will apply various tests to check Normality.

The p value is smaller than alpha i.e. we reject H0 and can infer that the data is not Normal.
Same inference can be drawn from another test.

We will check for autocorrelation.

The p-value suggests that we accept H0 since it is greater than 0.05 and thus the Variables are
Independent and no autocorrelation exists.

We will check for Equal Variance assumption.

We will also apply some tests to validate Equal Variance Assumption.

The tests suggest that we accept H0 as p-value is greater than 0.05 and thus variance of residuals
are equal. Bp test also suggests the same.

Now we will do the outlier test. It suggests that outlier of residual exists.

However, the boxplot looks good.

We will also check for influential position using Cook’s distance.

Thus, we can see there are some influential position.

Now, we will predict the values for test data using our model.

We will check the accuracy of our model which suggest that our model is 90% accurate.

Now we will predict the values of test data and plot it corresponding to the actual values in test
data.

MA181-004 Ethos Up-EASY Service Manual
100% (2)
MA181-004 Ethos Up-EASY Service Manual
156 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Econometric S
100% (1)
Econometric S
348 pages
Lean Implementation at Siemens Kalwa Plant
100% (1)
Lean Implementation at Siemens Kalwa Plant
19 pages
PSQF6270 Example4b Continuous QuantReg
No ratings yet
PSQF6270 Example4b Continuous QuantReg
13 pages
Final Report-Gr05 Statistical Models For Data Analysis
No ratings yet
Final Report-Gr05 Statistical Models For Data Analysis
36 pages
Mod 3
No ratings yet
Mod 3
50 pages
EAO MC 61 Main-Catalogue en
No ratings yet
EAO MC 61 Main-Catalogue en
110 pages
Arun 27072021 Predictive Modeling PDF
No ratings yet
Arun 27072021 Predictive Modeling PDF
33 pages
Etc 2410 Notes
50% (2)
Etc 2410 Notes
133 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
100% (1)
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
6 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
MS5107 Boston Housing, Corolla NUIG
No ratings yet
MS5107 Boston Housing, Corolla NUIG
6 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Unit 3
No ratings yet
Unit 3
24 pages
Lec 37
No ratings yet
Lec 37
12 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
All about Coding Variables
From Everand
All about Coding Variables
James Bow
No ratings yet
FRA Report
100% (1)
FRA Report
30 pages
Chapter 3
No ratings yet
Chapter 3
22 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
QESV1138 01 AP1055D Slides
No ratings yet
QESV1138 01 AP1055D Slides
185 pages
Modern Regression 1 - hw6
No ratings yet
Modern Regression 1 - hw6
11 pages
Saurabh
No ratings yet
Saurabh
22 pages
FS - 720 - Общее описание - A6V10210355
No ratings yet
FS - 720 - Общее описание - A6V10210355
182 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Heavy M
No ratings yet
Heavy M
172 pages
Copyright in Digital Age
100% (1)
Copyright in Digital Age
12 pages
Simple Liner REgression
No ratings yet
Simple Liner REgression
27 pages
Lab 5 LR
No ratings yet
Lab 5 LR
9 pages
LR Assumptions
No ratings yet
LR Assumptions
9 pages
Project Employee Absenteeism
No ratings yet
Project Employee Absenteeism
33 pages
Case Analysis On Color Kinetics Incorporated' Updated
100% (1)
Case Analysis On Color Kinetics Incorporated' Updated
19 pages
ISYE6501 Homework 5
No ratings yet
ISYE6501 Homework 5
5 pages
Week 09 Class Exercise
No ratings yet
Week 09 Class Exercise
3 pages
Flange Dim EN1092-1
No ratings yet
Flange Dim EN1092-1
18 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Credit-Scoring-CASE
No ratings yet
Credit-Scoring-CASE
29 pages
Sberbank Project Report
No ratings yet
Sberbank Project Report
19 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Oulier in R
No ratings yet
Oulier in R
8 pages
Correlation and Linear Regression Problems
No ratings yet
Correlation and Linear Regression Problems
17 pages
Revit Mass BIM
No ratings yet
Revit Mass BIM
3 pages
EABD Analysis
100% (1)
EABD Analysis
23 pages
Lecture 20: Outliers and Influential Points
No ratings yet
Lecture 20: Outliers and Influential Points
11 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Basic Regression Analysis 3
No ratings yet
Basic Regression Analysis 3
6 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Power Factor Correction - PPT (Autosaved)
No ratings yet
Power Factor Correction - PPT (Autosaved)
13 pages
IMC Objectives of Advertisement
No ratings yet
IMC Objectives of Advertisement
19 pages
Lab 5
No ratings yet
Lab 5
6 pages
Educ 3 New Syllabus
No ratings yet
Educ 3 New Syllabus
18 pages
PA Univariate R Solution
No ratings yet
PA Univariate R Solution
6 pages
BADM End Term
No ratings yet
BADM End Term
11 pages
Screenshot 2023-05-30 at 14.41.45
No ratings yet
Screenshot 2023-05-30 at 14.41.45
37 pages
Outliers Influence
No ratings yet
Outliers Influence
6 pages
Topical Revision Qns - Computer Studies (Paper 1)
No ratings yet
Topical Revision Qns - Computer Studies (Paper 1)
66 pages
Coda Cofee and Bext360 SC: MH, THING, RNET of Things, and BC
0% (1)
Coda Cofee and Bext360 SC: MH, THING, RNET of Things, and BC
5 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
CH 2
No ratings yet
CH 2
31 pages
Creation of Row Vector
No ratings yet
Creation of Row Vector
1 page
Module 4 It New Era Prelim
No ratings yet
Module 4 It New Era Prelim
7 pages
IADC Rig Equipment List DDD - Rev - 2014.12.09
No ratings yet
IADC Rig Equipment List DDD - Rev - 2014.12.09
95 pages
Quarterly Presentation On Training As Probationary Deputy Executive Engineer (Civil)
No ratings yet
Quarterly Presentation On Training As Probationary Deputy Executive Engineer (Civil)
22 pages
Commercial Proposal-GoodWin Pontoon and Slurry Pump Installation
No ratings yet
Commercial Proposal-GoodWin Pontoon and Slurry Pump Installation
4 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Sa 16
No ratings yet
Sa 16
5 pages
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
No ratings yet
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
6 pages
Assignment #2 - For Statistical Software
No ratings yet
Assignment #2 - For Statistical Software
4 pages
Institute of Management Nirma University: MBA FT-2019-2021 Supply Chain Management Group Assignment Section A Group: 6
No ratings yet
Institute of Management Nirma University: MBA FT-2019-2021 Supply Chain Management Group Assignment Section A Group: 6
21 pages
June 2022 QP - Paper 2 OCR Computer Science GCSE
No ratings yet
June 2022 QP - Paper 2 OCR Computer Science GCSE
20 pages
BSBPEF501 - Assessment Task 2 2024
No ratings yet
BSBPEF501 - Assessment Task 2 2024
14 pages
9CSC006267e - PROFIsafe Safety Functions Module - 11122023 - EN
No ratings yet
9CSC006267e - PROFIsafe Safety Functions Module - 11122023 - EN
29 pages
Communication For Managers: Group Assignment No. 1
No ratings yet
Communication For Managers: Group Assignment No. 1
25 pages
Enterprise Resource Planning
No ratings yet
Enterprise Resource Planning
6 pages
E-M-HG2-S-V2 Instruction Manual 011013
No ratings yet
E-M-HG2-S-V2 Instruction Manual 011013
55 pages
Applications of Reinforcement Learning
No ratings yet
Applications of Reinforcement Learning
10 pages
9 - HVAC - Heat Recovery Ventilation - (Slide 3 and Slide 10 - Slide 15)
No ratings yet
9 - HVAC - Heat Recovery Ventilation - (Slide 3 and Slide 10 - Slide 15)
24 pages
Introduction To Web Technologies
No ratings yet
Introduction To Web Technologies
8 pages
RF (Using Bond Yield)
No ratings yet
RF (Using Bond Yield)
18 pages
Mech Eng Reliability
No ratings yet
Mech Eng Reliability
3 pages
The mathematics of quantum mechanics
From Everand
The mathematics of quantum mechanics
Alessio Mangoni
No ratings yet
Computer Vision Based Attendance Management System For Students
No ratings yet
Computer Vision Based Attendance Management System For Students
6 pages
WBS11
No ratings yet
WBS11
1 page
PDF Lean Implementation at Siemens Kalwa Plant DL - PDF
No ratings yet
PDF Lean Implementation at Siemens Kalwa Plant DL - PDF
8 pages
Biometrics PPTV2
No ratings yet
Biometrics PPTV2
9 pages
Maruti Suzuki India Limited
No ratings yet
Maruti Suzuki India Limited
7 pages
Financial Accounting and Reporting Analysis-Term 1: Godrej Consumer Products Limited
No ratings yet
Financial Accounting and Reporting Analysis-Term 1: Godrej Consumer Products Limited
6 pages
BD Ga Intro
No ratings yet
BD Ga Intro
3 pages
Application of Blockchain
No ratings yet
Application of Blockchain
3 pages
Law Firm Receptionist11-Signed
No ratings yet
Law Firm Receptionist11-Signed
8 pages
4 PDF
No ratings yet
4 PDF
3 pages
Jadual
No ratings yet
Jadual
4 pages
Adm - T79-B81
No ratings yet
Adm - T79-B81
1 page
Exercises of Sets and Functions
From Everand
Exercises of Sets and Functions
Simone Malacrida
No ratings yet

Predictive Analytics Group Assignment

Uploaded by

Predictive Analytics Group Assignment

Uploaded by

PREDICTIVE ANALYTICS GROUP ASSIGNMENT

Submitted By: Group No. 191408

 From above, we find outliers in num_people, housearea, ave_monthly_income and

quartile_Numpeople[1]-range_Numpeople For calculating lower limit

quartile_Numpeople[2]+range_Numpeople For calculating upper limit

2. Average Monthly income:

We will apply various tests to check Normality.

We will check for autocorrelation.

We will check for Equal Variance assumption.

However, the boxplot looks good.

Thus, we can see there are some influential position.

You might also like