0% found this document useful (0 votes)
84 views13 pages

Amta - Final Exams: Code: # Load The Toyotacorolla - CSV

The document provides code to analyze Toyota car data and model price using regression. It includes: 1) Checking for missing data and visualizing relationships between price and variables like km and age. 2) Splitting data into training and test sets and building a linear regression model to predict price based on variables like age, km, and fuel type. 3) Evaluating the model on training and test sets to select the best variables using measures like RMSE and R-squared. 4) Using stepwise regression to further improve the model by removing insignificant variables.

Uploaded by

Shambhawi Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views13 pages

Amta - Final Exams: Code: # Load The Toyotacorolla - CSV

The document provides code to analyze Toyota car data and model price using regression. It includes: 1) Checking for missing data and visualizing relationships between price and variables like km and age. 2) Splitting data into training and test sets and building a linear regression model to predict price based on variables like age, km, and fuel type. 3) Evaluating the model on training and test sets to select the best variables using measures like RMSE and R-squared. 4) Using stepwise regression to further improve the model by removing insignificant variables.

Uploaded by

Shambhawi Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

AMTA – FINAL EXAMS

1. Toyota car data

Code:

# Load the ToyotaCorolla.csv

df = read.csv(file.choose())

# Missing Value

#first lets check for if any values are missing or not

library(randomForest)

library(missForest)

sapply(df, function(x) sum(is.na(x))) #Used to check and find if the data set has any
missing values

#from the output we understand that, there are no missing values in the given data set.
Henceforth, no need of imputation of values in the data set.

attach(df)

# Load the libraries

library(caret)

library(e1071)

library(foreach)

library(ggplot2)

library(ISLR)library(ggplot2)

library(ISLR)

i) Scatterplot
Price vs Kilo meter
Code :
# Price Vs KM
plot(Price ~ KM)

Interpretation

 Price of old cars which travelled for very less KM are priced High whereas the price
of old car which travelled for more KM are priced Low.

Price Vs Age

Code:

# Price vs Age

plot(Price ~ Age_08_04)
Interpretation

 As the age of car increases the price of the care decreases and vice versa

ii) Multicollinearity
Vif(df)
# variance inflation function should be less than 3 for all the variables of mtcars , check
for the dataset
# In case of multi colinearilty , If F is significant all independent variables become
insignificant
# A high value means alternate hypothesis
iii) Split Data Set
Code:

library(caret)
split <- createDataPartition(Price, p = .30, list = F)
train = df[-split, ] #training the model
test = df[ split, ] #model validation

mean(train$Price)
mean(test$Price)
mean(Price)

iv) Regression model


Code :
# 1.4 Predice the price based on Age, Kilometers, Fuel_Type,HP,Met_Color
# Automatic,CC,Doors,Quaterly_Tax,Weight

lmtrain = lm(Price ~
Age_08_04+KM+Fuel_Type+HP+Met_Color+Automatic+CC+Doors+Quarterly_Tax+Weig
ht, data = train)

v) Significance
Code:
summary(lmtrain)
# age,fuel type, km,hp,quaterly tax and weight are most significant variables when it
comes to predicting price
vi) Performance Validation
Code:
# 1.6 Performance validation set
# prediction interval for test data based on train model

dfpredtest = predict(lmtrain, test, interval = "predict", level = .95)


dfpredtest = data.frame(dfpredtest)
dfpredtest$price = test$Price

middle = function(a,b,y) {
val= ifelse( y >= a & y < b, 1,0)
return(val)
}

dfpredtest$validate = mapply(middle, dfpredtest$lwr, dfpredtest$upr, dfpredtest$price)


(sum(dfpredtest$validate)/dim(dfpredtest)[1])*100 # %of cases correctly classified

# saving predicted values


trainPrice = predict(lmtrain, train)
testPrice = predict(lmtrain, test)

# model performance measures

# R square: train
cor(train$Price, trainPrice)^2
# R square: test
cor(test$Price, testPrice)^2
# RMSE
RMSE(train$Price, trainPrice)
RMSE(test$Price, testPrice)
#lower RMSE , better model.
vii) Stepwise Regression

Code:

# 1.7 Stepwise Regression

lm.fit = lm(Price ~
Age_08_04+KM+Fuel_Type+HP+Met_Color+Automatic+CC+Doors+Quarterly_Tax+Weight, data = df)

step.fit = train(Price ~
Age_08_04+KM+Fuel_Type+HP+Met_Color+Automatic+CC+Doors+Quarterly_Tax+Weight, data = df,
method = "lmStepAIC", trace = FALSE)

summary(step.fit)

#when value of p is less than 0.05,then it is significant

#checking for AIC

AIC(lm.fit)

AIC(step.fit$finalModel)

#lower AIC is better. therefore stepwise AIC regression

#AIC estimates the quality of model.it is a type of criterion


2) Decision Tree
Code:
# Building the base tree
basetree = rpart(carat ~., data = df2)
rpart.plot(basetree)

# identifying pruning parameter


plotcp(basetree)
# post pruning

prunedtree = prune(basetree, cp =.015)

rpart.plot(prunedtree)

# pre pruning

preprunedtree <- rpart(carat ~ ., data = df2,

control = rpart.control(maxdepth = 2,minsplit = 50))


rpart.plot(preprunedtree)

TOPIC 2 : MODELLING IN CARET

ctrl.cv = trainControl(method = "cv", number = 6)

tree.fit = train(carat ~.,

data = df2,

method = "rpart2",

trControl = ctrl.cv,

tuneGrid = expand.grid(maxdepth = c(1:10)),

control = rpart.control(minsplit =10, minbucket = 5))

tree.fit

summary(tree.fit)

plot(tree.fit$finalModel)

rpart.plot(tree.fit$finalModel)

text(tree.fit$finalModel)

tree.fit$finalModel
# RMSE is low at MAXDEPTH – 6

plot(tree.fit$finalModel)

rpart.plot(tree.fit$finalModel)
2.2 Cluster Analysis

Code :

library(ISLR)

library(cluster)

library(caret)

library(leaps)

# Mixed Variable Cluster Analysis Using Partitioning Around Mediods

str(df2)

# computing the gower distance

gowerdist <- daisy(df2, metric = "gower")

class(gowerdist) #distance ; dissimilarity

gower.mat = as.matrix(gowerdist)

class(gower.mat)

# cluster analysis using partitioning around mediods:

# making k clusters from gowermatrix

pamsol = pam(gower.mat,diss = TRUE,3)

pamsol

# pulling out the cluster centers

pamsol$medoids

# pulling out the clustering vector

pamsol$clustering

df2$carat = pamsol$clustering# creating a column for which cluster the car is

head(df2)

# pulling out silinfo


pamsol$silinfo$avg.width

SIL = NULL

for(i in 2:7)

SIL[i] = pam(gower.mat, diss = TRUE , i)$silinfo$avg.width

plot(SIL,type = "l")

# So, 2 is the optimal number of clusters from the plot.

# using 4 clusters (SAME PROCESS)


pamsol = pam(gower.mat,diss = TRUE,2)

pamsol$medoids

df$clusmem = pamsol$clustering # creating a column for which cluster the car is

head(df)

# to know which variables are important or more significant in the cluster

model = glm(df2$carat~.,data = df2)

summary(model)

varImp(model)
Clusters obtained are “58”,”263”

You might also like