Support Vector Machines - Problem - Statement
Support Vector Machines - Problem - Statement
Instructions
Please share your answers filled inline in the word document. Submit Python code and R
code files wherever applicable.
1. Business Problem
1.1. Objective
1.2. Constraints (if any)
2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:
2.1 Make a table as shown above and provide information about the features such as its Data type
and its relevance to the model building, if not relevant provide reasons and provide description of the
feature.
Using R and Python codes perform:
3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
3.2 Outlier Imputation
5. Model Building
5.1 Build the model on the scaled data (try multiple options)
5.2 Perform Support Vector Machines.
5.3 Train and Test the data and compare accuracies by Confusion Matrix and use
different Hyper Parameters
5.4 Briefly explain the model output in the documentation
6. Share the benefits/impact of the solution - how or in what way the business (client) gets
benefit from the solution provided
Note:
The assignment should be submitted in the following format:
R code
Python code
Code Modularization should be maintained
Documentation of the model building (elaborating on steps mentioned above)
Problem Statement: -
A construction firm wants to develop a suburban locality with new infrastructure but they are faced
with a challenge of incurring losses if they cannot sell the properties. To overcome this, they consult
an analytics firm and would like to get insights on how densely the area is populated and different
level of income group people reside. You as a Data Scientist perform Support Vector Machines
Algorithm on the given dataset and bring out informative insights and also comment on if its viable
for investment in that area.
R-code:
#####Support Vector Machines
summary(salarydata_test)
summary(salarydata_train)
table(salarytrain_predictions, salarydata_train$Salary)
Output:
> # Load the Dataset
> salarydata_test <- read.csv(file.choose(), stringsAsFactors = TRUE)
> salarydata_train <- read.csv(file.choose(), stringsAsFactors = TRUE)
> summary(salarydata_test)
age workclass education educationno
Min. :17.00 Federal-gov : 463 HS-grad :4943 Min. : 1.00
1st Qu.:28.00 Local-gov : 1033 Some-college:3221 1st Qu.: 9.00
Median :37.00 Private :11021 Bachelors :2526 Median :10.00
Mean :38.77 Self-emp-inc : 572 Masters : 887 Mean :10.11
3rd Qu.:48.00 Self-emp-not-inc: 1297 Assoc-voc : 652 3rd Qu.:13.00
Max. :90.00 State-gov : 667 11th : 571 Max. :16.00
Without-pay : 7 (Other) :2260
maritalstatus occupation relationship
Divorced :2083 Exec-managerial:1992 Husband :6203
Married-AF-spouse : 11 Craft-repair :1990 Not-in-family :3976
Married-civ-spouse :6990 Prof-specialty :1970 Other-relative: 460
Married-spouse-absent: 182 Sales :1824 Own-child :2160
Restarting R session...
> install.packages("kernlab")
WARNING: Rtools is required to build R packages but is not currently installed. Please
download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/HP/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/kernlab_0.9-29.zip'
Content type 'application/zip' length 2849843 bytes (2.7 MB)
downloaded 2.7 MB
Inferences: For salary train data and test data rbfdot kernel gives more accuracy than
vanilladot kernel.
Problem Statement: -
In California, annual forest fires can cause huge loss of wild life, human life and
property damage can skyrocket in billions. Local officials would like to predict the size
burned area in forest fires annually so that they can be better prepared in future
calamities.
Build a Support Vector Machines algorithm on the dataset and share your insights on it
in the documentation.
Note: - Size_ category is the output variable.
summary(forestfires)
table(forestfires_predictions, forestfires_test$size_category)
agreement <- forestfires_predictions == forestfires_test$size_category
table(agreement)
prop.table(table(agreement))
Output:
Python-code:
import pandas as pd
import numpy as np
forestfires = pd.read_csv("D:/360DigiTMG/Assignment/BlackBox
technique/Datasets_SVM/forestfires.csv")
forestfires.describe()
# kernel = linear
help(SVC)
model_linear = SVC(kernel = "linear")
model_linear.fit(train_X, train_y)
np.mean(pred_test_linear == test_y)
# kernel = rbf
model_rbf = SVC(kernel = "rbf")
model_rbf.fit(train_X, train_y)
pred_test_rbf = model_rbf.predict(test_X)
np.mean(pred_test_rbf==test_y)
Output:
import pandas as pd
import numpy as np
forestfires = pd.read_csv("D:/360DigiTMG/Assignment/BlackBox
technique/Datasets_SVM/forestfires.csv")
forestfires.describe()
Out[47]:
FFMC DMC DC ... monthnov monthoct monthsep
count 517.000000 517.000000 517.000000 ... 517.000000 517.000000 517.000000
mean 90.644681 110.872340 547.940039 ... 0.001934 0.029014 0.332689
std 5.520111 64.046482 248.066192 ... 0.043980 0.168007 0.471632
min 18.700000 1.100000 7.900000 ... 0.000000 0.000000 0.000000
25% 90.200000 68.600000 437.700000 ... 0.000000 0.000000 0.000000
50% 91.600000 108.300000 664.200000 ... 0.000000 0.000000 0.000000
75% 92.900000 142.400000 713.900000 ... 0.000000 0.000000 1.000000
max 96.200000 291.300000 860.600000 ... 1.000000 1.000000 1.000000
[8 rows x 28 columns]
model_linear.fit(train_X, train_y)
Out[56]: SVC(kernel='linear')
pred_test_linear = model_linear.predict(test_X)
np.mean(pred_test_linear == test_y)
Out[58]: 0.9711538461538461
# kernel = rbf
model_rbf.fit(train_X, train_y)
Out[61]: SVC()
pred_test_rbf = model_rbf.predict(test_X)
np.mean(pred_test_rbf==test_y)
Out[63]: 0.7596153846153846