0% found this document useful (0 votes)

83 views16 pages

DM Tuto 3-6 (Final)

The document discusses using support vector machines and decision trees to analyze gene expression data and car seat sales data. It covers preprocessing and exploring the data, building SVM and decision tree models on training data, evaluating the models on test data, and tuning hyperparameters. Random forest models are also applied to the car seat data.

Uploaded by

Hash Brown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views16 pages

DM Tuto 3-6 (Final)

Uploaded by

Hash Brown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Tutorial 3 –SVM (Khan Gene Data)

#Step 1: get Khan data from library(ISLR)

> library(ISLR)
** The data consists of a number of tissue samples corresponding to four distinct types of small round
blue cell tumors. For each tissue sample, 2308 gene expression measurements are available.

> names(Khan)
[1] "xtrain" "xtest" "ytrain" "ytest"

> dim(Khan$xtrain) #63 subjects

[1] 63 2308

> dim(Khan$xtest)
[1] 20 2308
> length(Khan$ytrain)
[1] 63
> length(Khan$ytest )
[1] 20
> table(Khan$ytrain )

1 2 3 4
8 23 12 20
> table(Khan$ytest )

1 2 3 4
3 6 6 5

# Step 2 : split the data into training and testing data

**For illustration purposes, we can also plot the data only for 2 variables to see their distribution.
> dat_train = data.frame(Khan$xtrain, y=as.factor(Khan$ytrain))#training data
> dat_test = data.frame(Khan$xtest, y=as.factor(Khan$ytest)) #testing data

>plot(dat_train[,c(1,2)],col=as.factor(Khan$ytrain), pch=16, main="train data

")
# Step 3: Build SVM model

**We will use a support vector approach to predict cancer subtype using gene expression measurements.
Make sure that your target variable y is in as.factor class.

> library(e1071) #to use svm function

> svm.fit <- svm(y~., dat_train, kernel='linear', cost=10)

> summary(svm.fit)
Call:
svm(formula = y ~ ., data = dat_train, kernel = "linear", cost = 10)

Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 10
Number of Support Vectors: 58

( 20 20 11 7 )

Number of Classes: 4

Levels:
1 2 3 4

** By default, we build the svm model using the linear kernel, and the cost is equal to 10. The summary
of the model ‘svm.fit’ shows the number of support vectors is equal to 58, where the first and second
classes are 20 respectively; for the third class is 11 and for the fourth class is 7. Therefore, only 2
observations are not counted as the number of support vectors, and the model is considered good since
the more the number of support vectors used to build up the model means the better the model to classify
the data.

## Plot the model

> plot(svm.fit,dat_train,pch=16)
** remarks: For this data we cannot plot the model since the dimensions of data is too large. You can try
plot the model with fit the model for only train data with two variables.

# Step 4: evaluate the train model through an error matrix

> table(svm.fit$fitted, dat_train$y)

1 2 3 4
1 8 0 0 0
2 0 23 0 0
3 0 0 12 0
4 0 0 0 20

** The confusion matrix shows there are no training errors. In fact, this is not surprising, because the large
number of variables relative to the number of observations implies that it is easy to ﬁnd hyperplanes that
fully separate the classes.
We are most interested not in the support vector classiﬁer’s performance on the training observations,
but rather its performance on the test observations.

# Step 4: Predict the test data using the model and evaluate the prediction
> pred_test = predict(svm.fit, newdata=dat_test)
> table(pred_test, dat_test$y)

pred_test 1 2 3 4
1 3 0 0 0
2 0 6 2 0
3 0 0 4 0
4 0 0 0 5
** The result shows there are two observations from class 2 are misclassified into class 3 by using the
model with cost=10.

# Step 5: Find the best possible model through tune function

We can perform cross-validation using tune() to select the best choice of γ and cost for an SVM with kernel
> set.seed(1)
> tune.out=tune(svm, y~., data=dat_train, kernel="linear", ranges=list(cost=c
(0.1,1,10,100,1000)))
> summary(tune.out)

Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation

- best parameters:
cost
0.1

- best performance: 0.01666667

- Detailed performance results:

cost error dispersion
1 1e-01 0.01666667 0.05270463
2 1e+00 0.01666667 0.05270463
3 1e+01 0.01666667 0.05270463
4 1e+02 0.01666667 0.05270463
5 1e+03 0.01666667 0.05270463

Based on the tune results, the best parameter is when the cost equal to 0.1. However, the performance
results are still same for all costs considered with there are no changes for the errors value.

We can view the test set predictions for this model by applying the predict() function to the data.
Tutorial 4 – Decision Tree- Carseat Data
# Step 1: Read Carseat Data from the Library -> ISLR.

Data-> Library-> Carseats->Execute

Let variable ‘Sales’ be as the target variable, so we need to transform the data from numeric data type to
category data type.

# Step 2: Transform ‘Sales’

Transform->Recode->KMeans->Number:2->Sales->Execute

**By default, rattle split the Sales into two category low Sales [0,7.64] and high Sales (7.64, 16.3]. The new
variable named as BK2_Sales. Set BK2_Sales as the target variable.
# Step 3: EDA

Figure 1. Summary statistics of Carseats Data

Figure 1 shows the summary statistics of Carseats Data. Carseats Data consists of 11 variables which 8 of
them are the continuous variables while the rest of the variables are categorical variables. The mean
value for competitor price is 125.6 while the mean price of company charges for car seats at each site is
slightly lower which is 117.2. The mean income of the customers is 67.1 thousand dollars. The variable
BK2_Sales shows 146 observations are in low sales category and 134 observations are in high sales
category.

Figure 2 illustrates the boxplots for the distribution of Price by BK2_Sales. The blue boxplot shows
that the sales is high as the price of the car seats is low while the red boxplot shows that the sales is low
as the price of the car seats is high. Therefore, the customers tend to buy more car seats when the price
is low.

Figure 2. Boxplots of Price by BK2_Sales

# Step 4: Decision Tree model

Model->Tree->Execute

Figure 3. Summary of Decision Tree model

Figure 3 gives the output of the decision tree analysis using rattle. The first part, the output summarized
the decision for each node. The first row indicates the decision in the root node with the majority of the
observation is in the low sales category [0, 7.64]. The proportion of observations with low and high sales
category are given in the bracket.

In the root node labelled as 1 summarized that, overall, 280 stores sold car seats with only 134
are in high sales category which are misclassified into low sales category [0,7.64]. So, the proportion of
the car seats sold for the low sales category is 52.1%. The second part, the output gives the variables used
in the decision tree which are Advertising, Age, CompPrice, Income, Price, ShelveLoc. The last part of the
output shows the choice of 8 splits are appropriate as the cross validation errors (xerror) start increasing
as the depth of the tree is increasing.

Figure 4. Plot of the Decision tree

The decision rules can be described easier using the decision tree as depicted in Figure 4. Six variables are
used to construct the tree, namely Advertising, Age, CompPrice, Income, Price and ShelveLoc. The tree has
nine terminal nodes; 4 with decision low sales [0,7.64] and the other five with decision high sales
(7.64,16.3]. From node 4, 86% car seats with the shelveLoc quality is medium or bad and the price more
than 125 are classified in the low sales category. In contrast, from node 7, only 3% of car seats with good
shelveLoc quality and price less than 125 are classified in high sales (7.64, 16.3] category.

In addition, from node 11, 8% of car seats with high advertising which is more than 9, good
shelveLoc quality and price more than 125 are in high sales category. Besides, from node 99, 11% of car
seats are in high sales category when there are young age and high income customers, and with
reasonable car seat price, which is in between 93 and 125, and with lower competitor price and shelveLoc
type is medium or bad. Therefore, we can conclude that the analysis suggests high sales of child car seats
would be obtained when the price is less than 125, shelveLoc quality is medium or bad and competitor
price is greater than 126. The young and high income customers also promote the high sales of child car
seats.

# Step 5: Decision Tree model validation

Figure 5. Classification table

From the classification table in Figure 5, 7 observations are misclassified as high sales category and 6 is
misclassified as low category, that is, with error 21.7%. The correct classification rate can be improved by
considering other factors that influence the sales of child car seats.
# Step 5: Predicting the classification of the new child car seats data set based on the buildup tree model.

Evaluate->Type:Score-> Model: Tree -> Data: CSV File (get the new data from the file)

** The results will be stored in the csv file.

Tutorial 5 – Random Forest- Carseat Data
# Step 1: Import Data, Transform Data, Explore Data. Refer steps shows in Tutorial 4.

# Step 2: Random forest model

Model->Forest->Execute

** By default, rattle set number of trees equal to 500 and number of variables consider at each split is 3.

** Summary output:

Figure 5. Output from the random forest modelling on Child Car seats data

Figure 1 and 2 depict the summary output from the random forest modelling on Child Car seats
data with number of observations used to build the model is 280. From the output, we consider
500 trees built from 500 random samples selected with replacement from the original data, with
3 randomly chosen variables used when splitting the node of the tree diagram. The OOB estimate
of error rate is 21.79% which is quite large but still can be considered. The lower OOB estimate of error
rate, the higher is the accuracy of random forest model. The last table in Figure 1 shows a confusion
matrix. We can see the total error is (30+31=61). Which is 61/280= 21.79% error rate.

From Figure 2, the AUC value is 78.16% indicates that the model is good. The table below is on
the result of the variable importance that shows how important of each variable in the Random Forest
model built. The first column is the variables in the dataset. The second column is a measure of importance
of the observations with target variable showing “No” (Low sales category). The third column is a measure
of importance of the observations with target variable showing “Yes” (High sales category). The fourth
column (MeanDecreaseAccuracy) shows the importance measure by scaled average of the prediction
accuracy of each variable and the fifth column (MeanDecreaseGini) shows the importance measure which
measures the decrease of the node’s impurity of the decision tree when splitting the variables using Gini
index. The higher the number in each column indicates the variable is more important. From the results,
Price is maximum importance in terms of contributing accuracy while Population is not important for
prediction in this random forest model. Price also has the highest contribution to mean decrease gini.
Figure 2. Output 2 from the random forest modelling on Child Car seats data

Figure 3. Variables Importance

Figure 3 shows the variable importance plotthat consists of 4 grids (MeanDeceaseAccuracy,
MeanDecreaseGini, No, Yes). The vertical axes are the variables used in the model. The colored lines
represent the inverse measure of the importance values of each variable. The longer lines show less
importance while shorter line show more importance. From the figure as shows in table of Figure 2, Price
is the most important variable in this Random Forest model.

Figure 4. OOB plot

Figure 4 shows the OOB plot where the X-axis represents the number of trees used in the random forest
model and Y-axis represents the error rate. The black, red and green lines show the error rates of out-of-
bag (OOB) observations, observations with target variable “No” and observations with target variable
“Yes” respectively. The errors rates decrease as the number of trees increase. From the plot, after 300
trees, there is no much changes in the error rate. Thus, a random forest model with number of 300 trees
above would be sufficient for the building random forest for this dataset.
Tutorial 6 – Association Analysis- retail data
# Step 1: Import Data using R studio
> dat<-read.csv("retaildatabelgiumA.csv", sep=",",head=TRUE)

#Transform data for readable.

# Step 2: Read data through rattle

# Step 3: Associate analysis model

Associate->Baskets->Execute
# Get the relative frequency plot

Figure 6 Item frequency

For the association analysis, we set our minimum support as 0.01 and minimum confidence as 0.01 in the
rattle. Figure 1 shows the relative frequency of the item against the number of buying the item. From
Figure 1, there is 4 items that meet our criteria, namely, item 38, 39, 41 and 48. The item 39 have the
highest frequency among others. The second highest frequency is item 48, followed by item 41 and item
38. This is maybe due to item 39 fulfill customer needed in that item itself. The other items in the
transactions excluding from Figure 1, are not significant and hence, ignored in the interpretation of the
analysis.
Figure 2 Output of association analysis rules

Figure 1 shows the computer output from the association analysis. The summary of association rules
consists of several statistical measures including support, confidence and lift. The data contains 100
transactions. Four rules are generated using the apriori algorithm with specified parameter values of
minimum confidence 0.28 and minimum support of 0.110. The support of the rules is in the range (0.11,
0.12) and confidences of the rules is in the ranges (0.28, 0.58). From the mean support values, there is
mean of the rule occurs in 11% transaction which no dominance item set in the transaction data such that
customers tend to buy different items sold in the shop. However, the values of confidence suggest there
are items that customer tend to buy together, and to be investigated next.

Figure 3 Output of association analysis rules

Figure 3 shows a set of rules. Based on the first rules, we can see that 12% out of the total
transaction contains item 48 and item 39. When a customer buys item 48, then 39% of the time
he will buy item 39 as well. This occurs almost one time more than by random buying of any
other customers. The lift of the value indicated that the frequency of the purchase of item 48 and
item 39 is equal. However, the leverage of item 48 and item 39 is -0.0009 indicating that the
decrease of 0.0009 transaction for them to be occur together than when they are independent.
It means item 48 and item 39 are slightly negatively correlated that is the customers tend not to
buy item 48 and item 39 together. Hence, the owner can have larger consignment of these items
to be ordered.

Final Thesis Yifan Cao
No ratings yet
Final Thesis Yifan Cao
178 pages
Tutorial SVM Matlab
100% (1)
Tutorial SVM Matlab
113 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
ML Unit 4
No ratings yet
ML Unit 4
47 pages
AI Chapter 3 Part 3
No ratings yet
AI Chapter 3 Part 3
49 pages
08 Classification
No ratings yet
08 Classification
46 pages
Unit 3 Aam
No ratings yet
Unit 3 Aam
30 pages
Supervised Learning - SVM - DT
No ratings yet
Supervised Learning - SVM - DT
43 pages
Nalepa-Kawulok2019 Article SelectingTrainingSetsForSuppor
No ratings yet
Nalepa-Kawulok2019 Article SelectingTrainingSetsForSuppor
44 pages
AP For NLP-LO2
No ratings yet
AP For NLP-LO2
38 pages
IDS 575 Assignment - 3: Name: Swapnil Shashank Parkhe UIN: 660014865
No ratings yet
IDS 575 Assignment - 3: Name: Swapnil Shashank Parkhe UIN: 660014865
7 pages
SVM, Neural Network and Random Forest in R
No ratings yet
SVM, Neural Network and Random Forest in R
45 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
CQF EXAM 3-Answer
No ratings yet
CQF EXAM 3-Answer
14 pages
Decisiontree1 2
No ratings yet
Decisiontree1 2
29 pages
Finalised FBA CIA 3
No ratings yet
Finalised FBA CIA 3
16 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
UNIT-II-Support Vector Machine Algorithm
No ratings yet
UNIT-II-Support Vector Machine Algorithm
13 pages
CS 229 Project Report: Predicting Used Car Prices
100% (1)
CS 229 Project Report: Predicting Used Car Prices
5 pages
Seminar Presentation
No ratings yet
Seminar Presentation
25 pages
Types of Kernels in Support Vector Machines
No ratings yet
Types of Kernels in Support Vector Machines
14 pages
Model Selection On ML
No ratings yet
Model Selection On ML
49 pages
Week-2 NK
No ratings yet
Week-2 NK
12 pages
Prediction On Iris
No ratings yet
Prediction On Iris
14 pages
TEMPLATE JOSRE Paper
No ratings yet
TEMPLATE JOSRE Paper
9 pages
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
No ratings yet
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
4 pages
Project Occupancy Alfonso Vicente Aragues
No ratings yet
Project Occupancy Alfonso Vicente Aragues
18 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
MLT 07
No ratings yet
MLT 07
8 pages
Classification - With - Decision - Tree - MarketingData - Jupyter Notebook
No ratings yet
Classification - With - Decision - Tree - MarketingData - Jupyter Notebook
9 pages
SVM Unit 2
No ratings yet
SVM Unit 2
12 pages
A Introduction To SVM PDF
No ratings yet
A Introduction To SVM PDF
48 pages
SVM Implementation
No ratings yet
SVM Implementation
8 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
Support Vector Classifier
No ratings yet
Support Vector Classifier
7 pages
IE 451 Fall 2023-2024 Homework 7 Solutions
No ratings yet
IE 451 Fall 2023-2024 Homework 7 Solutions
11 pages
CarPricePrediction Python FR
No ratings yet
CarPricePrediction Python FR
6 pages
EIE520 Neural Computation: The Hong Kong Polytechnic University
No ratings yet
EIE520 Neural Computation: The Hong Kong Polytechnic University
14 pages
sectionSVM PDF
No ratings yet
sectionSVM PDF
10 pages
Aim of The Experiment-Software Required - Theory
No ratings yet
Aim of The Experiment-Software Required - Theory
6 pages
ML Lab 10 - Ensemble Learning
No ratings yet
ML Lab 10 - Ensemble Learning
7 pages
The Water Potability Prediction Based On Active Support Vector Machine and Artificial Neural Network
No ratings yet
The Water Potability Prediction Based On Active Support Vector Machine and Artificial Neural Network
5 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Title: Implement Support Vector Machine Classifier: Department of Computer Science and Engineering
No ratings yet
Title: Implement Support Vector Machine Classifier: Department of Computer Science and Engineering
5 pages
Tutorial 9 - Questions 2023
No ratings yet
Tutorial 9 - Questions 2023
4 pages
Assignment AnjaliVats 244
No ratings yet
Assignment AnjaliVats 244
12 pages
SVM
No ratings yet
SVM
2 pages
Business Analytics: Price Prediction of Mobile Phones Submitted To: Prof. Manit Mishra
No ratings yet
Business Analytics: Price Prediction of Mobile Phones Submitted To: Prof. Manit Mishra
7 pages
SVM Example in R
No ratings yet
SVM Example in R
4 pages
Problem: # Partition
No ratings yet
Problem: # Partition
5 pages
Lab 4 - Support Vector Machines: Part B
No ratings yet
Lab 4 - Support Vector Machines: Part B
5 pages
Assignment 6 Tree Based Methods
No ratings yet
Assignment 6 Tree Based Methods
7 pages
Report
No ratings yet
Report
2 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
This Dataset Comes From An Original (Non-Machine-Learning) Study and Received in December 1995
No ratings yet
This Dataset Comes From An Original (Non-Machine-Learning) Study and Received in December 1995
4 pages
Stock Price Trend Forecasting Using Supervised Learning Methods
No ratings yet
Stock Price Trend Forecasting Using Supervised Learning Methods
2 pages

DM Tuto 3-6 (Final)

Uploaded by

DM Tuto 3-6 (Final)

Uploaded by

Tutorial 3 –SVM (Khan Gene Data)

#Step 1: get Khan data from library(ISLR)

> dim(Khan$xtrain) #63 subjects

# Step 2 : split the data into training and testing data

>plot(dat_train[,c(1,2)],col=as.factor(Khan$ytrain), pch=16, main="train data

> library(e1071) #to use svm function

## Plot the model

# Step 4: evaluate the train model through an error matrix

# Step 5: Find the best possible model through tune function

Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation

- best performance: 0.01666667

- Detailed performance results:

Data-> Library-> Carseats->Execute

# Step 2: Transform ‘Sales’

Figure 1. Summary statistics of Carseats Data

Figure 2. Boxplots of Price by BK2_Sales

Figure 3. Summary of Decision Tree model

Figure 4. Plot of the Decision tree

# Step 5: Decision Tree model validation

Figure 5. Classification table

** The results will be stored in the csv file.

# Step 2: Random forest model

Figure 3. Variables Importance

Figure 4. OOB plot

#Transform data for readable.

# Step 2: Read data through rattle

# Step 3: Associate analysis model

Figure 6 Item frequency

Figure 3 Output of association analysis rules

You might also like