Tutorial 3 –SVM (Khan Gene Data)
#Step 1: get Khan data from library(ISLR)
> library(ISLR)
** The data consists of a number of tissue samples corresponding to four distinct types of small round
blue cell tumors. For each tissue sample, 2308 gene expression measurements are available.
> names(Khan)
[1] "xtrain" "xtest" "ytrain" "ytest"
> dim(Khan$xtrain) #63 subjects
[1] 63 2308
> dim(Khan$xtest)
[1] 20 2308
> length(Khan$ytrain)
[1] 63
> length(Khan$ytest )
[1] 20
> table(Khan$ytrain )
1 2 3 4
8 23 12 20
> table(Khan$ytest )
1 2 3 4
3 6 6 5
# Step 2 : split the data into training and testing data
**For illustration purposes, we can also plot the data only for 2 variables to see their distribution.
> dat_train = data.frame(Khan$xtrain, y=as.factor(Khan$ytrain))#training data
> dat_test = data.frame(Khan$xtest, y=as.factor(Khan$ytest)) #testing data
>plot(dat_train[,c(1,2)],col=as.factor(Khan$ytrain), pch=16, main="train data
")
# Step 3: Build SVM model
**We will use a support vector approach to predict cancer subtype using gene expression measurements.
Make sure that your target variable y is in as.factor class.
> library(e1071) #to use svm function
> svm.fit <- svm(y~., dat_train, kernel='linear', cost=10)
> summary(svm.fit)
Call:
svm(formula = y ~ ., data = dat_train, kernel = "linear", cost = 10)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 10
Number of Support Vectors: 58
( 20 20 11 7 )
Number of Classes: 4
Levels:
1 2 3 4
** By default, we build the svm model using the linear kernel, and the cost is equal to 10. The summary
of the model ‘svm.fit’ shows the number of support vectors is equal to 58, where the first and second
classes are 20 respectively; for the third class is 11 and for the fourth class is 7. Therefore, only 2
observations are not counted as the number of support vectors, and the model is considered good since
the more the number of support vectors used to build up the model means the better the model to classify
the data.
## Plot the model
> plot(svm.fit,dat_train,pch=16)
** remarks: For this data we cannot plot the model since the dimensions of data is too large. You can try
plot the model with fit the model for only train data with two variables.
# Step 4: evaluate the train model through an error matrix
> table(svm.fit$fitted, dat_train$y)
1 2 3 4
1 8 0 0 0
2 0 23 0 0
3 0 0 12 0
4 0 0 0 20
** The confusion matrix shows there are no training errors. In fact, this is not surprising, because the large
number of variables relative to the number of observations implies that it is easy to find hyperplanes that
fully separate the classes.
We are most interested not in the support vector classifier’s performance on the training observations,
but rather its performance on the test observations.
# Step 4: Predict the test data using the model and evaluate the prediction
> pred_test = predict(svm.fit, newdata=dat_test)
> table(pred_test, dat_test$y)
pred_test 1 2 3 4
1 3 0 0 0
2 0 6 2 0
3 0 0 4 0
4 0 0 0 5
** The result shows there are two observations from class 2 are misclassified into class 3 by using the
model with cost=10.
# Step 5: Find the best possible model through tune function
We can perform cross-validation using tune() to select the best choice of γ and cost for an SVM with kernel
> set.seed(1)
> tune.out=tune(svm, y~., data=dat_train, kernel="linear", ranges=list(cost=c
(0.1,1,10,100,1000)))
> summary(tune.out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.01666667
- Detailed performance results:
cost error dispersion
1 1e-01 0.01666667 0.05270463
2 1e+00 0.01666667 0.05270463
3 1e+01 0.01666667 0.05270463
4 1e+02 0.01666667 0.05270463
5 1e+03 0.01666667 0.05270463
Based on the tune results, the best parameter is when the cost equal to 0.1. However, the performance
results are still same for all costs considered with there are no changes for the errors value.
We can view the test set predictions for this model by applying the predict() function to the data.
Tutorial 4 – Decision Tree- Carseat Data
# Step 1: Read Carseat Data from the Library -> ISLR.
Data-> Library-> Carseats->Execute
Let variable ‘Sales’ be as the target variable, so we need to transform the data from numeric data type to
category data type.
# Step 2: Transform ‘Sales’
Transform->Recode->KMeans->Number:2->Sales->Execute
**By default, rattle split the Sales into two category low Sales [0,7.64] and high Sales (7.64, 16.3]. The new
variable named as BK2_Sales. Set BK2_Sales as the target variable.
# Step 3: EDA
Figure 1. Summary statistics of Carseats Data
Figure 1 shows the summary statistics of Carseats Data. Carseats Data consists of 11 variables which 8 of
them are the continuous variables while the rest of the variables are categorical variables. The mean
value for competitor price is 125.6 while the mean price of company charges for car seats at each site is
slightly lower which is 117.2. The mean income of the customers is 67.1 thousand dollars. The variable
BK2_Sales shows 146 observations are in low sales category and 134 observations are in high sales
category.
Figure 2 illustrates the boxplots for the distribution of Price by BK2_Sales. The blue boxplot shows
that the sales is high as the price of the car seats is low while the red boxplot shows that the sales is low
as the price of the car seats is high. Therefore, the customers tend to buy more car seats when the price
is low.
Figure 2. Boxplots of Price by BK2_Sales
# Step 4: Decision Tree model
Model->Tree->Execute
Figure 3. Summary of Decision Tree model
Figure 3 gives the output of the decision tree analysis using rattle. The first part, the output summarized
the decision for each node. The first row indicates the decision in the root node with the majority of the
observation is in the low sales category [0, 7.64]. The proportion of observations with low and high sales
category are given in the bracket.
In the root node labelled as 1 summarized that, overall, 280 stores sold car seats with only 134
are in high sales category which are misclassified into low sales category [0,7.64]. So, the proportion of
the car seats sold for the low sales category is 52.1%. The second part, the output gives the variables used
in the decision tree which are Advertising, Age, CompPrice, Income, Price, ShelveLoc. The last part of the
output shows the choice of 8 splits are appropriate as the cross validation errors (xerror) start increasing
as the depth of the tree is increasing.
Figure 4. Plot of the Decision tree
The decision rules can be described easier using the decision tree as depicted in Figure 4. Six variables are
used to construct the tree, namely Advertising, Age, CompPrice, Income, Price and ShelveLoc. The tree has
nine terminal nodes; 4 with decision low sales [0,7.64] and the other five with decision high sales
(7.64,16.3]. From node 4, 86% car seats with the shelveLoc quality is medium or bad and the price more
than 125 are classified in the low sales category. In contrast, from node 7, only 3% of car seats with good
shelveLoc quality and price less than 125 are classified in high sales (7.64, 16.3] category.
In addition, from node 11, 8% of car seats with high advertising which is more than 9, good
shelveLoc quality and price more than 125 are in high sales category. Besides, from node 99, 11% of car
seats are in high sales category when there are young age and high income customers, and with
reasonable car seat price, which is in between 93 and 125, and with lower competitor price and shelveLoc
type is medium or bad. Therefore, we can conclude that the analysis suggests high sales of child car seats
would be obtained when the price is less than 125, shelveLoc quality is medium or bad and competitor
price is greater than 126. The young and high income customers also promote the high sales of child car
seats.
# Step 5: Decision Tree model validation
Figure 5. Classification table
From the classification table in Figure 5, 7 observations are misclassified as high sales category and 6 is
misclassified as low category, that is, with error 21.7%. The correct classification rate can be improved by
considering other factors that influence the sales of child car seats.
# Step 5: Predicting the classification of the new child car seats data set based on the buildup tree model.
Evaluate->Type:Score-> Model: Tree -> Data: CSV File (get the new data from the file)
** The results will be stored in the csv file.
Tutorial 5 – Random Forest- Carseat Data
# Step 1: Import Data, Transform Data, Explore Data. Refer steps shows in Tutorial 4.
# Step 2: Random forest model
Model->Forest->Execute
** By default, rattle set number of trees equal to 500 and number of variables consider at each split is 3.
** Summary output:
Figure 5. Output from the random forest modelling on Child Car seats data
Figure 1 and 2 depict the summary output from the random forest modelling on Child Car seats
data with number of observations used to build the model is 280. From the output, we consider
500 trees built from 500 random samples selected with replacement from the original data, with
3 randomly chosen variables used when splitting the node of the tree diagram. The OOB estimate
of error rate is 21.79% which is quite large but still can be considered. The lower OOB estimate of error
rate, the higher is the accuracy of random forest model. The last table in Figure 1 shows a confusion
matrix. We can see the total error is (30+31=61). Which is 61/280= 21.79% error rate.
From Figure 2, the AUC value is 78.16% indicates that the model is good. The table below is on
the result of the variable importance that shows how important of each variable in the Random Forest
model built. The first column is the variables in the dataset. The second column is a measure of importance
of the observations with target variable showing “No” (Low sales category). The third column is a measure
of importance of the observations with target variable showing “Yes” (High sales category). The fourth
column (MeanDecreaseAccuracy) shows the importance measure by scaled average of the prediction
accuracy of each variable and the fifth column (MeanDecreaseGini) shows the importance measure which
measures the decrease of the node’s impurity of the decision tree when splitting the variables using Gini
index. The higher the number in each column indicates the variable is more important. From the results,
Price is maximum importance in terms of contributing accuracy while Population is not important for
prediction in this random forest model. Price also has the highest contribution to mean decrease gini.
Figure 2. Output 2 from the random forest modelling on Child Car seats data
Figure 3. Variables Importance
Figure 3 shows the variable importance plotthat consists of 4 grids (MeanDeceaseAccuracy,
MeanDecreaseGini, No, Yes). The vertical axes are the variables used in the model. The colored lines
represent the inverse measure of the importance values of each variable. The longer lines show less
importance while shorter line show more importance. From the figure as shows in table of Figure 2, Price
is the most important variable in this Random Forest model.
Figure 4. OOB plot
Figure 4 shows the OOB plot where the X-axis represents the number of trees used in the random forest
model and Y-axis represents the error rate. The black, red and green lines show the error rates of out-of-
bag (OOB) observations, observations with target variable “No” and observations with target variable
“Yes” respectively. The errors rates decrease as the number of trees increase. From the plot, after 300
trees, there is no much changes in the error rate. Thus, a random forest model with number of 300 trees
above would be sufficient for the building random forest for this dataset.
Tutorial 6 – Association Analysis- retail data
# Step 1: Import Data using R studio
> dat<-read.csv("retaildatabelgiumA.csv", sep=",",head=TRUE)
#Transform data for readable.
# Step 2: Read data through rattle
# Step 3: Associate analysis model
Associate->Baskets->Execute
# Get the relative frequency plot
Figure 6 Item frequency
For the association analysis, we set our minimum support as 0.01 and minimum confidence as 0.01 in the
rattle. Figure 1 shows the relative frequency of the item against the number of buying the item. From
Figure 1, there is 4 items that meet our criteria, namely, item 38, 39, 41 and 48. The item 39 have the
highest frequency among others. The second highest frequency is item 48, followed by item 41 and item
38. This is maybe due to item 39 fulfill customer needed in that item itself. The other items in the
transactions excluding from Figure 1, are not significant and hence, ignored in the interpretation of the
analysis.
Figure 2 Output of association analysis rules
Figure 1 shows the computer output from the association analysis. The summary of association rules
consists of several statistical measures including support, confidence and lift. The data contains 100
transactions. Four rules are generated using the apriori algorithm with specified parameter values of
minimum confidence 0.28 and minimum support of 0.110. The support of the rules is in the range (0.11,
0.12) and confidences of the rules is in the ranges (0.28, 0.58). From the mean support values, there is
mean of the rule occurs in 11% transaction which no dominance item set in the transaction data such that
customers tend to buy different items sold in the shop. However, the values of confidence suggest there
are items that customer tend to buy together, and to be investigated next.
Figure 3 Output of association analysis rules
Figure 3 shows a set of rules. Based on the first rules, we can see that 12% out of the total
transaction contains item 48 and item 39. When a customer buys item 48, then 39% of the time
he will buy item 39 as well. This occurs almost one time more than by random buying of any
other customers. The lift of the value indicated that the frequency of the purchase of item 48 and
item 39 is equal. However, the leverage of item 48 and item 39 is -0.0009 indicating that the
decrease of 0.0009 transaction for them to be occur together than when they are independent.
It means item 48 and item 39 are slightly negatively correlated that is the customers tend not to
buy item 48 and item 39 together. Hence, the owner can have larger consignment of these items
to be ordered.