SVM, Neural Network and Random Forest in R

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

AIML LAB

Unit-3
EXPERIMENT-8
Import CSV data file and implement training and testing of Support Vector Machine. 
Machine Learning
• Machine learning is a science of getting computers to act by feeding them data and letting
them learn a few tricks on their own, without being explicitly programmed to do so.
• The key to machine learning is the data. Machines learn just like us humans. We humans need
to collect information and data to learn, similarly, machines must also be fed data in order to
learn and make decisions.
• To understand Machine learning, let’s consider an example. Let’s say you want a machine to
predict the value of a stock. In such situations, you just feed the machine with relevant data.
After that, you must create a model which is used to predict the value of the stock.
Types Of Machine Learning
• 1. Supervised learning
• Supervised means to oversee or direct a certain activity and make sure it’s done correctly. In this
type of learning the machine learns under guidance.
• At school, our teachers guided us and taught us, similarly in supervised learning, you feed the
model a set of data called training data, which contains both input data and the corresponding
expected output. The training data acts as a teacher and teaches the model the correct output
for a particular input so that it can make accurate decisions when later presented with new data.

• 2. Unsupervised learning
• Unsupervised means to act without anyone’s supervision or direction.
• In unsupervised learning, the model is given a data set which is neither labeled nor classified. The
model explores the data and draws inferences from data sets to define hidden structures from
unlabeled data.
• An example of unsupervised learning is an adult like you and me. We don’t need a guide to help
us with our daily activities, we figure things out on our own without any supervision.
• 3. Reinforcement learning
• Reinforcement means to establish or encourage a pattern of behavior. Let’s say you were
dropped off at an isolated island, what would you do?
• Initially, you’d panic and be unsure of what to do, where to get food from, how to live and
so on. But after a while you will have to adapt, you must learn how to live in the island,
adapt to the changing climates, learn what to eat and what not to eat.
• You’re following what is known as the hit and trail concept because you’re new to this
surrounding and the only way to learn, is experience and then learn from your experience.
• This is what reinforcement learning is. It is a learning method wherein an agent (you, stuck
on an island) interacts with its environment (island) by producing actions and discovers
errors or rewards.
What Is SVM?
• SVM (Support Vector Machine) is a supervised machine learning algorithm which is mainly
used to classify data into different classes. Unlike most algorithms, SVM makes use of a
hyperplane which acts like a decision boundary between the various classes.
• SVM can be used to generate multiple separating hyperplanes such that the data is divided
into segments and each segment contains only one kind of data.

• Advantages
1. SVM is a supervised learning algorithm. This means that SVM trains on a set of labeled
data. SVM studies the labeled training data and then classifies any new input data
depending on what it learned in the training phase.
2. A main advantage of SVM is that it can be used for both classification and regression
problems. Though SVM is mainly known for classification, the SVR (Support Vector
Regressor) is used for regression problems.
3. SVM can be used for classifying non-linear data by using the kernel trick. The kernel trick
means transforming data into another dimension that has a clear dividing margin
between classes of data. After which you can easily draw a hyperplane between the
various classes of data.
How Does SVM Work?
• In order to understand how SVM works let’s consider a scenario.
• For a second, pretend you own a farm and you have a problem–you need to set up a fence to
protect your rabbits from a pack of wolves. But where do you build your fence?
• One way to get around the problem is to build a classifier based on the position of the
rabbits and wolves in your pasture.
• So if I do that, and try to draw a decision boundary between the rabbits and the wolves, it
looks something like this. Now you can clearly build a fence along this line.

• In simple terms, this is exactly how SVM works. It draws a decision boundary, i.e. a hyperplane
between any two classes in order to separate them or classify them.
• he basic principle behind SVM is to draw a hyperplane that best separates the 2 classes. In our case
the two classes are the rabbits and the wolves.
What is a Support Vector in SVM?
Implementation(BASIC)
Implementation(Train and Test)
• Link: dataset of Social network aids from file Social.csv (geeksforgeeks.org)
• #Importing the dataset
• dataset = read.csv('social.csv’) (If you are unable to import with this command, then
import directly from option)
• # Taking columns 3-5
• dataset = social[3:5]
• # Encoding the target feature as factor (Creating factor for Purchased)
• dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
• # Splitting the dataset into the Training set and Test set
• install.packages('caTools')
• library(caTools)
• set.seed(123) (random generator)
• split = sample.split(dataset$Purchased, SplitRatio = 0.75) (split into two
sets)
• training_set = subset(dataset, split == TRUE)
• test_set = subset(dataset, split == FALSE)
• # Feature Scaling
• training_set[-3] = scale(training_set[-3])
• test_set[-3] = scale(test_set[-3])

• # Fitting SVM to the Training set


• install.packages('e1071’)
• library(e1071)
• classifier = svm(formula = Purchased ~ .,data = training_set,type = 'C-
classification',kernel = 'linear')

• # Predicting the Test set results


• y_pred = predict(classifier, newdata = test_set[-3])
• # Plotting the training data set results
• set = training_set
• X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
• X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
• grid_set = expand.grid(X1, X2)
• colnames(grid_set) = c('Age', 'EstimatedSalary')
• y_grid = predict(classifier, newdata = grid_set)
• plot(set[, -3], main = 'SVM (Training set)',
• xlab = 'Age', ylab = 'Estimated Salary',
• xlim = range(X1), ylim = range(X2))
• contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
• points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'coral1', 'aquamarine'))
• points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
• Visualizing the Test set results
• set = test_set
• X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
• X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
• grid_set = expand.grid(X1, X2)
• colnames(grid_set) = c('Age', 'EstimatedSalary')
• y_grid = predict(classifier, newdata = grid_set)
• plot(set[, -3], main = 'SVM (Test set)',
• xlab = 'Age', ylab = 'Estimated Salary',
• xlim = range(X1), ylim = range(X2))
• contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
• points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'coral1', 'aquamarine'))
• points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
EXPERIMENT-9
Import CSV data file and implement training and testing of Neural
Network.
Neural Networks
• A neural network is a computational system that creates predictions based on existing data. Let us train and
test a neural network using the neuralnet library in R.
• How To Construct A Neural Network?
• A neural network consists of:
• Input layers: Layers that take inputs based on existing data
• Hidden layers: Layers that use backpropagation to optimise the weights of the input variables in order to improve the
predictive power of the model
• Output layers: Output of predictions based on the data from the input and hidden layers
Solving classification problems with neuralnet

• In this particular example, our goal is to develop a neural


network to determine if a stock pays a dividend or not.
• As such, we are using the neural network to solve a
classification problem. By classification, we mean ones
where the data is classified by categories. e.g. a fruit can be
classified as an apple, banana, orange, etc.
• In our dataset, we assign a value of 1 to a stock that pays a
dividend. We assign a value of 0 to a stock that does not pay
a dividend. The dataset for this example is available at
dividendinfo.csv.
• Our independent variables are as follows:
• fcfps: Free cash flow per share (in $)
• earnings_growth: Earnings growth in the past year (in %)
• de: Debt to Equity ratio
• mcap: Market Capitalization of the stock
• current_ratio: Current Ratio (or Current Assets/Current Liabilities)
• We firstly set our directory and load the data into the R environment:
• setwd("your directory")
• mydata <- read.csv("dividendinfo.csv")
• attach(mydata)
• Let’s now take a look at the steps we will follow in constructing this model.
Data Normalization

• One of the most important procedures when forming a neural


network is data normalization. This involves adjusting the data
to a common scale so as to accurately compare predicted and
actual values.
• Failure to normalize the data will typically result in the
prediction value remaining the same across all observations,
regardless of the input values.
• We can do this in two ways in R:
• Scale the data frame automatically using the scale function in R
• Transform the data using a max-min normalization technique
• We implement both techniques below but choose to use the
max-min normalization technique.
• Scaled Normalization
• scaleddata<-scale(mydata)
• Max-Min Normalization
• For this method, we invoke the following function to normalize our data:
• normalize <- function(x) {
• return ((x - min(x)) / (max(x) - min(x)))
•}

• Then, we use lapply to run the function across our existing data (we have
termed the dataset loaded into R as mydata):
• maxmindf <- as.data.frame(lapply(mydata, normalize))
• We base our training data (trainset) on 80% of the observations. The test data
(testset) is based on the remaining 20% of observations.
• # Training and Test Data
• trainset <- maxmindf[1:160, ]

• testset <- maxmindf[161:200, ]


Training a Neural Network Model using neuralnet
• nn <- neuralnet(dividend ~ fcfps + earnings_growth + de + mcap +
current_ratio, data=trainset, hidden=c(2,1), linear.output=FALSE,
threshold=0.01)
• nn$result.matrix
• plot(nn)
• Using neuralnet to “regress” the dependent “dividend” variable against the other
independent variables
• Setting the number of hidden layers to (2,1) based on the hidden=(2,1) formula
• The linear.output variable is set to FALSE, given the impact of the independent
variables on the dependent variable (dividend) is assumed to be non-linear
• The threshold is set to 0.01, meaning that if the change in error during an iteration is
less than 1%, then no further optimization will be carried out by the model
• We now generate the error of the neural network model,
along with the weights between the inputs, hidden layers,
and outputs:
• nn$result.matrix
Testing The Accuracy Of The Model
• The “subset” function is used to eliminate the dependent variable from the
test data
• The “compute” function then creates the prediction variable
• A “results” variable then compares the predicted data with the actual data
• A confusion matrix is then created with the table function to compare the
number of true/false positives and negatives
• #Test the resulting output
• temp_test <- subset(testset, select = c("fcfps","earnings_growth", "de", "mcap",
"current_ratio"))
• head(temp_test)
• nn.results <- compute(nn, temp_test)
• results <- data.frame(actual = testset$dividend, prediction = nn.results$net.result)
• The predicted results are compared to the actual results:
• results
Confusion Matrix
• Then, we round up our results using sapply and create a confusion
matrix to compare the number of true/false positives and negatives:
• roundedresults<-sapply(results,round,digits=0)
• roundedresultsdf=data.frame(roundedresults)
• attach(roundedresultsdf)
• table(actual,prediction)
• A confusion matrix is used to determine the number of true and false
positives generated by our predictions. The model generates 17 true
negatives (0’s), 20 true positives (1’s), while there are 3 false negatives.
• Ultimately, we yield an 92.5% (37/40) accuracy rate in determining
whether a stock pays a dividend or not.
EXPERIMENT -10
Import CSV data file and implement training and testing of Random Forest
What Is Classification?
• Classification is the method of predicting the class of a given input data point.
Classification problems are common in machine learning and they fall under the
Supervised learning method.
• Let’s say you want to classify your emails into 2 groups, spam and non-spam
emails. For this kind of problems, where you have to assign an input data point
into different classes, you can make use of classification algorithms.
• Under classification we have 2 types:
• Binary Classification
• Multi-Class Classification
• Classification Algorithms used in Machine Learning:
1. Logistic Regression
2. K Nearest Neighbor (KNN)
3. Decision Tree
4. Support Vector Machine
5. Naive Bayes
6. Random Forest
What Is Random Forest?
• Random forest algorithm is a supervised classification and regression algorithm. As
the name suggests, this algorithm randomly creates a forest with several trees.
• Generally, the more trees in the forest the more robust the forest looks like.
Similarly, in the random forest classifier, the higher the number of trees in the forest,
greater is the accuracy of the results.
Why Use Random Forest?

• Even though Decision trees are convenient and easily implemented,


they lack accuracy. Decision trees work very effectively with the
training data that was used to build them, but they’re not flexible
when it comes to classifying the new sample. Which means that the
accuracy during testing phase is very low.
• This happens due to a process called Over-fitting.
• Over-fitting occurs when a model studies the training data to such an extent
that it negatively influences the performance of the model on new data.
• This means that the disturbance in the training data is recorded
and learned as concepts by the model. But the problem here is that
these concepts do not apply to the testing data and negatively
impact the model’s ability to classify the new data, hence reducing
the accuracy on the testing data.
How Does Random Forest Work?
• To understand Random forest, consider the below sample data
set. In this data set we have four predictor variables, namely:
• Weight
• Blood flow
• Blocked Arteries
• Chest Pain
• These variables are used to predict whether or not a person has
heart disease. We’re going to use this data set to create a
Random Forest that predicts if a person has heart disease or not.
Implementation in R
• Step 1) Import the data
• Step 2) Train the model
• Step 3) Construct accuracy function
• Step 4) Visualize the model
• Step 5) Evaluate the model
• Step 6) Visualize Result
Step 1) Import the data
• library(dplyr)
• data_train <- read.csv("https://raw.githubusercontent.com/guru99-
edu/R-Programming/master/train.csv")
• glimpse(data_train)
• data_test <- read.csv("https://raw.githubusercontent.com/guru99-
edu/R-Programming/master/test.csv")
• glimpse(data_test)
Step 2) Train the model
• One way to evaluate the performance of a model is to train it on a number of different smaller
datasets and evaluate them over the other smaller testing set. This is called the F-fold cross-
validation feature.
• R has a function to randomly split number of datasets of almost the same size. For example, if
k=9, the model is evaluated over the nine folder and tested on the remaining test set. This process
is repeated until all the subsets have been evaluated. This technique is widely used for model
selection, especially when the model has parameters to tune.
• Now that we have a way to evaluate our model, we need to figure out how to choose the
parameters that generalized best the data.
• Random forest chooses a random subset of features and builds many Decision Trees. The model
averages out all the predictions of the Decisions trees.
• Random forest has some parameters that can be changed to improve the generalization of the
prediction. You will use the function RandomForest() to train the model.
Syntax for Randon Forest is
• RandomForest(formula, ntree=n, mtry=FALSE, maxnodes = NULL)
• Arguments:
• - Formula: Formula of the fitted model
• - ntree: number of trees in the forest
• - mtry: Number of candidates draw to feed the algorithm. By default,
it is the square of the number of columns.
• - maxnodes: Set the maximum amount of terminal nodes in the forest
• - importance=TRUE: Whether independent variables importance in
the random forest be assessed
• library(randomForest)
• library(caret)
• library(e1071)
• # Define the control
• trControl <- trainControl(method = "cv",
• number = 10,
• search = "grid")
• set.seed(1234)
• # Run the model
• rf_default <- train(survived~.,
• data = data_train,
• method = "rf",
• metric = "Accuracy",
• trControl = trControl)
• # Print the results
• print(rf_default)
• trainControl(method="cv", number=10, search="grid"): Evaluate the model with a grid search of 10 folder
• train(...): Train a random forest model. Best model is chosen with the accuracy measure.
• set.seed(1234)
• tuneGrid <- expand.grid(.mtry = c(1: 10))
• rf_mtry <- train(survived~.,
• data = data_train,
• method = "rf",
• metric = "Accuracy",
• tuneGrid = tuneGrid,
• trControl = trControl,
• importance = TRUE,
• nodesize = 14,
• ntree = 300)
• print(rf_mtry)
• tuneGrid <- expand.grid(.mtry=c(3:10)): Construct a vector with value from 3:10
• rf_mtry$bestTune$mtry(Best Value is stored)
• max(rf_mtry$results$Accuracy)
• best_mtry <- rf_mtry$bestTune$mtry
• best_mtry
Step 3) Search the best maxnodes
• Create a list
• Create a variable with the best value of the parameter mtry;
Compulsory
• Create the loop
• Store the current value of maxnode
• Summarize the results
• store_maxnode <- list()
• tuneGrid <- expand.grid(.mtry = best_mtry)
• for (maxnodes in c(5: 15)) {
• set.seed(1234)
• rf_maxnode <- train(survived~.,
• data = data_train,
• method = "rf",
• metric = "Accuracy",
• tuneGrid = tuneGrid,
• trControl = trControl,
• importance = TRUE,
• nodesize = 14,
• maxnodes = maxnodes,
• ntree = 300)
• current_iteration <- toString(maxnodes)
• store_maxnode[[current_iteration]] <- rf_maxnode
•}
• results_mtry <- resamples(store_maxnode)
• summary(results_mtry)
• Code explanation:
• store_maxnode <- list(): The results of the model will be stored in this list
• expand.grid(.mtry=best_mtry): Use the best value of mtry
• for (maxnodes in c(15:25)) { ... }: Compute the model with values of
maxnodes starting from 15 to 25.
• maxnodes=maxnodes: For each iteration, maxnodes is equal to the current
value of maxnodes. i.e 15, 16, 17, ...
• key <- toString(maxnodes): Store as a string variable the value of
maxnode.
• store_maxnode[[key]] <- rf_maxnode: Save the result of the model in the
list.
• resamples(store_maxnode): Arrange the results of the model
• summary(results_mtry): Print the summary of all the combination.
Step 4) Search the best ntrees
• store_maxtrees <- list()
• for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
• set.seed(5678)
• rf_maxtrees <- train(survived~.,
• data = data_train,
• method = "rf",
• metric = "Accuracy",
• tuneGrid = tuneGrid,
• trControl = trControl,
• importance = TRUE,
• nodesize = 14,
• maxnodes = 24,
• ntree = ntree)
• key <- toString(ntree)
• store_maxtrees[[key]] <- rf_maxtrees
•}
• results_tree <- resamples(store_maxtrees)
• summary(results_tree)
Step 5) Evaluate the model
• prediction <-predict(fit_rf, data_test)
• confusionMatrix(prediction, data_test$survived)
Step 6) Visualize Result
• varImpPlot(fit_rf)

You might also like