Support Vector Machines and Artificial Neural Networks: Dr.S.Veena, Associate Professor/CSE
Support Vector Machines and Artificial Neural Networks: Dr.S.Veena, Associate Professor/CSE
RAMAPURAM CAMPUS
Dr.S.Veena,Associate
Professor/CSE 1
SVM and ANN
Dr.S.Veena,Associate Professor/CSE
2
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
3
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
4
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
5
Support Vector Machines
Maximum margin classifier
• It is feasible to draw infinite hyperplanes to classify the same set of data
• But to consider as an ideal hyperplane, The maximum margin classifier
considers the hyperplane with the maximum margin of separation
width.
Hyperplanes:
• In n- dimensional space, a hyperplane is a flat affine subspace of
dimension n-1.
• In 2- dimensional space, the hyperplane is a straight line which
separates the 2-dimensional space into two halves.
• The hyperplane is defined by the following equation.
• Points which lay on the hyperplane have to follow the above equation.
Dr.S.Veena,Associate Professor/CSE
6
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
7
Support Vector Machines
• However, there are regions above and below as well. This means observations could
fall in either of the regions, also called the region of classes
Dr.S.Veena,Associate Professor/CSE
8
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
9
Support Vector Machines
We already know that projection of any vector or another vector is called dot-
product. Hence, we take the dot product of x and w vectors. If the dot product is
greater than ‘c’ then we can say that the point lies on the right side. If the dot
product is less than ‘c’ then the point is on the left side and if the dot product is
equal to ‘c’ then the point lies on the decision boundary.
Dr.S.Veena,Associate Professor/CSE
10
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
11
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
12
Support Vector Machines
• Infinite separate hyperplanes can be drawn to separate the two classes (blue and red).
• However, the maximum margin classifier attempts to fit the widest slab (maximize the
margin between positive and negative hyperplanes) between two classes and the
observations touching both the positive and negative hyperplanes called support vectors:
• Classifier performance purely depends on the support vectors and any changes to
observation values which are not support vectors do not impact any change in the
performance of the Maximum Margin Classifier, as only extreme points are considered in the
algorithm.
Dr.S.Veena,Associate Professor/CSE
14
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
15
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
16
Support Vector Machines
• High value of C - the model would be more tolerating and also have space for violations
(errors) in the left diagram,
• Lower value of C, - no scope for accepting violations leads to a reduction in margin width.
Dr.S.Veena,Associate Professor/CSE
17
Support Vector Machines
Support vector machines
• Support vector machines are used when the decision boundary is non-
linear and would not be separable with support vector classifiers
whatever the cost function is.
• The following diagram explains the non-linearly separable cases for both
1-dimension and 2-dimensions
• We need to use another way of handling the data, called the kernel trick,
using the kernel function to work with non-linearly separable data.
Dr.S.Veena,Associate Professor/CSE
18
Support Vector Machines
Support vector machines
• A polynomial kernel with degree 2 has been applied in transforming the data
from 1-dimensional to 2-dimensional data.
• In the left diagram, different classes (red and blue) are plotted on X1 only,
whereas after applying degree 2, we now have 2- dimensions, X1 and X 21
(the original and a new dimension).
• The degree of the polynomial kernel is a tuning parameter;
• the practitioner needs to tune them with various values to check where
higher accuracies are possible with the model:
Dr.S.Veena,Associate Professor/CSE
19
Support Vector Machines
• It seems that observations have been classified successfully using a linear plane after
projecting the data into higher dimensions:
Dr.S.Veena,Associate Professor/CSE
20
Support Vector Machines
Kernel functions
• Kernel functions are the functions that, given the original feature vectors,
return the same value as the dot product of its corresponding mapped
feature vectors.
• The main reason for using kernel functions is to eliminate the
computational requirement to derive the higher-dimensional vector space
from the given basic vector space, so that observations be separated
linearly in higher dimensions.
• Need - The derived vector space will grow exponentially with the
increase in dimensions and it will become almost too difficult to continue
computation, even when you have a variable size of 30 or so.
Dr.S.Veena,Associate Professor/CSE
21
Support Vector Machines
Kernel functions
• Example - shows how the size of the variables grows
• When we have two variables such as x and y, with a polynomial degree
kernel, it needs to compute x2, y2, and xy dimensions in addition.
• Whereas, if we have three variables x, y, and z, then we need to calculate
the x2, y2, z2, xy, yz, xz, and xyz vector spaces.
• The increase of one more dimension creates so many combinations.
Hence, care needs to be taken to reduce its computational complexity;
• Thus Kernels are used and defined more formally in the following
equation:
Dr.S.Veena,Associate Professor/CSE
22
Support Vector Machines
Kernel functions
• Polynomial Kernel:
• Polynomial kernels are popularly used, especially with degree 2.
• In fact, the inventor of support vector machines, Vladimir N Vapnik,
developed using a degree 2 kernel for classifying handwritten digits.
• Polynomial kernels are given by the following equation:
For reference : SVM | Support Vector Machine Algorithm in Machine Learning (analyticsvidhya.com)
Dr.S.Veena,Associate Professor/CSE
23
Support Vector Machines
Kernel functions
• Radial Basis Function (RBF) / Gaussian Kernel:
• RBF kernels are a good first choice for problems requiring nonlinear models.
• A decision boundary that is a hyperplane in the mapped feature space is similar to a decision
boundary that is a hypersphere in the original space.
• The feature space produced by the Gaussian kernel can have an infinite number of
dimensions, a feat that would be impossible otherwise. RBF kernels are represented by the
following equation
Dr.S.Veena,Associate Professor/CSE
24
Support Vector Machines
Kernel functions
• When the value of the gamma value is small, it gives you a pointed bump in the higher
dimensions; a larger value gives you a softer, broader bump.
• A small gamma will give you low bias and high variance solutions;
• a high gamma will give high bias and low variance solutions
Dr.S.Veena,Associate Professor/CSE
25
Support Vector Machines
>>> import os
""" First change the following directory link to
where all input files do exist """
>>> os.chdir("D:\\Book writing\\Codes\\Chapter 6")
>>> import pandas as pd
>>> letterdata = pd.read_csv("letterdata.csv")
>>> print (letterdata.head())
Dr.S.Veena,Associate Professor/CSE
26
Support Vector Machines
Following code is used to remove the target variable from x variables and at the same time
create new y variable for convenience:
>>> x_vars = letterdata.drop(['letter'],axis=1)
>>> y_var = letterdata["letter"]
Dr.S.Veena,Associate Professor/CSE
27
Support Vector Machines
# Linear Classifier
Dr.S.Veena,Associate Professor/CSE
28
Support Vector Machines
>>> svm_fit.fit(x_train,y_train)
>>> print ("\nSVM Linear Classifier - Train Confusion Matrix\n\n",
Dr.S.Veena,Associate Professor/CSE
29
Support Vector Machines
Maximum margin classifier - linear kernel
Dr.S.Veena,Associate Professor/CSE
30
Support Vector Machines
Dr.S.Veena,Associate Professor/CSE
31
Support Vector Machines
Maximum margin classifier - linear kernel
From the above results, we can see that test accuracy for the linear classifier is 85%
Dr.S.Veena,Associate Professor/CSE
32
Support Vector Machines
Polynomial kernel
• A polynomial kernel with degree of 2 has been used to check whether any improvement in
accuracy is possible.
• The cost value has been kept constant with respect to the linear classifier in order to
determine the impact of the non-linear kernel:
#Polynomial Kernel
>>> svm_poly_fit = SVC(kernel='poly',C=1.0,degree=2)
>>> svm_poly_fit.fit(x_train,y_train)
>>> print ("\nSVM Polynomial Kernel Classifier - Train Confusion
Matrix\n\
n",pd.crosstab(y_train,svm_poly_fit.predict(x_train),rownames =
["Actuall"],colnames = ["Predicted"]) )
>>> print ("\nSVM Polynomial Kernel Classifier - Train
accuracy:",round(accuracy_score( y_train,svm_poly_fit.predict(x_train))
,3))
>>> print ("\nSVM Polynomial Kernel Classifier - Train Classification
Report\n",
classification_report(y_train,svm_poly_fit.predict(x_train)))
Dr.S.Veena,Associate Professor/CSE
33
Support Vector Machines
Polynomial kernel
Dr.S.Veena,Associate Professor/CSE
34
Support Vector Machines
Polynomial kernel
>>> print ("\n\nSVM Polynomial Kernel Classifier - Test Confusion Matrix\n\n",
pd.crosstab(y_test,svm_poly_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))
>>> print ("\nSVM Polynomial Kernel Classifier - Test
accuracy:",round(accuracy_score( y_test,svm_poly_fit.predict(x_test)),3))
>>> print ("\nSVM Polynomial Kernel Classifier - Test Classification Report\n",
classification_report(y_test,svm_poly_fit.predict(x_test)))
Dr.S.Veena,Associate Professor/CSE
35
Support Vector Machines
RBF kernel
The cost value is kept constant with respective other kernels but the gamma value has been chosen
as 0.1 to fit the model:
#RBF Kernel
>>> svm_rbf_fit = SVC(kernel='rbf',C=1.0, gamma=0.1)
>>> svm_rbf_fit.fit(x_train,y_train)
>>> print ("\nSVM RBF Kernel Classifier - Train Confusion
Matrix\n\n",pd.crosstab( y_train,svm_rbf_fit.predict(x_train),rownames =
["Actuall"],colnames = ["Predicted"]))
>>> print ("\nSVM RBF Kernel Classifier - Train
accuracy:",round(accuracy_score( y_train, svm_rbf_fit.predict(x_train)),3))
>>> print ("\nSVM RBF Kernel Classifier - Train Classification Report\n",
classification_report(y_train,svm_rbf_fit.predict(x_train)))
Dr.S.Veena,Associate Professor/CSE
36
Support Vector Machines
RBF kernel
Dr.S.Veena,Associate Professor/CSE
37
Support Vector Machines
RBF kernel
Dr.S.Veena,Associate Professor/CSE
38
Support Vector Machines
RBF kernel
Dr.S.Veena,Associate Professor/CSE
39
Artificial Neural Networks - ANN
• Artificial neural networks (ANNs) model the
relationship between a set of input signals and output
signals using a model derived from a replica of the
biological brain, which responds to stimuli from its
sensory inputs.
• The human brain consists of about 90 billion neurons,
with around 1 trillion connections between them;
• ANN methods try to model problems using
interconnected artificial neurons (or nodes) to solve
machine learning problems
Dr.S.Veena,Associate Professor/CSE
40
Artificial Neural Networks - ANN
• In Human Brain
– Incoming signals are received by the cell's dendrites through a biochemical
process that allows the impulses to be weighted according to their relative
importance.
– As the cell body begins to accumulate the incoming signals, a threshold is
reached, at which the cell fires and the output signal is then transmitted via an
electrochemical process down the axon.
– At the axon terminal, an electric signal is again processed as a chemical signal
to be passed to its neighboring neurons, which will be dendrites to some other
neuron.
Dr.S.Veena,Associate Professor/CSE
41
Artificial Neural Networks - ANN
• A similar working principle is loosely used in building an artificial neural network, in which
each neuron has a set of inputs, each of which is given a specific weight.
• The neuron computes a function on these weighted inputs.
• A linear neuron takes a linear combination of weighted input and applies an activation
function (sigmoid, tanh, relu, and so on) on the aggregated sum.
• The network feeds the weighted sum of the input into the logistic function (in case of
sigmoid function).
• The logistic function returns a value between 0 and 1 based on the set threshold; for
example, here we set the threshold as 0.7.
• Any accumulated signal greater than 0.7 gives the signal of 1 and vice versa; any
accumulated signal less than 0.7 returns the value of 0:
Dr.S.Veena,Associate Professor/CSE
42
Artificial Neural Networks - ANN
• A typical artificial neuron with n input dendrites can be represented by the
following formula.
• The w weights allow each of the n inputs of x to contribute a greater or lesser
amount to the sum of input signals.
• The accumulated value is passed to the activation function, f(x), and the resulting
signal, y(x), is the output axon:
Dr.S.Veena,Associate Professor/CSE
43
Artificial Neural Networks - ANN
The parameters required for choosing for building neural networks are the following:
• Activation function: Choosing an activation function plays a major role in
aggregating signals into the output signal to be propagated to the other neurons of
the network.
• Network architecture or topology: This represents the number of layers required
and the number of neurons in each layer. More layers and neurons will create a
highly non-linear decision boundary, whereas if we reduce the architecture, the
model will be less flexible and more robust.
• Training optimization algorithm: The selection of an optimization algorithm plays a
critical role as well, in order to converge quickly and accurately to the best optimal
solutions
• Applications of Neural Networks:
– Images and videos: To identify an object in an image or to classify whether it
is a dog or a cat
– Text processing (NLP): Deep-learning-based chatbot and so on
– Speech: Recognize speech
– Structured data processing: Building highly powerful models to obtain a non-
linear decision boundary
Dr.S.Veena,Associate Professor/CSE
44
Artificial Neural Networks - ANN
Activation functions
• Activation functions are the mechanisms by which an artificial neuron processes
information and passes it throughout the network.
• The activation function takes a single number and performs a certain fixed
mathematical functional mapping on it.
• Different types of activation functions are :
• Sigmoid function: Sigmoid has the mathematical form σ(x) = 1 / (1+e−x). It takes
a real- valued number and squashes it into a range between 0 and 1. Sigmoid is a
popular choice, which makes calculating derivatives easy and is easy to interpret.
• Tanh function: Tanh squashes the real-valued number into the range [-1, 1]. The
output is zero-centered. In practice, tanh non-linearity is always preferred to sigmoid
non-linearity. Also, it can be proved that tanh is scaled sigmoid neuron tanh(x) = 2σ
(2x) − 1.
• Rectified Linear Unit (ReLU) function: ReLU has become very popular in the last
few years. It computes the function f(x) = max (0, x). Activation is simply thresholds
at zero.
• Linear function: The linear activation function is used in linear regression
problems, where it always provides a derivative as 1 due to the function used being
f(x) = x.
• Relu is now popularly being used in place of Sigmoid or Tanh due to its better
convergence property.
Dr.S.Veena,Associate Professor/CSE
45
Artificial Neural Networks - ANN
Dr.S.Veena,Associate Professor/CSE
46
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• Forward propagation and backpropagation are illustrated with the two
hidden layer deep neural networks in the following example, in which both
layers get three neurons each, in addition to input and output layers.
• The number of neurons in the input layer is based on the number of x
(independent) variables, whereas the number of neurons in the output layer
is decided by the number of classes the model needs to be predicted.
• For ease, we have shown only one neuron in each layer;
• Weights and biases are initiated from some random numbers, so that in both
forward and backward passes, these can be updated in order to minimize the
errors altogether.
Dr.S.Veena,Associate Professor/CSE
47
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• During forward propagation, features are input to the network and fed through the following layers to produce
the output activation.
• In the hidden layer 1, the activation obtained is the combination of bias weight 1 and weighted combination of
input values; If the overall value crosses the threshold, it will trigger to the next layer, else the signal will be 0 to
the next layer values. Bias values are necessary to control the trigger points.
• In some cases, the weighted combination signal is low; in those cases, bias will compensate the extra amount for
adjusting the aggregated value, which can trigger for the next level
Dr.S.Veena,Associate Professor/CSE
48
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• Once all the neurons are calculated in Hidden Layer 1 (Hidden1, Hidden2,
and Hidden3 neurons), the next layer of neurons needs to be calculated in a
similar way from the output of the hidden neurons from the first layer with
the addition of bias (bias weight 4).
• The following figure describes the hidden neuron 4 shown in layer 2:
Dr.S.Veena,Associate Professor/CSE
49
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• In the last layer (also known as the output layer), outputs are calculated in the same
way from the outputs obtained from hidden layer 2 by taking the weighted combination
of weights and outputs obtained from hidden layer 2.
• Once we obtain the output from the model, a comparison needs to be made with the
actual value and we need to backpropagate the errors across the net backward in order
to correct the weights of the entire neural network:
Dr.S.Veena,Associate Professor/CSE
50
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• In the following diagram, we have taken the derivative of the output value
and multiplied by that much amount to the error component, which was
obtained from differencing the actual value with the model output:
Dr.S.Veena,Associate Professor/CSE
51
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• In a similar way, we will backpropagate the error from the second hidden
layer as well.
• In the following diagram, errors are computed from the Hidden 4 neuron in
the second hidden layer:
Dr.S.Veena,Associate Professor/CSE
52
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• In the following diagram, errors are calculated for the Hidden 1 neuron in
layer 1 based on errors obtained from all the neurons in layer 2:
Dr.S.Veena,Associate Professor/CSE
53
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• Once all the neurons in hidden layer 1 are updated, weights between inputs and the
hidden layer also need to be updated,
• In the following diagram, we will be updating the weights of both the inputs and also,
at the same time, the neurons in hidden layer 1, as neurons in layer 1 utilize the
weights from input only:
Dr.S.Veena,Associate Professor/CSE
54
Artificial Neural Networks - ANN
Forward propagation and backpropagation
• Finally, in the following figure, layer 2 neurons are being updated in the
forward propagation pass:
Dr.S.Veena,Associate Professor/CSE
55
Artificial Neural Networks - ANN
Optimization of neural networks
• Various techniques have been used for optimizing the weights of
neural networks:
– Stochastic gradient descent (SGD)
– Momentum
– Nesterov accelerated gradient (NAG)
– Adaptive gradient (Adagrad)
– Adadelta
– RMSprop
– Adaptive moment estimation (Adam)
– Limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)
Dr.S.Veena,Associate Professor/CSE
56
Artificial Neural Networks - ANN
Optimization of neural networks -Stochastic gradient descent - SGD
• Gradient descent is a way to minimize an objective function J(θ)
parameterized by a model's parameter θ ε Rd by updating the
parameters in the opposite direction of the gradient of the objective
function with regard to the parameters.
• The learning rate determines the size of the steps taken to reach the
minimum
– Batch gradient descent (all training observations utilized in each
iteration)
– SGD (one observation per iteration)
– Mini batch gradient descent (size of about 50 training
observations for each iteration)
Dr.S.Veena,Associate Professor/CSE
57
Artificial Neural Networks - ANN
Optimization of neural networks -Stochastic gradient descent - SGD
Dr.S.Veena,Associate Professor/CSE
58
Artificial Neural Networks - ANN
Optimization of neural networks -Stochastic gradient descent - SGD
● In the following image 2D projection convergence characteristics of both
full batch and stochastic gradient descent with batch size 1 has been
compared.
● If we see, full batch updates, are more smoother due to the consideration of
all the observations.
● Whereas, SGD have wiggly convergence characteristics due to the reason
of using 1 observation for each update:
Dr.S.Veena,Associate Professor/CSE
59
Artificial Neural Networks - ANN
Introduction to deep learning
Deep learning is a class of machine learning algorithms which utilizes neural
networks for building models to solve both supervised and unsupervised problems
on structured and unstructured datasets such as images, videos, NLP, voice
processing, and so on:
Dr.S.Veena,Associate Professor/CSE
60
Artificial Neural Networks - ANN
Introduction to deep learning
• Deep neural network/deep architecture consists of multiple hidden layers of
units between input and output layers.
• Each layer is fully connected with the subsequent layer.
• The output of each artificial neuron in a layer is an input to every artificial
neuron in the next layer towards the output:
Dr.S.Veena,Associate Professor/CSE
61
Artificial Neural Networks - ANN
Introduction to deep learning
With the more number of hidden layers are being added to the neural network,
more complex decision boundaries are being created to classify different
categories.
Example of complex decision boundary can be seen in the following graph:
Dr.S.Veena,Associate Professor/CSE
62
Artificial Neural Networks - ANN
Solving methodology
• Backpropagation is used to solve deep layers by calculating the error
of the network at output units and propagate back through layers to
update the weights to reduce error terms.
• Thumb rules in designing deep neural networks:
– All hidden layers should have the same number of neurons per
layer
– Typically, two hidden layers are good enough to solve the
majority of problems
– Using scaling/batch normalization (mean 0, variance 1) for all
input variables after each layer improves convergence
effectiveness
– Reduction in step size after each iteration improves convergence,
in addition to the use of momentum and dropout
Dr.S.Veena,Associate Professor/CSE
63
Artificial Neural Networks - ANN
Deep learning software
• Deep learning software has evolved multi-fold in recent times.
• Different types of Deep Learning software are
– Theano: Python-based deep learning library developed by
the University of Montreal
– TensorFlow: Google's deep learning library runs on top of
Python/C++
– Keras / Lasagne: Lightweight wrapper which sits on top of
Theano/TensorFlow and enables faster model prototyping
– Torch: Lua-based deep learning library with wide support
for machine learning algorithms
– Caffe: deep learning library primarily used for processing
pictures
Dr.S.Veena,Associate Professor/CSE
64
Artificial Neural Networks - ANN
Deep learning software
TensorFlow is recently picking up momentum among the deep learning
community, as it is being backed up by Google and also has good visualization
capabilities using TensorBoard:
Dr.S.Veena,Associate Professor/CSE
65
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
We are using the same data as we trained the model on previously using scikit-
learn in order to perform apple-to-apple comparison between scikit-learn and the
deep learning software Keras.
Data loading steps
Dr.S.Veena,Associate Professor/CSE
66
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
Keras Library modules
>>> from keras.models import Sequential
>>> from keras.layers.core import Dense, Dropout, Activation
>>> from keras.optimizers import Adadelta,Adam,RMSprop
>>> from keras.utils import np_utils
Dr.S.Veena,Associate Professor/CSE
67
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
The following code loads the digit data from scikit-learn datasets. A quick piece
of code to check the shape of the data, as data embedded in numpy arrays itself,
hence we do not need to change it into any other format, as deep learning models
get trained on numpy arrays:
>>> digits = load_digits()
>>> X = digits.data
>>> y = digits.target
>>> print (X.shape)
>>> print (y.shape)
>>> print ("\nPrinting first digit")
>>> plt.matshow(digits.images[0])
>>> plt.show()
Dr.S.Veena,Associate Professor/CSE
68
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
The previous code prints the first digit in matrix form. It appears
that the following digit looks like a 0:
Dr.S.Veena,Associate Professor/CSE
69
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
We are performing the standardizing of data with the following code to demean
the series, followed by standard deviation to put all the 64 dimensions in a similar
scale.
>>> x_vars_stdscle = StandardScaler().fit_transform(X)
The following section of the code splits the data into train and test based on a 70-
30 split:
>>> x_train,x_test,y_train,y_test =
train_test_split(x_vars_stdscle,y,train_size = 0.7,random_state=42)
Dr.S.Veena,Associate Professor/CSE
70
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
We have used nb_classes as 10, due to the reason that the digits range from 0-9;
batch_size as 128, which means for each batch, we utilize 128 observations to
update the weights; and finally, we have used nb_epochs as 200, which means the
number of epochs the model needs to be trained is 200
Dr.S.Veena,Associate Professor/CSE
71
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
The following code actually creates the n-dimensional vector for multiclass values
based on the nb_classes value. Here, we will get the dimension as 10 for all train
observations for training using the softmax classifier:
>>> Y_train = np_utils.to_categorical(y_train, nb_classes)
The core model building code, which looks like Lego blocks, is shown as follows.
Here we, initiate the model as sequential rather than parallel and so on:
#Deep Layer Model building in Keras
>>> model = Sequential()
Dr.S.Veena,Associate Professor/CSE
72
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
In the first layer, we are using 100 neurons with input shape as 64 columns (as the number
of columns in X is 64), followed by relu activation functions with dropout value as 0.5
>>> model.add(Dense(100,input_shape= (64,)))
>>> model.add(Activation('relu'))
>>> model.add(Dropout(0.5))
In the second layer, we are using 50 neurons (to compare the results obtained using the
scikit-learn methodology, we have used a similar architecture):
>>> model.add(Dense(50))
>>> model.add(Activation('relu'))
>>> model.add(Dropout(0.5))
In the output layer, the number of classes needs to be used with the softmax classifier:
>>> model.add(Dense(nb_classes))
>>> model.add(Activation('softmax'))
Dr.S.Veena,Associate Professor/CSE
73
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
Here, we are compiling with categorical_crossentropy, as the output is multiclass;
whereas, if we want to use binary class, we need to use binary_crossentropy
instead:
>>> model.compile(loss='categorical_crossentropy', optimizer='adam')
The model is being trained in the following step with all the given batch sizes and
number of epochs:
#Model training
>>> model.fit(x_train, Y_train, batch_size=batch_size,
nb_epoch=nb_epochs,verbose=1)
Dr.S.Veena,Associate Professor/CSE
74
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
#Model Prediction
>>> y_train_predclass =
model.predict_classes(x_train,batch_size=batch_size)
>>> y_test_predclass = model.predict_classes(x_test,batch_size=batch_size)
>>> print ("\n\nDeep Neural Network - Train accuracy:"),
(round(accuracy_score(y_train,y_train_predclass),3))
>>> print ("\nDeep Neural Network - Train Classification Report")
>>> print classification_report(y_train,y_train_predclass)
>>> print ("\nDeep Neural Network - Train Confusion Matrix\n")
>>> print (pd.crosstab(y_train,y_train_predclass,rownames =
["Actuall"],colnames = ["Predicted"]) )
Dr.S.Veena,Associate Professor/CSE
75
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
Dr.S.Veena,Associate Professor/CSE
76
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
Testing
Dr.S.Veena,Associate Professor/CSE
77
Artificial Neural Networks - ANN
Deep learning software - Deep neural network classifier applied on
handwritten digits using Keras
Dr.S.Veena,Associate Professor/CSE
78