0% found this document useful (0 votes)
309 views17 pages

Classify Webcam Images Using Deep Learning

The document describes using a convolutional neural network called AlexNet to classify images from a webcam in real time. AlexNet is a pretrained deep CNN that has been trained on over 1 million images and can classify images into 1000 categories. The CNN will take images from a webcam and identify objects in the surroundings. It discusses problems with image classification like lighting and occlusion. It also provides details on the architecture of CNNs, including convolutional and pooling layers, as well as the specific architecture of AlexNet, which has convolutional, max pooling, normalization and fully connected layers.

Uploaded by

gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
309 views17 pages

Classify Webcam Images Using Deep Learning

The document describes using a convolutional neural network called AlexNet to classify images from a webcam in real time. AlexNet is a pretrained deep CNN that has been trained on over 1 million images and can classify images into 1000 categories. The CNN will take images from a webcam and identify objects in the surroundings. It discusses problems with image classification like lighting and occlusion. It also provides details on the architecture of CNNs, including convolutional and pooling layers, as well as the specific architecture of AlexNet, which has convolutional, max pooling, normalization and fully connected layers.

Uploaded by

gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

CLASSIFY

WEBCAM
IMAGES USING
DEEP LEARNING
ABSTRACT

• Deep learning has emerged as a new era in machine learning which is being
applied to a number of signal and image applications. The main purpose of the
work presented in this paper, is to apply the concept of a Deep Learning
algorithm namely, Convolutional neural networks (CNN) in classifying webcam
images in real time. The pretrained deep convolutional neural network that we
are using here is AlexNet that has been trained on over a million images and
can classify images into 1000 object categories (such as keyboard, coffee mug,
pencil, and many animals). Alexnet has learned rich feature representations for
a wide range of images. Images will be captured from our system webcam and
our pretrained deep convolutional neural network, AlexNet will identify objects
in our surroundings.
PROBLEMS IN CLASSIFYING IMAGES

 Large amount of intra-class variability


 Different lightening conditions
 Misalignment
 Non rigid deformation
 Occlusion
 Corruption
WHAT IS DEEP LEARNING?

• Deep learning (also known as deep structured learning or hierarchical


learning) is part of a broader family of machine learning methods based on
learning data representations, as opposed to task-specific algorithms. Learning
can be supervised, semi-supervised or unsupervised.
• Deep learning architectures such as deep neural networks, deep belief networks
and recurrent neural networks have been applied to fields including computer
vision, speech recognition, natural language processing, audio recognition, social
network filtering, machine translation, bioinformatics, drug design, medical
image analysis, material inspection and board game programs, where they have
produced results comparable to and in some cases superior to human experts.
WHY DEEP LEARNING?

• Learning features from data of interest is considered as a possible method of


remedying the limitations of hand-crafted features.
• Discover multiple levels of representation with the hope that higher level
features can represent more abstract semantics of the data. Such abstract
representations learned from a deep network are expected to provide greater
robustness to intra-class variability.
• One key ingredient to the success of deep learning in image classification is
the use of convolutional architectures. A convolutional deep neural network
(ConvNet) architecture consists of multiple trainable stages stacked on top of
each other followed by a supervised classifier
CNN

• A CNN network is a class of feed forward artificial neural networks, most commonly
applied to analyzing visual imagery. Convolutional neural networks are inspired by
biological processes. In CNN connectivity pattern between neurons resembles the
organization of the animal visual cortex. Individual cortical neurons respond to
stimuli only in a restricted region of the visual field known as receptive field. The
receptive fields of different neurons partially overlap such that they cover entire
visual field.
• CNNs use relatively little pre-processing compared to other image classification
algorithms which means our network learns the filters that in traditional algorithms
were hard engineered. This independence from prior knowledge and human effort
in feature design is a major advantage.
CNN-ALEXNET

• It was designed by Alex Krizhevsky and published with Liya Sutskever and
Geoffrey Hinton. AlexNet competed in the ImageNet Large Scale Visual
Recognition Challenge in 2012.
• The network achieved a top-5 error of 15.3%, more than 10.8 percent points
lower than that of the runner up. AlexNet shows the probability of the image
• it captures from the camera. It shows the top five highest categories with
the maximum probabilities and according to that a chart is prepared.
AlexNet is trained over more than 50000 times and shows more correct
results as compared to previous trained models.
ARCHITECTURE OF CNN

• A CNN consists number of convolutional and subsampling layers optionally followed


by fully connected layers. The input to a convolutional layer is a m x m x r image
where m is the height and width of the image and r is the number of channels, e.g.
an RGB image has r=3.
• The convolutional layer will have kk filters (or kernels) of size n x n x q where n is
smaller than the dimension of the image and q can either be the same as the
number of channels r or smaller and may vary for each kernel. The size of the filters
gives rise to the locally connected structure which are each convolved with the
image to produce k feature maps of size m−n+1. Each map is then subsampled
typically with mean or max pooling over p x p contiguous regions where p ranges
between 2 for small images and is usually not more than 5 for larger inputs.
A SIMPLE CONV-NET
OPERATIONS IN CONV-NET

• Convolution
• Non-Linearity (ReLU)
• Pooling or Sub Sampling
• Classification (Fully Connected Layer)
CONVOLUTION

• ConvNets derive their name from the “convolution” operator. The primary purpose of
Convolution in case of a ConvNet is to extract features from the input image. Convolution
preserves the spatial relationship between pixels by learning image features using small
squares of input data .In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or
‘feature detector’ and the matrix formed by sliding the filter over the image and
computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the
‘Feature Map‘. It is important to note that filters act as feature detectors from the original
input image. In practice, a CNN learns the values of these filters on its own during the
training process (although we still need to specify parameters such as number of filters,
filter size, architecture of the network etc. before the training process). More number of
filters we have, the more image features get extracted and the better our network
becomes at recognizing patterns in unseen images.
NON-LINEARITY (RELU)

• An additional operation called ReLU has been used after every Convolution


operation. ReLU stands for Rectified Linear Unit and is a non-linear
operation.
• ReLU is an element wise operation (applied per pixel) and replaces all
negative pixel values in the feature map by zero. The purpose of ReLU is to
introduce non-linearity in our ConvNet, since most of the real-world data
we would want our ConvNet to learn would be non-linear (Convolution is a
linear operation – element wise matrix multiplication and addition, so we
account for non-linearity by introducing a non-linear function like ReLU).
POOLING STEP

• Spatial Pooling (also called subsampling or down-sampling) reduces the


dimensionality of each feature map but retains the most
important information. Spatial Pooling can be of different types: Max,
Average, Sum etc. In case of Max Pooling, we define a spatial
neighborhood (for example, a 2×2 window) and take the largest
element from the rectified feature map within that window. Instead of
taking the largest element we could also take the average (Average
Pooling) or sum of all elements in that window.
FULLY CONNECTED LAYER

• The Fully Connected layer is a traditional Multi-Layer Perceptron that


uses a softmax activation function in the output layer (other classifiers
like SVM can also be used, but will stick to softmax in this post). The
term “Fully Connected” implies that every neuron in the previous layer
is connected to every neuron on the next layer. The output from the
convolutional and pooling layers represent high-level features of the
input image. The purpose of the Fully Connected layer is to use these
features for classifying the input image into various classes based on
the training dataset.
ALEXNET ARCHITECHTURE
DESCRIBING NETWORK

• The net contains eight layers with weights; the first five are convolutional and the remaining three are fully-connected.
The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000
class labels. The response-normalization layers follow the first and second convolutional layers. Max-pooling layers
follow both of the response-normalization layers as well as the last (fifth) convolutional layer. The ReLU non-linearity is
applied to the output of every convolutional and fully-connected layer.
•  
• The input to the net is a 227 × 227 × 3 image. The filters for each convolutional layer are:
•  96 kernels of size 11 × 11 × 3 with step size 4
•  256 kernels of size 5 × 5 × 48* with step size 1
•  384 kernels of size 3 × 3 × 256 with step size 1
•  384 kernels of size 3 × 3 × 192* with step size 1
•  256 kernels of size 3 × 3 × 192* with step size 1
THANK YOU

You might also like