Project-II B.Tech Format&guidelines
Project-II B.Tech Format&guidelines
Affiliated
to
Maulana Abul Kalam Azad University of Technology
(Formerly WBUT), 2019
Name : Arkadev Kundu
Sanjib Basak
Moumita Majumder
Soumyadip Debnath
12616003188 161260120057
12616003209 161260120078
12616003204 161260120073
12616003216 161260120085
.
1
Heritage Institute of Technology
(An Autonomous Institute)
Department of
Electronics and Communication Engineering
12616003188 161260120057
12616003209 161260120078
12616003204 161260120073
12616003216 161260120085
.Bachelor of Technology
In
Electronics and Communication Engineering
Maulana Abul Kalam Azad University of Technology
(Formerly WBUT), 2019
2
Dept. of Electronic and Communication Engineering
Heritage Institute of Technology, Kolkata-700107.
Certificate of Recommendation
This is to certify that the Thesis entitled “Wearable Object Detection System For Visually
Challenged People” &”Blind Stick” submitted by Arkadev Kundu, Sanjib Basak, Moumita
Majumder, Soumyadip Debnath- under the supervision of Prof. Chandrima Roy (Assistant
professor, Dept. of ECE, HITK), has been prepared according to the regulations of B.Tech.
Degree in Electronics and Communication Engineering Department, awarded by Maulana
Abul Kalam Azad University of Technology (Formerly WBUT) and he/she has fulfilled the
requirements for submission of thesis report and that neither his/her thesis report has been
submitted for any degree/diploma or any other academic award anywhere before.
…………………………………………………….………
Prof. Chandrima Roy
(Assistant Professor, Dept of ECE, HITK)
Project Supervisor
.
……………………………… ……………………….
Prof. Prabir Banerjee External Examiner
(HOD, Dept of ECE, HITK)
3
Heritage Institute of Technology
(An Autonomous Institute)
Affiliated to
Maulana Abul Kalam Azad University of
Technology
(Formerly WBUT)
Certificate of Approval*
The foregoing thesis report is hereby approved as a creditable study of an engineering
subject carried out and presented in a manner satisfactory to warrant its acceptance as a
prerequisite to the degree for which it has been submitted. It is understood that by this
approval the undersigned don’t necessarily endorse or approve any statement made
opinion expressed or conclusion drawn therein but approve the project report only for the
purpose for which it is submitted.
Signature of the Examiners:
1…………………………………………….
2…………………………………………….
3…………………………………………….
4
Contents page no
Part A
1. Introduction
1.1. Problem Statement
1.2. Applications
1.3. Challenges
2. Related Work
2.1. Bounding Box
2.2. Classification + Regression
2.3. Two-stage Method
2.4. Unified Method
3. Approach
4. Experimental Results
4.1. Dataset
4.2. Implementation Details
4.2.1. Pre-processing
4.2.2. Network
4.3. Qualitative Analysis
4.4. Quantitative Analysis
5. Conclusion
6. Reference
Part B
1. Introduction
2. Methodology
2.1. System Architecture
2.2. Microcontroller
2.3. Ultrasonic Sensor
2.4. Buzzer
2.5. Remote Control Unit
2.6. Circuit Diagram
2.7. Program
CONCLUSION
REFERENCE
5
Abstract
For the blind people the world is filled with full of darkness, so as a Engineer it is our duty to
give them light of hope. For lighten up their life, Efficient and accurate object detection is an
important topic for them. With the advent of deep learning techniques, the accuracy for object
detection has increased drastically. The project aims to incorporate state-of-the-art technique for
object detection with the goal of achieving high accuracy with a real-time performance. A major
challenge in many of the object detection systems is the dependency on other computer vision
techniques for helping the deep learning based approach, which leads to slow and non-optimal
performance. In this project, we use a completely deep learning based approach to solve the
problem of object detection in an end-to-end fashion. The network is trained on the most
challenging publicly available dataset,on which a object detection challenge is conducted
annually. The resulting system is fast and accurate, thus aiding those applications which require
object detection. With wearable object detection device a blind stick is also important for a
visually challenged people.
6
Part A
1 Introduction
1.1 Problem Statement
Many problems in computer vision were saturating on their accuracy before a decade. However,
with the rise of deep learning techniques, the accuracy of these problems drastically improved.
One of the major problem was that of image classification, which is defined as predicting the
class of the image. A slightly complicated problem is that of image localization, where the image
contains a single object and the system should predict the class of the location of the object in the
image (a bounding box around the object). The more complicated problem (this project), of
object detection involves both classification and localization. In this case, the input to the system
will be a image, and the output will be a bounding box corresponding to all the objects in the
image, along with the class of object in each box. An over view of all these problems is depicted
in Fig. 1.
1.2 Applications
A well known application of object detection is face detection, that is used in almost all the
mobile cameras. A more generalized (multi-class) application can be used in autonomous driving
where a variety of objects need to be detected. Also it has a important role to play in surveillance
systems. We can also use this system for the help of blind people, this is our main agenda. It can
be used for tracking objects and thus can be used in robotics and medical applications. Thus this
problem serves a multitude of applications.
7
(a)Surveillance (b) Autonomous vehicles
1.3 Challenges
The major challenge in this problem is that of the variable dimension of the output which is
caused due to the variable number of objects that can be present in any given input image. Any
general machine learning task requires a fixed dimension of input and output for the model to be
trained. Another important obstacle for widespread adoption of object detection systems is the
requirement of real-time (>30fps) while being accurate in detection. The more complex the
model is, the more time it requires for inference; and the less complex the model is, the less is
the accuracy. This trade-off between accuracy and performance needs to be chosen as per the
application. The problem involves classification as well as regression, leading the model to be
learnt simultaneously. This adds to the complexity of the problem.
8
2 Related Work
There has been a lot of work in object detection using traditional computer vision techniques
(sliding windows, deformable part models). However, they lack the accuracy of deep learning
based techniques. Among the deep learning based techniques, two broad class of methods are
prevalent: two stage detection (RCNN [1], Fast RCNN [2], Faster RCNN [3]) and unified
detection (Yolo [4], SSD [5]). The major concepts involved in these techniques have been
explained below.
2.1 Bounding Box
The bounding box is a rectangle drawn on the image which tightly _ts the object in the image. A
bounding box exists for every instance of every object in the image. For the box, 4 numbers
(center x, center y, width, height) are predicted. This can be trained using a distance measure
between predicted and ground truth bounding box. The distance measure is a jaccard distance
which computes intersection over union between the predicted and ground truth boxes as shown
in Fig. 3.
9
2.3 Two-stage Method
In this case, the proposals are extracted using some other computer vision technique and then
resized to fixed input for the classification network, which acts as a feature extractor. Then an
SVM is trained to classify between object and background (one SVM for each class). Also a
bounding box regressor is trained that outputs some correction (offsets) for proposal boxes. The
overall idea is shown in Fig. 5 These methods are very accurate but are computationally
intensive (low fps).
(a) Stage 1
10
(b) Stage 2
Figure 5: Two stage method
The major techniques that follow this strategy are: SSD (uses different activation maps
(multiple-scales) for prediction of classes and bounding boxes and Yolo (uses a single activation
map for prediction of classes and bounding boxes). Using multiple scales helps to achieve a
higher mAP(mean average precision) by being able to detect objects with different sizes on the
image better. Thus the technique used in this project is SSD.
11
3. Approach
The network used in this project is based on Single shot detection (SSD) [5]. The architecture is
shown in Fig. 7.
The SSD normally starts with a VGG [6] model, which is converted to a fully convolutional
network. Then we attach some extra convolutional layers, that help to handle bigger objects. The
output at the VGG network is a 38x38 feature map (conv4 3). The added layers produce 19x19,
10x10, 5x5, 3x3, 1x1 feature maps. All these feature maps are used for predicting bounding
boxes at various scales (later layers responsible for larger objects). Thus the overall idea of SSD
is shown in Fig. 8. Some of the activations are passed to the sub-network that acts as a classifier
and a localizer. Figure 8:
Anchors (collection of boxes overlaid on image at different spatial locations, scales and aspect
ratios) act as reference points on ground truth images as shown in Fig. 9. A model is trained to
make two predictions for each anchor:
_ A discrete class
_ A continuous offset by which the anchor needs to be shifted to fit the ground-truth bounding
box
12
Figure 9: Anchors
During training SSD matches ground truth annotations with anchors. Each element of the feature
map (cell) has a number of anchors associated with it. Any anchor with an IoU (jaccard distance)
greater than 0.5 is considered a match. Consider the case as shown in Fig. 10, where the cat has
two anchors matched and the dog has one anchor matched. Note that both have been matched on
different feature maps.
The loss function used is the multi-box classification and regression loss. The classification loss
used is the softmax cross entropy and, for regression the smooth L1 loss is used. During
prediction, non-maxima suppression is used to filter multiple boxes per object that may be
matched as shown in Fig. 11.
13
4 Experimental Results
4.1 Dataset
For the purpose of this project, the publicly available github platform dataset will be used. It
consists of 90 annotated images. These images are downloaded from google
Software Packages
TensorFlow - TensorFlow is a free and open-source software library for dataflow and
differentiable programming across a range of tasks. It is a symbolic math library, and is also used
for machine learning applications such as neural networks. It is used for both research and
production at Google. TensorFlow was developed by the Google Brain team for internal Google
use. It was released under the Apache 2.0 open-source license on November 9, 2015.
e-SpeakNG - eSpeakNG is a compact, open source, software speech synthesizer for Linux,
Windows, and other platforms. It uses a formant synthesis method, providing many languages in
a small size. Much of the programming for eSpeakNG's language support is done using rule files
with feedback from native speakers.
Hardware
Raspberry pi 3B+ -The system(Raspberry pi 3B+) specifications on which the model is trained
and evaluated are mentioned as follows: CPU - ARM Cortex-A53 1.4GHz, RAM - 1 Gb. The
Raspberry Pi is a series of small single-board computers developed in the United Kingdom by
the Raspberry Pi Foundation to promote teaching of basic computer science in schools and in
developing countries. The original model became far more popular than anticipated, selling
outside its target market for uses such as robotics. It does not include peripherals (such as
keyboards and mice) and cases. However, some accessories have been included in several
official and unofficial bundles.
14
Pi-Camera – It is a complete raspberry pi camera module. it has 5MP camera sensor.
Connection- The circuit connection is very easy, as we shown below in the image. The
raspberry pi is connected to a mobile or laptop through a Portable hotspot connectivity, by that
laptop or phone, it can be on or off
15
4.2.1 Pre-processing
The annotated data is provided in xml format, which is read and stored into a pickle file along
with the images so that reading can be faster. Also the images are resized to a fixed size.
4.2.2 Network
The model consists of the base network derived from VGG net and then the modified
convolutional layers for fine-tuning and then the classifier and localizer networks. This creates a
deep network which is trained end-to-end on the dataset.
16
4.3 Qualitative Analysis
The results on custom data sheet are shown in this table. This table is devided into model,ground
truth and prediction.
17
18
4.4 Quantitative Analysis
The evaluation metric used is mean average precision (mAP). For a given class, precision- recall
curve is computed. Recall is defined as the proportion of all positive examples ranked above a
given rank. Precision is the proportion of all examples above that rank which are from the
positive class. The AP summarizes the shape of the precision-recall curve, and is defined as the
mean precision at a set of eleven equally spaced recall levels [0, 0.1, ... 1]. Thus to obtain a high
score, high precision is desired at all levels of recall. This measure is better than area under curve
(AUC) because it gives importance to the sensitivity. The detections were assigned to ground
truth objects and judged to be true/false positives by measuring bounding box overlap. To be
considered a correct detection, the area of overlap between the predicted bounding box and
ground truth bounding box must exceed a threshold. The output of the detections assigned to
ground truth objects satisfying the overlap criterion were ranked in order of (decreasing)
confidence output. Multiple detections of the same object in an image were considered false
detections, i.e. 5 detections of a single object counted as 1 true positive and 4 false positives. If
no prediction is made for an image then it is considered a false negative.
The average precision for all the object categories are reported in Table
19
Class Average Precision Class Average Precision
Dog 0.891
Clock 0.719
Bottle 0.786
Laptop 0.864
Keyboard 0.728
20
Part B
1. Introduction- Ever heard of Hugh Herr? He is a famous American rock climber who has
shattered the limitations of his disabilities; he is a strong believer that technology could help
disabled persons to live a normal life. In one of his TED talk Herr said “Humans are not
disabled. A person can never be broken. Our built environment, our technologies, is
broken and disabled. We the people need not accept our limitations, but can transfer
disability through technological Innovation”. These were not just words but he lived his life
to them, today he uses Prosthetic legs and claims to live to normal life. So yes, technology
can indeed neutralize human disability; with this in mind let us use the power of Arduino
and simple sensors to build a Blind man’s stick that could perform more than just a stick
for visually impaired persons.
2. Methodology-
2.1. System Architechture- The proposed system design of the smart stick, as shown in
Fig.1 is composed of the following units:
21
2.3. Ultrasonic Sensor HC-SR04- Ultrasonic is the production of sound waves above
the frequency of human hearing and can be used in a variety of applications such as,
sonic rulers, proximity detectors, movement detectors, liquid level measurement.
Ultrasonic Ranging Module HC - SR04
22
23
Features: Ultrasonic ranging module HC - SR04 provides 2cm - 400cm non-contact
measurement function, the ranging accuracy can reach to 3mm. The modules includes ultrasonic
transmitters, receiver and control circuit. The basic principle
of work:
Using IO trigger for at least 10us high level signal,
The Module automatically sends eight 40 kHz and detect whether there is a pulse signal back.
IF the signal back, through high level , time of high output IO duration is the time from
sending ultrasonic to returning.
Buzzer
A transducer (converts electrical energy into mechanical energy) that typically operates A buzzer
is in the lower portion of the audible frequency range of 20 Hz to 20 kHz. This is accomplished
by converting an electric, oscillating signal in the audible range, into mechanical energy, in the
form of audible waves. Buzzer is used in this research to warn the blind person against obstacle
by generating sound proportional to distance from obstacle
24
2.5. Remote Control Unit
if the stick is lost, Remote unit is used for finding the stick. We use DTMF module in the remote
unit.
DTMF was originally decoded by tuned filter banks. By the end of the 20th century, digital
signal processing became the predominant technology for decoding. DTMF decoding algorithms
typically use the Goertzel algorithm. As DTMF signaling is often transmitted in-band with voice
or other audio signals present simultaneously, the DTMF signal definition includes strict limits
for timing (minimum duration and interdigit spacing), frequency deviations, harmonics, and
amplitude relation of the two components with respect to each other (twist). Decoder module
uses MT-8870 ic.
25
2.6. Circuit Diagram
26
Remote Control Unit Circuit
2.7. Program
#include <Ultrasonic.h>
int buzzer = 9;
#include <Ultrasonic.h>
Ultrasonic ultrasonic(12,11);
void setup() {
Serial.begin(9600);
pinMode(buzzer,OUTPUT);
}
void loop()
{
int distance = ultrasonic.Ranging(CM);
if (distance<50){
27
int dil = 2*distance;
digitalWrite(buzzer,HIGH);
delay(dil);
digitalWrite(buzzer,LOW);
delay(dil);
}
}
28
CONCLUSION
An accurate and efficient object detection system will be developed which achieves comparable
metrics with the existing state-of-the-art system. This project uses recent techniques in the field
of computer vision and deep learning. This can be used in real-time applications which require
object detection for pre-processing in their pipeline. An important scope would be to train the
system on a video sequence for usage in tracking applications. Addition of a temporally
consistent network would enable smooth detection and more optimal than per-frame detection. .
In the other hand with the proposed architecture of blind stick and object detection , if
constructed with at most accuracy, the blind people will able to move from one place to another
without others help, Which leads to increase autonomy for the blind. The developed smart stick
th alarming the person if any sign of danger or inconvenience is detected.
29
REFERENCE
[1] Ross Girshick, Je_ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2014.
[2] Ross Girshick. Fast R-CNN. In International Conference on Computer Vision (ICCV), 2015.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real- time
object detection with region proposal networks. In Advances in Neural Information Processing
Systems (NIPS), 2015.
[4] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:
Unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
[5] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng- Yang
Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
[6] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.15
30