Seminar Report On Object Detection and Tracking
Seminar Report On Object Detection and Tracking
Seminar Report On Object Detection and Tracking
Bachelor of Engineering
in
Computer Science And Engineering
Submitted by
Pravin Kumar: (Roll No. 19UCSE4012)
CERTIFICATE
This is to certify that the work contained in this report entitled “Object Detection And
Tracking” is submitted by the group members Mr. Pravin Kumar (Roll. No:
19UCSE4012) to the Department of Computer Science & Engineering, M.B.M.
Engineering College, Jodhpur, for the partial fulfillment of the requirements for the
degree of Bachelor of Engineering in Information Technology.
They have carried out their work under my supervision. This work has not been
submitted else-where for the award of any other degree or diploma.
The project work in our opinion, has reached the standard fulfilling of the requirements
for the degree of Bachelor of Engineering in Information Technology in accordance
with the regulations of the Institute.
Abhisek Gour
Assistant Professor
(Professor)
Dept. of Computer Science & Engg.
M.B.M. Engineering College, Jodhpur
N.C. Barwar
(Head)
Dept. of Computer Science & Engg.
M.B.M. Engineering College, Jodhpur
ii
This page was intentionally left blank.
iii
DECLARATION
I, Pravin Kumar, hereby declare that this seminar/project titled “Object Detection And
Tracking” is a record of original work done by me under the supervision and guidance
of Prof. Abhisek Gour.
I, further certify that this work has not formed the basis for the award of the
Degree/Diploma/Associateship/Fellowship or similar recognition to any candidate of
any university and no part of this report is reproduced as it is from any other source
without appropriate reference and permission.
SIGNATURE OF STUDENT
(Pravin Kumar)
7th Semester, CSE
Enroll. - < Enroll No>
Roll No. - 19UCSE4012
iv
This page was intentionally left blank.
v
ACKNOWLEDGEMENT
vi
This page was intentionally left blank.
vii
ABSTRACT
Object detection and tracking is one of the critical areas of research due to
routine change in motion of object and variation in scene size, occlusions,
appearance variations, and object motion and illumination changes.
Specifically, feature selection is the vital role in object tracking. It is related
to many real time applications like vehicle perception, video surveillance
and so on. In order to overcome the issue of detection, tracking related to
object movement and appearance. Most of the algorithm focuses on the
tracking algorithm to smoothing the video sequence. On the other hand,
few methods use the prior available information about object shape, color,
texture and so on.
viii
This page was intentionally left blank.
ix
Contents
References…………………………………………………………………….. 40
x
This page was intentionally left blank.
xi
List of Figures
xii
This page was intentionally left blank.
xiii
Project Report Template (B.E.)
Chapter 1
INTRODUCTION
Object detection is the process of detecting a target object in an image or a single frame of
the video. Object tracking refers to the ability to estimate or predict the position of a target
object in each consecutive frame in a video once the initial position of the target object is
defined.
Object tracking, using video sensing technique, is one of the major areas of research due
to its increased commercial applications such as surveillance systems, Mobile Robots,
Medical therapy, security systems and driver assistance systems. Object tracking, by
definition, is to track an object (or multiple objects) over a sequence of images.
Tracking is usually performed on higher-level applications that require the location and
object in every frame. The most popular application in this area is vision-based
surveillance, to help understand the movement patterns of people with suspicious
actions. Traffic scene analysis is also a well-known application, to get the tracking
information for keeping the vehicles in lane and preventing the accidents. Thus, object
detection and tracking under dynamic conditions is still a challenge for real-time
performance which requires the computational complexity to be minimum. Various
methods for object detection have been proposed; such as feature-based, template-based
object detection and background subtraction. But selection of the best technique for a
specific application is relative and dependent upon the hardware resources and scope of
the application. Feature-based detection searches for corresponding features in
successive frames, including Harris corner, edges, SIFT, contours or colour pixels.
Background subtraction is a popular method which uses static background and
calculating the difference between the hypotheses background and the current image.
This approach is fast and good for fixed background but it cannot deal with the dynamic
environment, with different illumination and motions of small objects. The goal of
tracking is to establish a correspondence between the detected target objects of images
Chapter 1: Introduction 1
Project Report Template (B.E.)
over frames. Tracking using mean shift kernel is also introduced. This method performs
well when there is occlusion, which can be solved using templates. Camshift
(Continously Adaptive Meanshift) can track a single object fast and robust using color
features, but it is ineffective for occlusion. There is also research on appearance-based
object detection. It uses whole 2-D images to perform tracking for navigation in faster
time. However, this kind of approach requires several templates and does not work
when the target object, color or perspective view is changed. The main problem in
object detection and tracking are the temporal variation of objects due to perspective,
occlusion, interaction between objects and appearance or disappearance of objects. That
cause the appearance of a target tends to change during a long tracking. The background
in a long image sequence is also dynamic even if it is taken by a stationary camera.
Detection and tracking of multiple objects at the same time is an important issue for
real-time performance. The comprehensive search in multiple tracking is computational
expensive and incapable of being real-time system. Another issue is when using the
moving camera, instead of using camera with a fixed location, which need the analysis
of camera platform coordinate system.
Chapter 1: Introduction 2
Project Report Template (B.E.)
Video surveillance is an active research topic in computer vision that tries to detect,
recognize and track objects over a sequence of images and it also makes an attempt to
understand and describe object behavior by replacing the aging old traditional method
of monitoring cameras by human operators. Object detection and tracking are important
and challenging tasks in many computer vision applications such as surveillance,
vehicle navigation and autonomous robot navigation. Object detection involves locating
objects in the frame of a video sequence. Every tracking method requires an object
detection mechanism either in every frame or when the object first appears in the video.
Object tracking is the process of locating an object or multiple objects over time using a
camera. The high powered computers, the availability of high quality and inexpensive
video cameras and the increasing need for automated video analysis has generated a
great deal of interest in object tracking algorithms. There are three key steps in video
analysis, detection interesting moving objects, tracking of such objects from each and
every frame to frame, and analysis of object tracks to recognize their behavior.Therefore,
Chapter 1: Introduction 3
Project Report Template (B.E.)
the use of object tracking is pertinent in the tasks of, motion based recognition.
Automatic detection, tracking, and counting of a variable number of objects are crucial
tasks for a wide range of home, business, and industrial applications such as security,
surveillance, management of access points, urban planning, traffic control, etc. However,
these applications were not still playing an important part in consumer electronics. The
main reason is that they need strong requirements to achieve satisfactory working
conditions, specialized and expensive hardware, complex installations and setup
procedures, and supervision of qualified workers. Some works have focused on
developing automatic detection and tracking algorithms that minimizes the necessity of
supervision. They typically use a moving object function that evaluates each
hypothetical object configuration with the set of available detection without to explicitly
compute their data association. Thus, a considerable saving in computational cost is
achieved. In addition, the likelihood function has been designed to account for noisy,
false and missing detections. The field of machine (computer) vision is concerned with
problems that involve interfacing computers with their surrounding environment. One
such problem, surveillance, has an objective to monitor a given environment and report
the information about the observed activity that is of significant interest. In this respect,
video surveillance usually utilizes electro-optical sensors (video cameras) to collect
information from the environment. In a typical surveillance system, these video cameras
are mounted in fixed positions or on pan-tilt devices and transmit video streams to a
certain location, called monitoring room. Then, the received video streams are
monitored on displays and traced by human operators. However, the human operators
might face many issues, while they are monitoring these sensors. One problem is due to
the fact that the operator must navigate through the cameras, as the suspicious object
moves between the limited field of view of cameras and should not miss any other
object while taking it. Thus, monitoring becomes more and more challenging, as the
number of sensors in such a surveillance network increases. Therefore, surveillance
systems must be automated to improve the performance and eliminate such operator
errors. Ideally, an automated surveillance system should only require the objectives of
an application, in which real time interpretation and robustness is needed. Then, the
challenge is to provide robust and real-time performing surveillance systems at an
affordable price. With the decrease in costs of hardware for sensing and computing, and
Chapter 1: Introduction 4
Project Report Template (B.E.)
the increase in the processor speeds, surveillance systems have become commercially
available, and they are now applied to a number of different applications, such as traffic
monitoring, airport and bank security, etc. However, machine vision algorithms
(especially for single camera) are still severely affected by many shortcomings, like
occlusions, shadows, weather conditions, etc. As these costs decrease almost on a daily
basis, multi-camera networks that utilize 3D information are becoming more available.
Although, the use of multiple cameras leads to better handling of these problems,
compared to a single camera, unfortunately, multi-camera surveillance is still not the
ultimate solution yet. There are some challenging problems within the surveillance
algorithms, such as background modeling, feature extraction, tracking, occlusion
handling and event recognition. Moreover, machine vision algorithms are still not
robust enough to handle fully automated systems and many research studies on such
improvements are still being done. This work focuses on developing a framework to
detect moving objects and generate reliable tracks from surveillance video. The problem
is most of the existing algorithms works on the gray scale video. But after converting
the RGB video frames to gray at the time of conversion, information loss occurs.The
main problem comes when background and the foreground both have approximately
same gray values. Then it is difficult for the algorithm to find out which pixel is
foreground pixel and which one background pixel. Sometimes two different colors such
as dark blue and dark violet, color when converted to gray scale, their gray values will
come very near to each other,it can’t be differentiated that which value comes from dark
blue and which comes from dark violet. However, if color images are taken then the
background and foreground color can be easily differentiated. So without losing the
color information this modified background model will work directly on the color
frames of the video.
Chapter 1: Introduction 5
Project Report Template (B.E.)
Every tracking method requires an object detection mechanism either in every frame or
when the object first appears in the video. A common approach for object detection is to
use information in a single frame. However, some object detection methods make use of
the temporal information computed from a sequence of frames to reduce the number of
false detection.
1. Point detectors-Point detectors are used to find interesting points in images which
have an expressive texture in their respective localities. A desirable quality of an interest
point is its in variance to changes in illumination and camera viewpoint. In literature,
commonly used interest point detectors include Moravec’s detector, Harris detector,
KLT detector, SIFT detector.
Chapter 1: Introduction 6
Project Report Template (B.E.)
The aim of an object tracker is to generate the trajectory of an object over time by
locating its position in every frame of the video. But tracking has two definition one is
in literally it is locating a moving object or multiple object over a period of time using a
camera. Another one in technically tracking is the problem of estimating the trajectory
or path of an object in the image plane as it moves around a scene. The tasks of
detecting the object and establishing a correspondence between the object instances
across frames can either be performed separately or jointly. In the first case, possible
object region in every frame is obtained by means of an object detection algorithm, and
then the tracker corresponds objects across frames. In the latter case, the object region
and correspondence is jointly estimated by iteratively updating object location and
region information obtained from previous frames. There are different methods of
Tracking
Chapter 1: Introduction 7
Project Report Template (B.E.)
Chapter 2
History And Evolution
In this story we will review the history of object detection from “traditional object
detection period (before 2014)” to “deep learning based detection period (after 2014)”.
Developed in 2001 by Paul Viola and Michael Jones,this object recognition framework
allows the detection of human faces in real-time.It uses sliding windows to go through all
possible locations and scales in an image to see if any window contains a human
face.The sliding windows essentially searches for ‘haar-like’ features (named after
Alfred Haar who developed the concept of haar wavelets).
Thus the haar wavelet is used as the feature representation of an image.To speed up
detection, it uses integral image , which makes the computational complexity of each
sliding window independent of its window size.Another trick to improve detection speed
that was used by the authors is to use Adaboost algorithm for feature selection which
selects a small set of features that are mostly helpful for face detection from a huge set of
random features pools.The algorithm also used Detection Cascades which is a multi-
HOG Detector :
Unfortunately object detection has reached a plateau after 2010 as the performance of
hand-crafted features became saturated.However in 2012, the world saw the rebirth of
convolutional neural networks and deep convolutional networks were successful at
learning robust and high-level feature representations of an image.The deadlocks of
object detection was broken in 2014 by the proposal of the Regions with CNN features
(RCNN) for object detection.In this deep learning era, object detection is grouped into
two genres: “two-stage detection” and “one-stage detection”.
It starts with the extraction of a set of object proposals (object candidate boxes) by
selective search.Then each proposal is rescaled to a fixed size image and fed into a pre-
trained CNN model to extract features.Finally, linear SVM classifiers are used to predict
the presence of an object within each region and to recognize object categories.
You Only Look Once (YOLO):
recognize moving objects [10]. Background Modeling is to yield reference model. This
reference model is used in background subtraction in which each video sequence is
compared against the reference model to determine possible Variation. The variations
between current video frames to that of the reference frame in terms of pixels signify
existence of moving objects. Currently, mean filter and median filter are widely used to
realize background modeling. The background subtraction method is to use the
difference method of the current image and background image to detect moving objects,
with simple algorithm, but very sensitive to the changes in the external environment and
has poor anti- interference ability. However, it can provide the most complete object
information in the case background is known. As describe in, background subtraction
has mainly two approaches:
1. Recursive algorithm - Recursive techniques do not maintain a buffer for background
estimation. Instead, they recursively update a single background model based on each
input frame. As a result, input frames from distant past could have an effect on the
current background model. Compared with non-recursive techniques, recursive
techniques require less storage, but any error in the background model can linger for a
much longer period of time. This technique includes various methods such as
approximate median, adaptive background, Gaussian of mixture
2. Non-Recursive Algorithm - A non-recursive technique uses a sliding-window
approach for background estimation. It stores a buffer of the previous L video frames,
and estimates the background image based on the temporal variation of each pixel
within the buffer. Non-recursive techniques are highly adaptive as they do not depend
on the history beyond those frames stored in the buffer. On the other hand, the storage
requirement can be significant if a large buffer is needed to cope with slow-moving
traffic.
Tracking can be defined as the problem of approximating the path of an object in the
image plane as it moves around a scene. The purpose of an object tracking is to generate
the route for an object above time by finding its position in every single frame of the
video. Object is tracked for object extraction, object recognition and tracking, and
decisions about activities. According to paper, Object tracking can be classified as point
tracking, kernel based tracking and silhouette based tracking. For illustration, the point
trackers involve detection in every frame; while geometric area or kernel based tracking
or contours-based tracking require detection only when the object first appears in the
scene. As described in, tracking methods can be divided into following categories:
1. Kalman Filter
They are based on Optimal Recursive Data Processing Algorithm. The Kalman Filter
performs the restrictive probability density propagation. Kalman filter is a set of
mathematical equations that provides an efficient computational (recursive) means to
estimate the state of a process in several aspects: it supports estimations of past, present,
and even future states, and it can do the same even when the precise nature of the
modelled system is unknown. The Kalman filter estimates a process by using a form of
feedback control. The filter estimates the process state at some time and then obtains
feedback in the form of noisy measurements. The equations for Kalman filters fall in
two groups: time update equations and measurement update equations. The time update
equations are responsible for projecting forward (in time) the current state and error
covariance estimates to obtain the priori estimate for the next time step. The
measurement update equations are responsible for the feedback. Kalman filters always
give optimal solutions.
2. Particle Filtering
The particle filtering generates all the models for one variable before moving to the next
variable. Algorithm has an advantage when variables are generated dynamically and
there can be unboundedly numerous variables. It also allows for new operation of
resampling. One restriction of the Kalman filter is the assumption of state variables are
normally distributed (Gaussian). Thus, the Kalman filter is poor approximations of state
variables which do not Gaussian distribution. This restriction can be overwhelmed by
using particle filtering. This algorithm usually uses contours, color features, or texture
mapping. The particle filter is a Bayesian sequential importance Sample technique,
which recursively approaches the later distribution using a finite set of weighted trials.
It also consists of fundamentally two phases: prediction and update as same as Kalman
Filtering. It was developing area in the field of computer vision communal and applied
to tracking problematic and is also known as the Condensation algorithm
These approaches examine for the object model in the existing frame. Shape matching
performance is similar to the template based tracking in kernel approach. Another
approach to Shape matching is to find matching silhouettes detected in two successive
frames. Silhouette matching, can be considered similar to point matching. Detection
based on Silhouette is carried out by background subtraction. Models object are in the
form of density functions, silhouette boundary, object edges. Capable of dealing with
single object and Occlusion handling will be performed in with Hough transform
techniques.
Chapter 3
Similar Technologies
3.1 Lidar
* Uses laser beams: - LiDAR technology uses light pulses or laser beams to determine
the distance between the sensor and the object. The laser travels to the object and is
reflected back to the source and the time taken for the laser to be reflected back is then
used to calculate the distance.
* It has a higher measurement accuracy: - Unlike RADAR, LiDAR data has a higher
accuracy of measurement because of its speed and short wavelength. Also, LiDAR
targets specific objects which contributes to the accuracy of the data relayed.
* Data can be collected quickly: - Because of its speed and accuracy of the laser
pulses from LiDAR sensors, the data can be collected fast and with utmost accuracy.
This is why LiDAR sensors are used in high capacity and data-intensive applications.
* It does not have geometric distortions: - LiDAR sensors are highly accurate and
are therefore not affected by geometric distortions. The data collected will be precise
and accurate and will map the exact location of the object in the image.
* It can be integrated with other data sources: - LiDAR data can easily be integrated
with other data sources such as GPS and used in mapping and calculation of distances.
This can also be applied in forest mapping and other remote sensing technologies.
3.2 Radar
* It can operate in cloudy weather conditions and during the night: - Unlike
LiDAR, RADAR technology is not affected by adverse weather conditions such as
clouds, rainfall, or fogs.
* Cannot detect smaller objects: - It does not allow the detection of smaller objects
due to longer wavelengths. This means that data regarding very tiny objects on the
surface may be distorted or insufficient.
* No 3D replica of the object: - It cannot provide an exact 3D image of the object due
to the longer wavelength. This means that the image will be a representation of the
object but not an exact replica of the object’s characteristics.
* Determines distance from objects and their angular positions: - Apart from the
distance from an object, RADAR technology can also provide the angular positions of
objects from the surface, a characteristic that cannot be measured by LiDAR.
* Radar beam can incorporate many targets: - A RADAR beam can have several
targets at the same time and return data on several objects at the same time. However,
this may exclude smaller objects within the target field.
* Radar may not distinguish multiple targets that are close together: - RADAR
technology cannot distinguish multiple targets within a surface that are closely
entangled together. The data may therefore not be accurate.
* RADAR takes more time to lock on an object: - RADAR, unlike LiDAR pulses,
travels at a slower speed which means more time is needed to lock onto an object and
return data regarding the object.
3.3 Sonar
* Uses sound waves: - Sonar stands for Sound Navigation and ranging. It transmits
sound waves that are then returned in form of echoes which are used to analyze various
qualities or attributes of the target or object.
* Mostly used to find actual sea depth: - Because of its unique capabilities of
penetrating seawater, sonar is mainly used to calculate the depth of the sea because it is
fast and accurate.
* Is not affected by surface factors: - The sound waves are not affected by the
calmness or the roughness of the water surface. They can penetrate even tides and still
get the necessary data required.
* It has adverse effects on marine life: - Sound waves from sonar have adverse effects
on marine life such as whales that also depend on sound waves.
* Sonar generates a lot of noise: - The sound waves from the transmitters usually
generate a lot of noise that also have an effect on the marine life that live deep sea.
* Passive sonar does not require a transmitter and a receiver: - Unlike active sonar
that transmits with the help of a transmitter and also relies on a receiver, passive sonar
does not transmit. It listens without transmitting
* Scattering: - Active sonar may lead to scattering from small objects as well as the sea
bottom and surface which may cause interference.
Chapter 4
Applications, Prons & Cons
One of the best examples of why you need object detection is for autonomous driving is
In order for a car to decide what to do in next step whether accelerate, apply brakes or
turn, it needs to know where all the objects are around the car and what those objects are
That requires object detection and we would essentially train the car to detect known set
of objects such as cars, pedestrians, traffic lights, road signs, bicycles, motorcycles, etc
3. TRACKING OBJECTS -
Object detection system is also used in tracking the objects, for example tracking a ball
during a football match, tracking movement of a cricket bat, tracking a person in a
video.Object tracking has a variety of uses, some of which are surveillance and security,
traffic monitoring, video communication, robot vision and animation.
Face detection and Face Recognition is widely used in computer vision task. We
noticed how facebook detects our face when you upload a photo This is a simple
application of object detection that we see in our daily life.Face detection can be
regarded as a specific case of object-class detection. In object-class detection, the task is
to find the locations and sizes of all objects in an image that belong to a given class.
Examples include upper torsos, pedestrians, and cars. Face detection is a computer
technology being used in a variety of applications that identifies human faces in digital
images. Face recognition describes a biometric technology that goes way beyond
recognizing when a human face is present. It actually attempts to establish whose face it
is. Face-detection algorithms focus on the detection of frontal human faces. It is
analogous to image detection in which the image of a person is matched bit by bit.
Image matches with the image stores in database. Any facial feature changes in the
database will invalidate the matching process.
There are lots of applications of face recognition. Face recognition is already being used
to unlock phones and specific applications. Face recognition is also used for biometric
surveillance, Banks, retail stores, stadiums, airports and other facilities use facial
recognition to reduce crime and prevent violence.
Iris recognition is one of the most accurate identity verification systems. Identity
verification and identification is becoming increasingly popular. However, advances in
the field have expanded the options to include biometric such as iris, retina and more.
Among the large set of options it has been shown that the iris is the most accurate
biometric. Hence we need object detection system in iris detection.
6. OBJECT COUNTING -
Object detection system can also be used for counting the number of objects in the
image or real time video.
People Counting: Object detection can be also used for people counting, it is used for
analyzing store performance or crowd statistics during festivals. These tend to be more
difficult as people move out of the frame quickly (also because people are non rigid
objects).
7. SMILE DETECTION -
Facial expression analysis plays a key role in analyzing emotions and human behaviors.
Smile detection is a special task in facial expression analysis with various potential
applications such as photo selection, user experience analysis and patient monitoring
8. MEDICAL IMAGING
Normally CCTV is Running every time, so we need large size of memory system to
store the recorded video. By using object detection system we can automate CCTV in
such a way that if some objects are detected then only recording is going to start. Using
this we can decrease the repeatedly recording same image frames, which increases the
memory efficiency. We can decrease the memory requirement by using this object
detection system.
11. ROBOTICS
Moderate
High
Frame Low to
Differencing High Moderate
Chapter 5
Summary
This report gives an idea about object detection and tracking. we see about some
techniques for detection and tracking of objects. Object detection is the process of
detecting a target object in an image or a single frame of the video. First some existing
algorithms for detecting the objects like Frame difference method, optical flow,
background substrction.
Object tracking refers to the ability to estimate or predict the position of a target object
in each consecutive frame in a video once the initial position of the target object is
defined.
In object tracking First we see about the point tracking, in an image structure, moving
objects are represented by their feature points during tracking and In point tracking we
learn about some point tracking methods like Kalman Filter, Particle Filtering , Multiple
Hypothesis Tracking.
The second technique of object tracking is kernel-based tracking in this we learn some
methods like Simple Template Matching, Mean Shift Method, Support Vector Machine
(SVM), Layering based tracking.
The third technique of object tracking is Silhouette Based Tracking Approach - Some
object will have complex shape such as hand, fingers, shoulders that cannot be well
defined by simple geometric shapes. Silhouette based methods afford an accurate shape
description for the objects. The aim of a silhouette-based object tracking is to find the
object region in every frame by means of an object model generated by the previous
frames. Capable of dealing with a variety of object shapes, Occlusion and object split
and merge , have two algorithms Contour Tracking, Shape Matching.
Chapter 5: Summary 38
Project Report Template (B.E.)
Chapter 5: Summary 39
Project Report Template (B.E.)
References
[1] analytics-vidhya link - https://medium.com/analytics-vidhya/evolution-of-
object-detection-582259d2aa9b
[3] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey.
Acm Computing Surveys (CSUR), 38(4):13, 2006
[4] Object Detection and Tracking in Images and Point Clouds Daniel J.
Finnegan link- https://ps2fino.github.io/documents/Daniel_J._Finnegan-
Thesis.pdf
[5] Object detection and tracking in video image Rajkamal Kishor Gupta link -
http://ethesis.nitrkl.ac.in/6256/1/E-1.pdf
References 40