This document discusses object detection in deep learning. It begins by explaining the difference between image classification, which identifies a single target object in an image, and object detection, which identifies multiple target objects and their positions. It then describes the general framework for object detection using deep learning models, including generating region proposals, extracting features from those regions, classifying and locating objects, and suppressing duplicate detections. It also discusses common metrics used to evaluate object detection models, such as mean average precision (mAP) and intersection over union (IoU).
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
62 views61 pages
Deep Learning: Dr. Sanjeev Sharma
This document discusses object detection in deep learning. It begins by explaining the difference between image classification, which identifies a single target object in an image, and object detection, which identifies multiple target objects and their positions. It then describes the general framework for object detection using deep learning models, including generating region proposals, extracting features from those regions, classifying and locating objects, and suppressing duplicate detections. It also discusses common metrics used to evaluate object detection models, such as mean average precision (mAP) and intersection over union (IoU).
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61
Deep Learning
Dr. Sanjeev Sharma
Object Detection • In image classification, we assume that there is only one main target object in the image, and the model’s sole focus is to identify the target category. • However, in many situations, we are interested in multiple targets in the image. We want to not only classify them, but also obtain their specific positions in the image. In computer vision, we refer to such tasks as object detection. • Figure explains the difference between image classification and object detection tasks. • Object detection is widely used in many fields. For example, in self-driving technology, we need to plan routes by identifying the locations of vehicles, pedestrians, roads, and obstacles in a captured video image. • Robots often perform this type of task to detect targets of interest. And systems in the security field need to detect abnormal targets, such as intruders or bombs. General object detection framework • Typically, an object detection framework has four components: • Region proposal—An algorithm or a DL model is used to generate regions of interest (RoIs) to be further processed by the system. • These are regions that the network believes might contain an object; the output is a large number of bounding boxes, each of which has an objectness score. • Boxes with large objectness scores are then passed along the network layers for further processing. • Feature extraction and network predictions— Visual features are extracted for each of the bounding boxes. • They are evaluated, and it is determined whether and which objects are present in the proposals based on visual features (for example, an object classification component). • Non-maximum suppression (NMS)—In this step, the model has likely found multiple bounding boxes for the same object. • NMS helps avoid repeated detection of the same instance by combining overlapping boxes into a single bounding box for each object. • Evaluation metrics—Similar to accuracy, precision, and recall metrics in image classification tasks (see chapter 4), object detection systems have their own metrics to evaluate their detection performance. In this section, we will explain the most popular metrics, like mean average precision (mAP), precision-recall curve (PR curve), and intersection over union (IoU). Region proposals • RoIs are regions that the system believes have a high likelihood of containing an object, called the objectness score (figure). • Regions with high objectness scores are passed to the next steps; regions with low scores are abandoned. • The important thing to note is that this step produces a lot (thousands) of bounding boxes to be further analyzed and classified by the network. During this step, the network analyzes these regions in the image and classifies each region as foreground (object) or background (no object) based on its objectness score. If the objectness score is above a certain threshold, then this region is considered a foreground and pushed forward in the network. Note that this threshold is configurable based on your problem. • If the threshold is too low, your network will exhaustively generate all possible proposals, and you will have a better chance of detecting all objects in the image. • On the flip side, this is very computationally expensive and will slow down detection. So, the trade-off with generating region proposals is the number of regions versus computational complexity—and the right approach is to use problem-specific information to reduce the number of RoIs. Network predictions • This component includes the pretrained CNN network that is used for feature extraction to extract features from the input image that are representative for the task at hand and to use these features to determine the class of the image. In object detection frameworks, people typically use pretrained image classification models to extract visual features, as these tend to generalize fairly well. • For example, a model trained on the MS COCO or ImageNet dataset is able to extract fairly generic features. • In this step, the network analyzes all the regions that have been identified as having a high likelihood of containing an object and makes two predictions for each region: • Bounding-box prediction—The coordinates that locate the box surrounding the object. The bounding box coordinates are represented as the tuple (x, y, w, h), where x and y are the coordinates of the center point of the bounding box and w and h are the width and height of the box. • Class prediction: The classic softmax function that predicts the class probability for each object. • Since thousands of regions are proposed, each object will always have multiple bounding boxes surrounding it with the correct classification. • For example, take a look at the image of the dog in figure 7.3. The network was clearly able to find the object (dog) and successfully classify it. But the detection fired a total of five times because • the dog was present in the five RoIs produced in the previous step: hence the five bounding boxes around the dog in the figure. Although the detector was able to successfully locate the dog in the image and classify it correctly, this is not exactly what we need. We need just one bounding box for each object for most problems. • In some problems, we only want the one box that fits the object the most. What if we are building a system to count dogs in an image? Our current system will count five dogs. We don’t want that. This is when the non- maximum suppression technique comes in handy. Non-maximum suppression (NMS) • As you can see in figure 7.4, one of the problems of an object detection algorithm is that it may find multiple detections of the same object. So, instead of creating only one bounding box around the object, it draws multiple boxes for the same object. • NMS is a technique that makes sure the detection algorithm detects each object only once. As the name implies, NMS looks at all the boxes surrounding an object to find the box that has the maximum prediction probability, and it suppresses or eliminates the other boxes (hence the name). • The general idea of NMS is to reduce the number of candidate boxes to only one bounding box for each object. • For example, if the object in the frame is fairly large and more than 2,000 object proposals have been generated, it is quite likely that some of them will have significant overlap with each other and the object. Object-detector evaluation metrics • FRAMES PER SECOND (FPS) TO MEASURE DETECTION SPEED • The most common metric used to measure detection speed is the number of frames per second (FPS). For example, Faster R-CNN operates at only 7 FPS, whereas SSD operates at 59 FPS. In benchmarking experiments, you will see the authors of a paper state their network results as: “Network X achieves mAP of Y% at Z FPS,” where X is the network name, Y is the mAP percentage, and Z is the FPS. • MEAN AVERAGE PRECISION (MAP) TO MEASURE NETWORK PRECISION • The most common evaluation metric used in object recognition tasks is mean average precision (mAP). It is a percentage from 0 to 100, and higher values are typically better, but its value is different from the accuracy metric used in classification. • To understand how mAP is calculated, you first need to understand intersection over union (IoU) and the precision-recall curve (PR curve). Let’s explain IoU and the PR curve and then come back to mAP. • INTERSECTION OVER UNION (IOU) • This measure evaluates the overlap between two bounding boxes: the ground truth bounding box (Bground truth) and the predicted bounding box (Bpredicted). By applying the IoU, we can tell whether a detection is valid (True Positive) or not (False Positive). • Figure 7.5 illustrates the IoU between a ground truth bounding box and a predicted bounding box. • The intersection over the union value ranges from 0 (no overlap at all) to 1 (the two bounding boxes overlap each other 100%). The higher the overlap between the two bounding boxes (IoU value), the better (figure 7.6). • To calculate the IoU of a prediction, we need the following: • The ground truth bounding box (Bground truth): the hand-labeled bounding box • created during the labeling process • The predicted bounding box (Bpredicted) from our model • We calculate IoU by dividing the area of overlap by the area of the union, as in the following • equation: • IoU is used to define a correct prediction, meaning a prediction (True Positive) that has an IoU greater than some threshold. This threshold is a tunable value depending on the challenge, but 0.5 is a standard value. For example, some challenges, like Microsoft COCO, use [email protected] (IoU threshold of 0.5) or [email protected] (IoU threshold of 0.75). If the IoU value is above this threshold, the prediction is considered a True Positive (TP); and if it is below the threshold, it is considered a False Positive (FP). PRECISION-RECALL CURVE (PR CURVE) • With the TP and FP defined, we can now calculate the precision and recall of our detection for a given class across the testing dataset. As explained in chapter 4, we calculate the precision and recall as follows (recall that FN stands for False Negative): • With the TP and FP defined, we can now calculate the precision and recall of our • detection for a given class across the testing dataset. As explained in chapter 4, we calculate • the precision and recall as follows (recall that FN stands for False Negative): Region-based convolutional neural networks (R-CNNs) • The R-CNN family of object detection techniques usually referred to as R-CNNs, which is short for region-based convolutional neural networks, was developed by Ross Girshick et al. in 2014. • The R-CNN family expanded to include Fast- RCNN2 and Faster-RCN3 in 2015 and 2016, respectively R-CNN • R-CNN is the least sophisticated region-based architecture in its family, but it is the basis for understanding how multiple object-recognition algorithms work for all of them. • It was one of the first large, successful applications of convolutional neural networks to the problem of object detection and localization, and it paved the way for the other advanced detection algorithms.