A. Convolutional Neural Networks For Object Detection in Aerial Imagery For Disaster Response and Recovery
A. Convolutional Neural Networks For Object Detection in Aerial Imagery For Disaster Response and Recovery
a
Department of Construction Science, Texas A&M University, 3137 TAMU, College Station, TX 77843, USA
b
Zachry Department of Civil Engineering, Texas A&M University, 3136 TAMU, College Station, TX 77843, USA
Keywords: Accurate and timely access to data describing disaster impact and extent of damage is key to successful disaster
Disaster management management (a process that includes prevention, mitigation, preparedness, response, and recovery). Airborne
Convolutional neural network (CNN) data acquisition using helicopter and unmanned aerial vehicle (UAV) helps obtain a bird’s-eye view of disaster-
Aerial reconnaissance affected areas. However, a major challenge to this approach is robustly processing a large amount of data to
Deep learning
identify and map objects of interest on the ground in real-time. The current process is resource-intensive (must
Unmanned aerial vehicle (UAV)
be carried out manually) and requires offline computing (through post-processing of aerial videos). This research
introduces and evaluates a series of convolutional neural network (CNN) models for ground object detection
from aerial views of disaster’s aftermath. These models are capable of recognizing critical ground assets in-
cluding building roofs (both damaged and undamaged), vehicles, vegetation, debris, and flooded areas. The CNN
models are trained on an in-house aerial video dataset (named Volan2018) that is created using web mining
techniques. Volan2018 contains eight annotated aerial videos (65,580 frames) collected by drone or helicopter
from eight different locations in various hurricanes that struck the United States in 2017–2018. Eight CNN
models based on You-Only-Look-Once (YOLO) algorithm are trained by transfer learning, i.e., pre-trained on the
COCO/VOC dataset and re-trained on Volan2018 dataset, and achieve 80.69% mAP for high altitude (helicopter
footage) and 74.48% for low altitude (drone footage), respectively. This paper also presents a thorough in-
vestigation of the effect of camera altitude, data balance, and pre-trained weights on model performance, and
finds that models trained and tested on videos taken from similar altitude outperform those trained and tested on
videos taken from different altitudes. Moreover, the CNN model pre-trained on the VOC dataset and re-trained
on balanced drone video yields the best result in significantly shorter training time.
1. Introduction connectivity, and cloud storage, have created new opportunities for
data collection and spatial mapping before, during, and after a disaster
According to the United Nations Office for Disaster Risk Reduction strikes. Among various methods of data collection, unmanned aerial
(UNISDR), in the 10-year period ending in 2014, natural disasters have vehicles (UAVs) or drones have drawn much attention due to their
affected 1.7 billion people, claimed 700,000 lives, and cost 1.4 trillion growing ubiquity, easy operation, large coverage area, and storage
Dollars in damages [1]. With the changing climate, the frequency and capacity. Compared to traditional methods of aerial data collection
severity of natural disasters are also on the rise [2], requiring people such as helicopter flyovers, UAV-based data collection is more afford-
and governments to be better prepared and equipped to cope with the able, less resource-intensive, and can provide high-resolution imagery
effects of such catastrophes. Timely retrieval and integration of disaster from difficult-to-reach places. More importantly, with proper im-
information is critical for effective disaster management. Particularly, plementation, it can be outsourced to volunteer groups and ordinary
real-time spatial information about disaster damage and risk is of citizens, thus reducing the workload on first responders and search and
paramount importance to designing appropriate mitigation strategies rescue (SAR) units, while enabling data collection at scale.
and response plans [3]. Besides, identifying and locating objects of The U.S. Federal Aviation Administration (FAA) estimates the
interest is a central issue in urban search and rescue (USAR) missions number of recreational drones in the United States to reach 2.4 million
[4]. Recent advancements in personal handheld devices, mobile by the year 2022 [5], or one drone for approximately every 140 people.
⁎
Corresponding author.
E-mail addresses: [email protected] (Y. Pi), [email protected] (N.D. Nath), [email protected] (A.H. Behzadan).
https://doi.org/10.1016/j.aei.2019.101009
Received 25 May 2019; Received in revised form 20 September 2019; Accepted 29 October 2019
Available online 14 November 2019
1474-0346/ © 2019 Elsevier Ltd. All rights reserved.
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
However, some major challenges still exist that hinder the widespread from aerial views. Han et al. [13] conducted real-time object detection
use of drones in large-scale operations that involve people, commu- in drone imagery using region-based CNN (R-CNN) and kernelized
nities, and urban assets. On the operational side, existing FAA regula- correlation filter (KCF). Narayanan et al. [14] demonstrated the possi-
tions require the presence of a human operator and maintaining a clear bility of using a high-performance cloud computer for real-time aerial
line-of-sight (LOS) at all times during the mission. Moreover, FAA does detection from UAVs. Guirado et al. [15] employed Faster R-CNN to
not allow drone flights at night or over people, although some recent track and count the number of whales from satellite images. Previous
progress has been made to ease such restrictions [6]. On the logistical work, however, has not systematically investigated and documented
side, the current practice of drone data collection and processing is the problem of CNN-based aerial inspection in natural disasters.
heavily human-centered and based on post-processing of captured data. In computer vision, the task of identifying objects in an image
A UAV reconnaissance team consists of at minimum three people each broadly includes four different steps of classification, detection, loca-
trained and certified to conduct a particular task; one person flies the lization, and semantic segmentation. Image classification (i.e., predict
drone, another person checks drone camera view, and the third person the class to which an object belongs) has been developing rapidly since
records and reports the status to first responders and SAR units. Russakovsky et al. [16] introduced ImageNet and Alex et al. proposed
The aim of this paper is to use artificial intelligence (AI) to address AlexNet [17]. Simonyan and Zisserman [18] introduced VGGNet that
some of the problems associated with the logistics of carrying out drone has 11–19 layers and achieves better accuracy in image classification
missions in information-rich environments such as post-disaster op- than AlexNet by increasing the number of network layers. Szegedy et al.
erations. In particular, we introduce a fully annotated visual dataset [19] proposed GoogLeNet, which includes inception modules that apply
called Volan2018, as well as train and test a convolutional neural convolutional and max pooling at the same time, thus outperforming
network (CNN) architecture to detect, classify, and localize information VGGNet. ResNet [20] used residual network in its image classification
on valuable ground assets in aerial drone views of natural disasters with architecture that can outperform an average human by 3.57%. How-
a special focus on hurricanes. CNN models apply convolutional com- ever, while these methods can determine the existence of an object class
putations on input matrices (digital images and videos) to extract useful in an image, they fail to recognize the location(s) and number of de-
features (e.g., object edges, shapes, and patterns), and ultimately pre- tected objects (a.k.a., detection and localization). One intuitive method
dict object classes and locations. The parameters of a CNN model are of localizing objects in an image is by sliding windows of different sizes
produced by training the model on a large annotated dataset, which across the image, and classifying the content of each window, a process
involves selecting the weights of the neural network such that the error that is extremely slow. Faster object detection and localization techni-
between ground truth and prediction is gradually minimized down to ques use a parallel network that proposes smaller candidate regions in
an optimum point. Using the trained model to detect, classify, and lo- the image to locate multiple objects, followed by using classifiers to
calize objects in drone views with high speed (real-time) and accuracy predict the class of those candidate regions. Past research in this do-
will help improve disaster information retrieval and exchange. Findings main includes Girshick et al. [21] who introduced R-CNN, which uses
of this work are also sought to be a key component of information region proposal to replace the extremely slow sliding box method.
exchange platforms and decision support system (DSS) applications that Girshick et al. [22] later introduced Fast R-CNN using region of interest
integrate and share different data modalities with stakeholders involved (RoI), which led to a considerable improvement of the prediction speed.
in search, rescue, and recovery processes, ranging from ordinary people Ren et al. [23] proposed Faster R-CNN, which adopts region proposal
to first responders, law enforcement, local jurisdictions, insurance network (RPN) to reduce calculation time, and achieves 73.2% and
companies, and non-governmental organizations (NGOs). 70.4% mean average precision (mAP) on PASCAL VOC [24] 2007 and
2012, respectively. Dai et al. [25] introduced FRCN that generates a
2. Literature review 3 × 3 position-sensitive map of the input, and produces results close to
Faster-RCNN but 2.5–20 times faster. Several one-stage detectors have
The digital revolution is characterized by a fusion of technologies been introduced recently with faster speeds than the above methods.
that is blurring the lines between the physical, digital, and biological For example, Redmon et al. [26] introduced YOLO, which reaches 45
worlds, and has pushed the frontiers of science and discovery. In the frames per second (FPS) with a 63.4% mAP on VOC 2007. Liu et al. [27]
last decade alone, the seamless integration of data into everyday life has proposed SSD based on MobileNet achieving 76.8% mAP on VOC 2007.
resulted in exponentially large, multimodal datasets (e.g., photos, vi- Lin et al. [28] introduced RetinaNet which yields 37.8% COCO average
deos, blogs, and tweets) in the public domain such as online content precision (COCO-AP) on COCO [29] using focal loss to better learn hard
sharing platforms and social networking sites. Within the context of examples.
natural disasters, when communities participate in data collection and It must be noted that there is often a trade-off between accuracy and
information exchange, new opportunities can emerge for better un- speed, as demonstrated in Table 1. Real-time spatial information is
derstanding of urban vulnerabilities, capacities, and risks, as well as important as it allows decision-makers to model scenarios and interact
creating data-driven methods for damage assessment and recovery with the spatiotemporal dimensions of a disaster as it evolves [30].
planning. Since the objective of this work is to enable time-sensitive disaster in-
Several researchers have explored crowdsourced data collection. formation retrieval in aerial views, the computational speed of the CNN
Craglia et al. [7] used user-generated content (UGC) such as Facebook model is considered a major constraint. The desired model must be able
and Twitter to produce volunteered geographic information (VGI)
maps. Kim and Hastak [8] utilized machine learning (ML) and online
Table 1
social networks to obtain disaster-related information. Yuan et al. [9] COCO-average precision (COCO-AP) versus time (millisecond) on COCO test-
proposed using semantic analysis of tweets to extract disaster in- dev dataset .
formation, and verified this method in hurricane Matthew. Goodchild adapted from [28]
and Glennon [10] investigated crowdsourcing geographic disaster in-
CNN architecture COCO-AP (%) Time (ms)
formation collection to the public in order to facilitate the information
flow. More recently, researchers have been looking into the challenge of YOLOv2 [31] 21.6 25
detecting ground objects from aerial drone footage. Baker et al. [11] SSD321 [27] 28.0 61
proposed a Monte-Carlo algorithm for using UAVs to conduct post- R-FCN [25] 29.9 85
DSSD513 [27] 33.2 156
disaster search, which was proven faster than the conventional
FRCN [25] 36.2 172
sweeping search. Radovic et al. [12] used transfer learning based on the RetinaNet-101-800 [28] 37.8 198
you-only-look-once (YOLO) algorithm to detect airplanes on the ground
2
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
to operate in real-time (i.e., 30 FPS or higher) and be light enough to the point of response, they need to have precise information about
run on ordinary drones. After careful examination of several CNN ar- flooded roads and areas to avoid. Once floodwaters have rescinded,
chitectures, the authors have selected YOLO v2 [31] as a benchmark identifying debris locations is of the essence to municipalities to ex-
model. YOLO v2 is one of the fastest object detection algorithms that pedite cleanup and rebuild. Similarly, insurance companies and disaster
can achieve 78.6% mAP on PASCAL VOC 2007 [24] and 48.1% COCO- aid agencies such as the U.S. Federal Emergency Management Agency
AP on COCO [29] test-dev datasets. Details of this network architecture (FEMA) may be interested in obtaining an overall damage assessment of
will be presented in Section 5, following a discussion of the designed a subdivision or neighborhood (by counting the number of damaged
methodology (Section 3), and a description of the developed dataset roofs or toppled trees) to start processing claims and allocating funds.
(Section 4). Considering these and other actions that are routinely undertaken by
different stakeholders to quantify disaster damage, a list of key ground
3. Methodology objects of interest (GOIs) is compiled and presented in Table 2.
Given the growing intensity and frequency of hurricanes in the 3.2. Dataset preparation
continental U.S., and the nature of the destruction left behind (high
visibility from the air), we use hurricane damage data as a motivating Web mining was implemented to extract video content from YouTube.
case with the long-term goal of adopting the designed approach to other To yield best results, the following keywords were used: hurricane, da-
disaster domains such as typhoons, earthquakes, wildfires, and land- mage, drone, UAV, disaster, aftermath, and aerial. Videos that contained
slides. By convention, the term “hurricane” describes a tropical cyclone desirable classes (i.e., GOIs), as listed in Table 2, were then used to create
that forms in the north Atlantic, northeastern Pacific, the Caribbean an in-house annotated hurricane imagery dataset, named Volan2018. This
Sea, and the Gulf of Mexico [32]. The 2017 hurricane season was the dataset is further split into two subsets based on the viewpoint altitude of
costliest in the U.S. history with seventeen named storms including the video capturing platform. In particular, videos that were taken by a
hurricanes Harvey, Irma, and Maria [33]. Hurricane forecasters predict drone (flying at a relatively lower altitude of below 300 feet) were
fifteen named storms and eight hurricanes in 2019 [34]. When a hur- grouped into Volan2018-D (“D” for drone) dataset, and those taken by a
ricane strikes, the immediate damage could include fallen branches, helicopter (flying at a relatively higher altitude of above 1000 feet) were
uprooted trees, toppled power lines, roof failure, and wall collapse. The grouped into Volan2018-H (“H” for helicopter) dataset. This distinction
resulting storm surge may also lead to extensive flooding, submerged later allowed us to test the effect of altitude on the accuracy of GOI de-
vehicles, flooded roads, trapped people and livestock, and damage to tection. In creating the Volan2018 dataset (regardless of D or H designa-
critical infrastructure such as power grid, chemical plants, and levees. tion), we also included videos from different geographical locations and
The widespread extent of hurricane damage and the vast number of hurricanes, as well as videos of different lengths and content richness
affected people and neighborhoods in coastal communities has been (number of GOIs). This approach was necessary to verify whether the CNN
well documented over the years through videos taken by ordinary model trained on a particular disaster or location can still produce sa-
people, news channels, government agencies, and volunteer groups, tisfactory results when tested on a different disaster or location, thus
and posted on web-based content sharing (e.g., YouTube) and social supporting domain adaptation and generalizability. As explained in
media (e.g., Facebook, Twitter) sites. Such visual content can be web- Section 4, each subset (D and H) was thoroughly annotated and post-
mined and used for training and testing disaster-domain CNN models. processed (to ensure content balancing), and then split into training
Fig. 1 illustrates the overall methodology of this research, which is (60%), validation (20%), and testing (20%) datasets.
explained at length in the following Subsections.
3.3. Model training, deployment, and performance
3.1. Identification of ground objects of interest (GOIs)
The YOLO v2 model was pre-trained on two large publicly available
A comprehensive literature review was first conducted to identify datasets, COCO [29] and VOC [24]. We used transfer learning to retrain
the most important categories of objects on the ground that are of in- the YOLO v2 [31] network (pre-trained on COCO and VOC) on the
terest to first responders and SAR units, and visible from the air. Volan2018 dataset. Past work has shown that through transfer learning,
Pinpointing the location of such objects and their potential progression a CNN model can provide better and more consistent result on a rela-
with time is the focal point of disaster management operations. For tively small dataset (a.k.a. target dataset) since it learns useful and
instance, in rescue planning, SAR units need to know the location and relevant intermediate features from a large dataset (a.k.a. source da-
number of trapped people and submerged vehicles. In order for first taset) [35,36]. For training, as explained in Section 5, several combi-
responders and SAR units to navigate from the point of deployment to nations are experimented based on video altitude (D or H), pre-trained
dataset (COCO or VOC), and whether the training data is balanced
(denoted by “B”) or unbalanced (denoted by “U”). These experiments
are necessary to determine key factors affecting the performance of the
CNN models. In total, eight training set combinations are created,
namely D-B-COCO, D-B-VOC, D-U-COCO, D-U-VOC, H-B-COCO, H-B-
VOC, H-U-COCO, and H-U-VOC. A detailed description of these com-
binations is provided in Section 5. Resulting CNN models are then
tested on testing portions of Volan2018 dataset to output GOI positions
(bounding boxes) and class labels. In addition, to examine the perfor-
mance of the CNN models in GOI detection in aerial footage obtained
from new locations, two additional unseen videos (captured from
geographical regions not represented in Volan2018 dataset) are tested.
For performance measurement, precision and recall analysis is carried
out to quantify the discrepancy between ground-truth (annotated
classes and bounding boxes) and model predictions (detected classes
and bounding boxes). In addition, mAP, defined as the mean value of
average precisions across all classes, is calculated for each model.
Fig. 1. Research methodology. Section 5 contains results of model performance and analysis.
3
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
Table 2
Key GOIs and their potential applications to disaster management operations.
GOI Potential application
Flooded area Rescue planning, resource deployment, wayfinding, storm surge mapping, aid delivery
Building roofs (damage, undamaged) Damage information map, people rescue, insurance claims, construction repair, debris removal
Car People rescue, insurance claims
Debris Cleanup, damage information map, construction repair
Vegetation Cleanup
4. Data description time and location (i.e., new domain). In our experiments, we first train
and test our models on Volan001 through Volan006 that cover six
4.1. Data collection different locations in three hurricanes during the 2017 season (hurri-
canes Harvey, Maria, and Irma). We will then test the trained models
A content-rich, properly annotated visual dataset is key to accurate (without retraining) on Volan007 and Volan008 captured from the
model training for disaster GOI detection. There are few aerial image 2018 season (hurricane Michael) at two completely different locations
recognition datasets, such as AID [37], a large classification dataset (St Joe Beach and Mexico Beach, Florida) to assess the scalability and
retrieved from Google Earth, which contains 10,000 images with 30 generalizability of the trained models to new situations.
classes, Pattern Net [38] with 38 classes each containing 800 images for
satellite-level object sensing, and Minidrone, an annotated drone video 4.2. Dataset annotation
dataset created for surveillance in a parking lot with two main classes of
humans and cars [39]. However, none of these datasets is semantically We annotate each frame of Volan2018 with DarkLabel [40] by
rich enough for natural disaster information collection, necessitating a drawing bounding boxes that individually cover each GOI listed in
need for creating a new dataset in this research, named Volan2018. As Table 2. One object in one frame is defined as one instance so that the
previously explained, videos included in Volan2018 were obtained pixel coordination values (with the most upper left point of the frame
from YouTube using a semi-supervised web mining technique, and serving as origin) of the bounding boxes are obtained for training.
collectively contain the GOIs listed in Table 2. Overall, Volan2018 Annotating consecutive frames in a video using DarkLabel can cost one
contains eight aerial videos (with 1280 × 720 resolution, 30 FPS) from annotator approximately 2 s per frame, although this varies based on
different hurricanes that occurred during the 2017–18 hurricane sea- the content of the frame and the number and diversity of objects of
sons (including hurricanes Harvey, Maria, Irma, and Michael). To- interest that must be annotated. For example, to annotate a 10-minute
gether, this video dataset covers locations in the city of Houston long video (18,000 frames), 10 h of work is expected. Fig. 2 illustrates
(Texas), southern areas of Texas (i.e., Port Aransas, Holiday Beach, and annotation samples for each class following the annotation strategy
Rockport), Puerto Rico (U.S. territory), Big Pine Key, Mexico Beach, and below:
St Joe Beach (all in Florida). Table 3 presents detailed information on
the Volan2018 video dataset. • Flooded area: Flooded areas in Volan2018 are widespread and
As part of pre-processing, frames containing watermarks in connected. Bounding boxes corresponding to flood can thus cover
Volan001 and 002, and black margins in Volan008 were removed. As the most portion of a frame (including other class instances). To
mentioned in Section 3.2, considering the viewpoint altitude, avoid this, we split the flooded area by drawing multiple small
Volan001, 002 and 003 are grouped as drone dataset (D), while bounding boxes that cover only the flooded area without including
Volan004, 005, and 006 are grouped as helicopter dataset (H). Also, other GOIs.
Volan007 (drone video) and Volan008 (helicopter video) from hurri- • Debris: We ignore small-sized debris in annotation since small
cane Michael, which are different from Volan001 through Volan006, amounts of debris will most likely not turn into major obstacles
serve as completely unseen test videos to validate the real-world ap- during disaster operations. Large-sized debris that could impede
plicability of this work beyond common testing and validation practices transportation or cause damage is annotated by drawing bounding
in computer vision. While in traditional computer vision, the same boxes around the edges. Similarly, visible destructions to infra-
(already available) dataset is split into training and testing portions, structure and splintered houses are annotated as debris.
here the goal is to train a CNN model in advance using past disaster • Cars: Bounding boxes are drawn around the edges for each car, in-
data, and then test it on new disaster data captured from a different dividually.
Table 3
Volan2018 dataset description.
Dataset Metadata Number of Instances per GOI
Number Hurricane Event Year Vehicle Duration (s) Location Flooded Area Undamaged Roof Damaged Roof Car Debris Vegetation
Drone (Train & 1 Harvey 2017 D 84 Southern TX 1,015 1,814 1,457 1,046 2,678 123
Test) 2 Harvey 2017 D 72 Houston, TX 1,572 1,174 871 612 1,653 0
3 Harvey 2017 D 1,333 Houston, TX 38,480 37,661 296 24,710 0 38,976
Helicopter (Train 4 Irma 2017 H 88 Big Pine Key, 1,916 1,481 2,217 666 2,928 597
& Test) FL
5 Maria 2017 H 219 Puerto Rico 3,774 4,003 3,416 3,129 3,986 2,005
6 Harvey 2017 H 317 Rockport, FL 8,975 7,544 7,112 1,389 11,412 0
Unseen (Test) 7 Michael 2018 D 59 St Joe Beach, 0 1,657 1,779 896 1,712 246
FL
8 Michael 2018 H 14 Mexico 0 410 339 0 410 0
Beach, FL
4
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
Fig. 2. Volan2018 annotation samples showing instances of (a) 2 cars, 6 undamaged roofs, 3 damaged roofs, and 2 debris fields; and (b) 3 cars, 12 undamaged roofs,
8 vegetations, and 3 flooded areas.
• Vegetation: We draw large boxes covering the tree cluster if those (DRP) for each video, using Eq. (2) in which nDR and nUR are the total
boxes do not cover any other class instance. Otherwise, one separate number of instances belonging to damaged roof and undamaged roof
box is drawn for each tree. Lawn is not classified as vegetation for classes in each video, respectively. The DRP value reflects the extent of
the purpose of this annotation. roof damage in a particular location, which could be a good indicator of
• Damaged roof: We draw bounding boxes that cover the entire roof the overall building damage in a neighborhood or subdivision, and help
(include undamaged part), but do not include elements below the inform insurance-related decision-making (i.e., quick estimation of
roof (e.g., walls). claims). As listed in Table 4, Volan004 has the highest DRP value,
• Undamaged roof: We draw bounding boxes that reach the roof drip whereas Volan003 has nearly no damaged roofs. Overall, damage
edge (containing skylight, chimney, etc.), but do not include ele- quantification using IPF and DRP indices can be of significant value to
ments below the roof (e.g., walls). disaster management, particularly for response and recovery operations
[41].
The instance distribution of Volan001 through 006 is shown in nDR
Fig. 3. In this Figure, each colored line represents one class, the x-axis Damaged Roof Percentage (DRP ) (%) =
nDR + nUR (2)
indicates the frame number, and the y-axis indicates the number of
instances. This distribution shows the variation of visible objects on the
ground in camera’s viewpoint as the vehicle (drone or helicopter) flies 4.3. Data balancing
over different areas. For example, in Volan005, a helicopter flies over
flooded areas in frame 1 through 360, followed by areas that are not The CNN model was initially trained on the entire Volan2018 da-
flooded in frames 361 through 1170, and then over flooded areas again taset. However, given the long duration and large number of instances
in frames 1171 through 2460. A similar visual analysis can be per- in Volan003, this model was heavily biased toward Volan003, i.e., the
formed for all classes in all six videos. model produced very accurate prediction on the test frames of
In order to quantify the class/instance diversity in Volan2018 da- Volan003 but performed poorly on test frames from other videos. To
taset, we introduce two indices that are calculated for each of the videos remedy the problem of unbalanced data in model training, there are
contained in the dataset. These include Instance Number (IN) which three general strategies that include pre-processing, cost-sensitive
refers to the total number of instances of each class (as listed in learning, and a combination of both [42]. Considering crowdsourcing
Table 3), and Instance per Frame (IPF) which represents the average as a possible future direction of this work to scale up data collection by
number of instances of a particular class per frame, and calculated by utilizing consumer-grade drones and low computational power, we
Eq. (1). In this Equation, Frame Number (FN) refers to the number of favor pre-processing methods for data balancing which include over-
frames that contain at least one instance of a particular class. sampling the instances of minority classes, and under-sampling the
instances of majority classes [43]. In this research, to avoid overfitting
Instance Number (IN ) we apply under-sampling to Volan2018 dataset. Prior to under-sam-
Instance Per Frame (IPF ) =
Frame Number (FN ) (1) pling, the balance ratio (BR) of each video in Volan2018 is calculated
using Eq. (3), wherein, k is the number of instances of class n in frame k,
Table 4 lists the IPF values for each class in each of the eight videos and c and f are the total number of classes, and the total number of
in Volan2018. Comparing the IPF values for the same class across dif- frames in each video, respectively. In addition, in this Equation,
ferent videos (taken from different locations or at different times) re- c f
in, k
veals the significance of a certain type of damage or hazard across time N̄ = n =1 ck = 1 represents the average number of instances per class.
(temporal scale) and locations (spatial scale), as viewed in aerial ima- A higher BR value indicates a less balanced video. For instance, as
gery taken by drone or helicopter. For example, Volan005 contains shown in Table 5, Volan003 has a BR value of 0.79, making it the least
more flooded areas than other videos, while the location represented in balanced video among all eight videos in Volan2018 dataset. For re-
Volan003 has suffered less damage (as indicated by the fewer number ference, we also measure the BR for two popular datasets, namely
of damaged roofs) than other locations. The last column in Table 4 lists COCO [29] with a BR of 0.84 and VOC [24] with a BR of 0.54, making
the average IPF value for the entire Volan2018 dataset. Evidently, the VOC a more balanced dataset in comparison.
most frequent classes in Volan2018 are flooded area (5.73), undamaged c f
i N¯
k = 1 n, k
roof (4.68), and vegetation (3.15) whereas debris, car, and damaged Balance Ratio (BR) = n=1
c N¯ (3)
roof do not appear frequently. The average IPF can assists future data
collection by directing more attention to classes that are under- To perform under-sampling and balance the dataset with respect to
represented (lower IPF values), thus helping balance the dataset. the number of instances in each class, two values are defined and used:
We also calculate a third index, called damaged roof percentage diversity balancing threshold (DBT) (Eq. (4)) and quantity balancing
5
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
Fig. 3. Instance numbers for each class in Volan2018 Drone dataset (D) containing Volan001, 002, and 003, and Helicopter dataset (H) containing Volan004, 005,
and 006.
6
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
Table 4
Instance per frame (IPF) and damaged roof percentage (DRP) for Volan2018 dataset.
Volan 001 Volan 002 Volan 003 Volan 004 Volan 005 Volan 006 Volan 007 Volan 008 Average IPF
DRP (%) 27 35 0 62 32 50 57 44
Table 5 the conclusion of the balancing process, the original Volan2018 dataset
Data balancing results for each video within Volan2018 dataset. is split into six parts, namely Drone Balanced (D-B), Drone Unbalanced
Viewpoint Video Initial BR #Frames BRmin #Frames #Frames
(D-U), Helicopter Balanced (H-B), Helicopter Unbalanced (H-U),
@ Initial @ BRmin Selected Volan007, and Volan008. These parts are used in different combina-
BR tions to train, validate, and test the CNN models. It must be noted that
Volan007 and Volan008 are considered unseen videos (not used for
D Voaln001 0.49 2,520 0.36 1,045 925
Voaln002 0.51 2,160 0.47 925 (min.) 925
initial training and testing, and exclusively kept for scalability assess-
Voaln003 0.79 39,990 0.24 3,084 925 ment) and therefore, are not balanced.
H Voaln004 0.60 6,570 0.29 519 (min.) 519
Voaln005 0.64 2,640 0.24 3,136 519 5. Experiments and results
Voaln006 0.36 9,510 0.20 2,813 519
In this Section, we first introduce the CNN model (YOLO v2)
structure and transfer-learning scheme used for pre-training. Next,
1 k f contains distinct classes (ck ) fewer than the DBT value (i.e., eight different models (based on different training and testing data
ck < DBT ), the frame is excluded from further consideration (training combinations) and their performance are presented. Finally, a discus-
and testing), in favor of under-sampling. In Volan2018, the default sion of key factors that influence model performance is provided.
value of DBT is set to 1 to eliminate frames that contain no instances of
any class. Next, we exclude any frame if the number of instances of a
5.1. Model architecture and transfer learning
given classes in that frame is greater than QBT (i.e., in, k > QBT ). This
comparison is done for QBT = {1, max max in, k }, and the BR corre- YOLO v2 [31] is a CNN model that takes red, green and blue (RGB)
1 n c 1 k f
imagery as input, and outputs predictions in form of target objects’
sponding to each QBT is computed. The QBT that yields the minimum classes and their coordinates in the image. Fig. 4 shows the architecture
BR, a.k.a., BRmin (hence, the most balanced dataset) is ultimately se- of YOLO v2, which has 23 layers including convolutional (CONV) and
lected, and frames that pass the test are preserved. A summary of the max-pooling (MAX) layers, each with kernel, normalization, and acti-
calculations is shown in Table 5. Finally, for the two subsets (D or H), vation functions. A CONV layer has kernels with parameters in each
the same number of frames from each video (selected randomly with kernel cell, and extracts features such as shape and color from the input.
uniform probability), equal to the minimum number of remaining Each kernel applies convolutional computations by a stride to cover the
frames in all videos is selected for training. As shown in Table 5, this entire image and outputs one channel. Taking the first CONV in Fig. 4
process leads to 925 frames per video in the D subset, and 519 frames as an example, after applying 32 kernels with stride 1 on the input
per video in the H subset. In addition, from this Table, it is observed (416 × 416 × 3), the output is a 32-channel matrix. Each layer com-
that after dataset balancing, Volan003 has the highest BR drop (from putes the input and then passes the result on to the next layer until the
0.79 to 0.24) and the largest frame loss (from 39,990 down to 925). At last output layer. Altogether, the model divides the input image into a
7
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
8
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
L in Table 6. The training process follows the steps described earlier. All
S2 B S2 B models are ultimately tested on D-B, D-U, H-B, H-U test subsets, as well
= box 1obj
i, j [BC (x i , x i ) + BC (yi , yi )] + box 1iobj
,j as on the entire length of Volan007 and Volan008 videos.
i =0 j=0 i=0 j=0 The performance metrics include intersection over union (IoU),
S2 B precision, recall, mAP, and F1 score. IoU is defined by Eq. (8) and is
[( wi wi )2 + ( hi hi ) 2] + 1obj
i, j [BC (ci , ci )] + equal to the overlapping area between detection and ground truth di-
i=0 j=0
vided by their union area. Fig. 6 illustrates the concept of IoU. The best
S2 B S2 B prediction is when IoU is 100%, i.e., prediction boxes and ground truth
1noobj
i, j [BC (ci , ci )] + 1iobj
, j [BC (pi , pi )] boxes exactly overlap. However, while 100% IoU is nearly impossible to
i=0 j=0 i=0 j =0 (6) achieve (given limitations of existing CNN architectures), an IoU value
of 50% to 90% is commonly used in many computer vision applications.
BC (a^i , ai ) = ai log(a^i ) (1 ai )log(1 a^i ) (7)
In this work, given the scope of work and complexity of disaster
Initially, the YOLO v2 [31] is pre-trained on the ImageNet dataset scenery, a detection is deemed successful (a.k.a., true positive or TP) if
[16]. Next, object detection weights are re-trained on either VOC [24] IoU ≥ 50%.
or COCO [29], and finally, transfer learning is used to train the model
on Volan2018 dataset, as shown in Fig. 5. In particular, pre-trained Intersection Area
Intersection over Union (IoU ) =
model weights from ImageNet and COCO/VOC are first loaded in our Union Area (8)
model, and weights are further updated by re-training the model on
In order to calculate precision and recall, in addition to TP cases,
Volan2018. For each subset (i.e., D-B, D-U, H-B, H-U), we split the
predictions that result in false positive (FP), and false negative (FN)
dataset into training (60%), validation (20%), and testing (20%). The
must also be considered. Examples of all three possible cases are shown
first 22 layers are initially frozen, and only the last layer is trained for
in Fig. 7. Of note, in object detection, there is no finite number of true
25 epochs with a learning rate of 10−3. Next, the entire model is fine-
negative (TN) cases (i.e., there is no ground truth object in a particular
tuned with an initial learning rate of 10−4, which is gradually de-
region of the image and the model does not detect anything in that
creased by half if loss does not drop after 3 epochs. To avoid overfitting,
region) and, therefore, TN is not considered in the performance as-
training is terminated if the validation loss does not drop in 10 con-
sessment. Precision and recall are calculated using Eqs. (9) and (10),
secutive epochs.
and the harmonic average of precision and recall (a.k.a., F1 score) is
obtained from Eq. (11).
5.2. Results
TP
We created eight models (numbered 1 through 8) based on different precision =
TP + FP (9)
combinations of pre-training, training and validation datasets, as listed
9
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
Table 7
mAP (%) of CNN models tested on different test subsets (* denotes best performance).
Test Subset Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8
recall =
TP plotted in precision-recall curves. The area under the precision-recall
TP + FN (10) curve for each class is then calculated as average precision (AP) for that
class. Finally, the average of all AP values is reported as the mAP of the
precision recall
F1 = 2 model. Results for all 8 CNN models are shown in Table 7. As indicative
precision + recall (11)
by these results, models trained on drone videos tend to perform better
In the prediction stage, threshold (i.e., minimum class probability) when tested on drone videos, while models trained on helicopter videos
is commonly used to eliminate low-confidence detection boxes and tend to perform better when tested on helicopter videos. This ob-
produce human understandable output. For example, a model with the servation supports the influence of viewpoint altitude on model per-
threshold value of 0.1 only outputs detections with class probability formance. The best mAP performance for D set and H set are 74.48%
higher than 0.1. To investigate the suitable threshold for real-world (Model 3 trained on D-U-COCO and tested on D-U) and 80.69% (Model
applications, we test Model 2 (trained on D-B-VOC) on Volan007 video, 7 trained on H-U-COCO and tested on H-B), respectively. These pro-
and record precision, recall, and F1 score for threshold values ranging mising results support that automatically extracting information from
from 0.0 to 0.9 with 0.1 increments. As later summarized in Table 7, aerial views is feasible.
Model 2 outperforms other models when tested on completely unseen Although training and testing images in D and H are different, they
drone footage. Sample results for classes debris and undamaged roof are are still extracted from similar videos that closely resemble one another
shown in Fig. 8 and Fig. 9. According to this Figure, larger threshold (e.g., same disaster, or same location). In practice, the test video could
values lead to lower recall and higher precision, indicating that the be captured from an utterly different altitude, camera, flying speed,
model produces fewer detections, therefore, fewer false positives and disaster event, location, time, and lighting condition. A robust model
more false negatives. Theoretically, the best threshold is at the point must perform adequately in detecting GOIs when applied to a com-
where precision, recall, and F1 score converge. However, in cases pletely new (unseen and drastically different) video footage. Therefore,
where GOIs are highly valuable or not detecting them may have severe trained models are also tested on Volan007 (drone video) and Volan008
consequences (e.g., trapped people on the roof of a flooded home), high (helicopter video) that were not previously seen by the models. It can
recall is necessary even with low precision. On the other hand, different be seen in Table 7 that models pre-trained on VOC and trained on the
information users may have different opinions about the value of dif- balanced data tend to perform better than other models. For example,
ferent GOIs, thus leading to varying perceptions of precision and recall for drone footage, Model 2 (trained on D-B-VOC) performs best
values. Besides, different classes and test data have different corre- (24.50% mAP) on Volan007 video whereas for helicopter footage,
sponding precision, recall and F1 curves. In conclusion, the selection of Model 6 (trained on H-B-VOC) performs best (13.88% mAP) on
precision and recall, essentially the threshold value, primarily depends Volan008 video.
on the needs and expectations of the end user. Fig. 9 displays the precision-recall curves for the best performing
For testing the performance of the CNN models, cumulative preci- model (i.e., Model 2) tested on different test subsets (listed in Table 6).
sion and recall values for each class (a.k.a., GOI) are calculated and This model is trained and validated on balanced drone videos (i.e., D-B-
Fig. 8. Precision, recall, and F1 score of best performing model (Model 2) tested on Volan007 subset for (a) debris, and (b) undamaged roof classes.
10
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
Fig. 9. Precision-recall curves for best performing model (Model 2) tested on drone balanced, drone unbalanced, helicopter balanced, helicopter unbalanced,
Volan007, and Volan008 subsets.
VOC) (Table 6). When tested on balanced drone (D-B) and unbalanced detection of debris has the highest AP of 81.77% when tested on D-B
drone (D-U) test subsets, the model’s performance is 65.30% and and 78.53% when tested on D-U. Similarly, the detection of damaged
40.00% mAP, respectively. Looking at each class individually, the roof has an AP of 53.27% when tested on D-B and 51.89% when tested
11
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
Table 8 can be expressed by the null hypothesis H0 that that there is no dif-
Statistical analysis of the influence of viewpoint altitude on model performance ference in the mAP between the results when the mode is tested on
(**: p < 0.01). Volan007 (u007) and Volan008 (u008), i.e., u007 = u008. The alternative
Model Test data Viewpoint mAP (%) STDV t-value Hypothesis hypothesis or HA is that u007 > u008 (for models trained on drone
video, i.e., Models 1, 2, and 3) and u008 > u007 (for models trained on
1 Volan007 Drone 18.20 0.520 74.27** u007 > u008 drone video, i.e., Models 4, 5 and 6). For each comparison, the con-
Volan008 Helicopter 11.74 0.317
fidence level is set at 99%, test data sample is produced by randomly
2 Volan007 Drone 24.54 6.372 252.73** u007 > u008 selecting two-thirds of the entire frames in each video (Volan007 or
Volan008 Helicopter 4.26 0.367
Volan008), and mAP is measured and averaged over 50 iterations. Next,
3 Volan007 Drone 18.45 0.485 263.97** u007 > u008 a two-sample one-tail t-test is run to evaluate the significance for each
Volan008 Helicopter 0.15 0.017
pair. Considering the first row of Table 8 for instance, results show that
4 Volan007 Drone 12.69 0.544 137.75** u007 > u008 Model 1 (trained on D) performs with an average mAP of 18.20% when
Volan008 Helicopter 1.69 0.130 tested on Volan007 (captured by drone), and an average mAP of
5 Volan007 Drone 1.278 0.064 −303.28** u008 > u007 11.74% when tested on Volan008 (captured by helicopter), and this
Volan008 Helicopter 12.08 0.241 difference in mAP is statistically significant at 99% confidence level.
6 Volan007 Drone 1.66 0.07 −145.31** u008 > u007 The same trend is observed for all eight models (training and testing on
Volan008 Helicopter 14.01 0.591 video taken from relatively same altitude yields better results). How-
7 Volan007 Drone 1.22 0.042 −325.86** u008 > u007 ever, cross training and testing, i.e., model trained on D dataset and
Volan008 Helicopter 12.37 0.236 tested on H dataset or vice versa, leads to very low accuracy. Therefore,
8 Volan007 Drone 0.63 0.022 −224.80** u008 > u007 it can be concluded that the viewpoint altitude is a critical factor that
Volan008 Helicopter 10.50 0.307 must be considered when building and deploying CNN models for GOI
detection in aerial videos.
We further compare the performance of all eight models from the
on D-U. Moreover, as shown in this Figure, the overall mAP of Model 2 perspective of data balancing. As shown in Table 9, data balancing
when tested on Volan007 (unseen drone video) is 24.50%, with vege- (a.k.a. under-sampling, as explained in Section 4.3) excludes a sig-
tation detection achieving the highest AP of 67.59%, followed by un- nificant number of relatively less useful video frames from the training
damaged roof detection at 32.81%, and debris detection at 17.58%. dataset, resulting in a shorter training time by a factor of 14 to 17. In
However, the model struggles to detect car resulting in the lowest AP of particular, when pre-trained on COCO [29], reducing the size of
1.82%. In contrast, the overall mAP of Model 2 when tested on training dataset affects slightly on performance (1.5% increase for D
Volan008 (unseen helicopter video) is only 6.40%, supporting the ar- and 0.23% decrease for H subset). On the other hand, by pre-training on
gument that the camera altitude and properties could affect the accu- VOC [24], data balancing not only does decrease the training time, but
racy of GOI detection. These results also suggest that certain GOIs (i.e., also improves the performance by 11.89% and 3.54% for D and H
vegetation, undamaged roof, and debris) are predictable by training the subsets, respectively. In this analysis, training time is based on Intel
CNN model on videos of previous events, and testing it on unseen Xeon E5-2680 v4 2.40 GHz 14-core CPU, 128 GB RAM, and NVIDIA K80
footage from new events. To understand the effect of viewpoint alti- (12 GB) GPU [44]. Overall, it is observed that models trained on ba-
tude, data balancing, and pre-trained dataset on model performance, a lanced training subset with pre-trained weighs from VOC [24] (i.e.,
statistical analysis is presented in the following Subsection. Model 2 from the D subset, and Model 6 from the H subset) outperform
other models. Sample GOI detection in video frames obtained from
5.3. Analysis of model performance for viewpoint altitude, data balance, Models 2 and 6 are illustrated in Fig. 10.
and pre-trained weights
6. Summary and conclusions
The effect of three key factors, namely the viewpoint altitude (low
for drone vs. high for helicopter), data balancing (balanced vs. un- The goal of this study was to facilitate natural disaster damage
balanced), and pre-trained dataset (i.e., COCO vs. VOC) on model detection and quantification in aerial imagery using CNN models
performance is investigated in this Section. Table 8 summarizes the trained on footage from past disasters. To achieve this, a list of valuable
results of statistical analysis performed on all eight models with respect GOIs was first generated by analyzing literature pertinent to disaster
to viewpoint altitude. For each trained model, a comparison is made response and recovery. Next, in order to train and test CNN models, an
between the performance of that model, when tested on Volan007 in-house dataset named Volan2018 was created (using semi-supervised
(drone) and VOLAN008 (helicopter) videos. For each comparison listed web mining on YouTube videos) and annotated with tags corresponding
in Table 8, the general hypothesis is that training and testing on videos to previously identified GOIs. Volan2018 contains eight videos from
captured from the relatively same altitude yields best results. In other hurricanes Harvey (Texas), Irma (Puerto Rico), Maria (Florida), and
words, a model trained on D (H) dataset tends to show a statistically Michael (Florida). A total of eight CNN models were then trained, va-
better performance when tested on D (H) dataset. Mathematically, this lidated, and tested on different combinations of viewpoint altitude (low
Table 9
Influence of data balance and pre-trained weights on model performance (* denotes best performance).
Model Train data Test data Pre-trained on Data balance # Training frames mAP (%) Training time (hr)
12
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
for drone vs. high for helicopter), data balancing (balanced vs. un- training data help achieve better results. Findings of this work are ul-
balanced), and pre-trained dataset (i.e., COCO vs. VOC) to investigate timately sought to be a key component of information exchange plat-
the effect of these parameters on model performance. Analysis showed forms and decision support system (DSS) applications that integrate and
that CNN models trained and tested on videos captured from the rela- share different data modalities with all stakeholders involved in disaster
tively same altitude yields best results, while cross training and testing management, ranging from ordinary people to first responders, law
leads to very low accuracy. The paper also presented a novel method for enforcement, local jurisdictions, insurance companies, and non-gov-
balancing large video datasets using under-sampling. This approach not ernmental organizations (NGOs).
only did reduce the training time but also improved the performance of
several models. Particularly, models pre-trained on VOC and re-trained Declaration of Competing Interest
on balanced data yielded the best results for both D and H subsets.
Training and testing on Volan001 through 006 videos resulted in mAP The authors declare that they have no known competing financial
of 74.48% (Model 3 trained on D-U-COCO and tested on D-U) and interests or personal relationships that could have appeared to influ-
80.69% (Model 7 trained on H-U-COCO and tested on H-B), respec- ence the work reported in this paper.
tively, which shows the potential of the presented technique for de-
tection of ground assets with high fidelity in aerial imagery. While Acknowledgments
testing on completely different (unseen) footage (captured by different
cameras, in different events and location), an mAP as high as 24.50% Model training was performed on Texas A&M University’s High
(across all classes) was achieved, with the model detecting vegetation Power Research Computing (HPRC) clusters. The authors gratefully
and undamaged roof classes with ~68% and ~33% average precision, acknowledge the HPRC for providing necessary computational re-
respectively. sources. Any opinions, findings, conclusions, and recommendations
In addition to hurricanes, we plan to investigate the capability of expressed in this paper are those of the authors and do not necessarily
CNN models for GOI detection in other types of natural disasters such as represent those of the HPRC.
tornados, typhoons, earthquakes, and wildfires, as well as analyze
parameters that may influence model performance in such scenarios. References
While the CNN models developed in this research are based on the
YOLO architecture (due to its speed and accuracy for real-time field [1] United Nations Office for Disaster Risk Deduction, Economic and Human Impacts of
assessment and decision-making), other CNN-based deep learning Disasters, 2019, Available from: < https://www.unisdr.org/we/inform/disaster-
statistics > .
models that can achieve equal or better output with respect to speed, [2] J.H. Seinfeld, S.N. Pandis, Atmospheric Chemistry and Physics: From Air Pollution
accuracy, and level of detail (e.g., pixel segmentation) will be also in- to Climate Change, John Wiley & Sons, 2016.
vestigated in the future. Finally, Volan2018 dataset is annotated with [3] T.J. Cova, GIS in emergency management, Geogr. Inform. Syst. 2 (1999) 845–858.
[4] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair, I. Grixa, F. Ruess,
bounding boxes which can constraint the ability of the trained CNN M. Suppa, D. Burschka, Toward a fully autonomous UAV: research platform for
models to predict non-rectangular boxes (i.e., grid structure). This ap- indoor and outdoor urban search and rescue, IEEE Rob. Autom. Mag. 19 (3) (2012)
proach, however, could be improved particularly for detection of ob- 46–56.
[5] Federal Aviation Administration, Fact Sheet – Federal Aviation Administration
jects with irregular shapes and sizes (e.g., flood area, vegetation). To (FAA) Forecast Fiscal Years (FY) 2017-2038, 2018, Available from: < https://www.
this end, our next revision to the dataset will contain semantic seg- faa.gov/news/fact_sheets/news_story.cfm?newsId=22594 > .
mentation where each pixel of the image will be labeled. We expect that [6] Federal Aviation Administration, Recreational Fliers & Modeler Community-Based
Organizations, 2019, Available from: < https://www.faa.gov/uas/educational_
these improvements coupled with increased size and diversity of
13
Y. Pi, et al. Advanced Engineering Informatics 43 (2020) 101009
users/ > . [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, A.C. Berg, SSD: Single
[7] M. Craglia, F. Ostermann, L. Spinsanti, Digital Earth from vision to practice: making shot multibox detector, European Conference on Computer Vision, Amsterdam,
sense of citizen-generated content, Int. J. Digital Earth 5 (5) (2012) 398–416. Netherlands, Springer, 2016.
[8] J. Kim, M. Hastak, Social network analysis: characteristics of online social networks [28] T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection,
after a disaster, Int. J. Inf. Manage. 38 (1) (2018) 86–96. Proceedings of the IEEE international conference on computer vision, Venice, Italy,
[9] Y. Faxi, L. Rui, M. Ouejdane, An information system for real-time critical infra- IEEE, 2017.
structure damage assessment based on crowdsourcing method: a case study in Fort [29] T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection,
McMurray, Proceeding of International Conference on Sustainable Infrastructure, Proceedings of the IEEE international conference on computer vision, Venice, Italy,
New York City, NY, USA, American Society of Civil Engineers, 2017. IEEE, 2017.
[10] M.F. Goodchild, J.A. Glennon, Crowdsourcing geographic information for disaster [30] A. Zerger, D.I. Smith, Impediments to using GIS for real-time disaster decision
response: a research frontier, Int. J. Digital Earth 3 (3) (2010) 231–241. support, Comput., Environ. Urban Syst. 27(2), 123–141.
[11] C.A.B. Baker, S. Ramchurn, W.T. Teacy, N.R. Jennings, Planning search and rescue [31] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, Proceedings of the IEEE
missions for UAV teams, Proceedings of the Twenty-second European Conference Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA,
on Artificial Intelligence, Netherlands, IOS Press, 2016. IEEE, 2017.
[12] M. Radovic, O. Adarkwa, Q. Wang, Object recognition in aerial images using con- [32] National Oceanic Atmospheric Administration, Tropical Cyclone Climatology,
volutional neural networks, J. Imag. 3 (2) (2017) 21. 2019, Available from: < https://www.nhc.noaa.gov/climo/ > .
[13] S. Han, W. Shen, Z. Liu, Deep Drone: Object Detection and Tracking for Smart [33] J.M. Shultz, S. Galea, Preparing for the next Harvey, Irma, or Maria—addressing
Drones on Embedded System, Stanford University, 2012. research gaps, N. Engl. J. Med. 377 (19) (2017) 1804–1806.
[14] P. Narayanan, C. Borel-Donohue, H. Lee, H. Kwon, R. Rao, A Real-Time Object [34] J.K. Philip, M.B. Michael, J. Jhordanne, Extended Range Forecast Of Atlantic
Detection Framework for Aerial Imagery Using Deep Neural Networks and Seasonal Hurricane Activity And Landfall Strike Probability For 2019, Department
Synthetic Training Images, Proceeding of Signal Processing, Sensor/Information of Atmospheric Science, Colorado State University, 2019, Available from: <
Fusion, and Target Recognition XXVII, International Society for Optics and https://tropical.colostate.edu/media/sites/111/2019/04/2019-04.pdf > .
Photonics, Orlando, FL, USA, 2018. [35] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image
[15] E. Guirado, S. Tabik, M.L. Rivas, D. Alcaraz-Segura, F. Herrera, Automatic whale representations using convolutional neural networks, Proceedings of the IEEE
counting in satellite images with deep learning, bioRxiv, 2018. Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA,
[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Imagenet large scale IEEE, 2014.
visual recognition challenge, Int. J. Comput. Vision 115 (3) (2015) 211–252. [36] H.C. Shin, H.R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura,
[17] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con- R.M. Summers, Deep convolutional neural networks for computer-aided detection,
volutional neural networks, Proceeding of Advances in Neural Information CNN architectures, dataset characteristics and transfer learning, IEEE Trans. Med.
Processing Systems, Lake Tahoe, Nevada, USA, NIPS, 2012. Imaging 35 (5) (2016) 1285–1298.
[18] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image [37] G.S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, X. Lu, AID: A benchmark
recognition, 2014, arXiv preprint arXiv:14091556. data set for performance evaluation of aerial scene classification, IEEE Trans.
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Geosci. Remote Sens. 55 (7) (2017) 3965–3981.
Vanhoucke, A. Rabinovich, Proceedings of the IEEE Conference on Computer Vision [38] W. Zhou, S. Newsam, C. Li, Z. Shao, PatternNet: A benchmark dataset for perfor-
and Pattern Recognition, Boston, MA, USA, IEEE, 2015. mance evaluation of remote sensing image retrieval, ISPRS J. Photogramm. Remote
[20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Sens. 145 (2018) 197–209.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, [39] M. Bonetto, P. Korshunov, G. Ramponi, T. Ebrahimi, Privacy in mini-drone based
Las Vegas, NV, USA, IEEE, 2016. video surveillance, Proceeding of 11th IEEE International Conference and
[21] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia,
object detection and semantic segmentation, Proceedings of the IEEE Conference on IEEE, 2015.
Computer Vision and Pattern Recognition, Columbus, OH, USA, IEEE, 2014. [40] DarkLabel1.3, Image Labeling and Annotation Tool, 2017, Available from: <
[22] R. Girshick, Fast r-cnn, Proceedings of the IEEE International Conference on https://darkpgmr.tistory.com/16 > .
Computer Vision, Boston, MA, USA, IEEE, 2015. [41] Federal Emergency Management Agency, Damage Assessment Operations Manual,
[23] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection 2016, Available from: < https://www.fema.gov/media-library-data/
with region proposal networks, Proceeding of Advances in Neural Information 1459972926996-a31eb90a2741e86699ef34ce2069663a/PDAManualFinal6.
Processing Systems, Vancouver, Canada, NIPS, 2015. pdf > .
[24] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The Pascal [42] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, Learning from
visual object classes (VOC) challenge, Int. J. Comput. Vision 88 (2) (2010) 303–338. class-imbalanced data: Review of methods and applications, Expert Syst. Appl. 73
[25] J. Dai, Y. Li, K. He, J. Sun, R-FCN: Object detection via region-based fully con- (2017) 220–239.
volutional networks, Proceeding of Advances in Neural Information Processing [43] M. Buda, A. Maki, M.A. Mazurowski, A systematic study of the class imbalance
Systems, Barcelona, Spain, NIPS, 2016. problem in convolutional neural networks, Neural Networks 106 (2018) 249–259.
[26] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real- [44] Texas A&M University High Performance Research Computing, Terra: A Lenovo x86
time object detection, Proceedings of the IEEE Conference on Computer Vision and HPC Cluster, 2019, Available from: < https://hprc.tamu.edu/wiki/Terra:Intro > .
Pattern Recognition, Las Vegas, NV, USA, IEEE, 2016.
14