2402 16246
2402 16246
2402 16246
ARTICLEINFO ABSTRACT
1. Introduction
In recent years, with the increasing demand for traffic safety and congestion mitigation tasks, real-time
traffic data collection has gained significant attention from researchers. However, traditional traffic data
collection methods, including manual collection and equipment collection such as photodetectors, ultrasonic
detectors, require substantial labor and material resources, which cannot meet the current demand for high
accuracy, timeliness and predictability of traffic data analysis and prediction. Modern methods such as traffic
electronic monitoring system collection, GPS, mobile phone, GIS and other information data collection
methods have shortcomings such as fixed angle position and poor timeliness. Therefore, the use of UAVs for
traffic video collection is more convenient than traditional methods, which is an effective means to solve the
problem of traffic information collection and can ensure the real-time nature of traffic data.
Mobile devices are portable and flexible, make it capable to collect traffic data anytime and anywhere.
Although the computing power of mobile devices is far inferior to that of server-side, the performance of
Apple’s mobile silicon and the machine learning frameworks provide hardware and software support for
object recognition and trajectory tracking on the mobile platform.
At present, most of the research on real-time detection and tracking through UAV video is based on
theories by improving the algorithms and assessing the performance on datasets, which lacks practical
applications. In addition, there are several factors to be considered for different UAVs and mobile devices.
* Corresponding author at: 49 S Xilin Rd, Hohhot, Inner Mongolia, 010020, China.
E-mail address: [email protected],
2
Based on the existing theoretical research, this paper presents practical experiments that integrate UAVs,
machine learning, and mobile development technology. It proposes a system based on the iOS platform,
which controls UAVs to capture real-time video streams, perform object detection and tracking of vehicles at
intersections. The system analyzes and visualizes the detection results to obtain micro and macro traffic
parameters. The paper describes the program's design process and conducts experiments to validate its
feasibility. Through this research, this paper expects to improve the real-time and accuracy of traffic data
collection in practical applications and provide support for solving urban traffic problems.
2. Literature review
The flexibility, ease of operation and real-time of aerial video make it a useful tool in the data acquisition of
urban traffic. Zhang et al. (1) utilized fixed UAV angle to capture traffic videos and employed the Mask-
RCNN model for vehicle detection. The pixel distance between the center point of the vehicle's Bounding
Box in fixed interval time frames was calculated and converted to actual distance. By analyzing the vehicle
speeds, the model was able to determine whether traffic congestion occurred and identify the causes of
congestion. Byun et al. (2) proposed a method that utilizes EfficientDet for vehicle detection and SORT for
tracking to analyze the movement of vehicles. By segmenting the road area in the drone image and calculating
the ratio of lane length to pixels, the actual speed of the vehicle can be determined. Chalmers et al. (3)
employed frame sampling technology in conjunction with DJI Mavic Pro 2 to achieve real-time object
detection of animals, maximizing both accuracy and throughput.
Recent advancements in traffic trajectory recognition using drones and computer vision have shown
significant progress on desktop-side devices. Gu et al. (4) proposed a framework for analyzing collision risk
in highway interchanges by scrutinizing drone-captured video data based on vehicle micro-behavior. Chen et
al. (5) utilized drone data to observe mixed traffic flow involving motorized vehicles, non-motorized vehicles,
and pedestrians within safety space boundaries. Wu et al. (6) developed an automatic road conflict
identification system (ARCIS) using the masked region convolutional neural network (R-CNN) technique to
process traffic videos collected by UAVs. Ma et al. (7) focused on using UAV vehicle trajectory data to
establish a traffic conflict prediction model for highway traffic in diversion areas. Chen et al. (8) achieved
high-precision trajectory extraction and fast, accurate vehicle tracking by UAVs using kernel technology and
coordinate change, combined with wavelet transform. The Intelligent and Safe Transportation Laboratory
(UCF SST) at the University of Central Florida employed deep learning techniques to extract parameters from
UAV videos, facilitating road safety diagnosis (9). Additionally, they created the CitySim dataset (10) to
support traffic safety analysis and digital twin scene modeling.
While most studies have conducted performance evaluations on datasets after improving the models,
theoretically meeting the requirements for real-time detection performance, practical applications of these
methods have not been fully tested yet (11-15). From the standpoint of the model deployment platform, Hua
et al. (16) introduced a lightweight UAV real-time object tracking algorithm based on policy gradient and
attention mechanism, and successfully deployed the model on NVIDIA Jetson AGX Xavier devices. In
related research, several studies have utilized the Nvidia Jetson TX2 on-board computer for object detection
and tracking computations (17, 18). Regarding the utilization of mobile devices, Martinez-Alpiste et al. (19)
conducted a series of tests using OpenCV in conjunction with the YOLOv3 model deployed on an Android
phone. They evaluated the feasibility of achieving machine learning-based object detection on mobile
platforms, employing frames per second, accuracy, battery consumption, temperature, RAM usage, model
size, FLOPS, and model load time as evaluation indicators. In the context of iOS-based systems, Zhou et al.
(20) executed the object tracker on the iPad Air2, while Li et al. (21) deployed a YOLO model trained on the
COCO dataset on the iPhone to achieve object detection functionality. However, some challenges were
encountered, such as bounding box shifts.
The problem in existing methods is the image-based detection or tracking methods typically require huge
computational power and complicated algorithms which make it hard to implement. In this paper, we first
3
propose a novel method that can effectively tune the thresholds in detection algorithms. This paper takes
advantage of the flexibility of the modern mobile devices and deep learning based real-time object detection
method to realize a mobile system which can automatically detect certain objects from the video stream of
UAV.
The system's core design is centered on the iOS framework, aiming to seamlessly integrate drone
technology. Customized drone functionalities are developed to facilitate real-time control of drone flight and
video frame data acquisition. This integration enables simultaneous target detection during video frame
acquisition, resulting in real-time visualization of detected vehicle targets. The system's ultimate goal is to
harmoniously merge UAV flight control, traffic video acquisition, target detection, trajectory tracking, and
data analysis on a mobile terminal."
The system is predicated on the iOS Software Development Kit (SDK), which serves as the foundational
framework for formulating the core system functionalities and interface design. Building upon this foundation,
the DJI SDK is leveraged to actualize custom drone control functionalities, enabling the real-time acquisition
of the drone's video stream. In tandem, machine learning frameworks including Core ML and Vision are
deployed on the iOS platform to facilitate real-time processing of the video stream. This processing is chiefly
focused on the detection and tracking of vehicle targets. For an in-depth visualization of the system's structure
and components, please refer to Fig. 1.
For enhanced management of both the primary function interface and sub-function interfaces, this system
employs the Navigation Controller as its foundational controller. Upon system initialization, the user is
directed to the main interface. The main interface is structured in accordance with the Model-View-Controller
(MVC) design pattern, leveraging the UIKit framework for user interface (UI) design. Its primary
responsibilities encompass system initialization and drone setup.
(1) Interface design
The interface is partitioned into left and right sections. The left area serves as the functional zone, housing 6
UIButtons to facilitate functions such as data review, viewing mobile phone albums, accessing flight records,
checking binding and activation statuses, and logging into DJI accounts. Additionally, 4 UILabels are
employed to display binding status, activation status, connection mode, and the currently connected device.
The right section is dedicated to establishing the connection with drones through a single UIButton. The
interface layout, as illustrated in Fig. 3, reflects this division.
The real-time image transmission module assumes the critical role of receiving the live video stream
generated by the drone. This stream is subsequently relayed to the detection and tracking module, which, in
turn, transmits the video stream data to the vehicle. The real-time image transmission module serves as a
pivotal intermediary, directly bridging the UAV's real-time video stream to the target detection module. In
doing so, it provides a vital interface for seamless interaction with the target detection module.
(1) Interface design
The design layout of the real-time video transmission interface is depicted in Fig.4 , delineating two distinct
areas within a two-dimensional plane perspective: the upper and lower segments.
The upper area serves as the status information section, designated for monitoring the drone's status
information. This upper area is further subdivided into three distinct sections: the UAV information area, PTZ
(Pan-Tilt-Zoom) information area, and other information area. These sub-areas are visually represented
through the utilization of UILabels and are structured using a horizontal Stack View. The UAV information
area displays real-time processing performance metrics, including the UAV's pitch, roll, and yaw angles. The
PTZ information area provides details on the gimbal's pitch, roll, and yaw angles. Additionally, the other
information area presents essential data such as the current time, drone boot time, and the number of satellites.
The lower area is designated as the image transmission display zone, denoted as FPV View. This area is
constructed using the root view UIView of the FPVViewController. From a three-dimensional perspective, a
custom UIView layer is superimposed onto the underlying view. This additional layer is employed to create
an area dedicated to displaying detection results, aptly named Detect View. The purpose of this Detect View
6
is to visually represent the outcomes of target detection.The implementation specifics of the drone interface
are graphically illustrated in Fig. 5.
.
Fig. 9. Video data flow direction
In light of the inability to directly transmit video stream data obtained through the DJIVideoFeedListener
listening agent to the object detection module, the real-time video transmission module employs the
VideoFrameProcessor proxy method to acquire video frame data. The VideoFrameProcessor's proxy method
is invoked with each retrieval of a video frame. In the callback method “videoProcessFrame()”, the
VideoFrameProcessor retrieves the video frame data object, denoted as “frame”. This object, of type
VideoFrameYUV, is subject to a condition check to determine the initiation of the probe, contingent upon the
evaluation of whether “isStartDetect” is true. Subsequently, the VideoFrameYUV data frame undergoes a
transformation to a CVPixelBuffer type through the “createPixelBuffer()” method. This CVPixelBuffer is
then relayed to the target detection module and the target tracking module for data frame processing. The
target tracking module returns an array of detection result rectangles, denoted as “rects”. This information is
employed to calculate micro-level parameters and macro-level traffic parameters of the vehicle within the data
processing module.
To visualize these detection results, the “showInDetectView()” method is utilized. It passes the processing
result rectangles to the “polyRects” property of the custom view, “detectView”. Subsequently, the “draw()”
method of the “detectView” is triggered, leading to the interface's recalibration on the main queue,
accomplished via the “setNeedsLayout()” method of the “detectView”.
Furthermore, the “viewWillDisappear” lifecycle method is overridden to facilitate memory management.
When the back button is clicked, returning to the main interface, several critical actions are undertaken. These
include the removal of the target view of the DJIVideoPreviewer decoder through the “unSetView()” method,
the elimination of the DJIVideoFeedListener through the “remove()” method, thereby halting the reception of
new video frame data. Finally, the “close()” method is invoked to shut down the decoder, releasing system
memory resources.
3.4 Vehicle target detection and tracking implementation Tracking module based on SORT algorithm
The current implementation of the Vision tracker exhibits certain performance bottlenecks in object
tracking. To address this, we propose an enhancement by replacing the Vision framework tracker with a
method based on the SORT algorithm. The improved tracking module encompasses the following seven key
steps.
(1) The object detection module transmits the detection results, which are then received and processed for
each object's bounding box properties. To ensure compatibility with the SORT tracker, the y-axis of the
10
bounding box is flipped, and the coordinates are adjusted from the Vision framework to the UIKit framework.
Normalized coordinates are mapped to real image coordinates, and the format is converted from [x, y, w, h] to
[x1, y1, x2, y2].
w
(x,y) (x ,y )
(x ,y )
The target tracking module delivers instantaneous positional data. Situated between the target tracking
module and the result drawing module, the data processing module assumes the pivotal role of processing
every frame of data forwarded by the target tracking module. Within this data processing paradigm, the
module undertakes the computation of real-time micro-level parameters for individual vehicles while
simultaneously deducing macro-level traffic parameters with respect to lanes or road segments. Subsequently,
these computed parameters find their path to the result plotting module for visual representation. This
architectural arrangement materializes an efficient amalgamation of data processing and analysis, resulting in
a noteworthy enhancement of precision and practicality in the domain of traffic behavior analysis.
In the context of real-time data processing, the employment of fixed-frame sampling introduces variability
in the time intervals between samples. This variance adversely impacts the calculation of traffic micro
parameters, leading to deviations and, consequently, imprecise speed calculations. Conversely, if a single
frame is selected for sampling, the resulting interval proves excessively brief. Such brevity introduces
inaccuracies stemming from the detection accuracy error of the target's position and errors in the time interval
calculations. The outcome is pronounced fluctuations in speed calculation results.
To address these challenges, this study adopts a dynamic calculation approach. By computing the average
FPS, this method utilizes the FPS value as the foundation for determining the sampling interval in the
11
acquisition of traffic data. This strategy effectively mitigates the influence of errors on the results by adjusting
and extending the sampling interval.
(1) Calculation of sampling intervals
To initiate the probing process, each activation of the "Start Probing" button triggers the creation of a
temporary variable, "frames," designed to tally the cumulative number of frames per probe. This variable is
initialized with a value of 0 and is iteratively incremented by 1 during the processing of each individual frame.
This mechanism ensures an accurate count of the frames captured during each probing operation.
Subsequently, the processing duration of each frame is meticulously recorded by computing the timestamps
both before and after the processing of a single frame. The collective processing time 𝑡𝑖𝑚𝑒𝐴𝑙𝑙 is determinable
using Eq. (1).
𝑛
Where startTime and endTime are the timestamps before and after single frame processing.
Conclusively, the real-time processing frame rate 𝐹𝑃𝑆 is derived from the calculation presented in Eq. (2).
𝑓𝑟𝑎𝑚𝑒𝑠
𝐹𝑃𝑆 = (2)
𝑡𝑖𝑚𝑒𝐴𝑙𝑙
Where 𝑓𝑟𝑎𝑚𝑒𝑠 is number of frames during each detection.
As illustrated in Fig. 11, the data of FPS video frames serves as the basis for defining the sampling period
interval. By configuring the sampling interval coefficient, with the default value of 1 representing a one-
second interval, the sampling interval size becomes dynamically adaptable. The actual calculated sampling
interval 𝑆𝐹𝑃𝑆 is ascertained using Eq. (3).
𝑆𝐹𝑃𝑆 = 𝐹𝑃𝑆 ∗ 𝑡𝑖𝑚𝑒𝑅𝑎𝑡𝑖𝑜 (3)
Where timeRatio is the sampling interval coefficient.
Micro parameters are calculated in frames after receiving the trace results for each frame, and the data is
inserted into the database through the data management module.
(1) Calculation of speed
First, the vehicle target moves between two frames as shown in Fig. 13.
13
bbox(n)
bbox(n SFPS )
bbox(n SFPS )
bbox(n SFPS)
Fig. 14. Schematic of actual movement distance of vehicle target in the sampling interval
The route of the vehicle will change during the moving process. There will be an error, if the Euclidean
distance between the two endpoints is directly calculated, so the exact moving distance is obtained by
calculating the sampling interval through Eq. (9).
𝑛
where realDistance is the actual distance traveled by the vehicle, SFPS is the sampling interval,
pixelToRealDistance is the pixel distance ratio.
Finally, the vehicle moving speed is calculated by Eq. (10).
𝑟𝑒𝑎𝑙𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑛)
𝑠𝑝𝑒𝑒𝑑(𝑛) = (10)
𝑡𝑖𝑚𝑒𝑅𝑎𝑡𝑖𝑜
14
where timeRatio is the sampling interval coefficient, speed is the actual speed of the vehicle.
(2) Calculation of acceleration
Take the speed parameters of the two endpoints of the sampling interval and calculate the acceleration by
Eq. (11).
𝑠𝑝𝑒𝑒𝑑(𝑛) − 𝑠𝑝𝑒𝑒𝑑(𝑛 − 𝑆𝐹𝑃𝑆)
𝑎𝑐𝑐𝑒𝑙𝑒𝑟𝑎𝑡𝑖𝑜𝑛(𝑛) = (11)
𝑡𝑖𝑚𝑒𝑅𝑎𝑡𝑖𝑜
where acceleration is the acceleration of a vehicle.
(3) Judgment of the direction of movement
First, the current vehicle speed is judged, and if the vehicle speed is zero, the vehicle is stationary. Second,
when the vehicle speed is not 0, calculate the offset of the vehicle target in the x-axis and y-axis directions in
one sampling interval, and calculate the displacement angle by Eq. (12).
𝑎𝑡𝑎𝑛2(𝑑𝑦, 𝑑𝑥) ∗ 180
𝑎𝑛𝑔𝑙𝑒𝐼𝑛𝐷𝑒𝑔𝑟𝑒𝑠𝑠 = (12)
𝜋
where the atan2 function represents the relative angle of the line between the start point and the end point
and the positive x-axis, so that the angleInDegress represents the angle with the positive direction of the x-
axis, and the interval is between (-180°, 180°).
Finally, the moving direction of the vehicle can be obtained by the interval where the displacement angle is
located. As shown in Fig. 15, the lane change judgment can be made after obtaining the direction of
movement.
Fig. 15. Schematic of vehicle displacement angle and movement direction correspondence
(4) Lane change judgment
The Euclidean distance between sampling points is calculated by Eq. (13) and compared with the lane
change thresholds.
𝑡𝑟𝑢𝑒, 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 > 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
𝑖𝑠𝐿𝑎𝑛𝑒𝐶ℎ𝑎𝑛𝑔𝑒 = { (13)
𝑓𝑎𝑙𝑠𝑒, 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ≤ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
wherer isLaneChange is the lane change identifier, distance is the Euclidean distance between the
sampling points.
When the conditions for lane changing are met, the directions of the two end points of the sampling interval
are obtained, the state before and after them is judged to calculate whether lane changing or steering occurs,
15
and finally the lane changing behavior in each direction is judged by the positive and negative values of dx
and dy. The calculation of driving north is shown in Fig. 16.
Real-time object detection and tracking are implemented through two distinct views, with the result plotting
module responsible for rendering the vehicle target frames and micro traffic data. Additionally, the data
visualization interface extracts macro traffic data from a database and offers visual representation. When
16
processing results are conveyed to the result drawing module, it triggers the setNeedsLayout() method of
DetectView, which, in turn, invokes the draw() method. This process leverages the Core Graphics framework
to craft the target results. The drawing process is segmented into four fundamental stages: drawing context
configuration, drawing area computation, target box rendering, and target information box rendering.
(1) Drawing Context Configuration: To configure the drawing context, the current graphics context "ctx" is
acquired using the “UIGraphicsGetCurrentContext()” method. It is noteworthy that the coordinate system
origin within the Core Graphics frame is situated in the lower left corner. Upon the return of the “draw()”
method, the system automatically aligns the coordinate system with the UIKit coordinate system. The
“saveGState()” method is employed to create a copy of the current graphics state, which is pushed to the top
of the context graphics state stack. Subsequently, the context is configured to act as a transparent artboard,
ensuring that the drawing layer does not obstruct the FPV View's video transmission screen.
(2) Computation of the Drawing Area: The video stream resolutions vary among different UAV models,
while screen sizes differ across various iOS devices. Consequently, the scaling ratio of the video frame
displayed on the screen exhibits variations, as exemplified in Fig.18. Importantly, the drone's video stream
does not occupy the entire screen; hence, it becomes necessary to compute the dimensions of the drawing area
to enable accurate rendering of the target boxes.
The aspect ratio of the video stream is determined using the pre-acquired property
"captureDeviceResolution." Subsequently, the size of the graphic drawing area, relevant to the current
device's screen size, is computed and represented as "imageAreaRect." It's crucial to note that the iOS system
automatically scales the video frame to fit the screen. In this regard, two scenarios arise during video frame
scaling: "case 1," involving alterations in height based on width, and "case 2," leading to changes in width
based on height. The calculations are as follows:
The aspect ratio 𝐴𝑠𝑝𝑒𝑐𝑡𝑅𝑎𝑡𝑖𝑜 of the video stream is calculated as per Eq. (14).
𝑤𝑖𝑑𝑡ℎ
𝐴𝑠𝑝𝑒𝑐𝑡𝑅𝑎𝑡𝑖𝑜 = (6)
ℎ𝑒𝑖𝑔ℎ𝑡
Where width and height are the resolution of the video stream.
The computation of the Dx and Dy offsets for the drawing area is accomplished by applying Eq. (16) and
Eq. (17), while the dimensions of SizeOption1 are employed as the dimensions of the imageAreaRect within
the drawing area.
0, 𝑆𝑖𝑧𝑒𝑂𝑝𝑡𝑖𝑜𝑛1. ℎ𝑒𝑖𝑔ℎ𝑡 ≤ 𝑉. ℎ𝑒𝑖𝑔ℎ𝑡
𝐷𝑥 = { 𝑉. 𝑤𝑖𝑑𝑡ℎ − 𝑆𝑖𝑧𝑒𝑂𝑝𝑡𝑖𝑜𝑛2. 𝑤𝑖𝑑𝑡ℎ (8)
⌊ ⌋, 𝑆𝑖𝑧𝑒𝑂𝑝𝑡𝑖𝑜𝑛1. ℎ𝑒𝑖𝑔ℎ𝑡 > 𝑉. ℎ𝑒𝑔ℎ𝑡
2
𝑉. ℎ𝑒𝑖𝑔ℎ𝑡 − 𝑆𝑖𝑧𝑒𝑂𝑝𝑡𝑖𝑜𝑛1. ℎ𝑒𝑖𝑔ℎ𝑡
⌊ ⌋, 𝑆𝑖𝑧𝑒𝑂𝑝𝑡𝑖𝑜𝑛1. ℎ𝑒𝑖𝑔ℎ𝑡 ≤ 𝑉. ℎ𝑒𝑖𝑔ℎ𝑡 (9)
𝐷𝑦 = { 2
0, 𝑆𝑖𝑧𝑒𝑂𝑝𝑡𝑖𝑜𝑛1. ℎ𝑒𝑖𝑔ℎ𝑡 > 𝑉. ℎ𝑒𝑔ℎ𝑡
Conversely, if the earlier condition is not satisfied, the situation is identified as case 2. In this context, the
calculation of the scaled area's dimensions transpires through Eq. (18), denoted as SizeOption 2. Concurrently,
the offsets Dx and Dy are determined through Eq. (16) and Eq. (17). The dimensions of SizeOption2 are
employed as the dimensions of the drawing area.
𝑆𝑖𝑧𝑒𝑂𝑝𝑡𝑖𝑜𝑛2 = (⌊𝑉. ℎ𝑒𝑖𝑔𝑡ℎ ∗ 𝐴𝑠𝑝𝑒𝑐𝑡𝑅𝑎𝑡𝑖𝑜⌋, 𝑉. ℎ𝑒𝑖𝑔ℎ𝑡) (10)
(3) The process of target box rendering involves the traversal of the result array “polyRects” designated for
rendering. For each target object “polyRect” within the array, distinct drawing colors are assigned based on
the current state of the vehicle. The “cornerPoints” attribute of the polyRect object is obtained, providing the
coordinates of inflection points. An adjustment to the coordinate system origin of these inflection point
coordinates is executed, aligning it with the coordinate system origin of the imageAreaRect within the
drawing area. Subsequently, the inflection points are traversed and connected through the utilization of the
“addLine()” method, culminating in the closure of the target box.
(4) Proceeding to the drawing of the target information box, the size of the text box is derived from the
“boundingBox” property of the polyRect object. An offset is applied to the “textRect” to align it with the
imageAreaRect coordinate system through the use of the affine transformation “CGAffineTransform”. The
offset positions the text box 1 height unit above the rectangular box while retaining the same height
adjustment. The width of the text box is established as 4 times the size of the rectangular box. The text
information is formatted in rich text style, incorporating the micro parameters of the vehicle target. This
information is drawn within the text box area.
Lastly, the graphical context artboard is preserved in its most recent state using the “restoreGState()”
method. This concludes the draw method, culminating in the rendering of the detection results for the current
frame and the refresh of the detectView interface.
This chapter elaborates on the training and deployment procedures of the object detection model,
encompassing comparative experiments designed to scrutinize the varying performance and efficacy of each
model. To assess the practicality of real-time target detection and analysis, a series of comparative
18
experiments were undertaken to validate the system's operational effectiveness. These experiments encompass
the training of distinct models, feasibility assessments, real-time target detection performance validation, real-
time target tracking performance validation, erformance testing, and real-world flight experiments.
A total of three datasets were employed in the experimental phase, with the primary dataset originating
from a multitude of intersection traffic videos captured by DJI drones within the urban confines of Hohhot,
designated as the Hohhot dataset. To enhance the dataset's diversity, this study amalgamated the VisDrone
dataset, the UAVDT dataset, and the Drone Vehicle dataset.
(1) Hohhot dataset: This dataset encompasses vehicle videos recorded at various prominent intersections
within the urban region of Hohhot, as depicted in subfigure (a) of Fig. 19. The videos were acquired under
varying altitudes, lighting conditions, weather circumstances, and across multiple temporal segments,
including morning and evening rush hours. The drone captures footage from a 90° vertical downward
perspective.
(2) VisDrone dataset: As illustrated in subfigure (b), this dataset comprises videos from 14 distinct cities,
encompassing diverse vehicle densities, weather conditions, and environmental settings. The dataset
encompasses ten target categories, ranging from diminutive targets such as pedestrians and bicycles. The
VisDrone dataset comprises 400 video clips, comprising 265,228 frames and 10,209 static images, all
captured via drones.
(3) UAVDT dataset: Displayed in subfigure (c), this dataset exhibits a resolution of 1080*540, recorded at
30FPS, encompassing video frame sequences under differing weather conditions, altitudes, angles, and levels
of occlusion. The dataset revolves around three primary target categories: CAR, BUS, and TRUCK.
(a)
(b)
(c)
Dataset fusion and partitioning: Following the consolidation of the three datasets, the resolution is
standardized to 640 x 640 pixels, and the dataset is partitioned into multiple sub-datasets. Additionally, the
label information is converted into TXT format, employing normalized coordinates, suitable for the YOLO
algorithm, and JSON format, utilizing pixel coordinates, which is intended for use with Create ML. Through
this comprehensive sequence of preprocessing steps, the dataset's quality and diversity are meticulously
curated, thereby establishing a robust foundation for subsequent model training and performance assessment.
During the model training process, two distinct methods were employed: utilizing the Create ML tool for
training the native model and employing Coremltools to convert the YOLO model based on the PyTorch
framework.
(1) Training based on Create ML
In the Create ML training process, the verification set partitioning method was set to automatic (auto).
Create ML automatically determined the proportion of the verification set based on the dataset size. Various
batch-size settings were explored, including 16, 32, 64, 128, and auto, resulting in 56 training iterations.
Through observation of the model's performance under different parameter configurations, adjustments were
made to optimize the model. The best-performing cubic model was selected from these iterations, as
presented in Table 3.
Table 3 Three model data tables based on Create ML training
training Loss Training set accuracy (%) Validation set Number of Batch-Size
number accuracy (%) iterations
train_1 5.2 71% 44% 19000 auto
train_8_2 2.377 93% 92% 10000 16
train_182 0.971 95% 96% 10000 16
In all training experiments, the longest successful training duration was 28.5 hours, while failed training
lasted up to 38 hours. The optimal training effect was achieved with a batch size of 16 and 10,000 iterations.
Increasing the dataset size and iteration count elevated the failure rate. Larger batch sizes led to proportionate
increases in training time. In unsuccessful training attempts, the optimal Loss value converged to 2.32.
(2) Training based on YOLO model
The dataset was divided into training, validation, and test sets in a 7:2:1 ratio. The training environment
20
specifications are outlined in Table 5.9. Utilizing the PyTorch framework, three YOLO models, namely
YOLOv5, YOLOv7, and YOLOv8, were trained. The models' characteristics facilitated their conversion into
mobile models. The model conversion algorithm was modified, incorporating the NMS module function. This
modification allowed for direct model visualization on macOS and enabled the filtering of target detection
results through IOU and confidence level. Separating bounding box functionality from business logic
enhanced system efficiency and facilitated model usage on iOS platforms.
Training for YOLOv5 involved utilizing the YOLOv5m initial weights, with an input size of 640*640 and
a batch size of 4. Training proceeded through three distinct phases: 5, 50, and 500 epochs. Fig. 21 illustrates
the results after 5 epochs of training.
By testing the performance of the trained model and verifying whether the model can meet the
requirements of real-time, this section mainly introduces the effect experiment of single-frame object
detection and the effect experiment of continuous frame object detection. The test consists of two parts: a
single image detection effect test and a sequential video frame object detection performance test. The test
environment is shown in Table 4.
Table 4 Object detection effect test environment
Configuration Parameter
System macOS 13.3
RAM 16GB
Chip Apple M1
NPU performance 16core-11TOPS
(1) Single-image object detection performance
The trained model was loaded using the Create ML App, and its recognition performance for a single frame
was assessed, with the test results displayed in Fig. 23.
22
Through the system's functionality enabling the reading of local albums, the test video is systematically
processed, and video frames are subsequently extracted in a sequential manner. The refined tracking
mechanisms are vividly exemplified in Fig. 25. In subfigure (a), the tracking effectiveness is demonstrated as
id39 executes a left turn from north to east, and id29 proceeds straight from north to south at frames 112, 274,
and 439, leveraging the YOLOv7 model. Notably, at frame 274, id39 is discerned as executing a left turn.
Subfigure (b) displays the tracking outcomes as id27 travels straight from east to west while id7 advances
straight from west to east at frames 1235, 1458, and 1669, capitalizing on the YOLOv8 model. This approach
successfully identifies id27's left lane change behavior at frame 1458.
24
In the conducted real-flight experiment, the study focused on multiple intersections located in Hohhot,
Inner Mongolia. The drone operated at altitudes ranging from 80 to 120 meters, encountering clear weather
25
conditions, mild wind forces ranging from 2 to 4, and conducted flights during the time frame of 1 to 5 pm.
The flight equipment employed was the DJI Mavic Mini drone, and the mobile device utilized was the iPad
Pro 2020.
Throughout the real-flight experiments, certain targets exhibited challenges in identification, resulting in
either incorrect identifications or missed identifications. By systematically analyzing the experimental video
footage captured during the actual flight experiment, these occurrences were categorized and quantified at 15-
second intervals. The analysis encompassed a total duration of 62 minutes and 39 seconds. The assessment
outcomes are comprehensively presented in Table 7.
Table 7 Performance Evaluation Metrics
TP FP FN Precision(%) Recall(%) F1 (%)
2956 52 406 98.27 87.93 92.85
Detection target plotting and data visualization are achieved through two distinct views. The result plotting
module is responsible for receiving data from the data processing module and rendering the vehicle target
frames along with the corresponding micro traffic data. On the other hand, the data visualization interface
extracts macro traffic data from the database and presents it in a visual format. As depicted in Fig.26, the user
can seamlessly switch between the a and b views by clicking on the corresponding labeled area. During the
real flight experiment, clicking on this area triggers the swift transition between views (a) and (b).
This paper constructs a reliable and efficient real-time mobile UAV video-based vehicle data acquisition
and analysis system. Main findings are summarized as followings:
1. This research presents a comprehensive solution that synergizes UAV development and deep learning
technology on mobile terminals. Leveraging Swift and Objective-C languages, the iOS SDK serves as the
fundamental framework for the program, while DJI Mobile SDK and UX SDK are utilized to develop UAV
functionalities. Additionally, CoreML and the Vision framework are employed to deploy object detection
models, culminating in the design of a real-time object detection and tracking system on iOS devices.
2. Through a meticulous comparison between the Create ML-trained native model and the YOLOv8 model,
it is evident that the YOLO model surpasses the Create ML-trained native model in performance. To further
enhance the tracking algorithm, the original Vision framework tracker is replaced with the SORT algorithm-
based tracking module. By integrating the YOLOv8 model with the SORT algorithm, real flight tests are
conducted using an iPad equipped with A12Z and a DJI Mavic Mini. The results show an impressive accuracy
rate of 98.27% and a recall rate of 87.93%. when the number of vehicles reaches 27, the real-time processing
power of the A12Z chip is lower than the drone video transmission rate. Therefore, this paper optimizes
performance through a frame dropping strategy, selectively skipping non-critical video frames to enhance
system efficiency without compromising accuracy and stability in the detection process.
3. Factors such as UAV camera parameters, flight parameters, and equipment hardware performance, are
considered to dynamically calculate the FPS parameters for real-time video frame processing capability. This
enables the acquisition of real-time micro parameters such as speed, acceleration, direction, and lane-changing
behavior, along with macro parameters like the number of vehicles, total vehicles, and vehicles in all
directions within the regional road section. The visualization of this traffic data is presented in real time,
offering valuable insights into the traffic scenario.
This paper has made significant strides in the domain of real-time target detection and tracking for mobile
UAVs. However, certain aspects demand further investigation and refinement. One crucial aspect pertains to
the communication delay between UAVs and mobile devices, as it can impact the efficacy of real-time target
detection and tracking. Addressing this challenge through communication-related optimizations will
contribute valuable advancements to the system's overall efficiency and effectiveness.
Moreover, specific application scenarios, such as inclement weather conditions like rain and snow, as well
as low-light environments, present unique challenges for detection and tracking performance. By devising
targeted optimizations tailored to these scenarios, we can bolster the system's capabilities and ensure robust
performance under diverse real-world conditions.
In conclusion, conducting further research to address communication delays and exploring specialized
optimizations for specific scenarios holds the potential to elevate real-time target detection and tracking for
mobile UAVs. These advancements will enhance the system's practicality and reliability, making it a more
valuable tool for a wide range of real-world applications.
References
1. Zhang, H., M. Liptrott, N. Bessis, and J. Cheng, 2019. Real-time traffic analysis using deep
learning techniques and UAV based video. In: Proceedings of the 2019 16th IEEE
International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1-
5.
2. Byun, S., I.-K. Shin, J. Moon, J. Kang, and S.-I. Choi, 2021. Road Traffic Monitoring from UAV
Images Using Deep Learning Networks. Remote Sensing 13(20), pp. 4027.
3. Chalmers, C., P. Fergus, C.A.C. Montanez, S.N. Longmore, and S.A. Wich, 2021. Video analysis
27
for the detection of animals using convolutional neural networks and consumer-grade
drones. Journal of Unmanned Vehicle Systems 9(2), pp. 112-127.
4. Gu, X., M. Abdel-Aty, Q. Xiang, Q. Cai, J.J.A.A. Yuan, and Prevention, 2019. Utilizing UAV
video data for in-depth analysis of drivers’ crash risk at interchange merging areas. 3, pp.
159-169.
5. Chen, A.Y., Y.-L. Chiu, M.-H. Hsieh, P.-W. Lin, and O.J.T.r.p.C.e.t. Angah, 2020. Conflict
analytics through the vehicle safety space in mixed traffic flows using UAV image
sequences. 119, pp. 102744.
6. Wu, Y., M. Abdel-Aty, O. Zheng, Q. Cai, and S.J.T.r.r. Zhang, 2020. Automated safety diagnosis
based on unmanned aerial vehicle video and deep learning algorithm. 2674(8), pp. 350-359.
7. Ma, Y., H. Meng, S. Chen, J. Zhao, S. Li, and Q.J.J.o.t.e. Xiang, Part A: Systems, 2020.
Predicting traffic conflicts for expressway diverging areas using vehicle trajectory data.
146(3), pp. 04020003.
8. Chen, X., Z. Li, Y. Yang, L. Qi, and R.J.I.T.o.I.T.S. Ke, 2020. High-resolution vehicle trajectory
extraction and denoising from aerial videos. 22(5), pp. 3190-3202.
9. Lv, Y. and M. Abdel-Aty, 0 3. The University of Central Florida’s Smart and Safe
Transportation Lab [Its Research Lab]. IEEE Intelligent Transportation Systems Magazine
15(1), pp. 468-475.
10. Zheng, O., M. Abdel-Aty, L. Yue, A. Abdelraouf, Z. Wang, and N. Mahmoud, 2022. CitySim: A
Drone-Based Vehicle Trajectory Dataset for Safety Oriented Research and Digital Twins.
arXiv preprint arXiv:2208.11036.
11. Zhu, X., S. Lyu, X. Wang, and Q. Zhao, 2021. TPH-YOLOv5: Improved YOLOv5 Based on
Transformer Prediction Head for Object Detection on Drone-captured Scenarios, pp. 2778-
2788.
12. Wu, H., J. Nie, Z. He, Z. Zhu, and M. Gao, 2022. One-Shot Multiple Object Tracking in UAV
Videos Using Task-Specific Fine-Grained Features. Remote Sensing 14(16), pp. 3853.
13. Walambe, R., A.R. Marathe, and K. Kotecha, 2021. Multiscale Object Detection from Drone
Imagery Using Ensemble Transfer Learning. Drones.
14. Jadhav, A., P. Mukherjee, V. Kaushik, and B. Lall, 2020. Aerial Multi-Object Tracking by
Detection Using Deep Association Networks, pp. 1-6 .
15. Ali, S. and A. Jalal, 2023. Vehicle Detection and Tracking from Aerial Imagery via YOLO and
Centroid Tracking, pp. 146.
16. Hua, X., X. Wang, T. Rui, F. Shao, and D. Wang, 2021. Light-weight UAV object tracking
network based on strategy gradient and attention mechanism. Knowledge-Based Systems
224 %6, pp. 107071 .
17. Ma, X., K. Ji, B. Xiong, L. Zhang, S. Feng, and G. Kuang, 2021. Light-YOLOv4: An Edge-
Device Oriented Target Detection Method for Remote Sensing Images. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 14 (6), pp. 10808-
10820.
18. Lo, L.-Y., C.H. Yiu, Y. Tang, A.-S. Yang, B. Li, and C.-Y. Wen, 2021. Dynamic Object Tracking
on Autonomous UAV System for Surveillance Applications. Sensors 21(23), pp. 7888.
19. Martinez-Alpiste, I., P. Casaseca-de-la-Higuera, J. Alcaraz-Calero, C. Grecos, and Q. Wang,
2019. Benchmarking Machine-Learning-Based Object Detection on a UAV and Mobile
Platform, pp. 1-6.
20. Zhou, G., J. Yuan, I.L. Yen, and F. Bastani, 2016. Robust real-time UAV based power line
28
detection and tracking, pp. 744-748.
21. Li, C., X. Sun, and J. Cai, 2019. Intelligent Mobile Drone System Based on Real-Time Object
Detection. Journal on Artificial Intelligence 1(1), pp. 1--8.