Report (1) .New
Report (1) .New
Chapter - 1
INTRODUCTION
Small object detection has significant research and application values for instance, Tiny objects
such as nuts, screws, washers, nails and fuses can exist on the airport runways, and the accurate
detection of these objects could prevent major aviation accidents and economic losses. For
autonomous driving, accurate detection of small objects that could cause traffic accidents from
high-resolution scene photos of cars is essential. For satellite remote sensing images, targets in
such images may only contain a small area or even a few pixels. Small targets usually lack
sufficient appearance information relative to regular-sized targets, making it difficult to
distinguish them from background or similar targets. YOLO and SSD are popular object
detection models in the industry and have shown high sufficiency for real-time object detection.
However, their performance on small object detection datasets are still not ideal enough.
Previous research show that on the public object detection dataset MSCOCO, there is a
significant gap in detection performance between small and large targets, with the mean average
precision of small targets typically being half that of large targets. Thus, it is clear that small
object detection is still challenging. In addition, consider real world are usually intricate and
complex which often contains illumination changes, target occlusion, densely connected targets,
etc. and the effects of these factors on small target features could further increase the difficulty of
small object detection
Human eyes are more sensitive to moving objects than to stationary ones. Previous studies have
shown that our visual perception system relies on temporal and spatial resolution to recognize
objects. The perception and attention of human eyes to moving objects are higher than that of
stationary objects. In the detection of vehicle targets from cameras on drones, we try to integrate
the motion trend information into the process of object detection. Vehicle objects in conventional
drone images are usually small, and increasing their temporal resolution can improve the
performance of small object detection. To do this, we use an image-to-image translation method
to construct a prior knowledge of the Euler motion information from our collected dataset. In
order to do this, we need to answer two questions:
Detecting small objects is a crucial challenge in various fields due to its significant research and
practical applications. For example, tiny objects like nuts, screws, washers, nails, and fuses on
airport runways pose serious safety risks, and accurately identifying them can help prevent
aviation accidents and economic losses. In the case of autonomous vehicles, detecting small
objects on the road is essential to avoid potential traffic hazards. Similarly, in satellite remote
sensing images, objects of interest often occupy only a few pixels, making their identification
difficult. The challenge with small object detection arises from their minimal visual details,
which makes distinguishing them from the background or similar-looking objects difficult.
Modern object detection models such as YOLO and SSD have proven effective for real-time
detection tasks. However, they still struggle with small object detection, as previous studies
indicate a noticeable performance gap between detecting small and large objects. For instance, in
the MSCOCO dataset, the mean average precision (mAP) for small objects is significantly lower
than for larger objects. This issue is further compounded by real-world conditions, where varying
lighting, object occlusion, and densely packed targets create additional obstacles for detection
algorithms.
One potential solution to improve small object detection involves leveraging motion-based
information. Studies show that human vision is naturally more attuned to moving objects than
static ones, relying on both spatial and temporal resolution to perceive them effectively. Inspired
by this, researchers aim to integrate motion trend data into object detection tasks. This approach
is particularly useful in drone-based vehicle detection, where vehicles often appear small in
captured images. Enhancing their temporal resolution by constructing motion trend maps can
improve detection accuracy. The challenge, however, lies in effectively generating motion trend
maps and incorporating this data into conventional 2D object detection models for drones.
The comparison of detection performance on the VisDrone dataset between YOLO-V5 with and
without the Collaborative Filtering Mechanism (CFM) highlights the advantages of integrating
CFM for small object detection. Small object detection presents unique challenges due to limited
appearance information, background noise, and occlusions, making it difficult for conventional
object detection models to achieve high accuracy.
This is particularly evident in applications such as autonomous driving, satellite remote sensing,
and drone-based surveillance, where small targets like vehicles, debris, or tiny structural
elements must be identified with precision.
Despite the efficiency of models like YOLO and SSD in real-time detection, their performance
on small objects remains suboptimal. Studies have shown that on datasets such as MSCOCO, the
mean average precision (mAP) for small objects is significantly lower compared to larger ones.
This gap is further exacerbated by complex real-world conditions, including variable lighting,
densely packed targets, and occlusions.
The introduction of CFM in YOLO-V5 aims to address these challenges by filtering out
irrelevant background information and enhancing the focus on motion-based and spatially
relevant features. By leveraging CFM, the detection pipeline effectively improves feature
extraction, leading to higher precision and recall for small objects.
Experimental results on the VisDrone dataset demonstrate that YOLO-V5 with CFM
significantly outperforms its baseline counterpart, particularly in detecting small, occluded, and
densely connected objects.
This makes CFM-enhanced YOLO-V5 a promising approach for applications that require
accurate small object detection, such as traffic monitoring, security surveillance, and remote
sensing.
The motion trend map containing motion trend information can be obtained from the
sequence frames of drone images. We designed an algorithm to calculate the difference
between adjacent frames and construct a difference matrix, which is further named motion
trend map, to indicate moving objects. Then use the image translation model to learn the
mapping from 2D image to its corresponding motion trend map. The reason we can not
directly use drone clips to compute motion trend maps in real-time is that when the drone
moves, all pixels in the frame will move simultaneously. To avoid that, the process of
calculating the motion prior knowledge is carried out on near-stationary aerial clips and must
be separated from the detection branch, so we can filter out irrelevant information by
accumulating position changes, and only pay attention to the objects that actually move on
the surface. We will show more details in Section 3.
After getting the motion trend map, we need to consider how to integrate it into the object
detection model so that the model can take advantage of the motion trend prior. In this paper
we propose a Collaborative Filtering Mechanism (CFM) to enable the model to filter out the
part of the feature map that does not contain motion trends, thereby improving detection
performance. CFM is inspired by the design mechanism of the Pooling layer in deep
learning. For the traditional Pooling layer, such as Max Pooling Layer, its main function is
down sampling without interfering with detection results, which means that there is
unnecessary redundant information in the feature map after convolution for feature
extraction. While in small object detection, irrelevant information usually occupies most of
the image area in fact. In reverse thinking, we found that how to filter this “redundant”
information in a more targeted manner would be a promising direction for small object
detection. The detailed design of CFM will also be explained in detail in Section 3.
Fig. 1. motion trend map generation using GAN: On the left side is the input 2D image-motion
trend map pair, and on the right side is the schematic of the PIX2PIX model, where the red block
in the center is the Self-Attention module we added.
MODEL DESIGN
In order to form the motion trend map, we can compute the pixel-level differences between
adjacent 2D frames as motion features. Consider a motion trend map as m for frame f in
frame sequence F which share the same dimensions. For the next frame fi+1 of frame fi , the
motion trend map can be formed as follows:
In this way, we can define the generation of the motion information feature map. Afterwards
we transform the pixels in m into color space using Algorithm 1, which give us RGB image
→ RGB motion trend map pairs (shown in Figure 1) that can be used for image-to-image
training.
Initialization :
ColorMatch :
featfakeb ← E(fakeb)
else
pointcolor ← (0, 0, 0)
end if
After generating the motion trend maps, we can train a image translation model to find the
connection between the RGB image to motion trend map.
Therefore, instead of the traditional U-Net structure used in the Pix2Pix model, we added a
self-attention mechanism to the bottleneck layer in the middle of the network to help better
aggregate the high level semantic information.
This process is shown in Figure 1 Fig. 2. Visualization of the effect with Collaborative
Filtering Mechanism Fig. 3.
The core point of the Collaborative Filtering Mechanism is to use the motion trend map to
assist the 2D image in the feature extraction part of YOLO-V5 to separate out the features
that are irrelevant to the detection target. To do this, we build a parallel pipeline identical to
the YOLO-V5’s 2D image feature extraction and use the motion trend map as a mask at each
step to filter the feature maps computed from the original input image at each layer. For the
elements in the feature map formed by each step of the 2D image, we find the element from
the motion trend map at the same position: if it is not 0, keep it. While on the contrary, if the
element from the motion trend map it is 0, then clear the corresponding element of the
feature map at the same position. This process can be visualized as Figure 2 and Figure 3.
We will demonstrate the effect of CFM in the experimental section.
Chapter - 2
LITERATURE SURVEY
In recent years, deep neural networks based on Convolutional Neural Networks (CNNs) have
achieved remarkable success in object detection. One of the pioneering approaches was proposed
by Girshick et al., known as RCNN, which transformed the object detection problem into a
classification task. This method marked a significant breakthrough in object detection. Building
on this, Fast RCNN combined the advantages of RCNN and SPPNet, introducing ROI pooling to
address the challenge of different input scales.
Further advancements led to the development of Faster RCNN, where Ren et al. introduced a
Region Proposal Network (RPN) to reduce the computational cost of generating candidate
frames. Later, Dai et al. proposed the Region-based Fully Convolutional Network (RFCN),
which replaced fully connected layers with position-sensitive score maps obtained through full
convolution. This innovation significantly improved detection speed.
Another key development was the Feature Pyramid Network (FPN) introduced by Lin et al.
Before FPN, most CNN-based detectors performed detection only at the top layer of the network.
While deep CNN features are beneficial for category identification, they do not always aid in
precise target localization. FPN addressed this by incorporating a laterally connected top-down
structure, making substantial progress in multi-scale detection tasks. Today, it is a fundamental
component of many state-of-the-art models.
Despite the impressive performance of YOLO-V5, the issue of small object detection remains a
challenge due to the loss of fine-grained features during multiple down sampling operations. To
address this limitation, researchers have explored various enhancements, such as incorporating
attention mechanisms, optimizing feature fusion strategies, and using higher-resolution input
images. Additionally, newer versions like YOLO-V7 and YOLO-V8 have introduced structural
improvements to further refine accuracy and efficiency in real-time detection scenarios.
Looking ahead, object detection models continue to evolve with the integration of transformer-
based architectures like Vision Transformers (ViTs) and Swin Transformers. These models
leverage self-attention mechanisms to capture long-range dependencies and enhance feature
representations. Moreover, hybrid approaches combining CNNs with transformers are being
developed to balance efficiency and accuracy.
The adoption of self-attention mechanisms in object detection has significantly improved the
ability of models to capture fine-grained spatial relationships. Traditional CNN-based
architectures rely on localized receptive fields, limiting their ability to understand long-range
dependencies in an image. The introduction of Transformer-based models, such as DEtection
TRansformers (DETR) and Swin Transformer, has allowed object detection systems to model
complex spatial correlations more effectively. These approaches leverage multi-head self-
attention to enhance feature extraction, making them particularly useful for detecting small or
occluded objects in cluttered environments, such as aerial drone images.
Another crucial innovation in modern object detection is multi-scale feature fusion, which
enhances the ability of networks to detect objects of varying sizes. Traditional detection
frameworks often struggle with objects that appear at significantly different scales within the
same image. Techniques like Feature Pyramid Networks (FPN), Path Aggregation Networks
(PANet), and BiFPN (Bidirectional Feature Pyramid Network) have been developed to address
this issue. These architectures ensure that both high-resolution and low-resolution features
contribute to the final predictions, improving overall accuracy. For real-time applications like
autonomous drones, traffic monitoring, and surveillance, multi-scale detection is essential for
identifying both large vehicles and smaller objects, such as pedestrians or debris.
The increasing demand for real-time object detection on edge devices has led to the optimization
of lightweight architectures. Many state-of-the-art models, including YOLO-V7, YOLO-V8, and
Chapter - 3
The evolution of deep learning, particularly Convolutional Neural Networks (CNNs), has led to
significant advancements in object detection models, improving accuracy and efficiency.
Landmark models such as RCNN, Fast RCNN, Faster RCNN, and YOLO have addressed many
challenges in object detection, making real-time applications feasible. However, detecting small
objects, occluded objects, and objects in complex backgrounds remains a persistent challenge.
One major limitation in existing models is the loss of fine-grained features due to multiple down
sampling operations. While architectures like Feature Pyramid Networks (FPN) and Path
Aggregation Networks (PANet) have attempted to mitigate this issue, small object detection
remains suboptimal.
Additionally, most CNN-based models rely on deep feature extraction, which is beneficial for
classification but often lacks precision in localization, particularly for small and overlapping
objects.
Another challenge is maintaining a balance between detection speed and accuracy. While models
like YOLO-V5 have demonstrated real-time performance, their reliance on multiple down
sampling operations reduces the ability to retain critical spatial information.
Furthermore, new approaches such as Vision Transformers (ViTs) and hybrid CNN-transformer
architectures are emerging, showing promising improvements in object representation and
feature extraction. However, their feasibility in real-time applications needs further exploration.
Given these challenges, there is a need for an optimized object detection framework that
improves small object detection, enhances multi-scale feature extraction, and balances
computational efficiency with accuracy.
This study aims to analyze existing detection models, explore new architectures, and propose an
improved detection framework to overcome current limitations Small object detection remains
one of the most difficult problems in computer vision. In many real-world scenarios, objects like
screws, nuts, or pedestrians can occupy only a small portion of the image, making them difficult
to identify.
Conventional models struggle to detect such small objects because they lose vital spatial
information during down-sampling operations. Additionally, the limited appearance of small
objects leads to challenges in distinguishing them from the background or other similar objects,
which often results in high false-negative rates. Even with advanced models such as YOLO and
SSD, the accuracy of small object detection still lags behind that of larger objects, highlighting a
significant gap in the current state of the art.
While many modern object detection models excel at detecting larger objects, the trade-off
between speed and accuracy becomes more pronounced when dealing with small objects. Real-
time applications such as autonomous driving or surveillance systems require fast.
efficient processing without compromising detection quality. Many current models, including
YOLO-V5, sacrifice accuracy for speed due to their reliance on multiple down-sampling layers.
This results in a reduction of fine-grained details essential for detecting small objects.
The challenge lies in designing an architecture that not only maintains real-time performance but
also improves the accuracy of small object detection, especially in dynamic and cluttered
environments where objects may be occluded or partially visible.
This approach enhances the ability to detect small objects by preserving contextual information
that is often lost in CNN-based architectures.
Hybrid models that combine the strengths of both CNNs and transformers are also gaining
popularity, offering a balance between feature extraction speed and accuracy. However, despite
their potential, these models face challenges in real-time applications, where computational
efficiency and fast inference times are critical.
Further research is needed to optimize these models for use in real-world environments where
both accuracy and speed are essential.
OBJECTIVES
o Compare the performance of models such as RCNN, Fast RCNN, Faster RCNN,
YOLO, and SSD in different detection scenarios.
o Explore the impact of transfer learning and pre-trained models on the evolution of
object detection.
o Examine the role of one-stage and two-stage detectors in balancing speed and
accuracy for different applications.
o Investigate hybrid approaches that combine CNNs and transformers for better
detection accuracy
o Evaluate trade-offs between detection speed and precision across different object
scales.
o Investigate hybrid approaches that combine CNNs and transformers for better
detection accuracy.
o Propose an improved model that integrates multi-scale feature extraction and real-
time efficiency.
o Utilize motion trend maps to filter out irrelevant background noise in aerial
imagery.
o Utilize hardware accelerators like TensorRT, OpenVINO, and Edge TPU for
faster inference.
Chapter - 4
METHODOLOGY
o CFM integrates a motion trend map into the feature extraction process of YOLO-V5
to eliminate irrelevant background noise and enhance object detection accuracy.
o The motion trend map functions as an adaptive filter, allowing only motion-relevant
features to be retained while discarding static or less informative elements.
o This mechanism is particularly useful for small object detection in aerial imagery,
where distinguishing between moving and stationary objects is a significant
challenge.
o This approach significantly enhances the ability to track and detect small objects,
making it more effective for drone-based vision applications.
3. Self-Attention Mechanism:
o This mechanism learns spatial dependencies, refining how the network processes
different regions of an image and improving detection precision.
o Generative Adversarial Networks (GANs) are used to generate motion priors, which
help in learning the temporal movement of objects from video sequences.
o The self-attention modules within GANs further refine motion estimation, ensuring
the model can track and predict movement patterns more effectively.
o This enables the system to achieve higher accuracy in tracking small and fast-moving
objects, which are often missed by traditional detection models.
o This color mapping process simplifies complex motion information, making it easier
for the model to extract temporal-based movement patterns.
o Additionally, the precision and recall metrics show notable gains, proving the
effectiveness of motion-based feature filtering in real-world drone applications.
o The proposed method is benchmarked against baseline YOLO-V5s models, both with
and without the CFM enhancement, for a comparative performance analysis.
o Results confirm that integrating CFM and self-attention mechanisms leads to a more
accurate and efficient object detection system for drone-based vision.
Chapter - 5
DISCUSSIONS
The proposed Collaborative Filtering Mechanism (CFM) effectively enhances object detection
performance in drone-based vision systems. By integrating motion trend maps into the YOLO-
V5 feature extraction process, the model filters out irrelevant background features, leading to
improved detection accuracy. The use of a parallel pipeline ensures that only motion-relevant
features are retained, which is particularly useful for small object detection in dynamic
environments.
One of the key insights from the study is the impact of self-attention mechanisms in PIX2PIX.
Experimental results indicate a notable improvement in detection performance when self-
attention is incorporated, as it helps in refining feature extraction and improving precision. The
results in Table I confirm that the PIX2PIX model with self-attention outperforms the version
without it, showing an increase of 4.22% in mAP 0.5 and 3.26% in mAP 0.5:0.95.
Furthermore, the use of GAN-based motion priors estimation introduces a novel approach to
motion tracking. By employing displacement-color mapping, motion features are effectively
visualized, making it easier for the network to interpret temporal-based movement patterns. This
process proves to be beneficial in detecting small moving objects, which are often challenging
for conventional detection models.
The VisDrone dataset plays a crucial role in evaluating the model's performance. As an
established benchmark for drone vision, the dataset provides diverse urban and rural scenarios.
The experimental results in Table II demonstrate that YOLO-V5s with the CFM module
significantly outperforms the original version, achieving a 10.49% improvement in mAP 0.5 and
substantial gains in precision and recall. These findings reinforce the effectiveness of integrating
motion-based filtering techniques into traditional object detection frameworks.
However, while the proposed approach delivers promising results, there are potential challenges.
The computational cost of integrating GANs with self-attention may impact real-time
performance on resource-constrained drone systems. Additionally, the approach relies heavily on
the accuracy of motion trend maps, meaning that errors in motion estimation could affect
detection performance.
Future research could focus on optimizing the computational efficiency of the CFM framework
to enable real-time deployment on edge devices, such as drones with limited processing power.
Techniques like lightweight self-attention modules or quantized GAN models could help reduce
computational overhead without significantly compromising detection accuracy. Additionally,
exploring adaptive motion trend maps that adjust dynamically based on environmental
conditions could enhance the robustness of small object detection in varying scenarios.
Beyond small object detection, the principles behind CFM could be extended to other aerial
vision tasks, such as multi-object tracking, behavior analysis, and anomaly detection. By
integrating temporal motion priors into broader deep learning architectures, drone-based systems
could become more intelligent, allowing for autonomous decision-making in surveillance,
search-and-rescue, and traffic monitoring applications. Combining CFM with additional sensor
modalities, such as thermal imaging or LiDAR, could further improve detection performance,
especially in low-light or obscured environments.
Future Directions
o Optimizing Computational Efficiency: Enhancing the efficiency of the CFM module to
make it more suitable for real-time drone applications.
All the experiments conducted in this study focus on the vehicle category of the VisDrone
dataset. The VisDrone dataset is a large-scale drone-based computer vision dataset, developed by
the AISKYEYE team at Tianjin University, China. It includes 10,209 still images and 288 video
clips, totaling 261,908 frames, captured in 14 different cities with varying environments, such as
urban and rural areas, and different densities, including sparse and crowded scenes. The dataset
provides over 2.6 million manually annotated bounding boxes for objects such as pedestrians,
vehicles, bicycles, and tricycles.
In recent years, drones have been widely used in traffic analysis and surveillance. As a
result, automated visual data analysis from drones has become increasingly important.
Due to the computing limitations of drones, this study uses YOLO-V5s, a lightweight
version of YOLO, to ensure efficiency while maintaining detection accuracy.
Inspired by the human visual system, this study explores the integration of motion
information into object detection to enhance accuracy, particularly for small and occluded
objects. By leveraging the Collaborative Filtering Mechanism (CFM) in conjunction with
YOLO-V5s, we aim to filter out irrelevant background noise while emphasizing motion-
based features.
The study evaluates the Collaborative Filtering Mechanism (CFM), integrated with
YOLO-V5s, to enhance detection performance. From the experimental results, it was
observed that incorporating the Self-Attention mechanism led to an improvement of
4.22% in mAP 0.5 and 3.26% in mAP 0.5:0.95, compared to the standard model.
The study evaluates the Collaborative Filtering Mechanism (CFM), integrated with
YOLO-V5s, to enhance detection performance. From the experimental results, it was
observed that incorporating the Self-Attention mechanism led to an improvement of
4.22% in mAP 0.5 and 3.26% in mAP 0.5:0.95, compared to the standard model.
Further evaluations showed that using the CFM module significantly improved the
detection performance of YOLO-V5s on the VisDrone dataset. The mAP 0.5 score
improved by 10.49%, and similar enhancements were seen in mAP 0.5:0.95, Precision,
and Recall. These improvements confirm that the CFM module effectively enhances
small object detection for drone-based applications.
The comparison of detection performance on the VisDrone dataset highlights the impact of
incorporating the Self-Attention mechanism in a Pix2Pix-based enhancement framework for
YOLO-V5 with the Collaborative Filtering Mechanism (CFM).
The experimental results indicate that the inclusion of the Self-Attention mechanism
significantly refines feature representation, leading to improved object detection accuracy.
This enhancement is particularly beneficial for small and occluded objects, where precise feature
extraction is crucial. By leveraging Self-Attention, the model effectively distinguishes relevant
objects from background noise, resulting in superior detection performance compared to its
counterpart without Self-Attention.
Traditional convolutional models struggle with such objects due to limited receptive fields,
whereas Self-Attention allows the model to focus on relevant regions more effectively. This
integration leads to superior detection performance, improving precision and recall compared to
the standard model without Self-Attention, making it a promising approach for real-world
applications like drone-based surveillance and traffic monitoring.
The comparison of detection performance on the VisDrone dataset between YOLO-V5 with and
without the Collaborative Filtering Mechanism (CFM) demonstrates the effectiveness of CFM in
improving object detection.
Experimental results show that integrating CFM enhances feature extraction by filtering out
irrelevant background noise and emphasizing motion-based features. This leads to increased
precision and recall, particularly for small and occluded objects.
The model with CFM achieves superior detection accuracy compared to the standard YOLO-V5,
validating its potential in drone-based surveillance and traffic monitoring applications.
The comparison of detection performance on the VisDrone dataset between YOLO-V5 with and
without the Collaborative Filtering Mechanism (CFM) shows that CFM significantly improves
object detection. By filtering out irrelevant background noise and emphasizing motion-based
features, CFM enhances feature extraction.
This leads to improved precision and recall, especially for small and occluded objects. The
model with CFM outperforms the standard YOLO-V5, demonstrating its effectiveness in drone-
based surveillance and traffic monitoring. CFM’s ability to refine feature representation validates
its potential for real-world applications.
CONCLUSION
They proposed Collaborative Filtering Mechanism (CFM) introduces a novel approach to
improving small object detection in drone vision by leveraging motion-based feature
enhancement. By integrating GAN-generated motion priors and self-attention modules, we
successfully refine the feature extraction process, leading to notable improvements in object
detection performance.
The experimental results on the VisDrone dataset demonstrate the effectiveness of CFM,
showing substantial gains in mean Average Precision (mAP), precision, and recall compared to
conventional YOLO-V5s models. The inclusion of self-attention mechanisms further enhances
feature representation, allowing the model to focus on salient motion patterns critical for object
detection in dynamic aerial environments.
Future research directions may include extending CFM for real-time object tracking on drones,
optimizing computational efficiency for edge deployment, and exploring multi-modal fusion
techniques for better scene understanding in aerial surveillance and traffic analysis applications.
Our work underscores the potential of motion-aware deep learning in advancing autonomous
drone perception and monitoring systems.
The experimental evaluation on the VisDrone dataset confirms that our CFM-enhanced YOLO-
V5s significantly outperforms the baseline models, achieving notable gains in mAP, precision,
and recall. Additionally, integrating GAN-based motion priors and self-attention mechanisms
further refines feature extraction, making the model more effective in handling occlusions and
detecting small, fast-moving objects. These improvements highlight the potential of combining
motion-aware learning techniques with existing object detection frameworks to enhance drone
vision applications.
In the future, our research could be extended to multi-object tracking and real-time anomaly
detection, which are crucial for applications like surveillance, traffic monitoring, and search-and-
rescue operations. Moreover, optimizing the model for low-power edge devices would enable its
deployment on drones for real-world tasks. Further studies could explore how temporal motion
features can improve other deep learning tasks, such as action recognition and behavior analysis,
making drone vision systems more intelligent and adaptive.
Their proposed Collaborative Filtering Mechanism enhances small object detection in drone
vision by incorporating motion-aware learning techniques. Unlike traditional methods that rely
solely on static image features, CFM integrates motion trend maps generated using a GAN-based
displacement-color mapping process. This approach refines feature extraction by selectively
emphasizing moving objects while suppressing irrelevant background noise. Additionally, the
inclusion of self-attention mechanisms strengthens the model’s ability to differentiate between
objects, improving detection accuracy in complex aerial environments.
Future research directions for CFM could include extending the approach to real-time object
tracking, anomaly detection, and behavior analysis. These capabilities are essential for
applications such as smart traffic monitoring, security surveillance, and disaster response.
Moreover, optimizing the model for edge computing would allow real-world deployment on
drones with limited computational resources. Additionally, exploring multi-modal fusion
techniques—such as integrating LiDAR, thermal imaging, or radar data—could further enhance
scene understanding, making drone vision systems more adaptive and efficient in dynamic
environments.
translation (CycleGAN). These motion maps help filter out irrelevant background information
and highlight moving objects, improving detection accuracy. Additionally, the self-attention
mechanism refines feature extraction by ensuring the model focuses on important motion
patterns, leading to better object differentiation and robustness in aerial imagery analysis.
Experimental evaluations on the VisDrone dataset demonstrate that the CFM-enhanced YOLO-
V5s model significantly outperforms traditional object detection approaches, achieving higher
mean Average Precision (mAP), precision, and recall. The ability to leverage motion priors
allows for better detection of small, fast-moving, or partially occluded objects, making it highly
effective for drone-based surveillance, traffic monitoring, and security applications. Future
research could explore real-time tracking, anomaly detection, and multi-modal fusion with
additional data sources (such as LiDAR or thermal imaging) to further enhance detection
accuracy and adaptability in dynamic environments.
REFERENCES
[1] Hayun Lee, Gyeonghwan Hong, and Dongkun Shin. Shareable camera framework for
multiple computer vision applications. In 2018 20th International Conference on Advanced
Communication Technology (ICACT), pages 669–674, 2018.
[2] Bin Cao, Meng Li, Xin Liu, Jianwei Zhao, Wenxi Cao, and Zhihan Lv. Many-objective
deployment optimization for a drone-assisted camera network. IEEE Transactions on Network
Science and Engineering, 8(4):2756–2764, 2021.
[3] Hamed Ghasemi, Amin Mirfakhar, Mehdi Tale Masouleh, and Ahmad Kalhor. Control a
drone using hand movement in ros based on single shot detector approach. In 2020 28th Iranian
Conference on Electrical Engineering (ICEE), pages 1–5, 2020.
[4] Jorgen Wallerman, Jonas Bohlin, Mats B. Nilsson, and Johan E. ¨ S. Franssen. Drone-based
forest variables mapping of icos tower surroundings. In IGARSS 2018 - 2018 IEEE International
Geoscience and Remote Sensing Symposium, pages 9003–9006, 2018.
[5] Trinadh V S N Venna, Sarosh Patel, and Tarek Sobh. Application of image-based visual
servoing on autonomous drones. In 2020 15th IEEE Conference on Industrial Electronics and
Applications (ICIEA), pages 579–585, 2020.
[6] Assem Alsawy, Alan Hicks, Dan Moss, and Susan Mckeever. An image processing based
classifier to support safe dropping for delivery-bydrone. In 2022 IEEE 5th International
Conference on Image Processing Applications and Systems (IPAS), volume Five, pages 1–5,
2022.
[8] Visarut Trairattanapa, Ankit A. Ravankar, and Takanori Emaru. Estimation of tree diameter
at breast height using stereo camera by drone surveying and mobile scanning methods. In 2020
59th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE),
pages 946–951, 2020.
[9] Omar Daniel Mora Granillo and Zizilia Zamudio Beltran. Real-time ´ drone (uav) trajectory
generation and tracking by optical flow. In 2018 International Conference on Mechatronics,
Electronics and Automotive Engineering (ICMEAE), pages 38–43, 2018.
[10] Kian Meng Yap, Kok Seng Eu, and Jun Ming Low. Investigating wireless network
interferences of autonomous drones with camera based positioning control system. In 2016
International Computer Symposium (ICS), pages 369–373, 2016.
[11] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell:
Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39(4):652–663, 2017.
[12] Sujeet Kumar, Prashant Johri, Avneesh Kumar, Sudeept Singh Yadav, and Harshit Kumar.
Multiple object detection using deep learning. In 2021 3rd International Conference on Advances
in Computing, Communication Control and Networking (ICAC3N), pages 380–384, 2021.
[13] Junying Zeng, Zuoyong Lin, Chuanbo Qi, Xiaoxiao Zhao, and Fan Wang. An improved
object detection method based on deep convolution neural network for smoke detection. In 2018
International Conference on Machine Learning and Cybernetics (ICMLC), volume 1, pages 184–
189, 2018.
[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollar, and C Lawrence Zitnick. Microsoft ´ coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-
12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[15] Richard A Abrams and Shawn E Christ. Motion onset captures attention. Psychological
Science, 14(5):427–432, 2003.
[16] Steven Franconeri and Daniel Simons. Moving and looming stimuli capture attention.
Perception psychophysics, 65:999–1010, 11 2003.
[17] Richard A Abrams and Shawn E Christ. Motion onset captures attention. Psychological
Science, 14(5):427–432, 2003
[18] Atsushi Senju, Toshikazu Hasegawa, and Yoshikuni Tojo. Does perceived direct gaze boost
detection in adults and children with and without autism? the stare-in-the-crowd effect revisited.
Visual Cognition - VIS COGN, 12:1474–1496, 11 2005.
[19] Rene Laprise. The euler equations of motion with hydrostatic pressure ´ as an independent
variable. Monthly weather review, 120(1):197–207, 1992.
[20] Aleksander Holynski, Brian L Curless, Steven M Seitz, and Richard Szeliski. Animating
pictures with eulerian motion fields. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5810–5819, 2021.
[21] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 580–587, 2014.
[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards real-time
object detection with region proposal networks. Advances in neural information processing
systems, 28, 2015.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep
convolutional networks for visual recognition. IEEE transactions on pattern analysis and
machine intelligence, 37(9):1904– 1916, 2015.
[24] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully
convolutional networks. Advances in neural information processing systems, 29, 2016.
[25] Adedeji Olugboja, Zenghui Wang, and Yanxia Sun. Parallel convolutional neural networks
for object detection [j]. Journal of Advances in Information Technology Vol, 12(4), 2021.
[26] Xingxing Xie, Gong Cheng, Jiabao Wang, Xiwen Yao, and Junwei Han. Oriented r-cnn for
object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 3520–3529, 2021.
[27] Feng Shuang, Hanzhang Huang, Yong Li, Rui Qu, and Pei Li. Afe-rcnn: Adaptive feature
enhancement rcnn for 3d object detection. Remote Sensing, 14(5):1176, 2022. [28] Tsung-Yi
Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariha- ´ ran, and Serge Belongie.
Feature pyramid networks for object detection. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2117–2125, 2017.
[29] Guiyi Yang, Zhengyou Wang, and Shanna Zhuang. Pff-fpn: A parallel feature fusion
module based on fpn in pedestrian detection. In 2021 International Conference on Computer
Engineering and Artificial Intelligence (ICCEAI), pages 377–381, 2021.
[30] Di Liu and Fei Cheng. Srm-fpn: A small target detection method based on fpn optimized
feature. In 2021 18th International Computer Conference on Wavelet Active Media Technology
and Information Processing (ICCWAMTIP), pages 506–509, 2021.
[31] Xiaoqi Yang and Liangliang Duan. Mptc-fpn: A multilayer progressive fpn with
transformer-cnn based encoder for salient object detection. IEEE Access, 10:98816–98827,
2022.
[32] Jia Li, Ruiqi Li, Chensheng Wang, and Yangguang Li. Comparative research of fpn and
mtcn in face attribute recognition. In 2019 International Conference onArtificial Intelligence and
Advanced Manufacturing (AIAM), pages 539–543, 2019.
[33] Zhiqing Li, Erzhu Li, Tianyu Xu, Alim Samat, and Wei Liu. Feature alignment fpn for
oriented object detection in remote sensing images. IEEE Geoscience and Remote Sensing
Letters, 20:1–5, 2023.
[34] Yu-Ming Zhang, Jun-Wei Hsieh, Chun-Chieh Lee, and Kuo-Chin Fan. Sfpn: Synthetic fpn
for object detection. In 2022 IEEE International Conference on Image Processing (ICIP), pages
1316–1320, 2022.
[35] Huayu Li, Shuyu Miao, and Rui Feng. Dg-fpn: Learning dynamic feature fusion based on
graph convolution network for object detection. In 2020 IEEE International Conference on
Multimedia and Expo (ICME), pages 1–6, 2020.
[36] Junhao Hu, Lei Jin, and Shenghuo Gao. Fpn++: A simple baseline for pedestrian detection.
In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1138–1143,
2019.
[38] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Taskdriven super resolution:
Object detection in low-resolution images. In Neural Information Processing: 28th International
Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part V
28, pages 387–395. Springer, 2021.
[39] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le.
Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 113–123, 2019.
[40] Marcus D Bloice, Peter M Roth, and Andreas Holzinger. Biomedical image augmentation
using augmentor. Bioinformatics, 35(21):4522– 4524, 2019.
[41] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical
automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
[42] Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin
Ling. Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and
Machine Intelligence, pages 1–1, 2021.
[43] Tianqu Zhao and Hong Jiang. Landing system for ar.drone 2.0 using onboard camera and
ros. In 2016 IEEE Chinese Guidance, Navigation and Control Conference (CGNCC), pages
1098–1102, 2016.
[45] Boguslaw Cyganek and Kazimierz Wiatr. Design of a visual frontend for parallel signal
processing on underwater search drone. In 2018 IEEE Intl Conf on Parallel Distributed
Processing with Applications, Ubiquitous Computing Communications, Big Data Cloud
Computing, Social Computing Networking, Sustainable Computing Communications
(ISPA/IUCC/BDCloud/SocialCom/SustainCom), pages 1046–1047, 2018.
[46] Cheng-Fang Peng, Jun-Wei Hsieh, Shao-Wei Leu, and Chi-Hung Chuang. Drone-based
vacant parking space detection. In 2018 32nd International Conference on Advanced Information
Networking and Applications Workshops (WAINA), pages 618–622, 2018.
[47] Hakan Kayan, Raheleh Eslampanah, Faezeh Yeganli, and Murat Askar. Heat leakage
detection and surveiallance using aerial thermography 225 Authorized licensed use limited to:
J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on March
10,2025 at 03:47:12 UTC from IEEE Xplore. Restrictions apply. drone. In 2018 26th Signal
Processing and Communications Applications Conference (SIU), pages 1–4, 2018.
[48] Maitha Al Shamsi, Mohammed Al Shamsi, Rashed Al Dhaheri, Rashed Al Shamsi, Saif Al
Kaabi, and Younes Al Younes. Foggy drone: Application to a hexarotor uav. In 2018 Advances
in Science and Engineering Technology International Conferences (ASET), pages 1–5, 2018.
[49] Yanan Xu, Dexiang Yao, Xue Ren, and Yunhai Dai. Intelligent black ice detection and alert
system using thermal imaging camera and drone. In 2021 IEEE 23rd Int Conf on High
Performance Computing Communications; 7th Int Conf on Data Science Systems; 19th Int Conf
on Smart City; 7th Int Conf on Dependability in Sensor, Cloud Big Data Systems Application
(HPCC/DSS/SmartCity/DependSys), pages 2328–2331, 2021.
[50] Noriyasu Yamamoto and Noriki Uchida. Improvement of image processing for a
collaborative security flight control system with multiple drones. In 2018 32nd International
Conference on Advanced Information Networking and Applications Workshops (WAINA),
pages 199–202, 2018.
[51] Andres Erazo, Eduardo Tayupanta, and Seok-Bum Ko. Epipolar geometry on drones
cameras for swarm robotics applications. In 2020 IEEE International Symposium on Circuits and
Systems (ISCAS), pages 1–5, 2020.
[52] Lin Meng, Takuma Hirayama, and Shigeru Oyanagi. Underwater-drone with panoramic
camera for automatic fish recognition based on deep learning. IEEE Access, 6:17880–17886,
2018.
[53] Kazi Mahmud Hasan, Wida Susanty Suhaili, S. H. Shah Newaz, and Md. Shamim Ahsan.
Development of an aircraft type portable autonomous drone for agricultural applications. In 2020
International Conference on Computer Science and Its Application in Agriculture (ICOSICA),
pages 1–5, 2020.
[54] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep
convolutional networks for visual recognition. IEEE transactions on pattern analysis and
machine intelligence, 37(9):1904– 1916, 2015.
[55] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully
convolutional networks. Advances in neural information processing systems, 29, 2016.