YOLOv5 Architecture and Algorithm for Object Detection
YOLOv5 Architecture and Algorithm for Object Detection
Architectural Components
Neck (PANet and SPPF): YOLOv5's neck consists of a PANet (Path Aggregation Network)
structure combined with a Spatial Pyramid Pooling - Fast (SPPF) block. The SPPF block
sequentially applies three 5x5 max-pooling operations to simulate larger receptive fields
efficiently. PANet fuses features across multiple scales both top-down and bottom-up,
enhancing detection across object sizes. Feature maps from the backbone are upsampled
and concatenated in FPN style, then downsampled again to propagate high-resolution
context downward.
Head (Detection Layer): The head is responsible for producing final predictions across three
different scales (P3, P4, P5). Each scale corresponds to detecting small, medium, and large
objects respectively. Each cell predicts bounding boxes using predefined anchor boxes.
YOLOv5 uses a total of 9 anchors (3 per scale), and each prediction includes 4 box
coordinates, 1 objectness score, and N class scores (N+5 outputs per anchor). The decoding
formulas are designed to allow coordinate predictions to exceed grid boundaries and
produce reasonably sized bounding boxes.
Training Process
Loss Functions: YOLOv5 uses a composite loss function made of three main components: (1)
Classification loss (Binary Cross Entropy), (2) Objectness loss (Binary Cross Entropy), and (3)
Localization loss (CIoU loss - Complete Intersection over Union). Each scale’s loss is weighted
differently to prioritize small object detection. The total loss guides the optimization during
backpropagation.
Learning Process: During training, the model performs a forward pass by feeding input
images through the backbone, neck, and head to generate predictions. These predictions are
then compared with the ground truth labels using the defined loss functions. The error (loss)
is propagated backward using backpropagation, and the weights of the neural network are
updated using an optimizer such as SGD or Adam. YOLOv5 often benefits from transfer
learning, where pre-trained weights (e.g., trained on COCO) are fine-tuned on the target
dataset.
Anchor Optimization and Target Assignment: Before training starts, YOLOv5 automatically
adjusts anchor boxes using the AutoAnchor algorithm to match the object sizes in the
dataset. During training, each ground-truth box is assigned to one or more anchors at
different scales, and positive/negative samples are dynamically selected based on IoU and
center proximity.
Model Evaluation During Training: The training pipeline includes periodic evaluation on a
validation set using metrics like Precision, Recall, and mean Average Precision ([email protected]
and [email protected]:0.95). YOLOv5 selects the best performing weights based on the highest
validation mAP.
Anchor Boxes and Target Assignment: YOLOv5 includes an AutoAnchor mechanism to adjust
anchor sizes to the dataset. During training, each ground-truth box is matched to suitable
anchors based on size ratios. YOLOv5’s coordinate system allows bounding boxes to span
multiple cells, improving detection of border-aligned objects.
YOLOv5 is released in multiple size variants, allowing trade-offs between speed and
accuracy:
YOLOv5n (nano): With a depth multiplier of 0.33 and a width multiplier of 0.25, this
is the smallest YOLOv5 model. It has approximately 1.9 million parameters, making it
extremely lightweight. It offers a speed advantage for mobile and embedded devices,
though its accuracy is lower compared to larger models.
YOLOv5s (small): This version has a depth multiplier of 0.33 and a width multiplier of
0.50, totaling around 7.2 million parameters. It is capable of detecting small objects
and is a popular choice for real-time applications due to its balanced speed and
accuracy.
YOLOv5m (medium): With a depth multiplier of 0.67 and width multiplier of 0.75,
this model contains approximately 21.2 million parameters. It runs slower than
YOLOv5s but achieves about 8% higher COCO mAP scores (e.g., YOLOv5s ~37.4 mAP
vs. YOLOv5m ~45.4 mAP).
YOLOv5l (large): Defined with a depth multiplier of 1.0 and width multiplier of 1.0,
YOLOv5l has around 46.5 million parameters. Due to its greater depth and width, it
may fall below real-time speeds even on mid-range GPUs, but provides higher
accuracy (COCO mAP ~49.0).
YOLOv5x (x-large): The largest variant, with a depth multiplier of 1.33 and width
multiplier of 1.25, reaching approximately 86.7 million parameters. It achieves the
highest accuracy (COCO mAP ~50.7), but also requires the most computational
resources. As such, it tends to be slower in real-time applications.
Inference Pipeline
1. Preprocessing: Input image is resized and letterboxed to preserve aspect ratio, then
normalized.
2. Forward Pass: The image passes through the backbone, neck, and head producing
predictions.
5. Rescaling and Output: Final boxes are mapped back to the original image size.
YOLOv5 inference is optimized for real-time performance, capable of running at 30+ FPS
depending on model size and hardware.
Half Precision (FP16) Computation: YOLOv5 supports Mixed Precision Training where
weights and activations are calculated in 16-bit floating point format. This
significantly speeds up training and inference by leveraging Tensor Cores available in
modern GPUs. Similarly, during inference, running the model in FP16 mode (e.g.,
using model.half()) can reduce memory usage and increase throughput. When
exporting the model to TorchScript or TensorRT using Ultralytics tools, setting
half=True converts model weights into half-precision. This typically results in only a
minor drop in accuracy while achieving substantial speed gains (1.5–2x), particularly
on compatible GPUs.
Model Compression via Quantization (INT8): A more aggressive optimization
approach is converting the model to 8-bit integer format through quantization.
YOLOv5 supports Post-Training Quantization (PTQ) when exporting to formats like
TensorRT or TFLite. With the int8=True option, both model weights and optionally
activations are quantized to INT8. This considerably reduces the model size and
accelerates inference on supported hardware (e.g., NVIDIA GPUs with TensorRT, Intel
VNNI, ARM NPUs). For instance, an INT8-quantized model can achieve 2–3x speedup
on CPUs and reduce memory usage by almost half. While this may slightly reduce
accuracy, with proper calibration, the loss is generally minimal (typically just a few
percentage points). Ultralytics documentation emphasizes that INT8 quantization
offers major performance gains with only minor accuracy trade-offs, making it
suitable for deployment on edge devices and latency-sensitive environments.
Model Pruning: YOLOv5 models can be pruned during or after training to reduce
complexity. Pruning involves removing less important weights (e.g., filters with small
magnitude values), effectively making the network sparser. Experiments with YOLOv5
show that pruning up to 30% of the parameters results in only a small drop in
accuracy. For example, pruning 30% of the YOLOv5x model reduced its mAP from
0.507 to 0.489 (a ~3.6% drop) while keeping inference speed nearly unchanged. In
YOLOv5, pruning is typically performed by analyzing the scale (gamma) parameters in
BatchNorm layers and removing filters below a certain threshold. The pruned model
is then fine-tuned to recover any lost accuracy. This technique is especially valuable
for deployment in environments with limited memory and compute resources, such
as embedded systems.
Layer Fusion and Other Fine-Tuning: YOLOv5 can be further optimized with minor
pre-inference enhancements. For example, calling model.fuse() merges every Conv2d
+ BatchNorm pair into a single Conv layer, eliminating one memory access and
compute step per pair. Additionally, compiling the model with PyTorch JIT, or
exporting to ONNX Runtime or TensorRT, can yield backend-specific speed
improvements. Input size optimization is another useful strategy: training the model
with smaller resolutions like 512 or 416 instead of 640 can significantly boost
inference speed with only a slight decrease in accuracy. Therefore, adjusting input
resolution and model variant based on application requirements helps achieve the
best balance between speed and performance.
These optimizations allow YOLOv5 to be deployed on edge devices, mobile platforms, and
GPUs with varying capabilities.
Architectural Differences
The architecture employs a C2f module (Cross-Stage Feature Fusion), which replaces
C3 in YOLOv5, allowing for more efficient feature reuse.
YOLOv8 uses a single decoupled head (instead of a shared head) for objectness,
classification, and box regression, improving detection accuracy.
It also includes native support for task-specific variants (e.g., classification,
segmentation, pose estimation) and adopts a more modern and simplified PyTorch
implementation.
Performance Comparison
The attached figure plots COCO mAP50-95 (mean Average Precision) against latency on an
NVIDIA T4 GPU using TensorRT10 in FP16 mode. This offers a clear view of the accuracy-
latency trade-off among different YOLO models.
Key observations:
The curve representing YOLOv8 is higher and steeper than YOLOv5, indicating a
better accuracy-to-latency efficiency. In other words, YOLOv8 provides higher
detection accuracy for a given inference time.
YOLOv5n and YOLOv8n both offer extremely low latency (~2ms), but YOLOv8n
exhibits a slight accuracy advantage (~38.5 vs. 37 mAP).
As we move from nano to x-large variants, the performance gap widens, showcasing
the scalability and architectural efficiency of YOLOv8.
The chart also includes other models like YOLOv6, PP-YOLOE, and EfficientDet, but
YOLOv8 achieves state-of-the-art performance on the COCO dataset while
maintaining real-time inference capability.
Summary
YOLOv8 significantly enhances the YOLO architecture by:
While YOLOv5 remains highly popular due to its stability and extensive deployment tools,
YOLOv8 is technically superior for new projects prioritizing detection accuracy, model
generalization, and modularity.