Latest Advancements in Perception Algorithms For ADAS and AV Systems Using Infrared Images and Deep Learning
Latest Advancements in Perception Algorithms For ADAS and AV Systems Using Infrared Images and Deep Learning
Abstract
1. Introduction
combat zones and hazardous areas without remote operators. Many DARPA Grand
Challenges were organized in order to come up with the technology for fully autono-
mous ground vehicles with collaboration across diverse fields. The first challenge
took place in 2004 when 15 self-driving vehicles were competing to navigate around
228 km across the desert in Primm, Nevada. None of the team was able to succeed
due to the technological hurdles involved. The second event was held in 2005 in
southern Nevada were 5 teams competed to navigate 212 km. With limited and better
technology, this time, Stanford University’s Stanley managed to complete the distance
and won the prize money. In 2007, third event took place in an urban environment
commonly known as DARPA Urban Challenge. Here, the team need to showcase the
autonomous driving capability in the driving traffic scenarios and needs to perform
complete maneuvering that includes braking and parking events. The boss vehicle
from Carnegie Mellon University won the first prize and the Junior vehicle from
Standford University claimed the second prize [1]. Since after DARPA, the grand
challenges for autonomous driving solutions, the accurate perception and navigation
of the vehicle more autonomously became one of the hottest fields for research and
industry. There are 6 levels of Society of Automotive Engineers (SAE) international
standards to achieve fully autonomous driving from ADAS as shown in Figure 1. Upto
level 3, the presence of driver is mandatory to take control of the vehicle whenever
needed. Levels 4 and 5 allow for fully autonomous driving with and without driver,
respectively [3].
Worldwide, major accidents were reported due to human error which might
have resulted in fatal incidents. Considering the safety of the drivers and passen-
gers, ADAS systems have indeed been targeted by top manufactures to support the
drivers during unpredicted circumstances. The most common ADAS systems that
are available include lane departure warning, forward collision warning, high beam
safety system, traffic signals recognition, adaptive cruise control and so on. ADAS
systems are semi-autonomous driving concepts that assist drivers during driving.
The objective is to automate, adapt, and enhance safety by reducing human errors.
The fully autonomous vehicle is capable of sensing the environment and navigating
without human intervention under all environment circumstances. Here, the vehicle
is capable of perceiving the environment, thinking and reasoning to take a deci-
sion and control the vehicle autonomously similar to a human driver [4]. The major
Figure 1.
Levels of automation according to SAE standards [2].
2
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
Figure 2.
State-of-the art-ADAS sensors.
3
Digital Image Processing – Latest Advances and Applications
need to report crashes to the US agency as per a general standing order issued by
NAtional Highway Traffic Safety Administration (NHTSA) in 2021 [2]. An average
of 14 accidents were reported since July 2021 with the maximum of 22 and minimum
of 8 crashes reported in a month by AV (level 3 and above) equipped vehicles still in
development. Similarly, an average of 44 crashes were reported since July 2021 with
maximum of 62 and minimum of 26 crashes reported in a month by level 2 ADAS
equipped vehicles. Autonomous vehicle collision reports shows that the on road
experience of such to be claimed as matured ADAS & autonomous driving technol-
ogy still needs improvement to match with human driver perception. Numerous
incidents as major and minor crashes are reported by self-driving cars on the road
for test drive. This shows that the existing sensor suite that are claimed to achieve
level 4 autonomous driving are lacking with the performance especially during
extreme weather conditions, dark/night scenarios, glare and so on. This demands for
more robust sensors to achieve full autonomous driving irrespective of environment
conditions.
The recent advancements in artificial intelligence (AI) have a significant impact on
the fast development and deployment of level 3 and above ADAS and AV solutions.
Especially, in order to generate precise information from surrounding environment,
high data from different sensing modalities and advanced computing resources play
a key role in enabling AI as an essential component for ADAS and AV perception
system [7]. Extensive research and development activities are currently invested to
analyze the effective use of AIs in various functionalities of AVs such as perception,
planning and control, localization and mapping and decision making.
In this article, the limitations and challenges of the existing sensor suite, need
for infrared camera, infrared technology, applications of AI in the development of
perception systems and the need for multiple sensor fusion strategy are presented in
detail. Also, some latest research in the field of infrared sensors and deep learning
approaches for ADAS and AV systems are discussed in detail.
In ADAS and AV systems, sensors are considered equivalent to eyes and ears of the
human driver to sense and perceive the environment. Different aspects of the envi-
ronment are sensed and monitored using various types of sensors and the informa-
tion will be shared with the driver or Electronic control unit. This section introduces
commonly used automotive sensors and their functionality in achieving level 2 and
above ADAS and AV solutions. Figure 3 shows the representative image of a vehicle
equipped with perception sensors for one or more ADAS and AV solutions [9].
RGB or visible cameras are most commonly used in any ADAS and AV systems
due to their low cost and easy installation. Normally, more than one camera is used
to capture/sense the complete environment. Images captured by vision sensors are
processed by an embedded system to detect, analyze, understand and track various
objects in the environment. Captured images are rich in information such as color,
contrast, texture and details which are more unique features over other sensors.
Visible cameras are used as a single lens camera called as monocular camera and two
lens camera called as stereo setup. Monocular cameras are low-cost and require less
4
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
Figure 3.
Representative figure of AV/ADAS vehicle with various perception sensors. Figure from [8].
processing power. They are commonly used for object detection and classification,
lanes and parking lines detection, traffic sign recognition etc., Monocular cameras
lack distance/depth information compared to other active sensors. There are a few
techniques used to estimate the distance information but not as accurate as expected
by ADAS and AV systems for maneuvering autonomously. On the other hand, stereo
cameras are useful in extracting depth or distance information as the system con-
sists of two lenses separated by a distance resembling two eyes of the human. Such
systems are highly beneficial to detect and classify the objects in the environment
along with depth/distance information with better accuracy compared to monocular
cameras. When compared to the other automotive sensors, the depth estimation
using stereo cameras is reliable up to 30 m over a short distance [10]. Autonomous
vehicles demand accurate distance estimation at far distances, especially in high ways
(Figure 4) [3, 10].
Figure 4.
Common adverse scenarios where current ADAS/AV sensor suite struggles to perform.
5
Digital Image Processing – Latest Advances and Applications
2.1.1 Lidar
LiDAR stands for Light Detection and Ranging. This is an active sensor that works
by emitting a laser beam which gets reflected by any object. The time taken between the
emitted laser beam and its reflection measures the distance of the object. These sensors
are capable of generating high-resolution 3D point clouds and operate at longer ranges
when compared to vision sensors. Also, it can generate 360° 3-D images surrounding
the ego vehicle with accurate depth information. In recent autonomous vehicles, LiDAR
sensor plays a major role in driving the vehicle autonomously by generating accurate
and precise environment perception. ADAS systems such as autonomous braking,
parking solutions, collision avoidance, object detection etc., can be achieved with more
accuracy using Lidar sensors. A major setback of Lidars is the bulky size and expensive.
Also, extreme weather conditions such as rain and fog can impact the performance of
Lidar sensors. Due to the latest advancements in semiconductor technology, signifi-
cantly smaller and inexpensive Lidars may be possible in future [3, 10].
2.1.2 Radar
Radar stands for radio detection and ranging. This is an active sensor that works
on the principle of the Doppler effect. Radars emit microwave energy and measure
the frequency difference between emitted and reflected beams in order to estimate
the speed and distance of the object from which the energy gets reflected. It is capable
of detecting objects at a longer distance compared to lidar and vision sensors. Radar
performs equally in all weather conditions, including extreme conditions such as rain
and fog. Radars are classified as short, medium and long-range sensors. Short and
medium-range sensors are mostly used in blind spot detection, cross traffic alert and
they will mounted in the corners of the vehicle. Long-range radars are mostly used for
adaptive cruise control and are mounted near front/rear bumpers [3, 10].
Ultrasonic sensors are active sensors which use sound waves in order to measure
the distance between ego vehicle and objects. Such sensors are most commonly used to
detect nearby objects to the vehicle such as kerbs, especially in parking spaces [3, 10].
Other sensors such as GPSs and IMUs are also used in most of the ADAS and AV
use cases. These sensors are used to measure the position and for localization of the
ego vehicle. Table 1 shows the consolidated summary of the advantages and disad-
vantages of various sensors for ADAS and AV systems.
The sensing range can cover up to 200 m Classification issues in cold conditions
Better vision through dust, fog, and snow Challenging to detect and classify
Table 1.
List of advantages and disadvantages of various sensors for ADAS/AV perception systems [11].
Figure 5.
Electromagnetic Spectrum-representation of IR range
sun glare, dense fog, heavy rain, high beam glare, surface reflections and low light,
would cause light reflections that result in the reduction of visibility for RGB Cameras.
Existing ADAS solution in the market uses vision & ultrasonic sensors predomi-
nantly to achieve level 1 features such as warning and alert signals for traffic lights,
and detecting obstacles while reversing the vehicle etc., To achieve level 2 and above
ADAS features, vision, radar and ultrasonic sensors are used either as individual sen-
sor modality or combination of these sensors. For example, a vision sensor is capable
of generating accurate detection and classification of on-road objects, static objects
7
Digital Image Processing – Latest Advances and Applications
etc., whereas radar sensor is capable of generating accurate position and velocity,
distance of the objects from a vehicle. Hence, these two pieces of information will
be fused to generate the combined representation of the detected objects with their
position, velocity and class. Here, each object can be represented with more rich
information so that the guidance and navigation module can make proper decisions
and planning for the vehicle to move autonomously. Fusion of the information from
multiple sensors modality implies the combined representation of complementary
and redundant information in order to represent the environment more precisely.
Also, this enables theextension the feasibility of ADAS and AV systems to function in
all weather/lighting conditions [6]. However, above said sensor fusion is best suited
to address the challenges during daylight and not fully compliance during night time.
The sensors combination in the latest ADAS and AV solutions are still not able to
precisely generate the environment perception as human driver perceives. The latest
standing general order crash report [2] available on the National Highway Traffic
Safety Administration web page, clearly shows that the existing sensor suite is not
sufficient to achieve level 4 & above autonomous driving. The system fails especially
during adverse weather conditions, low light and dark scenarios, extreme sun glare
and so on. After the accident of the autonomous Uber car in 2018 [13], the research
community started considering including the infrared sensor in the ADAS sensor
fusion suite. This shows that there is a need for another sensor which can comple-
ment the information, especially during extreme weather and lighting conditions. To
enable level 4 or 5 ADAS functionalities with zero human intervention, it is necessary
to make the system more robust to various weather and lighting conditions [11, 14].
The most state-of-the-art approaches constitute camera and lidar sensors as the
major sensory modality for detecting the object, mostly due to the fact that these
sensors will have dense image pixel information/high-density point cloud data.
Whereas, these lidar and radar sensors are costly and computationally expensive.
Vision sensor-based perception algorithms depend on the brightness and contrast of
the images captured. They are cost-effective sensors and use either image processing
or deep learning-based techniques for object detection. Due to limited image features,
CNN-based algorithms are able to detect the objects only with good lighting condi-
tions or with minimum lux level. Moreover, vision sensor-based approaches often fail
at daylight glare, night-time glare, fog, rain, and strong light/direct sun situations.
The performance may be degraded in poor lighting conditions such as dark scenarios,
dams, tunnels, parking garages, etc.,. Similarly, when there is contamination in the
camera lens and increased complexity of the scenes such as the detection of pedestri-
ans in crowded environment are always challenging for any vision-based perception
algorithms [15].
Apart from the mentioned challenges, the vision sensors are also limited to the
distance of the objects. Reasonable performance can be expected up to 20 meters
using visible images and hence, recognition of long-distance objects is limited. It
has limitations in low light conditions, fog and rain weather conditions to sense the
environment precisely. The limitations and challenges can be overcome by adding
additional sensors such as lidars and radars that can detect objects at a long distance
even in fog and rain.. Recent accidents by Uber and Tesla autonomous vehicles indi-
cate that the sensors suite comprising vision sensors, lidars and radars are not suffi-
cient, especially in the detection of cars and pedestrians during extreme weather and
8
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
lighting conditions The robust perception algorithm is expected to work in all lighting
and adverse weather conditions as shown in Figure 5. Hence, the ADAS/autonomous
driving sensor suite should require a night vision-compliant data-capturing sensor,
such as infrared or thermal imagery. Infrared (IR) sensors look promising, especially
in extreme conditions such as poor lighting, night, bright sun glare and inclement
weather. IR sensors are capable of classifying vehicles, pedestrians, animals and all
objects in common driving conditions. They are capable of performing equally in
daylight and dark scenarios. It outperforms in low light and dark lighting conditions
compared to other sensors used for ADAS and AV applications [15].
Infrared falls within electromagnetic radiation with wavelengths longer than the
visible spectrum and shorter than radio waves. IR ranges from 0.75 μm to 1000 μm.
Any object with an absolute temperature over 0 K is capable of radiating infrared
energy. Generally, it is a measure of internal energy due to the acceleration of electri-
cally charged particles. Hotter objects radiate more energy. It is invisible to human
eyes but sensed as a warmth on the skin. The electromagnetic spectrum and IR range
are shown in Figure 5. Infrared sensor is an electronic device which is capable of
emitting and detecting Infrared radiations within the IR range. It works on three basic
laws of physics [16–18]:
1. Planck’s law of radiation—It states that any object whose temperature is not equal
to absolute zero Kelvin (O K) emits radiation
2. Stephan Boltzmann’s law—It states that the total energy emitted by a black body
at all wavelengths is related to the absolute temperature
Figure 6 shows the radiant existence of a perfect black body according to Planck
law and the maximum peak is inversely proportional to the temperature as per Wein’s
displacement law.
Based on the operating principles, IR sensors are broadly classified into thermal
and photonic detectors. Infrared rays emitted by any objects are captured by thermal
sensors and then converted into heat which is then transformed into a change in
resistance. Thermo electromotive force extracts the output. Quantum sensors use the
photo conductive effect and photovoltaic effect in semiconductors and PN junctions.
For ADAS and AV applications, thermal sensors are most widely accepted. Another
classification of IR detectors into cooled and non-cooled detectors based on operat-
ing temperature is often used at initial stages. Based on the detector’s construction,
it can be further classified into single and linear or array detectors. The commonly
used detector arrangement is the FPA, focal plane array sensor which consists of
multiple single detectors placed as array of matrices [16–18]. Further classification of
IR sensor type is based on the operating frequency band as IR detectors operate in the
bank where maximum transmission with minimal absorption is possible [16–18]. It is
generally classified as:
9
Digital Image Processing – Latest Advances and Applications
Figure 6.
Radiant existence of a black body according to Planck law [16].
Figure 7.
Infrared sensor types based on operational frequency bands.
10
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
Applications Driver monitoring Road marking detection Pedestrian and object detection
Table 2.
Details of various types of infrared images used in automotive industry.
more context information, like lane markings, traffic signs, etc. Unfortunately, SWIR
camera applications are uncommon due to the high cost of indium gallium arsenide
(InGaAs) detectors [19]. LWIRs are commonly referred to as ‘thermal infrared’ as
they operate solely through thermal emissions and do not require any external sources
for illumination [20].
Representative images for visible camera (RGB), near infrared (NIR), short-
wavelength infrared (SWIR), mid-wavelength infrared and long-wavelength infrared
(LWIR) cameras are shown in Figure 8 [16–18].
The basic components of the Infrared imaging system are shown in Figure 9. It
measures the IR radiations emitted from the objects and then converts them into
electrical impulses using an IR detector. The converted signal is transformed into tem-
perature map considering the ambience and atmospheric effects. A temperature map
is displayed as an image which can be color-coded to represent thermograms thermal
images or IR images using an imaging algorithm. IR detector acts as a transducer,
which converts radiation into electrical signals. Microbolometric detectors are the
most widely used IR detectors as they can operate at room temperature. It is basically
a resistor with a very small heat capacity and a high negative temperature coefficient
of resistivity. IR radiations received by detectors change the resistance of the microbo-
lometers and produce the corresponding electrical outputs. Considering the ambient
and atmospheric effects between the object and IR detector, an infrared measurement
model is used to represent the detected heat map to temperature map. The measure-
ment model depends on the emissivity of the object, atmospheric and ambient
temperature, relative humidity and distance between the object and detector as per
Figure 8.
Representative images for visible camera (RGB), near-infrared (NIR), short-wavelength infrared (SWIR), mid-
wavelength infrared and long-wavelength infrared (LWIR) cameras [20].
11
Digital Image Processing – Latest Advances and Applications
Figure 9.
Basic components of IR imaging system.
Figure 10.
Representative RGB and IR images used in ADAS and AV applications from Kaist multispectral pedestrian
detection dataset [21].
the FLIR thermal camera. It varies between different manufacturers and IR detector
characteristics. The measured temperature map is visually represented as grayscale
images. For industrial applications, pseudo color-coded images named as thermo-
grams can also be generated by the thermal imaging system for easy representation of
the difference in temperature distribution. Figure 10 shows a representative RGB and
the corresponding IR thermal images used in ADAS and AV applications [17].
and Tesla accidents clearly show that the current SAE automation level 2 and 3 sensors
suite do not provide accurate detection of cars and pedestrians. In particular, vulner-
able road users (VRUs) such as pedestrians, animals and bicyclists are challenging to
detect and classify accurately. Classification of these objects are challenging in poor
lighting condition, low light and dark scenarios, direct sunlight in driving direction
and extreme weather conditions such as fog, rain and snow. The performance of vision
sensors does not meet the requirement of autonomous driving in such conditions.
Other sensors performance was also more or less affected and failed to provide the
complete environment perception for autonomous navigation. A combination of low
light vision sensors, lidar and radar can help to some extent to perform night scenarios
up to 50 meters. Beyond that, it will be challenging to drive the vehicle autonomously
[15]. Infrared sensor technology overcomes the aforementioned challenges and
reliably detects, classifies cars, vehicles, pedestrians, animals and other objects that
are common in driving scenarios. Also, IR sensors are capable of performing equally
in daylight conditions thereby they can provide redundant information for the
existing sensor suite thereby increasing confidence in detection and classification
algorithms. IR sensors can be used effectively to address the limitations of vision and
other sensors. The real-time performance of IR camera is not affected by low light
and dark scenarios, sun glare or vehicle headlight and brake light reflections. It can be
considered as a potential solution in extreme weather conditions such as snow, fog and
rain. Un-cooled thermal imaging systems that are available in the market are the most
affordable low-cost IR sensors due to advancements in microbolometer technology.
These sensors are capable of generating the temperature map as an image to visually
analyze and further process the environmental information. In existing automotive
sensor suites, these sensors can supplement or even replace existing technology due
to the advantages of sensing the infrared emission of the objects, and operates inde-
pendently of illumination conditions, thereby providing a promising and consistent
technology in order to achieve more precise environment perception systems [15].
As per the NTSB report, the fatal incident of a pedestrian by Uber vehicle which is
a level 3 autonomous car in Tempe, Arizona. This vehicle uses lidar, radar and vision
sensors. The report shows that the incident happened at night time when there were
only street lights lit. The system is first classified as an unknown object, after some
time as a car, then as a bicycle and finally as a person. This scenario was recreated
and tested by FLIR using a wide field of view thermal camera using a basic classifier.
The system was capable of detecting the person at a distance of approximately 85.4 m
which is twice the required stopping distance for a vehicle at 43 mph. When per-
formed using narrow FOV cameras, FLIR IR cameras have demonstrated the perfor-
mance of pedestrian detection even at a distance four times greater than the required
by decision algorithms in autonomous vehicles [15].
The vehicle currently on the road with ADAS systems that support SAE level 2
with partial automation and level 3 with conditional automation does not include
an IR sensor in the sensor suite. AWARE (All Weather All Roads Enhanced) vision
project was executed in 2016 to test the potential of sensors operating in four differ-
ent bands on the electromagnetic spectrum such as Visible RGB, near infared (NIR),
short wave Infrared (SWIR) and long wave infared (LWIR), especially in challenging
conditions such as fog, snow and rain. It was reported that LWIR camera performed
well in detecting pedestrians in extreme fog (visibility range = 15 ± 4 m) compared
to NIR and SWIR whereas the vision sensor reported with lowest detection perfor-
mance comparatively. Similarly, LWIR camera was capable of detecting pedestrians
in extreme dark scenarios and when reflections were present due to other vehicles
13
Digital Image Processing – Latest Advances and Applications
headlights in the fog conditions whereas the other sensors failed to detect pedestrians
as they were not seen due to headlight glare/reflections [15, 22]. Also, as per Wien’s
displacement law, the peak radiation for the human skin with emissivity 0.99 is 9.8 μm
at room temperature which falls in LWIR camera operating range. Therefore, being
completely passive sensors, LWIR cameras take advantage of sensing the emitted IR
radiations from objects irrespective of extreme weather conditions and illuminations
[16]. The distance at which the IR sensor can detect and classify an object depends on
the field of view (FOV) of the camera. Narrow FOV cameras are capable of detecting
objects at far distances whereas wide FOV cameras are capable of detecting objects in
the greater angle of view. Also, IR sensors require the target object of 20 x 8 pixels to
reliably detect and classify the object. IR sensor with a narrow FOV lens is capable of
detecting and classifying an object of 20 x 8 pixel size at a distance greater than 186
meters. Therefore, IR sensors with narrow FOV can be used on highways to detect far
objects whereas wide FOV cameras can be used in urban or city driving scenarios [15].
In the automotive domain, IR sensors can be effectively used in both in-cabin sensing
and driving applications. In cabin sensing applications, IR sensors will be mounted
inside the vehicle in order to understand the driver’s drowsiness and fatigue detection,
eye gaze localisation, face recognition, occupant gender classification. Facial expres-
sion, emotion detection etc. In ADAS and AV systems, IR sensors can be efficiently
used to generate precise and accurate perception of the surrounding environment by
providing one or more cameras mounted in the vehicle. It is commonly used to gener-
ate object detection, classification and semantic segmentation [23].
Figure 11.
Representative convolutional neural network and its components.
generate the object-level information such as object position, size, class, distance,
and orientation from ego vehicle and also as semantic information which generates
pixel-wise object class. Mostly commonly convolutional neural networks (CNN) are
used for object detection and classification tasks. A more generic representation of
convolutional neural network and its components are shown in Figure 11. Any deep
learning model has an input layer, few hidden layers and a final fully connected layer
called an output layer. Input layer takes an image as input and final output layer define
the detected objects along with their confidence score. A combination of the convolu-
tion layer, pooling and activation layer represents one hidden layer whereas in deep
networks, multiple layers of these feature extraction layers are added to extract the
coarse to detailed information. At last softmax function is used to classify more than
one object detected with the corresponding confidence score based on the feature
similarity. The object with the highest confidence score is recognized as the object
class with the detected bounding box information [24]. CNN, RCNN, fast RCNN,
faster R-CNN, SSD, Yolo, Yolov2, Yolov3 etc., are a few DL approaches that are most
commonly used for the recognition of road objects. In case level 3 and above ADAS
and AV systems, multi-task networks are predominantly used where a common
network architecture will be trained to perform multiple tasks. End-to-end AI system
can also be used to generate the complete perception of ADAS and AV systems that
include perception, localization and mapping algorithms [7].
Training, testing and validation are the most important steps to be considered
for any deep learning-based perception algorithms. After proper training of the
networks, it can deployed in real time to perceive the environment successfully.
The captured dataset and its size used for training the model play a critical role
while deploying the model in real time. Convolutional neural networks (CNN) have
significantly improved the performance of many ADAS and AV applications, whereas
it requires significant amounts of training data to obtain optimum performance and
reliable validation outcomes. Fortunately, there are multiple large-scale publicly avail-
able thermal datasets with annotations which can be used for training CNN. However,
15
Digital Image Processing – Latest Advances and Applications
when compared to visible imaging data, there are not many 2 dimensional thermal
datasets available for automotive applications on the open internet. In Table 3, a list
of available datasets captured by various types of thermal sensors in different envi-
ronmental conditions is provided. These datasets are widely used in building various
pre-trained CNN models for several applications like Pedestrian detection, Vehicle
detection and classification and small object detection in the thermal spectrum on
GPU/edge-GPU devices for the automotive sensor suite [31]. Whereas there are inher-
ent challenges also associated with the training and validation of the thermal data
for CNN models are as follows: A limited number of publicly available datasets, little
variability in the scenes such as weather, lighting and heat conditions, difficulties
while using RGB pre-trained CNN models for thermal data.
FLIR [29] 16 GB 640×512 14,000 Person, Vehicles, Bicycles, Bikes, Poles, Dogs
Table 3.
Details of various open access infrared datasets available related to ADAS applications.
16
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
a deep neural network are used to classify the object. DL-based techniques are found
to be the out performing the traditional methods [35].
The two commonly used deep learning approaches for VRUs detection are two-stage
detector (region proposal approach) and single-stage detector (non-region proposal
approach). Two-stage or regional proposed approach is the hand-crafted technique
that uses HOG, LBP or similar for feature extraction in the first stage and includes CNN
networks for classification in the second stage such as region-CNN (R-CNN) [36],
regional-fast convolution network (R-FCN) [37] and faster R-CNN [38]. Whereas sin-
gle-stage detector is able to perform region proposal, feature extraction as well as clas-
sification in a single step. Some of the non-region proposal-based approaches include
single shot detector (SSD) [39] and you only look once (YOLO) [40]. The advantages
and disadvantages of two and single-stage detectors are presented in Table 4.
Many studies are conducted around the most reliable approach to using both
color and thermal information from visible cameras and thermal cameras [41–43].
Commonly, these studies highlight the illumination dependency of visible cameras as
well as their limitations during adverse weather conditions and the benefits of including
the thermal data for better performance. The fusion of visible and thermal cameras aids
in reducing the inaccuracies of object detection mainly during nighttime. The fusion
of this sensor information can be possible at various levels like pixel level, feature level
or decision level. For pixel-level fusion, the thermal images that are intensity values,
which can be fused with the visible images in the intensity (I) component and fused
images can be reconstructed with the new I value. In general, the pixel-level fusion is
done through the following methods: Wavelet-based transform, curvelet transform and
laplacian pyramid transform fusion. Typically pixel level fusion is not used with deep
learning-based sensor fusion as it takes place outside the neural network.
The typical architectures for deep learning-based sensor fusion are early fusion, late
fusion and halfway fusion, as shown in Figure 12. Early fusion is also called feature-
level fusion, in which visible and thermal images are combined together as a 4-channel
(Red, Green, Blue, Intensity (RGBI)) input for the deep learning network to learn the
relationship between the image sources. The late fusion is also called decision-level, in
which feature extraction of visible and thermal images happens separately into subnet-
works and fused just before the object classification layer. Halfway fusion is another
approach which involves feeding the visible and thermal information separately into
the same network and the fusion happens inside the network itself. There are various
studies which demonstrated the benefits of using multispectral detection techniques,
Single stage SSD, YOLO Higher speeds Information loss, Large number of False Positives
Table 4.
Deep learning-based object detection types.
17
Digital Image Processing – Latest Advances and Applications
Figure 12.
Sensor fusion techniques [35].
which produced the best results while combining the visible and thermal images.
However, during night conditions thermal cameras performed much better than the
fusion data. At low light conditions, fused data performed worse with an Average Miss
rate of 3% and an overall decrease of 5% during daytime is observed. Also, the usage of
multiple sensors causes an increase in system complexity due to differences in sensor
positions, alignment, synchronisations and resolutions of the cameras used [35].
By far, the halfway fusion is considered the most effective of the other two tech-
niques with 3.5% lower miss rate. Also, using the stand-alone visible and thermal
information was shown to be performing much worse than halfway fusion by 11% [44].
18
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
Wagner et al. [45] investigated finding the optimal fusion techniques for pedestrian
detection using Faster R-CNN and KAIST datasets and found that the multispectral
information with single-stage and halfway fusion can achieve better performance.
Similarly, there are various studies conducted highlighting the importance of detec-
tion stages and sensor fusion techniques at various lighting conditions to draw similar
conclusions [46]. However, these deep learning models should estimate the bounding
box of the object appropriately and calculate the probability of the class it belongs via
neural networks, which makes them unsuitable for real-time applications [40].
DenseFuse, a deep learning architecture to extract more useful features proposed
by Li et al. [47] uses a combination of CNN, fusion layers and dense blocks, which
creates a reconstructed fused image that outperforms all the existing fusion methods.
SeAFusion network combines the image and semantic segmentation information and
uses gradient residual blocks to enhance image fusion process [48]. An unsupervised
fusion network called U2Fusion was proposed by Xu et al. [49] to best estimate
the fusion process considering the source image importance. Similarly, multiple image
fusion networks were proposed such as end-to-end fusion network (RFN-Net), the
effective biletaral mechanism (BAM), bilateral ReLU residual network (BRRLNet),
etc. [50, 51]. Despite so much progress in deep learning-based fusion architectures,
they are no lightweight real-time running application, they are all limited by appropri-
ate hyper-parameter selection and significant memory utilization. The advantages and
disadvantages of image fusion model-related literatures are summarized in Table 5.
In YOLO framework [40], both creating the bounding box and classification of
image are dealt as a single regressive problem to enhance the inference speed and train
the neural network as a whole task. YOLO creates a grid of mxn of the input image
then predicts the N number of bounding boxes and estimates the confidence score on
each Bounding Boxes (BB) using the CNN. Each BB consists of (x,y) central coordi-
nates, (w,h) is its width and height along with class probability value. The intersection
of union (IOU) is calculated based on the overlap of the detected BB and the actual
ground truth (GT). The width of the IOU indicates how accurately the BB is predicted.
The probability of the BB is expressed as the multiplication of probability of the object
and the width of the IOU. The central coordinates of the predicted BB and the GT
exist in the IOU then it is assumed that successful detection and Pr(Object) is set to 1,
else it is set to 0. If there are i number of classes that could be classified as Pr(Class |
object. The BB with the highest probability of the classified object among all possible N
numbers of BBs is considered as the best fit BB of the concerned object.
SeAfuseion [48] Combines fusion and semantic segmentation Cannot handle complex scenes
Y-shaped net [50] Extracts local features and context info Introduce artifacts or blur
RFN-Net [51] Two-stage training strategy Large amount of training data and time
Table 5.
Summary of image fusion model-related literature.
19
Digital Image Processing – Latest Advances and Applications
Yoon and Cho [52] have proposed a multimodel YOLO-based object detection
method based on late fusion using non-maximum suppression to efficiently extract
the features of an object using color information from visible cameras and bound-
ary information from thermal cameras. The architectural block diagram is shown in
Figure 13. The non-maximum suppression is generally employed towards the second
half of the detection model to improve the object detection performance of models
like YOLO and SSD.
Further, they have also proposed an improved deep multimodel object detec-
tion strategy by introducing dehazing network to enhance the performance of the
model during reduced visibility. The dehaze network constitutes evaluation of haze
level classification, light scattering coefficient estimation from visible images and
depth estimation from thermal images. Detailed performance metrics for dense haze
condition results from Yoon and Cho [52] are presented in Table 6 based on YOLO
trained for (a) visible (b) IR/thermal (c) visible and IR and (d) visible, IR and dehaze
network. Examples of output images from Yoon and Cho [52] of the vehicle detec-
tion results based on YOLO model trained for visible alone, IR/thermal alone, fused
visible and IR and fused visible, IR and dehaze model are shown in Figure 14. Missed
detection is marked as red box and correct detection is marked as blue box. The
performance of vehicle detection model improved from 81.11% to 84.02% of accuracy
from fusion model, but badly impacted by the run time, hence dehaze model is unfit
for real-time applications.
Chen et al. [53] proposed a thermal based R-CNN model for pedestrian detection
using VGG-16 as a backbone network as it has good network stability which enables
the integration of any new branch network. To address the pedestrian occlusion prob-
lem, they have proposed a part model architecture with new aspect ratio and block
model to strengthen the network’s generalization. The presence and resemblance of
pedestrian will be completely lost if the occlusion rate is over 80%, henceforth, train-
ing pedestrians occlusion <80% only be considered. Figure 15 represents the possible
types of pedestrian occlusion and the ground truth labelling and possible detection
are represented as green and red rectangle boxes, respectively. Figure 16 shows the
Figure 13.
Block diagram of the multimodal YOLO-based object detection method based on late fusion [52].
20
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
Table 6.
Performance of vehicle detection model during dense haze condition based on YOLO trained for (a) visible (b)
IR/thermal (c) visible and IR and (d) visible, IR and dehaze network. Results from [52].
Figure 14.
Examples of vehicle detection results based on YOLO model trained for (a) visible (b) IR/thermal (c) visible and
IR and (d) visible, IR and dehaze network. Missed detection is marked in red box and correct detection is marked
in blue box. Results from [52].
architecture of the thermal R-CNN fusion model for improved Pedestrian detection
proposed by [53], which constitutes a full body and region decomposition branch to
extract the full body features of pedestrian and segmentation head branch to extract
the individual pedestrian from the crowded scenes. The loss function is defined
considering five loss components such as BB loss, classification loss, segmentation
loss, pixel level loss and fusion loss.
In Table 7, the performance comparison results from [53] for thermal R-CNN
fusion pedestrian detection model with state-of-the-art deep learning models
found that the thermal R-CNN fusion model is effective and performing better. The
thermal R-CNN fusion model is sensitive to regional features which may easily tend
to misjudge the images, however, the semantic segmentation feature enhances the
information of the complete pedestrian bounding box and the final output is accurate
resulting in higher precision. Figure 17 shows the results from [53], the example
images to demonstrate the improved pedestrian detection by the thermal R-CNN
fusion model compared to ground truth and benchmarked modified R-CNN model. It
is evident from the results that modified R-CNN results partially detect the occluded
pedestrians and in some cases, double partial bounding boxes are created for single
21
Digital Image Processing – Latest Advances and Applications
Figure 15.
Illustration of types of pedestrian occlusion. (a) the green BB rectangle represents full pedestrian and red BB
rectangle represents detectable pedestrians. (b) Top six types of pedestrian occlusion types [53].
Figure 16.
The architecture of the thermal R-CNN fusion model for improved pedestrian detection [53].
Table 7.
Performance comparison of various pedestrian detection models. Results from [53].
pedestrians, whereas this issue is addressed and the pedestrian detection is improved
in the thermal R-CNN fusion results.
Figure 17.
Example images to demonstrate the improved pedestrian detection by (c) the thermal R-CNN fusion model
compared to (a) ground truth and (b) modified R-CNN. Results from [53].
network. Here, the information is transferred from teacher network to the student
network by reducing the loss function as well as GT labels. Chen et al. [54] proposed
to use the low-level feature from the teacher network to supervise and train the
deeper features for the student network resulting in improved performance. To
address the low-resolution issues of IR-visible fused images, Xiao et al. [55] intro-
duced a heterogeneous knowledge distillation network with multi-layer attention
embedding. This technique consists of teacher network with high resolution fusion
and a student network with low-resolution and super-resolution fusion. The percep-
tual distillation method used to train the image fusion neural networks without GTs is
proposed by Liu et al. [56] to train the teacher network and a multi-autoencoder with
self-supervision for a student network. Furthermore, there are multiple such studies
conducted in recent times summarized in Table 8.
Cross-stage connection path [55] Uses low-level features to Increases the complexity
supervise deeper features
Perceptual distillation [57] Trains image fusion networks depends on teacher network
without ground truths quality
Table 8.
Summary of knowledge-distillation network-related literature [57].
for image retrieval using adaptive features. In brief, this adaptive feature mechanism
blends the effectiveness of multiple features and outperforms the single feature
retrieval techniques. This can enhance the image fusion precision, reduce the noise
interference and enhance the real-time performance of the network. Conversely,
there have been various adaptive mechanisms such as adaptive selection loss function,
activation and sampling functions proposed to optimize the deep learning neural
networks. The advantages and disadvantages of adaptive mechanism-related litera-
ture are summarized in Table 9.
Global group sparse coding [63] Automatic network depth estimation by Suffer from sparsity or
learning inter-layer connections redundancy issues
Table 9.
Summary of adaptive mechanism network-related literature.
24
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
Figure 18.
Generative adversarial network based image fusion framework.
edges and features in the target images. Further, they have attempted to improve the
GAN-based framework by including the double discriminator conditional genera-
tion countermeasure network. Li et al. [66] showcased improvement in capturing
the interested region through integrating a multi-scale attention mechanism branch
into the GAN-based Image fusion framework. These networks generate fused images
with excellent quality, which are useful for entertainment and human perception,
However, these networks are not suitable for demanding visual processing tasks like
in automotive applications.
7. Conclusions
Level 3 and above ADAS and AV systems demand accurate and precise perception
of the surrounding environment in order to drive the vehicle autonomously. This
can be achieved using multiple sensors of different modalities such as vision, lidars
and radars. ADAS and AV systems provided by various OEMs and Tier 1 companies
show a lack of performance in extreme weather and lighting conditions, especially
dark scenarios, sun glare, rain, fog and snow. The performance can be improved
by adding another sensor to existing sensor suite that can provide complementary
and redundant information during the extreme environments. The properties and
characteristics of Infrared sensors look promising as it is capable of detecting the
natural emission of IR radiations and representing that as an image which indicates
relative temperature map. Recent advancements in AI paved the way to establish
efficient algorithms that can detect and classify objects more accurately irrespective
of weather and lighting conditions. Also, it is capable of detecting objects at long
distances compared to other sensors. Literature shows that fusing the information
25
Digital Image Processing – Latest Advances and Applications
from IR sensor with other sensor data results more precisely thereby ensuring the
path towards autonomous driving. The research community looks for more datasets
available for RGB images for quick and easy deployment of IR in ADAS and AV appli-
cations. Integration in the automotive domain is challenging currently as IR camera
needs a separate setup for calibration and is an expensive technology due to sensor
array. More intense research in IR technology and deep learning models may be highly
beneficial to make use of IR cameras effectively in ADAS and AV systems.
Abbreviations
26
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683
Digital Image Processing – Latest Advances and Applications
References
[6] Odukha O. How sensor fusion for [14] Image Engineering. Challenges for
autonomous cars helps avoid deaths on cameras in automotive applications.
the road. Intellias; Aug 2023. Available Feb 2022. Available from: https://www.
from: https://intellias.com/sensor-fusion- image-engineering.de/library/blog/
autonomous-cars-helps-avoid-deaths-road/ articles/1157-challenges-for-cameras-in-
automotive-applications
[7] Ma Y, Wang Z, Yang H, Yang L.
Artificial intelligence applications in the [15] Why ADAS and autonomous vehicles
development of autonomous vehicles: A need thermal infrared cameras. 2018.
survey. IEEE/CAA Journal of Automatica Available from: https://www.flir.com/
Sinica. 2020;7(2):315-329 [Accessed: September 25, 2023]
[34] Wang X, Han TX, Yan S. An [41] Geronimo D, Lopez AM, Sappa AD,
HOG-LBP human detector with partial Graf T. Survey of pedestrian detection for
occlusion handling. In: Proceedings of advanced driver assistance systems. IEEE
the IEEE 12th International Conference Transactions on Pattern Analysis and
on Computer Vision, Kyoto, Japan, 29 Machine Intelligence. 2010;32:1239-1258
September-2 October 2009. Japan: IEEE;
2009. pp. 32-39. Available from: https:// [42] Enzweiler M, Gavrila DM.
ieeexplore.ieee.org/document/5459207 Monocular pedestrian detection: Survey
and experiments. IEEE Transactions
[35] Ahmed S, Huda MN, Rajbhandari S, on Pattern Analysis and Machine
Saha C, Elshaw M, Kanarachos S. Intelligence. 2009;31:2179-2195
Pedestrian and cyclist detection and
intent estimation for autonomous [43] Dolã P, Wojek C, Schiele B, Perona P.
vehicles: A survey. Applied Sciences. Pedestrian detection: An evaluation of
2019;9:2335. DOI: 10.3390/app9112335 the state of the art. IEEE Transactions
30
Latest Advancements in Perception Algorithms for ADAS and AV Systems Using Infrared Images…
ITexLi.1003683