JRC129088 01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 117

X-ray baggage screening and

artificial intelligence (AI)

A technical review of machine


learning techniques for X-ray
baggage screening

2022
Vukadinovic, D., Anderson, D.

EUR 31123 EN
This publication is a Science for Policy report by the Joint Research Centre (JRC), the European Commission’s science and knowledge
service. It aims to provide evidence-based scientific support to the European policymaking process. The scientific output expressed does
not imply a policy position of the European Commission. Neither the European Commission nor any person acting on behalf of the
Commission is responsible for the use that might be made of this publication. For information on the methodology and quality underlying
the data used in this publication for which the source is neither Eurostat nor other Commission services, users should contact the
referenced source. The designations employed and the presentation of material on the maps do not imply the expression of any opinion
whatsoever on the part of the European Union concerning the legal status of any country, territory, city or area or of its authorities, or
concerning the delimitation of its frontiers or boundaries.

Contact information
Name: Danijela Vukadinovic, David Anderson
Address: JRC Geel, Retieseweg 111, 2440, Geel, Belgium
Email: [email protected], [email protected]
Tel.: +32 14 57 12 11

EU Science Hub
https://ec.europa.eu/jrc

JRC129088

EUR 31123 EN

PDF ISBN 978-92-76-53494-5 ISSN 1831-9424 doi:10.2760/46363

Luxembourg: Publications Office of the European Union, 2022

© European Union, 2022

The reuse policy of the European Commission is implemented by the Commission Decision 2011/833/EU of 12 December 2011 on the
reuse of Commission documents (OJ L 330, 14.12.2011, p. 39). Except otherwise noted, the reuse of this document is authorised under
the Creative Commons Attribution 4.0 International (CC BY 4.0) licence (https://creativecommons.org/licenses/by/4.0/). This means that
reuse is allowed provided appropriate credit is given and any changes are indicated. For any use or reproduction of photos or other
material that is not owned by the EU, permission must be sought directly from the copyright holders.

All content © European Union, 2022, except figures 1 to 51 and front cover image (©absent84 – Adobe Stock).

How to cite this report: Vukadinovic, D., Anderson, D., X-ray baggage screening and AI, EUR 31123 EN, Publications Office of the European
Union, Luxembourg, 2022, ISBN 978-92-76-53494-5, doi:10.2760/46363, JRC129088.
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Contents

1 Introduction.....................................................................................................................................................................................................................................................5
1.1 Aviation security in context................................................................................................................................................................................................5
1.2 Aviation security policies ......................................................................................................................................................................................................6
1.3 Artificial intelligence in aviation ....................................................................................................................................................................................7
1.4 Scope of this report ................................................................................................................................................................................................................10
2 X-ray baggage screening technology .................................................................................................................................................................................11
2.1 Introduction ....................................................................................................................................................................................................................................11
2.2 Principles of X-ray screening .........................................................................................................................................................................................11
2.3 Hold baggage ...............................................................................................................................................................................................................................19
2.4 Cabin baggage ............................................................................................................................................................................................................................21
2.5 Chapter summary ....................................................................................................................................................................................................................24
3 Human-machine interaction ........................................................................................................................................................................................................25
3.1 Introduction ....................................................................................................................................................................................................................................25
3.2 Literature review .......................................................................................................................................................................................................................25
3.3 Chapter summary ....................................................................................................................................................................................................................30
4 Evaluating performance of automated methods....................................................................................................................................................31
4.1 Introduction ....................................................................................................................................................................................................................................31
4.2 Overview of performance metrics ...........................................................................................................................................................................31
4.3 Chapter summary ....................................................................................................................................................................................................................36
5 Datasets of X-ray images ..............................................................................................................................................................................................................37
5.1 Introduction ....................................................................................................................................................................................................................................37
5.2 Overview of X-ray image libraries ...........................................................................................................................................................................37
5.3 Chapter summary ....................................................................................................................................................................................................................38
6 Image enhancement and threat detection ....................................................................................................................................................................39
6.1 Introduction ....................................................................................................................................................................................................................................39
6.2 Bag of visual words ...............................................................................................................................................................................................................40
6.3 Other classical machine learning methods .....................................................................................................................................................44
6.4 Chapter summary ....................................................................................................................................................................................................................45
7 Synthetic data augmentation.....................................................................................................................................................................................................46
7.1 Introduction ....................................................................................................................................................................................................................................46
7.2 Review of augmented data ............................................................................................................................................................................................46
7.3 Chapter summary ....................................................................................................................................................................................................................48
8 Deep-learning ............................................................................................................................................................................................................................................50
8.1 Introduction ....................................................................................................................................................................................................................................50
8.2 Deep learning: the basics..................................................................................................................................................................................................50
8.3 Transfer learning ......................................................................................................................................................................................................................56
8.4 Deep learning with augmented data .....................................................................................................................................................................61

i
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

8.5 Working with 3D CT data ..................................................................................................................................................................................................66


8.6 Chapter summary ....................................................................................................................................................................................................................68
9 Materials classification.....................................................................................................................................................................................................................69
9.1 Introduction ....................................................................................................................................................................................................................................69
9.2 2D computer vision ................................................................................................................................................................................................................69
9.3 3D computer vision ................................................................................................................................................................................................................70
9.4 Non-computer-vision methods....................................................................................................................................................................................75
9.5 Chapter summary ....................................................................................................................................................................................................................77
10 Horizontal issues ....................................................................................................................................................................................................................................78
10.1 Introduction ....................................................................................................................................................................................................................................78
10.2 Testing AI system performance .................................................................................................................................................................................78
10.3 Scarcity of data ..........................................................................................................................................................................................................................79
10.4 Transparency, traceability and explainability ................................................................................................................................................80
10.5 Resilience to attacks..............................................................................................................................................................................................................81
10.6 The need for harmonised testing data ................................................................................................................................................................81
10.7 Chapter summary ....................................................................................................................................................................................................................83
11 Conclusions ..................................................................................................................................................................................................................................................84
References .............................................................................................................................................................................................................................................................86
List of abbreviations and definitions ........................................................................................................................................................................................106
List of figures ..................................................................................................................................................................................................................................................109
List of tables .....................................................................................................................................................................................................................................................113

ii
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Abstract
The aim of this report is to review the scientific literature and state of the art regarding the application of
machine learning techniques in X-ray security screening of baggage. We begin by reviewing X-ray baggage
screening technology, followed by a discussion on the importance of human-machine interaction. The
different approaches to measuring and describing the performance of AI algorithms are summarised, and an
overview of existing databases of X-ray images is given. An overview of image enhancement and threat
detection using classical machine learning techniques is described followed by data augmentation then deep
learning. We also describe some applications of machine learning to materials classification (as opposed to
object detection). The report concludes by discussing some horizontal issues concerning the application of AI
in X-ray baggage screening, including testing of algorithm performance, the need for large, harmonised
databases of images, data scarcity, transparency, and explainability.

3
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Executive summary
The main aim of this report is to review the scientific literature and state of the art regarding the application
of machine learning techniques in X-ray security screening of baggage.

Policy context
For sectoral legislation in aviation security, the European Commission has established common rules for civil
aviation security aimed at protecting persons and goods from unlawful interference with civil aircraft.
Regulation (EC) 300/2008 (European Commission, 2008) lays down common rules and basic standards on
aviation security and procedures to monitor the implementation of the common rules and standards. It
replaced the initial framework Regulation 2320/2002 in order to meet evolving risks and to allow new
technologies to be introduced. The framework legislation is accompanied by various supplementing and
implementing legislation, notably Implementing Regulation 2015/1998 (European Commission, 2015a) which
lays down detailed measures for the implementation of the common basic standards on aviation security.
Concerning cross-cutting legislation on artificial intelligence (AI), the EU has taken an active role by setting up
a High-Level Expert Group on Artificial Intelligence (AI HLEG) in June 2018, which then published AI HLEG
“Ethics Guidelines on Trustworthy AI” in April 2019 – proposing a set of non-binding recommendations
regarding AI. This was followed by the White Paper on AI (European Commission, 2020) in February 2020, and
the AI Act (European Commission, 2021c) in April 2021 as the first draft and attempt to regulate this fast
growing field. In the EU AI Act, aviation is one of several domains listed as high-risk.

Quick guide
Data scientists who might not be so knowledgeable about aviation security will find a review of X-ray
baggage screening technology in Chapter 2. Conversely, X-ray technology experts who might not be so
knowledgeable about machine learning will hopefully find that Chapters 4 to 9 provide an accessible overview
of the application of machine learning techniques and the associated issues and challenges.
We also dedicate Chapter 3 to the important topic of human-machine interaction. Numerous studies have
shown that humans adapt their behaviour during a detection task in response to the perceived strengths and
weaknesses of an accompanying detection algorithm. Hence, human and machine performances should not
be considered as separate components, but as a combined system.
Finally, in Chapter 10 we discuss some of the horizontal issues concerning the application of AI in X-ray
baggage screening, including testing of algorithm performance, the need for large, harmonised databases of
images, data scarcity, transparency, and explainability.

Main findings
We consider the biggest obstacle to the development of AI in X-ray baggage screening is currently data
scarcity. While for face recognition or for almost any problem that uses conventional RGB images, a huge
amount of data can be gathered on the internet, X-ray or CT scans of bags are not easy to create.
Additionally, X-ray scans of bags with illicit substances or objects are even more difficult to gather in large
numbers. The problem is compounded by the fact that labelling such specific data is very costly. Labelling in
general is a time-consuming activity, especially images where several objects of interest exist, which is often
the case in threat detection. Another difficulty with data is that transferring models between domains and
different X-ray scanners is challenging and usually does not lead to good results because of non-quantified
differences between different models of scanners.
We conclude that a process to collect and reuse data relevant for developing algorithms is highly desirable. If
operational and laboratory data is stored in a standardised way, algorithms will be able to make sense of it.
Furthermore, enabling human screeners to label objects of interest with a single click, could dramatically
improve the content of current X-ray image databases. Scientific institutes that work with explosives or that
produce explosive simulants and other prohibited items could be mandated to produce large image datasets.
In addition to data labelling and data standardisation, quality control of data is also required to obtain
validated and clean data ready for use. This, in turn, requires standard tools to fulfil all these tasks, which are
currently not existing and should be developed.

4
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

1 Introduction

1.1 Aviation security in context


Air transport is a very important part of the global economy. Before the COVID-19 crisis, the aviation industry
in 2019 was generating 65.5 million jobs and its share in the global economy (direct, indirect, induced, and
tourism driven) was estimated at $ 2.7 trillion, which is equivalent to 3.6% of the world gross domestic
product (Global Aviation Industry High-Level Group, 2019). In the same year, 4.3 billion passengers were
carried by airlines worldwide, which was an increase of 6.4% from 2017. In the EU, there were over one billion
air passengers in 2019: an increase of 3.8 % compared with 2018, as illustrated in Figure 1 (Statista, 2021).
This increasing trend is likely to return once the COVID-19 crisis is over.

Figure 1. Number of passengers (millions) carried by air in the EU from 2008 to 2020.

Source: reproduced from Statista (2021)

The strategic value of aviation for the world economy, coupled with its very high sensitivity to all kinds of
interference, creates a direct incentive for terrorist acts performed by various terrorist organizations. Looking
at the history of terrorist attacks in the Global Terrorism Database (2021), 420 attacks against airplanes and
airports occurred after 2001, most of them by means of bombing or explosion (see Figure 2). Repeated terror
attacks have led to increased aviation security measures, and approximately $50 billion is spent annually
worldwide in the quest to deter or disrupt terrorist attacks to aviation (Stewart & Mueller, 2018).

Figure 2. Terrorist attacks against airplanes and airports since 2001 by attack type.

Source: reproduced from Global Terrorism Database (2021)

5
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

1.2 Aviation security policies


Since 2002, the European Commission established common rules in the field of civil aviation security aimed
at protecting persons and goods from unlawful interference with civil aircraft. Regulation (EC) N°300/2008
(European Commission, 2008) lays down common rules and basic standards on aviation security and
procedures to monitor the implementation of the common rules and standards. It replaced the initial
framework Regulation N° 2320/2002 in order to meet evolving risks and to allow new technologies to be
introduced. The framework legislation is accompanied by various supplementing and implementing legislation,
in particular Implementing Regulation 2015/1998 (European Commission, 2015a).
Common basic standards comprise:
— screening of passengers, cabin baggage and hold (checked) baggage
— airport security (access control, surveillance)
— aircraft security checks and searches
— screening of cargo and mail
— screening of airport supplies
— staff recruitment and training.
There has been a need for the screening of passengers and their baggage for three main purposes (Ecorys,
2009) & (Wells & Bradley, 2012):
— the illegal movement of goods or prohibited items according to the local legislative procedures
— fraud and revenue avoidance
— terrorist threat.
The illegal movement of goods, fraud and revenue avoidance are topic of customs checks. In this report, we
will focus on security screening which concentrates on terrorist threats. Figure 3 shows typical workflow in a
U.S. airport security checkpoint with various screening technologies (U.S. Government Accountability Office,
2019), and European airports have similar procedures. A passenger may be examined by these technologies
and by human operators in primary and secondary screening. The primary screening of humans is performed
using walk-through metal detectors and full-body, millimetre-wave scanners. In this report, we focus on
threats located in the baggage.
Essentially, there are five major approaches available for the inspection of screened baggage (Butler & Poole,
2002), (Singh & Singh, 2003) & (Wells & Bradley, 2012), including:
— manual hand search (sometimes referred to as ‘pat-down’);
— explosive detection dogs;
— explosives trace detection (ETD);
— automated X-ray or CT inspection including explosives detection systems (EDS);
— visual inspection of X-ray and CT images performed by a human screener;
— liquid explosive detection systems (LEDS)
Manual hand search, explosive detection dogs and ETD have capacity issues, and are therefore used only as
second-level inspection methods once the initial trigger has been produced (Wells & Bradley, 2012).
Primary screening of baggage is X-ray based and consists of two types of screening:
— visual inspection of X-ray technology based scans for improvised explosive device (IED), bare explosives
and prohibited items
— automated explosive detection systems (EDS) and automated object detection.
For the majority of cases, the secondary screening takes place only if an alarm is triggered during primary
screening.

6
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 3. Typical workflow in U.S. airports security checkpoints with various screening technologies.

Source: U.S. Government Accountability Office (2019)

Airports strive to automate technologies and detection processes to improve effectiveness while reducing
scanning and image processing times and human error, and while rightsizing the number of personnel needed
at the checkpoint. X-ray based techniques are the main technology used at checkpoints. There are 2D and 3D
X-ray scanners for screening of both hold baggage and cabin baggage, and there exist different techniques to
process the scans depending on types of threats. X-ray diffraction (XRD) is sometimes used, although current
deployment levels of XRD are very low. In order to perform security screenings fast and securely enough, new
solutions are constantly emerging trying to adapt to the reality of content of travel baggage and to new
security threats which emerge as technology develops.

1.3 Artificial intelligence in aviation


A computer algorithm that does not suffer from human weakness (e.g. does not get tired, nor bored, nor have
a bad day, standardised across all machines operating in all airports), could be a valuable assistant to human
operators and the first step to fully automatized airport security checkpoints. Besides detecting and
classifying objects, faces and human behaviour, AI is especially effective in identifying patterns in data that
humans are not capable to do. Consequently, various applications in aviation security are already
implemented, or close to implementation.
In aviation security, the most discussed application of AI at the moment is unmanned aerial systems (i.e., UAS,
or drones). However, as stated in the European Union Aviation Safety Agency (EASA) AI roadmap (EASA,
2020a), AI and machine learning (ML) could also be used in nearly any application that implies mathematical
optimisation problems, removing the need for analysis of all possible combinations of associated parameter
values and logical conditions. Typical applications of AI could be flight control laws optimisation, sensor
calibration, fuel tank quantity evaluation, icing detection, etc. Some AI applications in air traffic management
are already implemented, such as, improving strategic planning, enhancing trajectory prediction, and better
understanding passenger behaviour (European Aviation Artificial Intelligence High Level Group, 2020; EASA,
2020a). AI applications in aviation industry can be broadly split into three groups: i) airport operations
optimization, ii) air traffic management and iii) security applications, as indicated in Table 1.

7
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Table 1. Potential applications of AI across aviation.

— Automated video analysis enchaining image and detecting


objects through limited visibility and occlusions, removing need
for physical control towers (NVIDIA, 2018), e.g. implemented in
London Heathrow (Wolfe, 2020)
Runway operations
optimization — Runway overrun prevention systems and landing optimization
(Airbus, 2016).

Airport operations — Various sensors are used: RGB cameras, laser (LIDAR), and
optimization radar.

— Model the distance between gates and passengers location


within airport for automated gate allocation and assignment,
Optimization of
passenger transfer — AI based decision support systems advising passengers on how
between gates to move optimally between the gates using real-time input,
e.g., London Heathrow airport implementation (Guo et al.,
2020)

— Traffic predictions improvements (Maastricht Upper Area


Control Centre, 2018):
• Prediction of sector skips
• Prediction of Take-Off Times
• 4D trajectory prediction
Air traffic Flight trajectory
management optimization • Prediction of flight routes
— Trajectory forecasting:
• AI-based distance recommendation for aircraft on final
approach,
• Improvement of the trajectory forecast for the climb phase
of flights.

— Face recognition and periocular scanning


Automatic — Iris and retina recognition
biometric
— Gait and voice recognition
identification
system — Fingerprints and friction ridge recognition
— Hand vein and geometry recognition

— Automated detection of verbal and non-verbal clues using


video as an input:
Security
applications Automated — At the airport area: uncertainty about path direction and speed,
detection of increased speed, unnecessary stops and starts,
suspicious
— At the checkpoint: visual (increased blinking, increased self-
behaviour
grooming), vocal (increased hesitation and increased speech
errors, shorter responses and higher voice pitch), and verbal
(increased over generality and increased irrelevance).

Automated — Human screening


dangerous items
detection — Baggage screening
Source: JRC

8
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

One of the significant improvements (75%) was brought to Heathrow airport with respect to runway
throughput by installing cameras and using computer vision to know when the airplanes have left the runway
in foggy conditions when flight controllers relying on radar lost 20% of landing capacity (EUROCONTROL,
2020). Regarding throughput of people at airports, face recognition technologies are already successfully
implemented at passport checkpoints.
An extensive Science for Policy Report was published by the European Commission’s Joint Research Centre
(JRC) in 2019, giving a detailed overview of existing face recognition technologies, challenges and
possibilities, concluding that Automatic Biometric Identification System Face systems had reached a sufficient
level of readiness and availability for its integration into Central Schengen Information System (Psyllos et al.,
2019).
Another application that is potentially sensitive from the ethical and privacy point of view is automated
detection of suspicious behaviour in airport area. TSA is in the process of developing the Behaviour Detection
and Analysis Program to implement AI solutions developed by academia and private companies (Blum, 2020;
Office of Inspector General, 2016).
Regarding automated threat detection with AI powered automated systems, currently, high-level automation
systems, where human screener checks only machine-alarmed bags are implemented. Smiths Detection has a
scanner for hold baggage inspection that uses AI for lithium batteries detection (Smiths Detection, 2021b),
and for guns, gun parts, ammunition and knives detection hidden in cabin baggage iCMORE scanner is used
(Smiths Detection, 2021a, 2021c).
AI is being increasingly incorporated into consumer products; this trend is accelerating, and AI will be
increasingly used in safety-critical systems such as aviation industry. In order to understand not only technical
challenges, but also ethical, social, and legal ones, and establish a safe and optimal framework to utilize AI
systems to their full capacity for our benefits, and minimize potential harm, position of all stakeholders
involved, on how AI influences them needs to be analysed.
The use of AI comes with severe socio-technical, legal and regulatory challenges that have to be addressed
before wider deployment of AI applications takes place (Emanuilov & Dheu, 2021). The EU has taken an active
role, by setting up a High-Level Expert Group on Artificial Intelligence (AI HLEG) in June 2018, which then
published AI HLEG “Ethics Guidelines on Trustworthy AI” in April 2019 proposing a set of non-binding
recommendations regarding AI (Emanuilov & Dheu, 2021). This was followed by the White Paper on AI
(European Commission, 2020) in February 2020, and the AI act (European Commission, 2021c) in April 2021
as the first draft and attempt to regulate this fast growing field.
According to a definition of AI published by the AI HLEG (2018), the term AI has an explicit reference to the
notion of intelligence that is a vague concept both in machines and humans. However, AI researchers mostly
use the notion of rationality, which refers to the ability to choose the best action to take in order to achieve a
certain goal, given certain criteria to be. Another definition of AI provided by the same authors (AI HLEG, 2018,
2019) says that AI systems are software systems designed by humans that, given a complex goal, act by
perceiving their environment through data acquisition, data interpretation, reasoning on knowledge, and by
processing the information derived from this data decide the best actions to take to achieve the given goal.
The aspect of an AI system being capable of making decision independently of humans, is what makes them
going one step further from automation in taking control from humans. According to generally accepted
definition, automation is the use of control systems and information technologies reducing the need for
human intervention. However, the push happening with AI from automation to autonomy, where the decision
power is given to machines, is accompanied with many concerns.
In the EU AI act (European Commission, 2021c), aviation is one of several domains listed as high-risk.
According to EASA’s roadmap (EASA, 2020a) and AI HLEG (2019), trustworthiness is embedded as a key pillar
and a pre-requisite for developing and deploying AI technologies.

9
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

According to AI HLEG (2019), there are seven key requirements that a trustworthy AI system has to meet in
the entire cycle:
1. human agency and oversight
2. technical robustness and safety
3. privacy and data governance
4. transparency
5. diversity, non-discrimination and fairness
6. environmental and societal well-being
7. accountability.

1.4 Scope of this report


The main scope of this report is to review the scientific literature and state of the art regarding the
application of machine learning techniques in X-ray security screening of baggage.
Data scientists who might not be so knowledgeable about aviation security will find a review of X-ray
baggage screening technology in Chapter 2.
Conversely, X-ray technology experts who might not be so knowledgeable about machine learning will
hopefully find that Chapters 4 to 9 provide an accessible overview of the application of machine learning
techniques and the associated issues and challenges.
More specifically, the different approaches measuring and describing the performance of AI algorithms are
summarised in Chapter 4, and an overview of existing databases of X-ray images is given in Chapter 5. An
overview of image enhancement and threat detection using classical machine learning techniques is
described Chapter 6, whilst deep learning is addressed in Chapter 8. In chapter 9 we describe some
applications of machine learning to materials classification (as opposed to object detection).
We also dedicate Chapter 3 to the important topic of human-machine interaction. Numerous studies have
shown that humans adapt their behaviour during a detection task in response to the perceived strengths and
weaknesses of an accompanying detection algorithm. Hence, human and machine performances should not
be considered as separate components, but as a combined system.
Finally, in Chapter 10 we discuss some of the horizontal issues concerning the application of AI in X-ray
baggage screening, including testing of algorithm performance, the need for large, harmonised databases of
images, data scarcity, transparency, and explainability.

10
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

2 X-ray baggage screening technology

2.1 Introduction
In this chapter, we review the principles of X-ray imaging and detection in the context of aviation security. The
X-ray technologies have become an important tool in airport security, which has become the chief application
for these techniques outside the medical sector. These X-ray based systems assist human screeners who are
required by law, e.g., European Commission (2015a), to inspect cabin baggage at airport security checkpoints.
The human-machine systems are deployed in two ways:
— assisting humans in performing their task of visually detecting threats
— providing automated detection that serves as an alarm to human operators.
Within the EU, specific legislation exists (European Commission, 2015a) that establishes the categories of
objects prohibited in both hold and cabin baggage. The lists of prohibited items are long and growing and can
be found online for both European airports on the website of the European Commission (2021a) and for U.S.
airports provided by Transportation Security Administration (TSA) (Transportation Security Administration,
2021). Prohibited items differ between the hold and cabin baggage, and therefore often require different
detection methods, but the most common prohibited items fall into several categories: sharp and blunt
objects, firearms, and explosives, flammables and incendiary materials.

2.2 Principles of X-ray screening


Objects in visible-spectrum images are opaque and occlude each other. On the contrary, X-ray images are
transparent. X-rays penetrate the objects; therefore, the objects along the X-ray path attenuate the signal and
affect the final intensity value. Therefore, pixel intensities in X-ray images represent signal attenuation due to
(multiple) objects. The contrast between objects in X-ray images is provided by the differential attenuation of
the X-rays as they pass through the objects. For X-ray transmission, the attenuation of X-rays as they travel
through objects is formulated by:

Ix = I0 e−μx Eqn. 1

where Ix is the intensity of the X-ray at a distance x from the source, I0 is the intensity of the incident X-ray
beam, and μ is the linear attenuation coefficient of the object, measured in cm−1. The higher the density of
the material, the higher the value of μ, and the higher the attenuation. Hence, high-density materials, like
metals, attenuate the X-rays more; as a result, the measured intensity becomes lower and the image gets
darker.
The simplest X-ray method for screening baggage is radiographic X-ray imaging. It can be done either by a
single radiographic shot with a large area imager, or by a continuous X-ray exposure and using an X-ray line
sensor. Generally, the line scanning method is used to screen cabin baggage. It requires only a limited area to
be irradiated by X-ray and since the carry-on items are continuously travelling on a conveyor belt, it allows
continuous scan and storage of the X-ray image of the whole baggage. Because the irradiated area is a
narrow line, it is much easier to shield the X-rays than in the case of a single shot X-ray system, where the
whole object is illuminated. A typical line scanning setup is shown in Figure 4a. The main advantage of the
linear scan system is the high speed – up to 1,500 bags per hour. However, these systems do not discriminate
very well between different types of materials. The reason is that the image pixel values are related to its
linear attenuation coefficient, which is not unique for any given material, but is a function of the material
composition, the photon energies interacting with the material, and the mass density of the material (see Eqn.
1). Consequently, the main drawback of the single source X-ray scanning is that, for a single energy system, a
thin, high Z (atomic number) material will have the same attenuation as a thick, low Z material (Singh &
Singh, 2003). For example, black powder looks very similar to honey (Zentai, 2008), as shown in Figure 4b.
Objects that appear dark on the X-ray image would hide any object behind them. For example, objects made
of metal would appear very dark in the single high-energy source X-ray image, and any other (less dense)
material (e.g. explosive) behind it would remain invisible in the X-ray image. To overcome these problems, X-
ray systems that use two X-ray energy levels are used (i.e., dual-energy systems).

11
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 4. (a) Schematic of luggage line scanner, (b) X-ray radiographic image, showing similarity between black powder
(left) and honey (right).
(a) (b)

Source: Zentai (2008)

Dual-energy, X-ray imaging provides information about material composition and improved image contrast
(Rebuffel & Dinten, 2007; Chen et al., 2005). Several techniques for the collection of multi-energy images
exist, including (Martz & Glenn, 2019):
— consecutive data acquisition achieved by varying the input energy of the X-ray source
— dual sources that use two source-detector pairs covering different part of energy spectrum, rapid source
voltage switching
— dual-layer (sandwich) with two detectors with different spectral responses.
When using two energy sources, a low one, and a high one, the problem with metal objects shielding lighter
materials is solved. High-density objects appear dark in both images, while lighter materials are darker in low
energy view. Therefore, a combination of two values enables both objects to be discerned. Additionally, an
ideal type of characterization would include high-spatial resolution estimates of both mass density (ρ) and
atomic number (Z) of all constituent elements and compounds in the part under inspection. Dual-energy
systems are used to estimate the atomic numbers of the materials in baggage (Eilbert & Krug, 1993; Singh &
Singh, 2003). For a single energy system, a thin, high-Z material will have the same attenuation as a thick,
low-Z material. In a dual-energy system, however, the measurements obtained at different energies can
distinguish these two cases. The dual-energy method applied to a simple object yields an area density that in
turn gives a measure of density and material thickness by using a priori information between atomic number
and material density. The detection of illicit materials using dual-energy X-ray technology is therefore based
on chemical composition (atomic number) rather than just density variation, as in the case of single-energy X-
ray technology. The main limitation of the method is that the real density of objects is poorly known for real
baggage items and the system only generates an estimate of atomic number, i.e. effective atomic number
(Zeff) (Singh & Singh, 2003). The advantage of dual-energy measurements for aviation security is shown
conceptually in Figure 5, and the separation of different materials based on density and Zeff is illustrated in
Figure 6.
Assistance to manual detection is reflected in enhancing the image using various image processing methods
(Lu & Conners, 2006) or using pseudo colours (Figure 7). Alarm-based type of assistance is done by deploying
automated detection methods using two different systems. The first one is automated EDS implemented in
dual-energy scanners, that classifies materials by their density and effective atomic number Zeff. The second
one is automated object detection based on computer vision and machine learning techniques. In both cases,
human screener sees highlighted areas in X-ray images that might contain prohibited items (Figure 7).

12
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 5. Notional plot showing threats (red) and non-threats (green). As the number of X-ray features increase the
detection rate increases and/or the false alarm rate decreases.

Source: Martz & Glenn (2019)

Figure 6. Zeff and density for commonly seen innocuous materials and for illicit materials.

Source: Eilbert & Krug (1993)

The separation of different materials can be shown directly on X-ray images using pseudo colours associated
with different atomic Z-number (Abidi et al., 2006), hence showing different materials in different colours
(see Figure 7). The low-energy and high-energy images are fused with the help of a look-up table into a single
pseudo-colour image to facilitate the interpretation of the baggage contents. The look-up table is obtained
through calibration. In a typical pseudo-colour image, intensity is related to the thickness of the material while
colour (i.e. hue) encodes the material group; low-density organic materials are orange, high density non-
organic materials are blue, and medium density (or overlapping) mid-Z materials are green in colour (see
Figure 7).

13
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 7. Examples of different materials shown in different colours on a pseudo-colour image from dual-energy X-ray
scans: (a) EDS with an explosive automatically detected based on density and effective atomic number framed in red in
two views, and (b) automatic detection using AI technology of gun (left) and knife (right).

(a)

(b)

Source: (a) Martz & Glenn (2019), (b) Gaus et al. (2019)

In the newer generation X-ray scanners, six colours are sometimes used to differentiate different materials
belonging to different Zeff ranges (see Table 2). Colour information helps the screeners more easily recognize
the objects and it can be used as an extra feature for automatic recognition systems to achieve higher
recognition rates. This colour scheme was first introduced by the German manufacturer, Heimann, then a
division of Siemens, around 1988, and it is used to date (Marshall & Oxley, 2009).

Table 2. Material pseudo-colours and its classes used widely in the X-ray security scanners.

Source: Benedykciuk et al. (2020)

14
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

The colour coding can be misleading or simply less helpful, however, for an object composed of many
materials (the absorption value is averaged absorption value for all materials and their thicknesses on the
radiation line) or if objects made of light materials are placed behind objects made of heavy materials. Still,
most of the explosives fall into the organic range and appear orange in coloured dual energy X-ray images,
and can be spotted by human screeners. This detection method is limited by superimposed objects.
Additionally, false positives because explosives can have similar absorption coefficients to many common
materials, e.g. fruitcake or cheese (Zentai, 2008; Singh & Singh, 2003).
With both single-energy and dual-energy 2D X-ray technology, target visibility, for both explosives and other
threats, still poses a problem for human screeners. Because of the transmission nature of X-ray imaging,
scans are the result of a 3-dimensional volume being collapsed into a 2-dimensional image. Prohibited items
in X-ray images are more difficult to recognize when depicted from unusual viewpoints, when superimposed
by other objects and when placed in visually complex bags (Bolfing et al., 2008; Hättenschwiler et al., 2019).
These are common scenarios with airport baggage, and in order to improve target visibility, multi-view X-ray
systems are currently the norm in aviation security (Figure 8).

Figure 8. (a) A single-view X-ray image of a bag containing a knife, and (b) multi-view X-ray image of the same bag. The
knife is hard to spot on the single-view scan due to high superposition and difficult rotation, while it is easily visible in the
second view of the multi-view X-ray scan where the superposition is low.

Source: von Bastian et al. (2008)

Multi-view X-ray systems produce two or more images of scanned objects from different viewpoints (see
Figure 9). In airport security screening this means that the decision of the screener, if a passenger bag is
threat-free or not, is supported by multiple images of the same bag. Threat objects that are rotated in such a
way that become hardly recognizable or are superimposed by other objects in the bag might be recognized
easier when an additional, orthogonal view is available (von Bastian et al., 2008).

15
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 9. Simplified sketch of a 4-view X-ray scanner. As the baggage goes through the tunnel (in z direction), 4 X-ray
generators scan one slice of the baggage (x − y plane) to generate two energy images for each of the 4 views.

Source: Baştan (2015)

Even better visibility is obtained by using computed tomography (CT) scanners where three-dimensional
reconstructed image is constructed using multiple X-ray measurements taken from different angles. Such
systems offer the possibility to rotate a bag image around 360 degrees to inspect an object from different
angles and viewpoints, and to look through an object of interest by using a slice view, thereby reducing the
need for baggage opening by security personnel (see Figure 10).

Figure 10. DETECT™ 1000 three-dimensional high spatial resolution image. A suspect threat object is highlighted in red.
Metallic and plastic objects are highlighted in blue and gray, respectively, using equipment from Integrated Defence and
Security Solutions Corporation.

Source: Martz & Glenn (2019)

16
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Regarding material classification, single-energy CT (SECT) is only able to resolve ambiguities between solid
materials (e.g. metals), plastics, fabrics, tissues, and liquids (in general), hence, dual-energy CT (DECT) and
multi-energy CT are normally used. With DECT scanners, attenuation measurements for an object at two
different tube voltages (usually 80 kVp and 140 kVp) are acquired, resulting in two separate attenuation
profiles. It has been shown that the attenuation coefficients for any material may be expressed as a linear
combination of the coefficients of two basis materials if the two chosen materials are sufficiently different in
their atomic numbers (Kalender et al., 1986; Mouton & Breckon, 2015). Hence, the main advantage of DECT
and multi-energy CT is the ability to distinguish between various materials. However, CT scanners have
several downsides, namely, that they are generally slower than 2D screening due to rotating and slicing
(Merks et al., 2018), and that images usually have substantial noise, metal-streaking artifacts, and poorer
voxel resolution and is thus generally of a poorer quality than medical CT imagery (Mouton & Breckon, 2015;
Singh & Singh, 2003) (Figure 11). Hence, 3D scanners are typically used for explosive detection for hold
baggage based on their effective atomic number and not yet widely for automated object detection.
On the other hand, dual-energy 3D CT scanners are becoming faster, with higher image quality, and more
sophisticated, and their downsides are outweighed by them enabling increased throughput at the checkpoints.
Namely, at the checkpoint for cabin baggage, higher throughput is gained with passengers being able to keep
their electronic devices and liquids in their bags due to DECT higher material differentiation quality.
Additionally, easier visual inspection of 3D images helps reducing need for manual inspections for both hold
and cabin baggage consequently reducing the time needed for bags inspection. Taking this into account, and
their superiority in visual detection of threats, they are rapidly becoming commonplace for both hold and
cabin baggage screening in aviation security. While currently most of automated threat detection methods are
for 2D imagery, research is intensifying for 3D threat detection, and there is a growing number of 3D object
detection methods (see section 8.5).

Figure 11. Baggage-CT scans illustrating lower 2D image quality, low resolution, artefacts and clutter, obtained on Reveal
CT80-DR dual-energy baggage scanner.

Source: Mouton & Breckon (2015)

Another technique that is used for material classification and detection is X-ray diffraction (XRD) technology.
This technology is based on X-ray diffraction patterns that are specific for different materials. When a
microcrystalline substance is exposed to X-rays, they diffract from the specimen at precise intensities and
angles forming a specific diffraction pattern. The reason for this is orderly periodic arrangements of atoms in
this type of materials (see Figure 12).
Scanned materials can be identified by matching the diffraction pattern with the database of patterns
characteristic for each material. XRD technology is used for characterization of materials in crystalline form in
different fields (Figure 12). In airport security, XRD is used for detection of threat materials, as the majority of
them are polycrystalline substances (Marticke, 2016). Many explosives give distinctive XRD pattern (see Figure
13). Generally, in lab settings, XRD measurements are made with monochromatic 8 keV X-ray beams, and
these XRD systems use Angular Dispersive X-ray Diffraction (ADXRD). In this configuration, it is not possible to
use XRD for baggage screening because 8 KeV X-rays are absorbed by any material, even air, so they do not
penetrate deep into a bag (Zentai, 2008). One solution is to use polychromatic X-rays associated to a
spectroscopic detector, and using energy dispersive X-ray diffraction (EDXRD) that allows entire object
inspection in a reasonable time. Although XRD technology has very high detection rate and very low alarm

17
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

rate, scanning duration of a single bag is an issue, and is therefore usually used only to resolve alarms in a
specific area of a bag that has been previously indicated by EDS as suspicious.

Figure 12. Crystalline materials: The crystal system describes the shape of the unit cell (left), the lattice parameters
describe the size of the unit cell (left), the unit cell repeats in all dimensions to fill space and produce the macroscopic
grains or crystals (middle of the material (right).

Source: Speakman (2015)

XRD technology, when combined with conventional EDS based on materials atomic number, significantly
reduces the false alarm rate avoiding manual bag checks. It targets a small bag area that is defined by an
algorithm analysing CT volume. For a long time, only two commercial products existed on the market. One
was made by Heimann in 1999 using pencil-beam geometry and because of its limited object coverage used
only for false-alarm clearing. The second one was introduced in 2000 by Yxlon as XES3000 employing a cone
beam scatter-geometry for increased object-coverage allowing for complete bag-scans in about 60s
(Kosciesza et al., 2013). The drawback of such system is that it is very slow and expensive since it needs a
special cooling system for the sensitive germanium detector, although EDXRD based systems use a semi-
conductor energy resolving detector with which it is easier to work in room temperatures (Marticke, 2016).
Because of these drawbacks, XRD systems tend to be used in airports only as a third-tier system for hold
baggage, and only bags found to be suspicious by X-ray or CT scanners are scanned with it (Zentai, 2008).
However, there are advances in this field, using novel focal construct technology, currently implemented only
for cabin baggage scanners (see section 2.4 on cabin baggage).

Figure 13. (a) Measured energy dispersive X-ray diffraction (EDXRD) spectra of pure explosives constituents: ammonium
nitrate, HMX (octogen); and of military and industrial explosives: TNT, ammongelite, Semtex and Seismoplast, and (b)
diffraction profiles of two explosives and typical baggage contents.
(a) (b)

Source: (a) Strecker et al. (1994), (b) Luggar et al. (1997), as cited by Marticke (2016)

18
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

2.3 Hold baggage


Hold baggage are bags that passengers check in with an airline before passing through security and that are
stored in the hold of the aircraft. Hold baggage may contain items that are not allowed to be carried in the
cabin baggage due to concerns of their potential use in acts of unlawful interference. Passengers have no
access to hold baggage bags once in the airport security zone and during the flight. Hence, dangerous items in
the hold baggage are the items that can explode and sabotage the aircraft in that way. It is against the law to
carry the following items in the hold baggage: explosives, flammables and incendiary materials and devices.
Hold baggage inspection changed after Pan Am flight 103 exploded due to a bomb in a passenger bag
transported in the hold of the aircraft over Lockerbie in Scotland on 21 December 1988 (Strantz, 1990). Since
then, many terrorist attacks have targeted airplanes with explosive devices in hold baggage (Baum, 2016;
Strantz, 1990; Wells & Bradley, 2012). This resulted in EDS based on 2D imaging for hold baggage screening.
Until the terrorist attacks of 9 September 2001 in the United States (9/11), only 5% of U.S. checked baggage
was actually screened. Since 9/11, 100% of all hold baggage started to be scanned for threat items (Wells &
Bradley, 2012). At the time, the throughput was only 150-200 bags per hour, with a false alarm rate of 30%
causing severe delays in airports (Blalock, Kadiyali, & Simon, 2007). Novel technologies are constantly
developed to increase checkpoint throughput and enable high detection rate of prohibited items.
Currently, as stated in European Commission Implementing Regulation 2015/1998 (European Commission,
2015a) hold baggage can be inspected by:
— X-ray equipment;
— EDS equipment;
— ETD equipment;
— a hand search;
— explosive detection dogs.

In Europe, the process flow of hold baggage screening is ultimately decision of the airport and the Member
State authority. For logistics reasons, the first step is normally X-ray based EDS system. Multi-level screening
is necessary for baggage screening operations, in order to screen the majority of bags as fast as possible.
Multiple layers of screening – often comprising different technologies with complementary strengths and
weaknesses – are combined to create a single screening process. The detection performance of the overall
system depends on multiple factors, including the performance of individual layers, the complementarity of
different layers, and the decision rule(s) for determining how outputs from individual layers are combined. In
most multi-layered screening process in airports around the world, only the bags that raise an alarm are
diverted from the main flow and re-screened in the next layer of the system. Optimising the system-level
performance of a multi-layered screening process requires (i) knowledge of the degree of correlation of
alarms between layers, and (ii) and judicious selection of layers depending on the decision rule for combining
multiple screening results (Anderson, 2021).
EDS and screening protocols continue to adapt in response to risks posed by ever-evolving threats. According
to the European Commission Implementing Regulation No 2021/255 (European Commission, 2021b) the
following standards for EDS equipment exist in the EU (Oftring, 2015; Skorupski et al., 2018):
— equipment installed before 1 September 2014 must at least meet standard 2;
— equipment installed from 1 September 2014 to 31 August 2022 must at least meet standard 3;
— equipment installed from 1 September 2022 to 31 August 2026 must at least meet standard 3.1;
— equipment installed from 1 September 2026 must at least meet standard 3.2.

More stringent standards tend to involve higher detection rates, lower false alarm rates, better image quality,
lower masses, and a wider range of materials to be detected. The first standard, Standard 1, expired on 1
September 2012.

19
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Currently, airports in Europe are working to implement Standard 3 for EDS in hold baggage, which is close to
U.S. TSA standards for baggage handling system (Transportation Security Administration, 2015). The ability of
airports in the EU to complete the process of installation of Standard 3 EDS equipment was severely impacted
by the COVID-19 pandemic (European Commission, 2021b). Additionally, there is a large technological gap
between Standard 2 and Standard 3, changing from 2D X-ray scanners to 3D CT machines. This means that
manufacturers cannot simple upgrade existing scanners, but must produce new ones with the new technology
that is able to satisfy throughput requirements of a modern busy airport. Finally, human screeners trained on
2D X-ray machines need to be retrained to use CT scanners. For example, in the case of 2D images, the
screeners usually concentrate on trying to identify all the components of an explosive device at once, mass,
cable, battery, shape of an explosive. In a 3D image, they need to rotate it and first spot an explosive mass,
and then see whether around that organic mass they are able to find a cable or other elements of an
explosive (Future airport, 2016).
In 2004, it was estimated that the false alarm rate for CT systems certified by the TSA was about 30%
(Harding, 2004), but reduced to 15% only a decade later (Hättenschwiler et al., 2019). DECT and multi energy
CT scanners have better material classification qualities and can detect lower quantities than 2D radiography
and SECT. This is a valuable feature for separating harmless fluids (e.g. water, spirits and liquors) from highly
flammable (e.g. gasoline, hydrocarbon derivatives), explosive (e.g. hydrogen peroxide) or aggressively acid
liquids and their compounds (Kehl et al., 2018). DECT is relatively limited in separating various types of fluids
compared to multi-energy CT (also referred to as spectral CT). Additionally, CT scanners are catching up with
2D X-ray scanners in terms of throughput. The most recent hold baggage CT scanners report speeds of 1,800
bags per hour with automated EDS using dual CT scanner (Smiths Detection, 2020). However, the traditional
X-ray machine only cleared approximately 70% of all bags (due to false alarms), lowering the capacity down
to around 1,000 bags per hour. As false alarm rates decrease, CT machines clear a greater percentage of
bags, resulting in an effective throughput of around 1,200 bags per hour, hence being more efficient than
traditional 2D X-ray scanners.
In addition to explosive materials and devices, incendiary devices and flammable materials are also banned
from hold baggage. Among them, lithium batteries pose a significant risk since in recent years they became
the primary power source for the majority of personal, portable electronic devices, with consumer demand for
these products growing annually. Given the short life cycle of these items, shipping time is critical and
airfreight is clearly the fastest way to transport the devices. From January 2006 to September 2021 over 300
incidents of smoke, heat, fire or explosion involving lithium batteries in air cargo or hold baggage have been
recorded (Federal Aviation Administration, 2021) (see also Figure 14). The risk that batteries ignite whilst
airborne (particularly low quality or even counterfeit batteries) is a real threat to the aviation industry (Smiths
Detection, 2021b). Latest technology scanners for hold baggage, in addition to EDS, feature AI algorithms to
detect lithium batteries using shape features (Smiths Detection, 2021b).

Figure 14. Incidents of smoke, heat, fire, or explosion involving lithium batteries in air cargo or hold baggage presented
per year.

Source: Smiths Detection (2021b)

20
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

2.4 Cabin baggage


Passengers store their carry-on bags in the cabin of airplanes. Because such cabin baggage can be accessed
during flight, objects that can be used for both hijacking and sabotaging the aircraft are prohibited.
Passengers are not permitted to carry the following articles into security restricted areas and on board an
aircraft: guns, firearms and other devices that discharge projectiles, stunning devices, sharp objects,
workmen’s tools, blunt instruments, explosives and IEDs, and incendiary substances and devices (European
Commission, 2015a).
As stated in European Commission Implementing Regulation 2015/1998 (European Commission, 2015a) cabin
baggage can be inspected by:
— X-ray equipment;
— EDS equipment;
— ETD equipment;
— a hand search;
— explosive detection dogs in combination with hand search.

In practice, cabin baggage screening usually consists of two steps (see Figure 15):
— primary bag search: screening the interior of baggage using X-ray scanner (2D or 3D),
— secondary bag search could be conducted by using ETD equipment, bottled liquid scanners, chemical
reaction test strips, a hand search or, rarely, and explosive detection dogs.

At the cabin-baggage checkpoint, passengers are usually asked to remove laptop and large electrical items
from their bags. The reason for this is that such objects with complex electronics create clutter and can
hamper the visibility of other objects in 2D X-ray image. This problem is solved by using 3D scanners that
recently started to be used in cabin-baggage screening processes, given recent advances in their speed, image
quality, and detection performance.
Among prohibited items, explosives are still considered the most dangerous articles in passenger baggage. In
the Commission implementing regulation (EU) 2015/187 (European Commission, 2015b) is stated that recent
evidence has shown that terrorists are trying to develop new concealments for IEDs designed to counter the
existing aviation security measures relating to cabin baggage screening. Explosives are composed of a power
source, a triggering device, a detonator, and an explosive charge that are usually all connected by wires
(Huegli et al., 2020; Turner, 1994; Wells & Bradley, 2012). In cabin baggage screening, bare explosives also
pose a threat because these could be combined with other IED components after passing an airport security
checkpoint (Hättenschwiler et al., 2018; Huegli et al., 2020). Detecting bare explosives can be a challenge
even for well-trained screeners because they lack the other components of IEDs (power source, triggering
device, and detonator) and often look like a harmless organic mass (Zentai, 2008; Huegli et al., 2020; Jones,
2003). In recent years, X-ray based explosive detection systems for cabin baggage screening (EDS-CB) have
become available (Sterchi & Schwaninger, 2015). EDS-CB use dual-energy X-ray imaging to detect explosives
via the estimation of the effective number and material density (Singh & Singh, 2003).

21
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 15. Illustration of a cabin baggage screening security checkpoint with the four positions of an airport security
officer: bag loading, pat-down search of passengers, X-ray screening of passenger bags, secondary search of passenger
bags.

Source: Michel et al. (2014)

In Europe, standards for EDS-CB are laid out in Commission Implementing Regulation No 2015/1998
(European Commission, 2015a) and Commission Implementing Regulation No 2015/255 (European
Commission, 2021b):
— All EDS equipment designed to screen cabin baggage shall meet at least standard C1 where screening
requires the divestment of electronics and LAGs.
— All EDS equipment designed to screen cabin baggage containing portable computers and other large
electrical items shall meet at least standard C2.
— All EDS equipment designed to screen cabin baggage containing portable computers and other large
electrical items and LAGs shall meet at least standard C3.
— All EDS equipment designed to screen cabin baggage containing portable computers and other large
electrical items and LAGs shall meet at least standard C4 (expanded list of explosives compared to C3).

EDS equipment analyses the X-ray attenuation data for potential explosives before the X-ray image is
displayed to the X-ray screener. EDS-CB indicates potential explosive material by either marking an area on
the X-ray image of a passenger bag with a coloured rectangle or highlighting it in a special colour (Nabiev &
Palkina, 2017), as illustrated in Figure 16. Screeners then have to take follow-up actions to determine
whether the area indicated by the machine is harmless (i.e. a false alarm by the EDS-CB) or whether it
actually could be explosive material. EDS-CB systems with high detection rates (close to 90%) have false
alarm rates in the range of 15–20% (Hättenschwiler et al., 2018). However, certain multi-view EDS-CB with
updated detection algorithms now achieve detection rates above 80% with false alarm rates below 10% and,
therefore, an automation reliability d’ > 2.1 (see Eqn. 2, Chapter 4 for definition) (Huegli et al., 2020).

22
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

As will be described in Chapter 3 in more detail, visually inspecting X-ray images for prohibited items is a
complex task for the human screeners. The detection performance is influenced by various factors, including
elapsed time on task, target visibility, image display technology, and the screeners’ knowledge (Biggs &
Mitroff, 2015; Biggs et al., 2018; Buser et al., 2020; Huegli et al., 2020; Michel & Schwaninger, 2009;
Schwaninger, 2005). Hence, multi-view 2D scanners are currently the norm for cabin baggage inspection. As
stated in Huegli et al. (2020), and based on the personal communication with airports and manufacturers in
February 2020, EDS-CB based on CT technology seem to be even more successful than multi-view
radiography, achieving even lower false alarm rates. CT scanners are being used for hold baggage inspection
for over two decades already, and only recently, they are increasingly being employed, also in European
airports, for cabin baggage screening (International Airport Review, 2021).

Figure 16. X-ray images from a multi-view EDS-CB machine. The same bag is shown from two viewpoints differing by
about 90 degrees. It contains an IED on which the EDS-CB has alarmed (shown by red rectangles).

Source: Huegli et al. (2020)

An example of Smiths HI-SCAN 6040 CTiX scanner with EDS-CB functionality is shown in Figure 17. Meeting
EU standards C2 and C3, CT cabin baggage scanners allow much easier target detection and enable laptops
and other large electronic devices to remain in bags during screening. Additionally, the latest CT scanners can
detect explosives in the form of LAGs (European Civil Aviation Conference, 2021a, 2021b), and some
preliminary testing are already performed in several airports (i.e. Dubai, Schiphol, Heathrow, etc.)
(International Airport Review, 2021). In the U.S., as of September 2018, Transportation Security Administration
(TSA) had deployed a limited number of CT units for testing, and at least 300 additional CT units in fiscal year
2020 for the primary screening of carry-on baggage (U.S. Government Accountability Office, 2019).
Regarding XRD technology, longer-term integration is being explored in order to further reduce false alarm
rate and minimize delays for passengers at the checkpoint. Lately, there are announcements by HALO X-ray
Technologies Ltd of a fast cabin baggage XRD system (HXT264) that would have threat resolution speed
within ~2s in most cases (Halo X-Ray Technologies, 2017). The HXT264 is based on focal construct technology
developed by experts at Nottingham Trent University and Cranfield University (Rogers & Evans, 2018). Focal
construct technology expands the interrogating beam to generate an annulus or ‘halo’, which increases
diffraction signal strength by orders of magnitude. The same company announced a CT/XRD scanner that uses
HXT264 as an additional module in the CT scanner (Halo X-Ray Technologies, 2017). Developing these
products has received U.S. Homeland security funding and is planned to be employed on U.S. airports
(Cranfield University, 2019).
The most recent developments in AI-based object recognition allowed for X-ray airport scanners to be
equipped with automated prohibited items (PI) detection. The list of prohibited items is very long, but some
limited number of items, usually of a prominent shape, can be detected using latest X-ray scanners. In U.S.
airports, deployment of advance technology (AT) X-ray scanners began already in 2018 (U.S. Government
Accountability Office, 2019). Notably, Smiths Detection iCMORE scanner is being deployed in the U.S. and
Europe, and it can detect guns, gun parts, ammunition and knives hidden in the cabin baggage (Smiths
Detection, 2021a, 2021c). TSA is currently testing the combined EDS-CB and automated PI detection system
in U.S. airports, with PI detection testing performed only on the limited number of items. (Soroosh, 2021).

23
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 17. CT for cabin baggage: an example of an EDS alarm from the Smiths HI-SCAN 6040 CTiX.

Source: ACI (2019)

2.5 Chapter summary


In this chapter, we reviewed the principles of X-ray screening of baggage. We explained how dual-energy X-
ray overcomes the problems of single-energy, namely that a thin, high-Z (atomic number) material would
have the same attenuation as a thick, low-Z material. We described the differences between single-view,
dual-view, multi-view, and reconstructed 3D images from computed tomography (CT) scanners.
The use of image processing and pseudo colours to assist human operators was introduced, and we identified
two different kinds of alarm-based assistance:
— using X-ray transmission data to infer density and effective atomic number of scanned items, and raising
an alarm if those properties match those of known threat materials;
— using computer vision and machine learning techniques to detect prohibited objects.
We briefly introduced another technique for material classification and detection, namely X-ray diffraction
(XRD) technology, which is based on X-ray diffraction patterns that are specific for different materials.
For hold baggage screening, we reviewed the EU legislative basis for screening hold baggage for prohibited
items, where the focus is primarily on fully assembled IEDs, but also lithium batteries, which can accidentally
start a fire.
For cabin baggage screening, we reviewed the EU legislative basis for screening cabin baggage for prohibited
items. Bare explosives have to be detected without the additional cues of a fully-assembled IED (i.e.
detonator, switch, and power source), because unlike hold baggage, a perpetrator has access to the contents
of cabin baggage and can manipulate its contents during the flight. We also mentioned operational aspects,
such as the need to remove laptops from cabin baggage prior to screening (this requirement can be obviated
with the latest 3D scanners now being introduced at the passenger checkpoint).

24
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

3 Human-machine interaction

3.1 Introduction
Human-machine interaction in low-level automation systems is a complex one. The human screener
ultimately makes a decision that is influenced by different characteristics of the automated systems outputs.
In this chapter, we review the challenges that might lead to sub-optimal human performance during the task
of detecting prohibited items in X-ray images. We also review the literature on how the performance of
screeners is influenced by the perceived strengths and weaknesses of automated systems. A clear conclusion
is that screeners give higher trust ratings towards automatic systems when they receive a rationale for
potential automation failures compared to when they had not. This speaks in favour of explainable
automation, i.e., automated algorithms whose decision making process can be understood and explained to
humans.

3.2 Literature review


Humans are an important factor in security screening procedures. However, there are many challenges for
humans performing this task. Humans are not very good at repetitive tasks, and the performance of an X-ray
screener can start to suffer after only 10 minutes and declines exponentially with increasing time, according
to (Meuter & Lacherez, 2016) (see Figure 18). In 1988, unannounced Federal Aviation Administration (FAA)
testing of domestic screeners revealed a failure rate of 22% of screeners to spot weapons placed in carry-on
bags (U.S. Congress, Office of Technology Assessment, 1992). Poor performance was attributed to lack of
training, low wages, and attention fatigue (Marshall & Oxley, 2009). In a study of eighteen Brazilian airports
(Arcúrio et al., 2018), the researchers questioned 602 aviation security professionals to explore the cognitive
processes and perceptions related to their actions and decision-making while working in the security screening
process. Similarly to previously reported findings (Liang et al., 2010), the study found a high frequency of
human errors related to factors such as repetition, complacency and not following work procedures. More
recent data published by the Transportation Security Administration (2019, 2020) confirms that thousands of
loaded firearms in carry-on baggage are regularly found every year. Hence, human screeners can be effective
when the working conditions are managed appropriately.

Figure 18. Fitted quadratic trend for the generalized estimating equations model for accuracy (percent correct) of
detected threats in airport baggage as a function of time on task for high workload shifts. Performance declined
exponentially with increasing time on task.

Source: Meuter & Lacherez (2016)

Furthermore, the target prevalence of real explosives (IEDs and bare explosives) is extremely low which
makes the tasks of the screeners more difficult. In X-ray screening at airports, this was first addressed by the

25
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

FAA in 1992 through its program SPEARS (Screener Proficiency Evaluation and Reporting System) where
fictional threats were inserted using Threat Image Projection (TIP). TIP is a technology that projects pre-
recorded X-ray images of prohibited articles into X-ray images of real passenger bags being screened (Hofer
& Schwaninger, 2005; Meuter & Lacherez, 2016; Schwaninger et al., 2010; Skorupski & Uchroński, 2016). The
idea was to motivate operators by sporadically presenting bag images modified by superimposing stored
images of actual threats. The target prevalence of prohibited articles (guns, explosives, knives, and other
threat items) was increased artificially to about 2-4% using this technology. As stated in (Marshall & Oxley,
2009), the SPEARS program gave good results, increasing screeners’ performance in terms of a lower number
of missed threats and false alarms. Therefore, from 2000 onward, the TSA has placed increased emphasis on
TIP capability being included on its X-ray equipment purchases (Marshall & Oxley, 2009).
In the EU, training of the human screeners using TIP is defined in the Commission implementing regulation
(EU) 2015/1998 (European Commission, 2015a). In the 2012 report from the Commission to the European
Parliament and the Council (European Commission, 2013) it was concluded that if adequately deployed, TIP
has a potential to positively impact screener performance. However, it was also stated that under the
deployment framework, no difference in detecting prohibited articles could be established between TIP and
non-TIP airports.
Moreover, whereas the number of main categories of prohibited items is limited, each category includes many
different types of items. Airport security officers need to have mental representations of a large number of
prohibited items as well as of harmless everyday objects, which they should be able to activate at any time
(Chavaillaz et al., 2020; Jiang et al., 2004).
Target visibility poses another challenge for the screeners. There are three reasons described as image-based
factors (Schwaninger, 2005). First, the piece of baggage may be densely packed with many items so that the
potential target is less easily spotted due to many distractors (i.e., high level of clutter). Second, a harmless
object may be superimposed on the target making it more difficult to recognize. Third, the potential target
may be displayed from an unfamiliar point of view, e.g., non-canonical view of a gun (Figure 19). Material of
which item is made is another source of low visibility where certain materials are better visible than others
(Figure 19). In this case, as mentioned in Chapter 2, multi-view systems can improve screener performance,
because the additional view (e.g., side view) provides information that facilitates recognition when the primary
view of an item is rotated or superimposed by other objects (Huegli et al., 2020). Given all the difficulties of
humans performing baggage screening, and the literature suggesting that detection performance is about
80–90% (Michel et al., 2007), the inspection task is therefore facilitated by machines.
With an increased reliability on machines at airports, human-machine interaction scenarios play an
increasingly important role. Various taxonomies have been proposed to distinguish different levels of
automation (Huegli et al., 2020; Parasuraman et al., 2000; Vagia et al., 2016), with Endsley (1987)
discriminating the five levels of automation:
1. manual control with no assistance from the system
2. decision support by the operator with input in the form of recommendations provided by the system
3. consensual artificial intelligence (AI) by the system with the consent of the operator required to carry
out actions
4. monitored AI by the system to be automatically implemented unless vetoed by the operator
5. full automation with no operator interaction.
We will narrow down this selection to three most common cases in airports checkpoints:
1. low-level automation (decision support by the human operator with input in the form of
recommendations provided by the system)
2. high-level automation (human operator checks only the bags that are alarmed by the automated
system)
3. full automation (with no operator interaction).
High-level automation systems in airports are currently deployed for hold baggage scanning, passenger
scanning, and face recognition based passport checking. During the flight, passengers cannot access items
stored in the hold of an aircraft, so guns or knives do not pose a threat. Therefore, at hold luggage screening
checkpoints, high-level automation systems are deployed using CT scanners targeting fully functioning IEDs
with automated EDS (Bertz, 2002) (see section 2.2). Lately, automated detection of lithium batteries is

26
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

incorporated in hold baggage CT scanners (Smiths Detection, 2021b) (see section 2.2). These screening
systems raise an alarm if prohibited items are detected in a bag, and only the alarmed bags are further
examined with manual search and ETD. This approach is called alarm-only viewing.
Human-machine interaction during cabin baggage screening at checkpoints tends to be low-level automation
systems. The automated detection system provides alerts, alarms, or warnings to support human operators by
cueing attention to areas of a display that might contain a target. These types of systems support screeners
by indicating areas in X-ray image that might contain target, usually by framing them with red colour boxes
(Figure 7). Thus, the task in cabin baggage screening mainly consists of visual search and decision-making,
while the task in hold baggage screening consists mainly of deciding whether an X-ray image contains an IED
or not (Koller et al., 2009; McCarley et al., 2004; Wolfe & Van Wert, 2010).

Figure 19. Image-based factors in x-ray screening: (a) bag complexity; (b) superposition; (c) high and low target visibility,
i.e., metallic baseball bat in blue (left) with 92% detection rate, wooden baseball bat in orange (right) with 15% detection
rate; and (d) viewpoint.

(a)

(b)

(c)

(d)

Source: (a) Schwaninger et al. (2010), (b) & (c) Biggs & Mitroff (2015), (d) Schwaninger et al. (2010)

There were several studies investigating the influence of the performance of low-level automation systems
on humans and therefore overall performance of the human-machine detection system (Chavaillaz et al.,
2020; Huegli et al., 2020) and (Madhavan et al., 2004, 2006). In these studies, mainly two hypotheses were
investigated. One is the “cry wolf” hypothesis where an automated system produces a high number of false
alarms (FA) and creates under-trust in automation, resulting in operators ignoring automation alerts. The

27
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

second hypothesis is the conspicuity of automation failures. Screeners expect automation to perform at near
perfect rates, which leads to rapid decline in trust when automation makes errors.
Not only high false alarm rates and imperfect automation influence human operators’ decisions, but also the
degree to which targets are easy or hard to detect by automated threat detection algorithms. In a series of
studies on visual search, Madhavan et al. (2004, 2006) confirmed expected results; when automation had
missed several easily detectable targets (easy-miss), participants trusted their own abilities more than the
one of the automatic system. The opposite result pattern was observed when misses were more difficult to
notice (difficult-miss). Similar results were obtained regarding trust and self-confidence when automation
generated only easily-detectable false alarms (Madhavan et al., 2006).
Whether a target is considered easy or difficult to detect could be related to the term ‘cue plausibility’, i.e.,
how much the cued object was similar to a prohibited item. This topic was researched by Chavaillaz et al.
(2020). In this study, three input parameters were varied:
— cue plausibility for FA (high/low)
— system reliability (high/ low)
— rationale about failure (RAF) was provided or not to screeners.
In low-reliability condition, accuracy (percentage of system recommendations that were correct) was 72%. In
the high-reliability scenario, accuracy was 91%.
Input variables were correlated with the following performance estimators:
— detection performance d’
— response bias (tendency to respond positively or negatively)
— and response time.
Detection performance corresponds to participant ability to identify the presence (or absence) of a target and
it is defined as follows:

d′ = z(HR) − z(FAR), Eqn. 2

where z is the inverse of the cumulative distribution function of the standard normal distribution, hit rate, HR,
(or true positive rate, TPR, or detection rate, DR) is defined as:

true positives
HR = Eqn. 3
true postives + false negatives

and false alarm rate, FAR (or false positive rate), is defined as:

false positives
FAR = Eqn. 4
false positives + true negatives

Screeners with implausible cues had lower detection performance and tended to respond that there was no
target. This was in agreement with Madhavan et al. (2006) where participants in the easy-miss condition
more often indicated the presence of a target than in the difficult-miss condition. Screeners with plausible
cues, i.e., cued object sharing features with a potential target or having a similar shape of a potential target
(e.g., a pen may be mistaken for a rotated knife) spent more time to make a decision and made more false
alarms. Hence, in evaluation of automated system performance, it is important to quantify similarity of
automated FAs to true targets since that has a strong influence not only on throughput, but also on human-
machine system FAR.

28
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Screeners working with highly reliable systems showed higher detection performance. Rationale about failure
(RAF) did not influence detection performance. However, RAF slowed down response time because participants
became more cautious during visual inspection as they may have been more aware of the fact that the
automation could fail. Interestingly, when presenting automated systems to human operators, attention
should be paid to the way information about automation performance is framed since it can affect how the
automation is used (Lacson et al., 2005). A positive framing (e.g., ‘the system makes about 80% of correct
decisions and maximize hits and correct rejections’) reduced reliance compared to a negative framing (‘the
system failed 20% of the time and minimize misses and false alarms’). A positive effect of human observers
knowing rationale about failures is that it reduced workload under particularly difficult conditions, i.e., low
reliability combined with implausible cues (Chavaillaz et al., 2020). Dzindolet et al. (2003) showed that
participants gave higher trust ratings towards the automatic system when they had received a rationale for
potential automation failures than when they had not. This speaks in favour of explainable automation, i.e.,
automated algorithms whose decision making process can be understood and explained to humans (see
section 10.4 for more details on this topic).
In the study by Chavaillaz et al. (2020), human use of automation parameters that were correlated to the
input variables were:
— compliance measured screeners’ propensity to follow automation recommendation when it indicated the
presence of a target, in other words: number of positive responses given by the participant when the
automation provided a positive response, divided by the total number of positive responses provided by
the automation;
— reliance estimated to what extent participants acknowledged automation recommendation when it
indicated the absence of a target (number of negative responses given by the participant when the
automation provided a negative response, divided by the total number of negative responses provided by
the automation).
Participants in the ‘plausible cue’-condition showed higher compliance rates than in the one with implausible
cues. This result makes sense, since if a cued object looks like a prohibited item, the human is more likely to
accept an automation recommendation that turns out to be an FA, leading to higher compliance. Participants
displayed lower rates of both compliance and reliance under a low-reliability system.
The first study performed with professional airport security screeners using low-level automation support
system for explosive detection was performed by Huegli et al. (2020). Previously mentioned studies used
simulated targets, i.e., array of alphabets (Madhavan et al., 2006) or guns and knives (Chavaillaz et al., 2020)
because participants were not professionals and were not trained to visually inspect images for explosives. In
this study (Huegli et al., 2020) input parameters that were varied were accuracy, automation reliability (d’ in
Eqn.2), and positive predictive value (PPV): the probability of a target being present when automation alarms.
These input variables were correlated with human–machine system performance and operator compliance
measures. Participants showed better human–machine system detection performance under high reliability
than low reliability of automated system. The benefits of automation depend on unaided performance and
automation reliability. When unaided performance was high, screeners’ confidence was high, and automation
provided only small benefits, and only with systems with high d’. Interestingly, when the automation produced
numerous false alarms (automation false alarm rate of 20%), screeners rejected most of the false
suggestions. This resulted in similar human–machine efficiency as with systems with low FA rate (FAR = 5%).
Automation false alarm rate of 20% (with hit rate of 86%) resulted in operators ignoring about one-half of
the true automation alarms on difficult targets, such as bare explosives. It seems that operators complied
more with automation when automation alarms were correct than when automation alarms were false. These
findings support current trends to minimize FAR of automated systems not only because the throughput
increases, but also because compliance of screeners with high PPV automated system increases.
It is clear that human-machine interaction in low-level automation systems is a complex one. The human
screener ultimately makes a decision that is influenced by different characteristics of the automated systems
outputs. Additionally, automated systems are increasingly reliable and more accurate due to novel AI methods
trained with large datasets on high-computing machines. For these reasons, high-level automated systems
for baggage screening, called alarm-only viewing, will probably be deployed more in the future. In these
systems, the human screener checks only bags alarmed by the automated system, which would significantly
increase throughput and hopefully detection accuracy.
For both low-level and high-level automation systems, there is no standardised operational testing of
equipment, nor testing certificates based on standardised evaluation parameters. A computer algorithm that

29
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

never tires or bores, standardised across all machines operating in all airports, and capable of making
predictions quickly can provide significant assistance to human operators at existing checkpoints, perhaps
taking the first steps toward rethinking and fully automating security checkpoints.
The European Commission recently published the first legal framework on AI (European Commission, 2021d),
where a rather broad definition of AI is used to cover all types of AI systems, and a risk-based approach is
implemented. Aviation systems fall under high-risk AI systems and need to comply with certain requirements.
However, a concrete definition of some of the requirements still need to be defined in the Annexes of the
legislative act (European Commission, 2021c) by the experts in given fields. Having in mind the importance of
security systems in airports, and complex relationships between machines and human operators presented in
this chapter, it is of highest importance to develop standards for evaluating automated baggage screenings
systems based on AI analysis of images.

3.3 Chapter summary


In this chapter, we reviewed the human factors that might lead to sub-optimal human performance during the
task of detecting prohibited items in X-ray images. The difficulty of the task is compounded by impaired
target visibility (clutter & superposition) and low target prevalence. To address low target prevalence, threat
image projection (TIP) has gained importance in last 20 years amongst EU and US regulators.
Furthermore, the performance of screeners in a human-machine system depends on a number of subtle
interactions, such as the screeners’ perception of the strengths and weaknesses of the automated detection.
One study showed that knowing the rationale about failures can help reduce screener workload under difficult
conditions, i.e. low reliability combined with implausible cues. Another study showed that participants gave
higher trust ratings towards the automatic system when they had received a rationale for potential
automation failures than when they had not. This speaks in favour of explainable automation, i.e., automated
algorithms whose decision making process can be understood and explained to humans.
It is clear that human-machine interaction in low-level automation systems is a complex one. The human
screener ultimately makes a decision that is influenced by different characteristics of the automated systems
outputs. Additionally, automated systems are increasingly reliable and more accurate due to novel AI methods
trained with large datasets on high-computing machines. For these reasons, high-level automated systems
for baggage screening, called alarm-only viewing, will probably be deployed more in the future. In these
systems, the human screener checks only bags alarmed by the automated system, which would significantly
increase throughput and hopefully detection accuracy.
The European Commission recently published the first legal framework on AI, where a rather broad definition
of AI is used to cover all types of AI systems, and a risk-based approach is implemented. Aviation systems fall
under high-risk AI systems and need to comply with certain requirements. However, at the time of writing,
there is no standardised operational testing of equipment, nor testing certificates based on standardised
evaluation parameters, for either low-level or high-level automation systems in X-ray baggage screening.

30
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

4 Evaluating performance of automated methods

4.1 Introduction
In order to understand and compare the performance of different systems, it is important to define some
evaluation measures commonly used to evaluate algorithms for automated detection. Additionally, algorithms
can be evaluated on shared datasets, which improves the validity of the performance evaluation. Different
metrics describe different aspects of algorithm performance, hence it is imperative to choose the appropriate
metrics to measure and optimise performance, depending on the application and data available. In this
section, we will explain commonly used evaluation measures and public datasets available in the field of
automated threat detection.

4.2 Overview of performance metrics


In the context of automated threat detection in airport security, “positive” examples would represent examples
that contain the threat, while “negative” would be the examples with no threat. An example can be a bag,
images of one bag, one image of a bag or a pixel or voxel (3D pixel) of an image. In the case of a pixel or a
voxel representing an example, a positive example would be a pixel belonging to an image area that
represents a threat. In classification, true positives (TP) represent the number of correctly classified positive
examples, true negatives (TN) the number of correctly classified negative examples, false positives (FP)
the number of wrongly classified positive examples, and false negatives (FN) a number of wrongly
classified negative examples. These values are shown in so-called confusion matrix, in Figure 20, where the
values on the major diagonal represent the correct decisions made, and the numbers of the other diagonal
represent the errors, or the confusion, between the various classes.

Figure 20. Confusion matrix.

Source: Fawcett (2006)

One interesting information visible directly from the confusion matrix is that any evaluation measure that in
its definition contains values from both columns is dependent on the class distribution. Hence, such evaluation
measure will be sensitive to class skews, i.e., when samples from one class significantly outnumber those of
the other class.
One of the widely used evaluation measure is accuracy (ACC), or equivalently error rate. Accuracy is defined
as the number of correctly predicted samples over the total number of predictions, which is mathematically
shown as:

31
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

TP + TN
ACC = = 1 − error rate Eqn. 5
TP + TN + FP + FN

Accuracy is the simplest, most intuitive evaluation measure for automated systems or classifiers. We are
interested classifiers not to make mistakes, and we simply count the number of total hits (or equivalently,
count the mistakes), and divide them with the total number of testing examples. However, accuracy may not
be a useful measure in cases where:
— there is a large class skew, e.g., is 98% accuracy good if 97% of the instances are negative? In aviation
security, negative examples are very dominant;
— there are asymmetric misclassification costs, e.g. getting a positive instance wrong, costs more than
getting a negative wrong: in aviation security, it is more costly to have a missed bomb in the luggage
than a false positive. More generally, if a classifier has a low true positive rate and high false positive
rate, while another classifier has high true positive rate and low false positive rate, the ‘accuracy’ of the
two classifiers might be the same and not reflect this difference.
Other measures, such as precision and recall, address some of the accuracy issues. Precision is defined as:

TP
precision = Eqn. 6
TP + FP

and it assesses to what extent the classifier correctly classifies examples as positives, or how good a model is
at predicting the positive class. Recall is defined as:

TP
recall = true positive rate = sensitivity = Eqn. 7
FN + TP

and it assesses to what extent all the examples that needed to be classified as positive were so. If applied to
automated threat detection, recall answers the question: of the present threats, how many did you find? While
precision answers the question: of the threats that you found, how many are true threats? Precision and recall
are good measures for positive recognition. However, they do not describe the ability of the classifier to deal
with negative examples.
True positive rate (TPR) is defined as:

TP
true positive rate = Eqn. 8
TP + FN

False positive rate (FPR) is defined as number of negative cases incorrectly identifies as positive cases
divided by the total of negative cases:

FP
false positive rate = = 1 − specificity Eqn. 9
FP + TN

In other words, FPR represents the proportion of negative cases incorrectly identified as positive cases, i.e. the
probability that the false alarm will be raised. It is important to mention that TPR is the same as hit rate, HR,
as defined in Eqn.3 (also referred to as detection rate, DR), and FPR is the same as FAR (as defined in Eqn. 4).
TPR and FPR terminology is generally used in machine learning, while DR and FAR are more used for aviation
security application. In this report, we will use TPR and FPR.

32
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Precision and recall are usually in a direct or positive relationship, while TPR and FPR trade off each other.
Their relationships are represented in two curves: precision-recall curve (PR curve) and receiver operating
characteristics curve (ROC curve), which are illustrated in Figure 21.

Figure 21. ROC curves and precision-recall curves of two classifiers applied on dataset with different ratios of positives
and negatives: (a) ROC curves, positives: negatives = 1:1; (b) precision-recall curves, positives: negatives = 1:1; (c) ROC
curves, positives: negatives = 1:10 and (d) precision-recall curves, positives: negatives = 1:10.

(a)
(b)

(c) (d)

Source: Fawcett (2006)

ROC graphs are two-dimensional graphs in which TPR is plotted on the Y-axis and FPR is plotted on the X-axis.
An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives). ROC
graph allows viewing at what costs (FPR) satisfactory benefits (TPR) can be achieved. Similarly, on PR graphs,
precision is plotted on Y-axis and recall on the X-axes.
There is a difference in how the ROC and PR graphs look if a discrete and probabilistic classifier is used. A
discrete classifier outputs only a class label, e.g. 1 and 0, and ROC and PR graphs would consists of one point.
Probabilistic classifier produces scores that can be strict probabilities, in which case they adhere to standard
theorems of probability; or they can be general, uncalibrated scores, in which case the only property that
holds is that a higher score indicates a higher probability. We shall call both a probabilistic classifier, despite
the fact that the output may not be a proper probability. Such a ranking or scoring classifier can be used with
a threshold to produce a discrete (binary) classifier: if the classifier output is above the threshold, the
classifier produces, for example, a 1, else a 0. Each threshold value produces a different point in ROC/PR
space as in Figure 21. In an ideal case with a perfectly correct classification, ROC curves would have TPR = 1
for FPR = 0, and for PR curves, precision = 1 and recall = 1. In the less-than-perfect example in Figure 21c,
instsx10.roc represents a better classifier than insts2x10.roc.

33
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

An important point about ROC graphs is that they measure the ability of a classifier to produce good relative
instances scores. A classifier does not need to produce accurate, calibrated probability estimates; it only needs
to produce relative accurate scores that serve to discriminate positive and negative instances. For example, let
us suppose that on 10 examples a classifier produces scores from the range [0, 1] shown in Figure 22, and
based on them a ROC plot is created. The true class of each example is given by variable true = p or true = n,
while the hypothesized class (Hyp) is the class assign to each example by given classifier. Let us assume the
threshold = 0.5 is applied on the scores, i.e. if score > 0.5 then Hyp = Y, otherwise Hyp = N as in Figure 22. In
this case, instances 7 and 8 are misclassified, yielding 80% accuracy, which is seemingly in contradiction to
the perfect ROC curve of this classifier. The explanation lies in what each is measuring. The ROC curve shows
the ability of the classifier to rank the positive instances relative to the negative instances, and it is indeed
perfect in this ability. The accuracy metric imposes a threshold (score > 0.5) and measures the resulting
classifications with respect to the scores. The accuracy measure would be appropriate if the scores were
proper probabilities, with balanced distributions, but they are not. Another way of saying this is that the scores
are not properly calibrated, as true probabilities are. In ROC space, the imposition of a 0.5 threshold results in
the performance designated by the circled ‘‘accuracy point’’ in Figure 22. This operating point is sub-optimal;
the optimal one is [0, 1]. One way to eliminate this phenomenon is to calibrate the classifier scores. There are
some methods for doing this (Zadrozny & Elkan, 2001). Another approach is to use an ROC method that
chooses operating points based on their relative performance, and there are methods for doing this as well
(Fawcett, 2006; Provost & Fawcett, 2001). These latter methods are used more in literature.

Figure 22. Scores and classifications of 10 instances, and the resulting ROC curve.

Source: Fawcett (2006)

Additionally, ROC curves are often used for visualization of classifier performance since they have the
property of being insensitive to changes in class distribution. If the proportion of positive to negative instances
changes in a test set, the ROC curves will not change; while PR curve is quite different depending on this
proportion (see Figure 21).
ROC curves are used for visual inspection of classifiers performance, but in order to quantify it, we want to
reduce ROC performance to a single scalar value. A common method is to calculate area under the curve
(AUC) which is equivalent to the area under ROC curve (see Figure 23). Since AUC is a portion of the area of
the unit square, its value will always be between 0 and 1. Because random guessing produces the diagonal
line between [0, 0] and [1, 1], with an area of 0.5, no realistic classifier have an AUC < 0.5. The AUC has an
important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will
rank a randomly chosen positive instance higher than a randomly chosen negative instance. This is equivalent
to the Wilcoxon test of ranks (Hanley & McNeil, 1982) often used in literature to compare two paired groups
of samples.
Reviewing both precision and recall is useful in cases where there is an imbalance in the observations
between the two classes. Specifically, this means that, for example, there are many negative examples (no
threat objects in the bag) and only a few examples of a positive event (threat object in the bag). The reason
for this is that typically the large number of negative examples means we are less interested in the skill of

34
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

the model at predicting negative class correctly, e.g. high true negatives. This is the situation we have in
aviation security, hence many algorithms in automated threat detection report PR curves and values related
to it to describe success of their algorithms. Key to the calculation of precision and recall is that the
calculations do not make use of the true negatives, i.e. we are only concerned with the correct prediction of
the minority class, a positive class. However, in aviation security, as in other application fields, many mistakes
in predicting negative class are also not desirable since high FAR slows down the workflow that is crucial for
airport operations.

Figure 23 ROC graphs of two classifiers A and B, with the AUC marked for both: AUCB > AUCA.

Source: Fawcett (2006)

If comparing two classifiers’ PR curves in Figure 21, the more successful one would have a more convex
shape, with a tip closer to the point [1, 1], e.g. classifier insts.precall shows better performance with respect to
precision and recall than classifier insts2.precall (see in Figure 21b). Similarly to the AUC for ROC curves, we
need a value to quantify it, we want to reduce PR curve performance to a single value. For that purpose, a
measure is derived that is very popular in automated threat detection field. The mean average precision
(mAP) is mean of average precision across classes. Average precision can be calculated as the area under the
PR curve:

1
Average Precision = � p(r)dr Eqn. 10
0

where p(r) is precision for recall = r. The mean average precision is defined as:

∑i=C
i=1 average precision (i)
mAP = Eqn. 11
C

where C is number of classes.


Another composite score derived from precision and recall values is F-measure, defined as a harmonic mean
of precision and recall:

2 ∗ precision ∗ recall
F= Eqn. 12
precision + recall

35
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Harmonic mean is used instead of arithmetic mean because precision and recall are ratios (Bekkar & Djemaa,
2013). The value of F increases proportionally to the increase of precision and recall, a high value of F-
measure indicates that the model performs better on the positive class.
The typical evaluation metric for object detection, or semantic segmentation, is intersection over union
(IoU). It represents the intersection between two regions or bounding boxes that represent the ground truth
(A) and the automated segmentation that needs to be evaluated (B). It is defined as:

A⋂B
IoU = � � Eqn. 13
A⋃B

Finally, another evaluation measure that is reported in the literature is root-mean-square deviation (RMS
error), defined as:

∑N �n − yn )2
n=1(y
RMS error = � Eqn. 14
N

where 𝑦𝑦
�𝑛𝑛 is a classifier prediction, yn is the ground truth (the real value) and N is the number of samples.

4.3 Chapter summary


In this chapter we reviewed some of the common performance metrics that are used to describe and optimise
threat detection algorithms:
— accuracy
— precision
— recall
— true positive rate (detection rate)
— false positive rate (false alarm rate)
— receiver operating characteristic (ROC) curve
— area under curve (AUC)
— mean average precision (mAP)
— F-measure (harmonic mean of precision and recall)
— intersection over union (IoU)
— root-mean-square deviation (RMS error)

We mentioned some factors that have to be taken into account when choosing a metric, such as potential
class skews, i.e., when samples from one class significantly outnumber those of the other class, or
asymmetric misclassification costs, i.e., when the consequences of false negatives are much more costly than
those of false positives.

36
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

5 Datasets of X-ray images

5.1 Introduction
The quality of datasets and their availability are of a crucial importance for the development and evaluation
of threat detection algorithms. In this chapter, to the best of our knowledge, we list the public and non-public
datasets, most of them listed in (Akçay & Breckon, 2020), and comment on their availability and quality.

5.2 Overview of X-ray image libraries


The Durham baggage patch/full image dataset (DBP) comprises 15,449 X-ray samples with associated
false-colour, materials mapping from dual-energy. Originally, samples have the following class distributions:
494 camera, 1,596 ceramic knife, 3,208 knife, 3,192 firearms, 1,203 firearm parts, 2,390 laptop and 3,366
benign images. Several variants of this dataset were constructed for classification (Dbp2, Dbp3, Dbp6) (Akçay,
Kundegorski et al., 2016; Akçay et al., 2018; Kundegorski et al., 2016) and detection (Dbf2, Dbf3 Dbf6) (Akçay
& Breckon, 2017; Akçay et al., 2018).
The Durham baggage anomaly dataset (DBA) (Akçay et al., 2019a) is a dataset that comprises 230,275
dual energy X-ray security image patches extracted via a 64 x 64 overlapping sliding window approach. The
dataset contains 3 abnormal sub-classes: knife (63,496), gun (45,855) and gun component (13,452). Normal
class comprises 107,472 benign X-ray patches, split via 80:20 train-test ratio. DBA dataset is used in (Akçay
et al., 2019a, 2019b) for unsupervised anomaly detection.
The full firearm vs operational benign (FFOB) dataset (Centre for Applied Science and Technology, 2016)
contains samples from the United Kingdom government evaluation dataset comprising both expertly
concealed firearm (threat) items and operational benign (non-threat) imagery from commercial X-ray security
screening operations (baggage/parcels). Denoted as FFOB, this dataset comprises 4,680 firearm full-weapons
as full abnormal and 67,672 operational benign as full normal images, respectively. Access to this dataset is
restricted.
The Grima X-ray dataset (GDXray) (Mery et al., 2015) comprises 19,407 mono-energy X-ray images
samples from five various subsets including castings (2,727), welds (88), baggage (8,150), natural images
(8,290), and settings (152). The baggage subset is mainly used for security applications and comprises
images from multiple-views. The limitation of this dataset is its non-complex content, which is non-ideal to
train for real deployment.
The SIXray dataset collected and released by Miao et al. (2019) comprises 1,059,231 X-ray images, 8,929
of which are manually annotated for 6 different classes: gun, knife, wrench, pliers, scissors, hammer, and
background. The dataset consists of objects with a wide variety in scale, viewpoint and especially overlapping,
and is first studied in Miao et al. (2019) for classification and localization problems. These images are
collected using a Nuctech dual-energy X-ray scanner, where the distribution of the general baggage/parcel
items corresponds to stream-of-commerce occurrence.
The Compass-XP dataset (Caldwell & Griffin, 2020) is collected using 501 objects from 369 object classes
that are subset of ImageNet classes. The dataset includes 1,901 image pairs such that each pair has an X-ray
image scanned with Gilardoni FEP ME 536 and its photographic version taken with a Sony DSC-W800 digital
camera. In addition, each X-ray image has its low-energy, high-energy, material density, grey-scale
(combination of low and high energy) and pseudo-coloured RGB versions.
The ALERT dataset has three active CT datasets (Crawford et al., 2011, 2013) and (Crawford, 2014), and
one with RGB images and videos of passengers (Karanam et al., 2018). All datasets are available free-of-
charge to academic and government research communities. The datasets have been developed in several
initiatives supported by the Department of Homeland Security (DHS): ALERT’s Computerized Tomography (CT)
Segmentation Initiative (Crawford et al., 2011), Reconstruction Initiative (Crawford et al., 2013), Automated
Threat Recognition (ATR) Initiative (Crawford, 2014), and ALERT Video Analytics Re-Identification work
(Karanam et al., 2018). Within CT Segmentation Initiative, representative datasets of packed luggage and
reference objects are obtained. Approximately 900 objects were placed in luggage and scanned to produce 62
luggage datasets to span the spectrum of packing, density, arrangement, orientation, and size difficulty
(ALERT, 2011). ALERT developed automated target recognition algorithms (ATR) for CT-based explosive
detection systems (EDS) (Crawford, 2014). The research groups were provided with images from scans of

37
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

targets packed in bags and scanned on a medical CT scanner. The targets were chosen to create detection
scenarios similar to what vendors face when detecting explosives using their CT-based EDS.
Finally, cargo object detection is not a topic of this report, but it is worth mentioning UCL TIP dataset
(Rogers et al., 2016). It contains images of cargo containers and it comprises 120,000 benign images, each of
which is 16-bit grayscale with sizes varying between 1,920 x 850 and 2570 x 850. The train and test split of
the dataset is 110,000:10,000, where the training images are 256 x 256 patches randomly sub-sampled
from 110,000 images and the test set comprises 5,000 benign and 5,000 threat images. The threat images
are synthetically generated via the TIP algorithm proposed in Rogers et al. (2016), where, depending on the
application, images of threats are projected into the benign samples.
Unfortunately, the availability of public databases that can be used for baggage inspection is very limited.
While in some areas of computer vision (e.g., face recognition), there are hundreds of databases since the
1990s, in baggage inspection there is only one 3D public dataset (ALERT) and only two 2D public datasets
mentioned above: GDXray, SIXray. The first one consists of a dataset of mono-energy X-ray images captured
on controlled scenarios using only one X-ray system (in an X-ray lab) with images of simple, not cluttered
bags. Whereas the second one has dual-energy X-ray images captured in more challenging scenarios (real
world) from many different X-ray systems but fewer prohibited items than benign samples (see Table 3). For
these reasons, results are typically reported on only one of the datasets and cannot be fairly compared.
Additionally, those datasets are highly biased towards certain classes, limiting the training of reliable
supervised methods.
The rest of the datasets used in the experiments reported by the industry and academia are private. In many
cases, the entities (industry, government, or academia) that fund research in X-ray testing do not allow
databases to be made public. This happens for security reasons, or to prevent competitors from having access
to data that could improve their processes. Unfortunately, it is not possible to use private datasets to make
comparisons and analyses of different computer vision algorithms (Mery et al., 2020). It would be greatly
beneficial for the scientific community in general, and essential for improvement and development of
efficient automated methods, to build large, realistic datasets that can be shared amongst bona fide security
practitioners, with homogenous subsets, using different scanners, collected either by manual scanning and
labelling, or by generating synthetic datasets (or both).

Table 3. Public datasets for baggage inspection.

dataset no. images mono dual classes

GDXray 8,150 yes no razor blade,


shuriken, handgun,
(Mery et al., 2015)
knife, spring, clip

SIXray 1,059,231 yes yes gun, knife, wrench,


plier, scissor,
(Miao et al., 2019)
hammer
Source: Mery et al. (2020)

5.3 Chapter summary


Compared to the task of general object detection, or sector-specific applications like medical analysis, there is
a notable lack of available, labelled X-ray datasets for security applications. Equipment manufacturers, third-
party algorithm developers, regulators, and the testing community, would all benefit from the availability of
shared, standardised libraries of X-ray security images. Such libraries will, however, require significant and
specialised resources to develop and maintain.

38
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

6 Image enhancement and threat detection

6.1 Introduction
As mentioned in Chapter 2, there are two ways in which software of the X-ray based scanners assist human
operators. Firstly, by enhancing the X-ray/CT image the human operators need to visually inspect, and
secondly, by providing automated detection of objects in the image. State-of-the-art algorithms typically
focus on image enhancement (Abidi et al., 2006; Movafeghi et al., 2020; Dmitruk et al., 2017; Lu & Conners,
2006; Chen et al., 2005) and image understanding.
One of the most used image enhancement methods is image de-noising followed by pseudo-colouring. Image
de-noising is achieved by using histograms (Chen et al., 2005), edge detection and Gaussian blurring (Dmitruk
et al., 2017), and using Gabor filters (Movafeghi et al., 2020). Initial attempts (Chen et al., 2005) fused low
and high energy X-ray images and applied background subtraction for noise reduction. For adaptive image
enhancement, multi-layer perceptron is used, where the model predicts the best enhancement technique
based on input and enhanced output images (Singh & Singh, 2005). To improve low-density images, Radon
transformation is used for threshold selection to declutter region of interest (ROI) from complex X-ray scans
(Liang, 2004). Standard pseudo-colouring schemes are used for single-energy images, showing that human
screener detection performance and alertness level improve in comparison to grey-scale images (Abidi et al.,
2005). Threat detection performance is further improved via new colour coding scheme by calibrating the
estimation of effective atomic number (Zeff) and density information (Chan et al., 2010). In the rest of this
section, we will focus on state-of-the-art image understanding methods.
Among image understanding methods, we can observe two areas of implementation: object recognition and
object detection task. In object recognition, test images are already cropped images containing only the object
of interest (Baştan et al., 2011; Schmidt-Hackenberg et al., 2012; Turcsany et al., 2013). In object detection,
objects of interest need to be located in the baggage image, and classified as one of the predefined types of
objects (Franzel et al., 2012; Mery et al., 2013). See Figure 24 for an example of object recognition, and Figure
25 for an example of object detection. Object recognition and object detection are discussed in detail below.

Figure 24 Examples of positive (left) and negative (right) data for object recognition.

Source: Turcsany et al. (2013)

Early work within the field focusses more on image processing approaches. Simplistic pixel-based
segmentations with a fixed absolute threshold and region grouping are presented in (Paranjape et al., 1998;
Sluser & Paranjape, 1999). Subsequent work, on the other hand, focuses more on pre-segmentation via
nearest neighbour, overlapping background removal and final classification (Ding et al., 2006; Singh & Singh,
2004). Lu and Conners (2006) used a smoothing filter (Nagao & Matsuyama, 1979) for de-noising and region
growing approach to segment explosives. One of the earliest work on object detection in multi-view X-ray
baggage scans was done by Mery (2011) detecting razors, blades and similar objects by using Scale Invariant
Feature Transform (SIFT) descriptor (Lowe, 2004) that is invariant to translation, scale and rotation and robust
to image noise. The method detects objects of interest by comparing a single SIFT descriptor of a reference
object to SIFT descriptors of pre-segmented proposal regions in the image. Detections from different views

39
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

are combined by tracking sparse SIFT features across images achieving TPR of 94% and FPR of 6%. In Mery
(2011), however, objects are detected in simple bag composition, with little clutter to occlude objects of
interest.

Figure 25 Example of object detection task: (a) a detected gun framed, and (b) bag without threats.
(a) (b)

Source: Baştan et al. (2011) and Baştan (2015)

Instead of using shape information and methods developed for visible spectrum images, some approaches
utilise chemical (attenuation) properties of X-ray scans. Heitz & Chechik (2010) proposed a method for
separating objects in a set of X-ray images using the property of additivity in log space, where the log
attenuation at a pixel is the sum of the log-attenuations of all objects that the corresponding X-ray passes
through. The method achieved promising results, 23% RMS error (Eqn.14) reduction, when compared to two
baseline method usually used in natural images, namely segmentation method (Felzenszwalb & Huttenlocher,
2004), and X-means clustering (Pelleg & Moore, 2000). However, the method was designed to handle images
with a small number of objects, and is therefore limited to low-clutter luggage.

6.2 Bag of visual words


Prior to the dominance of the deep learning within the field, and despite X-ray images lack of texture, the bag
of visual words (BoVW) approach (Csurka et al., 2004; Sivic & Zisserman, 2003) that performs well with
texture rich images, was prevalent. BoVW is a method where local features obtained from an image set are
clustered into a finite number of clusters. Centroids of clusters form a dictionary, which is used to encode
features of images in a vector quantized representation. The clusters’ centroids are called visual words and
the bag-of-words model represents an image by its histogram over these visual words. It usually consists of
five steps: feature detection, feature description, dictionary construction, bag of words (BoW) computation
(weighted histograms of visual words), and classification (Baştan et al., 2011; Chatfield et al., 2011; Csurka et
al., 2004).
In one of the initial attempts utilizing BoVW, Baştan et al. (2011) used this method to detect handguns from 4
views dual-energy X-ray images of bags on a relatively small dataset: 52 positive and 156 negative images
for training (52 bags), and 764 images (190 bags) for testing. They used various feature detectors: Difference
of Gaussians (DoG), Hessian-Laplace, Harris corner detector, features from accelerated segment test (FAST)
corner detection method (Rosten & Drummond, 2006), and STAR feature detectors based on CenSurE features
(Agrawal et al., 2008) (see Figure 26). Three different descriptors were experimented with: SIFT (Lowe, 2004),
Speeded Up Robust Features (SURF) (computationally more efficient because it filters integral images with a
square filter) (Bay et al., 2008), and Binary Robust Independent Elementary Features (BRIEF) feature
descriptors. They further used k-means for dictionary generation and SVM classifier (Hearst et al., 1998;
Vapnik, 1999) for classification. The results are evaluated using: Recall (Eqn.7), Precision (Eqn.6), and Average
Precision (Eqn.10). DoG detector and SIFT descriptor are shown to perform the best among the descriptors,
but the performance increases when 2 different detectors (Harris + DoG) are used and when using the union
of key-points from low energy, high energy and colour images. They have achieved TPR = 70% (Eqn.7), and
Average Precision of 57% (Eqn.10).

40
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 26 Salient point detectors on a colour X-ray image using the OpenCV 2.2 implementations of Harris, SIFT’s DoG,
SURF’s Hessian-Laplace and FAST respectively.

Source: Baştan et al. (2011)

Baştan et al. (2013) performed detection of three X-ray objects (laptops, handguns and glass bottles) having
approximately 5 times more negative samples than positive ones. Compared to Baştan et al. (2011), they
used another descriptor, and used features from each single-view to combine it in all views feature space.
The added intensity domain SPIN image descriptor (Lazebnik et al., 2005) that is a 2D histogram of intensity
values in the circular neighbourhood of a point and is rotationally invariant, improved the detection. They also
presented results with precision, recall and average precision, and concluded that single texture features
perform rather poorly, especially for less textured objects (e.g., bottles). As in classification, the performance
improves significantly with the addition of colour features (material information). Furthermore, multi-view
detection improves the performance considerably especially when the single view detections are not good, as
in handguns and bottles. mAP (Eqn.11) for gun detection was 66%, for laptop detection 87%, and for bottle
detection 64%.
Both Baştan et al. (2011, 2013) and Franzel et al. (2012) performed performance comparison between
single-view and multi-view object detection and concluded that combining multiply views achieves better
detection results. Inspired by Baştan et al (2011), Turcsany et al. presented a unique BoWV approach for the
X-ray firearm classification via class-specific feature extraction (Turcsany et al., 2013). A novel modification
to the traditional codebook generation in BoWV was presented where clustering was done per two classes,
separately, as it was previously done in (Perronnin et al., 2006), which simplifies the task of the classifier.
Additionally, with this approach, one can directly influence the ratio between number of visual words
representing positive and negative class. Usually, the negative class words are significantly outnumbering
positive class visual words, in their approach, Turcsany et al. (2013) generated equal number of visual words
representing positive and negative examples in order to represent the complexity of both classes equally.
SURF feature detector and descriptor, and RBF SVM kernel showed the best performance. The method was
developed and tested using 2500 images, with 850 firearm and 10000 image patches cropped from them,
and 3-fold cross-validation was performed using ROC curve (Metz, 1978), where kernel size, σ, was varied.
Low FPR = 4% (Eqn.9), and high TPR (Eqn.7) was achieved 99%, and it is a significant improvement to TPR =
70% in (Baştan et al., 2011). This improvement was explained with a higher number of training data, features
clustering separately for each class of data, and 6 times higher number of visual words used in codebook
creation which is shown to improve classification accuracy (Nowak et al., 2006; Philbin et al., 2007).
In Mery et al. (2016), BoVW is further employed for detection of four objects: guns, shuriken, razor blades and
clips (see Figure 27). Similar to Turcsany et al. (2013), dictionaries are formed for each class that consists of
SIFT (Lowe, 2004) feature descriptors of randomly cropped image patches. In the training phase, the number
of samples is reduced by using k-means algorithm on already clustered samples (see Figure 28). In the
testing phase, first the best test patches (the most discriminative patches based on the coefficients
distribution) are selected by using sparsity concentration index. Consequently, each selected test patch is
represented by an ‘adaptive’ sparse representation computed from the ‘best’ representative dictionary of each
class (the closest parent cluster to the given test patch). Finally, the test patches are classified according to
‘sparse representation classification’ methodology (Wright et al., 2009). The test image that the selected
patches belong to is classified using the majority vote of the classes assigned to the selected test patches. In
the experimental set up, five classes were used: four for the four types of threat objects, and the fifth class
represented background images. The dataset consisted of 100 images per each class, obtained from the
publicly available GDXray database (Mery et al., 2015) (see Figure 27), and a leave-one-object-out strategy

41
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

was used for evaluation: from each class, 51 randomly selected images were chosen, 50 for training, and one
for testing. This is repeated 100 times, and mean Accuracy (Eqn.5) was calculated. After experimenting with
Local Binary Patterns (LBP) (Ojala et al., 2002), SIFT and both LBP and SIFT for descriptors, the best
performance was reported with the combination of descriptors: more than 95% accuracy for each class and
85% in case of occlusion that is less than 30% of the object. The occluded data was created by replacing the
image pixels with constant grey-scale values squares. It is important to mention that this study was
performed using already cropped objects or background images, instead of locating and classifying objects
which is a more difficult task.

Figure 27 Example of three images per class from GDXray database, described in Mery et al. (2015).

Source: Mery et al. (2016)

Figure 28 Representation of class i objects: a) Patches of all X-ray images from the training set of class i are extracted.
b) Each patch is represented with the feature descriptor values. c) Filtered patches that do not appear too often (no
valuable information) or too seldom (noise). d) Set of points is clustered in Q parent e) Each parent cluster is clustered in R
child clusters, e.g. R representative samples from parent cluster Q, and all the Q*R centroids are stored in the dictionary. f)
Visualization of the patches of the dictionary with centroids of R and Q clusters. In this example 2.400 points are
presented with Q = 6 (parent clusters) and R = 10 (child clusters).

Source: Mery et al. (2016)

42
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

In his work, Baştan (2015) thoroughly reviews several feature detectors (Harris–Laplace, Harris-affine,
Hessian–Laplace, Hessian-affine), on which he studies applicability and efficiency of sparse local features
(SIFT + SPIN) on object detection in X-ray baggage imagery via the use of a similar Bag-of-Features concept.
This work also investigates how material information given in X-ray imagery via colour mapping and multi-
view X-ray imaging affects detection performance. They use a multi-view branch-and-bound search (Lampert
et al., 2009) in a structured learning framework (structural SVMs) (Blaschko & Lampert, 2008) that results in
a good spatial distribution of classification confidence values. This work is the most similar to Franzel et al.
work (Franzel et al., 2012). The major difference is, instead of running single-view detectors on each view and
fusing the detections in 3D, Baştan (2015)performs the detection directly in 3D using quantized local features
from all the views images similar to their work from 2013 (Baştan et al., 2013). The dataset included three
images per object, low-energy, high-energy, and pseudo-colour image, detecting three threat objects: laptops,
hand guns and glass bottles. Feature points were detected on the pseudo-colour X-ray images, since colour
images contain more information and texture compared to low/high-energy images (they are also noisier).
Dataset used was large: 669 scans (2676 images) containing handguns, 250 scans (1000 images) containing
laptops, and 280 scans (720 images) containing glass bottles. To evaluate the object detection performance,
they used the standard recall, precision and average precision measures, based on the PASCAL’s 50% area
overlap criterion (Everingham et al., 2015) for a correct. The best performance was achieved using multi-view
and multi-feature approach, combining Hessian Laplace and Harris Laplace feature detector and SIFT, colour
SPIN descriptor and super pixel sampling. Average Precision (Eqn.10) for gun detection was 66%, for laptop
detection 87%, and for bottles 64%. The reason for the worse performance on guns and especially bottles, is
explained with their textureless appearance.
Kundegorski et al. (2016) exhaustively evaluated various feature point descriptors within a BoVW based
image classification task. Their evaluation dataset consisted of close to 20,000 X-ray sample patches already
cropped from full images (dual-energy, false-coloured, from varying manufacturers). The combination of
FAST-SURF trained with an SVM classifier was the best performing feature detector and descriptor
combination for a firearm detection task yielding Accuracy (Eqn.5) of 0.94, true positive rate (Eqn.7) 83%, and
false positive rate (Eqn.9) 3.3%.
A sparse KNN based method in sparse reconstruction was presented by Svec (2016). He worked with images
extracted from GDXray database (Mery et al., 2015), with objects belonging to four classes: pistols, shuriken,
razors, and background. They used SIFT key-points extracted from the image, search-forward feature
selection was used for feature selection, and k-means clustering was used to create dictionaries for each
class. In the testing phase, a k nearest neighbours (KNN) classifier (Larose & Larose, 2005) was used for each
class, and the class of the object is predicted using soft voting. The method was evaluated using GDXray data
and using precision, recall, and accuracy. The proposed method reached 97% accuracy for pistols, 99%
accuracy for shuriken, and 92% accuracy for razors on the validation set (the set used for tuning the
parameters). Additionally, the data were cropped from larger images containing only the object of interest as
in Mery et al. (2016) which makes classification task easier than locating and recognizing threat objects from
complete images of passenger bags.
Another method using visual vocabulary framework was proposed by Riffo and Mery (2016) using single view
and single energy images of bags with threat objects (pistols, shuriken, razors) imaged in different poses
similarly as in Mery et al. (2016) and Svec (2016) but utilizing full images of bags for testing purposes. In this
work they used standard SIFT model for key-points detection, and description and proposed “adapted implicit
shape model” (AISM) for clustering and object detection. AISM is based on the well-known “implicit shape
model” (ISM) method (Leibe et al., 2008), with the fundamental difference being that ISM only uses the
occurrence with the highest similarity score as a valid detection, while AISM uses a threshold to determine
which occurrences will be used. Additionally, AISM does not require a priori knowledge of the number of target
objects to be detected, while ISM does. The method was trained using 100 razor blades, 100 shuriken, and
200 handguns due to higher interclass variations for handguns. Testing was done on 200 X-ray images with
various numbers of objects inside (0-2). The evaluation included ROC analysis and based on it, the detection
of razor blades and shuriken was very effective resulting in almost perfect ROC curve. In both cases, a high
AUC and TPR and a low FPR was obtained: AUC > 98%, TPR>98% and FPR equal 2% for razor blades and 6%
for shuriken. The results for the detection of handguns are somewhat lower: AUC 90%, TPR = 85% and FPR =
20%. Given that asymmetrical objects have very disjointed occurrences with respect to the real centre of the
object, the best results are obtained for symmetrical objects like razor blades and shuriken. Hence, the
method would not be very effective with occluded objects that irrespective of their true shape would always
appear asymmetrical. However, this approach was developed using single-energy and single-view images, and
does not benefit from multi-view detection and dual-energy colour features.

43
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

6.3 Other classical machine learning methods


Despite the dominance of BoVW in the past, other computer vision/machine learning techniques have also
been studied for X-ray object classification. Franzel et al. (2012) proposed a method using standard sliding
window with histogram oriented gradient (HOG) features (Dalal & Triggs, 2005), SVM classifier and using a
novel method for combining classification results from each of the four views. The authors used three rounds
of bootstrapping for SVM, and retraining focusing on difficult negative examples (Walk et al., 2010). The
multi-view detection, requiring knowledge of scanner geometry, is done by casting lines in 3D space, and if
they intersect (or pass very close to each other), e.g. classification for given bounding box agree on those
views, the classification confidence in each view is increased. In this way, the number of false detection is
reduced. Average precision increased for gun detection from 46% to 66%, for laptop detection 85% to 86%
and for bottle detection from 54% to 64%. The disadvantage of this approach is that it loses objects’
bounding boxes and only outputs objects’ centres. Therefore, the performance evaluation is based on the
distance between the centres of detected and ground truth objects, not the commonly used area overlap
between object bounding boxes.
Schmidt-Hackenberg et al. (2012) proposed a novel method for the detection of guns, using features inspired
by human visual cortex, SLF-HMAX (Mutch & Lowe, 2008) and V1-like (Pinto et al., 2008), in combination with
a linear SVM classifier. SLF-HMAX features are essentially Gabor wavelet filters (Daubechies, 1990), with the
addition of incorporated sparsity and biologically motivated complex visual cells features, while V1-like
features model represents simple cells as explained in Pinto et al. (2008). The method was compared with, at
the time dominant approach that utilizes SIFT and pyramid histogram of visual words (PHOW) features
(Vedaldi & Fulkerson, 2010) with k-means clustering within BoVW framework and a linear SVM classifier.
They used a standard sliding window approach, applied on the non-cropped multi-energy colour single-view
images. The dataset consisted of 1200 X-ray images of baggage, with medium clutter (clothes, shoes, bottles,
electronic devices, etc.), with 50% of bags containing a gun. The method was trained on 500-cropped images
of guns, and 500 cropped images of background. The remaining 200 images, 100 with a gun, and 100
without, were used as a test set. The evaluation measures were precision and recall graph using five-fold
cross validation, and the visual cortex based feature seem to be superior compared to the traditional texture
detectors (see Figure 29), which is not surprising given that X-ray images, and especially guns in them, are
characterized with low texture presence.

Figure 29 (a) Precision-recall curve for all tested features, and (b) correctly detected gun in two bags with different
orientations.
(a) (b)

Source: Schmidt-Hackenberg et al. (2012)

Shape-based handgun detection is investigated in Roomi (2012) using shape context descriptors (Belongie et
al., 2002) and Zernike moments for features (Khotanzad & Hong, 1990) that are fed into a fuzzy k-NN
classifier (Keller et al., 1985) but with limited, non-quantitative but only visual evaluation over only 15 image
examples of non-cluttered bags.
A general multi-staged approach proposed in the works of Mery et al. (2013, 2017) provides ways for object
detection in 3D from multi-view 2D images. These approaches require that the geometry of the scanner is
known. Similar 3D processing was already mentioned in Mery (2011). The methods consist of: feature
extraction via feature descriptors and k-NN classifier (Larose & Larose, 2005), matching the key-points for

44
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

the consecutive images from different views and the multiple-views analysis, where the key-points of the two
successive images are matched, and their 3D points are formed with structure from motion. After being
clustered, 3D points are re-projected back to 2D key-points, which are classified by the k-NN classifier. The
method was tested on 120 samples including clips, springs and blades, achieving the best performance for 15
blades examples, 100% precision and 93% recall, and for 75 spring examples, 96% precision and 85% recall.

6.4 Chapter summary


Computer vision can help human screeners in two ways: i) enhancing the image in order to make the objects
of interest more visible, ii) detecting the position of objects of interest. Regarding image enhancement
methods, the most used method is image de-noising. Image de-noising can be achieved using histograms
(Chen et al., 2005), edge detection and Gaussian blurring (Dmitruk et al., 2017), Gabor filters (Movafeghi et
al., 2020), neural network to model a desirable output based on the input image (Singh & Singh, 2005).
Before deep learning methods gained popularity, the bag of visual words (BoVW) approach was dominant in
the field of threat detection in X-ray images. BoVW is a method where local features obtained from an image
set are clustered into a finite number of clusters. Centroids of clusters form a dictionary, which is used to
encode features of images in a vector quantized representation. The clusters’ centroids are called visual
words and the bag-of-words model represents an image by its histogram over these visual words. BoWV was
used for handgun detection (Baştan et al., 2011), firearm classification (Turcsany et al., 2013), 4-class
classification of guns, shuriken, razor blades and clips (Mery et al., 2016). The usual approach for BoWV was
to use different feature detectors. Baştan (2015) thoroughly reviewed several feature detectors (Harris–
Laplace, Harris-affine, Hessian–Laplace, Hessian-affine), combined with SIFT and SPIN feature descriptors,
and represented as a Bag of Features trained with a SVM classifier. The best performance for gun, bottle and
laptop detection was achieved using multi-view and multi-feature approach, combining Hessian Laplace and
Harris Laplace feature detector and SIFT, colour SPIN descriptor and super pixel sampling.
Despite the dominance of BoVW in the past, other computer vision and machine learning techniques have
been studied for X-ray object classification. Franzel et al. (2012) proposed a method using standard sliding
window with histogram oriented gradient (HOG) features (Dalal & Triggs, 2005), SVM classifier and using a
novel method for combining classification results from each of the four views. Schmidt-Hackenberg et al.
(2012) proposed a novel method for the detection of guns, using features inspired by human visual cortex,
SLF-HMAX and V1-like, in combination with a linear SVM classifier. Shape-based handgun detection is
investigated in Roomi (2012) using shape context descriptors (Belongie et al., 2002) and Zernike moments for
features (Khotanzad & Hong, 1990) that are fed into a fuzzy k-NN classifier (Keller et al., 1985) but with
limited, non-quantitative visual evaluation over only 15 image examples of non-cluttered bags.

45
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

7 Synthetic data augmentation

7.1 Introduction
Scarcity of data in general, and especially annotated data, is a problem in many fields where data cannot
simply be taken from the Web and where trained professionals are required for data annotation. For example,
in medical field, it is costly to have a trained radiologist accurately annotating images needed for the
development of automated image analysis algorithms. In the field of security threat detection, there are
several additional challenges to be addressed:
— number of prohibited items categories are long and growing (European Commission, 2021a;
Transportation Security Administration, 2021)
— real world data distributions tend to drift over time or evolve cyclically with the seasons (Liang, 2020)
— data storing may represent a security or privacy risk
— real world data is highly imbalanced; very low prevalence of prohibited items.

In this chapter we review the principles of synthetically augmented data, and cite examples in the area of X-
ray threat detection.

7.2 Review of augmented data


In order to increase the number of available data, different techniques for synthetic data generation exist and
have been used for decades in machine learning algorithms in different fields of application (Chellappa &
Kashyap, 1985; Bajura et al., 1992; Schott et al., 1995; Vetter, 1998). With the recent success and popularity
of deep learning algorithms, the need for data has increased even more, and the number of publications in
data augmentation domain is increasing (see Figure 30). To the best of our knowledge, the use of synthetic
data to train deep learning algorithms for X-ray threat detection is described and listed in section 8.4.

Figure 30 Image data augmentation publications from 2015–2020.

Source: reproduced from Khalifa (2022)

46
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Image augmentation algorithms can be broadly divided into classical and deep learning approaches, as
summarised in Table 4. Geometric and photometric approaches are classical data augmentation techniques
are easy to implement, manipulate the images directly, and the data is usually generated during training.
Geometric approaches include image translation, rotation, flipping, scaling, reflection, and shearing, while
photometric approaches include colour space shifting, image filtering, noise adding, etc. (see Figure 31). Data
augmentation is traditionally performed using these methods, but some studies suggest that little additional
information can be gained in these ways (Frid-Adar et al., 2018).

Table 4. Overview of different data augmentation approaches, with examples of their use for X-ray threat detection.

Examples in X-ray
Data augmentation approaches
threat detection

Geometric transformations (Krell et al., 2018; Krizhevsky et Xu et al., (2018)


al., 2012; Mikolajczyk & Grochowski, 2018; Vyas et al., 2018),
translation, flipping, rotation, scaling, reflection, shearing.

Classical Photometric approaches (Krizhevsky et al., 2012; Mikolajczyk


approaches & Grochowski, 2018), colour space shifting and kernel filtering Dhiraj and Jain (2019)
(Galdran et al., 2017), adding noise (Moreno-Barea et al.,
Bhowmik et al. (2019)
2018), random erasing (Zhong et al., 2020), image mixing
(Inoue, 2018; Zhang et al., 2018), TIP (Hofer & Schwaninger, Wang et al. (2020a)
2005; Neiderman & Fobes, 2005)

Zhao et al. (2018)


GAN-based (Goodfellow et al., 2014; Zhu et al., 2020b; Choi
et al., 2018; Choi et al., 2020; Zhang et al., 2019; Brock et al., Zhu et al. (2020b)
2019)
Kim et al., (2020)
Deep learning
based approaches
Feature space augmentation (DeVries & Taylor, 2017; Li et al.,
2021)

Adversarial training (Goodfellow et al., 2015; Li et al., 2018)

Figure 31 Same image after different types of geometric (top) and photometric (bottom) transformations

Source: Mikolajczyk & Grochowski, 2018

47
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Within X-ray security imagery, an established technique Threat Image Projection (TIP) (Hofer & Schwaninger,
2005; Neiderman & Fobes, 2005) is used (see Figure 32). It is a process for the monitoring of human
operators, which uses a smaller collection of X-ray imagery comprising of isolated prohibited objects, which
are subsequently superimposed onto more readily available benign X-ray security imagery. This approach
additionally can facilitate the generation of synthetic, yet realistic prohibited X-ray security imagery for the
purpose of training automated algorithms. However, a recent study (Bhowmik et al., 2019) indicated that
using TIP datasets for training negatively affects detection performance on real-world examples.

Figure 32 Threat image projection (TIP) pipeline for synthetically composited image generation.

Source: Bhowmik et al. (2019)

Lately, deep learning based approaches became popular. Among them, the most used are generative
adversarial networks (GANs) (Goodfellow et al., 2014; Wang, 2020). For X-ray threat detection, GAN based
generated data is used in several publications, including Zhao et al. (2018), Zhu et al. (2020b), and Kim et al.,
(2020). In Figure 33a and Figure 33b, data generated using GDXray (Mery et al., 2015) as training images for
different types of GANs are shown. It is clear that generated data looks different from real data. However, in
Kim et al., (2020) a GAN with gradient based improved adversarial loss function does show promising results
(see Figure 33c). Due to the lack of training data, current models are capable of producing only objects but
not full X-ray images. Moreover, the appearance of the generated images is not realistic, and more training
data is needed for it to improve.

7.3 Chapter summary


Scarcity of data in general, and especially annotated data, is a problem in many fields where data cannot
simply be taken from the Web and where trained professionals are required for data annotation. In the field
of security threat detection, there are additional challenges such as a growing list of prohibited items,
temporal and geographical fluctuations in real-world data, security and privacy concerns, and class
imbalances in real-world data.
Different techniques for synthetic data generation exist and have been used for decades in machine learning
algorithms in different fields of application. Classical approaches include geometric transformations and
photometric approaches.
With the recent success and popularity of deep learning algorithms, the need for data has increased even
more. Amongst the deep learning based approaches, the most used are generative adversarial networks
(GANs). For X-ray threat detection, GAN based generated data was used in several publications in the last few
years. Due to the lack of training data, current models are capable of producing only objects but not full X-ray
images. Moreover, the appearance of the generated images is not realistic, and more training data is needed
for it to improve.

48
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 33 Data generated using different GANs that are trained on GDXray (Mery et al., 2015): (a) from Zhao et al.
(2018), (b) from Zhu et al. (2020b), and (c) from Kim et al., (2020).
(a) (b)

(c)

Source: (a) Zhao et al. (2018), (b) Zhu et al. (2020b), (c) Kim et al., (2020)

49
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

8 Deep-learning

8.1 Introduction
In this chapter, we review the basic concepts of deep learning algorithms that set it apart from the traditional
machine learning techniques discussed in Chapter 6. Deep learning is a machine learning predictive model
that has received great attention lately because it is shown capable to surpass human performance on
classifying photographic images (Chollet, 2017; He et al., 2016; Krizhevsky et al., 2012; Szegedy et al., 2015).
More so, practical implementations of difficult computer vision problems such as facial recognition are now
routinely available in mass-market consumer products such as smartphones and tablets. Perhaps the most
prominent characteristic of deep learning is that instead of utilizing hand-engineered features as traditional
machine learning does, it leverages neural networks to learn hierarchical, abstract representations from the
data itself; “deep” in “deep learning” refers to deep neural networks (i.e., with many layers).

8.2 Deep learning: the basics


Artificial neural networks (ANNs) are an old model originating from the 1940s (McCulloch & Pitts, 1943),
which became popular in the 1980s with the work of Hopfield (1982), and used in parallel with other
conventional machine learning algorithms with moderate success. Since input to ANNs were usually not an
engineered set of features extracted from data samples, but the samples itself that ANN algorithm extracted
the best features from, they were especially not suitable for computer vision applications where inputs were
2-dimensional and 3-dimensional images and the number of ANN parameters would become very large. To
overcome this obstacle, features were extracted beforehand, and fed into an ANN, but other ML methods
were faster for training and gave better results.
ANNs became more successful in the 2000s with an increase of available training data and computational
power to handle big amounts of data within complex neural networks. Modern ANNs easily reach one hundred
layers, millions of neurons and complex structures of connections between them. They include the following
layers with varying characteristics: convolutional layers (feature extraction), fully connected layer
(intermediate representation), pooling layer (dimensionality reduction), and non-linear operators (sigmoid,
hyperbolic functions and rectified linear units). Why ANNs have only recently became so powerful is illustrated
in Figure 34 where its performance dominance to other learning algorithms is shown when large amount of
training data is used. The performance on vision tasks increases logarithmically based on volume of training
data size as shown by Sun et al. (2017).

Figure 34 Performances of deep learning (ANN with many layers) against that of most other machine learning
algorithms. Deep ANN still benefit from large amounts of data, whereas the performance increase of other machine
learning models plateaus.

Source: Ng (2018)

Several factors have contributed to the success of deep learning approaches:


— Computational power that has increased rapidly due to the widespread use of graphics processing units
(GPUs) that are particularly suited to efficiently perform linear algebra operations necessary for fitting
neural networks (Krizhevsky et al., 2012; Krüger & Westermann, 2003).

50
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

— Large amounts of data needed to train deep neural networks (Sun et al., 2017) that recently became
available (e.g. from the internet, smartphones, etc.).
— Development of novel parameters optimization algorithms able to cope with large amount of data and
DNNs parameters, with what is now considered seminal work in deep learning, proposed by Hinton &
Salakhutdinov (2006) where the depth of neural networks is gradually increased by alternating between
adding a new layer and optimizing the network parameters, consequently stabilizing the optimization.

A deep-learning architecture is a multi-layer stack of simple modules (neurons), all (or most) of which are
subject to learning, and many of which compute non-linear input–output mappings. Modules from one layer
can take as inputs all the outputs from previous layers, and this type of ANN is called fully connected neural
network (FCNN). This type of architecture is computationally very expensive, especially when applied on
images. There is one type of DNN, which is easier to train and generalizes much better than FCNNs, it
performs convolution (Lyons, 2011) in its layers and is called convolutional neural network (CNN). CNNs were
first proposed in the 1990s by LeCun et al. (1990, 1998), and after Krizhevsky et al. (2012) successfully
proposed their deep CNN (AlexNet) for image classification using ILSVRC data (Russakovsky et al., 2015) in
2012 (see Figure 35 and Figure 36), they became state of the art for classification and detection in the field
of computer vision.

Figure 35 Eight ILSVRC-2010 test images and the five labels considered most probable by the proposed CNN. The correct
label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it
happens to be in the top 5.

Source: Russakovsky et al. (2015)

51
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 36 Five ILSVRC-2010 test images in the first column. The adjacent columns show the six training images that
produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test
image.

Source: Krizhevsky et al. (2012)

CNNs are based on two main concepts named local receptive fields and shared weights. Local receptive fields
are small regions inside the image, which provide local information with region size defined as a variable
parameter. Similar to the notion of sliding window, local receptive fields are spread across the image such
that each forms a neuron in the following hidden layer. Using shared weights and biases for neurons in
hidden layers of CNN is a unique notion that provides many advantages. First, since each neuron in a hidden
layer uses same weight and bias, hidden layers have distinct feature characteristics (see Figure 37).
This allows a concept of transfer learning to be successfully used with DNNs - more on transfer learning can
be read further in this chapter. Having many convolutional layers gives one a very broad feature matrix.
Another advantage of using shared weights is that total number of parameter used rapidly decreases, which
gives us not only faster training times but also the opportunity to construct more complex (deeper) network
structures. Even though using shared weights significantly decreases the number of parameters present,
these still considerably exceed those of more traditional machine learning approaches. CNN networks are
designed manually with the resulting parametrization of the networks performing training using a stochastic
gradient descent approach with varying parameters such as batch size, weight decay, momentum and
learning rate over a huge data set (typically 106 in size).
After AlexNet (Krizhevsky et al., 2012), other CNN models quickly emerged, e.g., Zeiler & Fergus (2014),
Szegedy et al. (2015), Simonyan & Zisserman (2015), but like AlexNet., they are also trained on ImageNet
RGB images database (Deng et al., 2009) which contains more than a million labelled training images, 1,000
distinct class labels, 50,000 validation images and 100,000 test images. More recently, Google Open Images
provided a large amount of training data (9.2 million) with 8 annotated objects per image on average
(Kuznetsova et al., 2020) on photographic images. Even more spectacular is the JFT-300M internal Google
dataset, where images are labelled using an algorithm that uses complex mixture of raw web signals,
connections between web pages, and user feedback. The JFT dataset that has more than 300 million images
that are labelled with 18,291 categories. The annotations have been automatically obtained and, therefore,
are noisy and not exhaustive. These annotations have been cleaned using complex algorithms to increase the
precision of labels; however, there is still approximately 20% error in precision (Sun et al., 2017). A smaller,
but more context based Microsoft COCO database (Lin et al., 2014) is also a common benchmark for ANN
algorithms and contains photographic images of complex everyday scenes with 328k images and 2.5 million
labelled instances.

52
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 37 Visualization of features in a fully trained model. For each feature map the corresponding image patches are
also shown. Note: (i) the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii)
exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1).

Source: Zeiler & Fergus (2014)

In order for its potential to be fully used, large amount of data requires higher capacity models as shown by
Sun et al. (2017). AlexNet CNN was 8 layers deep, soon Simonyan and Zisserman proposed the VGG-19 model
which uses smaller convolutional filters and has depth of 19 layers (Simonyan & Zisserman, 2015), and since
then the representational power and depth of these models have continued to grow every year. GoogLeNet
(Szegedy et al., 2015) was a 22-layer network (with 12 times fewer parameters than AlexNet) and residual,
ResNet, models proposed by He et al. (2016) reached 152 layers and more complex connections between the
layers. The core idea of residual models is to add residual connections between layers, which helps in
optimization of very-deep models.
Large amount of training data and high computing power are not always available, as it is the case with X-ray
airport security data. Collecting X-ray data, especially manually annotated data is very expensive. As stated in
the PhD thesis work of Liang (2020), where large annotated dataset was collected in cooperation with Smiths
and Rapiscan, despite strategies to accelerate bag assembly and threat annotation, it still took about 400

53
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

worker hours to collect each dataset. On the other hand, millions of scans are conducted at U.S. and European
airports every day. This type of data is commonly referred to as stream of commerce. TSA regularly finds
thousands of loaded firearms in carry-on baggage every year (Transportation Security Administration, 2019,
2020). Under current operational procedures, most airport data around the world remains unlabelled and
uncollected.
More generally, a few solutions are proposed for scarcity of training data. One of them is a concept of
transfer learning originating from the 1970s (Bozinovski, 2020), further expanded in the 1990s (Pratt &
Jennings, 1996; Pratt, 1992) and becoming popular with the rise of DNNs, with one of the first deep learning
computer vision applications in study of Oquab et al. (2014). Transfer learning uses features of an already
trained DNN that are extracted as useful for a specific task, and uses them as a base of solving another task.
The useful characteristic of DNN is that lower layers provide general feature extraction capabilities. The
features extracted in the lowest layers are almost as a rule Gabor filters (Daubechies, 1990; Fogel & Sagi,
1989) or colour blobs, and they occur regardless of the exact cost function and natural image dataset, hence,
we call these first-layer features general. Higher layers carry information that is increasingly more specific to
the original classification task - see Figure 37 for visualization of extracted features in different layers of
DNN. Hence, re-use of generalized feature extraction of the lower layers, and fine-tuning of the parameters of
the higher layers for the new classification task is possible. Furthermore, a number of studies have taken
advantage of this fact to obtain state-of-the-art results when transferring from lower layers (Donahue et al.,
2014; Sermanet et al., 2014; Zeiler & Fergus, 2014), collectively suggesting that these layers of neural
networks do indeed compute features that are general.
A typical transfer learning pipeline is shown in Figure 38, where authors in Akçay et al. (2018) present
applicability of transfer learning from general photographic images trained CNN model, to fine tuning its
weights to a different task of firearm classification using X-ray data.

Figure 38 Transfer learning pipeline: (A) shows classification pipeline for a source task, while (B) is a target task,
initialized by the parameters learned in the source task.

Source: Akçay et al. (2018)

Yosinski et al. (2014) performed an interesting analysis on when do features stop being general and become
specific and how successful feature transfer is if the original task and target task are very different. They
used ImageNet (Deng et al., 2009) data split in two sets, A and B, each set containing 500 different classes. It
turned out that not only the first layer features (Gabor and colour blobs) are general, but also the features of
first three layers of ANN can be considered general. Furthermore, the drop in accuracy even if all but the last
layer are frozen and the complete network is used for a different classification task is only 15% (see Figure
39b). The random split into two subsets contained classes belonging to two sets that are similar (e.g. garbage
track vs military track). Therefore, the new split was manually made so that set A contains man-made objects,
and set B natural objects (animals and plants). The transferability of features was less successful with the
new split of data where classes were less similar (Figure 39d). The work of Yosinski et al. (2014) also showed
that fine-tuning of the parameters help the accuracy of classification. However, if the target dataset is small
and the number of parameters is large, fine-tuning may result in overfitting, so the base features are often
left frozen.

54
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 39 Effectiveness of feature transfer. (a) Neural networks (NNs) baseA and baseB are trained on the set A and set
B, selffer BnB is a NN where first n layers are copied from baseB and frozen, and the remaining layers are randomly
initialized and trained on dataset B, (b) Results from randomly splits sets A and B: each marker represents accuracy over
the validation set, horizontal groups of same-type markers represent 8 random splits of the initial set into A and B sets.
(c) Lines connecting the means of each random split (numbered descriptions above each line refer to interpretation). (d)
Performance degradation vs. number of layers in transfer learning; degradation when transferring between dissimilar
tasks (upper line: networks trained to the “natural” target task, lower line: those trained toward “man-made” target task).
(a)

(b)

(c) (d)

Source: Yosinski et al. (2014)

55
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

8.3 Transfer learning


In aviation security applications, the scarcity of data – in particular of images of threats – necessitates the
use of transfer learning. This method is likely to give good results, especially if ANNs used are trained on
images containing objects similar to threat items. The first research applying CNN to X-ray security images is
done by Akçay et al. in 2016 on handgun image classification (Akçay et al., 2016). They used AlexNet.
(Krizhevsky et al., 2012) as a base model which they further optimized by fine-tuning its convolutional and
fully-connected layers to the hand gun detection problem in X-ray images. AlexNet consists of 5 convolutional
layers, 3 fully-connected layer with 60 million parameters and 650,000 neurons. They used fine-tuning
approach to AlexNet to re-train over the X-ray baggage dataset using propagation algorithm with stochastic
gradient descent method. Training and testing were performed via the use of Caffe (Jia et al., 2014), a deep
learning tool designed and developed by the Berkley Vision and Learning Center. They created manually
cropped images of handguns from 6,997 X-ray images, resulting in 3924 positive samples and 13,495
negative ones. They varied the number of fine-tuned and frozen layers in AlexNet, e.g. AlexNet4-8 means that
layers 4-8 were fine-tuned, and layers 1-3 were frozen. They compared their transfer learning method
directly with the approach of Turcsany et al. (2013) which uses BoVW and SVM classifier, and the results are
shown in Table 5. The transfer learning method showed superior results, even when only the last layer of
AlexNet was fine-tuned (AlexNet8 in Table 5).
Furthermore, the authors compared transfer learning from AlexNet and GoogLeNet (Szegedy et al., 2015)
model used for the six-class problem with following classes and number of samples for each class in the
brackets: firearm (2,847), firearm components (1,060), knives (2,468), ceramic knives (1,416), camera (432)
and laptop (900). The comparison of Average Precision (Eqn.10) shows that GoogLeNet exhibits strong
performance, even for the classes similar to each other (see Table 5 and Table 6). However, it is not clear on
how many test examples were 2-class and 6-class methods evaluated, nor how many layers of AlexNet and
GoogLeNet were fine-tuned for the 6-class problem. Nevertheless, this work showed that transfer learning
was possible between a base model trained on photographic images to a new model specialized for X-ray
(RGB) data classification. The question remains how efficient transfer learning would be for threat items
detection where the item has non-standard shape or does not resemble objects from photographic images on
which large ANNs are trained (e.g. powders).

Table 5. Performance of the transfer learning method from Akçay et al. (2016) for the two class problem (gun vs. no gun)
using test set and comparison with BoVW combined with SVM or Random Forest (RF) method used in Turcsany et al.
(2013).

Source: Akçay et al. (2016)

Table 6. Results for the multi-class problem (average precision %).

Source: Akçay et al. (2016)

56
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

The same authors published more elaborate studies on threat detection in 2017 (Akçay & Breckon, 2017) and
2018 (Akçay et al., 2018) with more data using Durham Baggage Patch/Full Image Dataset (see Chapter 6)
with single conventional X-ray imagery with associated false colour materials mapping from dual-energy.
They compared traditional sliding window based CNN detection with region based object detection techniques
such as Faster Region-based CNN (R-CNN) (Ren et al., 2017) and Region-based fully convolutional networks
(R-FCN) (Dai et al., 2016), and in Akçay et al. (2018) additionally YOLOv2 model (Redmon & Farhadi, 2017). In
the sliding window approach, a window of a certain size is moved across the image on different scales, and
each of the image patches extracted is classified as containing a threat or not (see Figure 40a).
Instead of a sliding window approach, in Ren et al. (2017) an approach containing two ANNs is proposed: one
to propose regions of interest (region proposal network, RPN), and the second one to classify those regions
(fast R-CNN detector), while both of the ANNs share convolutional layers which makes it cost efficient. ROI
pooling layer resizes the ROI proposals to have fixed sized width and length, and fully connected layers then
create feature vector to be used by bounding box regression and softmax layers (see Figure 40b). The R-FCN
method proposed by Dai et al. (2016) is also region based and faster than Faster R-CNN (Ren et al., 2017). It
also takes a whole image as an input, but does not use costly per-ROI subnetwork hundreds of times due to
the two subsequent fully connected layers as in Ren et al. (2017). Instead, fully connected layers are removed
and a new approach called “position sensitive score map” is used. Since no fully connected subnetwork is
used, the proposed model shares weights within almost entire network (see Figure 40c). This leads to much
faster convergence both in training and test stages, while achieving similar results to Fast RCNN.
YOLOv2 (Redmon & Farhadi, 2017) is also a fully CNN that performs classification in a single forward-pass,
while region-based approaches utilize sub-network for region generation. It also employs anchors, like Faster
RCNN, but the difference is that instead of fixing the anchor parameters, this approach makes use of k-means
clustering over the input data to learn the anchor parameters of the ground truth bounding boxes. YOLOv2
also performs batch normalization after each layer, resulting in an improvement in the overall performance.
Finally, unlike classification networks that inputs smaller size images such as 224 x 224 pixels, YOLOv2
accepts inputs with higher resolution varying between 350 x 350 to 600 x 600 pixels. Besides, the model
randomly resizes input images during the training, which allows the network to work with objects with varying
scales, and hence handles scaling issue.

57
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 40 Schematics for the CNN driven detection strategies evaluated by Akçay et al. (2018): (A) sliding window based
CNN, (B): Faster RCNN, (C): R-FCN, and (D): YOLOv2.

Source: Akçay et al (2018)

Akçay et al. (2018) compared the success of SVM based classification of patches using engineered features
with the success of SVM using CNN (AlexNet) chosen features (Krizhevsky et al., 2012). The performance is
evaluated by the comparison of TPR (Eqn.7) FPR (Eqn.9), precision (Eqn. 6), accuracy (Eqn.5) and F-score
(Eqn.12), and using approximately 3,000 images of positive class and around 15,000 negative class images,
both split into 60% for training, 20% for validation, and 20% for testing.
In Table 7 the performance results of firearm classification are shown, illustrating dominance of CNN and
transfer learning methods to SVM with engineered features. We see that true and false positives have a
general trend to decrease as the number of fine-tuned layers reduces. Likewise, freezing lower layers reduces
the accuracy of the models. Training an SVM classifier on CNN features with layer freezing yields relatively
better performance than the standard end-to-end CNN results. Here we see that fine-tuning more layers has
a positive impact on the overall performance. For instance, SVM trained on fully fine-tuned CNN has the
highest performance on all of the metrics.

58
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Table 7. Results of CNN and BoVW features on dataset of patches for firearm classification. AlexNetab denotes that the
network is fine-tuned from layer a to layer b.

TPR FPR Precision Accuracy F-score


Method
% % % % %

AlexNet1-8 99.3 4.1 74.1 96.1 84.9


CNN (Krizhevsky et al., A. CNN (Krizhevsky et al.,
AlexNet2-8 98.5 2.4 83.2 98.3 90.2

Layer Freezing
AlexNet3-8 96.3 2.2 84.4 98 90
AlexNet4-8 95.6 3 79 97.3 86.5
2012)
AlexNet5-8 98.2 4.7 71.1 96.1 82.5
AlexNet6-8 96.3 5.1 69.3 95.4 80.6
AlexNet7-8 94.49 3.65 75.4 96.1 83.9
AlexNet8 95.22 4.21 73.3 96 82.8

AlexNet1-8 99.6 1.1 99.7 99.4 99.6


AlexNet2-8 99.3 1.5 99.6 99.1 99.4
Layer Freezing

AlexNet3-8 99.2 1.9 99.5 98.9 99.3


2012) + SVM

AlexNet4-8 98.9 1.9 99.5 98.8 99.2


AlexNet5-8 98.8 2.1 99.4 98.6 99.1
AlexNet6-8 98.7 3 99.1 98.3 98.3
AlexNet7-8 98.6 4.1 98.9 98 98
AlexNet8 98.4 5.4 98.5 97.6 97.6

VGGM (Chatfield et
98.4 0.4 99.8 98.7 98
al., 2011)
VGG16 (Simonyan &
99.1 1.1 99.7 99 98.5
End-to-end

Zisserman, 2015)
ResNet18 (He et al.,
CNN

99.4 1.4 99.6 99.2 98.8


2016)
ResNet50 (He et al.,
99.5 1 99.8 99.5 99.2
2016)
ResNet101 (He et al.,
99.7 1.1 99.7 99.5 99.3
2016)
SURF/SURF 79.2 3.2 88 93 83
(Kundegorski
et al., 2016)
BoVW SVM

KAZE/KAZE 77.3 3.9 85 92 81


FAST/SURF 83 3.3 88 94 85
FAST/SIFT 80.9 4.3 85 92 83
SIFT/SIFT 68.3 4.2 83 90 75
Source: Akçay et al. (2018)

Additionally, a comparison of the four ANN architectures success on object detection within X-ray binary class
problem (firearms vs background) and multiple-class object detection problem (firearm, firearm-components,
knives, ceramic knives, camera, and laptop) is performed. The dataset from Durham Baggage Patch/Full
Image Dataset contains 11,627 samples (5,867 training, 2,880 validation and 2,880 test samples). Each of
the architectures are constructed using different ANNs: Sliding window CNN and Faster RCNN with AlexNet
(Krizhevsky et al., 2012), VGG (Simonyan & Zisserman, 2015) and ResNET (He et al., 2016), while R-FCN only
with ResNet (He et al., 2016) and YOLOv2 with Darknet network. Transfer learning was used for training the
models, but it is not stated how many layers were kept frozen. Evaluation measure was AP (Eqn.10) per class,
and mAP (Eqn.11) across all classes (see Table 8). SW-CNN even with a complex network such as VGG16 and
ResNet-101 performs poorer than any other detection approaches. This is mainly due to not employing a
bounding box regression layer, a significant performance booster. Best performance of RCNN with VGG16
(mAP: 85) is worse than any FRCNN or R-FCN. This is because the RPN within FRCNN and R-FCN provides
superior object proposals than selective search used in RCNN. For the overall performance of firearm
detection, R-FCN with ResNet-101 yields the highest mAP of 96% requiring only 0.1s per image. For the
overall performance for 6-class problem, Faster RCNN with VGG16 shows superior performance (mAP: 88%).
A similar study by Liang et al. (2019) explores single shot multibox detector (SSD) (Liu et al., 2016) and F-
RCNN (Ren et al., 2017) by training on a dataset containing 4 threat classes: firearms (e.g. pistols), sharps (e.g.
knives), blunts (e.g. hammers), and liquids, aerosols, and gels or LAGs (e.g. liquid-filled bottles). Each class
comprises approximately 3,400 images scanned with Rapiscan 620DV dual energy, dual-view scanner.
Transfer learning is performed from pre-trained weights from the COCO Object Detection Challenge (Lin et al.,
2014) that are fine-tuned to detect each of the target classes. Unlike Faster R-CNN (Ren et al., 2017), which
performs classification and bounding box regression twice, Single-stage detectors like SSD (Liu et al., 2016)
combine both stages. This eliminates the proposal stage to directly predict both classes and bounding boxes

59
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

at once. This reduction tends to make the network much faster, though sometimes at the cost of accuracy. F-
RCNN (Ren et al., 2017) with Inception ResNet v2 (He et al., 2016) backbone yields the highest mAP (Eqn.11):
92 and 98 on single and multi-view images, respectively.

Table 8. Left: Detection results of SW-CNN, Fast-RCNN (RCNN) (Girshick, 2015), Faster RCNN (FRCNN) (Ren et al., 2017),
R-FCN (Dai et al., 2016) and YOLOv2 (Redmon & Farhadi, 2017) for firearm detection problem (300 region proposals).
Right: Detection results of SW-CNN, Fast-RCNN (RCNN) (Girshick, 2015), Faster RCNN (FRCNN) (Ren et al., 2017), R-FCN
(Dai et al., 2016) and YOLOv2 (Redmon & Farhadi, 2017) for multi-class problem (300 region proposals) (Akçay et al.,
2018).

Source: Akçay et al. (2018)

Hassan et al. (2019) proposed an object detection algorithm with complex, non-deep learning based,
preprocessing applied on heavily occluded and cluttered baggage. It starts with preprocessing enhancing
image contrast, followed by ROI generation via cascaded multiscale structure tensors that extracts ROIs
based on the variations of the orientations of the object. The extracted ROI is then passed into a CNN, which
quantitatively and computationally outperforms RetinaNet (Lin et al., 2017), YOLOv2 and F-RCNN on GDXray
and SIXray datasets with respect to accuracy (99.5%), sensitivity (99.6%), specificity (95.7%), FPR (4.3%) and
Precision (99%). The method is trained on pretrained ResNet50 architecture, and then retrained on the scans
from GDXray and SIXray datasets where 80% of scans in each subset have been used for training purposes
and remaining 20% scans were used for testing, maintaining the ratio of 4 to 1.
It is important to mention a study that compares generalization value of automated methods by performing
experiments on data coming from different scanners. The work of Gaus et al. (2019) tests the generalisation
capability of CNN models, namely Faster R-CNN (Ren et al., 2017), Mask R-CNN and RetinaNet (Lin et al.,
2017) using ResNet101 network configuration. They used two datasets, Durham Dataset Full 3-class (firearm,
knives and firearms parts), Dbf3 (Akçay et al., 2018), and SIXray10 (firearms and knives) (Miao et al., 2019).
The data in the two datasets comes from two scanners: Smiths Detection dual-energy X-ray scanner, and
Nuctech dual-energy X-ray scanner. Furthermore, images are generated with differing energy, geometry,
resolution and colour profiles (see examples in Figure 41). Additionally, Dbf3 is focused on passenger carry-on
baggage within aviation security, while SIXray10 is based on security screening within a metro transit system
context. The method was trained on Dbf3 and tested on SIXray10, and reversely, trained on SIXray10 and
tested on Dbf3. The best results were achieved using Faster R-CNN with ResNet101. For Dbf3=>Dbf3
experimental set-up mAP of 0.88 is achieved, with a drop to mAP = 0.85 for Dbf3=> SIXray10 set up.
Additionally, SIXray10=>SIXray10 achieved mAP of 0.86 while SIXray10 => Dbf3 mAP increased to 0.91. The
results show good transferability of Faster R-CNN learning from one dataset to the other.

60
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 41 Examples of multiple prohibited item detection for the inter-domain X-ray security training and evaluation
configurations: (A) Dbf3 ⇒ SIXray10 and (B) SIXray10 ⇒ Dbf3 with varying CNN models.

Source: Gaus et al. (2019)

8.4 Deep learning with augmented data


Publicly available datasets are in general small or not diverse enough for usage of deep learning algorithms
when a method requires specific data, e.g., images with single target as in Xu et al. (2018), despite using
transfer learning, additional data is often needed. If the amount of data is not sufficient, data that resembles
available data can be generated. This process is called data augmentation and is used in various fields of
application (Calimeri et al., 2017; Krell et al., 2018; Tran et al., 2017).
As stated in Chapter 6, data augmentation approaches can be divided into classical methods and deep
learning based methods. Regarding classical methods, there are several implementations used in deep
learning threat detection algorithms.
Motivated by the lack of annotated X-ray datasets, Xu et al. (2018), make use of attention mechanisms for
the localization of threat objects. The first stage forward-passes an input and finds the corresponding class
probability. The backpropagation stage finds which neurons within the network decide the output class. The
feedback model is much like the backpropagation in the training process, but the signal of the
backpropagation changes to the semantic information of the output layer, not the value of loss function.
Using the neurons from the first convolutional layer on top of the input image localizes the threat (see Figure
42). The final stage refines the activation map by normalizing the layers with the activations of the previous
layer. The method was trained and tested using GDX-ray dataset (Mery et al., 2015) (see Chapter 6), with
addition of data augmentation (Krell et al., 2018) and transfer learning from GoogleNet pre-trained with
ImageNet2012 training set.
Data augmentation is performed since GDX-ray dataset has a few drawbacks: (1) the samples set is small,
there are few single-target images used to train the classification network, and it is easy to cause overfitting;
(2) the background is monotonous. Data augmentation is performed without usage of GAN, in the following
way: images are cut to 2/3 of the security images original size. The cropped position is random, and 10
images containing the complete target from the cropped images are picked out. Then these 10 images are
rotated at, 90°, 180°, and 270° and flipped horizontally. Finally, one image can be expanded to 80 images.
The data set is expanded to 5000 pictures containing 4 object categories (revolver, gun, shuriken, and knife).
90% of the images from the dataset are extracted as a training set, and another 10% as a testing set.
Accuracy achieved is 96% for revolvers, 98% for guns, 99% for shuriken, and 97% for knives. Comparison for

61
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

bounding box position accuracy calculated by the intersection over union (Eqn.13) with the GT bounding box,
against the traditional deconvolution method (34.3%) shows that the proposed method achieves superior
results (56.6%).

Figure 42 Attention mechanism based model. (a) and (b) represent feed-forward and feed-back propagation for a
convolutional neural network. (a) Given an input image, the output neuron corresponding to the predicted category is
activated after the feedforward propagation and represented by the red dot. (b) In the feed-back propagation, the red dots
represent neurons positively related to the output neuron and are activated layer by layer.

Source: Xu et al. (2018)

The work of Dhiraj and Jain (2019) utilises an approach similar to the concept of TIP (Hofer & Schwaninger,
2005; Neiderman & Fobes, 2005) where the prohibited item is superimposed into X-ray security imagery. Two
steps are applied: image modelling, which increases the object count per category, and second step that uses
image transformation methods to bring variability in the dataset generated after image modelling. The image
modelling method is described in the work of Mery & Katsaggelos (2017) and it is used to superimpose threat
object on an X-ray image using addition of logarithmic images of the threat and the X-ray image on which the
threat is superimposed. The initial dataset was extracted from the GDXray database (Mery et al., 2015), with
the 1223 useful images of baggage, each having one or more threats (blades, guns, knives and shuriken).
After data augmentation, the number of data increased to 3669. This data was used to train an object
detection frameworks based on YOLOv2 and Faster RCNN architecture with ResNet, where FRCNN showed to
be slightly better achieving 93% vs 92% Precision, 98% vs 88% Recall and 98% vs 97% Accuracy.
Motivated by the work of Dhiraj and Jain (2019), Bhowmik et al. (2019) investigated the difference in
detection performance achieved using real and TIP based created synthetic X-ray training imagery for CNN
architecture. The task evaluated was detection of three exemplar-prohibited items (firearm, firearm parts, and
knives) in cluttered and complex X-ray security baggage imagery. In this work, the synthetic data was not
generated using GANs, but TIP (Neiderman & Fobes, 2005) following the method described in Rogers et al.
(2016). Two representative CNN object detection models were used, Faster R-CNN (Ren et al., 2017) and
RetinaNet (Lin et al., 2017), for the purposes of the evaluation. RetinaNet (173) is an object detector where
the key idea is to solve the extreme class imbalance between foreground and background classes. To improve
the performance, RetinaNet employs a novel loss function called Focal Loss, where it modifies the cross-
entropy loss such that it down-weights the loss in easy negative samples so that the loss is focusing on the
sparse set of hard samples. The dataset used in Bhowmik et al. (2019) was from Durham Dataset Full 3-class
(firearm, knives and firearms parts), Dbf3, consisting of total 7603 images. Synthetically composited dataset
was generated using 3366 benign Smiths Detection X-ray images, into which 123 prohibited objects are
composed. Mixed real and synthetic dataset is constructed using half of real images and half of synthetically
created. All three dataset consisted of equal amount of images, and each dataset was split into training
(60%), validation (20%) and test sets (20%) so that each split has similar class distribution.
Results show the best performance was achieved with Faster R-CNN and ResNet101 for all three classes. As
expected, training and testing on the real data achieved best results for all classes, and overall mAP of 88%.
However, even though the performance is lesser for training on synthetic data and tested on real data (mAP =
78%), it is a very good result given no manual labelling of the training data is required since TIP insertion

62
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

positions are known. The performance gap between CNN architecture trained on real and synthetically
composited X-ray imagery is attributable to the domain shift problem whereby the distribution of training and
test data differ. In the first experiment (Table 9, top row), the training and test data are from the same
distribution since they are created by randomly dividing data captured under the same experimental
conditions. It is also noteworthy that prohibited images used for generating synthetic X-ray imagery data is a
smaller set of prohibited item instances than in the real training images. As a result, CNN architecture trained
on synthetic data have larger generalization errors than those trained on real data. In the second set of
experiments, when tested on the synthetic data, the results are comparable when trained on real or synthetic
data (in both cases mAP = 91%). These experimental results show that it is essential to have diverse
prohibited item signatures in the training data to improve the generalisation.

Table 9. Detection results of varying CNN architecture trained on real and synthetic data [Dbf3Real: three class real data
(top row), Dbf3SC: three class synthetic data (middle row) and Dbf3Real+SC: three class real and synthetic data (bottom
row)]. All models are evaluated on set of real X-ray security imagery.

Train => Average precision


Model Network mAP
Evaluation Firearm Firearm Parts Knives

Faster ResNet50 0.87 0.84 0.76 0.82


R-CNN (Ren et
Dbf3Real => al., 2017) ResNet101 0.91 0.88 0.85 0.88
Dbf3Real
RetinaNet (Lin ResNet50 0.88 0.86 0.73 0.82
et al., 2017) ResNet101 0.89 0.86 0.73 0.83

Faster ResNet50 0.82 0.77 0.55 0.71


R-CNN (Ren et
Dbf3SC => al., 2017) ResNet101 0.86 0.8 0.66 0.78
Dbf3Real
RetinaNet (Lin ResNet50 0.84 0.77 0.53 0.71
et al., 2017) ResNet101 0.84 0.76 0.54 0.72

Faster ResNet50 0.85 0.79 0.65 0.76


R-CNN (Lin et
Dbf3Real+SC => al., 2017) ResNet101 0.87 0.81 0.74 0.81
Dbf3Real
RetinaNet (Lin ResNet50 0.85 0.81 0.64 0.76
et al., 2017) ResNet101 0.86 0.8 0.63 0.76
Source: Bhowmik et al. (2019)

In the last few years, generative adversarial networks (GANs) (Goodfellow et al., 2014; Wang, 2020) are
routinely used as a way to generate new images from a set of similar images. Many derived GAN models
have been proposed to improve the quality of the generated images (Wei et al., 2018), especially the SAGAN
(Zhang et al., 2019) and the BigGAN (Brock et al., 2019). The new samples are not generated to be close in
whatever distance measure we use to any existing training sample, but that both generated and training
samples belong to similar probability distributions (Wang, 2020). However, for the task of generating X-ray
prohibited item images, existing GAN based approaches are not trainable since the amount of training images
is not enough. In addition, the items in baggage are placed randomly and packed tightly, so prohibited items
generally appear at various visual angles. These factors are unfavourable for GAN to learn the common
features of all threat items.
Zhao et al. (2018) propose X-ray prohibited item image generation using GANs (Goodfellow et al., 2014). First,
a novel pose based classification method of items is presented to classify and label the training images. The
network training is facilitated by adding pose labels for the collected images and extracting the object
foreground with KNN-matting (Chen et al., 2013). Then, the CT-GAN (Wei et al., 2018) model is applied to
generate many realistic images of foreground only. To increase the diversity, they improve the CGAN (Isola et
al., 2017) model. Finally, a simple CNN model is employed, trained on real images, and tested on 100
generated images from each class, to verify whether the generated images belong to the same item class as

63
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

the training images. Most of the generated images were correctly classified (see Table 10). Examples of
generated data is shown in Figure 43.

Table 10. Matching results of real images trained CNN model: number of correctly classified generated images.

Prohibited item Handgun Wrench Pliers Blade Lighter Kitchen knife Screwdriver Fruit knife Hammer

Number correctly classified 100 100 100 87 100 92 91 95 100


Source: Zhao et al. (2018)

Figure 43 Examples of generated image samples: (a) real X-ray images, (b) images generated by DCGAN, (c) images
generated by WGAN-GP, and (d) images generated by CT-GAN.

Source: Zhao et al. (2018)

The work done in Zhu et al. (2020b) also used GAN variants to generate new images of threat objects, which
were used to create entirely new X-ray security images. They adapted a Self-Attention GAN (SAGAN) (Zhang et
al., 2019) model to generate new images of threat objects and a Cycle-GAN (Zhu et al., 2020a) model to
translate camera images of threat objects to the X-ray image counterpart. Then, they augmented the X-ray
security image training set by combining the generated and translated images of X-ray threats with normal X-
ray images from baggage scans. See Figure 44 for some examples of generated images.

Figure 44 Images generated by the different GAN models.

Source: Zhu et al. (2020b)

64
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

The method was trained on the private set, consisting of 4,500 images of real data and 4,200 synthesized
images, both across 7 classes. Out of these images, test set was extracted containing 680 images for both
real and mixed dataset. Their method improved the detection performance (measured with mAP) of a Single-
Shot Detector (SSD) (Liu et al., 2016) model on a seven-class problem (power bank, lighter, fork, knife, gun,
scissor, pliers) by 5.6% compared to the model trained on the original training set. Hence, concluding that the
addition of synthesized images does improve the performance. However, since the method was trained and
tested on a mixed dataset of real and synthesized images, the improvement might simply be a result of a
larger number of training data. It should be also noted that the datasets used were not only collected in an
ideal laboratory setting but are also relatively small and balanced, even after the augmentation. Hence, their
approach may still not effectively scale to the real-world X-ray security systems.
Another DNN method uses GAN based data augmentation on GDXray database (Mery et al., 2015) for object
detection with Faster R-CNN (Ren et al., 2017) (see Figure 45). The method was trained on generated images,
and tested on the initial GDXray dataset and results are presented in Table 11. Perhaps the main contribution
of this article is the good quality of GAN generated images using 1,038 training images (see Figure 46).

Figure 45 Faster R-CNN features architecture.

Source: Kim et al., (2020)

Table 11. Results of GAN based data augmentation method on GDXray database by Kim et al. (2020).

Results (AP = average precision)

mAP Handgun (AP) Shuriken (AP) Razor (AP)

91 % 91 % 92 % 91 %
Source: Kim et al., (2020)

In conclusion, regarding GAN generated images, the limitations are that the models are currently capable of
producing only objects but not full X-ray images. Moreover, the quality of the generated images is currently
quite poor (see Figure 43 and Figure 44).

65
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 46 GAN generated images after (a) 3,000 iterations, and (b) 18,000 iterations.
(a)

(b)

Source: Kim et al., (2020)

8.5 Working with 3D CT data


In the recent years, progress has also been made in the area of computed tomography, although much more
limited.
As described in Chapter 2, CT scanners were until recently mainly used for explosive detection in hold
baggage based on effective atomic number of the materials present and not for automated object detection
(although in the recent years this is rapidly changing). Hence, the quality of CT images in airport security is
lower than in medicine (Singh & Singh, 2003). That is the reason why many of image processing and
automated object detection methods from medical field cannot be easily applied on aviation security data
(Megherbi et al., 2013). Another reason could be that the appearance variation of human organs in CT images
is smaller than the appearance variation of prohibited items. Not only that the Hounsfield Units (HU)
distribution within non-degenerated organs and tissues is usually more homogenous and known, but their
location, size and shape, as they appear in CT images, are usually standard. This is the case not only because
the relative position of human organs is known, but also because the human body position within CT scanner
coordinate system is fixed, In order to improve the quality of CT images, metal artefact reduction and de-
noising (Mouton et al., 2013) techniques were suggested.
The majority of pioneering work in this domain originated in 2013 from the U.S. Department of Homeland
Security's Awareness and Localization of Explosives-Related Threats (ALERT) initiative (Crawford et al., 2011).
Among them methods based on image processing without machine learning with extensive post processing
that require careful parameter tuning are dominant. Wiley et al. (2012) and Crawford et al. (2011) presented
3D region-growing method. Multiscale approach by Crawford et al. (2011) decomposes images into trees of
connected sets to segment objects using classifiers, while Grady et al. (2012) uses intensity based
segmentation with isoperimetric algorithm to separate connected objects (Grady & Schwartz, 2006). The
methods presented within this initiative were not quantitatively evaluated. All the above methods are
developed using fully labelled datasets gathered under the same initiative (see Chapter 5). It is worth
mentioning that the data was entirely free of both threats and prohibited materials, and had few electronics
and objects that are commonly in the luggage of the passengers nowadays.

66
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Later, many methods based on 3D features for 3D object recognition have been developed: see, for example,
rotation invariant feature transform (RIFT) (Lazebnik et al., 2005) and SIFT descriptors work of Flitton et al.
(2013), 3D Visual Cortex Modelling 3D Zernike descriptors and histogram of shape index (Megherbi et al.,
2012). There are contributions using known recognition techniques, such as bag of words (Flitton et al., 2015)
and random forest (Mouton et al., 2014; Mouton & Breckon, 2015). This work largely concludes that the
choice of the feature descriptor, feature sampling/detection strategy and the final classification framework
had a significant impact on performance with the use of simplistic feature descriptors (e.g. density gradient
histogram) notably outperforming 3D derivatives of more complex ones, e.g., SIFT / RIFT (Flitton et al., 2013)
as shown in Figure 47. The evaluation for handgun detection is done using ROC curve and shown in Figure 47.
The superiority of simpler feature descriptors may be due to the poor quality of imagery available at the time:
imaging artefacts and noise hinder a reliable invariant orientation for the SIFT descriptor.

Figure 47 ROC curves for handguns detection using five different feature descriptors: density descriptor (D), density
histogram (DH), density gradient magnitude histogram (DGH). RIFT and SIFT.

Source: Flitton et al. (2013)

Regarding deep learning application on CT data in aviation security, the literature is at the moment limited.
Wang et al. (2020a) developed a multi-class detection method using one unified framework. They investigate
different CNN architectures, i.e. ResNet (He et al., 2016) with variable depths under the RetinaNet object
detection framework (Lin et al., 2017). They also evaluated the effectiveness of data augmentation
techniques including 3D volume flipping and rotation. Dataset was created using CT80-DR dual-energy
baggage CT scanner manufactured by Reveal Imaging Inc., and it is a combination of 478 real CT volumes
and 287 synthetically composited ones generated with the TIP algorithm (Neiderman & Fobes, 2005). The
dataset is randomly divided into two subsets, 70% for training and 30% for testing. Experimental results
demonstrate the combination of the 3D RetinaNet and a series of favourable strategies can achieve a mean
Average Precision (mAP) of 65.3% over five object classes (i.e. bottles, handguns, binoculars, Glock frames,
iPods). The overall performance is affected by the poor performance on Glock frames and iPods due to the
lack of data and their resemblance with the baggage clutter. Some results of false positives and false
negatives are shown in Figure 48. 3D detection methods are the future of the aviation security because of the
increased usage of 3D scanners, lately also for the cabin baggage, and high amount of information that 3D
images contain. However, computational cost is still high, and available dataset are even more limited than
the 2D X-ray datasets because of difficulties to annotate and store them. As common approach in computer
vision methods for material classification in 3D data follow usual structure, where objects of interest are first
detected and then material classified based on the intensity features, more object detection methods in 3D
will be described in section 9.3.

67
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 48 Exemplar false positive (left) and false negative (right) detection results. The false detections are emphasized
with red arrows.

Source: Wang et al. (2020a)

8.6 Chapter summary


In this chapter, we reviewed the basic concepts and structural elements of deep learning algorithms, and
identified the reasons why deep learning made such progress during the last 10 years, namely i)
computational power, ii) availability of large datasets and iii) novel optimisation algorithms. In threat
detection for aviation security, however, data scarcity is an issue. One possible solution is transfer learning,
and in section 8.3 we reviewed some applications from literature of transfer learning for X-ray screening.
Another approach to tackling data scarcity is data augmentation, which was introduced and discussed in
Chapter 7. In section 8.4, we reviewed in more detail some examples of data augmentation for X-ray
screening, such as the deep learning technique known as generative adversarial networks (GANs). Finally, in
section 8.5 we reviewed recent work concerning the application of deep learning on 3D images from
computed tomography.

68
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

9 Materials classification

9.1 Introduction
As described in Chapter 6, most studies on automated threat detection methods from X-ray and CT images
focus on prohibited items with a characteristic shape, while methods dealing with prohibited materials are
very few. In this chapter, we give an overview of published computer-vision methods applied to X-ray images
of materials with similar characteristics as prohibited materials in airport baggage, even if some scenarios are
not necessarily representative of the task of screening real-world, cluttered baggage.

9.2 2D computer vision


Various illicit materials, among them powder-like substances, do not have a prominent shape or appearance
characteristics (as guns, scissors, etc.); their appearance and shape can greatly vary. That is another reason
why they are challenging to be detected consistently – either by human screeners or automated methods.
Colour (orange for organic and green for inorganic substances) and texture could be prominent features for
prohibited materials detection, but challenges remain if the material of interest is superimposed with another
material.
Uroukov & Speller (2015) published a preliminary study on materials detection from dual-energy grayscale X-
ray images using textural signatures of different materials. Gabor filters (Fogel & Sagi, 1989) were used as
texture descriptors, to distinguish between different organic materials that would all appear orange on
pseudo-colour images and would look like most explosives. The results were not quantitative, but the image
was enhanced to highlight areas with a high response to Gabor filters (see Figure 49).
Benedykciuk et al. (2020) published their work on patches of material classification determining two best
machine learning classifiers (random forests and SVM) to predict material at every pixel of patches in the X-
ray dual energy images. Test images consisted of three channels: low energy, high energy and zeros. The
features used were simply patches of pixel intensities. Raw dual-energy data was used, and a sliding window
approach for patches extraction was implemented. To achieve this, they built a large database consisting of
one million samples, containing four material classes: light organic materials, heavy organic materials, mixed
materials, and heavy metals.

Figure 49 X-ray image of organic content packed in a rucksack with the computed results for the detection of all organics
as well as the specific detection of tobacco obtained at 80 kV. Image areas where tobacco was hidden are indicated with
the white broken line. The ‘arrow’ indicates one of the hidden loose tobacco packages and the ‘arrowheads' indicate
hidden boxed cigarettes. (A) & (C): greyscale x-ray images of combined organic materials including food, liquids and gels;
(B): detection of the entire organic content; (D): demonstration of tobacco detection (loose and cigarettes).

Source: Uroukov & Speller (2015)

69
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Light organic substances were represented with materials with Zeff < 8, i.e., with the absorption values similar
to those of explosives, e.g. C-4, TNT, and Semtex. The light organic materials were water, ethanol, and slightly
heavier vegetable oil as liquids, while from the set of solid substances they used sugar, CD, and plexiglass.
From heavy organic materials (8 < Zeff < 10), paper and plasticine were selected since they appear similar to
the powdery drugs like cocaine or heroin. Light inorganic materials (10 < Zeff < 17) were represented by
aluminum and salt, while prohibited substances with similar X-ray absorbance are gunpowder or heavy fuel.
Substances with Zeff > 17 are represented with steel and brass, representing prohibited substances such as
weapons, firearms, cartridges, and high-value smuggling materials such as silver and gold. For all materials,
samples of different thicknesses were prepared. Several sizes of patches was extracted from materials and
background, ranging from 3x3 to 15x15 pixels. This multi-scale approach was used in order to better capture
different statistical characteristics of patterns. The probability of belonging to one of five classes was
calculated as the average probability of the class for each pixel based on all patch sizes. 1,076,784 training
patches and 112,660 testing patches were used for algorithm development and evaluation. The success of
the algorithm is evaluated using accuracy (Eqn. 5), with SVM yielding the best performance with 95% average
accuracy across all classes calculated on the test set.
A year later, the same authors published their work on the same topic, and using the same database, but with
a CNN approach instead of conventional machine learning (Benedykciuk, Denkowski, & Dmitruk, 2021). The
CNN contained five subnetworks corresponding to the patches sizes, and the weights were initiated using
Xavier initialization (Glorot & Bengio, 2010). The training consisted of over one million sample patches, and
testing was done on 100,000 patches. The average accuracy was equal to 95%, and not higher than in
Benedykciuk et al. (2020). Regarding classification of light inorganic materials, i.e. inorganic powders,
precision, recall and F1-score were all equal to 94%. Both algorithms (Benedykciuk et al., 2020, 2021)) were
trained and tested on patches of different materials (see Figure 50) and not on realistic cluttered bags that
airlines travellers usually carry. It is very possible that the success of the algorithms would deteriorate in a
real-world situation.
Movafeghi et al. (2020) distinguish between powders and liquids from dual-energy X-ray images. Low and
high energy X-ray images are generated using potentials between 20 and 40 kV and between 140 and 160
kV respectively and computed radiography (CR) and imaging plates were used for image acquisition. They use
Gabor wavelets filter bank (Movafeghi et al., 2020; Fogel & Sagi, 1989) with six scales and six orientations for
noise reduction, and perform statistical analysis (mean and standard deviation) on the dual energy Gabor
wavelets reconstructed image, to determine whether there is a statistically significant difference in pixel
values belonging to different powders and different liquids. The conclusion was that there is a small but
statistically significant difference in pixel values of water and alcohol. Similarly, there was a small but
statistically significant difference among pixels values of jelly powder, caramel cream and chicken spice.

Figure 50 Pipeline for full scan material classification in patches-classification approach (Benedykciuk et al., 2020).

Source: Benedykciuk et al. (2020)

9.3 3D computer vision


Material-based discrimination in 3D does not have to deal with occlusions, clutter problems, and density
confusion (Wei et al., 2018), hence, in 3D the segmentation is possible using the correlations between the
effective atomic numbers and densities of materials that is the basis of automated explosives detection in
security-screening applications. The problem of material classification in 3D is reduced to object segmentation
and then material classification is performed based on HU values. However, different materials do have
overlapping HU range of values (e.g. saline and rubber). Mouton and Breckon (2015) published a study on
materials-based segmentation of unknown objects from dual-energy CT images of cluttered baggage. Initial
materials-based coarse segmentations is generated using the Dual-Energy Index (DEI) (Johnson et al., 2011),
and further refining is done using a random-forest-based model that guides the object segmentation. DEI is a

70
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

post-reconstruction DECT measure, hence, it does not require raw data or calibration, and offers a crude
estimate of the chemical characteristics of a scan. It is calculated using pixel values from low and high-
energy scans, and it represent only an indication of the effective atomic number, Zeff.
In 2014, an initiative for automatic target recognition (ATR) was proposed by a multi-university centre of
excellence, ALERT (Awareness and Localization of Explosives-Related Threats), under the auspices of the U.S.
Department of Homeland Security. Methods were proposed towards automatic threat materials recognition
based on baggage CT images (Crawford, 2014). Several methods were presented within this initiative. Ye et
al. (Crawford, 2014) proposed a 3D CT image segmentation approach based on intensity-based segmentation
and shape clusters separation with an SVM classifier trained on 485 ground truth objects. After object
segmentation, the segmented candidate threat material signatures were classified as threats or non-threats
using an SVM classifier with intensity features (minimum, maximum, mean, histogram characteristics). These
classifiers are trained on different clusters of training object signatures based on their shapes so that the
threats and non-threats falling into different clusters can be recognized using different classifiers
respectively. On the dataset gathered within the ALERT initiative (see Chapter 5), they achieved a TPR of 95%
and a FPR less than 10%.
Within the same initiative, Zhang et al. (Crawford, 2014) presented a pixel classification method which can be
implemented by an expectation-maximization algorithm and enhanced using a Markov random field to
impose spatial smoothness-constraints. SVM classifiers were used for material classification, one SVM
classifier for each material: rubber sheet, bulk rubber, saline, clay and non-targets. The performance is slightly
worse than that of Ye et al. with TPR of 89% and FPR of 10%.
Wang et al. (2020b) address the problem of adaptability of automated detection algorithms. Once the said
algorithms are deployed to the scanners, they work for the detection of threat materials used in the training
phase. In reality, the definition of threat materials evolves as the aviation security landscape and the threat
actors change. This results in new materials and objects that are added to the list of forbidden items. The
deployed algorithms need to be updated, or completely changed, to be able to work on these new items. This
is not a trivial task, hence, it would be useful if the threat-detection algorithms could be adaptable to new
threats. To address this issue, adaptive automatic threat recognition was proposed by Manerikar et al. (2020)
under the ALERT centre of excellence. They proposed a multi-scaled 3D single-energy image segmentation
algorithm, to segment objects of different scales and shapes and different intensities. For this purpose, multi-
scale connected component analysis and a histogram-based intensity split was proposed (Figure 51). Once
the separate objects are segmented, SVM with a linear kernel was used to classify segmented objects into
different classes of material. Features used for classification were normalized histograms of intensities. Each
object is assigned a probability of belonging to a certain class, which can be manipulated in case a bias
towards certain materials is needed. The method recognized four classes: saline, rubber, clay and others.

Figure 51 An illustration of intensity based split with mean Hounsfield Units (HU) values on x-axis.

Source: Wang et al. (2020b)

Martin et al. (2015) proposed a novel method where not all the objects of the dual-energy 3D image are
segmented and then classified based on the material, but the initial segmentation is done for only a small
number of objects made of materials that belong to a given list. They attempted to create a method resistant
to artefacts and noise, common problem in CT data. Instead of explicitly modelling the materials via the
effective atomic number and density parameters, they use a machine learning method to estimate the
appearance models of the materials on interest directly in the space of the reconstructed images. Voxel

71
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

classification was done using a KNN algorithm, spatial smoothing was done within possible object boundaries
detected with high gradients, and pixel weighing based on distance to metal objects was used to mitigate the
effects of metal artefacts. Finally, energy minimization using graph cuts applied to a function containing all
three components was used for object detection and material classification (see Figure 52).

Figure 52 (a) Scatter plot of the high-energy attenuation voxel values versus the low-energy attenuation voxel values for
different materials (values are in HU). Left: values from all labels. Right: zoom on purple box in lower-left corner of the
left plot; (b) axial and (c) coronal views of test bag results. Left: high-energy attenuation in HU displayed in the range
[−1000, 500], Middle: KNN results. Right: final version of the method of Martin et al. (2015) with spatial smoothing and
data weighing results.

(a)

(b)

(c)

Source: Martin et al. (2015)

The method was evaluated using ALERT TO3 database (Crawford et al., 2013). The database was created
using C300 electron-beam medical scanner, and since no real threats were scanned, a set of objects of
interest was defined as: water, saline (doped water), and rubber. Water and saline in plastic and glass bottles,
as well as rubber sheets, were inserted to the cluttered bags. Two bags were used in the experiments, one
with all 390 slices used for training, and the other one used for testing. The results are calculated on 5 slices,
since only for them ground truth existed, resulting in precision of above 80% for water and saline, and 50%
for rubber, while recall was above 90% for all three materials. The authors argued that precision values were
relatively low because of inappropriate ground truth (too small and missing some of the objects).
Wang & Breckon (2020) used deep learning for material classification formulated as semantic segmentation
problem, using two methods: 2D U-Net architecture extended to 3D (Çiçek et al., 2016; Ronneberger et al.,
2015), and point cloud processing methods (Qi et al., 2017a, 2017b). The segmentation results of 3D U-Net
and PointNet++ are voxel-wise and point-wise class labelling respectively that are further post processed

72
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

using morphology operations and connected components analysis. In their experiments they use Northeastern
University Automated Threat Recognition (NEU ATR) dataset (Crawford, 2014) collected and annotated by NEU
ALERT. The ATR ALERT dataset consists of 188 CT volumes in which there are 446 object signatures of three
target materials (i.e. saline, rubber and clay) and other non-target materials as cluttered background of typical
packed baggage. The ground truth voxels are labelled by NEU ALERT for all the objects of three target
materials. The dataset is split into two subsets evenly: odd set and even set containing 94 odd and even
indexed volumes respectively (i.e. 50/50, training/testing data split). The best results are obtained using 3D U-
Net model with lowest down-sampling, namely, average IoU of 75%, overall precision of 85% and recall of
87%, while the lowest precision and recall material wise were achieved for saline, 78% and 82% respectively.
The highest probability of detection (PD) of 90% and lowest probability of false alarm of 4% was achieved
using 3D U-Net with down-sampling with factor 4 (see Figure 53).

Figure 53 Qualitative evaluation of material segmentation and classification using varying methods for examples A-E,
and materials present saline (orange), rubber (green), and clay (blue).

Source: Wang & Breckon (2020)

In 2021, the same authors addressed the problem of 3D methods of high computational cost limitation
introduced by the use of 3D CNN (Wang & Breckon, 2021). They explored the possibility of addressing illicit
material detection in 3D using deep learning based 2D semantic segmentation. They used axial, sagittal and
coronal slices to represent the whole 3D volume. These three types of 2D slices, and all the 2D slices
combined, are used to train 4 classifiers using fully convolutional networks (FCN) (Long et al., 2015) with
ResNet101 (He et al., 2016).

73
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Labelling of 3D data is very laborious, therefore, the authors used pseudo-labelling in the following way. Only
a few slices per dataset were annotated to be used for training, while the labelling in neighbouring slices was
deduced from the labelled ones using FCN. Again, the ALERT NEU ATR dataset (Crawford, 2014) was used for
evaluation, and 2D based semi-supervised method was shown to be superior compared to the best 3D based
segmentation methods from Wang & Breckon (2020). It achieved state-of-the-art performance even when
the number of annotated slices for training was significantly reduced to 1/128 of the full annotation
(equivalent to ~1-2 slices per CT), with overall mean IoU of 77%, overall precision of 91%, recall of 88%,
overall TPR of 90% and FPR of 3%. Qualitative results are shown in Figure 54.

Figure 54 Qualitative evaluation of detection results of different approaches (from left to right: CT volumes, ground truth
labels, 2D FCN method using semi-supervised learning (Wang & Breckon, 2021), 3D Residual U-Net (Lee et al., 2017).
Materials shown are saline (orange), rubber (green), and clay (blue).

Source: Wang & Breckon (2021)

It should be added that there are many published algorithms with machine learning methods applied for
material classification in conventional images. Most of them use texture features extracted using various
operators (Hu et al., 2011; Liu et al., 2010; Qi et al., 2014; Schwartz & Nishino, 2013), some use reflectance
features (Cula & Dana, 2004; Gu & Liu, 2012; Lombardi & Nishino, 2012; Zhang et al., 2015), while there are
several publications using deep learning methods (Bian et al., 2018; Bunrit et al., 2019; Cimpoi et al., 2015; Xu
& Muselet, 2020). Regarding 3D CT medical images, one recent example is a successful lung nodule type
classification using multi-scale CNNs (Liu et al., 2018).

74
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

9.4 Non-computer-vision methods


Prohibited materials can be detected using various methods, with varying degrees of success. For example,
infrared hyperspectral imaging can be used for detection of traces of explosive manipulation on passengers’
fingers, as described in Fernández de la Ossa et al. (2014). It took five minutes to successfully analyse
fingerprints of volunteers who previously manipulated energetic materials (ammonium nitrate, black powder,
single-and double-base smokeless gunpowders and dynamite). Although it requires a long processing time,
this is a promising forensic tool for the detection of explosive residues and other related samples.
One method for materials classification is X-ray diffraction (XRD), as described in Chapter 2. It can achieve a
high detection rate and low false alarm rate for crystalline substances, including inorganic powders. These
systems are slow and expensive and therefore still used as a third tier for hold baggage false alarm clearing,
and not yet commonly implemented for cabin baggage screening. However, as described in Chapter 2, they
are becoming faster and more sophisticated, and more promising for both hold and cabin baggage screening.
As many threat materials are polycrystalline (Marticke, 2016), Dicken et al. (2010) developed a method for
explosive detection using a powder diffraction method. Their system is presented in Figure 55, and consists of
a fine-focus X-ray tube with a molybdenum target and 40 kV accelerating voltage. A scintillator detector is
located on an arm that rotates about a central point with constant radius. A global rotation of the inspection
volume is proposed in this system in order to separate different materials (Marticke, 2016).

Figure 55 X-ray diffractometer for explosive detection.

Source: Dicken et al. (2010)

As mentioned in Chapter 2, angular dispersive XRD (ADXRD) detection methods are time consuming, requiring
cooling of the detector, while energy dispersive XRD (EDXRD) is less bulky, faster, with room temperature
spectroscopic detector, and allows the acquisition of a whole diffraction spectrum in one step. Therefore,
Dicken et al. (2010) proposed another powder classification method, which uses EDXRD. A complete detection
system for airline security was developed by Madden et al. (2008). It was tested for concealed explosives
detection at the Transportation Security Laboratory and airline passenger baggage at Orlando International
Airport. When combined with CT EDS, it reduces the false alarm rate avoiding manual bag checks (Madden et
al., 2008). It targets a small area in the object that is defined by an algorithm analysing CT volume. A
collimation system allows to limit the inspected volume to a diamond-shaped volume, about 50 mm in height
by 25 mm in length by 1.5 mm in width (see Figure 56). This collimation system is also capable to choose
between two scattering angles which can be chosen depending on attenuation of objects within the inspected
bags (Marticke, 2016). A support vector machine (SVM) (Hearst et al., 1998; Vapnik, 1999), using wavelet
features (Chui, 1993) is used to separate the samples into explosives and non-explosives.
Deep learning methods are also used with XRD diffraction patterns classification. In their work, Lee et al.
(2020) use CNN models to classify 170 inorganic compounds. Although the CNN is trained using simulated
XRD data, a test with real experimental XRD data returned an accuracy of nearly 100% for phase
identification and 86% for three-step-phase-fraction quantification.

75
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 56 XRD system for baggage inspection conceived by L-3 Communications Security and Detection Systems and an
example of an acquired diffraction spectrum.

Source: Marticke (2016)

On the industrial side, Morpho Detection, now part of Smiths Detection, developed an X-ray diffraction system
for luggage inspection (Marticke, 2016). The XRD 3500 is an EDXRD device for airport security which works
with a germanium detector and is used as a second level control for luggage. There has also been work on a
device with CdZnTe detector dedicated to cabin luggage inspection at first control level (Kosciesza et al.,
2013; Marticke, 2016).
There exists some limited work on material classification based on X-ray imagery, but not involving object
detection and classification in cluttered baggage. Such material discrimination methods used for the dual-
energy and multi-energy X-ray images do not cope well with realistic situations where layers of different
materials exist at a given point. Nevertheless, we will mention some of the published work in this area. For
example, Chen et al. (2014) propose a classification curves (R-curve) based method using high-energy dual
energy 6/3MeV X-ray digital radiography imaging system for cargo inspection. Specifically, they discriminate
between the following materials, Pb, Fe, Al, and C, by using the ratio of two X-ray energies after penetrating
materials. The results are shown visually, without quantified results. Chang et al. (2019) also classify three
different fine-grained organic materials based on the energy spectrum obtained from X-ray transmission
images. In their work, the energy was varied from 8keV to 90keV, obtaining 42-dimensional feature vectors.
Three different classifiers were used, Gradient boosted decision trees (GBDT) (Natekin & Knoll, 2013), KNN
(“K-Nearest neighbors algorithm,” 2005) and SVM (Hearst et al., 1998; Vapnik, 1999), and evaluated using
accuracy for classification of three types of materials: C1 with similar properties as Teflon (heroin, cocaine,
etc.), C2 similar to polymethyl methacrylate (PMMA), such as water, sugar, etc., and C3 containing light
materials with low effective atomic number (Zeff) and density (gasoline, alcohol, etc.). By using feature
selection, it is shown that only four energy levels are sufficient to classify those materials, with best results
achieved with GBDT classifier and average accuracy of 85%. Energies sufficient for classification are 8keV,
10keV, 12keV, and 14keV.
Osipov et al. (2019) use dual and multi-energy methods to distinguish between materials with Zeff = 6 (light
organic), Zeff = 9 (mineral materials), Zeff = 13 (light metals), Zeff =9 (calcium), Zeff = 26 (metals), and
heavy metals with Zeff > 50. A simulation model of cargo inspection facilities with the function of material
identification of test objects and their fragments is built, where material identification is determined by
converting dual-energy images and multi-energy images into mass thickness and effective atomic number of
test objects.
In 2011, Eger et al. (2011) published a leaning-based approach to explosive detection using multi-energy X-
ray CT. The hypothesis is that the discriminative power of multi-energy CT is derived from its sensitivity to
attenuation vs energy curves of material (see Figure 57). Hence, they extract features from these curves, and
build an SVM classifier to distinguish between explosive and non-explosive materials. They used 84 examples

76
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

of explosives and 40 examples of non-explosives, with a total of 124 materials. The explosives and their
chemical formulas were taken from the 1985 LLNL explosives handbook (Dobratz & Crawford, 1985). Each
linear attenuation coefficient (LAC) curve was sampled on 141 energy levels, and these attenuation
coefficient values represented features. The training was done on 80% of the data, and testing on 20%,
resulting in TPR of 85%. One of the interesting findings of the study is that superior classification
performance is found when using features different from the standard photoelectric and Compton
coefficients.

Figure 57 Examples of linear attenuation coefficient (LAC) curves

Source: Eger et al. (2011)

9.5 Chapter summary


Most studies on automated threat detection methods from X-ray and CT images focus on prohibited items
with a characteristic shape, while methods dealing with prohibited materials are very few. In this chapter, we
review computer-vision methods applied to X-ray images of materials with similar characteristics as
prohibited materials.
For 2D computer vision, several studies have been published using dual-energy X-ray images, in which
classifiers have been developed to distinguish between different classes of materials using patches of pixel
intensities, texture descriptors and traditional machine learning classifiers (random forests and SVM). Using
the same dataset, a CNN produced an average accuracy that was not higher than the traditional machine
learning classifiers. Another paper describes the classification of different powders and liquids from dual-
energy X-ray images using Gabor wavelets and traditional machine learning classifiers (random forest and
support vector machine).
For 3D computer vision, material-based discrimination does not have to deal with occlusions, clutter problems,
and density confusion, hence, in 3D the segmentation is possible using the correlations between the effective
atomic numbers and densities of materials that is the basis of automated explosives detection in security-
screening applications. In the last five years, several papers have been published concerning the use of deep
learning for material classification based on the Northeastern University Automated Threat Recognition (NEU
ATR) dataset. Regarding 3D CT medical images, one recent example is a successful lung nodule type
classification using multi-scale CNNs.
Prohibited materials can be detected using non-computer-vision methods, with varying degrees of success.
For example, infrared hyperspectral imaging can be used for detection of traces of explosive manipulation on
passengers’ fingers. A widely used method for materials classification is X-ray diffraction (XRD), as described
in Chapter 2. Deep learning methods are also used with XRD diffraction patterns classification. There exists
some limited work on material classification based on X-ray imagery, but not involving object detection and
classification in cluttered baggage. Such material discrimination methods used for the dual-energy and multi-
energy X-ray images do not cope well with realistic situations where layers of different materials exist at a
given point.

77
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

10 Horizontal issues

10.1 Introduction
In this chapter, we discuss some of the horizontal issues and challenges arising from the increased use of AI
in aviation security.

10.2 Testing AI system performance


Testing AI system performance is a very important aspect of AI system safety and accountability. How well
the system performs is crucial in situations where AI systems directly affect human lives (e.g. healthcare,
transportation, military, and aviation). As shown in Chapter 4, there are many evaluation metrics to assess
performance of an automated system, each of them addressing a specific aspect. It is of upmost importance
to create a set of metrics that are clearly explained and that measure the most critical aspects of the AI
application.
Due to the non-deterministic, and often, black-box nature of AI systems, traditional testing is not enough (AI
HLEG, 2019). Testing should start already at the training phase, inspecting the training data, and ensuring that
the data is representative of the problem in question, and determining types of examples that can occur in a
real-life, operational setting, on which a system could have problems. Additionally, the environment in which
the algorithm is tested can have a large impact on the performance. This can be relevant, for example, for
body scanners and suspicious behaviour detection at airports where, for example, light conditions can have
large influence on performance. Since transfer learning is very much used in algorithm development (see
section 8.3), it is necessary to inspect the data on which the pre-trained model is trained, and the model itself.
It is necessary to test the behaviour of a system as a whole, and not only components operating under
controlled (laboratory) conditions.
A proper evaluation process should deliver not only a reliable indication of accurate and inaccurate
predictions, but also how likely these errors are, and in what kind of situations inaccurate prediction is more
probable. An assurance model for machine learning (ML) applications is proposed in EASA (2020b). This report
identifies elements for learning assurance in a W-shaped development cycle, as an extension to traditional
development assurance framework for software development (see Figure 58).

Figure 58 Assurance model for machine learning (ML) applications.

Source: EASA (2020b)

78
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

10.3 Scarcity of data


It is clear that data is not only crucial for a thorough evaluation of an AI system performance, but it is a
driving force for its success. The larger and more representative the available data is, the better prediction will
be achieved. This is especially true for deep learning systems that comprise the majority of deployed
algorithms (see Figure 34). Even worse, deep learning methods simple cannot converge if enough data is not
used for training. Depending on the difficulty of the classification/detection problem, tens of thousands of
examples for each class are usually a necessary minimum. This is particularly problematic in the field of
automated threat recognition. While for example, for face recognition, or for almost any problem that uses
conventional RGB images, a huge amount of data can be gathered on the internet, X-ray or CT scans of bags
are not easy to create. Additionally, X-ray images of bags with illicit substances or objects are even more
difficult to gather in such huge numbers. As noted in Chapter 5, there are only two datasets publicly available,
and they only contain several objects. Accessible datasets containing explosives or other illicit materials do
not exist.
The problem is compounded by the fact that labelling such specific data is very costly. Labelling in general is
a time-consuming activity, especially images where several objects of interest exist, which is often the case in
threat detection. Compared to labelling RGB images with everyday objects on them, it is much more costly to
label, for example, medical data, where highly educated radiologists need to work many hours on it, or
labelling airport X-ray data by experienced screeners. For instance, Liang (2020) reported that images without
localisation (drawing accurate bounding boxes around objects of interest), required 30,000 person hours for
the 328,000 images. Additionally, preparation for imaging which required assembling bags, scanning, and
hand labelling, took well over 250 person hours for 4,000 scans over the span of several months. The
situation gets much more time consuming when accurate delineation of the object is required, given that the
objects can come in large variety of shapes, sizes, poses, and settings, and are usually in a cluttered bag.
Additionally, many objects or materials are very difficult to see in the image, and it takes many minutes of
deliberation just to label one image (as confirmed by the authors’ personal experience).
Another difficulty with data is that transferring models between domains and different X-ray scanners is
challenging and usually does not lead to good results because of non-quantified differences between
different models of scanners (Akçay & Breckon, 2020; Gaus et al., 2019). Hence, it is usually not possible to
use data or models derived from one X-ray scanner and apply it to new data obtained from a different X-ray
scanner. What is needed for data scanned by different X-ray scanners is usually some sort of intensity and
projection normalization to generally accepted standard that everyone would develop their algorithms for.
There are some examples of AI methods developed to transfer models trained on one source data, to another
source data (Zhu et al., 2020a).
Finally, real-world environments tend to change over time, while many AI models remain static because they
are trained on certain data once, in the past. In airport passenger’s baggage, this change happens not only in
longer periods (e.g., much more electronics and lithium batteries in bags today compared to ten years ago),
but also there are significant seasonal changes (e.g. the content of passengers’ bags are different in summer
than winter). Additionally, safety rules keep changing, and items are being added (or more rarely removed) to
the list of prohibited items. Finally, the items themselves sometimes change their appearance over time (e.g.
consumer laptops become lighter, more aluminium, SSDs instead of HDDs, etc.).
In order to keep up with the changes and remain fit for purpose, systems ideally need to be adaptive, and that
is possible in two ways: (i) creating self-learning AI systems that improve and learn from new data
(reinforcement learning), and/or (ii) retrain using updated datasets. In self-learning systems, we let an AI
system make decisions, while a human user labels every wrong decision, and in that way system learns
throughout its operational lifetime (Parisi et al., 2019). This is an interesting concept that is already used in
some industries. For example, in recommendation systems used in online e-commerce, reinforcement learning
is used to suggest to users what they might like to buy. However, as stated in the EASA AI Roadmap (EASA,
2020a), real-time learning as a part of high-risk AI systems such as aviation, is a parameter that will
introduce a great deal of complexity in the already difficult task of creating trustworthy AI systems, and
probably would require some time to incorporate it in regulations and guidance. On the other hand, creating
large, representative, standardised, and up-to-date datasets with standardised evaluation metrics, for both
training and evaluation, that are available to research groups and companies would facilitate competition in
algorithm development and getting more reliable products on the market.
Some solutions to the ‘lack of data’ problem involve algorithmic solutions (data augmentation) such as TIP,
GAN, or simple rotation, translation and scaling of existing objects. Nevertheless, as stated in the section 8.4,
TIP and object deformations are not giving very good results, and GANs again need a lot of data to train on,

79
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

and the nature of airport passengers bags being tightly packed and having objects placed randomly are
unfavourable for GANs. To date, the quality of the generated images is far from being realistic. More
advanced algorithms with image translation or domain adaptation (Isola et al., 2017; Zhu et al., 2020b) could
be considered, again requiring a solid training database (see section 8.4 for more details). Transfer learning is
another approach used when datasets are scarce, but it commonly leads to bad results where transfer
between different domains is applied, when the difference in appearance is large, or when the target dataset
is small (see section 8.3 for more details). All the evidence points to the need to build large, realistic datasets
that can be shared amongst bona fide security practitioners.

10.4 Transparency, traceability and explainability


As explained in the Ethics Guidelines for Trustworthy AI (AI HLEG, 2019), transparency encompasses the
transparency of the components of an AI system: the data, the system and the business models. It is
important to know on what kind of data an AI system is trained, and how the data was labelled, so that one
can trace back to the training data why an AI system behaves in a certain way. This can help predict cases in
which the system might be less successful and, subsequently, to take steps to minimise the probability of
such mistakes occurring.
When we talk about transparency of the system itself, we mean explainability of the AI system. Explainability
is the ability to explain technical processes of an AI system to human beings. This is an issue with DNN
algorithms, which make up ever more of commercial implementations. It seems there is a trade-off between
explainability of algorithms and their success. Easy to understand algorithms (e.g. rule-based classifiers) are
less successful than “black-box” DNNs with billions of parameters that together constitute a successful
decision making tool but with little understanding what all the parameters mean and represent. Sometimes
small changes in data values might result in dramatic changes in interpretation. Even if an AI model is
deterministic from a mathematical perspective (e.g. fixed weights in a neural network), for any new input, the
output will depend on the correlation between that input with the data set that was used for the training
process and this can lead to unpredictable outputs that may be difficult to explain (EASA, 2020a).
For example, adding noise that is invisible to human eyes to an image of a panda can cause an AI system to
claim with high accuracy to “see” a gibbon in the image (Goodfellow et al., 2015). Furthermore, Su et al.
(2019) show that 71% of images can be misclassified by changing just one pixel. This strategy is often used
in adversarial attacks. A famous example is one by Microsoft Azure’s computer vision API detecting sheep on
an image with no sheep simply because the image contained a typical sheep-like landscape. Another similar
case is the one where a DNN algorithm detect wolves instead of dogs on images with snow. By analysing the
sheep and the wolfs examples, we can assume that this kind of error would probably not be happening if the
training dataset contained enough images of sheep and wolfs in versatile environments. This is yet another
example of importance of the training dataset, and dataset content and labelling process transparency. On
the other hand, humans have a tendency to assume computers have the same way of “thinking” as us, and do
not expect such mistakes from a generally successful machine that can often “see” objects that humans
cannot. That is why unexpected mistakes can be dangerous – not only because of possibly grave
consequences (in health or safety domain), but also because it hugely undermines the trust humans have in
the AI systems, as explained in Chapter 3.
A relatively new field of research, explainable AI (XAI) tries to address this issue to better understand a
system’s underlying mechanisms and find solutions (Adadi & Berrada, 2018). XAI tries to create new methods
that maintain high performance and are at the same time more explainable to humans. Two of the most
prominent actors in XAI field are a group of academics called FAT 1, and the Defense Advanced Research
Projects Agency 2 (Gunning, 2019; Turek, 2016). There are an increasing number of workshops and
conferences dedicated to this topic (Biundo et al., 2020; Farina & Reed, 2017; Goebel, 2018; Graaf et al.,
2018; Guyon et al., 2017; Kim et al., 2018; Komatsu & Said, 2018).

1
https://www.fatml.org
2
https://www.darpa.mil

80
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

As presented in the XAI survey by Adadi and Berrada (2018), methods that enable AI system explainability can
be broadly divided in three categories:
— complexity-related methods
— scoop-related methods
— model-related methods.
Complexity-related methods start from the assumption that the more complex the model, the more difficult it
is to interpret it. Hence, simple models that are fully explainable are part of this group of methods.
Additionally, reverse engineering is applied on high-complex “black-box” models, to try to explain how they
work, without knowing the inner mechanisms of the original models (Krening et al., 2017; Mahendran &
Vedaldi, 2015; Mikolov et al., 2013).
Regarding scoop-related methods, there exists global interpretability methods focusing of the global outcome
of the model (Caruana et al., 2015; Letham et al., 2015; Yang et al., 2018), and local interpretability ones that
explain the reasons for a single prediction. For example, local ones can work by finding regions of an image
that were particularly influential to the final classification (Simonyan et al., 2014; Zeiler & Fergus, 2014; Zhou
et al., 2015), or by using local gradients that show how a data point has to be moved to change its predicted
label (Baehrens et al., 2010).
The third group, model related methods, is the most general group of algorithms and we will mention only a
few approaches: simpler models trained on the outputs of the original more complex model (Bastani et al.,
2017; Ribeiro et al., 2016; Thiagarajan et al., 2016), decomposition approaches that extract rules at the level
of individual units within the trained ANN (Bach et al., 2016; Montavon et al., 2017; Robnik-Sikonja &
Kononenko, 2008).

10.5 Resilience to attacks


AI systems can be vulnerable to specific types of attacks. We have already mentioned adversarial attacks in
the section on explainability (section 10.4). Adversarial attacks are harming the AI system by supplying input
data that will cause malfunctioning of the system. Deep knowledge of the way how system works is needed
to create such poisonous data, and these data are usually not suspicious to human eyes. Some examples of
these type of data are described in Goodfellow et al. (2015). Just a single perturbation, often called universal
adversarial perturbation (UAP), can be generated to fool a DNN for most images (Zhang et al., 2021). For
example, border crossing face recognition systems are subjected to so-called morphing attacks, where by
using an image processing technique, photos of two people are morphed into one synthetic image that is
linked not to one, but to two persons. It was shown that the issuing protocol of the ePass presents a security
issue with respect to morphing algorithms (Ferrara et al., 2014). The typical defense against adversarial
attacks is to train the ANN with the adversarial examples. Therefore, it is important to be aware of the
possible vulnerabilities of the system. So far, no adversarial attacks were reported in automated threat
recognition algorithms.

10.6 The need for harmonised testing data


Interoperability - the ability for two systems to communicate effectively, and data interoperability – the ability
to integrate two or more datasets, are key for machine learning development. For smartphone apps,
applications in online shopping, banking and other industries with products for everyday usage, we today
expect that the platforms we use to exchange information can communicate seamlessly whenever we need
them to. When systems are not interoperable, AI algorithms are less successful since they have less data to
operate with.
For example, patients are increasingly tracking and generating large volumes of personal health data through
wearable sensing and mobile health apps. If a physician could integrate this data with the data from the
health record, a better analysis of patient’s health could be performed, while more data from many patients
could help research and development. In order to achieve interoperability, data harmonisation is an essential
step. Data harmonisation means reconciling various types, levels, and sources of data in formats that are
compatible and comparable, and thus useful for better decision making.

81
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

An approach to address the interoperability challenges is provided by the European Commission European
Interoperability Framework (European Commission, 2017). The framework distinguishes four interoperability
levels (technical, semantic, organisational and legal interoperability) under an overarching integrated
governance approach (see Figure 59). To enable interoperability between data for AI algorithms, each of these
interoperability levels needs to be addressed.

Figure 59 European interoperability framework model.

Source: European Commission (2017)

As stated in the EASA roadmap, the European Commission proposes data and algorithm sharing in form of
developing common “European libraries” that would be accessible to all. Both data and AI algorithms are key
instruments that would ensure the independence of EU industry from the “AI mega players” (EASA, 2020a).
The European Commission will start preparatory discussions to develop and implement a model for data
sharing and making best use of common data spaces, with the focus on transport, healthcare, and
manufacturing (EASA, 2020a). There are already some initiatives for sharable data/AI infrastructure, so-called
data lakes, namely AirSense, DataBeacon, OpenPrisme, Skywise, Topsky, Data4Safety, GAIA-X (European
Aviation Artificial Intelligence High Level Group, 2020). What is needed instead is unified approach. In addition,
data on traffic control and weather data are the centre of attention of said initiatives, not the threats hidden
in the passengers’ baggage nor detection of suspicious behaviour in airports. For such initiatives to be
successful, close cooperation of all concerned stakeholders, Member States, industry, societal actors and
citizens is required (EASA, 2020a). Ideally, one European data hub, with data gathered from airports, with data
preparation and standardisation environment, shared AI detection algorithms and evaluation metrics in place,
could bring a benefits of scale.
According to data published by TSA, (Transportation Security Administration, 2019, 2020), thousands of
loaded firearms in carry-on baggage are regularly found every year. A positive example, meaning an explosive
in a bag or a terrorist preparing an attack behaving suspiciously, is a rare occasion at European airports.
Hence, although it is true that a huge amount of images of passengers’ bags is produced every day, there are
(thankfully) almost no examples among them containing threats. Even when suspicious-looking items or
prohibited items are found in passenger’s luggage, this data unfortunately remains unlabelled and uncollected
under current operational procedures (Liang, 2020).
A process that enables the reuse of data to develop algorithms is highly desirable. This will not require all of
the data from all airports to be included in a single system, but if all airports store their data in a
standardised way, and send the data to the shared database (either centralised or distributed), algorithms will
be able to make sense of it. An add-on to every airport scanner software, enabling human screeners to label
objects of interest with a single click, could dramatically improve the content of current X-ray image
databases. Additionally, creating datasets including explosives is an impossible task for industry and their
occurrence in airports is not very high. Scientific institutes that work with explosives or that produce explosive
simulants and prohibited items could be mandated with producing large image datasets. In addition to data
labelling and data standardisation, quality control of data is also required to obtain validated and clean data
ready for use. However, there are currently no standard tools to fulfil all these tasks.

82
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Having control over data brings us closer to transparent AI systems – one of the key requirements for
trustworthy AI. The requirement to know on which data the systems is trained and how exactly it was labelled,
helps us understand what kind of errors are more likely than others. Having standardised evaluation metrics,
with as descriptive as possible results, and for which the latest up-to-date data from airports is used, brings
us even closer to AI system transparency. Finally, given the importance of AI systems explainability, a
continued effort (e.g. via workshops) to raise awareness in the aviation security community of the importance
of explainability of AI algorithms is needed. This could also lay the groundwork towards AI explainable
software libraries that could be used for industry and academia to analyse and “explain” the methods behind
their algorithms.

10.7 Chapter summary


Due to the non-deterministic, and often, black-box nature of AI systems, traditional testing is not enough.
Testing should start already at the training phase, inspecting the training data, and ensuring that the data is
representative of the problem in question, and determining types of examples that can occur in a real-life,
operational setting, on which a system could have problems. It is necessary to test the behaviour of a system
as a whole, and not only components operating under controlled (laboratory) conditions. An assurance model
for machine learning (ML) applications is proposed in EASA (2020b).
The larger and more representative the available data is, the better prediction will be achieved, especially for
deep learning systems. Data scarcity is problematic in the field of automated threat recognition; X-ray or CT
scans of bags are not easy to create. Additionally, X-ray images of bags with illicit substances or objects are
even more difficult to gather in such huge numbers. The problem is compounded by the fact that labelling
such specific data is very costly. Another difficulty is that transferring models between domains and different
X-ray scanners is challenging and usually does not lead to good results because of non-quantified differences
between different models of scanners. Hence, it is usually not possible to use data or models derived from
one X-ray scanner and apply it to new data obtained from a different X-ray scanner. What is needed is some
sort of intensity and projection normalization to generally accepted standard that everyone would develop
their algorithms for.
Some solutions to the ‘lack of data’ problem include data augmentation (e.g., GAN, simple rotation,
translation and scaling of existing images). Nevertheless, object deformations are not giving very good results
and GANs again need a lot of data to train on. Transfer learning is another approach when datasets are
scarce, but it commonly leads to bad results where transfer between different domains is applied, when the
difference in appearance is large, or when the target dataset is small. All the evidence points to the need to
build large, realistic datasets that can be shared amongst bona fide security practitioners. In addition to data
labelling and data standardisation, quality control of data is also required to obtain validated and clean data
ready for use. However, there are currently no standard tools to fulfil all these tasks.
Transparency encompasses the transparency of the components of an AI system: the data, the system and
the business models. It is important to know on what kind of data an AI system is trained, and how the data
was labelled, so that one can trace back to the training data why an AI system behaves in a certain way. This
can help predict cases in which the system might be less successful and, subsequently, to take steps to
minimise the probability of such mistakes occurring. Explainable AI (XAI) tries to create new methods that
maintain high performance and are at the same time more explainable to humans.
Interoperability - the ability for two systems to communicate effectively, and data interoperability – the
ability to integrate two or more datasets, are key for machine learning development. An approach to address
the interoperability challenges is provided by the European Commission European Interoperability Framework
(European Commission, 2017). The framework distinguishes four interoperability levels (technical, semantic,
organisational and legal interoperability) under an overarching integrated governance approach
AI systems can be vulnerable to specific types of attacks. Adversarial attacks are harming the AI system by
supplying input data that will cause malfunctioning of the system. Resilience against adversarial attacks is
to train the ANN with the adversarial examples. Therefore, it is important to be aware of the possible
vulnerabilities of the system. So far, however, no adversarial attacks were reported in automated threat
recognition algorithms.

83
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

11 Conclusions
Prior to the Covid-19 pandemic, there were over one billion air passengers in the EU in 2019. The strategic
value of aviation for the world economy, coupled with its very high sensitivity to all kinds of interference,
creates a direct incentive for terrorist acts performed by various terrorist organizations. Looking at the history
of terrorist attacks in the Global Terrorism Database (2021), 420 attacks against airplanes and airports
occurred after 2001, most of them by means of bombing or explosion.
Screening of passengers' baggage is typically performed using X-ray scanners, in order to meet the
requirements for threat detection and high throughput. During the last 20 years, X-ray technology used for
baggage screening has exploited various image processing and data processing techniques to enhance
images and provide some automated detection of threats.
The rapid progress in machine learning points to the likely widespread deployment of artificial intelligence in
baggage screening in the very near future, in concert with the broader deployment of these techniques across
society. This brings opportunities for better detection performance and enhanced automation of processes,
but it also creates new challenges such as availability of data (for training and for testing), new potential
vulnerabilities, and issues around transparency and explainability of AI algorithms.
Another critical aspect is the human-machine interaction. In Chapter 3, we summarised the literature on this
topic. Numerous studies have shown that humans adapt their behaviour during a detection task in response to
the perceived strengths and weaknesses of an accompanying detection algorithm. Hence, human and machine
performances should not be considered as separate components, but as a combined system.
Next, we reviewed both classical machine learning approaches and deep learning approaches to image
enhancement, object detection, and material classification. Classical machine learning approaches include de-
noising, pseudo colouring, edge detection, Gaussian blurring and the use of Gabor filters. Specific techniques
include, amongst others, Scale Invariant Feature Transform (SIFT), bag of visual words (BoVW), Difference of
Gaussians (DoG), Harris corner detector, features from accelerated segment test (FAST). In Chapter 6, we
reviewed over 60 publications covering these techniques and their application to X-ray baggage screening.
Around 75% of these publications were published between 2000 and 2015.
In Chapter 8, we reviewed the basics of deep learning, and discuss two aspects of particular relevance for
aviation security: transfer learning and data augmentation. Deep learning is a machine learning approach that
has received great attention lately because it is shown capable to surpass human performance on classifying
photographic images. Perhaps the most prominent characteristic of deep learning is that instead of utilizing
hand-engineered features as traditional machine learning does, it leverages neural networks to learn
hierarchical, abstract representations from the data itself. In aviation security applications, the scarcity of
data – in particular of images of threats – necessitates the use of transfer learning. This method is likely to
give good results, especially if neural networks used are trained on images containing objects similar to threat
items. Available datasets for aviation security applications are too small or not diverse enough for training
deep learning algorithms. Despite using transfer learning, additional data is often needed. If the amount of
data is not sufficient, data that resembles available data can be generated. This process is called data
augmentation. We ended the chapter discussing the challenging task of working with 3D CT data. Concerning
publications related to use of deep learning in X-ray baggage screening, we cited over 80 publications in total
in this chapter.
In Chapter 10 we discussed some of the horizontal issues concerning the application of AI in X-ray baggage
screening, including testing of algorithm performance, the need for large, harmonised databases of images,
data scarcity, transparency, and explainability. Due to the non-deterministic, and often, black-box nature of AI
systems, traditional testing is not enough. Testing should start already at the training phase, inspecting the
training data, and ensuring that the data is representative of the problem in question, and determining types
of examples that can occur in a real-life, operational setting, on which a system could have problems. A
proper evaluation process should deliver not only a reliable indication of accurate and inaccurate predictions,
but also how likely these errors are, and in what kind of situations inaccurate prediction is more probable.
The biggest obstacle to the development of AI in X-ray baggage screening is currently data scarcity. While for
face recognition or for almost any problem that uses conventional RGB images, a huge amount of data can
be gathered on the internet, X-ray or CT scans of bags are not easy to create. Additionally, X-ray scans of
bags with illicit substances or objects are even more difficult to gather in large numbers. The problem is
compounded by the fact that labelling such specific data is very costly. Labelling in general is a time-
consuming activity, especially images where several objects of interest exist, which is often the case in threat

84
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

detection. Another difficulty with data is that transferring models between domains and different X-ray
scanners is challenging and usually does not lead to good results because of non-quantified differences
between different models of scanners.
We conclude that a process to collect and reuse data relevant for developing algorithms is highly desirable. If
operational and laboratory data is stored in a standardised way, algorithms will be able to make sense of it.
Furthermore, enabling human screeners to easily label objects of interest could dramatically improve the
content of X-ray image databases. Scientific institutes that work with explosives (or explosive simulants) and
other prohibited items could be mandated to produce large image datasets. In addition to data labelling and
data standardisation, quality control of data is also required to obtain validated and clean data ready for use.
This, in turn, requires standard tools to fulfil all these tasks, which are currently not existing.

85
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

References

Abidi, B. R., Zheng, Y., Gribok, A., & Abidi, M. (2005). Screener Evaluation of Pseudo-Colored Single Energy X-ray
Luggage Images. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05) - Workshops, 3, 35–35. https://doi.org/10.1109/CVPR.2005.521
Abidi, B. R., Zheng, Y., Gribok, A. V., & Abidi, M. A. (2006). Improving Weapon Detection in Single Energy X-Ray
Images Through Pseudocoloring. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications
and Reviews), 36(6), 784–796. https://doi.org/10.1109/TSMCC.2005.855523
ACI. (2019). Implementation Guide: Advanced Cabin Baggage Screening CT. Retrieved from
https://store.aci.aero/wp-content/uploads/2019/11/Smart-Security-ACBS-CT-Implementation-Guide-
v0.1.pdf
Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence
(XAI). IEEE Access, 6, 52138–52160. https://doi.org/10.1109/ACCESS.2018.2870052
Agrawal, M., Konolige, K., & Blas, M. R. (2008). CenSurE: Center Surround Extremas for Realtime Feature
Detection and Matching. In D. Forsyth, P. Torr, & A. Zisserman (Eds.), Computer Vision – ECCV 2008, Vol.
5305, pp. 102–115. https://doi.org/10.1007/978-3-540-88693-8_8
AI HLEG. (2018). A definition of AI: Main capabilities and scientific disciplines. Retrieved from
https://ec.europa.eu/futurium/en/system/files/ged/ai_hleg_definition_of_ai_18_december_1.pdf
AI HLEG. (2019). Ethics Guidelines for Trustworthy AI. Retrieved from https://www.euractiv.com/wp-
content/uploads/sites/2/2018/12/AIHLEGDraftAIEthicsGuidelinespdf.pdf
Airbus. (2016, July 20). Lufthansa, Honeywell and Airbus sign MoU to deploy Airbus’ ROPS and Honeywell’s
SmartLanding systems on Lufthansa Group’s fleet. https://www.airbus.com/en/newsroom/press-
releases/2016-07-lufthansa-honeywell-and-airbus-sign-mou-to-deploy-airbus-rops-and
Akçay, S., Kundegorski, M. E., Devereux, M., & Breckon, T. P. (2016). Transfer learning using convolutional
neural networks for object classification within X-ray baggage security imagery. 2016 IEEE International
Conference on Image Processing (ICIP), 1057–1061. https://doi.org/10.1109/ICIP.2016.7532519
Akçay, S., & Breckon, T. P. (2017). An evaluation of region based object detection strategies within X-ray
baggage security imagery. 2017 IEEE International Conference on Image Processing (ICIP), 1337–1341.
https://doi.org/10.1109/ICIP.2017.8296499
Akçay, S., Kundegorski, M. E., Willcocks, C. G., & Breckon, T. P. (2018). Using Deep Convolutional Neural Network
Architectures for Object Classification and Detection Within X-Ray Baggage Security Imagery. IEEE
Transactions on Information Forensics and Security, 13(9), 2203–2215.
https://doi.org/10.1109/TIFS.2018.2812196
Akçay, S., Atapour-Abarghouei, A., & Breckon, T. P. (2019a). GANomaly: Semi-supervised Anomaly Detection
via Adversarial Training. In C. V. Jawahar, H. Li, G. Mori, & K. Schindler (Eds.), Computer Vision – ACCV 2018,
Vol. 11363, pp. 622–637. https://doi.org/10.1007/978-3-030-20893-6_39
Akçay, S., Atapour-Abarghouei, A., & Breckon, T. P. (2019b). Skip-GANomaly: Skip Connected and Adversarially
Trained Encoder-Decoder Anomaly Detection. In 2019 International Joint Conference on Neural Networks
(IJCNN), 1–8. https://doi.org/10.1109/IJCNN.2019.8851808
Akçay, S., & Breckon, T. (2020). Towards Automatic Threat Detection: A Survey of Advances of Deep Learning
within X-ray Security Imaging. Pattern Recognition, 122. https://doi.org/10.1016/j.patcog.2021.108245
ALERT. (2011). ALERT segmentation dataset. Retrieved from https://alert.northeastern.edu/transitioning-
technology/alert-datasets/
Anderson, D. (2021). Optimising multi-layered security screening. J Transp Secur 14, 249–273.
https://doi.org/10.1007/s12198-021-00237-3
Arcúrio, M. S. F., Nakamura, E. S., & Armborst, T. (2018). Human Factors and Errors in Security Aviation: An
Ergonomic Perspective. Journal of Advanced Transportation, 1–9. https://doi.org/10.1155/2018/5173253

86
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Bach, S., Binder, A., Müller, K.-R., & Samek, W. (2016). Controlling Explanatory Heatmap Resolution and
Semantics via Decomposition Depth. In 2016 IEEE International Conference on Image Processing (ICIP),
2271–2275. https://doi.org/10.1109/ICIP.2016.7532763
Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., & Hansen, K. (2010). How to Explain Individual
Classification Decisions. The Journal of Machine Learning Research, 11, 1803–1831. Retrieved from
https://www.jmlr.org/papers/volume11/baehrens10a/baehrens10a
Bajura, M., Fuchs, H., & Ohbuchi, R. (1992). Merging virtual objects with the real world: Seeing ultrasound
imagery within the patient. Computer Graphics, 26(2), 203–210.
https://dl.acm.org/doi/abs/10.1145/142920.134061
Baştan, M., Yousefi, M. R., & Breuel, T. M. (2011). Visual Words on Baggage X-Ray Images. In P. Real, D. Diaz-
Pernil, H. Molina-Abril, A. Berciano, & W. Kropatsch (Eds.), Computer Analysis of Images and Patterns, Vol.
6854, pp. 360–368. https://doi.org/10.1007/978-3-642-23672-3_44
Baştan, M., Byeon, W., & Breuel, T. (2013). Object Recognition in Multi-View Dual Energy X-ray Images.
Proceedings of the British Machine Vision Conference 2013, 130.1-130.11.
https://doi.org/10.5244/C.27.130
Baştan, M. (2015). Multi-view object detection in dual-energy X-ray images. Machine Vision and Applications,
26(7–8), 1045–1060. https://doi.org/10.1007/s00138-015-0706-x
Bastani, O., Kim, C., & Bastani, H. (2017). Interpretability via Model Extraction. 2017 Workshop on Fairness,
Accountability, and Transparency in Machine Learning (FAT/ML). Presented at the 2017 Workshop on
Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2017). Retrieved from
http://arxiv.org/abs/1706.09773
Baum, P. (2016). Violence in the skies: A history of aircraft hijacking and bombing. Chichester, England:
Summersdale Publishers.
Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-Up Robust Features (SURF). Computer Vision
and Image Understanding, 110(3), 346–359. https://doi.org/10.1016/j.cviu.2007.09.014
Bekkar, M., & Djemaa, D. H. K. (2013). Evaluation Measures for Models Assessment over Imbalanced Data
Sets. Journal of Information Engineering and Applications, 3(10), 27–38. Retrieved from
https://eva.fing.edu.uy/pluginfile.php/69453/mod_resource/content/1/7633-10048-1-PB.pdf
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape context. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522.
https://doi.org/10.1109/34.993558
Benedykciuk, E., Denkowski, M., & Dmitruk, K. (2020). Learning-based Material Classification in X-ray Security
Images. Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and
Computer Graphics Theory and Applications, 284–291. https://doi.org/10.5220/0008951702840291
Benedykciuk, E., Denkowski, M., & Dmitruk, K. (2021). Material classification in X-ray images based on multi-
scale CNN. Signal, Image and Video Processing. https://doi.org/10.1007/s11760-021-01859-9
Bertz, E. A. (2002). 9/11: One Year Later: Slow Takeoff. IEEE Spectrum, 39(9), 37–38.
https://doi.org/10.1109/MSPEC.2002.1030985
Bhowmik, N., Wang, Q., Gaus, Y. F. A., & Szarek, M. (2019). The Good, the Bad and the Ugly: Evaluating
Convolutional Neural Networks for Prohibited Item Detection Using Real and Synthetically Composited X-
ray Imagery. British Machine Vision Conference, Workshop on Object Detection and Recognition for Security
Screening, 13. https://doi.org/10.48550/arXiv.1909.11508
Bian, P., Li, W., Jin, Y., & Zhi, R. (2018). Ensemble feature learning for material recognition with convolutional
neural networks. EURASIP Journal on Image and Video Processing, 2018(1), 64.
https://doi.org/10.1186/s13640-018-0300-z
Biggs, A. T., & Mitroff, S. R. (2015). Improving the Efficacy of Security Screening Tasks: A Review of Visual
Search Challenges and Ways to Mitigate Their Adverse Effects: Improving the efficacy of security
screening tasks. Applied Cognitive Psychology, 29(1), 142–148. https://doi.org/10.1002/acp.3083

87
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Biggs, A. T., Kramer, M. R., & Mitroff, S. R. (2018). Using Cognitive Psychology Research to Inform Professional
Visual Search Operations. Journal of Applied Research in Memory and Cognition, 7(2), 189–198.
https://doi.org/10.1016/j.jarmac.2018.04.001
Biundo, S., Langley, P., Magazzeni, D., & Smith, D. (2020). International Workshop on Explainable AI Planning
(XAIP), Proc. ICAPS Workshop. Retrieved April 2022 from https://icaps20subpages.icaps-
conference.org/workshops/xaip/. Update June 2022: website apparently moved to
http://xaip.mybluemix.net/2020.
Blalock, G., Kadiyali, V., & Simon, D. H. (2007). The Impact of Post-9/11 Airport Security Measures on the
Demand for Air Travel. The Journal of Law and Economics, 50(4), 731–755.
https://doi.org/10.1086/519816
Blaschko, M. B., & Lampert, C. H. (2008). Learning to Localize Objects with Structured Output Regression. In D.
Forsyth, P. Torr, & A. Zisserman (Eds.), Computer Vision – ECCV 2008, Vol. 5302, pp. 2–15.
https://doi.org/10.1007/978-3-540-88682-2_2
Blum, S. (2020, April 18). Behaviour detection technology: Screening on the go. Transport Security
International Magazine. https://www.tsi-mag.com/behaviour-detection-technology-screening-on-the-go/
Bolfing, A., Halbherr, T., & Schwaninger, A. (2008). How Image Based Factors and Human Factors Contribute to
Threat Detection Performance in X-Ray Aviation Security Screening. In A. Holzinger (Ed.), HCI and Usability
for Education and Work, Vol. 5298, pp. 419–438. https://doi.org/10.1007/978-3-540-89350-9_30
Bozinovski, S. (2020). Reminder of the First Paper on Transfer Learning in Neural Networks, 1976. Informatica
(Slovenia), 44(3). https://doi.org/10.31449/inf.v44i3.2828
Brock, A., Donahue, J., & Simonyan, K. (2019). Large Scale GAN Training for High Fidelity Natural Image
Synthesis. Proceedings of the 7th International Conference on Learning Representations (ICLR). Presented
at the 2019 ICLR, New Orleans, LA, USA. Retrieved from http://arxiv.org/abs/1809.11096
Bunrit, S., Kerdprasop, N., & Kerdprasop, K. (2019). Evaluating on the Transfer Learning of CNN Architectures
to a Construction Material Image Classification Task. International Journal of Machine Learning and
Computing, 9(2), 201–207. https://doi.org/10.18178/ijmlc.2019.9.2.787
Buser, D., Sterchi, Y., Schwaninger, A. (2020). Why stop after 20 minutes? Breaks and target prevalence in a
60-minute X-ray baggage screening task, International Journal of Industrial Ergonomics, 76, 102897,
https://doi.org/10.1016/j.ergon.2019.102897
Butler, V., & Poole, R. W. (2002). Rethinking Checked Baggage Screening—Policy study 297. Los Angeles:
Reason Public Policy Institute. Retrieved from https://reason.org/wp-
content/uploads/files/f9b5018689d607923c7ce0c624e7dd58.pdf
Caldwell, M., & Griffin, L. D. (2020). Limits on transfer learning from photographic image data to X-ray threat
detection. Journal of X-Ray Science and Technology, 27(6), 1007–1020. https://doi.org/10.3233/XST-
190545
Calimeri, F., Marzullo, A., Stamile, C., & Terracina, G. (2017). Biomedical Data Augmentation Using Generative
Adversarial Neural Networks. In A. Lintas, S. Rovetta, P. F. M. J. Verschure, & A. E. P. Villa (Eds.), Artificial
Neural Networks and Machine Learning – ICANN 2017, Vol. 10614, pp. 626–634.
https://doi.org/10.1007/978-3-319-68612-7_71
Caruana, R., Lou, Y., & Gehrke, J. (2015). Intelligible Models for HealthCare: Predicting Pneumonia Risk and
Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 1721–1730. https://doi.org/10.1145/2783258.2788613
Centre for Applied Science and Technology. (2016). OSCT Borders X-ray Image Library (Technical Report No.
146/16). UK Home Office.
Chan, J., Evans, P., & Wang, X. (2010). Enhanced color coding scheme for kinetic depth effect X-ray (KDEX)
imaging. 44th Annual 2010 IEEE International Carnahan Conference on Security Technology, 155–160.
https://doi.org/10.1109/CCST.2010.5678714
Chang, Q., Li, W., & Chen, J. (2019). Application of Machine Learning Methods for Material Classification with
Multi-energy X-Ray Transmission Images. In X. Sun, Z. Pan, & E. Bertino (Eds.), Artificial Intelligence and
Security, Vol. 11632, pp. 194–204. https://doi.org/10.1007/978-3-030-24274-9_17

88
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of
recent feature encoding methods. 2011 Proceedings of the British Machine Vision Conference (BMVC),
76.1-76.12. https://doi.org/10.5244/C.25.76
Chavaillaz, A., Schwaninger, A., Michel, S., & Sauer, J. (2020). Some cues are more equal than others: Cue
plausibility for false alarms in baggage screening. Applied Ergonomics, 82, 102916.
https://doi.org/10.1016/j.apergo.2019.102916
Chellappa, R., & Kashyap, R. (1985). Texture synthesis using 2-D noncausal autoregressive models. IEEE
Transactions on Acoustics, Speech, and Signal Processing, 33(1), 194–203.
https://doi.org/10.1109/TASSP.1985.1164507
Chen, Z., Zheng, Y., Abidi, B.R., Page, D.L., Abidi, M.A.. (2005). A Combinational Approach to the Fusion, De-
noising and Enhancement of Dual-Energy X-Ray Luggage Images. 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05) - Workshops, 2–2.
https://doi.org/10.1109/CVPR.2005.386
Chen, Q., Li, D., & Tang, C.-K. (2013). KNN Matting. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35, 2175–2188. https://doi.org/10.1109/TPAMI.2013.18
Chen, Z., Zhao, T., & Li, L. (2014). A Curve-based Material Recognition Method in MeV Dual-energy X-ray
Imaging System. Nuclear Science and Techniques 27, 25, 1–8. https://doi.org/10.1007/s41365-016-0019-
4
Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., & Choo, J. (2018). StarGAN: Unified Generative Adversarial
Networks for Multi-Domain Image-to-Image Translation. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 8789–8797. https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00916
Choi, Y., Uh, Y., Yoo, J., & Ha, J.-W. (2020). StarGAN v2: Diverse Image Synthesis for Multiple Domains.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8185–8194.
https://doi.org/10.48550/arXiv.1912.01865
Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 1800–1807. https://doi.org/10.1109/CVPR.2017.195
Chui, C. K. (1993). An Introduction to Wavelets. Retrieved from
https://www.sciencedirect.com/bookseries/wavelet-analysis-and-its-applications/vol/1/suppl/C
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016). 3D U-Net: Learning Dense
Volumetric Segmentation from Sparse Annotation. Medical Image Computing and Computer-Assisted
Intervention – MICCAI 2016, 9901, 424–432. https://doi.org/10.1007/978-3-319-46723-8_49
Cimpoi, M., Maji, S., & Vedaldi, A. (2015). Deep filter banks for texture recognition and segmentation. 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3828–3836.
https://doi.org/10.1109/CVPR.2015.7299007
Cranfield University. (2019, May 1). Revolutionary X-ray scanner receives U.S. Homeland Security funding.
Cranfield University. Retrieved from https://www.cranfield.ac.uk/press/news-2019/revolutionary-x-ray-
scanner-receives-us-homeland-security-funding
Crawford, C., Martz, H., & Pien, H. (2011). Segmentation of objects from volumetric CT data—Final report (No.
Task Order Number HSHQDC-10-J-00396). Awareness and Localization of Security-Related Threats
(ALERT), DHS Center of Excellence at Northeastern University. Retrieved from
https://alert.northeastern.edu/transitioning-technology/alert-datasets/
Crawford, C., Clem, C., & Martz, H. (2013). Research and Development of Reconstruction Advances in CT‐Based
Object Detection Systems ‐ Final Report (No. HSHQDC‐12‐J‐00056). Awareness and Localization of
Security-Related Threats (ALERT), DHS Center of Excellence at Northeastern University.
Crawford, C. (2014). Advances in Automatic Target Recognition (ATR) for CT-Based Object Detection
Systems—Final report (No. Task Order Number HSHQDC-12-J-00429). Awareness and Localization of
Security-Related Threats (ALERT), DHS Center of Excellence at Northeastern University. Retrieved from
http://alert.northeastern.edu/assets/TaskOrder4_FinalReport_Full_DIGITAL.pdf
Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual Categorization with Bags of
Keypoints. In Workshop on Statistical Learning in Computer Vision (ECCV), 1, 1–22. Retrieved from
https://people.eecs.berkeley.edu/~efros/courses/AP06/Papers/csurka-eccv-04.pdf

89
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Cula, O. G., & Dana, K. J. (2004). 3D Texture Recognition Using Bidirectional Feature Histograms. International
Journal of Computer Vision, 59(1), 33–60. https://doi.org/10.1023/B:VISI.0000020670.05764.55
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object Detection via Region-based Fully Convolutional Networks.
Advances in Neural Information Processing Systems (NIPS), 379–387. Retrieved from
https://dl.acm.org/doi/pdf/10.5555/3157096.3157139
Dalal, N., & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection. 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 886–893.
https://doi.org/10.1109/CVPR.2005.177
Daubechies, I. (1990). The wavelet transform, time-frequency localization and signal analysis. In IEEE
Transactions on Information Theory, 36(5), 961–1005. https://doi.org/10.1109/18.57199
DeVries, T., & Taylor, G. W. (2017). Dataset Augmentation in Feature Space. http://arxiv.org/abs/1702.05538
Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale hierarchical image
database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255.
https://doi.org/10.1109/CVPR.2009.5206848
Dhiraj, & Jain, D. K. (2019). An evaluation of deep learning based object detection strategies for threat object
detection in baggage security imagery. Pattern Recognition Letters, 120, 112–119.
https://doi.org/10.1016/j.patrec.2019.01.014
Dicken, A., Rogers, K., Evans, P., Rogers, J., & Chan, J. W. (2010). The separation of X-ray diffraction patterns
for threat detection. Applied Radiation and Isotopes, 68(3), 439–443.
https://doi.org/10.1016/j.apradiso.2009.11.072
Ding, J., Li, Y., Xu, X., & Wang, L. (2006). X-ray Image Segmentation by Attribute Relational Graph Matching.
8th International Conference on Signal Processing, 4128990. https://doi.org/10.1109/ICOSP.2006.345698
Dmitruk, K., Denkowski, M., Mazur, M., & Mikołajczak, P. (2017). Sharpening filter for false color imaging of
dual-energy X-ray scans. Signal, Image and Video Processing, 11(4), 613–620.
https://doi.org/10.1007/s11760-016-1001-7
Dobratz, B., & Crawford, P. (1985). LLNL Explosives Handbook: Properties of Chemical Explosives and Explosive
Simulants. Lawrence Livermore National Lab CA.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). DeCAF: A Deep
Convolutional Activation Feature for Generic Visual Recognition. Proceedings of the 31st International
Conference on Machine Learning, 32, 647–655.
Dzindolet, M. T., Peterson, S. A., Pomranky, R. A., Pierce, L. G., & Beck, H. P. (2003). The role of trust in
automation reliance. International Journal of Human-Computer Studies, 58(6), 697–718.
https://doi.org/10.1016/S1071-5819(03)00038-7
EASA. (2020a). Artificial Intelligence Roadmap: A human-centric approach to AI in aviation. Retrieved from
https://www.easa.europa.eu/sites/default/files/dfu/EASA-AI-Roadmap-v1.0.pdf
EASA. (2020b). Concepts of Design Assurance for Neural Networks (CoDANN). Retrieved from
https://www.easa.europa.eu/document-library/general-publications/concepts-design-assurance-neural-
networks-codann
Ecorys. (2009). Study on the Competitiveness of the EU security industry. (Final Report No. Within the
Framework Contract for Sectoral Competitiveness Studies – ENTR/06/054). Retrieved from
http://www.decision.eu/wp-content/uploads/2016/11/Study-on-the-Competitiveness-of-the-EU-security-
industry.pdf
Eger, L., Do, S., Ishwar, P., Karl, W. C., & Pien, H. (2011). A learning-based approach to explosives detection
using Multi-Energy X-Ray Computed Tomography. 2011 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2004–2007. https://doi.org/10.1109/ICASSP.2011.5946904
Eilbert, R. F., & Krug, K. D. (1993). Aspects of image recognition in Vivid Technologies’ dual-energy x-ray
system for explosives detection. Applications of Signal and Image Processing in Explosives Detection
Systems, 1824, 127–143. https://doi.org/10.1117/12.142891
Emanuilov, I., & Dheu, O. (2021). Flying High for AI? Perspectives on EASA’s Roadmap for AI in Aviation. Air &
Space Law, 6(1), 1–28. Retrieved from https://lirias.kuleuven.be/retrieve/626180

90
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Endsley, M. R. (1987). The Application of Human Factors to the Development of Expert Systems for Advanced
Cockpits. Proceedings of the Human Factors Society Annual Meeting, 31(12), 1388–1392.
https://doi.org/10.1177/154193128703101219
EUROCONTROL. (2020, January 10). Real safety and capacity gains at Heathrow from artificial intelligence
initiatives. https://www.eurocontrol.int/article/real-safety-and-capacity-gains-heathrow-artificial-
intelligence-initiatives
European Aviation Artificial Intelligence High Level Group. (2020). The FLY AI Report: Demystifying and
Accelerating AI in Aviation/ATM. Retrieved from https://www.eurocontrol.int/sites/default/files/2020-
03/eurocontrol-fly-ai-report-032020.pdf
European Civil Aviation Conference. (2021a, February 19). Liquid Explosive Detection Systems (LEDS).
Retrieved from https://www.ecac-ceac.org/images/activities/security/ECAC-
CEP_Liquid_Explosive_Detection_Systems_20210219.pdf
European Civil Aviation Conference (2021b, March 19). Explosive Detection Systems for Cabin Baggage (EDS-
CB). Retrieved from https://www.ecac-ceac.org/images/activities/security/ECAC-
CEP_Explosive_Detection_Systems_for_Cabin_Baggage_20210319.pdf
European Commission. (2008). Regulation (EC) No 300/2008 of the European Parliament and of the Council of
11 March 2008 on common rules in the field of civil aviation security and repealing Regulation (EC) No
2320/2002 (Text with EEA relevance). Retrieved from http://data.europa.eu/eli/reg/2008/300/oj
European Commission. (2013). Report From the Commission to the European Parliament and the Council,
2012 Annual Report On The Implementation Of Regulation (EC) N° 300/2008 On Common Rules In The
Field Of Civil Aviation Security. COM(2013) 523 final. Retrieved from https://eur-lex.europa.eu/legal-
content/EN/TXT/?uri=celex:52013DC0523
European Commission. (2015a). Commission implementing regulation (EU) 2015/1998 of 5 November 2015
laying down detailed measures for the implementation of the common basic standards on aviation
security. Official Journal of the European Union. Retrieved from
http://data.europa.eu/eli/reg_impl/2015/1998/oj
European Commission. (2015b). Commission implementing regulation (EU) 187/2015 of 6 February 2015
amending regulation 185/2010 as regards the screening of cabin baggage. Retrieved from
http://data.europa.eu/eli/reg_impl/2015/187/oj
European Commission. (2017). Annex 2 to the Communication from the Commission to the European
Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions:
European Interoperability Framework – Implementation Strategy. COM(2017) 134 final. Retrieved from
https://eur-lex.europa.eu/resource.html?uri=cellar:2c2f2554-0faf-11e7-8a35-
01aa75ed71a1.0017.02/DOC_3&format=PDF
European Commission. (2020). White Paper on AI. Retrieved from
https://ec.europa.eu/info/sites/default/files/commission-white-paper-artificial-intelligence-feb2020_en.pdf
European Commission. (2021a). Luggage restrictions. Retrieved from
https://europa.eu/youreurope/citizens/travel/carry/luggage-restrictions/index_en.htm
European Commission. (2021b). Commission Implementing Regulation (EU) 2021/255 of 18 February 2021
amending Implementing Regulation (EU) 2015/1998 laying down detailed measures for the
implementation of the common basic standards on aviation security (Text with EEA relevance). Retrieved
from http://data.europa.eu/eli/reg_impl/2021/255/oj
European Commission. (2021c, April 21). Annexes (1-9) to the Proposal for a Regulation of the European
Parliament and of the Council: Laying Down Harmonised Rules on Artificial Intelligence (Artificial
Intelligence Act) and Amending Certain Union Legislative Acts. Retrieved from https://eur-
lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52018PC0375&from=EN
European Commission. (2021d, April 21). Proposal for a Regulation laying down harmonised rules on Artificial
Intelligence (Artificial Intelligence Act) and amending certain Union legislative acts. Retrieved from
https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=COM:2021:206:FIN

91
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The Pascal
Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111(1), 98–
136. https://doi.org/10.1007/s11263-014-0733-5
Farina, M. P., & Reed, C. (2017). Explainable Computational Intelligence Workshop. Proceedings of XCI:
Explainable Computational Intelligence Workshop, Santiago de Compostela, Spain. Retrieved from
http://xci2017.arg.tech
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
https://doi.org/10.1016/j.patrec.2005.10.010
Federal Aviation Administration. (2021). Events with smoke, fire, extreme heat or explosion involving lithium
batteries. Retrieved from https://www.faa.gov/hazmat/resources/lithium-battery-incident-chart
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient Graph-Based Image Segmentation. International
Journal of Computer Vision, 59(2), 167–181. https://doi.org/10.1023/B:VISI.0000022288.19776.77
Fernández de la Ossa, M. Á., Amigo, J. M., & García-Ruiz, C. (2014). Detection of residues from explosive
manipulation by near infrared hyperspectral imaging: A promising forensic tool. Forensic Science
International, 242, 228–235. https://doi.org/10.1016/j.forsciint.2014.06.023
Ferrara, M., Franco, A., & Maltoni, D. (2014). The magic passport. IEEE International Joint Conference on
Biometrics, 1–7. https://doi.org/10.1109/BTAS.2014.6996240
Flitton, G., Breckon, T. P., & Megherbi, N. (2013). A comparison of 3D interest point descriptors with application
to airport baggage object detection in complex CT imagery. Pattern Recognition, 46(9), 2420–2436.
https://doi.org/10.1016/j.patcog.2013.02.008
Flitton, G., Mouton, A., & Breckon, T. P. (2015). Object classification in 3D baggage security computed
tomography imagery using visual codebooks. Pattern Recognition, 48(8), 2489–2499.
https://doi.org/10.1016/j.patcog.2015.02.006
Fogel, I., & Sagi, D. (1989). Gabor filters as texture discriminator. Biological Cybernetics, 61(2).
https://doi.org/10.1007/BF00204594
Franzel, T., Schmidt, U., & Roth, S. (2012). Object Detection in Multi-view X-Ray Images. In A. Pinz, T. Pock, H.
Bischof, & F. Leberl (Eds.), Pattern Recognition 7476, 144–154. https://doi.org/10.1007/978-3-642-32717-
9_15
Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). GAN-based Synthetic
Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification.
Neurocomputing, 321, 321–331. https://doi.org/10.1016/j.neucom.2018.09.013
Future airport. (2016, December 16). The magic number—Upgrading baggage-screening capabilities. Retrieved
from https://www.futureairport.com/features/featurethe-magic-number---upgrading-baggage-screening-
capabilities-5724299/
Gaus, Y. F. A., Bhowmik, N., Akçay, S., & Breckon, T. (2019). Evaluating the Transferability and Adversarial
Discrimination of Convolutional Neural Networks for Threat Object Detection and Classification within X-
Ray Security Imagery. 2019 18th IEEE International Conference on Machine Learning and Applications
(ICMLA), 420–425. https://doi.org/10.1109/ICMLA.2019.00079
Galdran, A., Alvarez-Gila, A., Meyer, M. I., Saratxaga, C. L., Araújo, T., Garrote, E., Aresta, G., Costa, P., Mendonça,
A. M., & Campilho, A. (2017). Data-Driven Color Augmentation Techniques for Deep Skin Image Analysis.
http://arxiv.org/abs/1703.03702
Girshick, R. (2015). Fast R-CNN, arXiv, https://doi.org/10.48550/arxiv.1504.08083
Global Aviation Industry High-Level Group. (2019). Aviation Benefits Report 2019. Retrieved from
https://www.icao.int/sustainability/Documents/AVIATION-BENEFITS-2019-web.pdf
Global Terrorism Database. (2021). https://www.start.umd.edu/gtd (Accessed 07 June 2021).
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
Proceedings of Machine Learning Research (PLMR), 9, 249–256. Retrieved from
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

92
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Goebel, R. (2018). MAKE-Explainable AI. CD-MAKE Workshop on Explainable Artificial Intelligence. Retrieved
from https://2018.cd-make.net/special-sessions/make-explainable-ai/index.html
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014).
Generative Adversarial Networks. Advances in Neural Information Processing Systems, 3(11).
https://doi.org/10.1145/3422622
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. 3rd
International Conference on Learning Representations, ICLR. Presented at the San Diego, CA, USA. Retrieved
from http://arxiv.org/abs/1412.6572
Graaf, M., Malle, B., Dragan, A., & Ziemke, T. (2018). Explainable Robotic Systems. Proceedings HRI Workshop
on Explainable Robotic Systems. Chicago, IL, USA. Retrieved from
https://explainableroboticsystems.wordpress.com/
Grady, L., & Schwartz, E. L. (2006). Isoperimetric graph partitioning for image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 28(3), 469–475. https://doi.org/10.1109/TPAMI.2006.57
Grady, L., Singh, V., Kohlberger, T., Alvino, C., & Bahlmann, C. (2012). Automatic Segmentation of Unknown
Objects, with Application to Baggage Security. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, & C. Schmid
(Eds.), Computer Vision – ECCV 2012, 7573, 430–444. https://doi.org/10.1007/978-3-642-33709-3_31
Gu, J. & Liu, C. (2012). Discriminative illumination: Per-pixel classification of raw materials based on optimal
projections of spectral BRDF. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 797–804.
https://doi.org/10.1109/CVPR.2012.6247751
Gunning, D. (2019). Explainable Artificial Intelligence (XAI) Program. AI Magazine.
https://doi.org/10.1609/aimag.v40i2.2850
Guo, X., Grushka-Cockayne, Y., & De Reyck, B. (2020). London Heathrow Airport Uses Real-Time Analytics for
Improving Operations. SSRN. Retrieved from http://dx.doi.org/10.2139/ssrn.3619914
Guyon, I., Escalante, H., Escalera, S., Viegas, E., Güçlütürk, Y., Güçlü, U., van Gerven, M., van Lier, R. (2017,
annual event). Explainability of Learning Machines. Retrieved from
https://gesture.chalearn.org/ijcnn17_explainability_of_learning_machines
Halo X-Ray Technologies. (2017). HALO X-Ray Technologies: Products. Retrieved from
https://www.haloxray.com/products
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic
(ROC) curve. Radiology, 143(1), 29–36. https://doi.org/10.1148/radiology.143.1.7063747
Harding, G. (2004). X-ray scatter tomography for explosives detection. Radiation Physics and Chemistry, 71(3–
4), 869–881. https://doi.org/10.1016/j.radphyschem.2004.04.111
Hassan, T., Khan, S. H., Akçay, S., Bennamoun, M., & Werghi, N. (2019). Deep CMST Framework for the
Autonomous Recognition of Heavily Occluded and Cluttered Baggage Items from Multivendor Security
Radiographs. Computer Science, 18.
Hättenschwiler, N., Sterchi, Y., Mendes, M., & Schwaninger, A. (2018). Automation in airport security X-ray
screening of cabin baggage: Examining benefits and possible implementations of automated explosives
detection. Applied Ergonomics, 72, 58–68. https://doi.org/10.1016/j.apergo.2018.05.003
Hättenschwiler, N., Mendes, M., & Schwaninger, A. (2019). Detecting Bombs in X-Ray Images of Hold Baggage:
2D Versus 3D Imaging. Human Factors: The Journal of the Human Factors and Ergonomics Society, 61(2),
305–321. https://doi.org/10.1177/0018720818799215
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.
https://doi.org/10.1109/CVPR.2016.90
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE
Intelligent Systems and Their Applications, 13(4), 18–28. https://doi.org/10.1109/5254.708428
Heitz, G., & Chechik, G. (2010). Object separation in x-ray image sets. 2010 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, 2093–2100. https://doi.org/10.1109/CVPR.2010.5539887

93
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks.
Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647
Hofer, F., & Schwaninger, A. (2005). Using threat image projection data for assessing individual screener
performance. WIT Transactions on the Built Environment, 82, 417–426.
https://doi.org/10.2495/SAFE050411
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities.
Proceedings of the National Academy of Sciences USA, 79(8), 2554–2558.
Hu, D., Bo, L., & Ren, X. (2011). Toward Robust Material Recognition for Everyday Objects. Proceedings of the
British Machine Vision Conference 2011, 48, 1–11. https://doi.org/10.5244/C.25.48
Huegli, D., Merks, S., & Schwaninger, A. (2020). Automation reliability, human–machine system performance,
and operator compliance: A study with airport security screeners supported by automated explosives
detection systems for cabin baggage screening. Applied Ergonomics, 86, 103094.
https://doi.org/10.1016/j.apergo.2020.103094
Inoue, H. (2018). Data Augmentation by Pairing Samples for Images Classification. Retrieved from
https://doi.org/10.48550/arXiv.1801.02929
International Airport Review. (2021). CT scan machines installed at Schiphol airport’s security checkpoints.
Retrieved from https://www.internationalairportreview.com/news/158492/ct-scan-schiphol-airport-security/
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-Image Translation with Conditional Adversarial
Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5967–
5976. https://doi.org/10.1109/CVPR.2017.632
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Darrell, T. (2014). Caffe: Convolutional
Architecture for Fast Feature Embedding. Proceedings of the 22nd ACM International Conference on
Multimedia, 675–678. https://doi.org/10.1145/2647868.2654889
Jiang, X., Gramopadhye, A. K., & Melloy, Brian. J. (2004). Theoretical issues in the design of visual inspection
systems. Theoretical Issues in Ergonomics Science, 5(3), 232–247.
https://doi.org/10.1080/1463922021000050005
Johnson, T., Fink, C., Schönberg, S. O., & Reiser, M. F. (2011). Dual Energy CT in Clinical Practice. Springer
Science & Business Media.
Jones, T. L. (2003). Court Security: A Guide for Post 9-11 Environments. Charles Thomas Publisher.
Kalender, W.A., Perman, W.H., Vetter, J.R. and Klotz, E. (1986), Evaluation of a prototype dual-energy computed
tomographic apparatus. I. Phantom studies. Med. Phys., 13: 334-339. https://doi.org/10.1118/1.595958
Karanam, S., Gou, M., Wu, Z., Rates-Borras, A., Camps, O., & Radke, R. J. (2018). A Systematic Evaluation and
Benchmark for Person Re-Identification: Features, Metrics, and Datasets. ArXiv:1605.09653 [Cs.CV].
Retrieved from http://arxiv.org/abs/1605.09653
Kehl, C., Mustafa, W., Kehres, J., Dahl, A., & Olsen, U. (2018, December). Distinguishing malicious fluids in
luggage via multi-spectral CT reconstructions. Presented at the 3D-NordOst, GFaI - Gesellschaft zur
Förderung angewandter Informatik e.V., Berlin, Germany. Retrieved from
https://www.researchgate.net/publication/330580147_Distinguishing_malicious_fluids_in_luggage_via_mu
lti-spectral_CT_reconstructions
Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy K-nearest neighbor algorithm. IEEE Transactions on
Systems, Man, and Cybernetics, SMC-15(4), 580–585. https://doi.org/10.1109/TSMC.1985.6313426
Khalifa, N. E. (2022). A comprehensive survey of recent trends in deep learning for digital images
augmentation. Artificial Intelligence Review, 55, 2351–2377. https://doi.org/10.1007/s10462-021-10066-
4
Khotanzad, A., & Hong, Y. H. (1990). Invariant image recognition by Zernike moments. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 12(5), 489–497. https://doi.org/10.1109/34.55109
Kim, B., Varshney, K. R., & Weller, A. (2018). ICML Workshop on Human Interpretability in Machine Learning
(WHI), https://sites.google.com/view/whi2020/home. In International Conference on Machine Learning.

94
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Kim, J., Kim, J., & Ri, J. (2020). Generative adversarial networks and faster-region convolutional neural
networks based object detection in X-ray baggage security imagery. OSA Continuum, 3(12).
https://opg.optica.org/osac/fulltext.cfm?uri=osac-3-12-3604&id=444813
Koller, S. M., Drury, C. G., & Schwaninger, A. (2009). Change of search time and non-search time in X-ray
baggage screening due to training. Ergonomics, 52(6), 644–656.
https://doi.org/10.1080/00140130802526935
Komatsu, T., & Said, A. (2018, annual event). Explainable Smart Systems (EXSS),
https://explainablesystems.comp.nus.edu.sg/. In Proceedings of ACM Intelligent User Interfaces (IUI)
Workshop.
Kosciesza, D., Schlomka, J.-P., Meyer, J., & Montemont, G. (2013, October). X-ray diffraction imaging system
for the detection of illicit substances using pixelated CZT-detectors. 1–5.
https://doi.org/10.1109/NSSMIC.2013.6829846
Krell, M. M., Seeland, A., & Kim, S. K. (2018). Data Augmentation for Brain-Computer Interfaces: Analysis on
Event-Related Potentials Data. ArXiv:1801.02730 [Cs, q-Bio]. Retrieved from
http://arxiv.org/abs/1801.02730
Krening, S., Harrison, B., Feigh, K. M., Isbell, C. L., Riedl, M., & Thomaz, A. (2017). Learning From Explanations
Using Sentiment and Advice in RL. IEEE Transactions on Cognitive and Developmental Systems, 9(1), 44–
55. https://doi.org/10.1109/TCDS.2016.2628365
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural
networks. Proc. Advances in Neural Information Processing Systems, 25(6), 1090–1098.
https://proceedings.neurips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-
networks.pdf
Krüger, J., & Westermann, R. (2003). Linear algebra operators for GPU implementation of numerical
algorithms. ACM Transactions on Graphics, 22(3), 908–916. https://doi.org/10.1145/882262.882363
Kundegorski, M. E., Akçay, S., Devereux, M., Mouton, A., & Breckon, T. P. (2016). On using Feature Descriptors as
Visual Words for Object Detection within X-ray Baggage Security Screening. 7th International Conference
on Imaging for Crime Detection and Prevention (ICDP 2016), 1–6. https://doi.org/10.1049/ic.2016.0080
Kuznetsova, A., Rom, H., Alldrin, N. et al. (2020). The Open Images Dataset V4. Int J Comput Vis 128, 1956–
1981. https://doi.org/10.1007/s11263-020-01316-z
Lacson, F., Wiegmann, D., & Madhavan, P. (2005). Effects of Attribute and Goal Framing on Automation
Reliance and Compliance. Proceedings of the Human Factors and Ergonomics Society 49th Annual Meeting,
5.
Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2009). Efficient Subwindow Search: A Branch and Bound
Framework for Object Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12),
2129–2142. https://doi.org/10.1109/TPAMI.2009.144
Larose, D.T., Larose C.D., (2005). “K-nearest neighbor algorithm”, Discovering Knowledge in Data: An
Introduction to Data Mining, Second Edition, John Wiley & Sons, pp. 149–164.
Lazebnik, S., Schmid, C., & Ponce, J. (2005). A sparse texture representation using local affine regions. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1265–1278.
https://doi.org/10.1109/TPAMI.2005.151
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D. (1990).
Handwritten Digit Recognition with a Back-Propagation Network. Advances in Neural Information
Processing Systems 2, 396–404. https://dl.acm.org/doi/10.5555/109230.109279
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791
Lee, K., Zung, J., Li, P., Jain, V., & Seung, H. S. (2017). Superhuman Accuracy on the SNEMI3D Connectomics
Challenge. In 2017 Conference on Neural Information Processing Systems (NIPS). Presented at the Long
Beach, CA, USA. Retrieved from http://arxiv.org/abs/1706.00120

95
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Lee, J.-W., Park, W. B., Lee, J. H., Singh, S. P., & Sohn, K.-S. (2020). A deep-learning technique for phase
identification in multiphase inorganic compounds using synthetic XRD powder patterns. Nature
Communications, 11(1), 86. https://doi.org/10.1038/s41467-019-13749-3
Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust Object Detection with Interleaved Categorization and
Segmentation. International Journal of Computer Vision, 77(1–3), 259–289.
https://doi.org/10.1007/s11263-007-0095-3
Letham, B., Rudin, C., McCormick, T. H., & Madigan, D. (2015). Interpretable classifiers using rules and Bayesian
analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3).
https://doi.org/10.1214/15-AOAS848
Li, S., Chen, Y., Peng, Y., & Bai, L. (2018). Learning More Robust Features with Adversarial Training.
http://arxiv.org/abs/1804.07757
Li, B., Wu, F., Lim, S.-N., Belongie, S., & Weinberger, K. Q. (2021). On Feature Normalization and Data
Augmentation. http://arxiv.org/abs/2002.11102
Liang, J. (2004). Improving the detection of low-density weapons in x-ray luggage scans using image
enhancement and novel scene-decluttering techniques. Journal of Electronic Imaging, 13(3), 523.
https://doi.org/10.1117/1.1760571
Liang, G.-F., Lin, J.-T., Hwang, S.-L., Wang, E. M., & Patterson, P. (2010). Preventing human errors in aviation
maintenance using an on-line maintenance assistance platform. International Journal of Industrial
Ergonomics, 40(3), 356–367. https://doi.org/10.1016/j.ergon.2010.01.001
Liang, K. J., Sigman, J. B., Spell, G. P., Strellis, D., Chang, W., Liu, F., Mehta, T., Carin, L., (2019). Toward
Automatic Threat Recognition for Airport X-ray Baggage Screening with Deep Convolutional Object
Detection. Denver X-Ray Conference. Presented at the Denver, Colorado, USA. Retrieved from
http://arxiv.org/abs/1912.06329
Liang, K. J. (2020). Deep Automatic Threat Recognition: Considerations for Airport X-Ray Baggage Screening
(Ph.D Thesis, Duke University). Retrieved from https://hdl.handle.net/10161/20887
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C.L. (2014). Microsoft
COCO: Common Objects in Context. In 2014 European Conference on Computer Vision (ECCV), 740–755.
https://doi.org/10.1007/978-3-319-10602-1_48
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. 2017 IEEE
International Conference on Computer Vision (ICCV), 2999–3007. https://doi.org/10.1109/ICCV.2017.324
Liu, C., Sharan, L., Adelson, E. H., & Rosenholtz, R. (2010). Exploring features in a Bayesian framework for
material recognition. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
239–246. https://doi.org/10.1109/CVPR.2010.5540207
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single Shot MultiBox
Detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision – ECCV 2016, 9905, 21–37.
https://doi.org/10.1007/978-3-319-46448-0_2
Liu, X., Hou, F., Qin, H., & Hao, A. (2018). Multi-view multi-scale CNNs for lung nodule type classification from
CT images. Pattern Recognition, 77, 262–275. https://doi.org/10.1016/j.patcog.2017.12.022
Lombardi, S., & Nishino, K. (2012). Single image multimaterial estimation. 2012 IEEE Conference on Computer
Vision and Pattern Recognition, 238–245. https://doi.org/10.1109/CVPR.2012.6247681
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3431–3440.
Retrieved from http://arxiv.org/abs/1411.4038
Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of
Computer Vision, 60(2), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Lu, Q., & Conners, R. W. (2006). Using Image Processing Methods to Improve the Explosive Detection Accuracy.
IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 36(6), 750–760.
https://doi.org/10.1109/TSMCC.2005.855532

96
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Luggar, R. D., Horrocks, J. A., Speller, R. D., & Lacey, R. J. (1997). Low angle X-ray scatter for explosives
detection: A geometry optimization. Applied Radiation and Isotopes, 48(2), 215–224.
https://doi.org/10.1016/S0969-8043(96)00212-6
Lyons, R. (2011). Understanding Digital Signal Processing. Prentice Hall.
Maastricht Upper Area Control Centre. (2018, February). Predicting flight routes with a Deep Neural Network in
the operational Air Traffic Flow and Capacity Management system. Retrieved from
https://www.eurocontrol.int/archive_download/all/node/11314
Madden, R. W., Mahdavieh, J., Smith, R. C., & Subramanian, R. (2008, August 28). An explosives detection
system for airline security using coherent x-ray scattering technology (A. Burger, L. A. Franks, & R. B.
James, Eds.). https://doi.org/10.1117/12.796174
Madhavan, P., Wiegmann, D. A., & Lacson, F. C. (2004). Occasional Automation Failures on Easy Tasks
Undermines Trust in Automation. In Proceedings of the 112th Annual Meeting of the American
Psychological Association, 1–6. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.572.1125&rep=rep1&type=pdf
Madhavan, P., Wiegmann, D. A., & Lacson, F. C. (2006). Automation Failures on Tasks Easily Performed by
Operators Undermine Trust in Automated Aids. Human Factors: The Journal of the Human Factors and
Ergonomics Society, 48(2), 241–256. https://doi.org/10.1518/001872006777724408
Mahendran, A., & Vedaldi, A. (2015). Understanding Deep Image Representations by Inverting Them. 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5188–5196.
https://doi.org/10.1109/CVPR.2015.7299155
Manerikar, A., Prakash, T., & Kak, A. C. (2020). Adaptive Target Recognition: A Case Study Involving Airport
Baggage Screening. In A. Ashok, J. A. Greenberg, & M. E. Gehm (Eds.), In SPIE Proceedings Vol. 11404:
Anomaly Detection and Imaging with X-Rays (ADIX) V (Vol. 11404, p. 114040B).
https://doi.org/10.1117/12.2557638
Marshall, M., & Oxley, J. C. (Eds.). (2009). Aspects of explosives detection (1st edition). Amsterdam; Boston:
Elsevier.
Marticke, F. (2016). Optimization of an X-ray diffraction imaging system for medical and security applications
(Ph.D Thesis, Communauté Université Grenoble Alpes). Retrieved from
https://www.theses.fr/2016GREAT055.pdf
Martin, L., Tuysuzoglu, A., Karl, W. C., & Ishwar, P. (2015). Learning-Based Object Identification and
Segmentation Using Dual-Energy CT Images for Security. IEEE Transactions on Image Processing, 24(11),
4069–4081. https://doi.org/10.1109/TIP.2015.2456507
Martz, H. E., & Glenn, S. M. (2019). Dual-Energy X-ray Radiography and Computed Tomography. In
Nondestructive Testing Handbook, No. LLNL-BOOK-753617 (Vol. 4, p. 20). Livermore, CA, USA: Lawrence
Livermore National Laboratory. Retrieved from https://www.osti.gov/servlets/purl/1608919
McCarley, J. S., Kramer, A. F., Wickens, C. D., Vidoni, E. D., & Boot, W. R. (2004). Visual Skills in Airport-Security
Screening. Psychological Science, 15(5), 302–306. https://doi.org/10.1111/j.0956-7976.2004.00673.x
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of
Mathematical Biophysics, 5, 19.
Megherbi, N., Han, J., Breckon, T. P., & Flitton, G. T. (2012). A comparison of classification approaches for threat
detection in CT based baggage screening. 2012 19th IEEE International Conference on Image Processing,
3109–3112. https://doi.org/10.1109/ICIP.2012.6467558
Megherbi, N., Breckon, T. P., & Flitton, G. T. (2013, October 16). Investigating existing medical CT segmentation
techniques within automated baggage and package inspection (R. Zamboni, F. Kajzar, A. A. Szep, D.
Burgess, & G. Owen, Eds.). https://doi.org/10.1117/12.2028509
Merks, S., Hattenschwiler, N., Zeballos, M., & Schwaninger, A. (2018). X-ray Screening of Hold Baggage: Are the
Same Visual-Cognitive Abilities Needed for 2D and 3D Imaging? 2018 International Carnahan Conference
on Security Technology (ICCST), 1–5. https://doi.org/10.1109/CCST.2018.8585715
Mery, D., (2011). Automated detection in complex objects using a tracking algorithm in multiple X-ray views.
CVPR 2011 Workshops, 41–48. https://doi.org/10.1109/CVPRW.2011.5981715

97
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Mery, D., Riffo, V., Zuccar, I., & Pieringer, C. (2013). Automated X-Ray Object Recognition Using an Efficient
Search Algorithm in Multiple Views. 2013 IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 368–374. https://doi.org/10.1109/CVPRW.2013.62
Mery, D., Riffo, V., Zscherpel, U., Mondragón, G., Lillo, I., Zuccar, I., Lobel, H., Carrasco, M. (2015). GDXray: The
Database of X-ray Images for Nondestructive Testing. Journal of Nondestructive Evaluation, 34(4), 42.
https://doi.org/10.1007/s10921-015-0315-7
Mery, D., Svec, E., & Arias, M. (2016). Object Recognition in X-ray Testing Using Adaptive Sparse
Representations. Journal of Nondestructive Evaluation, 35(3), 45. https://doi.org/10.1007/s10921-016-
0362-8
Mery, D, Riffo, V., Zuccar, I., & Pieringer, C. (2017). Object recognition in X-ray testing using an efficient search
algorithm in multiple views. Insight - Non-Destructive Testing and Condition Monitoring, 59(2), 85–92.
https://doi.org/10.1784/insi.2017.59.2.85
Mery, D. & Katsaggelos, A. K. (2017). A Logarithmic X-Ray Imaging Model for Baggage Inspection: Simulation
and Object Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), 251–259. https://doi.org/10.1109/CVPRW.2017.37
Mery, D., Saavedra, D., & Prasad, M. (2020). X-Ray Baggage Inspection With Computer Vision: A Survey. IEEE
Access, 8, 145620–145633. https://doi.org/10.1109/ACCESS.2020.3015014
Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8(4), 283–298.
https://doi.org/10.1016/S0001-2998(78)80014-2
Meuter, R. F. I., & Lacherez, P. F. (2016). When and Why Threats Go Undetected: Impacts of Event Rate and
Shift Length on Threat Detection Accuracy During Airport Baggage Screening. Human Factors: The Journal
of the Human Factors and Ergonomics Society, 58(2), 218–228.
https://doi.org/10.1177/0018720815616306
Miao, C., Xie, L., Wan, F., Su, C., Liu, H., Jiao, J., & Ye, Q. (2019). SIXray: A Large-Scale Security Inspection X-Ray
Benchmark for Prohibited Item Discovery in Overlapping Images. 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2114–2123. https://doi.org/10.1109/CVPR.2019.00222
Michel, S., Koller, S. M., de Ruiter, J. C., Moerland, R., Hogervorst, M., & Schwaninger, A. (2007). Computer-Based
Training Increases Efficiency in X-Ray Image Interpretation by Aviation Security Screeners. 2007 41st
Annual IEEE International Carnahan Conference on Security Technology, 201–206.
https://doi.org/10.1109/CCST.2007.4373490
Michel, S., & Schwaninger, A. (2009). Human-machine interaction in x-ray screening. 43rd Annual 2009
International Carnahan Conference on Security Technology, 13–19.
https://doi.org/10.1109/CCST.2009.5335572
Michel, S., Hattenschwiler, N., Kuhn, M., Strebel, N., & Schwaninger, A. (2014). A multi-method approach toward
identifying situational factors and their relevance for X-ray screening. Proceedings of the 48th IEEE
International Carnahan Conference on Security Technology (ICCST), 218–213.
https://doi.org/10.1109/CCST.2014.6987001
Mikolajczyk, A., & Grochowski, M. (2018). Data augmentation for improving deep learning in image
classification problem. 2018 International Interdisciplinary PhD Workshop (IIPhDW), 117–122.
https://doi.org/10.1109/IIPHDW.2018.8388338
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and
Phrases and their Compositionality. In Advances in Neural Information Processing Systems (NIPS), 3111–
3119. Retrieved from http://arxiv.org/abs/1310.4546
Montavon, G., Bach, S., Binder, A., Samek, W., & Müller, K.-R. (2017). Explaining NonLinear Classification
Decisions with Deep Taylor Decomposition. Pattern Recognition, 65, 211–222.
https://doi.org/10.1016/j.patcog.2016.11.008
Moreno-Barea, F. J., Strazzera, F., Jerez, J. M., Urda, D., & Franco, L. (2018). Forward Noise Adjustment Scheme
for Data Augmentation. 2018 IEEE Symposium Series on Computational Intelligence (SSCI), 728–734.
https://doi.org/10.1109/SSCI.2018.8628917

98
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Mouton, A., Flitton, G. T., Bizot, S., Megherbi, N., & Breckon, T. P. (2013). An evaluation of image denoising
techniques applied to CT baggage screening imagery. 2013 IEEE International Conference on Industrial
Technology (ICIT), 1063–1068. https://doi.org/10.1109/ICIT.2013.6505819
Mouton, Andre, Breckon, T. P., Flitton, G. T., & Megherbi, N. (2014). 3D object classification in baggage
computed tomography imagery using randomised clustering forests. 2014 IEEE International Conference
on Image Processing (ICIP), 5202–5206. https://doi.org/10.1109/ICIP.2014.7026053
Mouton, A., & Breckon, T. P. (2015). Materials-based 3D segmentation of unknown objects from dual-energy
computed tomography imagery in baggage security screening. Pattern Recognition, 48(6), 1961–1978.
https://doi.org/10.1016/j.patcog.2015.01.010
Movafeghi, A., Rokrok, B., & Yahaghi, E. (2020). Dual-energy X-ray Imaging in Combination with Automated
Threshold Gabor Filtering for Baggage Screening Application. Russian Journal of Nondestructive Testing,
56(9), 765–773. https://doi.org/10.1134/S1061830920090065
Mutch, J., & Lowe, D. G. (2008). Object Class Recognition and Localization Using Sparse Features with Limited
Receptive Fields. International Journal of Computer Vision, 80(1), 45–57. https://doi.org/10.1007/s11263-
007-0118-0
Nabiev, S. S., & Palkina, L. A. (2017). Modern technologies for detection and identification of explosive agents
and devices. Russian Journal of Physical Chemistry B, 11(5), 729–776.
https://doi.org/10.1134/S1990793117050190
Nagao, M., & Matsuyama, T. (1979). Edge perserving smoothing. Computer Graphics and Image Processing,
9(4), 394–407. https://doi.org/10.1016/0146-664X(79)90102-3
Natekin, A., & Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in Neurorobotics, 7.
https://doi.org/10.3389/fnbot.2013.00021
Neiderman, E.C., Fobes, J.L. (2005). Threat image projection system (U.S. Patent No. 6,899,540 B1). U.S. Patent
and Trademark Office.
Ng, A. (2018). Machine Learning Yearning: Technical Strategy for AI Engineers, In the Era of Deep Learning.
https://www.deeplearning.ai
Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling Strategies for Bag-of-Features Image Classification. In A.
Leonardis, H. Bischof, & A. Pinz (Eds.), Computer Vision – ECCV 2006, 3954, pp. 490–503.
https://doi.org/10.1007/11744085_38
NVIDIA. (2018). What Tower? Controlling Air Traffic with AI. Retrieved from
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/customer-stories/searidge-
technology-success-story-us-75066-r5-hr.pdf
Office of Inspector General. (2016). Verification Review of Transportation Security Administration’s Screening
of Passengers by Observation Techniques/Behavior Detection and Analysis Program (OIG-16-111-VR).
United States Department of Homeland Security, Office of Inspector General.
https://www.hsdl.org/?view&did=794320
Oftring, C. (2015). White paper: Assessing the Impact of ECAC3 on Baggage Handling Systems –
Considerations for Upgrading Existing ECAC2 Systems. BEUMER Group. Retrieved from https://www.airport-
technology.com/downloads/whitepapers/baggage/assessing-ecac3/
Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture
classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(7), 971–987. https://doi.org/10.1109/TPAMI.2002.1017623
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and Transferring Mid-level Image Representations
Using Convolutional Neural Networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition,
1717–1724. https://doi.org/10.1109/CVPR.2014.222
Osipov, S. P., Usachev, E. Yu., Chakhlov, S. V., Shchetinkin, S. A., Song, S., Zhang, G., Batranin, A.V., Osipov, O. S.
(2019). Limit Capabilities of Identifying Materials by High Dual- and Multi-Energy Methods. Russian
Journal of Nondestructive Testing, 55(9), 687–699. https://doi.org/10.1134/S1061830919090055

99
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Paranjape, R., Sluser, M., & Runtz, E. (1998). Segmentation of handguns in dual energy X-ray imagery of
passenger carry-on baggage. Conference Proceedings. IEEE Canadian Conference on Electrical and
Computer Engineering (Cat. No.98TH8341), 1, 377–380. https://doi.org/10.1109/CCECE.1998.682763
Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A model for types and levels of human interaction
with automation. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans,
30(3), 286–297. https://doi.org/10.1109/3468.844354
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual Lifelong Learning with Neural
Networks: A Review. Neural Networks, 113, 54–71. https://doi.org/10.1016/j.neunet.2019.01.012
Pelleg, D., & Moore, A. (2000). X-means: Extending k-means with efficient estimation of the number of
clusters. In International Conference on Machine Learning, 727–734. Retrieved from
https://www.researchgate.net/publication/2532744_X-means_Extending_K-
means_with_Efficient_Estimation_of_the_Number_of_Clusters
Perronnin, F., Dance, C., Csurka, G., & Bressan, M. (2006). Adapted Vocabularies for Generic Visual
Categorization. In A. Leonardis, H. Bischof, & A. Pinz (Eds.), Computer Vision – ECCV 2006, Vol. 3954, 464–
475. https://doi.org/10.1007/11744085_36
Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and
fast spatial matching. 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–8.
https://doi.org/10.1109/CVPR.2007.383172
Pinto, N., Cox, D. D., & DiCarlo, J. J. (2008). Why is Real-World Visual Object Recognition Hard? PLoS
Computational Biology, 4(1), e27. https://doi.org/10.1371/journal.pcbi.0040027
Pratt, L. Y. (1992). Discriminability-Based Transfer between Neural Networks. Proceedings of the 5th
International Conference on Neural Information Processing Systems, NIPS, 5, 204–211.
https://dl.acm.org/doi/10.5555/2987061.2987087
Pratt, L., & Jennings, B. (1996). A Survey of Transfer between Connectionist Networks. Connection Science,
8(2), 163–184. https://doi.org/10.1080/095400996116866
Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42(3),
203–231. https://doi.org/10.1023/A:1007601015854
Psyllos, A., Ferrara, P., Beslay, L., Galbally, J., Haraksim, R. (2019). Study on face identification technology for
its implementation in the Schengen Information System. European Commission, Joint Research Centre.
Retrieved from https://data.europa.eu/doi/10.2760/661464
Qi, X., Xiao, R., Li, C.-G., Qiao, Y., Guo, J., & Tang, X. (2014). Pairwise Rotation Invariant Co-Occurrence Local
Binary Pattern. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11), 2199–2213.
https://doi.org/10.1109/TPAMI.2014.2316826
Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017a). PointNet: Deep Learning on Point Sets for 3D Classification and
Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition,
652–660. Retrieved from http://arxiv.org/abs/1612.00593
Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017b). PointNet++: Deep Hierarchical Feature Learning on Point Sets in a
Metric Space. In I Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett
(Eds.), In Proceedings of 2017 Conference on Neural Information Processing Systems (NIPS) (Vol. 30).
Retrieved from https://proceedings.neurips.cc/paper/2017/file/d8bf84be3800d12f74d8b05e9b89836f-
Paper.pdf
Rebuffel, V., & Dinten, J.-M. (2007). Dual-energy X-ray imaging: Benefits and limits. Insight - Non-Destructive
Testing and Condition Monitoring, 49(10), 589–594. https://doi.org/10.1784/insi.2007.49.10.589
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 7263–7271. Retrieved from
https://ieeexplore.ieee.org/document/8100173/
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.
https://doi.org/10.1109/TPAMI.2016.2577031

100
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any
Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 1135–1144. Retrieved from http://arxiv.org/abs/1602.04938
Riffo, V., & Mery, D. (2016). Automated Detection of Threat Objects Using Adapted Implicit Shape Model. IEEE
Transactions on Systems, Man, and Cybernetics: Systems, 46(4), 472–482.
https://doi.org/10.1109/TSMC.2015.2439233
Robnik-Sikonja, M., & Kononenko, I. (2008). Explaining Classifications for Individual Instances. IEEE
Transactions on Knowledge and Data Engineering, 20(5), 589–600.
https://doi.org/10.1109/TKDE.2007.190734
Rogers, T. W., Jaccard, N., Protonotarios, E. D., Ollier, J., Morton, E. J., & Griffin, L. D. (2016). Threat Image
Projection (TIP) into X-ray images of cargo containers for training humans and machines. 2016 IEEE
International Carnahan Conference on Security Technology (ICCST), 1–7.
https://doi.org/10.1109/CCST.2016.7815717
Rogers, K., & Evans, P. (2018). X-Ray diffraction and focal construct technology. In Devices, Circuits, and
Systems. X-ray diffraction imaging: Technology and applications. (pp. 165–188). Boca Raton, FL, USA: CRC
Press.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image
Segmentation. In N. Navab, J. Hornegger, W. M. Wells, & A. F. Frangi (Eds.), Medical Image Computing and
Computer-Assisted Intervention – MICCAI 2015 (pp. 234–241). Retrieved from
http://arxiv.org/abs/1505.04597
Roomi, M. (2012). Detection of Concealed Weapons in X-Ray Images Using Fuzzy K-NN. International Journal
of Computer Science, Engineering and Information Technology, 2(2), 187–196.
https://doi.org/10.5121/ijcseit.2012.2216
Rosten, E., & Drummond, T. (2006). Machine Learning for High-Speed Corner Detection. In A. Leonardis, H.
Bischof, & A. Pinz (Eds.), Computer Vision – ECCV 2006, 3951, 430–443.
https://doi.org/10.1007/11744023_34
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
M., Berg, A.C., & Li Fei-Fei (2015). ImageNet Large Scale Visual Recognition Challenge. International
Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y
Schmidt-Hackenberg, L., Yousefi, M. R., & Breuel, T. M. (2012). Visual cortex inspired features for object
detection in X-ray images. In Proceedings of the 21st International Conference on Pattern Recognition
(ICPR2012), 2573–2576. Tsukuba, Japan: IEEE.
Schwaninger, A. (2005). Increasing Efficiency in Airport Security Screening. In AVSEC World 2004: Managing
Stress, Trauma and Change in the Airline Industry, WIT Transactions on The Built Environment, 82, 405–
416. https://doi.org/10.2495/SAFE050401
Schwaninger, A., Hardmeier, D., Riegelnig, J., & Martin, M. (2010). Use It and Still Lose It?: The Influence of Age
and Job Experience on Detection Performance in X-Ray Screening. The Journal of Gerontopsychology and
Geriatric Psychiatry (GeroPsych), 23(3), 169–175. https://doi.org/10.1024/1662-9647/a000020
Schwartz, G., & Nishino, K. (2013). Visual Material Traits: Recognizing Per-Pixel Material Context. 2013 IEEE
International Conference on Computer Vision Workshops (ICCW), 883–890.
https://doi.org/10.1109/ICCVW.2013.121
Schott, J. R., Salvaggio, C., Brown, S. D., & Rose, R. A. (1995). Incorporation of texture in multispectral synthetic
image generation tools (Watkins, W. R. & Clement, D., Eds.; pp. 189–196).
https://doi.org/10.1117/12.210590
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). OverFeat: Integrated Recognition,
Localization and Detection using Convolutional Networks. In International Conference on Learning
Representations (ICLR), 2014. Retrieved from http://arxiv.org/abs/1312.6229
Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image
Classification Models and Saliency Maps. ArXiv:1312.6034 [Cs]. Retrieved from
http://arxiv.org/abs/1312.6034

101
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Simonyan, K., & Zisserman, A. (2015, May). Very Deep Convolutional Networks for Large-Scale Image
Recognition. Presented at the International Conference on Learning Representations (ICLR), 2015, San
Diego, CA, USA. Retrieved from http://arxiv.org/abs/1409.1556
Singh, S., & Singh, M. (2003). Explosives detection systems (EDS) for aviation security. Signal Processing,
83(1), 31–55. https://doi.org/10.1016/S0165-1684(02)00391-2
Singh, M., & Singh, S. (2004). Image segmentation optimisation for x-ray images of airline luggage.
Proceedings of the 2004 IEEE International Conference on Computational Intelligence for Homeland
Security and Personal Safety, 2004. CIHSPS 2004, 10–17. https://doi.org/10.1109/CIHSPS.2004.1360198
Singh, M., & Singh, S. (2005). Optimizing image enhancement for screening luggage at airports. Proceedings of
the 2005 IEEE International Conference on Computational Intelligence for Homeland Security and Personal
Safety, CIHSPS 2005, 131–136. https://doi.org/10.1109/CIHSPS.2005.1500627
Sivic, & Zisserman. (2003). Video Google: A text retrieval approach to object matching in videos. Proceedings
Ninth IEEE International Conference on Computer Vision, 2, 1470–1477.
https://doi.org/10.1109/ICCV.2003.1238663
Skorupski, J., & Uchroński, P. (2016). A Human Being as a Part of the Security Control System at the Airport.
Procedia Engineering, 134, 291–300. https://doi.org/10.1016/j.proeng.2016.01.010
Skorupski, J., Uchroński, P., & Łach, A. (2018). A method of hold baggage security screening system
throughput analysis with an application for a medium-sized airport. Transportation Research Part C:
Emerging Technologies, 88, 52–73. https://doi.org/10.1016/j.trc.2018.01.009
Sluser, M., & Paranjape, R. (1999). Model-based probabilistic relaxation segmentation applied to threat
detection in airport X-ray imagery. Engineering Solutions for the Next Millennium. 1999 IEEE Canadian
Conference on Electrical and Computer Engineering (Cat. No.99TH8411), 2, 720–726.
https://doi.org/10.1109/CCECE.1999.808023
Smiths Detection. (2020). HI-Scan 10080 XCT. Retrieved from https://smithsdetection-
scio.com/AssetDownload.aspx?client=1&task=t5%2bIUGVlOVlP%2bs1dTJyXmw%3d%3d
Smiths Detection. (2021a). ICMORE Weapons. Retrieved from https://smithsdetection-
scio.com/AssetDownload.aspx?client=1&task=11hK4J%2Bwuy%2F42%2FK50%2FRWpA%3D%3D
Smiths Detection. (2021b). ICMORE_Lithium Batteries and Dangerous Goods. Retrieved from
https://smithsdetection-scio.com/AssetDownload.aspx?client=1&task=DQlpZCkBbbGZdkLWJax5rw%3D%3D
Smiths Detection. (2021c). Smiths Detection deploys integrated X-ray system with auto-detecting explosives
and weapons software at HarbourFront Station during Emergency Preparedness Exercise. Retrieved from
https://www.smithsdetection.com/press-releases/smiths-detection-deploys-integrated-x-ray-system-with-
auto-detecting-explosives-and-weapons-software-at-harbourfront-station-during-emergency-
preparedness-exercise/
Soroosh, A. (2021, June). TSA: Machine Learning: Benefits Detection and Introduces New Challenges.
Presented at the Meeting on role of artificial intelligence in aviation security, ECAC and DG MOVE, Brussels.
Speakman, S. A. (2015). Basics of X-Ray Powder Diffraction: Training to Become an Independent User of the
X-Ray SEF at the Center for Materials Science and Engineering at MIT. Retrieved from
http://prism.mit.edu/xray/documents/1%20Basics%20of%20X-Ray%20Powder%20Diffraction.pdf
Statista. (2021). Number of passengers carried by air in the European Union from 2008 to 2020. Retrieved
from https://www.statista.com/statistics/1118397/air-passenger-transport-european-union/
Sterchi, Y., & Schwaninger, A. (2015). A first simulation on optimizing EDS for cabine baggage screening
regarding throughput. In 2015 International Carnahan Conference on Security Technology (ICCST), 55–60.
https://doi.org/10.1109/CCST.2015.7389657
Stewart, M. G., & Mueller, J. E. (2018). Are We Safe Enough? Measuring and Assessing Aviation Security.
Elsevier.
Strantz, N. J. (1990). Aviation Security and Pan Am Flight 103: What Have We Learned Journal of Air Law and
Commerce, 56, 413. Retrieved from https://scholar.smu.edu/jalc/vol56/iss2/3

102
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Strecker, H., Harding, G. L., Bomsdorf, H., Kanzenbach, J., Linde, R., & Martens, G. (1994). Detection of
explosives in airport baggage using coherent x-ray scatter. In G. L. Harding, R. C. Lanza, L. J. Myers, & P. A.
Young (Eds.), In Substance Detection Systems, 2092, 399–410. https://doi.org/10.1117/12.171259
Su, J., Vargas, D. V., & Kouichi, S. (2019). One pixel attack for fooling deep neural networks. IEEE Transactions
on Evolutionary Computation, 23(5), 828–841. https://doi.org/10.1109/TEVC.2019.2890858
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting Unreasonable Effectiveness of Data in Deep
Learning Era. 2017 IEEE International Conference on Computer Vision (ICCV), 843–852.
https://arxiv.org/abs/1707.02968
Svec, E. (2016). Sparse KNN - a method for object recognition over X-ray images using KNN based in sparse
reconstruction (M.Sc Thesis, Pontifica Universidad Catolica de Chile, School of Engineering). Retrieved from
https://repositorio.uc.cl/handle/11534/21187
Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper
with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–9.
https://doi.org/10.1109/CVPR.2015.7298594
Thiagarajan, J. J., Kailkhura, B., Sattigeri, P., & Ramamurthy, K. N. (2016). TreeView: Peeking into Deep Neural
Networks Via Feature-Space Partitioning. In Proceedings of the Interpretability Workshop. Presented at the
Barcelona, Spain. Retrieved from https://arxiv.org/abs/1611.07429
Tran, T., Pham, T., Carneiro, G., Palmer, L., & Reid, I. (2017). A Bayesian Data Augmentation Approach for
Learning Deep Models. Neural Information Processing Systems, NIPS, 31, 2794–2803.
https://doi.org/10.48550/arXiv.1710.10564
Transportation Security Administration. (2021). What can I bring? Retrieved from
https://www.tsa.gov/travel/security-screening/whatcanibring/all
Transportation Security Administration. (2015). Planning Guidelines and Design Standards for Checked
Baggage Inspection Systems. Retrieved from https://crp.trb.org/acrp0715/wp-content/themes/acrp-
child/documents/111/original/Planning_Guidelines_and_Design_Standards_for_Checked_Baggage_Inspecti
on_Systems.pdf
Transportation Security Administration. (2019, February 7). TSA Year in Review: A Record Setting 2018.
Retrieved from https://www.tsa.gov/blog/2019/02/07/tsa-year-review-record-setting-2018
Transportation Security Administration. (2020, January 15). TSA Year in Review: 2019. Retrieved from
https://www.tsa.gov/blog/2020/01/15/tsa-year-review-2019
Turcsany, D., Mouton, A., & Breckon, T. P. (2013). Improving feature-based object recognition for X-ray
baggage security screening using primed visualwords. 2013 IEEE International Conference on Industrial
Technology (ICIT), 1140–1145. https://doi.org/10.1109/ICIT.2013.6505833
Turek, M. (2016, August). Explainable Artificial Intelligence (XAI). Defense Advanced Research Projects Agency.
Turner, S. (1994). Terrorist Explosive Sourcebook: Countering Terrorist Use of Improvised Explosive Devices.
Paladin Press.
U.S. Government Accountability Office. (2019). Aviation Security: TSA Should Ensure Screening Technologies
Continue to Meet Detection Requirements after Deployment. GAO-20-56. Retrieved from
https://www.gao.gov/products/gao-20-56
Uroukov, I., & Speller, R. (2015). A Preliminary Approach to Intelligent X-ray Imaging for Baggage Inspection at
Airports. Signal Processing Research, 4(0), 1.
https://www.academia.edu/27930756/A_Preliminary_Approach_to_Intelligent_X_ray_Imaging_for_Baggag
e_Inspection_at_Airports_Application_to_the_detection_of_threat_materials_and_objects
Vagia, M., Transeth, A. A., & Fjerdingen, S. A. (2016). A literature review on the levels of automation during the
years. What are the different taxonomies that have been proposed? Applied Ergonomics, 53, 190–202.
https://doi.org/10.1016/j.apergo.2015.09.013
Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. https://doi.org/10.1007/978-1-4757-2440-0
Vedaldi, A., & Fulkerson, B. (2010). Vlfeat: An open and portable library of computer vision algorithms.
Proceedings of the International Conference on Multimedia - MM ’10, 1469–1472.
https://doi.org/10.1145/1873951.1874249

103
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Vetter, T. (1998). Synthesis of Novel Views from a Single Face Image. International Journal of Computer
Vision, 28(2), 103–116. https://link.springer.com/article/10.1023/A:1008058932445
von Bastian, C. C., Schwaninger, A., & Michel, S. (2008). Do multi-view X-ray systems improve X-ray image
interpretation in airport security screening? Zeitschrift Für Arbeitswissenschaft, 3, 166–173. Retrieved from
https://doi.org/10.3239/9783640684991
Vyas, A., Yu, S., Paik, J. (2018) Fundamentals of digital image processing. Signals Commun Technol. Retrieved
from https://doi.org/10.1007/978-981-10-7272-7_1
Walk, S., Majer, N., Schindler, K., & Schiele, B. (2010). New features and insights for pedestrian detection. 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1030–1037.
https://doi.org/10.1109/CVPR.2010.5540102
Wang, Y. (2020). A Mathematical Introduction to Generative Adversarial Nets (GAN). Retrieved from
http://arxiv.org/abs/2009.00169
Wang, Q., & Breckon, T. P. (2020). Contraband Materials Detection Within Volumetric 3D Computed
Tomography Baggage Security Screening Imagery. ArXiv:2012.11753 [Cs, Eess]. Retrieved from
http://arxiv.org/abs/2012.11753
Wang, Q., Bhowmik, N., & Breckon, T. P. (2020a). Multi-Class 3D Object Detection Within Volumetric 3D
Computed Tomography Baggage Security Screening Imagery. 2020 19th IEEE International Conference on
Machine Learning and Applications (ICMLA), 13–18. https://doi.org/10.1109/ICMLA51294.2020.00012
Wang, Q., Ismail, K. N., & Breckon, T. P. (2020b). An approach for adaptive automatic threat recognition within
3D computed tomography images for baggage security screening. Journal of X-Ray Science and
Technology, 28(1), 35–58. https://doi.org/10.3233/XST-190531
Wang, Q., & Breckon, T. P. (2021). On the Evaluation of Semi-Supervised 2D Segmentation for Volumetric 3D
Computed Tomography Baggage Security Screening. In 2021 International Joint Conference on Neural
Networks (IJCNN), 1–8. IEEE.
Wei, X., Gong, B., Liu, Z., Lu, W., & Wang, L. (2018). Improving the Improved Training of Wasserstein GANs: A
Consistency Term and Its Dual Effect. ArXiv:1803.01541 [Cs, Stat]. Retrieved from
http://arxiv.org/abs/1803.01541
Wells, K., & Bradley, D. A. (2012). A review of X-ray explosives detection techniques for checked baggage.
Applied Radiation and Isotopes, 70(8), 1729–1746. https://doi.org/10.1016/j.apradiso.2012.01.011
Wiley, D. F., Ghosh, D., & Woodhouse, C. (2012). Automatic Segmentation of CT Scans of Checked Baggage. In
Proceedings of the 2nd International Meeting on Image Formation in X-Ray CT, 310–313.
Wolfe, J. M., & Van Wert, M. J. (2010). Varying Target Prevalence Reveals Two Dissociable Decision Criteria in
Visual Search. Current Biology, 20(2), 121–124. https://doi.org/10.1016/j.cub.2009.11.066
Wolfe, F. (2020, June 29). How Searidge Uses Artificial Intelligence to Revolutionize Airports, Air Traffic
Management. Aviation Today. https://www.aviationtoday.com/2020/06/29/how-searidge-uses-artificial-
intelligence-to-revolutionize-airports-air-traffic-management
Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Yi Ma. (2009). Robust Face Recognition via Sparse
Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 210–227.
https://doi.org/10.1109/TPAMI.2008.79
Xu, M., Zhang, H., & Yang, J. (2018). Prohibited Item Detection in Airport X-Ray Security Images via Attention
Mechanism Based CNN. In J.-H. Lai, C.-L. Liu, X. Chen, J. Zhou, T. Tan, N. Zheng, & H. Zha (Eds.), Pattern
Recognition and Computer Vision—First Chinese Conference, PRCV 2018, Guangzhou, China, November 23-
26, 2018, Proceedings, Part II (pp. 429–439). https://doi.org/10.1007/978-3-030-03335-4_37
Xu, S., & Muselet, D. (2020). Deep Learning for Material recognition: Most recent advances and open
challenges. Presented at the International Conference on Big Data, Machine Learning and Applications,
Silchar, India.
Yang, C., Rangarajan, A., & Ranka, S. (2018). Global Model Interpretation via Recursive Partitioning. IEEE 20th
International Conference on High Performance Computing and Communications; IEEE 16th International
Conference on Smart City; IEEE 4th International Conference on Data Science and Systems
(HPCC/SmartCity/DSS), 1563–1570. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00256

104
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks?
Advances on Neural Information Processing Systems (NIPS), 3320–3328. Retrieved from
https://proceedings.neurips.cc/paper/2014/file/375c71349b295fbe2dcdca9206f20a06-Paper.pdf
Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive
Bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML),
609–611. https://dl.acm.org/doi/10.5555/645530.655658
Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. European Conference
on Computer Vision (ECCV), 8689, 818–833. https://doi.org/10.1007/978-3-319-10590-1_53
Zentai, G., (2008). X-ray imaging for homeland security. 2008 IEEE International Workshop on Imaging
Systems and Techniques, 1–6. https://doi.org/10.1109/IST.2008.4659929
Zhang, Hang, Dana, K., & Nishino, K. (2015). Reflectance Hashing for Material Recognition. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 3071–3080. https://doi.org/10.1109/CVPR.2015.7298926
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond Empirical Risk Minimization.
http://arxiv.org/abs/1710.09412
Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019). Self-Attention Generative Adversarial Networks.
Proceedings of the 36th International Conference on Machine Learning, PMLR, 97, 7354–7363. Retrieved
from http://arxiv.org/abs/1805.08318
Zhang, C., Benz, P., Karjauv, A., & Kweon, I. S. (2021). Universal Adversarial Perturbations Through the Lens of
Deep Steganography: Towards A Fourier Perspective. Thirty-Fifth AAAI Conference on Artificial Intelligence,
AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The
Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 3296–
3304. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16441
Zhao, Z., Zhang, H., & Yang, J. (2018). A GAN-Based Image Generation Method for X-Ray Security Prohibited
Items. In J.-H. Lai, C.-L. Liu, X. Chen, J. Zhou, T. Tan, N. Zheng, & H. Zha (Eds.), Pattern Recognition and
Computer Vision, 11256, 420–430. https://doi.org/10.1007/978-3-030-03398-9_36
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Learning Deep Features for Discriminative
Localization. ArXiv:1512.04150 [Cs]. Retrieved from http://arxiv.org/abs/1512.04150
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random Erasing Data Augmentation. Proceedings of the
AAAI Conference on Artificial Intelligence, 34(07), 13001–13008. https://doi.org/10.1609/aaai.v34i07.7000
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2020a). Unpaired Image-to-Image Translation using Cycle-
Consistent Adversarial Networks. ArXiv:1703.10593 [Cs]. Retrieved from http://arxiv.org/abs/1703.10593
Zhu, Y., Zhang, Y., Zhang, H., Yang, J., & Zhao, Z. (2020b). Data Augmentation of X-Ray Images in Baggage
Inspection Based on Generative Adversarial Networks. IEEE Access, 8, 86536–86544.
https://doi.org/10.1109/ACCESS.2020.2992861

105
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

List of abbreviations and definitions

ACC accuracy
ACI Airports Council International
ADXRD angle-dispersive x-ray diffraction
AI artificial intelligence
AI HLEG High-Level Expert Group on Artificial Intelligence
AISM adapted implicit shape model
ALERT awareness and localization of explosives-related threats
ANN artificial neural networks
AP average precision
API application programming interface
AT advanced technology
ATR automatic threat recognition
AUC area under curve
BoVW bag of visual words
BRIEF binary robust independent elementary features
CGAN conditional generative adversarial network
CNN convolutional neural network
COCO common objects in context
DBA Durham baggage anomaly
DBP Durham baggage patch
DCGAN deep convolutional generative adversarial network
DECT dual-energy computer tomography
DEI dual energy index
DGH density gradient magnitude histogram
DHS Department of Homeland Security
DNN deep neural network
DR detection rate
EASA European Union Aviation Safety Agency
EC European Commission
EDS explosives detection system
EDXRD energy-dispersive X-ray diffraction
ETD explosive trace detection
EU European Union
FAA Federal Aviation Administration
FAR false alarm rate
FAST features from accelerated segment test
FCN fully convolutional network

106
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

FCNN fully connected neural network


FFOB full firearm vs operational benign
FPR false positive rate
FRCNN faster region-based convolutional neural network
GAN generative adversarial network
GBDT gradient boosted decision tree
GDXray Grima X-ray dataset
GPU graphics processing unit
HOG histogram oriented gradient
IED improvised explosive device
ILSVRC ImageNet Large Scale Visual Recognition Challenge
ISM implicit shape model
JRC Joint Research Centre
KNN k-nearest neighbour
LAC linear attenuation coefficient
LAGs liquids, aerosols and gels
LBP local binary patterns
LIDAR light detection and ranging
LLNL Lawrence Livermore National Laboratory
PHOW pyramid histogram of visual words
PMMA polymethyl methacrylate
PPV positive predictive value
RAF rationale about failure
RBF radial basis function
R-CNN region-based convolutional neural network
RGB red, green, blue
RIFT rotation invariant feature transform
RMS root mean square
ROC receiver operator characteristic
ROI region of interest
RPN region proposal network
SAGAN self-attention generative adversarial network
SECT single-energy computer tomography
SIFT scale-invariant feature transform
SPEARS screener proficiency evaluation and reporting system
SSD single-shot detector
SURF speeded up robust features
SVM support vector machine
TIP threat image projection

107
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

TNT trinitrotoluene
TPR true positive rate
TSA Transportation Security Administration
UAP universal adversarial perturbation
UAS unmanned aerial system
UCL University College London
U.S. United States
VGG visual geometry group
XAI explainable artificial intelligence
XRD X-ray diffraction
YOLO you only look once

108
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

List of figures
Figure 1. Number of passengers (millions) carried by air in the EU from 2008 to 2020. ....................... 5
Figure 2. Terrorist attacks against airplanes and airports since 2001 by attack type. ........................... 5
Figure 3. Typical workflow in U.S. airports security checkpoints with various screening technologies. ......... 7
Figure 4. (a) Schematic of luggage line scanner, (b) X-ray radiographic image, showing similarity between
black powder (left) and honey (right). ................................................................................12
Figure 5. Notional plot showing threats (red) and non-threats (green). As the number of X-ray features
increase the detection rate increases and/or the false alarm rate decreases.....................................13
Figure 6. Zeff and density for commonly seen innocuous materials and for illicit materials. ..................13
Figure 7. Examples of different materials shown in different colours on a pseudo-colour image from dual-
energy X-ray scans: (a) EDS with an explosive automatically detected based on density and effective atomic
number framed in red in two views, and (b) automatic detection using AI technology of gun (left) and knife
(right). ..................................................................................................................14
Figure 8. (a) A single-view X-ray image of a bag containing a knife, and (b) multi-view X-ray image of the
same bag. The knife is hard to spot on the single-view scan due to high superposition and difficult rotation,
while it is easily visible in the second view of the multi-view X-ray scan where the superposition is low. .....15
Figure 9. Simplified sketch of a 4-view X-ray scanner. As the baggage goes through the tunnel (in z
direction), 4 X-ray generators scan one slice of the baggage (x − y plane) to generate two energy images for
each of the 4 views. ...................................................................................................16
Figure 10. DETECT™ 1000 three-dimensional high spatial resolution image. A suspect threat object is
highlighted in red. Metallic and plastic objects are highlighted in blue and gray, respectively, using equipment
from Integrated Defence and Security Solutions Corporation. .....................................................16
Figure 11. Baggage-CT scans illustrating lower 2D image quality, low resolution, artefacts and clutter,
obtained on Reveal CT80-DR dual-energy baggage scanner. ......................................................17
Figure 12. Crystalline materials: The crystal system describes the shape of the unit cell (left), the lattice
parameters describe the size of the unit cell (left), the unit cell repeats in all dimensions to fill space and
produce the macroscopic grains or crystals (middle of the material (right). ......................................18
Figure 13. (a) Measured energy dispersive X-ray diffraction (EDXRD) spectra of pure explosives constituents:
ammonium nitrate, HMX (octogen); and of military and industrial explosives: TNT, ammongelite, Semtex and
Seismoplast, and (b) diffraction profiles of two explosives and typical baggage contents. .....................18
Figure 14. Incidents of smoke, heat, fire, or explosion involving lithium batteries in air cargo or hold baggage
presented per year. ....................................................................................................20
Figure 15. Illustration of a cabin baggage screening security checkpoint with the four positions of an airport
security officer: bag loading, pat-down search of passengers, X-ray screening of passenger bags, secondary
search of passenger bags. ............................................................................................22
Figure 16. X-ray images from a multi-view EDS-CB machine. The same bag is shown from two viewpoints
differing by about 90 degrees. It contains an IED on which the EDS-CB has alarmed (shown by red rectangles).
..........................................................................................................................23
Figure 17. CT for cabin baggage: an example of an EDS alarm from the Smiths HI-SCAN 6040 CTiX. .......24
Figure 18. Fitted quadratic trend for the generalized estimating equations model for accuracy (percent
correct) of detected threats in airport baggage as a function of time on task for high workload shifts.
Performance declined exponentially with increasing time on task. ................................................25
Figure 19. Image-based factors in x-ray screening: (a) bag complexity; (b) superposition; (c) high and low
target visibility, i.e., metallic baseball bat in blue (left) with 92% detection rate, wooden baseball bat in orange
(right) with 15% detection rate; and (d) viewpoint. .................................................................27
Figure 20. Confusion matrix. ........................................................................................31

109
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

Figure 21. ROC curves and precision-recall curves of two classifiers applied on dataset with different ratios
of positives and negatives: (a) ROC curves, positives: negatives = 1:1; (b) precision-recall curves, positives:
negatives = 1:1; (c) ROC curves, positives: negatives = 1:10 and (d) precision-recall curves, positives: negatives
= 1:10. ..................................................................................................................33
Figure 22. Scores and classifications of 10 instances, and the resulting ROC curve. ...........................34
Figure 23 ROC graphs of two classifiers A and B, with the AUC marked for both: AUCB > AUCA................35
Figure 24 Examples of positive (left) and negative (right) data for object recognition..........................39
Figure 25 Example of object detection task: (a) a detected gun framed, and (b) bag without threats. .......40
Figure 26 Salient point detectors on a colour X-ray image using the OpenCV 2.2 implementations of Harris,
SIFT’s DoG, SURF’s Hessian-Laplace and FAST respectively. .......................................................41
Figure 27 Example of three images per class from GDXray database, described in Mery et al. (2015). ......42
Figure 28 Representation of class i objects: a) Patches of all X-ray images from the training set of class i are
extracted. b) Each patch is represented with the feature descriptor values. c) Filtered patches that do not
appear too often (no valuable information) or too seldom (noise). d) Set of points is clustered in Q parent e)
Each parent cluster is clustered in R child clusters, e.g. R representative samples from parent cluster Q, and all
the Q*R centroids are stored in the dictionary. f) Visualization of the patches of the dictionary with centroids
of R and Q clusters. In this example 2.400 points are presented with Q = 6 (parent clusters) and R = 10 (child
clusters). ................................................................................................................42
Figure 29 (a) Precision-recall curve for all tested features, and (b) correctly detected gun in two bags with
different orientations. .................................................................................................44
Figure 30 Image data augmentation publications from 2015–2020. ............................................46
Figure 31 Same image after different types of geometric (top) and photometric (bottom) transformations 47
Figure 32 Threat image projection (TIP) pipeline for synthetically composited image generation. ............48
Figure 33 Data generated using different GANs that are trained on GDXray (Mery et al., 2015): (a) from Zhao
et al. (2018), (b) from Zhu et al. (2020b), and (c) from Kim et al., (2020). .......................................49
Figure 34 Performances of deep learning (ANN with many layers) against that of most other machine
learning algorithms. Deep ANN still benefit from large amounts of data, whereas the performance increase of
other machine learning models plateaus. ...........................................................................50
Figure 35 Eight ILSVRC-2010 test images and the five labels considered most probable by the proposed
CNN. The correct label is written under each image, and the probability assigned to the correct label is also
shown with a red bar (if it happens to be in the top 5. .............................................................51
Figure 36 Five ILSVRC-2010 test images in the first column. The adjacent columns show the six training
images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the
feature vector for the test image. ....................................................................................52
Figure 37 Visualization of features in a fully trained model. For each feature map the corresponding image
patches are also shown. Note: (i) the strong grouping within each feature map, (ii) greater invariance at higher
layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1,
cols 1). ..................................................................................................................53
Figure 38 Transfer learning pipeline: (A) shows classification pipeline for a source task, while (B) is a target
task, initialized by the parameters learned in the source task. ....................................................54
Figure 39 Effectiveness of feature transfer. (a) Neural networks (NNs) baseA and baseB are trained on the
set A and set B, selffer BnB is a NN where first n layers are copied from baseB and frozen, and the remaining
layers are randomly initialized and trained on dataset B, (b) Results from randomly splits sets A and B: each
marker represents accuracy over the validation set, horizontal groups of same-type markers represent 8
random splits of the initial set into A and B sets. (c) Lines connecting the means of each random split
(numbered descriptions above each line refer to interpretation). (d) Performance degradation vs. number of

110
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

layers in transfer learning; degradation when transferring between dissimilar tasks (upper line: networks
trained to the “natural” target task, lower line: those trained toward “man-made” target task).................55
Figure 40 Schematics for the CNN driven detection strategies evaluated by Akçay et al. (2018): (A) sliding
window based CNN, (B): Faster RCNN, (C): R-FCN, and (D): YOLOv2................................................58
Figure 41 Examples of multiple prohibited item detection for the inter-domain X-ray security training and
evaluation configurations: (A) Dbf3 ⇒ SIXray10 and (B) SIXray10 ⇒ Dbf3 with varying CNN models. ........61
Figure 42 Attention mechanism based model. (a) and (b) represent feed-forward and feed-back propagation
for a convolutional neural network. (a) Given an input image, the output neuron corresponding to the predicted
category is activated after the feedforward propagation and represented by the red dot. (b) In the feed-back
propagation, the red dots represent neurons positively related to the output neuron and are activated layer by
layer. ....................................................................................................................62
Figure 43 Examples of generated image samples: (a) real X-ray images, (b) images generated by DCGAN, (c)
images generated by WGAN-GP, and (d) images generated by CT-GAN...........................................64
Figure 44 Images generated by the different GAN models........................................................64
Figure 45 Faster R-CNN features architecture. ....................................................................65
Figure 46 GAN generated images after (a) 3,000 iterations, and (b) 18,000 iterations. .......................66
Figure 47 ROC curves for handguns detection using five different feature descriptors: density descriptor (D),
density histogram (DH), density gradient magnitude histogram (DGH). RIFT and SIFT. ..........................67
Figure 48 Exemplar false positive (left) and false negative (right) detection results. The false detections are
emphasized with red arrows. .........................................................................................68
Figure 49 X-ray image of organic content packed in a rucksack with the computed results for the detection of
all organics as well as the specific detection of tobacco obtained at 80 kV. Image areas where tobacco was
hidden are indicated with the white broken line. The ‘arrow’ indicates one of the hidden loose tobacco
packages and the ‘arrowheads' indicate hidden boxed cigarettes. (A) & (C): greyscale x-ray images of
combined organic materials including food, liquids and gels; (B): detection of the entire organic content; (D):
demonstration of tobacco detection (loose and cigarettes). .......................................................69
Figure 50 Pipeline for full scan material classification in patches-classification approach (Benedykciuk et al.,
2020). ..................................................................................................................70
Figure 51 An illustration of intensity based split with mean Hounsfield Units (HU) values on x-axis. .........71
Figure 52 (a) Scatter plot of the high-energy attenuation voxel values versus the low-energy attenuation
voxel values for different materials (values are in HU). Left: values from all labels. Right: zoom on purple box
in lower-left corner of the left plot; (b) axial and (c) coronal views of test bag results. Left: high-energy
attenuation in HU displayed in the range [−1000, 500], Middle: KNN results. Right: final version of the method
of Martin et al. (2015) with spatial smoothing and data weighing results. .......................................72
Figure 53 Qualitative evaluation of material segmentation and classification using varying methods for
examples A-E, and materials present saline (orange), rubber (green), and clay (blue). ..........................73
Figure 54 Qualitative evaluation of detection results of different approaches (from left to right: CT volumes,
ground truth labels, 2D FCN method using semi-supervised learning (Wang & Breckon, 2021), 3D Residual U-
Net (Lee et al., 2017). Materials shown are saline (orange), rubber (green), and clay (blue). ...................74
Figure 55 X-ray diffractometer for explosive detection. ..........................................................75
Figure 56 XRD system for baggage inspection conceived by L-3 Communications Security and Detection
Systems and an example of an acquired diffraction spectrum. ...................................................76
Figure 57 Examples of linear attenuation coefficient (LAC) curves ..............................................77
Figure 58 Assurance model for machine learning (ML) applications..............................................78
Figure 59 European interoperability framework model. ...........................................................82

111
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

112
X-ray baggage screening and AI, JRC Science for Policy Report, EUR 31123 EN, 2022.

List of tables
Table 1. Potential applications of AI across aviation. ............................................................... 8
Table 2. Material pseudo-colours and its classes used widely in the X-ray security scanners. .................14
Table 3. Public datasets for baggage inspection. ..................................................................38
Table 4. Overview of different data augmentation approaches, with examples of their use for X-ray threat
detection. ...............................................................................................................47
Table 5. Performance of the transfer learning method from Akçay et al. (2016) for the two class problem
(gun vs. no gun) using test set and comparison with BoVW combined with SVM or Random Forest (RF) method
used in Turcsany et al. (2013). .......................................................................................56
Table 6. Results for the multi-class problem (average precision %). .............................................56
Table 7. Results of CNN and BoVW features on dataset of patches for firearm classification. AlexNetab
denotes that the network is fine-tuned from layer a to layer b. ...................................................59
Table 8. Left: Detection results of SW-CNN, Fast-RCNN (RCNN) (Girshick, 2015), Faster RCNN (FRCNN) (Ren et
al., 2017), R-FCN (Dai et al., 2016) and YOLOv2 (Redmon & Farhadi, 2017) for firearm detection problem
(300 region proposals). Right: Detection results of SW-CNN, Fast-RCNN (RCNN) (Girshick, 2015), Faster RCNN
(FRCNN) (Ren et al., 2017), R-FCN (Dai et al., 2016) and YOLOv2 (Redmon & Farhadi, 2017) for multi-class
problem (300 region proposals) (Akçay et al., 2018)................................................................60
Table 9. Detection results of varying CNN architecture trained on real and synthetic data [Dbf3Real: three
class real data (top row), Dbf3SC: three class synthetic data (middle row) and Dbf3Real+SC: three class real
and synthetic data (bottom row)]. All models are evaluated on set of real X-ray security imagery. ............63
Table 10. Matching results of real images trained CNN model: number of correctly classified generated
images. .................................................................................................................64

113
GETTING IN TOUCH WITH THE EU
In person
All over the European Union there are hundreds of Europe Direct information centres. You can find the address of the centre
nearest you at: https://europa.eu/european-union/contact_en
On the phone or by email
Europe Direct is a service that answers your questions about the European Union. You can contact this service:
- by freephone: 00 800 6 7 8 9 10 11 (certain operators may charge for these calls),
- at the following standard number: +32 22999696, or
- by electronic mail via: https://europa.eu/european-union/contact_en
FINDING INFORMATION ABOUT THE EU

Online
Information about the European Union in all the official languages of the EU is available on the Europa website at:
https://europa.eu/european-union/index_en
EU publications
You can download or order free and priced EU publications from EU Bookshop at: https://publications.europa.eu/en/publications.
Multiple copies of free publications may be obtained by contacting Europe Direct or your local information centre (see
https://europa.eu/european-union/contact_en).
KJ-NA-31123-EN-N

doi: 10.2760/46363

ISBN 978-92-76-53494-5

You might also like