Assessment of X Ray Image Interpretation
Assessment of X Ray Image Interpretation
Abstract—The competency of an aviation security screener to increasing motivation. In short, certification and competency
detect prohibited items in X-ray images quickly and reliably is assessment can be important instruments for improving
important for any airport security system. This paper details the aviation security.
requirements and principles for reliable, valid, and standardized
competency assessment as well as different methods that can be The implementation of competency assessment procedures
applied for this purpose. First, the concepts of reliability, validity presents several challenges. First, what should be assessed has
and standardization are explained. Then, threat image projection to be identified. Then, there should be consideration of how
(TIP) as a means to assess X-ray image interpretation procedures for the certification of different competencies can
competency is discussed. This is followed by a discussion of be implemented. Another important challenge is international
computer-based tests, which provide another often more reliable standardization, since several countries, organizations, and
and standardized option for measuring x-ray image even companies are independently developing their own
interpretation competency. Finally, the application of computer- certification or quality control systems.
based tests in an EU funded project (VIA Project,
www.viaproject.eu) are presented. ECAC Doc. 301 of the European Civil Aviation Conference
specifies three elements for initial certification of aviation
Keywords-component; aviation security; competency security screeners:
assessment; X-ray screening; certification
• an X-ray image interpretation exam
I. COMPETENCY ASSESSMENT IN AVIATION • a theoretical exam
SECURITY SCREENING
• a practical exam
In response to the increased risk of terrorist attacks, large
investments have been made in recent years in aviation security The periodical certification should contain a theoretical
technology. But soon enough it became clear that the best exam and an X-ray image interpretation exam. Practical exams
equipment is of limited value if the people who operate it are can be conducted if considered necessary.
not selected and trained appropriately to perform their tasks This paper covers the first element, that is, how to examine
effectively and accurately and the relevance of human factors competency in X-ray image interpretation. First, important
has increasingly been recognized. This entailed the research concepts such as reliability, validity and standardization are
and investigation on best practices regarding the training of presented as well as different methods for competency
security personnel and the assessment of needed abilities and assessment and certification of screeners. Second, on-the-job
competencies. The main focus lies on the personnel who assessment of screener competency using TIP is discussed.
conduct security screening at airports (aviation security Third, an example of a reliable, valid, and standardized
screeners). computer-based test is presented which is now used at more
Competency assessment maintains the workforce than 100 airports worldwide, the X-Ray Competency
certification process. The main aim of certification procedures Assessment Test (X-Ray CAT). Fourth, the application of this
is to ensure that adequate standards in aviation security are test in an EU-funded project (the VIA Project) at several
consistently and reliably achieved. Certification of aviation European airports is presented.
security screeners can be considered as providing quality
control over the screening process. Using certification tests, A. Requirements for Assessing Competency
important information on strengths and weaknesses in aviation One of the most important tasks of an aviation security
security procedures in general as well as on each individual screener is the interpretation of X-ray images of passenger bags
screener can be obtained. As a consequence, certification can and the identification of prohibited items within these bags. Hit
also be a valuable basis for qualifying personnel, measuring rates, false alarm rates, and the time used to visually inspect an
training effectiveness, improving training procedures, and
1
This research was financially supported by the European Commission ECAC Doc. 30, Annex IV-12A, “Certification Criteria for
Leonardo da Vinci Programme (VIA Project, DE/06/C/F/TH-80403). This Screeners,” and ECAC Doc. 30, chapter 12, 12.2.3,
paper is a summary of Work Package 6: Competency assessment tests. For
more information, see www.viaproject.eu. “Certification of Security Staff,” 1.1.10.3.
Figure 1. Illustration of different correlation coefficients. Left: r = 0.05, middle: r = 0.50, right: r = 0.95
different results, the test is not reliable. If, for example, an IQ Among others, concurrent validity refers to whether a test can
test yields a score of 90 for an individual today and 125 a week distinguish between groups that it should be able to distinguish
later, it is not reliable. The concept of reliability is illustrated in between (e.g., between trained and untrained screeners). In
Fig. 1. Each point represents an individual person. The x-axis order to establish convergent validity, it has to be shown that
measures that should be related are indeed related. If, for
represents the test results in the first measurement and the y-
example, threat image projection (TIP, i.e., the insertion of
axis represents the scores in the same test of the second fictional threat items into X-ray images of passenger bags)
measurement. Fig. 1a–c represent tests of different reliability. measures the same competencies as a computer-based offline
The test in Fig. 1a is not reliable. The score a participant test, one would expect a high correlation between TIP
achieved in the first measurement does not correspond at all performance data and the computer-based test scores.
with the test score in the second measurement.
21
3) Standardization and development of population norms: 3) Computer-based X-ray image interpretation tests:
The third important aspect of judging the quality of a test is Computer-based X-ray image interpretation tests constitute a
standardization. This involves administering the test to a valuable tool for standardized measurements of X-ray image
representative group of people in order to establish norms (a interpretation competency. These tests should consist of X-ray
normative group). When an individual takes the test, it can then images of passenger bags containing different prohibited
be determined how far above or below the average her or his objects. The categories of items should reflect the prohibited
score is, relative to the normative group. It is important to items list and requirements of the appropriate authority, and it
know how the normative group was selected, though. For should be ensured that the test content remains up to date. The
instance, for the standardization of a test used to evaluate the test should also contain clean bag images, that is, X-ray
images of bags that do not contain a prohibited object. For
detection performance of screeners, a meaningful normative
each image, the screeners should indicate whether or not a
group of a large and representative sample of screeners (at least
prohibited object is present. Additionally, the screeners can be
200 males and 200 females) should be tested.
requested to identify the prohibited item(s). The image display
In summary, competency assessment of X-ray image
interpretation needs to be based on tests that are reliable, valid, duration should be comparable to operational conditions.
and standardized. However, it is also important to consider test Test conditions should be standardized and comparable for
difficulty, particularly if results from different tests are all participants. For example, the brightness and contrast on the
compared with each other. Although two tests can have similar monitor should be calibrated and similar for all participants.
properties in terms of reliability, an easy test may not This applies equally to other monitor settings that could
adequately assess the level of competency needed for the X-ray influence detection performance (e.g., the refresh rate). In order
screening job. to achieve a valid measure of detection performance, not only
hit rates but also false alarm rates should be taken into account.
B. Assessment of X-ray Image Interpretation Competency The probability of detecting a prohibited item depends on
Currently, there are several methods used to assess X-ray the knowledge of a screener as well as on the general difficulty
image interpretation competency: covert testing (infiltration of the item. Image-based factors such as the orientation in
testing), threat image projection (TIP), and computer-based which a threat item is depicted in the bag (view difficulty), the
image tests. degree by which other objects are superimposed on an object
(superposition), and the number and type of other objects
1) Covert testing: Covert testing, as the exclusive basis within the bag (bag complexity) influence detection
for individual assessment of X-ray image interpretation performance substantially [6], [7]. Tests should take these
competency, is only acceptable if the requirements of effects into account.
reliability, validity, and standardization are fulfilled. For
covert testing to achieve these requirements, a significant C. Certification of X-Ray Image Interpretation Competency
number of tests of the same screener are necessary in order to As indicated above, individuals carrying out screening
assess competency reliably. More research is needed to operations should be certified initially and periodically
address this issue and it should be noted that this paper does thereafter. Certification can not only be considered as
not apply to principles and requirements for covert testing providing quality control over the screening process; but also
used to verify compliance with regulatory requirements. as a valuable basis for awarding personnel a qualification,
2) Threat image projection (TIP): Screener competency measuring training effectiveness, improving training
can also be assessed using TIP data if certain requirements are procedures, and increasing motivation. Certification data
met. In cabin baggage screening, TIP is the projection of provides important information on strengths and weaknesses in
fictional threat items into X-ray images of passenger bags aviation security procedures in general as well as on individual
during the routine baggage screening operation. This way, the screeners. Furthermore, certification can help in achieving
detection performance of a screener can be measured under international standardization in aviation security. The present
operational conditions. Using raw TIP data alone does not section gives a brief overview of how a certification system can
provide a reliable measure of individual screener detection be implemented.
performance. Data needs to be aggregated over time in order As mentioned above, certification of screeners should
to have a large enough sample upon which to perform contain a theoretical exam and an X-ray image interpretation
meaningful analysis. In order to achieve reliable, valid, and exam. For periodic certification, practical exams can be
standardized measurements, several other aspects need to be conducted if considered necessary, unlike the initial
taken into account as well when analyzing TIP data. One certification, where practical exams are required. The exams
requirement is to use an appropriate TIP library. This should should meet the requirements of high reliability and validity
contain a large number of threat items, which represent the and standardization (see above).
prohibited items that need to be detected and which feature a The X-ray image interpretation exam should be adapted to
reasonable difficulty level. See the section on reliable the domain in which a screener is employed, that is, cabin
measurement of performance using TIP for more information baggage screening, hold baggage screening, or cargo screening.
on how to use TIP data for measuring X-ray detection Since not every threat object always constitutes a threat during
performance of screeners. the flight, depending on where aboard the aircraft it is
transported, screeners should be certified according to their
22
domain. The certification of cabin baggage screeners should be individual performance assessment. Also viewpoint difficulty,
based on cabin baggage X-ray images that contain all kinds of superposition, and bag complexity may need to be considered.
objects that are prohibited from being carried on in cabin Finally, as mentioned above, data needs to be aggregated over
baggage (guns, knives, improvised explosive devices, and other time in order to have a large enough sample upon which to
prohibited items). Objects that are prohibited from being perform meaningful analyses. TIP data should only be used for
transported in the cabin of an aircraft do not necessarily pose a certification of screeners if the reliability of the data has been
threat when transported in the hold or in cargo. Furthermore, proven, for example by showing that the correlation between
different types of bags are transported in the cabin, the hold, TIP scores based on odd days and even days aggregated over
and cargo. Usually, small suitcases or bags serve as hand several months is higher than .75.
baggage, whereas big suitcases and traveling bags are
transported in the hold of the aircraft. The certification of hold III. X-RAY COMPETENCY ASSESSMENT TEST (X-
baggage screeners should be conducted using X-ray images of RAY CAT)
hold baggage. Similarly, cargo screeners should be tested using
X-ray images of cargo items. This section introduces the X-Ray Competency Assessment
Test (X-Ray CAT) as an example of a computer-based test that
Screeners should be kept up to date regarding new and can be used for assessing X-ray image interpretation
emerging threats. In order to verify whether this is consistently competency (Fig. 2 displays an example of the stimuli). The
achieved, it is recommended that a recurrent certification CAT has been developed on the basis of scientific findings
should be conducted on a periodical basis, typically every 1-2 regarding threat detection in X-ray images of passenger bags
years. The minimum threshold that should be achieved in the [6], [7] . How well screeners can detect prohibited objects in
tests in order to pass certification should be defined by the passenger bags is influenced in two different ways. First, it
national appropriate authority and should be based on a large depends on the screener’s knowledge of what objects are
and representative sample of screeners. prohibited and what they look like in X-ray images. This
knowledge is an attribute of the individual screener and can be
II. MEASUREMENT OF PERFORMANCE ON THE enhanced by specific training. Second, the probability of
JOB USING THREAT IMAGE PROJECTION (TIP) detecting a prohibited item in an X-ray image of a passenger
bag also depends on image-based factors. These are the
Threat image projection (TIP) is a function of state-of-the- orientation of the prohibited item within the bag (view
art X-ray machines that allows the exposure of aviation difficulty), the degree by which other objects are superimposed
security screeners to artificial but realistic X-ray images during over an object in the bag (superposition), and the number and
the process of the routine X-ray screening operation at the
type of other objects within the bag (bag complexity).
security checkpoint. For cabin baggage screening (CBS), Systematic variation or control of the image-based factors is a
fictional threat items (FTIs) are digitally projected in random fundamental property of the test and has to be incorporated in
positions into X-ray images of real passenger bags. In hold the test development. In the X-Ray CAT, the effects of
baggage screening (HBS), combined threat images (CTIs) are viewpoint are controlled by using two standardized rotation
displayed on the monitor. In this case, not only the threat item angles in an ‘easy’ and a ‘difficult’ view for each forbidden
is projected but an image of a whole bag that may or may not object. Superposition is controlled in the sense that it is held
contain a threat item. This is possible if the screeners visually constant over the two views and as far as possible over all
inspecting the hold baggage are physically separated from the objects. With regard to bag complexity, the bags are chosen in
passengers and their baggage. Projecting whole bags in HBS such a way that they are visually comparable in terms of the
provides not only the opportunity to project threat images (i.e., form and number of objects with which they are packed.
bags containing a threat item) but also non-threat images (i.e.,
bags not containing any threat item). This also allows the
recording of false true alarms (namely, if a non-threat image
was judged as containing a threat) and correct rejections
(namely, if a non-threat image was judged as being harmless).
TIP data are an interesting source for various purposes like
quality control, risk analysis, and assessment of individual
screener performance. Unlike the situation in a test setting,
individual screener performance can be assessed on the job
when using TIP data. However, if used for the measurement of
individual screener X-ray detection performance, international Figure 2. Example images from the X-Ray CAT. Left: harmless
bag (non-threat image), right: same bag with a prohibited item
standards of testing have to be met, that is, the method needs to at the top right corner (threat image). The prohibited item (gun)
be reliable, valid, and standardized (see above). In a study of is also shown separately at the bottom right.
CBS and HBS TIP, it was found that there were very low
reliability values for CBS TIP data when a small TIP image
library of a few hundred FTIs was used ([8]). Good reliabilities The X-Ray CAT contains two sets of objects in which
were found for HBS TIP data when a large TIP image library object pairs are similar in shape (see Fig. 3). This construction
was available. It is suggested that a large image library (at least not only allows the measurement of any effect of training, that
1000 FTIs) containing a representative sample of items of is, if detection performance can be increased by training, but
varying difficulty should be used when TIP is used for
23
also possible transfer effects. The threat objects of one set can items - # false alarms). The analyses were made separately for
then be included in the training. By measuring detection threat images and for non-threat images. Table 1 shows the
performance after training using both sets of the test, it can be reliability coefficients.
ascertained whether training also helped in improving the
As stated above, an acceptable test should reach split half
detection of the objects that did not appear during training.
correlations of at least .75 and Cronbach alpha values of at least
Should this be the case, it indicates a transfer of the knowledge
.85. Bearing this in mind, the reliability values listed in Table 1
gained about the visual appearance of objects used in training
show that the X-Ray CAT is very reliable and therefore a
to similar-looking objects.
useful tool for measuring the detection performance of aviation
security screeners.
TABLE I. RELIABILITIES
RELIABILITY ANALYSES
Reliability
Hit CR
Coefficients
24
discriminate, for example, between trained and untrained discussions and results obtained by means of the EU-funded
screeners [11]. Fig. 5 shows the results of the study. It can be VIA Project.
seen that the detection performance increases for the trained
screeners but not for the untrained screeners. This means that A. The VIA Project
the test is able to discriminate between screeners who received The VIA Project evolved from the tender call in 2005 of the
training with the computer-based training system X-Ray Tutor European Commission’s Leonardo da Vinci program on
and those who did not receive training with X-Ray Tutor [12], vocational education and training. The project’s full title is
[13]. Therefore, the concurrent validity of the X-Ray CAT can “Development of a Reference Levels Framework for
be confirmed. A’VIA’tion Security Screeners.” The aim of the project is to
develop appropriate competence and qualification assessment
tools and to propose a reference levels framework (RLF) for
First Measurement Second Measurement Third Measurement aviation security screeners at national and cross-sectoral levels.
Detection Performance (d')
25
view difficulty and threat category (guns, IEDs, knives and depicted in easy views than for threat objects depicted in
other items), and the between-participants airport factor is difficult views (canonical views rotated by 85 degrees).
reported as well. As part of the ANOVA, only the significant
interaction effects are reported and considered to be noteworthy
in the context.
1) Detection performance comparison between airports:
Fig. 6 shows the comparison of the detection performance
achieved at eight European airports that participated in the VIA
project. First, the detection performance was calculated for
each screener individually. The data were averaged across
screeners for each airport. Thin bars represent the standard
deviation (a measure of variability) across screeners. Due to its
security sensitivity and for data protection reasons, the
individual airports’ names are not indicated and no numerical
data are given here.
26
effects [category * airport and view difficulty * airport]). very high reliability scores and its design allows us to measure
the X-ray image interpretation competency of aviation security
3) Analysis of Variance (ANOVA): The following screeners with regard to different aspects of their ability and
statistics provide quantitative values for what has been reported knowledge. The X-Ray CAT is widely used at many different
graphically. This allows us to compare the effects of the airports throughout the world, for competency assessment and
different factors. We applied a three-way ANOVA to the two certification purposes as well as in studies assessing the
within-subjects factors, category and view difficulty, and one fundamentals of the demands required for the job of the
between-subjects airport factor on the detection performance aviation security screener. It was also shown how a reliable,
measure A’. valid, and standardized test can be used to compare X-ray
The analysis revealed highly significant main effects on image interpretation competency across different airports and
threat category (guns, IEDs, knives, and other items) with an countries (on behalf of the EU funded VIA Project).
effect size of η2 = .131, F(3, 5602.584) = 339.834, MSE =
2.057, p < .001, on view difficulty (easy view v. REFERENCES
difficult/rotated view) with an effect size of η2 = .47, F(1,
[1] J. A. Fishman and T. Galguera, Introduction to Test Construction in the
2257) = 2009.772, MSE = 9.031, p < .001, and also on the Social and Behavioural Sciences. A Practical Guide. Oxford: Rowman
between-subjects airport factor with an η2 = .080, F(1, 2257) = & Littlefield, 2003.
28.128, MSE = 1.415, p < .001. The following two-way [2] P. Kline, Handbook of Psychological Testing. London: Routledge, 2000
interactions were also highly significant: threat category * view [3] K.evin R. Murphy and C. O. Davidshofer, Psychological Testing. Upper
difficulty: η2 = .094, F(3, 6542.213) = 233.969, MSE = .931, p Saddle River, NJ: Prentice Hall, 2001.
< .001, threat category * airport η2 = .068, F(3, 5602.584) = [4] N. A. MacMillan and C. D. Creelman, Detection Theory: A User’s
23.411, MSE = .142, p < .001, and view difficulty * airport η2 = Guide. New York: Cambridge University Press, 1991.
.159, F(1, 2257) = 60.953, MSE = .274, p < .001. These results [5] F. Hofer and A. Schwaninger, “Reliable and valid measures of threat
indicate different detection performance for different threat detection performance in x-ray screening,” IEEE ICCST Proceedings
vol. 38, pp. 303–8, 2004.
categories and higher detection performance for prohibited
items in easy view than for rotated threat items (the effect of [6] A. Schwaninger, D. Hardmeier, and F. Hofer, “Measuring visual
abilities and visual knowledge of aviation security screeners,” IEEE
viewpoint). This is consistent with results reported in the view- ICCST Proceedings. vol. 38, pp. 258–64, 2004.
based object recognition literature (for reviews see, for [7] A. Schwaninger, “Evaluation and Selection of Airport Security
example, [16], [17]). The effect sizes were very large according Screeners,” AIRPORT, vol. 2, pp. 14–15, 2003..
to Cohen’s conventions [18]. [8] F. Hofer and A. Schwaninger, “Using Threat Image Projection Data for
Assessing Individual Screener Performance,” WIT Transactions on the
C. Discussion Built Environment, vol. 82, pp. 417–26, 2005.
[9] D. M. Green and J. A. Swets, Signal Detection Theory and
Although the reported real world application consists of Psychophysics. New York: Wiley, 1966.
baseline measurement data only, some important features of the [10] D. Hardmeier, F. Hofer, and A. Schwaninger, “Increased Detection
X-Ray CAT could be illustrated well. X-Ray CAT allows us to Performance in Airport Security Screening Using the X-Ray ORT as
measure and to evaluate the effects of view difficulty and threat Pre-employment Assessment Tool,” Proceedings of the 2nd International
objects practically independently of each other. Furthermore, Conference on Research in Air Transportation, ICRAT 2006, Belgrade,
the X-Ray CAT can be used as a very reliable tool to compare Serbia and Montenegro, June 24–28, pp. 393–97 2006.
the X-ray image interpretation competency of security staff at [11] S. M. Koller, D. Hardmeier, S. Michel, and A. Schwaninger,
“Investigating Training, Transfer and Viewpoint Effects Resulting from
different airports and other types of infrastructure using X-ray Recurrent CBT of X-Ray Image Interpretation,” Journal of
technology for security control procedures. Transportation Security, vol. 1, no. 2, pp. 81-106, 2008.
[12] A. Schwaninger, “Computer-Based Training: A Powerful Tool for the
V. SUMMARY AND CONCLUSIONS Enhancement of Human Factors,” Aviation Security International, vol.
10, pp. 31–36, 2004.
The competency of a screener to detect prohibited items in [13] A. Schwaninger, “Increasing Efficiency in Airport Security Screening,”
X-ray images quickly and reliably is important for any airport WIT Transactions on the Built Environment, vol. 82, pp. 405–16, 2005..
security system. Computer-based tests, TIP, and to a limited [14] W. H. Angoff, “Norms, Scales, and Equivalent Scores,” in Educational
extent covert tests can be used to assess individual competency Measurement (2nd ed.),R. L. Thorndike, Ed. Washington: American
in X-ray image interpretation. However, to achieve reliable, Council on Education, 1971, pp. 508–600.
valid, and standardized measurements, it is essential that the [15] D. Green and J. Swets, “Signal Detection Theory and Psychophysics,” in
requirements and principles detailed in this paper are followed Detection Theory: A User’s Guide, N. MacMillan, Ed. London:
Erlbaum, 1966.
by those who produce, procure, or evaluate the competency
[16] M. J. Tarr and H. H. Bülthoff, “Is Human Object Recognition Better
assessment of the X-ray image interpretation tests of individual Described by Geon Structural Descriptions or by Multiple Views?
screeners. Comment on Biederman and Gerhardstein (1993),” Journal of
Experimental Psychology: Human Perception and Performance, vol. 21,
This paper introduced the competency assessment in airport pp. 1494–1505, 1995.
security screening. In order to achieve a meaningful result the [17] M. J. Tarr and H. H. Bülthoff, “Image-Based Object Recognition in
assessment has to meet the criteria of reliability and validity. Man, Monkey and Machine,” in Object Recognition in Man, Monkey
Furthermore, the assessment has to be standardized to allow the and Machine, M.J. Tarr and H. H. Bülthoff, Eds. Cambridge, MA: MIT
evaluation of screeners’ performance in relation to the Press, 1998, pp.1–20.
population norm. A focus was laid on the computer-based X- [18] J. Cohen, Statistical Power Analysis for the Behavioral Sciences. New
Ray Competency Assessment Test (X-Ray CAT). It features York: Erlbaum, Hillsdale, 1988.
27