0% found this document useful (0 votes)
37 views8 pages

Assessment of X Ray Image Interpretation

Uploaded by

juan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views8 pages

Assessment of X Ray Image Interpretation

Uploaded by

juan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Assessment of X-Ray Image Interpretation

Competency of Aviation Security Screeners


Saskia M. Steiner-Koller, Anton Bolfing, & Adrian Schwaninger
University of Applied Sciences Northwestern Switzerland
School of Applied Psychology, Institute Human in Complex Systems
Olten, Switzerland

Abstract—The competency of an aviation security screener to increasing motivation. In short, certification and competency
detect prohibited items in X-ray images quickly and reliably is assessment can be important instruments for improving
important for any airport security system. This paper details the aviation security.
requirements and principles for reliable, valid, and standardized
competency assessment as well as different methods that can be The implementation of competency assessment procedures
applied for this purpose. First, the concepts of reliability, validity presents several challenges. First, what should be assessed has
and standardization are explained. Then, threat image projection to be identified. Then, there should be consideration of how
(TIP) as a means to assess X-ray image interpretation procedures for the certification of different competencies can
competency is discussed. This is followed by a discussion of be implemented. Another important challenge is international
computer-based tests, which provide another often more reliable standardization, since several countries, organizations, and
and standardized option for measuring x-ray image even companies are independently developing their own
interpretation competency. Finally, the application of computer- certification or quality control systems.
based tests in an EU funded project (VIA Project,
www.viaproject.eu) are presented. ECAC Doc. 301 of the European Civil Aviation Conference
specifies three elements for initial certification of aviation
Keywords-component; aviation security; competency security screeners:
assessment; X-ray screening; certification
• an X-ray image interpretation exam
I. COMPETENCY ASSESSMENT IN AVIATION • a theoretical exam
SECURITY SCREENING
• a practical exam
In response to the increased risk of terrorist attacks, large
investments have been made in recent years in aviation security The periodical certification should contain a theoretical
technology. But soon enough it became clear that the best exam and an X-ray image interpretation exam. Practical exams
equipment is of limited value if the people who operate it are can be conducted if considered necessary.
not selected and trained appropriately to perform their tasks This paper covers the first element, that is, how to examine
effectively and accurately and the relevance of human factors competency in X-ray image interpretation. First, important
has increasingly been recognized. This entailed the research concepts such as reliability, validity and standardization are
and investigation on best practices regarding the training of presented as well as different methods for competency
security personnel and the assessment of needed abilities and assessment and certification of screeners. Second, on-the-job
competencies. The main focus lies on the personnel who assessment of screener competency using TIP is discussed.
conduct security screening at airports (aviation security Third, an example of a reliable, valid, and standardized
screeners). computer-based test is presented which is now used at more
Competency assessment maintains the workforce than 100 airports worldwide, the X-Ray Competency
certification process. The main aim of certification procedures Assessment Test (X-Ray CAT). Fourth, the application of this
is to ensure that adequate standards in aviation security are test in an EU-funded project (the VIA Project) at several
consistently and reliably achieved. Certification of aviation European airports is presented.
security screeners can be considered as providing quality
control over the screening process. Using certification tests, A. Requirements for Assessing Competency
important information on strengths and weaknesses in aviation One of the most important tasks of an aviation security
security procedures in general as well as on each individual screener is the interpretation of X-ray images of passenger bags
screener can be obtained. As a consequence, certification can and the identification of prohibited items within these bags. Hit
also be a valuable basis for qualifying personnel, measuring rates, false alarm rates, and the time used to visually inspect an
training effectiveness, improving training procedures, and
1
This research was financially supported by the European Commission ECAC Doc. 30, Annex IV-12A, “Certification Criteria for
Leonardo da Vinci Programme (VIA Project, DE/06/C/F/TH-80403). This Screeners,” and ECAC Doc. 30, chapter 12, 12.2.3,
paper is a summary of Work Package 6: Competency assessment tests. For
more information, see www.viaproject.eu. “Certification of Security Staff,” 1.1.10.3.

978-1-4244-4170-9/09/$25.00 ©2009 IEEE 20


X-ray image of a passenger bag are important measures that The reliability coefficient can be calculated by the
can be used to assess the effectiveness of screeners at this task. correlation between the two measurements. In Fig. 1a, the
A hit refers to detecting prohibited items in an X-ray image of a correlation is near zero, that is, r = 0.05 (the theoretical
passenger bag. The hit rate refers to the percentage of all X-ray maximum is 1). The test in Fig. 1b is somewhat more reliable.
images of bags containing a prohibited item that are correctly The correlation between the two measurements is 0.50. Fig.1c
judged as being NOT OK. If a prohibited item is reported in an shows a highly reliable test with a correlation of 0.95.
X-ray image of a bag that does not contain such an item, this
counts as a false alarm. The false alarm rate refers to the Reliability can also be a measure of a test’s internal
percentage of all harmless bags (i.e., bags not containing any consistency. Using this method, the reliability of the test is
prohibited items) that are judged by a screener as containing a judged by estimating how well the items that reflect the same
prohibited item. The time taken to process each bag is also construct or ability yield similar results. The most common
important, as it helps in determining throughput rates and can index for estimating the internal reliability is Cronbach’s alpha.
indicate response confidence. Cronbach’s alpha is often interpreted as the mean of all
possible split-half estimates (for details see standard text books
The results of an X-ray image interpretation test provide on psychometrics such as for example [1]; [2]; [3]).
very important information for civil aviation authorities,
Acceptable tests usually have reliability coefficients
aviation security institutions, and companies. Moreover, failing
a test can have serious consequences, depending on the between 0.7 and 1.0. Correlations exceeding 0.9 are not often
regulations of the appropriate authority. Therefore, it is achieved. For individual performance to be measured reliably,
essential that a test should be fair, reliable, valid, and correlation coefficients of at least 0.75 and Cronbach’s alpha of
standardized. In the last 50 years, scientific criteria have been at least 0.85 are recommended. These numbers represent the
developed that are widely used in psychological testing and minimum values. In the scientific literature, the suggested
psychometrics. These criteria are essential for the development values are often higher.
of tests for measuring human performance. A summary of the 2) Validity: Validity indicates whether a test is able to
three most important concepts, namely reliability, validity, and measure what it is intended to measure. For example, the hit
standardization, is now presented. rate alone is not a valid measure of detection performance in
1) Reliability: We mean reliability to refer to the terms of discriminability (or sensitivity), because a high hit rate
“consistency” or “repeatability” of measurements. It is the can also be achieved by judging most bags as containing
extent to which the measurements of a test remain consistent prohibited items. In order to measure detection performance in
over repeated tests of the same participant under identical terms of discriminability (or sensitivity), the false alarm rate
conditions. If a test yields consistent results for the same must be considered, too [4], [5].
measure, it is reliable. If repeated measurements produce As with reliability, there are also different types of validity.

Figure 1. Illustration of different correlation coefficients. Left: r = 0.05, middle: r = 0.50, right: r = 0.95

different results, the test is not reliable. If, for example, an IQ Among others, concurrent validity refers to whether a test can
test yields a score of 90 for an individual today and 125 a week distinguish between groups that it should be able to distinguish
later, it is not reliable. The concept of reliability is illustrated in between (e.g., between trained and untrained screeners). In
Fig. 1. Each point represents an individual person. The x-axis order to establish convergent validity, it has to be shown that
measures that should be related are indeed related. If, for
represents the test results in the first measurement and the y-
example, threat image projection (TIP, i.e., the insertion of
axis represents the scores in the same test of the second fictional threat items into X-ray images of passenger bags)
measurement. Fig. 1a–c represent tests of different reliability. measures the same competencies as a computer-based offline
The test in Fig. 1a is not reliable. The score a participant test, one would expect a high correlation between TIP
achieved in the first measurement does not correspond at all performance data and the computer-based test scores.
with the test score in the second measurement.

21
3) Standardization and development of population norms: 3) Computer-based X-ray image interpretation tests:
The third important aspect of judging the quality of a test is Computer-based X-ray image interpretation tests constitute a
standardization. This involves administering the test to a valuable tool for standardized measurements of X-ray image
representative group of people in order to establish norms (a interpretation competency. These tests should consist of X-ray
normative group). When an individual takes the test, it can then images of passenger bags containing different prohibited
be determined how far above or below the average her or his objects. The categories of items should reflect the prohibited
score is, relative to the normative group. It is important to items list and requirements of the appropriate authority, and it
know how the normative group was selected, though. For should be ensured that the test content remains up to date. The
instance, for the standardization of a test used to evaluate the test should also contain clean bag images, that is, X-ray
images of bags that do not contain a prohibited object. For
detection performance of screeners, a meaningful normative
each image, the screeners should indicate whether or not a
group of a large and representative sample of screeners (at least
prohibited object is present. Additionally, the screeners can be
200 males and 200 females) should be tested.
requested to identify the prohibited item(s). The image display
In summary, competency assessment of X-ray image
interpretation needs to be based on tests that are reliable, valid, duration should be comparable to operational conditions.
and standardized. However, it is also important to consider test Test conditions should be standardized and comparable for
difficulty, particularly if results from different tests are all participants. For example, the brightness and contrast on the
compared with each other. Although two tests can have similar monitor should be calibrated and similar for all participants.
properties in terms of reliability, an easy test may not This applies equally to other monitor settings that could
adequately assess the level of competency needed for the X-ray influence detection performance (e.g., the refresh rate). In order
screening job. to achieve a valid measure of detection performance, not only
hit rates but also false alarm rates should be taken into account.
B. Assessment of X-ray Image Interpretation Competency The probability of detecting a prohibited item depends on
Currently, there are several methods used to assess X-ray the knowledge of a screener as well as on the general difficulty
image interpretation competency: covert testing (infiltration of the item. Image-based factors such as the orientation in
testing), threat image projection (TIP), and computer-based which a threat item is depicted in the bag (view difficulty), the
image tests. degree by which other objects are superimposed on an object
(superposition), and the number and type of other objects
1) Covert testing: Covert testing, as the exclusive basis within the bag (bag complexity) influence detection
for individual assessment of X-ray image interpretation performance substantially [6], [7]. Tests should take these
competency, is only acceptable if the requirements of effects into account.
reliability, validity, and standardization are fulfilled. For
covert testing to achieve these requirements, a significant C. Certification of X-Ray Image Interpretation Competency
number of tests of the same screener are necessary in order to As indicated above, individuals carrying out screening
assess competency reliably. More research is needed to operations should be certified initially and periodically
address this issue and it should be noted that this paper does thereafter. Certification can not only be considered as
not apply to principles and requirements for covert testing providing quality control over the screening process; but also
used to verify compliance with regulatory requirements. as a valuable basis for awarding personnel a qualification,
2) Threat image projection (TIP): Screener competency measuring training effectiveness, improving training
can also be assessed using TIP data if certain requirements are procedures, and increasing motivation. Certification data
met. In cabin baggage screening, TIP is the projection of provides important information on strengths and weaknesses in
fictional threat items into X-ray images of passenger bags aviation security procedures in general as well as on individual
during the routine baggage screening operation. This way, the screeners. Furthermore, certification can help in achieving
detection performance of a screener can be measured under international standardization in aviation security. The present
operational conditions. Using raw TIP data alone does not section gives a brief overview of how a certification system can
provide a reliable measure of individual screener detection be implemented.
performance. Data needs to be aggregated over time in order As mentioned above, certification of screeners should
to have a large enough sample upon which to perform contain a theoretical exam and an X-ray image interpretation
meaningful analysis. In order to achieve reliable, valid, and exam. For periodic certification, practical exams can be
standardized measurements, several other aspects need to be conducted if considered necessary, unlike the initial
taken into account as well when analyzing TIP data. One certification, where practical exams are required. The exams
requirement is to use an appropriate TIP library. This should should meet the requirements of high reliability and validity
contain a large number of threat items, which represent the and standardization (see above).
prohibited items that need to be detected and which feature a The X-ray image interpretation exam should be adapted to
reasonable difficulty level. See the section on reliable the domain in which a screener is employed, that is, cabin
measurement of performance using TIP for more information baggage screening, hold baggage screening, or cargo screening.
on how to use TIP data for measuring X-ray detection Since not every threat object always constitutes a threat during
performance of screeners. the flight, depending on where aboard the aircraft it is
transported, screeners should be certified according to their

22
domain. The certification of cabin baggage screeners should be individual performance assessment. Also viewpoint difficulty,
based on cabin baggage X-ray images that contain all kinds of superposition, and bag complexity may need to be considered.
objects that are prohibited from being carried on in cabin Finally, as mentioned above, data needs to be aggregated over
baggage (guns, knives, improvised explosive devices, and other time in order to have a large enough sample upon which to
prohibited items). Objects that are prohibited from being perform meaningful analyses. TIP data should only be used for
transported in the cabin of an aircraft do not necessarily pose a certification of screeners if the reliability of the data has been
threat when transported in the hold or in cargo. Furthermore, proven, for example by showing that the correlation between
different types of bags are transported in the cabin, the hold, TIP scores based on odd days and even days aggregated over
and cargo. Usually, small suitcases or bags serve as hand several months is higher than .75.
baggage, whereas big suitcases and traveling bags are
transported in the hold of the aircraft. The certification of hold III. X-RAY COMPETENCY ASSESSMENT TEST (X-
baggage screeners should be conducted using X-ray images of RAY CAT)
hold baggage. Similarly, cargo screeners should be tested using
X-ray images of cargo items. This section introduces the X-Ray Competency Assessment
Test (X-Ray CAT) as an example of a computer-based test that
Screeners should be kept up to date regarding new and can be used for assessing X-ray image interpretation
emerging threats. In order to verify whether this is consistently competency (Fig. 2 displays an example of the stimuli). The
achieved, it is recommended that a recurrent certification CAT has been developed on the basis of scientific findings
should be conducted on a periodical basis, typically every 1-2 regarding threat detection in X-ray images of passenger bags
years. The minimum threshold that should be achieved in the [6], [7] . How well screeners can detect prohibited objects in
tests in order to pass certification should be defined by the passenger bags is influenced in two different ways. First, it
national appropriate authority and should be based on a large depends on the screener’s knowledge of what objects are
and representative sample of screeners. prohibited and what they look like in X-ray images. This
knowledge is an attribute of the individual screener and can be
II. MEASUREMENT OF PERFORMANCE ON THE enhanced by specific training. Second, the probability of
JOB USING THREAT IMAGE PROJECTION (TIP) detecting a prohibited item in an X-ray image of a passenger
bag also depends on image-based factors. These are the
Threat image projection (TIP) is a function of state-of-the- orientation of the prohibited item within the bag (view
art X-ray machines that allows the exposure of aviation difficulty), the degree by which other objects are superimposed
security screeners to artificial but realistic X-ray images during over an object in the bag (superposition), and the number and
the process of the routine X-ray screening operation at the
type of other objects within the bag (bag complexity).
security checkpoint. For cabin baggage screening (CBS), Systematic variation or control of the image-based factors is a
fictional threat items (FTIs) are digitally projected in random fundamental property of the test and has to be incorporated in
positions into X-ray images of real passenger bags. In hold the test development. In the X-Ray CAT, the effects of
baggage screening (HBS), combined threat images (CTIs) are viewpoint are controlled by using two standardized rotation
displayed on the monitor. In this case, not only the threat item angles in an ‘easy’ and a ‘difficult’ view for each forbidden
is projected but an image of a whole bag that may or may not object. Superposition is controlled in the sense that it is held
contain a threat item. This is possible if the screeners visually constant over the two views and as far as possible over all
inspecting the hold baggage are physically separated from the objects. With regard to bag complexity, the bags are chosen in
passengers and their baggage. Projecting whole bags in HBS such a way that they are visually comparable in terms of the
provides not only the opportunity to project threat images (i.e., form and number of objects with which they are packed.
bags containing a threat item) but also non-threat images (i.e.,
bags not containing any threat item). This also allows the
recording of false true alarms (namely, if a non-threat image
was judged as containing a threat) and correct rejections
(namely, if a non-threat image was judged as being harmless).
TIP data are an interesting source for various purposes like
quality control, risk analysis, and assessment of individual
screener performance. Unlike the situation in a test setting,
individual screener performance can be assessed on the job
when using TIP data. However, if used for the measurement of
individual screener X-ray detection performance, international Figure 2. Example images from the X-Ray CAT. Left: harmless
bag (non-threat image), right: same bag with a prohibited item
standards of testing have to be met, that is, the method needs to at the top right corner (threat image). The prohibited item (gun)
be reliable, valid, and standardized (see above). In a study of is also shown separately at the bottom right.
CBS and HBS TIP, it was found that there were very low
reliability values for CBS TIP data when a small TIP image
library of a few hundred FTIs was used ([8]). Good reliabilities The X-Ray CAT contains two sets of objects in which
were found for HBS TIP data when a large TIP image library object pairs are similar in shape (see Fig. 3). This construction
was available. It is suggested that a large image library (at least not only allows the measurement of any effect of training, that
1000 FTIs) containing a representative sample of items of is, if detection performance can be increased by training, but
varying difficulty should be used when TIP is used for

23
also possible transfer effects. The threat objects of one set can items - # false alarms). The analyses were made separately for
then be included in the training. By measuring detection threat images and for non-threat images. Table 1 shows the
performance after training using both sets of the test, it can be reliability coefficients.
ascertained whether training also helped in improving the
As stated above, an acceptable test should reach split half
detection of the objects that did not appear during training.
correlations of at least .75 and Cronbach alpha values of at least
Should this be the case, it indicates a transfer of the knowledge
.85. Bearing this in mind, the reliability values listed in Table 1
gained about the visual appearance of objects used in training
show that the X-Ray CAT is very reliable and therefore a
to similar-looking objects.
useful tool for measuring the detection performance of aviation
security screeners.

TABLE I. RELIABILITIES

RELIABILITY ANALYSES
Reliability
Hit CR
Coefficients

alpha .98 .99


Figure 3. Example of two X-ray images of similar looking threat X-Ray CAT
objects used in the test. Left: a gun of set A. Right: Split-half .97 .99
Corresponding gun of set B. Both objects are depicted also in 85
degree horizontal rotation (top) and 85 degree vertical rotation
(bottom). C. Validity of the X-Ray CAT
Regarding the convergent validity of the CAT, it can be
The task is to inspect visually the test images and to judge compared to another test that measures the same competency.
whether they are OK (contain no prohibited item) or NOT OK An example of such a test that is also widely used at different
(contain a prohibited item). airports is the Prohibited Items Test (PIT) [10]. To assess
convergent validity, the correlation between the scores on the
A. Assessing Detection Performance in a Computer-Based X-Ray CAT and the scores on the PIT of a sample that
conducted both tests is calculated. This precise procedure was
Test
applied to a sample of 473 airport security screeners. The result
The detection performance of screeners in a computer can be seen in Fig. 4 (r = .791).
based test can be assessed by their judgments of X-ray images.
As explained above, not only is the hit rate (i.e., the proportion
of correctly detected prohibited items in the threat images) an
important value but so is the false alarm rate (i.e., the
proportion of non-threat images that were judged as being NOT
OK, that is, as containing a prohibited item). This incorporates
the definition of detection performance as the ability not only
to detect prohibited items but also to discriminate between
prohibited items and harmless objects (that is, to recognize
harmless objects as harmless). Therefore, in order to evaluate
the detection performance of a screener, his or her hit rate in
the test has to be considered as well as his or her false alarm
rate [4], [5], [8], [9]. There are different measures of detection
performance that set the hit rate against the false alarm rate, for
example d’ or A’.

B. Reliability of the X-Ray CAT


As elaborated earlier in this chapter, the reliability of a test
stands for its repeatability or consistency. The reliability of the
X-Ray CAT was measured by computing Cronbach’s alpha
and Guttman’s split-half coefficients. The calculations are Figure 4. Convergent validity shown as the reliability between the scores
of the X-Ray CAT and the PIT. The dots represent individual screeners.
based on the results of a study at several airports throughout
Europe (see below for the details and further results of the
study) including the data on 2265 screeners who completed the Since correlation coefficients range from 0 (no correlation)
X-Ray CAT on behalf of the EU funded VIA project in 2007. to 1 (perfect correlation) (see also above), the convergent
The reliability measures were calculated based on correct validity can be classified as quite high. This means that the X-
answers, that is, hits for threat images and correct rejections Ray CAT and the PIT measure the same X-ray image
(CR) for non-threat images (# correct rejections = # non-threat interpretation competency. Other studies have also confirmed
the concurrent validity, that is, the ability of a test to

24
discriminate, for example, between trained and untrained discussions and results obtained by means of the EU-funded
screeners [11]. Fig. 5 shows the results of the study. It can be VIA Project.
seen that the detection performance increases for the trained
screeners but not for the untrained screeners. This means that A. The VIA Project
the test is able to discriminate between screeners who received The VIA Project evolved from the tender call in 2005 of the
training with the computer-based training system X-Ray Tutor European Commission’s Leonardo da Vinci program on
and those who did not receive training with X-Ray Tutor [12], vocational education and training. The project’s full title is
[13]. Therefore, the concurrent validity of the X-Ray CAT can “Development of a Reference Levels Framework for
be confirmed. A’VIA’tion Security Screeners.” The aim of the project is to
develop appropriate competence and qualification assessment
tools and to propose a reference levels framework (RLF) for
First Measurement Second Measurement Third Measurement aviation security screeners at national and cross-sectoral levels.
Detection Performance (d')

For more information regarding this project, see


www.viaproject.eu.
For the studies presented here, 8 European airports were
involved. Most of these airports were going through the same
procedure of recurrent tests and training phases. This made it
possible to scientifically investigate the effect of recurrent
weekly computer-based training and knowledge transfer and
subsequently to develop a reference levels framework based on
these outcomes. The tools used for testing in the VIA project
XRT Training Group (n=97) Control Group (n=112)
were the computer-based training (CBT) program, X-Ray
Tutor [13], and the X-Ray CAT. Subsequently, the results of
Figure 5. Detection performance d’ for trained (XRT Training Group) the computer-based test measurements included as part of the
compared to untrained (Control Group) screeners. The concurrent validity VIA project procedure are reported in detail.
appears in the difference of the detection performance between the two
groups after one group has trained. Thin bars are standard deviations. Note:
D’ values are not provided for security reasons. B. VIA Computer-Based Test Measurement Results
The X-Ray Competency Assessment Test (CAT) contains
256 X-ray images of passenger bags, half of which contain a
D. Standardization
prohibited item. This leads to four possible outcomes for a trial:
The X-Ray CAT was standardized in regard to its a “hit” (a correctly identified threat object), a “miss” (a missed
development. The revisions of the test were based on data from threat object), a “correct rejection” (a harmless bag correctly
representative samples (N = 94) of airport security screeners. In judged as being OK), and a “false alarm” (an incorrectly
the study described in the section on real world application, reported threat object).
involving a large and representative sample of airport security
screeners (N = 2265), a mean detection performance A’ of 0.8 In terms of sensitivity, the hit rate alone is not a valid
(SD = 0.08) was achieved. There are different approaches to measure to assess X-ray image interpretation competency. It is
the definition of pass marks. The normative approach defines a easy to imagine that a hit rate of 100 percent can be achieved
pass mark as the threshold at which a certain proportion of by simply judging every X-ray image as containing a
screeners fails the test (e.g., not more than 10 percent), based prohibited item. In this case, the entire set of non-threat items is
on a test measurement. That is, a screener is rated in relation to completely neglected (the false alarm rate would also be 100
all other screeners. The criterion-referenced approach sets the percent). In contrast, Green and Swets in 1966 developed a
pass mark according to a defined criterion. For instance, the signal detection performance measure d’, taking into account
results could be compared to the test results obtained in other hit rates as well as false alarm rates [15]. Often, d’ is referred to
countries when the test was conducted the first time or by as sensitivity, emphasizing the fact that it measures the ability
having a group of experts (e.g., using the Angoff method [14]) to distinguish between noise (in our case an X-ray image of a
rate the difficulty of the test items (in this case the difficulty of bag without a threat) and signal plus noise (in our case an X-
the images) and the minimum standard of performance. These ray image containing a prohibited item). For the current
approaches can of course be combined. Furthermore, the analyses another detection performance measure, A, was the
standard might be adjusted by taking into account the reliability measure of choice because its non-parametric character allows
of the test, the confidence intervals, and the standard error of its use independently from underlying measurements
measurement. distributions that apply to d' (see [4] for a detailed discussion)
The reported results provide graphical displays of the
IV. REAL WORLD APPLICATION OF THE X-RAY relative detection performance measures A’ at the eight
COMPETENCY ASSESSMENT TEST (X-RAY CAT) European airports that participated in the present study, as well
X-Ray CAT was used in several studies and in a series of as another graph showing the effect of the two viewpoints on
international airports in order to measure the X-ray image the different threat categories as explained earlier. In order to
interpretation competencies of screening officers. In this provide statistical corroboration of these results, an analysis of
section, the application of X-Ray CAT is presented along with variance (ANOVA) on the two within-participants factors,

25
view difficulty and threat category (guns, IEDs, knives and depicted in easy views than for threat objects depicted in
other items), and the between-participants airport factor is difficult views (canonical views rotated by 85 degrees).
reported as well. As part of the ANOVA, only the significant
interaction effects are reported and considered to be noteworthy
in the context.
1) Detection performance comparison between airports:
Fig. 6 shows the comparison of the detection performance
achieved at eight European airports that participated in the VIA
project. First, the detection performance was calculated for
each screener individually. The data were averaged across
screeners for each airport. Thin bars represent the standard
deviation (a measure of variability) across screeners. Due to its
security sensitivity and for data protection reasons, the
individual airports’ names are not indicated and no numerical
data are given here.

Figure 7. Detection performance A’ broken up by category and views


(unrotated (easy view) vs. 85° rotated objects (difficult view)). The thin
bars represent standard deviations between the eight VIA airports.
Pairwise comparisons showed significant viewpoint effects for all four
threat categories.

Although this effect can be found in every one of the four


threat categories, there are significant differences between them
regarding general differences between the mean detection
performances and also between the effect sizes of view
difficulty that are unequal between threat categories. Knives
and IEDs, for example, differ very much in view difficulty
effect size but not so much in average detection performance.
As can be seen in Fig. 8, the reason is quite simple: IEDs
consist of several parts and not all parts are depicted in easy or
Figure 6. Comparison between eight European airports participating in in difficult view at the same time. Some parts are always
the VIA project. Thin bars represent standard deviations between depicted in easy view when others are difficult, and vice versa.
screeners. Knives have very characteristic shapes. They look consistently
longish when seen perpendicular to their cutting edge but very
Although no numerical data is displayed in the graph due to small and thin when seen in parallel to their cutting edge. This
security reasons, we can discern substantial differences interaction effect between threat item category and view
between the airports in terms of mean detection performance difficulty can easily be observed in Fig. 7, where the difference
and standard deviation. As described above, all VIA airports go between easy and difficult views is much larger in knives than
through a similar procedure of alternation of test phases and in IEDs. Furthermore, based on earlier studies of training
training phases. Nevertheless, there are considerable effects, it is important to mention here that this pattern shown
differences between them. There were large differences in the in Fig. 7 is also highly dependent on training [11] (interaction
initial positions when the project was started, and the baseline
assessment test, which is reported here, was conducted at
different times at different airports. The differences can be put
down to differences in the amount of training that was
accomplished prior to this baseline test as well as to differences
in the personnel selection assessment. Some of the reported
airports were already coached prior to the VIA project, though
with diverse intensity and duration.
2) Detection performance comparison between threat
categories regarding view difficulty: Fig. 7 shows again the
detection performance measure A’, but with a different focus. Figure 8. This illustrates how effects of view difficulty differ between
the four ECAC threat categories. Figs. 8a and 8b show an IED and a
The data are averaged across the airports shown in Fig. 6, but knife each in a frontal view and a rotated view from almost 90 degrees
analyzed by view difficulty within threat categories. There is a around the vertical axis. 8c and 8d show a gun and a taser each in a
striking effect on detection performance deriving from view frontal view and a rotated view from almost 90 degrees around the
difficulty. Performance is significantly higher for threat objects horizontal axis.

26
effects [category * airport and view difficulty * airport]). very high reliability scores and its design allows us to measure
the X-ray image interpretation competency of aviation security
3) Analysis of Variance (ANOVA): The following screeners with regard to different aspects of their ability and
statistics provide quantitative values for what has been reported knowledge. The X-Ray CAT is widely used at many different
graphically. This allows us to compare the effects of the airports throughout the world, for competency assessment and
different factors. We applied a three-way ANOVA to the two certification purposes as well as in studies assessing the
within-subjects factors, category and view difficulty, and one fundamentals of the demands required for the job of the
between-subjects airport factor on the detection performance aviation security screener. It was also shown how a reliable,
measure A’. valid, and standardized test can be used to compare X-ray
The analysis revealed highly significant main effects on image interpretation competency across different airports and
threat category (guns, IEDs, knives, and other items) with an countries (on behalf of the EU funded VIA Project).
effect size of η2 = .131, F(3, 5602.584) = 339.834, MSE =
2.057, p < .001, on view difficulty (easy view v. REFERENCES
difficult/rotated view) with an effect size of η2 = .47, F(1,
[1] J. A. Fishman and T. Galguera, Introduction to Test Construction in the
2257) = 2009.772, MSE = 9.031, p < .001, and also on the Social and Behavioural Sciences. A Practical Guide. Oxford: Rowman
between-subjects airport factor with an η2 = .080, F(1, 2257) = & Littlefield, 2003.
28.128, MSE = 1.415, p < .001. The following two-way [2] P. Kline, Handbook of Psychological Testing. London: Routledge, 2000
interactions were also highly significant: threat category * view [3] K.evin R. Murphy and C. O. Davidshofer, Psychological Testing. Upper
difficulty: η2 = .094, F(3, 6542.213) = 233.969, MSE = .931, p Saddle River, NJ: Prentice Hall, 2001.
< .001, threat category * airport η2 = .068, F(3, 5602.584) = [4] N. A. MacMillan and C. D. Creelman, Detection Theory: A User’s
23.411, MSE = .142, p < .001, and view difficulty * airport η2 = Guide. New York: Cambridge University Press, 1991.
.159, F(1, 2257) = 60.953, MSE = .274, p < .001. These results [5] F. Hofer and A. Schwaninger, “Reliable and valid measures of threat
indicate different detection performance for different threat detection performance in x-ray screening,” IEEE ICCST Proceedings
vol. 38, pp. 303–8, 2004.
categories and higher detection performance for prohibited
items in easy view than for rotated threat items (the effect of [6] A. Schwaninger, D. Hardmeier, and F. Hofer, “Measuring visual
abilities and visual knowledge of aviation security screeners,” IEEE
viewpoint). This is consistent with results reported in the view- ICCST Proceedings. vol. 38, pp. 258–64, 2004.
based object recognition literature (for reviews see, for [7] A. Schwaninger, “Evaluation and Selection of Airport Security
example, [16], [17]). The effect sizes were very large according Screeners,” AIRPORT, vol. 2, pp. 14–15, 2003..
to Cohen’s conventions [18]. [8] F. Hofer and A. Schwaninger, “Using Threat Image Projection Data for
Assessing Individual Screener Performance,” WIT Transactions on the
C. Discussion Built Environment, vol. 82, pp. 417–26, 2005.
[9] D. M. Green and J. A. Swets, Signal Detection Theory and
Although the reported real world application consists of Psychophysics. New York: Wiley, 1966.
baseline measurement data only, some important features of the [10] D. Hardmeier, F. Hofer, and A. Schwaninger, “Increased Detection
X-Ray CAT could be illustrated well. X-Ray CAT allows us to Performance in Airport Security Screening Using the X-Ray ORT as
measure and to evaluate the effects of view difficulty and threat Pre-employment Assessment Tool,” Proceedings of the 2nd International
objects practically independently of each other. Furthermore, Conference on Research in Air Transportation, ICRAT 2006, Belgrade,
the X-Ray CAT can be used as a very reliable tool to compare Serbia and Montenegro, June 24–28, pp. 393–97 2006.
the X-ray image interpretation competency of security staff at [11] S. M. Koller, D. Hardmeier, S. Michel, and A. Schwaninger,
“Investigating Training, Transfer and Viewpoint Effects Resulting from
different airports and other types of infrastructure using X-ray Recurrent CBT of X-Ray Image Interpretation,” Journal of
technology for security control procedures. Transportation Security, vol. 1, no. 2, pp. 81-106, 2008.
[12] A. Schwaninger, “Computer-Based Training: A Powerful Tool for the
V. SUMMARY AND CONCLUSIONS Enhancement of Human Factors,” Aviation Security International, vol.
10, pp. 31–36, 2004.
The competency of a screener to detect prohibited items in [13] A. Schwaninger, “Increasing Efficiency in Airport Security Screening,”
X-ray images quickly and reliably is important for any airport WIT Transactions on the Built Environment, vol. 82, pp. 405–16, 2005..
security system. Computer-based tests, TIP, and to a limited [14] W. H. Angoff, “Norms, Scales, and Equivalent Scores,” in Educational
extent covert tests can be used to assess individual competency Measurement (2nd ed.),R. L. Thorndike, Ed. Washington: American
in X-ray image interpretation. However, to achieve reliable, Council on Education, 1971, pp. 508–600.
valid, and standardized measurements, it is essential that the [15] D. Green and J. Swets, “Signal Detection Theory and Psychophysics,” in
requirements and principles detailed in this paper are followed Detection Theory: A User’s Guide, N. MacMillan, Ed. London:
Erlbaum, 1966.
by those who produce, procure, or evaluate the competency
[16] M. J. Tarr and H. H. Bülthoff, “Is Human Object Recognition Better
assessment of the X-ray image interpretation tests of individual Described by Geon Structural Descriptions or by Multiple Views?
screeners. Comment on Biederman and Gerhardstein (1993),” Journal of
Experimental Psychology: Human Perception and Performance, vol. 21,
This paper introduced the competency assessment in airport pp. 1494–1505, 1995.
security screening. In order to achieve a meaningful result the [17] M. J. Tarr and H. H. Bülthoff, “Image-Based Object Recognition in
assessment has to meet the criteria of reliability and validity. Man, Monkey and Machine,” in Object Recognition in Man, Monkey
Furthermore, the assessment has to be standardized to allow the and Machine, M.J. Tarr and H. H. Bülthoff, Eds. Cambridge, MA: MIT
evaluation of screeners’ performance in relation to the Press, 1998, pp.1–20.
population norm. A focus was laid on the computer-based X- [18] J. Cohen, Statistical Power Analysis for the Behavioral Sciences. New
Ray Competency Assessment Test (X-Ray CAT). It features York: Erlbaum, Hillsdale, 1988.

27

You might also like