Research Methods in Human Resource Management
Research Methods in Human Resource Management
Rosopa
In this volume of Research in Human Re-
source Management we consider the over-
all validity of inferences stemming from em- RESEARCH METHODS
pirical research in human resource manage-
ment (HRM), industrial and organizational IN
psychology, organizational behavior and
HUMAN RESOURCE
A Volume in:
Research in Human Resource Management
Series Editors
Dianna L. Stone
James H. Dulebohn
Research in Human Resource Management
Series Editors
Dianna L. Stone
Universities of New Mexico, Albany, and Virginia Tech
James H. Dulebohn
Michigan State University
Human Resource Strategies for the High Growth Entrepreneurial Firm (2006)
Robert L. Heneman & Judith Tansky
COMING SOON
Forgotten Minorities
Dianna L. Stone, Kimberly M. Lukaszewski, & James H. Dulebohn
Research Methods in Human
Resource Management: Toward
Valid Research-Based Inferences
Edited by
Eugene F. Stone-Romero
Patrick J. Rosopa
The CIP data for this book can be found on the Library of Congress website (loc.gov).
Paperback: 978-1-64802-088-9
Hardcover: 978-1-64802-089-6
E-Book: 978-1-64802-090-2
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilm-
ing, recording or otherwise, without written permission from the publisher.
v
vi • CONTENTS
Biographies............................................................................................. 227
CHAPTER 1
PERSPECTIVES ON THE
VALIDITY OF INFERENCES
FROM RESEARCH IN HUMAN
RESOURCE MANAGEMENT
Eugene F. Stone-Romero and Patrick J. Rosopa
In order for the results of empirical studies to have a high level of validity, it
is critical that they be based upon empirical studies that have construct validity,
internal validity, external validity, and statistical conclusion validity (Campbell
& Stanley, 1963; Cook & Campbell, 1979; Shadish, Cook, & Campbell; 2002).
Construct validity has to with the degree to which the measures and manipulations
used in an empirical study are faithful representations of underlying constructs.
Internal validity reflects the degree to which the design of a study allows for valid
inferences about causal connections between the variables considered by a study.
External validity represents the extent to which the findings of a study general-
ize to different sampling particulars of units, treatments, research settings, and
outcomes. Finally, statistical conclusion validity is the degree to which inferences
stemming from the use of statistical methods are correct.
Valid research results are vital for both science and practice in HRM and allied
fields. With respect to science, the confirmation of a theory hinges on the validity
of empirical studies that are used to support it. For example, research aimed at
testing a theory that X causes Y is of little or no value unless it is based on studies
that use randomized experimental designs. In addition, the results of valid re-
search are essential for the development and implementation of HRM policies and
practices. For example, attempts to reduce employee turnover will not meet with
success unless an organization measures this criterion in a construct valid manner.
ceptual definitions and measurement issues, the authors provide critiques of each
construct as well as directions for future research. The authors conclude with a
discussion of the conceptual, research design, and data collection challenges that
researchers in organizational politics face.
Allen I. Huffcutt (University of Wisconsin Green Bay) discusses the problem
of range restriction in HRM, especially in employment interviews. He demon-
strates how serious this problem can be by simulating data that is unrestricted and
free of measurement error. Then, he shows how validities change after systemati-
cally introducing measurement error, direct range restriction, and indirect range
restriction. In addition, he provides a step-by-step demonstration of the calcula-
tions to obtain corrected correlation coefficients.
Lois E. Tetrick (George Mason University), Robert R. Sinclair (Clemson
University), Gargi Sawhney (Auburn University), and Tiancheng (Allen) Chen
(George Mason University) discuss methodological issues in the safety climate
literature based on a review of 261 articles. There review reveals a lack of consen-
sus and an inadequate explication of the safety climate construct and its dimen-
sionality. In addition, the authors discuss some common research design issues
including the low percentage of studies that involve interventions. The authors
highlight the (a) importance of incorporating time in research studies involving
multiple measurements and (b) increased use of various levels in safety climate
research.
REFERENCES
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
research. Chicago, IL: Rand McNally.
Cook, T. D., & Campbell, D. T. (1976). The design and conduct of quasi-experiments and
true experiments in field settings. In M. D. Dunnette (Ed.), Handbook of industrial
and organizational psychology (pp. 223–326). Chicago, IL: Rand McNally.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues
for field settings. Boston, MA: Houghton Mifflin.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
CHAPTER 2
ADVANCES IN RESEARCH
METHODS
What Have We Neglected?
Neal Schmitt
My purpose in this paper is twofold. First I trace and describe the phenomenal de-
velopment of quantitative analyses methods over the past couple of decades. Then
I make the case that our progress in measurement, research design, and estimating
the practical significance of our research has not kept pace with the development
of analytic techniques and that more attention should be directed to these critical
aspects of our research endeavors.
At this time the notion that all validities were specific to a situation was the
accepted wisdom in the personnel selection area. Frank Schmidt and Jack Hunt-
er introduced meta-analyses and validity generalization in the mid to late 1970s
(Schmidt & Hunter, 1977). Hypothesis testing was standard practice too and little
attention was paid to the practical significance of statistically significant results.
So, a person at that time was considered well trained if he/she were conversant
with correlation and regression, analyses of variance, exploratory factor analysis
and perhaps nonparametric indices. This has changed radically in the intervening
years.
DEVELOPMENT OF MODERN
QUANTITATIVE ANALYSIS METHODS
The 1980s were distinguished by the rapid adoption of structural equation model-
ing (SEM) using LISREL (later AMOS, MPLUS and other software tools) and
the use of meta-analysis to summarize bodies of research on a wide variety of
relationships between HR and OB constructs. Among SEM enthusiasts, there was
even a misperception that confirmation of a proposed model of a set of relation-
ships indicated that these variables were causally related rather than the fact that
data were simply consistent with a hypothesized set of relationships. Even after
this error of interpretation was recognized there was an enthusiastic adoption of
SEM by researchers. Both meta-analysis and SEM brought a focus on the under-
lying latent constructs being measured and related as opposed to the measured
variables themselves.
Developments in both SEM and meta-analyses became increasingly sophis-
ticated. Meta-analysts were concerned about file-drawer problems, random ver-
sus fixed effects analyses, estimates of variance accounted for by various errors,
moderator analyses, and the use of regression analyses of meta-analytically de-
rived estimates of relationships. Specific applications of SEM such as multi-group
analyses and tests for measurement invariance (Vandenberg & Lance, 2000)
were soon widely applied as were SEM analyses of longitudinal data (e.g., latent
growth modeling, Willett & Sayer, 1994).
Certainly among the most frequently used analytic innovations have been those
associated with levels research (Klein & Kozlowski, 2000). Multilevel modeling
is used in a very large proportion of the articles now published in our journals. In
one recent issue of Journal of Applied Psychology (February, 2017), Sonnentag,
Pundt, and Venz used multilevel SEM to assess survey data on snacking behav-
ior, Walker, Jaarsveld, and Skarlecki used multilevel SEM to study the impact
of customer aggression on employee incivility and Zhou, Wang, Song, and Wu
examined perceptions of innovation and creativity using hierarchical linear mod-
eling. Hierarchical linear modeling (HLM) has been used to study change, goal
congruence, climate and many other phenomena. It is almost as though I suddenly
discovered the nested nature of most of the data I collect.
Advances in Research Methods • 7
Big Data produces opportunities and challenges associated with the analysis
and interpretation of huge multidisciplinary data sets. Angrave, Charlwood, Kirk-
patrick Lawrence, and Stuart (2016), Cascio and Boudreau (2011) and others have
detailed a number of challenges in the use and interpretation of the wide variety
of big data available and the potential that analyses of these data will result in
improved HR practices. The quality and accuracy of many Big Data files is often
unknown; for example, it is rare that one would be able to assess the construct
validity of Big Data indices as organizational researchers usually do.
Big Data also introduces a whole new vocabulary (see Harlow & Oswald,
2016). Words like lasso (screening out noncontributing predictor variables), la-
tent dirichlet allocation (modeling words in a text that are attributable to a smaller
set of topics), k-fold cross-validation (developing models on multiple subsets of
“test” data that are then cross-validated on “training” data), crud factor (a gen-
eral factor or nuisance factor) and many more. In many ways, analyses of Big
Data seem like the “dust-bowl empiricism” that was decried by organizational
psychologists a half century ago. Note though that most Big Data analysts do
attend to theory and much more effort has been devoted to consideration of cross-
validation of findings than was true in early validation efforts.
Many other analysis techniques have challenged us such as relative importance
analysis, use of interactive and power terms in regression to analyze difference
scores, power analysis, spatial analyses, social network analyses, dyadic data
analyses and more. The last two to three decades have been an exciting time for
those working in quantitative analyses. The variety of new techniques available
to help understand data and the increased availability of data publicly available
in social media outlets and elsewhere as well as the availability of free software
packages such as R is nothing short of a revolution. We are continually faced with
the challenge of educating ourselves and others on the appropriate and productive
use of these procedures.
While we have much to celebrate and occupy ourselves, I would like to voice
concerns about some issues that seem to have gone unnoticed by many research-
ers.
we have paid too little attention to research design. This is especially evident
when we collect (or try to collect) longitudinal data. Third, in the interest of dis-
covering statistically significant findings or the results of the latest novel analytic
technique, we have lost sight of the practical significance of our results— in terms
of reporting effect sizes that are meaningful to practitioners, explaining the nature
of our results (witness the lack of impact of selection utility analyses so popular
a couple of decades ago) and in terms of addressing issues that concern OB/HR
practitioners or our organizational clients. In the remainder of this chapter, I will
describe the “state of the art” in these three areas and why I think they should
receive more attention by research methodologists than is currently the case.
MEASUREMENT CONCERNS
Aside from IRT developments there has been very little direction as to how to
evaluate the items or scales we use. Even IRT is not very applicable with short
scales. CFA has been used for the same purpose, but we have little guidance as
to what is good fit to a particular measurement model. Nye and Drasgow (2011)
have tried to provide such guidance and Meade, Johnson, and Braddy (2008) rec-
ommend the use of the confirmatory fit index (with a cutoff value of .002) as a
means of comparing the fit of alternative models or invariance. Too often we wave
a set of alpha values at the scales we use sometimes apologizing for those whose
alphas are below .70 as evidence that our measures are acceptable. Sometimes
journals even publish one item per scale so the reader can get some sense of the
nature of the construct measured, but even the publication of one item has been
found objectionable on proprietary grounds. Clark (2006) decries the sentence
often used to justify the use of a measure: “According to the literature, measure
X’s reliability is good and it has been shown to have validity.” (p. 448). This
statement is often made without a reference, but even with a reference, it often
appears doubtful that the author read the paper or papers they cite. Of course,
there is often no mention as to how or against what the measure was validated. Or,
as investigators we write a set of items for a study and label them as some exist-
ing construct with no supporting investigation of its psychometric characteristics.
Subsequent researchers or meta-analysts take the label for granted. This situation
is even worse now that we have become enamored of Big Data because we have
little or no control over the nature of the data collected and many times the data
comes from disciplines that have little appreciation for the quality of their mea-
sures (Angrave et al, 2016).
Let me give some examples. Forty or fifty years ago, there were several pub-
lications which provided guidelines on item writing though most of those ad-
dressed multiple choice ability items. Some guidelines addressed the use of
double-barreled items, use of double negatives, or jargon (Edwards, 1957) in
Likert-type items. I even remember one paper that experimentally manipulated
some of these guidelines in constructing a final exam in an introductory psychol-
ogy course (Dudycha & Carpenter, 1973). We now take item writing (whether
10 • NEAL SCHMITT
multiple choice or Likert items) for granted with the possible exception of large
test publishers whose main concern is the perceived fairness of test items to dif-
ferent groups of examinees. Even attempts to improve perceived fairness to dif-
ferent underrepresented groups have rarely been examined (for an exception see
Golubovich, Grand, Ryan and Schmitt, 2014).
We do have some other developments in measurement – both methods of mea-
surement and the means of analyzing the measurement properties of our indices.
Cognitive diagnosis models, multidimensional IRT models and simulations/gam-
ing are some examples. However, these techniques have not caught on to any
great degree—perhaps because they are too challenging for many of us or because
psychometricians or quantitative data analysts do not speak the language of most
psychologists and there may be some level of arrogance among psychometricians
about the relative incompetence of the rest of us. In any event, few of us read
Psychometrika anymore and I suspect the same may be true of Psychological
Methods and educational journals like the Journal of Educational Measurement.
Organizational Research Methods is still accessible to most organizational re-
searchers and that may account for its relatively high impact factor. Whatever the
reason there seems to be a segregation of quantitative analysts and measurement
types from other researchers, particularly those who develop or use psychological
measures.
In addition to a lack of attention in writing items, there is an overdependence on
alpha as an index of reliability or unidimensionality. Cortina (1993) and (Schmitt,
1996) have demonstrated that alpha can be a poor index of reliability even when
we have lots of items. Schmitt (Cortina [1993] provided a similar analysis) dem-
onstrated that a six-item test with the item intercorrelations in Table 2.1 yielded an
alpha of .86. Most of us would be happy with this alpha and proceed to do further
analyses using this measure. If we bother to look further (examine item intercor-
relations), it would be obvious that the six items address two dimensions. Further,
examination of item content would almost certainly provide at least a tentative
explanation of these two sets of items, but that almost never occurs. This example
was constructed to make a point, but with more items and a more ambiguous set of
correlations, this problem would likely go unrecognized. A more modern and fre-
quently used approach to assess the dimensionality of our measures is to employ
confirmatory factor analyses. Assessment of a unidimensional model of these in-
tercorrelations would have yielded the following fit indices (Chi square=401.62,
df=9, RMSEA=.47, NNFI=.29, CFI = .57). Most researchers would conclude that
the alpha of .86 was not a good index of the unidimensionality of this measure and
that a composite index of this set of six items is meaningless.
However, we can also fool ourselves about the dimensionality of a set of items
when using CFA—probably not as easily. We are dependent on a set of “rules
of thumb” as to whether a model fits our data and indices of practical fit (Nye &
Drasgow, 2011) are not helpful in this instance. Consider the item intercorrela-
tions in Table 2.2 for which a four-factor model produces a perfect fit to the data.
Advances in Research Methods • 11
A one factor model does pretty well too (Chi-square=73.66, df=54, RMSEA=.04,
NNFI=.98, CFI=.98). Most of us as authors and most of us as reviewers would
be happy with this demonstration of unidimensionality. I agree that the difference
in the correlations of within and between items belonging to each of these four
factors is small, but alpha for each of these four factors is .82 and the correlations
between any two of the four sets of items is .55. Are these distinct and practi-
cally meaningful “factors?” Incidentally, the alpha for the 12 item composite here
is .93. Clearly, both alpha and CFA tell us that one factor explains these data,
but four distinct factors are responsible for the item intercorrelations. The point
I am making is that the more sophisticated analysis of dimensionality does not
do justice to the question any more so than does alpha. A third way of looking
at these data is to examine the item content, item-total correlations, and the item
intercorrelations or perform an exploratory factor analysis—something that few
“sophisticated” data analysts ever do!
INFORMATION ON RELIABILITY AS
PRESENTED IN CURRENT RESEARCH ARTICLES
Overall, then, I do not believe we have paid much attention to the quality of our
measures. We seem to think item writing is easy and if we ask respondents to
use a five-point Likert type scale we will have a quality measure. Or more often,
researchers adapt a measure from previous work, occasionally taking a few items
from some longer measure. Then we use alpha and CFA to justify our behavior. To
ascertain that our statements about these practices are relatively standard, I exam-
ined the last three issues of the 2017 volumes of Journal of Applied Psychology,
Personnel Psychology and the Academy of Management Journal and tabulated the
results in Table 2.3. In this table, I have described the construct the authors pur-
ported to measure, the evidence for reliability and, in some cases, the discriminant
validity of the measure (convergent validity was rarely mentioned or assessed),
employment of rules of thumb to justify reliability and the justification for use of
the measure.
The table does document several positive features of the research reviewed.
First, most indices of reliability are quite high and clearly exceed the usual mini-
mum for alpha (i.e., .70) cited in the literature. Second authors do routinely pro-
vide justification for their use of measures. That justification, though, is almost
always limited to the fact that someone else used the same scale or the current
measure was modified from a measure used in an earlier study. Very rarely did
authors present any evidence of the relationship between the original version of
Advances in Research Methods • 13
(Continues)
14 • NEAL SCHMITT
(Continues)
16 • NEAL SCHMITT
the measure and the modified measure. The frequent modification of scales is
documented in Cortina et al. (under review).
Beyond these positive features of the research it is clear that organizational
researchers have measured a wide variety of different constructs, most of which
are not the typical individual difference measure that was the target of research
in the selection arena. Human resource researchers, broadly defined, have clearly
expanded the nature of the issues and constructs with which they are interested.
This proliferation of measures may, however, make it more difficult to assess the
commonality of research findings across study and time; calling attention to this
issue was not the purpose of our paper.
Almost all studies summarized in Table 2.3 use self-report instruments to as-
sess the constructs of interest and in many of these cases this is the only alterna-
tive. However, researchers frequently use supervisory responses or objective or
archival data as the source of information about constructs of interest. There are
fewer references to articles published in AMJ as many of the articles published
in that journal employ archival data for which coding accuracy or agreement are
applicable and for which data are readily available for verification purposes.
Third, there is an almost universal reliance on alpha as an index of measure-
ment reliability or adequacy. In some cases, this is complemented by a CFA of
18 • NEAL SCHMITT
the use of an index he labeled the greatest lower bound (GLB) estimate as the
preferred estimate of reliability. However, Zinbarg et al. (2005) showed that the
GLB was almost always lower than the hierarchical form of omega. Omega that
includes item loadings on a general factor as well as item loadings on group fac-
tors as true variance appears to the best lower bounds estimate of reliability and
the most appropriate index to use in correcting observed correlations between
two variables for attenuation due to unreliability. Dunn et al. document the al-
most universal use of alpha as a measure of internal consistency in spite of the
critical psychometric literature including a paper by Cronbach himself (Cronbach
& Shavelson, 2004). They also support the routine use of omega along with the
confidence interval for its estimation and provide direction and an example of its
calculation using the open source statistical package, R. McNeish (2018) provides
a review of the use of alpha like that provided here in Table 2.1 for three different
psychological journals. The results of that review are very similar in that almost
all authors used alpha as a report of reliability. McNeish went on to compare the
magnitude of alpha and five other reliability indices for measures included in two
publicly available data sets. He found alpha was consistently lower by about .02
to .10 depending most often on the variability of item loadings on a general factor.
Aside from underrepresenting the reliability of a measure, these differences may
be practically meaningful in applied instances when relationships are corrected
for attenuation due to unreliability as they routinely are in studies of the criterion-
related validity of personnel selection measures (Schmidt & Hunter, 1998).
In the March 2018 issue of the American Psychological Society’s Observer,
Fried and Flake make four observations about measurement that are consistent
with the data in Table 2.3 and this discussion. First, they encourage researchers
to clearly communicate the construct targeted, how it is measured, and its source.
Second, there should be a rationale for scale use and modifications. Third, if the
only evidence you have of measure “validity” is its alpha, consider conducting a
validity study to ascertain the scales’ correlates. Finally, stop using alpha as the
only evidence of a scale’s adequacy. I would add that we should replace alpha
with omega for composite measures.
and equally problematic, the possibility that job tenure might play a role in this
process was not considered.
These are all excellent studies, but in each case, the time periods studied are
not discussed (the exception was the Kaltianen et al. study in which they cited the
lack of more frequent measurement as a limitation). Time must be considered if
we are to discover and adequately estimate the underlying processes we seek to
explain. To underscore this issue, I examined the articles published in the last year
in two major journals (Journal of Applied Psychology, and Personnel Psychology)
and the last three issues in the 2017 volume of Academy of Management Journal).
The shorter time frame for Academy of Management Journal was used because
more papers were published in AMJ and more involved longitudinal designs in
which time of data collection was a potential concern.
Table 2.4 contains a brief description of the 46 studies reviewed in these three
journals including the major hypotheses evaluated, the time interval between data
collections, support for the hypothesized effects, and any discussion of time. In
about half of these studies (N = 22), there was no discussion of the role that time
might have played in the study results or whether the timing of multiple data col-
lections was appropriate. In some of these studies, the variables studied might not
have been sensitive to the precise time of data collection or the time interval rep-
resented a reasonable space within which to expect an effect to occur (e.g., effect
of socialization during probationary period). However, in most of these 22 cases,
it would not be hard to assert that the time of measurement was a critical factor in
finding or estimating the effect of some process (e.g., leader personality affecting
leader performance) yet it was not mentioned in the description of the research.
In those studies in which time was mentioned, it was almost always mentioned
as a limitation of the study sometimes with the suggestion that future research
consider the importance of data collection timing. In one study in Personnel Psy-
chology, there was an extensive discussion of the socialization process investi-
gated and why the timing of data collections was appropriate.
A very large proportion of the papers published in Academy of Management
Journal (AMJ) were longitudinal and many involved the use of archival data that
occasionally spanned one or more decades. In some of the archival studies, data
were collected for many time periods really assuring that any time-related effects
would be observed. Like the other two journals, however, 7 of the 16 AMJ papers
did not discuss the importance of time when it seemed to me that it should have
been a relevant issue. The relatively greater concern with time in papers pub-
lished in Academy of Management Journal may be a function of what seems to
be a greater emphasis on theory in that journal. This theoretical emphasis should
produce a concern that measurement time periods coincide with the theoretical
processes being studied.
In none of the papers mentioned in Table 2.4 was the timing of the first data
collection described. When studying a work-related issue, it seems that the first
data collection should occur at employment or immediately before or after an
22
TABLE 2.4. Longitudinal Research Designs in Articles Published Recently in Major Journals
•
Journal Hypothesized Effects Time Interval Hypo. Support & Discussion or rationale for Time Interval
JAP Morning rudeness>task perf. & goal Nine hours Rudeness affected all four outcomes. Hypo. Restricted to morning
progress & interaction avoidance & Psych. rudeness, but possibility of buildup or crossover effects are
Withdrawal recognized
JAP Role conflict>emotional exhaustion Six months Time interval = probation period. Role conflict>exhaustion
moderated by helping (socialization) moderated by type of help provided to newcomers
NEAL SCHMITT
JAP Assessment center feedback>self- First stage was 2.4 years after Hypotheses were supported. No discussion of the timing of data
efficacy>feedback seeking>career outcomes feedback; second stage was 15 collection. Times are averages across participants
years later
JAP Team charter and team conscientiousness 10 weeks Hypothesis supported; no discussion of time.
lead to task cohesion and team performance
JAP Political behavior>task performance Two months separating each of Al l four hypotheses were supported. No discussion of time interval.
mediated by emotional exhaustion and three surveys
psychological empowerment including
moderator effects of political behavior on
exhaustion
JAP Study 1:Intrinsic motivation>organizational Six months Supported; no discussion of time interval
identification Support was found for the first link in the hypothesized sequence and
Study 2: Need fulfillment>intrinsic Three stage with 4 weeks partial support for the mediation hypothesis. No discussion of time
motivation>organizational identification intervening interval or extent of previous experience
JAP Job control & task-related stressor and Five times over 10 years, but In a general sense, hypotheses were confirmed. Data collection times
social stressors>health and well-being mid-point varied and last data were discussed and early periods were defended on the notion that
collection was six years after the was when most job stress would occur.
fourth period
JAP Unethical behavior, supervisor bottom line Six months and two weeks Unethical behavior. Shame; shame>exemplification; supervisor BLM
orientation, and shame>exemplification moderated the latter relationship. Time issue was discussed
behavior
JAP Work demands>unhealthy eating buffered Morning noon and evening Job demand>unhealthy eating in the evening and the interaction
by sleep and mediated by self-regulations of fifteen days. In a second of job demands and sleeping was significant. Negative customer
study, four daily surveys were interaction>negative mood>unhealthy eating.
administered for four weeks. Various points in a day were sampled; no discussion of multi-day
effects.
JAP Team voice>team innovation & team 6–8 weeks after teams started and Promotive perf.>productivity and prohibitive perf.>safety. Promotive
monitoring>productivity and safety three months later perf.>innovation>perf. gains
Prohibitive perf.>monitoring>safety gains. Timing of meas. was
recognized as limitation
JAP Study 1: Intercultural dating>creativity 10 months Hypothesis supported. No mention of timing.
JAP Distance and velocity 45 minute experiment Disturbances both affected frustration and enthusiasm, but velocity
disturbances>enthusiasm and had longer term effects—authors mentioned the limiting effect of
frustration>goal commitment, effort and time on the result
perf.
JAP Intraindividual increases in org. One year pre- and post-merger Mixed support for hypotheses. Authors emphasized the need to
valence>org. identification>job sat & intent collect data at multiple time points, but did not discuss the time
to stay and personal valence constr.>org interval between data collections
identification>job satisfaction and intent
to stay
JAP Leader extraversion, agreeableness, & Three months Partial support for hypotheses. No discussion of the time interval
conscientiousness>team potency belief separating data collection
and identification w. ldr>Performance
moderated by power distance
JAP Work engagement>work-family Work Engagement collected at Mediated effects were supported. Authors did discuss the problem of
Interpersonal capitalization>family work but mediator and outcomes simultaneous collection of mediator and outcome data.
satisfaction and work-family balance collected at the same time
JAP Participation in job crafting 8 weeks Major mediation hypothesis unsupported. No discussion of timing.
Advances in Research Methods •
(Continues)
24
TABLE 2.4. Continued
•
Journal Hypothesized Effects Time Interval Hypo. Support & Discussion or rationale for Time Interval
JAP Newcomers’ task and social info. 1 week between each of four data Most hypotheses were supported. Discusses lack of true longitudinal
Seeking>Mgrs. Perceptions of newcomer collections design.
commitment to task master and social
adjustment >mgrs. Provision of
help>outcomes
JAP Process justice & cognitive trust are Data collected over two years Hypotheses confirmed. Data collections tied to specific changes
NEAL SCHMITT
reciprocally related through three stages of and tied to specific company hypothesized to result from merger. Discussed need to estimate
a merger changes relationships in a shorter time frame.
JAP Trust in direct ldrs.>direct ldr procedural Three months Trickle model supported—direct ldr. trust leads to top ldr. trust
justice>trust in top ldrs. & performance. mediated by direct ldr procedural justice. No discussion of length of
Relationships moderated by vertical time interval between data collections
collectivism.
JAP Recruitment source timing and Time between receipt of Time was the major variable studied and it was related to human
diagnosticity>human capital information on jobs and capital. Attribution is that students developed skills relevant to
recruitment varied specific jobs.
JAP High performance leads to supportive or Eight weeks Hypotheses were supported, but there was no mention of the time
undermining behavior by peers mediated by interval
peers’ perceived benefit or threat
PPsych Interaction of Job demands and control > Seven years Hypothesis was supported and there was a lengthy discussion of the
death implications of end-of-career data collection
PPsych Ambient discrimination > mentoring 4 weeks Not seen as a longitudinal study; time difference was used to control
> organizational commitment, strain, for common method variance
insomnia, absenteeism. Mentoring activities
moderated the discrimination—outcome
relationship
PPsych Culture beliefs > intercultural sensitivity Time 2 data collected six months Data were collected before, during and after a program so the timing
rejection > cross-cultural adjustment after program entry and a third of data collection spanned the totality of the participants’ experience.
wave 3 months later Hypotheses were supported.
PPsych Job challenge and developmental Two months Mixed support and recognition of the lack of truly longitudinal
experiences > leader self-efficacy and design
mngrs. network > promotability and leader
effectiveness
PPsych LMX > higher salaries & responsibility in 18 months Hypotheses were supported. No mention of time interval but it seems
subsequent jobs as well as alumni goodwill. appropriate.
PPsych Emotional labor (surface and deep acting) > Two months Mention that the two month interval may have been too long thereby
ego depletion > coworker harming reducing magnitude of expected relationships.
PPsych Vertical access, horizontal tie strength and Time 1 (2 months before org. Extensive discussion of socialization and timing of surveys.
core self-evaluation > newcomer learning entry, Time 2 (6 months later) Vertical access and core self-evaluations were related to outcomes;
and organizational identification and Time 3 (two months after horizontal tie strength was not. Three-way interaction related to 3 of
Survey 2 4 outcomes.
PPsych Customer mistreatment > Negative mood > Daily before and after the closing Hypothesized indirect effect supported. Daily data collection
employees’ helping behavior of restaurants where participants consistent with hypotheses.
worked.
PPsych Group cohesiveness will moderate OCBI Wave 1 followed by Wave 2 three All hypotheses were confirmed. No discussion of the timing of data
and OCBO and self-efficacy change and months later and a third wave collection.
mediation against job performance after another 3 months
AMJ Employee identification > Use of voice Two months Support was found for the hypothesized mediation, but limitation of
regarding work > managers’ valuation of data collection timing was discussed
voice
AMJ Company policies and employee passion 4 weeks Support for hypotheses but no discussion of timing of measurement
for volunteering > corporate volunteering
climate > Volunteering intentions and
Advances in Research Methods •
behavior
25
(Continues)
26
TABLE 2.4. Continued
•
Journal Hypothesized Effects Time Interval Hypo. Support & Discussion or rationale for Time Interval
AMJ Pay for performance > individual Monthly performance for four Supported – no discussion of time period, but likely not needed
performance years
AMJ Identity conflict and identity enhancement > 4 months Intrinsic motivation mediator supported; perspective taking
intrinsic motivation and perspective taking unsupported. Timing of data collection mentioned as a study
> performance limitation
NEAL SCHMITT
AMJ CEO Power > board-chair separation and Ten years Hypotheses supported; no discussion of timing of data collection.
lead independent director appts.
AMJ Team based empowerment > team 7 months before intervention and Hypotheses supported; time was sufficient for intervention to effect
effectiveness moderated by team leader 37 months after outcomes
status
AMJ Follower’s dependence on leader > Abusive Three waves of data collection Timing of data collection matched followers’ performance reviews.
supervision time 2 > abusive supervision separated by 4 weeks Hypotheses supported in two studies
and reconciliation time 3 moderated by
follower’s behavior
AMJ Social networks > information and Yearly over 7 years Specifically hypothesized that effects would increase with time.
communication technology use > When ICT use and family and community centrality were high
entrepreneurial activity and profit entrepreneurial activity increased with time.
AMJ Top executive humility > Middle manager 1 year Hypotheses were supported. No mention of time interval
job satisfaction > middle manager turnover
moderated by top mngmt. faultlines
AMJ Donors contributions > peer recognition 7 years Hypotheses supported. No discussion of the time period over which
of Russian theatres moderated by depth of data were collected
involvement of external stakeholders
AMJ Economic downturns > Zero-sum construal 17 years First step of causal sequence was confirmed by longitudinal data;
of success > workplace helping second step by experiment
AMJ Supervisor liberalism > performance-based 25 years Hypothesis supported even after control variable are considered. No
pay gap between gender groups discussion of time period.
AMJ Daily surface acting at work > emotional Daily surveys for five days Hypotheses supported with giving help being a significant
exhaustion > next day work engagement moderator. No mention of time
moderated by giving and receiving help
AMJ Subordinate deviance > supervisor self Two weeks in Study 1; two to Indirect effect for self-regulation was supported, but not the indirect
-regulation / social exchange > abusive four weeks in Study 2 effect for social exchange. Emphasized their use of a cross-lagged
supervision research design, but did not discuss timing of data collection.
AMJ Risk aversion > guanxi activities Cross-sectional survey No discussion of timing of data collection, but hypothesis supported
AMJ Team commitment and organizational Experiment and survey with no Mixed support in the survey replication of an experiment. No
commitment > Dominating, Integrating, time interval mention of time.
Obliging, Avoiding conflict strategies
Advances in Research Methods •
27
28 • NEAL SCHMITT
important intervention that is the study focus. This was the case in some of the
papers, but very often the timing of initial or subsequent data collection appeared
to be a matter of convenience (e.g. every two months or every four weeks). On a
positive note, it seems that a very large proportion of the papers, particularly in
AMJ, were longitudinal. This was clearly not the case a couple of decades ago.
It should also be noted that the data provided in Table 2.4 are partly a result
of one reader’s interpretation of the studies. In some of these studies, the authors
may argue that time was considered, and/or it was irrelevant.
It is also the case that most studies employing longitudinal designs are in-
stances of quasi-experimentation, hence the causal inferences derived from these
studies are often problematic (Shadish, Cook, & Campbell, 2002). These stud-
ies are almost always designed to test mediation hypotheses using hierarchical
regression or SEM to test hypotheses. These models often reflect a poor basis for
making causal inferences even though authors frequently imply directly or indi-
rectly that they provide support for causal hypotheses. These inference problems
and potential solutions have been described in a series of papers by Stone-Romero
and Rosopa (2004, 2008, 2011). They make the case that causal inferences when
data are not generated as a function of an experimental design that tests the ef-
fects of both independent and mediator variables are not justified. Like earlier au-
thors (e.g., James, Mulaik, & Brett, 2006), they point out that SEM findings (and
analyses using hierarchical linear regression) may support a model but that other
models that include a different causal direction or unmeasured common causes
may also be consistent with the data.
A longitudinal design that includes theoretically and/or empirically supported
differences in the timing of data collection would seem to obviate at least the
problem of misspecified causal direction. Given the importance of time-ordering
of the independent, mediator and outcome variables, as argued above, it is sur-
prising that Wood, Goodman, and Cook (2008) found only 11% of the studies in
their review of mediation research incorporated time ordering. Their results are
consistent with the data in Table 2.4. The past decade since the Wood et al review
has produced very little change in longitudinal research; even when data are col-
lected at multiple points in time, there is little, or no justification of the time points
selected. Those conducting longitudinal research are missing an opportunity to
provide stronger justification of causal inference when they fail to design their
research with careful consideration of time (Mitchell & James, 2001).
As did Cohen, these authors point to the context of the research as an important
factor in presenting and interpreting effect sizes. An effect size of .1 is awfully
important if the outcome predicted is one’s life. It might not be that impressive
if it is one’s level of organizational commitment (my apology to those who study
organizational commitment). They also point to the strength (or lack thereof) of
the research design that produces an effect. If the effect is easily produced, then it
should be less likely dismissed as unimportant. If one needs to use a sledge ham-
mer manipulation to get an effect, it is probably not all that practically important.
Perhaps combining both these ideas, Cortina and Landis describe the finding that
taking aspirins accounts for 1/10 of one percent of the variance in heart attack
occurrence, but such a small intervention with an important outcome makes it
a significant effect (in my opinion and it seems the medical profession as well).
JAP does require a section of the discussion be devoted to the theoretical and
practical significance of the study results and most articles in other journals do
as well. However, this often appears to be a pro forma satisfaction of a publica-
tion requirement. Moreover, as mentioned above, many of our sophisticated data
analyses do not translate into an effect size. Even when they do, unless these d
statistics or similar effect sizes are in a metric related to organizationally or soci-
etally important outcomes, they are not likely to have much influence on decision
makers. It is also interesting that the literature on utility (Cascio, 2000) which
was oriented to estimating the effect of various behavioral interventions in dollar
terms has pretty much faded away.
We also suspect that it would be hard for even a doctoral level staff person in
an organization to translate the results of a structural equation analysis or a mul-
tilevel analysis or even stepwise regressions into organizationally relevant met-
rics. A good example, and probably an exception, of a combination of stepwise
regression analyses of the impact of various recruitment practices and sources of
occupational information is a paper by Campion, Ployhart, and Campion (2017).
The usual regression-based statistics were used to evaluate hypotheses and then
translated into percent passing an assessment of critical skills and abilities under
different recruitment scenarios. This information communicated quite directly
with information users. This would be very important data, for example, for the
military in evaluating the impact of lowering entrance qualifications of military
recruits on subsequent failure rates in training or dropout rates. Incidentally, Cam-
pion et al. also reported the number of applicants who heard about jobs from vari-
ous sources and the quality (in terms of assessed ability) of the applicants.
As for the previous research issues raised in this paper, I reviewed papers pub-
lished in the same three journals (Journal of Applied Psychology, Personnel Psy-
chology, and Academy of Management Journal) to ascertain the degree to which
authors addressed the practical implications of their research in some quantifiable
manner or in terms of some manner that readers could understand what might or
should be changed in an organizational practice to benefit from the study findings.
Since all papers have the potential for practical application, I reviewed only the
30 • NEAL SCHMITT
last 12 papers published in 2017 in these three journals. In most articles pub-
lished in these three journals, there was a section titled “practical implications.”
I reviewed these sections as well as the authors’ reports regarding their data in
producing Table 2.5.
The table includes a column in which the primary interest of the author(s) is
listed. I then consider whether there was any presentation of a quantitative esti-
mate of the impact of the variables studied on some outcome (organizational or
individual). Most papers presented their results in terms of correlations or mul-
tiple regressions, but many also presented the results of structural equation model-
ing or hierarchical linear modeling. There were only a few papers in which any
other quantitative index other than the results of these statistical analyses of the
impact of a set of “independent” variables was presented. These indices were d
(standardized mean difference) or odds ratios. These indices may also be deficient
in that the metric to which they refer may or may not be organizationally relevant.
For example, I might observe a .5 standard deviation increase in turnover intent,
but unless I know how turnover intent is related to actual turnover in a given
circumstance and how that turnover is related to production, profit, or expense of
recruiting and training new personnel, it is not easy to make the results relevant
to an organizational decision maker. Of course, It is also the case that correlations
can be translated to d and that means and standard deviations can be used to com-
pute d and with appropriate available metrics to some organizationally relevant
metric. However, this was never done in the 36 studies reviewed.
Nearly all authors did make some general statements as to what they believed
their study implied for the organization or phenomena they studied. Abbreviated
forms of these statements are included in the last column of Table 2.5. As men-
tioned above, Journal of Applied Psychology includes a “practical implications”
section in all articles. As is obvious in these statements, authors have given some
thought to the practical implications of their work and their statements relate to a
wide variety of organizationally and individually relevant outcomes. What is not
apparent in Table 2.5 is that these sections in virtually all papers rarely exceed one
to three paragraphs of material and usually did not discuss how their statements
would need to be modified for use or implementation in a local context.
The utility analyses developed by Schmidt, Hunter, McKenzie, and Muldrow
(1979) and popularized by Cascio (2000) were directed to an expression of study
results in dollar terms. This approach to utility received a great deal of attention
a couple of decades ago, but interest in this approach has waned. Several issues
may have been critical. First, expressing some variables in dollar terms may have
seemed artificial (e.g., volunteering, team-based empowerment, OCBs, rudeness).
Second, calculations underlying utility estimates devolved into some fairly ar-
cane economic formulations (e.g., Boudreau, 1991) which in turn required as-
sumptions that may have made organizational decision makers uncomfortable.
Third, the utility estimates were based on judgments that some decision makers
may have suspected were inaccurate (Macan & Highhouse, 1994) even though
TABLE 2.5. Reports of Practical Impact of Research and Effect Sizes
Effect Size
Journal Nature of Phenomenon Studied Estimates Practical Implications Suggested
JAP Workplace gossip No Discussed gossip relationships with workplace deviance and promoting norms for
acceptable behavior
JAP Job insecurity No Risk that job performance and OCB will suffer and intent to leave will increase
JAP Flexible working arrangements No Improve employees’ wellbeing and effectiveness. Flextime should be accompanied by
some time structuring and goal setting
JAP Environmental and climate change Odds ratios Self-concordance of goals and climate change were related to petition signing behavior
and intentions to engage in sustainable climate change behavior
JAP Insomnia No Treatment for insomnia had positive effects on OCB and interpersonal deviance
JAP Stereotype threat, training, and performance Yes—d Stereotype effect learning which has implications for human potential over time
potential
JAP Snacking at work No Individual, organizational, and situational factors affect what employees eat.
Organizations should promote healthy organizational eating climate.
JAP Customer behavior and service incivility No Verbal aggression directed to an employee and interruptions lead to employee incivility
JAP Perceptions of novelty and creativity No Organizations should encourage creativity and innovation and use employees with
promotion focus to identify good ideas
JAP Authoritarian leadership No Negative effects of authoritarian leadership on performance, OCB and intent to stay
moderated by power distance and role breadth self-efficacy
JAP Gender transition and job attitudes and No Gender transition related to job satisfaction, person-organization fit and lower perceived
experiences discrimination. Organizations should promote awareness and inclusivity
JAP Gender and crying NO Crying was associated with lower performance and leader evaluations for males. Men
should be cautious in emotional expression.
PPsych Work demands, job control, and mortality Odds ratios Job demands and job control interacted to produce higher mortality. Organizations should
Advances in Research Methods •
(Continues)
32
TABLE 2.5. Reports of Practical Impact of Research and Effect Sizes
•
Effect Size
Journal Nature of Phenomenon Studied Estimates Practical Implications Suggested
PPsych Work family balance No Practices that promote balance satisfaction and effectiveness may enhance job attitudes
and performance.
PPsych Mentoring as a buffer against discrimination No High quality formal and informal mentoring relationships that offer social support reduce
the negative impact of racism and lead to a number of positive job outcomes.
NEAL SCHMITT
PPsych Cultural intelligence No Provision of experiences that foster social adjustment increase benefits derived from
international experiences
PPsych Role-based identity at work No Provides role-based identity scales and suggests that employees who assume too many
roles may experience burnout.
PPsych Leader development No Combinations of developmental exercises: formal training, challenging responsibilities,
and developmental supervision best in developing leaders.
PPsych LMX leadership effects LMX quality relationships are related to career progress in new organizations and alumni
good will. Orgs. should promote internal job opportunities
PPsych Status incongruence and the impact of No Organizations should consider training employees on the biases faced by women in
transformational leadership leadership roles.
PPsych Emotional labor in customer contacts No Organizations should promote deep acting rather than surface acting in service employees
to prevent harming behavior to clients and coworkers.
PPsych Newcomer adjustment No Organizations should tailor their approach to newcomer socialization to individual needs.
PPsych Training transfer No Expectations regarding transfer of training should take account of different learning
trajectories and opportunities to perform.
PPsych Family role identification and leadership No Organizations and individuals should promote family involvement as these activities
enhance transformational leadership behavior
AMJ Curiosity and creativity No Study offers suggestions as to how to provide feedback and that curiosity be considered
when selected people into “creative” jobs. Creative workers must have time to consider
revisions.
AMJ Ambiguity in corporate communication in Likelihood of Use vague language in annual reports to reduce competitive entry in your market
response to competition competitive
actions
AMJ Value of voice No Exercise of voice should be on issues that are feasible in terms of available resources.
Speaking up on issues that are impossible to address will have negative impact on the
manager and employee
AMJ Pay for performance No Employees indebted to a pay for performance plan will react positively to debt
forgiveness but only in the short term.
AMJ Identity conflict and sales performance d of selling Managers can influence performance by reducing role conflict and increasing identity
intention enhancement.
AMJ Board director appt and firm performance No CEO and boards can be balanced in terms of power and this likely leads to positive firm
level outcomes.
AMJ Team-based empowerment Percentage of High status leaders struggle with team-based empowerment and specific leader behaviors
same day appt. facilitate or hinder delegation
requests
AMJ Abusive supervision No Provides strategies for abused followers to reconcile with an abusive supervisor.
Organizations should encourage leaders and followers to foster mutual dependence.
AMJ Entrepreneurs’ motivation shapes the No Describes the process of organizing new firms and whether founders remain till the firm
characteristics and strategies of firms becomes operational or leave
AMJ Innovation and domain experts No Experts are useful in generating potential problem solutions, but may interfere in
selecting the best solution
AMJ Volunteering climate No Fostering collective pride about volunteering leads to affective commitment and to
volunteering intentions.
AMJ Women entrepreneurs in India Odds ratios Community and social networks lead to entrepreneurial activity and profit moderated by
and profit in information and technology use
rupees
Advances in Research Methods •
33
34 • NEAL SCHMITT
the consistency across judges was usually quite acceptable (Hunter, Schmidt, &
Coggin, 1988). Finally, some estimates were so large (e.g., Schmidt, Hunter, &
Pearlman, 1982) and the vagaries of organizational life so unpredictable (Teno-
pyr, 1987) that utility estimates were rarely realized.
It appears that HR personnel are facing a similar set of “so what” questions as
they attempt to make sense of the Big Data analyses that are now possible and
increasingly common. Angrave et al. (2016) report that HR practitioners who are
faced with these data are enthused but feel no better informed about how to put
them into practice than before they were informed about the data. This seems
to be the situation that those working on utility analyses confronted in the 80s
and 90s. Although many organizations have begun to engage with HR data and
analytics, most seem not to have moved beyond operational reporting. Angrave
et al. assert that four items are important if HR is to make use of Big Data analyt-
ics. First, there must be a theory of how people contribute to the success of the
organization. Do they create, capture, and/or leverage something of value to the
organization and what is it? Second, the analyst needs to understand the data and
the context in which it is collected to be able to gain insight into how best to use
the metrics that are reported. Third these metrics must help identify the groups
of talented people who are most instrumental in furthering organizational perfor-
mance. Finally, simple reports of relationships are not sufficient, there must be
attention given to the use of experiments and quasi-experiments that show that a
policy or intervention improves performance.
REFERENCES
Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M., & Stuart, M. (2016). HR and
analytics: Why HR is set to fail the big data challenge. Human Resource Manage-
ment Journal, 26, 1–12.
Barnes, C. M., Miller, J. A., & Bostock, S. (2017). Helping employees sleep well: Effects
of cognitive behavior therapy for insomnia on work outcomes. Journal of Applied
Psychology, 102, 104–113.
Boudreau, J. W. (1991). Utility analysis for decisions in human resource management. In
M. D. Dinette & L. M. Hough (Eds.), Handbook of industrial and organizational
psychology: Vol 2. (pp. 621–746). Palo Alto, CA: Consulting Psychologists Press.
Campion, M. C., Ployhart, R. E., & Campion, M. A. (2017). Using recruitment source
timing and diagnosticity to enhance applicants’ occupation-specific human capital.
Journal of Applied Psychology, 102, 764–781
Cascio, W., & Boudreau, J. (2011). Investing in people: The financial impact of human
resource initiatives. (2d. ed.). Upper Saddle, NJ: Pearson.
Cascio, W. F. (2000). Costing human resources: The financial impact of behavior in orga-
nizations. Cincinnati, OH: Southwestern.
Chen, G., Ployhart, R. E., Cooper-Thomas, H. D., Anderson, N., & Bliese, P. D. (2011).
The power of momentum: A new model of dynamic relationships between job sat-
36 • NEAL SCHMITT
Hunter, J. E., Schmidt, F. L., & Coggin, T. D. (1988). Problems and pitfalls in using capital
budgeting and financial accounting techniques in assessing the utility of personnel
programs. Journal of Applied Psychology, 73, 522–528.
Ilgen, D. R., & Hulin, C. L. (Eds.). (2000). Computational modeling of behavioral pro-
cesses in organizational research. Washington, DC: American Psychological As-
sociation Press.
James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational
Research Methods, 9, 233–244.
Kaltianen, J., Lipponen, J., & Holtz, B. C. (2017). Dynamic interplay between merger
process justice and cognitive trust in top management: A longitudinal study. Journal
of Applied Psychology, 102, 636–647.
Klein, K. J., & Kozlowski, S. W. J. (Eds.) (2000). Multilevel theory, research and methods
in organizations. San Francisco, CA: Jossey-Bass.
Macan, T. H., & Highhouse, S. (1994). Communicating the utility of human resource ac-
tivities: A survey of I/O and HR professionals. Journal of Business and Psychology,
8, 425–436.
Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternate
fit indices in tests of measurement invariance. Journal of Applied Psychology, 93,
568–592.
McNeish, D. (2018). Thanks coefficient alpha: We’ll take it from here. Psychological Bul-
letin, 23, 412–433.
Mitchell, T. R., & James, L. R. (2001). Building better theory: Time and the specification of
when things happen. Academy of Management Review, 26, 530–547.
Murphy, K. R., & Russell, C. J. (2017). Mend it or end it: Redirecting the search for in-
teractions in the organizational sciences. Organizational Research Methods, 20,
549–573.
Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equiva-
lence: Understanding the practical importance of differences between groups. Jour-
nal of Applied Psychology, 96, 966–980.
Pitiaru, A. H., & Ployhart, R. E. (2010). Explaining change: Theorizing and testing dy-
namic mediated longitudinal relationships. Journal of Management, 36, 405–429.
Ployhart, R. E., & Kim, Y. (2013). Dynamic growth modeling. In J. M. Cortina and R. S.
Landis (Eds.), Modern research methods (pp. 63–98). New York, NY: Routledge.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behav-
ioral Research, 47, 667–696.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of
validity generalization. Journal of Applied Psychology, 62, 529–540.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in
personnel psychology: Practical and theoretical implications of 85 years of research
findings. Psychological Bulletin, 124, 262–274.
Schmidt, F. L., Hunter, J. E., McKenzie, R., & Muldrow, T. (1979). Impact of valid se-
lection procedures on workforce productivity. Journal of Applied Psychology, 64,
609–626.
Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1982). Assessing the economic impact of
personnel programs on workforce productivity. Personnel Psychology, 35, 333–347.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8,
350–353.
38 • NEAL SCHMITT
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s
alpha. Psychometrika, 74, 107–120.
Sonnentag, S., Pundt, A., & Venz, L. (2107). Distal and proximal predictors of snacking at
work: A daily-survey study. Journal of Applied Psychology, 102, 151–162.
Stone-Romero, E. F., & Rosopa, P. J. (2004). Inference problems with hierarchical multiple
regression-based tests of mediating effects. Research in Personnel and Human Re-
sources Management, 23, 249–290.
Stone-Romero, E. F., & Rosopa, P. J. (2008). The relative validity of inferences about
mediation as a function of research design characteristics. Organizational Research
Methods, 11, 326–352.
Stone-Romero, E. F., & Rosopa, P. (2011). Experimental tests of mediation models: Pros-
pects, problems, and some solutions. Organizational Research Methods, 14, 631–
646.
Tenopyr, M. L. (1987). Policies and strategies underlying a personnel research program.
Paper presented at the Second Annual Conference of the Society for Industrial and
Organizational Psychology, Atlanta, Georgia.
Tiffin, J., & McCormick, E. J. (1965). Industrial psychology. Englewood Cliffs, NJ: Pren-
tice-Hall.
Vancouver, J. B., & Purl, J. D. (2017). A computational model of self-efficacy’s various
effects on performance: Moving the debate forward. Journal of Applied Psychology,
102, 599–616.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement in-
variance literature: Suggestions, practices, and recommendations for organizational
research. Organizational Research Methods, 3, 4–70.
Walker, D. D., Jaarsveld, D. D., & Skarlecki, D. P. (2017). Sticks and stones can break
my bones but words can also hurt me: The relationship between customer verbal
aggression and employee incivility. Journal of Applied Psychology, 102, 163–179.
Willett, J. B., & Sayer, A. G. (1994). Using covariance structure analysis to detect corre-
lates and predictors of change. Psychological Bulletin, 116, 363–381.
Wood, R. E., Goodman, J. S., & Cook, N. D. (2008). Mediation testing in management
research. Organizational Research Methods, 11, 270–295.
Zhou, J., Wang, X. M., Song, L. J., & Wu, J. (2017). Is it new? Personal and contextual
influences on perceptions of novelty and creativity. Journal of Applied Psychology,
102, 180–202.
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s a, Revelle’s b, and Mc-
Donald’s vH: Their relations with each other and two alternate conceptualizations of
reliability. Psychometrika, 70, 1–11.
CHAPTER 3
ences in research aimed at testing model-based predictions, but also the effective-
ness of HRM policies and practices.
In the interest of explicating the way in which experimental design affects the
validity of causal inferences in research, this article considers the following is-
sues: (a) invalid causal inferences in HRM research, (b) the importance of valid
causal inferences in basic and applied research, facets of validity in research, (c)
formal reasoning procedures as applied to the results of empirical research, (d) the
importance of experimental design for valid causal inferences, (e) the settings in
which research is conducted, (f) experimental design options (randomized experi-
mental, quasi-experimental and non-experimental) for research, (g) other research
design issues, (h) overcoming objections that have been raised about randomized
experiments in HRM and related disciplines, and (i) conclusions and recommen-
dations for basic and applied research and editorial policies.
ing causal connections between or among variables (Campbell & Stanley, 1963;
Cook & Campbell, 1976, 1979; Shadish et al., 2002). Unless the results of an
empirical study show that an independent variable (i.e., an assumed cause) is
causally related to a dependent variable (i.e., an assumed effect) it is of little con-
sequence that the study has high levels of external, construct, or statistical conclu-
sion validity.
longitudinal research (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979;
Shadish et al., 2002; Stone-Romero, 2009, 2010). Research that demonstrates that
X and Y are related satisfies only one such condition. Thus, it does not serve as
a sufficient basis for inferring that X causes Y. As the well-known adage states,
“correlation does not imply causation.”
Assume that a study provides evidence of a correlation between two measured
variables, O1 and O2. As Figure 3.1 indicates, his finding might result from (a) O1
being a cause of O2 (Figure 3.1a); (b) O2 being a cause of O1, (Figure 3.1b); or (c)
the relation between O1 and O2 being a non-causal function of a third unmeasured
variable, O3 (Figure 3.1c). So, evidence that O1 and O2 are correlated is insuffi-
cient to infer that there is a causal connection between these variables.
Nevertheless, as was noted above, it is quite common for researchers in HRM
and allied disciplines to base causal inferences on evidence of relations between
variables (e.g., an observed correlation between two variables) as opposed to re-
search that uses a sound experimental design. One vivid example of this is the
body of research on the relation between satisfaction and organizational commit-
ment (commitment hereinafter). On the basis of observed correlations between
measures of these two variables and different types of statistical analyses: (a)
some researchers (e.g., Williams & Hazer, 1986) have concluded that satisfac-
tion causes commitment, (b) other researchers (e.g., Bateman & Strasser, 1984;
Koslowsky, 1991; Weiner & Vardi, 1980) have inferred that commitment causes
satisfaction, (c) still others (e.g., Lance, 1991) have reasoned that satisfaction and
commitment are reciprocally related to one another, and (d) yet others have ar-
gued that the relation between satisfaction and commitment is unclear or spurious
(Farkas & Tetrick, 1989).
Another instance of invalid causal inferences relates to the correlation between
job attitudes (attitudes hereinafter) and performance. As noted above, on the basis
of a meta-analysis of the results of 16 non-experimental studies Riketta (2008)
inappropriately concluded that attitudes are more likely to influence performance
than vice versa. The fact that the study was based on meta-analysis does nothing
whatsoever to bolster causal inferences.
RESEARCH SETTINGS
Empirical research can be conducted in what have typically referred to as “labora-
tory” and “field” settings (e.g., Bouchard, 1976; Cook & Campbell, 1976, 1979;
Evan, 1971; Fromkin & Streufert, 1976; Locke, 1986). However, as John Camp-
bell (1986) and others (e.g., Stone-Romero, 2009, 2010) have argued, the labo-
ratory versus field distinction is not very meaningful. One important reason for
this is that research “laboratories” can be set up in what are commonly referred
to as “field” settings. For example, an organization can be created for the specific
purpose of conducting a randomized-experimental study (Shadish et al., 2002, p.
274). Clearly, such a setting blurs the distinction between so called laboratory and
field research.
To better characterize the settings in which research takes place Stone-Romero
(2009, 2010) recommended that they be categorized in terms of their purpose.
More specifically, (a) special purpose (SP) settings are those that were created for
the specific purpose of conducting empirical research and (b) non-special purpose
(NSP) settings are those that were created for a non-research purpose (e.g., manu-
facturing, consulting, retailing). In the interest of clarity about research settings
the SP versus NSP distinction is used in the remainder of this article.
46 • EUGENE F. STONE-ROMERO
Randomized-Experimental Designs
The simplest method for conducting research that allows for valid causal infer-
ences about the relation between two variables (e.g., X and Y) is a randomized-
experimental study in which (a) X is manipulated at two or more levels, (b) sam-
pling units (e.g., individuals, groups, organizations) are assigned to experimental
conditions on a random basis, and (c) the dependent variable is measured. If there
are a sufficiently large number of sampling units, randomization serves to equate
the average level of observed variables (Oi) in the experimental conditions on all
measured or unmeasured variables prior to the manipulation of the independent
variable or variables. As such, randomization rules out such threats to internal va-
R O1A X O2A
R O1B ~X O 2B
R X O2C
R ~X O2D
The results of a study using this design provide a convincing basis for concluding
that the independent variable caused the dependent variable. That is, they allow
for ruling out all threats to internal validity. Note, however, the same results could
not be used to support the conclusion that X is the only cause of changes in the
dependent variable. Other randomized-experimental research may show that X is
also causally related to a host of other manipulations of the independent variables.
Multiple Independent Variable Designs. Randomized-experimental designs
also can be used in studies that examine causal relations between multiple inde-
pendent variables and one or more dependent variables. A study of this type can
consider both the main and interactive effects of two or more independent vari-
ables (e.g., X1, X2, and X1×X2). Thus, for example, a 2 × 2 factorial study could test
48 • EUGENE F. STONE-ROMERO
for the main and interactive effects of room temperature and relative humidity on
workers’ self-reports of their comfort level.
Quasi-Experimental Designs
Quasi-experimental designs have three attributes: First, one or more indepen-
dent variables (e.g., X1, X2, and X3) are manipulated. Second, assumed dependent
and control variables are measured (O1, O2, . . . Ok) before and after the manipula-
tions. Third, the studied units are not randomly assigned to experimental condi-
tions. The latter attribute results in a very important deficiency, i.e., an inability to
argue that the studied units were equivalent to one another before the manipula-
tion of the independent variable(s). Stated differently, at the outset of the study
the units may have differed from one another on a host of measured and/or un-
measured confounding variables (Campbell & Stanley, 1963; Cook & Campbell,
1976, 1979: Shadish et al., 2002). Thus, any observed differences in measures
of the assumed dependent variable(s) may have been a function of one or more
confounding variables.
There are five basic types of quasi-experimental designs. Brief descriptions of
them are provided below.
Single Group Designs With Or Without Control Group. In this type of de-
sign an independent variable is manipulated and the assumed dependent variable
is measured at various points in time before and/or after the manipulation. A very
simple case of this type of design is the one-group pretest-posttest design:
O1A X O2A
~R X O2A
~R ~X O2B
Although this design is slightly better than the just-described single group de-
sign, it is still highly deficient with respect to the criterion of internal validity.
Research Design and Causal Inferences • 49
The principal reason for this is that posttest differences in the assumed dependent
variable may have resulted from such confounds as pre-treatment differences on
the same variable or a host of other confounds (e.g., local history).
Multiple Group Designs with Control Groups and Pretest Measures. In
this type of design (a) units are assigned to one or more treatment and control
conditions on a non-random basis, (b) the independent variable is manipulated in
one or more such conditions, and (c) the assumed dependent variable is measured
before and after the treatment period. One example of this type of design is:
~R O1A X O2A
~R O1B ~X O 2B
O1A O2A O3A ··· O25A X O26A O27A O28A ··· O50A
~R O1A X O2A
~R O1B ~X O 2B
Unfortunately, this design does not allow for confident inferences about the
effect of the treatment on the assumed dependent variable. There are numerous
reasons for this including differential levels of attrition from members of the two
groups (Cook & Campbell, 1976, 1979; Shadish et al., 2002).
Summary. As noted above, quasi-experimental designs may allow a researcher
to rule out some threats to internal validity. However, as is detailed by Campbell
and Stanley (1963, Table 2) other threats can’t be ruled out by these designs. As a
result, internal validity is often questionable in research using quasi-experimental
designs. Stated differently, these designs are inferior to randomized-experimental
designs in terms of supporting causal inferences.
Non-Experimental Designs
In a non-experimental study the researcher measures (i.e., observes) assumed
independent, mediator, moderator, and dependent variables (O1, O2, O3, . . . Ok).
One example of this type of study is research by Hackman and Oldham (1976).
Its purpose was to test the job characteristics theory of job motivation. In it the
researchers measured the assumed (a) independent variables of task variety, au-
tonomy, and feedback, (b) mediator variables of experienced meaningfulness of
work, and knowledge of results of work activities, (c) moderator variable of high-
er order need strength, and (d) dependent variables of work motivation, perfor-
mance, and satisfaction. They then used statistical analyses (e.g., zero-order cor-
relation, multiple regression) to test for relations between the observed variables.
Results of the study showed strong support for hypothesized relations between
the measured variables. Nevertheless, because the study was non-experimental,
any causal inferences stemming from it would rest on a very shaky foundation.
Note, moreover, that it is of no consequence whatsoever that the analyses were
predicated on a theory! Thus, for example, the study’s results were incapable of
serving as a valid basis for (a) inferring that job characteristics were the causes
of satisfaction or (b) ruling out the operation of a number of potential (observed
and unobserved) confounding variables. More generally and contrary to the argu-
ments of many researchers, causal inferences are not strengthened by invoking
a theory prior to the time a study is conducted. As noted above, for example,
(a) some theorists argue that satisfaction causes performance, (b) others contend
that performance causes satisfaction, and (c) still others assert that the relation is
spurious. Non-experimental research is incapable of determining which, if any, of
these assumed causal models is correct.
Research Design and Causal Inferences • 51
Mediation Models
Research using randomized-experimental designs also may be used in tests of
models involving mediation (e.g., Pirlott & MacKinnon, 2016; Rosopa & Stone-
Romero, 2008; Stone-Romero & Rosopa, 2008, 2010, 2011). For example, a re-
searcher may posit that O1 → O2 → O3 . Here, as is illustrated in Figure 3.1g, the
effect of O1 on O3 is transmitted through the mediator, O2. The simplest way of
testing such a mediation model is to conduct two randomized experiments, one
that tests for the effects of O1 on O2 and the other that tests for the effects of O2 on
O3 (Rosopa & Stone-Romero, 2008; Stone-Romero & Rosopa, 2008, 2010, 2011).
If the results show support for both such predictions, the mediating effect of O2
can be deduced through the use of symbolic logic (Kalish & Montague, 1964,
Theorem 26).
Research by Eden, Stone-Romero, and Rothstein (2015) is an example of a
meta-analytic based mediation study. It used the results of meta-analyses of two
52 • EUGENE F. STONE-ROMERO
relations: The first involved the causal relation between leader expectations (LE)
and subordinate self-efficacy (SE). For it, the average correlation was .58. The
second considered the causal relation between subordinate self-efficacy (SE) and
subordinate performance (SP), the average correlation was .35. When combined,
these correlations along with formal reasoning deductions provided support for
the assumed mediation model. The reasoning is ((LE → SE) ˄ (SE → SP)) → (LE
→ SE) (see Theorem 26 of Kalish & Montague, 1964).
Whereas the results of meta-analyses of experimental studies may be used to
support causal inferences for either simple (e.g., O1 → O2) or complex relations
(e.g., O1 → O2 → O3), they do not justify such inferences in cases where the meta-
analyses are based upon relations derived from non-experimental studies (e.g.,
Judge et al., 2001; Riketta, 2008). Stated somewhat differently, meta-analytic
methods cannot serve as basis for valid causal inferences when they involve the
accumulation of the findings of two or more non-experimental studies.
the assumed causal model (a) the effect of Z1 on Z3 was mediated by Z2, and (b)
there also was a direct effect of Z1 on Z3. The variables in the simulation were r12,
r13, r23, and N, where (a) r12 = correlation between the assumed independent vari-
able, Z1 and the assumed mediator variable, Z2; (b) r13 = correlation between the
assumed independent variable, Z1 and the assumed dependent variable, Z3; (c) r23
= correlation between the assumed mediator variable, Z2 and the dependent vari-
able Z3, and sample size. The manipulations of r12, r13, r23, (values of .1 to .9 for
each) and N (values of 68, 136, 204, 272, 340, 408, 1,000, 1,500, 2,000, 2,500, and
3,000) resulted in 651 data sets. They were analyzed using the HMR technique.
Results of 8,463 HMR analyses showed that: (a) if the model tested was not the
true model, there would be a large number of cases in which there would be sup-
port for partial or complete mediation and the researcher would make highly erro-
neous inferences about mediation; (b) if the model tested was the true model, there
would only be slight support for inferences about complete mediation and modest
support for inferences about partial mediation. Overall, the HMR procedure did
very poorly in terms of providing evidence of mediation (see Stone-Romero &
Rosopa, 2004 for detailed information on the findings). Thus, the HMR technique
is unlikely to provide consistent evidence to support inferences about mediation.
As noted by Stone-Romero and Rosopa (2004) there are many problems with
the HMR strategy for making inferences about causal connections between vari-
ables. First, it relies on the interpretation of the magnitudes of regression coef-
ficients as a basis for determining effect size estimates. However, as Darlington
(1968) demonstrated more than 50 years ago, when multiple correlated predictors
are used in a regression analysis it is impossible to determine the proportion of
variance that is explained uniquely by each of them. Second, when applied to data
from non-experimental research there is almost always ambiguity about causal
direction. Third, there is the issue of model specification. Although a researcher
may test an assumed causal model, he or she cannot be certain that is the correct
model. Moreover, the invocation of a theory may be of little or no consequence
because there may be many theories about the causal connection between vari-
ables (e.g., the relation between satisfaction and performance). Fourth, the results
of non-experimental research do not provide a basis for making causal inferences.
In contrast, the findings of randomized-experimental studies do. Fifth, and finally,
in non-experimental research there is always the problem of unmeasured con-
founds. These are seldom considered in HMR analyses. Even if they are, if the
measures of confounds lack construct validity their effects cannot be controlled
fully by an HMR analysis.
Stone-Romero and Rosopa (2004) are not alone in questioning the ability of
HMR to provide credible evidence about causal connections between (or among)
measured variables. A number of other researchers have comparable views. For
example, Mathieu and Taylor (2006) wrote that “Research design factors are para-
mount for reasonable mediational inferences to be drawn. If the causal order of
variables is compromised, then it matters little how well the measures perform or
54 • EUGENE F. STONE-ROMERO
the covariances are partitioned. Because no [data] analytic technique can discern
the true causal order of variables, establishing the internal validity of a study is
critical. . . [and] randomized field experiments afford the greatest control over such
concerns” (p. 1050). They went on to state that randomized experiments “remain
the ‘gold standard’ [in empirical research] and should be pursued whenever pos-
sible” (p. 1050). These and similar views stand in sharp contrast to the generally
invalid arguments of several authors (e.g., Baron & Kenny, 1986; Blalock, 1964,
1971; James, 2008; James, Mulaik, & Brett, 2006; Kenny, 1979, 2008; Preacher
& Hayes, 2004, 2008). Unfortunately, unwarranted inferences about causality on
the basis of so called “causal modeling” methods are all too common in publica-
tions in HRM and allied fields. For example, on the basis of a meta-analysis of
the satisfaction- performance relation, Judge, Thoresen, Bono, and Patton (2001)
argued that causal modeling methods can shed light on causal relations between
these variables, especially in cases where mediation is hypothesized. They wrote
that “Though some research has indirectly supported mediating influences [on the
satisfaction-performance relation], direct tests are lacking. Such causal studies
are particularly appropriate in light of advances in causal modeling techniques in
the past 20 years” (p. 390). Contrary to the views of Judge et al., causal modeling
techniques cannot provide a valid basis for causal inferences.
Another example of invalid causal inferences comes from Riketta’s (2008)
meta-analytic study of the relations between job attitudes (attitudes hereinafter)
and performance. As noted above, he cumulated the findings of 16 nonexperimen-
tal studies to compute average correlations between attitudes and performance.
They were used in what he described as a meta-analytic regression analysis. On
the basis of it he wrote that “ because the present analysis is based on correlational
rather than experimental data, it allows for only tentative causal conclusions and
cannot rule out some alternative causal explanations (e.g., that third variables in-
flated the cross-lagged paths; see, e.g., Cherrington, Reitz, & Scott, 1971; Brown
& Peterson, 1993). Although the present analysis accomplished a more rigorous
test for causality than did previous meta-analyses in this domain, it still suffers
from the usual weakness of correlational designs. Experiments are required to
provide compelling evidence of causal relations” (p. 478). Whereas Riketta was
correct in concluding that experiments are needed to test causal relations, he was
incorrect in asserting that his study provided a more rigorous test of causality than
previous meta-analytic studies.
Brown and Peterson (1993) conducted an SEM-based test of an assumed
causal model on the antecedent and consequences of salesperson job satisfaction.
On the basis of its results they concluded that “Another important finding of the
causal analysis is evidence that job satisfaction primarily exerts a direct causal ef-
fect on organizational commitment rather than vice versa” (p. 73). Unfortunately,
this and other causal inferences were unwarranted because the study’s data came
from non-experimental studies.
Research Design and Causal Inferences • 55
It is interesting to consider the views of James et al. (2006) with respect to test-
ing assumed mediation models. They argue that “if theoretical mediation models
are thought of as causal models, then strategies designed specifically to test the fit
of causal models to data, namely, confirmatory techniques such as structural equa-
tion modeling (SEM), should be employed to test mediation models” (p. 234).
Moreover, they contend that in addition to testing a mediation model of primary
interest they strongly recommend testing alternative causal models. As they note,
“The objective is to contrast alternative models and identify those that appear to
offer useful explanations versus those that do not” (p. 243). However, they go on
to write that the results of SEM analyses “for both complete and partial mediation
models do not imply that a given model is true even though the pattern of parame-
ter estimates is consistent with the predictions of the model. There are always oth-
er equivalent models implying different causal directions or unmeasured common
causes that would also be consistent with the data” (p. 238). Unfortunately, for the
reasons noted above, testing primary or alternative models with SEM or any other
so called “causal modeling” methods does not allow researchers to make valid
causal inferences because when applied to data from non-experimental studies
these methods cannot serve as a valid basis for inferences about cause.
Some researchers seem to believe that the invocation of a theory combined with
the findings of a “causal modeling” analysis (e.g., SEM) is the deus ex machina of
nonexperimental research. Nothing could be further from the truth. One reason for
this is that the same set of observed correlations between or among a set of mea-
sured variables can be used to support a number of assumed causal models (e.g.,
Figures 3.1a to 3.1g). In the absence of research using randomized experimental
designs it is impossible to determine which, if any, of the models is correct.
Clearly, so called “causal modeling” methods (e.g., path analysis, hierarchical
regression, cross-lagged panel correlation, and SEM) are incapable of providing
valid evidence on causal connections between and among measured variables (Cliff,
1983; Freedman, 1987; Games, 1990; Rogosa, 1987; Rosopa & Stone-Romero,
2008; Spencer, Zanna, & Fong, 2005; Stone-Romero, 2002, 2008, 2009, 2010;
Stone-Romero & Gallaher, 2006; Stone-Romero & Rosopa, 2004, 2008, 2010,
2011). Therefore, researchers interested in making causal inferences should con-
duct studies using either randomized-experimental or quasi-experimental designs.
In recent years, a number of researchers have championed the use of data ana-
lytic strategies for supposedly improving causal inferences in research using non-
experimental designs. Two examples of this are propensity score modeling (e.g.,
Rosenthal & Rosnow, 2008) and regression-based techniques for approximating
counterfactuals (e.g., Morgan & Winship, 2014). On their face, these techniques
may appear elegant and sophisticated. However, the results of these regression-
based strategies do not provide a valid basis for causal inferences because the data
used by them come from non-experimental research. Another very serious limi-
tation of the propensity score strategy and similar strategies is that they provide
statistical controls for only a limited set of control variables. This leaves a host of
56 • EUGENE F. STONE-ROMERO
OBJECTIONS TO RANDOMIZED-EXPERIMENTS
Some researchers (e.g., James, 2008; Kenny, 2008) have argued that research
based on randomized-experimental designs is not feasible for various reasons,
including (a) some independent variables can’t be manipulated, (b) the manipula-
tion of others may not be ethical, and (c) organizations will not permit randomized
experiments.
Non-Manipulable Variables
Clearly, some variables are incapable of being manipulated by researchers,
including (a) the actual ages, sexes, genetic makeup, physical attributes, and cog-
nitive abilities of research participants, (b) the laws of cities, counties, states, and
countries, and (c) the environments in which research units operate. Neverthe-
less, through creative research design it may be possible to manipulate a number
of such variables. For example, in a randomized-experimental study of helping
behavior by Danzis and Stone-Romero (2009) the attractiveness of a confederate
(who requested help from research subjects) was manipulated through a number
of strategies (e.g., the clothing, jewelry, makeup, and hairstyle of confederates).
Results of the study showed that attractiveness had an impact on helping behavior.
Attractiveness also can be manipulated in a number of other ways. For ex-
ample, in a number of simulated hiring studies using randomized-experimental
designs the physical attractiveness of job applicants was manipulated via photos
of the applicants (see Stone, Stone, & Dipboye, 1992, for details). In addition,
a randomized-experimental study by Kreuger, Stone, and Stone-Romero (2014)
examined the effects of several factors, including applicant weight on hiring deci-
sions. In it, the weight of applicants was manipulated through the editing of pho-
tos of them using Photoshop software. Overall, what the above demonstrates quite
clearly is that randomized-experiments are possible that involve independent vari-
ables that some researchers believe to be difficult or impossible to manipulate.
In an article that critiqued the use of research using randomized-experimental
designs, James (2008) wrote that “If we limited causal inference to randomized
experiments where participants have to be randomly sampled [sic] into values
of a causal variable, then we would no longer be able to draw causal inferences
about smoking and lung cancer (to mention one of several maladies)” (p. 361).
Clearly, this argument is of little or no consequence because many variables can
be manipulated. For example, a large number of randomized-experimental studies
have shown the causal connection between smoking and lung cancer using hu-
man or non-human research subjects (El-Bayoumi, Iatropolous, Amin, Hoffman,
& Wynder, 1999 Salaspuro & Salaspuro, 2004). And, at the cellular level, thou-
Research Design and Causal Inferences • 57
ments can granted on the basis of breaking ties with regard to a selection variable,
(f) units are indifferent to the type of treatment they receive, (g) units are sepa-
rated from one another, and (h) the researcher can create an organization within
which the research will be conducted.
Finally, even if randomized experiments are not possible in NSP settings (e.g.,
work organizations) they may be possible in SP settings (Stone-Romero, 2010),
including, organizations created for the specific purpose of experimental research
(Evan, 1971; Shadish et al., 2002). Thus, contrary to the arguments of several
analysts (e.g., James, 2008; Kenny, 2008), researchers should consider random-
ized-experimental research when their interest is testing assumed causal models.
Of course, if the sole purpose of a study is to determine if observed variables are
related to one another non-experimental studies are appropriate.
CONCLUSIONS
In view of the above, several conclusions are offered: First, causal inferences
require sound experimental designs. Of the available options, randomized-exper-
imental designs provide the strongest foundation for such inferences, quasi-ex-
perimental designs afford a weaker basis, and non-experimental designs offer the
weakest. Thus, whenever possible researchers interested in making causal infer-
ences should use randomized-experimental designs.
Second, data analytic strategies are never an appropriate substitute for sound
experimental design. It is inappropriate to advance causal inferences on the basis
of such “causal modeling” strategies as HMR, path analysis, cross-lagged panel
correlation, and SEM. Thus, researchers should refrain from doing so. There is
nothing wrong with arguing that a study’s results are consistent with an assumed
causal model, but consistency is not a valid basis for implying the correctness of
that model. The reason is that the results may be consistent with many other mod-
els and there is seldom a valid basis for choosing one model over others.
Third, researchers should acknowledge the fact causal inferences are inappro-
priate when a study’s data come from research using non-experimental or quasi-
experimental designs. Thus, they should not advance such inferences (see also,
Wood, Goodman, Beckmann, & Cook, 2008). Rather, they should be circumspect
in discussing the implications of the findings of their research.
Fourth, randomized experiments are possible in both SP and NSP settings, and
they are the “gold standard” for conducting research aimed at testing assumed
causal models. Thus, they should be the first choice for research aimed at testing
such models. Moreover, there are numerous strategies for conducting such experi-
ments in NSP settings (Evan, 1971; Shadish et al., 2002). The many studies by
Eden and his colleagues are evidence of this.
Fifth, researchers should not assume that statistical controls for confounds
(e.g., in regression models) are effective in ruling out confounds. There are two
reasons for this. One is that the measures of known confounds may lack con-
struct validity. The other is that the researcher may not be aware of all confounds
60 • EUGENE F. STONE-ROMERO
that may influence observed relations between assumed causes and effects. Thus,
it typically proves impossible to control for confounds in non-experimental re-
search.
Sixth, the editors of journals in HRM and related disciplines should insure
that authors of research-based articles refrain from advancing causal inferences
when their studies are based on experimental designs that do not justify them. As
noted above, there is nothing wrong with arguing that a study is based upon an as-
sumed causal model. For example, an author may argue legitimately that a study’s
purpose is to test a model that posits a causal connection between achievement
motivation and job performance. In the study, both variables are measured. If a
relation is found between these variables it would be inappropriate to conclude
that the results of the study provided a valid basis for inferring that achievement
motivation was the cause of performance. As noted above, research using non-
experimental or quasi-experimental designs cannot provide evidence of the cor-
rectness of an assumed causal model.
Seventh, sound research methods are vital to both (a) the development and test-
ing of theoretical models and (b) the formulation of recommendations for practice.
Thus, progress in both such pursuits is most likely to be made through research
that uses randomized-experimental designs (Stone-Romero, 2008, 2009, 2010).
With respect to theory testing, randomized-experiments are the best research
strategy for providing convincing evidence on causal connections between vari-
ables. With little exception, the research literature on various topics (e.g., the
satisfaction-performance relation) shows quite clearly that non-experimental re-
search has done virtually nothing to provide credible answers to the validity of ex-
tant theories. On the other hand, well-conceived experimental studies (e.g., Cher-
rington et al, 1971) provide clear evidence on causal linkages between variables.
With regard to recommendations for practice it is important to recognize that
a large percentage of studies in HRM and allied disciplines have used non-exper-
imental designs. Because of this it seems likely that many HRM-related policies
and practices are based upon research that lacks internal validity. Thus, research
using randomized-experimental designs has the potential to greatly improve
HRM-related policies and practices.
Eighth, the language associated with some statistical methods may serve as
a basis for invalid inferences about causal connections between variables. One
example of this is analysis of variance. In a study involving two manipulated
variables (e.g., A and B) an ANOVA analysis would allow for valid inferences
about the main and interactive effects of these variables on a measured dependent
variable. However, if an ANOVA was used to analyze data from a study in which
the same variables were measured (e.g., age, ethnicity, sex) it would be inappro-
priate to argue that these so called “independent variables” affected the assumed
dependent variable. It deserves adding that the same arguments can be made about
the language associated with other statistical methods (e.g., multiple regression,
and SEM).
Research Design and Causal Inferences • 61
Ninth and finally, although this paper’s focus was on HRM research, the just
noted conclusions have far broader implications. More specifically, they apply
to virtually all disciplines in which the results of empirical research are used to
advance causal inferences about the correctness of assumed causal models.
REFERENCES
Blalock, H. M. (1964). Causal inferences in nonexperimental research. New York, NY:
W. W. Norton.
Bateman, T. S., & Strasser, S. (1984). A longitudinal analysis of the antecedents of organi-
zational commitment. Academy of Management Journal, 27, 95–112.
Blalock, H. M. (1971). Causal models in the social sciences. Chicago. IL: Aldine.
Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in so-
cial psychological research: Conceptual, strategic, and statistical considerations.
Journal of Personality and Social Psychology, 51, 1173–1182.
Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: Wiley.
Bouchard, T. (1976). Field research methods: Interviewing, questionnaires, participant ob-
servation, systematic observation, and unobtrusive measures. In M. D. Dunnette
(Ed.), Handbook of industrial and organizational psychology (pp. 363–413). Chi-
cago, IL: Rand McNally.
Brown, S. P., & Peterson, R. A. (1993). Antecedents and consequences of salesperson job
satisfaction: Meta-analysis and assessment of causal effects. Journal of Marketing
Research, 30, 63–77.
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
research. Chicago, IL: Rand McNally.
Campbell, J. P. (1986). Labs, fields, and straw issues. In E. A. Locke (Ed.), Generalizing
from laboratory to field settings: Research findings from industrial-organizational
psychology, organizational behavior, and human resource management (pp. 269–
279). Lexington, MA; Lexington Books.
Cherrington, D. J., Reitz, H. J., & Scott, W. E. (1971). Effects of contingent and noncontin-
gent reward on the relationship between satisfaction and task performance. Journal
of Applied Psychology, 55, 531–536.
Cook, T. D., & Campbell, D. T. (1976). The design and conduct of quasi-experiments and
true experiments in field settings. In M. D. Dunnette (Ed.), Handbook of industrial
and organizational psychology (pp. 223–326). Chicago, IL: Rand McNally.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues
for field settings. Boston, MA: Houghton Mifflin.
Cliff, N. (1983). Some cautions concerning the application of causal modeling methods.
Multivariate Behavioral Research, 18, 115−126.
Danzis, D., & Stone-Romero, E. F. (2009). Effects of helper sex, recipient attractiveness,
and recipient femininity on helping behavior in organizations. Journal of Manage-
rial Psychology, 24, 722–737.
Darlington, R. B. (1968). Multiple regression in psychological research and practice. Psy-
chological Bulletin, 69, 161–182.
Davidson, O. B., & Eden, D. (2000). Remedial self-fulfilling prophecy: Two field experi-
ments to prevent Golem effects among disadvantaged women. Journal of Applied
Psychology, 85, 386–398.
62 • EUGENE F. STONE-ROMERO
Dvir, T., Eden, D., Avolio, B. J., & Shamir, B. (2002). Impact of leadership development on
follower development and performance: A field experiment. Academy of Manage-
ment Journal, 45, 735–744.
Eden, D. (1985). Team development: A true field experiment at three levels of rigor. Jour-
nal of Applied Psychology, 70, 94–100.
Eden, D. (2003). Self-fulfilling prophecies in organizations. In J. Greenberg (Ed.), Organi-
zational behavior (2nd ed., pp. 91–122). Mahwah, NJ: Erlbaum.
Eden, D. (2017). Field experimentation in organizations. Annual Review of Organizational
Psychology and Organizational Behavior, 4, 91–122.
Eden, D., & Aviram, A. (1993). Self-efficacy training to speed reemployment: Helping
people to help themselves. Journal of Applied Psychology, 78, 352–360.
Eden, D., Stone-Romero, E. F., & Rothstein, H. R. (2015) Synthesizing results of mul-
tiple randomized experiments to establish causality in mediation testing. Human
Resource Management Review, 25, 342–351.
Eden, D., & Zuk, Y. (1995). Seasickness as a self-fulfilling prophecy: Raising self-efficacy
to boost performance at sea. Journal of Applied Psychology, 80, 628–635.
El-Bayoumy, K., Iatropolous, M., Amin, S., Hoffman, D., & Wynder, E. L. (1999). In-
creased expression of cyclooygnase-2 in rat lung tumors induced by tobacco-spe-
cific nitrosamine-4-(3 pyridl)-1-butanone: The impact of a high fat diet. Cancer
Research, 59, 1400–1403.
Fabre, K. M., Livingston, C., & Tagle, D. A. (2014). Organs-on-chips (microphysiological
systems): tools to expedite efficacy and toxicity testing in human tissue. Experimen-
tal Biology and Medicine, 239, 1073–1077.
Evan, W. M. (1971). Organizational experiments: Laboratory and field research. New
York, NY: Harper & Row.
Farkas, A. J., & Tetrick, L. E. (1989). A three-wave longitudinal analysis of the causal
ordering of satisfaction and commitment on turnover decisions. Journal of Applied
Psychology, 74, 855–868.
Freedman, D. A. (1987). As others see us: A case study in path analysis. Journal of Educa-
tional Statistics, 12, 101−128.
Fromkin, H. L., & Streufert, S. (1976). Laboratory experimentation. In M. D. Dunnette
(Ed.). Handbook of industrial and organizational psychology (pp. 415–465). Chi-
cago, IL: Rand McNally.
Games, P. A. (1990). Correlation and causation: A logical snafu. Journal of Experimental
Education, 58, 239–246.
Hackman, J. R., & Oldham, G. R. (1976). Motivation through the design of work: Test of a
theory. Organizational Behavior and Human Performance, 16, 250–279.
Hosoda, M., Stone-Romero, E. F., & Coats, G. (2003). The effects of physical attractive-
ness on job-related outcomes: A meta-analysis of experimental studies. Personnel
Psychology, 56, 431–462.
James, L. R. (2008). On the path to mediation. Organizational Research Methods, 11,
359–363.
James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational
Research Methods, 9, 233–244.
Judge, T. A., Locke, E. A., Durham, C. C., & Kluger, A. N. (1998). Dispositional effects on
job and life satisfaction: The role of core evaluations. Journal of Applied Psychol-
ogy, 83, 17–34.
Research Design and Causal Inferences • 63
Judge, T. A., Thoresen, C. J., Bono, J. E., & Patton, G. K. (2001). The job satisfaction-job
performance relationship: A qualitative and quantitative review. Psychological Bul-
letin, 127, 376–407.
Kalish, D., & Montague, R. (1964). Logic: Techniques of formal reasoning. New York,
NY: Harcourt, Brace, & World.
Kenny, D. A. (1979). Correlation and causality. New York, NY: Wiley.
Kenny, D. A. (2008). Reflections on mediation. Organizational Research Methods, 11,
353–358.
Koslowsky, M. (1991). A longitudinal analysis of job satisfaction, commitment, and inten-
tion to leave. Applied Psychology: An International Review, 40, 405–415.
Krueger, D. C., Stone, D. L., & Stone-Romero, E. F. (2014). Applicant, rater, and
job factors related to weight-based bias. Journal of Managerial Psychology, 29,
164–186.
Lance, C. E. (1991). Evaluation of a structural model relating job satisfaction, organiza-
tional commitment, and precursors to voluntary turnover. Multivariate Behavioral
Research, 26, 137–162.
Locke, E. A. (1986). Generalizing from laboratory to field settings: Research findings from
industrial-organizational psychology, organizational behavior, and human resource
management. Lexington, MA: Lexington Books.
Mathieu, J. E., & Taylor, S. R. (2006). Clarifying conditions and decision points for me-
diational type inferences in organizational behavior. Journal of Organizational Be-
havior, 27, 1031–1056.
Morgan, S. L., & Winship, C. (2014). Counterfactuals and causal inference. New York,
NY: Cambridge University Press.
Noe, R. A. (2017). Employee training and development (7th ed.). Burr Ridge, IL: Mc-
GrawHill/Irwin.
Physicians Committee for Responsible Medicine. (2018). Retrieved 14 December 2018
from: https://www.pcrm.org/ethical-science/animal-testing-and-alternatives/human-
relevant-alternatives-to-animal-tests
Pirlott, A. G., & MacKinnon, D. P. (2016). Design approaches to experimental mediation.
Journal of Experimental Social Psychology, 66, 29–38.
Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect
effects in simple mediation models. Behavior Research Methods, Instruments, &
Computers, 36, 717–731.
Preacher, K. J., & Hayes, A. F. (2008). Contemporary approaches to assessing mediation
in communication research. In A. F. Hayes, M. D. Slater, & L. B. Snyder (Eds.), The
SAGE sourcebook of advanced data analysis methods for communication research
(pp. 13–54). Thousand Oaks, CA: Sage.
Riketta, M. (2008). The causal relation between job attitudes and performance: A meta-
analysis of panel studies. Journal of Applied Psychology, 93, 472–481.
Rogosa, D. (1987). Causal models do not support scientific conclusions: A comment in
support of Freedman. Journal of Educational Statistics, 12, 185–195.
Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behavioral research: Methods and
data analysis (3rd ed.). New York, NY: Mc Graw Hill.
Rosopa, P. J., & Stone-Romero, E. F. (2008). Problems with detecting assumed mediation
using the hierarchical multiple regression strategy. Human Resource Management
Review, 18, 294–310.
64 • EUGENE F. STONE-ROMERO
Salaspuro, V., & Salaspuro, M. (2004). Synergistic effect of alcohol and drinking on in
vivo acetaldehyde concentration in saliva. International Journal of Cancer, 111,
480–483.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
Spencer, S. J., Zanna, M. P., & Fong, G. T. (2005). Establishing a causal chain: Why experi-
ments are often more effective than mediation analyses in examining psychological
processes. Journal of Personality and Social Psychology, 89, 845–851.
Stone, D. L., & Stone, E. F. (1985). The effects of feedback consistency and feedback
favorability on self-perceived task competence and perceived feedback accuracy.
Organizational Behavior and Human Decision Processes, 36, 167–185.
Stone, E. F., Stone, D. L., & Dipboye, R. L. (1992). Stigmas in organizations: Race, handi-
caps, and physical attractiveness. In K. Kelley (Ed.), Issues, theory, and research
in industrial/organizational psychology (pp. 385–457). Amsterdam, Netherlands:
Elsevier Science Publishers
Stone-Romero, E. F. (2002). The relative validity and usefulness of various empirical re-
search designs. In S. G. Rogelberg (Ed.), Handbook of research methods in indus-
trial and organizational psychology (pp. 77–98). Malden, MA: Blackwell.
Stone-Romero, E. F. (2008). Strategies for improving the validity and utility of research in
human resource management and allied disciplines. Human Resource Management
Review, 18, 205–209.
Stone-Romero, E. F. (2009). Implications of research design options for the validity of in-
ferences derived from organizational research. In D. Buchanan & A. Bryman (Eds.),
Handbook of organizational research methods (pp. 302–327). London, UK: Sage.
Stone-Romero, E. F. (2010). Research strategies in industrial and organizational psychol-
ogy: Nonexperimental, quasi-experimental, and randomized experimental research
in special purpose and nonspecial purpose settings. In S. Zedeck (Ed.), Handbook of
industrial and organizational psychology (pp. 35–70). Washington, DC: American
Psychological Association Press.
Stone-Romero, E. F., & Gallaher, L. (2006, May). Inappropriate use of causal language in
reports of non-experimental research. Paper presented at the meeting of the Society
for Industrial and Organizational Psychology. Dallas, TX.
Stone-Romero, E. F., & Rosopa, P. J. (2004). Inference problems with hierarchical multiple
regression-based tests of mediating effects. Research in Personnel and Human Re-
sources Management, 23, 249–290.
Stone-Romero, E. F., & Rosopa, P. J. (2008). The relative validity of inferences about
mediation as a function of research design characteristics. Organizational Research
Methods, 11, 326–352.
Stone-Romero, E. F., & Rosopa, P. (2010). Research design options for testing mediation
models and their implications for facets of validity. Journal of Managerial Psychol-
ogy, 25, 697–712.
Stone-Romero, E. F., & Rosopa, P. (2011). Experimental tests of mediation models: Pros-
pects, problems, and some solutions. Organizational Research Methods, 14, 631–
646.
Wanous, J. P. (1974). A causal-correlational analysis of the job satisfaction and perfor-
mance relationship. Journal of Applied Psychology, 59, 139–144.
Research Design and Causal Inferences • 65
Wiener, Y., & Vardi, Y. (1980). Relationships between job, organization, and career com-
mitments and work outcomes: An integrative approach. Organizational Behavior
and Human Performance, 26, 81–96.
Williams, L. J., & Hazer, J. T. (1986). Antecedents and consequences of satisfaction and
commitment in turnover models: A reanalysis using latent variable structural equa-
tion methods. Journal of Applied Psychology, 71, 219–231.
Wood, R. E., Goodman, J. S., Beckmann, N., & Cook, A. (2008). Mediation testing in
management research: A review and proposals. Organizational Research Methods,
11, 270–295.
CHAPTER 4
HETEROSCEDASTICITY IN
ORGANIZATIONAL RESEARCH
Amber N. Schroeder, Patrick J. Rosopa,
Julia H. Whitaker, Ian N. Fairbanks, and Phoebe Xoxakos
Variance plays an important role in theory and research in human resource man-
agement and related fields. Variance refers to the dispersion of scores or residuals
around a mean or, more generally, a predicted value (Salkind, 2007, 2010). In the
general linear model, the mean square error provides an estimate of the disper-
sion of population error variance (Fox, 2016). As mean square error decreases,
in general, the variability in the population error also decreases. In general linear
models, it is assumed that the population error variance is constant across cases
(i.e., observations in a sample). This assumption is known as homoscedasticity, or
homogeneity of variance (Fox, 2016; King, Rosopa, & Minium, 2018; Rencher,
2000). When the homoscedasticity assumption is violated, it is referred to as het-
eroscedasticity, or heterogeneity of variance (Fox, 2016; Rosopa, Schaffer, &
Schroeder, 2013). When heteroscedasticity is present in the general linear model,
this results in incorrect standard errors, which can lead to biased Type I error rates
and reduced statistical power (Box, 1954; DeShon & Alexander, 1996; White,
1980; Wilcox, 1997). This can threaten the statistical conclusion validity of a
study (Shadish, Cook, & Campbell, 2002). Notably, heteroscedasticity has been
found in a variety of organizational and psychological research contexts (Agui-
nis & Pierce, 1998; Antonakis & Dietz, 2011; Ostroff & Fulmer, 2014), thereby
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 67–86.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 67
68 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS
prompting research regarding best practices for detecting changes in residual vari-
ance and mitigating its negative effects (Rosopa et al., 2013).
In the present paper, we discuss how change in residual variance (i.e., heterosce-
dasticity) can be more than a violated statistical assumption. In some instances,
heteroscedasticity can be of substantive theoretical importance. For instance, Bryk
and Raudenbush (1988) proposed that heteroscedasticity may be an indicator of
unmeasured individual difference moderators in studies where treatment effects are
measured. Thus, the focus of this paper is twofold: First, we highlight five areas
germane to human research management and related fields in which changes in
variance provide a theoretical and/or empirical contribution to research and prac-
tice. Namely, we describe how the examination of heteroscedasticity can contribute
to the understanding of organizational phenomena across five research topics: (a)
stress interventions, (b) aging and individual differences, (c) skill acquisition and
training, (d) groups and teams, and (e) organizational climate.
Second, we describe several data analytic approaches that can be used to detect
heteroscedasticity. These approaches, however, are discussed in the context of
various statistical analyses that are commonly used in human resource manage-
ment and related fields. We consider (a) testing for the equality of two indepen-
dent means, (b) analysis of variance, (c) analysis of covariance, and (d) multiple
linear regression.
SUBSTANTIVE HETEROSCEDASTICITY
IN ORGANIZATIONAL RESEARCH
Even though error variance equality is an assumption of the general linear model,
in some instances, heteroscedasticity may be more than a violated assumption;
rather, it could be theoretically important. In the following sections, we provide
examples of substantively meaningful heteroscedasticity in organizational re-
search.
Stress Intervention
Stress management is a topic of interest in several psychological specialties,
including organizational and occupational health psychology. For organizations,
stress can result in decreased job performance (Gilboa, Shirom, Fried, & Cooper,
2008), increased absenteeism (Darr & Johns, 2008), turnover (Podsakoff, LePine,
& LePine, 2007), and adverse physical and mental health outcomes (Schaufeli
& Enzmann, 1998; Zhang, Zhang, Ng, & Lam, 2019). Thus, stress management
interventions are often implemented by organizations with the objective of re-
ducing stressors in the workplace (Jackson, 1983), teaching employees to better
manage stressors, or reducing the negative outcomes associated with stressors
(Ivancevich, Matteson, Freedman, & Phillips, 1990). Although several different
stress interventions exist (e.g., cognitive-behavioral approaches, relaxation ap-
proaches, multimodal approaches; Richardson & Rothstein, 2008; van der Klink,
Heteroscedasticity in Organizational Research • 69
Blonk, Schene, & van Dijk, 2001), stress interventions have one common goal: to
reduce stress and its negative consequences.
Stress intervention research often examines the reduction in strain or nega-
tive health outcomes of those in a treatment group compared to those in a con-
trol group (Richardson & Rothstein, 2008; van der Klink et al., 2001). However,
successful stress interventions may also result in less variability in stress-related
outcomes for those in the treatment group compared to those in the control group,
as has been demonstrated (although not explicitly predicted) in several studies
(e.g., Bond & Bunce, 2001; Galantino, Baime, Maguire, Szapary, & Farrar, 2005;
Jackson, 1983; Yung, Fung, Chan, & Lau, 2004). Thus, individual-level stress in-
terventions (DeFrank & Cooper, 1987; Giga, Noblet, Faragher, & Cooper, 2003)
may result in a reduction in the variability of reported strain (e.g., by reducing
individual differences in perceiving stressors, coping with stress, or recovering
from strain; LaMontagne, Keegel, Louie, Ostry, & Landsbergis, 2007), thereby
contributing to heterogeneity of variance when comparing those who underwent
the intervention to those who did not. This is consistent with the finding that
treatments can interact with individual difference variables to contribute to dif-
ferences in variability in outcomes (see e.g., Bryk & Raudenbush, 1988). Thus,
heteroscedasticity could be the natural byproduct of an effective stress interven-
tion, which provides an illustration of a circumstance in which heteroscedasticity
may be expected when testing for the equality of two independent means. Figure
4.1 provides an example of two independent groups where the means differ be-
FIGURE 4.1. Plot of means for two independent groups (n = 100 in each group)
with 95% confidence intervals, suggesting that the variability in the Intervention
group is much smaller than the variability in the Control group.
70 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS
FIGURE 4.2. Simple linear regression predicting memory with age, suggesting that
residual variance increases as age increases.
72 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS
larly among older adults, such that older adults often report feeling younger than
their chronological age (Barak, 2009). Thus, it is possible that adults with the
same chronological age may have varying perceptions of felt age, which could
impact their strategies for seeking and maintaining social relationships. As such,
for chronologically older adults, those with a lower felt age may react similar to
younger adults (i.e., by engaging in social relationships for instrumental purpos-
es), whereas those with a higher felt age may respond more in line with SST (i.e.,
by focusing on emotional connectivity in interpersonal relationships). As such,
there would be greater heteroscedasticity in motives for social interactions for
older adults compared to younger adults due to greater variability in perceptions
of time remaining in life (Carstensen, Isaacowitz, & Charles, 1999). Therefore,
an examination of variance dispersion as a function of both chronological and felt
age may provide an important theoretical contribution.
Taking this a step further, Horwitz and Horwitz (2007) conducted a meta-
analysis to examine how various types of team diversity impact team outcomes.
Their findings indicated that task-related diversity (i.e., variability in attributes
relevant to task completion, such as expertise) was positively related to team per-
formance quality and quantity, whereas demographic diversity (i.e., dispersion
in observable individual category memberships, such as in age and race/ethnic-
ity subgroups) was unrelated to team performance. Notably, however, later work
suggested that demographic diversity may in some cases be negatively related
to group performance when subjective (but not objective) performance metrics
are employed (van Dijk, van Engen, & van Knippenberg, 2012). Further, tempo-
ral examinations of team diversity suggested that demographic diversity within
teams may become advantageous over time due to team members’ shifting focus
from surface-level attributes (i.e., demographics) to more task-relevant individual
characteristics (Harrison, Price, Gavin, & Florey, 2002).
Additionally, in an examination of the impact of group cohesion on decision-
making quality as a function of groupthink (i.e., “a mode of thinking that people
engage in when they are deeply involved in a cohesive ingroup, when the mem-
bers’ striving for unanimity override their motivation to realistically appraise alter-
native courses of action”; Janis, 1972, p. 9), Mullen, Anthony, Salas, and Driskell
(1994) demonstrated that team decision-making quality was positively related to
group homogeneity in task commitment and inversely related to interpersonal
attraction-related cohesion. Taken together, research on organizational groups and
teams has benefited from an examination of the impact of heteroscedasticity in
team composition. Thus, we encourage future work to continue to explore how
heterogeneity of variance contributes to our understanding of phenomena related
to organizational groups and teams, including the consideration of new perspec-
tives, such as the real-time impact of diversity changes on team functioning (see
e.g., dynamic team diversity theory; Li, Meyer, Shemla, & Wegge, 2018).
Organizational Climate
Heteroscedasticity is also a factor of interest in organizational climate re-
search. Organizational climate has been defined as experience-based perceptions
of organizational environments based on attributes such as policies, procedures,
and observed behaviors Ostroff, Kinicki, & Muhammad, 2013; Schneider, 2000).
Although early climate research approached organizational climate broadly (i.e.,
a molar approach), later work examined climate through a more focused lens
(see Schneider, Ehrhart, & Macey, 2013), emphasizing that different climate types
can exist within an organization (e.g., customer service, safety, and innovation
climates). Organizational climate has been a topic of considerable interest to or-
ganizational researchers, as various climate types have been linked to a number
of work outcomes. For example, innovative organizational climate has been posi-
tively linked to creative performance (Hsu & Fan, 2010), perceived innovation
(Lin & Liu, 2012), and organizational performance (Shanker, Bhanugopan, van
der Heijden, & Farrell, 2017). Likewise, organizations with a more positive cus-
Heteroscedasticity in Organizational Research • 75
tomer service climate tend to have higher customer satisfaction and greater profits
(Schneider, Macey, Lee, & Young, 2009), and meta-analytic data demonstrated a
positive relation between safety climate and safety compliance (Christian, Brad-
ley, Wallace, & Burke, 2009).
Within organizational climate research, there has been a focus on understand-
ing how variability in perceptions of climate both across individuals and units
within organizations can influence associated organizational outcomes (Zohar,
2010). One way in which consensus in climate perceptions within an organization
has been examined is by assessing climate strength, which Schneider, Salvaggio,
and Subirats (2002) summarize quite succinctly as “within-group variability in
climate perceptions [such that] the less within-group variability, the stronger the
climate” (p. 220). Climate strength is an example of a dispersion model (see Chan,
1998), in which the model measures the extent to which perceptions of a con-
struct vary, and within-group variability is treated as a focal construct (Dawson,
González-Romá, Davis, & West, 2008). Climate strength has been described as
a moderator of relations between organizational climate and organizational out-
comes, such that the effect of a particular climate (e.g., safety climate) is stronger
when climate strength is high (Schneider et al., 2002, 2009; Shin, 2012). Yet other
work suggested that climate strength may be curvilinearly related to organiza-
tional performance in some contexts, such that performance peaks at moderate
levels of climate strength (Dawson et al., 2008).
In sum, organizational climate research has benefited from the consideration
of heteroscedasticity as a meaningful attribute. Thus, we encourage researchers to
move beyond the presumption that systematic differences in variance in organi-
zational data simply be viewed as a violated statistical assumption that should be
corrected, but rather, consider whether heteroscedasticity may provide a meaning-
ful contribution to underlying theory and empirical models.
Summary
The above sections reviewed various substantive research areas where the
change in variance may be of theoretical or practical importance. For example,
although a stress intervention may result in lower strain for those in a treatment
group compared to those in a control group, a smaller variance for those in the
treatment group compared to those in the control group could also be meaningful
(see Figure 4.1). Because researchers may not typically test for changes in vari-
ance, we review extant data analytic procedures in the following section.
procedures that can be used in (a) tests of the equality of two independent means, (b)
analysis of variance, (c) analysis of covariance, and (d) multiple linear regression.
It deserves noting that the sections below are all special cases of the general
linear model. That is, for each of n observations, a quantitative dependent variable
(y) can be modeled using a set of p predictor variables (x1, x2,…, xp ) plus some
unknown population error term. That is, in matrix form, tests on two independent
means, analysis of variance, analysis of covariance, linear regression, moderated
multiple regression, and polynomial regression are subsumed by the general lin-
ear model:
y = Xb + e (1)
number of groups minus 1. However, Box (1954) noted that this test can be sensi-
tive to departures from normality.
In instances where the normality assumption is violated, Brown and Forsythe’s
(1974) procedure is recommended. This approach is a modified version of
Levene’s (1960) test. Specifically, a two-sample t-test can be conducted on the
absolute value of the residuals. However, instead of calculating the absolute value
of the residuals from the mean, the absolute value of the residuals is calculated
using the median for each group. For a review, Bartlett (1937) and Brown and
Forsythe’s (1974) procedures are discussed in Rosopa et al. (2013) and Rosopa,
Schroeder, and Doll (2016).
Thus, although a researcher may be interested in testing whether the mean for
one group differs significantly from the mean of another group, if the researcher
also suspects that the variances differ as a function of group membership (see e.g.,
Figure 4.1), two statistical approaches are recommended. Bartlett’s (1937) test or
Brown and Forsythe’s (1974) test can be used. If the test is statistically significant
at some fixed Type I error rate (a), the researcher can conclude that the population
variances differ from one another.
It deserves noting that if a researcher finds evidence that the variances are
not the same between the two groups (i.e., heteroscedasticity exists), the conven-
tional Student’s t statistic should not be used to test for mean differences. Instead,
Welch’s t statistic should be used; this procedure allows for variances to be esti-
mated separately for each group, and, with Satterthwaite’s corrected degrees of
freedom, provides a more robust test for mean differences between two indepen-
dent groups regardless of whether the homoscedasticity assumption is violated
(Zimmerman, 2004).
Analysis of Variance
In a one-way analysis of variance, the population means on the dependent vari-
able are believed to be different (in some way) across two or more independent
groups. Assuming that the population error term in Equation 1 is normally dis-
tributed, the test statistic is distributed as an F random variable (Rencher, 2000).
However, in addition to tests on two or more means, a researcher may be interest-
ed in testing whether variance changes systematically across two or more groups.
For example, with three groups, the variance may be large for the control group,
but small for treatment A and treatment B. With three independent groups, be-
cause there are two dummy-variables for group membership, p = 2 and X is n × 3.
In the case of a one-way analysis of variance, Bartlett’s (1937) test and Brown
and Forsythe’s (1974) test are also suggested. However, Brown and Forsythe’s
(1974) test becomes, more generally, an analysis of variance on the absolute value
of the residuals around the respective medians. Thus, if the χ2 test or the F test, re-
spectively, are statistically significant at a, this suggests that the variances are dif-
ferent among the groups. Note that with three independent groups there are three
pairwise comparisons that can be conducted. However, there are only two linearly
78 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS
Analysis of Covariance
In analysis of covariance, a researcher is typically interested in examining
whether population differences on a dependent variable exist across multiple
groups. However, a researcher may have one or more continuous predictors that
they want to control statistically. Often, these continuous predictors (i.e., covari-
ates) are demographic variables (e.g., employee’s age), individual differences
(e.g., spatial ability), or a pretest variable. Assuming the simplest analysis of cova-
riance where a researcher has two independent groups and one covariate, because
there is one dummy-variable representing group membership and one covariate
(typically, centered), p = 2 and the model matrix (X) is n × 3. Here, the continu-
ous predictor is centered because in analysis of covariance researchers often are
interested in the adjusted means on the dependent variable where the adjustment
is at the grand mean of the continuous predictor (i.e., covariate) (Fox, 2016).
In analysis of covariance, residual variance can change as a function of not
only the categorical predictor, but also the continuous predictor (i.e., covariate).
For instances where a researcher suspects that the residual variance is changing
as a function of a categorical predictor, the procedures discussed above can be
used. Specifically, the OLS-based residuals from the analysis of covariance can
be saved. Then, either Bartlett’s (1937) test or Brown and Forsyths’s (1974) test
can be used to determine whether the residual variance changes as a function of
the categorical predictor. As noted above, with three or more groups, if additional
tests are conducted to isolate which of the groups had significantly different vari-
ances, a Bonferroni correction procedure is recommended.
In analysis of covariance, the residual variance could change as a function of
the continuous predictor. Here, a general approach is suggested, known as a score
test (Breusch & Pagan, 1979; Cook & Weisberg, 1983). This is discussed in the
next section on multiple linear regression.
Summary
In this section, we reviewed statistical procedures commonly used in human
resource management, organizational psychology, and related disciplines. In ad-
dition, we discussed some data-analytic procedures that can be used to detect
changes in residual variance. It deserves noting that if a researcher finds evidence
to support their theory that variance changes as expected, this suggests that the
homoscedasticity assumption in general linear models is violated. Thus, although
a researcher may have found evidence that residual variance changes as a con-
tinuous predictor increases (see e.g., Figure 4.2), the use of OLS estimation in
linear models is no longer optimal; regression coefficients will be incorrect (i.e.,
inefficient; Rencher, 2000). Thus, although parameter estimates remain unbiased
in the presence of heteroscedasticity, statistical inferences (e.g., hypothesis tests,
80 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS
CONCLUSION
A major objective of this paper is to describe how heteroscedasticity can be more
than just a statistical violation. Rather, differences in residual variance could be a
necessary and implicit aspect of a theory or empirical study. We included exam-
ples from five organizational research domains in which heteroscedasticity may
provide a substantive contribution, thus highlighting that although changes in re-
sidual variance are often viewed to be statistically problematic, heteroscedasticity
can also contribute meaningfully to our understanding of various organizational
phenomena. Nevertheless, there are likely other topical areas germane to orga-
nizational contexts in which heteroscedasticity may occur (see e.g., Aguinis &
Pierce, 1998; Bell & Fusco, 1989; Dalal et al., 2015; Grissom, 2000). Thus, we
hope that this paper stimulates research that considers the impact of heteroscedas-
ticity, as heterogeneity of variance can serve as an important explanatory mecha-
nism that can provide insight into a variety of organizational phenomena. We
encourage researchers to consider whether there is a theoretical basis for a priori
expectations of heteroscedasticity in their data, as well as to consider whether un-
anticipated heterogeneity of variance may have substantive meaning. Stated dif-
ferently, although homogeneity of variance is a statistical assumption of the gen-
eral linear model, we suggest that researchers carefully consider whether changes
in residual variance can be attributed to other constructs in a nomological network
(Cronbach & Meehl, 1955). Overall, this can enrich both theory and practice in
human resource management and allied fields.
REFERENCES
Ackerman, P. L. (1987). Individual differences in skill learning: An integration of psycho-
metric and information processing perspectives. Psychological Bulletin, 102, 3–27.
doi:10.1037//0033-2909.102.1.3
Ackerman, P. L. (2007). New developments in understanding skilled performance.
Current Directions in Psychological Science, 16, 235–239. doi:10.1111/j.1467-
8721.2007.00511.x
Ackerman, P. L., & Cianciolo, A. T. (2000). Cognitive, perceptual-speed, and psychomotor
determinants of individual differences during skill acquisition. Journal of Experi-
mental Psychology: Applied, 6, 259–290. doi:10.1037//1076-898X.6.4.259
Aguinis, H., & Pierce, C. A. (1998). Heterogeneity of error variance and the assessment
of moderating effects of categorical variables: A conceptual review. Organizational
Research Methods, 1, 296–314. doi:10.1177/109442819813002
Heteroscedasticity in Organizational Research • 81
Antonakis, J., & Dietz, J. (2011). Looking for validity or testing it? The perils of stepwise
regression, extreme-scores analysis, heteroscedasticity, and measurement error. Per-
sonality and Individual Differences, 50, 409–415. doi:10.1016/j.paid.2010.09.014
Backman, L., Small, B. J., & Wahlin, A. (2001). Aging and memory: Cognitive and bio-
logical perspectives. In Birren, J. E., & Schaie, W. K. (Eds.), Handbook of the psy-
chology of aging (pp. 349–366). San Diego, CA: Academic Press.
Baltes, P. B., & Baltes, M. M. (1990). Psychological perspectives on successful aging: The
model of selective optimization with compensation. In P. B. Baltes, & M. M. Baltes
(Eds.), Successful aging: Perspectives from the behavioral sciences (pp. 1–34). New
York, NY: Cambridge University Press.
Barak, B. (2009). Age identity: A cross-cultural global approach. International Journal of
Behavioral Development, 33, 2–11. doi:10.1177/0165025408099485
Barrick, M. B., Stewart, G. L. Neubert, M. J., & Mount, M. K. (1998). Relating member
ability and personality to work-team processes and team effectiveness. Journal of
Applied Psychology, 83, 377–191. doi:10.1037/0021-9010.83.3.377
Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the
Royal Society, A160, 268–282. doi:10.1098/rspa.1937.0109
Bell, S. (2007). Deep-level composition variables as predictors of team performance: A
meta-analysis. Journal of Applied Psychology, 92, 595–615. doi:10.1037/0021-
9010.92.3.595
Bell, P. A., & Fusco, M. E. (1989). Heat and violence in the Dallas field data: Linear-
ity, curvilinearity, and heteroscedasticity. Journal of Applied Social Psychology, 19,
1479–1482. doi:10.1111/j.1559-1816.1989.tb01459.x
Bond, F. W., & Bunce, D. (2001). Job control mediates change in a work reorganization
intervention for stress reduction. Journal of Occupational Health Psychology, 6,
290–302. doi:10.1037//1076-8998.6.4.290
Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of
variance problems, I. Effect of inequality of variance in the one-way classification.
Annals of Mathematical Statistics, 25, 290–302. doi:10.1214/aoms/1177728786
Boyle, P. A., Yu, L., Wilson, R. S., Gamble, K., Buchman, A. S., & Bennett, D. A. (2012).
Poor decision making is a consequence of cognitive decline among older persons
without Alzheimer’s disease or mild cognitive impairment. PLOS One, 7, 1–5.
doi:10.1371/journal.pone.0043647
Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random
coefficient variation. Econometrica, 47, 1287–1294. doi:10.2307/1911963
Brown, M. B., & Forsythe, A. B. (1974). Robust test for the equality of variances. Journal
of the American Statistical Association, 69, 364–367. doi:10.2307/2285659
Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of variance in experimental stud-
ies: A challenge to conventional interpretations. Psychological Bulletin, 104, 396–
404. doi:10.1037//0033-2909.104.3.396
Byron, K., Peterson, S. J., Zhang, Z., & LePine, J. A. (2018). Realizing challenges and
guarding against threats: Interactive effects of regulatory focus and stress on perfor-
mance. Journal of Management, 44, 3011–3037. doi:10.1177/0149206316658349
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of perfor-
mance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations
(pp. 35–70). San Francisco, CA: Jossey-Bass.
82 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS
DeShon, R. P., & Alexander, R. A. (1996). Alternative procedures for testing regression
slope homogeneity when group error variances are unequal. Psychological Meth-
ods, 1, 261–277. doi:10.1037/1082-989X.1.3.261
Fox, J. (2016). Applied regression analysis and generalized linear models (3rd ed.). Thou-
sand Oaks, CA: Sage.
Froehlich, D. E., Beausaert, S., & Segers, M. (2016). Aging and the motivation to stay
employable. Journal of Managerial Psychology, 31, 756–770. doi:10.1108/JMP-
08-2014-0224
Galantino, M. L., Baime, M., Maguire, M., Szapary, P. O., & Farrar, J. T. (2005). Associa-
tion of psychological and physiological measures of stress in health-care profes-
sionals during an 8-week mindfulness meditation program: Mindfulness in practice.
Stress and Health, 21, 255–261. doi:10.1002/smi.1062
Giga, S. I., Noblet, A. J., Faragher, B., & Cooper, C. L. (2003). The UK perspective: A
review of research on organisational stress management interventions. Australian
Psychologist, 38, 158–164. doi:10.1080/00050060310001707167
Gilboa, S., Shirom, A., Fried, Y., & Cooper, C. (2008). A meta-analysis of work-demand
stressors and job performance: Examining main and moderating effects. Personnel
Psychology, 61, 227–271. doi:10.1111/j.1744-6570.2008.00113.x
Grissom, R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting
and Clinical Psychology, 68, 155–165. doi: 10.1037/0022-006X.68.1.155
Harrison, D. A., Price, K. H., Gavin, J. H., & Florey, A. T. (2002). Time, teams, and task
performance: Changing effects of surface- and deep-level diversity on group func-
tioning. Academy of Management Journal, 45, 1029–1045. doi:10.2307/3069328
Hartley, H. O. (1950). The maximum F-ratio as a short-cut test for heterogeneity of vari-
ance. Biometrika, 37(3/4), 308–312.
Horwitz, S. K. & Horwitz, I. B. (2007). The effects of team diversity on team outcomes: A
meta-analytic review of team demography. Journal of Management, 33, 987–1015.
doi:10.2307/3069328
Hsu, M. L. A., & Fan, H. (2010). Organizational innovation climate and creative outcomes:
Exploring the moderating effect of time pressure. Creativity Research Journal, 22,
378–386. doi:10.1080/10400419.2010.523400
Ivancevich, J. M., Matteson, M. T., Freedman, S. M., & Phillips, J. S. (1990). Work-
site stress management interventions. American Psychologist, 45, 252–261.
doi:10.1037//0003-066X.45.2.252
Jackson, S. E. (1983). Participation in decision making as a strategy for reducing job-relat-
ed strain. Journal of Applied Psychology, 68, 3–19. doi:10.1037//0021-9010.68.1.3
Janis, I. L. (1972). Victims of groupthink. Boston, MA: Houghton-Mifflin.
Kanfer, R., & Ackerman, P. L. (1989). Motivation and cognitive abilities: An integrative/
aptitude-treatment interaction approach to skill acquisition. Journal of Applied Psy-
chology, 74, 657–690. doi:10.1037//0021-9010.74.4.657
King, B. M., Rosopa, P. J., & Minium, E. W. (2018). Statistical reasoning in the behavioral
sciences (7th ed.). Hoboken, NJ: Wiley.
Kotter-Grühn, D., Kornadt, A. E., & Stephan, Y. (2016). Looking beyond chronological
age: Current knowledge and future directions in the study of subjective age. Geron-
tology, 62, 86–93. doi:10.1159/000438671
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical
models (5th ed.). New York, NY: McGraw-Hill.
84 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS
LaMontagne, A. D., Keegel, T., Louie, A. M., Ostry, A., & Landsbergis, P. A. (2007). A
systematic review of the job-stress intervention evaluation literature. International
Journal of Occupational and Environmental Health, 13, 268–280. doi:10.1179/
oeh.2007.13.3.268
Lazarus, R. S., & Folkman, S. (1984). Stress, appraisal, and coping. New York, NY:
Springer.
Levene, H. (1960). Robust tests for equality of variances. In I. Olkin, S. G. Ghurye, W.
Hoeffding, W. G. Madow, & H. B. Mann (Eds.), Contributions to probability and
statistics (pp. 278–292). Stanford, CA: Stanford University Press.
Li, J., Meyer, B., Shemla, M., & Wegge, J. (2018). From being diverse to becoming di-
verse: A dynamic team diversity theory. Journal of Organizational Behavior, 39,
956–970. doi:10.1002/job.2272
Lin, Y. Y., & Liu, F. (2012). A cross‐level analysis of organizational creativity climate and
perceived innovation: The mediating effect of work motivation. European Journal
of Innovation Management, 15, 55–76. doi:10.1108/14601061211192834
Moen, P., Kojola, E., & Schaefers, K. (2017). Organizational change around an older work-
force. The Gerontologist, 57, 847–856. doi:10.1093/geront/gnw048
Morse, C. K. (1993). Does variability increase with age? An archival study of cognitive
measures. Psychology and Aging, 8, 156–164. doi:10.1037/0882-7974.8.2.156
Mullen, B., Anthony, T., Salas, E., & Driskell, J. E. (1994). Group cohesiveness and qual-
ity of decision making: An integration of tests of the groupthink hypothesis. Small
Group Research, 25, 189–204. doi:10.1177/1046496494252003
Neuman, G. A., Wagner, S. H., & Christiansen, N. D. (1999). The relationship between
work-team personality composition and the job performance of teams. Group &
Organization Management, 24, 28–45. doi:10.1177/1059601199241003
Ng, T. W., & Feldman, D. C. (2008). The relationship of age to ten dimensions of job
performance. Journal of Applied Psychology, 93, 392–423. doi:10.1037/0021-
9010.93.2.392
Ng, M., & Wilcox, R. R. (2009). Level robust methods based on the least squares regres-
sion estimator. Journal of Modern Applied Statistical Methods, 8, 384–395.
Ng, M., & Wilcox, R. R. (2011). A comparison of two-stage procedures for testing least-
squares coefficients under heteroscedasticity. British Journal of Mathematical and
Statistical Psychology, 64, 244–258. doi:10.1348/000711010X508683
O’Brien, R. G. (1979). A general ANOVA method for robust tests of additive mod-
els for variances. Journal of the American Statistical Association, 74, 877–880.
doi:10.2307/2286416
O’Brien, R. G. (1981). A simple test for variance effects in experimental designs. Psycho-
logical Bulletin, 89, 570–574. doi:10.1037//0033-2909.89.3.570
Ostroff, C., & Fulmer, C. A. (2014). Variance as a construct: Understanding variability
beyond the mean. In J. K. Ford, J. R. Hollenbeck, & A. M. Ryan (Eds.), The nature
of work: Advances in psychological theory, methods, and practice (pp. 185–210).
Washington, DC: APA. doi:10.1037/14259-010
Ostroff, C., Kinicki, A. J., & Muhammad, R. S. (2013). Organizational culture and climate.
In N. W. Schmitt, S. Highhouse, & I. B. Weiner (Eds.), Handbook of psychology:
Industrial and organizational psychology (pp. 643–676). Hoboken, NJ: Wiley.
Panatik, S. A., O’Driscoll, M. P., & Anderson, M. H. (2011). Job demands and work-re-
lated psychological responses among Malaysian technical workers: The moderating
Heteroscedasticity in Organizational Research • 85
Shanker, R., Bhanugopan, R., van der Heijden, Beatrice I. J. M., & Farrell, M. (2017).
Organizational climate for innovation and organizational performance: The mediat-
ing effect of innovative work behavior. Journal of Vocational Behavior, 100, 67–77.
doi:10.1016/j.jvb.2017.02.004
Shin, Y. (2012). CEO ethical leadership, ethical climate, climate strength, and collective
organizational citizenship behavior. Journal of Business Ethics, 108(3), 299–312.
doi:10.1007/s10551-011-1091-7
Spirduso, W. W., Francis, K. L., & MacRae, P. G. (2005). Physical dimensions of aging (2nd
ed.). Champaign, IL: Human Kinetics.
Taylor, M. A., & Bisson, J. B. (2019). Changes in cognitive functioning: Practical and
theoretical considerations for training the aging workforce. Human Resource Man-
agement Review. Advance online publication. doi:10.1016/j.hrmr.2019.02.001
van der Klink, J. J. L., Blonk, R. W. B., Schene, A. H., & van Dijk, F. J. H. (2001). The
benefits of interventions for work-related stress. American Journal of Public Health,
91, 270–276. doi:10.2105/AJPH.91.2.270
van Dijk, H., van Engen, M. L., & van Knippenberg, D. (2012). Defying conventional wis-
dom: A meta-analytical examination of the differences between demographic and
job-related diversity relationships with performance. Organizational Behavior and
Human Decision Processes, 119, 38–53. doi:10.1016/j.obhdp.2012.06.003
Webster, J. R., Beehr, T. A., & Love, K. (2011). Extending the challenge-hindrance model
of occupational stress: The role of appraisal. Journal of Vocational Behavior, 79,
505–516. doi:10.1016/j.jvb.2011.02.001
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity. Econometrica, 48, 817– 838. doi:10.2307/1912934
Wilcox, R. R. (1997). Comparing the slopes of two independent regression lines when
there is complete heteroscedasticity. British Journal of Mathematical and Statistical
Psychology, 50, 309–317. doi:10.1111/j.2044- 8317.1997.tb01147.x
Yung, P. M. B., Fung, M. Y., Chan, T. M. F., & Lau, B. W. K. (2004). Relaxation training
methods for nurse managers in Hong Kong: A controlled study. International Journal
of Mental Health Nursing, 13, 255–261. doi:10.1111/j.1445-8330.2004.00342.x
Zhang, Y., Zhang, Y., Ng, T. W. H., & Lam, S. S. K. (2019). Promotion- and prevention-
focused coping: A meta-analytic examination of regulatory strategies in the work
stress process. Journal of Applied Psychology, 104(10), 1296–1323. doi:10.1037/
apl0000404
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. Brit-
ish Journal of Mathematical and Statistical Psychology, 57(1), 173–181.
doi:10.1348/000711004849222
Zohar, D. (2010). Thirty years of safety climate research: Reflections and future directions.
Accident Analysis and Prevention, 42, 1517–1522. doi:10.1016/j.aap.2009.12.019
CHAPTER 5
duce meaningful conclusions, the data must be reliable and demonstrate construct
validity.
These methods of study aggregation typically require the employment of mul-
tiple coders to systematically gather and categorize data into an appropriate cod-
ing scheme. The agreement among coders is a significant issue in these studies, as
disagreement could constitute a threat to the validity of the results of aggregation
studies. Consequently, inter-rater reliability (IRR) is calculated to determine the
degree to which coders consistently agree upon the categorization of variables of
interest (Bliese, 2000; LeBreton, Burgess, Kaiser, Atchley, & James, 2003). The
most basic approach is the simple calculation of the percentage of agreements that
coders have established, whereby the number of total actual agreements is divid-
ed by the total possible number of agreements. The simplicity of calculating per-
centage agreements makes it a commonly used index of IRR in the management
literature. Although this method provides an easily calculable general indication
of the degree to which coders agree, it can be misleading, failing to take into
consideration the impact that chance may have on the reliability of agreement.
Deviations from 100% agreement become less meaningful and may result in an
inflated IRR, jeopardizing the construct validity of the measure and indicating that
percentage agreement is only useful and meaningful in very specific conditions.
Despite these potential shortcomings, IRR has been traditionally reported as
the simple percentage of agreement among coders in the management literature
(e.g., Barrick & Mount, 1991; Eby et al., 2005; Hancock et al., 2013; Hoch, Bom-
mer, Dulebohn, & Wu, 2018; Judge & Ilies, 2002; Mackey, Frieder, Brees, & Mar-
tinko, 2017). Reliability statistics such as Scott’s pi (p) (1955), Cohen’s Kappa
(κ) (1960) (e.g., Heugens & Lander, 2009; Koenig, Eagly, Mitchell, & Ristikari,
2011) and Krippendorff’s alpha (α) (1980) (e.g., Tuggle, Schnatterly, & Johnson,
2010; Tuggle, Sirmon, Reutzel, & Bierman, 2010) have been increasingly identi-
fied as superior indices of IRR in comparison to simple percentage agreement and
are beginning to appear in aggregate studies. However, these indices are also not
without limitations.
Each of the more sophisticated indicators has been derived in order to combat
the shortcomings of their predecessors. Even so, neither p, κ, nor a will be appro-
priate in all circumstances. In particular, each has limitations regarding a variety
of scenarios, including those where: (a) there are multiple coders but different
combinations of coders for different cases, (b) there exist any number of catego-
ries, scale values, or measures, (c) there is missing data, (d) known prevalence
(dichotomous coding) exists, (e) there is skewed data, and (f) for any sample
size. A search for another option for calculating inter-rater agreement across sev-
eral disciplines elicited attention to the AC1 statistic for IRR established by Gwet
(2001). This offers a test, which “is a more robust chance-corrected statistic that
consistently yields reliable results” (Gwet, 2002b, p. 5) as compared to κ, provid-
ing scholars with a more accurate measurement in each of those situations.
Kappa and Alpha and Pi, Oh My • 89
LITERATURE REVIEW
The degree to which data analysis and synthesis can lead to prescriptions for re-
searchers and practitioners is dependent upon the level of accuracy and reliability
with which coders of the data agree. IRR indices seek to provide some degree of
trust and assurance of data that are coded and categorized by human observers,
thus increasing the degree of confidence researchers have in data driven by hu-
man judgments (Hayes & Krippendorff, 2007) by improving construct validity. In
their review of several IRR indices, Hayes and Krippendorff (2007) identify five
properties that exemplify the nature of a good reliability index. First, agreement
amongst two or more coders/observers working independently to ascribe catego-
rizations to observations ought to be assessed without influence of the number
of independent coders present or by variation in the coders involved. Thus, the
individual coders participating in the codification of data should not influence
coding agreement.
Second, the number of categories to be coded should not bias the reliabilities.
Thus, reliability indices should not be influenced in one direction or the other by
the number of categories prescribed by the developer of the coding schemata.
Third, the reliability metric should be represented on a “numerical scale between
at least two points with sensible reliability interpretations (Hayes & Krippendorff,
2007, p. 79).” Thus, scales whereby a 0 indicates that zero agreement exists sug-
gests a violation of the assumption of independence of coders and are thus am-
biguous in their assessment of reliability. Fourth, Hayes and Krippendorff (2007)
suggest that a good reliability index should “be appropriate to the level of mea-
surement of the data (p. 79).” Thus, it must be suitable for comparisons across
various types of data, not limited to one particular type of data. Finally, the “sam-
pling behavior should be known or at least computable” (p. 79).
90 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN
Each of the most prevalent IRR indices has pros and cons when compared us-
ing Hayes and Krippendorff’s (2007) criteria. For example, although percentage
agreement is easy to calculate, it skews agreement in an overly positive direc-
tion. Although is complex to compute, it accommodates more complex coding
schemes. The following sections review the utility and shortcomings of each ap-
proach, providing a better understanding of the circumstances under which a par-
ticular IRR index may be most appropriately utilized.
Percentage Agreement. A common IRR index in the management literature
is simple percentage agreement. Percentage agreement assesses IRR by simply
dividing the number of agreements two coders have by the number of potential
matches that exist.
Percent Agreement
Occ
% Agreement = ∑ c × 100
n
where Occ represents each agreement coincidence and n represents the total
number of coding decisions.
Percentages are typically calculated for each variable in a coding scheme then
averaged such that the overall agreement is known, as is the agreement for each
specific variable. In addition to being a straightforward calculation, percent agree-
ment can provide researchers with insights into problematic variables within the
data (McHugh, 2012). For example, if percentage agreement for a particular vari-
able, agreement is only 40%, this suggests that the variable should be revisited
to determine the underlying reason for low agreement. However, although this
measure is easily calculable, it fails to fully satisfy a majority of the five reli-
ability criteria set forth by Hayes and Krippendorff (2007) and can be somewhat
misleading.
The simplicity of calculating percentage agreements makes it a commonly
used index of IRR in the management literature. However, the degree to which it
is meaningful is situationally specific, i.e., when there are two well-trained cod-
ers, in the presence of nominal data, with fewer rather than a greater number
categories, and a low chance that guessing will take place (Scott, 1955). Thus,
it is not a sufficient and reliable measure itself. Percentage agreement does not
consider the role that chance might play in ratings, incorrectly assuming that all
raters make deliberate, rational decisions in assigning their rating. Perhaps more
alarmingly, the lack of chance accounted for in this metric makes it possible for
agreement to seem acceptable even if both coders guessed at their categorization.
For example, if two coders employ two differing strategies for categorizing items,
one coder categorizes every item as “A” and the other coder often, but not always,
categorizes an item as “A,” simple percentage agreement would suggest that they
Kappa and Alpha and Pi, Oh My • 91
are in agreement when they are, in fact, utilizing different strategies for the their
categorizations or, more disturbingly, simply guessing. Additionally, this calcu-
lation is predisposed towards coding schemes with fewer categories whereby a
higher percentage agreement will be achieved by chance when there are fewer
categories to code.
Further, percentage agreement is interpreted as from 0–100% with 100% in-
dicating complete agreement and 0% complete disagreement, which is not likely
unless coders are violating the condition of independence. Consequently, devia-
tions from 100% agreement (complete agreement in all categories) become less
meaningful as the scale is not meaningfully interpretable. Failure of simple per-
centage agreement calculations to adequately assess reliability substantially limits
the construct validity of the assessments scholars are using to synthesize data
and draw conclusions and has been deemed unacceptable in determining IRR for
decades (e.g., Krippendroff, 1980; Scott 1955). Thus, it is advisable that manage-
ment scholars explore other, more reliable indices for assessing IRR; several other
metrics attempt to do so.
Scott’s Pi. In an attempt to overcome the limitations of percent agreement, p
(1955) was developed as a means by which IRR might be calculated above and
beyond simple percentages. Although percentage agreement is based on the num-
ber of matches that coders obtain out of a particular number of potential matches,
takes into consideration the role played by chance agreement. The probability of
chance is based on the cumulative classification of probabilities, not the prob-
abilities of individual rater classification (Gwet, 2002a) and provides a chance-
corrected agreement index for assessing IRR. This metric considers the degree to
which coders agree when they do not engage in guessing. Further, Scott (1955)
proposed that the previous categorizations of items by coders be examined by
calculating the observed number of items each coder has placed into a particular
category. For example, the total number of items placed into the same category
by two coders would be compared to the total number of items to categorize.
The assumption is that if each of the coders were simply categorizing items by
chance, each coder would have the same distribution (Artstein & Poesio, 2008;
Scott, 1955).
Scott’s p
Po − Pe
π=
1 − Pe
where
Occ
Po = ∑
c n
and
92 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN
Pe = ∑ pi2
c
where pi represents the proportion “of the sample coded as belonging to the
2
Cohen’s k
Po − Pe
κ=
1 − Pe
where
Occ
Po = ∑
c n
and
Kappa and Alpha and Pi, Oh My • 93
1
Pe =
n2
∑ pm i
where n represents the number of cases and S pmi represents the sum of the
marginal products (Neuendorf, 2002).
Like p, κ ranges from 0.0 to 1.0, however, because zero is defined as it would
be for a correlation, “Kappa, by accepting the two observers’ proclivity to use
available categories idiosyncratically as baseline, fails to keep κ tied to the data
whose reliability is in question. This has the effect of punishing observers for
agreeing on the frequency distribution of categories used to describe the given
phenomena (Brennan & Prediger, 1981, Zwick, 1988) and allowing systematic
disagreements, which are evidence of unreliability, to inflate the value of κ (Krip-
pendorff, 2004a,b).” (Hayes & Krippendorff, 2007, p. 81). Thus, like the measures
discussed above, κ also fails to satisfy the five requirements outlined by Hayes
and Krippendorff (2007).
Krippendorff’s Alpha. Krippendorff’s (1970) was developed in an attempt
to fulfill the remaining voids in reliability calculations left by percentage agree-
ments, p, and κ. This IRR index overcomes the data limitations of the previous
three by allowing for more than two observers and for the computation of agree-
ments among ordinal, interval, and ratio data, as well as nominal data (Hayes &
Krippendorff, 2007). Although earlier measures correct for percent agreement,
instead calculates disagreements. Consequently, it is gaining popularity as a stan-
dard IRR index that addresses the limitations of earlier IRR indices, providing
researchers with a metric that is able to overcome a variety of concerns.
Krippendorff’s a
Do
α = 1−
De
where
1
Do = ∑
n c
∑O
k
ckmetric ∂
2ck
and
1
De = ∑
n(n − 1) c
∑ n xn
k
c k 2
metric∂ ck
Alpha is useful for multiple coders and is appropriate for various data types
(Hayes & Krippendorff, 2007). However, it is not an efficient measure in certain
contexts. For example, it is not appropriate for a paired double coding scheme
(Knut De Swert, 2012), nor an appropriate measure for particular datasets. Be-
cause a is based on the chance of agreement, it is difficult to utilize this measure
of reliability with skewed data. Due to the binary nature of intensive content or
meta-analytic data (where the choices of “0” not present and “1” present exist),
many variables may be categorized as 0s and 1s, with several variables resulting
in a low representation of 1s. Thus, the degree of skewness can be problematic in
calculating a, as well as κ because, “The κ statistic is effected by skewed distribu-
tions of categories (the prevalence problem) and by the degree to which the coders
disagree (the bias problem)” (Eugenio & Glass, 2004).
Feinstein and Cicchetti (1990, p. 543) further articulate this problem:
In a fourfold table showing binary agreement of two observers, the observed pro-
portion of agreement, P0 can be paradoxically altered by the chance-corrected ratio
that creates κ as an index of concordance. In one paradox, a high value of P0 can be
drastically lowered by a substantial imbalance in the table’s marginal totals either
vertically or horizontally. In the second paradox, (sic) κ will be higher with an asym-
metrical rather than symmetrical imbalance in marginal totals, and with imperfect
rather than perfect symmetry in the imbalance. An adjustment that substitutes Kmax
for κ does not repair either problem, and seems to make the second one worse.
2002b, p. 5). Furthermore, Gwet (2008) investigated the influence of the condi-
tional probabilities of the coders on the prevalence of a specific trait using p and
κ as a metric for inter-rater reliability.
AC1
pa − pe
γˆ1 =
1 − pe
Where
q
1
pa =
1 − Pm
∑p
k =1
kk
1 q
pe = ∑ πk (1 − πk )
q − 1 k =1
( pk + + p+ k )
πk =
2
pk+= relative number of subjects assigned to category k by rater A
p+k = relative number of subjects assigned to category k by rater B
pkk = relative number of subjects classified into category k by both raters
pk = the probability of a randomly –selected rater to classify a randomly
selected subject into category k
q = nominal measurement scale? Q is the number of in the nominal rating
scale
AC1 may be used with any number of coders, any number of categories, scale
values, or measures. It can accommodate missing data, any sample size, and ac-
count for trait prevalence. Although AC1 may only be used to calculate IRR with
nominal data, a similar statistic, AC2, may be used to calculate IRR with ordinal,
interval, or ratio scale data. In our own review, we utilized a coding team of more
than two coders, but with two coders for each article, multiple categories, and
nominal data which demonstrated trait prevalence, that is a large amount of data
that were coded “1” by both coders, a condition deemed problematic with the
other forms of IRR calculation. Due to the lack of ordinal, interval, and ratio data
in our coding schemata, AC2 is beyond the scope of this paper and is suggested
as a more comprehensive measure for datasets comprised of data that are not
nominal in nature.
Calculations suggest that both p and κ produce realistic estimates of IRR when
the prevalence of a trait is approximately .50. The farther the trait prevalence
above or below a value of .50, the less reliable and accurate the indices. The
•96
Po − Pe Po − Pe Do pa − pe pa − pe
Formula
Occ π= κ= α = 1− γˆ1 = γˆ1 =
∑ c
× 100 1 − Pe 1 − Pe De
n 1 − pe 1 − pe
*If more than 2 coders exist, an extension called Fleiss’ kappa can be used to assess IRR.
JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN
TABLE 5.2. Guidelines for Best Selecting an IRR Index
Percent
Data Characteristics Agreement Scott’s p Cohen’s k Krippendorff’s a Gwet’s AC1 Gwet’s AC2
Accommodates Multiple Coders/Observers No No No Yes Yes Yes
Bias Due to Number of Categories, Scale Yes No Yes No No No
Values, or Measures
Level of Measurement (nominal, ordinal, Nominal Nominal Nominal Nominal, Ordinal, Nominal Ordinal, interval,
interval, ratio, etc.) Interval, and Ratio and ratio ratings
Accommodates Missing Data No No No Yes Yes Yes
Accommodates Known Prevalence (0 or 1) No No No No Yes Yes
Sample Size Restrictions No Yes Yes No No No
Accommodates Skewed Data No No No No Yes Yes
Kappa and Alpha and Pi, Oh My
•
97
98 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN
absence (coded 0) of a particular trait, phenomenon, etc. The data in this example
are representative of this problem, which cannot be sufficiently accommodated
by a.
The problems with a in this situation are laid bare in our study. Take for ex-
ample our coding of the retrospective study variable in Table 5.3. Despite their
being 96% agreement between coders, a is calculated at .28. This meager value is
the result of the dichotomous nature of the variable and the use of a rotated coding
scheme, whereby (a) Coders 1 and 2 code set of articles, (b) Coders 2 and 3 code
a set of articles, and (c) Coders 1 and 3 code a set of articles. The a index cannot
account for this coder scheme, and underestimates the degree of IRR. By contrast,
Table 4.3 demonstrates that AC1 more accurately measures IRR than when a rotat-
ing coder design is employed. Specifically, Table 5.3 provides calculations of the
different IRR indices for 25 of the 130 variables that were coded for within each
of the 440 studies in our sample. Given the laborious nature of content and meta-
analysis, these types of designs are increasingly, common, highlighting the utility
of AC1 as an IRR measure.
We calculated each of the five IRR indices for our data in order to compare
across several theoretical and methodological variables that were assigned as ei-
ther present or not present in a particularly article. Table 5.3 shows a comparison
of all five IRR indices for these coded variables. For these comparisons, it is
clear that there is a substantial range of IRR coefficients. A similar pattern can be
seen for all three comparisons: p, κ, and a, are all relatively close in value (to the
thousandth place), whereas percent agreement and the AC1 tend to be substantially
higher, with percent agreement consistently remaining the highest coefficient, fol-
lowed by AC1. Although there is a lack of commonality of agreement regarding
acceptable levels of IRR for each of these variables (e.g., Krippendorff, 1980; Per-
reault & Leigh, 1989; Popping, 1988), the general body of literature suggests that
IRR coefficient values of greater than .90 are acceptable in virtually all situations
and values of .80 or greater are acceptable in most situations. Values below 0.80
are subject to disagreement of acceptability among scholars (Neuendorf, 2002),
however, some scholars suggest that values between 0.6 and 0.8 are moderately
strong and sometimes acceptable (e.g. Landis & Koch, 1977). Other scholars
suggest 0.70 as the cutoff for reliability (e.g., Cronbach, 1980; Frey, Botan, &
Kreps, 2000). However, due to the relatively conservative nature of p and κ, lower
thresholds are at times deemed acceptable.
Using these acceptance guidelines, it is clear that the interpretation of accep-
tance varies based upon which IRR index is being used. For the coding of studies
grounded in the theories of Porter and Steers (1973), Lee and Mitchell (1991)
and Rusbult and Farrell (1983), the IRR is acceptable regardless of which metric
is used, though for p, κ, and a acceptance is borderline for Porter and Steers,
whereas percentage and AC1 are clearly acceptable. However, for the remaining
theoretical variables that were coded, the p, κ, and a are not deemed acceptable,
though AC1 and percentage agreement offer evidence of acceptable IRR among
•100
Ex post archival Nominal Study Design 0.8154 0.0672 0.0680 0.0685 0.8060
Longitudinal Nominal Study Design 0.8654 0.5594 0.5596 0.5600 0.8540
Repeated measures Nominal Study Design 0.9451 0.5395 0.5396 0.5402 0.9430
Retrospective Nominal Study Design 0.9615 0.2849 0.2849 0.2859 0.9610
Static cohort Nominal Study Design 0.7720 0.5513 0.5528 0.5519 0.7250
Rusbult & Farrell Nominal Theories 0.9890 0.8125 0.8125 0.8128 0.9890
Hulin et al Nominal Theories 0.9643 0.5874 0.5888 0.5879 0.9610
Lee & Mitchell Nominal Theories 0.9890 0.8889 0.8889 0.8891 0.9880
March & Simon Nominal Theories 0.9091 0.6727 0.6728 0.6734 0.8740
Mobley Nominal Theories 0.9093 0.6772 0.6774 0.6776 0.8740
Mobley et al Nominal Theories 0.8874 0.6701 0.6701 0.6705 0.8390
Muchinsky & Morrow Nominal Theories 0.9835 0.6915 0.6919 0.6919 0.9830
Price Nominal Theories 0.9011 0.4831 0.4834 0.4838 0.8780
Steers & Mowday Nominal Theories 0.9148 0.6032 0.6043 0.6037 0.8920
Maertz Nominal Theories 0.9973 0.6653 0.6654 0.6657 0.9980
Porter & Steers Nominal Theories 0.9286 0.7083 0.7084 0.7087 0.9050
Kappa and Alpha and Pi, Oh My
•
101
102 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN
coders. This can likely be attributed to the binary coding scheme that was used
and offers evidence for the importance of choosing the right metric. In all but
one of these instances, the percentage agreement is above 90% and is arguably
inflated based on the lack of consideration of chance. This inflation is further dem-
onstrated upon examination of the variables which coded for measures. Although
percentage agreement remains inflated, the remaining four IRR indices fail to
demonstrate IRR across the board (single item measures) or show low reliability
as calculated by p, κ, and a, and a barely acceptable AC1 (i.e., existing measures
adapted, existing measures without adaptation). Thus, the IRR index used has a
substantial influence on the degree to which IRR is considered acceptable or not.
REFERENCES
Aguinis, H., Dalton, D. R., Bosco, F. A., Pierce, C. A., & Dalton, C. M. (2011). Meta-
analytic choices and judgment calls: Implications for theory building and testing,
obtained effect sizes, and scholarly impact. Journal of Management, 37, 5–38.
Allen, D. G., Hancock, J. I., Vardaman, J. M., & McKee, D. L. N. (2014). Analytical mind-
sets in turnover research. Journal of Organizational Behavior, 35, S61–S86.
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics.
Computational Linguistics, 34, 555–596.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job per-
formance: A meta-analysis. Personnel Psychology, 44, 1–26.
Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implica-
tions for data aggregation and analysis. In K. J. Klein & S. W. J. Kozlowski (Eds.),
Multilevel theory, research, and methods in organizations: Foundations, extensions,
and new directions (pp. 349–381). San Francisco, CA: Jossey-Bass.
Borenstein, M., Hedges, L. V., Higgins, P. T., & Rothstein, H. R. (2011). Introduction to
meta-analysis. West Sussex, UK: John Wiley & Sons.
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alter-
natives. Educational and Psychological Measurement, 41, 687–699.
Cohen, J. A. (1960). Coefficient of agreement for nominal scales. Educational and Psycho-
logical Measurement, 20, 37–46.
Cronbach, L. J. (1980). Validity on parole: How can we go straight. In W. B. Schrader
(Ed.), New directions for testing and measurement: Measuring achievement over a
decade. (pp. 99–108). San Francisco, CA: Jossey-Bass.
Desa, G. (2012). Resource mobilization in international social entrepreneurship: Bricolage
as a mechanism of institutional transformation. Entrepreneurship Theory and Prac-
tice, 36, 727–751.
De Swert, K. (2012). Calculating inter-coder reliability in media content analysis using
Krippendorff’s Alpha. Center for Politics and Communication, 1–15. Retrieved
from: https://www.polcomm.org/wp-content/uploads/ICR01022012.pdf
Eby, L. T., Casper, W. J., Lockwood, A., Bordeaux, C., & Brinley, A. (2005). Work and
family research in IO/OB: Content analysis and review of the literature (1980–
2002). Journal of Vocational Behavior, 66, 124–197.
Eugenio, B. D., & Glass, M. (2004). The kappa statistic: A second look. Computational
Linguistics, 30, 95–101.
Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems
of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological
Bulletin, 76, 378.
Frey, L., Botan, C. H., & Kreps, G. (2000). Investigating communication. New York, NY:
Allyn & Bacon.
Gwet, K. (2001). Handbook of inter-rater reliability: How to estimate the level of agree-
ment between two or multiple raters. Gaithersburg, MD: STATAXIS Publishing
Company
Gwet, K. (2002a). Inter-rater reliability: Dependency on trait prevalence and marginal ho-
mogeneity. Statistical Methods for Inter-rater Reliability Assessment Series, 2, 1–9.
Kappa and Alpha and Pi, Oh My • 105
Gwet, K. (2002b). Kappa statistic is not satisfactory for assessing extent of agreement
between Raters. Statistical Methods for Inter-rater Reliability Assessment Series,
1, 1–5.
Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high
agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48.
Hancock, J. I., Allen, D. A., Bosco, F. A., McDaniel, K. R., & Pierce, C. A. (2013). Meta-
analytic review of employee turnover as a predictor of firm performance. Journal of
Management, 39, 573–603.
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability mea-
sure for coding data. Communication Methods and Measures, 1, 77–89.
Heugens, P. P., & Lander, M. W. (2009). Structure! Agency! (And other quarrels): A meta-
analysis of institutional theories of organization. Academy of Management Journal,
52, 61–85.
Hoch, J. E., Bommer, W. H., Dulebohn, J. H., & Wu, D. (2018). Do ethical, authentic, and
servant leadership explain variance above and beyond transformational leadership?
A meta-analysis. Journal of Management, 44, 501–529.
Judge, T. A., & Ilies, R. (2002). Relationship of personality to performance motivation: A
Meta-analytic review. Journal of Applied Psychology, 87, 797–807.
Koenig, A. M., Eagly, A. H., Mitchell, A. A., & Ristikari, T. (2011). Are leader stereotypes
masculine? A met-analysis of three research paradigms. Psychological Bulletin,
137, 616–642.
Kostova, T., & Roth, K. (2002). Adoption of an organizational practice by subsidiaries of
multinational corporations: Institutional and relational effects. Academy of Manage-
ment Journal, 45, 215–233.
Krippendorff, K. (1970). Estimating the reliability, systematic error and random error of
interval data. Educational and Psychological Measurement, 30, 61–70.
Krippendorff, K. (1980). Reliability. In K. Krippendorff, Content analysis; An introduction
to its methodology (pp. 129–154). Beverly Hills, CA: Sage Publications.
Krippendorff, K. (2004a). Content analysis: An introduction to its methodology (2nd ed.).
Thousand Oaks, CA: Sage.
Krippendorff, K. (2004b). Reliability in content analysis: Some common misconceptions
and recommendations. Human Communication Research, 30, 411–433.
Krippendorf, K. (2011). Computing Krippendor’s Alpha-Reliability. Retrieved from: hpttp.
repository. upenn. edu/asc_papers/43
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categori-
cal data. Biometrics, 33,159–174.
Lebreton, J. M., Burgess, J. R., Kaiser, R. B., Atchley, E. K., & James, L. R. (2003). The
restriction of variance hypothesis and interrater reliability and agreement: Are rat-
ings from multiple sources really dissimilar? Organizational Research Methods, 6,
80–128.
Lee, T. W., & Mitchell, T. R. (1991). The unfolding effects of organizational commitment
and anticipated job satisfaction on voluntary employee turnover. Motivation and
Emotion, 15, 99–121.
Mackey, J. D., Frieder, R. E., Brees, J. R., & Martinko, M. J. (2017). Abusive supervision:
A meta-analysis and empirical review. Journal of Management, 43, 1940–1965.
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica: Bio-
chemia Medica, 22, 276–282.
106 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN
Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.
Perreault Jr, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative
judgments. Journal of Marketing Research, 26, 135–148.
Pindek, S., Kessler, S. R., & Spector, P. E. (2017). A quantitative and qualitative review
of what meta-analyses have contributed to our understanding of human resource
management. Human Resource Management Review, 27, 26–38.
Popping, R. (1988). On agreement indices for nominal data. In Sociometric Research (pp.
90–105). London, UK: Palgrave Macmillan.
Porter, L. W., & Steers, R. M. (1973). Organizational, work, and personal factors in em-
ployee turnover and absenteeism. Psychological Bulletin, 80, 151.
Rusbult, C. E., & Farrell, D. (1983). A longitudinal test of the investment model: The
impact on job satisfaction, job commitment, and turnover of variations in rewards,
costs, alternatives, and investments. Journal of Applied Psychology, 68, 429.
Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding.
Public Opinion Quarterly, 19, 321–325.
Tuggle, C. S., Schnatterly, K., & Johnson, R. A. (2010). Attention patterns in the board-
room: How board composition and processes affect discussion of entrepreneurial
issues. Academy of Management Journal, 53, 550–571.
Tuggle, C. S., Sirmon, D. G., Reutzel, C. R., & Bierman, L. (2010). Commanding board of
director attention: investigating how organizational performance and CEO duality
affect board members’ attention to monitoring. Strategic Management Journal, 31,
946–968.
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–
378.
CHAPTER 6
EVALUATING JOB
PERFORMANCE MEASURES
Criteria for Criteria
Research aimed at improving performance appraisals dates back almost 100 years,
and there have been a number of reviews of this literature published over the years
(e.g., Austin & Villanova, 1992; Bretz, Milkovich, & Read, 1992; DeNisi & Mur-
phy, 2017; DeNisi & Sonesh, 2011; Landy & Farr, 1980; Smith, 1976). Each of
these papers have chronicled the research conducted to help us better understand
the processes involved in performance appraisals, and how this understanding
could help to improve the overall process. Although these reviews were done at
different points in time, the goal in each case was to draw conclusions concern-
ing how to make appraisal systems more effective. However, while each of these
reviews included studies comparing and contrasting different appraisal systems,
they all simply accepted whatever criterion was used for those comparisons, and,
based on those comparisons, made recommendations on how to conduct better
appraisals. This is an issue since many of the studies reviewed used different cri-
terion measures for their comparisons.
But this issue is even more serious when we realize that the reason why these
studies and reviews have used different criterion measures is that there is no con-
sensus on what is the “best” criterion measure to use when comparing appraisal
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 107–133.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 107
108 • ANGELO S. DENISI & KEVIN R. MURPHY
systems. Stated simply, if we want to make statement about how “system A” was
better than “system B,” we need some criterion or criteria against which to com-
pare the systems. Although this would seem to be a basic issue from a research
methods point of view, the truth is that there have been many criterion measures
used over time, but almost all of them are subject to serious criticism. Therefore,
despite 100 years of research on performance appraisal, there is actually very little
we can be certain about in terms of identifying the “best” approaches.
The present paper differs from those earlier review articles, because the present
paper focuses specifically on the problem of criterion identification. Therefore,
our review is not organized according to which rating formats or systems were
compared, but rather, we organized our review around which criterion measures
were used to make those comparisons. Our goal, then, is not to determine which
system is best, but to identify problems with the criterion measures used in the
past, and to propose a somewhat different approach to try to identify a more useful
and credible criterion measure. Therefore, we begin with a discussion of the vari-
ous criterion measures that have typically been used in comparing and evaluating
appraisal studies. In each case, we note the problems that have been identified
with their use, and why they may not really be useful as criterion measures.. We
then move on to lay out a comprehensive framework to evaluating the construct
validity of job performance measures that we believe can serve as the basis for
more useful measures to be used in this research.
HISTORICAL REVIEW
Agreement Measures
The reliance upon agreement measures as criteria for evaluating appraisal sys-
tems has a long history. Some type of inter-rater agreement measure has been
used to evaluate appraisal systems from as early as the 1930s (e.g., Remmers,
1934) continuing through the 50s (e.g., Bendig, 1953), the 60s (e.g., Smith &
Kendall, 1963), and the 70s (e.g., Blanz & Ghiselli, 1972). The underlying as-
sumption was that agreement indicated reliable ratings, and, since reliable ratings
Evaluating Job Performance Measures • 109
are a prerequisite for valid ratings, agreement could be used as a proxy for validity
and accuracy. But, in fact, the situation was much more complex. Viswesvaran,
Ones and Schmidt (1996) reviewed several methods of estimating the reliability
(or the freedom from random measurement error) of job performance ratings and
argued that inter-rater correlations provided the best estimate of the reliability
of performance ratings (See also Ones, Viswesvaran & Schmidt, 2008; Schmidt,
Viswesvaran & Ones, 2000). The correlations between ratings given to the same
employees by two separate raters are typically low, however, and others (e.g.,
LeBreton, Scherer, & James, L. R. 2014; Murphy & DeShon, 2000) have argued
that treating inter-rater correlations as measures of reliability makes sense only
if you believe that agreements between raters are due solely to true scores and
disagreements are due solely to random measurement error, a proposition that
strikes us as unlikely.
A number of studies have examined the roles of systematic and random error
in performance ratings, as well as methods of estimating systematic and random
error (Fleenor, Fleenor, & Grossnickle, 1996; Greguras & Robie, 1998; Hoff-
man, Lance, Bynum, & Gentry, 2010; Hoffman & Woehr, 2009; Kasten & Nevo,
2008; Lance, 1994; Lance, Baranik, Lau, & Scharlau, 2009; Lance, Teachout, &
Donnelly, 1992; Mount, Judge, Scullen, Sytsma, & Hezlett, 1998; Murphy, 2008;
O’Neill, McLarnon, & Carswell, 2015; Putka, Le, McCloy, & Diaz, 2008; Saal,
Downey, & Lahey, 1980; Scullen, Mount, & Goff, 2000; Woehr, Sheehan, & Ben-
nett, 2005). In general, these studies suggest that there is considerably less random
measurement error in performance ratings than studies of inter-rater correlation
would suggest. For example, Scullen, Mount, and Goff (2000) and Greguras and
Robie (1998) examined sources of variability in ratings obtained from multiple
raters and found that the largest source of variance in ratings is due to raters, some
of which is likely due to biases or general rater tendencies (e.g., leniency).
There have been a number of advances in research on inter-rater agreement,
some involving multi-level analyses (e.g., Conway, 1998), or the application of
generalizability theory (e.g., Greguras, Robie, Schleicher,, & Goff, 2003). Others
have examined sources of variability in peer ratings (e.g., Deirdorff & Surface,
2007), and multi-rater systems such as 360 Degree Appraisals (e.g., Hoffman,
Lance, Bynum,, & Gentry, 2010; Woehr, Sheehan,, & Bennett, 2005). In all these
cases, results indicated that substantial portions of the variability in ratings was
due to systematic rather than random sources of variability in ratings, undercut-
ting the claim (e.g., Schmidt et al. 2000) that performance ratings exhibit a sub-
stantial amount of random measurement error.
Studies of inter-rater agreement moved from the question of whether raters
agree, to considering why and under what circumstances they agree or disagree.
For example, there is a robust literature dealing with differences in ratings col-
lected from different sources (e.g., supervisors, peers). In general, self-ratings
were found to be typically higher than ratings from others (Valle & Bozeman,
2002), and agreement between subordinates, peers and supervisors was typically
110 • ANGELO S. DENISI & KEVIN R. MURPHY
modest, with uncorrected correlations in the .20s and .30s (Conway & Huffcutt,
1997; Valle & Bozeman, 2002). However, given the potentially low levels of reli-
ability for each source, it is likely that the level of agreement among sources is
actually somewhat higher. Harris and Schaubroeck (1988) reported corrected cor-
relations between sources in the mid .30s to low .60s. Viswesvaran, Schmidt, and
Ones (2002) applied a more aggressive set of corrections and suggested that in
ratings of overall performance and some specific performance dimensions, peers
and supervisors show quite high levels of agreement. This conclusion, however,
depends heavily on the assumption that almost half of the variance in performance
ratings represents random measurement error, a conclusion that has been shown to
be incorrect in studies of the generalizability of performance ratings.
Second, it is possible that raters agree about some things and disagree about
others. For example, it is commonly assumed that raters are more likely to agree
on specific, observable aspects of behavior than on more abstract dimensions
(Borman, 1979). Roch, Paquin and Littlejohn (2009) conducted two studies to
test this proposition, and their results suggested that the opposite is true. Inter-
rater agreement was actually higher for dimensions that are less observable or
that are judged to be more difficult to rate. Roch et al. (2009) speculated that this
seemingly paradoxical finding may reflect the fact that when there is less concrete
behavioral information available, raters fall back on their general impressions of
ratees when rating specific performance dimensions.
Other studies (e.g., Sanchez & De La Torre, 1996) have reported that accuracy
in observing behavior was positively correlated with accuracy in evaluating per-
formance. That is, raters who had an accurate recall of what they have observed,
also appeared to be more accurate in evaluating ratees. Unfortunately, however,
accuracy in behavioral observation did not appear to be related in any simple way
to the degree to which the behavior in question is observable or easy rate.
entirety, so it useful to deal separately with two major categories of “rater errors”:
(a) distributional errors across ratees, and (b) correlational errors within ratees.
Distributional Errors. Measures of distributional errors rely on the assump-
tion that if distributions of performance ratings deviate from some ideal, this
indicates that raters are making particular types of errors in their evaluations.
Although, in theory, any distribution might be viewed as ideal, in practice, the
ideal distribution for the purpose of determining whether or not rater errors have
occurred has been the normal distribution. Thus, given a group of ratees, any
deviation in their ratings from a normal distribution was seen as evidence of a
rating error. This deviation could take the form of “too many” ratees being rated
as excellent (“leniency error”), “too many” ratees being rated as poor (“severity
error”), or “too many” ratees being rated as average (“central tendency error”).
Obviously, the logic of this approach depends upon the assumption that the ideal
(normal) distribution was correct, so that any other distribution that was obtained
was due to some type of error, but, this underlying assumption has been ques-
tioned on several grounds.
First, the true distribution of the performance of the group of employees who
report to a single supervisor is almost always unknown. If it were known, we
would not need subjective evaluations and could simply rely upon the “true” rat-
ings. Therefore, it is impossible to assess whether or not there are “too many”
ratees who are evaluated at any point on the scale. Second, if there were an ideal
distribution of ratings, there is no justification for the assumption that it is nor-
mal and centered around the scale midpoint (Bernardin & Beatty, 1984). Rather,
organizations exert considerable effort to assure that the distribution of perfor-
mance is not normal. Saal, Downey, and Lahey (1980) point out that a variety of
activities, ranging from personnel selection to training are designed to produce
a skewed distribution of performance, so that most (if not all) employees should
be—at least—above the midpoint on many evaluation scales. Finally, the use of
distributional data as an indicator of errors assumes there are no true differences
in performance across work groups (cf., Murphy & Balzer, 1989). In fact, a rater
who gives subordinates higher than “normal” ratings may not be more lenient but
may simply have a better group of subordinates who are actually doing a better
job and so deserve higher ratings.
Furthermore, recent research has challenged the notion that job performance is
normally distributed in almost any situation (Aguinis & O’Boyle, 2014; Aguinis,
O’Boyle, Gonzalez-Mulé, & Joo. 2016; Joo, Aguinis, & Bradley, 2017). These
authors argue that in many settings, a small number of high performers (often
referred to as “stars”) contribute disproportionally to the productivity of a group,
creating a distribution that is far from normal. Beck, Beatty, and Sackett (2014)
suggest that the distribution of performance might depend substantially on the
type of performance that is measured, and it is reasonable in many cases to as-
sume nearly normal distributions. The argument over the appropriate distribu-
tional assumptions is a complex one, but the very fact that this argument exists is
112 • ANGELO S. DENISI & KEVIN R. MURPHY
a strong indication that we cannot say with confidence that ratings are too high,
or that there is too little variance in or too much intercorrelation among ratings
of different performance dimensions absent reliable knowledge of how ratings
should be distributed. In the eyes of some critics (e.g., Murphy & Balzer, 1989),
the lack of reliable knowledge about the true distribution of performance in the
particular workgroup evaluated by any particular rater makes distributional error
measures highly suspect.
Correlational Errors. Measures of correlational error are built around similar
assumptions, that there is some ideal level of correlation that the ratings from each
supervisor should assign. Specifically, it is often assumed that different aspects
or dimensions of performance should be independent and or at least should show
low levels of intercorrelation. Therefore, when raters give ratings of performance
that turn out to be correlated, this is thought to indicate a rating error. This infla-
tion of the intercorrelations among dimensions, is referred to as halo error. Cooper
(1981b) suggests that halo is likely to be present in virtually every type of rating
instrument.
There is an extensive body of research examining halo errors in rating, and
a number of different measures, definitions, and models of halo error have been
proposed (Balzer & Sulsky, 1992; Cooper, 1981a,b; Lance, LaPointe, & Stewart,
1994; Murphy & Anhalt, 1992; Murphy, Jako, & Anhalt, 1993; Nathan & Tippins,
1989; Solomonson & Lance, 1997). Although there was disagreement on a num-
ber of points across these proposals, there was substantial agreement on several
important points. First, the observed correlation between ratings of separate per-
formance dimensions reflects both actual consistencies in performance (referred
to as “true halo,” or the actual degree of correlation between two conceptually dis-
tinct performance dimensions) and errors in processing information about ratees
or in translating that information into performance ratings (referred to as “illusory
halo”). Clearly, the degree of true halo does not indicate any type of rating error
but instead reflects the true covariance across different parts of a job; it is only the
illusory halo that reflects a potential rater error (Bingham actually made the same
point in 1939). Second, this illusory halo is driven in large part by raters’ tendency
to rely on general impressions and global evaluations when rating specific aspects
of performance (e.g., Balzer & Sulsky, 1992; Jennings, Palmer, & Thomas, 2004;
Lance, LaPointe, & Stewart, 1994; Murphy & Anhalt, 1992). Third, all agree that
it is very difficult, if not impossible, to separate true halo from illusory halo. Even
in cases where the expected correlation between two rating dimensions is known
for the population in general (for example, in the population as a whole several of
the Big Five personality dimensions are believed to be essentially uncorrelated),
that does not mean that the performance of a small group of ratees on these dimen-
sions will show the same pattern of true independence.
There is an emerging consensus that measures that are based on the distribu-
tions and the intercorrelations among the ratings given by an individual rater have
proved essentially useless for evaluating performance ratings (DeNisi & Murphy,
Evaluating Job Performance Measures • 113
2017; Murphy & Balzer, 1989). First, we cannot say with any confidence that a
particular supervisor’s ratings are too high or too highly intercorrelated unless we
know a good deal about the true level of performance, and if we knew this, we
would not need supervisory performance ratings. Second, the label “rater error” is
misleading. It is far from clear that supervisors who give their subordinates high
ratings are making a mistake. There might be several good reasons to give sub-
ordinates high ratings (e.g., to give them opportunities to obtain valued rewards,
to maintain good relationships with subordinates), and raters who know that high
ratings are not truly deserved might nevertheless conclude that it is better to give
high ratings than to give low ones (Murphy & Cleveland, 1995; Murphy, Cleve-
land, & Hanscom, 2018). Finally, as we shall see, there is no evidence to support
the assumption that rating errors have much to do with rating accuracy, an as-
sumption that has long served as the basis for the use of rating errors measures as
criteria for evaluating appraisal systems.
Borman and his associates launched a sustained wave of research on rating ac-
curacy, using videotapes of ratees performing various tasks, which could then be
used as stimulus material for rating studies. Borman’s (1977, 1978, 1979) research
was based on the assumption that well-trained raters, observing these tapes under
optimal conditions, could provide a set of ratings which could then be pooled and
averaged (to remove potential individual biases and processing errors) to generate
“true scores” which would be used as the standard against which all other ratings
could be compared. That is, these pooled ratings, collected under optimal condi-
tions, could be considered to be an accurate assessment of performance, which
could then be used as criterion measures in subsequent research.
Rating accuracy measures, similar to those developed by Borman were widely
used in appraisal studies focusing on rater cognitive processes (Becker & Cardy,
1986; Cardy & Dobbins, 1986; DeNisi, Robbins, & Cafferty, 1989; McIntyre,
Smith, & Hassett, 1984; Murphy & Balzer, 1986; Murphy, Balzer, Kellam, &
Armstrong, 1984; Murphy, Garcia, Kerkar, Martin, & Balzer, 1982: Pulakos,
1986; Williams et al., 1986 ), but were also used in studies comparing different
methods of rater training (e.g., Pulakos, 1986), and even studies comparing dif-
fer types of rating scales (e.g., DeNisi, Robbins, & Summers, 1997). A review of
research on rating accuracy measures can be found in Sulsky & Balzer (1988).
Different Types of Accuracy. Attempts to increase the accuracy of perfor-
mance ratings are complicated by the fact that there are many different types of
accuracy. At a basic level, Murphy (1991) argued for making a distinction between
behavioral accuracy and classification accuracy. Behavioral accuracy referred to
the ability to discriminate between good and poor incidents of performance, while
classification accuracy referred to the ability to discriminate between the best per-
former, and the second-best performer and so on. Murphy (1991) also argued that
the purpose for which the ratings were to be used should dictate which type of
accuracy was more important, but it seems clear that these measures answer differ
questions about rating accuracy and that both are likely to be important.
At a more complex level, Cronbach (1955) had noted that there were several
ways we could define the agreement between a set of rating provided by a rater
and a set of true scores. Specifically, he defined four separate components of ac-
curacy: (1) Elevation—the accuracy of the average rating, over all ratees and
dimensions, (2) Differential Elevation—the accuracy in discriminating among
ratees, (3) Stereotype Accuracy—accuracy in discriminating among performance
dimensions across all ratees, and (4) Differential Accuracy—accuracy in detect-
ing ratee differences in patterns of performance, such as diagnosing individual
strengths and weakness. Research suggests that the different accuracy measures
are not highly correlated (Sulsky & Balzer, 1988), so that the conclusions one
draws about the accuracy of a set of ratings may depend more upon the choice
of accuracy measures than on a rater’s ability to evaluate his or her subordinates
(Becker & Cardy, 1986).
Evaluating Job Performance Measures • 115
Several scholars had questioned the assumption that rater error measures were
useful proxies for assessments of the accuracy (e.g., Becker & Cardy, 1986; Coo-
per, 1981b; Murphy & Balzer, 1986). Murphy and Balzer (1989) who, using data
from over 800 raters, provided the first direct empirical examination of this as-
sumption. They reported that the relationship between any of the common rating
errors and rating accuracy was either zero, or it was in the wrong direction (i.e.,
more rater errors were associated with higher accuracy). In particular, they re-
ported that the strongest error-accuracy relationship was between halo error and
accuracy, but that higher levels of halo were associated with higher levels of ac-
curacy—not lower levels, as should have been the case.
Accuracy measures have proved problematic as criteria for evaluating ratings.
First, different accuracy measures often lead to quite different conclusions about
rating systems; several authors have suggested that the purpose for the apprais-
als should probably dictate which type of accuracy measure should be used to
evaluate ratings (e.g., Murphy, 1991; Murphy & Cleveland, 1995). Furthermore,
direct measures of accuracy are only possible in highly controlled settings, such
as laboratory studies, making these measures less useful for field research. Fi-
nally, Ilgen, Barnes-Farrell and McKellin (1993) raised wide-ranging questions
about whether or not accuracy was the right goal in performance appraisal—and
therefore whether it was the best criterion measure for appraisal research. This
point was also raised elsewhere by DeNisi and Gonzalez (2004), and Ilgen (1993).
about the fairness of the ratings they received as well as the rating process itself,
are important criteria for evaluating the effectiveness of any appraisal system (cf.,
Folger, Konovsky, & Cropanzano, 1992; Greenberg, 1986, 1987). This focus was
consistent with the recommendations of Ilgen et al. (1993) and DeNisi and Gon-
zalez (2004), and assumes that employees are most likely to accept performance
feedback, to be motivated by performance-contingent rewards, and to view their
organization favorably if they view the performance appraisal system as fair and
honest.
In our view, perceptions of fairness should be thought of as a mediating vari-
able rather than as a criterion. The rationale for treating reactions as a mediating
variable is that performance ratings are often used in organizations as means of
improving performance, and reactions to ratings probably have a substantial im-
pact of the effectiveness of rating systems. It is likely that performance feedback
will lead to meaningful and useful behavior changes only if the ratee perceives the
feedback (i.e., the ratings) received as fair, and accepts this feedback. Ratee per-
formance may not actually improve, perhaps because of a lack of ability or some
situational constraint, but increasing an incumbent’s desire to improve and a will-
ingness to try harder is assumed to be a key goal performance appraisal and per-
formance management systems. Unfortunately, feedback, even when accepted, is
not always as effective as we had hoped it might be (cf., Kluger & DeNisi, 1996).
Conclusions
Our review of past attempts at identifying criterion measures for evaluating
performance appraisals suggests that one of the reasons for the recurring failure
in the century-long search for “criteria for criteria” is the tendency to limit this
search to a single class of measures, such as inter-rater agreement measures, rater
error scores, indices of rating accuracy and the like. Although some type of ratee
reaction measure may be more reasonable, this criterion is also narrow and deals
with only one aspect of appraisals.
Early in the history of research on criteria for criteria Thorndike (1949) re-
minded us of the importance of keeping the “ultimate criterion” criterion in mind.
He defined this ultimate criterion as the “complete and final goal” of the assess-
ment or intervention being evaluated (p. 121). In the field of performance ap-
praisal, this “ultimate criterion” is an abstraction, in part because performance ap-
praisal has many goals and purposes in most organizations (Murphy et al., 2018).
Nevertheless, this abstraction is a useful one, in part because it reminds us that no
single measure or class of measure is likely to constitute an adequate criterion for
evaluating performance appraisal systems. Each individual criterion measure is
likely to have a certain degree of criterion overlap with the ultimate criterion (i.e.,
each taps some part of the ultimate criterion), but each is also likely suffer from a
degree of criterion contamination (i.e., each measure is affected by things outside
of the ultimate criterion). The search for a single operational criterion for criteria
strikes us as pointless.
Evaluating Job Performance Measures • 117
CONSTRUCT VALIDATION AS A
FRAMEWORK FOR ESTABLISHING
CRITERIA FOR CRITERIA
extent to which job performance measures reflect the desired constructs and ful-
fil their desired purposes. That is, in order to evaluate performance ratings and
performance appraisal systems, we have to first know what they are intended to
measure and to accomplish, then collect the widest array of relevant evidence,
then put that information together to draw conclusions about how well our perfor-
mance measures reflect the constructs they are designed to reflect and achieve the
goals they are designed to accomplish.
Construct Explication
Construct explication is the process of defining the meaning and the correlates
of the construct one wishes to measure (Cook & Campbell, 1979; Shadish, Cook,
& Campbell, 2001). Applying this notion to performance appraisal systems in-
volves answering three questions, two of which focus on performance itself and
the last of which focusses on the purpose of performance appraisal systems in or-
ganizations: (1) what is performance? (2) what are its components? and (3) what
are we trying to accomplish with a PA system? We can begin by drawing upon
existing, well-researched models of the domain of job performance (Campbell,
1990; Campbell, McCoy, Oppler, & Sager, 1993) to answer the first two ques-
tions, although we also propose a general definition of job performance as the to-
tal value of the contribution of a person to the value of the organization, over a de-
fined period of time. This broad definition, however, requires further explication.
Campbell (1990) suggested that there were eight basic dimensions of job per-
formance that applied to most jobs, so that job performance could be defined as
how well an employee performed each. These were: job-specific task proficiency
(tasks that make up the core technical requirements of a job); non-job-specific
task proficiency (tasks not specific to the job but required by all jobs in the organi-
zation); written and oral communications; demonstrating effort (how committed a
person is to job tasks and how persistently and intensely they work at those tasks);
maintaining personal discipline(avoiding negative behavior at work); facilitating
team and peer performance (support help and development); supervision (in-
fluencing subordinates); and management and administration (non-supervisory
functions of management including goal setting).
Subsequent discussions (e.g., Motowidlo & Kell, 2013), expanded the crite-
rion space to include contextual performance (behavior that contributes to or-
ganizational effectiveness through its effects on the psychological, social, and
organizational context of work, but is not necessarily part of any person’s formal
job description), counterproductive performance (behaviors that are carried out to
hurt and hinder effectiveness and have negative expected organizational value),
and adaptive performance (which includes the ability to transfer training/learning
from one task to another, coping and emotional adjustment, and showing cultural
adaptability).
Assessing the degree to which an appraisal instrument captures the critical as-
pects of job performance is largely an issue of content validity. Although content
Evaluating Job Performance Measures • 119
validity has traditionally been used in connection with validating tests, it clearly
applies to evaluating appraisal instruments as well. In the case of appraisal instru-
ments this would mean the extent to which the content of the appraisal instrument
overlaps with defined performance on the job in question. Thus, the issue would
be assessing whether or not the appraisal instruments captures all the aspects of
job performance discussed above. This type of assessment is likely to rely on ex-
pert judgment, but there are many tools that can be applied to bring rigor to these
judgments. Lawshe (1975) first proposed a quantitative approach for assessing
the degree of agreement among those experts, resulting in the Content Validity
Index (CVI). Subsequent research (e.g., Polit, Beck, & Owen, 2007) supports the
usefulness of this index as a means of assessing content validity and cold be sued
with regard to appraisal instruments as well.
Addressing the third question requires knowledge of the context in which work
is performed and the goals of the organization in creating and implementing the
appraisal system (Murphy, Cleveland, & Hanscom, 2018). This involves consid-
eration of the reasons why organizations conduct appraisals and the ways in which
they use appraisals information. The model suggested by Cleveland, Murphy and
Williams (1989) is particularly useful in this regard. Those authors distinguish
among: between-person distinctions (e.g., who gets a raise or is promoted); with-
in-person distinctions (e.g., identification of training needs); systems maintenance
(e.g., evaluating HR systems); and documentation (e.g., justification for personnel
actions). Of course, in most organizations, appraisal information will be used for
several (if not all) of these purposes, but it is important to assess the effectiveness
of appraisal systems for each purpose for which information is used.
Evidence
There are many types of evidence that are relevant for evaluating performance
measures, and we discuss a number of these, but it is surely the case that there are
other types of evidence that could be collected as well. But, perhaps the most basic
type of evidence could be derived by simply examining the actual content of the
rating scales used to assess performance. This content should be based upon care-
ful job analysis that provides clear and unambiguous definitions of performance
dimensions that are related to the job in question. The basic dimensions suggested
by Campbell (1990), and discussed above would provide a good starting point,
although adding aspects of performance such as contextual performance, coun-
terproductive performance and adaptive performance would help ensure a more
complete view of a person’s contribution to the organization These dimensions
might be expressed in terms of behaviors, goals, or outcomes, but arguing that
personality traits or attitudes are related to these performance dimensions requires
an extra step and an extra set of assumptions.
Evidence could also be collected by assessing the convergent validity of vari-
ous measures of performance. The assessment of convergent validity is common-
ly a part of any base of evidence for construct validity and is concerned with the
120 • ANGELO S. DENISI & KEVIN R. MURPHY
extent to which different measures, claiming to assess the same construct, are
related to each other. In the case of performance appraisals, these “other” mea-
sures might include objective measures of performance, in situations where such
measures are possible. In fact, there is evidence to suggest that performance rat-
ings and objective performance measures are related (corrected correlations in the
.30s and .40s), but not substitutable (e.g., Bommer, Johnson, Rich, Podsakoff, &
MacKenzie, 1995; Conway, Lombardo, & Sanders, 2001; Heneman, 1986; Mabe
& West, 1982).
We could also approach convergent validation by comparing ratings of the
same person, using the same scale, but provided by different raters. This could be
viewed as the interrater agreement criterion discussed earlier, but those measures
typically involved multiple raters at the same level. The notion of 360 degree rat-
ings (or multi-source ratings) assumes that raters who have different relationships
with a ratee might evaluate that ratee differently (otherwise there would be no
reason to ask for ratings from different sources) and the level of agreement across
sources is seen as an important component of the effectiveness of these systems
(e.g., Atwater & Yammarino, 1992). In general, data suggest that ratings from dif-
ferent sources are related, but not highly correlated so that the rating source has
an important effect on ratings (e.g., Harris & Schaubroeck, 1988; Mount, Judge,
Scullen, Sytsma,, & Hezlett, 1998). Woehr, Sheehan, and Bennett (2005) also
reported strong effect for rating source, although they did find that the effects of
performance dimensions were the same across sources.
In both cases, there is surely some question about whether these different mea-
sures actually purport to measure the same things. Objective performance mea-
sures typically assess output only. It is possible that an employee’s performance is
more than just the number of units sold or produced. Nevertheless, evidence that
objective and subjective assessments of performance and effectiveness converge
can represent an important aspect of the validation of an appraisal system.
It is also possible to assess construct validity by examining evidence of cri-
terion-related validity. Performance ratings are among the most commonly used
criteria for validating selection tests. There is a large body of data demonstrating
that tests designed to measure job-relevant abilities and skills are consistently
correlated with ratings of job performance (cf., Schmidt & Hunter, 1998; Woehr
& Roch, 2016). We typically think of these data as evidence for the validity of the
selection tests rather than for the performance ratings, but they can be used for
both. That is, if there is a substantial body of evidence demonstrating that predic-
tors of performance that should be related to job performance measures actually
are related to performance ratings (and there is such a body of evidence) then
performance ratings are likely to be capturing at least some part of the construct
of job performance.
Another way of gathering evidence about the construct validity of perfor-
mance ratings is to determine whether ratings have consistent meanings across
contexts or cultures. Performance appraisals are used in numerous countries and
Evaluating Job Performance Measures • 121
as some have suggested. In fact, several review authors have concluded that bias
is not a significant issue in most appraisals (e.g., Arvey & Murphy, 1998; Bass &
Turner, 1973; Baxter, 2012; Bowen, Swim, & Jacobs, 2000; DeNisi & Murphy,
2017; Kraiger & Ford, 1985; Landy, Shankster, & Kohler, 1994; Pulakos, White,
Oppler, & Borman, 1989; Waldman & Avolio, 1991). Studies using laboratory
methods (e.g., Hamner, Kim, Baird, & Bigoness, 1974; Rosen & Jerdee, 1976;
Schmitt & Lappin, 1980), are more likely to report demographic differences in
ratings, especially when those studies involve vignettes rather than observations
of actual performance, but these biases do not appear to be substantial in ratings
collected in the field (see meta-analysis results reported by discussion by Murphy,
Herr, Lockhart, & Maguire, 1986). This is not to say that there are not situa-
tions where bias is very real and very serious (e.g. Heilman & Chen, 2005), but
the general hypothesis that performance ratings are substantially biased against
women, members of minority groups, older workers or disabled workers does not
seem credible (DeNisi & Murphy, 2017; Murphy et al., 2018). On the whole, the
lack of substantial bias typically encountered in performance appraisals can be
considered as evidence in favor of the construct validity of performance ratings.
Finally, evidence regarding employee reactions to appraisals and perceptions
that the ratings are fair would be worth collecting. As noted earlier, the research
focusing on ratee reactions and perceptions of fairness have a reasonably long his-
tory (e.g., Landy, Barnes, & Murphy, 1978; Landy, Barnes-Farrell, & Cleveland,
1980), and continues to be studied as an important part of the entire performance
management process (cf., Folger, Konovsky, & Cropanzano, 1992; Greenberg,
1986, 1987; Greenberg & Folger, 1983; Taylor, Tracy, Renard, Harrison, & Car-
roll, 1995). But, since ratee reactions are seen as mediating variables relating to
ratee motivation to improve performance and, ultimately, to actual performance
improvement, data on reactions should be collected in conjunction with data on
actual performance improvement.
Synthesis
Synthesizing evidence from all (or even many) of these sources is a non-trivial
task. Therefore, as with all construct validation efforts, the process will take time
and effort, and will not be a one-step evaluation process. Also, the construct vali-
dation process will involve continuing efforts to collect evidence so that we may
become more and more certain about any conclusions reached. In any case, the
process requires the accumulation of evidence and the judgment as to how strong
a case has been made for construct validity. Since the final assessment will neces-
sarily be a matter of judgment, it is clear that there are a number of issues that will
need to be addressed.
One such issue is the determination of how much evidence is enough. Obvi-
ously more evidence is always preferable but collecting more evidence may not
always be practical. Therefore, the question will remain as to how many “pieces”
of evidence will be needed to make a convincing case. The actual number of evi-
Evaluating Job Performance Measures • 123
dentiary data may also be a function of whether or not all the available evidence
comes to the same conclusion. That is, it may be the case that relatively few bits
of evidence are sufficient if they all indicate that then appraisal instrument has
sufficient content validity. But what if there is not consensus with regard to the
evidence?
Therefore, another important issue in developing a protocol for evaluating the
construct validity of performance measures is determining how to reconcile dif-
ferent streams of evidence that suggest different conclusion. First, there must be
a decision as to whether a case could be made for construct validity in the pres-
ence of any contradictory evidence. Then, assuming some contradictory evidence,
a decision must be made concerning how to weigh different types of evidence.
Earlier, in our discussion of traditional measures for evaluating appraisal instru-
ments, we noted that rating errors were not a good proxy for rating accuracy, and
probably not a good measure for evaluation at all. It would seem reasonable then,
that evidence relating to rating errors could be discounted in any analysis. But
what about assessing other types of evidence such as measurement equivalence
or source agreement, or the absence of bias? How much weight to give each of
these will ultimately be a judgment call, and the ability of anyone to make a case
for construct validity will depend largely upon one’s ability to make the case for
some differential weighting.
But there may be one type of evidence that can be given some precedence in
this process. We argue that, while organization conduct appraisals for a number
of reasons, ultimately, they conduct appraisals in the hope of help in employees
to improve their performance. Therefore, some deference should be shown to
evidence that supports this improvement. That is, if there is evidence that imple-
menting an appraisal system has resulted in a true improvement in individual
performance, this should be given a fair amount of weight in supporting the con-
struct validity of the system. Furthermore, evidence that the appraisal system has
also resulted in true improvement in performance at the level of the firm, should
be given even more weight. We note, however, that evidence clearly linking im-
provements in individual-level with improvements in firm-level performance, is
extremely rare (cf., DeNisi & Murphy, 2017).
So, where do we go from here? We believe that one of the major reasons for the
recurring failure in the century-long search for “criteria for criteria” is the tenden-
cy to limit this search to a single class of measures, such as inter-rater agreement
measures, rater error scores, indices of rating accuracy and the like. Some of these
measures have serious enough problems that they probably should not be used at
all, but, even if we accept that some of these measures provide us some insight
124 • ANGELO S. DENISI & KEVIN R. MURPHY
as to the usefulness of appraisal systems, they can only tell us part of the story.
Instead, we have proposed reframing the criteria we use to evaluate measures of
job performance in terms of the way we evaluate other measures of important
constructs—i.e., through the lens of construct validation.
But, the approach we have proposed suggests that evaluation process will be
complex. It requires collecting different types of data, where each data source can
tell us something about the effectiveness of appraisal systems, but where only
when we combine these different sources will we begin to get a true picture of
effectiveness. We have discussed a number of such data sources, which we have
termed as sources of evidence of construct validity, and research needs to contin-
ue to identify and refine these sources of evidence. Research needs to more fully
examine issues of convergence across rating sources. There is evidence to suggest
that ratings of the same person, from different rating sources, are correlated, but
are not substitutable. Is this because of measurement error, bias, or is it because
raters who have different relationships with a ratee observe different behaviors?
Perhaps peers, supervisors, subordinates, etc. see similar things but apply differ-
ent standards in evaluating what they see. Determining the source of the disagree-
ment may help us to establish upward boundaries that could be expected so that
we can more accurately assess convergence across sources.
More information about equivalence of ratings across cultures and contexts is
also needed. This type of research may require special efforts to overcome the ef-
fects of language differences, as well as differences in definitions across cultures.
For example, Farh, Earley, and Lin (1997) examined how American and Chinese
works viewed the idea of organizational citizenship behavior (OCB). The found
that it was necessary to go beyond the mere translation of OCB scales developed
in the west. Instead, they generated a Chinese definition of OCB and found that
measures of this Chinese version of OCB displayed the same relations with vari-
ous justice measures as the U.S. based measures did. But they also found that the
translated measures did not display the same relations. They concluded that citi-
zenship behavior was as important for the Chinese sample as it was for the U.S.
sample, but that the two groups defined citizenship in slightly different ways, and
it was important to respect these differences when comparing results. Therefore, it
may not be enough to simply translate appraisal instruments in order to compare
equivalence across cultures. But, on the other hand, at some point, the conceptual-
izations may be so different as to suggest that there is really not any equivalence.
These issues require a great deal of further research.
We noted that, although there is evidence of different types of bias in perfor-
mance ratings, these biases actually explained only small amounts of variance in
actual ratings. It is important to obtain clear estimates of how important bias may
be for ratings in different settings. This too may allow us to set upward boundaries
to help interpret data on bias, but it will also help to identify cases where bias is
more serious, and what to do in such situations.
Evaluating Job Performance Measures • 125
REFERENCES
Aguinis, H., O’Boyle, E., Gonzalez-Mulé, E., & Joo, H. (2016). Cumulative advantage:
Conductors and insulators of heavy-tailed productivity distributions and productiv-
ity stars. Personnel Psychology, 69, 3–66.
Arvey, R., & Murphy, K. (1998). Personnel evaluation in work settings. Annual Review
of Psychology, 49, 141–168.
Atwater, L. E., & Yammarino, F. Y. (1992). Does self–other agreement on leadership per-
ceptions moderate the validity of leadership and performance predictions? Person-
nel Psychology, 45, 141–164.
Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Ap-
plied Psychology, 77, 836–874.
Balzer, W. K., & Sulsky, L. M. (1992). Halo and performance appraisal research: A criti-
cal examination. Journal of Applied Psychology, 77, 975–985.
Bass, A. R., & Turner, J. N. (1973). Ethnic group differences in relationships among crite-
ria of job performance. Journal of Applied Psychology, 57, 101–109.
Baxter, G. W. (2012). Reconsidering the black-white disparity in federal performance rat-
ings. Public Personnel Management, 41, 199–218.
Beck, J. W., Beatty, A. S., & Sackett, P. R. (2014). On the distribution of job performance:
The role of measurement characteristics in observed departures from normality.
Personnel Psychology, 67, 531–566.
Becker, B. E., & Cardy, R. L. (1986). Influence of halo error on appraisal effectiveness:
A conceptual and empirical reconsideration. Journal of Applied Psychology, 71,
662–671.
Bendig, A. W. (1953). The reliability of self-ratings as a function of the amount of verbal
anchoring and the number of categories on the scale. Journal of Applied Psychol-
ogy, 37, 38–41.
Bento, R. F., White, L. F. & Zacur, S. R. (2012). The stigma of obesity and discrimination
in performance appraisal: A theoretical model. International Journal of Human
Resource Management, 23, 3196–3224.
Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human be-
havior at work. Boston, MA: Kent.
Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Man-
agement Review, 6, 205–212.
Bingham, W. V. (1939). Halo, invalid and valid. Journal of Applied Psychology, 23, 221–
228.
Blanz, F., & Ghiselli, E. E. (1972). The mixed standard scale: A new rating system. Per-
sonnel Psychology, 25, 185–200.
Bommer, W. H., Johnson, J. L., Rich, G. A., Podsakoff, P. M., & MacKenzie, S. B. (1995).
On the interchangeability of objective and subjective measures of employee perfor-
mance: A meta-analysis. Personnel Psychology, 48, 587–605.
Borman, W. C. (1977). Consistency of rating accuracy and rating errors in the judgment
of human performance. Organizational Behavior and Human Performance, 20,
238–252.
Borman, W. C. (1978). Exploring the upper limits of reliability and validity in job perfor-
mance ratings. Journal of Applied Psychology, 63, 135–144.
Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors.
Journal of Applied Psychology, 64, 410–421.
Evaluating Job Performance Measures • 127
Borman, W.C (1991). Job behavior, performance, and effectiveness. In M. D. Dunnette &
L. M. Hough (Eds.), Handbook of industrial and organizational Psychology (pp.
271–326). Palo Alto, CA: Consulting Psychologists Press.
Bowen, C., Swim, J. K., & Jacobs, R. (2000). Evaluating gender biases on actual job per-
formance of real people: A meta-analysis. Journal of Applied Social Psychology,
30, 2194–2215.
Bretz, R. D., Milkovich, G. T., & Read, W. (1992). The current state of performance ap-
praisal research and practice: Concerns, directions, and implications. Journal of
Management, 18, 321–352.
Campbell J. P. (1990). Modeling the performance prediction problem in industrial and
organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook
of industrial and organizational psychology (Vol. 1, pp. 687–732). Palo Alto, CA:
Consulting Psychologists Press.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of perfor-
mance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations
(pp. 35–70). San Francisco, CA: Jossey-Bass.
Cardy, R. L., & Dobbins, G. H. (1986). Affect and appraisal accuracy: Liking as an in-
tegral dimension in evaluating performance. Journal of Applied Psychology, 71,
672–678.
Cleveland, J. N., Murphy, K. R., & Williams, R. E. (1989). Multiple uses of performance
appraisal: Prevalence and correlates. Journal of Applied Psychology, 74, 130–135.
Colella, A., DeNisi, A. S., & Varma, A. (1998). The impact of ratee’s disability on perfor-
mance judgments and choice as partner: the role of disability-job fit stereotypes and
interdependence of rewards. Journal of Applied Psychology, 83, 102–111.
Conway, J. M. (1998). Understanding method variance in multitrait-multirater perfor-
mance appraisal matrices: Examples using general impressions and interpersonal
affect as measured method factors. Human Performance, 11, 29–55.
Conway, J. M., & Huffcutt, A. I. (1997). Psychometric properties of multisource perfor-
mance ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings.
Human Performance, 10, 331–360.
Conway, J. M., Lombardo, K., & Sanders, K. C. (2001). A meta-analysis of incremental
validity and nomological networks for subordinate and peer rating. Human Perfor-
mance, 14, 267–303.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues
for field settings. Boston, MA: Houghton Mifflin Company.
Cooper, W. H. (1981a). Conceptual similarity as a source of illusory halo in job perfor-
mance ratings. Journal of Applied Psychology, 66, 302–307.
Cooper, W. H. (1981b). Ubiquitous halo. Psychological Bulletin, 90, 218–244.
Cronbach, L. J. (1955). Processes affecting scores on “understanding of others” and “as-
sumed similarity.” Psychological Bulletin, 52, 177–193.
Cronbach, L. J. (1990). Essentials of psychological testing. New York, NY: Harper and
Row.
Czajka, J. M., & DeNisi, A. S. (1988). The influence of ratee disability on performance
ratings: The effects of unambiguous performance standards. Academy of Manage-
ment Journal, 31, 394–404.
128 • ANGELO S. DENISI & KEVIN R. MURPHY
DeNisi, A. S., & Gonzalez, J. A. (2004). Design performance appraisal to improve per-
formance appraisal. In E. A. Locke (Ed.) The Blackwell handbook of principles of
organizational behavior (Updated version, pp. 60–72). London, UK: Blackwell
Publishers.
DeNisi, A. S., & Murphy, K. R. (2017). Performance appraisal and performance manage-
ment: 100 Years of progress? Journal of Applied Psychology, 102, 421–433.
DeNisi, A. S., & Peters, L. H. (1996). Organization of information in memory and the
performance appraisal process: evidence from the field. Journal of Applied Psy-
chology, 81, 717.
DeNisi, A. S., Robbins, T., & Cafferty, T. P. (1989). Organization of information used for
performance appraisals: Role of diary-keeping. Journal of Applied Psychology, 74,
124–129.
DeNisi, A. S., Robbins, T. L., & Summers, T. P. (1997). Organization, processing, and the
use of performance information: A cognitive role for appraisal instruments. Journal
of Applied Social Psychology, 27, 1884–1905.
DeNisi, A. S., & Sonesh, S. (2011). The appraisal and management of performance at
work. In S. Zedeck (Ed.), Handbook of industrial and organizational psychology
(pp. 255–280). Washington, DC: APA Press.
Dierdorff, E. C., & Surface, E. A. (2007). Placing peer ratings in context: systematic influ-
ences beyond ratee performance. Personnel Psychology, 60, 93–126.
Farh, J., Earley, P.C., & Lin, S. 1997). Impetus for action: A cultural analysis of justice
and organizational citizenship behavior in Chinese society. Administrative Science
Quarterly, 42, 421–444.
Fleenor, J.W., Fleenor, J.B., & Grossnickle, W.F. (1996). Interrater reliability and agree-
ment of performance ratings: A methodological comparison. Journal of Business
and Psychology, 10, 367–38.
Folger, R., Konovsky, M. A., & Cropanzano, R. (1992). A due process metaphor for per-
formance appraisal. Research in Organizational Behavior, 14, 129–129.
Greenberg J. (1986) Determinants of perceived fairness of performance evaluations.
Journal of Applied Psychology, 71, 340–342.
Greenberg, J. (1987). A taxonomy of organizational justice theories. Academy of Man-
agement Review, 12, 9–22.
Greguras, G. J. (2005). Managerial experience and the measurement equivalence of per-
formance ratings. Journal of Business and Psychology, 19, 383–397.
Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of
360-degree feedback ratings. Journal of Applied Psychology, 83, 960–968.
Greguras, G. J., Robie, C., Schleicher, D. J., & Goff, M. (2003). A field study of the effects
of rating purpose on the quality of multisource ratings. Personnel Psychology, 56,
1–21.
Hamner, W. C., Kim, J. S., Baird, L., & Bigoness, W. J. (1979). Race and sex as determi-
nants of ratings by potential employers in a simulated work-sampling task. Journal
of Applied Psychology, 59, 705–711.
Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisory, self-peer,
and peer-subordinate ratings. Personnel Psychology, 41, 43–62.
Evaluating Job Performance Measures • 129
Heilman, M. E., & Chen, J. J. (2005). Same behavior, different consequences: reactions to
men’s and women’s altruistic citizenship behavior. Journal of Applied Psychology,
90, 431–441.
Heneman, R. L. (1986). The relationship between supervisory ratings and results-oriented
measures of performance: A meta-analysis. Personnel Psychology, 39, 811–826.
Hoffman, B. J., Lance, C. E., Bynum, B., & Gentry, W. A. (2010). Rater source effects are
alive and well after all. Personnel Psychology, 63, 119–151.
Hoffman, B. J., & Woehr, D. J. (2009). Disentangling the meaning of multisource perfor-
mance rating source and dimension factors. Personnel Psychology, 62, 735–765.
Ilgen, D. R. (1993). Performance appraisal accuracy: An elusive and sometimes mis-
guided goal. In H. Schuler, J. L. Farr, & M. Smith (Eds.), Personnel selection and
assessment: Industrial and organizational perspectives (pp. 235–252). Hillsdale,
NJ: Erlbaum.
Ilgen, D. R., Barnes-Farrell, J. L., & McKellin, D. B. (1993). Performance appraisal pro-
cess research in the 1980s: What has it contributed to appraisals in use? Organiza-
tional Behavior and Human Decision Processes, 54, 321–68.
Jennings, T., Palmer, J. K., & Thomas, A. (2004). Effects of performance context on pro-
cessing speed and performance ratings. Journal of Business and Psychology, 18,
453–463.
Joo, H., Aguinis, H., & Bradley, K. J. (2017). Not all non-normal distributions are cre-
ated equal: Improved theoretical and measurement precision. Journal of Applied
Psychology, 102, 1022–1053.
Kasten, R., & Nevo, B. (2008). Exploring the relationship between interrater correlations
and validity of peer ratings. Human Performance, 21, 180–197.
Kingsbury, F. A (1922). Analyzing ratings and training raters. Journal of Personnel Re-
search, 1, 377–382.
Kingsbury, F. A. (1933). Psychological tests for executives. Personnel, 9, 121–133.
Kluger, A. N., & DeNisi, A. S. (1996). The effects of feedback interventions on perfor-
mance: Historical review, meta-analysis, and a preliminary feedback intervention
theory. Psychological Bulletin, 119, 254–284.
Kraiger, K., & Ford, J. K. (1985). A meta-analysis of ratee race effects in performance
ratings. Journal of Applied Psychology, 70, 56–65.
Lance, C. E. (1994). Test of a latent structure of performance ratings derived from Wher-
ry’s (1952) theory of rating. Journal of Management, 20, 757–771.
Lance, C. E., Baranik, L. E., Lau, A. R., & Scharlau, E. A. (2009). If it ain’t trait it must
be method: (mis)application of the multitrait-multimethod design in organizational
research. In C. E. Lance & R. L. Vandenberg (Eds.), Statistical and methodological
myths and urban legends (pp. 227–360). New York, NY: Routledge.
Lance, C. E., LaPointe, J. A., & Stewart, A. M. (1994). A test of the context dependen-
cy of three causal models of halo rater error. Journal of Applied Psychology, 79,
332–340.
Lance, C. E., Teachout, M. S., & Donnelly, T. M. (1992). Specification of the criterion
construct space: An application of hierarchical confirmatory factor analysis. Jour-
nal of Applied Psychology, 77, 437–452.
130 • ANGELO S. DENISI & KEVIN R. MURPHY
Landy, F. J. (2010). Performance ratings: Then and now. In J.L. Outtz (Ed.). Adverse
impact: Implications for organizational staffing and high-stakes selection (pp.
227–248). New York, NY: Routledge.
Landy, F. J., Barnes, J., & Murphy, K. R. (1978). Correlates of perceived fairness and
accuracy of performance appraisals. Journal of Applied Psychology, 63, 751–754.
Landy, F. J., Barnes-Farrell, J., & Cleveland, J. (1980). Perceived fairness and accuracy of
performance appraisals: A follow-up. Journal of Applied Psychology, 65, 355–356.
Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72–107.
Landy, F. J., Shankster, L. J., & Kohler, S. S. (1994). Personnel selection and placement.
Annual Review of Psychology, 45, 261–296.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology,
28, 563–575.
LeBreton, J. M., Scherer, K. T., & James, L. R. (2014). Corrections for criterion reli-
ability in validity generalization: A false prophet in a land of suspended judgment.
Industrial and Organizational Psychology: Perspectives on Science and Practice,
7, 478–500.
Mabe, P. A., & West, S. G. (1982). Validity of self-evaluation of ability: A review and
meta-analysis. Journal of Applied Psychology, 67, 280–290.
McIntyre, R. M., Smith, D., & Hassett, C. E. (1984). Accuracy of performance ratings
as affected by rater training and perceived purpose of rating. Journal of Applied
Psychology, 69, 147–156.
Milkovich, G. T., & Wigdor, A. K. (1991). Pay for performance. Washington, DC: Na-
tional Academy Press.
Motowidlo, S. J., & Kell, H. J. (2013). Job Performance. In N. W. Schmitt & S. Highhouse
(Eds.), Comprehensive handbook of psychology, Volume 12: Industrial and organi-
zational psychology (2nd ed., pp. 82–103). New York, NY: Wiley.
Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait,
rater, and level effects in 360-degree performance ratings. Personnel Psychology,
51, 557–576.
Murphy, K. R, (1991). Criterion issues in performance appraisal research. Behavioral ac-
curacy vs. classification accuracy. Organizational Behavior and Human Decision
Processes, 50, 45–50.
Murphy, K. R. (2008). Explaining the weak relationship between job performance and rat-
ings of job performance. Industrial and Organizational Psychology: Perspectives
on Science and Practice, 1, 148–160.
Murphy, K. R., & Anhalt, R. L. (1992). Is halo error a property of the rater, ratees, or the
specific behaviors observed? Journal of Applied Psychology, 77, 494–500.
Murphy, K. R., & Balzer, W. K. (1986). Systematic distortions in memory-based behavior
ratings and performance evaluations: Consequences for rating accuracy. Journal of
Applied Psychology, 71, 39–44.
Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Ap-
plied Psychology, 74, 619–624.
Murphy, K. R., Balzer, W. K., Kellam, K. L., & Armstrong, J. (1984). Effect of purpose of
rating on accuracy in observing teacher behavior and evaluating teaching perfor-
mance. Journal of Educational Psychology, 76, 45–54.
Evaluating Job Performance Measures • 131
Saal, F. E., Downey, R. C., & Lahey, M. A. (1980). Rating the ratings: Assessing the qual-
ity of rating data. Psychological Bulletin, 88, 413–428.
Sanchez, J. I., & De La Torre, P. (1996). A second look at the relationship between rating
and behavioral accuracy in performance appraisal. Journal of Applied Psychology,
81, 3–10.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in per-
sonnel psychology: Practical and theoretical implications of 85 years of research
findings. Psychological Bulletin, 124, 262–274.
Schmitt, N., & Lappin, M. (1980). Race and sex as determinants of the mean and variance
of performance ratings. Journal of Applied Psychology, 65, 428–435.
Schmidt, F. L.,Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and valid-
ity is not reliability. Personnel Psychology, 53, 901–912.
Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job
performance ratings. Journal of Applied Psychology, 85, 956–970.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton-Mifflin.
Smith, P. C. (1976). Behaviors, results, and organizational effectiveness. In M. Dunnette
(Ed.), Handbook of industrial and organizational psychology. Chicago, IL: Rand-
McNally.
Smith, P. C., & Kendall, L. M. (1963). Retranlsation of expectations: An approach to the
construction of unambiguous anchors for rating scales. Journal of Applied Psychol-
ogy, 47, 149–155.
Solomonson, A. L., & Lance, C. E. (1997). Examination of the relationship between true
halo and halo error in performance ratings. Journal of Applied Psychology, 82,
665–674.
Stone-Romero, E. F., Alvarez, K., & Thompson, L. F. (2009). The construct validity of
conceptual and operational definitions of contextual performance and related con-
structs. Human Resource Management Review, 19, 104–116.
Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rat-
ing accuracy: Some methodological and theoretical concerns. Journal of Applied
Psychology, 73, 497–506.
Taylor, M. S., Tracy, K. B., Renard, M. K., Harrison, J. K., & Carroll, S. J. (1995). Due
process in performance appraisal: A quasi-experiment in procedural justice. Ad-
ministrative Science Quarterly, 495–523.
Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied
Psychology, 4, 25–29.
Thorndike, R. L. (1949). Personnel selection. New York, NY: Wiley.
Valle, M., & Bozeman, D. (2002). Interrater agreement on employees’ job performance:
Review and directions. Psychological Reports, 90, 975–985.
Varma, A., DeNisi, A. S., & Peters, L. H. (1996). Interpersonal affect in performance ap-
praisal: A field study. Personnel Psychology, 49, 341–360.
Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2002). The moderating influence of job
performance dimensions on convergence of supervisory and peer ratings of job
performance: Unconfounding construct-level convergence and rating difficulty.
Journal of Applied Psychology, 87, 345–354.
Evaluating Job Performance Measures • 133
Waldman, D. A., & Avolio, B. J. (1991). Race effects in performance evaluations: Con-
trolling for ability, education, and experience. Journal of Applied Psychology, 76,
897–901.
Williams, K. J., DeNisi. A. S., Meglino, B. M., & Cafferty, T. P. (1986). Initial decisions
and subsequent performance ratings. Journal of Applied Psychology, 71, 189–195.
Woehr, D. J., & Roch, S. G. (2016).Of babies and bathwater: Don’t throw the measure out
with the application. Industrial and Organizational Psychology: Perspectives on
Science and Practice, 9, 357—361.
Woehr, D. J., Sheehan, M. K., & Bennett, W. (2005). Assessing measurement equivalence
across rating sources: A multitrait-multirater approach. Journal of Applied Psychol-
ogy, 90, 592–600.
CHAPTER 7
RESEARCH METHODS IN
ORGANIZATIONAL POLITICS
Issues, Challenges, and Opportunities
2018; Ferris & Hochwarter, 2011; Kacmar & Baron, 1999; Lux, Ferris, Brouer,
Laird, & Summers, 2008), investigating a myriad of substantive relations. These
reviews identified trends, critically examined foundational underpinnings, and
noted inconsistencies and possible causes (Chang, Rosen, & Levy, 2009). Em-
bedded in many of these summaries are critiques of research design as well as
recommendations for addressing existing methodological deficiencies (Ferris, El-
len, McAllister, & Maher, 2019). However, to our knowledge, there has been
no systematic examination of research method issues in organizational politics
scholarship to date. Therefore, we offer a detailed critique of issues, challenges,
and future directions of organizational politics research methods.
and ‘reputation’ are burgeoning areas of study that fit well within the organiza-
tional politics nomological network. (Blom-Hansen & Finke, in press; Ferris et
al., 2019).
In concept, political will has been around for some time (Mintzberg, 1983;
Treadway, Hochwarter, Kacmar, & Ferris, 2005). Historically, the term repre-
sented worker behaviors undertaken to sabotage the leader’s directives (Brecht,
1937). In contemporary terms, conceptual advancements increased interest
(Blickle, Schütte, & Wihler, 2018; Maher, Gallagher, Rossi, Ferris, & Perrewé,
2018), and publication of the Political Will Scale (PWS) helped develop empirical
research in recent years (Kapoutsis, Papalexandris, Treadway, & Bentley, 2017).
Organizational reputation is not a new concept (Bromley, 1993; O’Shea,
1920). As an example, McArthur (1917) argued: “Reputation is something that
you can’t value in dollars and cents, but is mighty precious just the same…” (p.
63). However, for such a foundational construct, relatively little theory and re-
search have been conducted on reputation in the organizational sciences (Ferris,
Blass, Douglas, Kolodinsky, & Treadway, 2003; Ferris, Harris, Russell, Ellen,
Martinez, & Blass, 2014). As far back as Tsui (1984), and extending to the present
day (Ferris et al., 2019), reputation in organizations has been construed as less
of an objectively scientific construct and more of a sociopolitical one (Ravasi,
Rindova, Etter, & Cornelissen, 2018). Hence, reputation’s inclusion as a facet of
organizational politics is entirely appropriate (Munyon, Summers, Thompson, &
Ferris, 2015; Zinko, Gentry, & Laird, 2016) given its influence (direct and indi-
rect) on both tactics (Ferris et al., 2017) and presentation acuity (Smith, Plowman,
Duchon, & Quinn, 2009).
al., 2009; Guo, Kang, Shao, & Halvorsen, 2019; Miller, Rutherford, & Kolodin-
sky, 2008).
Although similarities between POPs and the broader organizational politics
construct exist, researchers note that perceptions are always subjective evalua-
tions, whereas organizational politics are captured objectively (Ferris, Harrell-
Cook, & Dulebohn, 2000; Ferris et al., 2019). Because perceptions ostensibly
manufacture reality (Landry, 1969), what is seen is impactful (Lewin, 1936; Por-
ter, 1976) and capable of explaining affective, cognitive, and behavioral outcomes
at work (Ferris & Kacmar, 1992). Accordingly, we define POPs as an individual’s
idiosyncratic estimation and evaluation of others self-serving, or egocentric, be-
havior at work (Ferris et al., 1989; Ferris et al., 2000; Ferris & Kacmar, 1992).
Ferris et al. (1989) developed one of the first theoretical models of POPs, which
specified the antecedents, outcomes, and moderators within the nomological net-
work of POPs. A subsequent review expanded these related constructs (Ferris,
Adams, Kolodinsky, Hochwarter, & Ammeter, 2002). Although no one study has
tested each proposed link, general support has been found for these two guiding
models, which were the primary studies that established the POPs nomological
network.
Despite the strong theoretical rationale for previous antecedent models, theori-
zation concerning the link between POPs and organizational outcomes was large-
ly absent before Chang et al.’s (2009) meta-analytic examination. Their study
was one of the first to identify psychological mechanisms linking POPs to more
distal work outcomes (i.e., turnover intentions and performance). Chang et al.
(2009) found that psychological strain mediated the relation between perceptions
of organizational politics and performance, such that as POPs increased, so did
psychological strain, in turn reducing performance. Morale mediated the POPs—
performance and turnover relation, albeit in a different fashion. Finally, one of
the most significant findings was the wide credibility intervals surrounding the
estimated effects of POPs on outcomes. This catalyzed the search for moderating
effects, which has dominated the POPs literature over the past decade.
Measurement. In 1980, two independent sets of scholars made first efforts to
assess political perceptions at work. Gandz and Murray (1980) asked employees
to report on the amount of political communication existing in their organization,
as well as its influence in shaping work environments. Respondents also reported
the organizational levels where political activities were most prevalent and of-
fered opinions on the effectiveness of these behaviors. Furthermore, respondents
provided a specific situation indicative of “a good example of workplace politics
in action” (Gandz & Murray, 1980, p. 240).
Madison et al. (1980) captured POPs through detailed interviews with chief
executive officers, high staff managers, and supervisors. Specifically, participants
answered questions, via face-to-face interviews, and reported on the frequency of
politics across different functional areas. They also described, in an open-ended
Research Methods in Organizational Politics • 139
tions. For example, some scholars already have defined the construct as the ac-
tive management of shared meaning (Ferris & Judge, 1991; Pfeffer, 1981), as
well as the effort to restore justice, attain resources and benefits for others, and/
or as a source of positive influence and change (Ellen, 2014; Hochwarter, 2012).
These views represent an initial benchmark for the constructs future refinement
and measurement.
Furthermore, despite literature focusing on self-serving and proactive tactics,
reactive and defensive political strategies are also viable (Ashforth & Lee, 1990;
Valle & Perrewé, 2000). Landells and Albrecht (2017) interpreted and categorized
POPs into four levels. Those who perceived organizational politics as reactive,
regarded the behaviors as destructive and manipulative, whereas reluctant politics
represented a “necessary evil” (Landells & Albrecht, 2017, p. 41). Furthermore,
strategic behaviors accomplished goals, and integrated tactics benefited actors
when central to successful company functioning, activity, and decision-making.
These findings support claims for an expansion that captures a fuller content do-
main. We encourage the use of grounded theory investigations as theoretical start-
ing points for improving conceptualizations and psychometric treatments.
Also concerning is the lack of theorizing regarding how POPs affect indi-
vidual-, group-, and organizational-level outcomes. Although several conceptual
models have begun to specify the direct effects of POPs (e.g., Aryee, Chen, &
Budhwar, 2004; Ferris et al., 2002; Valle & Perrewé, 2000), few studies have
offered theoretical support for possible processes that indirectly link POPs to
employee and organizational outcomes. Exceptions include studies investigating
morale (Rosen, Levy, & Hall, 2006) and need satisfaction (Rosen, Ferris, Brown,
Chen, & Yan, 2014) as mediating mechanisms. Building on these studies, more
substantial theorization needs to explain how and why POPs are associated with
attitudes and behaviors at work (Chang et al., 2009) across organizational levels
(Adams, Ammeter, Treadway, Ferris, Hochwarter, & Kolodinsky, 2002; Dipboye
& Foster, 2002; Franke & Foerstl, 2018).
Whereas historically the organizational politics literature has focused predomi-
nantly on between-person variance in politics perceptions as a stable environmen-
tal factor (Rosen et al., 2016), it is highly possible that politics perceptions vary
throughout the day, week, or more broadly across time. As research on experi-
ence sampling methods continue (Matta, Scott, Colquitt, Koopman, & Passantino,
2017), it would be beneficial for researchers also to consider within-person varia-
tion in politics perceptions, and the antecedents that may result in such variance.
Assuming within-person variance exists, researchers would be drawing a broader
picture as to how politics perceptions are developed and modified across time.
Furthermore, given the importance of uncertainty in the larger politics litera-
ture, researchers also may want to consider whether within-person variability in
politics perceptions is more harmful than consistently perceiving politics. Per-
haps, politics manifest in ways similar to justice perceptions. Specifically, vari-
Research Methods in Organizational Politics • 141
ability of cues likely cause more disdain when inconsistent (sometimes good –
sometimes bad) than when consistent (always bad) (Matta et al., 2017).
Political Behavior
Definition and Conceptualization. As stated by Mintzberg (1983, 1985), or-
ganizations are political arenas in which motivated and capable individuals enact
self-serving behavior. Although employees often perceive ‘office politics’ as be-
ing decisively negative, political behavior can produce organizational and inter-
personal benefits when appropriately implemented (Treadway et al., 2005). Given
widespread disagreement on the implications of politics, and more specifically
political behavior, conceptualizations have varied over time and across studies
(Kidron & Vinarski-Peretz, 2018; Lampaki & Papadakis, 2018).
Generally, researchers agree that political behavior is normal, and sometimes,
an essential element of functioning (Zanzi & O’Neill, 2001). However, no agreed-
upon definition that captures the complexity of political action exists (Ferris et al.,
2019). Whereas most definitions posit political behavior as non-sanctioned activ-
ity within organizational settings (Farrell & Petersen, 1982; Gandz & Murray,
1980; Mintzberg, 1983; Schein, 1977), others focus on political behavior as an
interdependent social enactor-receiver relationship (Lepisto & Pratt, 2012; Sharf-
man, Wolf, Chase, & Tansik, 1988). Furthermore, some researchers classify influ-
ence tactics (Kipnis & Schmidt, 1988; Kipnis, Schmidt, & Wilkinson, 1980; Yukl
& Falbe, 1990), impression management (Liden & Mitchell, 1988; Tedeschi &
Melburg, 1984), and even voice (Burris, 2012; Ferris et al., 2019) as relevant for
effective operationalization of the politics construct.
Since its original operationalization, several conceptual models have emerged
to explain potential antecedents of political behavior. The first, developed by Por-
ter, Allen, and Angle (1981), argued that political behavior is, at least partially, a
function of Machiavellianism, locus of control, need for power, risk-seeking pro-
pensity, and a lack of personal power. Just over a decade later, Ferris, Fedor, and
King (1994) stated that political behavior is the result of Machiavellianism and lo-
cus of control, like seen in Porter et al.’s (1981) model, as well as self-monitoring,
a propensity unique to the Ferris et al. (1994) model.
Overall, empirical research investigating the antecedents of political behav-
ior has been inconclusive (Grams & Rogers, 1990; Vecchio & Sussman, 1991),
leading to calls for an expansion of the individual difference domain previously
specified (Ferris, Hochwarter, Douglas, Blass, Kolodinsky, & Treadway, 2002b).
In response, Treadway et al. (2005) conceptualized political behavior to include
motivational and achievement need components, and Ferris et al. (2019) concep-
tualized general political behavior as one of the multiple other political actions
that organizational members enact. We now briefly describe other forms of politi-
cal action conceptualized as being part of political behavior in organizations.
Influence tactics are specific strategies employed to obtain desired goals. De-
spite general disagreement regarding what types of influence tactics exist (Kipnis
142 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
et al., 1980; Kipnis & Schmidt, 1988; Yukl & Tracey, 1992), an extensive body
of literature has examined not only what tactics are most effective, but also the
boundary conditions affecting tactic success (e.g., frequency of influence, direc-
tion of influence, power distance between enactor and receiver, reason for in-
fluence attempt). As part of this trend, several meta-analytic studies have begun
to tease apart these direct and moderating implications (Barbuto & Moss, 2006;
Higgins, Judge, & Ferris, 2003; Lee, Han, Cheong, Kim, & Yun, 2017; Smith et
al., 2013).
Impression management reflects any political act designed to manage how one
is perceived (Tedeschi & Melburg, 1984; Tedeschi, Melburg, Bacharach, & Lawl-
er, 1984). Attempts at impression management fall into five primary categories,
including ingratiation, self-promotion, exemplification, supplication, and intimi-
dation (Jones & Pittman, 1982). Past work has categorized impression manage-
ment into two dimensions (i.e., tactical-strategic and assertive-defensive; Tedeschi
& Melburg, 1984). The tactical-strategic dimension considers whether short-term
or long-term purposes guide impression management. Moreover, the assertive-
defensive dimension determines if behavior escalates proactively or reactively to
situational contingencies. Although the common intention of impression manage-
ment is a favorable assessment, recent work reports that poorly executed tactics
can be detrimental for one’s social image (Bolino, Long, & Turnley, 2016).
Voice, a type of organizational citizenship behavior (OCB), is the expression
of effective solutions in response to perceived problems to improve a given situ-
ation (Li, Wu, Liu, Kwan, & Liu, 2014; Van Dyne & LePine, 1998). Voice is es-
sential for the management of shared meaning in organizational contexts (Ferris
et al., 2019), and represents a mechanism to advertise and promote personal opin-
ions and concerns (Burris, 2012). However, unlike many other forms of OCBs,
voice can be maladaptive for individuals enacting the behavior, as well as for
their coworkers and the organization as a whole (Turnley & Feldman, 1999). As
such, employee voice exemplifies a form of informal political behavior (Ferris et
al., 2019).
Measurement. Despite existing theoretical avenues within the political behav-
ior literature, there is still considerable disagreement surrounding construct defi-
nition and use in scholarly practice (Ferris et al., 2019), which impedes construct
validity. Given a general lack of operational and conceptual consensus, measures
of political behavior also have been limited and quite inconsistent. Whereas some
scholars have developed scales assessing general political behavior (Valle & Per-
rewé, 2000; Zanzi, Arthur, & Shamir, 1991), others have used impression man-
agement (Bolino & Turnley, 1999), influence tactics (Kipnis & Schmidt, 1988),
and voice (Van Dyne & LePine, 1998) as proxies for political behavior in organi-
zational settings.
The most commonly utilized measure of individual political behavior was de-
veloped by Treadway et al. (2005; α = .83). Six items captured general politicking
behavior toward goal attainment, interpersonal influence, accomplishment shar-
Research Methods in Organizational Politics • 143
ing, and ‘behind the scenes’ political activity. Despite its widespread use since
the scale’s emergence, Treadway et al.’s (2005) measure has yet to undergo the
empirical rigor that traditional scale developments endure (Ferris et al., 2019).
Critique and Future Research Directions. Before empirical work on the
construct can continue, researchers need to develop a concise and agreed upon
operation of political behavior that includes traditional definitional components
while taking into consideration the importance of intentionality (Hochwarter,
2012), goal-directed activity and behavioral targets (Lepisto & Pratt, 2012), and
interpersonal dependencies (French & Raven, 1959). Furthermore, researchers
need to decide whether to expand political behavior to include concepts like influ-
ence tactics, impression management, and voice, or if each construct is unique
enough to hold an independent position within political behaviors’ nomological
network. Once the construct is better defined, and its related constructs identi-
fied, researchers will want to use this conceptualization to help inform subsequent
scale development attempts. We encourage researchers to cast a wide net when
defining political behaviors and its potential underlying dimensions.
Political behaviors reflect inherently non-sanctioned and self-serving actions
(Mitchell, Baer, Ambrose, Folger, & Palmer, 2018), triggering ostensibly adverse
outcomes. However, not all non-sanctioned behavior is aversive nor all self-serv-
ing behavior dysfunctional (Ferris & Judge, 1991; Zanzi & O’Neill, 2001). For
example, egotistic behavior may not be intrinsic to the actor. Instead, contexts in-
fused with threat often trigger self-serving motivations as a protective mechanism
(Lafrenière, Sedikides, & Lei, 2016; Von Hippel, Lakin, & Shakarchi, 2005). For
this reason, future research should expand conceptualizations and measurement to
include constructs predisposed to neutral and positive implications as well (Ellen,
2014; Fedor, Maslyn, Farmer, & Bettenhausen, 2008; Ferris & Treadway, 2012;
Hochwarter, 2012; Maslyn, Farmer, & Bettenhausen, 2017).
Furthermore, political behavior is a broad term encapsulating activity enacted
by different sources, including the self, others, groups, and organizations (Hill,
Thomas, & Meriac, 2016). Given its possible manifestations across organiza-
tional levels, future research must redefine the construct within the appropriate
and intended theoretical level. As part of this process, researchers also must con-
sider whether political behavior is a level-generic (or level-specific) phenomenon,
manifesting similarly (or differentially) across multiple hierarchies.
The objectionable and surreptitious nature of political behavior (Wickenberg
& Kylén, 2007) provokes the use of self-report measures prone to socially desir-
able responding. Diverse approaches, however, are likely unable to capture the
extensiveness and frequency of political activity for the very same reasons. This
conundrum is shared across disciplines (Reitz, Motti-Stefanidi, & Asendorpf,
2016; Zare & Flinchbaugh, 2019), as other-report indices are vulnerable to halo
bias (Dalal, 2005). As researchers develop improved measures of political behav-
ior, convergence or divergence must be determined to establish validity (Kruse,
Chancellor, & Lyubomirsky, 2017).
144 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
Political Skill
Definition and Conceptualization. Approximately 40 years ago, two inde-
pendent scholars concurrently introduced the political skill construct to the lit-
erature. Pfeffer (1981) defined political skill as a social effectiveness competency
allowing for the active execution of political behavior and attainment of power.
Mintzberg (1983, 1985) positioned the construct as the exercise of political in-
fluence (or interpersonal style) using manipulation, persuasion, and negotiation
through formal power. Despite the extensiveness of political activity in organiza-
tional settings, these initial works made little progress beyond the definition and
conceptualization stages.
However, over the past few decades, researchers have acknowledged the
prominence and importance of political acuity, social savviness, and social in-
telligence (Ferris, Perrewe, & Douglas, 2002; McAllister et al., 2018). Ahearn,
Ferris, Hochwarter, Douglas, and Ammeter (2004) provided an early effort at de-
lineating the political skill construct (then termed “social skill”), defined as “the
ability to effectively understand others at work, and to use such knowledge to
influence others to act in ways that enhance one’s personal and/or organizational
objectives” (Ahearn et al., 2004, p. 311). Additionally, political skill was argued
to encompass four critical underlying dimensional competencies, including (1)
social astuteness, (2) interpersonal influence, (3) networking ability, and (4) ap-
parent sincerity.
Social astuteness, or the ability to be self-aware and to interpret the behavior of
others accurately, is necessary for effective influence (Pfeffer, 1992). Individuals
possessing political skill are keen observers of social situations. Not only are they
able to accurately interpret the behavior of others, but also they can adapt socially
in response to what they perceive (Ferris, Treadway, Perrewé, Brouer, Douglas, &
Lux, 2007). This “sensitivity to others” (Pfeffer, 1992, p. 173) provides politically
skilled individuals the ability to understand the motivations of both themselves
and others better, making them useful in many political arenas.
Interpersonal influence concerns “flexibility,” or the successful adaptation of
behavior to different personal and situational contingencies to achieve desired
goals (Pfeffer, 1992). Individuals high in political skill exert powerful influence
through subtle and convincing interpersonal persuasion (Ferris et al., 2005, 2007).
Whereas Mintzberg (1983, 1985) defined political skill concerning influence and
explicit formal power, Ahearn et al.’s (2004) definition does not include direct ref-
erences to formal authority (Perrewé, Zellars, Ferris, Rossi, Kacmar, & Ralston,
2004). Instead, this view focuses on influence originating from the selection of
appropriate communication styles relative to the context at hand, as well as suc-
cessful adaptation and calibration when tactics are ineffective.
Politically skilled individuals also are adept at developing and utilizing social
networks (Ferris et al., 2005, 2007). Not only are these networks secure in terms
of their extensiveness, but also they tend to include more valuable and influential
members. Such networking capabilities allow individuals high in political skill to
Research Methods in Organizational Politics • 145
formulate robust and beneficial alliances and coalitions that offer further opportu-
nities to maintain, as well as develop, an increasingly more extensive social net-
work. Further, because these networks are strategically developed over time, the
politically skilled are better able to position themselves so as to take advantage of
available network-generated resources, opportunities, and social capital (Ahearn
et al., 2004; Pfeffer, 2010; Tocher, Oswald, Shook, & Adams, 2012).
The last characteristic politically skilled individuals possess is apparent sin-
cerity. That is, they are or at least appear to be, genuine in their intentions when
engaging in political behaviors (Douglas & Ammeter, 2004). Sincerity is essential
given that influence attempts are only successful when the intention is devoid of
ulterior or manipulative motives (Jones, 1990). Thus, perceived intentions may
matter more than actual intentions, for inspiring behavioral modification and con-
fidence in others.
Subsequently, Ferris et al. (2007) provided a systematic conceptualization of
political skill grounded in social-political influence theory. As part of this concep-
tualization, they characterized political skill as “a comprehensive pattern of social
competencies, with cognitive, affective, and behavioral manifestations” (Ferris
et al., 2007, p. 291). Specifically, they argued that political skill operated on self,
others, and group/organizational processes. Their model identified five anteced-
ents of political skill, including perceptiveness, control, affability, active influ-
ence, and developmental experiences. Munyon et al. (2015) extended this model
to encapsulate the effect of political skill on self-evaluations and situational ap-
praisals (i.e., intrapsychic processes), situational responses (i.e., behavioral pro-
cesses), as well as evaluations by others and group/organizational processes (i.e.,
interpersonal processes). Recently, Frieder, Ferris, Perrewé, Wihler, and Brooks
(in press) extended this meta-theoretical framework of social—political influence
to leadership.
Overall, research on political skill has generated considerable interest since
its original refinement by Ferris et al. (2005). Within the last decade, multiple
reviews and meta-analyses (Bing, Davison, Minor, Novicevic, & Frink, 2011;
Ferris, Treadway, Brouer, & Munyon, 2012; Munyon et al., 2015) have reported
on the effectiveness of political skill in work settings, both as a significant predic-
tor as well as a boundary condition. Some notable outcomes include the effect of
political skill on stress management (Hochwarter, Ferris, Zinko, Arnell, & James,
2007; Hochwarter, Summers, Thompson, Perrewé, & Ferris, 2010; Perrewé et
al., 2004), career success and performance (Blickle et al., 2011; Gentry, Gilm-
ore, Shuffler, & Leslie, 2012; Munyon et al., 2015), and leadership effectiveness
(Brouer, Douglas, Treadway, & Ferris, 2013; Whitman, Halbesleben, & Shanine,
2013).
Measurement. Ferris et al. (1999) provided a first effort at measuring the po-
litical skill construct by developing the six-item Political Skill Inventory (PSI).
Despite acceptable psychometric properties and scale reliability across five stud-
ies, the PSI was not without flaws. Although the scale reflected social astuteness
146 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
and interpersonal influence, they did not emerge as separate and distinguishable
factors. Resulting unidimensionality and construct domain concerns triggered the
development of an 18-item version, which retained the original scale name as well
as three original scale items (Ferris et al., 2005).
To develop the 18-item PSI, Ferris et al. (2005) generated an initial pool of
40 items to capture the full content domain of the political skill construct. After
omitting scale items prone to socially desirable responding and those with prob-
lematically high cross loading values, a final set of 18-items was retained, and
as hypothesized, a four-factor solution emerged containing the social astuteness,
interpersonal influence, networking ability, and apparent sincerity dimensions.
Critique and Future Research Directions. Coming up on its 15th anniversary,
the PSI has been widely accepted as a sound psychometric measure by those
well entrenched within the organizational politics field. Few conceptual squabbles
exist among scholars, and the theoretical clarity paired with strong empirically
established links to relevant constructs is evidence for strong construct validity.
However, this measure has a few notable deficiencies. The PSI inherently suffers
from the drawbacks associated with self-reports. Certainly, self-reports are easy
to obtain, and are considered the best way to measure psychological states, per-
ceptions, and motives (McFarland et al., 2012). However, as a tool for assessing
behavioral effectiveness, self-reports have some issues. Hubris, perceptual bias,
and self-desirability within individuals can lead to overinflated estimates of their
social abilities. Some individuals may believe, or are told erroneously, that they
are likable and keen social agents, but in reality, they are a social pariah who an-
noy and infuriate their colleagues.
Also problematic are invariant source requirements needed for response ac-
curacy. For example, social astuteness and networking ability are largely percep-
tual measures and best obtained through self-reports. Logically, observers cannot
provide an accurate account of what individuals observe during social interaction.
However, observers may be the best suited to assess interpersonal influence and
apparent sincerity. Influence represents a change of an attitude, judgment, or deci-
sion; that is, cues more amenable to assessment by an observer or trained rater.
Apparent sincerity is in the eye of the beholder regardless of whether the focal
individuals believe they intended on being or thought they acted sincerely (Sil-
vester & Wyatt, 2018).
With these shortcomings in mind, scholars would help advance scholarship
by developing a behavioral measure that can assess political skill without solely
relying on self-reports. Developing such a measure would contribute to further
legitimizing the construct of political skill to those scholars and practitioners who
are not intimately familiar with the organizational politics literature, and doubt its
merits. Furthermore, this measure need not replace the PSI entirely, but a stream
of investigations that employed both a behavioral and self-report measure could
illuminate the utility or futility of how we currently measure political skill. Admit-
tedly, this type of measurement requires added effort likely complicating data col-
Research Methods in Organizational Politics • 147
lection processes. However, we are confident that value rests in doing so if only to
confirm the utility of self-reports.
Another opportunity within the political skill literature is to evaluate the con-
struct’s developmental qualities. According to Ferris et al. (2005, 2007), political
skill is a social competency that can be cultivated over time through social feed-
back, role modeling, and mentorship. Despite strong theoretical support, ground-
ed in social learning theory (Bandura, 1986), little evidence for the development
of political skill through observation and modeling exists. Further, if both genetic
properties and situational factors affect political skill, then researchers need to
consider which individuals are more or less receptive to organizational training,
behavioral interventions, incentives, and role modeling techniques. Until empiri-
cal evidence is present, scholars should be cautious of discussing political skill as
a learnable or trainable competency.
Political Will
Definition and Conceptualization. Political will is a construct commonly
used in the popular press and governmental politics to describe a collective’s will-
ingness or unwillingness to expend resources towards a particular cause (Post,
Raile, & Raile, 2010). The creation of new laws and political courses of action
upsets the status quo, and in a world of diverse and often competing interests,
politicians must be willing to expend resources to fight for their desired agenda.
Similarly, Mintzberg (1983) argued that individual agents within organizations
needed political skill and political will in order to execute their desired managerial
actions successfully.
Over three decades ago, political will and political skill were introduced con-
ceptually into the organization sciences. Despite the sustained interest in politi-
cal skill, however, political will has attracted further inquiry only recently. This
neglect is unfortunate given that both constructs were integral to Mintzberg’s
theoretical framework, and omission of essential variables within measurement
models biases parameter estimates in previous studies.
Treadway (2012) provided a theoretical application of political will and sug-
gested instrumental (relational, concern for self, concern for others) and risk toler-
ance as underlying dimensions. Treadway defined political will as “the motivation
to engage in strategic, goal-directed behavior that advances the personal agenda
and objectives of the actor that inherently involves the risk of relational or repu-
tational capital (p. 533).” Keeping with Mintzberg’s conceptualization, Treadway
focused on describing political will at the individual level of analysis. Nonethe-
less, he did acknowledge that political will embodies a group mentality towards
a particular agenda.
Measurement. Scholars made a few early attempts to measure political will
before the development of a validated psychometric measure. Treadway et al.
(2005) first attempted to measure political will using need for achievement and
intrinsic motivation as proxies. These constructs successfully predicted the activ-
148 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
ity level of political behavior. Similarly, Liu, Liu, and Wu (2010) used need for
achievement, and analogously, need for power to predict political behavior. In the
same vein, Shaughnessy, Treadway, Breland, and Perrewé (2017) used the need
for power as a proxy for political will, which predicted informal leadership. Last-
ly, Doldor, Anderson, and Vinnicombe (2013) used semi-structured interviews to
explore what political will meant to male and female managers. Rather than focus
on the trait-like qualities previously employed as proxies, they found that political
will was more of an attitude about engaging in organizational politics. Lastly, they
found that functional, ethical, and emotional appraisals shaped political attitudes.
Recently, Kaptousis, Papelalexandris, Treadway, and Bentley (2017) developed
an eight-item measure called the Political Will Scale (PWS). Based on Treadway’s
(2012) conception of political will, they expected the scale to break out into the
five dimensions of instrumental, relational, concern for self, concern for others,
and risk tolerance. However, confirmatory principal axis factor analysis revealed
two factors for this scale, which they labeled benevolent and self-serving. To date,
only a handful of published studies have used this new measure. As an example,
Maher et al. (2018) found that political will and political skill predicted configu-
rations of impression management tactics. Moreover, moderate levels of political
will were associated with the most effective configuration. Blickle et al. (2018)
applied additional psychometric testing to the scale. In applying a triadic multi-
source design, they found support for the construct and criterion-related validity
of the self-serving dimension of political will. However, they did not find justifi-
cation for the benevolent dimension. Instead, they interpreted this dimension to be
synonymous with altruistic political will.
Critique and Future Research Directions. Because the study of political will
is in its nascent stage, lending a critical eye helps introduce ideas for remedying
potential deficiencies. Establishing, expanding, and empirically testing the politi-
cal will nomological network will help establish construct validity and advance
research in this area. The sections that follow evaluate the state of the construct,
with a focus on vetting current conceptualizations and measurement instruments.
To date, within the organization sciences, political will resides as an individ-
ual-level variable. Indeed, we take no issue with this stance. Mintzberg specifi-
cally discussed political will and political skill as individual attributes necessary
to navigate workplace settings. However, scholars in political science have char-
acterized political will as a group-level phenomenon (Post et al., 2010). Appropri-
ately, scholars within the organization sciences should also conceptualize and ex-
plore political will at collective levels of analysis. Indeed, political will possesses
attitude-based qualities (Doldor et al., 2013), and thus, can proliferate to others
within similar social networks (Salancik & Pfeffer, 1978).
Furthermore, scholars must examine how formal and informal leadership cre-
ate unique political will profiles, and assess how these configurations might affect
group outcomes. For example, teams with a high and consistent aggregate amount
of political will have a singular focus that leads to higher performance results. It
Research Methods in Organizational Politics • 149
may also be true that having one team member or leader who takes care of the
‘dirty work’ enables other team members to complete work tasks without engag-
ing in office politics.
Currently, scholars conceptualize political will as an individual characteristic.
To date, instruments make no effort to test whether this characteristic differs across
organizational situations and contexts. However, political scientists maintain that
political will is issue specific (Donovan, Bateman, & Heggestad, 2013; Morgan,
1989). In keeping with this notion, we suggest that a novel and illuminating line
of study would be to apply an event-oriented approach (Morgeson, Mitchell, &
Liu, 2015) to studying political will. Under this design, scholars could examine
how political will focuses on resources and effort toward a particular cause, and
track how these manifestations affect goals and change outcomes. Unlike team-
level aggregation, this approach would require the development of a new measure
rather than merely changing the referent in the existing measure of political will.
As with many constructs in the organizational politics literature, there is little
consensus on the underlying theoretical foundations of political will. Conceptual-
izations and definitions are essential for any sound psychometric instrument, and
this incongruence is a current affliction within the study of political will. Indeed,
we applaud the advancement in theory and measurement by Treadway (2012) and
Kapoutsis et al. (2017), as they represent the seminal works in the field. Previous
proxy measures (i.e., need for achievement, need for power, intrinsic motivation)
were rooted in constructs that are stable individual traits, and recent thinking more
appropriately suggests that political will is a state-like attribute closely akin to
an attitude. However, there are potential issues to confront concerning the more
contemporary works mentioned above.
For example, the multidimensional conceptualization of political will (Tread-
way, 2012) was not found to be supported (Kapoutsis et al., 2017). Correctly, no
items reflected risk tolerance, suggesting too narrow an operationalization. Simi-
larly, Rose and Greeley (2006) suggest that political will represents a sustained
commitment to a cause, as adversity and pushback are integral aspects of the pro-
cess. This aspect of political will is also absent from the recent measure. Scholars
should analyze the PWS dimensions in conjunction with scales of perseverance
(e.g., grit, Duckworth & Quinn, 2009) and risk tolerance to see if they load onto
a common factor.
As mentioned above, the word ‘politics’ is a loaded word that means different
things to different people. Many see it as a toxic parasite requiring immediate
extinction (Cantoni, 1993; Zhang, & Lu, 2009). Conversely, others recognize its
importance, necessity, and inevitability (Eldor, 2016). An in-depth debate regard-
ing the positive and negative aspects of organizational politics is beyond the scope
of this chapter. However, it is clear that definition unanimity has evaded both
scholars and study participants. Anecdotally, evidence suggests that respondents
consider workplace ‘political behavior’ to embody advocacy for a particular gov-
ernmental candidate. There are two potential remedies to this issue.
150 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
First, we suggest that scholars define organizational politics within the survey
instrument in use. This approach will focus participants’ attention to organiza-
tional politics, not governmental politics. Second, scholars should avoid using
any variant of the word ‘politics’ in the measures that they create, and instead
use more specific language to illustrate the intended political situation or charac-
teristic. When writing scale items, scholars advocate for clear language, so that
interpretations are uniform across time, culture, and individual attributes (Clark
& Watson, 1995). We find this practice particularly important given the ubiquity
and lack of agreement about the word ‘politics.’ Unfortunately, the PWS suffers
from this issue, as all eight items employ a variant of the word politics. Thus, we
suggest that scholars define politics in ways that clearly are understood by target
samples.
Lastly, we must note that the initial validation of the PWS has produced mixed
results. Blickle et al. (2018) found evidence that the self-serving dimension of
political will did demonstrate descriptive and convergent validity. Specifically,
the benevolent dimension of political will did not correlate with altruism, as the
authors argue. We question if altruism truly fits within the political framework, as
acting politically on others behalf does not have to be genuinely self-sacrificing
(which highlights the need for higher conceptual agreement). These results, com-
bined with the other issues raised in this review, warrant further construct valida-
tion research on the PWS.
Reputation
Definition and Conceptualization. Reputation is commonly discussed among
the public across social and business contexts. Generally, reflecting a positive
reputation is a complementary attribute. However, academic investigations of
reputation are inconsistent, and our understanding of what exactly reputation is
and how it functions is limited. Extant research exists across social science disci-
plines (e.g., economics, management, psychology; Ferris et al., 2003). Like many
other constructs in the organizational politics literature, disagreement regarding
the definition of reputation has thwarted research. This discrepancy is due, pri-
marily, to the different labels across fields, and even separate pockets of research
within each field (e.g., individual, team, organizational, and industry-level within
the management literature) in some cases. These different markers and branches
of research have fragmented research (Ferris et al., 2014).
To synthesize the existing research and create greater understanding among
scholars, Ferris et al. (2014) provided a cross-level review of reputation. They
found that it has three interacting features: (1) elements that inform reputation,
(2) stakeholder perceptions, and (3) functional utility. That is, the characteristics
of a focal entity interact with stakeholder perceptions to form the entity’s reputa-
tion. Thus, reputation then has a particular value, which, if positive, can result in
positive outcomes. Considering this, Ferris et al. (2014) proposed the following
definition of reputation: “a perceptual identity formed from the collective percep-
Research Methods in Organizational Politics • 151
tion likely have different relations with each dimension. Future research should
not only investigate reputation as a whole, but also seek to understand how, when,
and why each dimension of reputation is more or less influential on related con-
structs.
To date, research at the individual-level has primarily focused on the inform-
ing elements and functional utility features of reputation. The third element (i.e.,
stakeholder perceptions) has received very little attention. Indeed, it seems as
though researchers generally avoid this important element altogether. This inat-
tention is concerning, as reputation is a “perceptual identity formed from the col-
lective perceptions of others” (Ferris et al., 2014, p. 62) residing “in the minds of
external observers” (Rindova et al., 2010, p. 614). Despite this acceptance, there
has been no empirical investigation of the functional role of stakeholder char-
acteristics in reputation formation. The variance in how others may perceive a
focal individual is a central theme in reputation development. Indeed, individuals
interpret the same information differently (Branzei, Ursacki-Bryant, Vertinsky, &
Zhang, 2004), and attribute behaviors to different causes (Heider, 1958; Kelley,
1973). Still, although this variance in perception is well established, its effect
consequences of reputation (e.g., autonomy, financial reward) has received little
attention.
Related to how stakeholders may interpret informing elements differently, and
tying back into the measurement of personal reputation, is the obvious concern
with the method in which reputation is measured. Although Hochwarter et al.’s
(2007) measure has received statistical support (and convergence across self- and
other-report indices), assessments came from focal individuals exclusively (Ferris
et al., 2014). Although obtaining multiple assessments of a single focal individual
is generally more complicated, a reputation assessment from a single individual
offers minimal insight.
issues, and provide to the best of our ability some potential remedies to common
challenges within the field of organizational politics.
Conceptual Challenges
The decades of research on organizational politics notwithstanding, the field
still suffers from a fundamental issue of conceptual incongruence, which poses a
threat to construct validity (Ferris et al. 2019; Lepisto & Pratt, 2012; McFarland
et al., 2012). This shortcoming is not tremendously surprising, as accurately de-
fining and capturing motives and behaviors that are inherently concealed, infor-
mal, murky, or downright dishonest is no easy task. Adding to this complexity is
the perspective that the word politics itself is a well-known, yet misunderstood,
term within the popular lexicon, and these preconceived notions from both prac-
titioners and scholars alike can contaminate conceptualizations and measurement.
Ideally, a researcher would approach his or her studies with a tabula rasa, or clean
slate (Craig & Douglas, 2011; Fendt, & Sachs, 2008). However, as objective as
individuals aim to be, researchers’ personal experiences may influence how con-
structs are conceptualized and evaluated.
Perhaps it is true that the commercialization of greed that occurred in the
1980s, when current conceptualizations of POPs and other political constructs
were established, influenced how scholars defined and measured constructs with-
in organizational politics. It may also be the case that the more modern positive
psychology movement (e.g., Luthans & Avolio, 2009) has led scholars to search
for positive aspects of organizational politics (Byrne, Manning, Weston, & Ho-
chwarter, 2017; Elbanna, Kapoutisis, & Mellahi, 2017). We offer no formal defi-
nition here, but we do suggest that future attempts to unify organizational politics
under a common conceptual understanding acknowledge that much of what goes
on in organizations is informal and social, and that this reality allows for many
different outcomes, both good and bad.
A unifying definition of organizational politics should also consider the full
breadth of different behaviors and motivations. We have talked about the politi-
cian as an omnibus term rather than defining and refining what precisely that
means. Given the complex nature of political constructs, we advocate the de-
velopment of multidimensional constructs with both first- and second-order lev-
els. This strategy allows practitioners to look for general main effects, as well as
more nuanced relationships (e.g., Brouer, Badaway, Gallagher, & Haber, 2015).
Specifically, it might be helpful to establish profiles of and related to, political
behavior. From a research design standpoint, methods such as latent profile analy-
sis (Gabriel, Campbell, Djurdjevic, Johnson, & Rosen, 2018; Gabriel, Daniels,
Diefendorff, & Greguras, 2015), cluster analysis (Maher et al. 2018), and qualita-
tive comparative analysis (QCA; Misangyi, Greckhamer, Furnari, Fiss, Crilly, &
Aguilera, 2017; Rihoux & Ragin, 2008) represent data analysis techniques that
are currently underutilized in the organizational politics literature. We also assert
that the complexity of political constructs call for more nuanced explorations,
Research Methods in Organizational Politics • 155
and scholars should consider theorizing and measuring nonlinear and moderated
nonlinear investigations of politics (Ferris, Bowen, Treadway, Hochwarter, Hall,
& Perrewé, 2006; Grant & Schwartz, 2011; Hochwarter, Ferris, Laird, Treadway,
& Gallagher, 2010; Maslyn et al. 2017; Pierce & Aguinis, 2013; Rosen & Ho-
chwarter, 2014).
Another important consideration for the future of organizational research is to
examine context (see Johns, 2006, 2018). The majority of POPs research has come
from scholars and samples from the United States (for a few notable exceptions,
please see Abbas & Raja, 2014; Basar, & Basim, 2016; Eldor, 2016; Kapoutsis et
al., 2017). We feel that it is imperative to incorporate different viewpoints from
all corners of the world as we move towards a shared conceptual understanding
of organizational politics. Failure to do so creates issues of construct adequacy
(Arafat, Chowdhury, Qusar, & Hafez, 2016; Hult et al., 2008). Indeed, we expect
broad and salient similarities across cultures, but politics may look, act, and feel
different across different contexts.
In addition, the role of context likely also affects politics at a more localized
level. That is, among others, the type of organization (e.g., for-profit vs. not-for-
profit), industry (e.g., finance vs. social services), and hierarchical level (e.g., top
management teams vs. line managers) likely affect the prevalence, type, and pro-
cess of organizational politics. Contextualizing our research will provide an abun-
dance of different avenues from which we can continue to evaluate how, what,
when, why, and the effectiveness of political action under different circumstances.
This approach will help illuminate theory and help build a greater conceptual
understanding of politics.
Lastly, although there are ample avenues for investigations that employ the
contemporary political constructs discussed in this chapter, organizational politics
scholars should not rest on their laurels concerning the development of new theo-
ries and constructs. We encourage the inclusion and development of new theories
that could help explain political phenomena. For example, organizational politics
literature is rooted in the idea that individuals are not merely passive agents, but
instead enact and respond to their environment. The fields of leadership and or-
ganizational politics are inextricably linked, and much as the field of leadership
has placed emphasis on leaders over followers (Epitropaki, Kark, Mainemelis,
& Lord, 2017), organizational scholars have focused on the actions of the in-
fluencers rather than the targets of those influences. This perspective ignores a
century old stream of research that spans the social sciences and argues that there
is individual variation in the extent to which individuals are affected by their en-
vironment (Allport, 1920; Belsky & Pluess, 2009). Incorporating individuals’ sus-
ceptibility to social influence into theories and models of organizational politics
would restore balance to the contemporary biased perspective, and help alleviate
concerns of omitted variable bias.
156 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
fewer barriers to access with these samples. Despite the potential pitfalls of this
data collection method, these samples can increase the generalizability of a study,
perhaps more so than a sample drawn from a single organization. We encourage
the appropriate use of these samples (see Wheeler et al., for guidelines), especially
in conjunction with other data collection methods as part of a multi-study pack-
age, as student recruited sampling methods have the potential to attenuate the
weaknesses of other study designs (e.g., interviews, experiments, single-site field
studies). In a similar vein, technology has enabled us to gather data from different
online sources such as Amazon Mechanical Turk and Qualtrics. Although these
data sources can potentially suffer from some of the ills plaguing poorly designed
and executed student-recruited samples, understanding the virtues can help schol-
ars demonstrate strengths to their empirical studies (Cheung, Burns, Sinclair, &
Sliter, 2017; Couper, 2013; Das, Ester, & Kaczmirek, 2020; Finkel, Eastwick, &
Reis, 2015; Jann, Krumpal, & Wolter, 2019; Porter, Outlaw, Gale, & Cho, 2019).
No matter where the data are collected, organizational scholars will still run
into the inherent problem that organizational politics constructs are measured in
imperfect ways because of the invisibility of many of its core constructs. Thus, we
will close with a final appeal to use multiple sources of information to illuminate
political phenomena. There is an old Hindu parable about a collection of blind
men who individually feel parts of an elephant, and then collectively share their
knowledge to get a shared conceptualization of the elephant. Given the hidden
and often invisible nature of politics constructs, we must too rely on multiple ac-
counts to achieve a collective understanding.
For example, few studies have attempted to use objective measures of per-
formance when assessing the proposed relations with political skill (see Ahearn
et al., 2004 for an exception). Subjective measures of performance can be prob-
lematic, as those high in political skill can influence others, and likely the sub-
jective performance assessments. Thus, collecting both objective and subjective
performance and employing congruence analysis can not only help us understand
the quality of our data, but also extract theoretical richness. The same is true for
constructs such as self- and other-reported political skill, leader political behavior,
and perceptions of organizational politics. Polynomial regression and other forms
of congruence analysis can help determine if and why subjects are or are not see-
ing things the same way (Cheung, 2009; Edwards, 1994; Edwards & Parry, 1993).
Differences in these scores may well predict different outcomes, which can add to
our theoretical understanding of political phenomena.
CONCLUSION
The organizational politics literature has been going strong for decades, yet still
suffers from some of the fundamental problems that we see with fledgling streams
of research. At the core of almost every political construct is the issue of concep-
tual clarity and congruence. Without a sound theoretical basis, measures exist on
unstable grounds, and fault lines are sure to divide and divert what could be a
160 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
sound collective research stream. In this chapter, we have reviewed and critically
examined the theoretical bases and associated measures of the five significant
constructs in the field as well as the conventional research designs that predomi-
nate our literature. This exercise has led us to point out some of the virtues and
drawbacks of current established measures and methods, and to take some hard
looks in the mirror at our work. We hope that our suggestions help inspire and
guide future research so that the collective strength of this invaluable field con-
tinues to grow.
REFERENCES
Abbas, M., & Raja, U. (2014). Impact of perceived organizational politics on supervisory-
rated innovative performance and job stress: Evidence from Pakistan. Journal of
Advanced Management Science, 2, 158–162.
Adams, G., Ammeter, A., Treadway, D., Ferris, G., Hochwarter, W., & Kolodinsky, R.
(2002). Perceptions of organizational politics: Additional thoughts, reactions, and
multi-level issues. In F. Yammarino & F. Dansereau (Eds.), Research in multi-level
issues, Volume 1: The many faces of multi-level issues (pp. 287–294). Oxford, UK:
Elsevier Science.
Ahearn, K., Ferris, G., Hochwarter, W., Douglas, C., & Ammeter, A. (2004). Leader politi-
cal skill and team performance. Journal of Management, 30, 309–327.
Ahmad, J., Akhtar, H., ur Rahman, H., Imran, R., & ul Ain, N. (2017). Effect of diversified
model of organizational politics on diversified emotional intelligence. Journal of
Basic and Applied Sciences, 13, 375–385.
Allport, F. (1920). The influence of the group upon association and thought. Journal of
Experimental Psychology, 3, 159–182.
Arafat, S., Chowdhury, H., Qusar, M., & Hafez, M. (2016). Cross-cultural adaptation and
psychometric validation of research instruments: A methodological review. Journal
of Behavioral Health, 5, 129–136.
Aryee, S., Chen, Z., & Budhwar, P. (2004). Exchange fairness and employee performance:
An examination of the relationship between organizational politics and procedural
justice. Organizational Behavior and Human Decision Processes, 94, 1–14.
Ashforth, B., & Lee, R. (1990). Defensive behavior in organizations: A preliminary model.
Human Relations, 43, 621–648.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, NJ: Prentice Hall.
Barbuto, J., & Moss, J. (2006). Dispositional effects in intra-organizational influence tac-
tics: A meta-analytic review. Journal of Leadership & Organizational Studies, 12,
30–48.
Bartol, K., & Martin, D. (1990). When politics pays: Factors influencing managerial com-
pensation decisions. Personnel Psychology, 43, 599–614.
Basar, U., & Basim, N. (2016). A cross‐sectional survey on consequences of nurses’ burn-
out: Moderating role of organizational politics. Journal of Advanced Nursing, 72,
1838–1850.
Belsky, J., & Pluess, M. (2009). Beyond diathesis stress: Differential susceptibility to envi-
ronmental influences. Psychological Bulletin, 135, 885–908.
Research Methods in Organizational Politics • 161
Bing, M., Davison, H., Minor, I., Novicevic, M., & Frink, D. (2011). The prediction of task
and contextual performance by political skill: A meta-analysis and moderator test.
Journal of Vocational Behavior, 79, 563–577.
Blickle, G., Ferris, G., Munyon, T., Momm, T., Zettler, I., Schneider, P., & Buckley, M.
(2011). A multi‐source, multi‐study investigation of job performance prediction by
political skill. Applied Psychology, 60, 449–474.
Blickle, G., Schütte, N., & Wihler, A. (2018). Political will, work values, and objective
career success: A novel approach – The Trait-Reputation-Identity Model. Journal of
Vocational Behavior, 107, 42–56.
Blom-Hansen, J., & Finke, D. (2020). Reputation and organizational politics: Inside the
EU Commission. The Journal of Politics, 82(1), 135–148.
Bolino, M. (1999). Citizenship and impression management: Good soldiers or good actors?
Academy of Management Review, 24, 82–98.
Bolino, M., Long, D., & Turnley, W. (2016). Impression management in organizations:
Critical questions, answers, and areas for future research. Annual Review of Organi-
zational Psychology and Organizational Behavior, 3, 377–406.
Bolino, M., & Turnley, W. (1999). Measuring impression management in organizations:
A scale development based on the Jones and Pittman taxonomy. Organizational
Research Methods, 2, 187–206.
Branzei, O., Ursacki-Bryant, T., Vertinsky, I., & Zhang, W. (2004). The formation of green
strategies in Chinese firms: Matching corporate environmental responses and indi-
vidual principles. Strategic Management Journal, 25, 1075–1095.
Brecht, A. (1937). Bureaucratic sabotage. The Annals of the American Academy of Politi-
cal and Social Science, 189, 48–57.
Bromley, D. (1993). Reputation, image, and impression management. New York, NY: Wi-
ley.
Bromley, D. (2000). Psychological aspects of corporate identity, image and reputation.
Corporate Reputation Review, 3, 240–253.
Brouer, R., Badaway, R., Gallagher, V., & Haber, J. (2015). Political skill dimensional-
ity and impression management choice and effective use. Journal of Business and
Psychology, 30, 217–233.
Brouer, R., Douglas, C., Treadway, D., & Ferris, G. (2013). Leader political skill, relation-
ship quality, and leadership effectiveness a two-study model test and constructive
replication. Journal of Leadership & Organizational Studies, 20, 185–198.
Burris, E. (2012). The risks and rewards of speaking up: Managerial responses to employee
voice. Academy of Management Journal, 55, 851–875.
Byrne, D. (1917). Executive session. Nash’s Pall Mall Magazine, 59, 49–56.
Byrne, Z., Manning, S., Weston, J., & Hochwarter, W. (2017). All roads lead to well-being:
Unexpected relationships between organizational POPs, employee engagement, and
worker well-being. In C. Rosen & P. Perrewé (Eds.), Power, politics, and political
skill in job stress (pp. 1–32). Bingley, UK: Emerald.
Cantoni, C. (1993). Eliminating bureaucracy-roots and all. Management Review, 82, 30–
33.
Chang, C., Rosen, C., & Levy, P. (2009). The relationship between perceptions of organiza-
tional politics and employee attitudes, strain, and behavior: A meta-analytic exami-
nation. Academy of Management Journal, 52, 779–801.
162 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
Edwards, J., & Parry, M. (1993). On the use of polynomial regression equations as an
alternative to difference scores in organizational research. Academy of Management
Journal, 36, 1577–1613.
Elbanna, S., Kapoutsis, I., & Mellahi, K. (2017). Creativity and propitiousness in strate-
gic decision making: The role of positive politics and macro-economic uncertainty.
Management Decision, 55, 2218–2236.
Eldor, L. (2016). Looking on the bright side: The positive role of organizational politics in
the relationship between employee engagement and performance at work. Applied
Psychology, 66, 233–259.
Ellen III, B. (2014). Considering the positive possibilities of leader political behavior.
Journal of Organizational Behavior, 35, 892–896.
Epitropaki, O., Kark, R., Mainemelis, C., & Lord, R. G. (2017). Leadership and follower-
ship identity processes: A multilevel review. The Leadership Quarterly, 28, 104–
129.
Farrell, D., & Petersen, J. (1982). Patterns of political behavior in organizations. Academy
of Management Review, 7, 403–412.
Fedor, D., Maslyn, J., Farmer, S., & Bettenhausen, K. (2008). The contribution of positive
politics to the prediction of employee reactions. Journal of Applied Social Psychol-
ogy, 38, 76–96.
Fendt, J., & Sachs, W. (2008). Grounded theory method in management research: Users’
perspectives. Organizational Research Methods, 11, 430–455.
Ferris, G., Adams, G., Kolodinsky, R., Hochwarter, W., & Ammeter, A. (2002). Percep-
tions of organizational politics: Theory and research directions. In F. Yammarino
& F. Dansereau (Eds.), Research in multi-level issues, Volume 1: The many faces of
multi-level issues (pp. 179–254). Oxford, UK: Elsevier.
Ferris, G., Berkson, H., Kaplan, D., Gilmore, D., Buckley, M., Hochwarter, W., et al.
(1999). Development and initial validation of the political skill inventory. Paper
presented at the 59th annual national meeting of the Academy of Management, Chi-
cago.
Ferris, G., Blass, R., Douglas, C., Kolodinsky, R.,k & Treadway, D. (2003). Personal repu-
tation in organizations. In J. Greenberg (Ed.), Organizational behavior: The state of
the science (pp. 211–246). Mahwah, NJ: Lawrence Erlbaum.
Ferris, G. R., Bowen, M. G., Treadway, D. C., Hochwarter, W. A., Hall, A. T., & Perrewé, P.
L. (2006). The assumed linearity of organizational phenomena: Implications for oc-
cupational stress and well-being. In P. L. Perrewé & D. C. Ganster (Eds.), Research
in occupational stress and well-being (Vol. 5, pp. 205–232). Oxford, UK: Elsevier
Science Ltd.
Ferris, G., Ellen, B., McAllister, C., & Maher, L. (2019). Reorganizing organizational poli-
tics research: A review of the literature and identification of future research direc-
tions. Annual Review of Organizational Psychology and Organizational Behavior,
6, 299–323.
Ferris, G., Fedor, D., & King, T. (1994). A political conceptualization of managerial behav-
ior. Human Resource Management Review, 4, 1–34.
Ferris, G., Harrell-Cook, G., & Dulebohn, J. (2000). Organizational politics: The nature
of the relationship between politics perceptions and political behavior. In S. Bacha-
rach & E. Lawler (Eds.), Research in the sociology of organizations (pp. 89–130).
Stamford, CT: JAI Press.
164 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
Ferris, G., Harris, J., Russell, Z., Ellen, B., Martinez, A., & Blass, F. (2014). The role
of reputation in the organizational sciences: A multi-level review, construct assess-
ment, and research directions. In M. Buckley, A. Wheeler, & J. Halbesleben (Eds.),
Research in personnel and human resources management (pp. 241–303). Bingley,
UK: Emerald.
Ferris, G., Harris, J., Russell, Z., & Maher, L. (2018). Politics in organizations. In N. An-
derson, D. Ones, & H. Sinangi (Eds.), The handbook of industrial, work, and orga-
nization psychology (pp. 514–531). Thousand Oaks, CA: Sage.
Ferris, G., & Hochwarter, W. (2011). Organizational politics. In S. Zedeck (Ed.), APA
handbook of industrial and organizational psychology (pp. 435–459). Washington,
DC: APA.
Ferris, G., Hochwarter, W., Douglas, C., Blass, F., Kolodinsky, R., & Treadway, D. (2002b).
Social influence processes in organizations and human resource systems. In G. Fer-
ris, & J. Martocchio (Eds.), Research in personnel and human resources manage-
ment (pp. 65–127). Oxford, U.K.: JAI Press/Elsevier Science.
Ferris, G., & Judge, T. (1991). Personnel/human resources management: A political influ-
ence perspective. Journal of Management, 17, 447–488.
Ferris, G., & Kacmar, K. (1989). Perceptions of organizational politics. Paper presented at
the 49th Annual Academy of Management Meeting, Washington, DC.
Ferris, G., & Kacmar, K. (1992). Perceptions of organizational politics. Journal of Man-
agement, 18, 93–116.
Ferris, G., & King, T. (1991). Politics in human resources decisions: A walk on the dark
side. Organizational Dynamics, 20, 59–71.
Ferris, G., Perrewé, P., Daniels, S., Lawong, D., & Holmes, J. (2017). Social influence
and politics in organizational research: What we know and what we need to know.
Journal of Leadership & Organizational Studies, 24, 5–19.
Ferris, G., Perrewe, P., & Douglas, C. (2002). Social effectiveness in organizations: Con-
struct validity and research directions. Journal of Leadership and Organizational
Studies, 9, 49–63.
Ferris, G., Russ, G., & Fandt, P. (1989). Politics in organizations. In R. Giacalone & P.
Rosenfeld (Eds.), Impression management in the organization (pp. 143–170). Hill-
sdale, NJ: Erlbaum.
Ferris, G., & Treadway, D. (2012). Politics in organizations: History, construct specifica-
tion, and research directions. In G. Ferris & D. Treadway (Eds.), Politics in organi-
zations: Theory and research considerations (pp. 3–26). New York, NY: Routledge/
Taylor and Francis.
Ferris, G., Treadway, D., Brouer, R., & Munyon, T. (2012). Political skill in the organiza-
tional sciences. In G. Ferris & D. Treadway (Eds.), Politics in organizations: Theory
and research considerations (pp. 487–528). New York, NY: Routledge/Taylor &
Francis.
Ferris, G., Treadway, D., Kolodinsky, R., Hochwarter, W., Kacmar, C., Douglas, C., &
Frink, D. D. (2005). Development and validation of the political skill inventory.
Journal of Management, 31, 126–152.
Ferris, G., Treadway, D., Perrewé, P., Brouer, R., Douglas, C., & Lux, S. (2007). Political
skill in organizations. Journal of Management, 33, 290–320.
Research Methods in Organizational Politics • 165
Finkel, E., Eastwick, P., & Reis, H. (2015). Best research practices in psychology: Illus-
trating epistemological and pragmatic considerations with the case of relationship
science. Journal of Personality and Social Psychology, 108, 275–297.
Franke, H., & Foerstl, K. (2018). Fostering integrated research on organizational politics
and conflict in teams: A cross-phenomenal review. European Management Journal,
36, 593–607.
French, J., & Raven, B. (1959). The bases of social power. In D. Cartwright & A. Zander
(Eds.), Group dynamics (pp. 150–167). New York, NY: Harper & Row.
Frieder, R. E., Ferris, G. R., Perrewé, P. L., Wihler, A., & Brooks, C. D. (2019). Extending
the metatheoretical framework of social/political influence to leadership: Political
skill effects on situational appraisals, responses, and evaluations by others. Person-
nel Psychology, 72(4), 543–569.
Gabriel, A., Campbell, J., Djurdjevic, E., Johnson, R., & Rosen, C. (2018). Fuzzy profiles:
comparing and contrasting latent profile analysis and fuzzy set qualitative compara-
tive analysis for person-centered research. Organizational Research Methods, 21,
877–904.
Gabriel, A., Daniels, M., Diefendorff, J., & Greguras, G. (2015). Emotional labor actors: A
latent profile analysis of emotional labor strategies. Journal of Applied Psychology,
100, 863–879.
Gabriel, A., Koopman, J., Rosen, C., & Johnson, R. (2018). Helping others or helping one-
self? An episodic examination of the behavioral consequences of helping at work.
Personnel Psychology, 71, 85–107.
Gandz, J., & Murray, V. (1980). The experience of workplace politics. Academy of Man-
agement Journal, 23, 237–251.
Gentry, W., Gilmore, D., Shuffler, M., & Leslie, J. (2012). Political skill as an indicator of
promotability among multiple rater sources. Journal of Organizational Behavior,
33, 89–104.
George, G., Dahlander, L., Graffin, S., & Sim, S. (2016). Reputation and status: Expanding
the role of social evaluations in management research. Academy of Management
Journal, 59, 1–13.
Grams, W., & Rogers, R. (1990). Power and personality: Effects of Machiavellianism,
need for approval, and motivation on use of influence tactics. Journal of General
Psychology, 117, 71–82.
Grant, A., & Schwartz, B. (2011). Too much of a good thing: The challenge and opportu-
nity of the inverted U. Perspectives on Psychological Science, 6, 61–76.
Guo, Y., Kang, H., Shao, B., & Halvorsen, B. (2019). Organizational politics as a blind-
fold: Employee work engagement is negatively related to supervisor-rated work out-
comes when organizational politics is high. Personnel Review, 48, 784–798.
Heider, F. (1958). The psychology of interpersonal relations. New York, NY: Wiley.
Higgins, C., Judge, T., & Ferris, G. (2003). Influence tactics and work outcomes: A meta‐
analysis. Journal of Organizational Behavior, 24, 89–106.
Hill, S., Thomas, A., & Meriac, J. (2016). Political behaviors, politics perceptions and
work outcomes: Moving to an experimental study. In E. Vigoda-Gabot & A. Drory
(Eds.), Handbook of organizational politics: Looking back and to the future (pp.
369–400). Northampton, MA: Edward Elgar Publishing.
Hinkin, T. (1998). A brief tutorial on the development of measures for use in survey ques-
tionnaires. Organizational Research Methods, 1, 104–121.
166 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
Kacmar, K., & Bozeman, D., Carlson, D., & Anthony, W. (1999). An examination of the
perceptions of organizational politics model: Replication and extension. Human Re-
lations, 52, 383–416.
Kacmar, K., & Carlson, D. (1997). Further validation of the perceptions of politics scale
(POPs): A multiple sample investigation. Journal of Management, 23, 627–658.
Kacmar, K., & Ferris, G. (1991). Perceptions of organizational politics scale (POPs): De-
velopment and construct validation. Educational and Psychological Measurement,
51, 193–205.
Kacmar, K., Wayne, S., & Wright, P. (1996). Subordinate reactions to the use of impression
management tactics and feedback by the supervisor. Journal of Managerial Issues,
8, 35–53.
Kapoutsis, I., Papalexandris, A., Treadway, D., & Bentley, J. (2017). Measuring political
will in organizations: Theoretical construct development and empirical validation.
Journal of Management, 43, 2252–2280.
Kelley, H. (1973). The process of causal attributions. American Psychologist, 28, 107–128.
Kidron, A., & Vinarski-Peretz, H. (2018). The political iceberg: The hidden side of leaders’
political behaviour. Leadership & Organization Development Journal, 39, 1010–
1023.
Kiewitz, C., Restubog, S., Zagenczyk, T., & Hochwarter, W. (2009). The interactive effects
of psychological contract breach and organizational politics on perceived organi-
zational support: Evidence from two longitudinal studies. Journal of Management
Studies, 46, 806–834.
Kipnis, D., & Schmidt, S. (1988). Upward-influence styles: Relationship with performance
evaluations, salary, and stress. Administrative Science Quarterly, 33, 528–542.
Kipnis, D., Schmidt, S., & Wilkinson, I. (1980). Intraorganizational influence tactics: Ex-
plorations in getting one’s way. Journal of Applied Psychology, 65, 440–452.
Kruse, E., Chancellor, J., & Lyubomirsky, S. (2017). State humility: Measurement, concep-
tual validation, and intrapersonal processes. Self and Identity, 16, 399–438.
Lafrenière, M., Sedikides, C., & Lei, X. (2016). Regulatory fit in self-enhancement and
self-protection: implications for life satisfaction in the west and the east. Journal of
Happiness Studies, 17, 1111–1123.
Laird, M., Zboja, J., & Ferris, G. (2012). Partial mediation of the political skill-reputation
relationship, Career Development International, 17, 557–582.
Lampaki, A., & Papadakis, V. (2018). The impact of organisational politics and trust in
the top management team on strategic decision implementation success: A middle
manager’s perspective. European Management Journal, 36, 627–637.
Landells, E., & Albrecht, S. (2013). Organizational political climate: Shared perceptions
about the building and use of power bases. Human Resource Management Review,
23, 357–365.
Landells, E., & Albrecht, S. (2017). The positives and negatives of organizational politics:
A qualitative study. Journal of Business and Psychology, 32, 41–58.
Landry, H. (1969). Creativity and personality integration. Canadian Journal of Counsel-
ling and Psychotherapy, 3, 5–11.
Larson, R., & Csikszentmihalyi, M. (1983). The experience sampling method. New Direc-
tions for Methodology of Social & Behavioral Science, 15, 41–56.
Lasswell, H. (1936). Politics: Who gets what, when, how? New York, NY: Whittlesey.
168 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
Lee, S., Han, S., Cheong, M., Kim, S. L., & Yun, S. (2017). How do I get my way? A
meta-analytic review of research on influence tactics. The Leadership Quarterly,
28, 210–228.
LePine, J., Podsakoff, N., & LePine, M. (2005). A meta-analytic test of the challenge stress-
or–hindrance stressor framework: An explanation for inconsistent relationships
among stressors and performance. Academy of Management Journal, 48, 764–775.
Lepisto, D., & Pratt, M. (2012). Politics in perspectives: On the theoretical challenges and
opportunities in studying organizational politics. In G. Ferris & D. Treadway (Eds.),
Politics in organizations: Theory and research considerations (pp. 67–98). New
York, NY: Routledge/Taylor and Francis.
Lewin, K. (1936). Principles of topological psychology. New York, NY: McGraw-Hill.
Li, C., Liang, J., & Farh, J. L. (2020). Speaking up when water is Murky: An uncertainty-
based model linking perceived organizational politics to employee voice. Journal of
Management, 46(3), 443–469.
Li, J., Wu, L., Liu, D., Kwan, H., & Liu, J. (2014). Insiders maintain voice: A psychologi-
cal safety model of organizational politics. Asia Pacific Journal of Management, 31,
853–874.
Liden, R., & Mitchell, T. (1988). Ingratiatory behaviors in organizational settings. Acad-
emy of Management Review, 13, 572–587.
Lim, S., Ilies, R., Koopman, J., Christoforou, P., & Arvey, R. (2018). Emotional mecha-
nisms linking incivility at work to aggression and withdrawal at home: An experi-
ence-sampling study. Journal of Management, 44, 2888–2908.
Lincoln, Y., & Guba, E. (1985). Naturalistic observation. Thousand Oaks, CA: Sage Pub-
lications.
Liu, Y., Ferris, G., Zinko, R., Perrewé, P., Weitz, B., & Xu, J. (2007). Dispositional ante-
cedents and outcomes of political skill in organizations: a four-study investigation
with convergence, Journal of Vocational Behavior, 71, 146–165.
Liu, Y., Liu, J., & Wu, L. (2010). Are you willing and able? Roles of motivation, power,
and politics in career growth. Journal of Management, 36, 1432–1460.
Luthans, F., & Avolio, B. (2009). Inquiry unplugged: building on Hackman’s potential
perils of POB. Journal of Organizational Behavior: The International Journal of In-
dustrial, Occupational and Organizational Psychology and Behavior, 30, 323–328.
Lux, S., Ferris, G., Brouer, R., Laird, M., & Summers, J. (2008). A multi-level concep-
tualization of organizational politics. In C. Cooper & J. Barling (Eds.), The SAGE
handbook of organizational behavior (pp. 353–371). Thousand Oaks, CA: Sage.
Machiavelli, N. (1952). The prince. New York, NY: New American Library (The transla-
tion of Machiavelli’s The Prince by Luigi Ricci was first published in 1903).
Madison, D., Allen, R., Porter, L., Renwick, P., & Mayes, B. (1980). Organizational poli-
tics: An exploration of managers’ perceptions. Human Relations, 33, 79–100.
Maher, L., Gallagher, V., Rossi, A., Ferris, G., & Perrewé, P. (2018). Political skill and will
as predictors of impression management frequency and style: A three-study investi-
gation. Journal of Vocational Behavior, 107, 276–294.
Maslyn, J., Farmer, S., & Bettenhausen, K. (2017). When organizational politics matters:
The effects of the perceived frequency and distance of experienced politics. Human
Relations, 70, 1486–1513.
Maslyn, J., & Fedor, D. (1998). Perceptions of politics: Does measuring different foci mat-
ter? Journal of Applied Psychology, 83, 645–653.
Research Methods in Organizational Politics • 169
Matta, F., Scott, B., Colquitt, J., Koopman, J., & Passantino, L. (2017). Is consistently
unfair better than sporadically fair? An investigation of justice variability and stress.
Academy of Management Journal, 60, 743–770.
Mayes, B., & Allen, R. (1977). Toward a definition of organizational politics. Academy of
Management Review, 2, 672–678.
McArthur, J. (1917). What a company officer should know. New York, NY: Harvey Press.
Miller, B., Rutherford, M., & Kolodinsky, R. (2008). Perceptions of organizational politics:
A meta-analysis of outcomes. Journal of Business and Psychology, 22, 209–222.
Mintzberg, H. (1983). Power in and around organizations. Englewood Cliffs, NJ: Prentice-
Hall.
Mintzberg, H. (1985). The organization as political arena. Journal of Management Studies,
22, 133–154.
Misangyi, V., Greckhamer, T., Furnari, S., Fiss, P., Crilly, D., & Aguilera, R. (2017). Em-
bracing causal complexity: The emergence of a neo-configurational perspective.
Journal of Management, 43, 255–282.
Mitchell, M., Baer, M., Ambrose, M., Folger, R., & Palmer, N. (2018). Cheating under
pressure: A self-protection model of workplace cheating behavior. Journal of Ap-
plied Psychology, 103, 54–73.
Molina-Azorin, J., Bergh, D., Corley, K., & Ketchen, D. (2017). Mixed methods in the or-
ganizational sciences: Taking stock and moving forward. Organizational Research
Methods, 20, 179–192.
Morgan, L. (1989). “Political will” and community participation in Costa Rican primary
health care. Medical Anthropology Quarterly, 3, 232–245.
Morgeson, F., Mitchell, T., & Liu, D. (2015). Event system theory: An event-oriented ap-
proach to the organizational sciences. Academy of Management Review, 40, 515–
537.
Munyon, T., Summers, J., Thompson, K., & Ferris, G. (2015). Political skill and work
outcomes: A theoretical extension, meta‐analytic investigation, and agenda for the
future. Personnel Psychology, 68, 143–184.
Nye, L., & Witt, L. (1993). Dimensionality and construct validity of the perceptions of
organizational politics scale (POPS). Educational and Psychological Measurement,
53, 821–829.
O’Shea, P. (1920). Employees’ magazines for factories, offices, and business organiza-
tions. New York, NY: Wilson.
Perrewé, P., Zellars, K., Ferris, G., Rossi, A., Kacmar, C., & Ralston, D. (2004). Neutral-
izing job stressors: Political skill as an antidote to the dysfunctional consequences
of role conflict. Academy of Management Journal, 47, 141–152.
Pfeffer, J. (1981). Power in organizations. Marshfield, MA: Pitman.
Pfeffer, J. (1992). Managing with power: Politics and influence in organizations. Boston,
MA: Harvard Business Press.
Pfeffer, J. (2010). Power: Why some people have it and others don’t. New York, NY: Harp-
erCollins Publishers.
Pierce, J., & Aguinis, H. (2013). The too-much-of-a-good-thing effect in management.
Journal of Management, 39, 313–338.
Porter, L. (1976). Organizations as political animals. Presidential address, Division of
Industrial-Organizational Psychology, 84th Annual Meeting of the American Psy-
chological Association, Washington, DC.
170 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER
Porter, L., Allen, R., & Angle, H. (1981). The politics of upward influence in organiza-
tions. In L. Cummings, & B. Staw (Eds.), Research in organizational behavior (pp.
109–149). Greenwich, CT: JAI Press.
Porter, C., Outlaw, R., Gale, J., & Cho, T. (2019). The use of online panel data in man-
agement research: A review and recommendations. Journal of Management, 45,
319–344.
Post, L., Raile, A., & Raile, E. (2010). Defining political will. Politics & Policy, 38, 653–
676.
Ravasi, D., Rindova, V., Etter, M., & Cornelissen, J. (2018). The formation of organiza-
tional reputation. Academy of Management Annals, 12, 574–599.
Reitz, A., Motti-Stefanidi, F., & Asendorpf, J. (2016). Me, us, and them: Testing sociom-
eter theory in a socially diverse real-life context. Journal of Personality and Social
Psychology, 110, 908–920.
Rihoux, B., & Ragin, C. (2008). Configurational comparative methods: Qualitative com-
parative analysis (QCA) and related techniques (Vol. 51). Thousand Oaks, CA:
Sage Publications.
Rindova, V., Williamson, I., & Petkova, A. (2010). Reputation as an intangible asset: Re-
flections on theory and methods in two empirical studies of business school reputa-
tions. Journal of Management, 36, 610–619.
Rose, P., & Greeley, M. (2006). Education in fragile states: Capturing lessons and identify-
ing good practice. Brighton, UK: DAC Fragile States Group.
Rosen, C., Ferris, D., Brown, D., Chen, Y., & Yan, M. (2014). Perceptions of organizational
politics: A need satisfaction paradigm. Organization Science, 25, 1026–1055.
Rosen, C., & Hochwarter, W. (2014). Looking back and falling further behind: The mod-
erating role of rumination on the relationship between organizational politics and
employee attitudes, well-being, and performance. Organizational Behavior and Hu-
man Decision Processes, 124, 177–189.
Rosen, C., Kacmar, K., Harris, K., Gavin, M., & Hochwarter, W. (2017). Workplace poli-
tics and performance appraisal: A two-study, multilevel field investigation. Journal
of Leadership & Organizational Studies, 24, 20–38.
Rosen, C., Koopman, J., Gabriel, A., & Johnson, R. (2016). Who strikes back? A daily
investigation of when and why incivility begets incivility. Journal of Applied Psy-
chology, 101, 1620–1634.
Rosen, C., Levy, P., & Hall, R. (2006). Placing perceptions of politics in the context of the
feedback environment, employee attitudes, and job performance. Journal of Applied
Psychology, 91, 211–220.
Runkel, P., & McGrath, J. (1972), Research on human behavior: A systematic guide to
method, New York, NY: Holt, Rinehart and Winston, Inc.
Salancik, G., & Pfeffer, J. (1978). A social information processing approach to job attitudes
and task design. Administrative Science Quarterly, 23, 224–253.
Saleem, H. (2015). The impact of leadership styles on job satisfaction and mediating role
of perceived organizational politics. Procedia-Social and Behavioral Sciences, 172,
563–569.
Schein, V. (1977). Individual power and political behaviors in organizations: An inade-
quately explored reality. Academy of Management Review, 2, 64–72.
Schriesheim, C., Powers, K., Scandura, T., Gardiner, C., & Lankau, M. (1993). Improv-
ing construct measurement in management research: Comments and a quantitative
Research Methods in Organizational Politics • 171
Van Knippenberg, B., & Steensma, H. (2003). Future interaction expectation and the use of
soft and hard influence tactics. Applied Psychology, 52, 55–67.
Van Maanen, J., Sorensen, J., & Mitchell, T. (2007). The interplay between theory and
method. Academy of Management Review, 32, 1145–1154.
Vecchio, R., & Sussman, M. (1991). Choice of influence tactics: individual and organiza-
tional determinants. Journal of Organizational Behavior, 12, 73–80.
Vigoda, E. (2002). Stress-related aftermaths to workplace politics: The relationships
among politics, job distress, and aggressive behavior in organizations. Journal of
Organizational Behavior, 23, 571–588.
Von Hippel, W., Lakin, J., & Shakarchi, R. (2005). Individual differences in motivated
social cognition: The case of self-serving information processing. Personality and
Social Psychology Bulletin, 31, 1347–1357.
Wade, J., Porac, J., Pollock, T., & Graffin, S. (2006). The burden of celebrity: The impact of
CEO certification contests on CEO pay and performance. Academy of Management
Journal, 49, 643–660.
Wheeler, A., Shanine, K., Leon, M., & Whitman, M. (2014). Student‐recruited samples
in organizational research: A review, analysis, and guidelines for future research.
Journal of Occupational and Organizational Psychology, 87, 1–26.
Whitman, M., Halbesleben, J., & Shanine, K. (2013). Psychological entitlement and abu-
sive supervision: Political skill as a self-regulatory mechanism. Health Care Man-
agement Review, 38, 248–257.
Wickenberg, J., & Kylén, S. (2007). How frequent is organizational political behaviour? A
study of managers’ opinions at 491 workplaces. In S. Reddy (Ed.), Organizational
politics—New insights (pp. 82–94). Hyderabad, India: ICFAI University Press.
Yukl, G., & Falbe, C. (1990). Influence tactics and objectives in upward, downward, and
lateral influence attempts. Journal of Applied Psychology, 75, 132–140.
Yukl, G., & Tracey, J. (1992). Consequences of influence tactics used with subordinates,
peers, and the boss. Journal of Applied Psychology, 77, 525–535.
Zanzi, A., Arthur, M., & Shamir, B. (1991). The relationships between career concerns and
political tactics in organizations. Journal of Organizational Behavior, 12, 219–233.
Zare, M., & Flinchbaugh, C. (2019). Voice, creativity, and big five personality traits: A
meta-analysis. Human Performance, 32, 30–51.
Zhang, Y., & Lu, C. (2009). Challenge stressor-hindrance stressor and employees’ work-re-
lated attitudes, and behaviors: The moderating effects of general self-efficacy. Acta
Psychologica Sinica, 6, 501–509.
Zinko, R., Gentry, W., & Laird, M. (2016). A development of the dimensions of personal
reputation in organizations. International Journal of Organizational Analysis, 24,
634–649.
CHAPTER 8
RANGE RESTRICTION IN
EMPLOYMENT INTERVIEWS
An Influence Too Big to Ignore
Allen I. Huffcutt
personality. Procedures that are the most time intensive and/or expensive tend
to be implemented at (or towards) the end, usually after a number of candidates
have been eliminated. Interviews typically fall in this latter category. The later the
interview is in the selection process, the greater the possibility for (and extent of)
range restriction.
The degree to which range restriction can diminish the magnitude of validity
coefficients is illustrated in several of the larger and more prominent interview
meta-analyses. For instance, McDaniel, Whetzel, Schmidt, and Maurer (1994)
reported that the mean population validity of the Situational Interview or SI
(Latham, Saari, Pursell, & Campion, 1980) rose from .35 to .50 after further cor-
rection for range restriction. Expressed as a percent-of-variance, SIs accounted
for 12% of performance variance without the correction and 25% after. (As ex-
plained later, McDaniel et al.’s correction is most likely conservative because it
was based on direct rather than indirect restriction.)
Yet, most primary interview researchers—those actually conducting studies
rather than meta-analytically summarizing them—fail to take range restriction
into account. The lack of attention is evident with a quick search on PsycINFO
(search date: 4-11-2018). Entering “job applicant interviews” as the subject (SU)
term resulted in 1,472 entries. Adding “range restriction” as a second search term
anywhere in the document reduced that number to only seven. Using the alter-
nate term “restriction in range” resulted in the same number. Additional evidence
comes again from the McDaniel et al. (1994) interview meta-analysis, where only
14 of the 160 total studies in their dataset reported range restriction information
(see pp. 605–606).
Such widespread lack of consideration is surprising in one respect because
the mathematics behind range restriction, and the equations to correct for it, have
been around for a long time. Building on the earlier work of Pearson (1903),
for example, Thorndike (1949) presented the relatively straightforward procedure
needed to correct for direct (i.e., Case II) restriction.1 The first meta-analysis book
in Industrial-Organizational (I-O) psychology, Hunter et al. (1982), also outlined
the correction procedure for direct range restriction and illustrated how to utilize
it in selection research.
Unfortunately, the issue of range restriction got more complex for employment
interview researchers in the mid-2000s. Throughout essentially the entire history
of selection research, restriction was largely presumed to be direct. Schmidt and
colleagues (e.g., Hunter & Schmidt, 2004; Hunter, Schmidt, & Le, 2006) made
the assertion that most restriction is actually indirect rather than direct. Thorndike
(1949) provided the equations needed to correct for indirect (Case III) restriction
as well, but they were generally not viable for selection contexts because too
much of the needed information was unknown. Hunter et al. were able to sim-
plify the mathematics to make the indirect correction more feasible, although it is
still more complicated than direct. To distinguish their methodology from that of
Thorndike, they named their procedure Case IV.
Range Restriction in Employment Interviews • 175
It would appear that interview researchers as a whole appear to have paid even
less attention to the indirect form of restriction. Another search on PsycINFO
(same date) combining “job applicant interviews” as the subject term with “indi-
rect restriction” as a general term anywhere in the document yielded only three
entries.2 The first was an interview meta-analysis that incorporated indirect re-
striction as its primary purpose (Huffcutt, Culbertson, & Weyhrauch, 2014a). The
second was a reanalysis of the McDaniel et al. (1994) interview dataset using in-
direct methodology (Oh, Postlethwaite, & Schmidt, 2013; see also Le & Schmidt,
2006). The third was a general commentary on indirect restriction (Schmitt, 2007).
Failure to account for range restriction in a primary interview study (and in a
meta-analysis for that matter) can result in inaccurate or even mistaken conclu-
sions. Consider a company that wants to switch from traditional, unstructured
interviews to a new structured format such as a SI or a Behavior Description
Interview or BDI (Janz, 1982), but isn’t completely sure doing so is worth the
time and administrative trouble. If range restriction is present, which is likely, the
resulting validity coefficient will be artificially low. It might even be low enough
that the company decides not to make the switch.
One possible reason for the lack of consistent attention to range restriction
among interview researchers is that they don’t have a good, intuitive feel for its
effects. Graduate school treatment of range restriction, along with prominent
psychometric textbooks (e.g., Nunnally, 1978), tend to focus only on the correc-
tive formulas. Visual presentation, such as scatterplots, is often missing. Another
potential reason is that some of the needed information (e.g., the unrestricted
standard deviation of interview ratings in the applicant population) may not be
readily available given the customized nature of interviews (e.g., as opposed to
standardized ability measures). A final reason, one based on convenience and/or
expense, is that interview researchers may not feel they have the time to refamil-
iarize themselves with the correction process, which is not always presented in a
user-friendly manner, or purchase meta-analytic software that they would only
use periodically or perhaps even once (e.g., Schmidt & Le, 2014).
The overarching purpose of this manuscript is to provide a convenient, all-in-
one reference for interview researchers to help them deal with range restriction.
Subsumed under this purpose is an overview of the basic concepts and mechan-
ics of range restriction (including the all-important difference between its direct
and indirect forms), visual presentation of restriction effects via scatterplots to
enhance intuitive understanding, and a summary of equations and procedures for
their use in restriction correction. Further, realistic simulations are utilized to de-
rive some of the most difficult parameters for interviews, and then these param-
eters are built into the correction equations in order to simplify them.
high-structure interviews, as they consistently show the highest validity and are
considerably more standardized across situations than unstructured ones. For in-
stance, although the content of questions varies, all SIs are comprised of hypothet-
ical scenarios while all BDIs focus exclusively on description of past experiences.
In contrast, the content, nature, and even format of unstructured interviews can
vary immensely by interviewer and even by interview. Indeed, it is not surprising
that unstructured interviews have been likened unto a “disorganized conversa-
tion.”
A key parameter in this distribution is the population correlation between high-
ly structured interview ratings and job performance. At the present time, the best
available estimate appears to be the fully corrected (via indirect methodology)
population value (i.e., rho) of .69 from Huffcutt et al. (2014a). They provided
population estimates for four levels of structure (none to highly structured), and
this value is for the highest level (see p. 303). This level includes virtually all SIs,
and a majority of BDIs. (BDIs can be conducted using more of an intermediate
level of structure, such as allowing interviewers to choose questions from a bank
and to probe extensively; such studies usually reside at Level 3.) Their value of
.69 is corrected for both unreliability in performance assessment and range re-
striction, but not for interview reliability. As explained in more detail below, such
a correction results in “operational validity” rather than a construct-to-construct
association, that is the level to which a predictor (in its actual, imperfect state) is
associated with true performance.
To enhance realism (and out of the necessity of choosing a scaling), interview
parameters from Weekley and Gier (1987) were utilized. They developed a SI to
select entry-level associates in a national retail outlet. The sample question they
provide (see p. 485) about an angry customer whose watch is late coming back
from being repaired is cited regularly as an example of the SI format. Their final
interview contained 16 questions, all rated using the typical five-point scale that
has behavioral benchmarks at one, three, and five, resulting in a possible range of
16 to 80 with a midpoint of 48. Using Excel, a normal distribution was generated
with a mean of 48.0 (the midpoint) and a standard deviation of 7.4 (the actual sd
in their validity sample; see p. 486). In regards to sample size, 100 was chosen out
of convenience. These parameters should be reasonably representative of high-
structure interviews in general.3
On the performance side, the goal was to create a second distribution that cor-
related .69 with the original distribution of interview ratings. Using Excel, the
interview distribution was copied, sufficient measurement error was added to re-
duce the correlation with the original distribution to .69, and then the result was
rescaled to have a mean of 50.0 and a standard deviation of 10.0 (i.e., T scaling).
Given the extremely wide variation in performance rating formats across studies,
this particular scaling was chosen out of convenience assuming that it was reason-
ably representative.
Range Restriction in Employment Interviews • 177
ratings interviewers make, while error-prone, are used to make actual selection
decisions. Hence, correcting performance alone is often referred to as “opera-
tional validity” (Schmidt, Hunter, Pearlman, & Rothstein-Hirsh, 1985, p. 763).
Interviewer ratings can be corrected as well, and if done, provides valuable (albeit
theoretical) information on construct associations. To illustrate, Huffcutt, Roth,
and McDaniel (1996) corrected for measurement error in both interview ratings
and cognitive ability test scores (see p. 465) in order to assess the degree of con-
struct saturation of the latter in the former.
Statistically, correcting for measurement error in performance ratings is ac-
complished as shown in Formula 1, where ro is the observed (actual) validity coef-
ficient and ryy is the performance IRR (i.e., .52). Note that the correction involves
the square root of the reliability. The correction returns the validity coefficient to
its full population value. Readers are referred to Schmidt and Hunter (2015) for
more information on this correction (see p. 112).
ro ro ro .50
rc = = = = = .69 (1)
ryy .52 .72 .72
If an interview researcher has a study that fits this scenario (at least to a reason-
able degree), the correction is simple. Just divide the actual validity coefficient by
.72. Situations where a new structured interview is being pilot tested with appli-
cants (not incumbents) and is not used to make actual selection decisions would
be particularly relevant, especially if a high majority of applicants are hired and
retained once on the job.
a general scenario of 10% attrition, 5% from the top and 5% from the bottom. Based
on a simulation, they derived a range restriction ratio (u) of .80 (see p. 550), which is
the standard deviation of the restricted ratings (R) divided by the standard deviation
of the unrestricted ratings (P) in the population (i.e., sdR/sdP).
In regards to correction for attrition, a key question is whether it represents direct
or indirect restriction. Given that all (or most) applicants are hired in this scenario
and that attrition is an end-stage phenomenon, it seems reasonable to view it as
direct. The formulas for direct restriction (Hunter & Schmidt, 1990, p. 48; Hunter
& Schmidt, 2004, p. 215; see also Callender & Osburn, 1980, p. 549) are generally
intended for the predictor (here interviews). However, as noted by Schmidt and
Hunter (2015, p. 48), the effects of the predictor and the criterion on the validity co-
efficient are symmetrical; hence, the same formulas can be used to correct for attri-
tion restriction by itself. If there happens to be both restriction on the predictor and
attrition, then things get considerably more complicated (see Schmidt & Hunter, p.
48, for a discussion). This situation is addressed in Scenario 4.
In regard to procedure, the direct correction equation from Callender and Os-
burn (1980; p. 549) seems particularly popular and used widely (see Hunter &
Schmidt, 1990, p. 48; Hunter & Schmidt, 2004, p. 37). It is presented as Formula
2. The key component in this equation is u, the range restriction ratio noted above.
ro
rc = (2)
(1 − u )ro2 + u 2
2
ro ro
rc = = (3)
2 2 2
(1 − .80 )r + .80
o .36ro2 + .64
The above correction restores the validity coefficient to what it would have
been had there not been any attrition. To estimate operational validity, however,
an additional correction needs to be made for measurement error in the perfor-
mance ratings, which, fortunately, can be combined with the correction for direct
restriction. That equation is presented as Formula 4. As before, .52 is used for the
IRR of performance ratings.
ro ro
rc = = (4)
2
.52 (1 − .64)r + .64 o .72 .36ro2 + .64
Using this equation, an interview researcher with a study that fits reasonably
well with this scenario simply has to enter his/her observed correlation in the
Range Restriction in Employment Interviews • 181
last part of the above equation and do the computations to find an estimate of the
corrected (population) correlation. Situations where a new structured interview is
being pilot tested with applicants and is not used to make actual selection deci-
sions, a high majority of applicants are hired and/or hiring is done without strong
reference to merit, and there is moderate (but not extreme) attrition by the time
performance ratings are collected (from both the top and bottom) would be par-
ticularly relevant.
FIGURE 8.3. Scatterplots illustrating the association between interview and job
performance ratings with 90%, 50%, and 10% hiring respectively (and measure-
ment error in performance ratings). The correlations are .44, .39, and .29 respec-
tively.
Range Restriction in Employment Interviews • 183
sufficient measurement error was added to reduce the correlation with interview
ratings to the estimated value with performance error induced. Finally, a scat-
terplot was created. The scatterplots for all three levels of hiring are shown in
Figure 8.3.
The traditional way to correct for direct range restriction and performance
measurement error is to do the corrections simultaneously combing Formulas 1
and 2 as shown in Formula 5 below (Callender & Osburn, 1980, p. 549; Hunter &
Schmidt, 1990, p. 48; Hunter & Schmidt, 2004, p. 215). This is essentially what
was done in Formula 4 in the correction for attrition. Like there, the key parameter
is the range restriction ratio u, which here is the ratio of the restricted standard
deviation of interview ratings to the unrestricted one.
ro
rc = (5)
ryy (1 − u 2 )ro2 + u 2
Hunter et al. (2006) presented a two-step alternative based on “the little known
fact that when range restriction is direct, accurate corrections for range restriction
require not only use of the appropriate correction formula…but also the correct
sequencing of corrections for measurement error and range restriction” (p. 596).
In their method, the observed validity coefficient is corrected first for measure-
ment error in performance ratings (since that occurs last) using the restricted IRR
value (i.e., .52; denoted as “YYR”). That is accomplished using Formula 1. Then,
the corrected coefficient is inserted into an accompanying restriction formula
(Step 2 in their Table 1; see p. 599). To simply the process, the formulas for these
two steps are integrated into one, which is shown as Formula 6. Note that UX is
the inverse of the range restriction ratio (i.e., 1/ux).
U x ∗ ro / rYYR
rc = (6)
1 + (U x2 − 1) ∗ (ro / rYYR ) 2
Now to the results. For 90% hiring (top panel in Figure 8.3), the standard de-
viation with the bottom 10% of interview ratings removed is 6.0, resulting in a
u value of .81 (i.e., 6.0/7.4) and a U value of 1.24 (i.e., 1/.81). The performance
IRR value, as always, is .52. The validity coefficient drops to .44. Inserting these
values into the above formula, as shown in Formula 7, returns the fully corrected
value of .69 (which is important to confirm given that the validity coefficient was
computed from the actual data after removal of the bottom 10%).
For 50% hiring (middle panel in Figure 8.3), the standard deviation with the
bottom half of interview ratings removed is 5.0, resulting in a u value of .67 (i.e.,
5.0 / 7.4) and a U value of 1.49 (i.e., 1 / .67). The validity coefficient drops to .39.
Inserting these values into Formula 6, as shown in Formula 9 below, returns the
fully corrected value of .69.
Finally, for 10% hiring (bottom panel in Figure 8.3), the standard deviation
with the bottom 90% of interview ratings removed is 3.5, resulting in a u value
of .47 (i.e., 3.5 / 7.4) and a U value of 2.13 (i.e., 1 / .47). The validity coefficient
drops to .29. Inserting these values into Formula 6, as shown in Formula 11, re-
turns the fully corrected value of .69.
Isolating the relevant portion once again, the result is shown in Formula 12.
Interview researchers can use this equation when a highly structured interview
is used in a top-down fashion in selection, there is minimal preselection prior to
the interview, only a small percentage of applicants are hired, and there is no (or
minimal) attrition by the time that performance ratings are collected.
variate. Because of the relatively small concentration of data points at the low end
of the distribution, it does not take much elimination of points from that region to
reduce the overall range noticeably.
Conversely, comparing the middle panel in Figure 8.3 with the top panel, the
change in the scatterplot from 10% to 50% hiring may not be as pronounced as
some might expect. Specifically, the range drops from 29 to only 20 even though
four times as many points were eliminated (compared to 90% hiring). This time,
the elimination occurred in the very dense scoring region leading up the middle of
the distribution. Because of that density, the drop in range is much more modest,
in fact slightly less than the change from no restriction to 10% hiring. The drop in
range from 50% to 10% (second and third panels in Figure 8.3) is essentially the
same because it is the same region, just on the back side of the center.
There is a potentially important implication of this phenomenon, one that
should be explored further in future research. Given the low density in the high
end of the distribution (just like in the low end), one would expect the range (and
validity coefficient) to drop somewhat noticeably as the hiring ratio drops in rela-
tively small increments below 10%. This issue is particularly important for jobs
where a large number of individuals often apply (e.g., academic positions) and/or
when unemployment is high. In both cases, a very limited number of individuals
(sometimes only one) are hired.
The second phenomenon pertains to Hunter et al.’s (2006) two-step alternative
procedure, which continues to be “little known” (p. 596) in the general meta-
analytic community. Does it really lead to improved estimates over the traditional
Callender and Osburn (1980)-type simultaneous correction? As a supplemental
analysis, the computations were rerun for all three hiring percentages using the
simultaneous approach. The corrected validity coefficient was in fact overesti-
mated at all three hiring levels. Moreover, the degree of overestimation increased
progressively as the hiring percentage decreased. The overestimation was .03 at
90% hiring (i.e., .72 vs. .69), .05 at 50% hiring (i.e., .74 vs. .69), and .07 at 10%
hiring (i.e., 76 vs. .69). Clearly, the two-step procedure seems more accurate, par-
ticularly with lower hiring percentages.
The effects of double restriction are illustrated using the Scenario 3 data with
50% hiring. Unlike the other scenarios, a correction formula is not offered, as
again, one does not currently exist. That said, it is important for both practitioners
and researchers to understand fully the debilitating effects of double restriction.
Especially so since it is likely to be extremely common in practice.
Recalling the 50% case (the middle panel in Figure 8.3), the standard deviation
with the bottom half of interview ratings removed is 5.0, resulting in a u value of
5.0 / 7.4 or .67, and the validity coefficient drops to .39. Those data were sorted by
performance rating from highest to lowest, and then the top and bottom 5% were
removed. Given that the starting sample size is 50, that corresponded to removal
of the top five and bottom five sets of ratings and a final sample size of 40.
The resulting distribution is shown in Figure 8.4. Removal of the top and bot-
tom 5% causes the validity coefficient to drop from .39 to .06. The standard devia-
tion of interview ratings dropped only modestly, from 5.0 to 4.5. As expected, the
standard deviation of performance ratings dropped more noticeably, from 11.1 to
7.1, although by itself, such reduction does not appear sufficient to account for the
somewhat drastic drop in validity (at least not fully).
So why did the validity coefficient drop from a somewhat respectable .39 to
something not that far from zero? Schmidt and Hunter (2015) provide valuable in-
sight, namely that the regression line changes in complex ways when there is dou-
ble restriction, including no longer being linear and homoscedastic.4 Inspection of
Figure 8.4 suggests that the regression line, which retained essentially the same
FIGURE 8.4. Scatterplot between interview and job performance ratings with 50%
hiring and 10% attrition (5% from the bottom and top respectively). The correlation
is .06.
188 • ALLEN I. HUFFCUTT
pronounced slope throughout the previous scenarios, is now almost flat. Imagine
an upward sloping rectangle, and then slicing off the bottom left and top right
corners. Those two corners, a noticeable portion of which were removed because
of 10% attrition, were largely responsible for the distinct upward slope. And, the
peculiar shape of this distribution appears to violate virtually every known re-
gression assumption about bi-variate relationships (see Cohen & Cohen, 1983;
Osborne, 2016) include being heteroscedastic. Given all these considerations, it is
not surprising that no statistical formulas exist for correction of double restriction.
The implications of this illustration are of paramount importance for organi-
zations. Head-to-head, 50% hiring with 10% attrition came out far worse than
10% hiring with no attrition (i.e., .06 vs. .29). It would appear that attrition, even
at relatively low levels (e.g., 10%), has a powerful influence on validity when
direct restriction is already present (and presumably indirect as well). And, the
assumption of 50% hiring with 10% attribution is probably conservative. There
most likely are many employment situations where hiring is less than 50%, which
should, in theory, makes things even worse since the starting scatterplot and va-
lidity coefficient (before attrition effects) are already diminished and/or where
attrition is greater than 10%. Clearly, more research attention needs to be given to
developing ways to deal with double restriction.
al. (2006), this assumption is likely to be met to a close enough degree in selection
studies. If it is clear that this assumption does not hold, an alternative method has
been developed (see Le, Oh, Schmidt, & Wooldridge, 2016). Denoted as “Case
V” indirect correction, this method does not have the above assumption, but does
require the range restriction ratio for the second variable as well. If that variable
is job performance ratings, which is usually the case with selection, the range
restriction ratio for it is extremely difficult to obtain empirically (Le et al., p. 981).
Correction for Case IV indirect restriction is a five-step process, clearly mak-
ing it more involved than direct correction. Step 1 is to find / estimate the unre-
stricted reliability of the predictor in the applicant population (rXX_A). This, of
course, is not known for interviews. Accordingly, the equation for estimating it
is shown in Formula 13 (Schmidt & Hunter, 2015, p. 127), which involves the
restricted reliability value (rXX_R) and the range restriction ratio (uX).
Taking all three sources of measurement error into account (i.e., random re-
sponse, transient, and conspect), Huffcutt, Culbertson, and Weyhrauch (2013)
found a mean interrater reliability of .61 for highly structured interviews (see
Table 3, p. 271).5 In regards to the range restriction ratio, Hunter and Schmidt
(2004) recommend using a general value of .65 for all tests and all job families
when the actual value is unknown (see p. 184). Inserting these two values, the
equation becomes as shown in Formula 14. The pronounced difference between
the restricted and unrestricted IRR values highlight yet another important psy-
chometric principle, which is that reliability coefficients are influenced by range
restriction as well.
Step 2 is to convert the actual range restriction ratio (uX) into its equivalent for
true scores, unaffected by measurement error (i.e., uT). That equation is shown as
Formula 15 (Schmidt & Hunter, 2015, p. 127), which involves the unrestricted
applicant IRR value for the interview and the actual range restriction ratio. As
indicated, the range restriction ratio for true scores is smaller than the actual one,
which helps explain why indirect restriction tends to have a more detrimental ef-
fect than direct.
ro ro ro
rc = = = (16)
rXX _ A rYY _ R .61 .52 .56
Step 4 is to make the actual correction for indirect restriction, the equation for
which is shown as Formula 17 (Schmidt & Hunter, 2015, p. 129). Note that this
formula uses UT, which is the inverse of uT (i.e., 1/.56=1.79). Also note that the
subscript “T” denotes true scores for the interview and “P” denotes true scores for
performance.
U T ∗ rc 1.79 ∗ rc 1.79 ∗ rc
ρTP = = = (17)
2 2 2 2
(U − 1)r + 1
T c (1.79 − 1)r + 1 c 2.20rc2 + 1
Because a correction was made for interview reliability, the value of rho that
comes out of the above formula is actually the construct-level association between
interviews and performance. Thus, the final step, Step 5, is to translate it back to
operational validity by restoring measurement error in the interviews (Schmidt
& Hunter, 2015, p. 155). It is important to note that the IRR value used for inter-
views in this final step should be its unrestricted version and not the restricted one.
Using the value of .84 noted earlier, the computation becomes:
20. This value compares very favorably with Hunter et al.’s (2006) updated (via
indirect correction) value of .66 for the validity of General Mental Ability (GMA)
for medium complexity jobs (see p. 606).
DISCUSSION
The primary purpose of this manuscript is to provide a convenient, all-in-one ref-
erence for interview researchers to help them deal with range restriction. Was that
purpose accomplished? The answer is a qualified “yes.” Selection researchers in
a wide range of contexts should find the simplified formulas useful, especially so
given that the most difficult parameters are already estimated and incorporated. If
all (or at least a high majority) of applicants are hired and retained, then Formula
1 provides an easy correction for performance measurement error. If the interview
under consideration is used to make selection decisions in a top-down fashion,
then researchers simply pick the hiring proportion that is closest to their own ratio
(i.e., 90%, 50%, or 10%) and insert their observed validity coefficient into the
corresponding formula (i.e., Formula 8, 10, or 12). If the interview is not used to
make selection decisions, then the observed correlation can be inserted into For-
mula 19 for indirect correction.
Where the qualification manifests itself is when there is attrition. Due to the
symmetrical effects of the predictor and the criterion on the validity coefficient, a
modest level of attribution by itself (involving both the top and bottom segments
of the performance distribution) can be corrected for using Formula 4. Unfortu-
nately, when attrition is combined with any form of restriction, the impact on the
validity coefficient is both devastating and uncorrectable. Developing methods to
deal with attrition combined with restriction appears to be one of the most over-
looked psychometric challenges in the entire selection realm.
One possible way to deal with attrition and restriction is a backwards graphi-
cal approach. A similar method is found when correcting a predictor variable for
192 • ALLEN I. HUFFCUTT
NOTES
1. Thorndike’s Case I correction is applicable to the relation between two
variables (X1 and X2) when the actual restriction is on X1 but restriction
information is available only for X2. He noted that this situation is un-
likely to be encountered very often in practice.
2. Several other selection reanalyses have been done by Schmidt and col-
leagues, which, for whatever reason, did not appear on this search. See
Oh et al. (2013, p. 301) for a summary.
3. The mean of their overall SI scores was actually above the midpoint of
the scale. The midpoint was chosen, however, in an attempt to keep the
distribution symmetrical. However, even with the mean at the midpoint,
there was a small skew (which is not surprising given that a sample size
of 100 is not overly large). There were also minor anomalies in subse-
quent distributions, such as with homoscedasticity. Liberty was taken
in adjusting some of the data points to correct these anomalies (here to
make the distribution highly symmetrical).
4. Homoscedasticity is the assumption that the variability of criterion
scores (e.g., range) is reasonably consistent across the entire spectrum of
predictor values. When violated, the distribution is said to be heterosce-
dastic, power is reduced, and Type I error rates are inflated (see Rosopa,
Schaffer, & Schroeder, 2013, for a comprehensive review).
5. While random response error and transient error reflect variations in in-
terviewee responses to essentially the same questions within the same
194 • ALLEN I. HUFFCUTT
REFERENCES
Alexander, R. A., Carson, K. P., Alliger, G. M., & Carr, L. (1987). Correcting doubly trun-
cated correlations: An improved approximation for correcting the bivariate normal
correlation when truncation has occurred on both variables. Educational and Psy-
chological Measurement, 47, 309–315.
Arvey, R. R., Miller, H. E., Gould, R., & Burch, R. (1987). Interview validity for select-
ing sales clerks. Personnel Psychology, 40, 1–12. doi:10.1111/j.1744-6570.1987.
tb02373.x
Benz, M. P. (1974). Validation of the examination for Staff Nurse II. Urbana, IL: University
Civil Service Testing Program of Illinois, Testing Research Program.
Callender, J. C., & Osburn, H. G. (1980). Development and test of a new model for validity
generalization. Journal of Applied Psychology, 65, 543–558.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the
behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
Conway, J. M., Jako, R. A., & Goodman, D. F. (1995). A meta-analysis of interrater and
internal consistency reliability of selection interviews. Journal of Applied Psychol-
ogy, 80, 565–579.
Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2013). Employment interview reli-
ability: New meta-analytic estimates by structure and format. International Journal
of Selection and Assessment, 21, 264–276.
Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2014a). Moving forward indirect-
ly: Reanalyzing the validity employment interviews with indirect range restriction
methodology. International Journal of Selection and Assessment, 22, 297–309. doi:
org/10.1111/ijsa.12078
Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2014b). Multistage artifact correc-
tion: An illustration with structured employment interviews. Industrial and Organi-
zational Psychology: Perspectives on Science and Practice, 7, 552–557.
Huffcutt, A., Roth, P., & McDaniel, M. (1996). A meta-analytic investigation of cogni-
tive ability in employment interview evaluations: Moderating characteristics and
implications for incremental validity. Journal of Applied Psychology, 81, 459–473.
Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and
bias in research findings. Newbury Park, CA: Sage.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and
bias in research findings (2nd ed.). Thousand Oaks, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research
findings across studies. Beverly Hills, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Lee, H. (2006). Implications of direct and indirect range
restriction for meta-analysis methods and findings. Journal of Applied Psychology,
91, 594–612. doi: 0.1037/0021-9010.91.3.594
Range Restriction in Employment Interviews • 195
Safety has been the focus of much research over the past four decades, given the
social and economic costs of unsafe work. For instance, the International Labor
Organization (2009) estimated that approximately 2.3 million workers die each
year due to occupational injuries and illnesses, and additionally, millions incur
non-fatal injuries and illnesses. More recently, the Liberty Mutual Research Insti-
tute for Safety (2016) estimated that US companies spend $62 billion in worker
compensation claims alone.
In the Human Resource Management and related literatures (e.g., Industrial
Psychology) safety climate has been perhaps the most heavily studied aspect of
workplace safety (Casey, Griffin, Flatau Harrison, & Neal, 2017; Hofmann, Burke,
& Zohar, 2017). Several meta-analyses have established that safety climate is an
important contextual antecedent of safety behavior and corresponding outcomes
(e.g., Christian, Bradley, Wallace, & Burke, 2009; Clarke, 2010, 2013; Nahrgang,
Morgeson, & Hofmann, 2011). However, the research included in these meta-analy-
ses varies considerably in several methodological and conceptual qualities that may
affect the inferences drawn from safety climate studies.
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 197–226.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 197
198 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
CONCEPTUALIZATION AND
MEASUREMENT OF SAFETY CLIMATE
Solid conceptual and operational definitions form the foundation of any research
enterprise. As Shadish, Cook, and Campbell (1982, p. 21) discussed: “the first
problem of causal generalization is always the same: How can we generalize from
a sample of instances and the data patterns associated with them to the particular
target constructs they represent?” Similarly, the AERA, APA, and NCME (1985)
We’ve Got (Safety) Issues • 199
standards for educational and psychological testing have long emphasized the
central and critical nature of construct validity in psychological measurement, a
view that has evolved into the perspective that all inferences about validity ulti-
mately are inferences about constructs. Although the unitary view of validity as
construct validity is not without critics (e.g., Kane, 2012; Lissitz & Samuelsen,
2007), the importance of understanding constructs is generally acknowledged as
central to the research enterprise. In the specific case of safety climate, solid con-
ceptual understanding of the definition of safety climate is both a foundational
issue in the literature and an often-overlooked stage of the research process.
200 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
Zohar (1980) is widely credited as the first researcher to describe safety cli-
mate as one of these strategic climates; he noted that “when the strategic focus
involves performance of high-risk operations, the resultant shared perceptions
define safety climate” (Zohar, 2010, p. 2009). Interestingly, Zohar’s (2010) re-
view appeared nearly a decade ago. At that time, he characterized the literature
as mostly focusing on climate measurement issues such as its factor structure and
predictive validity with a corresponding need for greater attention to theoretical
issues. Since then, multiple meta-analytic and narrative reviews have accumu-
lated evidence supporting the predictive validity of climate perceptions, demon-
strating the efficacy of climate-related interventions, and clarifying the theoretical
pathways linking safety climate to safety-related outcomes (Beus, Payne, Berg-
man, & Arthur, 2010; Christian et al., 2009; Clarke, 2010; Clarke, 2013; Hofmann
et al., 2017; Lee, Huang, Cheung, Chen, & Shaw, 2018; Leitão & Greiner, 2016;
Nahrgang et al., 2011). Despite this progress, definitional ambiguities remain a
problem in the literature with fundamental measurement issues about the nature
of safety climate remaining unresolved.
One fundamental definitional issue concerns the extent to which Zohar’s defi-
nition of safety climate is accepted in the literature. To address this question, we
coded studies according to how they defined safety climate based on the citation
used. A total of 86 studies (36.3%) cited Zohar, 25 studies (10.6%) cited Neal and
Griffin in some combination, and 11 studies offered a definition without a cita-
tion (4.6%). It is important to note that whereas Griffin and Neal’s earlier work
emphasized the individual level (e.g., Griffin & Neal, 2000; Neal, Griffin, & Hart,
2000), their later work emphasized both the individual and group level in a similar
fashion to Zohar (e.g., Casey, et al., 2017; Neal & Griffin, 2006). Interestingly,
110 studies (46.4%) offered some other citation and 42 studies (17.7%) did not
clearly define safety climate.
Table 9.1 presents illustrative examples of the range of these definitions. As
should be evident from the table, there are a wide range of approaches that vary in
how precisely they define safety climate. Some key definitional issues include (1)
whether safety climate is conceptualized as a group, individual, or multilevel con-
struct and thus, involves shared perceptions; (2) what is the temporal stability of
climate perceptions; and (3) whether safety climate narrowly refers to perceptions
about the relative priority of safety or whether safety climate also encompasses
perceptions about a variety of management practices that may inform perceptions
about the relative priority of safety.
Not shown in the table are examples from the literature of the many studies
that do not offer an explicit definition, appearing to take for granted that there is a
shared understanding of the meaning of safety climate, beyond something about
the idea that safety is important (for example, Arcury, Grzywacz, Chen, Mora,
& Quandt, 2014; Cox et al., 2017). Given that conceptual definitions should in-
form researchers’ methodological choices of what to measure, we see the lack of
definitional precision in the safety climate literature as troubling. Future research
202 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
should attend much more closely to definitional issues and strive toward consen-
sus on the fundamental meaning of safety climate, particularly toward greater use
of the original Zohar definition.
Dollard and Bakker (2010, p. 580) described PSC as the extent to which the or-
ganization has “policies, practices, and procedures aimed to protect the health and
psychological safety of workers.” They elaborated (and empirically demonstrat-
ed) that PSC perceptions could be shared within an organizational unit (schools in
their study) and, similar to definitions of safety climate, they characterized PSC as
focused on perceptions about management policies, practices, and procedures that
reflected the relative priority of employees’ psychosocial health. Thus, PSC ex-
pands the health focus of safety climate to include psychosocial stressors and out-
comes in addition to the physical safety/injury prevention focus of safety climate.
Numerous studies show that PSC is related to psychosocial outcomes (e.g.,
Lawrie, Tuckey, & Dollard, 2018; Mansour & Tremblay, 2018, 2019). Some re-
search has linked PSC to safety related outcomes such as injuries and muscu-
loskeletal disorders (Hall, Dollard, & Coward, 2010; Idris, Dollard, Coward, &
Dormann, 2012). An understudied issue in this literature concerns the empirical
204 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
distinctiveness of safety climate and PSC. A few studies have examined measures
of both safety and psychosocial safety in the same study (e.g., Hall et al., 2010;
Bronkhorst, 2015; Bronkhorst & Vermeeren, 2016) but often using safety climate
measures that are PSC measures adapted to safety. These studies have found that
although PSC and safety climate measures are often highly correlated (i.e., r >
.69 in Bronkhorst, 2015; Bronkhorst & Vermeeren, 2016; and Study 1 of Idris et
al., 2012), they are structurally distinct with different patterns of correlates (Idris,
et al., 2012). However, more research is clearly required to determine the extent
to which PSC and safety climate measures are distinct and the possible boundary
conditions that might affect the degree to which they are related to each other and/
or to various safety and health-related outcomes.
Edmondson (1999, p. 354) defined psychological safety as “a shared belief
held by a work team that the team is safe for interpersonal risk taking.” Guer-
rero, Lapalme, and Séguin (2015) used the term “participative safety” to describe
essentially the same idea, but the term psychological safety is much more com-
monly used and Edmondson’s approach appears to represent consensus in the
now fairly extensive literature on the antecedents and outcomes of psychological
safety (Newman, Donohue, & Eva, 2017). Some of the qualities of a psychologi-
cally safe work environment include mutual respect among coworkers, the ability
to engage in constructive conflict, and comfort in expressing options and taking
interpersonal risks (Newman, et al., 2017). Thus, whereas safety climate focuses
on perceptions about the organization’s relative priority for employees’ physical
safety and PSC focuses on relative priorities for psychosocial health, psychologi-
cal safety refers to employees’ general comfort in the interpersonal aspects of the
workplace.
Another definitional issue in safety climate research is the distinction between
psychological safety and individual-level perceptions of safety climate. Some re-
searchers use the term psychological safety climate to refer to individual level
perceptions about safety climate issues (cf. Clark, Zickar, & Jex, 2014; Nixon, et
al. 2015). These authors appear to have had the good intentions to clearly label
individual level safety climate perceptions with a term that highlights the indi-
vidual nature of the construct (i.e., drawing on psychological climate literature
such as James & James, 1989; James, et al., 2008). However, other researchers
have studied psychological safety as a component of safety culture (Vogus. Cull,
Hengelbrok, Modell, & Epstein, 2016) or as an antecedent of safety outcomes
(e.g., Chen, McCabe, & Hyatt, 2018; Halbesleben et al., 2013). In our view, it is
theoretically appropriate to treat psychological safety as an antecedent of psycho-
logical climate, but, researchers need to be wary of exactly how studies are using
these various terms.
The terminological confusion between terms such as safety climate, psycho-
logical safety, psychological safety climate, and PSC represents a potential barrier
to accumulating knowledge about and drawing clear distinctions between these
constructs. At the very least, researchers are urged to use caution when citing
We’ve Got (Safety) Issues • 205
studies to ensure that they do in fact capture the construct of interest. However,
further empirical research is needed to distinguish these terms.
Key terms in this definition emphasize that it is a shared, agreed upon cognition
regarding the relative importance or priority of acting safely versus meeting other
competing demands such productivity or cost cutting. These safety climate percep-
tions emerge through ongoing social interaction in which employees share personal
experiences informing the extent to which management cares and invests in their
protection (as opposed to cost cutting or productivity).
In our anecdotal experience (the first two authors have been editors and associ-
ate editors of multiple journals), the issue of whether climate measures need to be
shared raises considerable consternation among researchers, particularly during
the peer review process. We have seen some reviewers assert that if the study does
not include shared perceptions, it is not a study of climate; whereas other authors
acknowledge that safety climate is a shared construct, but continue to study it at
the individual level; and still others do not discuss its multilevel nature. Irrespec-
tive of how researchers conceptually define safety climate, very few studies as-
sess it at the group level. In fact, out of the 230 empirical, quantitative studies we
reviewed, only 67 studies (29.1%) aggregated individual level data to test climate
effects at unit or higher levels. Of these 67, 42 studies (62.7%) reported statistical
evidence for the appropriateness of aggregation including ICC(1) only (N = 12,
17.9%), rwg only (N = 1, 1.5%), ICC(1) and ICC(2) (N = 4, 6.0%), ICC(1) and rwg
(N = 6, 9.0%), and all three measures (N = 19, 28.4%). These data suggest that
more multilevel studies are needed with improved reporting of statistical justifica-
tion for aggregation.
As we reviewed this literature, we were especially struck by the number of
articles that offered a group level conceptual definition of safety climate but stud-
ied safety climate at the individual level, often without explicit rationale for the
discrepancy (for example, Hoffmann et al., 2013; McGuire et al., 2017; Schwatka
& Rosencrance, 2016). Other studies have defined safety climate at the group
level but offered a justification for studying it as an individual construct. For ex-
ample, multiple studies by Huang and colleagues (examples include Huang et al.,
2013, 2018; Huang, Lee, McFadden, Rineer, & Robertson, 2017) have argued
that shared definitions of climate are less meaningful for employees who work by
themselves, such as long-haul truck drivers.
On one hand, the lack of attention to safety climate as a shared perception rep-
resents a potentially serious problem in the literature, as there appears to be a wide
206 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
disparity between how Zohar (1980) initially conceptualized safety climate and
how many researchers appear to be operationalizing it in practice. One might go
so far as to argue that given that comparatively little research has been performed
on safety climate as a group level construct, relatively little is known about it. On
the other hand, both the general climate literature (e.g., Ostroff et al., 2013) and
the safety climate literature (e.g., Clarke, 2013) explicitly acknowledge the con-
ceptual relevance of individual safety climate perceptions to the study of climate.
A common practice is to distinguish between organizational climate (a group level
construct) and psychological climate (an individual level construct; cf. James &
James, 1989; James et al., 2008). When designing a study, researchers should con-
sider Ostroff et al.’s (2013) discussion of this distinction as their model proposed
that psychological climate is more directly relevant to individual level outcomes
while organizational climate is more directly related to group level outcomes.
The individual-organizational level distinctions highlight the need to avoid at-
omistic and ecological fallacies (cf. Hannan, 1971) in safety climate research.
Atomistic fallacies occur when results obtained at the individual level are errone-
ously generalized to the group level. On the other hand, ecological fallacies occur
when group level results are used to draw conclusions at the individual level.
Also, it is important to acknowledge that researchers often focus on the individual
level because of practical constraints such as the lack of a work group/unit identi-
fier that can be used as the basis of aggregation, the lack of a sufficient number
of subunits to study climate, or a lack of proficiency in the multilevel methods
needed to study climate across organizational levels. Given these issues, it may
be appropriate for safety researchers interested in individual behavioral and atti-
tudinal phenomena to focus on psychological climate perceptions as they relate to
safety, although they should test for and/or attempt to rule out group level effects
when possible.
When researchers focus on individual level safety climate measurement, it is
important to ensure that their theoretical rationale fits with the individual level
formulation of climate. One of the potential areas of confusion in this literature
concerns the use of the term level. Although climate researchers distinguish be-
tween individual and organizational safety climate measures based on the level of
analysis/measurement; safety climate researchers also use the term level to refer
to particular climate stakeholders. For example, drawing from Zohar (2000, 2008,
2010), Huang et al. (2013) described group and organizational level climate as
two distinct perceptions employees form about safety. In Huang et al.’s approach,
the group level refers to one’s immediate work unit, with measures typically fo-
cused on employees’ perceptions of safety as a relative priority of one’s immedi-
ate supervisor. The organizational level refers to employees’ perceptions of the
global organization’s (or top management’s) relative priority for safety. But, both
group and organizational-level safety climate in Huang et al.’s model are usually
measured with individual level perceptual measures.
We’ve Got (Safety) Issues • 207
the ability of safety climate to predict outcomes drops much more rapidly (Berg-
man, Payne, Taylor, & Beus, 2014). Of course, the stability of both safety climate
scores and their ability to predict outcomes likely depends on the stability of the
work environment, but relatively little research has directly addressed this issue.
In our view, whether safety climate is a relatively stable phenomenon or a varying
snapshot of culture remains unresolved.
Guldenmund (2000, p. 220) noted that “before defining safety culture and cli-
mate, the distinction between culture and climate has to be resolved.” In the ensu-
ing nearly two decades, although progress has been made in understanding of the
general conceptual distinctions between organizational climate and culture (cf.
Ostroff et al., 2013), safety researchers are often careless in distinguishing cul-
ture and climate (Zohar, 2014). Some researchers assume that the climate-culture
distinction rests on the idea that climate is easier to change than culture, others
distinguish them in terms of the relative temporal stability of the two constructs.
Still others treat climate as the measurable aspect of culture, even though other
aspects of culture are likely measurable, albeit through different strategies than
those used in climate assessments. These ambiguities highlight the critical need
for further clarity in the conceptualization of climate. In fact, Hofmann, Burke,
and Zohar (2017, p. 381) concluded:
In the context of safety research, there potentially is even greater conceptual am-
biguity given the lack of a clear and agreed upon definition of safety culture, and
where the definitions that have been put forth do not make reference to broader,
more general aspects of organizational culture. In addition, many measures of safety
culture use items and scales which resemble safety climate measures. This has led
many authors to use the two constructs interchangeably. We believe this situation is
unfortunate and suggest that any study of safety culture should be integrated with
and connected to the broader, more general organizational culture as well as the
models and research within this domain.
measure reflects the unique safety concerns faced by lone workers such as truck
drivers. They developed and validated a measure consisting of three organization-
al-level factors: (a) proactive practices, (b) driver safety priority, and (c) supervi-
sory care promotion) and three group/unit level measures: (a) safety promotion,
(b) delivery limits, and (c) cell phone (use) disapproval.
Another example comes from literature on school climate. Zohar and Lee
(2016) provided an example of a traditional safety climate study conducted in a
school setting with school bus drivers. In addition to items measuring perceived
management commitment to safety, they developed context specific items such
as management becomes angry with drivers who have violated any safety rule,
and department immediately informs school principal of driver complaint against
disruptive child.
Occupational health research does not pay as much attention to the school cli-
mate literature compared to other contexts such as manufacturing; nevertheless,
we conducted a separate review of school climate literature which located over
1,000 citations to school climate, including over 500 in 2013 alone. Although a
full review of this literature is well-beyond the scope of this article, it should be
noted that safety issues are frequently mentioned in the school climate literature
(Wang & Degol, 2016). However, rather than reflecting physical injuries from
sources such as transportation incidents, slips, and strains, the predominant safety
concern is the extent to which teachers and students are protected from physical
and verbal violence. Moreover, much of this literature is concerned with student
health and academic performance outcomes rather than teachers’ occupational
well-being. Thus, traditional safety climate measures may be insufficient to cap-
ture the unique challenges of this context.
Healthcare is another setting where context-specific measures are frequently
used. Healthcare, however, encompasses a wide variety of practice areas and occu-
pations, each with specific sets of safety challenges. Accordingly, researchers have
measured a wide array of different aspects of safety climate such as error-related
communication (Ausserhofer et al., 2013), hospital falls prevention (Bennett et al.,
2014), communication openness and handoffs and transitions (Cox et al., 2017),
forensic ward climate such as therapeutic hold and patients’ cohesion and mutual
support (de Vries, Brazil, Tonkin, & Bulten, 2016), and hospital safety climate items
relating to issues such as availability of personal protective equipment and cleanli-
ness (Kim et al., 2018). The variety of issues captured by these measures raises
questions about whether healthcare should be treated as a single industry context by
researchers seeking to understand the effects of context on safety climate.
Jiang et al. (2019) highlighted some of the reasons why general/universal or
context-specific measures might be preferred. For example, industry-specific
measures may have greater value in diagnosing safety concerns that are unique
to a specific industry and therefore potentially more useful in guiding safety in-
terventions (see also Zohar, 2014). General measures may have more predictive
value if safety climate primarily reflects a general management commitment to
210 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
safety; if this is the case, safety interventions should focus on those broadly ap-
plicable concerns. General measures can also contribute to benchmarking norms
that may be used across a wide variety of industries.
To test the possible distinctions between universal and industry-specific mea-
sures, Jiang et al. (2019) tested the relative predictive power of each type of mea-
sure in a meta-analytic review of 120 samples (N = 81,213). They found that
each type of measure performed better in different situations. Specifically, the
industry-specific measures were more strongly related to safety behavior and
risk perceptions whereas the universal measures predicted other adverse events
such as errors and near misses. There were no differences between universal and
industry-specific measures in their ability to predict accidents and injuries. It is
important to note that Jiang et al. (2019) did not test whether the industries of the
industry-specific measures differed from those of the universal measures.
Jiang et al. (2019) cited as the most commonly studied industries in their re-
view to be construction (K = 21), health care, hospitality, manufacturing (K =
18), transportation (K = 18), hospitality, restaurant/accommodations (K = 12),
and construction (K = 11) with 19 studies described as “mixed context.” Our re-
view (which encompasses a different set of years than Jiang et al.) indicates that
the industries that appeared to be most likely to use industry-specific measures
were transportation, off-shore and gas production, education, and hospital/health
care. Thus, the comparison of industry-specific versus general measure may be
somewhat confounded if some industries are more/less likely to be represented
in the industry-specific group. Researchers could address this by comparing both
measures within the same industry.
Keiser and Payne (2018) did just this, using both types of measures in the
same settings which were university research labs and including context-specific
measures for animal biological, biological, chemical, human subjects/computer,
and mechanical/electrical labs. They concluded that while the context-specific
measures appeared to be more useful in less safety-salient contexts, there were
relatively little differences between the measures. However, they also noted that
there appeared to be measurement equivalence problems with the general measure
across the different settings they investigated. Of course, Keiser and Payne’s find-
ings may be unique to their organizational setting given that university research
labs likely differ in many ways from other types of safety-salient contexts. Thus,
there is mixed evidence about whether researchers should use context-specific
versus universal/general measures that so far appears to suggest at least some
differences between the types of measures in the settings in which they are most
useful. But this is clearly an issue which requires further research.
after Zohar’s original publication, Flin, Mearns, O’Connor, and Bryden (2000)
identified 100 dimensions of safety climate used in prior literature. They narrowed
these dimensions down to six themes (1) management/supervision, (2) safety sys-
tem, (3) risk, (4) work pressure, (5) competence of the workforce, and (6) pro-
cedures/rules. Yet, measures continued to proliferate; in fact, 10 years after the
Flin et al. (2000) publication Beus et al.’s (2010) meta-analytic review identified
61 different climate measures with varying numbers of dimensions. Our review
suggests that little progress has been made and there continues to be a wide array
of approaches to measuring safety climate. As noted above, one important dis-
tinction is between universal/generic and context-specific measures, with many
alternatives within each of these categories. A related issue concerns the dimen-
sions of those measures. For the purpose of this review, we did not compile a list
of the dimensions used in various measures of safety climate. Rather, we focused
on the methods used to ascertain the number of dimensions in individual studies.
Factor analysis is a widely recognized approach to assessing dimensionality
of a measure and therefore is an important step in measure development and con-
struct validation. Factor analyses are especially important in a literature such as
safety climate where there is a lack of clarity about the dimensionality of the
construct. Therefore, we coded studies in terms of whether they used any factor
analytic technique; if so, what technique they used. Across the 230 quantitative
empirical studies, the most common factor analytic technique used was confirma-
tory factor analysis (CFA, K = 64; 27.8%). Approximately 22% of the studies
used exploratory factor analysis (EFA) with half of them only using EFA (K =
25, 10.9%) and half using a combination of EFA and CFA (K = 24, 10.4%). That
CFA was used separately or in some combination with EFA in 38.2% of the stud-
ies (K =88) is encouraging given that CFA requires researchers to specify an a
priori measurement model. However, it is arguably more distressing that nearly
half of the studies in our review (K = 112, 48.7%) did not report any form of fac-
tor analysis, 2 studies reported the use of an unspecified form of factor analysis
(0.9%), and 3 studies (1.3%) reported using CFA but only on other measures than
safety climate. Given the lack of clarity in the literature about the dimensionality
of safety climate, the fact that just over 50% of the studies in our review either
did not report factor analyses or provided unclear information about the factor
analytic techniques used represents an important barrier to accumulating evidence
about the dimensionality of safety climate measures.
A related issue concerns how safety researchers interpret factor analytic re-
sults. Some researchers use unidimensional measures typically focusing on the
core idea of perceived management commitment to safety (for example, Arcury
et al., 2014; He et al., 2016). This approach is consistent with the argument that
management commitment is the central concept in safety climate literature as well
as with meta-analytic evidence showing that management commitment is among
the best predictors of safety-related outcomes (Beus et al., 2010). However, good
212 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
METHODS/DESIGNS
As indicated above, the conceptualization and measurement of safety climate has
several pitfalls that generate challenges for the design of studies seeking to ex-
amine the effects of safety climate, the antecedents of safety climate, as well as
214 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
the mediating and moderating effects of safety climate. In this section we review
some of the methodological challenges and issues.
Interventions
For the period 2013–2018, only 6% of all of the articles we coded were inter-
vention studies. Of these, four studies treated safety climate as an independent
variable, eight studies treated safety climate as a dependent variable, and two
studies treated safety climate as a mediator. Two of these intervention studies used
random assignment, two used quasi-experimental designs, and two used random
assignment clustering. Therefore, experimental or quasi-experimental designs
were rare. Admittedly these designs are difficult to implement in an applied field
setting but their absence does limit our ability to make causal inferences about
safety climate-related processes.
Lee, Huang, Cheung, Chen, and Shaw (2018) reviewed 19 intervention studies
that met their inclusion criteria; they reported that 10 of the 19 studies were quasi-
experimental pre-post-intervention designs and eight were based on mixed de-
signs with between- and within-subjects components. Ten of the 19 studies were
published in years preceding the period of our review which raises the question as
to whether research designs are becoming stronger. That said, the results of both
of these reviews support the ability of interventions to improve safety climate in
applied settings across several industries. But, they also highlight how rare such
studies are and the corresponding need for more studies utilizing these designs.
Many longitudinal studies do not make a case for the specific lag in measure-
ment they included in their designs. In addition, if there are not at least three
measurement occasions, then it is not possible to detect nonlinear trends. Un-
fortunately, the theories commonly used in safety climate research are silent on
the most appropriate time lag to choose for a given research question. It may be
the case that there is no perfect time lag as changes in safety climate may be best
explained by unique events, such as severe accidents or changes in organization-
al policy. Nevertheless, we echo calls by other scholars (e.g., Ployhart & Ward,
2011) to incorporate time into our research designs. This is especially important
for understanding the time that it takes for a cause (e.g., an accident) to exert
an effect (e.g., changes in safety climate). Other scholars (e.g., Spector, 2019)
have suggested that we modify our measures to explicitly incorporate time. Many
of our measures are so general that it is impossible to assess the sequencing of
events. By including time related content such as “in the last month” or “today,”
the temporal ambiguity is reduced if not eliminated.
Level of Analysis
As discussed above, many conceptualizations of safety climate suggest a group
or organizational level of analysis. However, 70.4% of the studies we coded mea-
sured and/or analyzed safety climate at only the individual level of analysis. Only
67 studies (29.1%) took a group, organizational, or multi-level approach. As we
point out in the previous section as well as in the section on future directions
below, moving beyond the individual level of analysis is necessary to advance
understanding of safety climate. More research at the group and organizational
levels are needed to link safety climate to organizational level outcomes as well
as understand the relations between the group and organizational levels with indi-
vidual level behaviors and outcomes.
mediator depending on the research question; however, our review indicates that
the research literature needs to investigate these roles to broaden our understand-
ing of the development and effects of safety climate.
CONCLUSION
In the present paper, we reviewed trends within the last five years in the safety
climate literature. Our review focused on safety climate, a mature area of research
that extends over four decades and encompasses hundreds of studies. Despite the
size of the literature, it still lacks consistent conceptualization and operationaliza-
tion of constructs. Research needs to consider these potentially important aspects
of safety climate as either concepts of the definition or important antecedents or
outcomes of safety climate. Additionally, research should explore alternative ana-
lytic perspectives of examining dimensions and progression of safety climate over
time, including the stability of safety climate, non-linearity patterns of safety cli-
mate, and the relation of safety climate with potential antecedents and outcomes.
NOTE
1. Because it was not possible to cite all of the empirical studies, a list of
the 237 empirical studies included in this review can be obtained from
the first author. Please contact her at [email protected]
REFERENCES
Aiken, J. R., Hanges, P. J., & Chen, T. (2018). The means are the end: Complexity science
in organizational research. In S.E. Humphrey & J. M. LeBreton (Ed.), The handbook
of multilevel theory, measurement, and analysis. Washington, DC: American Psy-
chological Association.
American Educational Research Association, American Psychological Association, & Na-
tional Council on Measurement in Education. (1985). Standards for educational
and psychological testing. Washington, DC: American Psychological Association.
Arcury, T. A., Grzywacz, J. G., Chen, H., Mora, D. C., & Quandt, S. A. (2014). Work
organization and health among immigrant women: Latina manual workers in North
Carolina. American Journal of Public Health, 104(12), 2445–2452.
Arens, O. B., Fierz, K., & Zúñiga, F. (2017). Elder abuse in nursing homes: Do spe-
cial care units make a difference? A secondary data analysis of the Swiss Nurs-
ing Homes Human Resources Project. Gerontology, 63(2), 169–179. https://doi.
org/10.1159/000450787
Ausserhofer, D., Schubert, M., Desmedt, M., Blegen, M. A., De Geest, S., & Schwendi-
mann, R. (2013). The association of patient safety climate and nurse-related or-
ganizational factors with selected patient outcomes: A cross-sectional survey. In-
220 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
Christian, M. S., Bradley, J. C., Wallace, J. C., & Burke, M. J. (2009). Workplace safety:
A meta-analysis of the roles of person and situational factors. Journal of Applied
Psychology, 94, 1103–1127.
Clarke, S. (2010). An integrative model of safety climate: Linking psychological climate
and work attitudes to individual safety outcomes using meta-analysis. Journal of
Occupational and Organizational Psychology, 83, 553–579.
Clarke, S. (2013). Safety leadership: A meta-analytic review of transformational and trans-
actional leadership styles as antecedents of safety behaviors. Journal of Occupa-
tional and Organizational Psychology, 86, 22–49.
Clark, O. L., Zickar, M. J., & Jex, S. M. (2014). Role definition as a moderator of the
relationship between safety climate and organizational citizenship behavior among
hospital nurses. Journal of Business and Psychology, 29, 101–110. https://doi.
org/10.1007/s10869-013-9302-0
Cox, S., & Cox, T. (1991). The structure of employee attitudes to safety: A European ex-
ample. Work & Stress, 5(2), 93–106. https://doi.org/10.1080/02678379108257007
Cox, E. D., Jacobsohn, G. C., Rajamanickam, V. P., Carayon, P., Kelly, M. M., Wetterneck,
T. B., ... & Brown, R. L. (2017). A family-centered rounds checklist, family engage-
ment, and patient safety: A randomized trial. Pediatrics, 139(5), 1–10. https://doi-
org.proxy.lib.odu.edu/10.1542/peds.2016-1688
Dedobbeleer, N., & Béland, F. (1991). A safety climate measure for construction sites. Jour-
nal of Safety Research, 22, 97–103. https://doi.org/10.1016/0022-4375(91)90017-P
de Vries, M. G., Brazil, I. A., Tonkin, M., & Bulten, B. H. (2016). Ward climate within a
high secure forensic psychiatric hospital: Perceptions of patients and nursing staff
and the role of patient characteristics. Archives of Psychiatric Nursing, 30(3), 342–
349. https://doi.org/10.1016/j.apnu.2015.12.007
Dollard, M. F., & Bakker, A. B. (2010). Psychosocial safety climate as a precursor to con-
ducive work environments, psychological health problems, and employee engage-
ment. Journal of Occupational and Organizational Psychology, 83(3), 579–599.
Drach-Zahavy, A., & Somech, A. (2015). Goal orientation and safety climate: Enhancing
versus compensatory mechanisms for safety compliance? Group & Organization
Management, 40, 560–588. https://doi.org/10.1177/1059601114560372
Edmondson, A. (1999). Psychological safety and learning behavior in work teams. Admin-
istrative science quarterly, 44(2), 350–383.
Flin, R., Mearns, K., O’Connor, P., & Bryden, R. (2000). Measuring safety climate: Iden-
tifying the common features. Safety Science, 34, 177–192.
Gazica M. W., & Spector P. E. (2016) A test of safety, violence prevention, and civility cli-
mate domain-specific relationships with relevant workplace hazards. International
Journal of Occupational and Environmental Health, 22, 45–51.
Golubovich, J., Chang, C. H., & Eatough, E. M. (2014). Safety climate, hardiness, and
musculoskeletal complaints: A mediated moderation model. Applied Ergonomics,
45(3), 757–766. https://doi.org/10.1016/j.apergo.2013.10.008
Graeve, C., McGovern, P. M., Arnold, S., & Polovich, M. (2017). Testing an intervention to
decrease healthcare workers’ exposure to antineoplastic agents. Oncology Nursing
Forum, 44(1), E10–E19. https://doi.org/10.1188/17.ONF.E10-E19
Griffin, M. A., & Neal, A. (2000). Perceptions of safety at work: A framework for linking
safety climate to safety performance, knowledge, and motivation. Journal of Oc-
cupational Health Psychology, 5, 347–358.
222 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
Guerrero, S., Lapalme, M. È., & Séguin, M. (2015). Board chair authentic leader-
ship and nonexecutives’ motivation and commitment. Journal of Leader-
ship & Organizational Studies, 22(1), 88-101. https://doi-org.proxy.lib.odu.
edu/10.1177/1548051814531825
Guldenmund, F. W. (2000). The nature of safety culture: a review of theory and research.
Safety Science, 34, 215–257.
Halbesleben, J. R. B., Hannes, L., Dierynck, B., Simmons, T., Savage, G. T., McCaughey,
D., Leon, M. R. (2013). Living up to safety values in health care: The effect of
leader behavioral integrity on occupational safety. Journal of Occupational Health
Psychology, 18, 395–405.
Hall, G.B., Dollard, M.F., & Coward, J. (2010). Psychosocial safety climate: development
of the PSC-12. International Journal of Stress Management, 17, 353–383.
Hannan, M. T. (1971). Aggregation and disaggregation in sociology. Lexington, MA: Lex-
ington Books.
Hartmann, C. W., Meterko, M., Zhao, S., Palmer, J. A., & Berlowitz, D. (2013). Validation
of a novel safety climate instrument in VHA nursing homes. Medical Care Research
and Review, 70(4), 400–417. https://doi.org/10.1177/1077558712474349
He, Q., Dong, S., Rose, T., Li, H., Yin, Q., & Cao, D. (2016). Systematic impact of institu-
tional pressures on safety climate in the construction industry. Accident Analysis and
Prevention, 93, 230–239. https://doi.org/10.1016/j.aap.2015.11.034
Hinde, T., Gale, T., Anderson, I., Roberts, M., & Sice, P. (2016). A study to assess the influ-
ence of interprofessional point of care simulation training on safety culture in the
operating theatre environment of a university teaching hospital. Journal of Interpro-
fessional Care, 30(2), 251–253. https://doi.org/10.3109/13561820.2015.1084277
Hoffmann, B., Miessner, C., Albay, Z., Scbrbber, J., Weppler, K., Gerlach, F. M., & Guth-
lin, C. (2013). Impact of individual and team features of patient safety climate: A
survey in family practices. Annals of Family Medicine, 11, 355–362. https://doi-org.
proxy.lib.odu.edu/10.1370/afm.1500
Hofmann, D. A., Burke, M. J., & Zohar, D. (2017). 100 Years of occupational safety re-
search: From basic protections and work analysis to a multilevel view of workplace
safety and risk. Journal of Applied Psychology, 102, 375–388.
Hong, S., & Li, Q. (2017). The reasons for Chinese nursing staff to report adverse events:
A questionnaire survey. Journal of Nursing Management, 25(3), 231–239. https://
doi.org/10.1111/jonm.12461
Huang, Y., Lee, J., McFadden, A. C., Rineer, J., & Robertson, M. M. (2017). Individual
employee’s perceptions of “Group-level Safety Climate” (supervisor referenced)
versus “Organization-level Safety Climate” (top management referenced): Associa-
tions with safety outcomes for lone workers. Accident Analysis and Prevention, 98,
37–45. https://doi.org/10.1016/j.aap.2016.09.016
Huang, Y., Sinclair, R. R., Lee, J., McFadden, A. C., Cheung, J. H., & Murphy, L. A. (2018).
Does talking the talk matter? Effects of supervisor safety communication and safety
climate on long-haul truckers’ safety performance. Accident Analysis & Prevention,
117, 357–367. https://doi-org.proxy.lib.odu.edu/10.1016/j.aap.2017.09.006
Huang, Y., Zohar, D., Robertson, M. M., Garabet, A., Lee, J., & Murphy, L. A. (2013).
Development and validation of safety climate scales for lone workers using truck
drivers as exemplar. Transportation Research Part F: Traffic Psychology and Be-
haviour, 17, 5–19. https://doi.org/10.1016/j.trf.2012.08.011
We’ve Got (Safety) Issues • 223
Idris, M. A., Dollard, M. F., Coward, J., & Dormann, C. (2012). Psychosocial safety cli-
mate: Conceptual distinctiveness and effect on job demands and worker health.
Safety Science, 50, 19–28.
International Labor Organization (2009). World day for safety and health at work 2009:
Facts on safety and health at work? International Labour Office. Geneva: ILO.
Retrieved from: http://www.ilo.org/wcmsp5/groups/public/@dgreports/@dcomm/
documents/publication/wcms_105146.pdf
James, L. A., & James, L. R. (1989). Integrating work environment perceptions: Explora-
tions into the measurement of meaning. Journal of Applied Psychology, 74, 739–
751.
James, L. R., & Jones, A. P. (1974). Organizational climate: A review of theory and re-
search. Psychological Bulletin, 81, 1096–1112.
James, L. R., Choi, C. C., Ko, C.-H. E., McNeil, P. K., Minton, M. K., Wright, M. A., &
Kim, K. I. (2008). Organizational and psychological climate: A review of theory
and research. European Journal of Work and Organizational Psychology, 17, 5–32.
Jiang, L., Lavaysse, L. M. & Probst, T. M. (2019) Safety climate and safety outcomes: A
meta-analytic comparison of universal vs. industry-specific safety climate predic-
tive validity, Work & Stress, 33, 41–57.
Kagan, I., & Barnoy, S. (2013). Organizational safety culture and medical error reporting
by Israeli nurses. Journal of Nursing Scholarship, 45(3), 273–280. https://doi-org.
proxy.lib.odu.edu/10.1111/jnu.12026
Kane, M. (2012). All validity is construct validity. Or is it? Measurement, 10, 66–70.
Keiser, N. L., & Payne, S. C. (2018). Safety climate measurement: An empirical test of
context-specific versus general assessments. Journal of Business and Psychology,
33, 479–494.
Kim, O., Kim, M. S., Jang, H. J., Lee, H., Kang, Y., Pang, Y., & Jung, H. (2018). Radia-
tion safety education and compliance with safety procedures—The Korea Nurses’
Health Study. Journal of Clinical Nursing, 27(13/14), 2650–2660. https://doi-org.
proxy.lib.odu.edu/10.1111/jocn.14338
Kines, P., Lappalainen, J., Mikkelsen, K. L., Olsen, E., Pousette, A., Tharaldsen, J., ...
& Törner, M. (2011). Nordic Safety Climate Questionnaire (NOSACQ-50): A new
tool for diagnosing occupational safety climate. International Journal of Industrial
Ergonomics, 41, 634–646.
Kozlowski, S. W., & Klein, K. J. (2000). A multilevel approach to theory and research
in organizations: Conceptual, temporal, and emergent processes. In K. Kline & S.
Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp.
3–90). San Francisco, CA: Jossey-Bass.
Lawrie, E. J., Tuckey, M. R., & Dollard, M. F. (2018). Job design for mindful work: The
boosting effect of psychosocial safety climate. Journal of Occupational Health Psy-
chology, 23(4), 483–495. https://doi-org.proxy.lib.odu.edu/10.1037/ocp0000102
Lau, D. C., & Murnighan, J. K. (1998). Demographic diversity and faultlines: The com-
positional dynamics of organizational groups. Academy of Management Review, 23,
325–340. doi:10.2307/259377
Lee, J., Huang, Y-H, Cheung, J. H., Chen, Zhuo, & Shaw, W. S. (2018). A systematic re-
view of the safety climate intervention literature: Past trends and future directions.
Journal of Occupational Health Psychology, 24, 66–91.
224 • TETRICK, SINCLAIR, SAWHNEY, & CHEN
Lee, J., Sinclair, R. R., Huang, E., & Cheung, J. (2019). Outcomes of safety climate in
trucking: A longitudinal framework. Journal of Business and Psychology, 34, 865–
878.
Leitão, S., & Greiner, B. A. (2016). Organisational safety climate and occupational ac-
cidents and injuries: An epidemiology based systematic review. Work & Stress, 30,
71–90.
Liberty Mutual Research Institute for Safety. (2016). 2016 Liberty Mutual work-
place safety index. Hopkinton, MA. Retrieved from: http://cdn2.hubspot.net/
hubfs/330425/2016_Liberty_Mutual_Workplace_Safety_Index.pdf
Lissitz, R. W., & Samuelsen, K. (2007). A suggested change in terminology and emphasis
regarding validity and education. Educational Researcher, 36, 437–448.
Mansour, S., & Tremblay, D. G. (2018). Psychosocial safety climate as resource pathways
to alleviate work-family conflict. A study in the health sector in Quebec. Personnel
Review, 47(2), 474–493. https://doi.org/10.1108/PR-10-2016-0281
Mansour, S., & Tremblay, D. G. (2019). How can we decrease burnout and safety work-
around behaviors in health care organizations? The role of psychosocial safety cli-
mate. Personnel Review, 48(2), 528–550.
Martowirono, K., Wagner, C., & Bijnen, A. B. (2014). Surgical residents’ perceptions of
patient safety climate in Dutch teaching hospitals. Journal of Evaluation in Clinical
Practice, 20(2), 121–128. https://doi-org.proxy.lib.odu.edu/10.1111/jep.12096
McCaughey, D., DelliFraine, J. L., McGhan, G., & Bruning, N. S. (2013). The negative
effects of workplace injury and illness on workplace safety climate perceptions and
health care worker outcomes. Safety Science, 51, 138–147. https://doi.org/10.1016/j.
ssci.2012.06.004
Mearns, K., Hope, L., Ford, M. T., & Tetrick, L. E. (2010). Investment in workforce health:
Exploring the implications for workforce safety climate and commitment. Accident
Analysis and Prevention, 42, 1445–1454.
Mearns, K., Whitaker, S. M., & Flin, R. (2003). Safety climate, safety management prac-
tice and safety performance in offshore environments. Safety Science, 41(8), 641–
680. Retrieved from: https://psycnet.apa.org/doi/10.1016/S0925-7535(02)00011-5
Milijić, N., Mihajlović, I., Nikolić, D., & Živković, Ž. (2014). Multicriteria analysis of safe-
ty climate measurements at workplaces in production industries in Serbia. Interna-
tional Journal of Industrial Ergonomics, 44(4), 510–519. https://doi.org/10.1016/j.
ergon.2014.03.004
Nahrgang, J. D., Morgeson, F. P., & Hoffmann, D. A. (2011). Safety at work: A meta-ana-
lytic investigation of the link between job demands, job resources, burnout, engage-
ment, and safety outcomes. Journal of Applied Psychology, 96, 71–94.
Neal, A., & Griffin, M. A. (2006). A study of the lagged relationships among safety climate,
safety motivation, safety behavior, and accidents at the individual and group levels.
Journal of Applied Psychology, 91(4), 946–953.
Neal, A., Griffin, M. A., & Hart, P. M. (2000). The impact of organizational climate on
safety climate and individual behavior. Safety Science, 34, 99–109.
Newman, A., Donohue, R., & Eva, N. (2017). Psychological safety: A systematic review of
the literature. Human Resource Management Review, 27, 521–535.
Nixon, A. E., Lanz, J. J., Manapragada, A., Bruk-Lee, V., Schantz, A., & Rodriguez, J. F.
(2015). Nurse safety: How is safety climate related to affect and attitude? Work &
We’ve Got (Safety) Issues • 225
Dr. David G. Allen is Associate Dean for Graduate Programs and Professor of
Management, Entrepreneurship, and Leadership at the Neeley School of Business
at Texas Christian University; Distinguished Research Environment Professor at
Warwick Business School; and Editor-in-Chief of the Journal of Management.
Professor Allen earned his Ph.D. from the Beebe Institute of Personnel and Em-
ployment Relations at Georgia State University. His teaching, research, and con-
sulting cover a wide range of topics related to people and work, with a particular
focus on the flow of human capital into and out of organizations. His award-
winning research has been regularly published in the field’s top journals, such as
Academy of Management Journal, Human Relations, Human Resource Manage-
ment, Journal of Applied Psychology, Journal of Management, Journal of Or-
ganizational Behavior, Organization Science, Organizational Research Methods,
and Personnel Psychology, and he is the author of the book Managing Employee
Turnover: Dispelling Myths and Fostering Evidence-Based Retention Strategies.
Professor Allen is a Fellow of the American Psychological Association, the Soci-
ety for Industrial and Organizational Psychology, and the Southern Management
Association.
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 227–235.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 227
228 • BIOGRAPHIES
Dr. Angelo DeNisi is the Albert Harry Cohen Chair in Business Administration
at Tulane University, where he also served a six-year term as Dean of the A.B.
Freeman School of Business. After receiving his Ph.D. in Industrial/Organiza-
tional Psychology from Purdue University in 1977, he served as a faculty mem-
ber at Kent State, the University of South Carolina, Rutgers, and Texas A&M
University before moving to Tulane. His research interests include performance
appraisal and performance management, as well as expatriate management, and
his research has been funded by the National Science Foundation, the U.S. Army
Research Institute, several state agencies and several industry groups in the U.S.
He has also served as President of the Society for Industrial and Organizational
Psychology (SIOP), as well as President of Academy of Management (AOM);
he has chaired both the Organizational Behavior and the Human Resources Di-
visions of the AOM, and he is a Fellow of the Academy of Management, SIOP,
and the American Psychological Association. He has published more than a doz-
en book chapters, and more than 80 articles in refereed journals, most of them
in top academic journals such as the Academy of Management Journal (AMJ),
the Academy of Management Review (AMR), the Journal of Applied Psychol-
ogy (JAP), the Journal of Personality and Social Psychology and Psychological
Bulletin. His research has been recognized with awards from several divisions
of the AOM, including winning the 2016 Herbert Heneman Lifetime Contribu-
tion Award from the Human Resources Division, and SIOP named him the co-
winner of the 2005 Distinguished Lifetime Scientific Contribution Award. He also
serves, or has served on a number of Editorial Boards, including AMJ, AMR, JAP,
Journal of Management, Entrepreneurship Theory and Practice, and Journal of
Organizational Behavior. He was Editor of AMJ from 1994 to 1996, and was re-
cently named Co-Editor of the SIOP Organizational Frontiers Series, with Kevin
Murphy.
Dr. Julie I. Hancock is Assistant Professor at the G. Brint Ryan College of Busi-
ness, University of North Texas. She holds a Ph.D. in Business Administration
from the University of Memphis. Her primary research interests include the flow
of people in organizations, collective turnover, perceived organizational sup-
port, and pro-social rule breaking. Her work on these topics has been published
in Journal of Management, Journal of Organizational Behavior, Human Rela-
tions, and Human Resource Management Review. Dr. Hancock currently serves
on the Academy of Management HR Division Executive Committee as a Repre-
sentative-at-Large.
Dr. Kevin Murphy holds the Kemmy Chair of Work and Employment Studies
at the University of Limerick. Professor Murphy earned his PhD in Psychology
from The Pennsylvania State University in 1979, and has served on the facul-
Biographies • 231
ties of Rice University, New York University, Pennsylvania State University and
Colorado State University. He is a Fellow of the American Psychological Asso-
ciation, the Society for Industrial and Organizational Psychology (SIOP) and the
American Psychological Society, and the recipient of SIOP’s 2004 Distinguished
Scientific Contribution Award. He is the author of over one hundred and ninety
articles and book chapters, and author or editor of eleven books, in areas ranging
from psychometrics and statistical analysis to individual differences, performance
assessment and honesty in the workplace. He served as co-Editor of the Taylor &
Francis (previously Erlbaum) Applied Psychology Series and has been appointed
co-editor, with Angelo DeNisi, of the SIOP Organizational Frontiers Series.
He has served as President of SIOP and Editor of Journal of Applied Psychol-
ogy and of Industrial and Organizational Psychology: Perspectives on Science
and Practice, and is a member of numerous editorial boards. Throughout his ca-
reer, Dr. Murphy has worked to advance both research and the application of that
research to solve practical problems in organizations. For example, he served as
both a member and the Chair of the U.S. Department of Defense Advisory Com-
mittee on Military Personnel Testing, and has also served on five U.S. National
Academy of Sciences committees, all of which dealt with problems in the work-
place. He has carried out a number of research projects with military and national
security organizations, dealing with problems ranging from training to applying
research on motivation to problems of nuclear deterrence, and has worked with
numerous private and public-sector organizations to build and evaluate their hu-
man resource management systems.
for Psychological Science, and Society for Industrial and Organizational Psychol-
ogy.
Dr. Lois Tetrick is University Professor in the Industrial and Organizational Psy-
chology Program, George Mason University. She is a former president of the
Society for Industrial and Organizational Psychology and a founding member of
the Society for Occupational Health Psychology. Dr. Tetrick is a fellow of the Eu-
ropean Academy of Occupational Health Psychology, the American Psychologi-
cal Association, the Society for Industrial and Organizational Psychology and the
Association for Psychological Science. Dr. Tetrick is a past editor of the Journal
of Occupational Health Psychology and the Journal of Managerial Psychology,
and served as Associate Editor of the Journal of Applied Psychology. Dr. Tetrick
has edited several books including The employment relationship: Examining psy-
chological and contextual perspectives with Jackie Coyle-Shapiro, Lynn Shore,
and Susan Taylor; The Employee-Organization Relations: Applications for the
21st Century with Lynn Shore and Jackie Coyle-Shapiro; the Handbook of Occu-
pational Health Psychology (1st and 2nd editions) with James C. Quick; Health and
Safety in Organizations with David Hofmann; Research Methods in Occupational
Health Psychology: Measurement, Design and Data Analysis with Bob Sinclair
and Mo Wang and two volumes on cybersecurity response teams: Psychosocial
dynamics of cybersecurity and Improving Social Maturity of Cybersecurity inci-
dent response teams with S. J. Zaccaro, R. D. Dalal, and colleagues. In addition,
she has published numerous chapters and journal articles on topics related to her
research interests in occupational health and safety, occupational stress, the work-
family interface, psychological contracts, social exchange theory and reciprocity,
organizational commitment, and organizational change and development.
Biographies • 235