0% found this document useful (0 votes)
315 views242 pages

Research Methods in Human Resource Management

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
315 views242 pages

Research Methods in Human Resource Management

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 242

Stone-Romero

Rosopa
In this volume of Research in Human Re-
source Management we consider the over-
all validity of inferences stemming from em- RESEARCH METHODS
pirical research in human resource manage-
ment (HRM), industrial and organizational IN
psychology, organizational behavior and
HUMAN RESOURCE

Toward Valid Research-Based Inferences


Research Methods in Human Research Management:
allied disciplines. The chapters in this volume
address the overall validity of inferences as a
function of four facets of validity, i.e., inter- MANAGEMENT
nal, external, construct, and statistical con- Toward Valid Research-Based Inferences
clusion. The contributions address validity
issues for specific foci of study (e.g., inter-
Edited by
views, safety, and organizational politics) as
well as those that span multiple foci (e.g., ne- Eugene F. Stone-Romero
glected research methods, causal inferences Patrick J. Rosopa
in research, and heteroscedasticity in mea-
sured variables). The general objective of the
chapters is to provide basic and applied re-
searchers with “tools” that will help them to Construct Internal
design and conduct empirical studies that validity validity
have high levels of validity, improving both
the science and practice of HRM. Validity of
research
results
Statistical
External
conclusion
validity
validity

IAP—INFORMATION AGE PUBLISHING


P.O. BOX 79049 A VOLUME IN: RESEARCH IN
CHARLOTTE, NC 28271-7047
WWW.INFOAGEPUB.COM
HUMAN RESOURCE MANAGEMENT
Research Methods in Human
Resource Management: Toward
Valid Research-Based Inferences

A Volume in:
Research in Human Resource Management

Series Editors

Dianna L. Stone
James H. Dulebohn
Research in Human Resource Management
Series Editors

Dianna L. Stone
Universities of New Mexico, Albany, and Virginia Tech

James H. Dulebohn
Michigan State University

Diversity and Inclusion in Organizations (2020)


Dianna L. Stone, James H. Dulebohn, & Kimberly M. Lukaszewski

The Only Constant in HRM Today is Change (2019)


Dianna L. Stone & James H. Dulebohn

The Brave New World of eHRM 2.0 (2018)


James H. Dulebohn & Dianna L. Stone

Human Resource Management Theory and Research on


New Employment Relationships (2016)
Dianna L. Stone & James H. Dulebohn

Human Resource Strategies for the High Growth Entrepreneurial Firm (2006)
Robert L. Heneman & Judith Tansky

IT Workers Human Capital Issues in a Knowledge Based Environment (2006)


Tom Ferratt & Fred Niederman

Human Resource Management in Virtual Organizations (2002)


Robert L. Heneman & David B. Greenberger

Innovative Theory and Empirical Research on Employee Turnover (2002)


Rodger Griffeth & Peter Hom

COMING SOON

Managing Team Centricity in Modern Organizations


James H. Dulebohn, Brian Murray, & Dianna L. Stone

Forgotten Minorities
Dianna L. Stone, Kimberly M. Lukaszewski, & James H. Dulebohn
Research Methods in Human
Resource Management: Toward
Valid Research-Based Inferences

Edited by
Eugene F. Stone-Romero
Patrick J. Rosopa

INFORMATION AGE PUBLISHING, INC.


Charlotte, NC • www.infoagepub.com
Library of Congress Cataloging-In-Publication Data

The CIP data for this book can be found on the Library of Congress website (loc.gov).

Paperback: 978-1-64802-088-9
Hardcover: 978-1-64802-089-6
E-Book: 978-1-64802-090-2

Copyright © 2020 Information Age Publishing Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilm-
ing, recording or otherwise, without written permission from the publisher.

Printed in the United States of America


CONTENTS

1. Perspectives on the Validity of Inferences from Research in


Human Resource Management................................................................. 1
Eugene F. Stone-Romero and Patrick J. Rosopa

2. Advances in Research Methods: What Have We Neglected?................ 5


Neal Schmitt

3. Research Design and Causal Inferences in Human Resource


Management Research............................................................................. 39
Eugene F. Stone-Romero

4. Heteroscedasticity in Organizational Research.................................... 67


Amber N. Schroeder, Patrick J. Rosopa,
Julia H. Whitaker, Ian N. Fairbanks, and Phoebe Xoxakos

5. Kappa and Alpha and Pi, Oh My: Beyond Traditional Inter-


rater Reliability Using Gwet’s AC1 Statistic.......................................... 87
Julie I. Hancock, James M. Vardaman, and David G. Allen

6. Evaluating Job Performance Measures:


Criteria for Criteria............................................................................... 107
Angelo S. DeNisi and Kevin R. Murphy

7. Research Methods in Organizational Politics: Issues,


Challenges, and Opportunities............................................................. 135
Liam P. Maher, Zachary A. Russell, Samantha L. Jordan,
Gerald R. Ferris, and Wayne A. Hochwarter

v
vi • CONTENTS

8. Range Restriction in Employment Interviews:


An Influence Too Big to Ignore............................................................. 173
Allen I. Huffcutt

9. We’ve Got (Safety) Issues: Current Methods and Potential


Future Directions in Safety Climate Research................................... 197
Lois E. Tetrick, Robert R. Sinclair,
Gargi Sawhney, and Tiancheng (Allen) Chen

Biographies............................................................................................. 227
CHAPTER 1

PERSPECTIVES ON THE
VALIDITY OF INFERENCES
FROM RESEARCH IN HUMAN
RESOURCE MANAGEMENT
Eugene F. Stone-Romero and Patrick J. Rosopa

Empirical research in Human Resource Management (HRM) and the related


fields of industrial and organizational psychology, and organizational behavior
has focused on such issues as recruiting, testing, selection, training, motivation,
compensation, and employee well-being. A review of the literature on these and
other topics suggests that less than optimal methods have often been used in HRM
studies. Among the methods-related problems are using (a) measures or manipu-
lations that have little or no construct validity, (b) samples of units (e.g., partici-
pants, organizations) that bear little or no correspondence to target populations,
(c) research designs that have little or no potential for supporting valid causal
inferences, (d) samples that are too small to provide for adequate statistical power,
and (e) data analytic strategies that are inappropriate for the issues addressed by a
study. As a result, our understanding of various HRM phenomena has suffered and
improved methods may serve to enhance both the science and practice of HRM
and allied disciplines.
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 1–4.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 1
2 • EUGENE F. STONE-ROMERO & PATRICK J. ROSOPA

In order for the results of empirical studies to have a high level of validity, it
is critical that they be based upon empirical studies that have construct validity,
internal validity, external validity, and statistical conclusion validity (Campbell
& Stanley, 1963; Cook & Campbell, 1979; Shadish, Cook, & Campbell; 2002).
Construct validity has to with the degree to which the measures and manipulations
used in an empirical study are faithful representations of underlying constructs.
Internal validity reflects the degree to which the design of a study allows for valid
inferences about causal connections between the variables considered by a study.
External validity represents the extent to which the findings of a study general-
ize to different sampling particulars of units, treatments, research settings, and
outcomes. Finally, statistical conclusion validity is the degree to which inferences
stemming from the use of statistical methods are correct.
Valid research results are vital for both science and practice in HRM and allied
fields. With respect to science, the confirmation of a theory hinges on the validity
of empirical studies that are used to support it. For example, research aimed at
testing a theory that X causes Y is of little or no value unless it is based on studies
that use randomized experimental designs. In addition, the results of valid re-
search are essential for the development and implementation of HRM policies and
practices. For example, attempts to reduce employee turnover will not meet with
success unless an organization measures this criterion in a construct valid manner.

PURPOSE OF THE SPECIAL ISSUE


In view of the above, the purpose of this Special Issue (SI) of Research in Human
Resource Management is to provide researchers with resources that will enable
them to improve the internal validity, external validity, construct validity, and sta-
tistical conclusion validity (Campbell & Stanley, 1963; Cook & Campbell, 1976,
1979; Shadish, Cook & Campbell, 2002) of research in HRM. Sound research
in these fields should serve to improve both the science and practice of HRM. In
the interest of promoting such research the authors of chapters in this SI specify
research methods-related problems in HRM and offer recommendations for deal-
ing with them.
The chapters in this volume are arranged in terms of the breadth of issues dealt
with by them. More specifically, the chapters that have the broadest scope are
presented first, followed by those that have a narrower focus. Brief summaries of
the chapters, in order of their appearance, are as follows:
Neal Schmitt (Michigan State University) provides a comprehensive contribu-
tion that considers such issues as the development and use of quantitative meth-
ods, estimates of reliability, IRT methods, Big Data, structural equation modeling,
meta-analysis, hierarchical linear modeling, computational modeling, regression
analysis, confirmatory factor analysis, analysis of data from research using lon-
gitudinal designs, the timing of data collection, and effect size estimation. The
topics covered by Schmitt deal with a number of facets of validity (e.g., internal,
construct, and statistical conclusion, and external).
Perspectives on the Validity of Inferences from Research • 3

Eugene F. Stone-Romero (University of New Mexico) explains the important


connection between experimental design options (randomized-experimental,
quasi-experimental, and non-experimental) and the validity of inferences about
causal connections between variables. In the process, he shows why (a) random-
ized-experimental designs provide the firmest basis for causal inferences and (b)
a number of so called “causal modeling” techniques (e.g., causal-correlation, hier-
archical regression, path analysis, and structural equation modeling) have virtual-
ly no ability to justify such inferences. In addition, he considers the importance of
randomized-experimental designs for research aimed at (a) the testing of theories
and (b) the development of HRM-related policies and practices. Stone-Romero’s
contribution focuses on the internal validity of research.
Amber N. Schroeder (University of Texas—Arlington), Patrick J. Rosopa
(Clemson University), Julia H. Whitaker (University of Texas—Arlington), Ian
N. Fairbanks (Clemson University), and Phoebe Xoxakos (Clemson University)
describe how heteroscedasticity may manifest in organizational research. In par-
ticular, they discuss how heteroscedasticity may be substantively meaningful.
They provide examples from research on stress interventions, aging and individu-
al differences, skill acquisition and training, groups and teams, and organizational
climate. In addition, they describe procedures that can be used to detect various
forms of heteroscedasticity in commonly used statistical analyses in HRM.
Julie I. Hancock (University of North Texas), James M. Vardaman (Missis-
sippi State University), and David G. Allen (Texas Christian University) note the
importance of inter-rater reliability in HRM studies that involve two or more in-
dependent coders (e.g., a meta-analysis). These authors review various measures
of inter-rater reliability including percentage agreement, Cohen’s kappa, Scott’s
pi, Krippendorf’s alpha, and Gwet’s AC1. Based on their comparative analysis of
440 articles that were coded for various characteristics, they provide evidence to
suggest that Gwet’s AC1 may be a useful index of inter-rater reliability beyond tra-
ditional indices (e.g., percentage agreement, Cohen’s kappa). They also provide
practical guidelines for HRM researchers when selecting an index for inter-rater
reliability.
Angelo S. DeNisi (Tulane University) and Kevin R. Murphy (University of
Limerick) discuss the difficulties associated with comparing appraisal systems
when job performance criteria vary across studies. After reviewing common ap-
proaches for evaluating criteria, the authors describe a construct validation frame-
work that can be used to establish criteria for criteria. The framework involves
construct explication, multiple evidence sources, and synthesis of evidence.
Liam P. Maher (Boise State University), Zachary A. Russell (Xavier Univer-
sity), Samantha L. Jordan (Florida State University), Gerald R. Ferris (Florida
State University), and Wayne A. Hochwarter (Florida State University) discuss
methodological issues in organizational politics. They discuss five constructs in
the organizational politics literature—perceptions of organizational politics, po-
litical behavior, political skill, political will, and reputation. In addition to con-
4 • EUGENE F. STONE-ROMERO & PATRICK J. ROSOPA

ceptual definitions and measurement issues, the authors provide critiques of each
construct as well as directions for future research. The authors conclude with a
discussion of the conceptual, research design, and data collection challenges that
researchers in organizational politics face.
Allen I. Huffcutt (University of Wisconsin Green Bay) discusses the problem
of range restriction in HRM, especially in employment interviews. He demon-
strates how serious this problem can be by simulating data that is unrestricted and
free of measurement error. Then, he shows how validities change after systemati-
cally introducing measurement error, direct range restriction, and indirect range
restriction. In addition, he provides a step-by-step demonstration of the calcula-
tions to obtain corrected correlation coefficients.
Lois E. Tetrick (George Mason University), Robert R. Sinclair (Clemson
University), Gargi Sawhney (Auburn University), and Tiancheng (Allen) Chen
(George Mason University) discuss methodological issues in the safety climate
literature based on a review of 261 articles. There review reveals a lack of consen-
sus and an inadequate explication of the safety climate construct and its dimen-
sionality. In addition, the authors discuss some common research design issues
including the low percentage of studies that involve interventions. The authors
highlight the (a) importance of incorporating time in research studies involving
multiple measurements and (b) increased use of various levels in safety climate
research.

REFERENCES
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
research. Chicago, IL: Rand McNally.
Cook, T. D., & Campbell, D. T. (1976). The design and conduct of quasi-experiments and
true experiments in field settings. In M. D. Dunnette (Ed.), Handbook of industrial
and organizational psychology (pp. 223–326). Chicago, IL: Rand McNally.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues
for field settings. Boston, MA: Houghton Mifflin.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
CHAPTER 2

ADVANCES IN RESEARCH
METHODS
What Have We Neglected?

Neal Schmitt

My purpose in this paper is twofold. First I trace and describe the phenomenal de-
velopment of quantitative analyses methods over the past couple of decades. Then
I make the case that our progress in measurement, research design, and estimating
the practical significance of our research has not kept pace with the development
of analytic techniques and that more attention should be directed to these critical
aspects of our research endeavors.

RESEARCH METHODS A HALF CENTURY AGO


In the 1960s, quantitative methods courses included correlation and regression
analyses, analysis of variance and an advanced course on factor analysis (this
was exploratory factor analysis; confirmatory factor analyses did not arrive on
the scene till the early 1980s). At this time, too, item response theory (IRT) had
been described theoretically but software packages that would allow for evaluat-
ing items, particularly polytomous models were really not available till the 1970s
and 80s.
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 5–38.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 5
6 • NEAL SCHMITT

At this time the notion that all validities were specific to a situation was the
accepted wisdom in the personnel selection area. Frank Schmidt and Jack Hunt-
er introduced meta-analyses and validity generalization in the mid to late 1970s
(Schmidt & Hunter, 1977). Hypothesis testing was standard practice too and little
attention was paid to the practical significance of statistically significant results.
So, a person at that time was considered well trained if he/she were conversant
with correlation and regression, analyses of variance, exploratory factor analysis
and perhaps nonparametric indices. This has changed radically in the intervening
years.

DEVELOPMENT OF MODERN
QUANTITATIVE ANALYSIS METHODS
The 1980s were distinguished by the rapid adoption of structural equation model-
ing (SEM) using LISREL (later AMOS, MPLUS and other software tools) and
the use of meta-analysis to summarize bodies of research on a wide variety of
relationships between HR and OB constructs. Among SEM enthusiasts, there was
even a misperception that confirmation of a proposed model of a set of relation-
ships indicated that these variables were causally related rather than the fact that
data were simply consistent with a hypothesized set of relationships. Even after
this error of interpretation was recognized there was an enthusiastic adoption of
SEM by researchers. Both meta-analysis and SEM brought a focus on the under-
lying latent constructs being measured and related as opposed to the measured
variables themselves.
Developments in both SEM and meta-analyses became increasingly sophis-
ticated. Meta-analysts were concerned about file-drawer problems, random ver-
sus fixed effects analyses, estimates of variance accounted for by various errors,
moderator analyses, and the use of regression analyses of meta-analytically de-
rived estimates of relationships. Specific applications of SEM such as multi-group
analyses and tests for measurement invariance (Vandenberg & Lance, 2000)
were soon widely applied as were SEM analyses of longitudinal data (e.g., latent
growth modeling, Willett & Sayer, 1994).
Certainly among the most frequently used analytic innovations have been those
associated with levels research (Klein & Kozlowski, 2000). Multilevel modeling
is used in a very large proportion of the articles now published in our journals. In
one recent issue of Journal of Applied Psychology (February, 2017), Sonnentag,
Pundt, and Venz used multilevel SEM to assess survey data on snacking behav-
ior, Walker, Jaarsveld, and Skarlecki used multilevel SEM to study the impact
of customer aggression on employee incivility and Zhou, Wang, Song, and Wu
examined perceptions of innovation and creativity using hierarchical linear mod-
eling. Hierarchical linear modeling (HLM) has been used to study change, goal
congruence, climate and many other phenomena. It is almost as though I suddenly
discovered the nested nature of most of the data I collect.
Advances in Research Methods • 7

We have also seen the development of new methods of analyzing longitudi-


nal data (including the use of HLM for this purpose). As mentioned above, la-
tent growth modeling via SEM has been used in many studies. Early methods of
analyzing change usually involved predictors that did not change over time and
were not directed to analyzing relationships in change across variables over time.
With time-varying predictors and dynamic relationships, more complex methods
of analysis are required. Good treatments of the differences in growth models
are provided in book chapters by Ployhart and Kim (2013) and DeShon (2013).
Examples of the analysis of dynamic change models are becoming more frequent
(e.g., Chen, Ployhart, Cooper-Thomas, Anderson, & Bliese, 2011; Pitiaru & Ploy-
hart, 2010).
Computational modeling was described by Ilgen and Hulin (2000) nearly two
decades ago, but is now becoming rather commonly used as an alternative re-
search method. See Grand (2017) for a computational model of the influence of
stereotype threat on training/learning practices and the performance potential of
employees and their organizations. Vancouver and Purl (2017) provide a compu-
tational model used to understand better the negative, positive, and null effects of
self-efficacy on performance.
Missing data plagues many of our applied studies, particularly when data are
collected on multiple occasions. Traditionally, I think most of us have used list-
wise deletion on cases on which there are missing data or replaced missing values
with the mean of the variable. Listwise deletion has a disastrous effect on the
available sample size (e.g., N=500, K=10, 10% randomly missing data, N=175).
Mean replacement as a solution to the missing data problem has an obvious im-
pact on the variability of a variable and the calculation of biased parameter esti-
mates. Full information and maximum likelihood EM imputation are much more
powerful and can handle huge proportions of missing data in ways that do not
produce biased estimates of parameters. Mention of these methods of treating
missing data is only beginning to appear in our literature.
Adaptations to the typical regression analyses have also been introduced and
are increasingly common. Moderated and mediated regressions have been part of
our repertoire for some time, but they continue to present challenges. Analyses
of models that involve mediation are susceptible to inference problems (Stone-
Romero & Rosopa, 2011) such as those mentioned above in connection with
SEM. I will summarize these problems in connection with research design prob-
lems later in this paper. We also employ polynomial regression, splines, and the
analysis of censored data sets.
Qualitative analyses have typically been disparaged by quantitative research-
ers, but the early stages of a job analysis, probably the oldest aspect of a selection
study, are certainly qualitative in nature. Modern forms of qualitative research
such as various versions of text analysis can be very quantitative and even when
such techniques are not used qualitative analyses have become an increasingly
valuable tool of organizational researchers.
8 • NEAL SCHMITT

Big Data produces opportunities and challenges associated with the analysis
and interpretation of huge multidisciplinary data sets. Angrave, Charlwood, Kirk-
patrick Lawrence, and Stuart (2016), Cascio and Boudreau (2011) and others have
detailed a number of challenges in the use and interpretation of the wide variety
of big data available and the potential that analyses of these data will result in
improved HR practices. The quality and accuracy of many Big Data files is often
unknown; for example, it is rare that one would be able to assess the construct
validity of Big Data indices as organizational researchers usually do.
Big Data also introduces a whole new vocabulary (see Harlow & Oswald,
2016). Words like lasso (screening out noncontributing predictor variables), la-
tent dirichlet allocation (modeling words in a text that are attributable to a smaller
set of topics), k-fold cross-validation (developing models on multiple subsets of
“test” data that are then cross-validated on “training” data), crud factor (a gen-
eral factor or nuisance factor) and many more. In many ways, analyses of Big
Data seem like the “dust-bowl empiricism” that was decried by organizational
psychologists a half century ago. Note though that most Big Data analysts do
attend to theory and much more effort has been devoted to consideration of cross-
validation of findings than was true in early validation efforts.
Many other analysis techniques have challenged us such as relative importance
analysis, use of interactive and power terms in regression to analyze difference
scores, power analysis, spatial analyses, social network analyses, dyadic data
analyses and more. The last two to three decades have been an exciting time for
those working in quantitative analyses. The variety of new techniques available
to help understand data and the increased availability of data publicly available
in social media outlets and elsewhere as well as the availability of free software
packages such as R is nothing short of a revolution. We are continually faced with
the challenge of educating ourselves and others on the appropriate and productive
use of these procedures.
While we have much to celebrate and occupy ourselves, I would like to voice
concerns about some issues that seem to have gone unnoticed by many research-
ers.

PROBLEMS THAT REQUIRE ATTENTION


Three concerns warrant greater attention than seems to be the case in our cur-
rent literature. First, we have not been as concerned about the data themselves as
the techniques we use to summarize and analyze them. Certainly many of us are
familiar with the phrase “garbage in, garbage out.” Very little attention has been
given to development of new measures or even the reliability or construct validity
of those measures we do employ. A similar concern was voiced by Cortina, Agui-
nis, and DeShon (2017) after reviewing 100 years of research methods papers
published in Journal of Applied Psychology. They said: “…we hope that the fas-
cination with abstruse data analysis techniques gets replaced by fascination with
appropriate research design, including top-notch measurement “(p. 283). Second,
Advances in Research Methods • 9

we have paid too little attention to research design. This is especially evident
when we collect (or try to collect) longitudinal data. Third, in the interest of dis-
covering statistically significant findings or the results of the latest novel analytic
technique, we have lost sight of the practical significance of our results— in terms
of reporting effect sizes that are meaningful to practitioners, explaining the nature
of our results (witness the lack of impact of selection utility analyses so popular
a couple of decades ago) and in terms of addressing issues that concern OB/HR
practitioners or our organizational clients. In the remainder of this chapter, I will
describe the “state of the art” in these three areas and why I think they should
receive more attention by research methodologists than is currently the case.

MEASUREMENT CONCERNS
Aside from IRT developments there has been very little direction as to how to
evaluate the items or scales we use. Even IRT is not very applicable with short
scales. CFA has been used for the same purpose, but we have little guidance as
to what is good fit to a particular measurement model. Nye and Drasgow (2011)
have tried to provide such guidance and Meade, Johnson, and Braddy (2008) rec-
ommend the use of the confirmatory fit index (with a cutoff value of .002) as a
means of comparing the fit of alternative models or invariance. Too often we wave
a set of alpha values at the scales we use sometimes apologizing for those whose
alphas are below .70 as evidence that our measures are acceptable. Sometimes
journals even publish one item per scale so the reader can get some sense of the
nature of the construct measured, but even the publication of one item has been
found objectionable on proprietary grounds. Clark (2006) decries the sentence
often used to justify the use of a measure: “According to the literature, measure
X’s reliability is good and it has been shown to have validity.” (p. 448). This
statement is often made without a reference, but even with a reference, it often
appears doubtful that the author read the paper or papers they cite. Of course,
there is often no mention as to how or against what the measure was validated. Or,
as investigators we write a set of items for a study and label them as some exist-
ing construct with no supporting investigation of its psychometric characteristics.
Subsequent researchers or meta-analysts take the label for granted. This situation
is even worse now that we have become enamored of Big Data because we have
little or no control over the nature of the data collected and many times the data
comes from disciplines that have little appreciation for the quality of their mea-
sures (Angrave et al, 2016).
Let me give some examples. Forty or fifty years ago, there were several pub-
lications which provided guidelines on item writing though most of those ad-
dressed multiple choice ability items. Some guidelines addressed the use of
double-barreled items, use of double negatives, or jargon (Edwards, 1957) in
Likert-type items. I even remember one paper that experimentally manipulated
some of these guidelines in constructing a final exam in an introductory psychol-
ogy course (Dudycha & Carpenter, 1973). We now take item writing (whether
10 • NEAL SCHMITT

multiple choice or Likert items) for granted with the possible exception of large
test publishers whose main concern is the perceived fairness of test items to dif-
ferent groups of examinees. Even attempts to improve perceived fairness to dif-
ferent underrepresented groups have rarely been examined (for an exception see
Golubovich, Grand, Ryan and Schmitt, 2014).
We do have some other developments in measurement – both methods of mea-
surement and the means of analyzing the measurement properties of our indices.
Cognitive diagnosis models, multidimensional IRT models and simulations/gam-
ing are some examples. However, these techniques have not caught on to any
great degree—perhaps because they are too challenging for many of us or because
psychometricians or quantitative data analysts do not speak the language of most
psychologists and there may be some level of arrogance among psychometricians
about the relative incompetence of the rest of us. In any event, few of us read
Psychometrika anymore and I suspect the same may be true of Psychological
Methods and educational journals like the Journal of Educational Measurement.
Organizational Research Methods is still accessible to most organizational re-
searchers and that may account for its relatively high impact factor. Whatever the
reason there seems to be a segregation of quantitative analysts and measurement
types from other researchers, particularly those who develop or use psychological
measures.
In addition to a lack of attention in writing items, there is an overdependence on
alpha as an index of reliability or unidimensionality. Cortina (1993) and (Schmitt,
1996) have demonstrated that alpha can be a poor index of reliability even when
we have lots of items. Schmitt (Cortina [1993] provided a similar analysis) dem-
onstrated that a six-item test with the item intercorrelations in Table 2.1 yielded an
alpha of .86. Most of us would be happy with this alpha and proceed to do further
analyses using this measure. If we bother to look further (examine item intercor-
relations), it would be obvious that the six items address two dimensions. Further,
examination of item content would almost certainly provide at least a tentative
explanation of these two sets of items, but that almost never occurs. This example
was constructed to make a point, but with more items and a more ambiguous set of
correlations, this problem would likely go unrecognized. A more modern and fre-
quently used approach to assess the dimensionality of our measures is to employ
confirmatory factor analyses. Assessment of a unidimensional model of these in-
tercorrelations would have yielded the following fit indices (Chi square=401.62,
df=9, RMSEA=.47, NNFI=.29, CFI = .57). Most researchers would conclude that
the alpha of .86 was not a good index of the unidimensionality of this measure and
that a composite index of this set of six items is meaningless.
However, we can also fool ourselves about the dimensionality of a set of items
when using CFA—probably not as easily. We are dependent on a set of “rules
of thumb” as to whether a model fits our data and indices of practical fit (Nye &
Drasgow, 2011) are not helpful in this instance. Consider the item intercorrela-
tions in Table 2.2 for which a four-factor model produces a perfect fit to the data.
Advances in Research Methods • 11

TABLE 2.1. Hypothetical Intercorrelations of a Six-Item Composite


Variable 1 2 3 4 5 6
1 1.0
2 .8 1.0
3 .8 .8 1.0
4 .3 .3 .3 1.0
5 .3 .3 .3 .8 1.0
6 .3 .3 .3 .8 .8 1.0

A one factor model does pretty well too (Chi-square=73.66, df=54, RMSEA=.04,
NNFI=.98, CFI=.98). Most of us as authors and most of us as reviewers would
be happy with this demonstration of unidimensionality. I agree that the difference
in the correlations of within and between items belonging to each of these four
factors is small, but alpha for each of these four factors is .82 and the correlations
between any two of the four sets of items is .55. Are these distinct and practi-
cally meaningful “factors?” Incidentally, the alpha for the 12 item composite here
is .93. Clearly, both alpha and CFA tell us that one factor explains these data,
but four distinct factors are responsible for the item intercorrelations. The point
I am making is that the more sophisticated analysis of dimensionality does not
do justice to the question any more so than does alpha. A third way of looking
at these data is to examine the item content, item-total correlations, and the item
intercorrelations or perform an exploratory factor analysis—something that few
“sophisticated” data analysts ever do!

TABLE 2.2. Hypothetical Intercorrelations of Data Representing Four Factors


Variable 1 2 3 4 5 6 7 8 9 10 11 12
1 1
2 .6 1
3 .6 .6 1
4 .5 .5 .5 1
5 .5 .5 .5 .6 1
6 .5 .5 .5 .6 .6 1
7 .5 .5 .5 .5 .5 .5 1
8 .5 .5 .5 .5 .5 .5 .6 1
9 .5 .5 .5 .5 .5 .5 .6 .6 1
10 .5 .5 .5 .5 .5 .5 .5 .5 .5 1
11 .5 .5 .5 .5 .5 .5 .5 .5 .5 .6 1
12 .5 .5 .5 .5 .5 .5 .5 .5 .5 .6 .6 1
12 • NEAL SCHMITT

Yet another reason why quality measurement is important is highlighted in a


paper by Murphy and Russell (2017). They point to a long history of frustration
among organizational scientists in formulating, documenting, and testing modera-
tor hypotheses and wonder if it is time to discontinue our search for moderator
variables. One reason they have been illusive is that the typical moderator variable
has very low reliability. If a moderator is formed by the product of two measures
each with reliability of .70 (the usual level considered marginally acceptable),
their product has a reliability of .49. This low reliability, of course, has an impact
on the power with which any test of moderation is conducted and the magnitude
of the effect associated with the moderation.
A third reason why we may be likely to ignore the measurement quality of our
data is the advent of Big Data. Big Data analyses can be valuable in answering
many of the questions we have about human behavior in ways we could only wish
a decade or so ago. However, it also means that we often accept data from many
different sources and people whose appreciation for measurement issues simply
do not match those of members of our discipline and there is usually no way we
can check the psychometric characteristics of these measures. I do not know to
what degree this may bias the results of Big Data analyses, but I do think it de-
serves attention.

INFORMATION ON RELIABILITY AS
PRESENTED IN CURRENT RESEARCH ARTICLES
Overall, then, I do not believe we have paid much attention to the quality of our
measures. We seem to think item writing is easy and if we ask respondents to
use a five-point Likert type scale we will have a quality measure. Or more often,
researchers adapt a measure from previous work, occasionally taking a few items
from some longer measure. Then we use alpha and CFA to justify our behavior. To
ascertain that our statements about these practices are relatively standard, I exam-
ined the last three issues of the 2017 volumes of Journal of Applied Psychology,
Personnel Psychology and the Academy of Management Journal and tabulated the
results in Table 2.3. In this table, I have described the construct the authors pur-
ported to measure, the evidence for reliability and, in some cases, the discriminant
validity of the measure (convergent validity was rarely mentioned or assessed),
employment of rules of thumb to justify reliability and the justification for use of
the measure.
The table does document several positive features of the research reviewed.
First, most indices of reliability are quite high and clearly exceed the usual mini-
mum for alpha (i.e., .70) cited in the literature. Second authors do routinely pro-
vide justification for their use of measures. That justification, though, is almost
always limited to the fact that someone else used the same scale or the current
measure was modified from a measure used in an earlier study. Very rarely did
authors present any evidence of the relationship between the original version of
Advances in Research Methods • 13

TABLE 2.3. Measure Adequacy as Described in Recent Issues of Three Journals


Rule of Previous Use
Type of Thumb or Justification.
Journal Construct(s) Measure Justification Mentioned Mentioned
JAP Rudeness Self-report Alpha = .92 No Yes
Goal Progress Self-report Alpha = .86 No Yes
Task performance Self-report Alpha = .94 No Yes
Psy withdrawal Self-report Alpha = .77 No Yes
Interpersonal avoidance Self-report Alpha = .94 No Yes
Morning affect Self-report Alpha = .92 No Yes
Core self-evaluation Self-report &.93 No Yes
Alpha = .85 Differences
CFA test of in fit
distinctiveness indices
JAP Role conflict Self-report Alpha = .78 No Yes
Empowering help orientation. Self-report & .81 Yes Yes
Emotional exhaustion Self-report Alpha = .65 No Yes
Alpha = .89
& .02
JAP Interviewer evaluations of Interviewers Alpha = .88 No No
candidates Self-report Test-retest=.59 No Yes
Self-verification
JAP Entity theory Self-report Alpha = .92 No Yes
Social Support Self-report Alpha = .71 No Yes
Self-efficacy Self-report Alpha = .67 No Yes
Feedback Seeking Self-report Alpha = .78 No Yes
JAP Anger Self-report Alpha = .96 No Yes
Empathy Self-report None No Yes
Perceptions of Treatment Self-report None No Yes
Intentions
JAP Behavioral integrity Self-report Alpha = .78 No Yes
Ethnic Dissimilarity Self-report &.94 Yes Yes
Ethnic representation Self-report Alpha = .69 No No
Alpha = .96
JAP Perceived Effort Supervisor Alpha = .96 No Yes
Perceived Liking Supervisor & .94 No Yes
Procuticle Adherence Self-report Alpha = .92 No Yes
Org. Justice Adherence Self-report & .91 No Yes
Conscientiousness Self-report Alpha = .82 No Yes
Agreeableness Self-report Alphas=.83–.93 No Yes
Alpha = .77
Alpha = .77
Similar
measures used
in three studies

(Continues)
14 • NEAL SCHMITT

TABLE 2.3. Continued


Rule of Previous Use
Type of Thumb or Justification.
Journal Construct(s) Measure Justification Mentioned Mentioned
JAP Ability Evaluations Alpha = .90 No Yes
Benevolence of target &.90 No Yes
Integrity Alpha = .95 No Yes
&.92
Alpha = .95
&.93
CFA indicated
three factors
JAP Moral disengagement Self-report Alpha = .76 No Yes
Intent to ostracize Self-report & .82 No Yes
MD language Self-report Alpha = .88 No Yes
Moral identity Self-report &.94 No Yes
Other concern Self-report Alpha = .92 No Yes
&.87
Alpha = .85
&.85
Alpha = .83
&.90
JAP Team goal setting Self-report Alpha = .90 No Yes
Team agreeableness Self-report Alpha = .73 No Yes
Team emotional stability Self-report Alpha = .67 No Yes
Team extraversion Self-report Alpha = .86 No Yes
Team conscientiousness Self-report Alpha = .81 No Yes
Task cohesion Self-report Alpha = .86 No Yes
JAP Org. politics Self-report Alpha = .74 No Yes
Political behavior Self-report Alpha = .83 No Yes
Spy. Empowerment Self-report Alpha = .88 No Yes
Emotional Exhaustion Self-report Alpha = .92 No Yes
Task performance Supervisor Alpha = .96 No Yes
Psych Ineffective interpersonal Other rating Alpha = .94 No Yes
Behavior Supervisors Alpha = .91 No Yes
Performance Supervisor Alpha = .92, No Yes
Effective Interpersonal Behavior Supervisor factor analysis No Yes
Derailment potential Supervisor and CFA No No
Promotability Supervisor Confirmed 3 No No
Performance factors.
PPsych Individual OCB Group ldr. Alpha = .89 No Yes
Organizational OCB Group ldr. Alpha = .74 No Yes
Group Cohesiveness Group mbrs. Alpha-.83 No Yes
Job self-efficacy Group mbrs Alpha = .73
&.74
Advances in Research Methods • 15

TABLE 2.3. Continued


Rule of Previous Use
Type of Thumb or Justification.
Journal Construct(s) Measure Justification Mentioned Mentioned
PPsych Commuting strain Self-report Alpha = .8 to No Yes
Task significance Self-report .93 No Yes
Family interference Self-report Alpha = .78 No Yes
Commuting means effic. Self-report to .93 No Yes
Self-regulation at work Self-report Alpha = .7 to No Yes
.93
Alpha = .86
Alpha = .94
PPsych Five dimensions of role Identity Self-report Alphas=.70 No Yes ;CFA for
Group cohesiveness Self-report to .81 No discriminant
Alpha = .61 validity, test of
significance and
fit indices
Yes
PPsych Part. In development Self-report Alpha = .74 No Yes
Development challenges Self-report Alpha = .83 No Yes-constr
Develop. Supervision Mgrs. Alpha = .96 No validity
Leader Self Efficacy Self-report Alpha = .94 No Yes.
Mentor network Self-report Alpha = .93 No Yes
Leader efficacy Supervisor Alpha = .85 No No
Promotability Supervisor Alpha = .87 No Yes
Yes
PPsych Leader member exchange Self-report Alpha = .89 No Yes-compared
Alumni goodwill Self-report Alumni = .73 No with full length
lmx
New items
PPsych Affective org.comm. Self-report Alpha = .93 no \yes
Superv.trans.ldrship. Self-report Alpha = .98
PPsych Surface acting Self-report Alpha = .84 No Yes
Deep acting Self-report No Yes
Ego depletion Self-report No Yes
Self-efficacy-emotion regulation Self-report No Yes
Intentional harming of coworker Supervisor No Yes
PPsych Core self-evaluation Self-report Alpha = .81 No Yes
Task mastery Self-report Alpha = .83 No Yes
Political Knowledge Self-report Alpha = .73 No Yes
Social integration Self-report Alpha = .84 No Yes
Org. Identification Self-report Alpha = .86 No Yes
CFA for discr.
Val.

(Continues)
16 • NEAL SCHMITT

TABLE 2.3. Continued


Rule of Previous Use
Type of Thumb or Justification.
Journal Construct(s) Measure Justification Mentioned Mentioned
PPsych Mastery orientation Self-report Alpha = .93 No Yes
Mastery ornt. Var. Self-report Alpha = .94 No Yes
Post trng. Self-efficacy Self-report Alpha = .91 No Yes
Motivation to transfer Self-report .89 No Yes
Declarative knowledge Test/quiz None NA NA
Transfer Self-report Alpha = .8 to .9 No Yes
Opportunity to perform Self-report Alpha = .69 No Yes
to .86
PPsych Transfomation ldrsh Supervisor Alpha = .96 No Yes
Family role identific. Self-report Alpha = .87 No Yes
PPsych Unethical behavior Supervisor Alpha = No Yes
Ostracism Self-report .91,.95,.96 No Yes
Performance Supervisor Alpha = No Yes
Performance Self-report .96,.91,.97 No yes
Relationship conflict Supervisor Alpha = .94 No Yes
Alpha = .70
Alpha = .92,96
AMJ Feedback seeking Raters Alpha = .75 No No
Curiosity Self-report Alpha = .81 No Yes
Change to artistic drafts Raters Kappa = .88 No No
AMJ Report vagueness Raters Krippen Yes No
alpha=.82
AMJ Brand identity conflict Self-report Alpha = .72 No Yes
Brand identity enhance Self-report Alpha = .70 No Yes
Intrinsic motivation Self-report Alpha = .92 No Yes
Perspective taking Self-report Alpha = .89 No Yes
AMJ Company resources Self-report Alpha = .73 No Yes
Employee belief in cause Self-report Alpha = .82 No Conv. & Disc.
Corporate volunteer climate Self-report Alpha = .97 No Val.
Corp. Vol intentions Self-report Alpha = .96 No Conv. & Disc.
Personal Vol. Intent Self-report Alpha = .97 No Val.
Yes
Yes
YES
AMJ Target influence behavior Evaluator Alpha = .93 No Yes
Self-reliance Evaluator Rwg=.77 No Yes
Leadership evaluations Self-report Alpha = .68 No Yes
Communality Evaluators Alpha = .88 No No
Competence Evaluator & .82 No Yes
Evaluator No data
Alpha = .88
Advances in Research Methods • 17

TABLE 2.3. Continued


Rule of Previous Use
Type of Thumb or Justification.
Journal Construct(s) Measure Justification Mentioned Mentioned
AMJ Ethical relativism Self-report Alpha = .84 No Yes
Ethical idealism Self-report Alpha = .81 No Yes
Ethical leadership Peer or Alpha = .92 No Yes
subordinates
AMJ Econ. Downturn perc. Self-report Alpha = .90 No Yes
Negative mood Self-report Alpha = .77 No Yes
Positive mood Self-report Alpha = .72 No Yes
Construal of success Self-report Alpha = .79 No Yes
AMJ Surface acting Self-report Alpha = .85 No Yes
Deep acting Self-report Alpha = .84 No Yes
Work engagement Self-report Alpha = .92 No Yes
Emotional exhaustion Self-report Alpha = .90 No Yes
Giving help Self-report Alpha = .90 No Yes
Receiving help Cortina, et
Positive affect al. (under
Negative affect review)
Self-report Alpha = .88 No Yes
Self-report Alpha = .91 No Yes
Self-report Alpha = .80 No Yes

the measure and the modified measure. The frequent modification of scales is
documented in Cortina et al. (under review).
Beyond these positive features of the research it is clear that organizational
researchers have measured a wide variety of different constructs, most of which
are not the typical individual difference measure that was the target of research
in the selection arena. Human resource researchers, broadly defined, have clearly
expanded the nature of the issues and constructs with which they are interested.
This proliferation of measures may, however, make it more difficult to assess the
commonality of research findings across study and time; calling attention to this
issue was not the purpose of our paper.
Almost all studies summarized in Table 2.3 use self-report instruments to as-
sess the constructs of interest and in many of these cases this is the only alterna-
tive. However, researchers frequently use supervisory responses or objective or
archival data as the source of information about constructs of interest. There are
fewer references to articles published in AMJ as many of the articles published
in that journal employ archival data for which coding accuracy or agreement are
applicable and for which data are readily available for verification purposes.
Third, there is an almost universal reliance on alpha as an index of measure-
ment reliability or adequacy. In some cases, this is complemented by a CFA of
18 • NEAL SCHMITT

items assigned to multiple constructs to ascertain their discriminant validity. In


only a few cases was a CFA employed to determine the unidimensionality of a
measure. Most alpha values were quite high (in the .80s and .90s) and very few
were below the .70 level that has been routinely suggested as the minimal accept-
able level. In no case, was this level cited as justification for the use of a measure.
While not presented in the table, it was the case that almost all authors presented
one or two items for each of their measures, but never the entire measure. Given
the availability of publication of information in supplementary sources in most
journals, it seems that publication of the entire measure should become standard
practice.
Fourth, coded in the last column of the table was the justification cited by the
author for use of a measure. In almost all cases, the justification was the use of a
measure by some other author to index the same or similar constructs. However,
these citations rarely included the data from the original study that supported the
measure and in most cases as mentioned above, there was a modification (usually
decreasing the number of items) of the original measure. In a few cases, data were
reported that included a correlation between the original measure and the modi-
fied measure. In the case of these modifications, it seems particularly important
that the careful reader have access to both the original and modified instrument
underscoring the value of the use of supplementary publication outlets if not the
main article for this purpose. In a few cases, the justification included a CFA of the
measured variables with a description of that analysis related to questions about
discriminant or convergent validity. While one or two representative items were
often presented in these papers, there was no presentation of item intercorrelations
or item-total correlations and content that might have further informed the reader
about the nature of the construct measured and the degree to which individual
items may or may not have represented that construct.
Because of the problems with alpha demonstrated in Table 2.1, it may be help-
ful to consider the inclusion of other indices of unidimensionality though there has
been minimal agreement as to what such an index might be (Hattie, 1985). When
a bifactor model (general factor plus uncorrelated specific factors) fit the data, it
might be useful to present the omega indices described by Reise (2012). These
indices represent the degree to which a set of items are reflective of a general fac-
tor as well as the variance associated with individual specific factors and variance
due to a specific item. Such analyses along with the examination of item content
might illuminate the nature of measures such as situational judgment measures
which typically display very low alphas, but little evidence that more than a single
general factor explains the data. Omega values as “reliability” measures would
have been more appropriate given the multidimensionality of the data presented
in Tables 2.1 and 2.2.
There is literature decrying the sole reliance on alpha (Sijtsma, 2009) and sup-
port for the use of omega (Dunn, Baguley, & Brunsden, 2014); and other indices
of internal consistency (Zinbarg, Revelle, Yovel, & Li, 2005). Sijtsma argued for
Advances in Research Methods • 19

the use of an index he labeled the greatest lower bound (GLB) estimate as the
preferred estimate of reliability. However, Zinbarg et al. (2005) showed that the
GLB was almost always lower than the hierarchical form of omega. Omega that
includes item loadings on a general factor as well as item loadings on group fac-
tors as true variance appears to the best lower bounds estimate of reliability and
the most appropriate index to use in correcting observed correlations between
two variables for attenuation due to unreliability. Dunn et al. document the al-
most universal use of alpha as a measure of internal consistency in spite of the
critical psychometric literature including a paper by Cronbach himself (Cronbach
& Shavelson, 2004). They also support the routine use of omega along with the
confidence interval for its estimation and provide direction and an example of its
calculation using the open source statistical package, R. McNeish (2018) provides
a review of the use of alpha like that provided here in Table 2.1 for three different
psychological journals. The results of that review are very similar in that almost
all authors used alpha as a report of reliability. McNeish went on to compare the
magnitude of alpha and five other reliability indices for measures included in two
publicly available data sets. He found alpha was consistently lower by about .02
to .10 depending most often on the variability of item loadings on a general factor.
Aside from underrepresenting the reliability of a measure, these differences may
be practically meaningful in applied instances when relationships are corrected
for attenuation due to unreliability as they routinely are in studies of the criterion-
related validity of personnel selection measures (Schmidt & Hunter, 1998).
In the March 2018 issue of the American Psychological Society’s Observer,
Fried and Flake make four observations about measurement that are consistent
with the data in Table 2.3 and this discussion. First, they encourage researchers
to clearly communicate the construct targeted, how it is measured, and its source.
Second, there should be a rationale for scale use and modifications. Third, if the
only evidence you have of measure “validity” is its alpha, consider conducting a
validity study to ascertain the scales’ correlates. Finally, stop using alpha as the
only evidence of a scale’s adequacy. I would add that we should replace alpha
with omega for composite measures.

INATTENTION TO RESEARCH DESIGN


I mentioned above that we have made significant advances in our analyses of lon-
gitudinal data. However, we have paid little attention to the research designs that
produce our longitudinal data. When we study socialization, training, leadership,
or the impact of job satisfaction on life satisfaction or the reverse and a host of
other time-related variables, it is important that our data collection efforts reflect
the time periods in which the process we are studying is likely to occur. For ex-
ample, if we are looking at the impact of training on some outcome variable, it
makes little sense to evaluate such training before the effects of training are likely
to have had their full impact. Likewise it makes little sense to assess the impact
of various socialization efforts many months or years after the employment of a
20 • NEAL SCHMITT

group of employees. Similarly, investigating the impact of life satisfaction on job


satisfaction among a group of long tenured employees doesn’t make much sense.
Perhaps this is well known or common sense. However, examples of a lack of
consideration of the timing of data collection are not difficult to find in recently
published articles. The following examples come from an unsystematic search of
the last several issues of top-tier journals in our discipline. Kaltianen, Lipponen,
and Holz (2017) studied the longitudinal relationships between perceptions of
process justice during a merger and subsequent cognitive trust. They did do an
excellent job of describing when data were collected and what was transpiring in
the organization at the time. However, they provided little concrete justification
for the one year separation between data collections. Did trust change more or less
slowly than these one-year intervals? The authors recognize this limitation in their
discussion in that they state that they would like to assess these changes in shorter
time periods. There is little theoretical or empirical justification (maybe nothing)
that indicates when such changes might occur. Most data on these relationships
are cross sectional, but the authors do cite a meta-analysis of this relationship.
Those meta-analytic data might be analyzed to determine if the time interval em-
ployed in studies moderates the reported effect size. If there is moderation, then an
appropriate time interval might be determined for use in subsequent longitudinal
studies, but to my knowledge, this is rarely if ever done when deciding on the
timing of longitudinal data collections. Barnes, Miller, and Bostock (2017) report
an interesting study on the effect of web-based cognitive behavior therapy on in-
somnia and a variety of workplace outcomes (organizational citizenship behavior,
interpersonal deviance, job satisfaction, negative affect). The researchers hypoth-
esized that the therapy would have effects on workplace outcomes mediated by
insomnia and evaluated these hypotheses with pre-post surveys separated by ten
weeks. There was no mention of the appropriateness of this ten week interval.
There was support for some of their hypotheses suggesting a rather short time
frame within which this hypothesized mediation occurred.
Grand (2017) provides a computational model of the effects of stereotype
threat during training and turnover on employee performance potential over time.
This is a very interesting and thorough analysis of what happens over time in the
presence of stereotype threat based on realistic parameters. The analyses show
the usual asymptote of employee learning with the negative impact of stereo-
type threat remaining over time as a function of turnover in trained employees.
However, there was no mention of the time interval during which these processes
unfold though I assume this could be shorter or longer based on the time it takes
employees to reach their full potential.
Perhaps most illustrative of a lack of consideration of the timing of data col-
lection is a study by Deng, Walter, Lam, and Zhao (2017) that just appeared in
Personnel Psychology. These authors studied the effect of emotional labor on ego
depletion and the treatment of customers. Data were collected in two surveys two
months apart. There was no mention of the appropriateness of this time interval
Advances in Research Methods • 21

and equally problematic, the possibility that job tenure might play a role in this
process was not considered.
These are all excellent studies, but in each case, the time periods studied are
not discussed (the exception was the Kaltianen et al. study in which they cited the
lack of more frequent measurement as a limitation). Time must be considered if
we are to discover and adequately estimate the underlying processes we seek to
explain. To underscore this issue, I examined the articles published in the last year
in two major journals (Journal of Applied Psychology, and Personnel Psychology)
and the last three issues in the 2017 volume of Academy of Management Journal).
The shorter time frame for Academy of Management Journal was used because
more papers were published in AMJ and more involved longitudinal designs in
which time of data collection was a potential concern.
Table 2.4 contains a brief description of the 46 studies reviewed in these three
journals including the major hypotheses evaluated, the time interval between data
collections, support for the hypothesized effects, and any discussion of time. In
about half of these studies (N = 22), there was no discussion of the role that time
might have played in the study results or whether the timing of multiple data col-
lections was appropriate. In some of these studies, the variables studied might not
have been sensitive to the precise time of data collection or the time interval rep-
resented a reasonable space within which to expect an effect to occur (e.g., effect
of socialization during probationary period). However, in most of these 22 cases,
it would not be hard to assert that the time of measurement was a critical factor in
finding or estimating the effect of some process (e.g., leader personality affecting
leader performance) yet it was not mentioned in the description of the research.
In those studies in which time was mentioned, it was almost always mentioned
as a limitation of the study sometimes with the suggestion that future research
consider the importance of data collection timing. In one study in Personnel Psy-
chology, there was an extensive discussion of the socialization process investi-
gated and why the timing of data collections was appropriate.
A very large proportion of the papers published in Academy of Management
Journal (AMJ) were longitudinal and many involved the use of archival data that
occasionally spanned one or more decades. In some of the archival studies, data
were collected for many time periods really assuring that any time-related effects
would be observed. Like the other two journals, however, 7 of the 16 AMJ papers
did not discuss the importance of time when it seemed to me that it should have
been a relevant issue. The relatively greater concern with time in papers pub-
lished in Academy of Management Journal may be a function of what seems to
be a greater emphasis on theory in that journal. This theoretical emphasis should
produce a concern that measurement time periods coincide with the theoretical
processes being studied.
In none of the papers mentioned in Table 2.4 was the timing of the first data
collection described. When studying a work-related issue, it seems that the first
data collection should occur at employment or immediately before or after an
22
TABLE 2.4. Longitudinal Research Designs in Articles Published Recently in Major Journals

Journal Hypothesized Effects Time Interval Hypo. Support & Discussion or rationale for Time Interval
JAP Morning rudeness>task perf. & goal Nine hours Rudeness affected all four outcomes. Hypo. Restricted to morning
progress & interaction avoidance & Psych. rudeness, but possibility of buildup or crossover effects are
Withdrawal recognized
JAP Role conflict>emotional exhaustion Six months Time interval = probation period. Role conflict>exhaustion
moderated by helping (socialization) moderated by type of help provided to newcomers
NEAL SCHMITT

JAP Assessment center feedback>self- First stage was 2.4 years after Hypotheses were supported. No discussion of the timing of data
efficacy>feedback seeking>career outcomes feedback; second stage was 15 collection. Times are averages across participants
years later
JAP Team charter and team conscientiousness 10 weeks Hypothesis supported; no discussion of time.
lead to task cohesion and team performance
JAP Political behavior>task performance Two months separating each of Al l four hypotheses were supported. No discussion of time interval.
mediated by emotional exhaustion and three surveys
psychological empowerment including
moderator effects of political behavior on
exhaustion
JAP Study 1:Intrinsic motivation>organizational Six months Supported; no discussion of time interval
identification Support was found for the first link in the hypothesized sequence and
Study 2: Need fulfillment>intrinsic Three stage with 4 weeks partial support for the mediation hypothesis. No discussion of time
motivation>organizational identification intervening interval or extent of previous experience
JAP Job control & task-related stressor and Five times over 10 years, but In a general sense, hypotheses were confirmed. Data collection times
social stressors>health and well-being mid-point varied and last data were discussed and early periods were defended on the notion that
collection was six years after the was when most job stress would occur.
fourth period
JAP Unethical behavior, supervisor bottom line Six months and two weeks Unethical behavior. Shame; shame>exemplification; supervisor BLM
orientation, and shame>exemplification moderated the latter relationship. Time issue was discussed
behavior
JAP Work demands>unhealthy eating buffered Morning noon and evening Job demand>unhealthy eating in the evening and the interaction
by sleep and mediated by self-regulations of fifteen days. In a second of job demands and sleeping was significant. Negative customer
study, four daily surveys were interaction>negative mood>unhealthy eating.
administered for four weeks. Various points in a day were sampled; no discussion of multi-day
effects.
JAP Team voice>team innovation & team 6–8 weeks after teams started and Promotive perf.>productivity and prohibitive perf.>safety. Promotive
monitoring>productivity and safety three months later perf.>innovation>perf. gains
Prohibitive perf.>monitoring>safety gains. Timing of meas. was
recognized as limitation
JAP Study 1: Intercultural dating>creativity 10 months Hypothesis supported. No mention of timing.
JAP Distance and velocity 45 minute experiment Disturbances both affected frustration and enthusiasm, but velocity
disturbances>enthusiasm and had longer term effects—authors mentioned the limiting effect of
frustration>goal commitment, effort and time on the result
perf.
JAP Intraindividual increases in org. One year pre- and post-merger Mixed support for hypotheses. Authors emphasized the need to
valence>org. identification>job sat & intent collect data at multiple time points, but did not discuss the time
to stay and personal valence constr.>org interval between data collections
identification>job satisfaction and intent
to stay
JAP Leader extraversion, agreeableness, & Three months Partial support for hypotheses. No discussion of the time interval
conscientiousness>team potency belief separating data collection
and identification w. ldr>Performance
moderated by power distance
JAP Work engagement>work-family Work Engagement collected at Mediated effects were supported. Authors did discuss the problem of
Interpersonal capitalization>family work but mediator and outcomes simultaneous collection of mediator and outcome data.
satisfaction and work-family balance collected at the same time
JAP Participation in job crafting 8 weeks Major mediation hypothesis unsupported. No discussion of timing.
Advances in Research Methods •

intervention>crafting toward interests and


strength>person-job fit
23

(Continues)
24
TABLE 2.4. Continued

Journal Hypothesized Effects Time Interval Hypo. Support & Discussion or rationale for Time Interval
JAP Newcomers’ task and social info. 1 week between each of four data Most hypotheses were supported. Discusses lack of true longitudinal
Seeking>Mgrs. Perceptions of newcomer collections design.
commitment to task master and social
adjustment >mgrs. Provision of
help>outcomes
JAP Process justice & cognitive trust are Data collected over two years Hypotheses confirmed. Data collections tied to specific changes
NEAL SCHMITT

reciprocally related through three stages of and tied to specific company hypothesized to result from merger. Discussed need to estimate
a merger changes relationships in a shorter time frame.
JAP Trust in direct ldrs.>direct ldr procedural Three months Trickle model supported—direct ldr. trust leads to top ldr. trust
justice>trust in top ldrs. & performance. mediated by direct ldr procedural justice. No discussion of length of
Relationships moderated by vertical time interval between data collections
collectivism.
JAP Recruitment source timing and Time between receipt of Time was the major variable studied and it was related to human
diagnosticity>human capital information on jobs and capital. Attribution is that students developed skills relevant to
recruitment varied specific jobs.
JAP High performance leads to supportive or Eight weeks Hypotheses were supported, but there was no mention of the time
undermining behavior by peers mediated by interval
peers’ perceived benefit or threat
PPsych Interaction of Job demands and control > Seven years Hypothesis was supported and there was a lengthy discussion of the
death implications of end-of-career data collection
PPsych Ambient discrimination > mentoring 4 weeks Not seen as a longitudinal study; time difference was used to control
> organizational commitment, strain, for common method variance
insomnia, absenteeism. Mentoring activities
moderated the discrimination—outcome
relationship
PPsych Culture beliefs > intercultural sensitivity Time 2 data collected six months Data were collected before, during and after a program so the timing
rejection > cross-cultural adjustment after program entry and a third of data collection spanned the totality of the participants’ experience.
wave 3 months later Hypotheses were supported.
PPsych Job challenge and developmental Two months Mixed support and recognition of the lack of truly longitudinal
experiences > leader self-efficacy and design
mngrs. network > promotability and leader
effectiveness
PPsych LMX > higher salaries & responsibility in 18 months Hypotheses were supported. No mention of time interval but it seems
subsequent jobs as well as alumni goodwill. appropriate.
PPsych Emotional labor (surface and deep acting) > Two months Mention that the two month interval may have been too long thereby
ego depletion > coworker harming reducing magnitude of expected relationships.
PPsych Vertical access, horizontal tie strength and Time 1 (2 months before org. Extensive discussion of socialization and timing of surveys.
core self-evaluation > newcomer learning entry, Time 2 (6 months later) Vertical access and core self-evaluations were related to outcomes;
and organizational identification and Time 3 (two months after horizontal tie strength was not. Three-way interaction related to 3 of
Survey 2 4 outcomes.
PPsych Customer mistreatment > Negative mood > Daily before and after the closing Hypothesized indirect effect supported. Daily data collection
employees’ helping behavior of restaurants where participants consistent with hypotheses.
worked.
PPsych Group cohesiveness will moderate OCBI Wave 1 followed by Wave 2 three All hypotheses were confirmed. No discussion of the timing of data
and OCBO and self-efficacy change and months later and a third wave collection.
mediation against job performance after another 3 months
AMJ Employee identification > Use of voice Two months Support was found for the hypothesized mediation, but limitation of
regarding work > managers’ valuation of data collection timing was discussed
voice
AMJ Company policies and employee passion 4 weeks Support for hypotheses but no discussion of timing of measurement
for volunteering > corporate volunteering
climate > Volunteering intentions and
Advances in Research Methods •

behavior
25

(Continues)
26
TABLE 2.4. Continued

Journal Hypothesized Effects Time Interval Hypo. Support & Discussion or rationale for Time Interval
AMJ Pay for performance > individual Monthly performance for four Supported – no discussion of time period, but likely not needed
performance years
AMJ Identity conflict and identity enhancement > 4 months Intrinsic motivation mediator supported; perspective taking
intrinsic motivation and perspective taking unsupported. Timing of data collection mentioned as a study
> performance limitation
NEAL SCHMITT

AMJ CEO Power > board-chair separation and Ten years Hypotheses supported; no discussion of timing of data collection.
lead independent director appts.
AMJ Team based empowerment > team 7 months before intervention and Hypotheses supported; time was sufficient for intervention to effect
effectiveness moderated by team leader 37 months after outcomes
status
AMJ Follower’s dependence on leader > Abusive Three waves of data collection Timing of data collection matched followers’ performance reviews.
supervision time 2 > abusive supervision separated by 4 weeks Hypotheses supported in two studies
and reconciliation time 3 moderated by
follower’s behavior
AMJ Social networks > information and Yearly over 7 years Specifically hypothesized that effects would increase with time.
communication technology use > When ICT use and family and community centrality were high
entrepreneurial activity and profit entrepreneurial activity increased with time.
AMJ Top executive humility > Middle manager 1 year Hypotheses were supported. No mention of time interval
job satisfaction > middle manager turnover
moderated by top mngmt. faultlines
AMJ Donors contributions > peer recognition 7 years Hypotheses supported. No discussion of the time period over which
of Russian theatres moderated by depth of data were collected
involvement of external stakeholders
AMJ Economic downturns > Zero-sum construal 17 years First step of causal sequence was confirmed by longitudinal data;
of success > workplace helping second step by experiment
AMJ Supervisor liberalism > performance-based 25 years Hypothesis supported even after control variable are considered. No
pay gap between gender groups discussion of time period.
AMJ Daily surface acting at work > emotional Daily surveys for five days Hypotheses supported with giving help being a significant
exhaustion > next day work engagement moderator. No mention of time
moderated by giving and receiving help
AMJ Subordinate deviance > supervisor self Two weeks in Study 1; two to Indirect effect for self-regulation was supported, but not the indirect
-regulation / social exchange > abusive four weeks in Study 2 effect for social exchange. Emphasized their use of a cross-lagged
supervision research design, but did not discuss timing of data collection.
AMJ Risk aversion > guanxi activities Cross-sectional survey No discussion of timing of data collection, but hypothesis supported
AMJ Team commitment and organizational Experiment and survey with no Mixed support in the survey replication of an experiment. No
commitment > Dominating, Integrating, time interval mention of time.
Obliging, Avoiding conflict strategies
Advances in Research Methods •
27
28 • NEAL SCHMITT

important intervention that is the study focus. This was the case in some of the
papers, but very often the timing of initial or subsequent data collection appeared
to be a matter of convenience (e.g. every two months or every four weeks). On a
positive note, it seems that a very large proportion of the papers, particularly in
AMJ, were longitudinal. This was clearly not the case a couple of decades ago.
It should also be noted that the data provided in Table 2.4 are partly a result
of one reader’s interpretation of the studies. In some of these studies, the authors
may argue that time was considered, and/or it was irrelevant.
It is also the case that most studies employing longitudinal designs are in-
stances of quasi-experimentation, hence the causal inferences derived from these
studies are often problematic (Shadish, Cook, & Campbell, 2002). These stud-
ies are almost always designed to test mediation hypotheses using hierarchical
regression or SEM to test hypotheses. These models often reflect a poor basis for
making causal inferences even though authors frequently imply directly or indi-
rectly that they provide support for causal hypotheses. These inference problems
and potential solutions have been described in a series of papers by Stone-Romero
and Rosopa (2004, 2008, 2011). They make the case that causal inferences when
data are not generated as a function of an experimental design that tests the ef-
fects of both independent and mediator variables are not justified. Like earlier au-
thors (e.g., James, Mulaik, & Brett, 2006), they point out that SEM findings (and
analyses using hierarchical linear regression) may support a model but that other
models that include a different causal direction or unmeasured common causes
may also be consistent with the data.
A longitudinal design that includes theoretically and/or empirically supported
differences in the timing of data collection would seem to obviate at least the
problem of misspecified causal direction. Given the importance of time-ordering
of the independent, mediator and outcome variables, as argued above, it is sur-
prising that Wood, Goodman, and Cook (2008) found only 11% of the studies in
their review of mediation research incorporated time ordering. Their results are
consistent with the data in Table 2.4. The past decade since the Wood et al review
has produced very little change in longitudinal research; even when data are col-
lected at multiple points in time, there is little, or no justification of the time points
selected. Those conducting longitudinal research are missing an opportunity to
provide stronger justification of causal inference when they fail to design their
research with careful consideration of time (Mitchell & James, 2001).

ESTIMATES OF EFFECT SIZE


Our sophisticated data analyses often do not provide an index of what differ-
ence the results of a study might make in important everyday outcomes or deci-
sion making. We have gotten good at reporting d statistics and we use the Cohen
(1977) guides for small, medium and large effect sizes. The adequacy of the use
of Cohen’s d to communicate effect size as well as other similar statistical indi-
ces was identified as an urban legend in a recent book (Cortina & Landis, 2009).
Advances in Research Methods • 29

As did Cohen, these authors point to the context of the research as an important
factor in presenting and interpreting effect sizes. An effect size of .1 is awfully
important if the outcome predicted is one’s life. It might not be that impressive
if it is one’s level of organizational commitment (my apology to those who study
organizational commitment). They also point to the strength (or lack thereof) of
the research design that produces an effect. If the effect is easily produced, then it
should be less likely dismissed as unimportant. If one needs to use a sledge ham-
mer manipulation to get an effect, it is probably not all that practically important.
Perhaps combining both these ideas, Cortina and Landis describe the finding that
taking aspirins accounts for 1/10 of one percent of the variance in heart attack
occurrence, but such a small intervention with an important outcome makes it
a significant effect (in my opinion and it seems the medical profession as well).
JAP does require a section of the discussion be devoted to the theoretical and
practical significance of the study results and most articles in other journals do
as well. However, this often appears to be a pro forma satisfaction of a publica-
tion requirement. Moreover, as mentioned above, many of our sophisticated data
analyses do not translate into an effect size. Even when they do, unless these d
statistics or similar effect sizes are in a metric related to organizationally or soci-
etally important outcomes, they are not likely to have much influence on decision
makers. It is also interesting that the literature on utility (Cascio, 2000) which
was oriented to estimating the effect of various behavioral interventions in dollar
terms has pretty much faded away.
We also suspect that it would be hard for even a doctoral level staff person in
an organization to translate the results of a structural equation analysis or a mul-
tilevel analysis or even stepwise regressions into organizationally relevant met-
rics. A good example, and probably an exception, of a combination of stepwise
regression analyses of the impact of various recruitment practices and sources of
occupational information is a paper by Campion, Ployhart, and Campion (2017).
The usual regression-based statistics were used to evaluate hypotheses and then
translated into percent passing an assessment of critical skills and abilities under
different recruitment scenarios. This information communicated quite directly
with information users. This would be very important data, for example, for the
military in evaluating the impact of lowering entrance qualifications of military
recruits on subsequent failure rates in training or dropout rates. Incidentally, Cam-
pion et al. also reported the number of applicants who heard about jobs from vari-
ous sources and the quality (in terms of assessed ability) of the applicants.
As for the previous research issues raised in this paper, I reviewed papers pub-
lished in the same three journals (Journal of Applied Psychology, Personnel Psy-
chology, and Academy of Management Journal) to ascertain the degree to which
authors addressed the practical implications of their research in some quantifiable
manner or in terms of some manner that readers could understand what might or
should be changed in an organizational practice to benefit from the study findings.
Since all papers have the potential for practical application, I reviewed only the
30 • NEAL SCHMITT

last 12 papers published in 2017 in these three journals. In most articles pub-
lished in these three journals, there was a section titled “practical implications.”
I reviewed these sections as well as the authors’ reports regarding their data in
producing Table 2.5.
The table includes a column in which the primary interest of the author(s) is
listed. I then consider whether there was any presentation of a quantitative esti-
mate of the impact of the variables studied on some outcome (organizational or
individual). Most papers presented their results in terms of correlations or mul-
tiple regressions, but many also presented the results of structural equation model-
ing or hierarchical linear modeling. There were only a few papers in which any
other quantitative index other than the results of these statistical analyses of the
impact of a set of “independent” variables was presented. These indices were d
(standardized mean difference) or odds ratios. These indices may also be deficient
in that the metric to which they refer may or may not be organizationally relevant.
For example, I might observe a .5 standard deviation increase in turnover intent,
but unless I know how turnover intent is related to actual turnover in a given
circumstance and how that turnover is related to production, profit, or expense of
recruiting and training new personnel, it is not easy to make the results relevant
to an organizational decision maker. Of course, It is also the case that correlations
can be translated to d and that means and standard deviations can be used to com-
pute d and with appropriate available metrics to some organizationally relevant
metric. However, this was never done in the 36 studies reviewed.
Nearly all authors did make some general statements as to what they believed
their study implied for the organization or phenomena they studied. Abbreviated
forms of these statements are included in the last column of Table 2.5. As men-
tioned above, Journal of Applied Psychology includes a “practical implications”
section in all articles. As is obvious in these statements, authors have given some
thought to the practical implications of their work and their statements relate to a
wide variety of organizationally and individually relevant outcomes. What is not
apparent in Table 2.5 is that these sections in virtually all papers rarely exceed one
to three paragraphs of material and usually did not discuss how their statements
would need to be modified for use or implementation in a local context.
The utility analyses developed by Schmidt, Hunter, McKenzie, and Muldrow
(1979) and popularized by Cascio (2000) were directed to an expression of study
results in dollar terms. This approach to utility received a great deal of attention
a couple of decades ago, but interest in this approach has waned. Several issues
may have been critical. First, expressing some variables in dollar terms may have
seemed artificial (e.g., volunteering, team-based empowerment, OCBs, rudeness).
Second, calculations underlying utility estimates devolved into some fairly ar-
cane economic formulations (e.g., Boudreau, 1991) which in turn required as-
sumptions that may have made organizational decision makers uncomfortable.
Third, the utility estimates were based on judgments that some decision makers
may have suspected were inaccurate (Macan & Highhouse, 1994) even though
TABLE 2.5. Reports of Practical Impact of Research and Effect Sizes
Effect Size
Journal Nature of Phenomenon Studied Estimates Practical Implications Suggested
JAP Workplace gossip No Discussed gossip relationships with workplace deviance and promoting norms for
acceptable behavior
JAP Job insecurity No Risk that job performance and OCB will suffer and intent to leave will increase
JAP Flexible working arrangements No Improve employees’ wellbeing and effectiveness. Flextime should be accompanied by
some time structuring and goal setting
JAP Environmental and climate change Odds ratios Self-concordance of goals and climate change were related to petition signing behavior
and intentions to engage in sustainable climate change behavior
JAP Insomnia No Treatment for insomnia had positive effects on OCB and interpersonal deviance
JAP Stereotype threat, training, and performance Yes—d Stereotype effect learning which has implications for human potential over time
potential
JAP Snacking at work No Individual, organizational, and situational factors affect what employees eat.
Organizations should promote healthy organizational eating climate.
JAP Customer behavior and service incivility No Verbal aggression directed to an employee and interruptions lead to employee incivility
JAP Perceptions of novelty and creativity No Organizations should encourage creativity and innovation and use employees with
promotion focus to identify good ideas
JAP Authoritarian leadership No Negative effects of authoritarian leadership on performance, OCB and intent to stay
moderated by power distance and role breadth self-efficacy
JAP Gender transition and job attitudes and No Gender transition related to job satisfaction, person-organization fit and lower perceived
experiences discrimination. Organizations should promote awareness and inclusivity
JAP Gender and crying NO Crying was associated with lower performance and leader evaluations for males. Men
should be cautious in emotional expression.
PPsych Work demands, job control, and mortality Odds ratios Job demands and job control interacted to produce higher mortality. Organizations should
Advances in Research Methods •

seek to increase employee control over job tasks.


31

(Continues)
32
TABLE 2.5. Reports of Practical Impact of Research and Effect Sizes

Effect Size
Journal Nature of Phenomenon Studied Estimates Practical Implications Suggested
PPsych Work family balance No Practices that promote balance satisfaction and effectiveness may enhance job attitudes
and performance.
PPsych Mentoring as a buffer against discrimination No High quality formal and informal mentoring relationships that offer social support reduce
the negative impact of racism and lead to a number of positive job outcomes.
NEAL SCHMITT

PPsych Cultural intelligence No Provision of experiences that foster social adjustment increase benefits derived from
international experiences
PPsych Role-based identity at work No Provides role-based identity scales and suggests that employees who assume too many
roles may experience burnout.
PPsych Leader development No Combinations of developmental exercises: formal training, challenging responsibilities,
and developmental supervision best in developing leaders.
PPsych LMX leadership effects LMX quality relationships are related to career progress in new organizations and alumni
good will. Orgs. should promote internal job opportunities
PPsych Status incongruence and the impact of No Organizations should consider training employees on the biases faced by women in
transformational leadership leadership roles.
PPsych Emotional labor in customer contacts No Organizations should promote deep acting rather than surface acting in service employees
to prevent harming behavior to clients and coworkers.
PPsych Newcomer adjustment No Organizations should tailor their approach to newcomer socialization to individual needs.
PPsych Training transfer No Expectations regarding transfer of training should take account of different learning
trajectories and opportunities to perform.
PPsych Family role identification and leadership No Organizations and individuals should promote family involvement as these activities
enhance transformational leadership behavior
AMJ Curiosity and creativity No Study offers suggestions as to how to provide feedback and that curiosity be considered
when selected people into “creative” jobs. Creative workers must have time to consider
revisions.
AMJ Ambiguity in corporate communication in Likelihood of Use vague language in annual reports to reduce competitive entry in your market
response to competition competitive
actions
AMJ Value of voice No Exercise of voice should be on issues that are feasible in terms of available resources.
Speaking up on issues that are impossible to address will have negative impact on the
manager and employee
AMJ Pay for performance No Employees indebted to a pay for performance plan will react positively to debt
forgiveness but only in the short term.
AMJ Identity conflict and sales performance d of selling Managers can influence performance by reducing role conflict and increasing identity
intention enhancement.
AMJ Board director appt and firm performance No CEO and boards can be balanced in terms of power and this likely leads to positive firm
level outcomes.
AMJ Team-based empowerment Percentage of High status leaders struggle with team-based empowerment and specific leader behaviors
same day appt. facilitate or hinder delegation
requests
AMJ Abusive supervision No Provides strategies for abused followers to reconcile with an abusive supervisor.
Organizations should encourage leaders and followers to foster mutual dependence.
AMJ Entrepreneurs’ motivation shapes the No Describes the process of organizing new firms and whether founders remain till the firm
characteristics and strategies of firms becomes operational or leave
AMJ Innovation and domain experts No Experts are useful in generating potential problem solutions, but may interfere in
selecting the best solution
AMJ Volunteering climate No Fostering collective pride about volunteering leads to affective commitment and to
volunteering intentions.
AMJ Women entrepreneurs in India Odds ratios Community and social networks lead to entrepreneurial activity and profit moderated by
and profit in information and technology use
rupees
Advances in Research Methods •
33
34 • NEAL SCHMITT

the consistency across judges was usually quite acceptable (Hunter, Schmidt, &
Coggin, 1988). Finally, some estimates were so large (e.g., Schmidt, Hunter, &
Pearlman, 1982) and the vagaries of organizational life so unpredictable (Teno-
pyr, 1987) that utility estimates were rarely realized.
It appears that HR personnel are facing a similar set of “so what” questions as
they attempt to make sense of the Big Data analyses that are now possible and
increasingly common. Angrave et al. (2016) report that HR practitioners who are
faced with these data are enthused but feel no better informed about how to put
them into practice than before they were informed about the data. This seems
to be the situation that those working on utility analyses confronted in the 80s
and 90s. Although many organizations have begun to engage with HR data and
analytics, most seem not to have moved beyond operational reporting. Angrave
et al. assert that four items are important if HR is to make use of Big Data analyt-
ics. First, there must be a theory of how people contribute to the success of the
organization. Do they create, capture, and/or leverage something of value to the
organization and what is it? Second, the analyst needs to understand the data and
the context in which it is collected to be able to gain insight into how best to use
the metrics that are reported. Third these metrics must help identify the groups
of talented people who are most instrumental in furthering organizational perfor-
mance. Finally, simple reports of relationships are not sufficient, there must be
attention given to the use of experiments and quasi-experiments that show that a
policy or intervention improves performance.

FIGURE 2.1. Example of an Expectancy Chart Reflecting the Relationship between


College GPA and Situational Judgment Scores
Advances in Research Methods • 35

Perhaps one takeaway or recommendation from this discussion is that authors


use sophisticated statistics to answer theoretical questions and then use descrip-
tive statistics including percentages or mean differences in meaningful organiza-
tionally relevant metrics to communicate with the consumers of our research. Or,
engage organizational decision makers in making the translations of these simple
statistics to a judgment about practical utility. In this context, perhaps we should
“reinvent” expectancy tables suggested for this use in early Industrial Psychology
textbooks (e.g., Tiffin & McCormick, 1965). See Figure 2.1 for an example.

SUMMARY AND CONCLUSIONS


In conclusion, there is much about which to congratulate ourselves regarding con-
tributions to our science that have produced an explosion of quantitative analysis
techniques that help us understand the data we collect as well as help to educate
our colleagues on their use and appropriate interpretation. While continuing our
work in these areas, I also think we should pay close attention to measurement
issues, research design concerns particularly in the context of longitudinal efforts,
and our ability to communicate our results in convincing ways to those who con-
sume our research. These points have all been made by others, but they remain
issues with which we must grapple. It may be time that editors and reviewers
require that researchers present more information on the content of measures and
the validation of those measures; that authors who investigate some process in
longitudinal research explain why data were/were not collected at certain time
points, and that authors provide indices of what impact the results of their research
might have on organizational outcomes.

REFERENCES
Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M., & Stuart, M. (2016). HR and
analytics: Why HR is set to fail the big data challenge. Human Resource Manage-
ment Journal, 26, 1–12.
Barnes, C. M., Miller, J. A., & Bostock, S. (2017). Helping employees sleep well: Effects
of cognitive behavior therapy for insomnia on work outcomes. Journal of Applied
Psychology, 102, 104–113.
Boudreau, J. W. (1991). Utility analysis for decisions in human resource management. In
M. D. Dinette & L. M. Hough (Eds.), Handbook of industrial and organizational
psychology: Vol 2. (pp. 621–746). Palo Alto, CA: Consulting Psychologists Press.
Campion, M. C., Ployhart, R. E., & Campion, M. A. (2017). Using recruitment source
timing and diagnosticity to enhance applicants’ occupation-specific human capital.
Journal of Applied Psychology, 102, 764–781
Cascio, W., & Boudreau, J. (2011). Investing in people: The financial impact of human
resource initiatives. (2d. ed.). Upper Saddle, NJ: Pearson.
Cascio, W. F. (2000). Costing human resources: The financial impact of behavior in orga-
nizations. Cincinnati, OH: Southwestern.
Chen, G., Ployhart, R. E., Cooper-Thomas, H. D., Anderson, N., & Bliese, P. D. (2011).
The power of momentum: A new model of dynamic relationships between job sat-
36 • NEAL SCHMITT

isfaction changes and turnover intentions. Academy of Management Journal, 54,


159–181.
Clark, L. A. (2006). When a psychometric advance falls in the forest. Psychometrika, 71,
447–450.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York, NY:
Academic Press.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applica-
tions. Journal of Applied Psychology, 78, 98–104.
Cortina, J., Sheng, A., List, S. K., Keeler, K. R. Katell, L. A., Schmitt, N., Tonidandel, S.
Summerville, K., Heggestad, E., & Banks, G. (under review). Why is coefficient
alpha?: A look at the past, present, and (possible) future of reliability assessment.
Journal of Applied Psychology.
Cortina, J. M., Aguinis, H., & DeShon, R. P. (2017). Twilight of dawn or of evening? A
century of research methods in the Journal of Applied Psychology. Journal of Ap-
plied Psychology, 102, 274–290.
Cortina, J. M., & Landis, R. S. (2009). When small effect sizes tell a big story, and when
large effect sizes don’t. In C. E. Lance & R. J. Landis (Eds.), Statistical and method-
ological myths and urban legends: Doctrine, verity, and fables in the organizational
and social sciences. New York, NY: Routledge.
Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and
successor procedures. Educational and Psychological Measurement, 64, 391–418.
Deng, H. Walter, F., Lam, C. K., & Zhao, H. H. (2017). Spillover effects of emotional la-
bor in customer service encounters toward coworker harming: A resource depletion
perspective. Personnel Psychology, 70, 469–502.
DeShon, R. P. (2013). Inferential meta-themes in organizational science research: Causal
research, system dynamics, and computational models. In N. Schmitt & S. High-
house (Eds.), Handbook of psychology Vol. 12: Industrial and organizational psy-
chology (pp. 14–42.). New York, NY: Wiley.
Dudycha, A. L., & Carpenter, J. B. (1973). Effect of item format on item discrimination and
difficulty. Journal of Applied Psychology, 58, 116–121.
Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solu-
tion to the pervasive problem of internal consistency estimation. British Journal of
Psychology, 105, 399–412.
Edwards, A. L. (1957) Techniques of attitude scale construction. New York, NY: Appleton-
Century-Crofts.
Fried, E. I., & Flake, J. K. (2018). Measurement matters. Observer, 31, 29–31.
Golubovich, J., Grand, J. A., Ryan, A. M., & Schmitt, N. (2014). An examination of com-
mon sensitivity review practices in test development. International Journal of Se-
lection and Assessment. 22, 1–11.
Grand, J. A. (2017). Brain drain? An examination of stereotype threat effects during train-
ing on knowledge acquisition and organizational effectiveness. Journal of Applied
Psychology, 102, 115–150.
Harlow, L. L., & Oswald, F. L. (2016). Big data in psychology: Introduction to the special
issue. Psychological Methods, 21, 447–457.
Hattie, J. (1985). Methodology review” Assessing unidimensionality of tests and items.
Applied Psychological Measurement, 9, 139–164.
Advances in Research Methods • 37

Hunter, J. E., Schmidt, F. L., & Coggin, T. D. (1988). Problems and pitfalls in using capital
budgeting and financial accounting techniques in assessing the utility of personnel
programs. Journal of Applied Psychology, 73, 522–528.
Ilgen, D. R., & Hulin, C. L. (Eds.). (2000). Computational modeling of behavioral pro-
cesses in organizational research. Washington, DC: American Psychological As-
sociation Press.
James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational
Research Methods, 9, 233–244.
Kaltianen, J., Lipponen, J., & Holtz, B. C. (2017). Dynamic interplay between merger
process justice and cognitive trust in top management: A longitudinal study. Journal
of Applied Psychology, 102, 636–647.
Klein, K. J., & Kozlowski, S. W. J. (Eds.) (2000). Multilevel theory, research and methods
in organizations. San Francisco, CA: Jossey-Bass.
Macan, T. H., & Highhouse, S. (1994). Communicating the utility of human resource ac-
tivities: A survey of I/O and HR professionals. Journal of Business and Psychology,
8, 425–436.
Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternate
fit indices in tests of measurement invariance. Journal of Applied Psychology, 93,
568–592.
McNeish, D. (2018). Thanks coefficient alpha: We’ll take it from here. Psychological Bul-
letin, 23, 412–433.
Mitchell, T. R., & James, L. R. (2001). Building better theory: Time and the specification of
when things happen. Academy of Management Review, 26, 530–547.
Murphy, K. R., & Russell, C. J. (2017). Mend it or end it: Redirecting the search for in-
teractions in the organizational sciences. Organizational Research Methods, 20,
549–573.
Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equiva-
lence: Understanding the practical importance of differences between groups. Jour-
nal of Applied Psychology, 96, 966–980.
Pitiaru, A. H., & Ployhart, R. E. (2010). Explaining change: Theorizing and testing dy-
namic mediated longitudinal relationships. Journal of Management, 36, 405–429.
Ployhart, R. E., & Kim, Y. (2013). Dynamic growth modeling. In J. M. Cortina and R. S.
Landis (Eds.), Modern research methods (pp. 63–98). New York, NY: Routledge.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behav-
ioral Research, 47, 667–696.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of
validity generalization. Journal of Applied Psychology, 62, 529–540.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in
personnel psychology: Practical and theoretical implications of 85 years of research
findings. Psychological Bulletin, 124, 262–274.
Schmidt, F. L., Hunter, J. E., McKenzie, R., & Muldrow, T. (1979). Impact of valid se-
lection procedures on workforce productivity. Journal of Applied Psychology, 64,
609–626.
Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1982). Assessing the economic impact of
personnel programs on workforce productivity. Personnel Psychology, 35, 333–347.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8,
350–353.
38 • NEAL SCHMITT

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s
alpha. Psychometrika, 74, 107–120.
Sonnentag, S., Pundt, A., & Venz, L. (2107). Distal and proximal predictors of snacking at
work: A daily-survey study. Journal of Applied Psychology, 102, 151–162.
Stone-Romero, E. F., & Rosopa, P. J. (2004). Inference problems with hierarchical multiple
regression-based tests of mediating effects. Research in Personnel and Human Re-
sources Management, 23, 249–290.
Stone-Romero, E. F., & Rosopa, P. J. (2008). The relative validity of inferences about
mediation as a function of research design characteristics. Organizational Research
Methods, 11, 326–352.
Stone-Romero, E. F., & Rosopa, P. (2011). Experimental tests of mediation models: Pros-
pects, problems, and some solutions. Organizational Research Methods, 14, 631–
646.
Tenopyr, M. L. (1987). Policies and strategies underlying a personnel research program.
Paper presented at the Second Annual Conference of the Society for Industrial and
Organizational Psychology, Atlanta, Georgia.
Tiffin, J., & McCormick, E. J. (1965). Industrial psychology. Englewood Cliffs, NJ: Pren-
tice-Hall.
Vancouver, J. B., & Purl, J. D. (2017). A computational model of self-efficacy’s various
effects on performance: Moving the debate forward. Journal of Applied Psychology,
102, 599–616.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement in-
variance literature: Suggestions, practices, and recommendations for organizational
research. Organizational Research Methods, 3, 4–70.
Walker, D. D., Jaarsveld, D. D., & Skarlecki, D. P. (2017). Sticks and stones can break
my bones but words can also hurt me: The relationship between customer verbal
aggression and employee incivility. Journal of Applied Psychology, 102, 163–179.
Willett, J. B., & Sayer, A. G. (1994). Using covariance structure analysis to detect corre-
lates and predictors of change. Psychological Bulletin, 116, 363–381.
Wood, R. E., Goodman, J. S., & Cook, N. D. (2008). Mediation testing in management
research. Organizational Research Methods, 11, 270–295.
Zhou, J., Wang, X. M., Song, L. J., & Wu, J. (2017). Is it new? Personal and contextual
influences on perceptions of novelty and creativity. Journal of Applied Psychology,
102, 180–202.
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s a, Revelle’s b, and Mc-
Donald’s vH: Their relations with each other and two alternate conceptualizations of
reliability. Psychometrika, 70, 1–11.
CHAPTER 3

RESEARCH DESIGN AND


CAUSAL INFERENCES
IN HUMAN RESOURCE
MANAGEMENT RESEARCH
Eugene F. Stone-Romero

The validity of inferences derived from empirical research in Human Resource


Management (HRM), Industrial and Organizational Psychology (I&O), Organi-
zational Behavior (OB) and virtually all other disciplines is a function of such fac-
ets of research design as experimental design, measurement methods, sampling
strategies, and statistical analyses (Campbell & Stanley, 1963; Cook & Campbell,
1976, 1979; Shadish, Cook, & Campbell, 2002; Stone-Romero, 2009, 2010). Re-
search design is “an overall plan for conducting a study” (Stone-Romero, 2010)
that takes into account the factors of internal validity, external validity, construct
validity, and statistical conclusion validity (Shadish et al., 2002).
Unfortunately, the HRM literature is replete with empirical studies that have
highly questionable levels of internal validity, i.e., the degree to which the re-
sults of a research allow for valid inferences about causal connections between
variables. As is detailed below, the validity of such inferences is a function of the
experimental designs used in research. In view of this, the major concern of this
article is experimental design. It determines not only the validity of causal infer-
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 39–65.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 39
40 • EUGENE F. STONE-ROMERO

ences in research aimed at testing model-based predictions, but also the effective-
ness of HRM policies and practices.
In the interest of explicating the way in which experimental design affects the
validity of causal inferences in research, this article considers the following is-
sues: (a) invalid causal inferences in HRM research, (b) the importance of valid
causal inferences in basic and applied research, facets of validity in research, (c)
formal reasoning procedures as applied to the results of empirical research, (d) the
importance of experimental design for valid causal inferences, (e) the settings in
which research is conducted, (f) experimental design options (randomized experi-
mental, quasi-experimental and non-experimental) for research, (g) other research
design issues, (h) overcoming objections that have been raised about randomized
experiments in HRM and related disciplines, and (i) conclusions and recommen-
dations for basic and applied research and editorial policies.

INVALID CAUSAL INFERENCES IN THE HRM LITERATURE


An inspection of the literature in HRM and allied disciplines shows a pervasive
pattern of unwarranted inferences about causal connections between and among
variables. Typically, such inferences are found in reports of the findings of non-
experimental studies. In such studies assumed independent and dependent vari-
ables are measured, correlations between and/or among variables are determined,
and causal inferences are generated on the basis of the observed correlations.
Illustrative of the unwarranted inferences in the HRM literature are the very
large number of articles in the HRM literature that (a) have such titles as “The
Impact of X on Y,” “The Effects of X on Y,” and “The Influence of X on Y,” and (b)
contain unwarranted inferences about causal relations between variables. Among
the many hundred examples of this are the following. First, on the basis of a
non-experimental study of the relation between job satisfaction (satisfaction here-
inafter) and job performance (performance hereinafter) and a so called “causal
correlational analysis.” Wanous (1974) concluded that the results of his study
indicated that performance causes intrinsic satisfaction and extrinsic satisfaction
causes performance. As is explained below, the results of a causal correlation
analyses do not allow for valid inferences about causality. More generally, the
findings of any non-experimental study provide a very poor basis for justifying
causal inferences.
Second, using the findings of a meta-analysis of 16 non-experimental stud-
ies of the relation between job attitudes and performance and a meta-analytic
based regression analysis, Riketta (2008) argued that job attitudes are more likely
to influence performance than the reverse. Both attitudes and performance were
measured. As is detailed below, regression analyses do not afford a valid basis for
causal inferences unless the analyses are based upon research that uses random-
ized experimental designs. Regrettably, Riketta’s study did not use such a design.
Instead, the design was non-experimental, making suspect any causal inferences.
Research Design and Causal Inferences • 41

Third, relying on a structural equation modeling (SEM)-based analysis of the


results of three non-experimental studies that examined relations between (a)
core self-evaluations and (b) job and life satisfaction, Judge, Locke, Durham, and
Kluger (1998) concluded that “The most important finding of this study is that
core evaluations of the self have consistent effects on job satisfaction, indepen-
dent of the attributes of the job itself. That is, the way in which people see them-
selves affects how they experience their jobs and even their lives” (p. 30). In view
of the fact that the researchers applied SEM to the findings of non-experimental
studies, causal inferences are not justified.
As is explained in detail below in the section titled “Research Design Op-
tions,” unless studies use randomized experimental designs, causal inferences
are seldom, if ever justified. Thus, such inferences were unwarranted in the just-
described studies and thousands of other non-experimental studies in the HRM
literature.
The findings of a study by Stone-Romero and Gallaher (2006) illustrate the
severity of causal inference problems in empirical studies in HRM and allied dis-
ciplines. They performed a content analysis of 161 articles that were randomly
sampled from articles published in the 1988, 1993, 1998, and 2003 volumes of
journals that publish HRM-related articles (i.e., Personnel Psychology, Organi-
zational Behavior and Human Decision Processes, the Academy of Management
Journal, and the Journal of Applied Psychology). The studies reported in these
articles used various types of experimental designs (i.e., non-experimental, qua-
si-experimental, and randomized-experimental). The articles were searched for
instances of the inappropriate use of causal language in their title, abstract, and
results and/or discussion sections. The search revealed one or more instances of
unwarranted causal inferences in 79% of the 73 articles that were based on non-
experimental designs, and 78% of the 18 articles that used quasi-experimental
designs. Overall, the analysis of the 161 articles showed that causal inferences
were unwarranted in a very large percentage of the research-based articles.

IMPORTANCE OF VALID INFERENCES IN


BASIC AND APPLIED RESEARCH
Valid inferences about causal relations between variables (i.e., internal validity)
are a function of experimental design. The major design options are randomized-
experimental, quasi-experimental, and non-experimental. Convincing inferences
about the degree to which a study’s results generalize to and across populations
(i.e., external validity) are contingent on the way sampling units are selected (e.g.,
random, non-random). Persuasive inferences about the nature of the constructs
dealt with by a study (i.e., construct validity) are conditional on the way that the
study’s variables are manipulated or measured. Finally, valid inferences about
relations between and among variables (i.e., statistical conclusion validity) are
dependent on the appropriateness of the statistical analyses used in a study. Of
the four facets of validity, internal validity is the “sine qua non” in research prob-
42 • EUGENE F. STONE-ROMERO

ing causal connections between or among variables (Campbell & Stanley, 1963;
Cook & Campbell, 1976, 1979; Shadish et al., 2002). Unless the results of an
empirical study show that an independent variable (i.e., an assumed cause) is
causally related to a dependent variable (i.e., an assumed effect) it is of little con-
sequence that the study has high levels of external, construct, or statistical conclu-
sion validity.

Internal Validity in Basic Research


Internal validity is a crucial issue in both basic and applied research. In basic
(theory testing) research it is vital to know if the causal links posited in conceptual
or theoretical models are supported by research results. For example, if a theory
posits that satisfaction produces changes in performance, research used to test the
theory should show that experimentally induced changes in satisfaction lead to
predicted changes in performance. None of the many hundred non-experimental
studies or meta-analyses of this relation have produced any credible evidence of
this causal connection (e.g., Judge, Thoresen, Bono, & Patton, 2001). On the other
hand, there is abundant experimental evidence that demonstrates a causal connec-
tion between training and performance (e.g., Noe, 2017).

Internal Validity in Applied Research


In applied research it is vital to show that an organizational intervention (e.g.,
job enrichment) leads to hypothesized changes in one or more dependent vari-
able (e.g., satisfaction, employee retention, performance). For example, unless
research can adduce support for a causal relation between the use of a selection
test and employee performance, it would make little or no sense for organizations
to use the test.

FORMAL REASONING AND MODEL TESTING IN RESEARCH


It is instructive to consider model testing research in the context of formal rea-
soning techniques (Kalish & Montague, 1964). They allow one to use symbolic
sentences along with formal reasoning principles to assess that validity of re-
search-based conclusions. Let’s consider this in the context of a simple research-
related example. In it (a) MC stands for “a model being tested is correct,” (b)
RC represents “research results are consistent with the model,” and (c) ~ MC
denotes “not consistent with the model.” The general strategy employed in test-
ing models (or theories) is to (a) formulate a model, (b) conduct an empirical
study designed to test it, and (c) use the study’s results to argue that the model is
correct or incorrect. In terms of symbolic sentences, the reasoning that is almost
universally used by researchers is as follows: The researcher (a) assumes that if
the model is correct then the results of research will be consistent with the model,
that is MC → RC, (b) shows RC through empirical research, and (c) uses the RC
finding to infer MC. Unfortunately, however, this reasoning strategy leads to an
Research Design and Causal Inferences • 43

invalid inference about MC because it is predicated on the logical fallacy of af-


firming the consequent. The inference is incorrect because it also may be true that
~MC → RC. For example, a researcher (a) assumes a model in which satisfaction
causes performance, (b) conducts an empirical study that shows a .30 correlation
between these variables, and (c) concludes that the model is correct. This is an
invalid inference because the same correlation would have resulted from a model
that posited satisfaction to be the cause of performance. In addition, it could have
resulted from the operation of one or more confounding variables (e.g., worker
pay that is positively contingent on performance). For example, a very creatively
designed randomized experimental study by Cherrington, Reitz, and Scott (1971)
studied the relation between performance and satisfaction. Subjects were random-
ly assigned to one of two conditions. In the first part of the study they performed a
task for one hour. Then, subjects in one condition received rewards that were posi-
tively contingent on performance, whereas in the other rewards were negatively
contingent on performance. Subsequently, they performed the task for another
hour. The researchers found that across reward contingency conditions, (a) there
was no relation between satisfaction and second-hour performance, (b) satisfac-
tion was positively related to second-hour performance for subjects who received
rewards that were positively contingent on performance, and (c) satisfaction was
negatively related to performance for those who received negatively contingent
rewards. These results are both interesting and important. They show that the sat-
isfaction-performance relation is a function of reward contingency. More specifi-
cally, reward contingency was responsible for the correlation between satisfaction
and productivity. As Cherrington et al. concluded: “Our theory implies no cause-
effect relationship between performance and satisfaction; instead, it stresses the
performance-reinforcing as well as the satisfaction-increasing potential of contin-
gent reinforcers” (p. 535).
It merits stressing that research results that are consistent with an assumed model
(RC) have no necessary implications for its correctness. However, via the inference
rule of Modus Tollens (Kalish & Montague, 1964), a study that showed ~RC would
allow for the logical inference of ~MC; that is if the study failed to provide support
for the model then the researcher could logically conclude that the model was incor-
rect. Of course, the latter inference would be contingent upon the study having high
levels of both construct validity and statistical conclusion validity.

CAUSAL INFERENCES AND EXPERIMENTAL DESIGN


In an empirical study of the relation between X (an assumed cause) and Y (an as-
sumed effect), there are three conditions that are vital to valid causal inferences:
(a) X precedes Y in time, (b) X and Y are related to one another, and (c) there are
no rival explanations of the relation between X and Y (for example, Z is a cause of
both X and Y and there is no causal relation between X and Y). These requirements
are most satisfied in randomized experimental research and are not satisfied as
adequately in either non-experimental or quasi-experimental research, including
44 • EUGENE F. STONE-ROMERO

longitudinal research (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979;
Shadish et al., 2002; Stone-Romero, 2009, 2010). Research that demonstrates that
X and Y are related satisfies only one such condition. Thus, it does not serve as
a sufficient basis for inferring that X causes Y. As the well-known adage states,
“correlation does not imply causation.”
Assume that a study provides evidence of a correlation between two measured
variables, O1 and O2. As Figure 3.1 indicates, his finding might result from (a) O1

FIGURE 3.1. Possible causal relations among several observed variables.


Research Design and Causal Inferences • 45

being a cause of O2 (Figure 3.1a); (b) O2 being a cause of O1, (Figure 3.1b); or (c)
the relation between O1 and O2 being a non-causal function of a third unmeasured
variable, O3 (Figure 3.1c). So, evidence that O1 and O2 are correlated is insuffi-
cient to infer that there is a causal connection between these variables.
Nevertheless, as was noted above, it is quite common for researchers in HRM
and allied disciplines to base causal inferences on evidence of relations between
variables (e.g., an observed correlation between two variables) as opposed to re-
search that uses a sound experimental design. One vivid example of this is the
body of research on the relation between satisfaction and organizational commit-
ment (commitment hereinafter). On the basis of observed correlations between
measures of these two variables and different types of statistical analyses: (a)
some researchers (e.g., Williams & Hazer, 1986) have concluded that satisfac-
tion causes commitment, (b) other researchers (e.g., Bateman & Strasser, 1984;
Koslowsky, 1991; Weiner & Vardi, 1980) have inferred that commitment causes
satisfaction, (c) still others (e.g., Lance, 1991) have reasoned that satisfaction and
commitment are reciprocally related to one another, and (d) yet others have ar-
gued that the relation between satisfaction and commitment is unclear or spurious
(Farkas & Tetrick, 1989).
Another instance of invalid causal inferences relates to the correlation between
job attitudes (attitudes hereinafter) and performance. As noted above, on the basis
of a meta-analysis of the results of 16 non-experimental studies Riketta (2008)
inappropriately concluded that attitudes are more likely to influence performance
than vice versa. The fact that the study was based on meta-analysis does nothing
whatsoever to bolster causal inferences.

RESEARCH SETTINGS
Empirical research can be conducted in what have typically referred to as “labora-
tory” and “field” settings (e.g., Bouchard, 1976; Cook & Campbell, 1976, 1979;
Evan, 1971; Fromkin & Streufert, 1976; Locke, 1986). However, as John Camp-
bell (1986) and others (e.g., Stone-Romero, 2009, 2010) have argued, the labo-
ratory versus field distinction is not very meaningful. One important reason for
this is that research “laboratories” can be set up in what are commonly referred
to as “field” settings. For example, an organization can be created for the specific
purpose of conducting a randomized-experimental study (Shadish et al., 2002, p.
274). Clearly, such a setting blurs the distinction between so called laboratory and
field research.
To better characterize the settings in which research takes place Stone-Romero
(2009, 2010) recommended that they be categorized in terms of their purpose.
More specifically, (a) special purpose (SP) settings are those that were created for
the specific purpose of conducting empirical research and (b) non-special purpose
(NSP) settings are those that were created for a non-research purpose (e.g., manu-
facturing, consulting, retailing). In the interest of clarity about research settings
the SP versus NSP distinction is used in the remainder of this article.
46 • EUGENE F. STONE-ROMERO

RESEARCH DESIGN OPTIONS IN EMPIRICAL STUDIES


In designing an empirical study, a researcher is faced with a number of options,
including (a) the type of experimental design, (b) the number and types of partici-
pants, (c) the measures or manipulations of variables, (d) its setting, and (e) the
planned statistical analyses. With respect to experimental design, there are three
general options, i.e., non-experimental, quasi-experimental, and randomized-ex-
perimental (Campbell & Stanley, 1963; Cook & Campbell, 1976, 1979; Shadish
et al., 2002; Stone-Romero, 2010). Table 3.1 provides a summary of the properties
of the design options. Note that experimental design is the major determinant of
the degree to which a study’s results allow for valid causal inferences.
Before describing various designs, a word is in order about the notation used
in the examples that are described below. In these examples (a) the symbol of =>
is used to denote implies or signifies, and (b) the research design symbols used
are as follows: (a) X => either an assumed independent variable or the manipu-
lation (treatment) of a variable, (b) ~X => the absence of a treatment, (c) Y =>
an assumed or actual dependent variable, (d) R => random assignment to treat-
ment conditions, (e) ~R => non-random assignment to such conditions, and (f) Oi
=> the operational definition of an assumed independent, mediator, or dependent
variable. Note, moreover that no distinction is made here between the operational
definition of a construct and the construct itself.

Randomized-Experimental Designs
The simplest method for conducting research that allows for valid causal infer-
ences about the relation between two variables (e.g., X and Y) is a randomized-
experimental study in which (a) X is manipulated at two or more levels, (b) sam-
pling units (e.g., individuals, groups, organizations) are assigned to experimental
conditions on a random basis, and (c) the dependent variable is measured. If there
are a sufficiently large number of sampling units, randomization serves to equate
the average level of observed variables (Oi) in the experimental conditions on all
measured or unmeasured variables prior to the manipulation of the independent
variable or variables. As such, randomization rules out such threats to internal va-

TABLE 3.1. Attributes of Three General Types of Experimental Designs


Design Type
Randomized-
Attribute Non-Experimental Quasi-Experimental Experimental
Independent variable Measured, assumed Manipulated Manipulated
Dependent variable Measured, assumed Measured, assumed Measured
Control of confounds Typically very low Low to moderate Very high
Validity of causal Typically very low Moderately high Very high
inferences
Research Design and Causal Inferences • 47

lidity as selection, history, maturation, regression, attrition, instrumentation, and


the interactive effects of these threats (Campbell & Stanley, 1963; Cook & Camp-
bell, 1976, 1979; Shadish et al., 2002). In their Table 1, Campbell and Stanley de-
tail the threats to internal validity that are controlled by randomized-experimental
designs. Designs such as the Solomon Four-Group allow the researcher to rule out
virtually all threats to internal validity. As a result, such designs should be used if
causal inferences are important in empirical research.
A study using a randomized-experimental design can test for not only the main
effects of independent variables (e.g., X1, X2, and X3), but also their interactive ef-
fects (X1 × X2, X1 × X3, X1 ×X3, and X1 × X2 × X3). Moreover, it may consider their
effects on multiple dependent variables (Y1, Y2, . . . Yj). For example, a study may
assess the effects of a job design manipulation (e.g., autonomy) on such depen-
dent variables as satisfaction, absenteeism, motivation, and turnover.
There are two general categories of randomized-experimental designs. They
are single independent variable designs and multiple independent variable de-
signs.
Single Independent Variable Designs. One of the simplest and most use-
ful of the randomized-experimental designs (diagramed below) is the Solomon
Four-Group design. In it (a) research units are randomly assigned to one of four
conditions, (b) units in condition A and C receive the treatment while those in
B and D serve as no-treatment controls, (c) a single independent variable (X) is
manipulated, and (d) its effects on the dependent variable (O) are determined via
statistical methods. Note that (a) the dependent variable is measured before (O1A
and O1B) and after (O2A, O 2B, and O2C) the treatment period. Diagrammatically:

R O1A X O2A
R O1B ~X O 2B
R X O2C
R ~X O2D

The results of a study using this design provide a convincing basis for concluding
that the independent variable caused the dependent variable. That is, they allow
for ruling out all threats to internal validity. Note, however, the same results could
not be used to support the conclusion that X is the only cause of changes in the
dependent variable. Other randomized-experimental research may show that X is
also causally related to a host of other manipulations of the independent variables.
Multiple Independent Variable Designs. Randomized-experimental designs
also can be used in studies that examine causal relations between multiple inde-
pendent variables and one or more dependent variables. A study of this type can
consider both the main and interactive effects of two or more independent vari-
ables (e.g., X1, X2, and X1×X2). Thus, for example, a 2 × 2 factorial study could test
48 • EUGENE F. STONE-ROMERO

for the main and interactive effects of room temperature and relative humidity on
workers’ self-reports of their comfort level.

Quasi-Experimental Designs
Quasi-experimental designs have three attributes: First, one or more indepen-
dent variables (e.g., X1, X2, and X3) are manipulated. Second, assumed dependent
and control variables are measured (O1, O2, . . . Ok) before and after the manipula-
tions. Third, the studied units are not randomly assigned to experimental condi-
tions. The latter attribute results in a very important deficiency, i.e., an inability to
argue that the studied units were equivalent to one another before the manipula-
tion of the independent variable(s). Stated differently, at the outset of the study
the units may have differed from one another on a host of measured and/or un-
measured confounding variables (Campbell & Stanley, 1963; Cook & Campbell,
1976, 1979: Shadish et al., 2002). Thus, any observed differences in measures
of the assumed dependent variable(s) may have been a function of one or more
confounding variables.
There are five basic types of quasi-experimental designs. Brief descriptions of
them are provided below.
Single Group Designs With Or Without Control Group. In this type of de-
sign an independent variable is manipulated and the assumed dependent variable
is measured at various points in time before and/or after the manipulation. A very
simple case of this type of design is the one-group pretest-posttest design:

O1A X O2A

An example of this design is a study in which performance is measured before


(O1A) and after (O2A) a job-related training program (X).
A major weakness of this and similar designs is that pretest versus posttest
changes in measured variable may have been a function of a host of confounds,
including history, maturation, and regression (Campbell & Stanley, 1963; Cook &
Campbell, 1976, 1979; Shadish et al., 2002). Thus, the design does not allow for
valid inferences about the causal connection between training and performance.
Multiple Group Designs Without Pretest Measures. In this type of design
units are not randomly assigned to conditions, two or more groups are assigned
to treatment and no treatment conditions, and the assumed dependent variable is
measured after the manipulation of the independent variable:

~R X O2A
~R ~X O2B

Although this design is slightly better than the just-described single group de-
sign, it is still highly deficient with respect to the criterion of internal validity.
Research Design and Causal Inferences • 49

The principal reason for this is that posttest differences in the assumed dependent
variable may have resulted from such confounds as pre-treatment differences on
the same variable or a host of other confounds (e.g., local history).
Multiple Group Designs with Control Groups and Pretest Measures. In
this type of design (a) units are assigned to one or more treatment and control
conditions on a non-random basis, (b) the independent variable is manipulated in
one or more such conditions, and (c) the assumed dependent variable is measured
before and after the treatment period. One example of this type of design is:

~R O1A X O2A
~R O1B ~X O 2B

For instance, in a study of the effects of participation in decision making on com-


mitment: (a) the treatment is implemented in one division of a company (i.e.,
Group A) while the other division (Group B) serves as a no treatment control
condition, and (b) commitment is measured before and after the treatment period
in both groups. The hoped for outcome is that O1A = O1B and O2A > O2B. Unfortu-
nately, this pattern of results would not justify the inference that the treatment pro-
duced the difference in commitment. Although the design is an improvement over
one in which there are no pretest measures, it is still quite deficient in terms of the
internal validity criterion. Even if the pretest measures revealed that the groups
did not differ from one another at the pretest period, a large number of confounds
may have been responsible for any observed posttest differences. For example,
the group that was treated also experienced a pay increase.
Time Series Designs. The time series design involves the measurement of the
assumed dependent variable on a number of occasions before and after the group
experiences the treatment. A simple illustration of this type of design is:

O1A O2A O3A ··· O25A X O26A O27A O28A ··· O50A

For example, performance may be measured at multiple points in time before


and after the implementation of a job training intervention. Although this design
allows for the ruling out of some confounds (e.g., maturation), it does not permit
the ruling out of others (e.g., history). As a result, the design is relatively weak in
terms of the internal validity criterion.
Regression Discontinuity Designs. This design entails (a) the measurement
of the variable of interest at a pretest period, (b) the use of pretest scores to assign
units to treatment versus control conditions, (c) the separate regression of posttest
scores (O2) on pretest scores (O1) for units in the conditions, and (d) the compari-
son of slope and/or intercept differences for the two groups. An example is of this
type of design is:
50 • EUGENE F. STONE-ROMERO

~R O1A X O2A
~R O1B ~X O 2B

Unfortunately, this design does not allow for confident inferences about the
effect of the treatment on the assumed dependent variable. There are numerous
reasons for this including differential levels of attrition from members of the two
groups (Cook & Campbell, 1976, 1979; Shadish et al., 2002).
Summary. As noted above, quasi-experimental designs may allow a researcher
to rule out some threats to internal validity. However, as is detailed by Campbell
and Stanley (1963, Table 2) other threats can’t be ruled out by these designs. As a
result, internal validity is often questionable in research using quasi-experimental
designs. Stated differently, these designs are inferior to randomized-experimental
designs in terms of supporting causal inferences.

Non-Experimental Designs
In a non-experimental study the researcher measures (i.e., observes) assumed
independent, mediator, moderator, and dependent variables (O1, O2, O3, . . . Ok).
One example of this type of study is research by Hackman and Oldham (1976).
Its purpose was to test the job characteristics theory of job motivation. In it the
researchers measured the assumed (a) independent variables of task variety, au-
tonomy, and feedback, (b) mediator variables of experienced meaningfulness of
work, and knowledge of results of work activities, (c) moderator variable of high-
er order need strength, and (d) dependent variables of work motivation, perfor-
mance, and satisfaction. They then used statistical analyses (e.g., zero-order cor-
relation, multiple regression) to test for relations between the observed variables.
Results of the study showed strong support for hypothesized relations between
the measured variables. Nevertheless, because the study was non-experimental,
any causal inferences stemming from it would rest on a very shaky foundation.
Note, moreover, that it is of no consequence whatsoever that the analyses were
predicated on a theory! Thus, for example, the study’s results were incapable of
serving as a valid basis for (a) inferring that job characteristics were the causes
of satisfaction or (b) ruling out the operation of a number of potential (observed
and unobserved) confounding variables. More generally and contrary to the argu-
ments of many researchers, causal inferences are not strengthened by invoking
a theory prior to the time a study is conducted. As noted above, for example,
(a) some theorists argue that satisfaction causes performance, (b) others contend
that performance causes satisfaction, and (c) still others assert that the relation is
spurious. Non-experimental research is incapable of determining which, if any, of
these assumed causal models is correct.
Research Design and Causal Inferences • 51

OTHER DESIGN ISSUES


This section deals with two strategies that are frequently used for making causal
inferences: (a) meta-analysis based summaries of randomized-experimental stud-
ies and (b) tests of mediation models

Cumulating the Results of Several Studies


There are often instances in which multiple studies using randomized- experi-
mental designs have examined causal relations between and among variables of
interest. In such cases their findings can be cumulated using meta-analytic meth-
ods. This may be done in cases where research has considered either (a) simple
causal models (e.g., Figures 3.1a or 3.1b) or (b) models involving mediation (e.g.,
Figure 3.1d, 1e, or 3.1g).

Tests of Simple Causal Models With the Results of a Meta-Analysis


The results of multiple randomized-experimental studies may be cumulated
using meta-analytic methods. For example, Hosoda, Stone-Romero, and Coats
(2003) meta-analyzed the results of 37 randomized-experimental studies of rela-
tions between (a) physical attractiveness and (b) various job-related outcomes
(e.g., hiring, predicted job success, promotion, and job suitability). Overall, re-
sults showed a .37 mean weighted effect size (d) for 62 studied relations. These
results allow for confident causal inferences about the impact of the attractiveness
on the dependent variables.
Using correlations derived from a meta-analysis leads to one very important
consequence. More specifically, it provides evidence that causal relations between
or among variables generalize across such dimensions as (a) types of sampling
units, (b) research contexts, or (c) time periods.

Mediation Models
Research using randomized-experimental designs also may be used in tests of
models involving mediation (e.g., Pirlott & MacKinnon, 2016; Rosopa & Stone-
Romero, 2008; Stone-Romero & Rosopa, 2008, 2010, 2011). For example, a re-
searcher may posit that O1 → O2 → O3 . Here, as is illustrated in Figure 3.1g, the
effect of O1 on O3 is transmitted through the mediator, O2. The simplest way of
testing such a mediation model is to conduct two randomized experiments, one
that tests for the effects of O1 on O2 and the other that tests for the effects of O2 on
O3 (Rosopa & Stone-Romero, 2008; Stone-Romero & Rosopa, 2008, 2010, 2011).
If the results show support for both such predictions, the mediating effect of O2
can be deduced through the use of symbolic logic (Kalish & Montague, 1964,
Theorem 26).
Research by Eden, Stone-Romero, and Rothstein (2015) is an example of a
meta-analytic based mediation study. It used the results of meta-analyses of two
52 • EUGENE F. STONE-ROMERO

relations: The first involved the causal relation between leader expectations (LE)
and subordinate self-efficacy (SE). For it, the average correlation was .58. The
second considered the causal relation between subordinate self-efficacy (SE) and
subordinate performance (SP), the average correlation was .35. When combined,
these correlations along with formal reasoning deductions provided support for
the assumed mediation model. The reasoning is ((LE → SE) ˄ (SE → SP)) → (LE
→ SE) (see Theorem 26 of Kalish & Montague, 1964).
Whereas the results of meta-analyses of experimental studies may be used to
support causal inferences for either simple (e.g., O1 → O2) or complex relations
(e.g., O1 → O2 → O3), they do not justify such inferences in cases where the meta-
analyses are based upon relations derived from non-experimental studies (e.g.,
Judge et al., 2001; Riketta, 2008). Stated somewhat differently, meta-analytic
methods cannot serve as basis for valid causal inferences when they involve the
accumulation of the findings of two or more non-experimental studies.

INVALID CAUSAL INFERENCES


BASED ON STATISTICAL STRATEGIES
The literature in HRM and related fields is replete with studies in which invalid
inferences about causal relations are based on the results of statistical analyses
as opposed to the use of randomized-experimental designs. These analyses are
of several types, including: (a) finding a zero-order correlation between two or
more measured variables (e.g., O1, O2, and O3), (b) using hierarchical regression
to show that a study’s results are consistent with an assumed causal model, and
(c) analyzing data with so called “causal modeling” methods (e.g., cross-lagged
correlation, causal-correlation analysis, path analysis, and SEM). Regrettably, the
statistical methods used in a non-experimental or quasi-experimental study don’t
provide a valid basis for valid causal inferences (Bollen, 1989; Rogosa, 1987;
Rosopa & Stone-Romero, 2008; Shadish et al., 2002; Stone-Romero & Rosopa,
2004, 2008). Stated somewhat differently, statistical methods are not an appro-
priate substitute for sound experimental design!
A number of researchers (e.g., Baron & Kenny, 1986) have advocated the use
of hierarchical multiple regression (HMR) as a basis for inferring mediating ef-
fects (e.g., O1 →O2, → O3) using data from non-experimental studies. They con-
tend that such analyses provide a basis for causal inferences about the direct (O1
→ O3 ) and indirect (O1 →O2, → O3) effects of the assumed independent variable
(O1) on the assumed dependent variable (O3). Figures 3.1d, 3.1e and 3.1g show
three of several possible models involving mediation for the measured variables
of O1, O2, and O3. As is explained below, the results of a “causal analysis” with
HMR or any other statistical technique cannot provide valid evidence on the cor-
rectness of any of these models.
Stone-Romero and Rosopa (2004) conducted an experimental statistical simu-
lation study to determine the effects of four concurrent manipulations on the abil-
ity of the Baron and Kenny HMR technique to provide evidence of mediation. In
Research Design and Causal Inferences • 53

the assumed causal model (a) the effect of Z1 on Z3 was mediated by Z2, and (b)
there also was a direct effect of Z1 on Z3. The variables in the simulation were r12,
r13, r23, and N, where (a) r12 = correlation between the assumed independent vari-
able, Z1 and the assumed mediator variable, Z2; (b) r13 = correlation between the
assumed independent variable, Z1 and the assumed dependent variable, Z3; (c) r23
= correlation between the assumed mediator variable, Z2 and the dependent vari-
able Z3, and sample size. The manipulations of r12, r13, r23, (values of .1 to .9 for
each) and N (values of 68, 136, 204, 272, 340, 408, 1,000, 1,500, 2,000, 2,500, and
3,000) resulted in 651 data sets. They were analyzed using the HMR technique.
Results of 8,463 HMR analyses showed that: (a) if the model tested was not the
true model, there would be a large number of cases in which there would be sup-
port for partial or complete mediation and the researcher would make highly erro-
neous inferences about mediation; (b) if the model tested was the true model, there
would only be slight support for inferences about complete mediation and modest
support for inferences about partial mediation. Overall, the HMR procedure did
very poorly in terms of providing evidence of mediation (see Stone-Romero &
Rosopa, 2004 for detailed information on the findings). Thus, the HMR technique
is unlikely to provide consistent evidence to support inferences about mediation.
As noted by Stone-Romero and Rosopa (2004) there are many problems with
the HMR strategy for making inferences about causal connections between vari-
ables. First, it relies on the interpretation of the magnitudes of regression coef-
ficients as a basis for determining effect size estimates. However, as Darlington
(1968) demonstrated more than 50 years ago, when multiple correlated predictors
are used in a regression analysis it is impossible to determine the proportion of
variance that is explained uniquely by each of them. Second, when applied to data
from non-experimental research there is almost always ambiguity about causal
direction. Third, there is the issue of model specification. Although a researcher
may test an assumed causal model, he or she cannot be certain that is the correct
model. Moreover, the invocation of a theory may be of little or no consequence
because there may be many theories about the causal connection between vari-
ables (e.g., the relation between satisfaction and performance). Fourth, the results
of non-experimental research do not provide a basis for making causal inferences.
In contrast, the findings of randomized-experimental studies do. Fifth, and finally,
in non-experimental research there is always the problem of unmeasured con-
founds. These are seldom considered in HMR analyses. Even if they are, if the
measures of confounds lack construct validity their effects cannot be controlled
fully by an HMR analysis.
Stone-Romero and Rosopa (2004) are not alone in questioning the ability of
HMR to provide credible evidence about causal connections between (or among)
measured variables. A number of other researchers have comparable views. For
example, Mathieu and Taylor (2006) wrote that “Research design factors are para-
mount for reasonable mediational inferences to be drawn. If the causal order of
variables is compromised, then it matters little how well the measures perform or
54 • EUGENE F. STONE-ROMERO

the covariances are partitioned. Because no [data] analytic technique can discern
the true causal order of variables, establishing the internal validity of a study is
critical. . . [and] randomized field experiments afford the greatest control over such
concerns” (p. 1050). They went on to state that randomized experiments “remain
the ‘gold standard’ [in empirical research] and should be pursued whenever pos-
sible” (p. 1050). These and similar views stand in sharp contrast to the generally
invalid arguments of several authors (e.g., Baron & Kenny, 1986; Blalock, 1964,
1971; James, 2008; James, Mulaik, & Brett, 2006; Kenny, 1979, 2008; Preacher
& Hayes, 2004, 2008). Unfortunately, unwarranted inferences about causality on
the basis of so called “causal modeling” methods are all too common in publica-
tions in HRM and allied fields. For example, on the basis of a meta-analysis of
the satisfaction- performance relation, Judge, Thoresen, Bono, and Patton (2001)
argued that causal modeling methods can shed light on causal relations between
these variables, especially in cases where mediation is hypothesized. They wrote
that “Though some research has indirectly supported mediating influences [on the
satisfaction-performance relation], direct tests are lacking. Such causal studies
are particularly appropriate in light of advances in causal modeling techniques in
the past 20 years” (p. 390). Contrary to the views of Judge et al., causal modeling
techniques cannot provide a valid basis for causal inferences.
Another example of invalid causal inferences comes from Riketta’s (2008)
meta-analytic study of the relations between job attitudes (attitudes hereinafter)
and performance. As noted above, he cumulated the findings of 16 nonexperimen-
tal studies to compute average correlations between attitudes and performance.
They were used in what he described as a meta-analytic regression analysis. On
the basis of it he wrote that “ because the present analysis is based on correlational
rather than experimental data, it allows for only tentative causal conclusions and
cannot rule out some alternative causal explanations (e.g., that third variables in-
flated the cross-lagged paths; see, e.g., Cherrington, Reitz, & Scott, 1971; Brown
& Peterson, 1993). Although the present analysis accomplished a more rigorous
test for causality than did previous meta-analyses in this domain, it still suffers
from the usual weakness of correlational designs. Experiments are required to
provide compelling evidence of causal relations” (p. 478). Whereas Riketta was
correct in concluding that experiments are needed to test causal relations, he was
incorrect in asserting that his study provided a more rigorous test of causality than
previous meta-analytic studies.
Brown and Peterson (1993) conducted an SEM-based test of an assumed
causal model on the antecedent and consequences of salesperson job satisfaction.
On the basis of its results they concluded that “Another important finding of the
causal analysis is evidence that job satisfaction primarily exerts a direct causal ef-
fect on organizational commitment rather than vice versa” (p. 73). Unfortunately,
this and other causal inferences were unwarranted because the study’s data came
from non-experimental studies.
Research Design and Causal Inferences • 55

It is interesting to consider the views of James et al. (2006) with respect to test-
ing assumed mediation models. They argue that “if theoretical mediation models
are thought of as causal models, then strategies designed specifically to test the fit
of causal models to data, namely, confirmatory techniques such as structural equa-
tion modeling (SEM), should be employed to test mediation models” (p. 234).
Moreover, they contend that in addition to testing a mediation model of primary
interest they strongly recommend testing alternative causal models. As they note,
“The objective is to contrast alternative models and identify those that appear to
offer useful explanations versus those that do not” (p. 243). However, they go on
to write that the results of SEM analyses “for both complete and partial mediation
models do not imply that a given model is true even though the pattern of parame-
ter estimates is consistent with the predictions of the model. There are always oth-
er equivalent models implying different causal directions or unmeasured common
causes that would also be consistent with the data” (p. 238). Unfortunately, for the
reasons noted above, testing primary or alternative models with SEM or any other
so called “causal modeling” methods does not allow researchers to make valid
causal inferences because when applied to data from non-experimental studies
these methods cannot serve as a valid basis for inferences about cause.
Some researchers seem to believe that the invocation of a theory combined with
the findings of a “causal modeling” analysis (e.g., SEM) is the deus ex machina of
nonexperimental research. Nothing could be further from the truth. One reason for
this is that the same set of observed correlations between or among a set of mea-
sured variables can be used to support a number of assumed causal models (e.g.,
Figures 3.1a to 3.1g). In the absence of research using randomized experimental
designs it is impossible to determine which, if any, of the models is correct.
Clearly, so called “causal modeling” methods (e.g., path analysis, hierarchical
regression, cross-lagged panel correlation, and SEM) are incapable of providing
valid evidence on causal connections between and among measured variables (Cliff,
1983; Freedman, 1987; Games, 1990; Rogosa, 1987; Rosopa & Stone-Romero,
2008; Spencer, Zanna, & Fong, 2005; Stone-Romero, 2002, 2008, 2009, 2010;
Stone-Romero & Gallaher, 2006; Stone-Romero & Rosopa, 2004, 2008, 2010,
2011). Therefore, researchers interested in making causal inferences should con-
duct studies using either randomized-experimental or quasi-experimental designs.
In recent years, a number of researchers have championed the use of data ana-
lytic strategies for supposedly improving causal inferences in research using non-
experimental designs. Two examples of this are propensity score modeling (e.g.,
Rosenthal & Rosnow, 2008) and regression-based techniques for approximating
counterfactuals (e.g., Morgan & Winship, 2014). On their face, these techniques
may appear elegant and sophisticated. However, the results of these regression-
based strategies do not provide a valid basis for causal inferences because the data
used by them come from non-experimental research. Another very serious limi-
tation of the propensity score strategy and similar strategies is that they provide
statistical controls for only a limited set of control variables. This leaves a host of
56 • EUGENE F. STONE-ROMERO

unmeasured variables uncontrolled. In short, statistical control strategies such as


propensity score analyses are a very poor substitute for research using random-
ized experimental designs.

OBJECTIONS TO RANDOMIZED-EXPERIMENTS
Some researchers (e.g., James, 2008; Kenny, 2008) have argued that research
based on randomized-experimental designs is not feasible for various reasons,
including (a) some independent variables can’t be manipulated, (b) the manipula-
tion of others may not be ethical, and (c) organizations will not permit randomized
experiments.

Non-Manipulable Variables
Clearly, some variables are incapable of being manipulated by researchers,
including (a) the actual ages, sexes, genetic makeup, physical attributes, and cog-
nitive abilities of research participants, (b) the laws of cities, counties, states, and
countries, and (c) the environments in which research units operate. Neverthe-
less, through creative research design it may be possible to manipulate a number
of such variables. For example, in a randomized-experimental study of helping
behavior by Danzis and Stone-Romero (2009) the attractiveness of a confederate
(who requested help from research subjects) was manipulated through a number
of strategies (e.g., the clothing, jewelry, makeup, and hairstyle of confederates).
Results of the study showed that attractiveness had an impact on helping behavior.
Attractiveness also can be manipulated in a number of other ways. For ex-
ample, in a number of simulated hiring studies using randomized-experimental
designs the physical attractiveness of job applicants was manipulated via photos
of the applicants (see Stone, Stone, & Dipboye, 1992, for details). In addition,
a randomized-experimental study by Kreuger, Stone, and Stone-Romero (2014)
examined the effects of several factors, including applicant weight on hiring deci-
sions. In it, the weight of applicants was manipulated through the editing of pho-
tos of them using Photoshop software. Overall, what the above demonstrates quite
clearly is that randomized-experiments are possible that involve independent vari-
ables that some researchers believe to be difficult or impossible to manipulate.
In an article that critiqued the use of research using randomized-experimental
designs, James (2008) wrote that “If we limited causal inference to randomized
experiments where participants have to be randomly sampled [sic] into values
of a causal variable, then we would no longer be able to draw causal inferences
about smoking and lung cancer (to mention one of several maladies)” (p. 361).
Clearly, this argument is of little or no consequence because many variables can
be manipulated. For example, a large number of randomized-experimental studies
have shown the causal connection between smoking and lung cancer using hu-
man or non-human research subjects (El-Bayoumi, Iatropolous, Amin, Hoffman,
& Wynder, 1999 Salaspuro & Salaspuro, 2004). And, at the cellular level, thou-
Research Design and Causal Inferences • 57

sands of randomized-experimental studied have linked a wide variety of chemical


compounds to cancer and other diseases. Moreover, a quasi- experimental study
by Salaspuro and Salaspuro (2004) used smoking and non-smoking subjects. The
smokers smoked one cigarette every 20 minutes while the non-smokers served as
controls. The researchers compared their salivary acetaldehyde (a known carcino-
gen) levels every 20 minutes for a 160 minute period. Results showed that smok-
ers had seven times higher acetaldehyde levels than non-smokers. The fact that
only smokers smoked the cigarettes there was virtually no ethical issue with the
study. Moreover, even though the study used a quasi-experimental design causal
inferences were possibly high because virtually all threats to internal validity
were controlled through its design.
It merits adding that I have strong ethical objections to research that uses non-
human subjects in research aimed at inducing cancer or other diseases. Fortu-
nately, there are various ethical alternatives to such studies. For example, research
can use “tissue-on-a-chip models or microphysiological systems [which are] are
a fusion of engineering and advanced biology. Silicon chips are lined with human
cells that mimic the structure and function of human organs and organ systems.
They are used for disease modeling, personalized medicine, and drug testing”
(e.g., Fabre, Livingston, & Tagle, 2013; Physicians Committee for Responsible
Medicine, 2018). The availability of these methods casts further doubt on the
validity of James’ (2008) arguments because it clearly demonstrates how random-
ized-experiments can be used in ethical research in medicine. These methods also
may have applicability in HRM research. For example, they may be used in stud-
ies of the effects of environmental toxins or substances (e.g., chemicals, asbestos,
and particulates) in work settings.
Second, the fact that randomized experiments are not always possible should
not lead researchers to operate on the assumption that they should not be used in
cases where they are possible (see Evan, 1971, for many examples). That would
be tantamount to arguing that since antibiotics cannot cure cancer, arthritis, and a
host of other diseases, they should not be used in the treatment of diseases such as
bronchitis, syphilis, sinus infections, strep throat, urinary tract infections, pneu-
monia, eye infections, and ear infections.

The View That Some Manipulations Are Unethical


There are many instances in which it would be unethical to subject research
participants to treatments that produce psychological or physical harm. In terms
of the former, for example, a researcher may be interested in assessing the effects
of feedback valence (positive versus negative) on task-based esteem or mood
state. However, it would be unethical to provide research participants with false
negative feedback about themselves or their work. In contrast, it would not be
unethical to ask them to role play a hypothetical worker and ask how they would
react to different types of feedback. For example, Stone and Stone (1985) used a 2
× 2 randomized-experimental design to study the effects of feedback favorability
58 • EUGENE F. STONE-ROMERO

and feedback consistency on self-perceived task competence and perceived feed-


back accuracy. Participants in the study were asked to role play a hypothetical
worker’s reactions to the feedback. Results showed main and interactive effects of
the manipulated variables. The fact that role playing methods were used averted
the ethical problems that would have arisen if the participants had been provided
with false negative feedback. The upshot of the foregoing is that randomized ex-
periments are indeed possible for variables that would be difficult or unethical to
manipulate. Of course, role playing and other simulation studies are not devoid of
problems. For example, construct validity may be threatened by research that uses
the role playing strategy. More specifically, it seems quite likely that the strength
of a variable manipulated through a role play (e.g., performance feedback in an
SP setting would be lower than that of the same type of variable in a NSP context
(e.g., performance feedback provided by a supervisor in a work organization).
Nevertheless, if relations between simulated feedback and measured outcomes
are found in an SP setting they are likely to be underestimates of the relations
that would be found in a NSP context. Moreover, if the purpose of a study is to
determine causal connections between variables the results of a randomized ex-
periment in an SP setting would certainly be more convincing than the results of
a nonexperimental study in an NSP setting.

The View That Experimental Research Is Not Possible in


Organizational Settings
Another objection that has been raised by some (e.g., James, 2008; Kenny,
2008) is that organizations will not allow researchers to conduct randomized ex-
periments. This argument is of questionable validity: Although randomized ex-
periments may not be allowed by some organizations, they are indeed possible.
Eden and his colleagues (Davidson & Eden, 2000; Dvir, Eden Avolio, & Shamir,
2002; Eden, 1985, 2003, 2017; Eden & Aviram, 1993; Eden & Zuk, 1995), for
example, have conducted a large number of randomized experiments in organiza-
tions. In one such study, Eden and Zuk (1995) used a randomized-experimental
design to assess the effects of self-efficacy training on seasickness of naval cadets
in the Israeli Defense Forces. Results of the study showed support for the study’s
hypotheses. Other researchers would do well to benefit from the high degree of
creativity shown by Eden and his colleagues in conducting randomized-experi-
ments in NSP settings.
Further evidence of the feasibility of experimentation in organizational settings
is afforded by the chapters in a book titled Organizational Experiments: Labora-
tory and field research (Evan, 1971). Moreover, Shadish et al. (2002) describe the
many situations that are conducive to the conduct of randomized experiments in
NSP settings (e.g., work organizations). Among these are circumstances when (a)
the demand for a treatment exceeds its supply, (b) treatments cannot be delivered
to all units simultaneously, (c) units can be isolated from one another, (d) when
there is little no communication between or among units, (e) assignment to treat-
Research Design and Causal Inferences • 59

ments can granted on the basis of breaking ties with regard to a selection variable,
(f) units are indifferent to the type of treatment they receive, (g) units are sepa-
rated from one another, and (h) the researcher can create an organization within
which the research will be conducted.
Finally, even if randomized experiments are not possible in NSP settings (e.g.,
work organizations) they may be possible in SP settings (Stone-Romero, 2010),
including, organizations created for the specific purpose of experimental research
(Evan, 1971; Shadish et al., 2002). Thus, contrary to the arguments of several
analysts (e.g., James, 2008; Kenny, 2008), researchers should consider random-
ized-experimental research when their interest is testing assumed causal models.
Of course, if the sole purpose of a study is to determine if observed variables are
related to one another non-experimental studies are appropriate.

CONCLUSIONS
In view of the above, several conclusions are offered: First, causal inferences
require sound experimental designs. Of the available options, randomized-exper-
imental designs provide the strongest foundation for such inferences, quasi-ex-
perimental designs afford a weaker basis, and non-experimental designs offer the
weakest. Thus, whenever possible researchers interested in making causal infer-
ences should use randomized-experimental designs.
Second, data analytic strategies are never an appropriate substitute for sound
experimental design. It is inappropriate to advance causal inferences on the basis
of such “causal modeling” strategies as HMR, path analysis, cross-lagged panel
correlation, and SEM. Thus, researchers should refrain from doing so. There is
nothing wrong with arguing that a study’s results are consistent with an assumed
causal model, but consistency is not a valid basis for implying the correctness of
that model. The reason is that the results may be consistent with many other mod-
els and there is seldom a valid basis for choosing one model over others.
Third, researchers should acknowledge the fact causal inferences are inappro-
priate when a study’s data come from research using non-experimental or quasi-
experimental designs. Thus, they should not advance such inferences (see also,
Wood, Goodman, Beckmann, & Cook, 2008). Rather, they should be circumspect
in discussing the implications of the findings of their research.
Fourth, randomized experiments are possible in both SP and NSP settings, and
they are the “gold standard” for conducting research aimed at testing assumed
causal models. Thus, they should be the first choice for research aimed at testing
such models. Moreover, there are numerous strategies for conducting such experi-
ments in NSP settings (Evan, 1971; Shadish et al., 2002). The many studies by
Eden and his colleagues are evidence of this.
Fifth, researchers should not assume that statistical controls for confounds
(e.g., in regression models) are effective in ruling out confounds. There are two
reasons for this. One is that the measures of known confounds may lack con-
struct validity. The other is that the researcher may not be aware of all confounds
60 • EUGENE F. STONE-ROMERO

that may influence observed relations between assumed causes and effects. Thus,
it typically proves impossible to control for confounds in non-experimental re-
search.
Sixth, the editors of journals in HRM and related disciplines should insure
that authors of research-based articles refrain from advancing causal inferences
when their studies are based on experimental designs that do not justify them. As
noted above, there is nothing wrong with arguing that a study is based upon an as-
sumed causal model. For example, an author may argue legitimately that a study’s
purpose is to test a model that posits a causal connection between achievement
motivation and job performance. In the study, both variables are measured. If a
relation is found between these variables it would be inappropriate to conclude
that the results of the study provided a valid basis for inferring that achievement
motivation was the cause of performance. As noted above, research using non-
experimental or quasi-experimental designs cannot provide evidence of the cor-
rectness of an assumed causal model.
Seventh, sound research methods are vital to both (a) the development and test-
ing of theoretical models and (b) the formulation of recommendations for practice.
Thus, progress in both such pursuits is most likely to be made through research
that uses randomized-experimental designs (Stone-Romero, 2008, 2009, 2010).
With respect to theory testing, randomized-experiments are the best research
strategy for providing convincing evidence on causal connections between vari-
ables. With little exception, the research literature on various topics (e.g., the
satisfaction-performance relation) shows quite clearly that non-experimental re-
search has done virtually nothing to provide credible answers to the validity of ex-
tant theories. On the other hand, well-conceived experimental studies (e.g., Cher-
rington et al, 1971) provide clear evidence on causal linkages between variables.
With regard to recommendations for practice it is important to recognize that
a large percentage of studies in HRM and allied disciplines have used non-exper-
imental designs. Because of this it seems likely that many HRM-related policies
and practices are based upon research that lacks internal validity. Thus, research
using randomized-experimental designs has the potential to greatly improve
HRM-related policies and practices.
Eighth, the language associated with some statistical methods may serve as
a basis for invalid inferences about causal connections between variables. One
example of this is analysis of variance. In a study involving two manipulated
variables (e.g., A and B) an ANOVA analysis would allow for valid inferences
about the main and interactive effects of these variables on a measured dependent
variable. However, if an ANOVA was used to analyze data from a study in which
the same variables were measured (e.g., age, ethnicity, sex) it would be inappro-
priate to argue that these so called “independent variables” affected the assumed
dependent variable. It deserves adding that the same arguments can be made about
the language associated with other statistical methods (e.g., multiple regression,
and SEM).
Research Design and Causal Inferences • 61

Ninth and finally, although this paper’s focus was on HRM research, the just
noted conclusions have far broader implications. More specifically, they apply
to virtually all disciplines in which the results of empirical research are used to
advance causal inferences about the correctness of assumed causal models.

REFERENCES
Blalock, H. M. (1964). Causal inferences in nonexperimental research. New York, NY:
W. W. Norton.
Bateman, T. S., & Strasser, S. (1984). A longitudinal analysis of the antecedents of organi-
zational commitment. Academy of Management Journal, 27, 95–112.
Blalock, H. M. (1971). Causal models in the social sciences. Chicago. IL: Aldine.
Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in so-
cial psychological research: Conceptual, strategic, and statistical considerations.
Journal of Personality and Social Psychology, 51, 1173–1182.
Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: Wiley.
Bouchard, T. (1976). Field research methods: Interviewing, questionnaires, participant ob-
servation, systematic observation, and unobtrusive measures. In M. D. Dunnette
(Ed.), Handbook of industrial and organizational psychology (pp. 363–413). Chi-
cago, IL: Rand McNally.
Brown, S. P., & Peterson, R. A. (1993). Antecedents and consequences of salesperson job
satisfaction: Meta-analysis and assessment of causal effects. Journal of Marketing
Research, 30, 63–77.
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
research. Chicago, IL: Rand McNally.
Campbell, J. P. (1986). Labs, fields, and straw issues. In E. A. Locke (Ed.), Generalizing
from laboratory to field settings: Research findings from industrial-organizational
psychology, organizational behavior, and human resource management (pp. 269–
279). Lexington, MA; Lexington Books.
Cherrington, D. J., Reitz, H. J., & Scott, W. E. (1971). Effects of contingent and noncontin-
gent reward on the relationship between satisfaction and task performance. Journal
of Applied Psychology, 55, 531–536.
Cook, T. D., & Campbell, D. T. (1976). The design and conduct of quasi-experiments and
true experiments in field settings. In M. D. Dunnette (Ed.), Handbook of industrial
and organizational psychology (pp. 223–326). Chicago, IL: Rand McNally.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues
for field settings. Boston, MA: Houghton Mifflin.
Cliff, N. (1983). Some cautions concerning the application of causal modeling methods.
Multivariate Behavioral Research, 18, 115−126.
Danzis, D., & Stone-Romero, E. F. (2009). Effects of helper sex, recipient attractiveness,
and recipient femininity on helping behavior in organizations. Journal of Manage-
rial Psychology, 24, 722–737.
Darlington, R. B. (1968). Multiple regression in psychological research and practice. Psy-
chological Bulletin, 69, 161–182.
Davidson, O. B., & Eden, D. (2000). Remedial self-fulfilling prophecy: Two field experi-
ments to prevent Golem effects among disadvantaged women. Journal of Applied
Psychology, 85, 386–398.
62 • EUGENE F. STONE-ROMERO

Dvir, T., Eden, D., Avolio, B. J., & Shamir, B. (2002). Impact of leadership development on
follower development and performance: A field experiment. Academy of Manage-
ment Journal, 45, 735–744.
Eden, D. (1985). Team development: A true field experiment at three levels of rigor. Jour-
nal of Applied Psychology, 70, 94–100.
Eden, D. (2003). Self-fulfilling prophecies in organizations. In J. Greenberg (Ed.), Organi-
zational behavior (2nd ed., pp. 91–122). Mahwah, NJ: Erlbaum.
Eden, D. (2017). Field experimentation in organizations. Annual Review of Organizational
Psychology and Organizational Behavior, 4, 91–122.
Eden, D., & Aviram, A. (1993). Self-efficacy training to speed reemployment: Helping
people to help themselves. Journal of Applied Psychology, 78, 352–360.
Eden, D., Stone-Romero, E. F., & Rothstein, H. R. (2015) Synthesizing results of mul-
tiple randomized experiments to establish causality in mediation testing. Human
Resource Management Review, 25, 342–351.
Eden, D., & Zuk, Y. (1995). Seasickness as a self-fulfilling prophecy: Raising self-efficacy
to boost performance at sea. Journal of Applied Psychology, 80, 628–635.
El-Bayoumy, K., Iatropolous, M., Amin, S., Hoffman, D., & Wynder, E. L. (1999). In-
creased expression of cyclooygnase-2 in rat lung tumors induced by tobacco-spe-
cific nitrosamine-4-(3 pyridl)-1-butanone: The impact of a high fat diet. Cancer
Research, 59, 1400–1403.
Fabre, K. M., Livingston, C., & Tagle, D. A. (2014). Organs-on-chips (microphysiological
systems): tools to expedite efficacy and toxicity testing in human tissue. Experimen-
tal Biology and Medicine, 239, 1073–1077.
Evan, W. M. (1971). Organizational experiments: Laboratory and field research. New
York, NY: Harper & Row.
Farkas, A. J., & Tetrick, L. E. (1989). A three-wave longitudinal analysis of the causal
ordering of satisfaction and commitment on turnover decisions. Journal of Applied
Psychology, 74, 855–868.
Freedman, D. A. (1987). As others see us: A case study in path analysis. Journal of Educa-
tional Statistics, 12, 101−128.
Fromkin, H. L., & Streufert, S. (1976). Laboratory experimentation. In M. D. Dunnette
(Ed.). Handbook of industrial and organizational psychology (pp. 415–465). Chi-
cago, IL: Rand McNally.
Games, P. A. (1990). Correlation and causation: A logical snafu. Journal of Experimental
Education, 58, 239–246.
Hackman, J. R., & Oldham, G. R. (1976). Motivation through the design of work: Test of a
theory. Organizational Behavior and Human Performance, 16, 250–279.
Hosoda, M., Stone-Romero, E. F., & Coats, G. (2003). The effects of physical attractive-
ness on job-related outcomes: A meta-analysis of experimental studies. Personnel
Psychology, 56, 431–462.
James, L. R. (2008). On the path to mediation. Organizational Research Methods, 11,
359–363.
James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational
Research Methods, 9, 233–244.
Judge, T. A., Locke, E. A., Durham, C. C., & Kluger, A. N. (1998). Dispositional effects on
job and life satisfaction: The role of core evaluations. Journal of Applied Psychol-
ogy, 83, 17–34.
Research Design and Causal Inferences • 63

Judge, T. A., Thoresen, C. J., Bono, J. E., & Patton, G. K. (2001). The job satisfaction-job
performance relationship: A qualitative and quantitative review. Psychological Bul-
letin, 127, 376–407.
Kalish, D., & Montague, R. (1964). Logic: Techniques of formal reasoning. New York,
NY: Harcourt, Brace, & World.
Kenny, D. A. (1979). Correlation and causality. New York, NY: Wiley.
Kenny, D. A. (2008). Reflections on mediation. Organizational Research Methods, 11,
353–358.
Koslowsky, M. (1991). A longitudinal analysis of job satisfaction, commitment, and inten-
tion to leave. Applied Psychology: An International Review, 40, 405–415.
Krueger, D. C., Stone, D. L., & Stone-Romero, E. F. (2014). Applicant, rater, and
job factors related to weight-based bias. Journal of Managerial Psychology, 29,
164–186.
Lance, C. E. (1991). Evaluation of a structural model relating job satisfaction, organiza-
tional commitment, and precursors to voluntary turnover. Multivariate Behavioral
Research, 26, 137–162.
Locke, E. A. (1986). Generalizing from laboratory to field settings: Research findings from
industrial-organizational psychology, organizational behavior, and human resource
management. Lexington, MA: Lexington Books.
Mathieu, J. E., & Taylor, S. R. (2006). Clarifying conditions and decision points for me-
diational type inferences in organizational behavior. Journal of Organizational Be-
havior, 27, 1031–1056.
Morgan, S. L., & Winship, C. (2014). Counterfactuals and causal inference. New York,
NY: Cambridge University Press.
Noe, R. A. (2017). Employee training and development (7th ed.). Burr Ridge, IL: Mc-
GrawHill/Irwin.
Physicians Committee for Responsible Medicine. (2018). Retrieved 14 December 2018
from: https://www.pcrm.org/ethical-science/animal-testing-and-alternatives/human-
relevant-alternatives-to-animal-tests
Pirlott, A. G., & MacKinnon, D. P. (2016). Design approaches to experimental mediation.
Journal of Experimental Social Psychology, 66, 29–38.
Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect
effects in simple mediation models. Behavior Research Methods, Instruments, &
Computers, 36, 717–731.
Preacher, K. J., & Hayes, A. F. (2008). Contemporary approaches to assessing mediation
in communication research. In A. F. Hayes, M. D. Slater, & L. B. Snyder (Eds.), The
SAGE sourcebook of advanced data analysis methods for communication research
(pp. 13–54). Thousand Oaks, CA: Sage.
Riketta, M. (2008). The causal relation between job attitudes and performance: A meta-
analysis of panel studies. Journal of Applied Psychology, 93, 472–481.
Rogosa, D. (1987). Causal models do not support scientific conclusions: A comment in
support of Freedman. Journal of Educational Statistics, 12, 185–195.
Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behavioral research: Methods and
data analysis (3rd ed.). New York, NY: Mc Graw Hill.
Rosopa, P. J., & Stone-Romero, E. F. (2008). Problems with detecting assumed mediation
using the hierarchical multiple regression strategy. Human Resource Management
Review, 18, 294–310.
64 • EUGENE F. STONE-ROMERO

Salaspuro, V., & Salaspuro, M. (2004). Synergistic effect of alcohol and drinking on in
vivo acetaldehyde concentration in saliva. International Journal of Cancer, 111,
480–483.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
Spencer, S. J., Zanna, M. P., & Fong, G. T. (2005). Establishing a causal chain: Why experi-
ments are often more effective than mediation analyses in examining psychological
processes. Journal of Personality and Social Psychology, 89, 845–851.
Stone, D. L., & Stone, E. F. (1985). The effects of feedback consistency and feedback
favorability on self-perceived task competence and perceived feedback accuracy.
Organizational Behavior and Human Decision Processes, 36, 167–185.
Stone, E. F., Stone, D. L., & Dipboye, R. L. (1992). Stigmas in organizations: Race, handi-
caps, and physical attractiveness. In K. Kelley (Ed.), Issues, theory, and research
in industrial/organizational psychology (pp. 385–457). Amsterdam, Netherlands:
Elsevier Science Publishers
Stone-Romero, E. F. (2002). The relative validity and usefulness of various empirical re-
search designs. In S. G. Rogelberg (Ed.), Handbook of research methods in indus-
trial and organizational psychology (pp. 77–98). Malden, MA: Blackwell.
Stone-Romero, E. F. (2008). Strategies for improving the validity and utility of research in
human resource management and allied disciplines. Human Resource Management
Review, 18, 205–209.
Stone-Romero, E. F. (2009). Implications of research design options for the validity of in-
ferences derived from organizational research. In D. Buchanan & A. Bryman (Eds.),
Handbook of organizational research methods (pp. 302–327). London, UK: Sage.
Stone-Romero, E. F. (2010). Research strategies in industrial and organizational psychol-
ogy: Nonexperimental, quasi-experimental, and randomized experimental research
in special purpose and nonspecial purpose settings. In S. Zedeck (Ed.), Handbook of
industrial and organizational psychology (pp. 35–70). Washington, DC: American
Psychological Association Press.
Stone-Romero, E. F., & Gallaher, L. (2006, May). Inappropriate use of causal language in
reports of non-experimental research. Paper presented at the meeting of the Society
for Industrial and Organizational Psychology. Dallas, TX.
Stone-Romero, E. F., & Rosopa, P. J. (2004). Inference problems with hierarchical multiple
regression-based tests of mediating effects. Research in Personnel and Human Re-
sources Management, 23, 249–290.
Stone-Romero, E. F., & Rosopa, P. J. (2008). The relative validity of inferences about
mediation as a function of research design characteristics. Organizational Research
Methods, 11, 326–352.
Stone-Romero, E. F., & Rosopa, P. (2010). Research design options for testing mediation
models and their implications for facets of validity. Journal of Managerial Psychol-
ogy, 25, 697–712.
Stone-Romero, E. F., & Rosopa, P. (2011). Experimental tests of mediation models: Pros-
pects, problems, and some solutions. Organizational Research Methods, 14, 631–
646.
Wanous, J. P. (1974). A causal-correlational analysis of the job satisfaction and perfor-
mance relationship. Journal of Applied Psychology, 59, 139–144.
Research Design and Causal Inferences • 65

Wiener, Y., & Vardi, Y. (1980). Relationships between job, organization, and career com-
mitments and work outcomes: An integrative approach. Organizational Behavior
and Human Performance, 26, 81–96.
Williams, L. J., & Hazer, J. T. (1986). Antecedents and consequences of satisfaction and
commitment in turnover models: A reanalysis using latent variable structural equa-
tion methods. Journal of Applied Psychology, 71, 219–231.
Wood, R. E., Goodman, J. S., Beckmann, N., & Cook, A. (2008). Mediation testing in
management research: A review and proposals. Organizational Research Methods,
11, 270–295.
CHAPTER 4

HETEROSCEDASTICITY IN
ORGANIZATIONAL RESEARCH
Amber N. Schroeder, Patrick J. Rosopa,
Julia H. Whitaker, Ian N. Fairbanks, and Phoebe Xoxakos

Variance plays an important role in theory and research in human resource man-
agement and related fields. Variance refers to the dispersion of scores or residuals
around a mean or, more generally, a predicted value (Salkind, 2007, 2010). In the
general linear model, the mean square error provides an estimate of the disper-
sion of population error variance (Fox, 2016). As mean square error decreases,
in general, the variability in the population error also decreases. In general linear
models, it is assumed that the population error variance is constant across cases
(i.e., observations in a sample). This assumption is known as homoscedasticity, or
homogeneity of variance (Fox, 2016; King, Rosopa, & Minium, 2018; Rencher,
2000). When the homoscedasticity assumption is violated, it is referred to as het-
eroscedasticity, or heterogeneity of variance (Fox, 2016; Rosopa, Schaffer, &
Schroeder, 2013). When heteroscedasticity is present in the general linear model,
this results in incorrect standard errors, which can lead to biased Type I error rates
and reduced statistical power (Box, 1954; DeShon & Alexander, 1996; White,
1980; Wilcox, 1997). This can threaten the statistical conclusion validity of a
study (Shadish, Cook, & Campbell, 2002). Notably, heteroscedasticity has been
found in a variety of organizational and psychological research contexts (Agui-
nis & Pierce, 1998; Antonakis & Dietz, 2011; Ostroff & Fulmer, 2014), thereby
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 67–86.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 67
68 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

prompting research regarding best practices for detecting changes in residual vari-
ance and mitigating its negative effects (Rosopa et al., 2013).
In the present paper, we discuss how change in residual variance (i.e., heterosce-
dasticity) can be more than a violated statistical assumption. In some instances,
heteroscedasticity can be of substantive theoretical importance. For instance, Bryk
and Raudenbush (1988) proposed that heteroscedasticity may be an indicator of
unmeasured individual difference moderators in studies where treatment effects are
measured. Thus, the focus of this paper is twofold: First, we highlight five areas
germane to human research management and related fields in which changes in
variance provide a theoretical and/or empirical contribution to research and prac-
tice. Namely, we describe how the examination of heteroscedasticity can contribute
to the understanding of organizational phenomena across five research topics: (a)
stress interventions, (b) aging and individual differences, (c) skill acquisition and
training, (d) groups and teams, and (e) organizational climate.
Second, we describe several data analytic approaches that can be used to detect
heteroscedasticity. These approaches, however, are discussed in the context of
various statistical analyses that are commonly used in human resource manage-
ment and related fields. We consider (a) testing for the equality of two indepen-
dent means, (b) analysis of variance, (c) analysis of covariance, and (d) multiple
linear regression.

SUBSTANTIVE HETEROSCEDASTICITY
IN ORGANIZATIONAL RESEARCH
Even though error variance equality is an assumption of the general linear model,
in some instances, heteroscedasticity may be more than a violated assumption;
rather, it could be theoretically important. In the following sections, we provide
examples of substantively meaningful heteroscedasticity in organizational re-
search.

Stress Intervention
Stress management is a topic of interest in several psychological specialties,
including organizational and occupational health psychology. For organizations,
stress can result in decreased job performance (Gilboa, Shirom, Fried, & Cooper,
2008), increased absenteeism (Darr & Johns, 2008), turnover (Podsakoff, LePine,
& LePine, 2007), and adverse physical and mental health outcomes (Schaufeli
& Enzmann, 1998; Zhang, Zhang, Ng, & Lam, 2019). Thus, stress management
interventions are often implemented by organizations with the objective of re-
ducing stressors in the workplace (Jackson, 1983), teaching employees to better
manage stressors, or reducing the negative outcomes associated with stressors
(Ivancevich, Matteson, Freedman, & Phillips, 1990). Although several different
stress interventions exist (e.g., cognitive-behavioral approaches, relaxation ap-
proaches, multimodal approaches; Richardson & Rothstein, 2008; van der Klink,
Heteroscedasticity in Organizational Research • 69

Blonk, Schene, & van Dijk, 2001), stress interventions have one common goal: to
reduce stress and its negative consequences.
Stress intervention research often examines the reduction in strain or nega-
tive health outcomes of those in a treatment group compared to those in a con-
trol group (Richardson & Rothstein, 2008; van der Klink et al., 2001). However,
successful stress interventions may also result in less variability in stress-related
outcomes for those in the treatment group compared to those in the control group,
as has been demonstrated (although not explicitly predicted) in several studies
(e.g., Bond & Bunce, 2001; Galantino, Baime, Maguire, Szapary, & Farrar, 2005;
Jackson, 1983; Yung, Fung, Chan, & Lau, 2004). Thus, individual-level stress in-
terventions (DeFrank & Cooper, 1987; Giga, Noblet, Faragher, & Cooper, 2003)
may result in a reduction in the variability of reported strain (e.g., by reducing
individual differences in perceiving stressors, coping with stress, or recovering
from strain; LaMontagne, Keegel, Louie, Ostry, & Landsbergis, 2007), thereby
contributing to heterogeneity of variance when comparing those who underwent
the intervention to those who did not. This is consistent with the finding that
treatments can interact with individual difference variables to contribute to dif-
ferences in variability in outcomes (see e.g., Bryk & Raudenbush, 1988). Thus,
heteroscedasticity could be the natural byproduct of an effective stress interven-
tion, which provides an illustration of a circumstance in which heteroscedasticity
may be expected when testing for the equality of two independent means. Figure
4.1 provides an example of two independent groups where the means differ be-

FIGURE 4.1. Plot of means for two independent groups (n = 100 in each group)
with 95% confidence intervals, suggesting that the variability in the Intervention
group is much smaller than the variability in the Control group.
70 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

tween a control group and an experimental group that underwent an intervention


designed to reduce strain. Notably, the variance is smaller for those who received
the intervention compared to those in the control group.
As highlighted by transactional stress theory (Lazarus & Folkman, 1984), in-
dividual perceptions play an important role in the stress response process. For ex-
ample, Webster, Beehr, and Love (2011) found that the same work demand (e.g.,
workload, hours of work, job requirements) can be perceived as challenging to
one employee and encumbering to another. Therefore, individual differences have
the potential to produce heteroscedasticity in stress intervention effectiveness.
More specifically, individual factors such as one’s self-regulatory orientation (i.e.,
promotion- or prevention-focused; Byron, Peterson, Zhang, & LePine, 2018),
self-efficacy (Panatik, O’Driscoll, & Anderson, 2011), and perceptions of oth-
ers’ expectations for perfectionism (Childs & Stoeber, 2012) have been linked to
stress appraisal and subsequent stress outcomes. As such, stress interventions not
specifically addressing relevant individual differences (e.g., variability in apprais-
als of efficacy-related stress) may be differentially effective across individuals,
thereby resulting in heteroscedasticity in stress outcomes. In other words, within a
treatment condition, post-intervention stress outcome variability may be impacted
by individual difference variables (i.e., the intervention may be more effective
for specific individuals, thereby resulting in a greater reduction in stress outcome
variability for this subset of employees). As such, differences in variability may be
an important index for measuring the effectiveness of a stress intervention.

Aging and Individual Differences


A second area in which heteroscedasticity may make a substantive contribu-
tion is in aging research. Research on aging often utilizes simple linear regression
to examine relations between age and various outcomes to determine if abilities
such as memory or reaction time decline as individuals age (i.e., a negative regres-
sion slope when predicting memory or positive regression slope when predicting
reaction time). For example, as age increases, visual acuity (Spirduso, Francis, &
MacRae, 2005), fluid intelligence (Morse, 1993), decision making ability (Boyle
et al., 2012), and episodic memory (Backman, Small, & Wahlin, 2001) tend to
decline. Heteroscedasticity appears to exist in this area (Baltes & Baltes, 1990),
as research shows that for many tasks, older adults tend to have larger variations
in performance than do younger adults (Spirduso, Francis, & MacRae, 2005).
However, these declines may be more pronounced for certain individuals due to
a variety of individual differences. Namely, Christensen et al. (1999) found that
physical strength, depression, illness, gender, and education level explained varia-
tion in cognitive functioning among older adults.
A number of other explanations for age-related changes in variability have
been proposed, including increased opportunities for gene expression (Plomin &
Thompson, 1986), environmental changes and life experiences, discrepancies in
the rate of change in various biological systems (Spirduso et al., 2005), or the
Heteroscedasticity in Organizational Research • 71

greater prevalence of health-related problems in older adults (Backman et al.,


2001; Baltes & Baltes, 1990). Each of these factors can lead to increased individu-
al differences in older adults, which has important implications for organizational
(Moen, Kojola, & Schaefers, 2017; Ng & Feldman, 2008; Taylor & Bisson, 2019)
and aging (Colcombe, Kramer, Erikson, & Scalf, 2005; Froehlich, Beausaert, &
Segers, 2016; Kotter-Grühn, Kornadt, & Stephan, 2016) research. Namely, when
researchers examine how outcomes change as a function of age using a simple lin-
ear regression model, it may be appropriate to test for heteroscedasticity and mod-
el it accordingly (Rosopa et al., 2013). Figure 4.2 depicts a scatterplot of memory
and age. The figure also includes the fitted line from the ordinary least squares
regression of memory on age. In addition to a statistically significant negative
slope, it should be evident that the residual variance is changing as a function of
age. Specifically, residual variance increases as age increases. For example, when
age is equal to 30, the residual variance is smaller compared to when age is equal
to 50 and when age is equal to 70.
The socio-emotional selectivity theory (SST; Carstensen, 1995) suggests an
age-related dependency in social relationship motivation as a result of coping ef-
forts related to declines in physical and cognitive abilities. Specifically, SST pro-
poses that because older adults view their remaining time in life to be shorter than
younger adults, this motivates older adults to gain emotional satisfaction from
social relationships, as opposed to focusing on acquiring resources in their social
interactions (which is more common in younger adults). However, previous work
has demonstrated a discrepancy between chronological age and felt age, particu-

FIGURE 4.2. Simple linear regression predicting memory with age, suggesting that
residual variance increases as age increases.
72 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

larly among older adults, such that older adults often report feeling younger than
their chronological age (Barak, 2009). Thus, it is possible that adults with the
same chronological age may have varying perceptions of felt age, which could
impact their strategies for seeking and maintaining social relationships. As such,
for chronologically older adults, those with a lower felt age may react similar to
younger adults (i.e., by engaging in social relationships for instrumental purpos-
es), whereas those with a higher felt age may respond more in line with SST (i.e.,
by focusing on emotional connectivity in interpersonal relationships). As such,
there would be greater heteroscedasticity in motives for social interactions for
older adults compared to younger adults due to greater variability in perceptions
of time remaining in life (Carstensen, Isaacowitz, & Charles, 1999). Therefore,
an examination of variance dispersion as a function of both chronological and felt
age may provide an important theoretical contribution.

Skill Acquisition and Training


Research on skill acquisition and training is yet another area in which theory-
consistent heteroscedasticity may be found. In general, training leads to an increase
in mean performance across individuals such that both low and high performing
individuals tend to demonstrate increased performance as a result of training (Ack-
erman, 2007). However, training may also impact variability in performance. A
seminal theory in the training literature, Ackerman’s (1987) theory of automatic and
controlled processing, provides a framework by which individuals process infor-
mation and develop skills. Through practice, tasks that require recurring skills and
procedures may become automatic (i.e., they can be completed quickly and effort-
lessly with little to no thought). These tasks should have a performance asymptote
(Campbell, McCloy, Oppler, & Sager, 1993) such that additional training beyond
the performance plateau does not increase performance. Examples include driving,
typing, and reading. Conversely, Ackerman (1987) defined controlled information
processing as a much slower and more effortful form of information processing.
Tasks that are more inconsistent (i.e., they require the use of multiple skills and/or
problem-solving abilities) may require controlled information processing and con-
scious thought to complete. Even with training, these tasks require significant atten-
tion and effort (i.e., continued controlled information processing).
Ackerman and colleagues (Ackerman, 1987; Ackerman & Cianciolo, 2000;
Kanfer & Ackerman, 1989) conducted a series of studies demonstrating that for
tasks requiring automatic processing, increased training leads to a decrease in per-
formance variability among individuals, whereas for controlled tasks, increased
training leads to an increase in individual performance variability. Thus, chang-
es in the variability of performance are a function of the characteristics of the
skill being trained. Specifically, because task automaticity decreases the impact
of individual differences (e.g., intelligence, attention span, or working memory)
on performance, the variability in performance may decrease among individuals
completing tasks inducing automatic processing. On the other hand, for controlled
Heteroscedasticity in Organizational Research • 73

processing tasks, variability in performance across individuals may remain con-


stant regardless of time spent in training (Ackerman, 1987), or increase in situ-
ations in which a lack of problem-solving skills causes some individuals to fall
further behind in performance, in comparison to other trainees (Ackerman, 2007).
Consistent with this supposition, across two air traffic controller tasks, Ackerman
and Cianciolo (2000) demonstrated decreased performance variability for a task
requiring automatic processing and increased performance variability on a con-
trolled processing task.
In sum, Ackerman (1987) theorized that task characteristics determine the role
of individual differences in task performance. If a task is consistent and becomes
automatic through training, individual differences will be less predictive of per-
formance. For this reason, variability in performance decreases with increased
training. If a task is inconsistent and requires controlled processing even after ex-
tensive training, individual differences will remain predictive of performance. For
this reason, variability may remain fairly constant or even increase with additional
training. It is important to note that we focus on changes in the variability of an
outcome (i.e., performance) across groups while another predictor changes (i.e.,
amount of training). Although there may be positive slopes (i.e., a mean increase
in performance with more training), the residuals around the predicted regression
surface may not remain constant, but instead decrease (for automatic processing
tasks) or increase (for controlled processing tasks) as a function of the amount of
training. Thus, in this dual processing theory, heteroscedasticity may be implicit.

Groups and Teams


Another substantive domain where variability may have important implica-
tions is research on groups and teams. A variety of measurement techniques have
been used to describe team composition variables, including examining means,
minimum or maximum values, and team member variability (e.g., standard devia-
tions) for variables of interest (Barrick, Stewart, Neubert, & Mount, 1998; Bell,
2007). Related to heteroscedasticity, studies examining team member diversity
(i.e., score variability across individual team members) have positively linked
variability in team member extraversion and emotional stability to team job per-
formance (Neuman, Wagner, & Christiansen, 1999) and dispersion in team mem-
ber work values with individual team member performance (Chou, Wang, Wang,
Huang, & Cheng, 2008). In addition, Barrick et al. (1998) found that variability
in team member conscientiousness was inversely related to team performance,
and De Jong, Dirks, and Gillespie (2016) demonstrated that team differentiation
in terms of specialized skills and decision-making power moderated the positive
relation between intrateam trust and team performance, such that stronger rela-
tions emerged for teams with greater team member differentiation. Thus, in each
of these cases, a consideration of heteroscedasticity related to various team com-
position factors explained additional variance in relations of interest.
74 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

Taking this a step further, Horwitz and Horwitz (2007) conducted a meta-
analysis to examine how various types of team diversity impact team outcomes.
Their findings indicated that task-related diversity (i.e., variability in attributes
relevant to task completion, such as expertise) was positively related to team per-
formance quality and quantity, whereas demographic diversity (i.e., dispersion
in observable individual category memberships, such as in age and race/ethnic-
ity subgroups) was unrelated to team performance. Notably, however, later work
suggested that demographic diversity may in some cases be negatively related
to group performance when subjective (but not objective) performance metrics
are employed (van Dijk, van Engen, & van Knippenberg, 2012). Further, tempo-
ral examinations of team diversity suggested that demographic diversity within
teams may become advantageous over time due to team members’ shifting focus
from surface-level attributes (i.e., demographics) to more task-relevant individual
characteristics (Harrison, Price, Gavin, & Florey, 2002).
Additionally, in an examination of the impact of group cohesion on decision-
making quality as a function of groupthink (i.e., “a mode of thinking that people
engage in when they are deeply involved in a cohesive ingroup, when the mem-
bers’ striving for unanimity override their motivation to realistically appraise alter-
native courses of action”; Janis, 1972, p. 9), Mullen, Anthony, Salas, and Driskell
(1994) demonstrated that team decision-making quality was positively related to
group homogeneity in task commitment and inversely related to interpersonal
attraction-related cohesion. Taken together, research on organizational groups and
teams has benefited from an examination of the impact of heteroscedasticity in
team composition. Thus, we encourage future work to continue to explore how
heterogeneity of variance contributes to our understanding of phenomena related
to organizational groups and teams, including the consideration of new perspec-
tives, such as the real-time impact of diversity changes on team functioning (see
e.g., dynamic team diversity theory; Li, Meyer, Shemla, & Wegge, 2018).

Organizational Climate
Heteroscedasticity is also a factor of interest in organizational climate re-
search. Organizational climate has been defined as experience-based perceptions
of organizational environments based on attributes such as policies, procedures,
and observed behaviors Ostroff, Kinicki, & Muhammad, 2013; Schneider, 2000).
Although early climate research approached organizational climate broadly (i.e.,
a molar approach), later work examined climate through a more focused lens
(see Schneider, Ehrhart, & Macey, 2013), emphasizing that different climate types
can exist within an organization (e.g., customer service, safety, and innovation
climates). Organizational climate has been a topic of considerable interest to or-
ganizational researchers, as various climate types have been linked to a number
of work outcomes. For example, innovative organizational climate has been posi-
tively linked to creative performance (Hsu & Fan, 2010), perceived innovation
(Lin & Liu, 2012), and organizational performance (Shanker, Bhanugopan, van
der Heijden, & Farrell, 2017). Likewise, organizations with a more positive cus-
Heteroscedasticity in Organizational Research • 75

tomer service climate tend to have higher customer satisfaction and greater profits
(Schneider, Macey, Lee, & Young, 2009), and meta-analytic data demonstrated a
positive relation between safety climate and safety compliance (Christian, Brad-
ley, Wallace, & Burke, 2009).
Within organizational climate research, there has been a focus on understand-
ing how variability in perceptions of climate both across individuals and units
within organizations can influence associated organizational outcomes (Zohar,
2010). One way in which consensus in climate perceptions within an organization
has been examined is by assessing climate strength, which Schneider, Salvaggio,
and Subirats (2002) summarize quite succinctly as “within-group variability in
climate perceptions [such that] the less within-group variability, the stronger the
climate” (p. 220). Climate strength is an example of a dispersion model (see Chan,
1998), in which the model measures the extent to which perceptions of a con-
struct vary, and within-group variability is treated as a focal construct (Dawson,
González-Romá, Davis, & West, 2008). Climate strength has been described as
a moderator of relations between organizational climate and organizational out-
comes, such that the effect of a particular climate (e.g., safety climate) is stronger
when climate strength is high (Schneider et al., 2002, 2009; Shin, 2012). Yet other
work suggested that climate strength may be curvilinearly related to organiza-
tional performance in some contexts, such that performance peaks at moderate
levels of climate strength (Dawson et al., 2008).
In sum, organizational climate research has benefited from the consideration
of heteroscedasticity as a meaningful attribute. Thus, we encourage researchers to
move beyond the presumption that systematic differences in variance in organi-
zational data simply be viewed as a violated statistical assumption that should be
corrected, but rather, consider whether heteroscedasticity may provide a meaning-
ful contribution to underlying theory and empirical models.

Summary
The above sections reviewed various substantive research areas where the
change in variance may be of theoretical or practical importance. For example,
although a stress intervention may result in lower strain for those in a treatment
group compared to those in a control group, a smaller variance for those in the
treatment group compared to those in the control group could also be meaningful
(see Figure 4.1). Because researchers may not typically test for changes in vari-
ance, we review extant data analytic procedures in the following section.

DATA ANALYTIC PROCEDURES


A variety of statistical approaches exist for conducting tests on variances or chang-
es in residual variance. Because these tests would likely be used in tandem with
commonly used statistical procedures (e.g., test on two independent means, simple
linear regression), we organize this section based on such approaches. We discuss
76 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

procedures that can be used in (a) tests of the equality of two independent means, (b)
analysis of variance, (c) analysis of covariance, and (d) multiple linear regression.
It deserves noting that the sections below are all special cases of the general
linear model. That is, for each of n observations, a quantitative dependent variable
(y) can be modeled using a set of p predictor variables (x1, x2,…, xp ) plus some
unknown population error term. That is, in matrix form, tests on two independent
means, analysis of variance, analysis of covariance, linear regression, moderated
multiple regression, and polynomial regression are subsumed by the general lin-
ear model:
y = Xb + e (1)

where y is an n × 1 vector of observations for the dependent variable, X is an n ×


(p + 1) model matrix including a leading column of 1s, b is a (p + 1) × 1 vector of
population regression coefficients, and e is an n × 1 vector of population errors.
b is typically estimated using ordinary least squares (OLS) estimation, where the
OLS-based estimates are b = (X´X)-1X´y, and OLS-based estimates of the popula-
tion errors are known as residuals (Rencher, 2000). Here, the have a mean of 0
and a constant variance (σ2). When the constant variance assumption is violated,
this is known as heteroscedasticity (Fox, 2016; Rosopa et al., 2013). It deserves
noting that the normality assumption is not required for Equation 1 to be valid.
However, when the population errors follow a normal distribution, this allows for
statistical inferences on the model and its coefficients including hypothesis tests
and confidence intervals (Rencher, 2000). For each of the four sections below, the
model matrix (X) changes and we describe various procedures that can be used
to test whether the variance changes as a function of one of the columns in X or
some other variable (e.g., fitted values).

Testing the Equality of Two Independent Means


In research involving two independent groups, the independent samples t sta-
tistic is often used to test whether two population means differ from one another.
For example, a researcher might assess whether the mean on a dependent variable
differs between the treatment and control group. Independent of this test on two
means, a researcher may want to assess whether the variances between two inde-
pendent groups differ. If the population variances differ, this is commonly known
as heterogeneity of variance or heteroscedasticity (Rosopa et al., 2013). In this
situation, p = 1 for a dummy-variable representing group membership in one of
two groups, and X is n × 2.
One approach for testing whether two population variances differ from one
another is due to Hartley (1950). However, because Box (1954) demonstrated that
this test does not adequately control Type I error, it is not recommended. Instead,
two other tests are recommended here.
Bartlett (1937) proposed a test that transforms independent variances. The test
statistic is approximately distributed as χ2 with degrees of freedom equal to the
Heteroscedasticity in Organizational Research • 77

number of groups minus 1. However, Box (1954) noted that this test can be sensi-
tive to departures from normality.
In instances where the normality assumption is violated, Brown and Forsythe’s
(1974) procedure is recommended. This approach is a modified version of
Levene’s (1960) test. Specifically, a two-sample t-test can be conducted on the
absolute value of the residuals. However, instead of calculating the absolute value
of the residuals from the mean, the absolute value of the residuals is calculated
using the median for each group. For a review, Bartlett (1937) and Brown and
Forsythe’s (1974) procedures are discussed in Rosopa et al. (2013) and Rosopa,
Schroeder, and Doll (2016).
Thus, although a researcher may be interested in testing whether the mean for
one group differs significantly from the mean of another group, if the researcher
also suspects that the variances differ as a function of group membership (see e.g.,
Figure 4.1), two statistical approaches are recommended. Bartlett’s (1937) test or
Brown and Forsythe’s (1974) test can be used. If the test is statistically significant
at some fixed Type I error rate (a), the researcher can conclude that the population
variances differ from one another.
It deserves noting that if a researcher finds evidence that the variances are
not the same between the two groups (i.e., heteroscedasticity exists), the conven-
tional Student’s t statistic should not be used to test for mean differences. Instead,
Welch’s t statistic should be used; this procedure allows for variances to be esti-
mated separately for each group, and, with Satterthwaite’s corrected degrees of
freedom, provides a more robust test for mean differences between two indepen-
dent groups regardless of whether the homoscedasticity assumption is violated
(Zimmerman, 2004).

Analysis of Variance
In a one-way analysis of variance, the population means on the dependent vari-
able are believed to be different (in some way) across two or more independent
groups. Assuming that the population error term in Equation 1 is normally dis-
tributed, the test statistic is distributed as an F random variable (Rencher, 2000).
However, in addition to tests on two or more means, a researcher may be interest-
ed in testing whether variance changes systematically across two or more groups.
For example, with three groups, the variance may be large for the control group,
but small for treatment A and treatment B. With three independent groups, be-
cause there are two dummy-variables for group membership, p = 2 and X is n × 3.
In the case of a one-way analysis of variance, Bartlett’s (1937) test and Brown
and Forsythe’s (1974) test are also suggested. However, Brown and Forsythe’s
(1974) test becomes, more generally, an analysis of variance on the absolute value
of the residuals around the respective medians. Thus, if the χ2 test or the F test, re-
spectively, are statistically significant at a, this suggests that the variances are dif-
ferent among the groups. Note that with three independent groups there are three
pairwise comparisons that can be conducted. However, there are only two linearly
78 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

independent comparisons (i.e., two degrees of freedom). If a researcher were to


conduct additional tests to isolate which of the three groups had significantly dif-
ferent variances, a Bonferroni correction procedure is recommended.
For a factorial analysis of variance, a researcher may be interested in main effects
for each categorical predictor (i.e., marginal mean differences) as well as possible
interactions between categorical predictors. However, independent of the mean dif-
ferences, a researcher may also be interested in testing the main and interactive
effects of the residual variances. O’Brien (1979, 1981) developed an approach that
could be used to test the main and interactive effects of the variances in the cells of
both one-way and factorial designs. See also an extension by Rosopa et al. (2016).

Analysis of Covariance
In analysis of covariance, a researcher is typically interested in examining
whether population differences on a dependent variable exist across multiple
groups. However, a researcher may have one or more continuous predictors that
they want to control statistically. Often, these continuous predictors (i.e., covari-
ates) are demographic variables (e.g., employee’s age), individual differences
(e.g., spatial ability), or a pretest variable. Assuming the simplest analysis of cova-
riance where a researcher has two independent groups and one covariate, because
there is one dummy-variable representing group membership and one covariate
(typically, centered), p = 2 and the model matrix (X) is n × 3. Here, the continu-
ous predictor is centered because in analysis of covariance researchers often are
interested in the adjusted means on the dependent variable where the adjustment
is at the grand mean of the continuous predictor (i.e., covariate) (Fox, 2016).
In analysis of covariance, residual variance can change as a function of not
only the categorical predictor, but also the continuous predictor (i.e., covariate).
For instances where a researcher suspects that the residual variance is changing
as a function of a categorical predictor, the procedures discussed above can be
used. Specifically, the OLS-based residuals from the analysis of covariance can
be saved. Then, either Bartlett’s (1937) test or Brown and Forsyths’s (1974) test
can be used to determine whether the residual variance changes as a function of
the categorical predictor. As noted above, with three or more groups, if additional
tests are conducted to isolate which of the groups had significantly different vari-
ances, a Bonferroni correction procedure is recommended.
In analysis of covariance, the residual variance could change as a function of
the continuous predictor. Here, a general approach is suggested, known as a score
test (Breusch & Pagan, 1979; Cook & Weisberg, 1983). This is discussed in the
next section on multiple linear regression.

Multiple Linear Regression


In multiple linear regression, a researcher typically has many predictors to pre-
dict a continuous dependent variable. A researcher may have categorical predic-
tors only (e.g., analysis of variance), continuous and categorical predictors (e.g.,
Heteroscedasticity in Organizational Research • 79

analysis of covariance), or categorical and continuous predictors along with func-


tions of the predictors (e.g., quadratic terms, product terms). Although such a
model can be increasingly complex, the overall model is still that shown in Equa-
tion 1. The model matrix (X) is n × (p + 1) where p denotes the number of predic-
tors/regressors in the overall model.
If the homoscedasticity assumption is violated, as noted above, it could be
due to a categorical predictor. In such instances, similar to the analyses discussed
above, Bartlett’s (1937) test or Brown and Forsythe’s (1974) test can be used.
When conducting multiple comparisons on the residual variances, Bonferroni
corrections are recommended.
For instances where the residual variance changes as a function of a continu-
ous predictor (e.g., a covariate in analysis of covariance), a general statistical
approach is available known as the score test. This test was independently de-
veloped in the econometrics (Breusch & Pagan, 1979) and statistics (Cook &
Weisberg, 1983) literatures. It can detect various forms of heteroscedasticity (i.e.,
change in residual variance). The test requires fitting two regression models. In
the first, the sum of squares error (SSE) from the full regression model of interest
is obtained. Then, in the second, the squared OLS residuals from the first analysis
are regressed on the variables purported to be the cause of the heteroscedasticity
(e.g., a continuous predictor), and the sum of squares regression (SSR) is obtained.
The test statistic, (SSR/2) ÷ (SSE/n)2, is asymptotically distributed as χ2 with de-
grees of freedom equal to the number of variables used to predict the squared OLS
residuals.
The score test is considered a general test for heteroscedasticity because it can
detect whether the residual variance changes as a function of categorical predic-
tors, continuous predictors, or predicted values (Kutner, Nachtsheim, Neter, & Li,
2005). Thus, it is a very flexible statistical approach and can be used not only for
multiple regression models, but also for two independent groups, analysis of vari-
ance, analysis of covariance, and models with interaction terms and higher-order
terms (e.g., quadratic or cubic terms) (Rosopa et al., 2013).

Summary
In this section, we reviewed statistical procedures commonly used in human
resource management, organizational psychology, and related disciplines. In ad-
dition, we discussed some data-analytic procedures that can be used to detect
changes in residual variance. It deserves noting that if a researcher finds evidence
to support their theory that variance changes as expected, this suggests that the
homoscedasticity assumption in general linear models is violated. Thus, although
a researcher may have found evidence that residual variance changes as a con-
tinuous predictor increases (see e.g., Figure 4.2), the use of OLS estimation in
linear models is no longer optimal; regression coefficients will be incorrect (i.e.,
inefficient; Rencher, 2000). Thus, although parameter estimates remain unbiased
in the presence of heteroscedasticity, statistical inferences (e.g., hypothesis tests,
80 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

confidence intervals) involving means, regression slopes, and linear combinations


of regression slopes will be incorrect (Fox, 2016; Kutner et al., 2005). However,
more general solutions are available including weighted least squares regression
(Kutner et al., 2005) and heteroscedasticity-consistent covariance matrices (Fox,
2016; Ng & Wilcox, 2009, 2011). For brief reviews, see also Rosopa et al. (2013)
and Rosopa, Brawley, Atkinson, and Robertson (2018).

CONCLUSION
A major objective of this paper is to describe how heteroscedasticity can be more
than just a statistical violation. Rather, differences in residual variance could be a
necessary and implicit aspect of a theory or empirical study. We included exam-
ples from five organizational research domains in which heteroscedasticity may
provide a substantive contribution, thus highlighting that although changes in re-
sidual variance are often viewed to be statistically problematic, heteroscedasticity
can also contribute meaningfully to our understanding of various organizational
phenomena. Nevertheless, there are likely other topical areas germane to orga-
nizational contexts in which heteroscedasticity may occur (see e.g., Aguinis &
Pierce, 1998; Bell & Fusco, 1989; Dalal et al., 2015; Grissom, 2000). Thus, we
hope that this paper stimulates research that considers the impact of heteroscedas-
ticity, as heterogeneity of variance can serve as an important explanatory mecha-
nism that can provide insight into a variety of organizational phenomena. We
encourage researchers to consider whether there is a theoretical basis for a priori
expectations of heteroscedasticity in their data, as well as to consider whether un-
anticipated heterogeneity of variance may have substantive meaning. Stated dif-
ferently, although homogeneity of variance is a statistical assumption of the gen-
eral linear model, we suggest that researchers carefully consider whether changes
in residual variance can be attributed to other constructs in a nomological network
(Cronbach & Meehl, 1955). Overall, this can enrich both theory and practice in
human resource management and allied fields.

REFERENCES
Ackerman, P. L. (1987). Individual differences in skill learning: An integration of psycho-
metric and information processing perspectives. Psychological Bulletin, 102, 3–27.
doi:10.1037//0033-2909.102.1.3
Ackerman, P. L. (2007). New developments in understanding skilled performance.
Current Directions in Psychological Science, 16, 235–239. doi:10.1111/j.1467-
8721.2007.00511.x
Ackerman, P. L., & Cianciolo, A. T. (2000). Cognitive, perceptual-speed, and psychomotor
determinants of individual differences during skill acquisition. Journal of Experi-
mental Psychology: Applied, 6, 259–290. doi:10.1037//1076-898X.6.4.259
Aguinis, H., & Pierce, C. A. (1998). Heterogeneity of error variance and the assessment
of moderating effects of categorical variables: A conceptual review. Organizational
Research Methods, 1, 296–314. doi:10.1177/109442819813002
Heteroscedasticity in Organizational Research • 81

Antonakis, J., & Dietz, J. (2011). Looking for validity or testing it? The perils of stepwise
regression, extreme-scores analysis, heteroscedasticity, and measurement error. Per-
sonality and Individual Differences, 50, 409–415. doi:10.1016/j.paid.2010.09.014
Backman, L., Small, B. J., & Wahlin, A. (2001). Aging and memory: Cognitive and bio-
logical perspectives. In Birren, J. E., & Schaie, W. K. (Eds.), Handbook of the psy-
chology of aging (pp. 349–366). San Diego, CA: Academic Press.
Baltes, P. B., & Baltes, M. M. (1990). Psychological perspectives on successful aging: The
model of selective optimization with compensation. In P. B. Baltes, & M. M. Baltes
(Eds.), Successful aging: Perspectives from the behavioral sciences (pp. 1–34). New
York, NY: Cambridge University Press.
Barak, B. (2009). Age identity: A cross-cultural global approach. International Journal of
Behavioral Development, 33, 2–11. doi:10.1177/0165025408099485
Barrick, M. B., Stewart, G. L. Neubert, M. J., & Mount, M. K. (1998). Relating member
ability and personality to work-team processes and team effectiveness. Journal of
Applied Psychology, 83, 377–191. doi:10.1037/0021-9010.83.3.377
Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the
Royal Society, A160, 268–282. doi:10.1098/rspa.1937.0109
Bell, S. (2007). Deep-level composition variables as predictors of team performance: A
meta-analysis. Journal of Applied Psychology, 92, 595–615. doi:10.1037/0021-
9010.92.3.595
Bell, P. A., & Fusco, M. E. (1989). Heat and violence in the Dallas field data: Linear-
ity, curvilinearity, and heteroscedasticity. Journal of Applied Social Psychology, 19,
1479–1482. doi:10.1111/j.1559-1816.1989.tb01459.x
Bond, F. W., & Bunce, D. (2001). Job control mediates change in a work reorganization
intervention for stress reduction. Journal of Occupational Health Psychology, 6,
290–302. doi:10.1037//1076-8998.6.4.290
Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of
variance problems, I. Effect of inequality of variance in the one-way classification.
Annals of Mathematical Statistics, 25, 290–302. doi:10.1214/aoms/1177728786
Boyle, P. A., Yu, L., Wilson, R. S., Gamble, K., Buchman, A. S., & Bennett, D. A. (2012).
Poor decision making is a consequence of cognitive decline among older persons
without Alzheimer’s disease or mild cognitive impairment. PLOS One, 7, 1–5.
doi:10.1371/journal.pone.0043647
Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random
coefficient variation. Econometrica, 47, 1287–1294. doi:10.2307/1911963
Brown, M. B., & Forsythe, A. B. (1974). Robust test for the equality of variances. Journal
of the American Statistical Association, 69, 364–367. doi:10.2307/2285659
Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of variance in experimental stud-
ies: A challenge to conventional interpretations. Psychological Bulletin, 104, 396–
404. doi:10.1037//0033-2909.104.3.396
Byron, K., Peterson, S. J., Zhang, Z., & LePine, J. A. (2018). Realizing challenges and
guarding against threats: Interactive effects of regulatory focus and stress on perfor-
mance. Journal of Management, 44, 3011–3037. doi:10.1177/0149206316658349
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of perfor-
mance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations
(pp. 35–70). San Francisco, CA: Jossey-Bass.
82 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

Carstensen, L. L. (1995). Evidence for a life-span theory of socioemotional selectivity.


Current Directions in Psychological Science, 4, 151–156. doi:10.1111/1467-8721.
ep11512261
Carstensen, L. L., Isaacowitz, D. M., & Charles, S. T. (1999). Taking time seriously:
A theory of socioemotional selectivity. American Psychologist, 54, 165–181.
doi:10.1037/0003-066X.54.3.165
Chan, D. (1998). Functional relationships among constructs in the same content domain at
different levels of analysis: A typology of composition models. Journal of Applied
Psychology, 83, 234–246. doi:10.1037/0021-9010.83.2.234
Childs, J. H., & Stoeber, J. (2012). Do you want me to be perfect? Two longitudinal studies
on socially prescribed perfectionism, stress and burnout in the workplace. Work &
Stress, 26, 347–364. doi:10.1080/02678373.2012.737547
Chou, L., Wang, A., Wang, T., Huang, M., & Cheng, B. (2008). Shared work values and
team member effectiveness: The mediation of trustfulness and trustworthiness. Hu-
man Relations, 61, 1713–1742. doi:10.1177/0018726708098083
Christensen, H., Mackinnon, A. J., Korten, A. E., Jorm, A. F., Henderson, A. S., Jacomb,
P., & Rodgers, B. (1999). An analysis of diversity in the cognitive performance of
elderly community dwellers: Individual differences in change scores as a function of
age. Psychology and Aging, 14, 365–379.
Christian, M. S., Bradley, J. C., Wallace, J. C., & Burke, M. J. (2009). Workplace safety:
A meta-analysis of the roles of person and situation factors. Journal of Applied Psy-
chology, 94, 1103–1127. doi:10.1037/a0016172
Colcombe, S. J., Kramer, A. F., Erickson, K. I., & Scalf, P. (2005). The implications of
cortical recruitment and brain morphology for individual differences in inhibitory
function in aging humans. Psychology and Aging, 20, 363–375. doi:10.1037/0882-
7974.20.3.363
Cook, R. D., & Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression.
Biometrika, 70, 1–10. doi:10.2307/2335938
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psycho-
logical Bulletin, 52, 281–302.
Dalal, R. S., Meyer, R. D., Bradshaw, R. P., Green, J. P., Kelly, E. D., & Zhu, M.
(2015). Personality strength and situational influences on behavior: A con-
ceptual review and research agenda. Journal of Management, 41, 261–287.
doi:10.1177/0149206314557524
Darr, W., & Johns, G. (2008). Work strain, health and absenteeism: A meta-analysis. Jour-
nal of Occupational Health Psychology, 13, 293–318. doi:10.1037/a0012639
Dawson, J. F., González-Romá, V., Davis, A., & West, M. A. (2008). Organizational cli-
mate and climate strength in UK hospitals. European Journal of Work and Organi-
zational Psychology, 17, 89–111. doi:10.1080/13594320601046664
DeFrank, R. S., & Cooper, C. L. (1987). Worksite stress management interventions: Their
effectiveness and conceptualisation. Journal of Managerial Psychology, 2, 4–10.
doi:10.1108/eb043385
De Jong, B. A., Dirks, K. T., & Gillespie, N. (2016). Trust and team performance: A meta-
analysis of main-effects, moderators, and covariates. Journal of Applied Psychology,
101, 1124–1150. doi:10.1037/apl0000110
Heteroscedasticity in Organizational Research • 83

DeShon, R. P., & Alexander, R. A. (1996). Alternative procedures for testing regression
slope homogeneity when group error variances are unequal. Psychological Meth-
ods, 1, 261–277. doi:10.1037/1082-989X.1.3.261
Fox, J. (2016). Applied regression analysis and generalized linear models (3rd ed.). Thou-
sand Oaks, CA: Sage.
Froehlich, D. E., Beausaert, S., & Segers, M. (2016). Aging and the motivation to stay
employable. Journal of Managerial Psychology, 31, 756–770. doi:10.1108/JMP-
08-2014-0224
Galantino, M. L., Baime, M., Maguire, M., Szapary, P. O., & Farrar, J. T. (2005). Associa-
tion of psychological and physiological measures of stress in health-care profes-
sionals during an 8-week mindfulness meditation program: Mindfulness in practice.
Stress and Health, 21, 255–261. doi:10.1002/smi.1062
Giga, S. I., Noblet, A. J., Faragher, B., & Cooper, C. L. (2003). The UK perspective: A
review of research on organisational stress management interventions. Australian
Psychologist, 38, 158–164. doi:10.1080/00050060310001707167
Gilboa, S., Shirom, A., Fried, Y., & Cooper, C. (2008). A meta-analysis of work-demand
stressors and job performance: Examining main and moderating effects. Personnel
Psychology, 61, 227–271. doi:10.1111/j.1744-6570.2008.00113.x
Grissom, R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting
and Clinical Psychology, 68, 155–165. doi: 10.1037/0022-006X.68.1.155
Harrison, D. A., Price, K. H., Gavin, J. H., & Florey, A. T. (2002). Time, teams, and task
performance: Changing effects of surface- and deep-level diversity on group func-
tioning. Academy of Management Journal, 45, 1029–1045. doi:10.2307/3069328
Hartley, H. O. (1950). The maximum F-ratio as a short-cut test for heterogeneity of vari-
ance. Biometrika, 37(3/4), 308–312.
Horwitz, S. K. & Horwitz, I. B. (2007). The effects of team diversity on team outcomes: A
meta-analytic review of team demography. Journal of Management, 33, 987–1015.
doi:10.2307/3069328
Hsu, M. L. A., & Fan, H. (2010). Organizational innovation climate and creative outcomes:
Exploring the moderating effect of time pressure. Creativity Research Journal, 22,
378–386. doi:10.1080/10400419.2010.523400
Ivancevich, J. M., Matteson, M. T., Freedman, S. M., & Phillips, J. S. (1990). Work-
site stress management interventions. American Psychologist, 45, 252–261.
doi:10.1037//0003-066X.45.2.252
Jackson, S. E. (1983). Participation in decision making as a strategy for reducing job-relat-
ed strain. Journal of Applied Psychology, 68, 3–19. doi:10.1037//0021-9010.68.1.3
Janis, I. L. (1972). Victims of groupthink. Boston, MA: Houghton-Mifflin.
Kanfer, R., & Ackerman, P. L. (1989). Motivation and cognitive abilities: An integrative/
aptitude-treatment interaction approach to skill acquisition. Journal of Applied Psy-
chology, 74, 657–690. doi:10.1037//0021-9010.74.4.657
King, B. M., Rosopa, P. J., & Minium, E. W. (2018). Statistical reasoning in the behavioral
sciences (7th ed.). Hoboken, NJ: Wiley.
Kotter-Grühn, D., Kornadt, A. E., & Stephan, Y. (2016). Looking beyond chronological
age: Current knowledge and future directions in the study of subjective age. Geron-
tology, 62, 86–93. doi:10.1159/000438671
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical
models (5th ed.). New York, NY: McGraw-Hill.
84 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

LaMontagne, A. D., Keegel, T., Louie, A. M., Ostry, A., & Landsbergis, P. A. (2007). A
systematic review of the job-stress intervention evaluation literature. International
Journal of Occupational and Environmental Health, 13, 268–280. doi:10.1179/
oeh.2007.13.3.268
Lazarus, R. S., & Folkman, S. (1984). Stress, appraisal, and coping. New York, NY:
Springer.
Levene, H. (1960). Robust tests for equality of variances. In I. Olkin, S. G. Ghurye, W.
Hoeffding, W. G. Madow, & H. B. Mann (Eds.), Contributions to probability and
statistics (pp. 278–292). Stanford, CA: Stanford University Press.
Li, J., Meyer, B., Shemla, M., & Wegge, J. (2018). From being diverse to becoming di-
verse: A dynamic team diversity theory. Journal of Organizational Behavior, 39,
956–970. doi:10.1002/job.2272
Lin, Y. Y., & Liu, F. (2012). A cross‐level analysis of organizational creativity climate and
perceived innovation: The mediating effect of work motivation. European Journal
of Innovation Management, 15, 55–76. doi:10.1108/14601061211192834
Moen, P., Kojola, E., & Schaefers, K. (2017). Organizational change around an older work-
force. The Gerontologist, 57, 847–856. doi:10.1093/geront/gnw048
Morse, C. K. (1993). Does variability increase with age? An archival study of cognitive
measures. Psychology and Aging, 8, 156–164. doi:10.1037/0882-7974.8.2.156
Mullen, B., Anthony, T., Salas, E., & Driskell, J. E. (1994). Group cohesiveness and qual-
ity of decision making: An integration of tests of the groupthink hypothesis. Small
Group Research, 25, 189–204. doi:10.1177/1046496494252003
Neuman, G. A., Wagner, S. H., & Christiansen, N. D. (1999). The relationship between
work-team personality composition and the job performance of teams. Group &
Organization Management, 24, 28–45. doi:10.1177/1059601199241003
Ng, T. W., & Feldman, D. C. (2008). The relationship of age to ten dimensions of job
performance. Journal of Applied Psychology, 93, 392–423. doi:10.1037/0021-
9010.93.2.392
Ng, M., & Wilcox, R. R. (2009). Level robust methods based on the least squares regres-
sion estimator. Journal of Modern Applied Statistical Methods, 8, 384–395.
Ng, M., & Wilcox, R. R. (2011). A comparison of two-stage procedures for testing least-
squares coefficients under heteroscedasticity. British Journal of Mathematical and
Statistical Psychology, 64, 244–258. doi:10.1348/000711010X508683
O’Brien, R. G. (1979). A general ANOVA method for robust tests of additive mod-
els for variances. Journal of the American Statistical Association, 74, 877–880.
doi:10.2307/2286416
O’Brien, R. G. (1981). A simple test for variance effects in experimental designs. Psycho-
logical Bulletin, 89, 570–574. doi:10.1037//0033-2909.89.3.570
Ostroff, C., & Fulmer, C. A. (2014). Variance as a construct: Understanding variability
beyond the mean. In J. K. Ford, J. R. Hollenbeck, & A. M. Ryan (Eds.), The nature
of work: Advances in psychological theory, methods, and practice (pp. 185–210).
Washington, DC: APA. doi:10.1037/14259-010
Ostroff, C., Kinicki, A. J., & Muhammad, R. S. (2013). Organizational culture and climate.
In N. W. Schmitt, S. Highhouse, & I. B. Weiner (Eds.), Handbook of psychology:
Industrial and organizational psychology (pp. 643–676). Hoboken, NJ: Wiley.
Panatik, S. A., O’Driscoll, M. P., & Anderson, M. H. (2011). Job demands and work-re-
lated psychological responses among Malaysian technical workers: The moderating
Heteroscedasticity in Organizational Research • 85

effects of self-efficacy. Work & Stress, 25, 355–370. doi:10.1080/02678373.2011.6


34282
Plomin, R. & Thompson, L. (1988). Life-span developmental behavioral genetics. In
Baltes, P. B., Featherman, D. L., & Lerner, R. M. (Eds.), Life-span development and
behavior (pp. 1–31). Hillsdale, NJ: Lawrence Erlbaum.
Podsakoff, N. P., LePine, J. A., & LePine, M. A. (2007). Differential challenge stressor-
hindrance stressor relationships with job attitudes, turnover intentions, turnover, and
withdrawal behavior: A meta-analysis. Journal of Applied Psychology, 92, 438–454.
doi:10.1037/0021-9010.92.2.438
Rencher, A. C. (2000). Linear models in statistics. New York, NY: Wiley.
Richardson, K. M., & Rothstein, H. R. (2008). Effects of occupational stress management
intervention programs: A meta-analysis. Journal of Occupational Health Psychol-
ogy, 13, 69–93. doi:10.1037/1076-8998.13.1.69
Rosopa, P. J., Brawley, A. M., Atkinson, T. P., & Robertson, S. A. (2018). On the con-
ditional and unconditional Type I error rates and power of tests in linear models
with heteroscedastic errors. Journal of Modern Applied Statistical Methods, 17(2),
eP2647. doi:10.22237/jmasm/1551966828
Rosopa, P. J., Schaffer, M. M., & Schroeder, A. N. (2013). Managing heteroscedasticity in
general linear models. Psychological Methods, 18, 335–351. doi:10.1037/a0032553
Rosopa, P. J., Schroeder, A. N., & Doll, J. L. (2016). Detecting between-groups het-
eroscedasticity in moderated multiple regression with a continuous predic-
tor and a categorical moderator: A Monte Carlo study. SAGE Open, 6(1), 1–14.
doi:10.1177/2158244015621115
Salkind, N. J. (2007). Encyclopedia of measurement and statistics. Thousand Oaks, CA:
Sage. doi:10.4135/9781412952644
Salkind, N. J. (2010). Encyclopedia of research design. Thousand Oaks, CA: Sage.
doi:10.4135/9781412961288
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton-Mifflin.
Schaufeli, W. B., & Enzmann, D. (1998). The burnout companion to study and practice.
Philadelphia, PA: Taylor & Francis.
Schneider, B. (2000). The psychological life of organizations. In N. M. Ashkanasy, C. P. M.
Wilderom, & M. F. Peterson (Eds.), Handbook of organizational culture & climate
(pp. xvii–xxi). Thousand Oaks, CA: Sage.
Schneider, B., Ehrhart, M. G., & Macey, W. H. (2013). Organizational climate and
culture. Annual Review of Psychology, 64, 361–388. doi:10.1146/annurev-
psych-113011-143809
Schneider, B., Macey, W. H., Lee, W. C., & Young, S. A. (2009). Organizational ser-
vice climate drivers of the American Customer Satisfaction Index (ACSI) and
financial and market performance. Journal of Service Research, 12, 3–14.
doi:10.1177/1094670509336743
Schneider, B., Salvaggio, A. N., & Subirats, M. (2002). Climate strength: A new direction
for climate research. Journal of Applied Psychology, 87, 220–229. doi:10.1037/0021-
9010.87.2.220
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton-Mifflin.
86 • SCHROEDER, ROSOPA, WHITAKER, FAIRBANKS, & XOXAKOS

Shanker, R., Bhanugopan, R., van der Heijden, Beatrice I. J. M., & Farrell, M. (2017).
Organizational climate for innovation and organizational performance: The mediat-
ing effect of innovative work behavior. Journal of Vocational Behavior, 100, 67–77.
doi:10.1016/j.jvb.2017.02.004
Shin, Y. (2012). CEO ethical leadership, ethical climate, climate strength, and collective
organizational citizenship behavior. Journal of Business Ethics, 108(3), 299–312.
doi:10.1007/s10551-011-1091-7
Spirduso, W. W., Francis, K. L., & MacRae, P. G. (2005). Physical dimensions of aging (2nd
ed.). Champaign, IL: Human Kinetics.
Taylor, M. A., & Bisson, J. B. (2019). Changes in cognitive functioning: Practical and
theoretical considerations for training the aging workforce. Human Resource Man-
agement Review. Advance online publication. doi:10.1016/j.hrmr.2019.02.001
van der Klink, J. J. L., Blonk, R. W. B., Schene, A. H., & van Dijk, F. J. H. (2001). The
benefits of interventions for work-related stress. American Journal of Public Health,
91, 270–276. doi:10.2105/AJPH.91.2.270
van Dijk, H., van Engen, M. L., & van Knippenberg, D. (2012). Defying conventional wis-
dom: A meta-analytical examination of the differences between demographic and
job-related diversity relationships with performance. Organizational Behavior and
Human Decision Processes, 119, 38–53. doi:10.1016/j.obhdp.2012.06.003
Webster, J. R., Beehr, T. A., & Love, K. (2011). Extending the challenge-hindrance model
of occupational stress: The role of appraisal. Journal of Vocational Behavior, 79,
505–516. doi:10.1016/j.jvb.2011.02.001
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity. Econometrica, 48, 817– 838. doi:10.2307/1912934
Wilcox, R. R. (1997). Comparing the slopes of two independent regression lines when
there is complete heteroscedasticity. British Journal of Mathematical and Statistical
Psychology, 50, 309–317. doi:10.1111/j.2044- 8317.1997.tb01147.x
Yung, P. M. B., Fung, M. Y., Chan, T. M. F., & Lau, B. W. K. (2004). Relaxation training
methods for nurse managers in Hong Kong: A controlled study. International Journal
of Mental Health Nursing, 13, 255–261. doi:10.1111/j.1445-8330.2004.00342.x
Zhang, Y., Zhang, Y., Ng, T. W. H., & Lam, S. S. K. (2019). Promotion- and prevention-
focused coping: A meta-analytic examination of regulatory strategies in the work
stress process. Journal of Applied Psychology, 104(10), 1296–1323. doi:10.1037/
apl0000404
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. Brit-
ish Journal of Mathematical and Statistical Psychology, 57(1), 173–181.
doi:10.1348/000711004849222
Zohar, D. (2010). Thirty years of safety climate research: Reflections and future directions.
Accident Analysis and Prevention, 42, 1517–1522. doi:10.1016/j.aap.2009.12.019
CHAPTER 5

KAPPA AND ALPHA


AND PI, OH MY
Beyond Traditional Inter-rater Reliability
Using Gwet’s AC1 Statistic

Julie I. Hancock, James M. Vardaman, and David G. Allen

The aggregation of research results is an important and useful mechanism for


better understanding a variety of phenomena, including those relevant to human
resource management (HRM). For decades, aggregate studies in the form of me-
ta-analyses have provided an overarching examination and synthesis of results.
Such studies provide conclusions based on the amalgamation of previous works,
enabling scholars to determine the consistency of findings and the magnitude of
an effect, as well as affording more precise recommendations than may be offered
based on the results of a solitary study (Aguinis, Dalton, Bosco, Pierce, & Dalton,
2011; Borenstein, Hedges, Higgins, & Rothstein, 2011). Similarly, content analy-
ses are another mechanism by which to identify trends through examining the
text, themes, or concepts present across existing studies. These aggregate studies
can have real and significant implications for better understanding HRM issues
(e.g., Allen, Hancock, Vardaman, & McKee, 2014; Barrick & Mount, 1991; Eby,
Casper, Lockwood, Bordeaux, & Brinley, 2005; Hancock, Allen, Bosco, McDan-
iel, & Pierce, 2013; Pindek, Kessler, & Spector, 2017). However, in order to de-
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 87–106.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 87
88 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

duce meaningful conclusions, the data must be reliable and demonstrate construct
validity.
These methods of study aggregation typically require the employment of mul-
tiple coders to systematically gather and categorize data into an appropriate cod-
ing scheme. The agreement among coders is a significant issue in these studies, as
disagreement could constitute a threat to the validity of the results of aggregation
studies. Consequently, inter-rater reliability (IRR) is calculated to determine the
degree to which coders consistently agree upon the categorization of variables of
interest (Bliese, 2000; LeBreton, Burgess, Kaiser, Atchley, & James, 2003). The
most basic approach is the simple calculation of the percentage of agreements that
coders have established, whereby the number of total actual agreements is divid-
ed by the total possible number of agreements. The simplicity of calculating per-
centage agreements makes it a commonly used index of IRR in the management
literature. Although this method provides an easily calculable general indication
of the degree to which coders agree, it can be misleading, failing to take into
consideration the impact that chance may have on the reliability of agreement.
Deviations from 100% agreement become less meaningful and may result in an
inflated IRR, jeopardizing the construct validity of the measure and indicating that
percentage agreement is only useful and meaningful in very specific conditions.
Despite these potential shortcomings, IRR has been traditionally reported as
the simple percentage of agreement among coders in the management literature
(e.g., Barrick & Mount, 1991; Eby et al., 2005; Hancock et al., 2013; Hoch, Bom-
mer, Dulebohn, & Wu, 2018; Judge & Ilies, 2002; Mackey, Frieder, Brees, & Mar-
tinko, 2017). Reliability statistics such as Scott’s pi (p) (1955), Cohen’s Kappa
(κ) (1960) (e.g., Heugens & Lander, 2009; Koenig, Eagly, Mitchell, & Ristikari,
2011) and Krippendorff’s alpha (α) (1980) (e.g., Tuggle, Schnatterly, & Johnson,
2010; Tuggle, Sirmon, Reutzel, & Bierman, 2010) have been increasingly identi-
fied as superior indices of IRR in comparison to simple percentage agreement and
are beginning to appear in aggregate studies. However, these indices are also not
without limitations.
Each of the more sophisticated indicators has been derived in order to combat
the shortcomings of their predecessors. Even so, neither p, κ, nor a will be appro-
priate in all circumstances. In particular, each has limitations regarding a variety
of scenarios, including those where: (a) there are multiple coders but different
combinations of coders for different cases, (b) there exist any number of catego-
ries, scale values, or measures, (c) there is missing data, (d) known prevalence
(dichotomous coding) exists, (e) there is skewed data, and (f) for any sample
size. A search for another option for calculating inter-rater agreement across sev-
eral disciplines elicited attention to the AC1 statistic for IRR established by Gwet
(2001). This offers a test, which “is a more robust chance-corrected statistic that
consistently yields reliable results” (Gwet, 2002b, p. 5) as compared to κ, provid-
ing scholars with a more accurate measurement in each of those situations.
Kappa and Alpha and Pi, Oh My • 89

Thus, in this paper, we contribute to the management literature in several ways.


First, we demonstrate the utility of the AC1 by offering a comparison of key char-
acteristics of five IRR indices, such as the number of coders/observers, level of
measurement, and sample size, as well as the number of categories, scale values,
or measures each IRR index can accommodate. Further, we compare the degree
to which each IRR index accommodates missing data, known prevalence (data
coded 0 or 1), and skewed data, highlighting the contextual characteristics in
which each index may be used appropriately. Next, we examine over 440 studies
to provide a side-by-side data-driven comparison of each of the five IRR indices
discussed, showing the variation that exists based on synthesis characteristics and
how the inferences and conclusions made by researchers as a result of IRR index
selection may vary. Finally, we provide recommendations for the best indices
to use for calculating IRR and address additional areas of practicality for AC1,
specifically, the value it holds for HRM practices. In so doing, we highlight the
value of AC1 over other IRR indices in two specific situations: when examining
dichotomous variables and when more than two coders are engaged in coding.

LITERATURE REVIEW
The degree to which data analysis and synthesis can lead to prescriptions for re-
searchers and practitioners is dependent upon the level of accuracy and reliability
with which coders of the data agree. IRR indices seek to provide some degree of
trust and assurance of data that are coded and categorized by human observers,
thus increasing the degree of confidence researchers have in data driven by hu-
man judgments (Hayes & Krippendorff, 2007) by improving construct validity. In
their review of several IRR indices, Hayes and Krippendorff (2007) identify five
properties that exemplify the nature of a good reliability index. First, agreement
amongst two or more coders/observers working independently to ascribe catego-
rizations to observations ought to be assessed without influence of the number
of independent coders present or by variation in the coders involved. Thus, the
individual coders participating in the codification of data should not influence
coding agreement.
Second, the number of categories to be coded should not bias the reliabilities.
Thus, reliability indices should not be influenced in one direction or the other by
the number of categories prescribed by the developer of the coding schemata.
Third, the reliability metric should be represented on a “numerical scale between
at least two points with sensible reliability interpretations (Hayes & Krippendorff,
2007, p. 79).” Thus, scales whereby a 0 indicates that zero agreement exists sug-
gests a violation of the assumption of independence of coders and are thus am-
biguous in their assessment of reliability. Fourth, Hayes and Krippendorff (2007)
suggest that a good reliability index should “be appropriate to the level of mea-
surement of the data (p. 79).” Thus, it must be suitable for comparisons across
various types of data, not limited to one particular type of data. Finally, the “sam-
pling behavior should be known or at least computable” (p. 79).
90 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Extant IRR Measures

Each of the most prevalent IRR indices has pros and cons when compared us-
ing Hayes and Krippendorff’s (2007) criteria. For example, although percentage
agreement is easy to calculate, it skews agreement in an overly positive direc-
tion. Although is complex to compute, it accommodates more complex coding
schemes. The following sections review the utility and shortcomings of each ap-
proach, providing a better understanding of the circumstances under which a par-
ticular IRR index may be most appropriately utilized.
Percentage Agreement. A common IRR index in the management literature
is simple percentage agreement. Percentage agreement assesses IRR by simply
dividing the number of agreements two coders have by the number of potential
matches that exist.

Percent Agreement

Occ
% Agreement = ∑ c × 100
n
where Occ represents each agreement coincidence and n represents the total
number of coding decisions.
Percentages are typically calculated for each variable in a coding scheme then
averaged such that the overall agreement is known, as is the agreement for each
specific variable. In addition to being a straightforward calculation, percent agree-
ment can provide researchers with insights into problematic variables within the
data (McHugh, 2012). For example, if percentage agreement for a particular vari-
able, agreement is only 40%, this suggests that the variable should be revisited
to determine the underlying reason for low agreement. However, although this
measure is easily calculable, it fails to fully satisfy a majority of the five reli-
ability criteria set forth by Hayes and Krippendorff (2007) and can be somewhat
misleading.
The simplicity of calculating percentage agreements makes it a commonly
used index of IRR in the management literature. However, the degree to which it
is meaningful is situationally specific, i.e., when there are two well-trained cod-
ers, in the presence of nominal data, with fewer rather than a greater number
categories, and a low chance that guessing will take place (Scott, 1955). Thus,
it is not a sufficient and reliable measure itself. Percentage agreement does not
consider the role that chance might play in ratings, incorrectly assuming that all
raters make deliberate, rational decisions in assigning their rating. Perhaps more
alarmingly, the lack of chance accounted for in this metric makes it possible for
agreement to seem acceptable even if both coders guessed at their categorization.
For example, if two coders employ two differing strategies for categorizing items,
one coder categorizes every item as “A” and the other coder often, but not always,
categorizes an item as “A,” simple percentage agreement would suggest that they
Kappa and Alpha and Pi, Oh My • 91

are in agreement when they are, in fact, utilizing different strategies for the their
categorizations or, more disturbingly, simply guessing. Additionally, this calcu-
lation is predisposed towards coding schemes with fewer categories whereby a
higher percentage agreement will be achieved by chance when there are fewer
categories to code.
Further, percentage agreement is interpreted as from 0–100% with 100% in-
dicating complete agreement and 0% complete disagreement, which is not likely
unless coders are violating the condition of independence. Consequently, devia-
tions from 100% agreement (complete agreement in all categories) become less
meaningful as the scale is not meaningfully interpretable. Failure of simple per-
centage agreement calculations to adequately assess reliability substantially limits
the construct validity of the assessments scholars are using to synthesize data
and draw conclusions and has been deemed unacceptable in determining IRR for
decades (e.g., Krippendroff, 1980; Scott 1955). Thus, it is advisable that manage-
ment scholars explore other, more reliable indices for assessing IRR; several other
metrics attempt to do so.
Scott’s Pi. In an attempt to overcome the limitations of percent agreement, p
(1955) was developed as a means by which IRR might be calculated above and
beyond simple percentages. Although percentage agreement is based on the num-
ber of matches that coders obtain out of a particular number of potential matches,
takes into consideration the role played by chance agreement. The probability of
chance is based on the cumulative classification of probabilities, not the prob-
abilities of individual rater classification (Gwet, 2002a) and provides a chance-
corrected agreement index for assessing IRR. This metric considers the degree to
which coders agree when they do not engage in guessing. Further, Scott (1955)
proposed that the previous categorizations of items by coders be examined by
calculating the observed number of items each coder has placed into a particular
category. For example, the total number of items placed into the same category
by two coders would be compared to the total number of items to categorize.
The assumption is that if each of the coders were simply categorizing items by
chance, each coder would have the same distribution (Artstein & Poesio, 2008;
Scott, 1955).

Scott’s p

Po − Pe
π=
1 − Pe
where
Occ
Po = ∑
c n

and
92 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Pe = ∑ pi2
c

where pi represents the proportion “of the sample coded as belonging to the
2

ith category” (Scott, 1955).


Occ = each agreement coincidence (diagonal cells in coincidence matrix)
n = total number of coding decisions
c = each coincidence marginal
The interpretability of p, though not universally agreed upon, ranges from 0.0
to 1.0 where (a) zero suggests that coder agreement would not be any worse than
if the coding process were random, (b) one indicates that there is perfect agree-
ment among coders, and (c) a negative outcome indicates that coder agreement
was worse than it would have been simply by chance. Thus, p satisfies Hayes and
Krippendoff’s (2007) second and third requirements for reliability. Like percent-
age agreement, however, p is traditionally limited to only nominal data and a
maximum of two coders. Though it does overcome some of the issues faced by
percent agreement, it does not satisfy all five requirements set forth by Hayes and
Krippendorff (2007) making it useful in limited conditions and, like percentage
agreement, limits construct validity.
Cohen’s Kappa. Cohen’s κ (1960) was developed to improve upon the short-
comings of percentage agreement and . Though a generalized version of measur-
ing pairwise agreement was proposed by Fleiss (1971), allowing for the use of
multiple coders, like both percentage agreements and p, κ is limited to two coders.
Further, it is limited to nominal data and it does not allow for coders to be substi-
tutable, thus it is unreliable in situations when the exchanging of coders is neces-
sary. κ may be used when guessing is likely to be prevalent among coders or if the
coders lack the training necessary to provide adequate comparisons. Although the
basic formula for calculation remains the same as p, it differs in its assumption of
chance agreement. The assumption here is that a coder’s prior distributions will
influence their assignment of items into a particular category, such that the prob-
ability of each coder assigning that item into the same category must be calculated
and summed. Consequently, each coder has their own distribution.

Cohen’s k

Po − Pe
κ=
1 − Pe
where
Occ
Po = ∑
c n

and
Kappa and Alpha and Pi, Oh My • 93

1
Pe =
n2
∑ pm i

where n represents the number of cases and S pmi represents the sum of the
marginal products (Neuendorf, 2002).
Like p, κ ranges from 0.0 to 1.0, however, because zero is defined as it would
be for a correlation, “Kappa, by accepting the two observers’ proclivity to use
available categories idiosyncratically as baseline, fails to keep κ tied to the data
whose reliability is in question. This has the effect of punishing observers for
agreeing on the frequency distribution of categories used to describe the given
phenomena (Brennan & Prediger, 1981, Zwick, 1988) and allowing systematic
disagreements, which are evidence of unreliability, to inflate the value of κ (Krip-
pendorff, 2004a,b).” (Hayes & Krippendorff, 2007, p. 81). Thus, like the measures
discussed above, κ also fails to satisfy the five requirements outlined by Hayes
and Krippendorff (2007).
Krippendorff’s Alpha. Krippendorff’s (1970) was developed in an attempt
to fulfill the remaining voids in reliability calculations left by percentage agree-
ments, p, and κ. This IRR index overcomes the data limitations of the previous
three by allowing for more than two observers and for the computation of agree-
ments among ordinal, interval, and ratio data, as well as nominal data (Hayes &
Krippendorff, 2007). Although earlier measures correct for percent agreement,
instead calculates disagreements. Consequently, it is gaining popularity as a stan-
dard IRR index that addresses the limitations of earlier IRR indices, providing
researchers with a metric that is able to overcome a variety of concerns.

Krippendorff’s a

Do
α = 1−
De
where
1
Do = ∑
n c
∑O
k
ckmetric ∂
2ck

and
1
De = ∑
n(n − 1) c
∑ n xn
k
c k 2
metric∂ ck

Where Do is the observed disagreement among values assigned to units of


analysis and De is the disagreement one would expect when the coding of
units is attributable to chance rather than to the properties of these units. Ock,
nc, nk, and n refer to the frequencies of values in coincidence matrices. (see
Krippendorff, 2011, p. 1 for further description).
94 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Alpha is useful for multiple coders and is appropriate for various data types
(Hayes & Krippendorff, 2007). However, it is not an efficient measure in certain
contexts. For example, it is not appropriate for a paired double coding scheme
(Knut De Swert, 2012), nor an appropriate measure for particular datasets. Be-
cause a is based on the chance of agreement, it is difficult to utilize this measure
of reliability with skewed data. Due to the binary nature of intensive content or
meta-analytic data (where the choices of “0” not present and “1” present exist),
many variables may be categorized as 0s and 1s, with several variables resulting
in a low representation of 1s. Thus, the degree of skewness can be problematic in
calculating a, as well as κ because, “The κ statistic is effected by skewed distribu-
tions of categories (the prevalence problem) and by the degree to which the coders
disagree (the bias problem)” (Eugenio & Glass, 2004).
Feinstein and Cicchetti (1990, p. 543) further articulate this problem:

In a fourfold table showing binary agreement of two observers, the observed pro-
portion of agreement, P0 can be paradoxically altered by the chance-corrected ratio
that creates κ as an index of concordance. In one paradox, a high value of P0 can be
drastically lowered by a substantial imbalance in the table’s marginal totals either
vertically or horizontally. In the second paradox, (sic) κ will be higher with an asym-
metrical rather than symmetrical imbalance in marginal totals, and with imperfect
rather than perfect symmetry in the imbalance. An adjustment that substitutes Kmax
for κ does not repair either problem, and seems to make the second one worse.

Despite these difficulties in assessing accurate IRR in varying contexts, and κ


continue to be the most legitimate IRR indices in premier management research
(e.g., Desa, 2012; Heugens & Lander, 2009; Kostova & Roth, 2002; Tuggle et
al., 2010). The calculations for these values elicit a paradoxical outcome, depen-
dent upon the degree to which they exhibit trait prevalence, or the presence of a
particular trait within a population, and the conditional probabilities of the coder
properly classifying that trait as either “present” or “not present” (typically 1 or
0). This issue, the prevalence problem, as well as the bias problem (Eugenio &
Glass, 2004) create difficulties in the accuracy of reliability statistics when few
categories exist or when there is a substantial difference in the marginal distri-
bution amongst coders. Consequently, meta-analyses and content analyses that
utilize a binary approach in their collection and synthesis of data or have any
substantially over-represented category, such that the data are skewed, suffer from
low, and thus “unreliable,” levels of agreement due to the calculations emphasis
on the outcome being a product of chance (Gwet, 2008).

FILLING THE VOID: THE AC1 STATISTIC


The search for another option for calculating inter-rater agreement across sev-
eral disciplines elicited attention to the AC1 statistic for IRR established by Gwet
(2001; 2002a). The AC1 inter reliability statistic “is a more robust chance-cor-
rected statistic that consistently yields reliable results” as compared to κ (Gwet,
Kappa and Alpha and Pi, Oh My • 95

2002b, p. 5). Furthermore, Gwet (2008) investigated the influence of the condi-
tional probabilities of the coders on the prevalence of a specific trait using p and
κ as a metric for inter-rater reliability.

AC1

pa − pe
γˆ1 =
1 − pe

Where
q
1
pa =
1 − Pm
∑p
k =1
kk

1 q
pe = ∑ πk (1 − πk )
q − 1 k =1

pm = “the relative number of subjects rated by a single rater (i.e. 1 rating is


missing)2”

( pk + + p+ k )
πk =
2
pk+= relative number of subjects assigned to category k by rater A
p+k = relative number of subjects assigned to category k by rater B
pkk = relative number of subjects classified into category k by both raters
pk = the probability of a randomly –selected rater to classify a randomly
selected subject into category k
q = nominal measurement scale? Q is the number of in the nominal rating
scale
AC1 may be used with any number of coders, any number of categories, scale
values, or measures. It can accommodate missing data, any sample size, and ac-
count for trait prevalence. Although AC1 may only be used to calculate IRR with
nominal data, a similar statistic, AC2, may be used to calculate IRR with ordinal,
interval, or ratio scale data. In our own review, we utilized a coding team of more
than two coders, but with two coders for each article, multiple categories, and
nominal data which demonstrated trait prevalence, that is a large amount of data
that were coded “1” by both coders, a condition deemed problematic with the
other forms of IRR calculation. Due to the lack of ordinal, interval, and ratio data
in our coding schemata, AC2 is beyond the scope of this paper and is suggested
as a more comprehensive measure for datasets comprised of data that are not
nominal in nature.
Calculations suggest that both p and κ produce realistic estimates of IRR when
the prevalence of a trait is approximately .50. The farther the trait prevalence
above or below a value of .50, the less reliable and accurate the indices. The
•96

TABLE 5.1. Guidelines for Best Selecting an IRR Index


Data Characteristics Percent Agreement Scott’s p Cohen’s k Krippendorff’s a Gwet’s AC1 Gwet’s AC2
# Coders/Observers 2 2* 2* Any Any Any
Number of Categories, Scale Limited Any Limited Any Any Any
Values, or Measures
Level of Measurement Nominal Nominal Nominal Nominal, Ordinal, Nominal Ordinal, interval, and
(nominal, ordinal, interval, Interval, and Ratio ratio ratings
ratio, etc.)
Missing Data No No No Yes Yes Yes
Known Prevalence (0 or 1) No No No No Yes Yes
Sample Size Any Not Small Not Small Any Any Any
Skewed Data No No No No Yes Yes

Po − Pe Po − Pe Do pa − pe pa − pe
Formula
Occ π= κ= α = 1− γˆ1 = γˆ1 =
∑ c
× 100 1 − Pe 1 − Pe De
n 1 − pe 1 − pe
*If more than 2 coders exist, an extension called Fleiss’ kappa can be used to assess IRR.
JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN
TABLE 5.2. Guidelines for Best Selecting an IRR Index
Percent
Data Characteristics Agreement Scott’s p Cohen’s k Krippendorff’s a Gwet’s AC1 Gwet’s AC2
Accommodates Multiple Coders/Observers No No No Yes Yes Yes
Bias Due to Number of Categories, Scale Yes No Yes No No No
Values, or Measures
Level of Measurement (nominal, ordinal, Nominal Nominal Nominal Nominal, Ordinal, Nominal Ordinal, interval,
interval, ratio, etc.) Interval, and Ratio and ratio ratings
Accommodates Missing Data No No No Yes Yes Yes
Accommodates Known Prevalence (0 or 1) No No No No Yes Yes
Sample Size Restrictions No Yes Yes No No No
Accommodates Skewed Data No No No No Yes Yes
Kappa and Alpha and Pi, Oh My

97
98 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

closer the trait prevalence to 0 or 1, the more difficult it is to have confidence in


the results of these two IRR indices. Thus, Gwet (2008) suggests that the biases
established by such calculations negatively influence the overall statistics, lead-
ing to the possibility of underestimating the actual inter-rater reliability by up
to 100% (Gwet, 2008, p. 40). Consequently, the inclusion of chance-agreement
probabilities as a core component in calculating these metrics is inappropriate
when utilizing data that rely upon a coding scheme that has few categories, such
as those using a binary categorical classification system. The AC1 statistic ad-
equately addresses the unique nature of data beyond the scope of a, κ, and p.
Thus, in an effort to compare these IRR indices, we calculated each using a dataset
of skewed data.

INTER RATER RELIABILITY MEASUREMENT:


A COMPARATIVE EXAMPLE
To explore the differences of IRR indices (and to demonstrate the utility of the
AC1 statistic) within an actual coding context, we conducted a review of over 440
employee turnover articles in eleven major journals in the fields of management
and psychology from 1958–2010 (Academy of Management Journal, Adminis-
trative Science Quarterly, Human Relations, Human Resource Management,
Journal of Applied Psychology, Journal of Management, Journal of Management
Studies, Journal of Organizational Behavior, Journal of Vocational Behavior, Or-
ganizational Behavior and Human Decision Processes, and Personnel Psychol-
ogy). A more thorough description of the data can be found in Allen et al. (2014),
the original study for which these data were coded. For each article, we coded
130 different variables, a majority of which were confounded by trait prevalence
and, subsequently, demanded a look beyond the traditional IRR indices. The ap-
propriateness of the IRR index to be used should be assessed based on the degree
to which the metric can accommodate the data.
In this example, the data were nominal, thus percent agreement, p, κ, a, or
AC1 could be appropriate, based solely on the type of data. However, a paired
double coding scheme was utilized, whereby three coders alternated coding such
that two coders independently coded each article. Discrepancies were resolved
amongst the coders with a fourth coder resolving any discrepancies that were
unable to be resolved by them. Consequently, this excludes κ as an appropriate
metric, given that this metric fails to support the substitution of coders. Percent
agreement and p are also excluded, given that they do not provide a means by
which to adequately assess agreement among more than two independent coders.
Beyond those requirements, we are left to consider and AC1, both of which ac-
commodate any number of independent coders, neither is restricted by the number
of categories, scale values or measures present, and both can be interpreted on a
numerical scale between two points, thus making either of these a suitable choice.
However, a common issue within coding schemes is that of known prevalence (or
trait prevalence), whereby coders are identifying the presence (coded 1) versus
Kappa and Alpha and Pi, Oh My • 99

absence (coded 0) of a particular trait, phenomenon, etc. The data in this example
are representative of this problem, which cannot be sufficiently accommodated
by a.
The problems with a in this situation are laid bare in our study. Take for ex-
ample our coding of the retrospective study variable in Table 5.3. Despite their
being 96% agreement between coders, a is calculated at .28. This meager value is
the result of the dichotomous nature of the variable and the use of a rotated coding
scheme, whereby (a) Coders 1 and 2 code set of articles, (b) Coders 2 and 3 code
a set of articles, and (c) Coders 1 and 3 code a set of articles. The a index cannot
account for this coder scheme, and underestimates the degree of IRR. By contrast,
Table 4.3 demonstrates that AC1 more accurately measures IRR than when a rotat-
ing coder design is employed. Specifically, Table 5.3 provides calculations of the
different IRR indices for 25 of the 130 variables that were coded for within each
of the 440 studies in our sample. Given the laborious nature of content and meta-
analysis, these types of designs are increasingly, common, highlighting the utility
of AC1 as an IRR measure.
We calculated each of the five IRR indices for our data in order to compare
across several theoretical and methodological variables that were assigned as ei-
ther present or not present in a particularly article. Table 5.3 shows a comparison
of all five IRR indices for these coded variables. For these comparisons, it is
clear that there is a substantial range of IRR coefficients. A similar pattern can be
seen for all three comparisons: p, κ, and a, are all relatively close in value (to the
thousandth place), whereas percent agreement and the AC1 tend to be substantially
higher, with percent agreement consistently remaining the highest coefficient, fol-
lowed by AC1. Although there is a lack of commonality of agreement regarding
acceptable levels of IRR for each of these variables (e.g., Krippendorff, 1980; Per-
reault & Leigh, 1989; Popping, 1988), the general body of literature suggests that
IRR coefficient values of greater than .90 are acceptable in virtually all situations
and values of .80 or greater are acceptable in most situations. Values below 0.80
are subject to disagreement of acceptability among scholars (Neuendorf, 2002),
however, some scholars suggest that values between 0.6 and 0.8 are moderately
strong and sometimes acceptable (e.g. Landis & Koch, 1977). Other scholars
suggest 0.70 as the cutoff for reliability (e.g., Cronbach, 1980; Frey, Botan, &
Kreps, 2000). However, due to the relatively conservative nature of p and κ, lower
thresholds are at times deemed acceptable.
Using these acceptance guidelines, it is clear that the interpretation of accep-
tance varies based upon which IRR index is being used. For the coding of studies
grounded in the theories of Porter and Steers (1973), Lee and Mitchell (1991)
and Rusbult and Farrell (1983), the IRR is acceptable regardless of which metric
is used, though for p, κ, and a acceptance is borderline for Porter and Steers,
whereas percentage and AC1 are clearly acceptable. However, for the remaining
theoretical variables that were coded, the p, κ, and a are not deemed acceptable,
though AC1 and percentage agreement offer evidence of acceptable IRR among
•100

TABLE 5.3. Coded Variable IRR Comparison


Coded Variable Type of Data Category %Agreement Scott’s p Cohen’s k Krippendorff’s a Gwet’s AC1
Existing measures adapted Nominal Measures 0.7857 0.5399 0.5402 0.5405 0.7210
Existing measures w/out adapting Nominal Measures 0.7885 0.5795 0.5795 0.5801 0.7460
Idiosyncratic Nominal Measures 0.8297 0.5890 0.5892 0.5596 0.7850
Multi item measures Nominal Measures 0.9148 0.7653 0.7656 0.7656 0.8960
Single item measures Nominal Measures 0.7225 0.4493 0.4497 0.4500 0.6290
Field Nominal Setting 0.9918 0.6629 0.6633 0.6634 0.9920
Lab Nominal Setting 0.9918 0.3969 0.3970 0.3977 0.9920
Simulation Nominal Setting 0.9945 0.4979 0.4979 0.4986 0.9940
Cross-sectional Nominal Study Design 0.8434 0.4547 0.4599 0.4554 0.8270
JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Ex post archival Nominal Study Design 0.8154 0.0672 0.0680 0.0685 0.8060
Longitudinal Nominal Study Design 0.8654 0.5594 0.5596 0.5600 0.8540
Repeated measures Nominal Study Design 0.9451 0.5395 0.5396 0.5402 0.9430
Retrospective Nominal Study Design 0.9615 0.2849 0.2849 0.2859 0.9610
Static cohort Nominal Study Design 0.7720 0.5513 0.5528 0.5519 0.7250
Rusbult & Farrell Nominal Theories 0.9890 0.8125 0.8125 0.8128 0.9890
Hulin et al Nominal Theories 0.9643 0.5874 0.5888 0.5879 0.9610
Lee & Mitchell Nominal Theories 0.9890 0.8889 0.8889 0.8891 0.9880
March & Simon Nominal Theories 0.9091 0.6727 0.6728 0.6734 0.8740
Mobley Nominal Theories 0.9093 0.6772 0.6774 0.6776 0.8740
Mobley et al Nominal Theories 0.8874 0.6701 0.6701 0.6705 0.8390
Muchinsky & Morrow Nominal Theories 0.9835 0.6915 0.6919 0.6919 0.9830
Price Nominal Theories 0.9011 0.4831 0.4834 0.4838 0.8780
Steers & Mowday Nominal Theories 0.9148 0.6032 0.6043 0.6037 0.8920
Maertz Nominal Theories 0.9973 0.6653 0.6654 0.6657 0.9980
Porter & Steers Nominal Theories 0.9286 0.7083 0.7084 0.7087 0.9050
Kappa and Alpha and Pi, Oh My

101
102 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

coders. This can likely be attributed to the binary coding scheme that was used
and offers evidence for the importance of choosing the right metric. In all but
one of these instances, the percentage agreement is above 90% and is arguably
inflated based on the lack of consideration of chance. This inflation is further dem-
onstrated upon examination of the variables which coded for measures. Although
percentage agreement remains inflated, the remaining four IRR indices fail to
demonstrate IRR across the board (single item measures) or show low reliability
as calculated by p, κ, and a, and a barely acceptable AC1 (i.e., existing measures
adapted, existing measures without adaptation). Thus, the IRR index used has a
substantial influence on the degree to which IRR is considered acceptable or not.

RESEARCH AND PRACTICE


Having examined five IRR indices, it is clear that there is no one specific IRR
index that is best suited for every situation and that the selection of an appropriate
metric is dependent upon the data and coding processes themselves. In conjunc-
tion with endorsements in other disciplines, we recommend the use of a in a
majority of content and meta-analytic contexts. However, although it builds and
improves upon the metrics commonly used to date, the importance of chance-
probability in the calculation of a makes it a poor choice in contexts where a prev-
alence or bias problem may exist. Consequently, we recommend the AC1 Statistic
as an alternative to for management scholars engaging in analytical synthesizing
research involving large numbers of categories coded as a function of 1 “present”
or 0 “not present.” Perhaps just as importantly, we also strongly recommend the
AC1 index in situations where more than two coders work together in rotating
fashion. Alpha is not suited for this type of scheme and underestimates IRR in
this situation.
Further, although AC1 is correlated with percent agreement, it offers construct
validity assurances that percent agreement does not. Although percent agreement
may be a methodologically-sound index of IRR when used in very simple coding
scenarios (e.g., two coders, nominal data, limited number of categories, no miss-
ing or skewed data), Tables 5.1 and 5.2 demonstrate that it has several limitations
in more complex situations that the AC1 index does not. Specifically, AC1 is most
valuable under circumstances such as when more than two coders are present,
when there are multiple categories to code, when data are not nominal, when
missing data must be accounted for, and when data are skewed. These situations
demonstrate the value of the AC1 index over and above simple percent agreement.
Given that the coding of dichotomous variables and the use of multiple coders
working in rotating fashion are becoming increasingly common in meta-analytic
and content studies, the AC1 index should become more prevalent. The AC1 index
is an IRR measure that is appropriate for more complex situations, as it overcomes
the role of chance and improves construct validity above and beyond the capabil-
ity of simple percent agreement in many situations.
Kappa and Alpha and Pi, Oh My • 103

Implications for Practice

IRR is not only imperative for establishing construct validity in management


research, it also has uses within HRM practice. Interviews and performance eval-
uations which tend to have multiple “coders” rely on agreement in order to make
accurate assessments leading to job offer, promotion, and termination decisions.
Implementing appropriate metrics for assessing IRR can aid organizations in en-
suring that these decisions are legally permissible, demonstrating validity and
reliability. For instance, ad hoc hiring committees are often made up of multiple
members who rotate in and out of interviews. When assessing agreement, using
would inaccurately report lower agreement about candidates, and open the orga-
nization up to legal questions about the validity of its hiring process. Employing
AC1 would address this issue and put the firm’s hiring process on more sound
procedural footing.
Further, many hiring decisions are also dichotomous (e.g., “acceptable candi-
date vs. unacceptable candidate” or “qualified candidate versus unqualified candi-
date”). The use of the AC1 statistic in calculating agreement also has value here, as
a and other forms of IRR underestimate agreement when the variable of interest
is dichotomous. In this sense, the choice of IRR metric has real financial and le-
gal consequences for HRM practitioners. Understanding the circumstances under
which a is not appropriate and AC1 should be used instead could have significant
practical implications for HR managers. Yes versus no decisions, and committee
decisions are common in organizational life, making an understanding of the AC1
statistic increasingly important in organizations.
This paper provides a review of the most commonly used IRR indices in the
management literature, building upon the flaws that have been previously identi-
fied regarding the usefulness of percentage agreement, p, κ, and a as indicators
of IRR. Although each of these reliability metrics provides important information
about the agreement present amongst coders and is certainly prevalent in the man-
agement literature, they are only appropriate to use in certain contexts. Further,
we highlight the utility of the AC1 statistic which provides reliability information
in a different context, above and beyond that provided by a. Specifically, although
a is the best choice in a wide variety of circumstances, AC1 demonstrates utility
over a when there is a rotation of multiple coders, as well as accommodating di-
chotomous data, common phenomena in both the HRM literature and in practice.
Consequently, we are not suggesting that there is a best or worst metric to use, but
instead that the choice of IRR index should be a function of the coding scheme
used, the number of coders used in observing and classifying data, and the orga-
nization of the data. The validity of the conclusions we make as scholars relies
upon the degree to which we are able to rely upon our measures, without which,
we cannot provide meaningful suggestions for research or practice.
104 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

REFERENCES
Aguinis, H., Dalton, D. R., Bosco, F. A., Pierce, C. A., & Dalton, C. M. (2011). Meta-
analytic choices and judgment calls: Implications for theory building and testing,
obtained effect sizes, and scholarly impact. Journal of Management, 37, 5–38.
Allen, D. G., Hancock, J. I., Vardaman, J. M., & McKee, D. L. N. (2014). Analytical mind-
sets in turnover research. Journal of Organizational Behavior, 35, S61–S86.
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics.
Computational Linguistics, 34, 555–596.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job per-
formance: A meta-analysis. Personnel Psychology, 44, 1–26.
Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implica-
tions for data aggregation and analysis. In K. J. Klein & S. W. J. Kozlowski (Eds.),
Multilevel theory, research, and methods in organizations: Foundations, extensions,
and new directions (pp. 349–381). San Francisco, CA: Jossey-Bass.
Borenstein, M., Hedges, L. V., Higgins, P. T., & Rothstein, H. R. (2011). Introduction to
meta-analysis. West Sussex, UK: John Wiley & Sons.
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alter-
natives. Educational and Psychological Measurement, 41, 687–699.
Cohen, J. A. (1960). Coefficient of agreement for nominal scales. Educational and Psycho-
logical Measurement, 20, 37–46.
Cronbach, L. J. (1980). Validity on parole: How can we go straight. In W. B. Schrader
(Ed.), New directions for testing and measurement: Measuring achievement over a
decade. (pp. 99–108). San Francisco, CA: Jossey-Bass.
Desa, G. (2012). Resource mobilization in international social entrepreneurship: Bricolage
as a mechanism of institutional transformation. Entrepreneurship Theory and Prac-
tice, 36, 727–751.
De Swert, K. (2012). Calculating inter-coder reliability in media content analysis using
Krippendorff’s Alpha. Center for Politics and Communication, 1–15. Retrieved
from: https://www.polcomm.org/wp-content/uploads/ICR01022012.pdf
Eby, L. T., Casper, W. J., Lockwood, A., Bordeaux, C., & Brinley, A. (2005). Work and
family research in IO/OB: Content analysis and review of the literature (1980–
2002). Journal of Vocational Behavior, 66, 124–197.
Eugenio, B. D., & Glass, M. (2004). The kappa statistic: A second look. Computational
Linguistics, 30, 95–101.
Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems
of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological
Bulletin, 76, 378.
Frey, L., Botan, C. H., & Kreps, G. (2000). Investigating communication. New York, NY:
Allyn & Bacon.
Gwet, K. (2001). Handbook of inter-rater reliability: How to estimate the level of agree-
ment between two or multiple raters. Gaithersburg, MD: STATAXIS Publishing
Company
Gwet, K. (2002a). Inter-rater reliability: Dependency on trait prevalence and marginal ho-
mogeneity. Statistical Methods for Inter-rater Reliability Assessment Series, 2, 1–9.
Kappa and Alpha and Pi, Oh My • 105

Gwet, K. (2002b). Kappa statistic is not satisfactory for assessing extent of agreement
between Raters. Statistical Methods for Inter-rater Reliability Assessment Series,
1, 1–5.
Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high
agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48.
Hancock, J. I., Allen, D. A., Bosco, F. A., McDaniel, K. R., & Pierce, C. A. (2013). Meta-
analytic review of employee turnover as a predictor of firm performance. Journal of
Management, 39, 573–603.
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability mea-
sure for coding data. Communication Methods and Measures, 1, 77–89.
Heugens, P. P., & Lander, M. W. (2009). Structure! Agency! (And other quarrels): A meta-
analysis of institutional theories of organization. Academy of Management Journal,
52, 61–85.
Hoch, J. E., Bommer, W. H., Dulebohn, J. H., & Wu, D. (2018). Do ethical, authentic, and
servant leadership explain variance above and beyond transformational leadership?
A meta-analysis. Journal of Management, 44, 501–529.
Judge, T. A., & Ilies, R. (2002). Relationship of personality to performance motivation: A
Meta-analytic review. Journal of Applied Psychology, 87, 797–807.
Koenig, A. M., Eagly, A. H., Mitchell, A. A., & Ristikari, T. (2011). Are leader stereotypes
masculine? A met-analysis of three research paradigms. Psychological Bulletin,
137, 616–642.
Kostova, T., & Roth, K. (2002). Adoption of an organizational practice by subsidiaries of
multinational corporations: Institutional and relational effects. Academy of Manage-
ment Journal, 45, 215–233.
Krippendorff, K. (1970). Estimating the reliability, systematic error and random error of
interval data. Educational and Psychological Measurement, 30, 61–70.
Krippendorff, K. (1980). Reliability. In K. Krippendorff, Content analysis; An introduction
to its methodology (pp. 129–154). Beverly Hills, CA: Sage Publications.
Krippendorff, K. (2004a). Content analysis: An introduction to its methodology (2nd ed.).
Thousand Oaks, CA: Sage.
Krippendorff, K. (2004b). Reliability in content analysis: Some common misconceptions
and recommendations. Human Communication Research, 30, 411–433.
Krippendorf, K. (2011). Computing Krippendor’s Alpha-Reliability. Retrieved from: hpttp.
repository. upenn. edu/asc_papers/43
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categori-
cal data. Biometrics, 33,159–174.
Lebreton, J. M., Burgess, J. R., Kaiser, R. B., Atchley, E. K., & James, L. R. (2003). The
restriction of variance hypothesis and interrater reliability and agreement: Are rat-
ings from multiple sources really dissimilar? Organizational Research Methods, 6,
80–128.
Lee, T. W., & Mitchell, T. R. (1991). The unfolding effects of organizational commitment
and anticipated job satisfaction on voluntary employee turnover. Motivation and
Emotion, 15, 99–121.
Mackey, J. D., Frieder, R. E., Brees, J. R., & Martinko, M. J. (2017). Abusive supervision:
A meta-analysis and empirical review. Journal of Management, 43, 1940–1965.
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica: Bio-
chemia Medica, 22, 276–282.
106 • JULIE I. HANCOCK, JAMES M. VARDAMAN, & DAVID G. ALLEN

Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.
Perreault Jr, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative
judgments. Journal of Marketing Research, 26, 135–148.
Pindek, S., Kessler, S. R., & Spector, P. E. (2017). A quantitative and qualitative review
of what meta-analyses have contributed to our understanding of human resource
management. Human Resource Management Review, 27, 26–38.
Popping, R. (1988). On agreement indices for nominal data. In Sociometric Research (pp.
90–105). London, UK: Palgrave Macmillan.
Porter, L. W., & Steers, R. M. (1973). Organizational, work, and personal factors in em-
ployee turnover and absenteeism. Psychological Bulletin, 80, 151.
Rusbult, C. E., & Farrell, D. (1983). A longitudinal test of the investment model: The
impact on job satisfaction, job commitment, and turnover of variations in rewards,
costs, alternatives, and investments. Journal of Applied Psychology, 68, 429.
Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding.
Public Opinion Quarterly, 19, 321–325.
Tuggle, C. S., Schnatterly, K., & Johnson, R. A. (2010). Attention patterns in the board-
room: How board composition and processes affect discussion of entrepreneurial
issues. Academy of Management Journal, 53, 550–571.
Tuggle, C. S., Sirmon, D. G., Reutzel, C. R., & Bierman, L. (2010). Commanding board of
director attention: investigating how organizational performance and CEO duality
affect board members’ attention to monitoring. Strategic Management Journal, 31,
946–968.
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–
378.
CHAPTER 6

EVALUATING JOB
PERFORMANCE MEASURES
Criteria for Criteria

Angelo S. DeNisi and Kevin R. Murphy

Research aimed at improving performance appraisals dates back almost 100 years,
and there have been a number of reviews of this literature published over the years
(e.g., Austin & Villanova, 1992; Bretz, Milkovich, & Read, 1992; DeNisi & Mur-
phy, 2017; DeNisi & Sonesh, 2011; Landy & Farr, 1980; Smith, 1976). Each of
these papers have chronicled the research conducted to help us better understand
the processes involved in performance appraisals, and how this understanding
could help to improve the overall process. Although these reviews were done at
different points in time, the goal in each case was to draw conclusions concern-
ing how to make appraisal systems more effective. However, while each of these
reviews included studies comparing and contrasting different appraisal systems,
they all simply accepted whatever criterion was used for those comparisons, and,
based on those comparisons, made recommendations on how to conduct better
appraisals. This is an issue since many of the studies reviewed used different cri-
terion measures for their comparisons.
But this issue is even more serious when we realize that the reason why these
studies and reviews have used different criterion measures is that there is no con-
sensus on what is the “best” criterion measure to use when comparing appraisal
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 107–133.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 107
108 • ANGELO S. DENISI & KEVIN R. MURPHY

systems. Stated simply, if we want to make statement about how “system A” was
better than “system B,” we need some criterion or criteria against which to com-
pare the systems. Although this would seem to be a basic issue from a research
methods point of view, the truth is that there have been many criterion measures
used over time, but almost all of them are subject to serious criticism. Therefore,
despite 100 years of research on performance appraisal, there is actually very little
we can be certain about in terms of identifying the “best” approaches.
The present paper differs from those earlier review articles, because the present
paper focuses specifically on the problem of criterion identification. Therefore,
our review is not organized according to which rating formats or systems were
compared, but rather, we organized our review around which criterion measures
were used to make those comparisons. Our goal, then, is not to determine which
system is best, but to identify problems with the criterion measures used in the
past, and to propose a somewhat different approach to try to identify a more useful
and credible criterion measure. Therefore, we begin with a discussion of the vari-
ous criterion measures that have typically been used in comparing and evaluating
appraisal studies. In each case, we note the problems that have been identified
with their use, and why they may not really be useful as criterion measures.. We
then move on to lay out a comprehensive framework to evaluating the construct
validity of job performance measures that we believe can serve as the basis for
more useful measures to be used in this research.

HISTORICAL REVIEW

Most measures of job performance, performance ratings (and sometimes rankings


of employees in terms of performance) rely on the judgments of supervisors or
other evaluators. The question of how to assess the validity of these judgments
has been a recurring concern in performance appraisal research and has often been
discussed under the heading of “the criterion problem” (cf., Austin & Villanova,
1992). Throughout the first 50 years of serious research in this area of perfor-
mance measures of this sort were almost always evaluated relative to two types of
criteria: (1) measures of agreement, or (2) measures based on the distributions of
ratings—i.e., rater error measures. There are issues associated with each type of
measure, and we shall discuss these in turn.

Agreement Measures
The reliance upon agreement measures as criteria for evaluating appraisal sys-
tems has a long history. Some type of inter-rater agreement measure has been
used to evaluate appraisal systems from as early as the 1930s (e.g., Remmers,
1934) continuing through the 50s (e.g., Bendig, 1953), the 60s (e.g., Smith &
Kendall, 1963), and the 70s (e.g., Blanz & Ghiselli, 1972). The underlying as-
sumption was that agreement indicated reliable ratings, and, since reliable ratings
Evaluating Job Performance Measures • 109

are a prerequisite for valid ratings, agreement could be used as a proxy for validity
and accuracy. But, in fact, the situation was much more complex. Viswesvaran,
Ones and Schmidt (1996) reviewed several methods of estimating the reliability
(or the freedom from random measurement error) of job performance ratings and
argued that inter-rater correlations provided the best estimate of the reliability
of performance ratings (See also Ones, Viswesvaran & Schmidt, 2008; Schmidt,
Viswesvaran & Ones, 2000). The correlations between ratings given to the same
employees by two separate raters are typically low, however, and others (e.g.,
LeBreton, Scherer, & James, L. R. 2014; Murphy & DeShon, 2000) have argued
that treating inter-rater correlations as measures of reliability makes sense only
if you believe that agreements between raters are due solely to true scores and
disagreements are due solely to random measurement error, a proposition that
strikes us as unlikely.
A number of studies have examined the roles of systematic and random error
in performance ratings, as well as methods of estimating systematic and random
error (Fleenor, Fleenor, & Grossnickle, 1996; Greguras & Robie, 1998; Hoff-
man, Lance, Bynum, & Gentry, 2010; Hoffman & Woehr, 2009; Kasten & Nevo,
2008; Lance, 1994; Lance, Baranik, Lau, & Scharlau, 2009; Lance, Teachout, &
Donnelly, 1992; Mount, Judge, Scullen, Sytsma, & Hezlett, 1998; Murphy, 2008;
O’Neill, McLarnon, & Carswell, 2015; Putka, Le, McCloy, & Diaz, 2008; Saal,
Downey, & Lahey, 1980; Scullen, Mount, & Goff, 2000; Woehr, Sheehan, & Ben-
nett, 2005). In general, these studies suggest that there is considerably less random
measurement error in performance ratings than studies of inter-rater correlation
would suggest. For example, Scullen, Mount, and Goff (2000) and Greguras and
Robie (1998) examined sources of variability in ratings obtained from multiple
raters and found that the largest source of variance in ratings is due to raters, some
of which is likely due to biases or general rater tendencies (e.g., leniency).
There have been a number of advances in research on inter-rater agreement,
some involving multi-level analyses (e.g., Conway, 1998), or the application of
generalizability theory (e.g., Greguras, Robie, Schleicher,, & Goff, 2003). Others
have examined sources of variability in peer ratings (e.g., Deirdorff & Surface,
2007), and multi-rater systems such as 360 Degree Appraisals (e.g., Hoffman,
Lance, Bynum,, & Gentry, 2010; Woehr, Sheehan,, & Bennett, 2005). In all these
cases, results indicated that substantial portions of the variability in ratings was
due to systematic rather than random sources of variability in ratings, undercut-
ting the claim (e.g., Schmidt et al. 2000) that performance ratings exhibit a sub-
stantial amount of random measurement error.
Studies of inter-rater agreement moved from the question of whether raters
agree, to considering why and under what circumstances they agree or disagree.
For example, there is a robust literature dealing with differences in ratings col-
lected from different sources (e.g., supervisors, peers). In general, self-ratings
were found to be typically higher than ratings from others (Valle & Bozeman,
2002), and agreement between subordinates, peers and supervisors was typically
110 • ANGELO S. DENISI & KEVIN R. MURPHY

modest, with uncorrected correlations in the .20s and .30s (Conway & Huffcutt,
1997; Valle & Bozeman, 2002). However, given the potentially low levels of reli-
ability for each source, it is likely that the level of agreement among sources is
actually somewhat higher. Harris and Schaubroeck (1988) reported corrected cor-
relations between sources in the mid .30s to low .60s. Viswesvaran, Schmidt, and
Ones (2002) applied a more aggressive set of corrections and suggested that in
ratings of overall performance and some specific performance dimensions, peers
and supervisors show quite high levels of agreement. This conclusion, however,
depends heavily on the assumption that almost half of the variance in performance
ratings represents random measurement error, a conclusion that has been shown to
be incorrect in studies of the generalizability of performance ratings.
Second, it is possible that raters agree about some things and disagree about
others. For example, it is commonly assumed that raters are more likely to agree
on specific, observable aspects of behavior than on more abstract dimensions
(Borman, 1979). Roch, Paquin and Littlejohn (2009) conducted two studies to
test this proposition, and their results suggested that the opposite is true. Inter-
rater agreement was actually higher for dimensions that are less observable or
that are judged to be more difficult to rate. Roch et al. (2009) speculated that this
seemingly paradoxical finding may reflect the fact that when there is less concrete
behavioral information available, raters fall back on their general impressions of
ratees when rating specific performance dimensions.
Other studies (e.g., Sanchez & De La Torre, 1996) have reported that accuracy
in observing behavior was positively correlated with accuracy in evaluating per-
formance. That is, raters who had an accurate recall of what they have observed,
also appeared to be more accurate in evaluating ratees. Unfortunately, however,
accuracy in behavioral observation did not appear to be related in any simple way
to the degree to which the behavior in question is observable or easy rate.

Rater Error Measures


The most commonly used criterion measures in appraisal research, referred to
as rater error measures, are related to the distributions of ratings. It has often been
assumed that the absence of these errors is evidence for the validity and accuracy
of performance ratings, although as we note later in this paper, this assumption
does not seem fully tenable.
The reliance upon rater error measures as criteria dates back to the earliest
research on performance appraisals (Bingham, 1939; Kingsbury, 1922, 1933;
Thorndike, 1920), and the three most common rater errors: leniency, range re-
striction, and halo error, have continued to influence the ways in which ratings
data are analyzed (e.g., Saal, Downey & Lahey, 1980; Sulsky & Balzer, 1988).
Research dealing with these criterion measures have in fact, accounted for a great
deal of the research on performance appraisals though much of the 20th century.
The sheer volume of this research makes it difficult to discuss this literature in its
Evaluating Job Performance Measures • 111

entirety, so it useful to deal separately with two major categories of “rater errors”:
(a) distributional errors across ratees, and (b) correlational errors within ratees.
Distributional Errors. Measures of distributional errors rely on the assump-
tion that if distributions of performance ratings deviate from some ideal, this
indicates that raters are making particular types of errors in their evaluations.
Although, in theory, any distribution might be viewed as ideal, in practice, the
ideal distribution for the purpose of determining whether or not rater errors have
occurred has been the normal distribution. Thus, given a group of ratees, any
deviation in their ratings from a normal distribution was seen as evidence of a
rating error. This deviation could take the form of “too many” ratees being rated
as excellent (“leniency error”), “too many” ratees being rated as poor (“severity
error”), or “too many” ratees being rated as average (“central tendency error”).
Obviously, the logic of this approach depends upon the assumption that the ideal
(normal) distribution was correct, so that any other distribution that was obtained
was due to some type of error, but, this underlying assumption has been ques-
tioned on several grounds.
First, the true distribution of the performance of the group of employees who
report to a single supervisor is almost always unknown. If it were known, we
would not need subjective evaluations and could simply rely upon the “true” rat-
ings. Therefore, it is impossible to assess whether or not there are “too many”
ratees who are evaluated at any point on the scale. Second, if there were an ideal
distribution of ratings, there is no justification for the assumption that it is nor-
mal and centered around the scale midpoint (Bernardin & Beatty, 1984). Rather,
organizations exert considerable effort to assure that the distribution of perfor-
mance is not normal. Saal, Downey, and Lahey (1980) point out that a variety of
activities, ranging from personnel selection to training are designed to produce
a skewed distribution of performance, so that most (if not all) employees should
be—at least—above the midpoint on many evaluation scales. Finally, the use of
distributional data as an indicator of errors assumes there are no true differences
in performance across work groups (cf., Murphy & Balzer, 1989). In fact, a rater
who gives subordinates higher than “normal” ratings may not be more lenient but
may simply have a better group of subordinates who are actually doing a better
job and so deserve higher ratings.
Furthermore, recent research has challenged the notion that job performance is
normally distributed in almost any situation (Aguinis & O’Boyle, 2014; Aguinis,
O’Boyle, Gonzalez-Mulé, & Joo. 2016; Joo, Aguinis, & Bradley, 2017). These
authors argue that in many settings, a small number of high performers (often
referred to as “stars”) contribute disproportionally to the productivity of a group,
creating a distribution that is far from normal. Beck, Beatty, and Sackett (2014)
suggest that the distribution of performance might depend substantially on the
type of performance that is measured, and it is reasonable in many cases to as-
sume nearly normal distributions. The argument over the appropriate distribu-
tional assumptions is a complex one, but the very fact that this argument exists is
112 • ANGELO S. DENISI & KEVIN R. MURPHY

a strong indication that we cannot say with confidence that ratings are too high,
or that there is too little variance in or too much intercorrelation among ratings
of different performance dimensions absent reliable knowledge of how ratings
should be distributed. In the eyes of some critics (e.g., Murphy & Balzer, 1989),
the lack of reliable knowledge about the true distribution of performance in the
particular workgroup evaluated by any particular rater makes distributional error
measures highly suspect.
Correlational Errors. Measures of correlational error are built around similar
assumptions, that there is some ideal level of correlation that the ratings from each
supervisor should assign. Specifically, it is often assumed that different aspects
or dimensions of performance should be independent and or at least should show
low levels of intercorrelation. Therefore, when raters give ratings of performance
that turn out to be correlated, this is thought to indicate a rating error. This infla-
tion of the intercorrelations among dimensions, is referred to as halo error. Cooper
(1981b) suggests that halo is likely to be present in virtually every type of rating
instrument.
There is an extensive body of research examining halo errors in rating, and
a number of different measures, definitions, and models of halo error have been
proposed (Balzer & Sulsky, 1992; Cooper, 1981a,b; Lance, LaPointe, & Stewart,
1994; Murphy & Anhalt, 1992; Murphy, Jako, & Anhalt, 1993; Nathan & Tippins,
1989; Solomonson & Lance, 1997). Although there was disagreement on a num-
ber of points across these proposals, there was substantial agreement on several
important points. First, the observed correlation between ratings of separate per-
formance dimensions reflects both actual consistencies in performance (referred
to as “true halo,” or the actual degree of correlation between two conceptually dis-
tinct performance dimensions) and errors in processing information about ratees
or in translating that information into performance ratings (referred to as “illusory
halo”). Clearly, the degree of true halo does not indicate any type of rating error
but instead reflects the true covariance across different parts of a job; it is only the
illusory halo that reflects a potential rater error (Bingham actually made the same
point in 1939). Second, this illusory halo is driven in large part by raters’ tendency
to rely on general impressions and global evaluations when rating specific aspects
of performance (e.g., Balzer & Sulsky, 1992; Jennings, Palmer, & Thomas, 2004;
Lance, LaPointe, & Stewart, 1994; Murphy & Anhalt, 1992). Third, all agree that
it is very difficult, if not impossible, to separate true halo from illusory halo. Even
in cases where the expected correlation between two rating dimensions is known
for the population in general (for example, in the population as a whole several of
the Big Five personality dimensions are believed to be essentially uncorrelated),
that does not mean that the performance of a small group of ratees on these dimen-
sions will show the same pattern of true independence.
There is an emerging consensus that measures that are based on the distribu-
tions and the intercorrelations among the ratings given by an individual rater have
proved essentially useless for evaluating performance ratings (DeNisi & Murphy,
Evaluating Job Performance Measures • 113

2017; Murphy & Balzer, 1989). First, we cannot say with any confidence that a
particular supervisor’s ratings are too high or too highly intercorrelated unless we
know a good deal about the true level of performance, and if we knew this, we
would not need supervisory performance ratings. Second, the label “rater error” is
misleading. It is far from clear that supervisors who give their subordinates high
ratings are making a mistake. There might be several good reasons to give sub-
ordinates high ratings (e.g., to give them opportunities to obtain valued rewards,
to maintain good relationships with subordinates), and raters who know that high
ratings are not truly deserved might nevertheless conclude that it is better to give
high ratings than to give low ones (Murphy & Cleveland, 1995; Murphy, Cleve-
land, & Hanscom, 2018). Finally, as we shall see, there is no evidence to support
the assumption that rating errors have much to do with rating accuracy, an as-
sumption that has long served as the basis for the use of rating errors measures as
criteria for evaluating appraisal systems.

Rating Accuracy Measures


Although not always formerly acknowledged, researches used agreement
measures and rating error measures to evaluate appraisals because these were
seen proxies for rating accuracy. It was long assumed that we could not assess
rating accuracy directly, and so we needed to use measures that could serve as
reasonable proxies for accuracy. But, starting in the late 1970s, laboratory studies
of performance ratings made it possible to assess rating accuracy directly. This
stream of research focused largely on rater cognitive processes, following sugges-
tions from Landy and Farr (1980), although the earliest research using accuracy
measures actually predated the publication of this article. In it, Landy and Farr
concluded that research on rating scale formats had not been very useful and sug-
gested, instead, that research focus more on raters themselves and how they made
decisions about which ratings to give.
In order to study the cognitive processes involved in evaluating performance
researchers were forced to rely more broadly on laboratory research, where it
might be possible to collect data that isolated particular processes. This move
to the lab also made it possible to develop and use direct measures of rating ac-
curacy as criteria, by developing “true scores” which could be used as criteria for
evaluating ratings, by comparing actual ratings to these true scores. Much of this
research began with Borman (1978) and continued through the work of Murphy
and colleagues (e.g., Murphy & Balzer, 1986; Murphy, Martin, & Garcia, 1982)
and DeNisi and colleagues (e.g., DeNisi, Robbins, & Cafferty, 1989; Williams,
DeNisi, Meglino,, & Cafferty, 1986). The ability to compute the accuracy of a set
of ratings allowed Murphy and colleagues (Murphy & Balzer, 1989) to assess the
relation between rating accuracy and rating error measures, but it also allowed for
different criteria for comparing rating scales as well as rater training techniques,
and it also led to more complex ways of assessing rating accuracy.
114 • ANGELO S. DENISI & KEVIN R. MURPHY

Borman and his associates launched a sustained wave of research on rating ac-
curacy, using videotapes of ratees performing various tasks, which could then be
used as stimulus material for rating studies. Borman’s (1977, 1978, 1979) research
was based on the assumption that well-trained raters, observing these tapes under
optimal conditions, could provide a set of ratings which could then be pooled and
averaged (to remove potential individual biases and processing errors) to generate
“true scores” which would be used as the standard against which all other ratings
could be compared. That is, these pooled ratings, collected under optimal condi-
tions, could be considered to be an accurate assessment of performance, which
could then be used as criterion measures in subsequent research.
Rating accuracy measures, similar to those developed by Borman were widely
used in appraisal studies focusing on rater cognitive processes (Becker & Cardy,
1986; Cardy & Dobbins, 1986; DeNisi, Robbins, & Cafferty, 1989; McIntyre,
Smith, & Hassett, 1984; Murphy & Balzer, 1986; Murphy, Balzer, Kellam, &
Armstrong, 1984; Murphy, Garcia, Kerkar, Martin, & Balzer, 1982: Pulakos,
1986; Williams et al., 1986 ), but were also used in studies comparing different
methods of rater training (e.g., Pulakos, 1986), and even studies comparing dif-
fer types of rating scales (e.g., DeNisi, Robbins, & Summers, 1997). A review of
research on rating accuracy measures can be found in Sulsky & Balzer (1988).
Different Types of Accuracy. Attempts to increase the accuracy of perfor-
mance ratings are complicated by the fact that there are many different types of
accuracy. At a basic level, Murphy (1991) argued for making a distinction between
behavioral accuracy and classification accuracy. Behavioral accuracy referred to
the ability to discriminate between good and poor incidents of performance, while
classification accuracy referred to the ability to discriminate between the best per-
former, and the second-best performer and so on. Murphy (1991) also argued that
the purpose for which the ratings were to be used should dictate which type of
accuracy was more important, but it seems clear that these measures answer differ
questions about rating accuracy and that both are likely to be important.
At a more complex level, Cronbach (1955) had noted that there were several
ways we could define the agreement between a set of rating provided by a rater
and a set of true scores. Specifically, he defined four separate components of ac-
curacy: (1) Elevation—the accuracy of the average rating, over all ratees and
dimensions, (2) Differential Elevation—the accuracy in discriminating among
ratees, (3) Stereotype Accuracy—accuracy in discriminating among performance
dimensions across all ratees, and (4) Differential Accuracy—accuracy in detect-
ing ratee differences in patterns of performance, such as diagnosing individual
strengths and weakness. Research suggests that the different accuracy measures
are not highly correlated (Sulsky & Balzer, 1988), so that the conclusions one
draws about the accuracy of a set of ratings may depend more upon the choice
of accuracy measures than on a rater’s ability to evaluate his or her subordinates
(Becker & Cardy, 1986).
Evaluating Job Performance Measures • 115

Several scholars had questioned the assumption that rater error measures were
useful proxies for assessments of the accuracy (e.g., Becker & Cardy, 1986; Coo-
per, 1981b; Murphy & Balzer, 1986). Murphy and Balzer (1989) who, using data
from over 800 raters, provided the first direct empirical examination of this as-
sumption. They reported that the relationship between any of the common rating
errors and rating accuracy was either zero, or it was in the wrong direction (i.e.,
more rater errors were associated with higher accuracy). In particular, they re-
ported that the strongest error-accuracy relationship was between halo error and
accuracy, but that higher levels of halo were associated with higher levels of ac-
curacy—not lower levels, as should have been the case.
Accuracy measures have proved problematic as criteria for evaluating ratings.
First, different accuracy measures often lead to quite different conclusions about
rating systems; several authors have suggested that the purpose for the apprais-
als should probably dictate which type of accuracy measure should be used to
evaluate ratings (e.g., Murphy, 1991; Murphy & Cleveland, 1995). Furthermore,
direct measures of accuracy are only possible in highly controlled settings, such
as laboratory studies, making these measures less useful for field research. Fi-
nally, Ilgen, Barnes-Farrell and McKellin (1993) raised wide-ranging questions
about whether or not accuracy was the right goal in performance appraisal—and
therefore whether it was the best criterion measure for appraisal research. This
point was also raised elsewhere by DeNisi and Gonzalez (2004), and Ilgen (1993).

Alternative Criteria for Evaluating Ratings


Neither agreement indices, rater error measures nor rating accuracy measures
have proved satisfactory as criteria for evaluating ratings. A number of authors
have proposed alternative criterion measures to be used in research. For example,
DeNisi and Peters (1996) conducted one of the few field studies examining cogni-
tive processes usually studied in the lab. These authors examined rater reactions
to the appraisals they gave, as well rating elevation and rating discrimination (be-
tween and within ratees) as criterion measures to evaluate the effectiveness of two
different interventions intended to improve rater recall of performance informa-
tion. Varma, DeNisi, and Peters (1996) relied upon the information used in rater
diaries to generate proxies for rating accuracy (i.e., the extent to which ratings
reflected what raters recorded in their diaries). But neither alternative seemed
likely to replace other criterion measures, although they did point to some other
possibilities.
Ratee (not rater) reactions to the appraisal process have a substantial history
of use as criteria measures, dating back at least to the late 1970s (Landy, Barnes,
& Murphy, 1978; Landy, Barnes-Farrell, & Cleveland, 1980). Specifically, this
research focused on ratee’s perceptions of the fairness of the appraisal process as
a criterion measure, and others have adopted this approach as well (e.g., Taylor,
Tracy, Renard, Harrison, & Carroll, 1995). Consistent with the larger body of lit-
erature on organizational justice, scholars have suggested that ratee’s perceptions
116 • ANGELO S. DENISI & KEVIN R. MURPHY

about the fairness of the ratings they received as well as the rating process itself,
are important criteria for evaluating the effectiveness of any appraisal system (cf.,
Folger, Konovsky, & Cropanzano, 1992; Greenberg, 1986, 1987). This focus was
consistent with the recommendations of Ilgen et al. (1993) and DeNisi and Gon-
zalez (2004), and assumes that employees are most likely to accept performance
feedback, to be motivated by performance-contingent rewards, and to view their
organization favorably if they view the performance appraisal system as fair and
honest.
In our view, perceptions of fairness should be thought of as a mediating vari-
able rather than as a criterion. The rationale for treating reactions as a mediating
variable is that performance ratings are often used in organizations as means of
improving performance, and reactions to ratings probably have a substantial im-
pact of the effectiveness of rating systems. It is likely that performance feedback
will lead to meaningful and useful behavior changes only if the ratee perceives the
feedback (i.e., the ratings) received as fair, and accepts this feedback. Ratee per-
formance may not actually improve, perhaps because of a lack of ability or some
situational constraint, but increasing an incumbent’s desire to improve and a will-
ingness to try harder is assumed to be a key goal performance appraisal and per-
formance management systems. Unfortunately, feedback, even when accepted, is
not always as effective as we had hoped it might be (cf., Kluger & DeNisi, 1996).

Conclusions
Our review of past attempts at identifying criterion measures for evaluating
performance appraisals suggests that one of the reasons for the recurring failure
in the century-long search for “criteria for criteria” is the tendency to limit this
search to a single class of measures, such as inter-rater agreement measures, rater
error scores, indices of rating accuracy and the like. Although some type of ratee
reaction measure may be more reasonable, this criterion is also narrow and deals
with only one aspect of appraisals.
Early in the history of research on criteria for criteria Thorndike (1949) re-
minded us of the importance of keeping the “ultimate criterion” criterion in mind.
He defined this ultimate criterion as the “complete and final goal” of the assess-
ment or intervention being evaluated (p. 121). In the field of performance ap-
praisal, this “ultimate criterion” is an abstraction, in part because performance ap-
praisal has many goals and purposes in most organizations (Murphy et al., 2018).
Nevertheless, this abstraction is a useful one, in part because it reminds us that no
single measure or class of measure is likely to constitute an adequate criterion for
evaluating performance appraisal systems. Each individual criterion measure is
likely to have a certain degree of criterion overlap with the ultimate criterion (i.e.,
each taps some part of the ultimate criterion), but each is also likely suffer from a
degree of criterion contamination (i.e., each measure is affected by things outside
of the ultimate criterion). The search for a single operational criterion for criteria
strikes us as pointless.
Evaluating Job Performance Measures • 117

We propose to evaluate measures of job performance in the same way we eval-


uate other measures of important constructs—i.e., through the lens of construct
validation. In particular, we propose a framework for evaluating performance rat-
ings that draws upon methods widely used to assess the construct validity of tests
and assessments (American Educational Research Association, 2014).

CONSTRUCT VALIDATION AS A
FRAMEWORK FOR ESTABLISHING
CRITERIA FOR CRITERIA

Construct validation is most commonly associated with testing (American Educa-


tional Research Association, 2014). Specifically, there are many tests designed to
measure constructs, such as Intelligence or Agreeableness. A construct is a label
we use to describe a set of related behaviors or phenomena, and they do not exist
in the literal sense. We cannot see Intelligence although we can infer it, and con-
structs such as Agreeableness or Intelligence are extremely useful in helping us to
understand behavior. The process of construct validation, then, involves assessing
whether the measure (of Intelligence, for example) actually measures the con-
struct (i.e., actually measures Intelligence), and in some sense, most assessments
of validity (regardless of the specific approach used) can really be thought of as-
sessments of construct validity (Cronbach, 1990; Murphy & Davidshofer, 2005).
Evaluating the construct validity of any measure does not involve the simple
and straightforward application of one method. Instead, it is a process of collect-
ing information in support of construct validity and establishing what is referred
to as a “nomological network.” Establishing this network really involves testing
a series of hypotheses concerning the construct. These hypotheses take the form
of “if this instrument measures intelligence, then sores on this instrument should
be positively (or negatively) related to other outcomes.” As evidence in support
of these hypotheses is collected, the case for construct validity builds. In the final
analysis, we can never “prove” the construct validity of any measure, but we can
amass enough data in support of construct validity that it becomes generally ac-
cepted as measuring what it is intended to measure.
This same approach can be used to assess the construct validity of perfor-
mance appraisals (similar approaches have been proposed by Borman (1991) and
Milkovich & Wigdor (1991) but never implemented to our knowledge; see also
Stone-Romero, Alvarez, & Thompson (2009)). Basically, this approach involves
collecting data to test hypotheses in support for stating that performance ratings
measure actual job performance. What kinds of data should we collect? There are
several classes of data that would be useful.
This framework we propose for evaluating job performance measures has three
components: (1) construct explication, (2) multiple evidence sources, and (3) the
accumulation and synthesis of relevant evidence to draw conclusions about the
118 • ANGELO S. DENISI & KEVIN R. MURPHY

extent to which job performance measures reflect the desired constructs and ful-
fil their desired purposes. That is, in order to evaluate performance ratings and
performance appraisal systems, we have to first know what they are intended to
measure and to accomplish, then collect the widest array of relevant evidence,
then put that information together to draw conclusions about how well our perfor-
mance measures reflect the constructs they are designed to reflect and achieve the
goals they are designed to accomplish.

Construct Explication
Construct explication is the process of defining the meaning and the correlates
of the construct one wishes to measure (Cook & Campbell, 1979; Shadish, Cook,
& Campbell, 2001). Applying this notion to performance appraisal systems in-
volves answering three questions, two of which focus on performance itself and
the last of which focusses on the purpose of performance appraisal systems in or-
ganizations: (1) what is performance? (2) what are its components? and (3) what
are we trying to accomplish with a PA system? We can begin by drawing upon
existing, well-researched models of the domain of job performance (Campbell,
1990; Campbell, McCoy, Oppler, & Sager, 1993) to answer the first two ques-
tions, although we also propose a general definition of job performance as the to-
tal value of the contribution of a person to the value of the organization, over a de-
fined period of time. This broad definition, however, requires further explication.
Campbell (1990) suggested that there were eight basic dimensions of job per-
formance that applied to most jobs, so that job performance could be defined as
how well an employee performed each. These were: job-specific task proficiency
(tasks that make up the core technical requirements of a job); non-job-specific
task proficiency (tasks not specific to the job but required by all jobs in the organi-
zation); written and oral communications; demonstrating effort (how committed a
person is to job tasks and how persistently and intensely they work at those tasks);
maintaining personal discipline(avoiding negative behavior at work); facilitating
team and peer performance (support help and development); supervision (in-
fluencing subordinates); and management and administration (non-supervisory
functions of management including goal setting).
Subsequent discussions (e.g., Motowidlo & Kell, 2013), expanded the crite-
rion space to include contextual performance (behavior that contributes to or-
ganizational effectiveness through its effects on the psychological, social, and
organizational context of work, but is not necessarily part of any person’s formal
job description), counterproductive performance (behaviors that are carried out to
hurt and hinder effectiveness and have negative expected organizational value),
and adaptive performance (which includes the ability to transfer training/learning
from one task to another, coping and emotional adjustment, and showing cultural
adaptability).
Assessing the degree to which an appraisal instrument captures the critical as-
pects of job performance is largely an issue of content validity. Although content
Evaluating Job Performance Measures • 119

validity has traditionally been used in connection with validating tests, it clearly
applies to evaluating appraisal instruments as well. In the case of appraisal instru-
ments this would mean the extent to which the content of the appraisal instrument
overlaps with defined performance on the job in question. Thus, the issue would
be assessing whether or not the appraisal instruments captures all the aspects of
job performance discussed above. This type of assessment is likely to rely on ex-
pert judgment, but there are many tools that can be applied to bring rigor to these
judgments. Lawshe (1975) first proposed a quantitative approach for assessing
the degree of agreement among those experts, resulting in the Content Validity
Index (CVI). Subsequent research (e.g., Polit, Beck, & Owen, 2007) supports the
usefulness of this index as a means of assessing content validity and cold be sued
with regard to appraisal instruments as well.
Addressing the third question requires knowledge of the context in which work
is performed and the goals of the organization in creating and implementing the
appraisal system (Murphy, Cleveland, & Hanscom, 2018). This involves consid-
eration of the reasons why organizations conduct appraisals and the ways in which
they use appraisals information. The model suggested by Cleveland, Murphy and
Williams (1989) is particularly useful in this regard. Those authors distinguish
among: between-person distinctions (e.g., who gets a raise or is promoted); with-
in-person distinctions (e.g., identification of training needs); systems maintenance
(e.g., evaluating HR systems); and documentation (e.g., justification for personnel
actions). Of course, in most organizations, appraisal information will be used for
several (if not all) of these purposes, but it is important to assess the effectiveness
of appraisal systems for each purpose for which information is used.

Evidence
There are many types of evidence that are relevant for evaluating performance
measures, and we discuss a number of these, but it is surely the case that there are
other types of evidence that could be collected as well. But, perhaps the most basic
type of evidence could be derived by simply examining the actual content of the
rating scales used to assess performance. This content should be based upon care-
ful job analysis that provides clear and unambiguous definitions of performance
dimensions that are related to the job in question. The basic dimensions suggested
by Campbell (1990), and discussed above would provide a good starting point,
although adding aspects of performance such as contextual performance, coun-
terproductive performance and adaptive performance would help ensure a more
complete view of a person’s contribution to the organization These dimensions
might be expressed in terms of behaviors, goals, or outcomes, but arguing that
personality traits or attitudes are related to these performance dimensions requires
an extra step and an extra set of assumptions.
Evidence could also be collected by assessing the convergent validity of vari-
ous measures of performance. The assessment of convergent validity is common-
ly a part of any base of evidence for construct validity and is concerned with the
120 • ANGELO S. DENISI & KEVIN R. MURPHY

extent to which different measures, claiming to assess the same construct, are
related to each other. In the case of performance appraisals, these “other” mea-
sures might include objective measures of performance, in situations where such
measures are possible. In fact, there is evidence to suggest that performance rat-
ings and objective performance measures are related (corrected correlations in the
.30s and .40s), but not substitutable (e.g., Bommer, Johnson, Rich, Podsakoff, &
MacKenzie, 1995; Conway, Lombardo, & Sanders, 2001; Heneman, 1986; Mabe
& West, 1982).
We could also approach convergent validation by comparing ratings of the
same person, using the same scale, but provided by different raters. This could be
viewed as the interrater agreement criterion discussed earlier, but those measures
typically involved multiple raters at the same level. The notion of 360 degree rat-
ings (or multi-source ratings) assumes that raters who have different relationships
with a ratee might evaluate that ratee differently (otherwise there would be no
reason to ask for ratings from different sources) and the level of agreement across
sources is seen as an important component of the effectiveness of these systems
(e.g., Atwater & Yammarino, 1992). In general, data suggest that ratings from dif-
ferent sources are related, but not highly correlated so that the rating source has
an important effect on ratings (e.g., Harris & Schaubroeck, 1988; Mount, Judge,
Scullen, Sytsma,, & Hezlett, 1998). Woehr, Sheehan, and Bennett (2005) also
reported strong effect for rating source, although they did find that the effects of
performance dimensions were the same across sources.
In both cases, there is surely some question about whether these different mea-
sures actually purport to measure the same things. Objective performance mea-
sures typically assess output only. It is possible that an employee’s performance is
more than just the number of units sold or produced. Nevertheless, evidence that
objective and subjective assessments of performance and effectiveness converge
can represent an important aspect of the validation of an appraisal system.
It is also possible to assess construct validity by examining evidence of cri-
terion-related validity. Performance ratings are among the most commonly used
criteria for validating selection tests. There is a large body of data demonstrating
that tests designed to measure job-relevant abilities and skills are consistently
correlated with ratings of job performance (cf., Schmidt & Hunter, 1998; Woehr
& Roch, 2016). We typically think of these data as evidence for the validity of the
selection tests rather than for the performance ratings, but they can be used for
both. That is, if there is a substantial body of evidence demonstrating that predic-
tors of performance that should be related to job performance measures actually
are related to performance ratings (and there is such a body of evidence) then
performance ratings are likely to be capturing at least some part of the construct
of job performance.
Another way of gathering evidence about the construct validity of perfor-
mance ratings is to determine whether ratings have consistent meanings across
contexts or cultures. Performance appraisals are used in numerous countries and
Evaluating Job Performance Measures • 121

cultures; multinational corporations might use similar appraisal systems in many


nations. The question of whether performance appraisal provide measures that
can reasonably be compared across borders is therefore an important one. Ploy-
hart, Wiechmann, Schmitt, Sacco, and Rogg (2003) examined ratings of technical
proficiency, customer service and teamwork given to fast food workers in Canada,
South Korea and Spain and concluded that ratings show evidence of invariance.
In particular, raters appeared to interpret the three dimensions in similar ways and
to apply comparable performance standards when evaluating their subordinates.
However, there was also evidence of some subtle differences in perceptions that
could make direct comparisons across countries complex. In particular, raters in
Canada perceived smaller relations between Customer Service and Teamwork
than did raters in South Korea and Spain. On the whole, however, Ployhart et al.
(2003) concluded that ratings from these countries reflected similar ideas about
the dimensions and about the performance levels expected and could therefore be
used to make cross-cultural comparisons.
Similarly, there is evidence of measurement equivalence when performance
ratings of more experienced and less experienced raters are compared (Greguras,
2005). Even though experience as a supervisor is likely to influence the strate-
gies different supervisors apply to maximize the success of their subordinates, it
appears that supervisors using a well-developed performance appraisal system
are likely to agree regarding the meaning of performance dimensions and perfor-
mance levels.
Also, we might use evidence regarding bias in performance ratings to evaluate
the construct validity of these ratings. The rationale here is that if performance
ratings can be shown to be strongly influenced by factors other than job perfor-
mance, this would tend to argue against the proposition that performance ratings
provide valid measures of job performance (Colella, DeNisi, & Varma, 1998).
There is a substantial literature dealing with the question of whether or not per-
formance ratings are biased by factors that are presumably unrelated to actual job
performance, such as the demographic characteristics of ratees or the characteris-
tics of work groups.
A full review of that literature is beyond the scope of the present paper, but
it is worth noting that there is evidence of some bias based on employee gender
(i.e., bias against women; see review by Roberson, Galvin, & Charles, 2007);
employee age (i.e., older workers are rated somewhat higher); race and ethnicity
(i.e., minority group members tend to receive somewhat lower ratings), and dis-
ability status (i.e., Colella, DeNisi, & Varma, 1998; Czajka & DeNisi, 1988), as
well as bias based on attributes viewed negatively by most of the population (e.g.,
obesity, low levels of physical attractiveness; see for example, Bento, White, &
Zacur, 2012).
Although there is evidence that performance ratings show some biases, it is im-
portant to note that age, gender, race and disability tend to have very small effects
on performance ratings (Landy, 2010), and these factors may not be as important
122 • ANGELO S. DENISI & KEVIN R. MURPHY

as some have suggested. In fact, several review authors have concluded that bias
is not a significant issue in most appraisals (e.g., Arvey & Murphy, 1998; Bass &
Turner, 1973; Baxter, 2012; Bowen, Swim, & Jacobs, 2000; DeNisi & Murphy,
2017; Kraiger & Ford, 1985; Landy, Shankster, & Kohler, 1994; Pulakos, White,
Oppler, & Borman, 1989; Waldman & Avolio, 1991). Studies using laboratory
methods (e.g., Hamner, Kim, Baird, & Bigoness, 1974; Rosen & Jerdee, 1976;
Schmitt & Lappin, 1980), are more likely to report demographic differences in
ratings, especially when those studies involve vignettes rather than observations
of actual performance, but these biases do not appear to be substantial in ratings
collected in the field (see meta-analysis results reported by discussion by Murphy,
Herr, Lockhart, & Maguire, 1986). This is not to say that there are not situa-
tions where bias is very real and very serious (e.g. Heilman & Chen, 2005), but
the general hypothesis that performance ratings are substantially biased against
women, members of minority groups, older workers or disabled workers does not
seem credible (DeNisi & Murphy, 2017; Murphy et al., 2018). On the whole, the
lack of substantial bias typically encountered in performance appraisals can be
considered as evidence in favor of the construct validity of performance ratings.
Finally, evidence regarding employee reactions to appraisals and perceptions
that the ratings are fair would be worth collecting. As noted earlier, the research
focusing on ratee reactions and perceptions of fairness have a reasonably long his-
tory (e.g., Landy, Barnes, & Murphy, 1978; Landy, Barnes-Farrell, & Cleveland,
1980), and continues to be studied as an important part of the entire performance
management process (cf., Folger, Konovsky, & Cropanzano, 1992; Greenberg,
1986, 1987; Greenberg & Folger, 1983; Taylor, Tracy, Renard, Harrison, & Car-
roll, 1995). But, since ratee reactions are seen as mediating variables relating to
ratee motivation to improve performance and, ultimately, to actual performance
improvement, data on reactions should be collected in conjunction with data on
actual performance improvement.

Synthesis
Synthesizing evidence from all (or even many) of these sources is a non-trivial
task. Therefore, as with all construct validation efforts, the process will take time
and effort, and will not be a one-step evaluation process. Also, the construct vali-
dation process will involve continuing efforts to collect evidence so that we may
become more and more certain about any conclusions reached. In any case, the
process requires the accumulation of evidence and the judgment as to how strong
a case has been made for construct validity. Since the final assessment will neces-
sarily be a matter of judgment, it is clear that there are a number of issues that will
need to be addressed.
One such issue is the determination of how much evidence is enough. Obvi-
ously more evidence is always preferable but collecting more evidence may not
always be practical. Therefore, the question will remain as to how many “pieces”
of evidence will be needed to make a convincing case. The actual number of evi-
Evaluating Job Performance Measures • 123

dentiary data may also be a function of whether or not all the available evidence
comes to the same conclusion. That is, it may be the case that relatively few bits
of evidence are sufficient if they all indicate that then appraisal instrument has
sufficient content validity. But what if there is not consensus with regard to the
evidence?
Therefore, another important issue in developing a protocol for evaluating the
construct validity of performance measures is determining how to reconcile dif-
ferent streams of evidence that suggest different conclusion. First, there must be
a decision as to whether a case could be made for construct validity in the pres-
ence of any contradictory evidence. Then, assuming some contradictory evidence,
a decision must be made concerning how to weigh different types of evidence.
Earlier, in our discussion of traditional measures for evaluating appraisal instru-
ments, we noted that rating errors were not a good proxy for rating accuracy, and
probably not a good measure for evaluation at all. It would seem reasonable then,
that evidence relating to rating errors could be discounted in any analysis. But
what about assessing other types of evidence such as measurement equivalence
or source agreement, or the absence of bias? How much weight to give each of
these will ultimately be a judgment call, and the ability of anyone to make a case
for construct validity will depend largely upon one’s ability to make the case for
some differential weighting.
But there may be one type of evidence that can be given some precedence in
this process. We argue that, while organization conduct appraisals for a number
of reasons, ultimately, they conduct appraisals in the hope of help in employees
to improve their performance. Therefore, some deference should be shown to
evidence that supports this improvement. That is, if there is evidence that imple-
menting an appraisal system has resulted in a true improvement in individual
performance, this should be given a fair amount of weight in supporting the con-
struct validity of the system. Furthermore, evidence that the appraisal system has
also resulted in true improvement in performance at the level of the firm, should
be given even more weight. We note, however, that evidence clearly linking im-
provements in individual-level with improvements in firm-level performance, is
extremely rare (cf., DeNisi & Murphy, 2017).

DIRECTIONS FOR FUTURE


RESEARCH AND CONCLUSIONS

So, where do we go from here? We believe that one of the major reasons for the
recurring failure in the century-long search for “criteria for criteria” is the tenden-
cy to limit this search to a single class of measures, such as inter-rater agreement
measures, rater error scores, indices of rating accuracy and the like. Some of these
measures have serious enough problems that they probably should not be used at
all, but, even if we accept that some of these measures provide us some insight
124 • ANGELO S. DENISI & KEVIN R. MURPHY

as to the usefulness of appraisal systems, they can only tell us part of the story.
Instead, we have proposed reframing the criteria we use to evaluate measures of
job performance in terms of the way we evaluate other measures of important
constructs—i.e., through the lens of construct validation.
But, the approach we have proposed suggests that evaluation process will be
complex. It requires collecting different types of data, where each data source can
tell us something about the effectiveness of appraisal systems, but where only
when we combine these different sources will we begin to get a true picture of
effectiveness. We have discussed a number of such data sources, which we have
termed as sources of evidence of construct validity, and research needs to contin-
ue to identify and refine these sources of evidence. Research needs to more fully
examine issues of convergence across rating sources. There is evidence to suggest
that ratings of the same person, from different rating sources, are correlated, but
are not substitutable. Is this because of measurement error, bias, or is it because
raters who have different relationships with a ratee observe different behaviors?
Perhaps peers, supervisors, subordinates, etc. see similar things but apply differ-
ent standards in evaluating what they see. Determining the source of the disagree-
ment may help us to establish upward boundaries that could be expected so that
we can more accurately assess convergence across sources.
More information about equivalence of ratings across cultures and contexts is
also needed. This type of research may require special efforts to overcome the ef-
fects of language differences, as well as differences in definitions across cultures.
For example, Farh, Earley, and Lin (1997) examined how American and Chinese
works viewed the idea of organizational citizenship behavior (OCB). The found
that it was necessary to go beyond the mere translation of OCB scales developed
in the west. Instead, they generated a Chinese definition of OCB and found that
measures of this Chinese version of OCB displayed the same relations with vari-
ous justice measures as the U.S. based measures did. But they also found that the
translated measures did not display the same relations. They concluded that citi-
zenship behavior was as important for the Chinese sample as it was for the U.S.
sample, but that the two groups defined citizenship in slightly different ways, and
it was important to respect these differences when comparing results. Therefore, it
may not be enough to simply translate appraisal instruments in order to compare
equivalence across cultures. But, on the other hand, at some point, the conceptual-
izations may be so different as to suggest that there is really not any equivalence.
These issues require a great deal of further research.
We noted that, although there is evidence of different types of bias in perfor-
mance ratings, these biases actually explained only small amounts of variance in
actual ratings. It is important to obtain clear estimates of how important bias may
be for ratings in different settings. This too may allow us to set upward boundaries
to help interpret data on bias, but it will also help to identify cases where bias is
more serious, and what to do in such situations.
Evaluating Job Performance Measures • 125

Finally, although we believe that further research is needed on several of these


topics, as helping to establish construct validity or appraisals, we also see it as es-
pecially important to generate evidence that appraisal systems matter. That is, we
view it as critical that any attempt to evaluate appraisal systems includes data that
feedback from appraisals change behavior. We view reliance upon ratee reactions
and perceptions as fairness as an important step in this process, but, ultimately,
ratee reactions and perceptions would only serve to mediate the relationships be-
tween appraisal results and performance improvement. Whatever evidence exists,
it might be difficult to establish construct validity of appraisals that didn’t help
employees to improve. Furthermore, beyond improving the performance of in-
dividual employees, it is also important to show how appraisal systems can help
firms improve performance of the firm itself. This would require demonstrating
how changes in individual-level performance actually translate into changes in
firm-level performance, and, as noted by DeNisi & Murphy (2017), data sup-
porting such a relation is extremely rare and will require a great deal of effort to
collect.
The search for “criteria for criteria” has been a long and disappointing one, in
part because none of the particular measures (e.g., agreement, rater errors) that
have been proposed have been fully adequate for the task. Other reviews of the
appraisal literature focused upon ways to improve appraisals, but they did not
consider potential problems with the criteria used to evaluate appraisal systems.
The present review focused explicitly upon questions of the criterion used, and
a critical review indicated that there were problems with most of the criterion
measures used I the past. After completing this review, we believe it is time to
abandon the search for one single criterion measure that could allow us to evalu-
ate performance appraisal systems and to adopt the same approach we have ad-
opted to validating most other measures. The evaluation of performance appraisal
systems will involve the same sort of complex, ongoing system of collecting,
weighing and evaluating evidence that we routinely apply when asking whether,
for example, a new measure of Agreeableness actually taps this construct. The
good news is that we have all of the tools and training we need, as well as a well-
established framework for validation. We look forward to applications of the con-
struct validation framework to the important question of evaluating performance
appraisal systems.

REFERENCES

American Educational Research Association, American Psychological Association, Na-


tional Council on Measurement in Education, Joint Committee on Standards for
Educational and Psychological Testing (U.S.). (2014). Standards for educational
and psychological testing. Washington, DC: AERA.
Aguinis, H., & O’Boyle, E. (2014). Star performers in twenty-first-century organizations.
Personnel Psychology, 67, 313–350.
126 • ANGELO S. DENISI & KEVIN R. MURPHY

Aguinis, H., O’Boyle, E., Gonzalez-Mulé, E., & Joo, H. (2016). Cumulative advantage:
Conductors and insulators of heavy-tailed productivity distributions and productiv-
ity stars. Personnel Psychology, 69, 3–66.
Arvey, R., & Murphy, K. (1998). Personnel evaluation in work settings. Annual Review
of Psychology, 49, 141–168.
Atwater, L. E., & Yammarino, F. Y. (1992). Does self–other agreement on leadership per-
ceptions moderate the validity of leadership and performance predictions? Person-
nel Psychology, 45, 141–164.
Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Ap-
plied Psychology, 77, 836–874.
Balzer, W. K., & Sulsky, L. M. (1992). Halo and performance appraisal research: A criti-
cal examination. Journal of Applied Psychology, 77, 975–985.
Bass, A. R., & Turner, J. N. (1973). Ethnic group differences in relationships among crite-
ria of job performance. Journal of Applied Psychology, 57, 101–109.
Baxter, G. W. (2012). Reconsidering the black-white disparity in federal performance rat-
ings. Public Personnel Management, 41, 199–218.
Beck, J. W., Beatty, A. S., & Sackett, P. R. (2014). On the distribution of job performance:
The role of measurement characteristics in observed departures from normality.
Personnel Psychology, 67, 531–566.
Becker, B. E., & Cardy, R. L. (1986). Influence of halo error on appraisal effectiveness:
A conceptual and empirical reconsideration. Journal of Applied Psychology, 71,
662–671.
Bendig, A. W. (1953). The reliability of self-ratings as a function of the amount of verbal
anchoring and the number of categories on the scale. Journal of Applied Psychol-
ogy, 37, 38–41.
Bento, R. F., White, L. F. & Zacur, S. R. (2012). The stigma of obesity and discrimination
in performance appraisal: A theoretical model. International Journal of Human
Resource Management, 23, 3196–3224.
Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human be-
havior at work. Boston, MA: Kent.
Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Man-
agement Review, 6, 205–212.
Bingham, W. V. (1939). Halo, invalid and valid. Journal of Applied Psychology, 23, 221–
228.
Blanz, F., & Ghiselli, E. E. (1972). The mixed standard scale: A new rating system. Per-
sonnel Psychology, 25, 185–200.
Bommer, W. H., Johnson, J. L., Rich, G. A., Podsakoff, P. M., & MacKenzie, S. B. (1995).
On the interchangeability of objective and subjective measures of employee perfor-
mance: A meta-analysis. Personnel Psychology, 48, 587–605.
Borman, W. C. (1977). Consistency of rating accuracy and rating errors in the judgment
of human performance. Organizational Behavior and Human Performance, 20,
238–252.
Borman, W. C. (1978). Exploring the upper limits of reliability and validity in job perfor-
mance ratings. Journal of Applied Psychology, 63, 135–144.
Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors.
Journal of Applied Psychology, 64, 410–421.
Evaluating Job Performance Measures • 127

Borman, W.C (1991). Job behavior, performance, and effectiveness. In M. D. Dunnette &
L. M. Hough (Eds.), Handbook of industrial and organizational Psychology (pp.
271–326). Palo Alto, CA: Consulting Psychologists Press.
Bowen, C., Swim, J. K., & Jacobs, R. (2000). Evaluating gender biases on actual job per-
formance of real people: A meta-analysis. Journal of Applied Social Psychology,
30, 2194–2215.
Bretz, R. D., Milkovich, G. T., & Read, W. (1992). The current state of performance ap-
praisal research and practice: Concerns, directions, and implications. Journal of
Management, 18, 321–352.
Campbell J. P. (1990). Modeling the performance prediction problem in industrial and
organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook
of industrial and organizational psychology (Vol. 1, pp. 687–732). Palo Alto, CA:
Consulting Psychologists Press.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of perfor-
mance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations
(pp. 35–70). San Francisco, CA: Jossey-Bass.
Cardy, R. L., & Dobbins, G. H. (1986). Affect and appraisal accuracy: Liking as an in-
tegral dimension in evaluating performance. Journal of Applied Psychology, 71,
672–678.
Cleveland, J. N., Murphy, K. R., & Williams, R. E. (1989). Multiple uses of performance
appraisal: Prevalence and correlates. Journal of Applied Psychology, 74, 130–135.
Colella, A., DeNisi, A. S., & Varma, A. (1998). The impact of ratee’s disability on perfor-
mance judgments and choice as partner: the role of disability-job fit stereotypes and
interdependence of rewards. Journal of Applied Psychology, 83, 102–111.
Conway, J. M. (1998). Understanding method variance in multitrait-multirater perfor-
mance appraisal matrices: Examples using general impressions and interpersonal
affect as measured method factors. Human Performance, 11, 29–55.
Conway, J. M., & Huffcutt, A. I. (1997). Psychometric properties of multisource perfor-
mance ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings.
Human Performance, 10, 331–360.
Conway, J. M., Lombardo, K., & Sanders, K. C. (2001). A meta-analysis of incremental
validity and nomological networks for subordinate and peer rating. Human Perfor-
mance, 14, 267–303.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues
for field settings. Boston, MA: Houghton Mifflin Company.
Cooper, W. H. (1981a). Conceptual similarity as a source of illusory halo in job perfor-
mance ratings. Journal of Applied Psychology, 66, 302–307.
Cooper, W. H. (1981b). Ubiquitous halo. Psychological Bulletin, 90, 218–244.
Cronbach, L. J. (1955). Processes affecting scores on “understanding of others” and “as-
sumed similarity.” Psychological Bulletin, 52, 177–193.
Cronbach, L. J. (1990). Essentials of psychological testing. New York, NY: Harper and
Row.
Czajka, J. M., & DeNisi, A. S. (1988). The influence of ratee disability on performance
ratings: The effects of unambiguous performance standards. Academy of Manage-
ment Journal, 31, 394–404.
128 • ANGELO S. DENISI & KEVIN R. MURPHY

DeNisi, A. S., & Gonzalez, J. A. (2004). Design performance appraisal to improve per-
formance appraisal. In E. A. Locke (Ed.) The Blackwell handbook of principles of
organizational behavior (Updated version, pp. 60–72). London, UK: Blackwell
Publishers.
DeNisi, A. S., & Murphy, K. R. (2017). Performance appraisal and performance manage-
ment: 100 Years of progress? Journal of Applied Psychology, 102, 421–433.
DeNisi, A. S., & Peters, L. H. (1996). Organization of information in memory and the
performance appraisal process: evidence from the field. Journal of Applied Psy-
chology, 81, 717.
DeNisi, A. S., Robbins, T., & Cafferty, T. P. (1989). Organization of information used for
performance appraisals: Role of diary-keeping. Journal of Applied Psychology, 74,
124–129.
DeNisi, A. S., Robbins, T. L., & Summers, T. P. (1997). Organization, processing, and the
use of performance information: A cognitive role for appraisal instruments. Journal
of Applied Social Psychology, 27, 1884–1905.
DeNisi, A. S., & Sonesh, S. (2011). The appraisal and management of performance at
work. In S. Zedeck (Ed.), Handbook of industrial and organizational psychology
(pp. 255–280). Washington, DC: APA Press.
Dierdorff, E. C., & Surface, E. A. (2007). Placing peer ratings in context: systematic influ-
ences beyond ratee performance. Personnel Psychology, 60, 93–126.
Farh, J., Earley, P.C., & Lin, S. 1997). Impetus for action: A cultural analysis of justice
and organizational citizenship behavior in Chinese society. Administrative Science
Quarterly, 42, 421–444.
Fleenor, J.W., Fleenor, J.B., & Grossnickle, W.F. (1996). Interrater reliability and agree-
ment of performance ratings: A methodological comparison. Journal of Business
and Psychology, 10, 367–38.
Folger, R., Konovsky, M. A., & Cropanzano, R. (1992). A due process metaphor for per-
formance appraisal. Research in Organizational Behavior, 14, 129–129.
Greenberg J. (1986) Determinants of perceived fairness of performance evaluations.
Journal of Applied Psychology, 71, 340–342.
Greenberg, J. (1987). A taxonomy of organizational justice theories. Academy of Man-
agement Review, 12, 9–22.
Greguras, G. J. (2005). Managerial experience and the measurement equivalence of per-
formance ratings. Journal of Business and Psychology, 19, 383–397.
Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of
360-degree feedback ratings. Journal of Applied Psychology, 83, 960–968.
Greguras, G. J., Robie, C., Schleicher, D. J., & Goff, M. (2003). A field study of the effects
of rating purpose on the quality of multisource ratings. Personnel Psychology, 56,
1–21.
Hamner, W. C., Kim, J. S., Baird, L., & Bigoness, W. J. (1979). Race and sex as determi-
nants of ratings by potential employers in a simulated work-sampling task. Journal
of Applied Psychology, 59, 705–711.
Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisory, self-peer,
and peer-subordinate ratings. Personnel Psychology, 41, 43–62.
Evaluating Job Performance Measures • 129

Heilman, M. E., & Chen, J. J. (2005). Same behavior, different consequences: reactions to
men’s and women’s altruistic citizenship behavior. Journal of Applied Psychology,
90, 431–441.
Heneman, R. L. (1986). The relationship between supervisory ratings and results-oriented
measures of performance: A meta-analysis. Personnel Psychology, 39, 811–826.
Hoffman, B. J., Lance, C. E., Bynum, B., & Gentry, W. A. (2010). Rater source effects are
alive and well after all. Personnel Psychology, 63, 119–151.
Hoffman, B. J., & Woehr, D. J. (2009). Disentangling the meaning of multisource perfor-
mance rating source and dimension factors. Personnel Psychology, 62, 735–765.
Ilgen, D. R. (1993). Performance appraisal accuracy: An elusive and sometimes mis-
guided goal. In H. Schuler, J. L. Farr, & M. Smith (Eds.), Personnel selection and
assessment: Industrial and organizational perspectives (pp. 235–252). Hillsdale,
NJ: Erlbaum.
Ilgen, D. R., Barnes-Farrell, J. L., & McKellin, D. B. (1993). Performance appraisal pro-
cess research in the 1980s: What has it contributed to appraisals in use? Organiza-
tional Behavior and Human Decision Processes, 54, 321–68.
Jennings, T., Palmer, J. K., & Thomas, A. (2004). Effects of performance context on pro-
cessing speed and performance ratings. Journal of Business and Psychology, 18,
453–463.
Joo, H., Aguinis, H., & Bradley, K. J. (2017). Not all non-normal distributions are cre-
ated equal: Improved theoretical and measurement precision. Journal of Applied
Psychology, 102, 1022–1053.
Kasten, R., & Nevo, B. (2008). Exploring the relationship between interrater correlations
and validity of peer ratings. Human Performance, 21, 180–197.
Kingsbury, F. A (1922). Analyzing ratings and training raters. Journal of Personnel Re-
search, 1, 377–382.
Kingsbury, F. A. (1933). Psychological tests for executives. Personnel, 9, 121–133.
Kluger, A. N., & DeNisi, A. S. (1996). The effects of feedback interventions on perfor-
mance: Historical review, meta-analysis, and a preliminary feedback intervention
theory. Psychological Bulletin, 119, 254–284.
Kraiger, K., & Ford, J. K. (1985). A meta-analysis of ratee race effects in performance
ratings. Journal of Applied Psychology, 70, 56–65.
Lance, C. E. (1994). Test of a latent structure of performance ratings derived from Wher-
ry’s (1952) theory of rating. Journal of Management, 20, 757–771.
Lance, C. E., Baranik, L. E., Lau, A. R., & Scharlau, E. A. (2009). If it ain’t trait it must
be method: (mis)application of the multitrait-multimethod design in organizational
research. In C. E. Lance & R. L. Vandenberg (Eds.), Statistical and methodological
myths and urban legends (pp. 227–360). New York, NY: Routledge.
Lance, C. E., LaPointe, J. A., & Stewart, A. M. (1994). A test of the context dependen-
cy of three causal models of halo rater error. Journal of Applied Psychology, 79,
332–340.
Lance, C. E., Teachout, M. S., & Donnelly, T. M. (1992). Specification of the criterion
construct space: An application of hierarchical confirmatory factor analysis. Jour-
nal of Applied Psychology, 77, 437–452.
130 • ANGELO S. DENISI & KEVIN R. MURPHY

Landy, F. J. (2010). Performance ratings: Then and now. In J.L. Outtz (Ed.). Adverse
impact: Implications for organizational staffing and high-stakes selection (pp.
227–248). New York, NY: Routledge.
Landy, F. J., Barnes, J., & Murphy, K. R. (1978). Correlates of perceived fairness and
accuracy of performance appraisals. Journal of Applied Psychology, 63, 751–754.
Landy, F. J., Barnes-Farrell, J., & Cleveland, J. (1980). Perceived fairness and accuracy of
performance appraisals: A follow-up. Journal of Applied Psychology, 65, 355–356.
Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72–107.
Landy, F. J., Shankster, L. J., & Kohler, S. S. (1994). Personnel selection and placement.
Annual Review of Psychology, 45, 261–296.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology,
28, 563–575.
LeBreton, J. M., Scherer, K. T., & James, L. R. (2014). Corrections for criterion reli-
ability in validity generalization: A false prophet in a land of suspended judgment.
Industrial and Organizational Psychology: Perspectives on Science and Practice,
7, 478–500.
Mabe, P. A., & West, S. G. (1982). Validity of self-evaluation of ability: A review and
meta-analysis. Journal of Applied Psychology, 67, 280–290.
McIntyre, R. M., Smith, D., & Hassett, C. E. (1984). Accuracy of performance ratings
as affected by rater training and perceived purpose of rating. Journal of Applied
Psychology, 69, 147–156.
Milkovich, G. T., & Wigdor, A. K. (1991). Pay for performance. Washington, DC: Na-
tional Academy Press.
Motowidlo, S. J., & Kell, H. J. (2013). Job Performance. In N. W. Schmitt & S. Highhouse
(Eds.), Comprehensive handbook of psychology, Volume 12: Industrial and organi-
zational psychology (2nd ed., pp. 82–103). New York, NY: Wiley.
Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait,
rater, and level effects in 360-degree performance ratings. Personnel Psychology,
51, 557–576.
Murphy, K. R, (1991). Criterion issues in performance appraisal research. Behavioral ac-
curacy vs. classification accuracy. Organizational Behavior and Human Decision
Processes, 50, 45–50.
Murphy, K. R. (2008). Explaining the weak relationship between job performance and rat-
ings of job performance. Industrial and Organizational Psychology: Perspectives
on Science and Practice, 1, 148–160.
Murphy, K. R., & Anhalt, R. L. (1992). Is halo error a property of the rater, ratees, or the
specific behaviors observed? Journal of Applied Psychology, 77, 494–500.
Murphy, K. R., & Balzer, W. K. (1986). Systematic distortions in memory-based behavior
ratings and performance evaluations: Consequences for rating accuracy. Journal of
Applied Psychology, 71, 39–44.
Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Ap-
plied Psychology, 74, 619–624.
Murphy, K. R., Balzer, W. K., Kellam, K. L., & Armstrong, J. (1984). Effect of purpose of
rating on accuracy in observing teacher behavior and evaluating teaching perfor-
mance. Journal of Educational Psychology, 76, 45–54.
Evaluating Job Performance Measures • 131

Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social,


organizational and goal-oriented perspectives. Newbury Park, CA: Sage.
Murphy, K. R., Cleveland, J., & Hanscom, M. (2018). Performance appraisal and man-
agement Thousand Oaks, CA: Sage.
Murphy, K., & Davidshofer, C. (2005). Psychological testing: Principles and applica-
tions (6th ed). Upper Sadddle River, NJ: Prentice Hall.
Murphy, K. R., & DeShon, R. (2000). Interrater correlations do not estimate the reliability
of job performance ratings. Personnel Psychology, 53, 873–900.
Murphy, K. R., Garcia, M., Kerkar, S., Martin, C., & Balzer, W. K. (1982). Relationship
between observational accuracy and accuracy in evaluating performance. Journal
of Applied Psychology, 67, 320.
Murphy, K. R., Herr, B. M., Lockhart, M .C., & Maguire, E. (1986). Evaluating the per-
formance of paper people. Journal of Applied Psychology, 71, 654–661.
Murphy, K. R., Jako, R. A., & Anhalt, R. L. (1993). Nature and consequences of halo er-
ror: A critical analysis. Journal of Applied Psychology, 78, 218–225.
Murphy, K. R., Martin, C., & Garcia, M. (1982). Do behavioral observation scales mea-
sure observation? Journal of Applied Psychology, 67, 562–567.
Nathan, B. R., & Tippins, N. (1989). The consequences of halo “error” in performance
ratings: A field study of the moderating effect of halo on test validation results.
Journal of Applied Psychology, 74, 290–296.
O’Neill, T. A., McLarnon, M. J. W., & Carswell, J. J. (2015). Variance components of job
performance ratings. Human Performance, 32, 801–824.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (2008). No new terrain: Reliability and
construct validity of job performance ratings. Industrial and Organizational Psy-
chology: Perspectives on Science and Practice, 1, 174–179.
Ployhart, R. E., Wiechmann, D., Schmitt, N., Sacco, J. M., & Rogg, K. (2003). The cross-
cultural equivalence of job performance ratings. Human Performance, 16, 49–79.
Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of con-
tent validity. Research in Nursing and Health, 30, 451–467.
Pulakos, E. D. (1986). The development of training programs to increased accuracy in
different rating tasks. Organizational Behavior and Human Decision Processes,
38, 76–91.
Pulakos, E. D., White, L. A., Oppler, S. H., & Borman, W. C. (1989). Examination of
race and sex effects on performance ratings. Journal of Applied Psychology, 74,
770–780.
Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs
in organizational research: Implications for estimating interrater reliability. Journal
of Applied Psychology, 93, 959.
Remmers, H. H. (1931). Reliability and halo effect of high school and college students’
judgments of their teachers. Journal of Applied Psychology 18, 619–630.
Roberson, L., Galvin, B. M., & Charles, A. C. (2007). When group identities matter: Bias
in performance appraisal. The Academy of Management Annals, 1, 617–650.
Roch, S. G., Paquin, A. R., & Littlejohn, T. W. (2009). Do raters agree more on observable
items? Human Performance, 22, 391–409.
Rosen, B., & Jerdee, T. H. (1976). The nature of job-related age stereotypes. Journal of
Applied Psychology, 61, 180–183.
132 • ANGELO S. DENISI & KEVIN R. MURPHY

Saal, F. E., Downey, R. C., & Lahey, M. A. (1980). Rating the ratings: Assessing the qual-
ity of rating data. Psychological Bulletin, 88, 413–428.
Sanchez, J. I., & De La Torre, P. (1996). A second look at the relationship between rating
and behavioral accuracy in performance appraisal. Journal of Applied Psychology,
81, 3–10.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in per-
sonnel psychology: Practical and theoretical implications of 85 years of research
findings. Psychological Bulletin, 124, 262–274.
Schmitt, N., & Lappin, M. (1980). Race and sex as determinants of the mean and variance
of performance ratings. Journal of Applied Psychology, 65, 428–435.
Schmidt, F. L.,Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and valid-
ity is not reliability. Personnel Psychology, 53, 901–912.
Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job
performance ratings. Journal of Applied Psychology, 85, 956–970.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experi-
mental designs for generalized causal inference. Boston, MA: Houghton-Mifflin.
Smith, P. C. (1976). Behaviors, results, and organizational effectiveness. In M. Dunnette
(Ed.), Handbook of industrial and organizational psychology. Chicago, IL: Rand-
McNally.
Smith, P. C., & Kendall, L. M. (1963). Retranlsation of expectations: An approach to the
construction of unambiguous anchors for rating scales. Journal of Applied Psychol-
ogy, 47, 149–155.
Solomonson, A. L., & Lance, C. E. (1997). Examination of the relationship between true
halo and halo error in performance ratings. Journal of Applied Psychology, 82,
665–674.
Stone-Romero, E. F., Alvarez, K., & Thompson, L. F. (2009). The construct validity of
conceptual and operational definitions of contextual performance and related con-
structs. Human Resource Management Review, 19, 104–116.
Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rat-
ing accuracy: Some methodological and theoretical concerns. Journal of Applied
Psychology, 73, 497–506.
Taylor, M. S., Tracy, K. B., Renard, M. K., Harrison, J. K., & Carroll, S. J. (1995). Due
process in performance appraisal: A quasi-experiment in procedural justice. Ad-
ministrative Science Quarterly, 495–523.
Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied
Psychology, 4, 25–29.
Thorndike, R. L. (1949). Personnel selection. New York, NY: Wiley.
Valle, M., & Bozeman, D. (2002). Interrater agreement on employees’ job performance:
Review and directions. Psychological Reports, 90, 975–985.
Varma, A., DeNisi, A. S., & Peters, L. H. (1996). Interpersonal affect in performance ap-
praisal: A field study. Personnel Psychology, 49, 341–360.
Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2002). The moderating influence of job
performance dimensions on convergence of supervisory and peer ratings of job
performance: Unconfounding construct-level convergence and rating difficulty.
Journal of Applied Psychology, 87, 345–354.
Evaluating Job Performance Measures • 133

Waldman, D. A., & Avolio, B. J. (1991). Race effects in performance evaluations: Con-
trolling for ability, education, and experience. Journal of Applied Psychology, 76,
897–901.
Williams, K. J., DeNisi. A. S., Meglino, B. M., & Cafferty, T. P. (1986). Initial decisions
and subsequent performance ratings. Journal of Applied Psychology, 71, 189–195.
Woehr, D. J., & Roch, S. G. (2016).Of babies and bathwater: Don’t throw the measure out
with the application. Industrial and Organizational Psychology: Perspectives on
Science and Practice, 9, 357—361.
Woehr, D. J., Sheehan, M. K., & Bennett, W. (2005). Assessing measurement equivalence
across rating sources: A multitrait-multirater approach. Journal of Applied Psychol-
ogy, 90, 592–600.
CHAPTER 7

RESEARCH METHODS IN
ORGANIZATIONAL POLITICS
Issues, Challenges, and Opportunities

Liam P. Maher, Zachary A. Russell, Samantha L. Jordan,


Gerald R. Ferris, and Wayne A. Hochwarter

Scientific inquiry has identified, casually discussed, informally examined, and


vigorously investigated organizational politics phenomena for over a century
(e.g., Byrne, 1917; Ferris & Treadway, 2012; Lasswell, 1936). In reality, politics
work goes back even further if we consider the publication of Niccolo Machia-
velli’s The Prince, initially written in the 1500s and first published in 1903 (Ma-
chiavelli, 1952). Delineations of organizational politics are abundant in the exist-
ing literature (many would claim ‘too many’ with little overlapping agreement;
Lepisto & Pratt, 2012). The expanding research base notwithstanding, the field
has yet to offer an agreed upon theory-driven definition. Foundationally (and his-
torically), organizational politics has been cast in a mostly pejorative or negative
light, referring to the self-interested behavior of individuals, groups, or organiza-
tions (e.g., Ferris & Treadway, 2012). Ostensibly, this view has driven previous
empirical research, with only a small number of exceptions (see Franke & Foerstl,
2018; Landells & Albrecht, 2013).
Given its maturation, especially in the past three decades, significant reviews
of the organizational politics literature exist (see Ferris, Harris, Russell, & Maher,
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 135–172.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 135
136 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

2018; Ferris & Hochwarter, 2011; Kacmar & Baron, 1999; Lux, Ferris, Brouer,
Laird, & Summers, 2008), investigating a myriad of substantive relations. These
reviews identified trends, critically examined foundational underpinnings, and
noted inconsistencies and possible causes (Chang, Rosen, & Levy, 2009). Em-
bedded in many of these summaries are critiques of research design as well as
recommendations for addressing existing methodological deficiencies (Ferris, El-
len, McAllister, & Maher, 2019). However, to our knowledge, there has been
no systematic examination of research method issues in organizational politics
scholarship to date. Therefore, we offer a detailed critique of issues, challenges,
and future directions of organizational politics research methods.

Scope of Organizational Politics Review


Given the exponential growth of published politics research, we thought it pru-
dent to identify a potential starting point when designing our discussion. Specifi-
cally, the original perceptions of politics (POPs) model (Ferris, Russ, & Fandt,
1989), has spawned considerable research and remains influential (e.g., Ahmad,
Akhtar, ur Rahman, Imran, ul Ain, 2017). Admittedly, other influential politics
discussions surfaced before the initial Ferris et al. model (e.g., Madison, Allen,
Porter, Renwick, & Mayes, 1980; Mayes & Allen, 1977). For example, Mintz-
berg’s (1985) characterization of organizations as “political arenas” and Pfeffer’s
(1981) support of politics as a constructive and legitimate element of organiza-
tional realities preceded Ferris et al. (1989) and remain relevant in contemporary
research (Cullen, Gerbasi, & Chrobot-Mason, 2018). For clarity, though, we use
Ferris et al.’s (1989) discussion as our point of departure.
In this review, we identify notable methodological approaches guiding past
discussions of organizational politics. Further, we offer suggestions for augment-
ing methodological approaches considered critical when scholars develop the
next generation of politics research. In terms of scope, we argue that organiza-
tional politics is far from a unitary or singular construct, but instead reflects a
multi-faceted area of inquiry. Indeed, organizational politics is quite complex, and
some of the shortcomings of existing research likely derive from a failure to fully
recognize this multi-faceted reality. Delineation of what each construct is and is
not is essential when seeking to establish construct validity. Further, a clear defi-
nition and understanding of a construct’s placement within its nomological net-
work is key to successful measure development (Hinkin, 1998). Psychometrically
sound measures must then be developed, tested, and established as possessing
validity in order for research to progress our knowledge (MacKenzie, Podsakoff,
& Podsakoff, 2011) of the organizational politics nomological network.
Although only subtle differences may be visible to some, the research domain
classically consists of perceptions of organizational politics, political behavior,
and political skill (Ferris & Hochwarter, 2011; Ferris, Perrewé, Daniels, Lawong,
& Holmes, 2017; Ferris & Treadway, 2012). However, we also include in our
present review and analysis two conceptually related constructs. ‘Political will’
Research Methods in Organizational Politics • 137

and ‘reputation’ are burgeoning areas of study that fit well within the organiza-
tional politics nomological network. (Blom-Hansen & Finke, in press; Ferris et
al., 2019).
In concept, political will has been around for some time (Mintzberg, 1983;
Treadway, Hochwarter, Kacmar, & Ferris, 2005). Historically, the term repre-
sented worker behaviors undertaken to sabotage the leader’s directives (Brecht,
1937). In contemporary terms, conceptual advancements increased interest
(Blickle, Schütte, & Wihler, 2018; Maher, Gallagher, Rossi, Ferris, & Perrewé,
2018), and publication of the Political Will Scale (PWS) helped develop empirical
research in recent years (Kapoutsis, Papalexandris, Treadway, & Bentley, 2017).
Organizational reputation is not a new concept (Bromley, 1993; O’Shea,
1920). As an example, McArthur (1917) argued: “Reputation is something that
you can’t value in dollars and cents, but is mighty precious just the same…” (p.
63). However, for such a foundational construct, relatively little theory and re-
search have been conducted on reputation in the organizational sciences (Ferris,
Blass, Douglas, Kolodinsky, & Treadway, 2003; Ferris, Harris, Russell, Ellen,
Martinez, & Blass, 2014). As far back as Tsui (1984), and extending to the present
day (Ferris et al., 2019), reputation in organizations has been construed as less
of an objectively scientific construct and more of a sociopolitical one (Ravasi,
Rindova, Etter, & Cornelissen, 2018). Hence, reputation’s inclusion as a facet of
organizational politics is entirely appropriate (Munyon, Summers, Thompson, &
Ferris, 2015; Zinko, Gentry, & Laird, 2016) given its influence (direct and indi-
rect) on both tactics (Ferris et al., 2017) and presentation acuity (Smith, Plowman,
Duchon, & Quinn, 2009).

PRIMARY CONSTRUCTS WITHIN ORGANIZATIONAL POLITICS


In the following sections, we review and analyze the five primary constructs and
their associated measurement instruments within politics research. As we describe
below, the study of politics research presents substantial, but imperfect, attempts
to describe and measure important phenomena. Beyond summarizing trends, we
highlight the areas for potential improvement by identifying limitations—many
of which come from the authors of this chapter.

Perceptions of Organizational Politics


Definition and Conceptualization. The operationalization of the term ‘per-
ceptions of organizational politics’ (POPs) is traced to the broader organizational
politics literature (Stolz, 1955). Traditionally defined as non-sanctioned and il-
legitimate activities characterized by self-interest (Ferris & King, 1991; Ferris
& Treadway, 2012; Mintzberg, 1983, 1985), political behavior in organizational
contexts casts a pejorative light. As such, conceptualizations focus primarily on
the dysfunctional and self-serving aspects of others’ political behavior (Chang et
138 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

al., 2009; Guo, Kang, Shao, & Halvorsen, 2019; Miller, Rutherford, & Kolodin-
sky, 2008).
Although similarities between POPs and the broader organizational politics
construct exist, researchers note that perceptions are always subjective evalua-
tions, whereas organizational politics are captured objectively (Ferris, Harrell-
Cook, & Dulebohn, 2000; Ferris et al., 2019). Because perceptions ostensibly
manufacture reality (Landry, 1969), what is seen is impactful (Lewin, 1936; Por-
ter, 1976) and capable of explaining affective, cognitive, and behavioral outcomes
at work (Ferris & Kacmar, 1992). Accordingly, we define POPs as an individual’s
idiosyncratic estimation and evaluation of others self-serving, or egocentric, be-
havior at work (Ferris et al., 1989; Ferris et al., 2000; Ferris & Kacmar, 1992).
Ferris et al. (1989) developed one of the first theoretical models of POPs, which
specified the antecedents, outcomes, and moderators within the nomological net-
work of POPs. A subsequent review expanded these related constructs (Ferris,
Adams, Kolodinsky, Hochwarter, & Ammeter, 2002). Although no one study has
tested each proposed link, general support has been found for these two guiding
models, which were the primary studies that established the POPs nomological
network.
Despite the strong theoretical rationale for previous antecedent models, theori-
zation concerning the link between POPs and organizational outcomes was large-
ly absent before Chang et al.’s (2009) meta-analytic examination. Their study
was one of the first to identify psychological mechanisms linking POPs to more
distal work outcomes (i.e., turnover intentions and performance). Chang et al.
(2009) found that psychological strain mediated the relation between perceptions
of organizational politics and performance, such that as POPs increased, so did
psychological strain, in turn reducing performance. Morale mediated the POPs—
performance and turnover relation, albeit in a different fashion. Finally, one of
the most significant findings was the wide credibility intervals surrounding the
estimated effects of POPs on outcomes. This catalyzed the search for moderating
effects, which has dominated the POPs literature over the past decade.
Measurement. In 1980, two independent sets of scholars made first efforts to
assess political perceptions at work. Gandz and Murray (1980) asked employees
to report on the amount of political communication existing in their organization,
as well as its influence in shaping work environments. Respondents also reported
the organizational levels where political activities were most prevalent and of-
fered opinions on the effectiveness of these behaviors. Furthermore, respondents
provided a specific situation indicative of “a good example of workplace politics
in action” (Gandz & Murray, 1980, p. 240).
Madison et al. (1980) captured POPs through detailed interviews with chief
executive officers, high staff managers, and supervisors. Specifically, participants
answered questions, via face-to-face interviews, and reported on the frequency of
politics across different functional areas. They also described, in an open-ended
Research Methods in Organizational Politics • 139

fashion, their general perceptions of politics as either helpful or harmful to both


the individual and the organization.
Almost a decade later, Ferris and Kacmar (1989) proposed a five-item unidi-
mensional measure of general POPs. Shortly, after that, Kacmar and Ferris (1991)
developed the Perceptions of Organizational Politics Scale (POPS). The 12-item
POPS contained three factors: (1) general political behavior, (2) going along to
get ahead, and (3) pay and promotion. Subsequent attempts to validate the POPS’
three-factor structure evidenced psychometric shortcomings (Kacmar & Carlson,
1997; Nye & Witt, 1993). In response, Kacmar and Carlson (1997) evaluated the
contributions of each POPS item, removed items not functioning as intended,
and developed new items, resulting in the 15-itemextended POPS. Vigoda (2002)
shortened this scale to a 6-item measure, which remains commonly used in re-
search (e.g., Sun & Chen, 2017).
Since the development of the POPS (Ferris & Kacmar, 1989), and its sub-
sequent extension (Kacmar & Carlson, 1997), scholars sought other ways to
measure politics perceptions. For example, Hochwarter, Kacmar, Treadway, and
Watson (2003) asked respondents to report experienced politics with a scale with
endpoints ranging from 0 (no politics exist) to 100 (great levels of politics exist)
(50 served as a midpoint – moderate levels of politics exist). Three organizational
levels were examined: (a) at the highest levels in your organization; (b) at the
level of your immediate supervisor, and (c) at the level of you and your cowork-
ers. This approach allowed respondents to indicate an absolute level of viewed
politics without a priori directionality. Documented differences across levels sur-
faced and uniquely predicted outcomes.
As an extension, Hochwarter, Kacmar, Perrewé, and Johnson (2003) devel-
oped a six-item assessment of POPs. Respondents were asked to respond to each
item while considering three different organizational levels (e.g., current, one
level up, highest level). Since its emergence, Hochwarter et al.’s scale remains
in use given its short length and overall acceptable reliability across organiza-
tional levels (Dahling, Gabriel, & MacGowan, 2017; Rosen, Koopman, Gabriel,
& Johnson, 2016).
Critique and Future Research Directions. Arguably, the most fundamen-
tal issue with current scholarship on POPs concerns its pervasively negative op-
erationalization, conceptualization, and measurement (Ellen, 2014; Hochwarter,
2012). As alluded to above, the construct’s negative orientation equates to its defi-
nitional overlap with the term ‘organizational politics’ from the broader politics
literature (Ferris et al., 2019). Despite its positioning as a “dark side” phenom-
enon (Ferris & King, 1991), as well as a “hindrance” stressor (Chang et al., 2009;
LePine, Podsakoff, & LePine, 2005), POPs also can serve neutral and positive
functions in organizational contexts. This confusion threatens the construct valid-
ity of POPs.
Thus, rather than assuming POPs is either positive or negative, future scholar-
ship should expand its thinking to include neutral and positive operationaliza-
140 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

tions. For example, some scholars already have defined the construct as the ac-
tive management of shared meaning (Ferris & Judge, 1991; Pfeffer, 1981), as
well as the effort to restore justice, attain resources and benefits for others, and/
or as a source of positive influence and change (Ellen, 2014; Hochwarter, 2012).
These views represent an initial benchmark for the constructs future refinement
and measurement.
Furthermore, despite literature focusing on self-serving and proactive tactics,
reactive and defensive political strategies are also viable (Ashforth & Lee, 1990;
Valle & Perrewé, 2000). Landells and Albrecht (2017) interpreted and categorized
POPs into four levels. Those who perceived organizational politics as reactive,
regarded the behaviors as destructive and manipulative, whereas reluctant politics
represented a “necessary evil” (Landells & Albrecht, 2017, p. 41). Furthermore,
strategic behaviors accomplished goals, and integrated tactics benefited actors
when central to successful company functioning, activity, and decision-making.
These findings support claims for an expansion that captures a fuller content do-
main. We encourage the use of grounded theory investigations as theoretical start-
ing points for improving conceptualizations and psychometric treatments.
Also concerning is the lack of theorizing regarding how POPs affect indi-
vidual-, group-, and organizational-level outcomes. Although several conceptual
models have begun to specify the direct effects of POPs (e.g., Aryee, Chen, &
Budhwar, 2004; Ferris et al., 2002; Valle & Perrewé, 2000), few studies have
offered theoretical support for possible processes that indirectly link POPs to
employee and organizational outcomes. Exceptions include studies investigating
morale (Rosen, Levy, & Hall, 2006) and need satisfaction (Rosen, Ferris, Brown,
Chen, & Yan, 2014) as mediating mechanisms. Building on these studies, more
substantial theorization needs to explain how and why POPs are associated with
attitudes and behaviors at work (Chang et al., 2009) across organizational levels
(Adams, Ammeter, Treadway, Ferris, Hochwarter, & Kolodinsky, 2002; Dipboye
& Foster, 2002; Franke & Foerstl, 2018).
Whereas historically the organizational politics literature has focused predomi-
nantly on between-person variance in politics perceptions as a stable environmen-
tal factor (Rosen et al., 2016), it is highly possible that politics perceptions vary
throughout the day, week, or more broadly across time. As research on experi-
ence sampling methods continue (Matta, Scott, Colquitt, Koopman, & Passantino,
2017), it would be beneficial for researchers also to consider within-person varia-
tion in politics perceptions, and the antecedents that may result in such variance.
Assuming within-person variance exists, researchers would be drawing a broader
picture as to how politics perceptions are developed and modified across time.
Furthermore, given the importance of uncertainty in the larger politics litera-
ture, researchers also may want to consider whether within-person variability in
politics perceptions is more harmful than consistently perceiving politics. Per-
haps, politics manifest in ways similar to justice perceptions. Specifically, vari-
Research Methods in Organizational Politics • 141

ability of cues likely cause more disdain when inconsistent (sometimes good –
sometimes bad) than when consistent (always bad) (Matta et al., 2017).

Political Behavior
Definition and Conceptualization. As stated by Mintzberg (1983, 1985), or-
ganizations are political arenas in which motivated and capable individuals enact
self-serving behavior. Although employees often perceive ‘office politics’ as be-
ing decisively negative, political behavior can produce organizational and inter-
personal benefits when appropriately implemented (Treadway et al., 2005). Given
widespread disagreement on the implications of politics, and more specifically
political behavior, conceptualizations have varied over time and across studies
(Kidron & Vinarski-Peretz, 2018; Lampaki & Papadakis, 2018).
Generally, researchers agree that political behavior is normal, and sometimes,
an essential element of functioning (Zanzi & O’Neill, 2001). However, no agreed-
upon definition that captures the complexity of political action exists (Ferris et al.,
2019). Whereas most definitions posit political behavior as non-sanctioned activ-
ity within organizational settings (Farrell & Petersen, 1982; Gandz & Murray,
1980; Mintzberg, 1983; Schein, 1977), others focus on political behavior as an
interdependent social enactor-receiver relationship (Lepisto & Pratt, 2012; Sharf-
man, Wolf, Chase, & Tansik, 1988). Furthermore, some researchers classify influ-
ence tactics (Kipnis & Schmidt, 1988; Kipnis, Schmidt, & Wilkinson, 1980; Yukl
& Falbe, 1990), impression management (Liden & Mitchell, 1988; Tedeschi &
Melburg, 1984), and even voice (Burris, 2012; Ferris et al., 2019) as relevant for
effective operationalization of the politics construct.
Since its original operationalization, several conceptual models have emerged
to explain potential antecedents of political behavior. The first, developed by Por-
ter, Allen, and Angle (1981), argued that political behavior is, at least partially, a
function of Machiavellianism, locus of control, need for power, risk-seeking pro-
pensity, and a lack of personal power. Just over a decade later, Ferris, Fedor, and
King (1994) stated that political behavior is the result of Machiavellianism and lo-
cus of control, like seen in Porter et al.’s (1981) model, as well as self-monitoring,
a propensity unique to the Ferris et al. (1994) model.
Overall, empirical research investigating the antecedents of political behav-
ior has been inconclusive (Grams & Rogers, 1990; Vecchio & Sussman, 1991),
leading to calls for an expansion of the individual difference domain previously
specified (Ferris, Hochwarter, Douglas, Blass, Kolodinsky, & Treadway, 2002b).
In response, Treadway et al. (2005) conceptualized political behavior to include
motivational and achievement need components, and Ferris et al. (2019) concep-
tualized general political behavior as one of the multiple other political actions
that organizational members enact. We now briefly describe other forms of politi-
cal action conceptualized as being part of political behavior in organizations.
Influence tactics are specific strategies employed to obtain desired goals. De-
spite general disagreement regarding what types of influence tactics exist (Kipnis
142 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

et al., 1980; Kipnis & Schmidt, 1988; Yukl & Tracey, 1992), an extensive body
of literature has examined not only what tactics are most effective, but also the
boundary conditions affecting tactic success (e.g., frequency of influence, direc-
tion of influence, power distance between enactor and receiver, reason for in-
fluence attempt). As part of this trend, several meta-analytic studies have begun
to tease apart these direct and moderating implications (Barbuto & Moss, 2006;
Higgins, Judge, & Ferris, 2003; Lee, Han, Cheong, Kim, & Yun, 2017; Smith et
al., 2013).
Impression management reflects any political act designed to manage how one
is perceived (Tedeschi & Melburg, 1984; Tedeschi, Melburg, Bacharach, & Lawl-
er, 1984). Attempts at impression management fall into five primary categories,
including ingratiation, self-promotion, exemplification, supplication, and intimi-
dation (Jones & Pittman, 1982). Past work has categorized impression manage-
ment into two dimensions (i.e., tactical-strategic and assertive-defensive; Tedeschi
& Melburg, 1984). The tactical-strategic dimension considers whether short-term
or long-term purposes guide impression management. Moreover, the assertive-
defensive dimension determines if behavior escalates proactively or reactively to
situational contingencies. Although the common intention of impression manage-
ment is a favorable assessment, recent work reports that poorly executed tactics
can be detrimental for one’s social image (Bolino, Long, & Turnley, 2016).
Voice, a type of organizational citizenship behavior (OCB), is the expression
of effective solutions in response to perceived problems to improve a given situ-
ation (Li, Wu, Liu, Kwan, & Liu, 2014; Van Dyne & LePine, 1998). Voice is es-
sential for the management of shared meaning in organizational contexts (Ferris
et al., 2019), and represents a mechanism to advertise and promote personal opin-
ions and concerns (Burris, 2012). However, unlike many other forms of OCBs,
voice can be maladaptive for individuals enacting the behavior, as well as for
their coworkers and the organization as a whole (Turnley & Feldman, 1999). As
such, employee voice exemplifies a form of informal political behavior (Ferris et
al., 2019).
Measurement. Despite existing theoretical avenues within the political behav-
ior literature, there is still considerable disagreement surrounding construct defi-
nition and use in scholarly practice (Ferris et al., 2019), which impedes construct
validity. Given a general lack of operational and conceptual consensus, measures
of political behavior also have been limited and quite inconsistent. Whereas some
scholars have developed scales assessing general political behavior (Valle & Per-
rewé, 2000; Zanzi, Arthur, & Shamir, 1991), others have used impression man-
agement (Bolino & Turnley, 1999), influence tactics (Kipnis & Schmidt, 1988),
and voice (Van Dyne & LePine, 1998) as proxies for political behavior in organi-
zational settings.
The most commonly utilized measure of individual political behavior was de-
veloped by Treadway et al. (2005; α = .83). Six items captured general politicking
behavior toward goal attainment, interpersonal influence, accomplishment shar-
Research Methods in Organizational Politics • 143

ing, and ‘behind the scenes’ political activity. Despite its widespread use since
the scale’s emergence, Treadway et al.’s (2005) measure has yet to undergo the
empirical rigor that traditional scale developments endure (Ferris et al., 2019).
Critique and Future Research Directions. Before empirical work on the
construct can continue, researchers need to develop a concise and agreed upon
operation of political behavior that includes traditional definitional components
while taking into consideration the importance of intentionality (Hochwarter,
2012), goal-directed activity and behavioral targets (Lepisto & Pratt, 2012), and
interpersonal dependencies (French & Raven, 1959). Furthermore, researchers
need to decide whether to expand political behavior to include concepts like influ-
ence tactics, impression management, and voice, or if each construct is unique
enough to hold an independent position within political behaviors’ nomological
network. Once the construct is better defined, and its related constructs identi-
fied, researchers will want to use this conceptualization to help inform subsequent
scale development attempts. We encourage researchers to cast a wide net when
defining political behaviors and its potential underlying dimensions.
Political behaviors reflect inherently non-sanctioned and self-serving actions
(Mitchell, Baer, Ambrose, Folger, & Palmer, 2018), triggering ostensibly adverse
outcomes. However, not all non-sanctioned behavior is aversive nor all self-serv-
ing behavior dysfunctional (Ferris & Judge, 1991; Zanzi & O’Neill, 2001). For
example, egotistic behavior may not be intrinsic to the actor. Instead, contexts in-
fused with threat often trigger self-serving motivations as a protective mechanism
(Lafrenière, Sedikides, & Lei, 2016; Von Hippel, Lakin, & Shakarchi, 2005). For
this reason, future research should expand conceptualizations and measurement to
include constructs predisposed to neutral and positive implications as well (Ellen,
2014; Fedor, Maslyn, Farmer, & Bettenhausen, 2008; Ferris & Treadway, 2012;
Hochwarter, 2012; Maslyn, Farmer, & Bettenhausen, 2017).
Furthermore, political behavior is a broad term encapsulating activity enacted
by different sources, including the self, others, groups, and organizations (Hill,
Thomas, & Meriac, 2016). Given its possible manifestations across organiza-
tional levels, future research must redefine the construct within the appropriate
and intended theoretical level. As part of this process, researchers also must con-
sider whether political behavior is a level-generic (or level-specific) phenomenon,
manifesting similarly (or differentially) across multiple hierarchies.
The objectionable and surreptitious nature of political behavior (Wickenberg
& Kylén, 2007) provokes the use of self-report measures prone to socially desir-
able responding. Diverse approaches, however, are likely unable to capture the
extensiveness and frequency of political activity for the very same reasons. This
conundrum is shared across disciplines (Reitz, Motti-Stefanidi, & Asendorpf,
2016; Zare & Flinchbaugh, 2019), as other-report indices are vulnerable to halo
bias (Dalal, 2005). As researchers develop improved measures of political behav-
ior, convergence or divergence must be determined to establish validity (Kruse,
Chancellor, & Lyubomirsky, 2017).
144 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Political Skill
Definition and Conceptualization. Approximately 40 years ago, two inde-
pendent scholars concurrently introduced the political skill construct to the lit-
erature. Pfeffer (1981) defined political skill as a social effectiveness competency
allowing for the active execution of political behavior and attainment of power.
Mintzberg (1983, 1985) positioned the construct as the exercise of political in-
fluence (or interpersonal style) using manipulation, persuasion, and negotiation
through formal power. Despite the extensiveness of political activity in organiza-
tional settings, these initial works made little progress beyond the definition and
conceptualization stages.
However, over the past few decades, researchers have acknowledged the
prominence and importance of political acuity, social savviness, and social in-
telligence (Ferris, Perrewe, & Douglas, 2002; McAllister et al., 2018). Ahearn,
Ferris, Hochwarter, Douglas, and Ammeter (2004) provided an early effort at de-
lineating the political skill construct (then termed “social skill”), defined as “the
ability to effectively understand others at work, and to use such knowledge to
influence others to act in ways that enhance one’s personal and/or organizational
objectives” (Ahearn et al., 2004, p. 311). Additionally, political skill was argued
to encompass four critical underlying dimensional competencies, including (1)
social astuteness, (2) interpersonal influence, (3) networking ability, and (4) ap-
parent sincerity.
Social astuteness, or the ability to be self-aware and to interpret the behavior of
others accurately, is necessary for effective influence (Pfeffer, 1992). Individuals
possessing political skill are keen observers of social situations. Not only are they
able to accurately interpret the behavior of others, but also they can adapt socially
in response to what they perceive (Ferris, Treadway, Perrewé, Brouer, Douglas, &
Lux, 2007). This “sensitivity to others” (Pfeffer, 1992, p. 173) provides politically
skilled individuals the ability to understand the motivations of both themselves
and others better, making them useful in many political arenas.
Interpersonal influence concerns “flexibility,” or the successful adaptation of
behavior to different personal and situational contingencies to achieve desired
goals (Pfeffer, 1992). Individuals high in political skill exert powerful influence
through subtle and convincing interpersonal persuasion (Ferris et al., 2005, 2007).
Whereas Mintzberg (1983, 1985) defined political skill concerning influence and
explicit formal power, Ahearn et al.’s (2004) definition does not include direct ref-
erences to formal authority (Perrewé, Zellars, Ferris, Rossi, Kacmar, & Ralston,
2004). Instead, this view focuses on influence originating from the selection of
appropriate communication styles relative to the context at hand, as well as suc-
cessful adaptation and calibration when tactics are ineffective.
Politically skilled individuals also are adept at developing and utilizing social
networks (Ferris et al., 2005, 2007). Not only are these networks secure in terms
of their extensiveness, but also they tend to include more valuable and influential
members. Such networking capabilities allow individuals high in political skill to
Research Methods in Organizational Politics • 145

formulate robust and beneficial alliances and coalitions that offer further opportu-
nities to maintain, as well as develop, an increasingly more extensive social net-
work. Further, because these networks are strategically developed over time, the
politically skilled are better able to position themselves so as to take advantage of
available network-generated resources, opportunities, and social capital (Ahearn
et al., 2004; Pfeffer, 2010; Tocher, Oswald, Shook, & Adams, 2012).
The last characteristic politically skilled individuals possess is apparent sin-
cerity. That is, they are or at least appear to be, genuine in their intentions when
engaging in political behaviors (Douglas & Ammeter, 2004). Sincerity is essential
given that influence attempts are only successful when the intention is devoid of
ulterior or manipulative motives (Jones, 1990). Thus, perceived intentions may
matter more than actual intentions, for inspiring behavioral modification and con-
fidence in others.
Subsequently, Ferris et al. (2007) provided a systematic conceptualization of
political skill grounded in social-political influence theory. As part of this concep-
tualization, they characterized political skill as “a comprehensive pattern of social
competencies, with cognitive, affective, and behavioral manifestations” (Ferris
et al., 2007, p. 291). Specifically, they argued that political skill operated on self,
others, and group/organizational processes. Their model identified five anteced-
ents of political skill, including perceptiveness, control, affability, active influ-
ence, and developmental experiences. Munyon et al. (2015) extended this model
to encapsulate the effect of political skill on self-evaluations and situational ap-
praisals (i.e., intrapsychic processes), situational responses (i.e., behavioral pro-
cesses), as well as evaluations by others and group/organizational processes (i.e.,
interpersonal processes). Recently, Frieder, Ferris, Perrewé, Wihler, and Brooks
(in press) extended this meta-theoretical framework of social—political influence
to leadership.
Overall, research on political skill has generated considerable interest since
its original refinement by Ferris et al. (2005). Within the last decade, multiple
reviews and meta-analyses (Bing, Davison, Minor, Novicevic, & Frink, 2011;
Ferris, Treadway, Brouer, & Munyon, 2012; Munyon et al., 2015) have reported
on the effectiveness of political skill in work settings, both as a significant predic-
tor as well as a boundary condition. Some notable outcomes include the effect of
political skill on stress management (Hochwarter, Ferris, Zinko, Arnell, & James,
2007; Hochwarter, Summers, Thompson, Perrewé, & Ferris, 2010; Perrewé et
al., 2004), career success and performance (Blickle et al., 2011; Gentry, Gilm-
ore, Shuffler, & Leslie, 2012; Munyon et al., 2015), and leadership effectiveness
(Brouer, Douglas, Treadway, & Ferris, 2013; Whitman, Halbesleben, & Shanine,
2013).
Measurement. Ferris et al. (1999) provided a first effort at measuring the po-
litical skill construct by developing the six-item Political Skill Inventory (PSI).
Despite acceptable psychometric properties and scale reliability across five stud-
ies, the PSI was not without flaws. Although the scale reflected social astuteness
146 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

and interpersonal influence, they did not emerge as separate and distinguishable
factors. Resulting unidimensionality and construct domain concerns triggered the
development of an 18-item version, which retained the original scale name as well
as three original scale items (Ferris et al., 2005).
To develop the 18-item PSI, Ferris et al. (2005) generated an initial pool of
40 items to capture the full content domain of the political skill construct. After
omitting scale items prone to socially desirable responding and those with prob-
lematically high cross loading values, a final set of 18-items was retained, and
as hypothesized, a four-factor solution emerged containing the social astuteness,
interpersonal influence, networking ability, and apparent sincerity dimensions.
Critique and Future Research Directions. Coming up on its 15th anniversary,
the PSI has been widely accepted as a sound psychometric measure by those
well entrenched within the organizational politics field. Few conceptual squabbles
exist among scholars, and the theoretical clarity paired with strong empirically
established links to relevant constructs is evidence for strong construct validity.
However, this measure has a few notable deficiencies. The PSI inherently suffers
from the drawbacks associated with self-reports. Certainly, self-reports are easy
to obtain, and are considered the best way to measure psychological states, per-
ceptions, and motives (McFarland et al., 2012). However, as a tool for assessing
behavioral effectiveness, self-reports have some issues. Hubris, perceptual bias,
and self-desirability within individuals can lead to overinflated estimates of their
social abilities. Some individuals may believe, or are told erroneously, that they
are likable and keen social agents, but in reality, they are a social pariah who an-
noy and infuriate their colleagues.
Also problematic are invariant source requirements needed for response ac-
curacy. For example, social astuteness and networking ability are largely percep-
tual measures and best obtained through self-reports. Logically, observers cannot
provide an accurate account of what individuals observe during social interaction.
However, observers may be the best suited to assess interpersonal influence and
apparent sincerity. Influence represents a change of an attitude, judgment, or deci-
sion; that is, cues more amenable to assessment by an observer or trained rater.
Apparent sincerity is in the eye of the beholder regardless of whether the focal
individuals believe they intended on being or thought they acted sincerely (Sil-
vester & Wyatt, 2018).
With these shortcomings in mind, scholars would help advance scholarship
by developing a behavioral measure that can assess political skill without solely
relying on self-reports. Developing such a measure would contribute to further
legitimizing the construct of political skill to those scholars and practitioners who
are not intimately familiar with the organizational politics literature, and doubt its
merits. Furthermore, this measure need not replace the PSI entirely, but a stream
of investigations that employed both a behavioral and self-report measure could
illuminate the utility or futility of how we currently measure political skill. Admit-
tedly, this type of measurement requires added effort likely complicating data col-
Research Methods in Organizational Politics • 147

lection processes. However, we are confident that value rests in doing so if only to
confirm the utility of self-reports.
Another opportunity within the political skill literature is to evaluate the con-
struct’s developmental qualities. According to Ferris et al. (2005, 2007), political
skill is a social competency that can be cultivated over time through social feed-
back, role modeling, and mentorship. Despite strong theoretical support, ground-
ed in social learning theory (Bandura, 1986), little evidence for the development
of political skill through observation and modeling exists. Further, if both genetic
properties and situational factors affect political skill, then researchers need to
consider which individuals are more or less receptive to organizational training,
behavioral interventions, incentives, and role modeling techniques. Until empiri-
cal evidence is present, scholars should be cautious of discussing political skill as
a learnable or trainable competency.

Political Will
Definition and Conceptualization. Political will is a construct commonly
used in the popular press and governmental politics to describe a collective’s will-
ingness or unwillingness to expend resources towards a particular cause (Post,
Raile, & Raile, 2010). The creation of new laws and political courses of action
upsets the status quo, and in a world of diverse and often competing interests,
politicians must be willing to expend resources to fight for their desired agenda.
Similarly, Mintzberg (1983) argued that individual agents within organizations
needed political skill and political will in order to execute their desired managerial
actions successfully.
Over three decades ago, political will and political skill were introduced con-
ceptually into the organization sciences. Despite the sustained interest in politi-
cal skill, however, political will has attracted further inquiry only recently. This
neglect is unfortunate given that both constructs were integral to Mintzberg’s
theoretical framework, and omission of essential variables within measurement
models biases parameter estimates in previous studies.
Treadway (2012) provided a theoretical application of political will and sug-
gested instrumental (relational, concern for self, concern for others) and risk toler-
ance as underlying dimensions. Treadway defined political will as “the motivation
to engage in strategic, goal-directed behavior that advances the personal agenda
and objectives of the actor that inherently involves the risk of relational or repu-
tational capital (p. 533).” Keeping with Mintzberg’s conceptualization, Treadway
focused on describing political will at the individual level of analysis. Nonethe-
less, he did acknowledge that political will embodies a group mentality towards
a particular agenda.
Measurement. Scholars made a few early attempts to measure political will
before the development of a validated psychometric measure. Treadway et al.
(2005) first attempted to measure political will using need for achievement and
intrinsic motivation as proxies. These constructs successfully predicted the activ-
148 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

ity level of political behavior. Similarly, Liu, Liu, and Wu (2010) used need for
achievement, and analogously, need for power to predict political behavior. In the
same vein, Shaughnessy, Treadway, Breland, and Perrewé (2017) used the need
for power as a proxy for political will, which predicted informal leadership. Last-
ly, Doldor, Anderson, and Vinnicombe (2013) used semi-structured interviews to
explore what political will meant to male and female managers. Rather than focus
on the trait-like qualities previously employed as proxies, they found that political
will was more of an attitude about engaging in organizational politics. Lastly, they
found that functional, ethical, and emotional appraisals shaped political attitudes.
Recently, Kaptousis, Papelalexandris, Treadway, and Bentley (2017) developed
an eight-item measure called the Political Will Scale (PWS). Based on Treadway’s
(2012) conception of political will, they expected the scale to break out into the
five dimensions of instrumental, relational, concern for self, concern for others,
and risk tolerance. However, confirmatory principal axis factor analysis revealed
two factors for this scale, which they labeled benevolent and self-serving. To date,
only a handful of published studies have used this new measure. As an example,
Maher et al. (2018) found that political will and political skill predicted configu-
rations of impression management tactics. Moreover, moderate levels of political
will were associated with the most effective configuration. Blickle et al. (2018)
applied additional psychometric testing to the scale. In applying a triadic multi-
source design, they found support for the construct and criterion-related validity
of the self-serving dimension of political will. However, they did not find justifi-
cation for the benevolent dimension. Instead, they interpreted this dimension to be
synonymous with altruistic political will.
Critique and Future Research Directions. Because the study of political will
is in its nascent stage, lending a critical eye helps introduce ideas for remedying
potential deficiencies. Establishing, expanding, and empirically testing the politi-
cal will nomological network will help establish construct validity and advance
research in this area. The sections that follow evaluate the state of the construct,
with a focus on vetting current conceptualizations and measurement instruments.
To date, within the organization sciences, political will resides as an individ-
ual-level variable. Indeed, we take no issue with this stance. Mintzberg specifi-
cally discussed political will and political skill as individual attributes necessary
to navigate workplace settings. However, scholars in political science have char-
acterized political will as a group-level phenomenon (Post et al., 2010). Appropri-
ately, scholars within the organization sciences should also conceptualize and ex-
plore political will at collective levels of analysis. Indeed, political will possesses
attitude-based qualities (Doldor et al., 2013), and thus, can proliferate to others
within similar social networks (Salancik & Pfeffer, 1978).
Furthermore, scholars must examine how formal and informal leadership cre-
ate unique political will profiles, and assess how these configurations might affect
group outcomes. For example, teams with a high and consistent aggregate amount
of political will have a singular focus that leads to higher performance results. It
Research Methods in Organizational Politics • 149

may also be true that having one team member or leader who takes care of the
‘dirty work’ enables other team members to complete work tasks without engag-
ing in office politics.
Currently, scholars conceptualize political will as an individual characteristic.
To date, instruments make no effort to test whether this characteristic differs across
organizational situations and contexts. However, political scientists maintain that
political will is issue specific (Donovan, Bateman, & Heggestad, 2013; Morgan,
1989). In keeping with this notion, we suggest that a novel and illuminating line
of study would be to apply an event-oriented approach (Morgeson, Mitchell, &
Liu, 2015) to studying political will. Under this design, scholars could examine
how political will focuses on resources and effort toward a particular cause, and
track how these manifestations affect goals and change outcomes. Unlike team-
level aggregation, this approach would require the development of a new measure
rather than merely changing the referent in the existing measure of political will.
As with many constructs in the organizational politics literature, there is little
consensus on the underlying theoretical foundations of political will. Conceptual-
izations and definitions are essential for any sound psychometric instrument, and
this incongruence is a current affliction within the study of political will. Indeed,
we applaud the advancement in theory and measurement by Treadway (2012) and
Kapoutsis et al. (2017), as they represent the seminal works in the field. Previous
proxy measures (i.e., need for achievement, need for power, intrinsic motivation)
were rooted in constructs that are stable individual traits, and recent thinking more
appropriately suggests that political will is a state-like attribute closely akin to
an attitude. However, there are potential issues to confront concerning the more
contemporary works mentioned above.
For example, the multidimensional conceptualization of political will (Tread-
way, 2012) was not found to be supported (Kapoutsis et al., 2017). Correctly, no
items reflected risk tolerance, suggesting too narrow an operationalization. Simi-
larly, Rose and Greeley (2006) suggest that political will represents a sustained
commitment to a cause, as adversity and pushback are integral aspects of the pro-
cess. This aspect of political will is also absent from the recent measure. Scholars
should analyze the PWS dimensions in conjunction with scales of perseverance
(e.g., grit, Duckworth & Quinn, 2009) and risk tolerance to see if they load onto
a common factor.
As mentioned above, the word ‘politics’ is a loaded word that means different
things to different people. Many see it as a toxic parasite requiring immediate
extinction (Cantoni, 1993; Zhang, & Lu, 2009). Conversely, others recognize its
importance, necessity, and inevitability (Eldor, 2016). An in-depth debate regard-
ing the positive and negative aspects of organizational politics is beyond the scope
of this chapter. However, it is clear that definition unanimity has evaded both
scholars and study participants. Anecdotally, evidence suggests that respondents
consider workplace ‘political behavior’ to embody advocacy for a particular gov-
ernmental candidate. There are two potential remedies to this issue.
150 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

First, we suggest that scholars define organizational politics within the survey
instrument in use. This approach will focus participants’ attention to organiza-
tional politics, not governmental politics. Second, scholars should avoid using
any variant of the word ‘politics’ in the measures that they create, and instead
use more specific language to illustrate the intended political situation or charac-
teristic. When writing scale items, scholars advocate for clear language, so that
interpretations are uniform across time, culture, and individual attributes (Clark
& Watson, 1995). We find this practice particularly important given the ubiquity
and lack of agreement about the word ‘politics.’ Unfortunately, the PWS suffers
from this issue, as all eight items employ a variant of the word politics. Thus, we
suggest that scholars define politics in ways that clearly are understood by target
samples.
Lastly, we must note that the initial validation of the PWS has produced mixed
results. Blickle et al. (2018) found evidence that the self-serving dimension of
political will did demonstrate descriptive and convergent validity. Specifically,
the benevolent dimension of political will did not correlate with altruism, as the
authors argue. We question if altruism truly fits within the political framework, as
acting politically on others behalf does not have to be genuinely self-sacrificing
(which highlights the need for higher conceptual agreement). These results, com-
bined with the other issues raised in this review, warrant further construct valida-
tion research on the PWS.

Reputation
Definition and Conceptualization. Reputation is commonly discussed among
the public across social and business contexts. Generally, reflecting a positive
reputation is a complementary attribute. However, academic investigations of
reputation are inconsistent, and our understanding of what exactly reputation is
and how it functions is limited. Extant research exists across social science disci-
plines (e.g., economics, management, psychology; Ferris et al., 2003). Like many
other constructs in the organizational politics literature, disagreement regarding
the definition of reputation has thwarted research. This discrepancy is due, pri-
marily, to the different labels across fields, and even separate pockets of research
within each field (e.g., individual, team, organizational, and industry-level within
the management literature) in some cases. These different markers and branches
of research have fragmented research (Ferris et al., 2014).
To synthesize the existing research and create greater understanding among
scholars, Ferris et al. (2014) provided a cross-level review of reputation. They
found that it has three interacting features: (1) elements that inform reputation,
(2) stakeholder perceptions, and (3) functional utility. That is, the characteristics
of a focal entity interact with stakeholder perceptions to form the entity’s reputa-
tion. Thus, reputation then has a particular value, which, if positive, can result in
positive outcomes. Considering this, Ferris et al. (2014) proposed the following
definition of reputation: “a perceptual identity formed from the collective percep-
Research Methods in Organizational Politics • 151

tions of others, which is reflective of the complex combination of salient entity


characteristics and accomplishments, demonstrated behavior, and intended im-
ages presented over some period of time as observed directly and/or reported
from secondary sources, which reduces ambiguity about expected future behav-
ior” (Ferris et al., 2014, p. 272).
An essential element noted in this definition of reputation is its perceptual na-
ture. That is, the focal individual does not own reputation. Instead, stakeholders
form perceptions of the focal individual based on prior behavior indicative of
performance and character. Another defining element of this definition is the idea
of saliency. This extension argues that one cannot merely be of high character and
a high performer. Instead, stakeholders need to be aware of the focal individual
and her or his behaviors and accomplishments.
The definition proposed by Ferris et al. (2014) corresponds strongly with the
most frequently referenced individual level conceptualization (i.e., Hochwarter
et al., 2007). This view argues that two informing elements—character/integrity
and performance/ results—build reputation. Further, it proposes that perceptions
develop from a history of observed individual behavior. Recently, Zinko et al.
(2016) proposed that reputation has three dimensions (i.e., task, integrity, and
social). This conceptualization is similar to Hochwarter et al. in that it emphasizes
performance and character, but also addresses saliency (Ferris et al., 2014). Zinko
et al. (2016) proposed that actor popularity affects reputation distinctly from one’s
expertise in an area.
Little research has investigated the second feature of reputation (i.e., stakehold-
er perceptions). Instead, scholars have studied the functional utility of reputation.
Briefly, positive outcomes evolve from one’s favorable reputation as perceived by
others, including more resources, power, and behavioral discretion (e.g., Bartol
& Martin, 1990; Wade, Porac, Pollock, & Graffin, 2006). In sum, although there
has been little investigation into individual reputation, scholars report characteris-
tics (i.e., performance/results, character/integrity, saliency/prominence) that build
functional utility, and lead to positive outcomes for the focal individual.
Measurement. Hochwarter et al. (2007) developed the most widely cited mea-
sure of personal reputation. As mentioned above, this measure argues that reputa-
tion represents an observer’s opinion of a focal individual based on a history of
behavior related to character/ integrity and performance/results. Scholars have
used the measure with both other-report (e.g., Laird, Zboja, & Ferris, 2012) and
self-report (e.g., Hochwarter et al., 2007) responses, and have found empirical
support for the performance and character dimensions of reputation assessed by
the measure (e.g., Liu, Ferris, Zinko, Perrewé, Weitz, & Xu, 2007). Studies have
found a strong relation between self-reports and other-reports of this measure,
suggesting that the measure is reliable and capable of use in either method (Ho-
chwarter at al., 2007; Laird et al., 2012).
Some studies of personal reputation have focused on just a single aspect, gen-
erally assessing only the performance dimension of reputation (e.g., Liu et al.,
152 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

2007). Relatedly, a single dimension of reputation is standard at the organiza-


tion level (e.g., Bromley, 2000; Deephouse & Carter, 2005). Such measures have
caused some discussion regarding the validity of measures that only capture part
of the construct (e.g., Rindova, Williamson, & Petkova, 2010; Zinko et al., 2016).
Critique and Future Research Directions. More reputation research is need-
ed in order to establish, expand, and empirically test the reputation nomological
network. As mentioned above, the vast majority of reputation research is at the
organization level, and although a few scholars have investigated individual-level
reputation for decades, only recently has it gained considerable attention (George,
Dahlander, Graffin, & Sim, 2016). Although reputation affects the lives of every
employee, it is due to the limited research that we have less understanding of its
role and function in organizational politics. Because there is more research at the
organization level, and because reputation appears to function similarly across
levels of analysis (Ferris et al., 2014), it makes sense to examine the construct as a
whole, cross-reference, and integrate the literature to gain greater understanding.
Using the literature across levels and fields of analysis as foundations, many areas
require attention.
Hochwarter et al. (2007) developed a measure of personal reputation reflecting
common use and adequate predictive validity (e.g., Laird et al., 2012). However,
this measure does not account for all of the dimensions frequently discussed in
the reputation literature. Although it does capture the performance and character
dimensions, it does not capture the saliency/prominence dimension of reputation
acknowledged by Ferris et al. (2014). Borrowing from an organization-level anal-
ysis, several scholars have argued for the inclusion of a prominence dimension.
For example, Rindova et al., (2010) wrote, “prominence reflects the organization’s
relative centrality in collective attention and memory” (p. 615), in that it assesses
the size or distinctness of reputation. The prominence dimension is essential and
addresses how well known an individual is relative to peers. Hinkin (1998) noted
that to study a construct effectively, it is essential to use a measure that adequately
represents the construct. A first step to support investigations is the development
and validation of a new measure that captures all three frequently mentioned di-
mensions of reputation (i.e., performance, character, prominence) (Hinkin, 1998;
Schriesheim, Powers, Scandura, Gardiner, & Lankau, 1993).
Once developed, a measure that captures appropriate dimensions can be used
to expand exiting research. Accordingly, future research should conduct second-
order factor analyses to explore if a higher-order factor underlies the three dimen-
sions. Such evidence would promote theory development by identifying contexts
most predictive of reputational effects. Johns (2001, 2006, 2018) has called for
desperately needed research into contextual effects in management research. That
is, more research is needed regarding when and under which circumstances do
“known” relations exist (or not exist), and why.
Each dimension likely acts somewhat differently within the reputation-related
nomological network. That is, different antecedents and consequences of reputa-
Research Methods in Organizational Politics • 153

tion likely have different relations with each dimension. Future research should
not only investigate reputation as a whole, but also seek to understand how, when,
and why each dimension of reputation is more or less influential on related con-
structs.
To date, research at the individual-level has primarily focused on the inform-
ing elements and functional utility features of reputation. The third element (i.e.,
stakeholder perceptions) has received very little attention. Indeed, it seems as
though researchers generally avoid this important element altogether. This inat-
tention is concerning, as reputation is a “perceptual identity formed from the col-
lective perceptions of others” (Ferris et al., 2014, p. 62) residing “in the minds of
external observers” (Rindova et al., 2010, p. 614). Despite this acceptance, there
has been no empirical investigation of the functional role of stakeholder char-
acteristics in reputation formation. The variance in how others may perceive a
focal individual is a central theme in reputation development. Indeed, individuals
interpret the same information differently (Branzei, Ursacki-Bryant, Vertinsky, &
Zhang, 2004), and attribute behaviors to different causes (Heider, 1958; Kelley,
1973). Still, although this variance in perception is well established, its effect
consequences of reputation (e.g., autonomy, financial reward) has received little
attention.
Related to how stakeholders may interpret informing elements differently, and
tying back into the measurement of personal reputation, is the obvious concern
with the method in which reputation is measured. Although Hochwarter et al.’s
(2007) measure has received statistical support (and convergence across self- and
other-report indices), assessments came from focal individuals exclusively (Ferris
et al., 2014). Although obtaining multiple assessments of a single focal individual
is generally more complicated, a reputation assessment from a single individual
offers minimal insight.

BROAD SCALE CRITIQUE AND


DIRECTIONS FOR FUTURE RESEARCH
In our above review and analysis of the organizational politics literature, we iden-
tified methodological issues requiring attention in each designated topic areas.
Our general findings suggest that there are some conceptual challenges that must
be overcome in order for the field to move forward. In the remaining sections of
this review, we identify potential roadblocks, as well as opportunities presumed
to augment research in a manner that is projective rather than reflective. To date,
scholars identified, evaluated, and incorporated recommendations into contem-
porary research designs (Ferris et al., 2019). However, as concerns exit the list,
to be added are others that embody ever-changing research realities and expecta-
tions. Our discussion focuses on these externalities, which we feel have the most
significant potential to affect studies contributing to the next generation of politics
research. Specifically, we dissect conceptual, research design, and data collection
154 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

issues, and provide to the best of our ability some potential remedies to common
challenges within the field of organizational politics.

Conceptual Challenges
The decades of research on organizational politics notwithstanding, the field
still suffers from a fundamental issue of conceptual incongruence, which poses a
threat to construct validity (Ferris et al. 2019; Lepisto & Pratt, 2012; McFarland
et al., 2012). This shortcoming is not tremendously surprising, as accurately de-
fining and capturing motives and behaviors that are inherently concealed, infor-
mal, murky, or downright dishonest is no easy task. Adding to this complexity is
the perspective that the word politics itself is a well-known, yet misunderstood,
term within the popular lexicon, and these preconceived notions from both prac-
titioners and scholars alike can contaminate conceptualizations and measurement.
Ideally, a researcher would approach his or her studies with a tabula rasa, or clean
slate (Craig & Douglas, 2011; Fendt, & Sachs, 2008). However, as objective as
individuals aim to be, researchers’ personal experiences may influence how con-
structs are conceptualized and evaluated.
Perhaps it is true that the commercialization of greed that occurred in the
1980s, when current conceptualizations of POPs and other political constructs
were established, influenced how scholars defined and measured constructs with-
in organizational politics. It may also be the case that the more modern positive
psychology movement (e.g., Luthans & Avolio, 2009) has led scholars to search
for positive aspects of organizational politics (Byrne, Manning, Weston, & Ho-
chwarter, 2017; Elbanna, Kapoutisis, & Mellahi, 2017). We offer no formal defi-
nition here, but we do suggest that future attempts to unify organizational politics
under a common conceptual understanding acknowledge that much of what goes
on in organizations is informal and social, and that this reality allows for many
different outcomes, both good and bad.
A unifying definition of organizational politics should also consider the full
breadth of different behaviors and motivations. We have talked about the politi-
cian as an omnibus term rather than defining and refining what precisely that
means. Given the complex nature of political constructs, we advocate the de-
velopment of multidimensional constructs with both first- and second-order lev-
els. This strategy allows practitioners to look for general main effects, as well as
more nuanced relationships (e.g., Brouer, Badaway, Gallagher, & Haber, 2015).
Specifically, it might be helpful to establish profiles of and related to, political
behavior. From a research design standpoint, methods such as latent profile analy-
sis (Gabriel, Campbell, Djurdjevic, Johnson, & Rosen, 2018; Gabriel, Daniels,
Diefendorff, & Greguras, 2015), cluster analysis (Maher et al. 2018), and qualita-
tive comparative analysis (QCA; Misangyi, Greckhamer, Furnari, Fiss, Crilly, &
Aguilera, 2017; Rihoux & Ragin, 2008) represent data analysis techniques that
are currently underutilized in the organizational politics literature. We also assert
that the complexity of political constructs call for more nuanced explorations,
Research Methods in Organizational Politics • 155

and scholars should consider theorizing and measuring nonlinear and moderated
nonlinear investigations of politics (Ferris, Bowen, Treadway, Hochwarter, Hall,
& Perrewé, 2006; Grant & Schwartz, 2011; Hochwarter, Ferris, Laird, Treadway,
& Gallagher, 2010; Maslyn et al. 2017; Pierce & Aguinis, 2013; Rosen & Ho-
chwarter, 2014).
Another important consideration for the future of organizational research is to
examine context (see Johns, 2006, 2018). The majority of POPs research has come
from scholars and samples from the United States (for a few notable exceptions,
please see Abbas & Raja, 2014; Basar, & Basim, 2016; Eldor, 2016; Kapoutsis et
al., 2017). We feel that it is imperative to incorporate different viewpoints from
all corners of the world as we move towards a shared conceptual understanding
of organizational politics. Failure to do so creates issues of construct adequacy
(Arafat, Chowdhury, Qusar, & Hafez, 2016; Hult et al., 2008). Indeed, we expect
broad and salient similarities across cultures, but politics may look, act, and feel
different across different contexts.
In addition, the role of context likely also affects politics at a more localized
level. That is, among others, the type of organization (e.g., for-profit vs. not-for-
profit), industry (e.g., finance vs. social services), and hierarchical level (e.g., top
management teams vs. line managers) likely affect the prevalence, type, and pro-
cess of organizational politics. Contextualizing our research will provide an abun-
dance of different avenues from which we can continue to evaluate how, what,
when, why, and the effectiveness of political action under different circumstances.
This approach will help illuminate theory and help build a greater conceptual
understanding of politics.
Lastly, although there are ample avenues for investigations that employ the
contemporary political constructs discussed in this chapter, organizational politics
scholars should not rest on their laurels concerning the development of new theo-
ries and constructs. We encourage the inclusion and development of new theories
that could help explain political phenomena. For example, organizational politics
literature is rooted in the idea that individuals are not merely passive agents, but
instead enact and respond to their environment. The fields of leadership and or-
ganizational politics are inextricably linked, and much as the field of leadership
has placed emphasis on leaders over followers (Epitropaki, Kark, Mainemelis,
& Lord, 2017), organizational scholars have focused on the actions of the in-
fluencers rather than the targets of those influences. This perspective ignores a
century old stream of research that spans the social sciences and argues that there
is individual variation in the extent to which individuals are affected by their en-
vironment (Allport, 1920; Belsky & Pluess, 2009). Incorporating individuals’ sus-
ceptibility to social influence into theories and models of organizational politics
would restore balance to the contemporary biased perspective, and help alleviate
concerns of omitted variable bias.
156 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Research Design Challenges


We now discuss the particular aspects of research design that make it difficult
to establish precise connections between theory and measurement (Van Maanen,
Sorensen, & Mitchell, 2007). Issues of conceptual agreement aside, organiza-
tional politics inherently suffers from the fact that many of its core constructs
are inconspicuous and, in some cases, intentionally hidden from organizational
members. Suspicions of backroom deals, fraternizing and favoritism, informal
reciprocal agreements, and leader political support are environmental stimuli that
run the gambit from objective and observable to speculation and hearsay. From
a design standpoint, this makes getting a consensus on observations difficult, as
individuals vary on the extent to which they observed stimuli directly, or stimuli
were described to them by a primary or secondary source. Further complicating
matters is that ulterior motives, impression management, and deception can also
cloud perceptions. Multiple individuals could observe the same action, but per-
ceptions of motivation and intent will vary how individuals attribute those actions
(Bolino, 1999).
Perhaps because of these difficulties, scholars have applied a somewhat similar
set of research designs to study organizational politics. Most designs focus on
individual attributes and perceptions, which is essential research but ignores the
fact that organizational politics is inherently an interpersonal and multilevel field.
In order to combat these systemic issues, we provide recommendations on how
to design studies so that instruments are better able to measure the theoretical
underpinnings of organizational politics.
First, in order to ensure that study participants are evaluating the same phe-
nomenon, we echo the call for assessing specific foci and stimuli (Maslyn & Fe-
dor, 1998). Specifically, researchers should highlight specific aspects of the envi-
ronment so that subjects are responding to the same stimuli. Although this is not a
new philosophy, its use and importance in organizational politics research seem to
have not fully caught on in organizational politics research. We find it particularly
important, as the ubiquity and ambiguity of politics make measuring one particu-
lar aspect of the political environment rather tricky. For example, the statement “it
is pretty political around here” could be interpreted as referring to more proximal
group dynamics, leader dynamics, or more distal organizational dynamics, all of
which will have different antecedents, outcomes, and boundary conditions.
Furthermore, we advocate the study of specific events rather than general ap-
praisals of climate. Event systems theory argues that novel, disruptive, and critical
events affect organizations across time (Morgeson et al., 2015). Applying a design
of this nature would represent a break from the conventional research design, and
would provide much needed illumination of the ways in which organizational pol-
itics creates temporal (Hochwarter, Ferris, Gavin, Perrewé, Hall, & Frink, 2007;
Kiewitz, Restubog, Zagenczyk, & Hochwarter, 2009; Saleem, 2015) and multi-
level (Dipboye & Foster, 2002; Rosen, Kacmar, Harris, Gavin, & Hochwarter,
2017) effects upon organizations.
Research Methods in Organizational Politics • 157

Specifically, concerning multilevel research, it is vital that we recognize the ef-


fect that leaders can demonstrate on their followers (Ahearn et al. 2004; Douglas
& Ammeter, 2004; Frieder et al., in press; Treadway et al., 2004). This line of re-
search is promising, and this type of design can help bridge the gap that exists be-
tween macro- and micro-level research within organizational politics (Lepisto &
Pratt, 2012). At the same time, scholars can employ multilevel modeling to exam-
ine within-person effects over time. Doing so would be of great interest to those
who are interested in knowing how constructs like political will, political skill,
and reputation grow over time. Furthermore, experience-sampling approaches,
also known as diary studies (Gabriel, Koopman, Rosen, & Johnson, 2018; Larson
& Csikszentmihalyi, 1983; Lim, Ilies, Koopman, Christoforou, & Arvey, 2018)
could illuminate how political constructs affect intrapsychic processes and indi-
vidual attributes throughout the day.
As a whole, organizational politics suffers from an imbalance of the three-
horned dilemma (Runkel & McGrath 1972). Researchers aim to collect and ana-
lyze data that promote realism, generalizability, and precision. However, these
three horns are not mutually exclusive. Specifically, increasing precision dilutes
generalizability and realism, increasing generalizability dilutes precision and real-
ism, and increasing realism dilutes generalizability and precision. To date, orga-
nizational politics research has focused on generalizability and has room for im-
provement in both realism and precision. Except for rare exceptions (e.g., Doldor
et al. 2013; Landells & Albrecht, 2017), qualitative research in organizational
politics is meager.
This omission is especially disappointing given the nuance, complexity, in-
nuendo, and richness associated with political theory. As an example, given that
POPs are a perceptual construct, and that qualitative work is based on an episte-
mology that espouses various constructions of reality, it seems as if the marriage
of politics research and qualitative designs would be kismet. Perhaps it is not
surprising to see a field with such conceptual disagreement have such a void of
quality grounded theory work at its foundation. To improve theoretical richness
and provide a foundation for more targeted quantitative inquiry, we call for quali-
tative designs (Lincoln & Guba, 1985) such as ethnography, interviews, historical
analyses that provide a richer understanding than what is possible through tradi-
tional survey research methods.
Whereas qualitative research would bring more realism to the field, employing
experimental designs would enable organizational politics scholars to evaluate
political theories with more precision. Experimental designs involve the manipu-
lation of an independent variable to test its effect on dependent variables. These
designs have many advantages, as they provide reliable inferences for causality,
can be employed to evaluate subjects that are not legally or ethically viable in
field studies, and are easily administered (McFarland et al., 2012). Only a small
number of studies within the organizational politics literature have employed ex-
perimental designs (e.g., Kacmar, Wayne, & Wright, 1996; van Knippenberg &
158 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Steensma, 2003). Indeed, we understand how the complexities of organizational


politics make it challenging to design scenarios and manipulations that fully cap-
ture the intertwined network of factors that affect political action. However, many
opportunities exist to apply experimental designs to probe subjects like influence
susceptibility and effectiveness, decisions making, and differential effects of po-
litical skill dimensionality.
Our review of the literature indicates that organizational politics literature has
some fundamental issues for scholars to address. The covert nature of politics
necessitates the measurement of multiple perspectives. We would like to see the
field move towards a balance among designs that promote generalizability, real-
ism, and precision, as each type of research design has its virtues and drawbacks.
In order to achieve this balance, we call for individual studies and multi-study
packages that employ qualitative or experimental designs that can augment the
quantitative field designs that are more commonly used. Indeed, organizational
politics produce complex phenomena, and no one design can adequately address
this intricacy.
Mixed methods research continues to grow in its popularity and sophistication
(Clark, 2019; de Leeuw & Toepoel, 2018; Jehn & Jonsen, 2010; Molina-Azorin,
Bergh, Corley, & Ketchen, 2017), and these types of studies can be skillfully
designed to compensate for the inherent shortcomings in different individual re-
search designs. Lastly, although not a unique issue within our field, the complex-
ity of organizational politics research requires replications to ensure that theory is
accurately describing the political realities that individuals face at work. Replica-
tions, extensions, and the use of multi-study research packages help us demon-
strate these patterns of results so that we can be more confident in the validity of
our findings, or adjust our theory by exploring new contexts (Hochwarter, Ferris,
& Hanes, 2011; Kacmar, Bozeman, Carlson, & Anthony, 1999; Li, Liang, & Farh,
2020).

Data Collection Challenges


Although scholars from all disciplines face the practical challenges of collect-
ing data, there are some inherent challenges that scholars face when collecting
data on organizational politics. Managers and HR practitioners can be reluctant
to grant access to their employees when they fear that ‘political’ questions may
poison the well, and prime their employees to think about injustices coming from
leadership. In this case, the plurality of meaning assigned to the word politics
must be navigated with data collection gatekeepers as well, which requires edu-
cating and explaining in other terms the purpose of your study. Using alternative
phrases such as ‘rules of the game,’ ‘informal channels,’ and ‘social dynamics’
may be words that accurately describe the nature of your study, and help avoid
using the politically charged word politics.
One partial remedy to this problem has been the use of student-recruited sam-
ples (Hochwarter, 2014; Wheeler, Shanine, Leon, & Whitman, 2013), as there are
Research Methods in Organizational Politics • 159

fewer barriers to access with these samples. Despite the potential pitfalls of this
data collection method, these samples can increase the generalizability of a study,
perhaps more so than a sample drawn from a single organization. We encourage
the appropriate use of these samples (see Wheeler et al., for guidelines), especially
in conjunction with other data collection methods as part of a multi-study pack-
age, as student recruited sampling methods have the potential to attenuate the
weaknesses of other study designs (e.g., interviews, experiments, single-site field
studies). In a similar vein, technology has enabled us to gather data from different
online sources such as Amazon Mechanical Turk and Qualtrics. Although these
data sources can potentially suffer from some of the ills plaguing poorly designed
and executed student-recruited samples, understanding the virtues can help schol-
ars demonstrate strengths to their empirical studies (Cheung, Burns, Sinclair, &
Sliter, 2017; Couper, 2013; Das, Ester, & Kaczmirek, 2020; Finkel, Eastwick, &
Reis, 2015; Jann, Krumpal, & Wolter, 2019; Porter, Outlaw, Gale, & Cho, 2019).
No matter where the data are collected, organizational scholars will still run
into the inherent problem that organizational politics constructs are measured in
imperfect ways because of the invisibility of many of its core constructs. Thus, we
will close with a final appeal to use multiple sources of information to illuminate
political phenomena. There is an old Hindu parable about a collection of blind
men who individually feel parts of an elephant, and then collectively share their
knowledge to get a shared conceptualization of the elephant. Given the hidden
and often invisible nature of politics constructs, we must too rely on multiple ac-
counts to achieve a collective understanding.
For example, few studies have attempted to use objective measures of per-
formance when assessing the proposed relations with political skill (see Ahearn
et al., 2004 for an exception). Subjective measures of performance can be prob-
lematic, as those high in political skill can influence others, and likely the sub-
jective performance assessments. Thus, collecting both objective and subjective
performance and employing congruence analysis can not only help us understand
the quality of our data, but also extract theoretical richness. The same is true for
constructs such as self- and other-reported political skill, leader political behavior,
and perceptions of organizational politics. Polynomial regression and other forms
of congruence analysis can help determine if and why subjects are or are not see-
ing things the same way (Cheung, 2009; Edwards, 1994; Edwards & Parry, 1993).
Differences in these scores may well predict different outcomes, which can add to
our theoretical understanding of political phenomena.

CONCLUSION
The organizational politics literature has been going strong for decades, yet still
suffers from some of the fundamental problems that we see with fledgling streams
of research. At the core of almost every political construct is the issue of concep-
tual clarity and congruence. Without a sound theoretical basis, measures exist on
unstable grounds, and fault lines are sure to divide and divert what could be a
160 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

sound collective research stream. In this chapter, we have reviewed and critically
examined the theoretical bases and associated measures of the five significant
constructs in the field as well as the conventional research designs that predomi-
nate our literature. This exercise has led us to point out some of the virtues and
drawbacks of current established measures and methods, and to take some hard
looks in the mirror at our work. We hope that our suggestions help inspire and
guide future research so that the collective strength of this invaluable field con-
tinues to grow.

REFERENCES
Abbas, M., & Raja, U. (2014). Impact of perceived organizational politics on supervisory-
rated innovative performance and job stress: Evidence from Pakistan. Journal of
Advanced Management Science, 2, 158–162.
Adams, G., Ammeter, A., Treadway, D., Ferris, G., Hochwarter, W., & Kolodinsky, R.
(2002). Perceptions of organizational politics: Additional thoughts, reactions, and
multi-level issues. In F. Yammarino & F. Dansereau (Eds.), Research in multi-level
issues, Volume 1: The many faces of multi-level issues (pp. 287–294). Oxford, UK:
Elsevier Science.
Ahearn, K., Ferris, G., Hochwarter, W., Douglas, C., & Ammeter, A. (2004). Leader politi-
cal skill and team performance. Journal of Management, 30, 309–327.
Ahmad, J., Akhtar, H., ur Rahman, H., Imran, R., & ul Ain, N. (2017). Effect of diversified
model of organizational politics on diversified emotional intelligence. Journal of
Basic and Applied Sciences, 13, 375–385.
Allport, F. (1920). The influence of the group upon association and thought. Journal of
Experimental Psychology, 3, 159–182.
Arafat, S., Chowdhury, H., Qusar, M., & Hafez, M. (2016). Cross-cultural adaptation and
psychometric validation of research instruments: A methodological review. Journal
of Behavioral Health, 5, 129–136.
Aryee, S., Chen, Z., & Budhwar, P. (2004). Exchange fairness and employee performance:
An examination of the relationship between organizational politics and procedural
justice. Organizational Behavior and Human Decision Processes, 94, 1–14.
Ashforth, B., & Lee, R. (1990). Defensive behavior in organizations: A preliminary model.
Human Relations, 43, 621–648.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, NJ: Prentice Hall.
Barbuto, J., & Moss, J. (2006). Dispositional effects in intra-organizational influence tac-
tics: A meta-analytic review. Journal of Leadership & Organizational Studies, 12,
30–48.
Bartol, K., & Martin, D. (1990). When politics pays: Factors influencing managerial com-
pensation decisions. Personnel Psychology, 43, 599–614.
Basar, U., & Basim, N. (2016). A cross‐sectional survey on consequences of nurses’ burn-
out: Moderating role of organizational politics. Journal of Advanced Nursing, 72,
1838–1850.
Belsky, J., & Pluess, M. (2009). Beyond diathesis stress: Differential susceptibility to envi-
ronmental influences. Psychological Bulletin, 135, 885–908.
Research Methods in Organizational Politics • 161

Bing, M., Davison, H., Minor, I., Novicevic, M., & Frink, D. (2011). The prediction of task
and contextual performance by political skill: A meta-analysis and moderator test.
Journal of Vocational Behavior, 79, 563–577.
Blickle, G., Ferris, G., Munyon, T., Momm, T., Zettler, I., Schneider, P., & Buckley, M.
(2011). A multi‐source, multi‐study investigation of job performance prediction by
political skill. Applied Psychology, 60, 449–474.
Blickle, G., Schütte, N., & Wihler, A. (2018). Political will, work values, and objective
career success: A novel approach – The Trait-Reputation-Identity Model. Journal of
Vocational Behavior, 107, 42–56.
Blom-Hansen, J., & Finke, D. (2020). Reputation and organizational politics: Inside the
EU Commission. The Journal of Politics, 82(1), 135–148.
Bolino, M. (1999). Citizenship and impression management: Good soldiers or good actors?
Academy of Management Review, 24, 82–98.
Bolino, M., Long, D., & Turnley, W. (2016). Impression management in organizations:
Critical questions, answers, and areas for future research. Annual Review of Organi-
zational Psychology and Organizational Behavior, 3, 377–406.
Bolino, M., & Turnley, W. (1999). Measuring impression management in organizations:
A scale development based on the Jones and Pittman taxonomy. Organizational
Research Methods, 2, 187–206.
Branzei, O., Ursacki-Bryant, T., Vertinsky, I., & Zhang, W. (2004). The formation of green
strategies in Chinese firms: Matching corporate environmental responses and indi-
vidual principles. Strategic Management Journal, 25, 1075–1095.
Brecht, A. (1937). Bureaucratic sabotage. The Annals of the American Academy of Politi-
cal and Social Science, 189, 48–57.
Bromley, D. (1993). Reputation, image, and impression management. New York, NY: Wi-
ley.
Bromley, D. (2000). Psychological aspects of corporate identity, image and reputation.
Corporate Reputation Review, 3, 240–253.
Brouer, R., Badaway, R., Gallagher, V., & Haber, J. (2015). Political skill dimensional-
ity and impression management choice and effective use. Journal of Business and
Psychology, 30, 217–233.
Brouer, R., Douglas, C., Treadway, D., & Ferris, G. (2013). Leader political skill, relation-
ship quality, and leadership effectiveness a two-study model test and constructive
replication. Journal of Leadership & Organizational Studies, 20, 185–198.
Burris, E. (2012). The risks and rewards of speaking up: Managerial responses to employee
voice. Academy of Management Journal, 55, 851–875.
Byrne, D. (1917). Executive session. Nash’s Pall Mall Magazine, 59, 49–56.
Byrne, Z., Manning, S., Weston, J., & Hochwarter, W. (2017). All roads lead to well-being:
Unexpected relationships between organizational POPs, employee engagement, and
worker well-being. In C. Rosen & P. Perrewé (Eds.), Power, politics, and political
skill in job stress (pp. 1–32). Bingley, UK: Emerald.
Cantoni, C. (1993). Eliminating bureaucracy-roots and all. Management Review, 82, 30–
33.
Chang, C., Rosen, C., & Levy, P. (2009). The relationship between perceptions of organiza-
tional politics and employee attitudes, strain, and behavior: A meta-analytic exami-
nation. Academy of Management Journal, 52, 779–801.
162 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Cheung, G. (2009). A multiple-perspective approach to data analysis in congruence re-


search. Organizational Research Methods, 12(1), 63–68.
Cheung, J., Burns, D., Sinclair, R., & Sliter, M. (2017). Amazon Mechanical Turk in or-
ganizational psychology: An evaluation and practical recommendations. Journal of
Business and Psychology, 32, 347–361.
Clark, V. (2019). Meaningful integration within mixed methods studies: Identifying why,
what, when and how. Contemporary Educational Psychology, 57, 106–111.
Clark, L., & Watson, D. (1995). Constructing validity: Basic issues in objective scale de-
velopment. Psychological Assessment, 7, 309–319.
Couper, M. (2013). Is the sky falling? New technology, changing media, and the future of
surveys. Survey Research Methods, 7, 145–156.
Craig, C., & Douglas, S. (2011). Assessing cross-cultural marketing theory and research: A
commentary essay. Journal of Business Research, 64, 625–627.
Cullen, K., Gerbasi, A., & Chrobot-Mason, D. (2018). Thriving in central network posi-
tions: The role of political skill. Journal of Management, 44, 682–706.
Dahling, J., Gabriel, A., & MacGowan, R. (2017). Understanding typologies of feedback
environment perceptions: A latent profile investigation. Journal of Vocational Be-
havior, 101, 133–148.
Dalal, R. (2005). A meta-analysis of the relationship between organizational citizenship
behavior and counterproductive work behavior. Journal of Applied Psychology, 90,
1241–1255.
Das, M., Ester, P., & Kaczmirek, L. (Eds.). (2020). Social and behavioral research and
the internet: Advances in applied methods and research strategies. New York, NY:
Routledge.
de Leeuw, E., & Toepoel, V. (2018). Mixed-mode and mixed-device surveys. In D. Van-
nette & J. Krosnick (Eds.), The Palgrave handbook of survey research (pp. 51–61).
Cham, Switzerland: Palgrave MacMillan.
Deephouse, D., & Carter, S. (2005). An examination of differences between organization-
al legitimacy and organizational reputation. Journal of Management Studies, 42,
329–360.
Dipboye, R., & Foster, J. (2002). Multi-level theorizing about perceptions of organization-
al politics. In F. Yammarino & F. Dansereau (Eds.), The many faces of multi-level
issues (pp. 255–270). Oxford, UK: Elsevier Science.
Doldor, E., Anderson, D., & Vinnicombe, S. (2013). Refining the concept of political will:
A gender perspective. British Journal of Management, 24, 414–427.
Donovan, J., Bateman, T., & Heggestad, E. (2013). Individual differences in work motiva-
tion: Current directions and future needs. In N. Christiansen & R. Tett (Eds.), Hand-
book of personality at work (pp. 121–128). New York: NY: Routledge.
Douglas, C., & Ammeter, A. (2004). An examination of leader political skill and its effect
on ratings of leader effectiveness. The Leadership Quarterly, 15, 537–550.
Duckworth, A., & Quinn, P. (2009). Development and validation of the Short Grit Scale
(GRIT–S). Journal of Personality Assessment, 91, 166–174.
Edwards, J. (1994). The study of congruence in organizational behavior research: Critique
and a proposed alternative. Organizational Behavior and Human Decision Process,
58, 51–100.
Research Methods in Organizational Politics • 163

Edwards, J., & Parry, M. (1993). On the use of polynomial regression equations as an
alternative to difference scores in organizational research. Academy of Management
Journal, 36, 1577–1613.
Elbanna, S., Kapoutsis, I., & Mellahi, K. (2017). Creativity and propitiousness in strate-
gic decision making: The role of positive politics and macro-economic uncertainty.
Management Decision, 55, 2218–2236.
Eldor, L. (2016). Looking on the bright side: The positive role of organizational politics in
the relationship between employee engagement and performance at work. Applied
Psychology, 66, 233–259.
Ellen III, B. (2014). Considering the positive possibilities of leader political behavior.
Journal of Organizational Behavior, 35, 892–896.
Epitropaki, O., Kark, R., Mainemelis, C., & Lord, R. G. (2017). Leadership and follower-
ship identity processes: A multilevel review. The Leadership Quarterly, 28, 104–
129.
Farrell, D., & Petersen, J. (1982). Patterns of political behavior in organizations. Academy
of Management Review, 7, 403–412.
Fedor, D., Maslyn, J., Farmer, S., & Bettenhausen, K. (2008). The contribution of positive
politics to the prediction of employee reactions. Journal of Applied Social Psychol-
ogy, 38, 76–96.
Fendt, J., & Sachs, W. (2008). Grounded theory method in management research: Users’
perspectives. Organizational Research Methods, 11, 430–455.
Ferris, G., Adams, G., Kolodinsky, R., Hochwarter, W., & Ammeter, A. (2002). Percep-
tions of organizational politics: Theory and research directions. In F. Yammarino
& F. Dansereau (Eds.), Research in multi-level issues, Volume 1: The many faces of
multi-level issues (pp. 179–254). Oxford, UK: Elsevier.
Ferris, G., Berkson, H., Kaplan, D., Gilmore, D., Buckley, M., Hochwarter, W., et al.
(1999). Development and initial validation of the political skill inventory. Paper
presented at the 59th annual national meeting of the Academy of Management, Chi-
cago.
Ferris, G., Blass, R., Douglas, C., Kolodinsky, R.,k & Treadway, D. (2003). Personal repu-
tation in organizations. In J. Greenberg (Ed.), Organizational behavior: The state of
the science (pp. 211–246). Mahwah, NJ: Lawrence Erlbaum.
Ferris, G. R., Bowen, M. G., Treadway, D. C., Hochwarter, W. A., Hall, A. T., & Perrewé, P.
L. (2006). The assumed linearity of organizational phenomena: Implications for oc-
cupational stress and well-being. In P. L. Perrewé & D. C. Ganster (Eds.), Research
in occupational stress and well-being (Vol. 5, pp. 205–232). Oxford, UK: Elsevier
Science Ltd.
Ferris, G., Ellen, B., McAllister, C., & Maher, L. (2019). Reorganizing organizational poli-
tics research: A review of the literature and identification of future research direc-
tions. Annual Review of Organizational Psychology and Organizational Behavior,
6, 299–323.
Ferris, G., Fedor, D., & King, T. (1994). A political conceptualization of managerial behav-
ior. Human Resource Management Review, 4, 1–34.
Ferris, G., Harrell-Cook, G., & Dulebohn, J. (2000). Organizational politics: The nature
of the relationship between politics perceptions and political behavior. In S. Bacha-
rach & E. Lawler (Eds.), Research in the sociology of organizations (pp. 89–130).
Stamford, CT: JAI Press.
164 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Ferris, G., Harris, J., Russell, Z., Ellen, B., Martinez, A., & Blass, F. (2014). The role
of reputation in the organizational sciences: A multi-level review, construct assess-
ment, and research directions. In M. Buckley, A. Wheeler, & J. Halbesleben (Eds.),
Research in personnel and human resources management (pp. 241–303). Bingley,
UK: Emerald.
Ferris, G., Harris, J., Russell, Z., & Maher, L. (2018). Politics in organizations. In N. An-
derson, D. Ones, & H. Sinangi (Eds.), The handbook of industrial, work, and orga-
nization psychology (pp. 514–531). Thousand Oaks, CA: Sage.
Ferris, G., & Hochwarter, W. (2011). Organizational politics. In S. Zedeck (Ed.), APA
handbook of industrial and organizational psychology (pp. 435–459). Washington,
DC: APA.
Ferris, G., Hochwarter, W., Douglas, C., Blass, F., Kolodinsky, R., & Treadway, D. (2002b).
Social influence processes in organizations and human resource systems. In G. Fer-
ris, & J. Martocchio (Eds.), Research in personnel and human resources manage-
ment (pp. 65–127). Oxford, U.K.: JAI Press/Elsevier Science.
Ferris, G., & Judge, T. (1991). Personnel/human resources management: A political influ-
ence perspective. Journal of Management, 17, 447–488.
Ferris, G., & Kacmar, K. (1989). Perceptions of organizational politics. Paper presented at
the 49th Annual Academy of Management Meeting, Washington, DC.
Ferris, G., & Kacmar, K. (1992). Perceptions of organizational politics. Journal of Man-
agement, 18, 93–116.
Ferris, G., & King, T. (1991). Politics in human resources decisions: A walk on the dark
side. Organizational Dynamics, 20, 59–71.
Ferris, G., Perrewé, P., Daniels, S., Lawong, D., & Holmes, J. (2017). Social influence
and politics in organizational research: What we know and what we need to know.
Journal of Leadership & Organizational Studies, 24, 5–19.
Ferris, G., Perrewe, P., & Douglas, C. (2002). Social effectiveness in organizations: Con-
struct validity and research directions. Journal of Leadership and Organizational
Studies, 9, 49–63.
Ferris, G., Russ, G., & Fandt, P. (1989). Politics in organizations. In R. Giacalone & P.
Rosenfeld (Eds.), Impression management in the organization (pp. 143–170). Hill-
sdale, NJ: Erlbaum.
Ferris, G., & Treadway, D. (2012). Politics in organizations: History, construct specifica-
tion, and research directions. In G. Ferris & D. Treadway (Eds.), Politics in organi-
zations: Theory and research considerations (pp. 3–26). New York, NY: Routledge/
Taylor and Francis.
Ferris, G., Treadway, D., Brouer, R., & Munyon, T. (2012). Political skill in the organiza-
tional sciences. In G. Ferris & D. Treadway (Eds.), Politics in organizations: Theory
and research considerations (pp. 487–528). New York, NY: Routledge/Taylor &
Francis.
Ferris, G., Treadway, D., Kolodinsky, R., Hochwarter, W., Kacmar, C., Douglas, C., &
Frink, D. D. (2005). Development and validation of the political skill inventory.
Journal of Management, 31, 126–152.
Ferris, G., Treadway, D., Perrewé, P., Brouer, R., Douglas, C., & Lux, S. (2007). Political
skill in organizations. Journal of Management, 33, 290–320.
Research Methods in Organizational Politics • 165

Finkel, E., Eastwick, P., & Reis, H. (2015). Best research practices in psychology: Illus-
trating epistemological and pragmatic considerations with the case of relationship
science. Journal of Personality and Social Psychology, 108, 275–297.
Franke, H., & Foerstl, K. (2018). Fostering integrated research on organizational politics
and conflict in teams: A cross-phenomenal review. European Management Journal,
36, 593–607.
French, J., & Raven, B. (1959). The bases of social power. In D. Cartwright & A. Zander
(Eds.), Group dynamics (pp. 150–167). New York, NY: Harper & Row.
Frieder, R. E., Ferris, G. R., Perrewé, P. L., Wihler, A., & Brooks, C. D. (2019). Extending
the metatheoretical framework of social/political influence to leadership: Political
skill effects on situational appraisals, responses, and evaluations by others. Person-
nel Psychology, 72(4), 543–569.
Gabriel, A., Campbell, J., Djurdjevic, E., Johnson, R., & Rosen, C. (2018). Fuzzy profiles:
comparing and contrasting latent profile analysis and fuzzy set qualitative compara-
tive analysis for person-centered research. Organizational Research Methods, 21,
877–904.
Gabriel, A., Daniels, M., Diefendorff, J., & Greguras, G. (2015). Emotional labor actors: A
latent profile analysis of emotional labor strategies. Journal of Applied Psychology,
100, 863–879.
Gabriel, A., Koopman, J., Rosen, C., & Johnson, R. (2018). Helping others or helping one-
self? An episodic examination of the behavioral consequences of helping at work.
Personnel Psychology, 71, 85–107.
Gandz, J., & Murray, V. (1980). The experience of workplace politics. Academy of Man-
agement Journal, 23, 237–251.
Gentry, W., Gilmore, D., Shuffler, M., & Leslie, J. (2012). Political skill as an indicator of
promotability among multiple rater sources. Journal of Organizational Behavior,
33, 89–104.
George, G., Dahlander, L., Graffin, S., & Sim, S. (2016). Reputation and status: Expanding
the role of social evaluations in management research. Academy of Management
Journal, 59, 1–13.
Grams, W., & Rogers, R. (1990). Power and personality: Effects of Machiavellianism,
need for approval, and motivation on use of influence tactics. Journal of General
Psychology, 117, 71–82.
Grant, A., & Schwartz, B. (2011). Too much of a good thing: The challenge and opportu-
nity of the inverted U. Perspectives on Psychological Science, 6, 61–76.
Guo, Y., Kang, H., Shao, B., & Halvorsen, B. (2019). Organizational politics as a blind-
fold: Employee work engagement is negatively related to supervisor-rated work out-
comes when organizational politics is high. Personnel Review, 48, 784–798.
Heider, F. (1958). The psychology of interpersonal relations. New York, NY: Wiley.
Higgins, C., Judge, T., & Ferris, G. (2003). Influence tactics and work outcomes: A meta‐
analysis. Journal of Organizational Behavior, 24, 89–106.
Hill, S., Thomas, A., & Meriac, J. (2016). Political behaviors, politics perceptions and
work outcomes: Moving to an experimental study. In E. Vigoda-Gabot & A. Drory
(Eds.), Handbook of organizational politics: Looking back and to the future (pp.
369–400). Northampton, MA: Edward Elgar Publishing.
Hinkin, T. (1998). A brief tutorial on the development of measures for use in survey ques-
tionnaires. Organizational Research Methods, 1, 104–121.
166 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Hochwarter, W. (2012). The positive side of organizational politics. In G. Ferris & D.


Treadway (Eds.), Politics in organizations: Theory and research considerations (pp.
20–45). New York, NY: Routledge/Taylor and Francis.
Hochwarter, W. (2014). On the merits of student‐recruited sampling: Opinions a decade in
the making. Journal of Occupational and Organizational Psychology, 87, 27–33.
Hochwarter, W., Ferris, G., Gavin, M., Perrewé, P., Hall, A., & Frink, D. (2007). Political
skill as neutralizer of felt accountability—Job tension effects on job performance
ratings: A longitudinal investigation. Organizational Behavior and Human Decision
Processes, 102, 226–239.
Hochwarter, W., Ferris, G., & Hanes, T. (2011). Multi-study packages in organizational sci-
ence research. In D. Ketchen & D. Bergh (Eds.), Building methodological bridges:
Research methodology in strategy and management (pp. 163–199). Bingley, UK:
Emerald.
Hochwarter, W., Ferris, G., Laird, M., Treadway, D., & Gallagher, V. (2010). Nonlinear
politics perceptions–work outcome relationships: A three-study, five-sample inves-
tigation. Journal of Management, 36, 740–763.
Hochwarter, W., Ferris, G., Zinko, R., Arnell, B., & James, M. (2007). Reputation as a
moderator of political behavior-work outcomes relationships: A two-study investi-
gation with convergent results. Journal of Applied Psychology, 92, 567–576.
Hochwarter, W., Kacmar, C., Perrewé, P., & Johnson, D. (2003). Perceived organizational
support as a mediator of the relationship between politics perceptions and work
outcomes. Journal of Vocational Behavior, 63, 438–456.
Hochwarter, W., Kacmar, K., Treadway, D., & Watson, T. (2003). It’s all relative: The
distinction and prediction of political perceptions across levels. Journal of Applied
Social Psychology, 33, 1995–2016.
Hochwarter, W., Summers, J., Thompson, K., Perrewé, P., & Ferris, G. (2010). Strain reac-
tions to perceived entitlement behavior by others as a contextual stressor: Moderat-
ing role of political skill in three samples. Journal of Occupational Health Psychol-
ogy, 15, 388–398.
Hult, G., Ketchen, D., Griffith, D., Chabowski, B., Hamman, M., Dykes, B., Pollitte, W., &
Cavusgil, S. (2008). An assessment of the measurement of performance in interna-
tional business research. Journal of International Business Studies, 39, 1064–1080.
Jann, B., Krumpal, I., & Wolter, F. (2019). Social desirability bias in surveys – Collecting
and analyzing sensitive data. Methods, Data, Analyses, 13, 3–6.
Jehn, K., & Jonsen, K. (2010). A multimethod approach to the study of sensitive organiza-
tional issues. Journal of Mixed Methods Research, 4, 313–341.
Johns, G. (2001). In praise of context. Journal of Organizational Behavior, 22, 31–42.
Johns, G. (2006). The essential impact of context on organizational behavior. Academy of
Management Review, 31, 386–408.
Johns, G. (2018). Advances in the treatment of context in organizational research. Annual
Review of Organizational Psychology and Organizational Behavior, 5, 21–46.
Jones, E. (1990). Interpersonal perception. New York, NY: W.H. Freeman.
Jones, E., & Pittman, T. (1982). Toward a general theory of strategic self-presentation.
Psychological Perspectives on the Self, 1, 231–262.
Kacmar, K., & Baron, R. (1999). Organizational politics: The state of the field, links to
related processes, and an agenda for future research. In G. Ferris (Ed.), Research in
personnel and human resources management (pp. 1–39). Stamford, CT: JAI Press.
Research Methods in Organizational Politics • 167

Kacmar, K., & Bozeman, D., Carlson, D., & Anthony, W. (1999). An examination of the
perceptions of organizational politics model: Replication and extension. Human Re-
lations, 52, 383–416.
Kacmar, K., & Carlson, D. (1997). Further validation of the perceptions of politics scale
(POPs): A multiple sample investigation. Journal of Management, 23, 627–658.
Kacmar, K., & Ferris, G. (1991). Perceptions of organizational politics scale (POPs): De-
velopment and construct validation. Educational and Psychological Measurement,
51, 193–205.
Kacmar, K., Wayne, S., & Wright, P. (1996). Subordinate reactions to the use of impression
management tactics and feedback by the supervisor. Journal of Managerial Issues,
8, 35–53.
Kapoutsis, I., Papalexandris, A., Treadway, D., & Bentley, J. (2017). Measuring political
will in organizations: Theoretical construct development and empirical validation.
Journal of Management, 43, 2252–2280.
Kelley, H. (1973). The process of causal attributions. American Psychologist, 28, 107–128.
Kidron, A., & Vinarski-Peretz, H. (2018). The political iceberg: The hidden side of leaders’
political behaviour. Leadership & Organization Development Journal, 39, 1010–
1023.
Kiewitz, C., Restubog, S., Zagenczyk, T., & Hochwarter, W. (2009). The interactive effects
of psychological contract breach and organizational politics on perceived organi-
zational support: Evidence from two longitudinal studies. Journal of Management
Studies, 46, 806–834.
Kipnis, D., & Schmidt, S. (1988). Upward-influence styles: Relationship with performance
evaluations, salary, and stress. Administrative Science Quarterly, 33, 528–542.
Kipnis, D., Schmidt, S., & Wilkinson, I. (1980). Intraorganizational influence tactics: Ex-
plorations in getting one’s way. Journal of Applied Psychology, 65, 440–452.
Kruse, E., Chancellor, J., & Lyubomirsky, S. (2017). State humility: Measurement, concep-
tual validation, and intrapersonal processes. Self and Identity, 16, 399–438.
Lafrenière, M., Sedikides, C., & Lei, X. (2016). Regulatory fit in self-enhancement and
self-protection: implications for life satisfaction in the west and the east. Journal of
Happiness Studies, 17, 1111–1123.
Laird, M., Zboja, J., & Ferris, G. (2012). Partial mediation of the political skill-reputation
relationship, Career Development International, 17, 557–582.
Lampaki, A., & Papadakis, V. (2018). The impact of organisational politics and trust in
the top management team on strategic decision implementation success: A middle
manager’s perspective. European Management Journal, 36, 627–637.
Landells, E., & Albrecht, S. (2013). Organizational political climate: Shared perceptions
about the building and use of power bases. Human Resource Management Review,
23, 357–365.
Landells, E., & Albrecht, S. (2017). The positives and negatives of organizational politics:
A qualitative study. Journal of Business and Psychology, 32, 41–58.
Landry, H. (1969). Creativity and personality integration. Canadian Journal of Counsel-
ling and Psychotherapy, 3, 5–11.
Larson, R., & Csikszentmihalyi, M. (1983). The experience sampling method. New Direc-
tions for Methodology of Social & Behavioral Science, 15, 41–56.
Lasswell, H. (1936). Politics: Who gets what, when, how? New York, NY: Whittlesey.
168 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Lee, S., Han, S., Cheong, M., Kim, S. L., & Yun, S. (2017). How do I get my way? A
meta-analytic review of research on influence tactics. The Leadership Quarterly,
28, 210–228.
LePine, J., Podsakoff, N., & LePine, M. (2005). A meta-analytic test of the challenge stress-
or–hindrance stressor framework: An explanation for inconsistent relationships
among stressors and performance. Academy of Management Journal, 48, 764–775.
Lepisto, D., & Pratt, M. (2012). Politics in perspectives: On the theoretical challenges and
opportunities in studying organizational politics. In G. Ferris & D. Treadway (Eds.),
Politics in organizations: Theory and research considerations (pp. 67–98). New
York, NY: Routledge/Taylor and Francis.
Lewin, K. (1936). Principles of topological psychology. New York, NY: McGraw-Hill.
Li, C., Liang, J., & Farh, J. L. (2020). Speaking up when water is Murky: An uncertainty-
based model linking perceived organizational politics to employee voice. Journal of
Management, 46(3), 443–469.
Li, J., Wu, L., Liu, D., Kwan, H., & Liu, J. (2014). Insiders maintain voice: A psychologi-
cal safety model of organizational politics. Asia Pacific Journal of Management, 31,
853–874.
Liden, R., & Mitchell, T. (1988). Ingratiatory behaviors in organizational settings. Acad-
emy of Management Review, 13, 572–587.
Lim, S., Ilies, R., Koopman, J., Christoforou, P., & Arvey, R. (2018). Emotional mecha-
nisms linking incivility at work to aggression and withdrawal at home: An experi-
ence-sampling study. Journal of Management, 44, 2888–2908.
Lincoln, Y., & Guba, E. (1985). Naturalistic observation. Thousand Oaks, CA: Sage Pub-
lications.
Liu, Y., Ferris, G., Zinko, R., Perrewé, P., Weitz, B., & Xu, J. (2007). Dispositional ante-
cedents and outcomes of political skill in organizations: a four-study investigation
with convergence, Journal of Vocational Behavior, 71, 146–165.
Liu, Y., Liu, J., & Wu, L. (2010). Are you willing and able? Roles of motivation, power,
and politics in career growth. Journal of Management, 36, 1432–1460.
Luthans, F., & Avolio, B. (2009). Inquiry unplugged: building on Hackman’s potential
perils of POB. Journal of Organizational Behavior: The International Journal of In-
dustrial, Occupational and Organizational Psychology and Behavior, 30, 323–328.
Lux, S., Ferris, G., Brouer, R., Laird, M., & Summers, J. (2008). A multi-level concep-
tualization of organizational politics. In C. Cooper & J. Barling (Eds.), The SAGE
handbook of organizational behavior (pp. 353–371). Thousand Oaks, CA: Sage.
Machiavelli, N. (1952). The prince. New York, NY: New American Library (The transla-
tion of Machiavelli’s The Prince by Luigi Ricci was first published in 1903).
Madison, D., Allen, R., Porter, L., Renwick, P., & Mayes, B. (1980). Organizational poli-
tics: An exploration of managers’ perceptions. Human Relations, 33, 79–100.
Maher, L., Gallagher, V., Rossi, A., Ferris, G., & Perrewé, P. (2018). Political skill and will
as predictors of impression management frequency and style: A three-study investi-
gation. Journal of Vocational Behavior, 107, 276–294.
Maslyn, J., Farmer, S., & Bettenhausen, K. (2017). When organizational politics matters:
The effects of the perceived frequency and distance of experienced politics. Human
Relations, 70, 1486–1513.
Maslyn, J., & Fedor, D. (1998). Perceptions of politics: Does measuring different foci mat-
ter? Journal of Applied Psychology, 83, 645–653.
Research Methods in Organizational Politics • 169

Matta, F., Scott, B., Colquitt, J., Koopman, J., & Passantino, L. (2017). Is consistently
unfair better than sporadically fair? An investigation of justice variability and stress.
Academy of Management Journal, 60, 743–770.
Mayes, B., & Allen, R. (1977). Toward a definition of organizational politics. Academy of
Management Review, 2, 672–678.
McArthur, J. (1917). What a company officer should know. New York, NY: Harvey Press.
Miller, B., Rutherford, M., & Kolodinsky, R. (2008). Perceptions of organizational politics:
A meta-analysis of outcomes. Journal of Business and Psychology, 22, 209–222.
Mintzberg, H. (1983). Power in and around organizations. Englewood Cliffs, NJ: Prentice-
Hall.
Mintzberg, H. (1985). The organization as political arena. Journal of Management Studies,
22, 133–154.
Misangyi, V., Greckhamer, T., Furnari, S., Fiss, P., Crilly, D., & Aguilera, R. (2017). Em-
bracing causal complexity: The emergence of a neo-configurational perspective.
Journal of Management, 43, 255–282.
Mitchell, M., Baer, M., Ambrose, M., Folger, R., & Palmer, N. (2018). Cheating under
pressure: A self-protection model of workplace cheating behavior. Journal of Ap-
plied Psychology, 103, 54–73.
Molina-Azorin, J., Bergh, D., Corley, K., & Ketchen, D. (2017). Mixed methods in the or-
ganizational sciences: Taking stock and moving forward. Organizational Research
Methods, 20, 179–192.
Morgan, L. (1989). “Political will” and community participation in Costa Rican primary
health care. Medical Anthropology Quarterly, 3, 232–245.
Morgeson, F., Mitchell, T., & Liu, D. (2015). Event system theory: An event-oriented ap-
proach to the organizational sciences. Academy of Management Review, 40, 515–
537.
Munyon, T., Summers, J., Thompson, K., & Ferris, G. (2015). Political skill and work
outcomes: A theoretical extension, meta‐analytic investigation, and agenda for the
future. Personnel Psychology, 68, 143–184.
Nye, L., & Witt, L. (1993). Dimensionality and construct validity of the perceptions of
organizational politics scale (POPS). Educational and Psychological Measurement,
53, 821–829.
O’Shea, P. (1920). Employees’ magazines for factories, offices, and business organiza-
tions. New York, NY: Wilson.
Perrewé, P., Zellars, K., Ferris, G., Rossi, A., Kacmar, C., & Ralston, D. (2004). Neutral-
izing job stressors: Political skill as an antidote to the dysfunctional consequences
of role conflict. Academy of Management Journal, 47, 141–152.
Pfeffer, J. (1981). Power in organizations. Marshfield, MA: Pitman.
Pfeffer, J. (1992). Managing with power: Politics and influence in organizations. Boston,
MA: Harvard Business Press.
Pfeffer, J. (2010). Power: Why some people have it and others don’t. New York, NY: Harp-
erCollins Publishers.
Pierce, J., & Aguinis, H. (2013). The too-much-of-a-good-thing effect in management.
Journal of Management, 39, 313–338.
Porter, L. (1976). Organizations as political animals. Presidential address, Division of
Industrial-Organizational Psychology, 84th Annual Meeting of the American Psy-
chological Association, Washington, DC.
170 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Porter, L., Allen, R., & Angle, H. (1981). The politics of upward influence in organiza-
tions. In L. Cummings, & B. Staw (Eds.), Research in organizational behavior (pp.
109–149). Greenwich, CT: JAI Press.
Porter, C., Outlaw, R., Gale, J., & Cho, T. (2019). The use of online panel data in man-
agement research: A review and recommendations. Journal of Management, 45,
319–344.
Post, L., Raile, A., & Raile, E. (2010). Defining political will. Politics & Policy, 38, 653–
676.
Ravasi, D., Rindova, V., Etter, M., & Cornelissen, J. (2018). The formation of organiza-
tional reputation. Academy of Management Annals, 12, 574–599.
Reitz, A., Motti-Stefanidi, F., & Asendorpf, J. (2016). Me, us, and them: Testing sociom-
eter theory in a socially diverse real-life context. Journal of Personality and Social
Psychology, 110, 908–920.
Rihoux, B., & Ragin, C. (2008). Configurational comparative methods: Qualitative com-
parative analysis (QCA) and related techniques (Vol. 51). Thousand Oaks, CA:
Sage Publications.
Rindova, V., Williamson, I., & Petkova, A. (2010). Reputation as an intangible asset: Re-
flections on theory and methods in two empirical studies of business school reputa-
tions. Journal of Management, 36, 610–619.
Rose, P., & Greeley, M. (2006). Education in fragile states: Capturing lessons and identify-
ing good practice. Brighton, UK: DAC Fragile States Group.
Rosen, C., Ferris, D., Brown, D., Chen, Y., & Yan, M. (2014). Perceptions of organizational
politics: A need satisfaction paradigm. Organization Science, 25, 1026–1055.
Rosen, C., & Hochwarter, W. (2014). Looking back and falling further behind: The mod-
erating role of rumination on the relationship between organizational politics and
employee attitudes, well-being, and performance. Organizational Behavior and Hu-
man Decision Processes, 124, 177–189.
Rosen, C., Kacmar, K., Harris, K., Gavin, M., & Hochwarter, W. (2017). Workplace poli-
tics and performance appraisal: A two-study, multilevel field investigation. Journal
of Leadership & Organizational Studies, 24, 20–38.
Rosen, C., Koopman, J., Gabriel, A., & Johnson, R. (2016). Who strikes back? A daily
investigation of when and why incivility begets incivility. Journal of Applied Psy-
chology, 101, 1620–1634.
Rosen, C., Levy, P., & Hall, R. (2006). Placing perceptions of politics in the context of the
feedback environment, employee attitudes, and job performance. Journal of Applied
Psychology, 91, 211–220.
Runkel, P., & McGrath, J. (1972), Research on human behavior: A systematic guide to
method, New York, NY: Holt, Rinehart and Winston, Inc.
Salancik, G., & Pfeffer, J. (1978). A social information processing approach to job attitudes
and task design. Administrative Science Quarterly, 23, 224–253.
Saleem, H. (2015). The impact of leadership styles on job satisfaction and mediating role
of perceived organizational politics. Procedia-Social and Behavioral Sciences, 172,
563–569.
Schein, V. (1977). Individual power and political behaviors in organizations: An inade-
quately explored reality. Academy of Management Review, 2, 64–72.
Schriesheim, C., Powers, K., Scandura, T., Gardiner, C., & Lankau, M. (1993). Improv-
ing construct measurement in management research: Comments and a quantitative
Research Methods in Organizational Politics • 171

approach for assessing the theoretical content adequacy of paper-and-pencil survey-


type instruments. Journal of Management, 19, 385–417.
Sharfman, M., Wolf, G., Chase, R., & Tansik, D. (1988). Antecedents of organizational
slack. Academy of Management Review, 13, 601–614.
Shaughnessy, B., Treadway, D., Breland, J., & Perrewé, P. (2017). Informal leadership
status and individual performance: The roles of political skill and political will.
Journal of Leadership & Organizational Studies, 24, 83–94.
Silvester, J., & Wyatt, M. (2018). Political effectiveness at work. In C. Viswesvaran, D.
Ones, N. Anderson, & H. Sinangil (Eds.), Handbook of industrial work and organi-
zational Psychology (pp. 228–247). London, UK: Sage.
Smith, A., Plowman, D., Duchon, D., & Quinn, A. (2009). A qualitative study of high-
reputation plant managers: Political skill and successful outcomes. Journal of Op-
erations Management, 27, 428–443.
Smith, A., Watkins, M., Burke, M., Christian, M., Smith, C., Hall, A., & Simms, S. (2013).
Gendered influence: A gender role perspective on the use and effectiveness of influ-
ence tactics. Journal of Management, 39, 1156–1183.
Stolz, R. (1955). Is executive development coming of age? The Journal of Business, 28,
48–57.
Sun, S., & Chen, H. (2017). Is political behavior a viable coping strategy to perceived orga-
nizational politics? Unveiling the underlying resource dynamics. Journal of Applied
Psychology, 102, 1471–1482.
Tedeschi, J., & Melburg, V. (1984). Impression management and influence in the organiza-
tion. In S. Bacharach & E. Lawler (Eds.), Research in the sociology of organizations
(Vol. 3, pp. 31–58). Greenwich, CT: JAI Press.
Tedeschi, J., Melburg, V., Bacharach, S., & Lawler, E. (1984). Impression management and
influence in the organization. In S. Bacharach & E. Lawler (Eds.), Research in the
sociology of organizations (Vol. 3, pp. 31–58). Greenwich, CT: JAI Press.
Tocher, N., Oswald, S., Shook, C., & Adams, G. (2012). Entrepreneur political skill and
new venture performance: Extending the social competence perspective. Entrepre-
neurship & Regional Development: An International Journal, 24, 283–305.
Treadway, D. (2012). Political will in organizations. In G. Ferris & D. Treadway (Eds.),
Politics in organizations: Theory and research considerations (pp. 531–566). New
York, NY: Routledge/Taylor & Francis Group.
Treadway, D., Hochwarter, W., Ferris, G., Kacmar, C., Douglas, C., Ammeter, A., & Buck-
ley, M. (2004). Leader political skill and employee reactions. The Leadership Quar-
terly, 15, 493–513.
Treadway, D., Hochwarter, W., Kacmar, C., & Ferris, G. (2005). Political will, political
skill, and political behavior. Journal of Organizational Behavior, 26, 229–245.
Tsui, A. (1984). A role set analysis of managerial reputation. Organizational Behavior and
Human Performance, 34, 64–96.
Turnley, W., & Feldman, D. (1999). The impact of psychological contract violations on
exit, voice, loyalty, and neglect. Human Relations, 52, 895–922.
Valle, M., & Perrewé, P. (2000). Do politics perceptions relate to political behaviors? Tests
of an implicit assumption and expanded model. Human Relations, 53, 359–386.
Van Dyne, L., & LePine, J. (1998). Helping and voice extra-role behaviors: Evidence of
construct and predictive validity. Academy of Management Journal, 41, 108–119.
172 • MAHER, RUSSELL, JORDAN, FERRIS, & HOCHWARTER

Van Knippenberg, B., & Steensma, H. (2003). Future interaction expectation and the use of
soft and hard influence tactics. Applied Psychology, 52, 55–67.
Van Maanen, J., Sorensen, J., & Mitchell, T. (2007). The interplay between theory and
method. Academy of Management Review, 32, 1145–1154.
Vecchio, R., & Sussman, M. (1991). Choice of influence tactics: individual and organiza-
tional determinants. Journal of Organizational Behavior, 12, 73–80.
Vigoda, E. (2002). Stress-related aftermaths to workplace politics: The relationships
among politics, job distress, and aggressive behavior in organizations. Journal of
Organizational Behavior, 23, 571–588.
Von Hippel, W., Lakin, J., & Shakarchi, R. (2005). Individual differences in motivated
social cognition: The case of self-serving information processing. Personality and
Social Psychology Bulletin, 31, 1347–1357.
Wade, J., Porac, J., Pollock, T., & Graffin, S. (2006). The burden of celebrity: The impact of
CEO certification contests on CEO pay and performance. Academy of Management
Journal, 49, 643–660.
Wheeler, A., Shanine, K., Leon, M., & Whitman, M. (2014). Student‐recruited samples
in organizational research: A review, analysis, and guidelines for future research.
Journal of Occupational and Organizational Psychology, 87, 1–26.
Whitman, M., Halbesleben, J., & Shanine, K. (2013). Psychological entitlement and abu-
sive supervision: Political skill as a self-regulatory mechanism. Health Care Man-
agement Review, 38, 248–257.
Wickenberg, J., & Kylén, S. (2007). How frequent is organizational political behaviour? A
study of managers’ opinions at 491 workplaces. In S. Reddy (Ed.), Organizational
politics—New insights (pp. 82–94). Hyderabad, India: ICFAI University Press.
Yukl, G., & Falbe, C. (1990). Influence tactics and objectives in upward, downward, and
lateral influence attempts. Journal of Applied Psychology, 75, 132–140.
Yukl, G., & Tracey, J. (1992). Consequences of influence tactics used with subordinates,
peers, and the boss. Journal of Applied Psychology, 77, 525–535.
Zanzi, A., Arthur, M., & Shamir, B. (1991). The relationships between career concerns and
political tactics in organizations. Journal of Organizational Behavior, 12, 219–233.
Zare, M., & Flinchbaugh, C. (2019). Voice, creativity, and big five personality traits: A
meta-analysis. Human Performance, 32, 30–51.
Zhang, Y., & Lu, C. (2009). Challenge stressor-hindrance stressor and employees’ work-re-
lated attitudes, and behaviors: The moderating effects of general self-efficacy. Acta
Psychologica Sinica, 6, 501–509.
Zinko, R., Gentry, W., & Laird, M. (2016). A development of the dimensions of personal
reputation in organizations. International Journal of Organizational Analysis, 24,
634–649.
CHAPTER 8

RANGE RESTRICTION IN
EMPLOYMENT INTERVIEWS
An Influence Too Big to Ignore

Allen I. Huffcutt

The emergence of meta-analysis as a formal research technique in the early 1980s


raised awareness of the need to consider the influence of various statistical “arti-
facts” in research (e.g., Hunter, Schmidt, & Jackson, 1982). Sampling error, for
instance, artificially increases variability across coefficients, which could result in
the conclusion that validity is highly specific to individual selection situations and
not generalizable. Measurement error (particularly in performance criteria) and
restriction in range (hereafter range restriction) reduce the magnitude of validity
coefficients, thereby making selection approaches appear less effective than they
really are in predicting job performance. Construct validity can also be affected
by artifacts. For example, range restriction can artificially reduce correlations in a
multitrait-multimethod (MTMM) analysis, lowering confidence that similar mea-
sures are assessing a common construct.
The effect of range restriction, potentially the most potent statistical artifact,
is particularly troublesome with employment interviews. In most selection sys-
tems, assessment of candidates occurs sequentially. Completing an application
blank first is standard practice. After that, it is common to administer measures
that are relatively quick and reasonably inexpensive, such as those for ability and
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 173–196.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 173
174 • ALLEN I. HUFFCUTT

personality. Procedures that are the most time intensive and/or expensive tend
to be implemented at (or towards) the end, usually after a number of candidates
have been eliminated. Interviews typically fall in this latter category. The later the
interview is in the selection process, the greater the possibility for (and extent of)
range restriction.
The degree to which range restriction can diminish the magnitude of validity
coefficients is illustrated in several of the larger and more prominent interview
meta-analyses. For instance, McDaniel, Whetzel, Schmidt, and Maurer (1994)
reported that the mean population validity of the Situational Interview or SI
(Latham, Saari, Pursell, & Campion, 1980) rose from .35 to .50 after further cor-
rection for range restriction. Expressed as a percent-of-variance, SIs accounted
for 12% of performance variance without the correction and 25% after. (As ex-
plained later, McDaniel et al.’s correction is most likely conservative because it
was based on direct rather than indirect restriction.)
Yet, most primary interview researchers—those actually conducting studies
rather than meta-analytically summarizing them—fail to take range restriction
into account. The lack of attention is evident with a quick search on PsycINFO
(search date: 4-11-2018). Entering “job applicant interviews” as the subject (SU)
term resulted in 1,472 entries. Adding “range restriction” as a second search term
anywhere in the document reduced that number to only seven. Using the alter-
nate term “restriction in range” resulted in the same number. Additional evidence
comes again from the McDaniel et al. (1994) interview meta-analysis, where only
14 of the 160 total studies in their dataset reported range restriction information
(see pp. 605–606).
Such widespread lack of consideration is surprising in one respect because
the mathematics behind range restriction, and the equations to correct for it, have
been around for a long time. Building on the earlier work of Pearson (1903),
for example, Thorndike (1949) presented the relatively straightforward procedure
needed to correct for direct (i.e., Case II) restriction.1 The first meta-analysis book
in Industrial-Organizational (I-O) psychology, Hunter et al. (1982), also outlined
the correction procedure for direct range restriction and illustrated how to utilize
it in selection research.
Unfortunately, the issue of range restriction got more complex for employment
interview researchers in the mid-2000s. Throughout essentially the entire history
of selection research, restriction was largely presumed to be direct. Schmidt and
colleagues (e.g., Hunter & Schmidt, 2004; Hunter, Schmidt, & Le, 2006) made
the assertion that most restriction is actually indirect rather than direct. Thorndike
(1949) provided the equations needed to correct for indirect (Case III) restriction
as well, but they were generally not viable for selection contexts because too
much of the needed information was unknown. Hunter et al. were able to sim-
plify the mathematics to make the indirect correction more feasible, although it is
still more complicated than direct. To distinguish their methodology from that of
Thorndike, they named their procedure Case IV.
Range Restriction in Employment Interviews • 175

It would appear that interview researchers as a whole appear to have paid even
less attention to the indirect form of restriction. Another search on PsycINFO
(same date) combining “job applicant interviews” as the subject term with “indi-
rect restriction” as a general term anywhere in the document yielded only three
entries.2 The first was an interview meta-analysis that incorporated indirect re-
striction as its primary purpose (Huffcutt, Culbertson, & Weyhrauch, 2014a). The
second was a reanalysis of the McDaniel et al. (1994) interview dataset using in-
direct methodology (Oh, Postlethwaite, & Schmidt, 2013; see also Le & Schmidt,
2006). The third was a general commentary on indirect restriction (Schmitt, 2007).
Failure to account for range restriction in a primary interview study (and in a
meta-analysis for that matter) can result in inaccurate or even mistaken conclu-
sions. Consider a company that wants to switch from traditional, unstructured
interviews to a new structured format such as a SI or a Behavior Description
Interview or BDI (Janz, 1982), but isn’t completely sure doing so is worth the
time and administrative trouble. If range restriction is present, which is likely, the
resulting validity coefficient will be artificially low. It might even be low enough
that the company decides not to make the switch.
One possible reason for the lack of consistent attention to range restriction
among interview researchers is that they don’t have a good, intuitive feel for its
effects. Graduate school treatment of range restriction, along with prominent
psychometric textbooks (e.g., Nunnally, 1978), tend to focus only on the correc-
tive formulas. Visual presentation, such as scatterplots, is often missing. Another
potential reason is that some of the needed information (e.g., the unrestricted
standard deviation of interview ratings in the applicant population) may not be
readily available given the customized nature of interviews (e.g., as opposed to
standardized ability measures). A final reason, one based on convenience and/or
expense, is that interview researchers may not feel they have the time to refamil-
iarize themselves with the correction process, which is not always presented in a
user-friendly manner, or purchase meta-analytic software that they would only
use periodically or perhaps even once (e.g., Schmidt & Le, 2014).
The overarching purpose of this manuscript is to provide a convenient, all-in-
one reference for interview researchers to help them deal with range restriction.
Subsumed under this purpose is an overview of the basic concepts and mechan-
ics of range restriction (including the all-important difference between its direct
and indirect forms), visual presentation of restriction effects via scatterplots to
enhance intuitive understanding, and a summary of equations and procedures for
their use in restriction correction. Further, realistic simulations are utilized to de-
rive some of the most difficult parameters for interviews, and then these param-
eters are built into the correction equations in order to simplify them.

UNRESTRICTED INTERVIEW POPULATION


The starting point is creation of a hypothetical but realistic distribution of inter-
view and performance ratings, one that is totally unrestricted. The focus is on
176 • ALLEN I. HUFFCUTT

high-structure interviews, as they consistently show the highest validity and are
considerably more standardized across situations than unstructured ones. For in-
stance, although the content of questions varies, all SIs are comprised of hypothet-
ical scenarios while all BDIs focus exclusively on description of past experiences.
In contrast, the content, nature, and even format of unstructured interviews can
vary immensely by interviewer and even by interview. Indeed, it is not surprising
that unstructured interviews have been likened unto a “disorganized conversa-
tion.”
A key parameter in this distribution is the population correlation between high-
ly structured interview ratings and job performance. At the present time, the best
available estimate appears to be the fully corrected (via indirect methodology)
population value (i.e., rho) of .69 from Huffcutt et al. (2014a). They provided
population estimates for four levels of structure (none to highly structured), and
this value is for the highest level (see p. 303). This level includes virtually all SIs,
and a majority of BDIs. (BDIs can be conducted using more of an intermediate
level of structure, such as allowing interviewers to choose questions from a bank
and to probe extensively; such studies usually reside at Level 3.) Their value of
.69 is corrected for both unreliability in performance assessment and range re-
striction, but not for interview reliability. As explained in more detail below, such
a correction results in “operational validity” rather than a construct-to-construct
association, that is the level to which a predictor (in its actual, imperfect state) is
associated with true performance.
To enhance realism (and out of the necessity of choosing a scaling), interview
parameters from Weekley and Gier (1987) were utilized. They developed a SI to
select entry-level associates in a national retail outlet. The sample question they
provide (see p. 485) about an angry customer whose watch is late coming back
from being repaired is cited regularly as an example of the SI format. Their final
interview contained 16 questions, all rated using the typical five-point scale that
has behavioral benchmarks at one, three, and five, resulting in a possible range of
16 to 80 with a midpoint of 48. Using Excel, a normal distribution was generated
with a mean of 48.0 (the midpoint) and a standard deviation of 7.4 (the actual sd
in their validity sample; see p. 486). In regards to sample size, 100 was chosen out
of convenience. These parameters should be reasonably representative of high-
structure interviews in general.3
On the performance side, the goal was to create a second distribution that cor-
related .69 with the original distribution of interview ratings. Using Excel, the
interview distribution was copied, sufficient measurement error was added to re-
duce the correlation with the original distribution to .69, and then the result was
rescaled to have a mean of 50.0 and a standard deviation of 10.0 (i.e., T scaling).
Given the extremely wide variation in performance rating formats across studies,
this particular scaling was chosen out of convenience assuming that it was reason-
ably representative.
Range Restriction in Employment Interviews • 177

FIGURE 8.1. Scatterplot between unrestricted structured interview ratings and


error-free ratings of job performance. The sample size is 100 and the correlation is
.69.

The resulting bi-variate distribution is shown in Figure 8.1. Conceptually, this


scatterplot illustrates the results of a selection situation where 100 applicants ap-
ply, all are interviewed with a highly structured interview, all are hired, and then
error-free (i.e., true score) performance ratings are collected. (Implicit in this
scenario is the lack of attrition, something that is built into upcoming scenarios,
specifically Scenarios 2 and 4.) Now various scenarios that build upon this distri-
bution are presented.

Scenario 1: All Applicants are Hired—No Restriction or Attrition


This scenario represents the infrequent but not unheard of case where there is
no restriction in range on the predictor, and all (or most) who are hired remain
on the job. Technically speaking, no studies should fall into this category unless
every person who applies is hired without any consideration of interview ratings
and there is no prior selection of any kind (even from an application blank, since
that can cause indirect restriction). Practically speaking, some interview studies
could reasonably be considered to do so. For instance, in their Study 3, Latham et
al. (1980) noted that “The situational interview was administered to 56 applicants
for entry-level work in a pulp mill, all of whom were subsequently hired” (p. 425).
Additional examples include Benz (1974) and McMurry (1947).
Unfortunately, even with no restriction (or attrition), a scatterplot of actual
data with these parameters would not mirror that in Figure 8.1. The reason is that
there is no such thing as error-free performance ratings in organizational contexts.
178 • ALLEN I. HUFFCUTT

Performance ratings contain measurement error, a considerable amount actually


(Hunter et al., 2006), which diminishes its correlation with interview ratings. Al-
though various values of interrater reliability (i.e., IRR) for performance ratings
can be found in the literature, the most common appears to be .52 (Viswesvaran,
Ones, & Schmidt, 1996; see also Rothstein, 1990).
Using the upcoming Formula 1 in reverse, measurement error in the perfor-
mance ratings reduces the magnitude of its correlation with interview ratings from
.69 to .50 (assuming an IRR of .52). This level of association is illustrated as a
scatterplot in Figure 8.2. Using Excel, the error-free performance distribution was
copied and sufficient measurement error was added to reduce the correlation with
the original interview distribution to .50. Finally, it was rescaled using T scaling
again. Using the same scaling as in Figure 8.1 enhances comparison and high-
lights the effect of performance measurement error. To illustrate, visual inspection
suggests that, around an interview rating of 50, the range of performance ratings
increases roughly from 26 to 35. Although the primary focus of this manuscript is
on range restriction, the visible difference between these two figures highlights the
importance of also correcting for measurement error in performance assessment.
Readers may wonder why no correction was made for measurement error on
the interview side, since clearly it is there as well (Conway, Jako, & Goodman,
1995). Measurement error in performance ratings is artifactual because these rat-
ings (often made by supervisors or managers) do not reflect true performance on
the job due to influences such as bias, contrast effects, and halo. Conversely, the

FIGURE 8.2. Scatterplot between unrestricted structured interview ratings and


ratings of job performance, but with measurement error in the performance ratings.
The correlation is .50.
Range Restriction in Employment Interviews • 179

ratings interviewers make, while error-prone, are used to make actual selection
decisions. Hence, correcting performance alone is often referred to as “opera-
tional validity” (Schmidt, Hunter, Pearlman, & Rothstein-Hirsh, 1985, p. 763).
Interviewer ratings can be corrected as well, and if done, provides valuable (albeit
theoretical) information on construct associations. To illustrate, Huffcutt, Roth,
and McDaniel (1996) corrected for measurement error in both interview ratings
and cognitive ability test scores (see p. 465) in order to assess the degree of con-
struct saturation of the latter in the former.
Statistically, correcting for measurement error in performance ratings is ac-
complished as shown in Formula 1, where ro is the observed (actual) validity coef-
ficient and ryy is the performance IRR (i.e., .52). Note that the correction involves
the square root of the reliability. The correction returns the validity coefficient to
its full population value. Readers are referred to Schmidt and Hunter (2015) for
more information on this correction (see p. 112).

ro ro ro .50
rc = = = = = .69 (1)
ryy .52 .72 .72

If an interview researcher has a study that fits this scenario (at least to a reason-
able degree), the correction is simple. Just divide the actual validity coefficient by
.72. Situations where a new structured interview is being pilot tested with appli-
cants (not incumbents) and is not used to make actual selection decisions would
be particularly relevant, especially if a high majority of applicants are hired and
retained once on the job.

Scenario 2: All Applicants Are Hired—No Restriction But There Is


Attrition
Short of financial difficulties that necessitate lay-off’s, there are organizations
and/or job areas where employees are rarely let go and don’t frequently leave
(e.g., union shops, civil service). That situation is represented reasonably well
by Scenario 1 (again, assuming a high majority of applicants are hired). In oth-
er work environments, some degree of attrition is common. Such attrition often
comes from both the top and bottom of performance levels, as top performers get
promoted or leave and bottom performers get let go or reassigned (Sackett, Laczo,
& Arvey, 2002). There are exceptions of course, such as when top or bottom em-
ployees leave but not both. Attrition typically results in restriction in performance
ratings, which artificially lowers the validity coefficient.
Although the focus of this manuscript is on restriction in interview ratings, a gen-
eral correction for attrition is offered. The type and degree of attrition no doubt var-
ies considerably across situations, and modeling a broad spectrum of possibilities
is way beyond the scope of this study. Out of necessity, one hopefully common and
realistic scenario is modeled. Huffcutt, Culbertson, and Weyhrauch (2014b) posited
180 • ALLEN I. HUFFCUTT

a general scenario of 10% attrition, 5% from the top and 5% from the bottom. Based
on a simulation, they derived a range restriction ratio (u) of .80 (see p. 550), which is
the standard deviation of the restricted ratings (R) divided by the standard deviation
of the unrestricted ratings (P) in the population (i.e., sdR/sdP).
In regards to correction for attrition, a key question is whether it represents direct
or indirect restriction. Given that all (or most) applicants are hired in this scenario
and that attrition is an end-stage phenomenon, it seems reasonable to view it as
direct. The formulas for direct restriction (Hunter & Schmidt, 1990, p. 48; Hunter
& Schmidt, 2004, p. 215; see also Callender & Osburn, 1980, p. 549) are generally
intended for the predictor (here interviews). However, as noted by Schmidt and
Hunter (2015, p. 48), the effects of the predictor and the criterion on the validity co-
efficient are symmetrical; hence, the same formulas can be used to correct for attri-
tion restriction by itself. If there happens to be both restriction on the predictor and
attrition, then things get considerably more complicated (see Schmidt & Hunter, p.
48, for a discussion). This situation is addressed in Scenario 4.
In regard to procedure, the direct correction equation from Callender and Os-
burn (1980; p. 549) seems particularly popular and used widely (see Hunter &
Schmidt, 1990, p. 48; Hunter & Schmidt, 2004, p. 37). It is presented as Formula
2. The key component in this equation is u, the range restriction ratio noted above.

ro
rc = (2)
(1 − u )ro2 + u 2
2

Substituting .80 for u, the equation becomes:

ro ro
rc = = (3)
2 2 2
(1 − .80 )r + .80
o .36ro2 + .64

The above correction restores the validity coefficient to what it would have
been had there not been any attrition. To estimate operational validity, however,
an additional correction needs to be made for measurement error in the perfor-
mance ratings, which, fortunately, can be combined with the correction for direct
restriction. That equation is presented as Formula 4. As before, .52 is used for the
IRR of performance ratings.

ro ro
rc = = (4)
2
.52 (1 − .64)r + .64 o .72 .36ro2 + .64

Using this equation, an interview researcher with a study that fits reasonably
well with this scenario simply has to enter his/her observed correlation in the
Range Restriction in Employment Interviews • 181

last part of the above equation and do the computations to find an estimate of the
corrected (population) correlation. Situations where a new structured interview is
being pilot tested with applicants and is not used to make actual selection deci-
sions, a high majority of applicants are hired and/or hiring is done without strong
reference to merit, and there is moderate (but not extreme) attrition by the time
performance ratings are collected (from both the top and bottom) would be par-
ticularly relevant.

Scenario 3: Hiring Based Solely on Interview Ratings—Direct


Restriction, No Attrition
Case II (direct) restriction occurs when the new structured interview format
being evaluated is actually used in a strictly top-down fashion to select employ-
ees. Technically speaking, to fall into this category, studies would have to have
no prior selection whatsoever (again, even from an application blank). Practically
speaking, some studies could reasonably be classified as such. To illustrate, Ar-
vey, Miller, Gould, and Burch (1987) noted that “the interview was the sole basis
for hiring decisions” (p. 3), while Robertson, Gratton, and Rout (1990) noted
that applicants were placed “as a direct consequence of their performance in the
situational interview” (p. 72). Although the number of studies in the interview
literature with direct restriction is limited (see Huffcutt et al., 2014a, Table 1),
there may be more in practice. One such source could be promotions (Hunter et
al., 2006), a type of study that is reported far less often in the literature than initial
selection but occurs frequently in the workplace.
The degree of direct restriction is a function of the selection ratio. If, say, 90%
of interviewees are hired, the restriction would not be extensive. If 10% are hired,
there should be considerably more restriction. To illustrate the progressive ef-
fects of direct restriction graphically, scatterplots were formed illustrating hiring
percentages of 90, 50, and 10, respectively. There no doubt are situations where
the hiring percentage is outside these end values or between the middle value and
an endpoint. Nonetheless, these three levels should illustrate a sufficiently broad
spectrum of real employment scenarios.
To form these scatterplots, the structured interview ratings from the original
distribution (with a rho of .69) were rank-ordered and the appropriate number
were eliminated (e.g., the bottom 10% for 90% hiring). These data were used
rather than the subsequent data shown in Figure 8.2 where measurement error is
added to performance because this measurement error, in theory, has not occurred
yet and won’t occur until after restriction in interview ratings has taken place (and
the individuals have been on-the-job long enough to be evaluated).
From the data remaining after the appropriate elimination, the actual corre-
lation between interview ratings and performance was computed. Then, using
Formula 1 in reverse (with an IRR of .52), the value of the validity coefficient
with performance error induced was estimated. To portray this level of association
visually, the error-free performance ratings for each hiring level were copied and
182 • ALLEN I. HUFFCUTT

FIGURE 8.3. Scatterplots illustrating the association between interview and job
performance ratings with 90%, 50%, and 10% hiring respectively (and measure-
ment error in performance ratings). The correlations are .44, .39, and .29 respec-
tively.
Range Restriction in Employment Interviews • 183

sufficient measurement error was added to reduce the correlation with interview
ratings to the estimated value with performance error induced. Finally, a scat-
terplot was created. The scatterplots for all three levels of hiring are shown in
Figure 8.3.
The traditional way to correct for direct range restriction and performance
measurement error is to do the corrections simultaneously combing Formulas 1
and 2 as shown in Formula 5 below (Callender & Osburn, 1980, p. 549; Hunter &
Schmidt, 1990, p. 48; Hunter & Schmidt, 2004, p. 215). This is essentially what
was done in Formula 4 in the correction for attrition. Like there, the key parameter
is the range restriction ratio u, which here is the ratio of the restricted standard
deviation of interview ratings to the unrestricted one.

ro
rc = (5)
ryy (1 − u 2 )ro2 + u 2

Hunter et al. (2006) presented a two-step alternative based on “the little known
fact that when range restriction is direct, accurate corrections for range restriction
require not only use of the appropriate correction formula…but also the correct
sequencing of corrections for measurement error and range restriction” (p. 596).
In their method, the observed validity coefficient is corrected first for measure-
ment error in performance ratings (since that occurs last) using the restricted IRR
value (i.e., .52; denoted as “YYR”). That is accomplished using Formula 1. Then,
the corrected coefficient is inserted into an accompanying restriction formula
(Step 2 in their Table 1; see p. 599). To simply the process, the formulas for these
two steps are integrated into one, which is shown as Formula 6. Note that UX is
the inverse of the range restriction ratio (i.e., 1/ux).

U x ∗ ro / rYYR
rc = (6)
1 + (U x2 − 1) ∗ (ro / rYYR ) 2

Now to the results. For 90% hiring (top panel in Figure 8.3), the standard de-
viation with the bottom 10% of interview ratings removed is 6.0, resulting in a
u value of .81 (i.e., 6.0/7.4) and a U value of 1.24 (i.e., 1/.81). The performance
IRR value, as always, is .52. The validity coefficient drops to .44. Inserting these
values into the above formula, as shown in Formula 7, returns the fully corrected
value of .69 (which is important to confirm given that the validity coefficient was
computed from the actual data after removal of the bottom 10%).

1.24 ∗ .44 .52 .756


rc = = = .69 (7)
2
1 + (1.24 − 1) ∗ (.44 .52) 2 1.096
184 • ALLEN I. HUFFCUTT

If an interview researcher has a situation where a highly structured interview


is used in a top-down fashion in selection, there is minimal preselection prior to
the interview, a high majority of applicants are hired, and there is no (or minimal)
attrition by the time that performance ratings are collected, this equation can be
used to correct the observed validity coefficient. Isolating the relevant portion of
this equation to do the computations, one simply inserts the observed validity co-
efficient in the last part of the equation in Formula 8 to get a reasonable estimate
of the fully corrected value.

1.24 / .72 ∗ ro 1.72ro


rc = = (8)
1 + (1.242 − 1) ∗ (ro / .52) 2 1 + 1.03r02

For 50% hiring (middle panel in Figure 8.3), the standard deviation with the
bottom half of interview ratings removed is 5.0, resulting in a u value of .67 (i.e.,
5.0 / 7.4) and a U value of 1.49 (i.e., 1 / .67). The validity coefficient drops to .39.
Inserting these values into Formula 6, as shown in Formula 9 below, returns the
fully corrected value of .69.

1.49 ∗ .39 .52 .804


rc = = = .69 (9)
2
1 + (1.49 − 1)(.39 / .52) 2 1.164

Extracting the relevant portion again, the correction equation becomes as


shown in Formula 10. Interview researchers can use this equation when a highly
structured interview is used in a top-down fashion in selection, there is minimal
preselection prior to the interview, the proportion of applicants hired is in the
ballpark of one-half, and there is no (or minimal) attrition by the time that perfor-
mance ratings are collected.

1.49 / .72 ∗ ro 2.07 ro


rc = = (10)
1 + (1.492 − 1)(ro2 / .52) 1 + 2.35ro2

Finally, for 10% hiring (bottom panel in Figure 8.3), the standard deviation
with the bottom 90% of interview ratings removed is 3.5, resulting in a u value
of .47 (i.e., 3.5 / 7.4) and a U value of 2.13 (i.e., 1 / .47). The validity coefficient
drops to .29. Inserting these values into Formula 6, as shown in Formula 11, re-
turns the fully corrected value of .69.

2.13 ∗ .29 .52 .864 (11)


rc = = = .69
2
1 + (2.13 − 1)(.29 / .52) 2 1.257
Range Restriction in Employment Interviews • 185

Isolating the relevant portion once again, the result is shown in Formula 12.
Interview researchers can use this equation when a highly structured interview
is used in a top-down fashion in selection, there is minimal preselection prior to
the interview, only a small percentage of applicants are hired, and there is no (or
minimal) attrition by the time that performance ratings are collected.

2.13 / .72 ∗ .ro 2.95ro


rc = = (12)
2 2
1 + (2.13 − 1)(ro / .52) 1 + 6.77 ro2

A primary purpose of this manuscript is to highlight the effects of range restric-


tion in a straightforward and understandable manner in order enhance understand-
ing of the importance of correcting for it. The findings for 10% hiring, which is
a good benchmark in a number of selection situations, illustrates those effects
nicely. Comparison of the third panel in Figure 8.3 with Figure 8.2 shows the
considerable reduction in range after 90% of the candidates are eliminated. Earlier
a potential danger of range restriction was noted, specifically in reference a com-
pany that wants to switch to a highly structured interview format but doesn’t feel
like the resulting validity is high enough to justify doing do so. An observed coef-
ficient around .29 may not seem that much better (or any better) than the original
selection process and might not be worth the administrative trouble of switching
to the new format. Conversely, a fully corrected coefficient of .69 sounds very
desirable and should generate a decision to switch as quickly as possible.
A key assumption of this scenario is that there is no (or minimal) attrition
by the time that performance ratings are collected. What if there is more than
minimal attrition? An important question then becomes the nature and/or status of
those who leave. It is common for people to leave for reasons other than perfor-
mance, including family, location, health, and retirement. If the performance of
those departing does not differ substantially from those remaining, the effects on
the correction process are probably negligible.
On the other hand, if the departures are largely performance-related (typically
top / bottom), then the effects would be more pronounced. In this case, the correc-
tions offered for this scenario could still be done, but they would be conservative
because there is performance-related restriction that would not be accounted for
in the process. Attrition is incorporated into the next scenario.
Before moving on, two psychometric phenomena emerge from this scenario
that are interesting scientifically. The first has to do with the progressive effects
of 90%, 50%, and 10% hiring on the range of scores. Comparing the top panel in
Figure 8.3 with Figure 8.2, some might be surprised at the degree to which the
range of interview ratings is reduced at 90% hiring (relative to no restriction), spe-
cifically from 40 to 29. On the surface, a 10% elimination seems pretty minimal
and shouldn’t result in such a noticeable drop. Univariate normal distributions
contain considerably more data points in the middle than at the ends, and so do bi-
186 • ALLEN I. HUFFCUTT

variate. Because of the relatively small concentration of data points at the low end
of the distribution, it does not take much elimination of points from that region to
reduce the overall range noticeably.
Conversely, comparing the middle panel in Figure 8.3 with the top panel, the
change in the scatterplot from 10% to 50% hiring may not be as pronounced as
some might expect. Specifically, the range drops from 29 to only 20 even though
four times as many points were eliminated (compared to 90% hiring). This time,
the elimination occurred in the very dense scoring region leading up the middle of
the distribution. Because of that density, the drop in range is much more modest,
in fact slightly less than the change from no restriction to 10% hiring. The drop in
range from 50% to 10% (second and third panels in Figure 8.3) is essentially the
same because it is the same region, just on the back side of the center.
There is a potentially important implication of this phenomenon, one that
should be explored further in future research. Given the low density in the high
end of the distribution (just like in the low end), one would expect the range (and
validity coefficient) to drop somewhat noticeably as the hiring ratio drops in rela-
tively small increments below 10%. This issue is particularly important for jobs
where a large number of individuals often apply (e.g., academic positions) and/or
when unemployment is high. In both cases, a very limited number of individuals
(sometimes only one) are hired.
The second phenomenon pertains to Hunter et al.’s (2006) two-step alternative
procedure, which continues to be “little known” (p. 596) in the general meta-
analytic community. Does it really lead to improved estimates over the traditional
Callender and Osburn (1980)-type simultaneous correction? As a supplemental
analysis, the computations were rerun for all three hiring percentages using the
simultaneous approach. The corrected validity coefficient was in fact overesti-
mated at all three hiring levels. Moreover, the degree of overestimation increased
progressively as the hiring percentage decreased. The overestimation was .03 at
90% hiring (i.e., .72 vs. .69), .05 at 50% hiring (i.e., .74 vs. .69), and .07 at 10%
hiring (i.e., 76 vs. .69). Clearly, the two-step procedure seems more accurate, par-
ticularly with lower hiring percentages.

Scenario 4: Hiring Based Solely on Interview Ratings—Direct


Restriction with Attrition
This scenario involves restriction both on the predictor side from use in se-
lection (here interview ratings) and on the performance side (from attrition). As
noted by Schmidt and Hunter (2015), no methods currently exist for dealing with
double restriction (see pp. 48–49). There is a method that could possibly be adapt-
ed, that of Alexander, Carson, Alliger, and Carr (1987), but their method is based
on removal only of the lower portion of both distributions (e.g., from use of cut-
off scores). That assumption is probably fine for the predictor, but attrition from
both top and bottom is probably more likely with performance.
Range Restriction in Employment Interviews • 187

The effects of double restriction are illustrated using the Scenario 3 data with
50% hiring. Unlike the other scenarios, a correction formula is not offered, as
again, one does not currently exist. That said, it is important for both practitioners
and researchers to understand fully the debilitating effects of double restriction.
Especially so since it is likely to be extremely common in practice.
Recalling the 50% case (the middle panel in Figure 8.3), the standard deviation
with the bottom half of interview ratings removed is 5.0, resulting in a u value of
5.0 / 7.4 or .67, and the validity coefficient drops to .39. Those data were sorted by
performance rating from highest to lowest, and then the top and bottom 5% were
removed. Given that the starting sample size is 50, that corresponded to removal
of the top five and bottom five sets of ratings and a final sample size of 40.
The resulting distribution is shown in Figure 8.4. Removal of the top and bot-
tom 5% causes the validity coefficient to drop from .39 to .06. The standard devia-
tion of interview ratings dropped only modestly, from 5.0 to 4.5. As expected, the
standard deviation of performance ratings dropped more noticeably, from 11.1 to
7.1, although by itself, such reduction does not appear sufficient to account for the
somewhat drastic drop in validity (at least not fully).
So why did the validity coefficient drop from a somewhat respectable .39 to
something not that far from zero? Schmidt and Hunter (2015) provide valuable in-
sight, namely that the regression line changes in complex ways when there is dou-
ble restriction, including no longer being linear and homoscedastic.4 Inspection of
Figure 8.4 suggests that the regression line, which retained essentially the same

FIGURE 8.4. Scatterplot between interview and job performance ratings with 50%
hiring and 10% attrition (5% from the bottom and top respectively). The correlation
is .06.
188 • ALLEN I. HUFFCUTT

pronounced slope throughout the previous scenarios, is now almost flat. Imagine
an upward sloping rectangle, and then slicing off the bottom left and top right
corners. Those two corners, a noticeable portion of which were removed because
of 10% attrition, were largely responsible for the distinct upward slope. And, the
peculiar shape of this distribution appears to violate virtually every known re-
gression assumption about bi-variate relationships (see Cohen & Cohen, 1983;
Osborne, 2016) include being heteroscedastic. Given all these considerations, it is
not surprising that no statistical formulas exist for correction of double restriction.
The implications of this illustration are of paramount importance for organi-
zations. Head-to-head, 50% hiring with 10% attrition came out far worse than
10% hiring with no attrition (i.e., .06 vs. .29). It would appear that attrition, even
at relatively low levels (e.g., 10%), has a powerful influence on validity when
direct restriction is already present (and presumably indirect as well). And, the
assumption of 50% hiring with 10% attribution is probably conservative. There
most likely are many employment situations where hiring is less than 50%, which
should, in theory, makes things even worse since the starting scatterplot and va-
lidity coefficient (before attrition effects) are already diminished and/or where
attrition is greater than 10%. Clearly, more research attention needs to be given to
developing ways to deal with double restriction.

Scenario 5: Validation Data is Collected from Incumbents—Indirect


Restriction no Attrition
As noted above, most restriction in selection is now presumed to be indirect.
Indirect restriction can take various forms, including hiring based on another pre-
dictor before the interview is given (one that is correlated with interview ratings),
testing of incumbents, and selection based on a composite of the interview with
other predictors. Inspection of the frequency data presented by Huffcutt et al.
(2014a), specifically their Table 1, suggests that testing of incumbents is by far
the most frequent. Accordingly, that form of indirect restriction is the focus in this
scenario and the others forms are left for future research.
To provide a context, assume a company hears about the wonders of mod-
ern structured interview formats and decides to invest the resources to develop
one. To evaluate it, they administer their new interview to a sample of current
employees (i.e., incumbents) and then correlate those ratings with ratings of job
performance. In regards to the latter, they could utilize the most recent company
performance evaluations (i.e., administrative) or develop a new appraisal form
specifically for the study and collect performance ratings on-the-spot.
A major challenge with this scenario is that information regarding the original
selection process is rarely available. It is likely that the original selection mea-
sures correlate to some degree with the new structured interview, thus inducing
indirect range restriction. An assumption of the indirect correction procedure is
that the original process directly causes restriction only on the predictor and not
on performance ratings (Schmidt & Hunter, 2015; p. 47). As noted by Hunter et
Range Restriction in Employment Interviews • 189

al. (2006), this assumption is likely to be met to a close enough degree in selection
studies. If it is clear that this assumption does not hold, an alternative method has
been developed (see Le, Oh, Schmidt, & Wooldridge, 2016). Denoted as “Case
V” indirect correction, this method does not have the above assumption, but does
require the range restriction ratio for the second variable as well. If that variable
is job performance ratings, which is usually the case with selection, the range
restriction ratio for it is extremely difficult to obtain empirically (Le et al., p. 981).
Correction for Case IV indirect restriction is a five-step process, clearly mak-
ing it more involved than direct correction. Step 1 is to find / estimate the unre-
stricted reliability of the predictor in the applicant population (rXX_A). This, of
course, is not known for interviews. Accordingly, the equation for estimating it
is shown in Formula 13 (Schmidt & Hunter, 2015, p. 127), which involves the
restricted reliability value (rXX_R) and the range restriction ratio (uX).

rXX_A = 1—uX2(1—rXX_R) (13)

Taking all three sources of measurement error into account (i.e., random re-
sponse, transient, and conspect), Huffcutt, Culbertson, and Weyhrauch (2013)
found a mean interrater reliability of .61 for highly structured interviews (see
Table 3, p. 271).5 In regards to the range restriction ratio, Hunter and Schmidt
(2004) recommend using a general value of .65 for all tests and all job families
when the actual value is unknown (see p. 184). Inserting these two values, the
equation becomes as shown in Formula 14. The pronounced difference between
the restricted and unrestricted IRR values highlight yet another important psy-
chometric principle, which is that reliability coefficients are influenced by range
restriction as well.

rXX_A = 1—.652(1—.61) = .84 (14)

Step 2 is to convert the actual range restriction ratio (uX) into its equivalent for
true scores, unaffected by measurement error (i.e., uT). That equation is shown as
Formula 15 (Schmidt & Hunter, 2015, p. 127), which involves the unrestricted
applicant IRR value for the interview and the actual range restriction ratio. As
indicated, the range restriction ratio for true scores is smaller than the actual one,
which helps explain why indirect restriction tends to have a more detrimental ef-
fect than direct.

u X2 − (1 − rXX _ A ) .652 − (1 − .84)


uT = = = .56 (15)
rXX _ A .84
190 • ALLEN I. HUFFCUTT

Step 3 is to correct the observed validity coefficient for measurement error


in both the predictor and criterion using restricted reliability values (Schmidt &
Hunter, 2015, p. 150). That computation is shown as Formula 16.

ro ro ro
rc = = = (16)
rXX _ A rYY _ R .61 .52 .56

Step 4 is to make the actual correction for indirect restriction, the equation for
which is shown as Formula 17 (Schmidt & Hunter, 2015, p. 129). Note that this
formula uses UT, which is the inverse of uT (i.e., 1/.56=1.79). Also note that the
subscript “T” denotes true scores for the interview and “P” denotes true scores for
performance.

U T ∗ rc 1.79 ∗ rc 1.79 ∗ rc
ρTP = = = (17)
2 2 2 2
(U − 1)r + 1
T c (1.79 − 1)r + 1 c 2.20rc2 + 1

Because a correction was made for interview reliability, the value of rho that
comes out of the above formula is actually the construct-level association between
interviews and performance. Thus, the final step, Step 5, is to translate it back to
operational validity by restoring measurement error in the interviews (Schmidt
& Hunter, 2015, p. 155). It is important to note that the IRR value used for inter-
views in this final step should be its unrestricted version and not the restricted one.
Using the value of .84 noted earlier, the computation becomes:

ρ XP = ρTP ∗ rXX _ A = ρTP ∗ .84 = .92 ∗ ρTP (18)

Synthesizing Formulas 16–18 yields a single formula for correction of indirect


restriction with highly structured employment interviews, as shown in Formula
19.

.92 ∗ U T ∗ ro / .56 .92 ∗1.79 / .56 ∗ ro 2.94 ∗ ro


ρ XP = = = (19)
2 2 2 2
(U − 1)(ro / .56) + 1
T (1.79 − 1) / .31 ∗ r + 1
o 7.11 ∗ ro2 + 1

To illustrate, assume that a researcher does a concurrent validation of a new


structured interview format and finds an observed validity coefficient of .27 (a
very typical value). Using the last portion of Formula 19, the unrestricted opera-
tional validity of this new interview is estimated to be .64 as shown in Formula
Range Restriction in Employment Interviews • 191

20. This value compares very favorably with Hunter et al.’s (2006) updated (via
indirect correction) value of .66 for the validity of General Mental Ability (GMA)
for medium complexity jobs (see p. 606).

2.94 ∗ ro 2.94 ∗ .27 .81


ρ XP = = = = .64 (20)
2
7.1 ∗ r + 1
o
7.11 ∗ .27 ∗ .27 + 1 1.52

Although this scenario is focused on incumbents, it should be reasonably accu-


rate for the other two indirect situations identified by Huffcutt et al. (2014a). The
first is where another predictor is used for actual hiring and then the interview is
administered but not used, and the second is when hiring is based on a composite
of the interview and another predictor (see their Table 1). Using Formula 19 is
going to result in a much more accurate estimate of validity than no correction
whatsoever. What this formula would not be appropriate for is correction of the
final two restriction patterns identified by Huffcutt et al., namely hiring based on
another predictor first and then on the interview (indirect then direct) and hiring
based on the interview first and then on another predictor (direct then indirect).

DISCUSSION
The primary purpose of this manuscript is to provide a convenient, all-in-one ref-
erence for interview researchers to help them deal with range restriction. Was that
purpose accomplished? The answer is a qualified “yes.” Selection researchers in
a wide range of contexts should find the simplified formulas useful, especially so
given that the most difficult parameters are already estimated and incorporated. If
all (or at least a high majority) of applicants are hired and retained, then Formula
1 provides an easy correction for performance measurement error. If the interview
under consideration is used to make selection decisions in a top-down fashion,
then researchers simply pick the hiring proportion that is closest to their own ratio
(i.e., 90%, 50%, or 10%) and insert their observed validity coefficient into the
corresponding formula (i.e., Formula 8, 10, or 12). If the interview is not used to
make selection decisions, then the observed correlation can be inserted into For-
mula 19 for indirect correction.
Where the qualification manifests itself is when there is attrition. Due to the
symmetrical effects of the predictor and the criterion on the validity coefficient, a
modest level of attribution by itself (involving both the top and bottom segments
of the performance distribution) can be corrected for using Formula 4. Unfortu-
nately, when attrition is combined with any form of restriction, the impact on the
validity coefficient is both devastating and uncorrectable. Developing methods to
deal with attrition combined with restriction appears to be one of the most over-
looked psychometric challenges in the entire selection realm.
One possible way to deal with attrition and restriction is a backwards graphi-
cal approach. A similar method is found when correcting a predictor variable for
192 • ALLEN I. HUFFCUTT

nonlinearity in a correlation / regression analysis. There are references available


(e.g., Osborne, 2016) that show various nonlinear patterns graphically and the as-
sociated equation to transform the data to become reasonably linear. Figure 8.4,
for instance, portrays a very distinctive pattern that implies a specific combination
of hiring and attrition, and can be traced back to the original distribution with the
full population value of the validity correlation. It might be possible to simulate
other combinations of hiring and attrition and find different yet distinctive and
identifiable patterns.
In perspective, some might be uncomfortable with the idea of simply inserting
an observed validity coefficient into a formula that outputs a supposed estimate of
the population value. Such concerns are both noted and appreciated. The best re-
sponse at the present time is a reminder that the goal of a validity study is to assess
how well a particular selection measure does at predicting job performance across
the entire applicant pool, that is with all potential applicants and not just a subset
thereof. Results of the simulations in this manuscript, along with a sizable body
of meta-analytic research, suggest that observed (uncorrected) estimates often are
far too low. They simply do not reflect the true level of predictability of most
selection measures in the entire applicant pool. Conversely, corrected measures,
even though imperfect, are likely to be considerably closer to the true underlying
population value. The observed validity coefficient of .29 with 10% hiring illus-
trates nicely the dangers of no correction (when compared to the population value
of .69). There is a real possibility that an organization would decide not to expend
the time and resources needed to implement a new structured interview format for
this modest level of validity.
Several directions for future research emerge from this work, some already
noted. One of the most interesting psychometrically is the nonlinear effects of
hiring percentages on the range of predictor scores and resulting validity coef-
ficient. For instance, the range, standard deviation, and validity coefficient could
be computed for all values from 10% to 1% hiring. The regions from 10% to 50%
hiring (and 50% to 10%) could also be flushed out.
Also worthy of research attention are restriction patterns that combine direct
in some fashion with indirect. To illustrate, indirect then direct involves initial
selection based on another predictor that is correlated with the interview and then,
subsequently, the interview is used to make further selection decisions. Two con-
siderations make this pattern particularly important. One is that it seems to occur
somewhat regularly (see Huffcutt et al., 2014a, Table 1) in the literature and prob-
ably even more often in practice.
The other consideration relates to an issue that has largely been ignored, which
is up-front selection based on an application blank and letters of reference. At
the present time, there is very little understanding of how much these two almost
universal selection tools are actually used to reduce the applicant pool and how
much they tend to correlate with structured interview ratings. If the correlation
tends to be negligible, then their use really isn’t an issue. If the correlation is not
Range Restriction in Employment Interviews • 193

negligible, however, then indirect restriction is induced, which complicates every


scenario presented in this manuscript.
Finally, there are related topics that are likely to be affected by range restric-
tion as well, and its influence in these areas should be explored. One such topic
pertains to advances in data science and “big data” (Tonidandel, King, & Cortina,
2018). There is considerable interest in machine learning and the use of artificial
intelligence in selection, yet very little understanding of how range restriction
affects them and the decisions made from them (Rosopa, Moore, & Klinefelter,
2019). Another potential topic is missing data, a common phenomenon in selec-
tion analyses since values for two variables (predictor and criterion) are required.
It is unclear how range restriction affects the calculation of correlation coefficients
and multiple regression estimates when there is missing data.
On a closing note, this manuscript is focused on primary researchers with the
hope of providing them with a convenient, all-in-one resource for dealing with
range restriction. That said, the findings are just as applicable (and potentially
useful) for meta-analysts as well. They need the same difficult parameters, which
hopefully are presented sufficiently through the various scenarios. The main dif-
ference is that they work with mean validity coefficients instead of individual
ones.

NOTES
1. Thorndike’s Case I correction is applicable to the relation between two
variables (X1 and X2) when the actual restriction is on X1 but restriction
information is available only for X2. He noted that this situation is un-
likely to be encountered very often in practice.
2. Several other selection reanalyses have been done by Schmidt and col-
leagues, which, for whatever reason, did not appear on this search. See
Oh et al. (2013, p. 301) for a summary.
3. The mean of their overall SI scores was actually above the midpoint of
the scale. The midpoint was chosen, however, in an attempt to keep the
distribution symmetrical. However, even with the mean at the midpoint,
there was a small skew (which is not surprising given that a sample size
of 100 is not overly large). There were also minor anomalies in subse-
quent distributions, such as with homoscedasticity. Liberty was taken
in adjusting some of the data points to correct these anomalies (here to
make the distribution highly symmetrical).
4. Homoscedasticity is the assumption that the variability of criterion
scores (e.g., range) is reasonably consistent across the entire spectrum of
predictor values. When violated, the distribution is said to be heterosce-
dastic, power is reduced, and Type I error rates are inflated (see Rosopa,
Schaffer, & Schroeder, 2013, for a comprehensive review).
5. While random response error and transient error reflect variations in in-
terviewee responses to essentially the same questions within the same
194 • ALLEN I. HUFFCUTT

and across interviews, respectively, conspect error reflects disagree-


ments among interviewers in how they evaluate the same response infor-
mation. As noted by Schmidt and Zimmerman (2004), panel interviews
only control fully for conspect error since interviewers observe the same
random response errors and there is no second interview.

REFERENCES
Alexander, R. A., Carson, K. P., Alliger, G. M., & Carr, L. (1987). Correcting doubly trun-
cated correlations: An improved approximation for correcting the bivariate normal
correlation when truncation has occurred on both variables. Educational and Psy-
chological Measurement, 47, 309–315.
Arvey, R. R., Miller, H. E., Gould, R., & Burch, R. (1987). Interview validity for select-
ing sales clerks. Personnel Psychology, 40, 1–12. doi:10.1111/j.1744-6570.1987.
tb02373.x
Benz, M. P. (1974). Validation of the examination for Staff Nurse II. Urbana, IL: University
Civil Service Testing Program of Illinois, Testing Research Program.
Callender, J. C., & Osburn, H. G. (1980). Development and test of a new model for validity
generalization. Journal of Applied Psychology, 65, 543–558.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the
behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
Conway, J. M., Jako, R. A., & Goodman, D. F. (1995). A meta-analysis of interrater and
internal consistency reliability of selection interviews. Journal of Applied Psychol-
ogy, 80, 565–579.
Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2013). Employment interview reli-
ability: New meta-analytic estimates by structure and format. International Journal
of Selection and Assessment, 21, 264–276.
Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2014a). Moving forward indirect-
ly: Reanalyzing the validity employment interviews with indirect range restriction
methodology. International Journal of Selection and Assessment, 22, 297–309. doi:
org/10.1111/ijsa.12078
Huffcutt, A. I., Culbertson, S. S., & Weyhrauch, W. S. (2014b). Multistage artifact correc-
tion: An illustration with structured employment interviews. Industrial and Organi-
zational Psychology: Perspectives on Science and Practice, 7, 552–557.
Huffcutt, A., Roth, P., & McDaniel, M. (1996). A meta-analytic investigation of cogni-
tive ability in employment interview evaluations: Moderating characteristics and
implications for incremental validity. Journal of Applied Psychology, 81, 459–473.
Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and
bias in research findings. Newbury Park, CA: Sage.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and
bias in research findings (2nd ed.). Thousand Oaks, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research
findings across studies. Beverly Hills, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Lee, H. (2006). Implications of direct and indirect range
restriction for meta-analysis methods and findings. Journal of Applied Psychology,
91, 594–612. doi: 0.1037/0021-9010.91.3.594
Range Restriction in Employment Interviews • 195

Janz, T. (1982). Initial comparisons of patterned behavior description interviews versus


unstructured interviews. Journal of Applied Psychology, 67, 577–580.
Latham, G. P., Saari, L. M., Pursell, E. D., & Campion, M. A. (1980). The situational inter-
view. Journal of Applied Psychology, 65, 422–427. doi 10.1037/0021-9010.65.4.422
Le, H., & Schmidt, F. L. (2006). Correcting for indirect range restriction in meta-analysis:
Testing a new meta-analytic procedure. Psychological Methods, 11, 416–438.
Le, H., Oh, I.-S., Schmidt, F. L., & Wooldridge, C. D. (2016). Correction for range restric-
tion in meta-analysis revisited: Improvements and implications for organizational
research. Personnel Psychology, 69, 975–1008.
McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of
employment interviews: A comprehensive review and meta-analysis. Journal of Ap-
plied Psychology, 79, 599–616.
McMurry, R. N. (1947). Validating the patterned interview. Personnel, 23, 263–272.
Nunnally, J. C. (1978). Psychometric theory. New York, NY: McGraw-Hill.
Oh, I.-S., Postlethwaite, B. E., & Schmidt, F. L. (2013). Rethinking the validity of in-
terviews for employment decision making: Implications of recent developments in
meta-analysis. In D. J. Svyantek & K. T. Mahoney (Eds.), Received wisdom, kernels
of truth, and boundary conditions in organizational studies (pp. 297–329). Char-
lotte, NC: IAP Information Age Publishing.
Osborne, J. W. (2016). Regression & linear modeling: Best practices and modern methods.
Thousand Oaks, CA: Sage.
Pearson, K. (1903). Mathematical contributions to the theory of evolution—XI. On the in-
fluence of natural selection on the variability and correlation of organs. Philosophi-
cal Transactions, 321, 1–66.
Robertson, I. T., Gratton, L., & Rout, U. (1990). The validity of situational interviews for
administrative jobs. Journal of Organizational Behavior, 11, 69–76.
Rosopa, P. J., Moore, A., & Klinefelter, Z. (2019, April 5). Employee selection: Don’t let
the machines take over. Poster presented at the meeting of the Society for Industrial
and Organizational Psychology, National Harbor, MD.
Rosopa, P. J., Schaffer, M., & Schroeder, A. N. (2013). Managing heteroscedasticity in
general linear models. Psychological Methods, 18, 335–351.
Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth to asymp-
tote level with increasing opportunity to observe. Journal of Applied Psychology,
75, 322–327.
Sackett, P. R., Laczo, R. M., & Arvey, R. D. (2002). The effects of range restriction on
estimates of criterion interrater reliability: Implications for validation research. Per-
sonnel Psychology, 55, 807–825. doi: 10.1111/j.1744-6570.2002.tb00130.x
Schmidt, F. L., & Hunter, J. E. (2015). Methods of meta-analysis: Correcting error and
bias in research findings (3rd ed.). Thousand Oaks, CA: Sage.
Schmidt, F. L., Hunter, J. E., Pearlman, K., & Rothstein-Hirsh, H. (1985). Forty ques-
tions about validity generalization and meta-analysis. Personnel Psychology, 38,
697–798,
Schmidt, F. L., & Le, H. (2014). Software for the Hunter-Schmidt meta-analysis methods
(Version 2). Iowa City, IA: University of Iowa, Department of Management & Or-
ganizations.
196 • ALLEN I. HUFFCUTT

Schmidt, F. L., & Zimmerman, R. D. (2004). A counterintuitive hypothesis about employ-


ment interview validity and some supporting evidence. Journal of Applied Psychol-
ogy, 89, 553–581. doi: 10.1037/0021-9010.89.3.553
Schmitt, N. (2007). The value of personnel selection: Reflections on some remarkable
claims. The Academy of Management Perspectives, 21, 19–23.
Thorndike, R. L. (1949). Personnel selection. New York, NY: Wiley.
Tonidandel, S., King, E. B., & Cortina, J. M. (2018). Big data methods: Leveraging modern
data analytic techniques to build organizational science. Organizational Research
Methods, 21, 525–547.
Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reli-
ability of job performance ratings. Journal of Applied Psychology, 81, 557–574. doi:
10.1037/0021-9010.81.5.557
Weekley, J. A., & Gier, J. A. (1987). Reliability and validity of the situational interview for
a sales position. Journal of Applied Psychology, 72, 484–487.
CHAPTER 9

WE’VE GOT (SAFETY) ISSUES


Current Methods and Potential Future Directions
in Safety Climate Research

Lois E. Tetrick, Robert R. Sinclair,


Gargi Sawhney, and Tiancheng (Allen) Chen

Safety has been the focus of much research over the past four decades, given the
social and economic costs of unsafe work. For instance, the International Labor
Organization (2009) estimated that approximately 2.3 million workers die each
year due to occupational injuries and illnesses, and additionally, millions incur
non-fatal injuries and illnesses. More recently, the Liberty Mutual Research Insti-
tute for Safety (2016) estimated that US companies spend $62 billion in worker
compensation claims alone.
In the Human Resource Management and related literatures (e.g., Industrial
Psychology) safety climate has been perhaps the most heavily studied aspect of
workplace safety (Casey, Griffin, Flatau Harrison, & Neal, 2017; Hofmann, Burke,
& Zohar, 2017). Several meta-analyses have established that safety climate is an
important contextual antecedent of safety behavior and corresponding outcomes
(e.g., Christian, Bradley, Wallace, & Burke, 2009; Clarke, 2010, 2013; Nahrgang,
Morgeson, & Hofmann, 2011). However, the research included in these meta-analy-
ses varies considerably in several methodological and conceptual qualities that may
affect the inferences drawn from safety climate studies.
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 197–226.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 197
198 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

Understanding how these issues potentially influence safety climate research


should improve researchers’ ability to conduct high quality safety climate re-
search. Better quality research can help advance theoretical perspectives on safety
climate and inform evidence-based safety climate interventions. Therefore, the
purpose of this review is to examine the research methods used in recent research
on safety climate. Our review includes conceptual and measurement challenges in
defining the safety climate construct, cross-level implications of safety climate at
the individual, group/team, and organizational levels, research designs that limit
causal inferences and implications for external validity.

LITERATURE SEARCH PROCESS


To identify safety climate articles for this review, we searched the keyword of
safety climate in the following databases: PsycINFO, Psychology and Behavioral
Sciences Collection, PsycARTICLES, Business Source Complete, ERIC, and
Health Source: Nursing/Academic Edition. We confined the keyword search to
the abstract of articles, limited the results to peer-reviewed journal articles, and
set a five-year time frame from 2013 to October 2018 (when the search was con-
ducted).
Our search resulted in a total of 1,002 articles. Figure 9.1 summarizes our
literature search process. Two authors conducted the first round of review by read-
ing the abstracts and skimming the body of text to exclude articles that were not
written in English or those that did not assess safety climate (e.g., articles on dif-
ferent types of climate, articles discussing safety issues but not climate). This first
round of review yielded 284 relevant articles. In the second round of review, all
four authors coded a set of 50 articles on the construct definition, methodology,
analytics, results, study samples, and industries. The purpose of having all four
authors code this set of 50 articles was to refine and reach consensus on the coding
process. After consensus was reached, the rest of the articles were distributed to
each author to code individually, excluding the review papers. During this second
round of review, we excluded 23 articles because they mentioned safety climate
but did not directly study it. Therefore, the final number of articles included was
261, which consisted of 230 empirical quantitative studies, 7 empirical qualitative
studies, 6 conceptual papers, and 18 review papers.1

CONCEPTUALIZATION AND
MEASUREMENT OF SAFETY CLIMATE

Solid conceptual and operational definitions form the foundation of any research
enterprise. As Shadish, Cook, and Campbell (1982, p. 21) discussed: “the first
problem of causal generalization is always the same: How can we generalize from
a sample of instances and the data patterns associated with them to the particular
target constructs they represent?” Similarly, the AERA, APA, and NCME (1985)
We’ve Got (Safety) Issues • 199

FIGURE 9.1. Literature search process.

standards for educational and psychological testing have long emphasized the
central and critical nature of construct validity in psychological measurement, a
view that has evolved into the perspective that all inferences about validity ulti-
mately are inferences about constructs. Although the unitary view of validity as
construct validity is not without critics (e.g., Kane, 2012; Lissitz & Samuelsen,
2007), the importance of understanding constructs is generally acknowledged as
central to the research enterprise. In the specific case of safety climate, solid con-
ceptual understanding of the definition of safety climate is both a foundational
issue in the literature and an often-overlooked stage of the research process.
200 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

As we show below, the safety climate literature is plagued by problems as-


sociated with what Block (1995) called the jingle and jangle fallacies. Block de-
scribed the jingle fallacy as a problem created when researchers use similar terms
to describe different ideas and the jangle fallacy as when researchers use different
terms to describe similar ideas. Both of these problems are pervasive in safety
climate research as well as three additional concerns. First, safety climate studies
often use inconsistent conceptual and operational definitions, providing concep-
tual definitions of safety climate that do not match their measurement process.
Second, safety climate research often offers vague conceptual definitions of safety
climate that do not suggest any particular measurement process. Third, there is
a great deal of inconsistency in the conceptual scope of safety climate – espe-
cially related to whether/how researchers define the dimensions of safety climate.
Each of these definitional issues creates conceptual ambiguity that leads to chal-
lenges in integrating findings across the body of safety climate literature as well
as operational challenges in measuring safety climate. Our review focuses on six
definitional and operational issues in understanding safety climate: the general
definition of climate, distinguishing climate from related terms, the importance
of aggregation, the climate-culture distinction, industry-specific versus universal/
generic climate measurement, and the dimensionality of climate.

Safety Climate as a Strategic Organizational Climate


Ambiguity about the definition of safety climate is perhaps the fundamental
problem running through the relevant literature. There is considerable variability
in terms of how researchers conceptualize safety climate along with correspond-
ing inconsistencies for safety climate measurement. In a review of the general
climate literature, Ostroff, Kinicki and Muhammad (2013) discussed the histori-
cal origins of the concept and placed it in a nomological network explaining how
climate differs from the related term organizational culture. They also provided a
heuristic model linking climate to individual outcomes such as job attitudes and
performance behaviors as well as organizational outcomes such as shared atti-
tudes and organizational effectiveness and efficiency.
Drawing from prior work by James and Jones (1974) and Schneider (2000),
Ostroff et al. (2013, p. 644) characterized climate as an “experientially based de-
scription of what people see and report happening to them in an organizational
situation.” Although Ostroff et al. (2013) discussed other climate constructs such
as generic and molar climate, safety climate research usually treats safety climate
as a specific example of what are referred to as strategic organizational climate
constructs. Starting with the work of Schneider (1975), strategic organizational
climate research focused on the idea of climate as having a particular referent,
reflective of an organization’s goals and priorities (see also Schneider, 1990). Os-
troff et al. (2013) cited literature on a wide range of these referents, including
safety, service, sexual harassment, diversity, innovation, justice, citizenship be-
havior, ethics, empowerment, voice, and excellence.
We’ve Got (Safety) Issues • 201

Zohar (1980) is widely credited as the first researcher to describe safety cli-
mate as one of these strategic climates; he noted that “when the strategic focus
involves performance of high-risk operations, the resultant shared perceptions
define safety climate” (Zohar, 2010, p. 2009). Interestingly, Zohar’s (2010) re-
view appeared nearly a decade ago. At that time, he characterized the literature
as mostly focusing on climate measurement issues such as its factor structure and
predictive validity with a corresponding need for greater attention to theoretical
issues. Since then, multiple meta-analytic and narrative reviews have accumu-
lated evidence supporting the predictive validity of climate perceptions, demon-
strating the efficacy of climate-related interventions, and clarifying the theoretical
pathways linking safety climate to safety-related outcomes (Beus, Payne, Berg-
man, & Arthur, 2010; Christian et al., 2009; Clarke, 2010; Clarke, 2013; Hofmann
et al., 2017; Lee, Huang, Cheung, Chen, & Shaw, 2018; Leitão & Greiner, 2016;
Nahrgang et al., 2011). Despite this progress, definitional ambiguities remain a
problem in the literature with fundamental measurement issues about the nature
of safety climate remaining unresolved.
One fundamental definitional issue concerns the extent to which Zohar’s defi-
nition of safety climate is accepted in the literature. To address this question, we
coded studies according to how they defined safety climate based on the citation
used. A total of 86 studies (36.3%) cited Zohar, 25 studies (10.6%) cited Neal and
Griffin in some combination, and 11 studies offered a definition without a cita-
tion (4.6%). It is important to note that whereas Griffin and Neal’s earlier work
emphasized the individual level (e.g., Griffin & Neal, 2000; Neal, Griffin, & Hart,
2000), their later work emphasized both the individual and group level in a similar
fashion to Zohar (e.g., Casey, et al., 2017; Neal & Griffin, 2006). Interestingly,
110 studies (46.4%) offered some other citation and 42 studies (17.7%) did not
clearly define safety climate.
Table 9.1 presents illustrative examples of the range of these definitions. As
should be evident from the table, there are a wide range of approaches that vary in
how precisely they define safety climate. Some key definitional issues include (1)
whether safety climate is conceptualized as a group, individual, or multilevel con-
struct and thus, involves shared perceptions; (2) what is the temporal stability of
climate perceptions; and (3) whether safety climate narrowly refers to perceptions
about the relative priority of safety or whether safety climate also encompasses
perceptions about a variety of management practices that may inform perceptions
about the relative priority of safety.
Not shown in the table are examples from the literature of the many studies
that do not offer an explicit definition, appearing to take for granted that there is a
shared understanding of the meaning of safety climate, beyond something about
the idea that safety is important (for example, Arcury, Grzywacz, Chen, Mora,
& Quandt, 2014; Cox et al., 2017). Given that conceptual definitions should in-
form researchers’ methodological choices of what to measure, we see the lack of
definitional precision in the safety climate literature as troubling. Future research
202 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

TABLE 9.1. Illustrative Examples of Various Safety Climate Definitions


Clarke, S. (2013) Climate perceptions represent the individual’s cognitive interpretations
of the organizational context, bridging the effects of this wider context
on individual attitudes and behaviour. In relation to safety, Zohar (1980)
argued for the existence of a facet-specific climate for safety, which
represents employees’ perceptions of the relative priority of safety
in relation to other organizational goals. In subsequent work, safety
climate has been operationalised as a group-level construct (Zohar,
2000), and so researchers have aggregated climate perceptions to
represent the shared perceptions at this level. Safety climate can also
be considered as an individual-level construct, where perceived safety
climate represents ‘individual perceptions of policies, procedures and
practices relating to safety in the workplace’ (p. 27)
Bennett, et al., (2014) Safety climate differs from safety culture in that it is ‘the temporal state
measure of safety culture, subject to commonalities among individual
perceptions of the organization. It is therefore situationally based, refers
to the perceived state of safety at a particular place at a particular time,
is relatively unstable, and subject to change depending on the features
of the current environment or prevailing conditions’ (p. 27)
Bergheim, et al., (2013) Organizational climate is conceived as an empirically measurable
component of culture and is linked to a number of important
organizational outcomes (Zohar, 2002). In safety critical organizations
such as air traffic control, the concept of safety climate is more often
used than the more general organizational climate, to emphasize the
importance of ensuring a focus on safety issues (Cox & Flin, 1998....
safety climate to describe how air traffic controllers perceive both
management and group commitment to safety in their everyday work
Wiegmann, Zhang, & von Thaden, 2001). Thus, safety climate is often
referred to as a state-like construct, providing a snapshot of selected
aspects of an organization’s safety culture at a particular point in time
(Mearns, Whitaker, & Flin, 2003). (p. 232)

should attend much more closely to definitional issues and strive toward consen-
sus on the fundamental meaning of safety climate, particularly toward greater use
of the original Zohar definition.

Distinguishing Safety Climate from Related Terms


Safety climate literature faces the challenge that there are a variety of related
terms that refer to similar but conceptually different constructs. Although distin-
guishing these terms may be more of a conceptual than a methodological issue,
the distinctions are important to developing clarity about what is measured in a
study, an important concern given that some researchers use climate-related terms
in potentially confusing ways. It is especially important to discuss distinctions
between safety climate, psychosocial safety climate (PSC), and psychological
safety.
We’ve Got (Safety) Issues • 203

TABLE 9.1. Continued


Colley, et al., (2013) ‘‘Safety climate’’ refers to perceptions of organizational policies,
procedures and practices relating to safety (p. 69)
Bell, et al. (2016). Safety climate refers to the components of safety culture [7] that can be
measured. Safety culture, in turn, determines how safety is managed by
a team or organization. (p. 71)
Golubovich, et al. (2014) Safety climate refers to employees’ perceptions of safety policies,
procedures, and practices within their unit or organization (Zohar and
Luria, 2005). (p. 759)
Curcuruto, et al., (2018) Shared perceptions with regard to safety policies, procedures and
practices. (p. 184)
Rodrigues, et al. (2015) Zohar emphasised safety climate in the 1980s (Zohar 1980), and it
has been defined as a descriptive measure that ‘can be regarded as the
surface features of the safety culture discerned from the workforce’s
attitudes and perceptions at a given point in time’ (p. 412)
Hicks, et al. (2016) Safety climate has been conceptualized in this paper as consisting of
management’s commitment to safety, safety communication, safety
standards and goals, environmental risk, safety systems, and safety
knowledge and training. (p. 20)
Hinde, T. et al. (2016). “Safety culture” is described as “The product of individual and group
values, attitudes, perceptions, competencies, and patterns of behaviour”
(Health and Safety Commission, 1993...Safety and teamwork climates
(the feelings and attitudes of everyone in a work unit) are two
components of safety culture that are readily measured and amenable to
improvement by focused interventions. (p. 251)
Huang, et al. (2016) Safety climate, the degree to which employees perceive that safety is
prioritized in their company (Zohar, 2010) (p. 248)

Dollard and Bakker (2010, p. 580) described PSC as the extent to which the or-
ganization has “policies, practices, and procedures aimed to protect the health and
psychological safety of workers.” They elaborated (and empirically demonstrat-
ed) that PSC perceptions could be shared within an organizational unit (schools in
their study) and, similar to definitions of safety climate, they characterized PSC as
focused on perceptions about management policies, practices, and procedures that
reflected the relative priority of employees’ psychosocial health. Thus, PSC ex-
pands the health focus of safety climate to include psychosocial stressors and out-
comes in addition to the physical safety/injury prevention focus of safety climate.
Numerous studies show that PSC is related to psychosocial outcomes (e.g.,
Lawrie, Tuckey, & Dollard, 2018; Mansour & Tremblay, 2018, 2019). Some re-
search has linked PSC to safety related outcomes such as injuries and muscu-
loskeletal disorders (Hall, Dollard, & Coward, 2010; Idris, Dollard, Coward, &
Dormann, 2012). An understudied issue in this literature concerns the empirical
204 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

distinctiveness of safety climate and PSC. A few studies have examined measures
of both safety and psychosocial safety in the same study (e.g., Hall et al., 2010;
Bronkhorst, 2015; Bronkhorst & Vermeeren, 2016) but often using safety climate
measures that are PSC measures adapted to safety. These studies have found that
although PSC and safety climate measures are often highly correlated (i.e., r >
.69 in Bronkhorst, 2015; Bronkhorst & Vermeeren, 2016; and Study 1 of Idris et
al., 2012), they are structurally distinct with different patterns of correlates (Idris,
et al., 2012). However, more research is clearly required to determine the extent
to which PSC and safety climate measures are distinct and the possible boundary
conditions that might affect the degree to which they are related to each other and/
or to various safety and health-related outcomes.
Edmondson (1999, p. 354) defined psychological safety as “a shared belief
held by a work team that the team is safe for interpersonal risk taking.” Guer-
rero, Lapalme, and Séguin (2015) used the term “participative safety” to describe
essentially the same idea, but the term psychological safety is much more com-
monly used and Edmondson’s approach appears to represent consensus in the
now fairly extensive literature on the antecedents and outcomes of psychological
safety (Newman, Donohue, & Eva, 2017). Some of the qualities of a psychologi-
cally safe work environment include mutual respect among coworkers, the ability
to engage in constructive conflict, and comfort in expressing options and taking
interpersonal risks (Newman, et al., 2017). Thus, whereas safety climate focuses
on perceptions about the organization’s relative priority for employees’ physical
safety and PSC focuses on relative priorities for psychosocial health, psychologi-
cal safety refers to employees’ general comfort in the interpersonal aspects of the
workplace.
Another definitional issue in safety climate research is the distinction between
psychological safety and individual-level perceptions of safety climate. Some re-
searchers use the term psychological safety climate to refer to individual level
perceptions about safety climate issues (cf. Clark, Zickar, & Jex, 2014; Nixon, et
al. 2015). These authors appear to have had the good intentions to clearly label
individual level safety climate perceptions with a term that highlights the indi-
vidual nature of the construct (i.e., drawing on psychological climate literature
such as James & James, 1989; James, et al., 2008). However, other researchers
have studied psychological safety as a component of safety culture (Vogus. Cull,
Hengelbrok, Modell, & Epstein, 2016) or as an antecedent of safety outcomes
(e.g., Chen, McCabe, & Hyatt, 2018; Halbesleben et al., 2013). In our view, it is
theoretically appropriate to treat psychological safety as an antecedent of psycho-
logical climate, but, researchers need to be wary of exactly how studies are using
these various terms.
The terminological confusion between terms such as safety climate, psycho-
logical safety, psychological safety climate, and PSC represents a potential barrier
to accumulating knowledge about and drawing clear distinctions between these
constructs. At the very least, researchers are urged to use caution when citing
We’ve Got (Safety) Issues • 205

studies to ensure that they do in fact capture the construct of interest. However,
further empirical research is needed to distinguish these terms.

Do We All Have to Agree? The Importance (or Not) of Aggregation


Zohar’s (1980) original depiction of climate explicitly included the idea of
shared perceptions: “shared employee perceptions about the relative importance
of safe conduct in their occupational behavior” (p. 96). As Hofmann, Burke, and
Zohar (2017, p. 329) elaborate:

Key terms in this definition emphasize that it is a shared, agreed upon cognition
regarding the relative importance or priority of acting safely versus meeting other
competing demands such productivity or cost cutting. These safety climate percep-
tions emerge through ongoing social interaction in which employees share personal
experiences informing the extent to which management cares and invests in their
protection (as opposed to cost cutting or productivity).

In our anecdotal experience (the first two authors have been editors and associ-
ate editors of multiple journals), the issue of whether climate measures need to be
shared raises considerable consternation among researchers, particularly during
the peer review process. We have seen some reviewers assert that if the study does
not include shared perceptions, it is not a study of climate; whereas other authors
acknowledge that safety climate is a shared construct, but continue to study it at
the individual level; and still others do not discuss its multilevel nature. Irrespec-
tive of how researchers conceptually define safety climate, very few studies as-
sess it at the group level. In fact, out of the 230 empirical, quantitative studies we
reviewed, only 67 studies (29.1%) aggregated individual level data to test climate
effects at unit or higher levels. Of these 67, 42 studies (62.7%) reported statistical
evidence for the appropriateness of aggregation including ICC(1) only (N = 12,
17.9%), rwg only (N = 1, 1.5%), ICC(1) and ICC(2) (N = 4, 6.0%), ICC(1) and rwg
(N = 6, 9.0%), and all three measures (N = 19, 28.4%). These data suggest that
more multilevel studies are needed with improved reporting of statistical justifica-
tion for aggregation.
As we reviewed this literature, we were especially struck by the number of
articles that offered a group level conceptual definition of safety climate but stud-
ied safety climate at the individual level, often without explicit rationale for the
discrepancy (for example, Hoffmann et al., 2013; McGuire et al., 2017; Schwatka
& Rosencrance, 2016). Other studies have defined safety climate at the group
level but offered a justification for studying it as an individual construct. For ex-
ample, multiple studies by Huang and colleagues (examples include Huang et al.,
2013, 2018; Huang, Lee, McFadden, Rineer, & Robertson, 2017) have argued
that shared definitions of climate are less meaningful for employees who work by
themselves, such as long-haul truck drivers.
On one hand, the lack of attention to safety climate as a shared perception rep-
resents a potentially serious problem in the literature, as there appears to be a wide
206 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

disparity between how Zohar (1980) initially conceptualized safety climate and
how many researchers appear to be operationalizing it in practice. One might go
so far as to argue that given that comparatively little research has been performed
on safety climate as a group level construct, relatively little is known about it. On
the other hand, both the general climate literature (e.g., Ostroff et al., 2013) and
the safety climate literature (e.g., Clarke, 2013) explicitly acknowledge the con-
ceptual relevance of individual safety climate perceptions to the study of climate.
A common practice is to distinguish between organizational climate (a group level
construct) and psychological climate (an individual level construct; cf. James &
James, 1989; James et al., 2008). When designing a study, researchers should con-
sider Ostroff et al.’s (2013) discussion of this distinction as their model proposed
that psychological climate is more directly relevant to individual level outcomes
while organizational climate is more directly related to group level outcomes.
The individual-organizational level distinctions highlight the need to avoid at-
omistic and ecological fallacies (cf. Hannan, 1971) in safety climate research.
Atomistic fallacies occur when results obtained at the individual level are errone-
ously generalized to the group level. On the other hand, ecological fallacies occur
when group level results are used to draw conclusions at the individual level.
Also, it is important to acknowledge that researchers often focus on the individual
level because of practical constraints such as the lack of a work group/unit identi-
fier that can be used as the basis of aggregation, the lack of a sufficient number
of subunits to study climate, or a lack of proficiency in the multilevel methods
needed to study climate across organizational levels. Given these issues, it may
be appropriate for safety researchers interested in individual behavioral and atti-
tudinal phenomena to focus on psychological climate perceptions as they relate to
safety, although they should test for and/or attempt to rule out group level effects
when possible.
When researchers focus on individual level safety climate measurement, it is
important to ensure that their theoretical rationale fits with the individual level
formulation of climate. One of the potential areas of confusion in this literature
concerns the use of the term level. Although climate researchers distinguish be-
tween individual and organizational safety climate measures based on the level of
analysis/measurement; safety climate researchers also use the term level to refer
to particular climate stakeholders. For example, drawing from Zohar (2000, 2008,
2010), Huang et al. (2013) described group and organizational level climate as
two distinct perceptions employees form about safety. In Huang et al.’s approach,
the group level refers to one’s immediate work unit, with measures typically fo-
cused on employees’ perceptions of safety as a relative priority of one’s immedi-
ate supervisor. The organizational level refers to employees’ perceptions of the
global organization’s (or top management’s) relative priority for safety. But, both
group and organizational-level safety climate in Huang et al.’s model are usually
measured with individual level perceptual measures.
We’ve Got (Safety) Issues • 207

Safety Climate versus Safety Culture


Although nearly four decades have passed since Zohar’s initial conception of
safety climate (Zohar, 1980), many researchers still blur the distinction between
safety culture and safety climate or use what are essentially safety climate mea-
sures to study safety culture. For instance, Kagan and Barnoy (2013) referred to
safety culture as “workers’ understanding of the hazards in their workplace, and
the norms and roles governing safe working [conditions]” (p. 273). Similarly,
Pan, Huang, Lin, and Chen (2018) posited that safety culture can be character-
ized as including employee safety cognitions, behaviors, safety management sys-
tem, safety environment, individual’s stress recognition and competence. He et
al. (2016, p. 230) noted that safety climate “can reflect the current state of the
underlying safety culture.” Hinde, Gale, Anderson, Roberts, and Sice (2016, p.
251) characterized safety climate as aspects “of safety culture that are readily
measured and amenable to improvement by focused interventions.” Hartman,
Meterko, Zhao, Palmer, and Berlowitz (2013) also described safety climate as
modifiable aspects of the work environment. Referring to health care settings,
Hong and Li (2017) measured safety climate as a dimension of patient safety cul-
ture where other dimensions included teamwork climate, perception of manage-
ment, job satisfaction, and work stress. Still others described safety climate as the
measurable aspects of safety culture (e.g., Bell, Reeves, Marsden, & Avery, 2016;
Martowirono, Wagner, & Bijnen, 2014), which is problematic in that it implies
that other aspects of safety culture such as artifacts or assumptions, cannot be
measured. Finally, blurring the lines even further, Milijić, Mihajlović, Nikolić., &
Živković (2014, p. 510) indicated that:
Safety climate is viewed as an individual attribute, which consists of two factors:
management’s commitment to safety and workers’ involvement in safety” (Dedob-
beleer & Béland, 1991). On the other hand, safety culture refers to the term used to
describe a way in which safety is managed at the workplace, and often reflects “the
attitudes, beliefs, perceptions and values that employees share in relation to safety.
(Cox & Cox, 1991)

Researchers also sometimes describe safety climate as a “snapshot” of the or-


ganization’s safety culture. According to Bergheim et al. (2013, p. 232) “safety
climate is often referred to as a state-like construct, providing a snapshot of select-
ed aspects of an organization’s safety culture at a particular point in time (Mearns,
Whitaker, & Flin, 2003).” Similarly, Bennett et al. (2014) noted that as compared
to safety culture, safety climate is more contingent on the work environment and
susceptible to change (Bennett et al., 2014). Given the relative paucity of lon-
gitudinal research on safety climate (see below), little data exist concerning the
temporal stability of safety climate. One recent study reported test-retest correla-
tions greater than .50 for two safety climate measures and a corresponding ability
of the climate measures to predict safety outcomes across a two-year time period
(Lee, Sinclair, Huang, & Cheung, in press). However, others have concluded that
208 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

the ability of safety climate to predict outcomes drops much more rapidly (Berg-
man, Payne, Taylor, & Beus, 2014). Of course, the stability of both safety climate
scores and their ability to predict outcomes likely depends on the stability of the
work environment, but relatively little research has directly addressed this issue.
In our view, whether safety climate is a relatively stable phenomenon or a varying
snapshot of culture remains unresolved.
Guldenmund (2000, p. 220) noted that “before defining safety culture and cli-
mate, the distinction between culture and climate has to be resolved.” In the ensu-
ing nearly two decades, although progress has been made in understanding of the
general conceptual distinctions between organizational climate and culture (cf.
Ostroff et al., 2013), safety researchers are often careless in distinguishing cul-
ture and climate (Zohar, 2014). Some researchers assume that the climate-culture
distinction rests on the idea that climate is easier to change than culture, others
distinguish them in terms of the relative temporal stability of the two constructs.
Still others treat climate as the measurable aspect of culture, even though other
aspects of culture are likely measurable, albeit through different strategies than
those used in climate assessments. These ambiguities highlight the critical need
for further clarity in the conceptualization of climate. In fact, Hofmann, Burke,
and Zohar (2017, p. 381) concluded:

In the context of safety research, there potentially is even greater conceptual am-
biguity given the lack of a clear and agreed upon definition of safety culture, and
where the definitions that have been put forth do not make reference to broader,
more general aspects of organizational culture. In addition, many measures of safety
culture use items and scales which resemble safety climate measures. This has led
many authors to use the two constructs interchangeably. We believe this situation is
unfortunate and suggest that any study of safety culture should be integrated with
and connected to the broader, more general organizational culture as well as the
models and research within this domain.

Industry Specific versus General Measures


One of the ongoing issues in safety climate measurement concerns the use of
industry/context specific versus what are referred to as general or universal mea-
sures. The former refers to safety climate measures with item content designed to
reflect safety issues in a specific industry, the latter refers to measures with items
designed to generalize across a wide variety of contexts. Our review indicates
that although general safety climate measures are more commonly used, industry
specific measures appear to be more common in a few types of settings including
schools/education, health care, transportation, and offshore and gas production
(Jiang et al. (2019) also reported these as the most common contexts in their
meta-analysis).
One example of context specific measures comes from a series of studies on
truck drivers by Huang and colleagues (Huang et al., 2018; Lee, Sinclair, Huang,
& Cheung, 2019). Huang et al. (2013) argued that the need for a context specific
We’ve Got (Safety) Issues • 209

measure reflects the unique safety concerns faced by lone workers such as truck
drivers. They developed and validated a measure consisting of three organization-
al-level factors: (a) proactive practices, (b) driver safety priority, and (c) supervi-
sory care promotion) and three group/unit level measures: (a) safety promotion,
(b) delivery limits, and (c) cell phone (use) disapproval.
Another example comes from literature on school climate. Zohar and Lee
(2016) provided an example of a traditional safety climate study conducted in a
school setting with school bus drivers. In addition to items measuring perceived
management commitment to safety, they developed context specific items such
as management becomes angry with drivers who have violated any safety rule,
and department immediately informs school principal of driver complaint against
disruptive child.
Occupational health research does not pay as much attention to the school cli-
mate literature compared to other contexts such as manufacturing; nevertheless,
we conducted a separate review of school climate literature which located over
1,000 citations to school climate, including over 500 in 2013 alone. Although a
full review of this literature is well-beyond the scope of this article, it should be
noted that safety issues are frequently mentioned in the school climate literature
(Wang & Degol, 2016). However, rather than reflecting physical injuries from
sources such as transportation incidents, slips, and strains, the predominant safety
concern is the extent to which teachers and students are protected from physical
and verbal violence. Moreover, much of this literature is concerned with student
health and academic performance outcomes rather than teachers’ occupational
well-being. Thus, traditional safety climate measures may be insufficient to cap-
ture the unique challenges of this context.
Healthcare is another setting where context-specific measures are frequently
used. Healthcare, however, encompasses a wide variety of practice areas and occu-
pations, each with specific sets of safety challenges. Accordingly, researchers have
measured a wide array of different aspects of safety climate such as error-related
communication (Ausserhofer et al., 2013), hospital falls prevention (Bennett et al.,
2014), communication openness and handoffs and transitions (Cox et al., 2017),
forensic ward climate such as therapeutic hold and patients’ cohesion and mutual
support (de Vries, Brazil, Tonkin, & Bulten, 2016), and hospital safety climate items
relating to issues such as availability of personal protective equipment and cleanli-
ness (Kim et al., 2018). The variety of issues captured by these measures raises
questions about whether healthcare should be treated as a single industry context by
researchers seeking to understand the effects of context on safety climate.
Jiang et al. (2019) highlighted some of the reasons why general/universal or
context-specific measures might be preferred. For example, industry-specific
measures may have greater value in diagnosing safety concerns that are unique
to a specific industry and therefore potentially more useful in guiding safety in-
terventions (see also Zohar, 2014). General measures may have more predictive
value if safety climate primarily reflects a general management commitment to
210 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

safety; if this is the case, safety interventions should focus on those broadly ap-
plicable concerns. General measures can also contribute to benchmarking norms
that may be used across a wide variety of industries.
To test the possible distinctions between universal and industry-specific mea-
sures, Jiang et al. (2019) tested the relative predictive power of each type of mea-
sure in a meta-analytic review of 120 samples (N = 81,213). They found that
each type of measure performed better in different situations. Specifically, the
industry-specific measures were more strongly related to safety behavior and
risk perceptions whereas the universal measures predicted other adverse events
such as errors and near misses. There were no differences between universal and
industry-specific measures in their ability to predict accidents and injuries. It is
important to note that Jiang et al. (2019) did not test whether the industries of the
industry-specific measures differed from those of the universal measures.
Jiang et al. (2019) cited as the most commonly studied industries in their re-
view to be construction (K = 21), health care, hospitality, manufacturing (K =
18), transportation (K = 18), hospitality, restaurant/accommodations (K = 12),
and construction (K = 11) with 19 studies described as “mixed context.” Our re-
view (which encompasses a different set of years than Jiang et al.) indicates that
the industries that appeared to be most likely to use industry-specific measures
were transportation, off-shore and gas production, education, and hospital/health
care. Thus, the comparison of industry-specific versus general measure may be
somewhat confounded if some industries are more/less likely to be represented
in the industry-specific group. Researchers could address this by comparing both
measures within the same industry.
Keiser and Payne (2018) did just this, using both types of measures in the
same settings which were university research labs and including context-specific
measures for animal biological, biological, chemical, human subjects/computer,
and mechanical/electrical labs. They concluded that while the context-specific
measures appeared to be more useful in less safety-salient contexts, there were
relatively little differences between the measures. However, they also noted that
there appeared to be measurement equivalence problems with the general measure
across the different settings they investigated. Of course, Keiser and Payne’s find-
ings may be unique to their organizational setting given that university research
labs likely differ in many ways from other types of safety-salient contexts. Thus,
there is mixed evidence about whether researchers should use context-specific
versus universal/general measures that so far appears to suggest at least some
differences between the types of measures in the settings in which they are most
useful. But this is clearly an issue which requires further research.

The Dimensionality of Safety Climate


Nearly 40 years after Zohar (1980) first offered a formal definition of safety
climate, there seems to be little consensus on the dimensionality of safety climate
measures. This has been a long-standing concern in the literature. Twenty years
We’ve Got (Safety) Issues • 211

after Zohar’s original publication, Flin, Mearns, O’Connor, and Bryden (2000)
identified 100 dimensions of safety climate used in prior literature. They narrowed
these dimensions down to six themes (1) management/supervision, (2) safety sys-
tem, (3) risk, (4) work pressure, (5) competence of the workforce, and (6) pro-
cedures/rules. Yet, measures continued to proliferate; in fact, 10 years after the
Flin et al. (2000) publication Beus et al.’s (2010) meta-analytic review identified
61 different climate measures with varying numbers of dimensions. Our review
suggests that little progress has been made and there continues to be a wide array
of approaches to measuring safety climate. As noted above, one important dis-
tinction is between universal/generic and context-specific measures, with many
alternatives within each of these categories. A related issue concerns the dimen-
sions of those measures. For the purpose of this review, we did not compile a list
of the dimensions used in various measures of safety climate. Rather, we focused
on the methods used to ascertain the number of dimensions in individual studies.
Factor analysis is a widely recognized approach to assessing dimensionality
of a measure and therefore is an important step in measure development and con-
struct validation. Factor analyses are especially important in a literature such as
safety climate where there is a lack of clarity about the dimensionality of the
construct. Therefore, we coded studies in terms of whether they used any factor
analytic technique; if so, what technique they used. Across the 230 quantitative
empirical studies, the most common factor analytic technique used was confirma-
tory factor analysis (CFA, K = 64; 27.8%). Approximately 22% of the studies
used exploratory factor analysis (EFA) with half of them only using EFA (K =
25, 10.9%) and half using a combination of EFA and CFA (K = 24, 10.4%). That
CFA was used separately or in some combination with EFA in 38.2% of the stud-
ies (K =88) is encouraging given that CFA requires researchers to specify an a
priori measurement model. However, it is arguably more distressing that nearly
half of the studies in our review (K = 112, 48.7%) did not report any form of fac-
tor analysis, 2 studies reported the use of an unspecified form of factor analysis
(0.9%), and 3 studies (1.3%) reported using CFA but only on other measures than
safety climate. Given the lack of clarity in the literature about the dimensionality
of safety climate, the fact that just over 50% of the studies in our review either
did not report factor analyses or provided unclear information about the factor
analytic techniques used represents an important barrier to accumulating evidence
about the dimensionality of safety climate measures.
A related issue concerns how safety researchers interpret factor analytic re-
sults. Some researchers use unidimensional measures typically focusing on the
core idea of perceived management commitment to safety (for example, Arcury
et al., 2014; He et al., 2016). This approach is consistent with the argument that
management commitment is the central concept in safety climate literature as well
as with meta-analytic evidence showing that management commitment is among
the best predictors of safety-related outcomes (Beus et al., 2010). However, good
212 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

methodological practice suggests that confirmatory factor analyses should be


used to affirm the dimensionality of these measures.
Other studies use multidimensional measures but treat them as unidimension-
al, combining scores across multiple dimensions into an overall construct (for
example, McCaughey, DelliFraine, McGhan, & Bruning, 2013). In some cases,
this may be justified by high correlations among the factors in question. However,
other studies using multidimensional measures have treated the dimensions as
separate scores (for example, Hoffmann et al. 2013; Huang et al., 2017). The
fact that the studies using multidimensional measures often find differences in
safety climate antecedents or outcomes across climate dimensions suggests that
researchers who combine multidimensional measures into a single score may be
missing information of diagnostic or theoretical value.
In addition to the broad question of whether safety climate measures should
be treated as multidimensional, questions can be raised about what dimensions
should be included in safety climate measures. For example, five of the six themes
in the Flin et al. (2000) review of the literature did not directly concern perceptions
of management commitment: safety system, risk, work pressure, competence of
the workforce, and procedures/rules. Two other commonly-mentioned themes re-
late to safety training (e.g., Golubovich, Chang, & Eatough, 2014; Graeve, Mc-
Govern, Arnold, & Polovich, 2017) and safety communication (e.g., Bell, Reeves,
Marsden, & Avery, 2016; Cox et al., 2017).
The question that can be raised about any of these dimensions is whether they
should be treated as part of a unitary safety climate construct or whether they
should be regarded as antecedents or even possibly consequences of safety cli-
mate. Referring to the general climate literature, Ostroff et al. (2013) treated or-
ganizational climate as a consequence of organizational structure and practices.
To the extent that one adopts a strict definition of safety climate as relating to
perceptions about the relative priority of safety, several of the commonly-used
dimensions of safety climate might be viewed as causal antecedents of rather than
indicators of safety climate. Huang et al. (2018) noted that safety communication
has been treated as a part of safety climate, as a cause of safety climate, and even
as a consequence of safety climate. They argued that the literature was ambiguous
enough on this point that they treated safety communication as a correlate of safe-
ty climate. They found that communication both independently predicted safety
performance and moderated the safety climate-safety performance relationship
such that the benefits of (individual level) climate perceptions were stronger when
workers also perceived good organizational practices.
Future research needs to attend to the issue of whether perceptions about or-
ganizational policies and practices (such as training and communication) as well
as working conditions (such as job stress or work pressure) should be viewed
as indicators or as antecedents of safety climate. This is an important issue both
in terms of striving to reach consensus on the nature of climate and in terms of
researchers’ decisions about how to operationalize safety climate. Importantly,
We’ve Got (Safety) Issues • 213

strong correlations among the dimensions may be an insufficient justification as


one would expect that policy and practice-based antecedents of safety climate
should influence workers’ perceptions of the extent to which their management is
committed to safety.
We also noted several studies that used dimensions of safety climate not di-
rectly related to workers’ perceptions about the relative priority of safety. A few
examples include Arens, Feirz, and Zúñiga, (2017) who combined measures of
teamwork climate and patient safety climate, Hong and Li (2017) who used mea-
sures such as teamwork climate, stress recognition, and job satisfaction, Kim et
al. (2018) who included a measure of absence of job hindrances, and Ausserhofer
et al. (2013) who assessed commitment to resilience. We discourage researchers
from including such marginally-relevant concepts directly in their safety climate
measures. Rather, future research should carefully consider whether such vari-
ables might be better treated as antecedents or outcomes of safety climate or per-
haps as moderators of the effects of climate on outcomes.

Conceptualizing and Operationalizing Safety Climate: A Progress


Report
Safety climate has been a topic in occupational health research for nearly 40
years. In that time, several hundred studies have supported the general importance
of safety climate in occupational health. Despite the voluminous literature on the
topic, including 230 quantitative empirical studies in the past five years, problems
remain in nearly every aspect of conceptualizing and operationalizing safety cli-
mate. These issues create challenges in drawing conclusions about the nature of
safety climate as a construct. There seems to be relatively broad consensus about
the idea of safety climate as reflecting workers’ perceptions about the importance
of safety issues. However, there also are wide inconsistencies regarding how re-
searchers define and measure safety climate. Given that clarity about constructs
is fundamental to scientific progress, these inconsistencies raise questions about
how much we really know about the nature of safety climate. Some of these prob-
lems are likely to worsen as researchers begin to study similar ideas such as PSC
and psychological safety, as well as the emerging interest in other health-related
aspects of climate (e.g., Gazica & Spector, 2016; Mearns, Hope, Ford, & Tetrick,
2010; Sawhney, Sinclair, Cox, Munc, & Sliter, 2018; Sliter, 2013). Ultimately,
these problems should be addressed through greater attention to the importance
of construct validation and establishing a coherent nomological network for the
safety climate construct.

METHODS/DESIGNS
As indicated above, the conceptualization and measurement of safety climate has
several pitfalls that generate challenges for the design of studies seeking to ex-
amine the effects of safety climate, the antecedents of safety climate, as well as
214 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

the mediating and moderating effects of safety climate. In this section we review
some of the methodological challenges and issues.

Interventions
For the period 2013–2018, only 6% of all of the articles we coded were inter-
vention studies. Of these, four studies treated safety climate as an independent
variable, eight studies treated safety climate as a dependent variable, and two
studies treated safety climate as a mediator. Two of these intervention studies used
random assignment, two used quasi-experimental designs, and two used random
assignment clustering. Therefore, experimental or quasi-experimental designs
were rare. Admittedly these designs are difficult to implement in an applied field
setting but their absence does limit our ability to make causal inferences about
safety climate-related processes.
Lee, Huang, Cheung, Chen, and Shaw (2018) reviewed 19 intervention studies
that met their inclusion criteria; they reported that 10 of the 19 studies were quasi-
experimental pre-post-intervention designs and eight were based on mixed de-
signs with between- and within-subjects components. Ten of the 19 studies were
published in years preceding the period of our review which raises the question as
to whether research designs are becoming stronger. That said, the results of both
of these reviews support the ability of interventions to improve safety climate in
applied settings across several industries. But, they also highlight how rare such
studies are and the corresponding need for more studies utilizing these designs.

Longitudinal versus Cross-sectional Designs


In our review, 21% of the studies we coded obtained date across multiple mea-
surement occasions. Twenty-six studies included either two or three waves of data
collection although they did not necessarily measure all variables at all points in
time. There were eight prospective studies, and 15 intervention studies. However,
about 80% of the studies were cross-sectional studies. This is considerably higher
than the proportion of single-source cross-sectional designs reported in Spec-
tor’s (2019) review. Forty-one percent of the studies in two occupational health
psychology journals (Journal of Occupational Health Psychology and Work &
Stress) used a single-source cross-sectional design (Spector & Pindek, 2016) and
38% of the studies in Journal of Business and Psychology used a single-source
cross-sectional designs (Spector, 2019). The higher presence of cross-sectional
designs in safety climate research maybe a result of disciplinary differences or
editorial policies and practices. There does appear to be a belief among many
researchers that using a longitudinal design is preferred in establishing the valid-
ity of the research; however, adding additional measurement occasions raises a
number of threats to validity and may still not allow strong causal inferences
depending on other design issues (see Ferrer & Grimm, 2012; Ployhart & Ward,
2011; Stone-Romero, 2010).
We’ve Got (Safety) Issues • 215

Many longitudinal studies do not make a case for the specific lag in measure-
ment they included in their designs. In addition, if there are not at least three
measurement occasions, then it is not possible to detect nonlinear trends. Un-
fortunately, the theories commonly used in safety climate research are silent on
the most appropriate time lag to choose for a given research question. It may be
the case that there is no perfect time lag as changes in safety climate may be best
explained by unique events, such as severe accidents or changes in organization-
al policy. Nevertheless, we echo calls by other scholars (e.g., Ployhart & Ward,
2011) to incorporate time into our research designs. This is especially important
for understanding the time that it takes for a cause (e.g., an accident) to exert
an effect (e.g., changes in safety climate). Other scholars (e.g., Spector, 2019)
have suggested that we modify our measures to explicitly incorporate time. Many
of our measures are so general that it is impossible to assess the sequencing of
events. By including time related content such as “in the last month” or “today,”
the temporal ambiguity is reduced if not eliminated.

Level of Analysis
As discussed above, many conceptualizations of safety climate suggest a group
or organizational level of analysis. However, 70.4% of the studies we coded mea-
sured and/or analyzed safety climate at only the individual level of analysis. Only
67 studies (29.1%) took a group, organizational, or multi-level approach. As we
point out in the previous section as well as in the section on future directions
below, moving beyond the individual level of analysis is necessary to advance
understanding of safety climate. More research at the group and organizational
levels are needed to link safety climate to organizational level outcomes as well
as understand the relations between the group and organizational levels with indi-
vidual level behaviors and outcomes.

Role of Safety Climate


Reflecting the varied definitions and theoretical frameworks for safety climate,
there is considerable variability in whether safety climate is treated as an indepen-
dent variable, a dependent variable, a moderator or a mediator. Forty-two percent
of the studies we coded treated safety climate as an independent variable and
only 20% treated safety climate as a dependent variable. Leadership was the most
prevalent predictor of climate. Only 5.2% of the studies treated safety climate as a
moderator and 7.8% treated climate as a mediator even though most of the studies
treating safety climate as a mediator were cross-sectional designs, which some
people argue is not an appropriate design for testing for mediation (Stone-Rome-
ro, 2010). However, Spector (2019) suggested that in mature areas of research,
cross-sectional mediation models may be useful for ruling out plausible alterna-
tive explanations or identifying potential mediating mechanisms. Safety climate
thus can be considered a dependent variable, independent variable, moderator, or
216 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

mediator depending on the research question; however, our review indicates that
the research literature needs to investigate these roles to broaden our understand-
ing of the development and effects of safety climate.

FUTURE RESEARCH AGENDA


Research on safety climate has made tremendous progress since the term safety
climate was coined by Zohar in 1980. In the last five years, over 200 empirical
studies have been published on the topic. These studies span such countries as
China, Australia, Portugal, USA, and Norway, as well as a range of industries,
including healthcare, construction, education, manufacturing, and transportation.
Despite the considerable interest and advancement in the field, much work re-
mains to be done to improve the literature on safety climate. The remainder of this
paper is devoted to a discussion of the recommendations for conducting research
on safety climate. Specifically, the focus of the future research agenda should be
to improve theory, research designs and measurement.
Perhaps the most fundamental need in safety climate literature is to arrive
at consensus on the conceptual and operational definition of safety climate. Al-
though many studies define safety climate as shared perceptions with respect to
safety (Zohar, 1980) or policies, procedures, and practices in an organization with
regards to safety (Neal & Griffin, 2006), there is still considerable variability
in how people define the construct. Additionally, safety climate and culture are
sometimes still used interchangeably. Furthermore, the agreement on the dimen-
sions that comprise the construct of safety climate is currently lacking. Although
much of the research has considered safety climate to be multidimensional in
nature (e.g., Kines et al., 2011; Sawhney et al., 2018), some studies have opera-
tionalized it as a unidimensional construct (Drach-Zahavy & Somech, 2015).
With respect to dimensionality, even though most safety climate measures
share common themes of management commitment to safety, supervisor support
for safety, and safety communication (Kines et al., 2011; Neal & Griffin, 2006;
Sawhney et al., 2018), there are other dimensions that are less frequently utilized.
These include perceptions of causes of error, satisfaction with the safety program,
safety justice, and social status of the safety officer, to name just a few (Hoffmann
et al., 2013; McCaughey, DelliFraine, McGhan, & Bruning, 2013; Schwatka &
Rosecrance, 2016; Zohar, 1980). The lack of consensus on the conceptualization
of safety climate spills into the operationalization of the construct. As recently
discussed by Beus, Payne, Arthur, and Muñoz (2019), this conceptual ambigu-
ity makes it difficult to compare studies that use different definitions of safety
climate. According to Guldenmund (2000), a construct’s definition “sets the stage
for ensuing research [and] is the basis for hypotheses, research paradigms, and in-
terpretations of finding (p. 227).” With the differing dimensions of safety climate,
research can produce different findings. Future research can benefit from dispel-
ling theoretical ambiguity by reaching consensus on the dimensions that comprise
safety climate beyond management commitment to safety.
We’ve Got (Safety) Issues • 217

The advancement of research on safety climate is also contingent upon theo-


retical development in the area. Currently, we have a few frameworks that are spe-
cific to safety climate, such as Zohar’s (2000) multilevel model of safety climate
and Neal, Griffin, and Hart’s (2000) model of safety performance. Despite the
existence of these theories, researchers rarely draw explicitly upon these frame-
works, and therefore, these theories largely remain untested. The exception ap-
pears research examining the links between safety climate with safety behavior
and outcomes which has received abundant research attention. Theoretical devel-
opment and testing will provide the necessary foundation for better understand-
ing of predictors, mechanisms, and outcomes of safety climate. Currently, we
have comparatively few studies on the antecedents of safety climate (e.g., Beus
et al., 2015). Theories of safety climate can shed light on contextual factors that
facilitate the emergence of such climate in the workplace. By better understand-
ing the antecedents of safety climate, researchers and practitioners will be better
equipped to intervene in order to enhance workplace safety.
Research on safety climate can further prosper if theoretical models of safety
acknowledge the different levels of safety climate within an organization. Much
of the research on safety climate continues to focus on the individual level, with
relatively fewer studies exploring safety climate at the group or organizational
level. At the same time, it remains unknown whether aggregated responses to
safety climate measures maintain the properties of individual-level responses
(Beus et al., 2019). Considering that organizational processes at different levels
are often interconnected (Kozlowski & Klein, 2000), examining safety climate at
different levels separately will only give us an incomplete picture of safety within
an organization. Therefore, by explicitly modeling relations between safety cli-
mate and various outcomes at different levels of the organization, we decrease
the risk of committing either the atomistic or ecological fallacy (Hannan, 1971).
Beus et al. (2019) examined the validity of a newly developed safety climate
measure across individual and group levels of the construct and reported that the
associations between group-level safety climate and injuries/incidents were not
substantially different from those using corresponding individual-level percep-
tions of safety climate. However, more studies that explore the interconnectedness
of safety climate and criteria at varying levels are needed.
In addition to advances in theoretical development, research on safety climate
can be bolstered by strengthening research methods and measurement. Based on
our analysis of studies published over the last five years, safety climate research-
ers primarily rely on quantitative methods with a comparative lack of attention
given to qualitative studies. Although quantitative studies permit researchers to
objectively test relations of safety climate with theoretically meaningful con-
structs, such as safety performance and accidents and injuries in the workplace,
qualitative methods have been credited with allowing in-depth analysis of com-
plex social phenomena (Patton, 2002).
218 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

In the case of safety climate, qualitative designs may be particularly useful


in generating theory by understanding the process of emergence of employee
perceptions regarding safety and even emergence of group level perceptions of
safety. Bliese, Chan, and Ployhart (2007) described three sources that might af-
fect the formation of group level perceptions: employees’ individual experiences,
shared group characteristics (e.g., group cohesion) and clustering of individual
attributes by workgroup (e.g., demographics, backgrounds). For example, team
faultlines —defined as the hypothetical lines that divide a team into subgroups
on one or more attributes (Lau & Murnighan, 1998)—could be viewed as a clus-
tering of individual attributes by workgroup. Team faultlines create subgroups
among employees; the norms and ideologies of those subgroups shape in-group
members’ perception. Eventually, such a divide of in-group and out-group dy-
namics could produce inaccuracies in quantitative measures. That is, quantita-
tive measures demonstrated justifications for aggregation and a shared perception
among employees; the subgroups formed through faultlines had different per-
ceptions on safety policies, procedures, practices (see Beus, Jarrett, Bergman, &
Payne, 2012). Qualitative research methods (e.g., action research, interviews, ob-
servational research) could reveal the nuances and contexts of a multilevel system
(Aiken, Hanges, & Chen, 2018). Thus, using qualitative methods to understand
an organization could enrich the understanding of the emergence of employee
perceptions at both the individual level and higher levels. Similarly, such designs
may be useful for gaining better insights into change in safety climate. Therefore,
more safety climate studies should be undertaken using qualitative methodolo-
gies.
For research to move forward in the area of safety climate, more studies are
needed that utilize strong designs. Studies on safety climate have predominantly
relied on cross-sectional designs (e.g., Bodner, Kraner, Bradford, Hammer, &
Truxillo, 2014; Smith, Eldridge, & DeJoy, 2016). Although cross-sectional de-
signs allow investigations regarding the interrelatedness of different constructs,
they are insufficient for establishing causal relations while potentially leading to
results influenced by common method biases (CMB; Podsakoff, MacKenzie, Lee,
& Podsakoff, 2003; but also see Spector, 2019). To overcome CMB, some studies
have employed prospective designs (e.g., Zohar, Huang, Lee, & Robertson, 2014)
whereby variables are measured at two or more different time points. Although
prospective designs may offer some advantages over cross-sectional designs,
they do not necessarily remove CMB (Spector, 2019), test the reverse causation
hypotheses (Zapf, Dormann, & Frese, 1996) or allow causal inferences (Stone-
Romero, 2010) as implied by some authors. To detect causal effects, studies are
needed that employ experimental and quasi-experimental designs. Based on our
review, only a handful of studies have utilized such designs in the safety climate
literature (e.g., Cox et al., 2017; Graeve, McGovern, Arnold, & Polovich, 2017).
Although researchers have argued that safety climate can change over time,
research has yet to explore this phenomenon. One way to explore such changes in
We’ve Got (Safety) Issues • 219

safety climate is to utilize experience sampling methodology (ESM), which goes


beyond between-subject approaches. Specifically, ESM designs are equipped to
assess how within-person perceptions of safety climate fluctuate from one day
to another (Beal & Weiss, 2003). At the same time, ESM can rule out explana-
tions that are introduced by third variables (Uy, Foo, & Aguinis, 2010), thereby
enhancing theory. Future studies may consider employing ESM designs to better
understand the effect of safety climate on criteria of interest.

CONCLUSION
In the present paper, we reviewed trends within the last five years in the safety
climate literature. Our review focused on safety climate, a mature area of research
that extends over four decades and encompasses hundreds of studies. Despite the
size of the literature, it still lacks consistent conceptualization and operationaliza-
tion of constructs. Research needs to consider these potentially important aspects
of safety climate as either concepts of the definition or important antecedents or
outcomes of safety climate. Additionally, research should explore alternative ana-
lytic perspectives of examining dimensions and progression of safety climate over
time, including the stability of safety climate, non-linearity patterns of safety cli-
mate, and the relation of safety climate with potential antecedents and outcomes.

NOTE
1. Because it was not possible to cite all of the empirical studies, a list of
the 237 empirical studies included in this review can be obtained from
the first author. Please contact her at [email protected]

REFERENCES
Aiken, J. R., Hanges, P. J., & Chen, T. (2018). The means are the end: Complexity science
in organizational research. In S.E. Humphrey & J. M. LeBreton (Ed.), The handbook
of multilevel theory, measurement, and analysis. Washington, DC: American Psy-
chological Association.
American Educational Research Association, American Psychological Association, & Na-
tional Council on Measurement in Education. (1985). Standards for educational
and psychological testing. Washington, DC: American Psychological Association.
Arcury, T. A., Grzywacz, J. G., Chen, H., Mora, D. C., & Quandt, S. A. (2014). Work
organization and health among immigrant women: Latina manual workers in North
Carolina. American Journal of Public Health, 104(12), 2445–2452.
Arens, O. B., Fierz, K., & Zúñiga, F. (2017). Elder abuse in nursing homes: Do spe-
cial care units make a difference? A secondary data analysis of the Swiss Nurs-
ing Homes Human Resources Project. Gerontology, 63(2), 169–179. https://doi.
org/10.1159/000450787
Ausserhofer, D., Schubert, M., Desmedt, M., Blegen, M. A., De Geest, S., & Schwendi-
mann, R. (2013). The association of patient safety climate and nurse-related or-
ganizational factors with selected patient outcomes: A cross-sectional survey. In-
220 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

ternational Journal of Nursing Studies, 50(2), 240–252. https://doi.org/10.1016/j.


ijnurstu.2012.04.007
Beal, D. J., & Weiss, H. M. (2003). Methods of ecological momentary assessment in orga-
nizational research. Organizational Research Methods, 6(4), 440–464.
Bell, B. G., Reeves, D., Marsden, K., & Avery, A. (2016). Safety climate in English general
practices: Workload pressures may compromise safety. Journal of Evaluation in
Clinical Practice, 22(1), 71–76. https://doi.org/10.1111/jep.12437
Bennett, P. N., Ockerby, C., Stinson, J., Willcocks, K., & Chalmers, C. (2014). Measur-
ing hospital falls prevention safety climate. Contemporary Nurse, 47(1–2), 27–35.
https://doi.org/10.1080/10376178.2014.11081903
Bergheim, K., Eid, J., Hystad, S. W., Nielsen, M. B., Mearns, K., Larsson, G., & Luthans,
B. (2013). The role of psychological capital in perception of safety climate among
air traffic controllers. Journal of Leadership & Organizational Studies, 20(2), 232–
241. https://doi.org/10.1177/1548051813475483
Bergman, M. E., Payne, S. C., Taylor, A. B., & Beus, J. M. (2014). The shelf life of a safety
climate assessment: How long until the relationship with safety–critical incidents
expires? Journal of Business and Psychology, 29(4), 519–540.
Beus, J. M., Dhanani, L. Y., & McCord, M. A. (2015). A meta-analysis of personality and
workplace safety: Addressing unanswered questions. Journal of Applied Psychol-
ogy, 100, 481–498. https://doi.org/10.1037/a0037916
Beus, J. M., Jarrett, S. M., Bergman, M. E., & Payne, S. C. (2012). Perceptual equivalence
of psychological climates within groups: When agreement indices do not agree.
Journal of Occupational and Organizational Psychology, 85(3), 454–471.
Beus, J. M., Payne, S. C., Arthur Jr, W., & Muñoz, G. J. (2019). The development and
validation of a cross-industry safety climate measure: resolving conceptual and op-
erational issues. Journal of Management, 45(5), 1987–2013.
Beus, J. M., Payne, S. C., Bergman, M. E., & Arthur, W. (2010). Safety climate and inju-
ries: An examination of theoretical and empirical relationships. Journal of Applied
Psychology, 95, 713–727.
Bliese, P. D., Chan, D., & Ployhart, R. E. (2007). Multilevel methods: Future directions
in measurement, longitudinal analyses, and nonnormal outcomes. Organizational
Research Methods, 10(4), 551–563.
Block, J. (1995). A contrarian view of the five-factor approach to personality description.
Psychological Bulletin, 117, 187–215.
Bodner, T., Kraner, M., Bradford, B., Hammer, L., & Truxillo, D. (2014). Safety, health,
and well-being of municipal utility and construction workers. Journal of Oc-
cupational and Environmental Medicine, 56, 771–778. https://doi.org/10.1097/
JOM.0000000000000178
Bronkhorst, B. (2015). Behaving safely under pressure: The effects of job demands, re-
sources, and safety climate on employee physical and psychosocial safety behavior.
Journal of Safety Research, 55, 63–72. https://doi.org/10.1016/j.jsr.2015.09.002
Bronkhorst, B., & Vermeeren, B. (2016). Safety climate, worker health and organizational
health performance: Testing a physical, psychosocial and combined pathway. Inter-
national Journal of Workplace Health Management, 9(3), 270–289.
Casey, T., Griffin, M. A., Flatau Harrison, H., Neal, A. (2017). Safety climate and cul-
ture: Integrating psychological and systems perspectives. Journal of Occupational
Health Psychology, 22, 341–353.
We’ve Got (Safety) Issues • 221

Christian, M. S., Bradley, J. C., Wallace, J. C., & Burke, M. J. (2009). Workplace safety:
A meta-analysis of the roles of person and situational factors. Journal of Applied
Psychology, 94, 1103–1127.
Clarke, S. (2010). An integrative model of safety climate: Linking psychological climate
and work attitudes to individual safety outcomes using meta-analysis. Journal of
Occupational and Organizational Psychology, 83, 553–579.
Clarke, S. (2013). Safety leadership: A meta-analytic review of transformational and trans-
actional leadership styles as antecedents of safety behaviors. Journal of Occupa-
tional and Organizational Psychology, 86, 22–49.
Clark, O. L., Zickar, M. J., & Jex, S. M. (2014). Role definition as a moderator of the
relationship between safety climate and organizational citizenship behavior among
hospital nurses. Journal of Business and Psychology, 29, 101–110. https://doi.
org/10.1007/s10869-013-9302-0
Cox, S., & Cox, T. (1991). The structure of employee attitudes to safety: A European ex-
ample. Work & Stress, 5(2), 93–106. https://doi.org/10.1080/02678379108257007
Cox, E. D., Jacobsohn, G. C., Rajamanickam, V. P., Carayon, P., Kelly, M. M., Wetterneck,
T. B., ... & Brown, R. L. (2017). A family-centered rounds checklist, family engage-
ment, and patient safety: A randomized trial. Pediatrics, 139(5), 1–10. https://doi-
org.proxy.lib.odu.edu/10.1542/peds.2016-1688
Dedobbeleer, N., & Béland, F. (1991). A safety climate measure for construction sites. Jour-
nal of Safety Research, 22, 97–103. https://doi.org/10.1016/0022-4375(91)90017-P
de Vries, M. G., Brazil, I. A., Tonkin, M., & Bulten, B. H. (2016). Ward climate within a
high secure forensic psychiatric hospital: Perceptions of patients and nursing staff
and the role of patient characteristics. Archives of Psychiatric Nursing, 30(3), 342–
349. https://doi.org/10.1016/j.apnu.2015.12.007
Dollard, M. F., & Bakker, A. B. (2010). Psychosocial safety climate as a precursor to con-
ducive work environments, psychological health problems, and employee engage-
ment. Journal of Occupational and Organizational Psychology, 83(3), 579–599.
Drach-Zahavy, A., & Somech, A. (2015). Goal orientation and safety climate: Enhancing
versus compensatory mechanisms for safety compliance? Group & Organization
Management, 40, 560–588. https://doi.org/10.1177/1059601114560372
Edmondson, A. (1999). Psychological safety and learning behavior in work teams. Admin-
istrative science quarterly, 44(2), 350–383.
Flin, R., Mearns, K., O’Connor, P., & Bryden, R. (2000). Measuring safety climate: Iden-
tifying the common features. Safety Science, 34, 177–192.
Gazica M. W., & Spector P. E. (2016) A test of safety, violence prevention, and civility cli-
mate domain-specific relationships with relevant workplace hazards. International
Journal of Occupational and Environmental Health, 22, 45–51.
Golubovich, J., Chang, C. H., & Eatough, E. M. (2014). Safety climate, hardiness, and
musculoskeletal complaints: A mediated moderation model. Applied Ergonomics,
45(3), 757–766. https://doi.org/10.1016/j.apergo.2013.10.008
Graeve, C., McGovern, P. M., Arnold, S., & Polovich, M. (2017). Testing an intervention to
decrease healthcare workers’ exposure to antineoplastic agents. Oncology Nursing
Forum, 44(1), E10–E19. https://doi.org/10.1188/17.ONF.E10-E19
Griffin, M. A., & Neal, A. (2000). Perceptions of safety at work: A framework for linking
safety climate to safety performance, knowledge, and motivation. Journal of Oc-
cupational Health Psychology, 5, 347–358.
222 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

Guerrero, S., Lapalme, M. È., & Séguin, M. (2015). Board chair authentic leader-
ship and nonexecutives’ motivation and commitment. Journal of Leader-
ship & Organizational Studies, 22(1), 88-101. https://doi-org.proxy.lib.odu.
edu/10.1177/1548051814531825
Guldenmund, F. W. (2000). The nature of safety culture: a review of theory and research.
Safety Science, 34, 215–257.
Halbesleben, J. R. B., Hannes, L., Dierynck, B., Simmons, T., Savage, G. T., McCaughey,
D., Leon, M. R. (2013). Living up to safety values in health care: The effect of
leader behavioral integrity on occupational safety. Journal of Occupational Health
Psychology, 18, 395–405.
Hall, G.B., Dollard, M.F., & Coward, J. (2010). Psychosocial safety climate: development
of the PSC-12. International Journal of Stress Management, 17, 353–383.
Hannan, M. T. (1971). Aggregation and disaggregation in sociology. Lexington, MA: Lex-
ington Books.
Hartmann, C. W., Meterko, M., Zhao, S., Palmer, J. A., & Berlowitz, D. (2013). Validation
of a novel safety climate instrument in VHA nursing homes. Medical Care Research
and Review, 70(4), 400–417. https://doi.org/10.1177/1077558712474349
He, Q., Dong, S., Rose, T., Li, H., Yin, Q., & Cao, D. (2016). Systematic impact of institu-
tional pressures on safety climate in the construction industry. Accident Analysis and
Prevention, 93, 230–239. https://doi.org/10.1016/j.aap.2015.11.034
Hinde, T., Gale, T., Anderson, I., Roberts, M., & Sice, P. (2016). A study to assess the influ-
ence of interprofessional point of care simulation training on safety culture in the
operating theatre environment of a university teaching hospital. Journal of Interpro-
fessional Care, 30(2), 251–253. https://doi.org/10.3109/13561820.2015.1084277
Hoffmann, B., Miessner, C., Albay, Z., Scbrbber, J., Weppler, K., Gerlach, F. M., & Guth-
lin, C. (2013). Impact of individual and team features of patient safety climate: A
survey in family practices. Annals of Family Medicine, 11, 355–362. https://doi-org.
proxy.lib.odu.edu/10.1370/afm.1500
Hofmann, D. A., Burke, M. J., & Zohar, D. (2017). 100 Years of occupational safety re-
search: From basic protections and work analysis to a multilevel view of workplace
safety and risk. Journal of Applied Psychology, 102, 375–388.
Hong, S., & Li, Q. (2017). The reasons for Chinese nursing staff to report adverse events:
A questionnaire survey. Journal of Nursing Management, 25(3), 231–239. https://
doi.org/10.1111/jonm.12461
Huang, Y., Lee, J., McFadden, A. C., Rineer, J., & Robertson, M. M. (2017). Individual
employee’s perceptions of “Group-level Safety Climate” (supervisor referenced)
versus “Organization-level Safety Climate” (top management referenced): Associa-
tions with safety outcomes for lone workers. Accident Analysis and Prevention, 98,
37–45. https://doi.org/10.1016/j.aap.2016.09.016
Huang, Y., Sinclair, R. R., Lee, J., McFadden, A. C., Cheung, J. H., & Murphy, L. A. (2018).
Does talking the talk matter? Effects of supervisor safety communication and safety
climate on long-haul truckers’ safety performance. Accident Analysis & Prevention,
117, 357–367. https://doi-org.proxy.lib.odu.edu/10.1016/j.aap.2017.09.006
Huang, Y., Zohar, D., Robertson, M. M., Garabet, A., Lee, J., & Murphy, L. A. (2013).
Development and validation of safety climate scales for lone workers using truck
drivers as exemplar. Transportation Research Part F: Traffic Psychology and Be-
haviour, 17, 5–19. https://doi.org/10.1016/j.trf.2012.08.011
We’ve Got (Safety) Issues • 223

Idris, M. A., Dollard, M. F., Coward, J., & Dormann, C. (2012). Psychosocial safety cli-
mate: Conceptual distinctiveness and effect on job demands and worker health.
Safety Science, 50, 19–28.
International Labor Organization (2009). World day for safety and health at work 2009:
Facts on safety and health at work? International Labour Office. Geneva: ILO.
Retrieved from: http://www.ilo.org/wcmsp5/groups/public/@dgreports/@dcomm/
documents/publication/wcms_105146.pdf
James, L. A., & James, L. R. (1989). Integrating work environment perceptions: Explora-
tions into the measurement of meaning. Journal of Applied Psychology, 74, 739–
751.
James, L. R., & Jones, A. P. (1974). Organizational climate: A review of theory and re-
search. Psychological Bulletin, 81, 1096–1112.
James, L. R., Choi, C. C., Ko, C.-H. E., McNeil, P. K., Minton, M. K., Wright, M. A., &
Kim, K. I. (2008). Organizational and psychological climate: A review of theory
and research. European Journal of Work and Organizational Psychology, 17, 5–32.
Jiang, L., Lavaysse, L. M. & Probst, T. M. (2019) Safety climate and safety outcomes: A
meta-analytic comparison of universal vs. industry-specific safety climate predic-
tive validity, Work & Stress, 33, 41–57.
Kagan, I., & Barnoy, S. (2013). Organizational safety culture and medical error reporting
by Israeli nurses. Journal of Nursing Scholarship, 45(3), 273–280. https://doi-org.
proxy.lib.odu.edu/10.1111/jnu.12026
Kane, M. (2012). All validity is construct validity. Or is it? Measurement, 10, 66–70.
Keiser, N. L., & Payne, S. C. (2018). Safety climate measurement: An empirical test of
context-specific versus general assessments. Journal of Business and Psychology,
33, 479–494.
Kim, O., Kim, M. S., Jang, H. J., Lee, H., Kang, Y., Pang, Y., & Jung, H. (2018). Radia-
tion safety education and compliance with safety procedures—The Korea Nurses’
Health Study. Journal of Clinical Nursing, 27(13/14), 2650–2660. https://doi-org.
proxy.lib.odu.edu/10.1111/jocn.14338
Kines, P., Lappalainen, J., Mikkelsen, K. L., Olsen, E., Pousette, A., Tharaldsen, J., ...
& Törner, M. (2011). Nordic Safety Climate Questionnaire (NOSACQ-50): A new
tool for diagnosing occupational safety climate. International Journal of Industrial
Ergonomics, 41, 634–646.
Kozlowski, S. W., & Klein, K. J. (2000). A multilevel approach to theory and research
in organizations: Conceptual, temporal, and emergent processes. In K. Kline & S.
Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp.
3–90). San Francisco, CA: Jossey-Bass.
Lawrie, E. J., Tuckey, M. R., & Dollard, M. F. (2018). Job design for mindful work: The
boosting effect of psychosocial safety climate. Journal of Occupational Health Psy-
chology, 23(4), 483–495. https://doi-org.proxy.lib.odu.edu/10.1037/ocp0000102
Lau, D. C., & Murnighan, J. K. (1998). Demographic diversity and faultlines: The com-
positional dynamics of organizational groups. Academy of Management Review, 23,
325–340. doi:10.2307/259377
Lee, J., Huang, Y-H, Cheung, J. H., Chen, Zhuo, & Shaw, W. S. (2018). A systematic re-
view of the safety climate intervention literature: Past trends and future directions.
Journal of Occupational Health Psychology, 24, 66–91.
224 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

Lee, J., Sinclair, R. R., Huang, E., & Cheung, J. (2019). Outcomes of safety climate in
trucking: A longitudinal framework. Journal of Business and Psychology, 34, 865–
878.
Leitão, S., & Greiner, B. A. (2016). Organisational safety climate and occupational ac-
cidents and injuries: An epidemiology based systematic review. Work & Stress, 30,
71–90.
Liberty Mutual Research Institute for Safety. (2016). 2016 Liberty Mutual work-
place safety index. Hopkinton, MA. Retrieved from: http://cdn2.hubspot.net/
hubfs/330425/2016_Liberty_Mutual_Workplace_Safety_Index.pdf
Lissitz, R. W., & Samuelsen, K. (2007). A suggested change in terminology and emphasis
regarding validity and education. Educational Researcher, 36, 437–448.
Mansour, S., & Tremblay, D. G. (2018). Psychosocial safety climate as resource pathways
to alleviate work-family conflict. A study in the health sector in Quebec. Personnel
Review, 47(2), 474–493. https://doi.org/10.1108/PR-10-2016-0281
Mansour, S., & Tremblay, D. G. (2019). How can we decrease burnout and safety work-
around behaviors in health care organizations? The role of psychosocial safety cli-
mate. Personnel Review, 48(2), 528–550.
Martowirono, K., Wagner, C., & Bijnen, A. B. (2014). Surgical residents’ perceptions of
patient safety climate in Dutch teaching hospitals. Journal of Evaluation in Clinical
Practice, 20(2), 121–128. https://doi-org.proxy.lib.odu.edu/10.1111/jep.12096
McCaughey, D., DelliFraine, J. L., McGhan, G., & Bruning, N. S. (2013). The negative
effects of workplace injury and illness on workplace safety climate perceptions and
health care worker outcomes. Safety Science, 51, 138–147. https://doi.org/10.1016/j.
ssci.2012.06.004
Mearns, K., Hope, L., Ford, M. T., & Tetrick, L. E. (2010). Investment in workforce health:
Exploring the implications for workforce safety climate and commitment. Accident
Analysis and Prevention, 42, 1445–1454.
Mearns, K., Whitaker, S. M., & Flin, R. (2003). Safety climate, safety management prac-
tice and safety performance in offshore environments. Safety Science, 41(8), 641–
680. Retrieved from: https://psycnet.apa.org/doi/10.1016/S0925-7535(02)00011-5
Milijić, N., Mihajlović, I., Nikolić, D., & Živković, Ž. (2014). Multicriteria analysis of safe-
ty climate measurements at workplaces in production industries in Serbia. Interna-
tional Journal of Industrial Ergonomics, 44(4), 510–519. https://doi.org/10.1016/j.
ergon.2014.03.004
Nahrgang, J. D., Morgeson, F. P., & Hoffmann, D. A. (2011). Safety at work: A meta-ana-
lytic investigation of the link between job demands, job resources, burnout, engage-
ment, and safety outcomes. Journal of Applied Psychology, 96, 71–94.
Neal, A., & Griffin, M. A. (2006). A study of the lagged relationships among safety climate,
safety motivation, safety behavior, and accidents at the individual and group levels.
Journal of Applied Psychology, 91(4), 946–953.
Neal, A., Griffin, M. A., & Hart, P. M. (2000). The impact of organizational climate on
safety climate and individual behavior. Safety Science, 34, 99–109.
Newman, A., Donohue, R., & Eva, N. (2017). Psychological safety: A systematic review of
the literature. Human Resource Management Review, 27, 521–535.
Nixon, A. E., Lanz, J. J., Manapragada, A., Bruk-Lee, V., Schantz, A., & Rodriguez, J. F.
(2015). Nurse safety: How is safety climate related to affect and attitude? Work &
We’ve Got (Safety) Issues • 225

Stress, 29(4), 401–419. https://doi-org.proxy.lib.odu.edu/10.1080/02678373.2015.1


076536
Ostroff, C., Kinicki, A. J., & Muhammad, R. S. (2013). Organizational culture and climate.
In I. B. Wiener (Ed.), Handbook of psychology (2nd ed., pp. 643–676). New York,
NY: Wiley.
Pan, K. C., Huang, C. Y., Lin, S. C., & Chen, C. I. (2018). Evaluating safety culture and
related factors on leaving intention of nurses: The mediating effect of emotional
intelligence. International Journal of Organizational Innovation, 11(1), 1–9.
Patton, M. Q. (2002). Qualitative research & evaluation methods (3rd ed.). Thousand
Oaks, CA: Sage.
Ployhart, R. E., & Ward, A.-K. (2011). The “quick start guide” for conducting and pub-
lishing longitudinal research. Journal of Business and Psychology, 26(4), 413–422.
doi:10.1007/s10869-011-9209-6
Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method
biases in behavioral research: A critical review of the literature and recommended
remedies. Journal of Applied Psychology, 88, 879–903.
Sawhney, G., Sinclair, R. R., Cox, A. R., Munc, A. H., & Sliter, M. T., (2018). One climate
or many: Examining the structural distinctiveness of safety, health, and stress pre-
vention climate measures. Journal of Occupational and Environmental Medicine,
60, 1015–1025.
Schneider, B. (1975). Organizational climates: An essay. Personnel Psychology, 28, 447–
479.
Schneider, B. (2000). The psychological life of organizations. In N. M. Ashkanasy, C. P. M.
Wilderom, & M. F. Peterson (Eds.), Handbook of organizational culture & climate
(pp. xvii–xxi). Thousand Oaks, CA: Sage.
Schneider, B., 1990. The climate for service: an application of the climate construct. In:
Schneider, B. (Ed.), Organizational climate and culture (pp. 383–412). Jossey-Bass,
San Francisco, CA.
Schwatka, N. V., & Rosecrance, J. C. (2016). Safety climate and safety behaviors in the
construction industry: The importance of co-workers commitment to safety. Work:
Journal of Prevention, Assessment & Rehabilitation, 54, 401–413. https://doi.
org/10.3233/WOR-162341
Shadish, W. R., Cook, T. D., & Campbell, D. T. (1982). Experimental and quasi experi-
mental designs for generalized causal inference. New York, NY: Houghton Mifflin.
Sliter K. A. (2013). Development and validation of a measure of workplace climate for
healthy weight maintenance. Journal of Occupational Health Psychology, 18, 350–
362.
Smith, T. D., Eldridge, F., & DeJoy, D. M. (2016). Safety-specific transformational and pas-
sive leadership influences on firefighter safety climate perceptions and safety behav-
ior outcomes. Safety Science, 86, 92–97. https://doi.org/10.1016/j.ssci.2016.02.019
Spector, P. E. (2019). Do not cross me: Optimizing the use of cross-sectional designs.
Journal of Business and Psychology 34, 125–137. Doi.org/10.1007/s 10869-018-
09613-8
Spector, P. E., & Pindek, S. (2016). The future of research methods in work and occu-
pational health psychology. Applied Psychology: An International Review, 65(2),
412–431. https://doi.org/10.1111/apps.12056
226 • TETRICK, SINCLAIR, SAWHNEY, & CHEN

Stone-Romero, E. F. (2010). Research strategies in industrial and organizational psychol-


ogy: Nonexperimental, quasi-experimental, and randomized experimental research
in special purpose and nonspecial purpose settings. In S. Zedeck (Ed.), Handbook of
industrial and organizational psychology (pp. 35–70). Washington, DC: American
Psychological Association Press.
Uy, M. A., Foo, M. D., & Aguinis, H. (2010). Using experience sampling methodology to
advance entrepreneurship theory and research. Organizational Research Methods,
13, 31–54.
Vogus, T. J., Cull, M. J., Hengelbrok, N. E., Modell, S. J., & Epstein, R. A. (2016). Assess-
ing safety culture in child welfare: Evidence from Tennessee. Children and Youth
Services Review, 65, 94–103.
Wang, M-T., & Degol, J. L. (2016). School climate: A review of the construct, measure-
ment, and impact on student outcomes. Educational Psychology Review, 128, 315-
352. https://doi.org/10.1007/s10648-015-9319-1
Zapf, D., Dormann, C., & Frese, M. (1996). Longitudinal studies in organizational stress
research: A review of the literature with reference to methodological issues. Jour-
nal of Occupational Health Psychology, 1, 145–169. Retrieved from: http://dx.doi.
org/10.1037/1076-8998.1.2.145
Zohar, D. (1980). Safety climate in industrial organizations: Theoretical and applied impli-
cations. Journal of Applied Psychology, 65, 96–102.
Zohar, D. (2000). A group-level model of safety climate: Testing the effects of group cli-
mate on microaccidents in manufacturing jobs. Journal of Applied Psychology, 85,
587–596.
Zohar, D. (2008). Safety climate and beyond: A multi-level multi-climate framework. Safe-
ty Science, 46, 376–387.
Zohar, D. (2010). Thirty years of safety climate research: Reflections and future directions.
Accident Analysis and Prevention, 42, 1517–1522.
Zohar, D. (2014). Safety climate: Conceptualization, measurement, and improvement. In
B. Schneider & K. M. Barbera (Eds.), The Oxford handbook of organizational cli-
mate and culture (pp. 317–334). New York, NY: Oxford University Press.
Zohar, D., & Lee, J. (2016). Testing the effects of safety climate and disruptive children
behavior on school bus drivers performance: A multilevel model. Accident Analysis
and Prevention, 95(Part A), 116–124. https://doi.org/10.1016/j.aap.2016.06.016
BIOGRAPHIES

ABOUT THE AUTHORS

Dr. David G. Allen is Associate Dean for Graduate Programs and Professor of
Management, Entrepreneurship, and Leadership at the Neeley School of Business
at Texas Christian University; Distinguished Research Environment Professor at
Warwick Business School; and Editor-in-Chief of the Journal of Management.
Professor Allen earned his Ph.D. from the Beebe Institute of Personnel and Em-
ployment Relations at Georgia State University. His teaching, research, and con-
sulting cover a wide range of topics related to people and work, with a particular
focus on the flow of human capital into and out of organizations. His award-
winning research has been regularly published in the field’s top journals, such as
Academy of Management Journal, Human Relations, Human Resource Manage-
ment, Journal of Applied Psychology, Journal of Management, Journal of Or-
ganizational Behavior, Organization Science, Organizational Research Methods,
and Personnel Psychology, and he is the author of the book Managing Employee
Turnover: Dispelling Myths and Fostering Evidence-Based Retention Strategies.
Professor Allen is a Fellow of the American Psychological Association, the Soci-
ety for Industrial and Organizational Psychology, and the Southern Management
Association.
Research Methods in Human Resource Management:
Toward Valid Research-Based Inferences, pages 227–235.
Copyright © 2020 by Information Age Publishing
All rights of reproduction in any form reserved. 227
228 • BIOGRAPHIES

Tiancheng (Allen) Chen has a Master of Professional Studies degree in Indus-


trial and Organizational Psychology from the University of Maryland, College
Park and is currently is a doctoral student at George Mason University. Allen is
also a student member of the Society for Industrial Organizational Psychology
and the Personnel Testing Council Metropolitan Washington. Allen’s research in-
terests are leadership, teams, and organizational climate and culture. Allen’s main
publication is co-authoring a book chapter in the Handbook of Multilevel Theory,
Measurement, and Analysis.

Dr. Angelo DeNisi is the Albert Harry Cohen Chair in Business Administration
at Tulane University, where he also served a six-year term as Dean of the A.B.
Freeman School of Business. After receiving his Ph.D. in Industrial/Organiza-
tional Psychology from Purdue University in 1977, he served as a faculty mem-
ber at Kent State, the University of South Carolina, Rutgers, and Texas A&M
University before moving to Tulane. His research interests include performance
appraisal and performance management, as well as expatriate management, and
his research has been funded by the National Science Foundation, the U.S. Army
Research Institute, several state agencies and several industry groups in the U.S.
He has also served as President of the Society for Industrial and Organizational
Psychology (SIOP), as well as President of Academy of Management (AOM);
he has chaired both the Organizational Behavior and the Human Resources Di-
visions of the AOM, and he is a Fellow of the Academy of Management, SIOP,
and the American Psychological Association. He has published more than a doz-
en book chapters, and more than 80 articles in refereed journals, most of them
in top academic journals such as the Academy of Management Journal (AMJ),
the Academy of Management Review (AMR), the Journal of Applied Psychol-
ogy (JAP), the Journal of Personality and Social Psychology and Psychological
Bulletin. His research has been recognized with awards from several divisions
of the AOM, including winning the 2016 Herbert Heneman Lifetime Contribu-
tion Award from the Human Resources Division, and SIOP named him the co-
winner of the 2005 Distinguished Lifetime Scientific Contribution Award. He also
serves, or has served on a number of Editorial Boards, including AMJ, AMR, JAP,
Journal of Management, Entrepreneurship Theory and Practice, and Journal of
Organizational Behavior. He was Editor of AMJ from 1994 to 1996, and was re-
cently named Co-Editor of the SIOP Organizational Frontiers Series, with Kevin
Murphy.

Ian N. Fairbanks is a graduate student and teaching assistant at Clemson Univer-


sity. He is pursuing a Master of Science in Applied Psychology with an emphasis
in industrial-organizational psychology. His research interests are in personality
and individual differences, particularly in their application to personnel selection
and training. He is a member of the Society for Industrial and Organizational
Psychology
Biographies • 229

Dr. Gerald R. Ferris is the Francis Eppes Professor of Management, Professor


of Psychology, and Professor of Sport Management at Florida State University.
Before accepting this chaired position, he held the Robert M. Hearin Chair of
Business Administration and was Professor of Management and Acting Associate
Dean for Faculty and Research in the School of Business Administration at the
University of Mississippi from 1999–2000. Before that, he served as Professor of
Labor and Industrial Relations, of Business Administration, and of Psychology
at the University of Illinois at Urbana-Champaign from 1989–1999, and as the
Director of the Center for Human Resource Management at the University of Il-
linois from 1991–1996. Ferris received a Ph.D. in Business Administration from
the University of Illinois at Urbana-Champaign. He has research interests in the
areas of social influence in organizations, performance evaluation, and reputa-
tion in organizational contexts. Ferris is the author of numerous articles published
in such scholarly journals as the Journal of Applied Psychology, Organizational
Behavior and Human Decision Processes, Personnel Psychology, the Academy of
Management Journal, the Journal of Management, and the Academy of Manage-
ment Review. Ferris served as editor of the annual series, Research in Person-
nel and Human Resources Management, from 1981–2003. He has authored or
edited several books including Political Skill at Work, Handbook of Human Re-
source Management, Strategy and Human Resources Management, and Method
& Analysis in Organizational Research. Ferris has been the recipient of many
distinctions and honors, and in 2001 was the recipient of the Heneman Career
Achievement Award, and in 2010 was the recipient of the Thomas A. Mahoney
Mentoring Award, both from the Human Resources Division of the Academy of
Management.

Dr. Julie I. Hancock is Assistant Professor at the G. Brint Ryan College of Busi-
ness, University of North Texas. She holds a Ph.D. in Business Administration
from the University of Memphis. Her primary research interests include the flow
of people in organizations, collective turnover, perceived organizational sup-
port, and pro-social rule breaking. Her work on these topics has been published
in Journal of Management, Journal of Organizational Behavior, Human Rela-
tions, and Human Resource Management Review. Dr. Hancock currently serves
on the Academy of Management HR Division Executive Committee as a Repre-
sentative-at-Large.

Wayne A. Hochwarter is the Jim Moran Professor of Organizational Behavior


in the Department of Management, College of Business at Florida State Univer-
sity (FSU). He also is a Research Fellow in the Jim Moran Institute for Global
Entrepreneurship at FSU, and Honorary Research Professor at Australia Catholic
University. Before moving to FSU in 2001, Hochwarter served on the faculties of
Management at Mississippi State University and the University of Alabama. He
received a Ph.D. in Management from FSU. Hochwarter has research interests
230 • BIOGRAPHIES

in organizational leadership, power, influence, and workplace social dynamics,


and his research has been published in such journals as Administrative Science
Quarterly, the Journal of Applied Psychology, Organizational Behavior and Hu-
man Decision Processes, the Academy of Management Review, and the Journal
of Management.

Dr. Allen I Huffcutt is Caterpillar Professor at Bradley University in Peoria, Il-


linois. He publishes regularly in the employment interview literature in a variety
of journals including Human Resource Management Review, European Manage-
ment Journal, International Journal of Selection and Assessment, and Person-
nel Assessment and Decisions. This research has addressed core issues such as
reliability, validity, construct assessment, and ethnic group differences. His cur-
rent research focus is on the cognitive processes that underlie responding to Be-
havior Description and Situational Interviews. In addition, he publishes research
on methodological and measurement issues, including meta-analysis and artifact
(e.g., range restriction) correction. Dr. Huffcutt has written various book chapters,
including in the APA Handbook of Industrial and Organizational Psychology and
the Encyclopedia of Industrial-Organizational Psychology. Finally, he is a Fel-
low in the Society for Industrial and Organizational Psychology, and was recently
recognized as one of the top two percent most influential authors (Aguinis et al.,
2017) as measured by textbook citations. He reviews for a number of journals,
including Human Performance, Journal of Business and Psychology, Personnel
Assessment and Decisions, and Journal of Business Research.

Samantha L. Jordan is a Ph.D. candidate in Organizational Behavior/Human


Resources Management at Florida State University. She received a B.S. degree in
Psychology at the University of Florida. Jordan has research interests in organi-
zational justice and inclusion, social influence and political processes, and indi-
vidual differences (e.g., grit, narcissism). Her research has been published in such
journals as Group & Organization Management, Human Resource Management
Review, and the Journal of Leadership & Organizational Studies.

Dr. Liam P. Maher is Assistant Professor of Management in the College of Busi-


ness and Economics at Boise State University. He received a Ph.D. in Manage-
ment at Florida State University. His research interests include political skill,
political will, leadership, and identity. His research can be found in Personnel
Psychology, Annual Review of Organizational Psychology and Organizational
Behavior, Journal of Vocational Behavior, Group & Organization Management,
and Journal of Organizational & Leadership Studies.

Dr. Kevin Murphy holds the Kemmy Chair of Work and Employment Studies
at the University of Limerick. Professor Murphy earned his PhD in Psychology
from The Pennsylvania State University in 1979, and has served on the facul-
Biographies • 231

ties of Rice University, New York University, Pennsylvania State University and
Colorado State University. He is a Fellow of the American Psychological Asso-
ciation, the Society for Industrial and Organizational Psychology (SIOP) and the
American Psychological Society, and the recipient of SIOP’s 2004 Distinguished
Scientific Contribution Award. He is the author of over one hundred and ninety
articles and book chapters, and author or editor of eleven books, in areas ranging
from psychometrics and statistical analysis to individual differences, performance
assessment and honesty in the workplace. He served as co-Editor of the Taylor &
Francis (previously Erlbaum) Applied Psychology Series and has been appointed
co-editor, with Angelo DeNisi, of the SIOP Organizational Frontiers Series.
He has served as President of SIOP and Editor of Journal of Applied Psychol-
ogy and of Industrial and Organizational Psychology: Perspectives on Science
and Practice, and is a member of numerous editorial boards. Throughout his ca-
reer, Dr. Murphy has worked to advance both research and the application of that
research to solve practical problems in organizations. For example, he served as
both a member and the Chair of the U.S. Department of Defense Advisory Com-
mittee on Military Personnel Testing, and has also served on five U.S. National
Academy of Sciences committees, all of which dealt with problems in the work-
place. He has carried out a number of research projects with military and national
security organizations, dealing with problems ranging from training to applying
research on motivation to problems of nuclear deterrence, and has worked with
numerous private and public-sector organizations to build and evaluate their hu-
man resource management systems.

Dr. Patrick J. Rosopa is an Associate Professor in the Department of Psychol-


ogy at Clemson University. His substantive research interests are in personality
and cognitive ability, stereotypes and fairness in the workplace, and cross-cultural
issues in organizational research. He also has quantitative research interests in ap-
plied statistical modeling in the behavioral sciences including applications of ma-
chine learning and the use of computer-intensive approaches to evaluate statistical
procedures. Dr. Rosopa’s work has been supported by $4.1 million in grants from
various organizations including Alcon, BMW, and the National Science Founda-
tion. Dr. Rosopa has published in various peer-reviewed journals including Psy-
chological Methods, Organizational Research Methods, Journal of Modern Ap-
plied Statistical Methods, Human Resource Management Review, Journal of
Managerial Psychology, Journal of Vocational Behavior, Human Performance,
and Personality and Individual Differences. In addition, he has co-authored a sta-
tistics textbook titled Statistical Reasoning in the Behavioral Sciences, published
by Wiley in 2010 and 2018. Dr. Rosopa serves on the editorial board of Human
Resource Management Review and Organizational Research Methods. He also
serves as Associate Editor-Methodology for Journal of Managerial Psychology.
Dr. Rosopa is a member of the American Psychological Association, Association
232 • BIOGRAPHIES

for Psychological Science, and Society for Industrial and Organizational Psychol-
ogy.

Zachary A. Russell is Assistant Professor of Management in the Department


of Management and Entrepreneurship, Williams College of Business at Xavier
University. He received a Ph.D. in Management at Florida State University. His
research interests include reputation, social influence, organizational politics, hu-
man resource practice implementation, and labor unions. His research has been
published in Human Resource Management Review, Journal or Leadership &
Organizational Studies, Journal of Organizational Effectiveness: People and Per-
formance, and Journal of Labor Research.

Dr. Gargi Sawhney is Assistant Professor at the University of Minnesota Duluth.


Her research interests fall within the realm of occupational stress and occupa-
tional safety. Dr. Sawhney’s research has been published in various peer-reviewed
outlets, including Journal of Business and Psychology, Journal of Occupational
Health Psychology, and Journal of Positive Psychology.

Dr. Neal Schmitt is Emeritus Professor of Psychology and Management at


Michigan State University. He was editor of Journal of Applied Psychology from
1988–1994 and has served on a dozen editorial boards. He has received the So-
ciety for Industrial and Organizational Psychology’s Distinguished Scientific
Contributions Award (1999) and its Distinguished Service Contributions Award
(1998). In 2014, he was named a James McKeen Cattell Fellow of the American
Psychological Society (APS). He served as the Society’s President in 1989–90
and as the President of Division 5 (Measurement, Evaluation, and Statistics) of
the American Psychological Association (APA). Schmitt is a Fellow of Divisions
5 and 14, APA, and APS. He was also awarded the Heneman Career Achievement
Award and the Career Mentoring Award from the Human Resources Division of
the Academy of Management and Distinguished Career Award from the Research
Methods Division of the Academy of Management. He has coauthored three text-
books, Staffing Organizations with Ben Schneider and Rob Ployhart, Research
Methods in Human Resource Management with Richard Klimoski, Personnel Se-
lection with David Chan, edited the Handbook of Assessment and Selection, and
co-edited Personnel Selection in Organizations with Walter Borman and Mea-
surement and Data Analysis with Fritz Drasgow and published approximately
250 peer-reviewed articles and book chapters. His current research centers on the
effectiveness of organizations’ selection procedures, college admissions process-
es, and the outcomes of these procedures. He is the current chair of the Defense
Advisory Committee for Military Personnel Testing and chair of the Publications
Committee of the International Testing Commission.
Biographies • 233

Dr. Amber N. Schroeder is an assistant professor of psychology at The Univer-


sity of Texas at Arlington. She earned an M.S. and Ph.D. in Industrial-Organi-
zational Psychology from Clemson University after completing a B.A. in psy-
chology from Texas A&M University. She has published in elite, peer-reviewed
journals, including the Journal of Applied Psychology, Psychological Methods,
Psychological Bulletin, Journal of Occupational Health Psychology, Journal of
Managerial Psychology, and Computers in Human Behavior, among others. She
has also served as a PI on a National Science Foundation grant-funded project
and as a program evaluator on a U.S. Department of Education Race to the Top
grant. Dr. Schroeder’s research focuses primarily on the impact of technology
use in organizational settings, with published articles on topics such as the use of
web-based job applicant screening (i.e., cybervetting) and employee engagement
in cybermisbehavior (e.g., online incivility), as well as on approaches for detect-
ing and managing heteroscedasticity. Dr. Schroeder is a member of the Society
for Industrial and Organizational Psychology (SIOP) and the Association for Psy-
chological Science (APS).

Dr. Robert R. Sinclair is Professor of Industrial-Organizational Psychology at


Clemson University. Prior to arriving at Clemson in 2008, he was a member of
the faculty at Portland State University (2000–2008) and the University of Tulsa
(1995–1999). Bob currently serves as the Founding Editor-in-Chief for Occu-
pational Health Science, as an Associate Editor for the Journal of Business and
Psychology and is a founding member and past-president of the Society for Occu-
pational Health Psychology. Bob has published over 70 book chapters and articles
in leading journals such as the Journal of Applied Psychology, Journal of Organi-
zational Behavior, Journal of Occupational and Organizational Psychology, and
Journal of Occupational Health Psychology. He also has published four edited
volumes including Contemporary Occupational Health Psychology: Global Per-
spectives on Research and Practice, Volume 2 (2012 with Houdmont & Leka) and
Volume 3 (with Leka), Building Psychological Resilience in Military Personnel:
Theory and Practice (2013, with Britt), and Research Methods in Occupational
Health Psychology: Measurement, Design, and Data Analysis (2012, with Wang
and Tetrick).

Dr. Eugene F. Stone-Romero is Research Professor at the Anderson Graduate


School of Management, University of New Mexico. He is a Fellow of the Society
for Industrial and Organizational Psychology, the Association for Psychological
Science, and the American Psychological Association. He served as an Associ-
ate Editor of the Journal of Applied Psychology and as a member of numerous
editorial boards. Stone-Romero received the Distinguished Career Award of the
Research Methods Division of the Academy of Management in recognition of
publications in the field of research methods. In addition, he received the Thomas
Mahoney Career Mentoring Award from the Human Resource Division of the
234 • BIOGRAPHIES

Academy of Management in recognition of lifelong mentoring of doctoral stu-


dents in human resource management. Moreover, he received the Kenneth and
Mamie Clark Award of American Psychological Association of Graduate Students
(APAGS) in recognition of outstanding contributions to the professional devel-
opment of ethnic minority graduate students. The results of his research have
appeared in such outlets as the Journal of Applied Psychology, Organizational
Behavior and Human Performance, Personnel Psychology, Journal of Vocational
Behavior, Academy of Management Journal, Journal of Management, Education-
al and Psychological Measurement, Journal of Educational Psychology, Research
in Personnel and Human Resources Management, Applied Psychology: An Inter-
national Review, and the Journal of Applied Social Psychology. He is also the au-
thor of numerous book chapters dealing with issues germane to the related fields
of research methods, human resources management, industrial and organizational
psychology, and organizational behavior. Stone-Romero is the author of a chap-
ter on research methods in the APA Handbook of Industrial and Organizational
Psychology. What’s more, he is the author of a book titled Research Methods in
Organizational Behavior, and the co-author of books titled Job Satisfaction: How
People Feel about Their Jobs and How It Affects Their Performance, and The
Influence of Culture on Human Resource Management Processes and Practices.

Dr. Lois Tetrick is University Professor in the Industrial and Organizational Psy-
chology Program, George Mason University. She is a former president of the
Society for Industrial and Organizational Psychology and a founding member of
the Society for Occupational Health Psychology. Dr. Tetrick is a fellow of the Eu-
ropean Academy of Occupational Health Psychology, the American Psychologi-
cal Association, the Society for Industrial and Organizational Psychology and the
Association for Psychological Science. Dr. Tetrick is a past editor of the Journal
of Occupational Health Psychology and the Journal of Managerial Psychology,
and served as Associate Editor of the Journal of Applied Psychology. Dr. Tetrick
has edited several books including The employment relationship: Examining psy-
chological and contextual perspectives with Jackie Coyle-Shapiro, Lynn Shore,
and Susan Taylor; The Employee-Organization Relations: Applications for the
21st Century with Lynn Shore and Jackie Coyle-Shapiro; the Handbook of Occu-
pational Health Psychology (1st and 2nd editions) with James C. Quick; Health and
Safety in Organizations with David Hofmann; Research Methods in Occupational
Health Psychology: Measurement, Design and Data Analysis with Bob Sinclair
and Mo Wang and two volumes on cybersecurity response teams: Psychosocial
dynamics of cybersecurity and Improving Social Maturity of Cybersecurity inci-
dent response teams with S. J. Zaccaro, R. D. Dalal, and colleagues. In addition,
she has published numerous chapters and journal articles on topics related to her
research interests in occupational health and safety, occupational stress, the work-
family interface, psychological contracts, social exchange theory and reciprocity,
organizational commitment, and organizational change and development.
Biographies • 235

Julia H. Whitaker is a doctoral student at The University of Texas at Arlington.


She has an M.S. in experimental psychology from The University of Texas at
Arlington, as well as a B.S. in psychology and a certificate in Human Resource
Management from Indiana University Purdue University Indianapolis. Her re-
search interests include organizational technology use, with a special focus on
the use of online information for employment purposes (i.e., cybervetting). She
has co-authored a book chapter examining various forms of cyberdeviance (e.g.,
cyberloafing, cybercrime), and has presented at national conferences on topics
such as applicant reactions to cybervetting procedures and rater cognition in cy-
bervetting-based evaluation. Julia is a member of the Society for Industrial and
Organizational Psychology.

Phoebe Xoxakos is a graduate student and teaching assistant in the Department


of Psychology at Clemson University. She is pursuing a Ph.D. in Industrial-Or-
ganizational Psychology. Her research interests relate to diversity including ways
to enhance diversity climate; the effects of oppressive systems such as sexism,
ageism, and racism; and quantitative methods. She is a member of the Society for
Industrial and Organizational Psychology.

You might also like