A Comparison of Intraclass Correlation Coefficients, RWG (J)
A Comparison of Intraclass Correlation Coefficients, RWG (J)
Abstract
Researchers investigating organizations and leadership in particular are increasingly being called
upon to theorize multilevel models and to utilize multilevel data analytic techniques. However, the
literature provides relatively little guidance for researchers to identify which of the multilevel
methodologies are appropriate for their particular questions. In this final article, the statistical
procedures used in the multilevel data analyses in the previous articles of this special issue are
compared. Specifically, intraclass correlation coefficients (ICCs), rwg( j), hierarchical linear modeling
(HLM), within- and between-analysis (WABA), and random group resampling (RGR) are examined
and their results and conclusions discussed. Following comparisons of these methods, recommen-
dations for their use are presented. D 2002 Elsevier Science Inc. All rights reserved.
1. Introduction
The importance of theorizing and testing the levels of analysis at which variables and
relationships operate has been emphasized in the literature for several decades (e.g.,
Dansereau, Alutto, & Yammarino, 1984; Klein, Dansereau, & Hall, 1994; Roberts, Hulin,
& Rousseau, 1978; Rousseau, 1985). However, relatively few published studies have actually
conceptualized and tested theories at multiple levels of analysis. This is despite the fact that
1048-9843/02/$ – see front matter D 2002 Elsevier Science Inc. All rights reserved.
PII: S 1 0 4 8 - 9 8 4 3 ( 0 1 ) 0 0 1 0 5 - 9
70 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
there are multiple data-analytic options for conducting multilevel empirical analyses. Quite
possibly, the lack of published empirical analyses could be due to a dearth of knowledge
concerning the existence of the methods, uncertainty about the appropriateness of the
methods, or limitations of the methods. The intention of this article is to clarify questions
and address concerns related to multilevel methodologies by contrasting the methods used in
the previous articles in this special issue.
This article is organized around the different methods used in the previous articles. First,
the basic principles of intraclass correlation coefficients (ICCs), the rwg( j) coefficient (James,
Demaree, & Wolf, 1984), hierarchical linear modeling (HLM; Bryk & Raudenbush, 1992),
within- and between-analysis (WABA; Dansereau et al., 1984), and random group resampling
(RGR) are each briefly presented in separate sections. Following the introduction to each
method, the hypotheses and results of the studies utilizing the method are summarized. The
significant conclusions and theoretical units of each article, as well as the assumptions of each
method, are then addressed. Finally, a discussion of the strengths and weaknesses of each
methodology is presented, and recommendations are made in an attempt to help researchers
identify the appropriate method(s) for their research questions. For clarity, when discussing
the various methodologies I will be referring to ‘‘individuals’’ and ‘‘groups’’ since these are
the levels used in the other articles in this issue. However, readers should recognize that the
methods apply to any data set where there are two or more nested levels (e.g., individuals in
groups, groups in departments, departments in organizations, etc.).
2.1. Methodology
where msb is the between-group mean square, msw is the within-group mean square, and ng is
the group size (Bliese, 2000b).
ICC(2) estimates the reliability of group means and is calculated as follows (Bliese, 1998;
Glick & Roberts, 1984) (Eq. (2a)):
When group sizes are large, ICC(2) can also be calculated using the Spearman–Brown
formula in conjunction with the ICC(1) and group size (Bliese, 1998; Shrout & Fleiss, 1979)
(Eq. (2b)):
Since large group sizes generally result in more stable mean scores, it is possible to have
high ICC(2) values and low ICC(1) values (James, 1982). Both of these equations assume
equal group sizes, to calculate ICCs in samples with uneven group sizes; see Bliese and
Halverson (1998b).
where rwg( j) is the within-group agreement coefficient for judges’ mean scores based on
J items, s̄xj2 is the mean of the observed variances on the J items, and SE2 is the expected
variance of a hypothesized null distribution (James et al., 1984, p. 88). In the Bliese et al.
(this issue) article, each group in the sample had three rwg( j) coefficients, one for each
variable (task significance, leadership climate, and hostility). These were averaged across
groups and then interpreted.
The raw score analyses conducted at the individual level found no evidence for moderation.
However, the group level analyses (and the results of F tests) found some evidence supporting
aggregation. ICC(1) results indicated very little group-level variability for leadership climate,
task significance, and hostility (only 5% to 9% of the variance in individuals’ responses were a
function of group membership), which correspondingly indicates considerable individual level
variability. ICC(2) results varied from .70 to .80, indicating that the group means were reliable
and were differentiated from one another. These results offer some support for the aggregation
of the data to the group level, but the results are certainly not strong.
Averaged rwg( j) results (r̄wg( j)) indicated strong average within-group agreement for
leadership climate (.87), but weak average within-group agreement for task significance and
hostility (.58 and .56, respectively). Thus, using r̄wg( j) and a .70 criterion, support for
aggregation was obtained for only leadership climate. However, these results and conclusions
should be viewed with caution, since the r̄wg( j) coefficient was used and reported for the sample
as a whole, not for each of the groups separately. It is quite likely that some groups should not
have had their data aggregated based on low indices of agreement.
2.4. Assumptions
ICCs are based on variance partitioning and therefore are subject to essentially the same
assumptions as analysis of variance (ANOVA). These include homogeneity of variance (the
variances within the units are statistically the same), normality (the population scores are
normally distributed), statistical independence (the observations are independent), and
measures that are of equal psychological intervals.
The rwg( j) coefficient is a measure of interrater agreement. It was intended to be used in
analyzing variables that have discrete response formats, such as a 5- or 7-point response scale.
James et al. (1984) recommended that it not be used with a shorter response format (e.g., a
2-point response scale) as artificially low estimates of interrater agreement may result. Other
conditions that should be met if using rwg( j) include employing measures that ‘‘have acceptable
psychometric properties’’ and approximately equal-interval measurement (James et al., 1984,
p. 85), and having empirical evidence that supports the null distribution (pp. 93–94). Also, the
distribution of obtained responses should not be bimodal or multimodal (James & LeBreton,
2001). Finally, the rwg( j) coefficient should only be applied to measures with ‘‘‘essentially
parallel’ indicators of the same construct’’ (James et al., 1984, p. 88), implying that the
measure should be unidimensional.
ICCs are used to evaluate group-level properties of data, or the ratio of between-group
variance to total variance (the ICC(1)) and the group-mean reliability (the ICC(2)). The
ICC(1) coefficient estimates the variance in an individual’s response that can be explained by
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 73
group membership, or the degree to which a measure varies between versus within groups.
The ICC(2) coefficient evaluates the internal consistency reliability of the group means in a
sample. In contrast, the rwg( j) coefficient is designed to evaluate intragroup rater agreement
(see Bliese, 2000b; James et al., 1993; James & LeBreton, 2001, for a discussion of the
distinction between agreement and reliability). In other words, rwg( j) assesses the consensus
among raters within a single unit for a single variable. Thus, although all three indices
measure group-level properties of data, ICCs are omnibus measures that apply across all
groups, whereas the rwg( j) coefficient applies only to single groups (i.e., an rwg( j) measure is
obtained for each group in the sample).
The strength of ICCs is that they allow determination of how much of the total varia-
bility is due to group membership (ICC(1)) and whether this variability results in reliable
group means (ICC(2)). ICC(1) values are not affected by group size (Bliese, 1998; Bliese &
Halverson, 1998b). However, ICC(2) values are affected by group size since they are based
on the Spearman–Brown formula. A simulation by Bliese (1998) demonstrated that larger
groups and higher ICC(1) values both resulted in more reliable estimates (i.e., higher
ICC(2) values).
The strength of rwg( j) lies in its assessment of a separate within-group interrater agreement
measure for each group that is not based on intergroup variability (James et al., 1993). The
rwg( j) coefficient determines whether aggregation is justified by comparing the variability of
the variable of interest (within a specific group or unit) to an expected variance. The expected
variance utilized in rwg( j) calculations should be explicitly considered.
Many researchers default to comparing the observed group variance with the variance of a
rectangular distribution (Schriesheim et al., 2001), despite James et al.’s (1984) cautions. The
rectangular distribution assumes completely random responses and is thus characterized by an
equal number of responses for each category (e.g., an equal number of 1, 2, 3, 4, and 5
responses). As noted by James et al., since responses are generally not random, the result is
that the rwg( j) agreement coefficient is typically overstated. However, if raters’ responses are
polarized, or at the two extremes of the response scale, the rectangular distribution will result
in an understatement of rwg( j) (Lindell, Brandt, & Whitney, 1999). A better alternative is to
use theory, past research, and/or additional data (current data should not be used) to specify
the null distribution to be used in the analysis.
Relatedly, since the sampling distribution for various null distributions is not known,
determining the ‘‘significance’’ of rwg( j) values is difficult. A .70 criterion has been
commonly used (e.g., George, 1990), but adequate support and justification for this value
has not been provided. However, a Monte Carlo simulation developed by Charnes and
Schriesheim (1995) can be used to evaluate statistical significance. Quantiles of the rwg( j)
sampling distribution are estimated so that the obtained rwg( j) value can be tested to determine
if it is significantly greater than would be expected by chance. (Parenthetically, it should be
noted that using the Monte Carlo simulation, Schriesheim, Cogliser, & Neider, 1995, found
rwg( j) coefficients above .70 that were not statistically significant.)
74 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
There are cautions that need to be noted related to the applicability of the methodology.
Since rwg( j) applies to each group separately, it does not allow for comparisons between
groups. As we saw in the introductory article, combining rwg( j) results to draw inferences
about aggregating data across groups is a problem because the values for each group must be
summarized using a summary statistic such as the mean. Whether it is appropriate to
summarize the rwg( j) coefficient in this way is debatable. To date, there is no prescribed way
to combine the rwg( j) information across groups to infer an appropriate level of analysis. For
example, whether results should be computed across all groups is a question researchers may
want to answer but will be unable to with the current index. Additionally, rwg( j) is only able to
assess one variable at a time, and is unable to evaluate interaction terms or two or more
predictors simultaneously. Thus, while rwg( j) can be used to determine whether data within a
group should be aggregated to a higher level of analysis, it is only useful for evaluating single
variables. That is, it is not able to evaluate the relationship among two or more variables.
However, these are not weaknesses of the method per se, as it was not intended or designed to
do these things.
A caution related to the interpretation of the r̄wg( j) coefficients needs to also be noted. The
coefficients presented in the introductory article were averages calculated for all of the groups
in the sample. Since very low rwg( j) coefficients were undoubtedly obtained for some of the
groups on these variables, an argument can be made that the raw score data for these groups
should not have been aggregated (cf. Schriesheim et al., 1995). A low rwg( j) estimate indicates
that raters within the group do not agree, or do not perceive the construct similarly. There are
many possible reasons for this lack of agreement, including alternative levels of analysis and
the existence of subgroups (within which there might be agreement). Aggregating the data
may thus have caused effects to be missed, misidentified, or misinterpreted.
As a final caution, Schriesheim et al. (1995) found that the number of items in a measure
seemed to affect the size and significance of the rwg( j) coefficient such that the magnitude of
the rwg( j) coefficient for measures with more items was greater and was more likely to be
significant. Furthermore, Lindell and colleagues demonstrated that since the rwg( j) index is
based upon the Spearman–Brown formula, it increases as the number of items in the scale
increases (Lindell & Brandt, 1999, 2000; Lindell et al., 1999). Thus, part of the reason why
the leadership climate scale has a higher rwg( j) value than do the other scales is because
leadership climate is based on a scale with 11 items. In contrast, task significance is based on
a 3-item scale and hostility is based on a 5-item scale.
3.1. Methodology
The second article in this special issue (Gavin & Hofmann) used HLM (Bryk &
Raudenbush, 1992) to test the proposed moderator effect. HLM is a program to estimate
multilevel random coefficient models. These models evaluate relationships at multiple levels
of analysis and model variance among variables at these different levels. In Gavin and
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 75
Hofmann’s article (this issue), leadership climate was assumed to be at a higher level of
analysis than hostility and task significance. That is, leadership climate was treated as a
group-level (Level 2) variable. This Level 2 variable was expected to moderate the
relationship between individual reports of task significance and hostility.
Conceptually, HLM can be thought of as a two-step approach to modeling these multilevel
relationships. Level 1 involves estimating a separate regression for each group including the
individual-level predictor and individual-level outcome. Level 2 models the variance in the
Level 1 intercepts and slopes using the group-level variable. The general HLM model (also
used by Gavin and Hofmann in this issue) is expressed as:
In the Level 1 equation, Yij is the outcome measure for individual i in group j, B0j and B1j are
the intercept and slope (respectively), Xij is the value on the predictor for individual i in group
j, and eij is the residual. In the Level 2 equations, ;00 and ;10 are the Level 2 intercepts, ;01
and ;11 are the Level 2 slopes, Gj is a group-level variable, and u0j and u1j are the Level 2
residuals (see Bryk & Raudenbush, 1992; Hofmann, 1997, for further details).
Eq. (4) evaluates the relationship between the predictor(s) and outcome at the individual
level (individual-level error = eij). Eq. (5) evaluates the main effect across levels, with a
group-level variable predicting variance in an individual-level intercept. In other words,
Eq. (5) evaluates the relationship between a group-level variable and the group mean of the
dependent variable (e.g., hostility). Eq. (6) evaluates the interaction effect across levels, with a
group-level variable predicting variance in an individual-level slope. As Gavin and Hofmann
(this issue) noted, Eqs. (4–6) are estimated simultaneously, and the slopes and intercepts are
allowed to vary randomly across groups. Bliese (2000a, in press) shows how these multiple
equations can be combined and estimated in a single equation using software such as SAS,
S-PLUS, and R.
Gavin and Hofmann (this issue) proposed several multilevel hypotheses. First, after
controlling for individual-level task significance, group-level task significance was proposed
to incrementally and negatively predict hostility. Secondly, after controlling for both
individual- and group-level task significance, group-level leadership climate was proposed
to incrementally and negatively predict hostility. Finally, group-level leadership climate
was hypothesized to interact with individual-level task significance to predict individual-
level hostility.
76 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
Using a two-level model, the total variance in hostility (1.08) was partitioned into between-
and within-group components by estimating the HLM model with no Level 1 or Level 2
predictors. (This has the effect of forcing the within-group variance in hostility into the
Level 1 residual term and the between-group variance in hostility into the Level 2 residual
term; see Bryk & Raudenbush, 1992; Hofmann, 1997). Results indicated that both the within-
and between-group variance (1.02 and 0.06, respectively) were significantly different from
zero and also allowed for an estimate of the ICC (which is equivalent to the ICC(1)). The ICC
estimate for hostility based on the HLM model was .055 [.06/(.06 + 1.02)]. This value was
within rounding error of the ICC(1) estimate of .05 from the ANOVA model presented in
Bliese et al. (this issue).
To test their first hypothesis, Gavin and Hofmann (this issue) added task significance to
the Level 1 equation. The ;10 parameter, or the intercept term in the Level 2 equation
predicting the Level 1 slopes (B1j), represents the pooled within-group slopes since the
Level 1 slopes are regressed onto a unit vector (i.e., there are no predictors in the Level 2
slope models). A t test of the ;10 parameter was significant and negative (;10 =.30,
t=10.21, P .05), supporting the prediction that individual reports of task significance
would be negatively related to individual hostility. Furthermore, the chi-square tests of the
residual terms from the Level 2 equations (u0j and u1j) indicated that the Level 1 intercepts
and slopes varied significantly across groups. This suggests that the task significance–
hostility relationship varies across groups, and sets the stage for subsequent cross-level
interaction analyses.
To test the second hypothesis, aggregated task significance was added to the Level 2
equation predicting the Level 1 intercept (B0j). A t test of the ;01 parameter (the slope of
aggregated task significance) was significant and negative (;01 = .17, t = 2.04, P .05),
indicating that groups with high task significance had low average levels of hostility. This
provides evidence of a contextual effect where the aggregate variable differs in meaning from
the lower-level variable (see Bliese, 2000b).
Tests of the third hypothesis involved estimating the HLM model with aggregated
leadership climate added to the Level 2 equation predicting the Level 1 intercept (B0j).
The slope of aggregated leadership climate (;02) was significant and negative (;02=.36,
t=3.59, P .05), indicating that average leadership climate was negatively related to the
average level of hostility in the companies.
Tests of the fourth hypothesis were conducted by adding aggregated leadership climate to
the Level 2 equation predicting the Level 1 slope (B1j). The slope of aggregated leadership
climate was significant and negative (;11=.28, t=2.93, P .01). Gavin and Hofmann’s
(this issue) analysis of the interaction revealed that the task significance–hostility relationship
was negative, and stronger in groups with low levels of leadership climate.
The results from the Level 2 model evaluating the variance in hostility indicated that the
variance between groups was small (0.06), but significantly different from zero, and Gavin
and Hofmann (this issue) thus interpreted this as supporting the examination of group level
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 77
predictors of between group variance. The variance within groups was larger (1.02),
supporting the examination of individual level predictors of within-group variance.
In general, all of the hypotheses were supported. Individual-level task significance was
significantly negatively related to hostility. Group-level task significance was significantly
negatively related to hostility (after controlling for individual-level task significance). Group-
level leadership climate was significantly negatively related to hostility (controlling for both
individual- and group-level task significance). Finally, support was found for leadership
climate as a moderator of the task significance–hostility relationship. The task significance–
hostility relationship was negative under both high and low levels of leadership climate, but
stronger for groups with low levels of leadership climate.
3.4. Assumptions
Hofmann (1997, p. 739) notes the following statistical assumptions of HLM (see also Bryk
& Raudenbush, 1992, p. 200):
1. Level 1 residuals (eij) are independent and normally distributed with a mean of zero and
variance S2 for every Level 1 unit within each Level 2 unit (residual variance should be
the same across groups).
2. The Level 1 predictors (Xij) are independent of the Level 1 residuals (eij).
3. The random errors at Level 2 (u0j, u1j) are multivariate normal, each with a mean of
zero, some variance (Tqq) and covariance among random elements q and q0 (Tqq0), and
are independent among Level 2 units.
4. The set of Level 2 predictors ( Gj) are independent of every Level 2 residual (u0j, u1j;
similar to Assumption 2, but for Level 2).
5. The Level 1 (eij) and Level 2 (u0j, u1j) residuals are independent.
Gavin and Hofmann (this issue) note that HLM uses maximum likelihood to estimate the
error variance components of the Level 2 models, which Hofmann (1997) notes assumes the
variables are multivariate normal.
Gavin and Hofmann’s choice of modeling both individual and group level variability in
hostility seems reasonable.
The results of the rwg( j) and ICC analyses for task significance were similar to those for
hostility (r̄wg( j)=.58, ICC(1)=.08, and ICC(2)=.78), and thus task significance was also used
at the individual level of analysis. However, Gavin and Hofmann argued that the significant
F test result for task significance (combined with the ICC(2) value of .78) indicated that the
group-level properties of task significance needed to be investigated. They therefore used
both individual-level and group-level (aggregated) task significance in their analyses. Lead-
ership climate was aggregated to the group level of analysis in the HLM analysis, based on
somewhat stronger support from the introductory article’s analyses, since r̄wg( j)=.87,
ICC(1)=.09, and ICC(2)=.80. Recall, however, that the strong rwg( j) results may be inflated
due to the number of items in the scale.
HLM has several strengths that have been demonstrated in this issue. One strength is that
the method is well suited to testing ‘‘cross-level moderator effects’’ models where, as in this
issue, leadership climate (group level) is expected to have an impact on the task significance–
hostility slope (cf. Klein & Kozlowski, 2000). Another strength of HLM is that the method
allows researchers to identify and partition different sources of variance in outcome variables,
as Gavin and Hofmann did with hostility in their article (this issue). In fact, Gavin and
Hofmann noted that there needs to be evidence of both between and within variance in the
dependent variable. The magnitude of between group variance in the dependent variable can
be estimated using HLM (see Hofmann, 1997, pp. 732–733). Unfortunately, however, the
methodology is limited in that the variance in independent variables and in moderators cannot
be partitioned and evaluated. This prevents researchers from determining where the variance
in the independent variables and/or the moderator variables lies (e.g., between-groups,
within-groups, both between- and within-groups, or neither between- nor within-groups).
HLM also provides a method for analyzing longitudinal relationships. Interunit differences
in intraunit change are evaluated using HLM (the Level 1 model would include multiple
observations over time within a unit; the Level 2 model would look at the sample of multiple
units; see Hofmann, 1997, p. 737), but a problem occurs because the method assumes
uncorrelated residuals (Level 1 residuals are assumed to be independent). This is an
assumption that longitudinal data are likely to violate (cf. James, 1995). (However, Bliese,
in press, notes that other random coefficient models can model autocorrelation in their
procedures; this highlights that HLM is but one variant of a more general class of models and
that other variants need to be considered as data analytic options.)
HLM also has other limitations or weaknesses, several of which were pointed out by James
(1995). One is the assumption of multivariate normality that is involved in the use of
maximum likelihood estimation. This assumption is problematic when interactions are
present, as they are likely to violate the normality assumption. A second issue noted by
James is that HLM treats independent variables as random variables, and thus the possibility
arises that independent variables can be correlated with their associated residuals (in violation
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 79
4.1. Methodology
The third article used WABA (Dansereau et al., 1984) to examine the levels at which the
variables and relationships were operating and whether a moderated relationship existed.
WABA employs the logic of ANOVA to represent analytic entities (i.e., dyads, groups, or
other organizational units). Data on each variable are divided into within-entity (deviation
from the entity average) and between-entity (the entity average) scores. These within- and
between-entity scores are then compared using tests of both statistical significance (i.e., t, F,
and Z tests) and practical significance (i.e., E, A, and R tests). Unlike the tests of statistical
significance that incorporate sample size (degrees of freedom), the tests of practical
significance are geometrically based and are not influenced by sample size (see Dansereau
et al., 1984; Yammarino & Markham, 1992, for more information).
80 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
where HBx and HBy are the corresponding between-entity etas for variables x and y, HWx
and HWy are the corresponding within-entity etas, rBxy, and rWxy are the corresponding
between-entity and within-entity correlations, and rTxy is the total (raw score) correlation.
The traditional WABA procedure involves three steps to determine whether findings
should be viewed as occurring within-entities, between-entities, both, or neither.
In WABA I, variables are examined separately to determine the amount of variation
within-entities and between-entities. Within- and between-entity etas are calculated and tested
(relative to each other) using F tests of statistical significance and E tests of practical
significance. WABA II involves the examination of covariation within and between entities.
Within- and between-entity correlations are first computed. The magnitude of association for
each of these correlations is tested for practical significance using R tests and for statistical
significance using t tests. Differences among these correlations are then tested using A tests of
practical significance and Z tests of statistical significance. (For more complete information
on the computations and significance tests involved in WABA I and WABA II analyses, see
Dansereau et al., 1984; Yammarino & Markham, 1992.)
In the third and final step of WABA, inferences are drawn. Four overall inferences or
conclusions about variable relationships are possible in WABA: between-entities, within-
entities, both between- and within-entities (i.e., individual differences), or neither (i.e., null).
To draw inferences, the results of the tests of statistical and practical significance conducted in
WABA I and WABA II are considered along with the decomposed raw score correlations
(decomposed into within- and between-entities components; see Dansereau et al., 1984,
pp. 183–185 and 190–200; Yammarino & Markham, 1992, pp. 171–172, for additional
details on drawing inferences).
procedure outlined by Dansereau et al. (1984) can be extended to multivariate form using
hierarchical linear multiple regression (see Schriesheim, 1995, for details).
Markham and Halverson (this issue) hypothesized that the variables (task significance,
leadership climate, and psychological hostility) and relationships would operate at the group
level of analysis. Specifically, they proposed that groups with high levels of leadership
climate (referred to hereafter as ‘‘good’’ leadership climate) would report high levels of task
significance but low levels of psychological hostility. Additionally, groups with high levels
of task significance were hypothesized to also experience low levels of psychological
hostility. Finally, they hypothesized a moderated relationship: the relationship between
group task significance and group psychological hostility would differ under poor and good
leadership conditions.
The E tests of the WABA I analyses (presented in Markham and Halverson’s Table 1)
showed that there was practically significant within-group variance in all three variables
(leadership, task significance, and hostility). Additionally, since the within-etas were larger
than the between-etas, inverse F tests were conducted (the within and between etas and
corresponding degrees of freedom are inverted; see Dansereau et al., 1984, pp. 172–175, for
details), none of which were statistically significant. However, the traditional F test results
(with the between-eta in the numerator and the within-eta in the denominator) were all
statistically significant.
Markham and Halverson thus proposed a group-level inference in light of the traditional
F test results and the ICC(1) and ICC(2) values reported in Bliese et al. (this issue). It seems
important to point out that this WABA I conclusion would not be supported using the pro-
cedures and inferences outlined by Dansereau et al. (1984). They state that ‘‘tests of practical
significance always precede tests of statistical significance because the practical tests provide
an indication of which eta correlation is larger (not error)’’ (Dansereau et al., 1984, pp. 174–
175). Following this logic, the smaller eta correlation—in this example, the between eta cor-
relation—is then viewed as error and used in the denominator of the F test. Thus, using the
Dansereau et al. criteria, the WABA I conclusion that would be drawn is that the variance in
leadership, task significance, and hostility is equivocal, or both between and within groups.
The t tests in WABA II indicated that all of the within and between correlations were
statistically significant. Again, departing from the procedure advocated by Dansereau et al.
(1984), R tests of practical significance were not reported or discussed by Markham and
Halverson. However, a simple calculation using the equations specified by Dansereau et al.
(pp. 131–132) indicates that the R test values for all the within and between correlations are
practically significant. (Tests of practical significance are particularly important in this
instance, given the large sample size and its influence on tests of statistical significance.)
The A and Z tests for the relationship between leadership and hostility and for the
relationship between task significance and hostility were both practically and statistically
significant and indicated support for a group level of analysis, but neither the A test nor the
Z test for the relationship between leadership and task significance was significant. Finally,
82 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
evaluation of the between and within correlation components showed that only the within
component for the relationship between leadership and task significance was practically
larger (by the A test) than the corresponding between component.
Although Markham and Halverson interpreted these results as supporting a ‘‘wholes’’ or
group level of analysis for two of the three relationships, the results should not be considered
strong, since all of the necessary conditions recommended in Dansereau et al. (1984) were not
met. For example, the E tests in WABA I did not support aggregation, since the within eta
was practically larger than the between eta. Additionally, the A tests for the components did
not find any of the between components to be practically larger than the within components.
The results of the MRA analysis assessing the moderated relationship indicated that
leadership climate moderated the relationship between task significance and hostility.
However, the results should be interpreted with some caution, since all of Dansereau
et al.’s (1984) conditions were not met. To satisfy Dansereau et al.’s criteria using this
example, the following had to be found: (1) the between-groups correlation under the poor
leadership climate condition must be significant (using the R and t tests of practical and
statistical significance, respectively); (2) the between-groups correlation under the poor
leadership climate condition must differ significantly from the other three correlations by
the A and Z tests (the within-groups correlation under poor leadership climate, and the
between- and within-groups correlations under good leadership climate); (3) the other
three correlations must be nonsignificant (R and t tests); and (4) the other three
correlations must not differ significantly from each other (A and Z tests).
Conditions 1, 2, and 4 were satisfied; however, Condition 3 was not satisfied. The between-
groups correlation under good leadership climate (rB = .41) was practically significant
(R=.45), while the within-group correlations under both good and poor leadership climate
conditions (rW = .28 and .33, respectively) were both statistically and practically significant
(under good leadership climate, R=.29 and t = 8.66 with 881 degrees of freedom, P .01; under
poor leadership climate, R=.35 and t = 11.66 with 1113 degrees of freedom, P .01). Thus, there
is only weak support for a moderated relationship using Dansereau et al.’s (1984) criteria.
The moderated relationship that was found using WABA’s MRA was somewhat different
than the moderated relationship the HLM results identified. Under poor leadership climate, a
group level effect was evidenced such that groups with high levels of task significance
experienced low levels of hostility. Conversely, groups that had important tasks, on average,
reported low hostility even if the leadership climate was poor. Alternatively, under a good
leadership climate, an individual level effect was found such that individuals with low levels
of task significance experienced higher levels of hostility.
4.4. Assumptions
WABA is based on ANOVA and regression, and shares the same assumptions as these
techniques. Thus, the assumptions of WABA are homogeneity of variance, normality,
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 83
statistical independence, and equal interval measurement. With the exception of violations of
independence (which have been shown to be problematic; see Kenny & Judd, 1986), the F
and t tests of WABA, like those of ANOVA and regression, are fairly robust to violations of
these other assumptions (Cohen & Cohen, 1983).
WABA allows researchers to test variables and relationships at any hypothesized level of
analysis. Both the independent and dependent variables can be analyzed to assess the level of
analysis at which the variables and/or the relationship(s) are operating (individual, dyad,
group, department, organization, industry, etc.). However, there are some restrictions on the
level of analysis when evaluating relationships, and these will be discussed below.
Markham and Halverson (this issue) hypothesized that the relationship between task
significance and hostility was at the group level of analysis and was significantly different
under poor and good leadership climate conditions. However, WABA permitted the
researchers to empirically assess whether the variables and relationships were operating at
the individual level, within-group level, group level (referred to as between-groups), or some
other level (i.e., none of these levels). In fact, using WABA this study found that the
relationship between task significance and hostility operated at two different levels of analysis
(group and individual), depending on the leadership climate.
Generally speaking, one of the main strengths of WABA is the assessment of both
statistical and practical significance. In addition to traditional F and t tests of statistical
significance, the E, R, and A tests of practical significance conducted in WABA ensure that
the magnitudes of variables and the differences among variables are meaningfully large or
different, and not merely statistically significant due to large sample sizes. The results of the
three steps in WABA (WABA I, WABA II, and the comparison of the decomposed raw score
correlation components using the A test) are all considered in making an inference. The
criteria proposed by Dansereau et al. (1984) for interpreting effects are very conservative, as
there must be evidence of both statistical and practical significance to draw a strong inference
about the level of analysis at which the variables and relationships are operating. The explicit
treatment of practical significance is a unique component of the WABA procedure and
represents one of its strengths.
However, there have been criticisms leveled at the use of the E tests in WABA I. Although
the E tests (a ratio of etas) are geometrically based and not dependent on degrees of freedom
(Dansereau et al., 1984), they are influenced by group size. Specifically, the E test becomes
more conservative as group size increases. Bliese and Halverson (1998b) demonstrated that
the larger the group, the more unlikely it is that the E test will be practically significant. So,
with large groups (as in the current data set), finding practical significance is unlikely.
The third step in WABA, evaluating the components of the total correlation (e.g., the
within-group and the between-group correlation components), is important to ensure that the
84 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
conclusions or inferences drawn are logically consistent. Considering the combined within
and between correlation components ‘‘prevents such logical inconsistencies as (a) inferring a
between-groups effect when the between-cell correlation is large but the amount of explained
variance (etas) between groups is negligible, or (b) inferring a within-groups effect when the
within-cell correlation is large but the amount of within-group variance (etas) is minimal’’
(Yammarino & Markham, 1992, p. 171).
Another strength of WABA is that it allows researchers to identify and partition different
sources of variance in all the variables of interest and then to assess the level(s) of analysis at
which the variables and the relationships are operating in the data. WABA makes no
assumption about whether between-cell or within-cell variance is error, and both are tested.
The WABA I analysis is an ANOVA-based statistic, and compares between-groups variance
to within-group variance. A limitation associated with this is that any restriction of between-
groups variance (variance across groups) on the construct of interest may result in
underestimation of within-cell agreement (James, 1988), and thus produce erroneous WABA
I conclusions (e.g., an inappropriate conclusion that scores should not be aggregated to the
group level when in fact groups are present; see George & James, 1993, for an explanation
and example).
WABA is a fairly flexible methodology, in that researchers are able to look at alternative
levels of analysis. For example, in WABA it is possible to empirically evaluate whether the
variable(s) or relationships operate at the individual or group level, or whether they operate at
the group or organizational level. In other words, the dependent and independent variables are
not initially constrained to any particular level, allowing researchers to look at any level(s) of
analysis. However, in WABA II analyses, the independent and dependent variables must be at
the same level of analysis for the relationship being evaluated. So while there is no constraint
regarding the particular level in developing hypotheses about variables and relationships, the
relationships must be at the same level (empirically supported by the WABA I analyses) for
WABA II analyses to be conducted.
Additionally, when MRA is employed to evaluate moderators, there is a restriction on the
level of the moderator. MRA requires that the moderator variable be at a higher level of
analysis, such that if individuals and groups are of interest, the moderator would need to be at
the group level or higher (cf. Schriesheim et al., 2000). In MRA, the moderator variable is
categorized, and analyses are conducted within each subgroup of the moderator (Dansereau
et al., 1984). Thus, MRA cannot be used to test a moderator for which individuals within
groups may have different scores on the moderating variable (e.g., an attitude such as satis-
faction) if relationships at the group level or lower are of interest, since members of the same
group might be placed into different moderator subgroups.
One of the advantages of this restriction is that scores on variables at higher levels (e.g.,
groups) are not duplicated for each unit at lower levels (e.g., individual) and therefore
degrees of freedom are not a problem. For example, one option for analyzing multilevel data
that has been used in the past is to assign the scores of the group-level variables to each
individual member and then analyze the data at the individual level of analysis. A problem
with this approach is that the degrees of freedom used in the statistical significance tests are
based on the number of individuals, and not on the number of groups. Restricting the
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 85
moderator to be a higher level precludes this from happening. However, the restriction on the
moderator’s level of analysis does limit the researcher to only testing theories that propose
the moderator variable at a higher level of analysis. Additionally, if the moderator is a
continuous variable, it must be categorized for use in MRA, which results in a loss of
statistical power.
Alternatively, Schriesheim’s (1995) multivariate WABA procedure does not limit
moderator variables to be at higher levels of analysis, and thus is not subject to these
limitations. The procedure is limited, though, in that it cannot identify situations where
the level of analysis may change across different values of the moderator (cf. Schriesheim
et al., 2000).
5.1. Methodology
Bliese and Halverson (this issue) tested three hypotheses. First, they proposed that
group-level perceptions of leadership climate moderated the relationship between group-
level task significance and group-level hostility. Second, they proposed that group-level
relationships between leadership climate, task significance and hostility were due to true
86 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
group effects and not to aggregation effects. Finally, they proposed that the moderated
relationship (group-level leadership climate moderating the relationship between group-
level task significance and group-level hostility) was due to true group effects and not to
aggregation effects.
Hierarchical linear regression using unweighted group means (each variable was aggre-
gated to the group level) evaluated the moderation hypothesis, and determined that the
interaction was significant. At the group level of analysis, when leadership climate was high,
hostility was low regardless of task significance. However, when leadership climate was low,
task significance was strongly and negatively related to hostility.
Confidence intervals generated by RGR from the pseudo group results indicated that the
main effects (task significance and leadership climate) and their interaction were outside the
pseudo group confidence intervals. However, F tests of the pseudo group results revealed that
the two main effect terms were significantly related to hostility (the pseudo group interaction
term was not significantly related to hostility).
Bliese and Halverson (this issue) then extended the RGR procedure to the WABA II
results. RGR was used to test whether the within- and between-correlations from the actual
groups were significantly different from the pseudo group correlations. In both cases, the
actual group correlations were outside the 95% confidence intervals. That is, actual group
correlations were significantly different from pseudo group correlations.
The unweighted group-means analysis conducted by Bliese and Halverson (this issue)
using hierarchical linear regression indicated that leadership climate moderated the task
significance–hostility relationship (the interaction term MS=.35, P .01). Thus, their first
hypothesis was supported. The form of the interaction differed only slightly from the form
identified by HLM. The group-means analysis indicated that under poor leadership
climate, group task significance was strongly and negatively related to group hostility.
However, under good leadership climate, group task significance did not have a strong
relationship with group hostility (i.e., hostility levels were low regardless of the level of
group task significance).
Since the actual group results for the interaction term fell outside the pseudo group
confidence intervals, Bliese and Halverson argued that the significant interaction effects
were due to actual group characteristics (i.e., not merely a product of the aggregation
process), and thus their third hypothesis was supported. However, due to the significant
F test for the pseudo group main effect terms, Bliese and Halverson interpreted these results
as providing only partial support for their second hypothesis. In other words, some portion
of the group-level task significance and leadership climate values were attributable to
differences between groups, but some portion was also attributable to the aggregation
process (individual level effects).
The extension of RGR to WABA II determined that actual group effects existed for both the
within-group and between-group correlations. That is, the within- and between-groups
correlations differed significantly from the raw score correlations and were not artifacts.
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 87
5.4. Assumptions
RGR is a technique employed to generate random groups from within the same data set to
compare with actual groups. Random sampling is one assumption of this method. Relevant
assumptions also include those of the underlying methods (in this study, OLS regression and
WABA analyses).
By design, the RGR group-mean analysis looked at unit level variables (i.e., the group). RGR
analyses in the Bliese and Halverson article were conducted at the group level, using group
averages for task significance, leadership climate, and hostility. The hypotheses tested were that
the group-level relationships between the variables and the group-level moderator relationship
(group-level leadership climate moderating the relationship between group-level task signific-
ance and group-level hostility) were due to true group effects and not to the aggregation process.
The RGR analysis was based on the assumption that the stress-buffering hypothesis, initially
proposed at the individual level, would have applicability at the group level.
The RGR group-mean procedure allows researchers to evaluate whether group results (or
other units of interest) are based on actual group differences or are attributable to the
aggregation of individual (or lower level) results. A limitation of the RGR group-mean
analysis is that variables must be operationalized at the group level using group averages in
the analyses. This prevents the evaluation of other potential levels of analysis at which effects
could be operating (e.g., within-group effects, cross-level effects, etc.). Additionally, the
variability within each group is assumed to be error. By not evaluating the variance within
groups, important information could be overlooked. For example, a variable or relationship
could be operating at both the within- and between-groups levels of analysis, but this would
not be detected by the RGR group-mean analysis.
The RGR procedure can easily be extended to other areas as well. In Bliese and
Halverson’s study (this issue), RGR was applied to the WABA II correlation analysis. The
authors also note that the procedure has been applied to WABA I etas to determine whether
they differ significantly from chance levels (see Bliese, 2000a; Bliese & Halverson, 1996).
6. Comparisons
Considering the results of these articles in contrast to one another points out some
interesting differences. One difference that should draw the reader’s immediate attention is
the difference in results for the raw score regression analysis (in the introductory article) and
the multilevel analyses, which took into consideration levels of analysis. The raw score
regression found no evidence for moderation, while each of the multilevel methods found
88 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
evidence of a moderator. This underscores the importance of taking levels of analysis into
account both when theorizing and when conducting empirical analyses. Without multilevel
investigations, the moderating effect of leadership climate would have been missed.
The second contrast of interest is that the units of theory for each methodology differ in
subtle ways. In large part, these differences are based on the methodological orientation.
ICCs, rwg( j), and RGR were used to test only group-level properties of the data. HLM forced
the dependent variable to be at the lowest level of analysis, but allowed independent and
moderator variables to be at any theoretical level. WABA requires that the independent and
dependent variables be at the same level of analysis, and the MRA method requires that
moderators be at a higher level of analysis. All of these methods are therefore limited with
respect to the theoretical questions they can answer.
The tests conducted in each of the methodologies are strikingly different as well. The rwg( j)
coefficient tests intragroup rater agreement, while RGR group-mean analysis tests whether
actual group results can be assumed to be due to group characteristics, using confidence in-
tervals generated from randomly sampled pseudo groups. HLM tests between- and within-
group variance in the dependent variable separately to see if the variances are significantly
different from zero. This is different from WABA I’s assessment of variance, since in WABA I
etas evaluate within- and between-group variance for each variable of interest, including de-
pendent variables, independent variables, and moderators. WABA I offers the advantage of a
test of practical significance, to prevent variance that is not meaningful from being interpreted
as significant. HLM does not evaluate the relative covariances of the variables at different
levels, whereas WABA II tests whether the within- and between-group covariances are sig-
nificantly different from each other as well as whether the covariances are large enough to be
significant. Again, tests of practical significance in WABA II help prevent researchers from
making inferences about statistically significant but not practically meaningful results.
Finally, it should be noted that different forms of moderation were found using the
different methods. The HLM and RGR group-mean moderator results were similar. Under
good leadership climate, HLM found a negative relationship between task significance and
hostility, whereas under good leadership climate RGR determined that the level of hostility
was low at all levels of task significance. Both methods found a strong negative relationship
between task significance and hostility under poor leadership climate.
The WABA results differed from these in the level of analysis at which the relationship was
detected. Low and high levels of the moderator were found to operate at different levels of
analysis (group and individual). It should be noted that WABA uncovered individual level
relationships with leadership climate that HLM was unable to detect, due to the fact that MRA
can test whether the level of analysis at which a moderator functions changes across different
values of the moderator (this is despite Klein & Kozlowski’s (2000) assertion that WABA was
not designed to test cross-level moderator effects and ‘‘does so less flexibly than HLM,’’ p.
232). WABA’s identification of a moderator relationship varying across levels offers a view that
is somewhat different from what might be considered more ‘‘traditional’’ moderator analysis.
The practical implications for each of the moderator results are strikingly different as well.
Using the HLM results, to have low individual hostility both the group leadership climate and
individual task significance would need to be perceived as high. However, using the RGR
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 89
results, to have low average hostility one only needs to have a high average leadership climate
(average task significance is not important). The implications of the WABA results are
somewhat different in that the effect of leadership climate on the relationship between task
significance and hostility varied depending on the level of analysis. Under poor leadership
climate, groups with low levels of task significance experienced high levels of hostility. This
implies that leaders of groups with poor leadership climate would want to ensure that the
perception of task significance among group members on average was high. Alternatively,
leaders could try to improve the climate. Under good leadership climate conditions, WABA
indicates that the individual level is important. Thus, once the climate is good, leaders should
then be concerned with ensuring that each individual’s perception of task significance is high.
To some extent, each methodology presented in this issue answers different questions. ICC(1)
indicates the amount of variance in a variable attributable to group membership, and ICC(2)
assesses the internal consistency reliability of the group means. The rwg( j) index evaluates the
degree of consensus or agreement among raters within a group. HLM is useful for determining
whether cross-level direct effects exist and/or whether cross-level moderators are operating.
WABA answers two questions: (1) at what level(s) of analysis are phenomena operating and (2)
what is the relationship (e.g., direct effects, cross-level moderation, etc.). Finally, RGR
determines whether a group effect is due to group characteristics or to the effects of aggregation.
Each of the multilevel methodologies used in this issue has its strengths and weaknesses.
Some may be suited for use in combination, while others answer theoretically different
questions. For example, both rwg( j) and RGR would be useful in combination with either
HLM or WABA. The rwg( j) index would allow an evaluation of intragroup rater agreement
prior to the aggregation of scores (in HLM), or it could serve as an additional piece of
evidence in WABA’s level of analysis determination decisions. Alternatively, RGR can be
used to set confidence limits to evaluate whether results are due to group characteristics or
merely to chance (Bliese & Halverson, 1996, this issue). WABA’s evaluation of practical
significance and assessment of level (are the variables or the relationships occurring at the
individual level, dyad level, group level, etc.) might also be useful prior to utilizing HLM to
determine if variables are employed at appropriate levels of analysis. A unique characteristic
of HLM is its ability to evaluate relationships where the independent variable is at a higher
level of analysis than the dependent variable.
Another strength of HLM is the ability to model longitudinal relationships. A problem
occurs, however, because HLM assumes that residuals are uncorrelated, and longitudinal
analyses are likely to violate this assumption. Recognizing this limitation in HLM, Bliese
(in press) notes that alternative multilevel random coefficient models are able to handle
violations of this assumption. Specifically, Bliese states that PROC MIXED in SAS
(see Littell, Milliken, Stroup, & Wolfinger, 1996; Singer, 1998) and the lme routine in S-
PLUS (Pinheiro & Bates, 2000) are able to model Level 1 autocorrelation, but that the form of
the nonindependence must be specified. There are also other assumptions of HLM (noted
90 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
earlier) that are likely to be violated in organizational research. Information is therefore greatly
needed on how robust the methodology is to violations of its assumptions.
There are also questions related to rwg( j) that need additional research. For example, it would
be useful to know if it is possible to summarize rwg( j) values and how this can best be
accomplished (Schriesheim et al., 2001). Additionally, as Schriesheim et al. (1995) pointed out,
whether rwg( j) is a ‘‘lenient’’ statistic (how easily values greater than zero are obtained) has not
been evaluated. Questions regarding how the number of raters in groups (James et al., 1984) and
the number of items in scales (Schriesheim et al., 1995) affect the rwg( j) statistic have also been
proposed. Notably, Kozlowski and Hattrup (1992) and Lindell et al. (1999) recently provided
some insight into the effect of the number of raters on the rwg( j) coefficient, and Lindell et al.
offer recommendations on the maximum number of items (cf. Schriesheim et al., 2001).
However, in general, more research and information on the rwg( j) coefficient would be
extremely useful.
The evaluation of negative values in rwg( j) has been somewhat problematic. James et al.
(1984) suggested that any negative values be set to zero. In fact, some researchers have ques-
tioned this recommendation and proposed an alternative index of agreement that allows negative
values as representing disagreement (see Lindell & Brandt, 1999; Lindell et al., 1999). However,
it may also be useful to investigate negative index values to determine why they are negative.
For example, it could be that a negative value was obtained because two subgroups exist, and
while one subgroup rated items all 1s, the second subgroup rated them all 5s. There would be
perfect agreement (rwg( j) = 1) within the two subgroups separately, but the rwg( j) index for the
entire group would not provide such information. Thus, evaluating the basis of the negative
result before discounting or discarding the information might prove fruitful for researchers.
Another suggestion for future research is further exploration of the potential usefulness of
simulation. Charnes and Schriesheim (1995) used simulation to evaluate the distribution of
rwg( j), but simulation could also be used similarly in HLM and WABA. In HLM, a population
of intercepts and slopes could be generated, allowing evaluation of the effects, sensitivity to
changes, and differences in parameters. In WABA, simulation could be used to make
adjustments for group size in WABA I. That is, the simulation could look at the actual etas
and evaluate them in comparison to etas where groups do not matter.
Finally, as many have emphasized, theory is needed before multilevel methods are applied to
answer questions. In fact, there are several theoretical areas in leadership that seem particularly
suitable to multilevel investigations. The leader–member exchange (LMX; Dansereau, Graen,
& Haga, 1975; Graen & Uhl-Bien, 1995) theory stands out among leadership theories as almost
demanding multilevel analysis (cf. Schriesheim, Castro, & Cogliser, 1999), since the theory
proposes that leaders differentiate between subordinates within their work groups.
Transformational leadership theory (e.g., Bass, 1985; House & Podsakoff, 1994) is another
area where multilevel analysis might provide useful information. Multilevel analysis may pro-
vide insight into how the phenomenon operates, as different transformational behaviors may
operate at different levels of analysis and/or have different effects at different levels of analysis.
For example, the fostering of group performance goals may function at a higher level of analysis
than does individualized consideration. Other areas in leadership that might benefit from the
application of multilevel analysis include conflict management, influence attempts, perform-
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 91
ance appraisal, empowerment, and delegation. In general, the level at which these phenomena
operate (e.g., individual, dyadic, group, etc.) should be theoretically identified and then
empirically tested. Also, as demonstrated in the studies in this issue, variables (e.g., moderators)
that operate at other levels of analysis can be investigated to gain a more complete
understanding of leadership phenomena.
Once the theoretical questions are delineated (whether in leadership or other areas), the
method that is both theoretically and empirically appropriate for question(s) of interest should
be employed. Hopefully, the description of the strengths and weaknesses associated with ICCs,
rwg( j), HLM, WABA, and RGR in this article will help researchers determine which method or
methods are appropriate for their research.
Researchers interested in learning more about these methodologies are directed to the
following sources. For more information on ICCs, articles by Bliese (1998, 2000b), McGraw
and Wong (1996), and Shrout and Fleiss (1979) should prove useful. The initial article by
James et al. (1984) on rwg( j) is useful, but articles by James et al. (1993), Lindell and Brandt
(1999), Lindell et al. (1999), and Schriesheim et al. (1995, 2001) should also be read. HLM is
described in detail in Bryk and Raudenbush’s (1992) book, and Hofmann (1997) provides a
useful overview of the method. Bliese (in press) recognizes that there are alternative
multilevel random coefficient models that can be used instead of HLM, including MLn
(Kreft & de Leeuw, 1998), PROC MIXED for SAS (Littell et al., 1996; Singer, 1998),
VARCL (Longford, 1990), and the nlme library for R and S-PLUS (Pinheiro & Bates, 2000).
For those desiring to learn more about WABA, there is the book by Dansereau et al. (1984), as
well as informative articles by Schriesheim (1995) and Yammarino (1998). George and James
(1993) and Yammarino and Markham (1992) engaged in interesting dialogue over the problems
and merits of WABA that might provide interested readers with some useful information.
Additionally, Schriesheim et al. (2000) provided a comparison of two methods for examining
moderators, MRA (Dansereau et al., 1984) and multivariate WABA (Schriesheim, 1995).
Unfortunately, there is little available in the literature about RGR. Some information about
the procedure is provided in two studies by Bliese and Halverson (1996, 1998a), but the
article in this issue is more comprehensive. Examples involving RGR along with open-source
software are outlined in Bliese (2000a). Interested researchers should consult these references
for more information on the multilevel data analytic methodologies used in this issue.
References
Bass, B. M. (1985). Leadership and performance beyond expectations. New York: Free Press.
Bliese, P. D. (1998). Group size, ICC values, and group-level correlations: a simulation. Organizational Research
Methods, 1, 355 – 373.
Bliese, P. D. (2000a). Multilevel modeling in R: a brief introduction to R, the multilevel package, and the NLME
package. Washington, D.C. USA: Walter Reed Army Institute of Research.
Bliese, P. D. (2000b). Within-group agreement, non-independence, and reliability: implications for data aggrega-
tion and analysis. In: K. J. Klein, & S. W. Kozlowski (Eds.), Multilevel theory, research, and methods in
organizations ( pp. 349 – 381). San Francisco, CA: Jossey-Bass.
Bliese, P. D. (2002). Multilevel random coefficient modeling in organizational research: examples using SAS and
92 S.L. Castro / The Leadership Quarterly 13 (2002) 69–93
S-PLUS. In: F. Drasgow, & N. Schmitt (Eds.), Modeling in Organizational Research: Measuring and Ana-
lyzing behavior in Organizations ( pp. 401 – 445). San Francisco, CA: Jossey-Bass Inc.
Bliese, P. D., & Halverson, R. H. (1996). Individual and nomothetic models of job stress: an examination of work
hours, cohesion, and well-being. Journal of Applied Social Psychology, 26, 1171 – 1189.
Bliese, P. D., & Halverson, R. H. (1998a). Group consensus and psychological well-being: a large field study.
Journal of Applied Social Psychology, 28, 563 – 580.
Bliese, P. D., & Halverson, R. H. (1998b). Group size and measures of group-level properties: an examination of
eta-squared and ICC values. Journal of Management, 24, 157 – 172.
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models. Newbury Park, CA: Sage.
Charnes, J. M., & Schriesheim, C. A. (1995). Estimation of quantiles for the sampling distribution of the rwg
within-group agreement index. Educational and Psychological Measurement, 53, 435 – 437.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences
(2nd ed.). Hillsdale, NJ: Erlbaum.
Dansereau, F., Alutto, J. A., & Yammarino, F. J. (1984). Theory testing in organizational behavior: the variant
approach. Englewood Cliffs, NJ: Prentice-Hall.
Dansereau Jr., F., Graen, G. B., & Haga, W. J. (1975). A vertical dyad linkage approach to leadership within
formal organizations: a longitudinal investigation of the role making process. Organizational Behavior and
Human Performance, 13, 46 – 78.
George, J. M. (1990). Personality, affect, and behavior in groups. Journal of Applied Psychology, 75, 107 – 116.
George, J. M., & James, L. R. (1993). Personality, affect, and behavior in groups revisited: comment on aggre-
gation, levels of analysis, and a recent application of within and between analysis. Journal of Applied
Psychology, 78, 798 – 804.
Glick, W. H., & Roberts, K. H. (1984). Hypothesized interdependence, assumed independence. Academy of
Management Review, 9, 722 – 735.
Graen, G. B., & Uhl-Bien, M. (1995). Relationship-based approach to leadership: development of leader – member
exchange (LMX) theory of leadership over 25 years: applying a multi-level multi-domain perspective.
Leadership Quarterly, 6, 219 – 247.
Hofmann, D. A. (1997). An overview of the logic and rationale of hierarchical linear models. Journal of Manage-
ment, 23, 723 – 744.
House, R. J., & Podsakoff, P. M. (1994). Leadership effectiveness: past perspectives and future directions for
research. In: J. Greenberg (Ed.), Organizational behavior: the state of the science ( pp. 45 – 82). Hillsdale,
NJ: Erlbaum.
James, L. R. (1980). The unmeasured variables problem in path analysis. Journal of Applied Psychology,
65, 415 – 421.
James, L. R. (1982). Aggregation bias in estimates of perceptual agreement. Journal of Applied Psychology, 67,
219 – 229.
James, L. R. (1988). Organizational climate: another look at a potentially important construct. In: S. G. Cole, &
R. G. Demaree (Eds.), Applications of interactionist psychology: essays in honor of Saul B. Sells ( pp. 253 –
282). Hillsdale, NJ: Erlbaum.
James, L. R. (1995). Discussant’s comments. In N. Bennett (Symposium chair), Introduction, explanation, and
illustrations of hierarchical linear modeling as a management research tool. Vancouver, British Columbia:
Academy of Management Annual Conference.
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within group interrater reliability with and without
response bias. Journal of Applied Psychology, 69, 85 – 98.
James, L. R., Demaree, R. G., & Wolf, G. (1993). rwg: an assessment of within-group agreement. Journal of
Applied Psychology, 78, 306 – 309.
James, L. R., & LeBreton, J. M. (2001). Disentangling issues of agreement, disagreement, and lack of agreement
using rwg, r * wg, and rwg( j). Unpublished working paper.
Kenny, D. A., & Judd, C. M. (1986). Consequences of violating the independence assumption in analysis of
variance. Psychological Bulletin, 99, 422 – 431.
S.L. Castro / The Leadership Quarterly 13 (2002) 69–93 93
Klein, K. J., Dansereau, F., & Hall, R. J. (1994). Levels issues in theory development, data collection, and
analysis. Academy of Management Review, 19, 195 – 229.
Klein, K. J., & Kozlowski, S. W. (2000). From micro to meso: critical steps in conceptualizing and conducting
multilevel research. Organizational Research Methods, 3, 211 – 236.
Kozlowski, S. W. J., & Hattrup, K. (1992). A disagreement about within-group agreement: disentangling issues of
consistency versus consensus. Journal of Applied Psychology, 77, 161 – 167.
Kreft, I. G. G. (1996). Are multilevel techniques necessary? An overview, including simulation studies. Unpub-
lished paper, California State University, Los Angeles, USA.
Kreft, I. G. G., & de Leeuw, J. (1998). Introducing multilevel modeling. London: Sage.
Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: a comparison
of the CVI, T, rwg( j), and r * wg( j) indexes. Journal of Applied Psychology, 84, 640 – 647.
Lindell, M. K., & Brandt, C. J. (2000). Climate quality and climate consensus as mediators of the relationship
between organizational antecedents and outcomes. Journal of Applied Psychology, 85, 331 – 348.
Lindell, M. K., Brandt, C. J., & Whitney, D. J. (1999). A revised index of interrater agreement for multi-item
ratings of a single target. Applied Psychological Measurement, 23, 127 – 135.
Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D. (1996). SAS system for mixed models. Cary, NC:
SAS Institute.
Longford, N. T. (1990). VARCL. Software for variance component analysis of data with nested random effects
(maximum likelihood). Princeton, NJ: Educational Testing Service.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients.
Psychological Methods, 1, 30 – 46.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. New York: Springer-Verlag.
Roberts, K. H., Hulin, C. L., & Rousseau, D. M. (1978). Developing an interdisciplinary science of organizations.
San Francisco: Jossey-Bass.
Rousseau, D. M. (1985). Issues of level in organizational research: multi-level and cross-level perspectives.
Research in Organizational Behavior, 7, 1 – 37.
Schriesheim, C. A. (1995). Multivariate and moderated within- and between-entity analysis (WABA) using
hierarchical linear multiple regression. Leadership Quarterly, 6, 1 – 18.
Schriesheim, C. A., Castro, S. L., & Cogliser, C. C. (1999). Leader – member exchange (LMX) research: a
comprehensive review of theory, measurement, and data-analytic practices. Leadership Quarterly, 10, 63 – 113.
Schriesheim, C. A., Castro, S. L., & Yammarino, F. J. (2000). Investigating contingencies: an examination of the
impact of span of supervision and upward controllingness on Leader – Member Exchange using traditional and
multivariate within- and between-entities analysis. Journal of Applied Psychology, 85, 659 – 677.
Schriesheim, C. A., Cogliser, C. C., & Neider, L. L. (1995). Is it ‘‘trustworthy’’? A multiple levels-of-analysis
reexamination of an Ohio State leadership study, with implications for future research. Leadership Quarterly,
6, 111 – 145.
Schriesheim, C. A., Donovan, J. A., Zhou, X., LeBreton, J. M., Whanger, J. C., & James, L. R. (2001). Use and
misuse of the rwg( j) coefficient of within-group agreement: a review and suggestions for future research and
use. Unpublished working paper.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological
Bulletin, 86, 420 – 428.
Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual
growth models. Journal of Educational and Behavioral Statistics, 23, 323 – 355.
Yammarino, F. J. (1998). Multivariate aspects of the varient/WABA approach: a discussion and leadership
illustration. Leadership Quarterly, 9, 203 – 227.
Yammarino, F. J., & Markham, S. E. (1992). On the application of within and between analysis: are absence and
affect really group-based phenomena? Journal of Applied Psychology, 77, 168 – 176.