Measuring Sex Differences and Similarities
Measuring Sex Differences and Similarities
In press (2019):
Chapter in D. P. VanderLaan & W. I. Wong (Eds.), Gender and sexuality development:
Contemporary theory and research. New York, NY: Springer.
Marco Del Giudice, Department of Psychology, University of New Mexico. Logan Hall, 2001
Redondo Dr. NE, Albuquerque, NM 87131, USA; email: [email protected]
2
Few topics in psychology can rival sex differences in their power to stir controversy and
captivate both scientists and the public. Debates in this area revolve around two types of
questions: explanatory questions about the role of social learning and biological factors in
determining patterns of sex-related behavior, and descriptive questions about the size and
variability of those effects. These questions are logically distinct and can be addressed
independently; however, throughout the history of the discipline the answers have tended to
cluster together (see Eagly & Wood, 2013; Lippa, 2005). More often than not, researchers who
emphasize socio-cognitive factors typically view sex differences as small, outweighed by
similarities, and highly context-dependent. They also tend to worry that exaggerated beliefs
about the extent of sex differences and their stability may have pernicious influences on
individuals and society (e.g., Hyde, 2005; Hyde et al., 2018; Rippon et al., 2014; Unger, 1979).
Conversely, most biologically oriented scholars argue that—at least in regard to certain traits—
differences between the sexes can be large, pervasive, and potentially universal (e.g., Buss, 1995;
Davies & Shackelford, 2008; Ellis, 2011; Geary, 2010; Schmitt, 2015). While not all scholars
can be neatly placed in one of these two “camps,” the long-standing divide contributes to explain
why measurement and quantification are so often at the center of disputes in the field (Eagly &
Wood, 2013).
Regardless of one’s theoretical background, it is clear that future progress will depend on
our ability to quantify differences and similarities as accurately and meaningfully as possible.
Doing so requires not only the proper statistical tools, but also awareness of the many factors that
may distort empirical findings and make them less interpretable, or even potentially misleading.
Despite the importance of these issues, the relevant literature is fragmented; as far as I know,
there have been no attempts to organize it in an accessible form. This chapter aims to fill this gap
with a concise but systematic introduction to quantification in sex differences research. I begin
with a meta-methodological note about the meaning of “sex” and “gender,” and the rationale for
treating sex as a binary variable despite the complexities of sex-related identity and behavior (a
point that necessitates a brief detour into evolutionary biology). In the following section, I review
the main approaches to quantification, examine their strengths and limitations, and offer
suggestions for visualization. Finally, I discuss various statistical and methodological factors that
may inflate or deflate the apparent size of sex differences, and consider the available options to
minimize their influence.
While many authors in psychology and other disciplines treat “sex” and “gender” as
synonyms (Haig, 2004), these terms have different histories and implications. The contemporary
usage of “gender” as the social and/or psychological counterpart of biological sex was
introduced in psychology by Money (1955), though Bentley (1945) had drawn the same
distinction ten years before. Popularized by Stoller (1968), the term was rapidly adopted by
feminist scholars in the 1970s (Haig, 2004; Janssen, 2018). The motivation was to distinguish the
biological characteristics of males and females from the social roles, behaviors, and aspects of
identity associated with male/female labels; usually with the assumption that sociocultural
factors are more powerful and consequential than biological ones, and that psychological
differences are largely or exclusively determined by socialization (e.g., Oakley, 1972; Unger,
3
1979; for an in-depth analysis see Byrne, 2018). As many have noted over the years, the sex-
gender distinction is problematic and ultimately unworkable, which is probably why few authors
actually follow it in their writing. Not only does it suggest a clear-cut separation between social
and biological explanations; it also presupposes that one already knows whether a certain aspect
of behavior is biological or socially constructed in order to pick the appropriate term (Deaux,
1985; Ellis et al., 2008; Haig, 2004).
Having grown uneasy with the sex-gender distinction, some feminist scholars have
started to promote the use of the hybrid term “sex/gender” (or “gender/sex”) as a way to
recognize that biological and social factors are inseparable, encourage critical examination of the
processes that lead to observable male-female differences, and underscore the potential for
plasticity (Fausto-Sterling, 2012; Hyde et al., 2018; Jordan-Young & Rumiati, 2012; Rippon et
al., 2014). Of course this is a legitimate stance; but the new terminology has its own problems,
and I suspect that the cure would be worse than the disease. Sex/gender is often described by its
proponents as a continuum, or even a multidimensional collection of semi-independent features;
from this perspective, a person’s sex/gender may be regarded as hybrid, fluid, or otherwise
nonbinary (see e.g., Hyde et al., 2018). Yet, the same term is also used in the context of the
distinction between males and females as groups (Jordan-Young & Rumiati, 2012). Some
authors have carried this tension to its logical conclusion and suggested that researchers should
stop using sex as a binary variable (Joel & Fausto-Sterling, 2016). On this view, “male” and
“female” should be replaced with multiple overlapping categories, or even (multi)dimensional
scores of gendered self-concepts and attitudes (Hyde et al., 2018; Joel & Fausto-Sterling, 2016).
This radical methodological change is justified with the need to overcome the “gender binary.”
However, the binary nature of sex is not an illusion to dispel but a biological reality, as I now
briefly discuss.
A deeper issue with the “patchwork” definition of sex used in the social sciences is the
lack of a functional rationale, in stark contrast with how the sexes are defined in biology. From a
4
biological standpoint, what distinguishes the males and females of a species is the size of their
gametes: males produce small gametes (e.g., sperm), females produce large gametes (e.g., eggs;
Kodric-Brown & Brown, 1987).1 Dimorphism in gamete size or anisogamy is the dominant
pattern in multicellular organisms, including animals. The evolution of two gamete types with
different sizes and roles in fertilization can be predicted from first principles, as a result of
selection to maximize the efficiency of fertilization (Lehtonen & Kokko, 2011; Lehtonen &
Parker, 2014). In turn, anisogamy generates a cascade of selective pressures for sexually
differentiated traits in morphology, development, and behavior (see Janicke et al., 2016;
Lehtonen et al., 2016; Schärer et al., 2012). The biological definition of sex is not just one option
among many, or a matter of arbitrary preference: the very existence of differentiated males and
females in a species depends on the existence of two gamete types. Chromosomes and hormones
participate in the mechanics of sex determination and sexual differentiation, but do not play the
same foundational role. Crucially, anisogamy gives rise to a true sex binary at the species level:
even if a given individual may fail to produce viable gametes, there are only two gamete types
with no meaningful intermediate forms (Lehtonen & Parker, 2014). This dichotomy is functional
rather than statistical, and is not challenged by the existence of intersex conditions (regardless of
their frequency), nonbinary gender identities, and other apparent exceptions. And yet, anisogamy
is rarely discussed—or even mentioned—in the social science literature on sex and gender, with
the obvious exceptions of evolutionary psychology and anthropology.
What are the implications for research? If the sex binary is a basic biological fact,
arguments that call for rejecting it on scientific grounds (e.g., Hyde et al., 2018) lose much of
their appeal. One can speak of sex differences in descriptive terms—as I do in this chapter—
without assuming that such differences are “hardwired” or immune from social influences. From
a practical standpoint, sex as a categorical variable is also robust to the presence of a small
proportion of individuals who, for various reasons, are not easily classified or do not align with
the biological definition. This does not mean that exceptions are unimportant, or that sex should
only be viewed through a categorical lens. For example, there are methods for ranking
individuals of both sexes along a continuum of masculinity-femininity or male-female typicality
(e.g., Lippa, 2001, 2010a; Phillips et al., 2018; more on this in Section 2.2.1). Variations in
gender identity and sexual orientation can and should be studied in all their complexity
regardless of whether sex is a biological binary. More generally, the existence of a well-defined
sex binary is perfectly compatible with large amounts of within-sex variation in anatomy,
physiology, and behavior. Indeed, sexual selection often amplifies individual variability in sex-
related traits, and can favor the evolution of multiple alternative phenotypes in males and
females (Geary, 2010, 2015; Taborsky & Brockmann, 2010; see also Del Giudice et al., 2018).
In the remainder of the chapter I discuss how patterns of quantitative variation between the sexes
can be measured and analyzed in detail.
There are many possible ways to quantify sex differences and similarities. In this section
I review the methods that are most often employed in the literature. I then discuss some methods
1
Species with simultaneous hermaphroditism (mostly plants and invertebrates) do not have distinct sexes, given that
any individual can produce both types of gametes at the same time.
5
that are less common but warrant a closer look, either because of their untapped potential or
because of their peculiar limitations. I also address the question of how to visualize quantitative
findings effectively and intuitively. Note that the various methods and indices discussed in this
section are in no way alternative to one another. Different indices can reveal different aspects of
the data, and may be used in combination to gain a broader perspective; other times, one of the
indices may be better suited to answer the particular question at hand. The basic formulas are
reported and explained in Table 2.1. Additional methods to deal with more complex scenarios
can be found in the cited references.
Univariate Multivariate
𝑚$ − 𝑚& 𝑚$ − 𝑚&
𝑑= =
𝑆 -
(𝑁 − 1)𝑆$ + (𝑁& − 1)𝑆&- 𝐷 = 1(𝐦𝑴 − 𝐦𝑭 )5 𝐒7𝟏 (𝐦𝑴 − 𝐦𝑭 ) = 1𝐝5 𝐑7; 𝐝
( $
𝑁$ +𝑁& − 2
3 𝑁$ + 𝑁& − 𝑘 − 3 - 𝑁$ + 𝑁&
𝑑@ = 𝑔 = 𝑑 B1 − E 𝐷@ = (𝑚𝑎𝑥 B0, I 𝐷 −𝑘 KE
4(𝑁$ + 𝑁& − 2) − 1 𝑁$ + 𝑁& − 2 𝑁$ 𝑁&
Small-sample variant of 𝑑 corrected for bias Small-sample variant of 𝐷 corrected for bias
(approximate formula); also known as Hedges’ 𝑔
𝑘: number of variables
Φ(∙): normal cumulative distribution function (CDF) Φ(∙): normal cumulative distribution function (CDF)
𝑂𝑉𝐿 𝑂𝑉𝐿
𝑂𝑉𝐿- = = 1 − 𝑈; 𝑂𝑉𝐿- = = 1 − 𝑈;
2 − 𝑂𝑉𝐿 2 − 𝑂𝑉𝐿
Proportion of overlap relative to the joint distribution1,2 Proportion of overlap relative to the joint distribution1,2
𝑂𝑉𝐿 𝑂𝑉𝐿
𝑈; = 1 − = 1 − 𝑂𝑉𝐿- 𝑈; = 1 − = 1 − 𝑂𝑉𝐿-
2 − 𝑂𝑉𝐿 2 − 𝑂𝑉𝐿
Proportion of nonoverlap relative to the joint Proportion of nonoverlap relative to the joint
distribution1,2 distribution1,2
6
𝑈T = Φ(|𝑑|) 𝑈T = Φ(𝐷)
Proportion of individuals in the group with the higher Proportion of males who are more male-typical than the
mean who exceed the median individual of the other median female (= proportion of females who are more
group1,2 female-typical than the median male)1,2
𝐶𝐿 = ΦV|𝑑|/√2X 𝐶𝐿 = ΦV𝐷/√2X
Common language effect size. Probability that a Common language effect size. Probability that a
randomly picked individual from the group with the randomly picked male will be more male-typical than a
higher mean will exceed a randomly picked individual randomly picked female (= probability that a randomly
from the other group1,2 picked female will be more female-typical than a
randomly picked male)1,2
𝑑- 𝐷-
𝜂- = 𝜂- =
𝑑- + 4 𝐷- + 4
Eta squared. Proportion of variance explained by Eta squared. Proportion of generalized variance
sex1,2,3 explained by sex1,2,3
-
𝑉𝑅 = 𝑆$ /𝑆&- 𝑉𝑅 = |𝐒𝑴 |/|𝐒𝑭 |
Φ(𝑑 − 𝑧) Φ(𝐷 − 𝑧)
𝑇𝑅^_` = 𝑇𝑅^_` =
Φ(−𝑧) Φ(−𝑧)
Tail ratio. Relative proportion of males:females in the Tail ratio. Relative proportion of males:females in the
region located z standard deviations above the female region located z standard deviations from the female
mean (use −𝑑 for the relative proportion of centroid in the male-typical direction (= relative
females:males in the region located z standard proportion of females:males in the region located z
deviations above the male mean)1,2,3 standard deviations from the male centroid in the
female-typical direction)1,2,3
1
The formula assumes equality of variances (univariate case) or covariance matrices (multivariate case) in the
population.
2
The formula assumes (multivariate) normality in the population.
3
The formula assumes equal group sizes (i.e., equal proportions of males and females).
7
2
In Cohen’s own words: “The terms “small,” “medium,” and “large” are relative, not only to each other, but to the
area of behavioral science or even more particularly to the specific content and research method being employed in
any given investigation […] In the face of this relativity, there is a certain risk inherent in offering conventional
definitions for these terms for use in power analysis in as diverse a field of inquiry as behavioral science. This risk is
nevertheless accepted in the belief that more is to be gained than lost by supplying a common conventional frame of
reference which is recommended only when no better basis for estimating the ES index is available.” (Cohen, 1988,
p. 25; emphasis added). This must have been one of the least successful warnings in the history of statistics.
3
Of course, it is always possible to test the null hypothesis that a given difference is exactly zero, or within a range
that makes it practically equivalent to zero for the purpose of a particular study. In contrast with standard
8
can be nearly useless if one needs to make highly accurate predictions or classifications; to
illustrate, d = 0.80 implies a predictive accuracy of about 66%, which is better than chance but
may be too low in some applied contexts (see Section 2.2.1.5). Also, a conventionally “large”
effect may be comparatively small if the other effects in the same domain are consistently larger.
This is not just the case for Cohen’s d: the same principle applies to all the effect sizes discussed
in this chapter. The idea that the practical importance of an effect can be determined
mechanically using fixed conventional guidelines is tempting, but deeply misguided.
Figure 2.1. Sex differences in facial morphology. (a) Composite male and female faces (averages of 24
pictures each). (b) The continuum of male-female typicality in facial features. The figure shows a
sequence of morphed faces, from 100% female to 100% male. Adapted with permission from Rhodes et
al. (2004). Copyright 2004 by Elsevier Ltd.
significance testing, Bayesian methods can directly quantify the evidence in support of the null hypothesis (see
Dienes, 2016; Kruschke & Liddell, 2018; Wagenmakers et al., 2018).
9
The natural metric for measuring global sex differences across multiple variables is
Mahalanobis’ D, the multivariate generalization of Cohen’s d (Huberty, 2005; Olejnik & Algina,
2000; Table 2.1). The value of D is the distance between the centroids (multivariate means) of
the male and female distributions, relative to the standard deviation along the axis that connects
the centroids. Figure 2.2 illustrates the geometric meaning of D in the case of two variables (for
more details see Del Giudice, 2009). The interpretation of D is essentially the same as that of d,
with the difference that D is unsigned and cannot take negative values (reflecting the multivariate
nature of the comparison). Confidence intervals for D can be obtained with bootstrapping
(Kelley, 2005; Hess et al., 2007) or with exact methods, which unfortunately are not always
applicable (see Reiser, 2001; Zhou, 2007). Procedures for obtaining a pooled correlation matrix
are discussed in Furlow & Beretvas (2005). Simple R functions to calculate D with confidence
intervals, corrections for bias and measurement error (Section 2.3), heterogeneity statistics (see
below), and other diagnostics and effect sizes are available at
https://doi.org/10.6084/m9.figshare.7934942.v1.
The axis connecting the centroids summarizes the differences between males and females
across the entire set of variables, and can be conveniently interpreted as an overall dimension of
male-female typicality or masculinity-femininity (M-F) in the domain described by those
variables.4 To illustrate: in the case of facial morphology, the M-F axis would represent a
continuum of male-female typicality like the one shown in Figure 2.1b.5 This continuum
summarizes the combination of anatomical features that make a particular face male- or female-
typical. Depending on the size of D, the male and female distributions may overlap substantially
along the continuum or form largely separate clumps (as in Figure 2.2). Individual scores on the
M-F axis are closely related to the gender diagnosticity index proposed by Lippa and Connelly
(1990). Gender diagnosticity is the probability that a given individual is male (or, symmetrically,
female), estimated with linear discriminant analysis from a set of sexually differentiated
variables (e.g., preferences for various occupations or activities). This probability can be used as
an index of masculinity-femininity, and is a function of an individual’s position along the M-F
axis.
4
Except in special cases, the M-F axis does not coincide with the discriminant axis. However, the position of an
individual point along the M-F axis (i.e., its projection onto the M-F axis in the direction of the classification
boundary) is equivalent to its position along the discriminant axis. Thus, scores on the M-F axis provide the same
information as discriminant scores.
5
In this case, “male-female typicality” is arguably preferable to “masculinity-femininity:” studies have shown that
when observers make judgements of facial masculinity, they rely on facial cues of body size in addition to sexually
dimorphic features (Holzleitner et al., 2014; Mitteroecker et al., 2015).
10
ranging from personality (D = 2.71 in Del Giudice et al., 2012; average D = 1.12 in Mac Giolla
& Kajonius, 2018) and vocational interests (D = 1.61 in Morris, 2016) to mate preferences
(average D = 2.41 in Conroy-Beam et al., 2015). For comparison, the size of multivariate sex
differences in facial morphology is about D = 3.20 (Hennessy et al., 2005).
Figure 2.2. Illustration of Mahalanobis’ distance (D) in the bivariate case. D is the standardized distance
between the male and female centroids in the bivariate space, taking the correlation between variables
into account. (If the variables are uncorrelated, D reduces to the Euclidean distance.) Note that the
distributions in the figure are bivariate normal with equal covariance matrices. The axis that connects the
male and female centroids can be interpreted as a dimension of male-female typicality or “masculinity-
femininity” (M-F) with respect to the relevant variables. Univariate differences are represented as d1 and
d2 .
6
Of note, Phillips et al. (2018) framed their study as a demonstration that “the sex of the human brain can be
conceptualized along a continuum rather than as binary” (emphasis added). But this is not what they did: the
correlations between sex differentiation scores and other variables were calculated within each sex, meaning that sex
was treated as a binary variable and implicitly “controlled for” by analyzing males and females separately.
11
approach that is conceptually similar to gender diagnosticity). They then selected a subset of
features showing sizable sex differences and averaged them into a summary score. The effect
size for this differentiation score was about d = 1.80.7 Depending on how they are constructed,
summary scores can be less prone to overfitting the sample data than D (see Section 2.3.2); at the
same time, they discard information about the correlation structure of the variables and tend to
underestimate the overall effect. Note that systematic variation in effect sizes across studies may
depend on several factors, from differences in the reference populations (e.g., cross-cultural or
age-related effects) to the methods employed to correct for measurement error and other artifacts
(more on this in Section 2.3.3).
It is worth stressing that multivariate effect sizes like D are not meant to replace
univariate indices like Cohen’s d. Univariate and multivariate approaches are complementary,
and whether one of them provides a more meaningful description of the data is going to depend
on the specific question being asked. Criticism of D as an effect size has focused on the supposed
lack of interpretability of the M-F axis, and on the fact that D can be inflated by adding large
numbers of irrelevant variables (Hyde, 2014; Stewart-Williams & Thomas, 2013). While these
points can be readily addressed (see above and Section 2.3.2; for a lengthier discussion see Del
Giudice, 2013), they do raise the crucial point that D is only meaningful to the extent that it
summarizes a coherent, theoretically justified set of variables. A related issue is that many
multidimensional constructs in psychology are also hierarchical; for example, the broad-band
structure of personality can be usefully described with five broad traits (the Big Five:
extraversion, openness, agreeableness, conscientiousness, and neuroticism/emotional instability),
but each of those traits can be split into multiple narrower traits or “facets” (e.g., the possible
facets of extraversion include friendliness, gregariousness, activity, assertiveness, excitement-
seeking, and cheerfulness). If sex differences in the lower-order facets of a trait run in opposite
directions, they may cancel out at the level of broad traits, leading to underestimates of the actual
effect size (see Del Giudice, 2015; Del Giudice et al., 2012). Thus, the choice of the appropriate
level of analysis is an important consideration when applying multivariate methods to
hierarchical constructs.
7
The paper did not report descriptive statistics for the differentiation score; unfortunately, the raw data were not
available for reanalysis (Owen R. Phillips, personal communication, November 2, 2018). I extracted frequencies and
central bin values from the histogram in Figure 2 of Phillips et al. (2018) with ImageJ 1.50 (Schneider et al., 2012),
and used them to recover approximate sample statistics (females: M = –0.25, SD = 0.29; males: M = 0.26, SD =
0.27).
12
30% of the variables contributed equally and the remaining 70% made no contribution to the
effect). For example, in the personality dataset analyzed by Del Giudice et al. (2012) the
heterogeneity coefficients are H2 = .90 and EPV2 = .16, suggesting that the overall difference is
largely driven by a small subset of variables. Note that there are several possible ways to assign
credit to individual variables (e.g., Garthwaite & Koch, 2016); the method used to calculate H2
and EPV2 is somewhat ad-hoc and will likely be superseded by better alternatives (see Del
Giudice, 2018). Still, these indices can be used heuristically to contextualize plain D values and
flag patterns that may warrant further attention.
2.2.1.3 Indices of overlap (OVL, OVL2). In contrast with difference metrics, indices of
overlap focus on similarity, as they quantify the proportion of the distribution area (or
volume/hypervolume) that is shared between males and females. When overlap is high, many
males have female-typical scores and many females have male-typical scores. The overlapping
coefficient (OVL) is the proportion of each distribution that is shared with the other (Bradley,
2006). This is a highly intuitive index of overlap; however, many researchers use a somewhat
different index (OVL2), in which overlap is calculated as the shared area relative to the joint
distribution.8 The corresponding value can be calculated as 1–U1, where U1 is Cohen’s
coefficient of nonoverlap (Cohen, 1988). Typically, the quantity of interest is overlap rather than
nonoverlap; for convenience I use the label OVL2 to indicate 1–U1, the proportion of overlap
relative to the joint distribution. While OVL2 is a common index in psychology, its practical
interpretation is somewhat obscure, and some authors have argued (quite convincingly) that OVL
is preferable in most contexts (e.g., Grice & Barrett, 2014).
2.2.1.4 Indices of superiority (U3, CL). Another way of looking at differences and
similarities is to ask what proportion of people in the group with the higher mean would score
above the median member of the other group. The answer is provided by Cohen’s U3 coefficient,
which can be obtained from d or D under the same assumptions of overlap indices (Figure 2.3;
Table 2.1). For example, both d = 0.50 and D = 0.50 correspond to U3 = .69. Following the usual
conventions, U3 = .69 with a positive d means that 69% of males score above the median female
(or, equivalently, that 69% of females score below the median male; Cohen, 1988). The
interpretation of U3 changes slightly when one is dealing with a multivariate distribution.
Specifically, U3 becomes the proportion of males that are more “masculine” or “male-typical”
8
The difference between OVL and OVL2 can be visualized by looking at Figure 2.5. OVL = (purple area)/(purple
area + blue area) = (purple area)/(purple area + pink area). OVL2 = (purple area)/(purple area + blue area + pink
area).
13
than the median female—or, symmetrically, the proportion of females that are more “feminine”
or “female-typical” than the median male.
The common language effect size (CL; also known as “probability of superiority”) is
another popular index that translates group differences into probabilities. Specifically, CL is the
probability that a randomly picked individual from the group with the higher mean will outscore
a randomly picked individual from the other group (McGraw & Wong, 1992). By assuming
normality and equality of variances/covariances, CL can be easily obtained from d or D (Figure
2.3; Table 2.1). As with U3, the interpretation of CL changes somewhat in a multivariate context,
and becomes the probability that a randomly picked male will be more “masculine” or “male-
typical” than a randomly picked female (or, symmetrically, the probability that a randomly
picked female will be more “feminine” or “female-typical” than a randomly picked male). The
original CL index can be generalized to discrete distributions (Vargha & Delaney, 2000), and
there are procedures to calculate confidence intervals when standard assumptions do not apply
(Vargha & Delaney, 2000; Zhou, 2008).
Figure 2.3. Relations between the standardized mean difference (Cohen’s d or Mahalanobis’ D) and
various indices of difference/similarity. All conversion formulas assume (multivariate) normality and
equality of variances/covariance matrices. See Table 2.1 for details. OVL = proportion of overlap on a
single distribution. OVL2 = proportion of overlap on the joint distribution (equals 1–U1 in Cohen’s
terminology). U3 = proportion of a group above the median of the other group. CL = common language
effect size (“probability of superiority”). PCC = probability of correct classification (assuming equal
group sizes). h2 = proportion of variance explained (assuming equal group sizes).
14
If variances/covariances differ between the sexes but normality still applies, the
approximately optimal classifier is not LDA but QDA (quadratic discriminant analysis; see
James et al., 2013). When distributions are strongly non-normal and patterns of sex differences
are characterized by nonlinearity and higher-order interactions, the PCC is going to depend on
the particular classification model chosen for the analysis. The menu of available methods has
been expanding rapidly thanks to advances in machine learning; common options include logistic
regression, classification trees, support vector machines (SVMs), and deep neural networks (see
Berk, 2016; Efron & Hastie, 2016; James et al., 2013; Skiena, 2017). Sophisticated classification
methods can be especially effective in complex datasets with large numbers of variables; it is not
a coincidence that many recent applications to sex differences come from neuroscience. In a
study by van Putten et al. (2018), a neural network trained on electroencephalogram signals
(EEG) was able to identify the sex of participants more than 80% of the time. Using regularized
logistic regression, Checkroud et al. (2016) achieved 93% accuracy in identifying the sex of
adult participants from brain structure. The same accuracy (93%) was reported by Anderson et
al. (2018), who employed both SVM and regularized logistic regression on a large sample of
inmates and controls. By applying SVM to brain scan data, Joel et al. (2018) obtained 72-80%
accuracy in adults, while Sepehrband et al. (2018) achieved 77-83% accuracy in children and
adolescents. In all these studies, classification was performed on multivariate data from the
whole brain, not on individual brain regions. Interestingly, the sex differentiation score computed
by Phillips et al. (2018) from brain structure data (see Section 2.2.1.2) yields an expected PCC =
.82 (estimated from d = 1.80), which is close to the performance of more complex algorithms.
2.2.1.6 Variance explained (h2). The proportion of variance in the variable of interest
that is explained by a categorical predictor (e.g., sex) is usually labeled eta squared (h2; see
Lakens, 2013; Olejnik & Algina, 2000). This is a classic effect size but not a very intuitive one;
for this reason, it is seldom employed in sex differences research (but see Deaux, 1985). The
9
This is different from gender diagnosticity (section 2.2.1.2), which is the estimated probability that a particular
individual is male (or female), regardless of his/her actual sex.
15
2.2.1.7 Variance ratio (VR). Males and females may differ not only in their mean value
on a trait, but also in their variability around the mean. When computing most of the indices
reviewed in this chapter, unequal variances are treated as a deviation from standard assumptions
(Table 2.1); however, systematic differences in variability may be interesting in their own
respect, for example because they can have large effects on the relative proportions of males and
females at the distribution tails (Section 2.2.1.8).
Empirically, males have been found to show larger variance than females in a majority of
traits, including most dimensions of personality (except neuroticism; see Del Giudice, 2015),
general intelligence (e.g., Arden & Plomin, 2006; Dykiert et al., 2009; Johnson et al., 2008),
specific cognitive skills (e.g., Bessudnov & Makarov, 2015; Hyde et al., 2008; Lakin, 2013; Wai
et al., 2018), brain size (e.g., Ritchie et al., 2018; Wierenga et al., 2017), and many other bodily
and physiological features (see Del Giudice et al., 2018; Lehre et al., 2009). In the human
literature, this is known as the “greater male variability hypothesis” (for a historical perspective
see Feingold, 1992), but the same general pattern is apparent in most sexually reproducing
species (Wyman & Rowe, 2014; Del Giudice et al., 2018). Some of these differences seem to
reflect scaling effects: if the variability of a trait increases with its mean level, the sex with the
higher mean will also show the larger variance. This is the case for physical traits such as height,
body mass, and brain volume. While the variance of these traits is higher in males, the
coefficient of variation (i.e., the standard deviation divided by the mean) is very similar in men
and women (Del Giudice et al., 2018). However, greater male variance is also found in domains
in which average differences are very small or favor females (such as general intelligence and
most personality traits).
The standard index for sex differences in variability is the variance ratio (VR), which by
convention is the ratio of the male variance to the female variance. In sex differences research,
variance ratios are usually calculated on univariate distributions (confidence intervals on VR are
discussed in Shaffer, 1992). However, the generalized variance of a multivariate distribution is
the determinant of the covariance matrix (Sen Gupta, 2004); a generalized variance ratio can be
easily obtained as the ratio of the male and female generalized variances (Table 2.1). Equality of
variances corresponds to VR = 1.00. In the domains of personality and cognition, values of VR
estimated from large samples are often smaller than 1.20 and rarely larger than 1.50. For
neuroticism and related traits, which tend to be more variable in females, VR usually ranges
between 0.90 and 1.00 (Del Giudice, 2015; Hyde, 2014; Lakin, 2013; Lippa, 2009). For
comparison, the variance ratio for height is estimated at about VR = 1.11 (average across
countries; Lippa, 2009).
16
2.2.1.8 Tail ratio (TR). The relative proportions of males and females in the region
around the mean are often less interesting than their representation at the tails of the distribution.
This is typically the case when the outcome of interest depends on competition (e.g., selection of
the top-ranking applicants for a job), the crossing of a threshold (e.g., selection requiring a
minimum passing score), or other nonlinear effects (e.g., the probability of committing violent
crimes may increase more steeply at the upper end of the distribution of aggression). Crucially,
small differences between means can have a substantial impact as one moves toward the tails of
the distribution; and even if males and females have exactly the same mean on a trait, sex
differences in variability can produce marked differences at the extremes (Halpern et al., 2007).
When the tails of the distribution are the focus of interest, summary indices such as mean
differences and overlap coefficients are uninformative; researchers may wish to calculate a tail
ratio (TR), that is, the relative proportion of the two sexes in the region above (or below) a
certain cutoff. Here I adopt a slight variation of the reference group method proposed by Voracek
et al. (2013); the alternative approach by Hedges & Friedman (1993) uses the total distribution of
the two groups combined. In the standard version of Voracek et al.’s method, the group with the
lower mean serves as the reference group, and the cutoff to identify the tail is placed at z
standard deviations from the lower mean (where z can be any value). The choice of cutoff is
noted as TRzSD: for example, TR2SD is the tail ratio for a cutoff located z = 2 standard deviations
above the lower mean; TR2.5SD is the tail ratio for a cutoff located z = 2.5 standard deviations
above the lower mean; and so on. In the context of sex differences, it is arguably more useful to
pick one of the two sexes as the reference group regardless of the ranking of means; in the
following I use females as the reference group, following the standard convention for variance
ratios. While Voracek et al. (2013) proposed benchmarks for the interpretation of TR modeled on
those for Cohen’s d, fixed conventions are even less meaningful in this context and should
probably be avoided.
Tail ratios can be estimated from means and variances assuming normality, or from d and
D with the additional assumption of equal variances/covariances (Table 2.1). However, the
resulting estimates can be very sensitive to violations of these assumptions (see Section 2.3.1),
and researchers working with large samples often calculate tail ratios directly from frequency
data rather than from summary statistics (e.g., Lakin et al., 2013; Wai et al., 2018). Figure 2.4
shows how d determines the tail ratios above three common cutoffs. With equal variances (VR =
1), an effect size d = 0.50 corresponds to TR1SD = 1.94, TR2SD = 2.94, and TR3SD = 4.60. In other
words, there are almost twice as many males as females in the region one standard deviation
above the female mean (TR1SD); almost three times as many in the region two standard deviations
above the female mean (TR2SD); and 4.6 times as many in the region three standard deviations
above the female mean (TR3SD). As the standardized difference increases, TR becomes
disproportionately larger (note that the vertical axis of Figure 2.4 is logarithmic). Figure 2.4 also
illustrates the major impact of unequal variances, which—depending on how they combine with
distribution means—can dramatically amplify sex imbalances in the tails, but also attenuate or
even reverse them. While standardized differences and overlap coefficients are robust to minor
sex differences in variability, tail ratios can be remarkably sensitive to unequal variances.
Specifically, the impact of VR is maximized when d or D values are smaller and/or the chosen
cutoff is more extreme (Figure 2.4).
17
Figure 2.4. Tail ratios and the effect of unequal variances. The thick lines show the relative proportion of
males to females above the cutoffs located at one, two, and three standard deviations from the female
mean (TR1SD, TR2SD, and TR3SD) for positive values of d. Calculations assume normality, equal group
sizes, and equal variances in the two sexes (variance ratio VR = 1.00). The shaded areas represent changes
in tail ratios when variances are unequal, ranging from VR = 0.50 (twice as high in females) to VR = 2.00
(twice as high in males). Note that the impact of unequal variances on TR is stronger when the difference
between means is smaller and/or the cutoff is more extreme.
Besides visual exploration, relative distribution methods also support various types of
quantitative inference. Most intriguingly, the relative distribution can be easily decomposed into
independent components that separate the effects of location (i.e., differences in means or
medians) from those of shape (including, but not limited to, differences in variance). These
components of the distribution can be plotted separately to visually examine their characteristics,
or quantified and compared using information-theoretic measures (for details and examples see
Handcock & Morris, 1998). Despite their many attractive features, relative distribution methods
have been largely ignored in sex differences research; the few applications I am aware of—
limited to relative density plots—are in Bessudnov and Makarov (2015), Del Giudice (2011),
and Del Giudice et al. (2010, 2014).
Carothers and Reis (2013; Reis & Carothers, 2014) performed a taxometric analysis on
various putative indicators of gender, which they distinguished from biological sex: measures of
sexuality, mating preferences, empathy, intimacy, and personality (including the Big Five). They
found overwhelming support for a dimensional model and concluded that the latent structure of
gender—in contrast with that of sex—is not a binary but a continuum. They also argued that
average sex differences are “not consistent or big enough to accurately diagnose group
membership” (p. 401). However, a simpler interpretation of these findings is that the indicators
used in the study were too weak to detect the underlying taxa. As also noted by the authors,
taxometric procedures quickly lose sensitivity as group differences on the indicators become
smaller than d = 1.20 (Beauchaine, 2007; Ruscio et al., 2011); but almost all the effect sizes in
the study were below this threshold, and often substantially so. Because the indicators were
inadequate to detect taxonic differences, the analysis predictably indicated a dimensional
structure. The only set of psychological indicators with adequate effect sizes was a list of
preferences for sex-typed activities (e.g., boxing, hair styling, playing golf). Predictably, sex-
typed activities showed clear evidence of taxonicity, but this result was not treated as part of the
main analysis. Also, the authors’ claim that sex differences are too small and inconsistent to infer
a person’s sex from psychological measures is unfounded: personality traits alone can correctly
classify males and females with high probability, provided they are measured at the level of
narrow traits and aggregated with multivariate methods. For example, D = 2.71 (Del Giudice et
al., 2012) yields PCC = .91 using the standard formula.10 In contrast, the Big Five lack the
10
Of course, this effect size is based on latent variables, and the corresponding PCC assumes error-free
measurement (Section 2.3.3). The point remains valid: in principle, a combination of narrow personality traits can
accurately discriminate between males and females. Note that Carother and Reis’ claim concerned the actual
amount of overlap between the sexes, not the attenuating effects of measurement error.
19
resolution to accurately differentiate the sexes, and the corresponding effect sizes (d = 0.19 to
0.56 in the study) are too small to regard these traits as valid taxometric indicators. In light of
these limitations, the findings by Carothers and Reis (2013) are hard to interpret with any
confidence.
Beyond this particular study, it is unclear whether taxometric methods can make a
substantive contribution to sex differences research. The purpose of taxometrics is to probe for
the existence of taxa that cannot be directly observed, as is often the case with mental disorders
(Meehl, 1995). In meaningful applications, one does not know a priori whether the hypothetical
taxa exist or not, and there is a genuine possibility that the underlying structure of the data is
fully dimensional. But in the case of sex differences, the taxa (males and females) are already
known to the investigators, and indicator variables are chosen precisely because they can
distinguish between males and females. Given these premises, studies that use sufficiently strong
indicators (e.g., sex-typed activities) can be expected to confirm the existence of two sexes;
whereas studies that use weak indicators will be uninterpretable because of their lack of
sensitivity, as in Carothers and Reis (2013). Either way, the results are going to be
uninformative, unless the goal is to look for additional taxonic distinctions within each sex (e.g.,
discrete categories related to sexual orientation; Gangestad et al., 2000; Norris et al., 2015).
Unfortunately, the method devised by Joel et al. (2015) is seriously flawed. The threshold
for consistency is both arbitrary and exceedingly high: it is easy to show that, in realistic
conditions, the method always returns a small proportion of “internally consistent” individuals,
regardless of the pattern of differences and correlations among variables (Del Giudice et al.,
2015, 2016). This remains true even when the variables show unrealistically high levels of
consistency (i.e., all correlations among variables equal to .90). In light of this, it is not surprising
that Joel et al. (2015) found only 1.2% of internally consistent individuals in the domain of sex-
typed activities, with the same data that showed clear evidence of taxonicity in Carothers and
20
Reis’ (2013) analysis.11 While “substantially variable” profiles are more sensitive to variations in
the data (Del Giudice et al., 2015; Joel et al., 2016), the percentages returned by this method can
be quite misleading if taken at face value. The authors have continued to present their findings as
evidence that most brains are “gender/sex mosaics” (Joel & Fausto-Sterling, 2016; Hyde et al.,
2018). The question they address is without doubt an important one; patterns of
consistency/inconsistency among sex-related traits can be both theoretically interesting and
practically important. However, their method is designed to show invariably low levels of
internal consistency, and I cannot recommend it as a useful analytic tool.
2.2.3 Visualization
There are many possible ways to visualize sex differences/similarities in plots and
diagrams; the most appropriate type of display is going to depend on the researchers’ aims and
their intended audience. Figure 2.5a shows a relative density plot with females as the reference
group (Section 2.2.2.1). This plot does not depict the original distributions but only their relative
differences, and highlights the behavior of the variable in the tail regions. While relative density
plots can be very informative, they are not immediately intuitive and require some technical
background to interpret. A similar type of plot based on quantile differences instead of relative
densities is discussed in Rousselet et al. (2017) and Wilcox (2006). In Figure 2.5b, the male and
female probability densities are overlaid on the same plot (e.g., Ritchie et al., 2018). This
straightforward display conveys a lot of information, including the shape of the two distributions,
the difference between means, and the amount of male-female overlap—though it is less
effective than the relative density plot in showing differences in the tail regions. Overlay density
plots are similar to split violin plots, in which densities are displayed side by side instead of
overlaid (e.g., Wai et al., 2018); however, split violin plots make it hard to visualize the overlap
between distributions. Both density and relative density plots can be used to visually detect
obvious deviations from standard assumptions.
When effect sizes are mapped on normal distributions (with equal or unequal variances),
normalized density plots (Figure 2.5c) offer an intuitive display of standardized differences and
overlaps (e.g., Maney, 201612). Plots of actual or normalized distributions can be easily
augmented with confidence intervals on d, as shown in figure 2.5c. Still, this kind of plot is
inherently univariate, and can be misleading when one wants to present the results of
multivariate analyses. In complex multivariate contexts, the overlap between distributions is
usually the most intuitive metric; overlap coefficients can be visualized with Venn diagrams
(Figure 2.5d) in which areas represent proportions of overlap and nonoverlap (e.g., Del Giudice
et al., 2012).
11
To see why, consider a fictional man who hates talk shows and cosmetics and is passionate about boxing and
video games (male-typical values), but does not particularly like golf (intermediate). He would be classified as
showing an “intermediate” profile of gendered interests. If he happened to dislike golf (female-typical value), he
would be classified as a sex/gender mosaic with a “substantially variable” interest profile (see Del Giudice et al.,
2015).
12
Note that some of the normalized plots in Maney (2016) show atypically large differences in variance between
males and females, up to about VR = 23. However, those plots are based on very small samples, and the extreme
differences in variability they display are most likely due to sampling error.
21
Figure 2.5. Four visualizations of sex differences/similarities. All plots are based on the same dataset
with d = 1.0. (a) Relative density plot. This plot shows the relative male:female density at different
quantiles of the female distribution (bottom axis); the corresponding values of the variable (X) are shown
for references on the top axis. Dotted lines represent 95% pointwise confidence intervals. Assuming equal
group sizes, a relative density of 1.0 (horizontal dashed line) indicates equal proportions of males and
females. Under the same assumption, there are about five time as many males as females with values at
the lower extreme of the female distribution (0.0 on the bottom axis; relative density » 5.0). At the
median of the female distribution (0.5 on the bottom axis) there are about three times as many females as
males (relative density » 0.3), approximately the same proportions found at the upper extreme (1.0 on the
bottom axis). (b) Overlay density plot of the male and female distributions. This plot shows the shape of
the distributions, their overlap, and the location of means (vertical dotted lines). (c) Normalized plot of
the male and female distributions. This plot shows the standardized mean difference and the
corresponding overlap assuming normality and equality of variances (in this case, OVL = .62 and OVL2 =
.45). Horizontal bars represent 95% confidence intervals on d; the colors on the bottom bar can be
reversed when the interval includes opposite-sign values. (d) Venn diagram of the overlap between the
male and female distributions. This type of diagram can be used to intuitively communicate the overall
size of effects in complex multivariate contexts.
22
Many of the standard formulas presented in this chapter make the assumptions of
normality and equality of variances/covariances in the population. These formulas are useful
because they allow investigators to calculate a wide range of indices from commonly reported
statistics such as means, standard deviations, correlations, and values of d or D. Moreover, some
non-standard indices (e.g., multivariate overlap between non-normal distributions) may be
complicated to obtain even if raw data are available. Still, deviations from normality are quite
common: empirical data are frequently skewed, have heavier tails than expected under a normal
distribution, and so on (e.g., Limpert & Stahel, 2011). The size of indices like d and D is
sensitive to both non-normality and the presence of outliers (Wilcox, 2006); moreover, exact
formulas for confidence intervals are only accurate when normality can be assumed. Remedies to
these distorting effects include bootstrap confidence intervals and robust variants of Cohen’s d
that eliminate the influence of extreme values (e.g., Algina et al., 2005; see Kirby & Gerlanc,
2013). Deviations from normality may also change the amount of overlap between distributions.
When this is the case, robust nonparametric methods can be used to estimate the OVL coefficient
in place of the usual formulas (Anderson et al., 2012; Schmid & Schmidt, 2006). As noted in
Section 2.2.1, when variances/covariances are markedly unequal it is possible to use QDA
instead of LDA to estimate the PCC; however, both models are quite sensitive to non-normality
(Eisenbeis, 1977), which limits the utility of standard formulas when normality assumptions are
not met.
The most widely used test of univariate normality is the Shapiro-Wilk test (Garson, 2012;
Yap & Sim, 2011). Multivariate normality is harder to assess, and no single method performs
well in all conditions (Mecklin & Mundfrom, 2004, 2005). Thus, the recommended approach is
to combine multiple tests (which do not always agree with one another) and supplement them
with graphical displays (Holgersson, 2006; Korkmaz et al., 2014; see Mecklin & Mundfrom,
2004). Levene’s test is the standard procedure for comparing variances, and there are robust
versions of the test that are less sensitive to non-normality (Gastwirth et al., 2009). The equality
of covariance matrices is usually evaluated with Box’s M test. Unfortunately, the M test suffers
from a high rate of false positives (i.e., it rejects homogeneity too often) and is very sensitive to
departures from multivariate normality; the latter problem can be lessened by using robust
variants of the test (Anderson, 2006; O’Brien, 1992). More generally, using significance tests to
evaluate assumptions is not without problems. With small samples, many tests have low power
to detect violations; but when sample size is large, very small deviations from perfect
normality/homogeneity may cause a test to reject the assumption, even if the practical
consequences may be negligible.
Section 2.2.1, variance ratios are often lower than 1.20 and rarely higher than 1.50. Large
discrepancies between male and female variances typically occur as a consequence of non-
normality (e.g., skewed distributions with long tails), the presence of outliers, ceiling/floor
effects, and other artifacts. With variance ratios in the usual range and approximately normal
distributions, the results of the formulas in Table 2.1 are very close to the actual values even
when variances differ between the sexes (with the exception of tail ratios; see below). Because
equality of variances cannot be generally assumed, one can test the equality of correlation
matrices (which are standardized and do not contain information on variance) instead of that of
covariance matrices. This can be done with various significance tests (e.g., Jennrich, 1970;
Steiger, 1980; see Revelle, 2018). However, these tests suffer from the usual problems of low
sensitivity in small samples and excessive sensitivity in large samples (see above). An alternative
that does not rely on significance is to compare sample correlation matrices with Tucker’s
congruence coefficient (j or CC; Abdi, 2007). The CC coefficient in an index of matrix
congruence that ranges from –1.00 to 1.00. Lorenzo-Seva & ten Berge (2006) proposed
benchmarks for CC based on expert judgments; following their recommendations, values of .85
or more indicate fair similarity, while values above .95 indicate high similarity. A high value of
CC implies that there are no major discrepancies between the correlation matrices of males and
females. In many applications, this justifies the use of multivariate indices, with the caveat that
the resulting values are best regarded as reasonable approximations. Inspection of the correlation
matrices (and their difference) may point to specific variables that seem to behave differently in
the two sexes. Yet another strategy is to employ structural equation modeling (SEM) to fit a
multi-group factor model of the variables (see below), and use model fit indices to evaluate the
equivalence of correlations in the two sexes (e.g., Del Giudice et al., 2012).
While most of the standard formulas are robust to minor violations of their assumptions,
this is emphatically not the case of tail ratios. The formulas used to estimate TR from effect sizes
or summary statistics are very sensitive to small deviations from the hypothesized distributions,
particularly when differences between groups are small and/or cutoffs are extreme (Figure 2.4).
Thus, estimates of TR based on standard formulas should be treated with special caution unless
the underlying assumptions can be reasonably justified.
When they are calculated from sample data, d and D are not unbiased estimators of the
corresponding population parameters but exhibit a certain amount of bias away from zero (i.e.,
their expected value overestimates the absolute size of the effect). Bias is typically negligible in
large samples, but can be substantial in small studies; it transmits to other indices when
conversion formulas are used (Table 2.1), and may lead investigators to overestimate the size of
sex differences in their data. The bias in d arises from the fact that the pooled sample variance
slightly underestimates the population variance, and is only an issue when sample size is very
small: it amounts to less than 5% of the absolute value when the total N is ³ 18, and less than
1% when N ³ 78. The bias-corrected variant of Cohen’s d is known as du or Hedges’ g; a simple
correction formula is reported in Table 2.1 (see Hedges, 1981; Kelley, 2005). The bias in D is a
bigger concern, because random deviations from zero in the univariate effects (caused by
sampling error) add up and collectively inflate the value of D. In a previous paper (Del Giudice,
2013), I suggested a simple rule of thumb based on simulations: the bias in D can be kept to
24
acceptable levels (i.e., less than 0.05 in absolute value) by having at least 100 cases for each
variable in the analysis (e.g., N ³ 500 when calculating D from 5 variables). The rule works as
advertised when D ³ 0.45, but bias can still be substantial for smaller values of D. A better
alternative when N is small relative to the number of variables is to use the correction formula
reported in Table 2.1, which yields the small-sample variant Du (Lachenbruch & Mickey, 1968;
Hess et al., 2007).
While upward bias increases the apparent size of sex differences, measurement error has
the opposite effect. When variables are measured with error, the raw difference between group
means remains approximately the same but the standard deviation is inflated by noise; as a
consequence, standardized indices like d and D become proportionally smaller. When
measurement is unreliable, this reduction (attenuation) can be substantial. In classical test theory,
the reliability of a measure is the proportion of variance attributable to the construct being
measured (“true score variance,” as contrasted with “error variance”). Assuming that sex is
measured without error, the true value of d is attenuated by the square root of the reliability: d =
1.00 becomes 0.95 if the measure has 90% reliability, 0.84 with 70% reliability, and 0.71 with
50% reliability (Schmidt & Hunter, 2014; see also Schmidt & Hunter, 1996). In the case of D,
measurement error reduces both the univariate differences and the correlations among variables;
these effects may either reinforce or oppose one another depending on the correlation structure
and the direction of the univariate effects. In the field of sex differences, the large majority of
individual studies and meta-analyses fail to correct for attenuation due to measurement error, and
as a result yield downward biased estimates of effect sizes. This is also the case of the literature
syntheses compiled by Hyde (2005) and Zell et al. (2015).
There are two main approaches to correcting for measurement error. The first and
simpler method is to estimate the reliability of measures from sample data, then disattenuate d by
dividing it by the square root of the reliability coefficient. For example, consider a standardized
difference d = 0.50 on a variable with reliability .77. The square root of .77 is .88, and the
disattenuated d is 0.50/.88 = 0.57. To calculate D, both univariate effect sizes and correlations
25
The second and more sophisticated approach is to use latent variable methods (most
commonly SEM) to explicitly model the factor structure of the measures, and obtain estimates of
sex differences on latent variables instead of observed scores (e.g., Del Giudice et al., 2012; for a
different approach to factor analysis with SEM see Marsh et al., 2014). This applies to both
univariate and multivariate differences. If the factor structure is correctly specified, latent
variable modeling sidesteps the many problems of a and can achieve nearly error-free estimates
of the underlying effects (Brown, 2015; Kline, 2016; Rhemtulla et al., 2018). Typically, SEM
estimates of sex differences are notably larger than those obtained with reliability-based
disattenuation. In Del Giudice et al. (2012), we examined the effect of different correction
methods on the same dataset (15 personality facets in a large United States sample). With
uncorrected raw scores, we obtained D = 1.49. Disattenuation with a raised the estimate to D =
1.72; fitting a multigroup SEM and calculating the effect size from latent mean differences and
correlations yielded D = 2.71. Similarly, Mac Giolla and Kajonius (2018) calculated D on 30
facets of the Big Five, with no error correction; their average estimate across countries was D =
1.12. Of course, the use of SEM raises additional methodological issues, primarily that of
measurement invariance between the sexes (or lack thereof; see Brown, 2015; Kline, 2016). Note
that while invariance is desirable, the practical impact of statistically significant violations may
be small enough to be tolerable or even negligible (especially in large samples; e.g., Schmitt et
al., 2011). Nye and Drasgow (2011) developed methods to quantify the effects of measurement
non-invariance at the item level and estimate its impact on observed (not latent) group
differences. In presence of sizable distortions, it may still be possible to estimate latent
differences by fitting a partially invariant model (Guenole & Brown, 2014; Schmitt et al., 2011).
As an alternative to SEM, models based on item response theory (IRT) can also be used to
estimate sex differences on latent variables (e.g., Liddell & Kruschke, 2018).
Measurement error is not the only artifact researchers should guard against. Floor and
ceiling effects can severely distort measurement, and either inflate or deflate sex differences
depending on the direction of the effect, the direction of the artifact (floor vs. ceiling), and the
relative variances of males and females (Wilcox, 2006; see also Liddell & Kruschke, 2018).
Range restriction is another insidious artifact that occurs in a variety of research contexts: when
26
the participants of a study are (directly or indirectly) selected from the original population on the
basis of their personal characteristics, the resulting effect sizes can be substantially biased. There
are several methods and formulas that attempt to correct for range restriction, though they are not
without limitations (see Hunter & Schmidt, 2014; Johnson et al., 2017).
2.3.4 Meta-analysis
differences in psychology are often denounced as dangerous and socially harmful (e.g., Fine,
2010; Hyde, 2005; Reis & Carothers, 2014). In principle, several methods can be used to detect
publication and/or reporting bias in meta-analytic datasets (Jin et al., 2015). Unfortunately, the
standard tests are easy to misapply, suffer from high rates of false negatives unless the dataset
includes a large number of studies, and may mistake other sources of heterogeneity for evidence
of bias (Ioannidis, 2008b; Ioannidis & Trikalinos, 2007; Jin et al., 2014). Thus, common tests of
bias can be meaningfully applied only in the relatively few cases in which effect sizes are fairly
homogeneous across studies (Ioannidis & Trikalinos, 2007).
In recent years, standard procedures based on the distribution of effect sizes have been
joined by p-curve and p-uniform analyses, two methods that rely on the distribution of significant
p values in a set of studies to detect selective publication and/or reporting (Simonsohn et al.,
2014a, 2015; van Assen et al., 2015). The same methods can be used to estimate the average
effect size of a set of studies from their significant p values (Simonsohn et al., 2014b; van Assen
et al., 2015), thus complementing standard meta-analytic techniques. However, both p-curve and
p-uniform may overestimate the population effect when studies are highly heterogeneous (van
Aert et al., 2016). There are also some concerns about the validity of p-curve methods in non-
experimental research, when changes in significance may depend on the selective inclusion of
covariates in the analysis (see Bruns & Ioannidis, 2016).
2.4 Conclusion
In concluding this chapter it may be useful to point out that, important as it is, successful
quantification is only the beginning of understanding. Research on sex differences and
similarities relies on an exceptionally rich toolkit of methods, ranging from experimental studies
to developmental, cross-cultural, and even comparative research across species. Together, these
methods can be used to understand how sex differences in various domains vary systematically
across contexts, and what are the main factors that reduce or amplify them. At a deeper level, an
emphasis on measurement should not blind investigators to the possibility that males and females
may differ in qualitative rather than purely quantitative ways. For example, the same traits may
be influenced by different causal factors in the two sexes, or predict different patterns of
outcomes. If multiple sexually differentiated traits interact with each other in complex patterns,
they may give rise to configural or “gestalt” effects that are not well captured by their linear
combination (as implicitly assumed by D or discriminant analysis). Other nonlinear relations
between traits and outcomes (e.g., threshold effects) may turn graded quantitative differences
into discrete transitions. In some cases, males and females may possess different psychological
specializations that follow qualitatively different rules of operation. No doubt, the study of sex
differences and similarities will remain an exciting enterprise for a long time to come; and it is
easy to predict that high-quality measurement will play an ever more central role in the future of
the field.
Acknowledgments
I am grateful to Drew Bailey, Mike Bailey, Alex Byrne, Tom Booth, Doug VanderLaan,
and Ivy Wong for their many thoughtful comments on earlier drafts of this chapter.
28
References
Del Giudice, M., Klimczuk, A. C. E., Traficonte, D. M., & Maestripieri, D. (2014). Autistic-like and
schizotypal traits in a life history perspective: Diametrical associations with impulsivity,
sensation seeking, and sociosexual behavior. Evolution & Human Behavior, 35, 415-424.
Del Giudice, M., Lippa, R. A., Puts, D. A., Bailey, D. H., Bailey, J. M., & Schmitt, D. P. (2015). Mosaic
brains? A methodological critique of Joel et al. (2015). doi: 10.13140/RG.2.1.1038.8566.
Del Giudice, M., Lippa, R. A., Puts, D. A., Bailey, D. H., Bailey, J. M., & Schmitt, D. P. (2016). Joel et
al.'s method systematically fails to detect large, consistent sex differences. Proceedings of the
National Academy of Sciences USA, 113, E1965-E1965.
Dienes, Z. (2016). How Bayes factors change scientific practice. Journal of Mathematical Psychology,
72, 78-89.
Dunn, O. J., & Varady, P. D. (1966). Probabilities of correct classification in discriminant analysis.
Biometrics, 22, 908-924.
Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the
pervasive problem of internal consistency estimation. British Journal of Psychology, 105, 399-
412.
Dykiert, D., Gale, C.R., & Deary, I.J. (2009). Are apparent sex differences in mean IQ scores created in
part by sample restriction and increased male variance? Intelligence, 37, 42-47.
Eagly, A. H., & Wood, W. (2013). The nature–nurture debates: 25 years of challenges in understanding
the psychology of gender. Perspectives on Psychological Science, 8, 340-357.
Efron, B., & Hastie, T. (2016). Computer age statistical inference: Algorithms, evidence, and data
science. New York: Cambridge University press.
Eisenbeis, R. A. (1977). Pitfalls in the application of discriminant analysis in business, finance, and
economics. Journal of Finance, 32, 875-900.
Ellis, L. (2011). Identifying and explaining apparent universal sex differences in cognition and behavior.
Personality and Individual Differences, 51, 552–561.
Ellis, L., Hershberger, S., Field, E., Wersinger, S., Pellis, S., Geary, D., … & Karadi, K. (2008). Sex
differences: Summarizing more than a century of scientific research. New York: Psychology
Press.
Fausto-Sterling, A. (2012). Sex/gender: Biology in a social world. New York: Routledge.
Ferguson, C. J. (2009). An effect size primer: A guide for clinicians and researchers. Professional
Psychology: Research and Practice, 40, 532-538.
Fine, C. (2010). Delusions of gender: How our minds, society, and neurosexism create difference. New
York: Norton.
Furlow, C. F., & Beretvas, S. N. (2005). Meta-analytic methods of pooling correlation matrices for
structural equation modeling under different patterns of missing data. Psychological Methods, 10,
227-254.
Gangestad, S. W., Bailey, J. M., & Martin, N. G. (2000). Taxometric analyses of sexual orientation and
gender identity. Journal of Personality and Social Psychology, 78, 1109-1121.
Garthwaite, P. H., & Koch, I. (2016). Evaluating the contributions of individual variables to a quadratic
form. Australian & New Zealand journal of statistics, 58, 99-119.
Gastwirth, J. L., Gel, Y. R., & Miao, W. (2009). The impact of Levene's test of equality of variances on
statistical theory and practice. Statistical Science, 24, 343-360.
Geary, D. C. (2010). Male, female: The evolution of human sex differences (2nd ed.). Washington, DC:
American Psychological Association.
Geary, D. C. (2015). Evolution of vulnerability: Implications for sex differences in health and
development. San Diego, CA: Academic Press.
31
Glick, N. (1978). Additive estimators for probabilities of correct classification. Pattern Recognition, 10,
211-222.
Grice, J. W., & Barrett, P. T. (2014). A note on Cohen's overlapping proportions of normal distributions.
Psychological Reports, 115, 741-747.
Guenole, N., & Brown, A. (2014). The consequences of ignoring measurement invariance for path
coefficients in structural equation models. Frontiers in Psychology, 5, 980.
Haig, D. (2004). The inexorable rise of gender and the decline of sex: Social change in academic titles,
1945–2001. Archives of Sexual Behavior, 33, 87-96.
Halpern, D. F., Benbow, C. P., Geary, D. C., Gur, R. C., Hyde, J. S., & Gernsbacher, M. A. (2007). The
science of sex differences in science and mathematics. Psychological Science in the Public
Interest, 8, 1-51.
Handcock, M. S., & Janssen, P. L. (2002). Statistical inference for the relative density. Sociological
Methods & Research, 30, 394-424.
Handcock, M. S., & Morris, M. (1998). Relative distribution methods. Sociological Methodology, 28, 53-
97.
Handcock, M. S., & Morris, M. (1999). Relative distribution methods in the social sciences. New York:
Springer.
Hedges, L. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal
of Educational Statistics, 6, 107–128.
Hedges, L. V., & Friedman, L. (1993). Gender differences in variability in intellectual abilities: A
reanalysis of Feingold’s results. Review of Educational Research, 63, 94-105.
Helgeson, V. S. (2016). Psychology of gender (5th ed.). New York: Routledge.
Hennessy, R. J., McLearie, S., Kinsella, A., and Waddington, J. L. (2005). Facial surface analysis by 3D
laser scanning and geometric morphometrics in relation to sexual dimorphism in cerebral–
craniofacial morphogenesis and cognitive function. Journal of Anatomy, 207, 283-295.
Hess, M. R., Hogarty, K. Y., Ferron, J. M., & Kromrey, J. D. (2007). Interval estimates of multivariate
effect sizes: Coverage and interval width estimates under variance heterogeneity and
nonnormality. Educational and Psychological Measurement, 67, 21-40.
Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for interpreting
effect sizes in research. Child Development Perspectives, 2, 172-177.
Holgersson, H. E. T. (2006). A graphical method for assessing multivariate normality. Computational
Statistics, 21, 141-149.
Holzleitner, I. J., Hunter, D. W., Tiddeman, B. P., Seck, A., Re, D. E., & Perrett, D. I. (2014). Men's
facial masculinity: When (body) size matters. Perception, 43, 1191-1202.
Hooten, M. B., & Hobbs, N. T. (2015). A guide to Bayesian model selection for ecologists. Ecological
Monographs, 85, 3-28.
Huberty, C. J. (2002). A history of effect size indices. Educational and Psychological Measurement, 62,
227-240.
Huberty, C. J. (2005). Mahalanobis distance. In B. S. Everitt and D. C. Howell (Eds.), Encyclopedia of
statistics in behavioral science (pp. 1110-1111). Chichester, UK: Wiley.
Hull, C. L. (2003). Letter to the Editor: How sexually dimorphic are we? Review and synthesis. American
Journal of Human Biology, 15, 112-116.
Hyde, J. S. (2005). The gender similarities hypothesis. American Psychologist, 60, 581-592.
Hyde, J. S. (2014). Gender similarities and differences. Annual Review of Psychology, 65, 373-398.
Hyde, J. S., Bigler, R. S., Joel, D., Tate, C. C., & van Anders, S. M. (2018). The future of sex and gender
in psychology: Five challenges to the gender binary. American Psychologist.
32
Hyde, J.S., Lindberg, S.M., Linn, M.C., Ellis, A.B., & Williams, C.C. (2008). Gender similarities
characterize math performance. Science, 321, 494-495.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.
Ioannidis, J. P. (2008a). Why most discovered true associations are inflated. Epidemiology, 19, 640-648.
Ioannidis, J. P. (2008b). Interpretation of tests of heterogeneity and bias in meta-analysis. Journal of
Evaluation in Clinical Practice, 14, 951-957.
Ioannidis, J. P., & Trikalinos, T. A. (2007). The appropriateness of asymmetry tests for publication bias in
meta-analyses: a large survey. Canadian Medical Association Journal, 176, 1091-1096.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2103). An introduction to statistical learning with
applications in R. New York: Springer.
Janicke, T., Häderer, I. K., Lajeunesse, M. J., & Anthes, N. (2016). Darwinian sex roles confirmed across
the animal kingdom. Science Advances, 2, e1500983.
Janssen, D. F. (2018). Know thy gender: Ethymological primer. Archives of Sexual Behavior,
doi:10.1007/s10508-018-1300-x
Jennrich, Robert I. (1970) An asymptotic χ2 test for the equality of two correlation matrices. Journal of
the American Statistical Association, 65, 904-912.
Jin, Z. C., Zhou, X. H., & He, J. (2015). Statistical methods for dealing with publication bias in meta-
analysis. Statistics in Medicine, 34, 343-360.
Joel, D. (2012). Genetic-gonadal-genitals sex (3G-sex) and the misconception of brain and gender, or,
why 3G-males and 3G-females have intersex brain and intersex gender. Biology of Sex
Differences, 3, 27.
Joel, D., Berman, Z., Tavor, I., Wexler, N., Gaber, O., Stein, Y., ... & Liem, F. (2015). Sex beyond the
genitalia: The human brain mosaic. Proceedings of the National Academy of Sciences USA, 112,
15468-15473.
Joel, D., & Fausto-Sterling, A. (2016). Beyond sex differences: New approaches for thinking about
variation in brain structure and function. Philosophical Transaction of the Royal Society of
London B, 371, 20150451.
Joel, D., Persico, A., Hänggi, J., Pool, J., & Berman, Z. (2016). Reply to Del Giudice et al., Chekroud et
al., and Rosenblatt: Do brains of females and males belong to two distinct populations?
Proceedings of the National Academy of Sciences USA, 113, E1969-E1970.
Joel, D., Persico, A., Salhov, M., Berman, Z., Oligschlager, S., Meilijson, I., & Averbuch, A. (2018).
Analysis of human brain structure reveals that the brain ‘types’ typical of males are also typical of
females, and vice versa. Frontiers in Human Neuroscience, 12, 399.
Johnson, W., Carothers, A., & Deary, I. J. (2008). Sex differences in variability in general intelligence: A
new look at the old question. Perspectives on Psychological Science, 3, 518-531.
Johnson, W., Deary, I. J., & Bouchard Jr, T. J. (2017). Have standard formulas correcting correlations for
range restriction been adequately tested? Minor sampling distribution quirks distort them.
Educational and Psychological Measurement, doi:10.1177/0013164417736092
Jordan-Young, R., & Rumiati, R. I. (2012). Hardwired for sexism? Approaches to sex/gender in
neuroscience. Neuroethics, 5, 305-315.
Kelley, K. (2005). The effects of nonnormal distributions on confidence intervals around the standardized
mean difference: Bootstrap and parametric confidence intervals. Educational and Psychological
Measurement, 65, 51–69.
Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, application, and
implementation. Journal of Statistical Software, 20, 1-24.
33
Kirby, K. N., & Gerlanc, D. (2013). BootES: An R package for bootstrap confidence intervals on effect
sizes. Behavior Research Methods, 45, 905-927.
Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). New York:
Guilford.
Kodric-Brown, A., & Brown, J. H. (1987). Anisogamy, sexual selection, and the evolution and
maintenance of sex. Evolutionary Ecology, 1, 95-105.
Korkmaz, S., Goksuluk, D., & Zararsiz, G. (2014). MVN: An R package for assessing multivariate
normality. The R Journal, 6, 151-162.
Kruschke, J. K., & Liddell, T. M. (2018). Bayesian data analysis for newcomers. Psychonomic Bulletin &
Review, 25, 155-177.
Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis.
Technometrics, 10, 1-11.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical
primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863.
Lakin, J.M. (2013). Sex differences in reasoning abilities: Surprising evidence that male-female ratios in
the tails of the quantitative reasoning distribution have increased. Intelligence, 41, 263-274.
Lehre, A. C., Lehre, K. P., Laake, P., & Danbolt, N. C. (2009). Greater intrasex phenotype variability in
males than in females is a fundamental aspect of the gender differences in humans.
Developmental Psychobiology, 51, 198-206.
Lehtonen, J., & Kokko, H. (2011). Two roads to two sexes: unifying gamete competition and gamete
limitation in a single model of anisogamy evolution. Behavioral Ecology and Sociobiology, 65,
445-459.
Lehtonen, J., & Parker, G. A. (2014). Gamete competition, gamete limitation, and the evolution of the
two sexes. Molecular Human Reproduction, 20, 1161-1168.
Lehtonen, J., Parker, G. A., & Schärer, L. (2016). Why anisogamy drives ancestral sex roles. Evolution,
70, 1129-1135.
Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could
possibly go wrong? Journal of Experimental Social Psychology, 79, 328-348.
Limpert, E., & Stahel, W. A. (2011). Problems with using the normal distribution–and ways to improve
quality and efficiency of data analysis. PLoS ONE, 6, e21403.
Lippa, R. A. (2001). On deconstructing and reconstructing masculinity–femininity. Journal of Research
in Personality, 35, 168-207.
Lippa, R. A. (2005). Gender, nature, and nurture (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Lippa, R. A. (2009). Sex differences in sex drive, sociosexuality, and height across 53 nations: Testing
evolutionary and social structural theories. Archives of Sexual Behavior, 38, 631-651.
Lippa, R. A. (2010a). Sex differences in personality traits and gender-related occupational preferences
across 53 nations: Testing evolutionary and social-environmental theories. Archives of Sexual
Behavior, 39, 619-636.
Lippa, R. A. (2010b). Gender differences in personality and interests: When, where, and why? Social and
Personality Psychology Compass, 4, 1098-1110.
Lippa, R. A., & Connelly, S. (1990). Gender diagnosticity: A new Bayesian approach to gender-related
individual differences. Journal of Personality and Social Psychology, 59, 1051-1065.
Lorenzo-Seva, U., & Ten Berge, J. M. (2006). Tucker's congruence coefficient as a meaningful index of
factor similarity. Methodology, 2, 57-64.
34
Mac Giolla, E., & Kajonius, P. J. (2018). Sex differences in personality are larger in gender equal
countries: Replicating and extending a surprising finding. International Journal of Psychology,
doi:10.1002/ijop.12529
Maney, D. L. (2016). Perils and pitfalls of reporting sex differences. Philosophical Transactions of the
Royal Society B, 371, 20150119.
Marsh, H. W., Morin, A. J., Parker, P. D., & Kaur, G. (2014). Exploratory structural equation modeling:
An integration of the best features of exploratory and confirmatory factor analysis. Annual
Review of Clinical Psychology, 10, 85-110.
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin,
111, 361-365.
McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23, 412-
433.
Mecklin, C. J., & Mundfrom, D. J. (2004). An appraisal and bibliography of tests for multivariate
normality. International Statistical Review, 72, 123-138.
Mecklin, C. J., & Mundfrom, D. J. (2005). A monte carlo comparison of the type I and type II error rates
of tests of multivariate normality. Journal of Statistical Computation and Simulation, 75, 93–107.
Meehl, P. E. (1995). Bootstraps taxometrics: Solving the classification problem in psychopathology.
American Psychologist, 50, 266-275.
Mitteroecker, P., Windhager, S., Müller, G. B., & Schaefer, K. (2015). The morphometrics of
“masculinity” in human faces. PLoS ONE, 10, e0118374.
Money, J. (1955). Hermaphroditism, gender and precocity in hyperadrenocorticism: Psychologic findings.
Bulletin of the Johns Hopkins Hospital, 96, 253–264.
Morris, M. L. (2016). Vocational interests in the United States: Sex, age, ethnicity, and year effects.
Journal of Counseling Psychology, 63, 604–615.
Nakagawa, S., Noble, D. W., Senior, A. M., & Lagisz, M. (2017). Meta-evaluation of meta-analysis: Ten
appraisal questions for biologists. BMC biology, 15, 18.
Norris, A. L., Marcus, D. K., & Green, B. A. (2015). Homosexuality as a discrete class. Psychological
Science, 26, 1843-1853.
Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence:
Understanding the practical importance of differences between groups. Journal of Applied
Psychology, 96, 966-980.
Oakley, A. (1972). Sex, gender, and society. New York: Harper Colophon.
O'Brien, P. C. (1992). Robust procedures for testing equality of covariance matrices. Biometrics, 48, 819-
827.
Olejnik, S., & Algina, J. (2000). Measures of effect size for comparative studies: Applications,
interpretations, and limitations. Contemporary Educational Psychology, 25, 241-286.
Phillips, O. R., Onopa, A. K., Hsu, V., Ollila, H. M., Hillary, R. P., Hallmayer, J., ... & Singh, M. K.
(2018). Beyond a binary classification of sex: An examination of brain sex differentiation,
psychopathology, and genotype. Journal of the American Academy of Child & Adolescent
Psychiatry. doi:10.1016/j.jaac.2018.09.425
Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychological Bulletin, 112,
160-164.
Reiser, B. (2001). Confidence intervals for the Mahalanobis distance. Communications in Statistics:
Simulation and Computation, 30, 37–45.
Revelle, W. (2018). An introduction to psychometric theory with applications in R. Manuscript retrieved
on October 24, 2018 from the Personality Project website: http://personality-project.org/r/book/
35
Revelle, W., & Condon, D. M. (2018). Reliability. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The
Wiley handbook of psychometric testing (pp. 709-749). Hoboken, NJ: Wiley.
Rhemtulla, M., van Bork, R., & Borsboom, D. (2018). Worse than measurement error: Consequences of
inappropriate latent variable measurement models. Preprint retrieved on October 24, 2018 from
the Open Science Framework website: https://osf.io/27bxg/
Rhodes, G., Jeffery, L., Watson, T. L., Jaquet, E., Winkler, C., & Clifford, C. W. G. (2004). Orientation-
contingent face aftereffects and implications for face-coding mechanisms. Current Biology, 14,
2119–2123.
Rippon, G., Jordan-Young, R., Kaiser, A., & Fine, C. (2014). Recommendations for sex/gender
neuroimaging research: Key principles and implications for research design, analysis, and
interpretation. Frontiers in Human Neuroscience, 8, 650.
Ritchie, S. J., Cox, S. R., Shen, X., Lombardo, M. V., Reus, L. M., Alloza, C., ... & Liewald, D. C.
(2018). Sex differences in the adult human brain: evidence from 5216 UK Biobank participants.
Cerebral Cortex, 28, 2959-2975.
Rosenthal, R., & Rubin, D. B. (1979). A note on percent variance explained as a measure of the
importance of effects. Journal of Applied Social Psychology, 9, 395–396.
Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2017). Beyond differences in means: Robust graphical
methods to compare two groups in neuroscience. European Journal of Neuroscience, 46, 1738-
1748.
Ruscio, J., Haslam, N., & Ruscio, A. M. (2013). Introduction to the taxometric method: A practical guide.
New York: Routledge.
Ruscio, J., Ruscio, A. M., & Carney, L. M. (2011). Performing taxometric analysis to distinguish
categorical and dimensional variables. Journal of Experimental Psychopathology, 2, 170-196.
Sapp, M., Obiakor, F. E., Gregas, A. J., & Scholze, S. (2007). Mahalanobis distance: A multivariate
measure of effect in hypnosis research. Sleep and Hypnosis, 9, 67-70.
Sax, L. (2002). How common is intersex? A response to Anne Fausto-Sterling. Journal of Sex Research,
39, 174-178.
Schärer, L., Rowe, L., & Arnqvist, G. (2012). Anisogamy, chance and the evolution of sex roles. Trends
in Ecology & Evolution, 27, 260-264.
Schmid, F., & Schmidt, A. (2006). Nonparametric estimation of the coefficient of overlapping—theory
and empirical application. Computational Statistics & Data analysis, 50, 1583-1596.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26
research scenarios. Psychological Methods, 1, 199-223.
Schmidt, F. L., & Hunter, J. E. (2014). Methods of meta-analysis: Correcting error and bias in research
findings (3rd ed.). Thousand Oaks, CA: Sage.
Schmitt, D. P. (2015). The evolution of culturally-variable sex differences: Men and women are not
always different, but when they are… it appears not to result from patriarchy or sex role
socialization. In T. K. Shackelford, R. D. Hansen (eds.), The evolution of sexuality (pp. 221-256).
Cham, Switzerland: Springer.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350-353.
Schmitt, N., Golubovich, J., & Leong, F. T. (2011). Impact of measurement invariance on construct
correlations, mean differences, and relations with external correlates: An illustrative example
using Big Five and RIASEC measures. Assessment, 18, 412-427.
Schneider, C. A., Rasband, W. S., & Eliceiri, K. W. (2012). NIH Image to ImageJ: 25 years of image
analysis. Nature Methods, 9, 671-675.
36
Sen Gupta, A. (2004). Generalized variance. In S. Kotz, C. B. Read, N. Balakrishnan, B. Vidakovic, & N.
L. Johnson (Eds.), Encyclopedia of statistical sciences (6053). New York: Wiley.
doi:10.1002/0471667196.ess6053
Sepehrband, F., Lynch, K. M., Cabeen, R. P., Gonzalez-Zacarias, C., Zhao, L., D'arcy, M., ... & Clark, K.
A. (2018). Neuroanatomical morphometric characterization of sex differences in youth using
statistical learning. NeuroImage, 172, 217-227.
Shaffer, J. P. (1992). Caution on the use of variance ratios: A comment. Review of Educational Research,
62, 429-432.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a). P-curve: A key to the file-drawer. Journal of
Experimental Psychology: General, 143, 534-547.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b). P-curve and effect size: Correcting for
publication bias using only significant results. Perspectives on Psychological Science, 9, 666-681.
Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Better p-curves: Making p-curve analysis more
robust to errors, fraud, and ambitious p-hacking. A Reply to Ulrich and Miller (2015). Journal of
Experimental Psychology: General, 144, 1146-1152
Skiena, S. S. (2017). The data science design manual. New York: Springer.
Steiger, James H. (1980) Testing pattern hypotheses on correlation matrices: alternative statistics and
some empirical results. Multivariate Behavioral Research, 15, 335-352.
Stewart-Williams, S., & Thomas, A. G. (2013). The ape that thought it was a peacock: Does evolutionary
psychology exaggerate human sex differences? Psychological Inquiry, 24, 137-168.
Stoller, R. J. (1968). Sex and Gender: The Development of Masculinity and Femininity. New York:
Science House.
Taborsky, M., & Brockmann, H. J. (2010). Alternative reproductive tactics and life history phenotypes. In
Kappeler, P. (Ed.), Animal behavior: Evolution and mechanisms (pp. 537–586). New York, NY:
Springer.
Unger, R. K. (1979). Toward a redefinition of sex and gender. American Psychologist, 34, 1085-1094.
Vacha-Haase, T., & Thompson, B. (2004). How to estimate and interpret various effect sizes. Journal of
Counseling Psychology, 51, 473-481.
van Aert, R. C., Wicherts, J. M., & van Assen, M. A. (2016). Conducting meta-analyses based on p
values: Reservations and recommendations for applying p-uniform and p-curve. Perspectives on
Psychological Science, 11, 713-729.
van Assen, M. A., van Aert, R., & Wicherts, J. M. (2015). Meta-analysis using effect size distributions of
only statistically significant studies. Psychological Methods, 20, 293-309.
van Putten, M. J., Olbrich, S., & Arns, M. (2018). Predicting sex from brain rhythms with deep learning.
Scientific Reports, 8, 3069.
Verweij, K. J., Mosing, M. A., Ullén, F., & Madison, G. (2016). Individual differences in personality
masculinity-femininity: Examining the effects of genes, environment, and prenatal hormone
transfer. Twin Research and Human Genetics, 19, 87-96.
Voracek, M., Mohr, E., & Hagmann, M. (2013). On the importance of tail ratios for psychological
science. Psychological Reports, 112, 872-886.
Wagenmakers, E. J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., ... & Matzke, D. (2018).
Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications.
Psychonomic Bulletin & Review, 25, 35-57.
Wai, J., Hodges, J., & Makel, M. C. (2018). Sex differences in ability tilt in the right tail of cognitive
abilities: A 35-year examination. Intelligence, 67, 76-83.
37
Wierenga, L. M., Sexton, J. A., Laake, P., Giedd, J. N., Tamnes, C. K., & Pediatric Imaging,
Neurocognition, and Genetics Study. (2017). A key characteristic of sex differences in the
developing brain: Greater variability in brain structure of boys than girls. Cerebral Cortex, 28,
2741-2751.
Wilcox, R. R. (2006). Graphical methods for assessing effect size: Some alternatives to Cohen's d.
Journal of Experimental Education, 74, 351-367.
Wyman, M. J., & Rowe, L. (2014). Male bias in distributions of additive genetic, residual, and phenotypic
variances of shared traits. The American Naturalist, 184, 326-337.
Yap, B. W., & Sim, C. H. (2011). Comparisons of various types of normality tests. Journal of Statistical
Computation and Simulation, 81, 2141-2155.
Zell, E., Krizan, Z., & Teeter, S. R. (2015). Evaluating gender similarities and differences using
metasynthesis. American Psychologist, 70, 10-20.
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s α, Revelle’s β, and McDonald’s ωH:
Their relations with each other and two alternative conceptualizations of reliability.
Psychometrika, 70, 123-133.
Zou, G. Y. (2007). Exact confidence interval for Cohen’s effect size is readily available. Statistics in
Medicine, 26, 3054.