Machine Learning For Sociology: Annual Review of Sociology
Machine Learning For Sociology: Annual Review of Sociology
26.1
Review in Advance first posted on
May 13, 2019. (Changes may still
occur before final publication.)
SO45CH26_Garip ARjats.cls May 3, 2019 12:9
INTRODUCTION
Machine learning (ML) seeks to automate discovery from data. It represents a breakthrough in
Algorithm: a set of computer science, where past intelligent systems typically involved fixed algorithms (logical sets of
instructions telling a instructions) that coded the desired output for all possible inputs. Now, intelligent systems learn
computer what to do from data and estimate complex functions that discover representations of some input (X ), or link
Supervised machine the input to an output (Y ) in order to make predictions on new data ( Jordan & Mitchell 2015).
learning: methods ML can be viewed as an offshoot of nonparametric statistics (Kleinberg et al. 2015).
that use training data We can classify ML tools by how they learn (extract information) from data. Different varieties
of paired input (X ) and
of ML use different algorithms that invoke different assumptions about the principles underlying
output (Y ) samples to
learn parameters that intelligence (Domingos 2015). We can also categorize ML tools by the kind of experience they
predict Y from X in are allowed to have during the learning process (Goodfellow et al. 2016), and we use this latter
new data categorization here.
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Unsupervised In supervised machine learning (SML), the algorithm observes an output (Y ) for each input
machine learning: (X ). That output gives the algorithm a target to predict and acts as a teacher. In unsupervised
methods to summarize machine learning (UML), the algorithm only observes the input. It needs to make sense of the
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
or characterize input data without a teacher providing the correct answers—in fact, there are often no correct answers.1
data (X ) without
We start with a brief (and somewhat technical) description of SML and UML and follow with
reference to a
ground-truth output examples of social science applications. We cannot give a comprehensive account, given the sprawl
(Y ) of the topic, but we hope to provide enough coverage to allow readers to follow up on different
ideas. Our concluding remarks state why ML matters for sociology and how these tools can address
some long-standing questions in the field.
Y = f (X ) = X T β,
1 Supervised and unsupervised learning are not formally defined terms (Goodfellow et al. 2016). Many ML
algorithms can be used for both tasks. Scholars have proposed alternative labels, such as predictive and repre-
sentation learning (Grosse 2013). There are other kinds of learning not captured with a binary categorization.
In so-called reinforcement learning, the algorithm observes only some indication of the output (e.g., the end
result of a chess game but not the rewards/costs associated with each move) ( Jordan & Mitchell 2015).
2 Uppercase letters (e.g., X or Y ) denote variable vectors, and lowercase letters refer to observed values (e.g.,
This strategy ensures estimates of β that give the best fit in sample, but not necessarily the best
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
predictions out of sample (i.e., on new data) (see sidebar titled Classical Statistics Versus Machine
Learning).
To see that, consider the generalization error of the OLS model, that is, the expected prediction
error on new data. This error comprises two components: bias and variance (Hastie et al. 2009). A
model has bias if it produces estimates of the outcome that are consistently wrong in a particular
direction (e.g., a clock that is always an hour late). A model has variance if its estimates deviate from
the expected values across samples (e.g., a clock that alternates between fast and slow) (Domingos
2015). OLS minimizes in-sample error (Equation 1), but it can still have high generalization error
if it yields high-variance estimates (Kleinberg et al. 2015).
To minimize generalization error, SML makes a trade-off between bias and variance—that is,
unlike OLS, the methods allow for bias in order to reduce variance (Athey & Imbens 2017).3 For
example, an SML technique is to minimize
n
[yi − f (xi )]2 + λR( f ), 2.
i=1
that is, in-sample error plus a regularizer, R( f ), that penalizes functions that create variance
(Kleinberg et al. 2015, Mullainathan & Spiess 2017). An important decision is to select λ, which Generalization error:
sets the relative price for variance (Kleinberg et al. 2015). In OLS, that price is set to zero. In SML the prediction error of
methods, the price is determined using the data (more on that later). a model on new data
For example, in linear models, larger coefficients yield more variance in predictions. A pop- (also known as test
ular SML technique called lasso (least absolute shrinkage and selection operator) introduces a error)
regularizer, Model: a formal
representation of our
p
assumptions about the
R( f ) = |β j |, 3. world
j=1
Regularizer:
that equals the sum of the absolute values of the coefficients, β j ( j = 1, . . . , p) (Tibshirani 1996). a term that penalizes
The optimal function, f (X ), now needs to select coefficients that minimize the sum of squared estimation variance in
out-of-sample
residuals while also yielding the smallest absolute coefficient sum.
predictions
3 One can find a similar approach in multilevel models popular in sociology where cluster parameters are
deliberately biased (Gelman & Hill 2007).
SML techniques seek to achieve an ideal balance between reducing the in-sample and out-
of-sample error (i.e., training and generalization error, respectively). This goal helps avoid two
pitfalls of data analysis: underfitting and overfitting. Underfitting occurs when a model fits the
Training error:
the error of a model on data at hand poorly: As a simple example, an OLS model with only a linear term linking an input
training data (e.g., sum to output offers a poor fit if the true relationship is quadratic. Overfitting occurs when a model fits
of squared residuals) the data at hand too well and fails to predict the output for new inputs; for example, an OLS model
Overfitting: with N inputs (plus a constant) will perfectly fit N data points, but it will likely not generalize well
the concept of a model to new observations (Belloni et al. 2014).
fitting the data at hand Underfitting means we miss part of the signal in the data; we remain blind to some of its
well, but not patterns. Overfitting means we capture not just the signal, but also the noise—the idiosyncratic
generalizing to new
factors that vary from sample to sample—and hallucinate patterns that are not there (Domingos
data
2015).
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Empirical tuning: Through regularization, SML effectively searches for functions that are sufficiently complex
using the data to
to fit the underlying signal without fitting the noise. One way to regularize is to restrain model
optimize model
design, including the parameters. Let us consider lasso. The regularizer in Equation 3 puts a bound on the sum of
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
choice of the absolute values of the coefficients. It can be shown that lasso favors sparse models, where a small
regularization weight number of inputs have nonzero coefficients, and effectively restrains model complexity (Tibshirani
1996).
Now consider regression trees, another function class in SML. The method proceeds by par-
titioning the inputs into separate regions in a tree-like structure and returning a separate output
estimate (Ŷ ) for each region. Say we want to predict whether someone migrates using individ-
ual attributes of age and education. A tree might first split into two branches by age (young and
old), and then each branch might split into two by education (college degree or not). Each termi-
nal node (leaf ) corresponds to a migration prediction (e.g., 1 for young college graduates). With
enough splits in the tree, one can perfectly predict each observation in sample. To prevent overfit-
ting, a typical regularizer controls the tree depth and, thus, makes us search not for the best fitting
tree overall, but the best fitting tree among those of a certain depth (Mullainathan & Spiess 2017).
How do we select the model that offers the right compromise between in-sample and out-of-
sample fit? To answer this question, we need to decide, first, on how to regularize [measure model
variance/complexity, R( f )] and second, on how much to regularize [set the price for variance/
complexity, λ, in Equation 2].
In SML, we start the analysis by picking a function class and a regularizer. There are many
function classes and many associated regularizers (see the sidebar titled Some Supervised Ma-
chine Learning Techniques). The so-called no free lunch theorem proves that no ML method (or
no form of regularization) is universally better than any other (Wolpert & Macready 1997); the
task is not to seek the best overall method, but the best method for the particular question at hand
(Goodfellow et al. 2016, but see Domingos 2015 for a counterargument). The general recommen-
dation is to use the substantive question at hand to guide these choices.4 With the function class
and regularizer in hand, we turn to data to choose the optimal model complexity. Put differently,
in SML, we use the data not just to estimate the model parameters (e.g., coefficients, β, in lasso),
but also for tuning regularization parameters (e.g., the price for variance/complexity, λ).
What sets SML apart from classical statistical estimation, then, are two essential features: regu-
larization and the data-driven choice of regularization parameters (also known as empirical tuning)
(Athey & Imbens 2017, Kleinberg et al. 2015, Mullainathan & Spiess 2017). These features allow
4 Hastie et al. (2009, table 10.1) compare different methods on several criteria (e.g., interpretability, predictive
power, ability to deal with different kinds of data). Athey & Imbens (2016), Athey (2017), and Abadie & Kasy
(2017) link SML methods to traditional tools and questions in economics. Olson et al. (2018) offer an empirical
comparison on bioinformatics data.
researchers to consider complex functions and more inputs (polynomial terms, high-order inter-
actions, and, in some cases, more variables than observations) without overfitting the data. This
flexibility contrasts sharply with classical statistics, where one typically selects a small number of
inputs and a simple functional form to relate the inputs to the output.
One way SML uses data, therefore, is for model selection, that is, to estimate the performance
of alternative models (functions, regularization parameters) to choose the best one. This process
requires solving an optimization problem. Another way SML uses data is for model assessment,
that is, after settling on a final model, estimating its generalization (prediction) error on new data
(Hastie et al. 2009).
A crucial step in SML is to separate the data used for model selection from the data used for
model assessment. In fact, in an idealized setup, one creates three, not two, separate data sets.
Training data are used to fit the model; validation data are put aside to select among different
models (or to select among the different parameterizations of the same model); and finally, test
(or hold-out) data is kept in the vault to compute the generalization error of the selected model.
There is no generic rule for determining the ideal partition, but typically, a researcher can reserve
half of the data for training, and a quarter each for validation and testing (Hastie et al. 2009).
Splitting the data in this way comes at a cost, however. By reserving a validation and test set, we
reduce the chance of overfitting but now run the risk of underfitting because there are fewer data
left for estimation (Yarkoni & Westfall 2017). To achieve a middle ground, we can reserve the test
data but combine training and validation sets into one, especially if the data are small. We can then
recycle the training data for validation purposes (e.g., to select the optimal degree of complexity). Cross-validation:
One version of this process, called k-fold cross-validation, involves randomly splitting the data a method to estimate
and validate a model
into k subsets (folds), and then successively fitting the data to k − 1 of the folds and evaluating the
on different partitions
model performance on the kth fold. of data
Consider the regression-tree example above. We can divide the training data into k = 5 folds,
Training data: sample
use four of the folds to grow a tree with a particular depth (complexity), and then predict the
used to fit the model
output (migration) separately on the excluded fold, repeating for each of the five folds. We can
then repeat the same process with a different tree depth and pick the complexity level that min-
imizes the average prediction error across the left-out folds (see Varma & Simon 2006 for more
sophisticated nested cross-validation). In the final step, we can use the test data to compute the
predictive accuracy (generalization error) of the selected model.
In SML, there are model-averaging techniques to improve predictive performance. For exam-
Validation data:
sample used to select ple, bagging involves averaging across models estimated on different bootstrap samples (where
among different one draws with replacement N observations from a sample of size N ). Boosting involves giving
models more weight to misclassified observations over repeated estimation (Hastie et al. 2009).
Test data: Sample
reserved to compute SUPERVISED MACHINE LEARNING FOR POLICY PREDICTIONS,
the generalization CAUSAL INFERENCE, AND DATA AUGMENTATION
error of the selected
model (hold–out data) SML uses flexible functions of inputs to predict an output. Some SML tools, such as nearest
neighbor, have no parameters at all. Other methods, such as lasso, give parameter estimates, β̂,
but those estimates are not always consistent (that is, they do not converge to the true value as N
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Instead, SML is good at solving what Mullainathan & Spiess (2017, p. 88) call Ŷ tasks. Social
scientists (mostly economists) have identified three classes of Ŷ tasks: predictions for policy and
theory development, certain procedures for causal inference, and data augmentation [for reviews,
see Mullainathan & Spiess (2017) for predictive modeling in economics, Cranmer & Desmarais
(2017) for political science, and Yarkoni & Westfall (2017) for psychology].
Kleinberg et al. (2017), for example, illustrate how machine predictions can help us understand
the process underlying judicial decisions. The authors first train a regression-tree model to predict
judges’ bail-or-release decisions in New York City, and then they use the quasi-random assignment
of judges to cases to explain the sources of the discrepancy between model predictions and actual
decisions. Their findings show that judges overweight the current charge, releasing high-risk cases
if their present charge is minor and detaining low-risk ones if the present charge is serious. These
findings reveal important insights on human decision-making and carry the potential to inspire
new theory. From a policy standpoint, the authors’ predictive model, if used in practice, promises
significant welfare gains over human decisions: reducing the reoffending rate by 25% with no
increase in jailing rate or, alternatively, pulling down the jailing rate by 42% with no increase in
reoffending rate.
An important discussion in the literature is on how SML tools should weight different kinds of
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
prediction errors. Berk et al. (2016), for example, apply random forests to forecast repeat offenses
in domestic violence cases. In consultation with stakeholders, the authors weight false negatives
(where the model predicts no repeat offense when there is one) 10 times more heavily than false
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
positives (where the model predicts repeat offense when there is none). Their model, consequently,
produces highly accurate predictions of no-offense cases (which require very strong evidence), but
less accurate forecasts of repeat offenses (many of which do not end up occurring).
There are legitimate concerns that SML predictions (and the data on which they are based)
can perpetuate social inequalities (Barocas & Selbst 2016, Harcourt 2007, Starr 2014). What if
predicted offenders are disproportionately drawn from minority groups? What if predicted ben-
eficiaries of health interventions are mostly high-status individuals?
Scholars now acknowledge an inherent trade-off between predictive accuracy and algorithmic
fairness (Berk et al. 2018, Hardt et al. 2016, Kleinberg et al. 2016). An open question is how to
define fairness. While most definitions relate to treatment of protected groups, one can opera-
tionalize fairness in many different ways (Berk et al. 2018, Narayanan 2018).
To see the complexity of the problem, consider a predictive algorithm that outputs loan deci-
sions (Ŷ ) from credit scores (X ) (Hardt et al. 2016). Assume the algorithm produces more accurate
predictions for men than women and recommends more loans to be given to men. One way to
make the algorithm fair is to exclude applicants’ gender from the data, but this solution fails if gen-
der is correlated with another input, such as income. Another way is to seek demographic parity,
that is, to constrain the model so that gender has no correlation with the loan decision. But this
constraint might generate disparity in some other characteristic (Dwork et al. 2012). Yet another
way to define fair is to impose equal opportunity (Hardt et al. 2016), that is, to force the model
to make men and women equally likely to qualify for loans within a given subpopulation (e.g.,
individuals who pay back their loans).
Different definitions of fairness yield different outcomes. And it is difficult (if not impossible) to
implement multiple definitions at the same time (Berk et al. 2018). Addressing algorithmic fairness
is not just a technical issue in ML; it requires us—as a society—to consider difficult trade-offs.5
Causal Inference
Social scientists are often interested in identifying the causal effect of an input (treatment) on an
output. SML tools can help in certain causal inference procedures that involve prediction tasks.
5 Similar moral dilemmas abound in the use of ML in new technologies, such as self-driving cars (Greene
2016). Survey experiments show that while people agree that an algorithm should minimize casualties, they
are not thrilled with the prospect of riding in a utilitarian car that can sacrifice its driver for the greater good
(Bonnefon et al. 2016).
We provide some basic intuition and examples from this rather technical literature and refer the
readers to Athey & Imbens (2017) and Mullainathan & Spiess (2017) for comprehensive reviews,
and to Pearl & Mackenzie (2018) and Peters et al. (2017) for general frameworks that link ML to
causality.
As a primer, consider the fundamental problem of causal inference: We observe an individ-
ual (or any unit of analysis) in one condition alone (treatment or control) and cannot measure
individual-level variation in the effect of the treatment (for an authoritative review of causal infer-
ence in the social sciences, see Morgan & Winship 2007, 2014). We instead focus on an aggregate
average effect that we treat as homogeneous across the population (Xie 2013). In experimental
design, we randomly assign individuals to treatment and control groups and directly estimate
the average causal effect by comparing the mean output between the groups (Imbens & Rubin
2015).
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Social scientists now use SML to identify heterogeneous treatment effects in subpopulations
in existing experimental data. For example, Imai & Ratkovic (2013) discover groups of workers
differentially affected by a job training program. They interact the treatment (i.e., being in the pro-
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
gram) with different inputs and use a lasso model to select the inputs that are most important in
predicting increases in worker earnings. Similarly, Athey & Imbens (2016) develop causal trees to
estimate treatment effects for subgroups. Different from standard regression trees in ML (where
one seeks to minimize the error in predictions, Ŷ ), causal trees focus on minimizing the error in
treatment effects. One can then obtain valid inference for each leaf (subgroup) with honest estima-
tion, that is, by using half the sample to build the tree (select the optimal partition of inputs), and
the other half to estimate the treatment effects within the leaves. Wager & Athey (2018) extend the
method to random forests that average across many causal trees and allow for personalized treat-
ment effects (where each individual observation gets a distinct estimate). Similarly, Grimmer et al.
(2017) propose ensemble methods that weight several ML models and discover heterogeneous
treatment effects in data from two existing political science experiments.
Most empirical work in sociology relies on observational data where we do not control assign-
ment to treatment. One way to estimate the causal effect in this case is to assume the potential
output to be independent of assignment to treatment, conditional on other observed inputs. Under
this so-called selection-on-observables assumption, we can estimate a causal effect by matching
treatment and control groups on their propensity score (that is, the likelihood of being in the
treatment group conditional on inputs). Estimation of this score is well suited to SML as it in-
volves a prediction task (where the effects of inputs are not of interest). Recent work uses boosting
(McCaffrey et al. 2004), neural networks (Setoguchi et al. 2008, Westreich et al. 2010), and regres-
sion trees for this task (Diamond & Sekhon 2013, Hill 2011, Lee et al. 2010, Wyss et al. 2014) as
alternatives to traditional logistic regression.
In some cases, the selection-on-observables assumption does not hold, and we suspect that
some unobserved inputs are correlated with both the assignment to treatment and the output,
creating omitted variable bias in estimation. Regularization in SML could lead to exclusion of
such inputs from the model, amplifying this bias. Similarly, with many inputs, one generally runs
the risk of model misspecification (Belloni et al. 2014, Ho et al. 2007, King & Nielsen 2019, Muñoz
& Young 2018, Raftery 1995, Young & Holsteen 2017). Athey & Imbens (2015) develop a measure
of sensitivity to misspecification. Belloni et al. (2017) propose double-selection of inputs to address
potential omitted variable bias. This procedure involves solving two prediction tasks to determine,
first, the inputs correlated with the treatment, and second, those correlated with the output. The
union of these two sets of inputs enters an OLS regression of the output, leading to parameter
estimates with improved properties (Belloni et al. 2014, 2017; Chernozhukov et al. 2017).
Another way to address the omitted variable bias is to find an instrument—an input that is
correlated with the output only through its correlation with assignment to treatment (Angrist
et al. 1996). We can then regress the treatment (a given input, X ) on the instrument (Z), and then
use the predicted values (X̂ ) as an input in the output (Y ) regression. Because the first stage in this
instrumental variables regression involves a prediction task, we can use SML tools. There are now
many examples of this application in the econometrics literature. Belloni et al. (2012) use lasso to
produce first-stage predictions in data with many potential instruments, while Carrasco (2012)
and Hartford et al. (2016) turn to ridge regression and neural networks, respectively.
human-coded data to train SML algorithms to link individuals across census waves. Abramitzky
et al. (2019) develops a fully automated method to estimate probabilities of matches across cen-
sus waves, and then measures intergenerational occupational mobility. Using a nested design,
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
Bernheim et al. (2013) recruited a subset of survey respondents to participate in a lab experi-
ment and uses their responses in the lab as training data to impute responses for the remain-
ing sample. Blumenstock et al. (2015) collects survey responses from a subset of cell phone
users in Rwanda as training data to predict the wealth and well-being of one million phone
users.
Scholars are similarly turning to supervised topic modeling (Blei & McAuliffe 2010) to use
human-identified topics as training data to classify a larger set of documents (Hopkins & King
2010, Mohr et al. 2013). For instance, Chong et al. (2009) applies this approach successfully to
predict topics for image labels and annotations.
Researchers are also using SML for missing data imputation. Farhangfar et al. (2008) inves-
tigates the performance of different ML classifiers in fifteen data sets and find that, although no
method is universally best, naive-Bayes and support vector machine classifiers perform particu-
larly well in imputing missing values. More recently, Sovilj et al. (2016) uses Gaussian mixture
models to estimate the underlying distribution of data and an extreme learning machine (a type
of one-layer neural network) for data imputation. Their approach, evaluated in six different data
sets, yields more accurate values compared with conditional mean imputation.
6 In the ML community, researchers use the term data augmentation to also refer to the technique of artificially
increasing your training data in order to improve the predictive performance of ML classifiers. This strategy
is widely used in deep neural networks for image recognition (e.g., Wong et al. 2016) but remains outside the
scope of our review.
7 There are excellent reviews of latent class analysis (Bollen 2002), sequence analysis (Abbott & Tsay 2000,
Cornwell 2015), and community detection (Fortunato 2010, Fortunato & Hric 2016, Watts 2004).
Sequence analysis: compares sequences (ordered elements or events) with optimal matching to discover
groups of observations with similar patterns (typically with cluster analysis)
Topic modeling: discovers latent topics in text data based on co-occurrence of words across documents
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
Community detection: identifies communities in networks (graphs) based on structural position of nodes
Generating measures from complex data. UML can produce measures from data to be used in
subsequent statistical analysis. Sociologists have long used principal components and factor anal-
ysis to reduce many inputs into a smaller set. Social scientists now use UML to process new kinds
of data (images or text). Economists, for example, classify satellite images with UML to generate
measures (deforestation, pollution, night lights, and so on) that relate to economic outputs (see
Donaldson & Storeygard 2016 for a review). Sociologists categorize text to develop proxies for
discourse in the media (DiMaggio et al. 2013), state documents (Mohr et al. 2013) and academic
publications (McFarland et al. 2013). For more information about text analysis, readers are di-
rected to articles by Bail (2014), Blei (2012), Evans & Aceves (2016), Grimmer & Stewart (2013),
and Mohr & Bogdanov (2013).
Following a long tradition, sociologists also use UML to group social network data. Earlier
applications, such as block models, employed structural equivalence (sharing neighbors) to eval-
uate similarity and to then partition the network into subgroups (Breiger et al. 1975, White et al.
1976). Recent improvements involve using centrality (instead of equivalence) measures to discover
communities (Girvan & Newman 2002), assuming generative probabilistic distributions (Nowicki
& Snijders 2001) that help in model selection (Handcock et al. 2007), allowing for mixed mem-
bership in communities (Airoldi et al. 2008), and considering temporal dynamics (Matias & Miele
2017, Xing et al. 2010, Yang et al. 2011) and latent social structure (Hoff et al. 2002).
in Europe. Bonikowski & DiMaggio (2016) employ latent class analysis to characterize four types
of popular nationalism in the United States. Frye & Trinitapoli (2015) use sequence analysis to
discover five distinct event sequences that characterize discrepancy in women’s ideal and experi-
enced prelude to sex in Malawi. Killewald & Zhuo (2018) employ the same method to identify
four maternal employment patterns of American mothers. Garip (2012, 2016) uses cluster analysis
to identify four distinct groups among first-time Mexico-US migrants. Goldberg (2011) develops
relational class analysis that considers associations between individuals’ survey responses (rather
than responses themselves) to discover three separate logics of cultural distinction around musi-
cal tastes. Baldassarri & Goldberg (2014) apply the same tool to identify three configurations of
political beliefs among Americans.
These examples use a variety of methods, but share a common goal. They search for the hidden
structure in a population that would be presumed homogeneous under the traditional statistical
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
approach (Xie 2007, 2013; Duncan 1982). This approach often yields new hypotheses that emerge
from data.
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
Model checking. Unlike prediction problems, there is often no ground truth in UML; therefore,
model checking is an important step. Researchers use statistical validation techniques that involve
some heuristic measure to capture whether, for example, clusters (Garip 2012, Killewald & Zhuo
2018), latent classes (Bonikowski & DiMaggio 2016), or topics are well separated (DiMaggio et al.
2013). Scholars employ substantive validation to see if the produced partitions cohere with existing
typologies or, more generally, with human judgement. Grimmer & King (2011) offer a method for
computer-assisted clustering. The method allows researchers to explore and select from thousands
of partitions produced by different clustering methods and, thus, puts their domain knowledge at
the center (Grimmer & Stewart 2013).
Researchers also resort to external validation that bring new data to evaluate whether identi-
fied patterns confirm expectations. Bail (2008), for example, shows that three types of symbolic
boundaries emerging from attitudinal data are associated with country-level immigration patterns
and integration philosophies in Europe. Bonikowski & DiMaggio (2016) find that four varieties
of nationalism in the United States correlate with social and policy attitudes that were not used in
the identification of the typology. DiMaggio et al. (2013) check that topics identified in the news
coverage of government assistance to the arts respond to other news events in hypothesized ways.
Garip (2016) confirms that four migrant types, obtained by clustering survey responses alone,
relate differently to macrolevel economic and political indicators.
SML allows us to include many inputs (including higher-order terms and interactions) and
complex functions that connect inputs to the output. It helps break away from the linear model
imposed by OLS (Abbott 2001). It helps us avoid underfitting (missing part of the signal) and
mine the data effectively without overfitting (capturing the noise as well as the signal). This gain
comes at a cost. Predictive tools in SML typically do not yield reliable estimates of the effects of
particular inputs (β̂), and indeed, some methods only produce black-box results.
Sociologists can identify pure prediction (Ŷ ) problems where different research teams can po-
tentially compete in a common-task framework (Donoho 2017). Economists, for example, are
already using SML to make policy predictions (Kleinberg et al. 2015). Sociologists can further use
predictions as a starting point to understand underlying social process and to develop theory. So-
ciologists can also use their expertise in processes of stratification to inform debates on the ethics
of predictive modeling, and its ‘fairness’ to different social groups (Berk et al. 2018).
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Another direction for sociologists is to use SML to improve classical statistical techniques.
Economists now apply SML to prediction tasks within the causal-inference framework, for exam-
ple, estimation of the propensity score in matching (Westreich et al. 2010) or the first-stage equa-
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
tion in instrumental variables (Belloni et al. 2012), and identification of heterogeneous treatment
effects in existing experimental data (Athey & Imbens 2016). One particularly fruitful application
(and one that is highly relevant to sociologists given our typical attention to omitted variable bias)
involves using SML for model selection (Belloni et al. 2014, 2017).
are many choices available to us (researcher degrees of freedom) that might influence the results
(Simmons et al. 2011, King & Nielsen 2019).8 Any time we use the data to optimize over these
degrees of freedom (for example, choosing variables that give the best fit), we need to conduct an
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
out-of-sample test (or cross-validation) to evaluate the true performance of our choices. A related
activity at the research community level is to encourage independent replication studies, which
would serve as out-of-sample tests (Freese 2007).
8 This issue has led to heated debates in psychology where researchers have been unable to replicate some
well-known experimental findings (Simmons et al. 2011, Open Sci. Collab. 2015).
on an output. ML not only helps us improve parts of this strategy, but also give us tools that can
inspire new questions. How well do a set of inputs, for example, predict output? How do these
predictions deviate from observed outcomes and why? Or what is the underlying structure of some
input? How is that structure related to external factors? Answering these questions can help us
push theory forward or generate new hypotheses. Indeed, in some of the best social science appli-
cations, the results from ML provide not an end goal, but the starting point for further analysis
and conceptualization. As such, ML tools complement, but do not replace, existing methods in
sociology.
SUMMARY POINTS
1. Classical statistics focuses on inference (estimating parameters, β, that link the output,
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
2. SML balances in-sample and out-of-sample fit through regularization (i.e., penalizing
model complexity and estimation variance) and empirical tuning (i.e., data-driven choice)
of regularization parameters.
3. Unsupervised machine learning (UML) discovers underlying structure in data (e.g.,
principal components, clusters, latent classes) that needs to be validated with statistical,
substantive, or external evidence.
4. Sociologists can apply SML to predict outputs, to use the predictions as a starting point
to understand underlying social process, or to improve classical statistical techniques.
5. Sociologists can use UML to describe and classify inputs, and to conceptualize on the
basis of the descriptions.
FUTURE ISSUES
1. What are the prediction (Ŷ ) questions in sociology?
2. What can the deviations from predictions reveal about the underlying social process?
3. What are the criteria for evaluating predictive fairness?
4. How can we use predictions given by SML or descriptions produced by UML to
theorize?
5. How can we validate the findings of ML applications?
DISCLOSURE STATEMENT
The authors are not aware of any affiliations, memberships, funding, or financial holdings that
might be perceived as affecting the objectivity of this review.
ACKNOWLEDGMENTS
We offer our apologies to scholars whose work could not be appropriately cited due to space
constraints. We extend our thanks to Thomas Davidson, Joscha Legewie, Karen Levy, Samir Passi,
Mert Sabuncu, Florencia Torche, and Cristobal Young for their thoughtful feedback on our earlier
drafts. We also thank an anonymous reviewer and the editors. All errors are our own.
LITERATURE CITED
Abadie A, Kasy M. 2017. The risk of machine learning. arXiv:1703.10935 [stat.ML]
Abbott A. 1995. Sequence analysis: new methods for old ideas. Annu. Rev. Sociol. 21:93–113
Abbott A. 2001. Time Matters: On Theory and Method. Chicago: Univ. Chicago Press
Abbott A, Tsay A. 2000. Sequence analysis and optimal matching methods in sociology. Sociol. Methods Res.
29:3–33
Abramitzky R, Mill R, Perez S. 2019. Linking individuals across historical sources: a fully automated approach.
Hist. Methods. In press. https://doi.org/10.1080/01615440.2018.1543034
Airoldi EM, Blei DM, Fienberg SE, Xing EP. 2008. Mixed membership stochastic blockmodels. J. Mach. Learn.
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Res. 9:1981–2014
Angrist JD, Imbens GW, Rubin DB. 1996. Identification of causal effects using instrumental variables. J. Am.
Stat. Assoc. 91:444
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
Athey S. 2017. Beyond prediction: using big data for policy problems. Science 355:483–85
Athey S, Imbens G. 2015. A measure of robustness to misspecification. Am. Econ. Rev. 105:476–80
Athey S, Imbens G. 2016. Recursive partitioning for heterogeneous causal effects. PNAS 113:7353–60
Athey S, Imbens GW. 2017. The state of applied econometrics: causality and policy evaluation. J. Econ. Overview of applied
Perspect. 31:3–32 econometrics and the
Bail CA. 2008. The configuration of symbolic boundaries against immigrants in Europe. Am. Sociol. Rev. 73:37– place of machine
learning tools in the
59
field.
Bail CA. 2014. The cultural environment: measuring culture with big data. Theor. Soc. 43:465–82
Baldassarri D, Abascal M. 2017. Field experiments across the social sciences. Annu. Rev. Sociol. 43:41–73
Baldassarri D, Goldberg A. 2014. Neither ideologues nor agnostics: alternative voters’ belief system in an age
of partisan politics. Am. J. Sociol. 120:45–95
Barocas S, Selbst A. 2016. Big data’s disparate impact. Calif. Law Rev. 104:671–732
Baumer EPS, Mimno D, Guha S, Quan E, Gay GK. 2017. Comparing grounded theory and topic modeling:
extreme divergence or unlikely convergence? J. Assoc. Inf. Sci. Tech. 68:1397–410
Beck N, King G, Zeng L. 2000. Improving quantitative studies of international conflict: a conjecture. Am.
Political Sci. Rev. 94:21–35
Belloni A, Chen D, Chernozhukov V, Hanse C. 2012. Sparse models and methods for optimal instruments
with an application to eminent domain. Econometrica 80:2369–429
Belloni A, Chernozhukov V, Fernandez-Val I, Hansen C. 2017. Program evaluation and causal inference with
high-dimensional data. Econometrica 85:233–98
Belloni A, Chernozhukov V, Hansen C. 2014. Inference on treatment effects after selection among Consideration of
high-dimensional controls. Rev. Econ. Stud. 81:608–50 omitted variable bias in
Berk R. 2012. Criminal Justice Forecasts of Risk. New York: Springer ML.
Berk R, Heidari H, Jabbari S, Kearns M, Roth A. 2018. Fairness in criminal justice risk assessments: the state
of the art. Sociol. Method. Res. https://doi.org/10.1177/0049124118782533
Berk RA, Sorenson SB, Barnes G. 2016. Forecasting domestic violence: a machine learning approach to help
inform arraignment decisions. J. Empir. Legal Stud. 13:94–115
Bernheim BD, Bjorkegren D, Naecker J, Rangel A. 2013. Non-choice evaluations predict behavioral responses to
changes in economic conditions. NBER Work. Pap. 19269
Billari FC, Fürnkranz J, Prskawetz A. 2006. Timing, sequencing, and quantum of life course events: a machine
learning approach. Eur. J. Popul. 22:37–65
Blei DM. 2012. Probabilistic topic models. Commun. ACM 55:77–84
Blei DM, McAuliffe JD. 2010. Supervised topic models. arXiv:1003.0783 [stat.ML]
Blumenstock J, Cadamuro G, On R. 2015. Predicting poverty and wealth from mobile phone metadata. Science
350:1073–76
Bollen KA. 2002. Latent variables in psychology and the social sciences. Annu. Rev. Psychol. 53:605–34
Bonikowski B, DiMaggio P. 2016. Varieties of American popular nationalism. Am. Sociol. Rev. 81:949–80
Bonnefon JF, Shariff A, Rahwan I. 2016. The social dilemma of autonomous vehicles. Science 352:1573–6
Brandt PT, Freeman JR, Schrodt PA. 2011. Real time, time series forecasting of inter- and intra-state political
conflict. Conflict Manag. Peace 28:41–64
Breiger RL, Boorman SA, Arabie P. 1975. An algorithm for clustering relational data with applications to
social network analysis and comparison with multidimensional scaling. J. Math. Psychol. 12:328–83
Breiman L. 2001a. Random forests. Mach. Learn. 45:5–32
Breiman L. 2001b. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat.
Sci. 16:199–231
Carrasco M. 2012. A regularization approach to the many instruments problem. J. Econom. 170:383–98
Cederman LE, Weidmann NB. 2017. Predicting armed conflict: Time to adjust our expectations? Science
355:474–76
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W. 2017. Double/debiased/
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Goldberg A. 2011. Mapping shared understandings using relational class analysis: the case of the cultural
omnivore reexamined. Am. J. Sociol. 116:1397–436
Goodfellow I, Bengio Y, Courville A. 2016. Deep Learning. Cambridge, MA: MIT Press Basic introduction to
Greene JD. 2016. Our driverless dilemma. Science 352:1514–5 machine learning (with
Grimmer J, King G. 2011. General purpose computer-assisted clustering and conceptualization. PNAS emphasis on deep
108:2643–50 learning toolbox).
Grimmer J, Messing S, Westwood SJ. 2017. Estimating heterogeneous treatment effects and the effects of
heterogeneous treatments with ensemble methods. Political Anal. 25:413–34
Grimmer J, Stewart BM. 2013. Text as data: the promise and pitfalls of automatic content analysis methods
for political texts. Political Anal. 21:267–97
Grosse R. 2013. Predictive learning vs. representation learning. Laboratory for Intelligent Probabilistic Systems
Blog, Feb. 4. https://lips.cs.princeton.edu/predictive-learning-vs-representation-learning/
Handcock MS, Raftery AE, Tantrum JM. 2007. Model-based clustering for social networks. J. R. Stat. Soc.
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
170:301–54
Harcourt BE. 2007. Against Prediction: Profiling, Policing, and Punishing in an Actuarial Age. Chicago: Univ.
Chicago Press
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
Hardt M, Price E, Srebro N. 2016. Equality of opportunity in supervised learning. In Proceedings of the 30th
International Conference on Neural Information Processing Systems, ed. DD Lee, U von Luxburg, R Garnett,
M Sugiyama, I Guyon. Red Hook, NY: Curran
Hartford J, Lewis G, Leyton-Brown K, Taddy M. 2016. Counterfactual prediction with deep instrumental
variables networks. arXiv:1612.09596 [stat.AP]
Hastie T, Tibshirani R, Friedman J. 2009. The Elements of Statistical Learning: Data Mining, Inference, Overview of applied
and Prediction. New York: Springer. 2nd ed. econometrics, the place
Hill JL. 2011. Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 20:217–40 of machine learning
Ho DE, Imai K, King G, Stuart EA. 2007. Matching as nonparametric preprocessing for reducing model tools in the field.
Matias C, Miele V. 2017. Statistical clustering of temporal networks through a dynamic stochastic block model.
Stat. Methodol. 79:1119–41
McCaffrey DF, Ridgeway G, Morral AR. 2004. Propensity score estimation with boosted regression for eval-
uating causal effects in observational studies. Psychol. Methods 9(4):403–25
McFarland DA, Ramage D, Chuang J, Heer J, Manning CD, Jurafsky D. 2013. Differentiating language usage
through topic models. Poetics 41:607–25
Mohr JW, Bogdanov P. 2013. Introduction—topic models: what they are and why they matter. Poetics 41:545–
69
Mohr JW, Wagner-Pacifici R, Breiger RL, Bogdanov P. 2013. Graphing the grammar of motives in national
security strategies: cultural interpretation, automated text analysis and the drama of global politics. Poetics
41:670–700
Morgan SL, Winship C. 2007. Counterfactuals and Causal Inference: Methods and Principles for Social Research.
Cambridge, UK: Cambridge Univ. Press. 1st ed.
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
Morgan SL, Winship C. 2014. Counterfactuals and Causal Inference: Methods and Principles for Social Research.
Cambridge, UK: Cambridge Univ. Press. 2nd ed.
Introduction to machine Mullainathan S, Spiess J. 2017. Machine learning: an applied econometric approach. J. Econ. Perspect.
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
Wager S, Athey S. 2018. Estimation and inference of heterogeneous treatment effects using random forests.
J. Am. Stat. Assoc. 113:1228–42
Watts DJ. 2004. The new science of networks. Annu. Rev. Sociol. 30:243–70
Watts DJ. 2014. Common sense and sociological explanations. Am. J. Sociol. 120:313–51
Weber M. 1978. Economy and Society. Berkeley: Univ. Calif. Press
Western B. 1996. Vague theory and model uncertainty in macrosociology. Sociol. Methodol. 26:165–92
Westreich D, Lessler J, Funk MJ. 2010. Propensity score estimation: neural networks, support vector ma-
chines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J. Clin. Epidemiol.
63:826–33
White HC, Boorman SA, Breiger RL. 1976. Social structure from multiple networks. I. Blockmodels of roles
and positions. Am. J. Sociol. 81:730–80
Wolpert D, Macready W. 1997. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1:67–82
Wong SC, Gatt A, Stamatescu V, McDonnell MD. 2016. Understanding data augmentation for classification:
Access provided by Universidad Autonoma de Coahuila on 07/24/19. For personal use only.
when to warp? In 2016 International Conference on Digital Image Computing: Techniques and Applications
(DICTA), ed. AW Liew, B Lovell, C Fookes, J Zhou, Y Gao, et al., pp. 1–6. New York: IEEE
Wyss R, Ellis AR, Brookhart MA, Girman CJ, Jonsson Funk M, et al. 2014. The role of prediction modeling
Annu. Rev. Sociol. 2019.45. Downloaded from www.annualreviews.org
in propensity score estimation: an evaluation of logistic regression, bCART, and the covariate-balancing
propensity score. Am. J. Epidemiol. 180:645–55
Xie Y. 2007. Otis Dudley Duncan’s legacy: the demographic approach to quantitative reasoning in social
science. Res. Soc. Strat. Mobil. 25:141–56
Xie Y. 2013. Population heterogeneity and causal inference. PNAS 110:6262–8
Xing EP, Fu W, Song L. 2010. A state-space mixed membership blockmodel for dynamic network tomography.
Ann. Appl. Stat. 4:535–66
Yang T, Chi Y, Zhu S, Gong Y, Jin R. 2011. Detecting communities and their evolutions in dynamic social
networks—a Bayesian approach. Mach. Learn. 82:157–89
Yarkoni T, Westfall J. 2017. Choosing prediction over explanation in psychology: lessons from machine Explains relevance of
learning. Perspect. Psychol. Sci. 12:1100–22 machine learning to
Young C. 2009. Model uncertainty in sociological research: an application to religion and economic growth. psychological research,
Am. Sociol. Rev. 74:380–97 to preventing p-hacking.
Young C, Holsteen K. 2017. Model uncertainty and robustness. a computational framework for multimodel
analysis. Sociol. Methods Res. 46:3–40
RELATED RESOURCES
Murphy KP. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press
Bishop CM. 2016. Pattern Recognition and Machine Learning. New York: Springer
Salganik MJ. 2017. Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton Univ. Press
Summer Institute in Computational Social Science. 2017. Online resources. SICSS. https://
compsocialscience.github.io/summer-institute/2017/#schedule