Integrating Explanation and Prediction in Computational Social Science
Integrating Explanation and Prediction in Computational Social Science
https://doi.org/10.1038/s41586-021-03659-0 Jake M. Hofman1,17 ✉, Duncan J. Watts2,3,4,17 ✉, Susan Athey5, Filiz Garip6, Thomas L. Griffiths7,8,
Jon Kleinberg9,10, Helen Margetts11,12, Sendhil Mullainathan13, Matthew J. Salganik6,
Received: 23 February 2021
Simine Vazire14, Alessandro Vespignani15 & Tal Yarkoni16
Accepted: 20 May 2021
In the past 15 years, social science has experienced the beginnings In turn, these different values have led social and computer scientists
of a ‘computational revolution’ that is still unfolding1–4. In part this to prefer different methods from one another, and to invoke different
revolution has been driven by the technological revolution of the inter- standards of evidence. For example, whereas quantitative methods
net, which has effectively digitized the social, economic, political, in social science are designed to identify causal relationships or to
and cultural activities of billions of people, generating vast reposi- obtain unbiased estimates of theoretically interesting parameters,
tories of digital data as a byproduct5. And in part it has been driven machine learning methods are typically designed to minimize total
by an influx of methods and practices from computer science that error on as-yet unseen data9,10. As a result, it is standard practice for
were needed to deal with new classes of data—such as search and social scientists to fit their models entirely ‘in-sample’, on the grounds
social media data—that have tended to be noisier, more unstruc- that they are seeking to explain social processes and not to predict
tured, and less ‘designed’ than traditional social science data (for outcomes, whereas for computer scientists evaluation on ‘held out’
example, surveys and lab experiments). One obvious and important data is considered obligatory11. Conversely, computer scientists often
outcome of these dual processes has been the emergence of a new allow model complexity to increase as long as it continues to improve
field, now called computational social science2,4, that has generated predictive performance, whereas for social scientists models should
considerable interest among social scientists and computer scientists be grounded in, and therefore constrained by, substantive theory12.
alike6. We emphasize that both approaches are defensible on their own
What we argue in this paper, however, is that another outcome—less terms, and both have generated large, productive scientific litera-
obvious but potentially even more important—has been the surfacing tures; however, both approaches have also been subjected to serious
of a tension between the epistemic values of social and computer sci- criticism. On the one hand, theory-driven empirical social science
entists. On the one hand, social scientists have traditionally prioritized has been criticized for generating findings that fail to replicate13, fail
the formulation of interpretatively satisfying explanations of individual to generalize14, fail to predict outcomes of interest15,16, and fail to offer
and collective human behaviour, often invoking causal mechanisms solutions to real-world problems17,18. On the other hand, complex pre-
derived from substantive theory7. On the other hand, computer scien- dictive models have also been criticized for failing to generalize19 as
tists have traditionally been more concerned with developing accurate well as being uninterpretable20 and biased21. Meanwhile, extravagant
predictive models, whether or not they correspond to causal mecha- claims that the ability to mine sufficiently large datasets will result in
nisms or are even interpretable8. an ‘end of theory’ have been widely panned22. How might we continue
1
Microsoft Research, New York, NY, USA. 2Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA. 3The Annenberg School of Communication,
University of Pennsylvania, Philadelphia, PA, USA. 4Operations, Information, and Decisions Department, University of Pennsylvania, Philadelphia, PA, USA. 5Graduate School of Business,
Stanford University, Stanford, CA, USA. 6Department of Sociology, Princeton University, Princeton, NJ, USA. 7Department of Psychology, Princeton University, Princeton, NJ, USA.
8
Department of Computer Science, Princeton University, Princeton, NJ, USA. 9Department of Computer Science, Cornell University, Ithaca, NY, USA. 10Department of Information Science,
Cornell University, Ithaca, NY, USA. 11Oxford Internet Institute, University of Oxford, Oxford, UK. 12Public Policy Programme, The Alan Turing Institute, London, UK. 13Booth School of Business,
University of Chicago, Chicago, IL, USA. 14Melbourne School of Psychological Sciences, University of Melbourne, Melbourne, Victoria, Australia. 15Laboratory for the Modeling of Biological
and Socio-technical Systems, Northeastern University, Boston, MA, USA. 16Department of Psychology, University of Texas at Austin, Austin, TX, USA. 17These authors contributed equally:
Jake M. Hofman, Duncan J. Watts. ✉e-mail: [email protected]; [email protected]
and community detection in networks10. For example, much of what that the data on which the model is evaluated (the held-out or test
is known about public opinion, the state of the economy, and everyday data) are different from the data on which the model was estimated
human experience is derived from survey research, whether conducted (the training data). Activities in this quadrant encompass time series
by federal statistical agencies such as the Bureau of Labour Statistics or modelling43, prediction contests44, and much of supervised machine
research organizations such as Pew Research Center. Statistical analyses learning10, ranging from simple linear regression to complex artificial
of administrative data are also often descriptive in nature. For example, neural networks. By evaluating performance on a held-out test set,
recent studies have documented important differences in mortality these methods focus on producing predictions that generalize well to
rates37, wealth gaps38 and intergenerational economic mobility39 across future observations. From a policy perspective, it can be helpful to have
racial and ethnic groups. Qualitative and comparative methods that high-quality forecasts of future events even if those forecasts are not
are popular in sociology, communications, and anthropology also causal in nature9,45–47. For example, applications of machine learning to
fall into this quadrant. Finally, much of the progress in computational human behaviour abound in online advertising and recommendation
social science to date has been in using digital signals and platforms to systems, but can also detect potentially viral content on social media
investigate previously unmeasurable concepts5,40. Descriptive work, in early in its trajectory48. Although these algorithms do not identify what
other words, whether qualitative or quantitative, is useful and interest- is causing people to click or content to spread, they can still be useful
ing in its own right and also foundational to the activities conducted inputs for decision-makers—for example, alerting human reviewers
in the other three quadrants. to check potentially large cascades for harmful misinformation. That
Moving beyond description, explanatory modelling (quadrant said, there is often an implicit assumption that the data used to train
2) refers to activities whose goal is to identify and estimate causal and test the model come from the same data-generating process, akin
effects, but that do not focus directly on predicting outcomes. Most to making forecasts in a static (albeit possibly noisy) world. As a result,
of traditional empirical sociology, political science, economics, while these methods often work well for a fixed data distribution, they
and psychology falls into this quadrant, which encompasses a wide may not generalize to settings in which features or inputs are actively
range of methods, including statistical modelling of observational manipulated (as in a controlled experiment or policy change) or change
data, lab experiments, field experiments, and qualitative methods. as a result of other, uncontrolled factors.
Some methods (for example, in randomized or natural experiments, Combining the explanatory properties of quadrant 2 and the pre-
or non-experimental identification strategies such as instrumental dictive properties of quadrant 3, integrative modelling (quadrant 4)
variables and regression discontinuity designs) isolate causal effects refers to activities that attempt to predict as-yet unseen outcomes
by design, whereas others (for example, regression modelling, qualita- in terms of causal relationships. More specifically, whereas quad-
tive data) invoke causal interpretations based on theory. Regardless, rant 3 concerns itself with data that are out of sample, but still from
methods in this quadrant tend to prioritize simplicity, considering one the same (statistical) distribution, here the focus is on generalizing
or only a handful of features that may affect an outcome of interest. We ‘out of distribution’ to a situation that might change either naturally,
emphasize that these approaches can be very useful for understanding owing to some factor out of our control, or because of some inten-
individual causal effects, shaping theoretical models, and even guiding tional intervention such as an experiment or change in policy. This
policy. For example, field experiments that show that job applicants category includes distributional changes for settings that we have
with characteristically ‘Black’ names are less likely to be interviewed observed before (that is, setting an input feature to a specific value,
than those with ‘white’ names25 reveal the presence of structural rac- rather than simply observing it to be at that value) as well as the more
ism and inform public debates about discrimination with respect to extreme case of entirely new situations (that is, setting an input feature
gender, race, and other protected attributes. Relatedly, quantifying to an entirely new value that we have never seen before). Integrative
difficult-to-assess effects, such as the impact of gender and racial modelling therefore requires attention to quadrant 2 concerns about
diversity on policing41, can motivate concrete policy interventions. estimating causal, rather than simply associational, effects49, while
Nonetheless, the emphasis on studying effects in isolation can lead to simultaneously considering the impact of all such effects to forecast
little, if any, attention being paid to predictive accuracy. As many effects outcomes as accurately as possible (that is, quadrant 3). Ideally work
are small, and simple models can fail to incorporate the broader set of in this quadrant would generate high-quality predictions about future
features pertinent to the outcome being studied, these methods can outcomes in a (potentially) changing world. However, forcing one’s
suffer from relatively poor predictive performance. explanations to make predictions can reveal that they explain less than
In contrast with explanatory modelling, predictive modelling one would like15,50, thereby motivating and guiding the search for more
(quadrant 3) refers to activities that attempt to predict the outcome complete explanations51. Alternatively, such a search may reveal the
of interest directly but do not explicitly concern themselves with the presence of a fundamental limit to predictive accuracy that results
identification of causal effects. ‘Prediction’ in this quadrant may or may from the presence of system complexity or intrinsic randomness52,
not be about actual future events; however, in contrast with quadrants in which case the conclusion may be that we can explain less than we
1 and 2, it refers exclusively to ‘out of sample’ prediction42, meaning would like, even in principle53.
could be hit by an out-of-control vehicle and measuring the changes averages or ‘stylized facts’ (that is, the sort of qualitative statements that
in participants’ judgements of the moral acceptability of different are often used in summaries of scientific work, such as “income rises
outcomes. Agrawal et al.12 used this dataset as the basis for building with education”). In quadrant 2, estimating the magnitude of an effect
a predictive model, using a black box machine learning method (an is more informative than determining only its sign (positive or nega-
artificial neural network) to predict people’s decisions. This predictive tive), which is in turn more informative than simply establishing that it
model was used to critique a more traditional cognitive model and to is unlikely to be zero. Likewise, estimates of effect sizes made across a
identify potential causal factors that might have influenced people’s range of conditions are more informative than those that are made for
decisions. The cognitive model was then evaluated in a new round of only one set of conditions (for example, the particular settings chosen
experiments that tested its predictions about the consequences of for a lab experiment14). In quadrant 3, predictions about outcomes
manipulating those causal factors. can also be subjected to tests at widely different levels, depending on
numerous, often benign-seeming, details of the test36. For example: (a)
Clearly label contributions predictions about distributional properties (for example, population
Our second suggestion is deceptively simple: researchers should clearly averages) are less informative than predictions of individual outcomes;
label their research activities according to the type of contributions (b) predictions about which ‘bucket’ an observation falls into (for exam-
they make. Simply adding labels to published research sounds trivial, ple, above or below some threshold, as in most classification tasks) tell
but checklists68, badges69, and other labelling schemes are already a us less than predictions of specific outcome values (as in regression); (c)
central component of efforts to improve the transparency, openness, ex-ante predictions made immediately before an event are less difficult
and reproducibility of science70. Inspired by these efforts, we argue than those made far in advance; and (d) predictions that are evaluated
that encouraging researchers to clearly identify the nature of their against poor or inappropriate baseline models—or where a baseline is
contribution would be clarifying both for ourselves and for others, and absent—are less informative than those that are compared against a
propose the labelling scheme in Table 2 for this purpose. We anticipate strong baseline35. The same distinctions apply to quadrant 4, with the
that many other labelling schemes could be proposed, each of which key difference being that claims made in this quadrant are evaluated
would have advantages and disadvantages. At a minimum, however, we under some change in the data-generating process, whether through
advocate for a scheme that satisfies two very general properties: first, intentional experimentation or changes that result from other external
it should differentiate as cleanly as possible between contributions factors. Requiring researchers to state explicitly the level of granularity
in the four quadrants of Table 1; and second, within each quadrant it at which a particular claim is made will, we hope, lead to more accurate
should identify the level of granularity (for example, high, medium or interpretations of our findings.
low) that is exhibited by the result.
Focusing first on the columns of Table 2, we recognize that the Standardize open science practices
boundaries of the quadrants will, in reality, be blurry, and that indi- Our third suggestion is to standardize open science practices between
vidual papers will sometimes comprise a blend of contributions across those engaged in predictive and explanatory modelling. Over the last
quadrants or granularity levels; however, we believe that surfacing several years, scientists working in each tradition have promoted best
these ambiguities and making them explicit would itself be a useful practices to facilitate transparent, reproducible, and cumulative sci-
exercise. If, for example, it is unclear whether a particular claim is merely ence; specifically, pre-registration in the explanatory modelling com-
descriptive (for example, there exists a difference in outcome variable munity71, and the common task framework in the predictive modelling
y between two groups A and B) or is intended as a causal claim (for community72. Here we highlight how each community can learn from
example, that the difference exists because A and B differ on some other and leverage best practices developed in the other.
variable x), requiring us to attest that our model tests a causal claim in
order to place it in quadrant 2 should cause us to reflect on our choice Pre-registration. Pre-registration is the act of publicly declaring one’s
of language and possibly to clarify it. Such a clarification would also plans for how any given research activity will be done before it is actually
help to avoid confusion that can arise from any given research method carried out and is designed with a simple goal in mind: to make it easier
falling into more than one quadrant, depending on the objectives of for readers and reviewers to tell the difference between planned and
the researcher (see example in Box 1). unplanned analyses. This procedure can help to calibrate expectations
Focusing next on the rows, Table 2 is also intended to clarify that it about the reliability of reported findings and, in turn, reduce the inci-
is possible to engage in activities that reveal widely different amounts dence of unreliable, false-positive results in research that tests a given
of information while remaining within a given quadrant. In quadrant hypothesis or prediction27,71. Specifically, pre-registration reduces the
1, for example, a description that specifies the association between risk of making undisclosed post hoc, data-dependent decisions (for
individual-level attributes and outcomes tells us more about a phe- example, which of many possible statistical tests to run) that can lead
nomenon than one that does the same things at the level of population to non-replicable findings.
Summary of suggestions
1. Watts, D. J. A twenty-first century science. Nature 445, 489 (2007).
● Integrate predictive and explanatory modelling 2. Lazer, D. et al. Computational social science. Science 323, 721–723 (2009).
○ Look to sparsely populated quadrants for new research 3. Salganik, M. J. Bit by Bit: Social Research in the Digital Age (Princeton Univ. Press, 2018).
4. Lazer, D. M. J. et al. Computational social science: obstacles and opportunities. Science
opportunities 369, 1060–1062 (2020).
○ Test existing methods to see how they generalize under 5. Lazer, D. et al. Meaningful measures of human society in the twenty-first century. Nature
interventions or distributional changes https://doi.org/10.1038/s41586-021-03660-7 (2021).
6. Wing, J. M. Computational thinking. Commun. ACM 49, 33–35 (2006).
○ Develop new methods that iterate between predictive and 7. Hedström, P. & Ylikoski, P. Causal mechanisms in the social sciences. Annu. Rev. Sociol.
explanatory modelling 36, 49–67 (2010).
8. Breiman, L. Statistical modeling: the two cultures (with comments and a rejoinder by the
● Clearly label contributions according to the quadrant in which
author). Stat. Sci. 16, 199–231 (2001).
they make a claim, and the granularity of that claim We view our paper as an extension of Brieman’s dichotomy (the ‘algorithmic’ and ‘data
● Standardize open science practices across the social and modelling’ cultures), arguing that these approaches should be integrated.
9. Mullainathan, S. & Spiess, J. Machine learning: an applied econometric approach. J. Econ.
computer sciences, encouraging, for instance, pre-registration
Perspect. 31, 87–106 (2017).
for predictive models and the common task framework for This paper explores the relationships between predictive models and causal inference.
explanatory modelling 10. Molina, M. & Garip, F. Machine learning for sociology. Annu. Rev. Sociol. 45, 27–45 (2019).
11. Shmueli, G. To explain or to predict? Stat. Sci. 25, 289–310 (2010).
We build on Schmueli’s distinction between prediction and explanation and propose a
framework for integrating the two approaches.
the subjective experience of having made sense of many diverse phe- 12. Agrawal, M., Peterson, J. C. & Griffiths, T. L. Scaling up psychology via Scientific Regret
Minimization. Proc. Natl Acad. Sci. USA 117, 8825–8835 (2020).
nomena without being either predictively accurate or demonstrably
This paper exemplifies what we call integrative modelling.
causal81 (for example, conspiracy theories). 13. Munafò, M. R. et al. A manifesto for reproducible science. Nat. Hum. Behav. 1, 0021 (2017).
Interpretable explanations, of course, can be valued for other 14. Yarkoni, T. The generalizability crisis. Behav. Brain Sci. https://doi.org/10.1017/
S0140525X20001685 (2020).
reasons. For example, interpretability allows scientists to ‘mentally
15. Ward, M. D., Greenhill, B. D. & Bakke, K. M. The perils of policy by p-value: predicting civil
simulate’ their models, thereby generating plausible hypotheses for conflicts. J. Peace Res. 47, 363–375 (2010).
subsequent testing. Clearly this ability is helpful to theory develop- 16. Yarkoni, T. & Westfall, J. Choosing prediction over explanation in psychology: lessons
from machine learning. Perspect. Psychol. Sci. 12, 1100–1122 (2017).
ment, especially when data are sparse or noisy, which is often the case 17. Watts, D. J. Should social science be more solution-oriented? Nat. Hum. Behav. 1, 0015
for social phenomena. Equally important, interpretable models are (2017).
often easier to communicate and discuss (verbally or in text), thereby 18. Berkman, E. T. & Wilson, S. M. So useful as a good theory? The practicality crisis in (social)
psychological theory. Perspect. Psychol. Sci. https://doi.org/10.1177/1745691620969650
increasing the likelihood that others will pay attention to them, use (2021).
them, or improve upon them. In other words, interpretability is a per- 19. Athey, S. Beyond prediction: Using big data for policy problems. Science 355, 483–485
fectly valid property to desire of an explanation, and can be very useful (2017).
20. Lipton, Z. C. The mythos of model interpretability. Queue 16, 31–57 (2018).
pragmatically. It is our opinion, however, that it should be valued on its 21. Kleinberg, J., Ludwig, J., Mullainathan, S. & Sunstein, C. R. Discrimination in the age of
own merits, not on the grounds that it directly improves the predictive algorithms. J. Legal Anal. 10, 113–174 (2018).
or causal properties of a model. 22. Coveney, P. V., Dougherty, E. R. & Highfield, R. R. Big data need big theory too. Philos.
Trans. R. Soc. A 374, 20160153 (2016).
We also acknowledge that there are costs associated with adopting 23. Gigerenzer, G. Mindless statistics. J. Socio-Econ. 33, 587–606 (2004).
the integrative modelling practices that we have described. As men- 24. Cohen, J. The earth is round (p < .05). Am. Psychol. 49, 997–1003 (1994).
tioned earlier, evaluating explanations in terms of their predictive 25. Bertrand, M. & Mullainathan, S. Are Emily and Greg more employable than Lakisha and
Jamal? A field experiment on labor market discrimination. Am. Econ. Rev. 94, 991–1013
accuracy may reveal that our existing theories explain less than we (2004).
would like53. Likewise, clearly labelling contributions as descriptive, 26. Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2, e124
explanatory, predictive and so on may cast our findings in a less flat- (2005).
27. Simmons, J. P., Nelson, L. D. & Simonsohn, U. False-positive psychology: undisclosed
tering light than if they are described in vague or ambiguous language. flexibility in data collection and analysis allows presenting anything as significant.
Pre-registration requires additional time and effort from individual Psychol. Sci. 22, 1359–1366 (2011).
researchers, and some have criticized it as de-emphasizing important 28. Open Science Collaboration. Estimating the reproducibility of psychological science.
Science 349, aac4716 (2015).
exploratory work. Increased adoption of registered reports requires 29. Meehl, P. E. Why summaries of research on psychological theories are often
changes to editorial and review processes, and therefore the coordi- uninterpretable. Psychol. Rep. 66, 195–244 (1990).
nation of many individuals with potentially disparate interests. The 30. Gelman, A. Causality and statistical learning. Am. J. Sociol. 117, 955–966 (2011).
31. Dienes, Z. Understanding Psychology as a Science: An Introduction to Scientific and
common task framework demands a great deal of effort on the part Statistical Inference (Macmillan, 2008).
of those organizing an instance of it82, as well as adoption by others in 32. Schrodt, P. A. Seven deadly sins of contemporary quantitative political analysis. J. Peace
the field once a task is created. It is also subject to what has been called Res. 51, 287–300 (2014).
33. Lazer, D., Kennedy, R., King, G. & Vespignani, A. The parable of Google flu: traps in big
Goodhardt’s law83: “When a measure becomes a target, it ceases to be data analysis. Science 343, 1203–1205 (2014).
a good measure.” 34. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an
That said, it is our view that wider adoption of these practices would algorithm used to manage the health of populations. Science 366, 447–453 (2019).
35. Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M. & Watts, D. J. Predicting consumer
be a net benefit for the field of computational social science. Explora- behavior with web search. Proc. Natl Acad. Sci. USA 107, 17486–17490 (2010).
tory work is important and should be encouraged, but pre-registration 36. Hofman, J. M., Sharma, A. & Watts, D. J. Prediction and explanation in social systems.
Science 355, 486–488 (2017).
is crucial in that it helps to distinguish the act of testing models from
37. Case, A. & Deaton, A. Rising morbidity and mortality in midlife among white non-Hispanic
the process of building them. Registered reports help us to focus on Americans in the 21st century. Proc. Natl Acad. Sci. USA 112, 15078–15083 (2015).
the informativeness of inquiries being conducted without biasing 38. Oliver, M. L., Shapiro, T. M. & Shapiro, T. Black Wealth, White Wealth: A New Perspective on
Racial Inequality (Taylor & Francis, 2006).
our attention based on the outcomes of those tests. And the common
39. Chetty, R., Hendren, N., Kline, P. & Saez, E. Where is the land of opportunity? The
task framework provides a way of uniting sub-fields and disciplines geography of intergenerational mobility in the United States. Q. J. Econ. 129, 1553–1623
to accelerate collective progress. Most importantly, thinking clearly (2014).
40. Wagner, C. et al. Measuring algorithmically infused societies. Nature https://doi.
about the epistemic values of explanation and prediction not only
org/10.1038/s41586-021-03666-1 (2021).
helps us to recognize their distinct contributions but also reveals new 41. Ba, B. A., Knox, D., Mummolo, J. & Rivera, R. The role of officer race and gender in police–
ways to integrate them in empirical research. Doing so will, we believe, civilian interactions in Chicago. Science 371, 696–702 (2021).