Critiquing Big Data
Critiquing Big Data
Critiquing Big Data
19328036/20140005
KATE MILTNER
Microsoft Research, USA
MARY L. GRAY
Microsoft Research New England, USA
Indiana University, USA
Why now? This is the first question we might ask of the big data phenomenon. Why has
it gained such remarkable purchase in a range of industries and across academia, at this
point in the 21st century? Big data as a term has spread like kudzu in a few short years,
ranging across a vast terrain that spans health care, astronomy, policing, city planning,
and advertising. From the RNA bacteriophages in our bodies to the Kepler Space
Telescope, searching for terrorists or predicting cereal preferences, big data is deployed
as the term of art to encompass all the techniques used to analyze data at scale. But
why has the concept gained such traction now?
The Technical Terrain
A common answer is that big data approaches have been produced by our current technological
capacities: that the steady drumbeat of Moores Law, the doubling of integrated circuits every two years,
has brought us to a point where massive amounts of data can be easily gathered, stored, analyzed and
interlinked. Apaches Hadoop, a common big data platform that utilizes distributed nodes to act as
processing and analysis clusters, was launched in 2005reinforcing the idea of big datas newness. Yet
this isnt satisfying; it overemphasizes the technological element of our historical conjuncture and fails to
account for the economic, political, and cultural forces at work. And, in fact, the term big data has been
around for almost two decades, and yet it has only been in the last five years that it has acquired such
popular resonance. In its earliest incarnations in the archive of the Association for Computing Machinery,
the concept of big data simply referred to data sets that were too large for any single computer:
Visualization provides an interesting challenge for computer systems: data sets are generally quite large,
taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big
data (Cox & Ellsworth, 1997, p. 235).
This definition of the problem of big data as being about storage and compression was well
established in the mid-20th century, as RAND was developing its Relational Data File, a system designed
for the logical analysis of large collections of factual data. In November 1967, two computer scientists
were struggling with the difficulty of working with large data sets, and they discovered a wide variety of
problemslogical and linguistic, hardware and software, practical and theoretical that plagued their
endeavors (Levien & Maron, 1967, p. 715).
And these difficulties persist. As cloud computing platforms distribute storage, the pipes to move
data remain expensive and limited and storage remains problematic. Jonathan Sterne notes that further
increases in computing power and bandwidth may be used for higher definition, but they will also be used
for more elaborate compression schemes . . . there will be no post-compression age (2012, p. 231). In
other words, big data is neither new nor free of the technical challenges raised since the emergence of
supercomputers. Just as the Large Hadron Collider, the poster child for truly large data projects, has
turned to magnetic tape as its best data storage solution, big data contains the techniques, artifacts, and
challenges of older computational forms. Older concernstechnical, epistemological, and ethicalhaunt
the domains of big data.
Big Data as Political, Economic and Cultural
If we are to reject the claim that the big data moment has been precipitated by technology
alone, then we need to look more widely at the cultural, political, and economic making of big data: as a
science, a business, andimportantlyas a mythology (boyd & Crawford, 2012). This cultural mythology
can be seen in city billboards promoting big data solutions, at highly profitable big data conferences, and
in the many newspaper and magazine columns covering the advances brought about by big data science.
The very term big data science is itself a kind of mythological artifact: implying that the precepts and
methods of scientific research change as the data sets increase in size. Some big data fundamentalists
argue that at sufficient scale, data is enough; statistical algorithms find patterns where science cannot
(Anderson, 2008, para. 14), and thus big data represents the end of theory (Graham, 2012). But we
argue that big data is theory. It is an emerging Weltanschauung grounded across multiple domains in the
public and private sectors, one that is need of deeper critical engagement.
The mythic power of big data is part of what unifies it as a concept and informs its legibility as a
set of tools. And this is, of course, not a novel claim. As Donna J. Haraway first wrote in 1983:
The boundary is very permeable between tool and myth, instrument and concept,
historical systems of social relations and historical anatomies of possible bodies,
including objects of knowledge. Indeed, myth and tool mutually constitute each other.
(1983, para. 7)
Haraway writes of this common move in the technological sciences, of translating the world into
a problem of coding, where differences in the world are submitted to disassembly, reassembly,
investment, and exchange in what she describes as the informatics of domination (1991, pp. 302303).
We can see how the process of coding the world has progressed in the multiple prehistories of big data,
from the first instantiations of the U.S. census tracking human populations (Driscoll, 2012), the
quantification of climate shifts (Edwards, 2010), the rapid analysis and projection of financial data
(MacKenzie, 2006), and the complete capture of communications systems by intelligence agencies
(Landau, 2010).
This special section of the International Journal of Communication brings together critical
accounts of big data as theory, practice, archive, myth, and rhetorical move. The essays collected here
interrogate rather than accept the realities conjured through our political, economic, and cultural
imaginings of big data. From neoliberal economic logics shaping the deployment of big data to the cultural
precursors that underlie data mining techniques, the issue covers both the macro and micro contexts. We
have drawn together researchers from communication, anthropology, geography, information science,
sociology, and critical media studies to, among other things, examine the political and epistemological
ramifications of big data for a range of audiences. Articles in this collection also interrogate the ethics of
big data use and critically consider who gets access to big data and how access (or lack of it) matters to
the issues of class, race, gender, sexuality, and geography.
But, importantly, this is not a wholesale rejection of big data: Several of the authors presented
here use big data as tools and techniques in their everyday work. By analyzing big datas applications,
methods, and assumptions, they aim to improve the way social and cultural research is done. The already
tired binary of big datais it good or bad?neglects a far more complex reality that is developing. There
is a multitude of differentsometimes completely opposeddisciplinary settings, techniques, and
practices that still assemble (albeit uncomfortably) under the banner of big data. Fields engaged in media
and communication research that draw on big data to address dilemmas or raise new questions push us to
carefully consider the ways in which the term and techniques are deployed. This is particularly necessary
given that the big data nomenclature has generated nationally funded multi-billion-dollar grant programs
and tenure-track jobs across academe; it is the megafauna of the academic landscape. The rapid and
widespread ascendancy of the concept attests to its significance and stickiness across multiple fieldsit
has become a thing, despite the ways in which the term is often at odds with itself semantically and
industrially. As Tom Boellstorff (2013) suggests, there is no unitary phenomenon big data . . . yet the
impact of big data is real and worthy of sustained attention (para. 2).
Ethical Challenges
Another way to consider the why now? question is to ask who and what is subjected to
analysis. For decades, the informatics of domination have been focused and tested on historically
marginalized groups. As Virginia Eubanks (2014) has shown, drawing on her ethnographic research of
Electronic Benefit Transfer card and food stamp use in the United States, poor and working-class
Americans already live in the surveillance future (para. 3). Thanks to the archive released by Edward
Snowden, it is now public knowledge that consentless big data gathering is out of its testing phase and
has been widely franchised to the mass populace. This has also demonstrated the extent of the erosion of
civil liberties and privacy. However, in the case of scientific inquiry, it also raises the question of how big
data tools should be used. How can data be gathered without peoples knowledge or consent and still
meet the ethical obligation to treat people with justice, beneficence, and respect, as the Belmont Report
on human subjects research first outlined in 1978? Scientific research that involves drawing on what is
euphemistically known as passively collected big data must face difficult questions and develop new
ethical frameworks. This is particularly urgent given the leading professional bodies for computing and
engineering, the ACM and IEEE, both have ethical guidelines that are almost two decades old.
Snowdens trove of documents exposed to the public that the e-mails, phone calls, text
messages, and social media activity of millions of people around the worlds are collected and stored, that
enormous cloud servers have been breached, and both data and metadata have been fair game. But it
also revealed the driving economic imperatives: Big datas promise is economic efficiency, more
observation at less cost. Bankston and Soltani (2014) have shown in detail just how cheap mass
surveillance is compared to hiring police officers: just 6.5 cents per hour to monitor a person electronically
rather than $275 for a covert pursuit. The excitement about harnessing the promise of big data through
the widespread collection of disparate online transactions and interactions coincides with its cost efficiency
in targeting niche markets and providing oversight of populations.
In an informatics of domination that gathers all the data it can to unlock some presumed or asyet-unknown value down the road, data generation and collection are equated with innovation and
scientific breakthroughs. As such, participation in the big data projectoffering up the data we generate
through the social interactions that shape our everyday livesbecomes the responsibility of all good
citizens. To contribute ones data to the pool is to contribute to the advancement of science, innovation,
and learning. This rhetoric can be seen most clearly with regard to health data. To be concerned about
individual risk is equated with hindering progress; why be concerned about releasing data if it could help
others, in the aggregate? Of course, this fails to acknowledge the ways in which our data can reveal much
about us that we cannot know or intend, and can be used to discriminate against individuals and groups.
And how much trust should we have in the custodianship of data? The repositories of data are
characteristically unstable; data is leaky, and it escapes in unexpected ways, be it through errors, hacks,
or whistleblowing.
Big Data Literatures
There is a strong celebratory thread in the literature on big data: that more data will bring better
science, safer cities, and rapid innovation. One such example is The Human Face of Big Data (Smolan &
Erwitt, 2012), a collection of essays about the potential of large-scale data gathering to design
personalized drugs, predict divorce, and research Parkinsons and retinal disease. Even Indias
controversial biometric ID card program, the Aadhaar card, is described in glowing terms, with no mention
of privacy or ethics concerns, and only a brief mention of the information security risks. Likewise, in Social
Physics, Pentland (2014) outlines his goal to gather the digital bread crumbs we all leave behind as we
move through the worldcall records, credit card transactions, and GPS location fixes (2014, 16) to
predict who is more likely to get diabetes and whether someone is the sort of person who will pay back
loans (p. 7). In his view, big data brings us closer to a probabilistic universe where human behavior can
be predicted from metadata, to build a society that is better at avoiding market crashes, ethnic and
religious violence, political stalemates, widespread corruption, and dangerous concentrations of power (p.
16). Similarly, Mayer-Schnberger and Cukier (2013) suggest that large data sets have the potential to
replace the exactitude of causality for the good enough of correlation.
What these arguments fail to fully consider, however, is that data setsincluding predictive
datamay lead to new concentrations of power, and they are never methodologically removed from
human design and bias (Crawford, 2013). Big data continues to present blind spots and problems of
representativeness, precisely because it cannot account for those who participate in the social world in
ways that do not register as digital signals. It is big datas opacity to outsiders and subsequent claims to
veracity through volume that discursively neutralizes the tendency to make errors, fail to account for
certain people and communities, or discriminate. The rhetoric of objectivity can be very seductive to public
policy makers traversing the complex world of social phenomena. In Will Davies (2013) terms, data is
being icily naturalized, with its institutional and methodological preconditions being marginalized from
discussion (para. 7). Indeed, the celebratory promises of big data as good enough to produce
predictors of social behavior fundamentally ignore a key insight of social theory: Aggregated, individual
actions cannot, in and of themselves, illustrate the complicated dynamics that produce social interaction
the whole of society is greater than the sum of its parts.
More critical and historical investigations are emerging that address how big data is being
understood, operationalized, and resisted across the fields of media, computer science, law, and
economics. Two collections in particular have addressed the issue of how big data is made. Raw Data Is an
Oxymoron, a collection edited by Lisa Gitelman (2013), takes up this question by examining the
imagination of data across various disciplines, eras, and media. These essays observe how data is
generated and shaped, with the very definition of data changing across time and media, from newspaper
clippings in the 1860s to the computational cloud. A First Monday special issue likewise argues that data
creation is a process that is extended in time and across spatial and institutional settings (Helles &
Jensen, 2013). This special section contributes to this growing critical conversation. These articles bring a
nuanced and grounded analysis to engage big data practices, tools, and rhetorics directly and ask how
they function, how they build interpretations, and how they could be different, more ethical, and more
historically aware. This collection identifies three threads of inquiry.
Andrejevic sees a big data divide between us and our data; not only are we rarely granted access our
own data, we lack the capability to analyze it and make sense of it, particularly in the context of other
users. Andrejevic argues that it is not simply access to data sets but the technologies, infrastructure, and
expertise to analyze these data that reinforce power differentials between those who have the capacity to
make use of big data and those who are simply part of the sorting process. Andrejevic observes that the
users who opt in to data systems often feel powerless in regard to their participation. When users click I
agree, it is often with a sense of coercion; to access essential technologies, relinquishing control over
their personal data is the price they must pay. In contrast to the purportedly democratizing effect of
widespread Internet access and use, Andrejevic argues that the big data paradigm offers a reentrenchment of societal power differentials, with big data being used by an elite few to make decisions of
wide-ranging impact for the many.
Whereas Andrejevic argues that the promise of personalization has failed us individually, Nick
Couldry and Joseph Turow argue that personalization has failed us collectively, albeit in a different way. In
their analysis of the deep personalization of online content enabled by big data, Couldry and Turow
argue that big data practices have the potential to undermine the public sphere. When content
journalistic or otherwiseis tailored to individuals based upon the needs of advertisers and content
providers, it fractures the reference points necessary for a shared political and social life and risks
eliminating the connective media necessary for an effective democracy. Couldry and Turow remind us
that the unexpected negative externalities that result from successful applications of big data analysis
have the potential to undermine fundamental societal structures more than poorly implemented big data
initiatives.
Epistemological Challenges and Research Provocations
If researchers are to intervene in the debates outlined above, we must collectively invest in an
explicit epistemological pluralism. This would mean scholars from a range of disciplines engaging in
dialogue about how data shapes understanding and productively questioning the rhetoric of objectivity and
claims to knowledge.
Cornelius Puschmann and Jean Burgess examine the metaphors used to describe big data across
various
publications
reporting
on
the
business
and
management,
technology,
news,
and
telecommunications sectors. They find that big data is explicated in two key ways: as a resource to be
consumed and as a natural force to be controlled. Both of these metaphors position big data as reliable,
value-neutral sources of information. However, far from innocuous framings to help explain a technical
and complicated concept, the authors argue that by positioning big data in this way, these metaphors
obscure the many ways that databig or otherwiseare socially constructed, consequently reifying the
notion that big data is somehow a source of objective truth.
Dawn Nafus and Jamie Shermans essay on the data practices of the Quantified Self (QS)
community encourages us to think about the role that individual agency plays when it comes to resisting
dominant data logics. Several articles in this special section note how the process of data generation and
collection affects big data analyses: the conditions for what we can know are shaped by what data is
recognized, how that data is collected, and by whom. QSers engage in self-monitoring, but they collect
data to serve their own needs, often generating data and practices that confound the wishes of
institutional data collectors. The QS community provides examples of a subtle subversion of the dominant
data logic. Nafus and Sherman argue that even when data is being generated on the most intimate levels,
individuals are not necessarily willing participants in the big data project and they complicate naturalistic
epistemologies grounded in a straightforward empiricism.
Finally, Geoffrey Bowkers epilogue brings the collection full circle, questioning how an archives
size could ever serve as sufficient justification for certain beliefs or as a self-evident conveyor of truth
claims. He argues that different levels of interpretation and scope are necessary, because data of any size
do not operate in a social vacuum. Even though some databases are bigger than ever, they are still
structured in ways that privilege certain ontologies and obscure others.
Above all, we need new critical approaches to big data that begin with deep skepticism of its a
priori validity as a naturalized representation of the social world. We can make big data sets productive
archives for theory building if we reimagine what big data offers us. Combining separate, often disparate,
multiterabyte sets of information can reframe our understandings of people, institutions, and things.
Rather than invest in big data as an all-knowing prognosticator or a shortcut to ground truth, we need to
recognize and make plain its complexities and dimensionality as an emerging theory of knowledge.
References
Anderson, C. (2008, June 23). The end of theory. Wired, 16. Retrieved from
http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory
Bankston, K. S., & Soltani, A. (2014, January 9). Tiny constables and the cost of surveillance: Making
cents out of United States v. Jones. Yale Law Journal, 123. Retrieved from
http://yalelawjournal.org/forum/tiny-constables-and-the-cost-of-surveillance-making-cents-outof-united-states-v-jones
Boellstorff, T. (2013). Making big data, in theory. First Monday, 18(10). Retrieved from
http://firstmonday.org/ojs/index.php/fm/article/view/4869/3750
boyd, d., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological,
and scholarly phenomenon. Information, Communication & Society, 15(5), 662679.
Clifford, J., & Marcus, G. E. (Eds.). (1986). Writing culture: The poetics and politics of ethnography.
Berkeley, CA: University of California Press.
Cox, M., & Ellsworth, D. (1997, October). Application-controlled demand paging for out-of-core
visualization. In Proceedings of the 8th conference on visualization 97 (pp. 235244). Los
Alamitos, CA: IEEE.
Crawford, K. (2013, April 1). The hidden biases in big data. Harvard Business Review. Retrieved from
http://blogs.hbr.org/2013/04/the-hidden-biases-in-big-data
Davies, W. (2013, June 12). Datas double life. Potlatch. Retrieved from
http://potlatch.typepad.com/weblog/2013/06/datas-double-life.html
Driscoll, K. (2012). From punched cards to big data: A social history of database populism.
communication+ 1, 1(1), 4.
Edwards, P. N. (2010). A vast machine: Computer models, climate data, and the politics of global
warming. Cambridge, MA: MIT Press.
Eubanks, V. (2014, January 15). Want to predict the future of surveillance? Ask poor communities. The
American Prospect. Retrieved from http://prospect.org/article/want-predict-future-surveillanceask-poor-communities
Gitelman, L. (Ed.). (2013). Raw data is an oxymoron. Cambridge, MA: MIT Press.
Graham, M. (2012, March 9). Big data and the end of theory? The Guardian. Retrieved from
http://www.theguardian.com/news/datablog/2012/mar/09/big-data-theory
Haraway, D. J. (1983). The ironic dream of a common language for women in the integrated circuit:
Science, technology, and socialist feminism in the 1980s or a socialist feminist manifesto for
cyborgs. History of Consciousness Board, University of California at Santa Cruz. Retrieved from
http://homepages.herts.ac.uk/~comqcln/HarawayCyborg.html
Haraway, D. J. (1991). A cyborg manifesto: Science, technology and socialist feminism in the late
twentieth century. In D. J. Haraway, Simians, cyborgs, and women: The reinvention of
nature. London, UK: Free Association Books
Helles, R., & Jensen, K. B. (2013). Introduction to the special issue making databig data and
beyond. First Monday, 18(10). Retrieved from http://firstmonday.org/article/view/4860/3748
Landau, S. E. (2010). Surveillance or security? The risks posed by new wiretapping technologies.
Cambridge, MA: MIT Press.
Levien, R. E., & Maron, M. E. (1967). A computer system for inference execution and data
retrieval. Communications of the ACM, 10(11) (November 1967), 715721.
MacKenzie, D. A. (2006). An engine, not a camera: How financial models shape markets. Cambridge, MA:
MIT Press.
Mayer-Schnberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work,
and think. New York: Houghton Mifflin Harcourt.
Pentland, A. S. (2014). Social physics: How good ideas spreadThe lessons from a new science. New
York, NY: Penguin Books.
Smolan, R., & Erwitt, J. (2012). The human face of big data. Chicago, IL: Against All Odds Productions.
Sterne, J. (2012). MP3: The meaning of a format. Durham, NC: Duke University Press.
Tape rescues big data. (2013, September 26.) The Economist. Retrieved from
http://www.economist.com/blogs/babbage/2013/09/information
19328036/20140005
This article extends the notion of a big data divide to describe the asymmetric
relationship between those who collect, store, and mine large quantities of data, and those
whom data collection targets. It argues that this key distinction highlights differential
access to ways of thinking about and using data that potentially exacerbate power
imbalances in the digital era. Drawing on original survey and interview findings about
public attitudes toward collection and use of personal information, it maintains that the
inability to anticipate the potential uses of such data is a defining attribute of data-mining
processes, and thus of the forms of sorting and targeting that result from them.
Keywords: big data, data mining, privacy, digital divide, predictive analytics
This research was supported by the Australian Research Council's Discovery Project's funding scheme
(DP1092606).
Mark Andrejevic: [email protected]
Date submitted: 20130406
Copyright 2014 (Mark Andrejevic). Licensed under the Creative Commons Attribution Non-commercial
No Derivatives (by-nc-nd). Available at http://ijoc.org.
Of course, Google News and any number of aggregators and services are already hard at work
providing these kinds of services, without users needing to get involved or reclaim access to their data
trails. Berners-Lees point, however, is that personal devices might usefully combine data from different
social-networking silos and other applications and devices, since these form a personal informational
nexus where all types of different data rub shoulders (a personal NSA, as it were):
There are no programmes that I can run on my computer which allow me to use all the
data in each of the social networking systems that I use plus all the data in my calendar
plus in my running map site, plus the data in my little fitness gadget and so on to really
provide an excellent support to me. (Katz, 2012, para. 4)
Berners-Lee is bemoaning a growing separation of people from their data that characterizes the
lives of active users of interactive devices and servicesa form of data divide not simply between those
who generate the data and those who collect, store, and sort it, but also between the capabilities available
to those two groups. Berners-Lee challenges one aspect of that divide: If we generate data that is
potentially useful to us, he reasons, shouldnt we be able to access it and put it to use? Why not overcome
this separation between users and their data, and with it the separation between the different data silos
we generate on various devices and platforms?
Surely he has a point, but it raises a further one: Even if users had such access, what individuals
can do with their data in isolation differs strikingly from what various data collectors can do with this same
data in the broader context of everyone elses data. To take a familiar example, Berners-Lee mentions
customized news delivery as one possible benefit of self-data-mining: if a computer knows what its users
have read in the past, it might be able to predict which news stories will interest them in the future (this
is, of course, an echo of Negropontes [1996] Daily Me, and perhaps also his digital butler). But online
news aggregators take into account not only ones own interest patterns (surely not formed in isolation)
but also those of everyone else about whom they collect data. This data trove enables them to engage in
various forms of collaborative filteringthat is, to consider what the other people who share ones
interests are also interested in.
Generalizing this principle from the perspective of data mining, it is potentially much more
powerful to situate individual behavior patterns within the context of broader social patterns than to rely
solely on the historical data for a particular individual. Put somewhat differently, allowing users access to
their own data does not fully address the discrepancies associated with the data divide: that is, differential
capacities for putting data to use. Even if users had access to their own data, they would not have the
pattern recognition or predictive capabilities of those who can mine aggregated databases. Moreover, even
if individuals were provided with everyone elses data (a purely hypothetical conditional), they would lack
the storage capacity and processing power to make sense of the data and put it to use. It follows that the
structural divide associated with the advent of new forms of data-driven sense making will be increasingly
apparent in the era of big data.
To characterize the differential ability to access and use huge amounts of data, this article
proposes the notion of a big data divide by first defining the term, then considering why such a divide
merits attention, and then exploring how this divide might relate to public concern about the collection
and use of personal information. The sense of powerlessness that individuals express about emerging
forms of data collection and data mining reflects both the relations of ownership and control that shape
access to communication and information resources, and growing awareness of just how little people know
about the ways in which their data might be turned back upon them. Although the following research will
focus exclusively on personal datathe type of data at the heart of current debates about regulation of
data collection onlinethe notion of a big data divide is meant to invoke the broader issue of access to
sense-making resources in the digital era, and the distinct ways of thinking about and using data available
to those with access to tremendous databases and the technology and processing power to put them to
use.
From a research perspective, boyd and Crawford (2011) have noted the divide between the Big
Data rich (companies and universities that can generate or purchase and store large datasets) and the
Big Data poor (those excluded from access to the data, expertise, and processing power), highlighting
the fact that a relatively small group with defined interests threatens to control the big data research
agenda. This article extends the notion of a big data divide to incorporate a distinction between ways of
thinking about data and putting it to use. It argues that big data mining privileges correlation and
prediction over explanation and comprehension in ways that undermine the democratizing/empowering
promise of digital media. Despite the rhetoric of personalization associated with data mining, it yields
predictions that are probabilistic in character, privileging decision-making at the aggregate level (over
time). Moreover, it ushers in an era of emergent social sorting, the ability to discern un-anticipatable but
persistent patterns that can be used to make decisions that influence the life chances of individuals and
groups. In online tracking and other types of digital-era data surveillance, the logic of data mining, which
proposes to reveal unanticipated, unpredictable patterns in the data, renders notions such as informed
consent largely meaningless. Data miners claims, discussed in more detail in the following sections,
reveal that big data holds promise for much more than targeted advertising: It is about finding new ways
to use data to make predictions, and thus decisions, about everything from health care to policing, urban
planning, financial planning, job screening, and educational admissions. At a deeper level, the big data
paradigm challenges the empowering promise of the Internet by proposing the superiority of a postexplanatory pragmatics (available only to the few) to the forms of comprehension that digital media were
supposed to make more accessible to the many. None of these concerns fits comfortably within the
standard privacy-oriented framing of issues related to the collection and use of personal information.
A Big Data Divide
In the sense of standing for more information than any individual human or group of humans can
comprehend, the notion of big data has existed since the dawn of consciousness. The world and its
universe are, to anything or anyone with senses, incomprehensibly big data. The contemporary usage is
distinct, however, in that it marks the emergence of the prospect of making sense of an incomprehensibly
large trove of recorded datathe promise of being able to put it to meaningful use even though no
individual or group of individuals can comprehend it. More prosaically, big data denotes the moment when
automated forms of pattern recognition known as data analytics can catch up with automated forms of
data collection and storage. Such data analytics are distinct from simple searching and querying of large
data sources, a practice with a much longer legacy. Thus, for the purposes of this article, the big data
moment and the advent of data-mining techniques go hand in hand. The magnitude of what counts as big
data, then, will likely continue to increase to keep pace with both data storage and data processing
capacities. IBM, which is investing heavily in data mining and predictive analytics, notes that big data is
not just about size but also about the speed of data generation and processing and the heterogeneity of
data that can be dumped into combined databases. It describes these dimensions in terms of the three
Vs: volume, velocity, and variety (IBM, 2012, para. 2).
Big-data mining is omnivorous, in part because it has embarked on the project of discerning
unexpected, unanticipated correlations. As IBM puts it, Big data is any type of datastructured and
unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights
are found when analyzing these data types together (IBM, 2012, para. 9). Data can be collected, sorted,
and correlated on a hitherto unprecedented scale that promises to generate useful patterns far beyond the
human minds ability to detect or even explain. As data-mining consultant Colleen McCue (2007) puts it,
With data mining we can perform exhaustive searches of very large databases using automated methods,
searching well beyond the capacity of human analysts or even a team of analysts (p. 23). In short, data
mining promises to generate patterns of actionable information that outstrip the reach of the unaided
human brain. In his book Too Big to Know, David Weinberger (2011) describes this new knowledge as
requiring not just giant computers but a network to connect them, to feed them, and to make their work
accessible. It exists at the network level, not in the heads of individual human beings (p. 130).
Such observations trace the emerging contours of a big data divide insofar as putting the data
to use requires access to and control over costly technological infrastructures, expensive data sets, and
the software, processing power, and expertise for analyzing them. If, as Weinberger puts it, in the era of
big data the smartest person in the room is the room, then much depends on who owns and operates
the room. The forms of knowing associated with big data mining are available only to those with access
to the machines, the databases, and the algorithms. Assuming for the sake of argument that the big data
prognosticators (e.g., Mayer-Schnberger & Cukier, 2012) are correct, the era of big datacharacterized
by the ability to make use of databases too large for any individual or group of individuals to
comprehendushers in powerful new capabilities for decision making and prediction unavailable to those
without access to the databases, storage, and processing power. In manifold spheres of social practice,
then, those with access to databases, processing power, and data-mining expertise will find themselves
advantageously positioned compared to those without such access. But the divide at issue is not simply
between what boyd and Crawford (2011) describe as database haves and have-nots; it is also about
asymmetric sorting processes and different ways of thinking about how data relate to knowledge and its
application. The following sections consider each of these issues in turn.
The Big Data Sort
For those with database access, the ability to capture and mine tremendous amounts of data
considerably enhances and alters possibilities for engaging in what David Lyon (2002), building on the
work of Oscar Gandy (1993), has called surveillance as social sorting: a means of verifying identities
but also of assessing risks and assigning worth (p. i). Those with access to data, expertise, and
processing power are positioned to engage in increasingly powerful, sophisticated, and opaque forms of
sorting that can be powerful means of creating and reinforcing long-term [or newly generated] social
differences (Lyon, 2002, p. i). The very notion of a panoptic sort is premised on a power imbalance
between those positioned to make decisions that affect the life chances of individuals (in Gandys original
work, businesses as both employers and marketers) and those subjected to the sorting process.
Subsequently reflecting on the notion of the panoptic sort, Gandy observed that he had come to
understand that these decisions are not really based on an assessment of who or what people are, but on
what they will do in the future. The panoptic sort is not only a discriminatory technology, but it is one that
depends upon an actuarial assumption (Gandy, 2005, p. 2). This observation remains as salient as ever
in the era of data mining and predictive analytics, which, while deploying the rhetoric of personalization,
also operate at a probabilistic level. In this regard, the assertion that data mining augers a future in
which predictions seem so accurate that people can be arrested for crimes before they are committed, is
misleading (Kakutani, 2013, para. 14). Predictive analytics is not, despite the hype, a crystal ball. As one
commentator put it,
When you are doing this kind of analytics, which is called big data, you are looking at
hundreds of thousands to millions of people, and you are converging against the mean. I
cant tell you what one shopper is going to do, but I can tell you with 90 percent
accuracy what one shopper is going to do if he or she looks exactly like one million other
shoppers. (Nolan, 2012, p. 15)
But the confusion between fortune telling and forecasting is consequential, for decisions made at a
probabilistic, aggregate level produce effects felt at an individual level: the profile and the person
intersect. To someone who has been denied health care, employment, or credit, the difference between a
probabilistic prediction and a certainty is, for all practical purposes, immaterial.
Social sorting has a long history but comes into its own as a form of automated calculus, as
Gandy (1993) suggests, in the era of modern bureaucratic rationality. Thus, it is tempting to note the
historical continuity between big data-driven forms of social sorting and earlier forms of data-based
decision making, from Taylorist forms of scientific management to mid-20th-century forms of redlining
in the banking, housing, and insurance industries. Raley (2013), for example, has noted that in an early
account of computer-enabled surveillance, David Lyon (1994) suggests that the difference made by
information technologies is one of degree, not of kind, that they simply make more efficient more
widespread, and simultaneously less visible many processes that already occur (Raley, 2013, p. 124).
However, a qualitative shift in monitoring-based social sorting results from the emergent character of
new data-mining processes, which now can generate un-anticipatable and un-intuitable predictive patterns
(e.g., Chakrabarti, 2009). That is, their systemic, structural opacity creates a divide between the kinds of
useful knowledge available to those with and without access to the database.
In the following sections, I argue that emerging awareness of forms of asymmetrical power
associated with both the tremendous accumulation of data and new techniques for putting it to work
provides a possible explanation for public concern about the collection and use of personal data. Survey
after survey, including my own (discussed below), has revealed a high level of concern about the
commercial collection and use of personal information online. For example, a 2012 Pew study in the
United States revealed that a majority (65%) of people who use search engines did not approve of the use
of behavioral data to customize search results, and that more than two-thirds of all Internet users (68%)
did not approve of targeted advertising based on behavioral tracking (Purcell, Brenner, & Rainie, 2012).
Another nationwide U.S. survey found that 66% of respondents opposed ad targeting based on tracking
users activities (Turow, King, Hoofnagle, Bleakley, & Hennessy, 2009). In a U.S. study of public reaction
to proposed do not track legislation, 60% of respondents said they would opt out of online tracking,
given the choice. My own nationwide survey in Australia revealed strong support for do-not-track
legislation (95% in favor). Well over half of the respondents (56%) opposed customized advertising based
on tracking, and 59% felt Web sites collect too much information about users. Peoples continuing use,
despite their stated concerns, of services that collect and use their personal information is framed
sometimes as a paradox (e.g., Norberg, Horne, & Horne, 2007), and sometimes as evidence that people
do not really care as much as the research indicates (e.g., Oppmann, 2010). Based on early results of
qualitative research on privacy concerns, this article offers an alternative explanation: that people operate
within structured power relations that they dislike but feel powerless to contest. On a somewhat more
speculative level, I suggest that there is an emerging understanding on the part of users that the
asymmetry and opacity of a big data divide augurs an era of powerful but undetectable and unanticipatable forms of data mining, contributing to their concern about potential downsides of the digital
surveillance economy. This asymmetry runs deep, insofar as it privileges a form of knowledge available
only to those with access to costly resources and technologies over the types of knowledge and
information access that underwrite the empowering and democratizing promise of the Internet.2
Theorys End?
In a much discussed Wired magazine article, Chris Anderson (2008) claimed that the era of big
data (which he called the petabyte age) portended the end of theorythat is, the coming irrelevance
of model-based understandings of the world. As he put it,
This is a world where massive amounts of data and applied mathematics replace every
other tool that might be brought to bear. Out with every theory of human behavior, from
linguistics to sociology. Forget taxonomy, ontology, and psychology . . . With enough
data, the numbers speak for themselves. (Anderson, 2008, para. 8)
This sweeping, manifesto-like claim invites qualification: Surely, statistical models remain necessary for
developing algorithms, and other sorts of models are needed to shape the use of the information
generated by increasingly loquacious data. Data scientists have emphasized the importance of domainspecific expertise in assessing the data that gets fed into mining algorithms and shaping the questions
that might be put to the data. As McCue (2007) stated in her primer on data mining and predictive
analytics, domain expertise is used to evaluate the inputs, guide the process, and evaluate the end
products within the context of value and validity (p. 22). Indeed, the term domain expert emerges
2
For a good overview of the celebratory, democratizing rhetoric surrounding the reception of the Internet,
against the background of data minings convergent character to redress both the fact that it is, in a
sense, content-agnostic, and the resulting tendency to treat data analysis as a strictly technical exercise
(Berry & Linoff, 2001, p. 44).
Moreover, the claim that numbers speak for themselves overlooks the broader context of the
conversation around them (boyd and Crawford met this claim with a resounding no [2011, p. 4]).
Patterns may emerge from the data, but their relevance or usefulness depends heavily on the questions
they address, which in turn depend on who is asking. One thing data cannot do is set the agenda.
Andersons version of big data is an instrumental one abstracted from broader issues of values and goals
(questions of social justice, democratic commitments, etc.)the very issues that the existing bodies of
theory sidelined by Anderson are needed to address. Andersons article is simply a quick-hit magazine
piece, but its failure to consider the larger context in which large, for-profit entities (and even
governments, which, it turns out, piggyback on these databases) collect, own, and control the data is
telling nonetheless. More bluntly, sidelining the broader question of context and values effectively exempts
the question of the best uses of the data from the reach of theory and models, leaving it to the
imperatives of those with access to the databases. This is the real import of the end of theory claim.
With these qualifications in mind, the substance of Andersons claim is more narrowly
interpretable: data mining has the ability to generate actionable information that is both unpredictable and
inexplicable (neither needing nor generating an underlying explanatory model). For example, the era of
data mining and micro-targeting has renewed the salience of a bit of political wisdom discovered early in
the 1970s by Republican political consultants in the United States: Mercury owners were far more likely
to vote Republican than owners of any other kind of automobile (Gertner, 2004, para. 12). As one
consultant put it, We never had the money or the technology to make anything of it . . . but of course
they do now (ibid.).
This kind of inductive correlation, which is relatively easy to generate through data mining,
provides predictive power and actionable information but little in the way of explanation. Meanwhile, those
interested in using the information for electioneering purposes do not particularly care about any
underlying explanation, should there be one. As Anderson (2008, para. 8) pointed out, Who knows why
people do what they do? The point is they do it, and we can track and measure it with unprecedented
fidelity.
A defining attribute of this kind of knowledge is the replacement of explanation and causation
with correlation and prediction. What is known is not an underlying cause or explanation but rather a set
of probabilistic predictions. Data mining promises to unearth increasingly unpredictable (in the sense of
not being readily anticipatable) and otherwise indiscernible patterns by sorting through much larger data
setsindeed, the goal of data mining is to detect patterns that are not intuitively available to the unaided
human eye or mind. That is, the goal is, by definition, to extract non-predictable patterns that emerge
only via automated processing of data sets that are too large to make sense of otherwise. As one datamining textbook observed, as the world grows in complexity, overwhelming us with the data it generates,
data mining becomes our only hope for elucidating the patterns that underlie it (Chakrabarti, 2009, p.
32). Perhaps unsurprisingly, considering commercial databases central role in its development, the goals
of data mining are oftenalthough not exclusivelyportrayed in terms of competitive advantage. Data
mining is defined as the process of discovering patterns in data. The process must be automatic or (more
usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage,
usually an economic advantage (Chakrabarti, 2009, p. 27). But numerous other types of advantages are
conceivable. McCues (2007) textbook on predictive policing frames the goal in terms of national security
and military advantage: If knowledge is power, then foreknowledge [via predictive analytics] can be seen
as battlespace dominance or supremacy (p. 48). MITs big data guru Alex Pentland, who coined the term
reality mining to describe the breadth and depth of new forms of data capture, anticipates that insights
gleaned from the database will help create a more healthy, secure, and efficient world for all:
For society, the hope is that we can use this new in-depth understanding of individual
behaviour to increase the efficiency and responsiveness of industries and governments.
For individuals, the attraction is the possibility of a world where everything is arranged for
your convenienceyour health checkup is magically scheduled just as you begin to get
sick, the bus comes just as you get to the bus stop, and there is never a line of waiting
people at city hall. (Pentland, 2009, p. 79)
Other benefits could involve new forms of transparency that make various kinds of public records available
so as to hold public officials and private entities more accountable.
But even these salutary scenarios belie the leveling promise of networked digital technology.
The era of big data mining concentrates a particular technique for generating actionable information (to be
used for good or ill) in only a few hands, for the specific purpose of gaining some kind of advantage.3
Tellingly, it posits a form of knowing that allegedly renders obsolete or outdated the very model of
Internet empowerment that was supposed to help hold entrenched forms of power accountable by
increasing access to forms of knowledge that allowed people to understand the world around them. 4 This
is the thrust of Andersons account of the end of theory: that understanding the world through the
careful, judicious, and informed study of available information is, for a growing range of applications,
obsolete in the petabyte era, which promises to unearth powerfully useful patterns from bodies of
information that are too large for a single person or group of people to make sense of. At the very
moment that the new technology enhances access to traditional forms of understanding and evidence,
they are treated as ostensibly outdated.
Even if Anderson is overstating the case and understanding remains an important aspect of
knowledge acquisition in the digital era, the point remains: The few will have access to useful forms of
knowledge that provide an advantage of some kind and that are not just unavailable to the vast majority
but incomprehensible, in the sense described by Weinberger (2011). This knowledge is unpredictable and
inexplicable in the conventional sense (as in the Mercury example: a correlation without an underlying
3
Big data should not be understood as a static concept, for as more people gain access to data-mining
technology, still bigger data will remain beyond their reach, available only to those with the resources to
support the latest technology and the largest databases.
4
For a discussion of the promise of Internet empowerment, see Andrejevic (2007, pp. 1521).
explanation) and therefore opaque to those without access to the database. Thus, individual users have no
way to anticipate fully how information about them might prove salient for particular forms of decision
making, including, for example, whether they might be considered a security risk, a good or bad job
prospect, a credit risk, or more or less likely to drop out of school.. Consider, for instance, the finding that
people who fill out online job applications using browsers that did not come with the computer . . . but
had to be deliberately installed (like Firefox or Googles Chrome) perform better and change jobs less
often (Robot Recruiters, 2013, para. 2). The finding is unexplained and unlikely to be anticipated by the
applicants themselves, but it can significantly affect their lives nevertheless. As this example suggests, the
forms of social sorting associated with big data mining will range far beyond the marketing realm, feeding
into the decision-making processes of those with access to the information it provides and thereby
allowing them to affect the life chances of others in increasingly opaque but significant ways.
Whereas it may still be possible to intuitively grasp the link between, for example, a particular
brand of car and a political preference, the promise of data mining is to unearth correlations beyond the
realm of such imagining. Reverse engineering an algorithmic determination can require as much expertise
as generating it in the first place, and the results may have no direct explanatory power. When correlation
displaces causality or explanation, the goal is to accumulate as comprehensive and varied a database as
possible to generate truly surprising, non-intuitive results. Perhaps a particular combination of eating
habits, weather patterns, and geographic location correlates with a tendency to perform poorly in a
particular job or susceptibility to a chronic illness that threatens employability. There may not be any
underlying explanation beyond the pattern itself.
The basis for the kind of sorting envisioned via big data mining is likely to become increasingly
obscure in direct proportion to the size and scope of the available data and the sophistication of the
techniques used to mine it. At a recent meeting of the Organisation for Economic Co-operation and
Development, one participant observed that data mining entails the loss of a degree of transparency in
why computers make the decisions they do (Cukier, 2013, para. 6). According to the participant, who is
CEO of a data-mining company:
There are machines that learn, that are able to make connections that are much, much
finer than you can see and they can calibrate connections between tons and tons of
different facets of information, so that there is no way you as a human can understand
fully what is going on there. (J. Haesler, personal communication, February 26, 2013)
To note these characteristics of data mining is not to discount the potential benefits of its anticipated
benevolent uses. Yet the shadow of rationalization betokens asymmetrical control: a world in which people
are sorted at important life moments according to genetic, demographic, geo-locational, and previously
unanticipated types of data in ways that remain opaque and inaccessible to those who are affected. In
some instances, this is surely desirable: when, for example, a medical intervention is triggered just in
time to avoid more severe complications. At the same time, it is easy to imagine ways in which this type
of pre-emptive modellingwhat William Bogard (1996, p. 1) has called the simulation of surveillance
can be abused. Imagine, for example a world in which private health insurers mine client data in an
attempt to cancel coverage just in time to avoid having to cover major medical expenses.
These survey findings are based on a national telephone survey conducted with N = 1,106 adults across
Australia between November 17 and December 14, 2011. Managed by the Social Research Centre in
Melbourne, the project sourced respondents through random-digit phone number generation for landlines
and mobile phones. The final sample consisted of 642 surveys taken via landline numbers and 464 taken
via mobile numbers. Reported data were proportionally weighted to adjust for design (chance of
selection), contact opportunities (mobile only, landline, or both), and demographics (gender, age,
education, and state). A complete summary of the findings and methodology is available online at
www.cccs.uq.edu.au/personal-information-project. The survey was followed up by an ongoing series of
interviews and focus group discussions. As of this writing, 27 structured interviews were conducted at
about the collection and use of personal information: 59% of respondents said websites collect too much
information about people.6 They also revealed a very high level of support for stricter controls on
information collection, including a do-not-track option (92% support), a requirement to delete personal
data upon request (96% support), and real-time notification of tracking (95% support).7 Well over half of
the respondents (56%) said they opposed customized advertising based on tracking. The survey results
also indicated that people are palpably aware that they know little about how their information is used:
73% of respondents said they needed to know more about the ways websites collect and use their
information.8
These findings represent a particular type of big data divide, not between researchers with
access to the data and those without, but between sorters and sorteesthat is, not between those who
comprehend the correlations and those who do not, but between those who are able to extract and use
un-anticipatable and inexplicable (as described above) findings and those who find their lives affected by
the resulting decisions. This formulation can aid consideration of the ways that the post-survey findings
from the follow-up focus groups challenge the dominant framing of issues in contemporary discussions of
privacy. One repeatedly mobilized frame is perhaps best summed up by Eric Schmidts notorious
observation: If you have something that you dont want anyone to know, maybe you shouldnt be doing it
in the first place (and his subsequent, related assurance that if you dont have anything to hide, you
have nothing to fear) (Bradley, 2012, para. 3; Google CEO on Privacy, 2010, para. 1). Gmails role in
three sites across Australia (Melbourne, Sydney, and Brisbane). Recruited randomly in public spaces for
30- to 45-minute discussions, respondents were screened to include only experienced Internet users. The
preliminary interview sample skews young and female, consisting of 19 female respondents and 8 male
respondents, all between the ages of 19 and 37. As the project develops, respondents will be selected to
counter this skew. Focus group participants were similarly recruited in public spaces at the three research
sites and received a $20 iTunes gift card to participate in a 50-minute group discussion. A similar skew
applies to the focus group participants: 16 women and 6 men, ages 2031. The focus group structure was
tested on students in an undergraduate seminar, and some of their comments were included.
6
Actual survey question: Thinking now about the personal information gathered by ONLINE companies
about their consumers, would you say they gather too much, about the right amount or not enough
information?
7
Survey questions:
Do you think:
1.
There should be a law that requires websites and advertising companies to delete all
stored information about an individual, if requested to do so?
2.
There should be a law requiring Web sites and applications to provide a do-not-track
option that would prevent them from gathering information about people?
3.
There should be a law requiring companies to notify people at the time when they collect
data about them online?
Survey question: How would you describe your understanding of the ways in which companies collect
and use the information they gather about people online? Do you feel that you already know as much as
you need to know about what companies do in this regard or need to know more about what companies
do in this regard?
the downfall of U.S. General Petraeus lent these remarks a certain salience, but they do not reflect the
concerns of most respondents, who emphasize that whereas much of the information they share (and that
is collected about them) is mundane, they still dislike being compelled to share it.
One focus group participant, for example, used just one word in response to concerns about the
collection and use of personal information: powerless. Several others in the seven-person discussion
group indicated they had also written down powerless in their discussion notes. Another participant
chimed in,
You just feel out of control of what people can know about you. It reinforces what the
world has come to. I know that in general you share a lot more than you used to, were
used to that. But then I still feel powerless. (male, 21)
The focus group participants repeatedly invoked a feeling of asymmetry that paralleled this sense of
powerlessness. As one respondent maintained in a conversation touching on e-mail and social networking:
Its not fair, its not transparent. Its funny because Facebook is supposed to be all about transparency,
and theyre the ones who arent transparent at all (female, 31). Another respondent explained how this
sense of powerlessness influenced her decision not to read privacy policies:
I just click agree, because what else can I do? I think that frustration sometimes just
translates into: I wont even think about it, because what can I do? It [Facebook]
becomes part of how you connect with people. Its really useful for your career, for your
choices in life. It doesnt mean you cant live without it, but living with it becomes
important. (female, 29)
Most respondents expressed concern and frustration with the online collection of information
about them, but a few said they were unconcerned because there was nothing they could do about it. As
one focus group participant put it,
I dont see it as a threat . . . probably because I dont know much about it. . . . I cant
see it affecting me in my everyday life but if you tell me about online privacy . . . then
Ill be thinking about it all the time. Im better off not knowing about it in the first place.
(male, 22)
Significantly, even respondents who expressed concern about data collection were vague about actual,
perceived, or anticipated harm. When pressed on the concrete content of their concern, respondents
tended to fall back, not particularly confidently, on a familiar litany of well-covered privacy concerns: the
threat of identity theft or fraud and distaste for data-driven target marketing, which some equated with a
limiting form of stereotyping. As one respondent put it,
It kind of pushes you and says who you are and what youd like. . . . At the end of the
day you do have your right to choose, but this kind of enforces an idea of what you
should be choosing and limits what it is you can choose from. . . . You either work within
that stereotype or they will create another stereotype for you. (female, 25)
Overall, concern about actual harms came across less vociferously than did frustration over a
sense of powerlessness in the face of increasingly sophisticated and comprehensive forms of data
collection and mining. Focus group participants generally agreed with responses emphasizing that this
sense of powerlessness extended to their lack of knowledge over how personal data might be used. As
one respondent admitted, We really dont know where things collected about us gowe dont understand
how they interact in such a complex environment (female, 22). Interview respondents and focus group
participants alike noted the seemingly endless appetite for personal data: Its not just what you want
its where you are, what you do. Its everything. Youre not free any more. Youre just a slave of these
companies (male, 22). This may come across as hyperbolic, but nonetheless noteworthy is the stark
contrast between this response and the rhetoric of freedom, empowerment and convenience that has long
underpinned the promotion of the online economy. The contrast highlights the challenge posed by the
power asymmetries ushered in by big data mining.
Dimensions of the Divide
This articles analysis suggests that the sense of powerlessness expressed by the focus group
respondents operates in at least two dimensions: that of ownership and control over information and
communication resources, and that of different approaches to knowledge-based decision making. People
are palpably aware that powerful commercial interests shape the terms of access that extract information
from them: they must choose either to accept the terms on offer or to go without resources that in many
ways are treated as utilities of increasing importance in their personal and professional lives. However
and this is an interpretive, speculative claimthe very vagueness (but vociferousness) of their concerns
about information collection may reflect the structural gap in the big data divide: the fact that users of big
data rely on the unanticipatable and un-intuitable character of their findings. This vagueness, then, is not
necessarily an artifact of laziness or ignorance due to users failure to educate themselves about the
technologies they use (or to read legalistic, vague privacy policies) but may be a defining characteristic of
the data collection strategies to which they are subjected. People can hardly be expected to imagine that,
for example, their use of a particular browser might render them more or less desirable to employers, or
to envision all the possible patterns generated by the complex interplay of thousands of variables about
millions of people, patterns that data-mining strategies have explicitly relegated to the realm of too big to
know or predict. As one respondent put it, you end up accepting having no privacy without knowing the
consequences (male, 32).
If, as Helen Nissenbaum (2009) has compellingly argued, privacy is contextual (because of
established expectations associated with particular information-collection contexts), then the big data era
challenges people to develop contextual norms for the use of data whose uses can be radically,
unpredictably decontextualized. Thanks to the proliferation of monitoring technologies (license plate
readers, smart cameras, drones, RFID scanners, audio sensors, etc.), data scraping continues to extend
its reach both online and off, so fewer places and activities are likely to be exempt from the logic of the
big data divide, whereby people are separated from their data and excluded from the process of putting it
to use. Overcoming the digital divide means exacerbating the big data divide. Greater access to and
facility in the use of smartphones and networked laptops, tablets, and computers of one kind or another
means more data to store, sort, and mine. More comprehensive forms of data mining promise to serve a
growing variety of decision-making, forecasting, and sorting operations. Whereas many of the applications
mentioned here are only in their infancy, the pace of change urges the individual to anticipate the social,
cultural, and political consequences now. Given the impossibility of adjusting expectations to anticipate
correlations that are by definition unpredictable, people face the daunting prospect of finding ways to limit
the reach and opacity of emerging forms of social sorting and discrimination. This is the challenge of the
big data era.
References
Anderson, C. (2008, June 23). The end of theory: The data deluge makes the scientific method
obsolete. Wired Magazine, 16(7). Retrieved from
http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
Andrejevic, M. (2007). iSpy: Surveillance and power in the interactive era. Lawrence: University of Kansas
Press.
Berry, M., & Linoff, G. (2001). Mastering data mining: The art and science of customer relationship
management. Hoboken, NJ: John Wiley & Sons.
Bogard, W. (1996). The simulation of surveillance: Hypercontrol in telematic societies. New York, NY:
Cambridge University Press.
boyd, d., & Crawford, K. (2011, September). Six provocations for big data. Presentation at A Decade in
Internet Time: Symposium on the Dynamics of the Internet and Society, Oxford Internet
Institute, Oxford University, Oxford, UK. Available at SSRN http://ssrn.com/abstract=1926431
or http://dx.doi.org/10.2139/ssrn.1926431
Bradley, T. (2012, March 24). Hey EmployersMy Facebook Password Is None of Your Business. PCWorld.
Retrieved from
http://www.pcworld.com/article/252514/hey_employers_my_facebook_password_is_none_of_yo
ur_business.html
Byers, A. (2013, February 7). Microsoft hits Google email privacy. Politico. Retrieved from
http://www.politico.com/story/2013/02/microsoft-renews-google-attack-on-email-privacy87302.html#ixzz2LW6dnEV4
Chakrabarti, S. (2009). Data mining: Know it all. Burlington, MA: Morgan Kaufmann.
Cukier, K. N. (2013, February 18). The thing, and not the thing. The Economist. Retrieved from
http://www.economist.com/blogs/graphicdetail/2013/02/elusive-big-data
Gandy, O. H., Jr. (1993). The panoptic sort: A political economy of personal information. Critical studies in
communication and in the cultural industries. Boulder, CO: Westview Press.
Gandy, O.H., Jr. (2005, October). If it werent for bad luck. 14th Annual Walter and Lee Annenberg
Distinguished Lecture. Annenberg School for Communication, University of Pennsylvania,
Philadelphia, PA. Retrieved from
http://www.asc.upenn.edu/usr/ogandy/Annenberg%20Lecture.pdf
Gates, B. (1995). The road ahead. New York, NY: Penguin Books.
Gertner, J. (2004, February 15). The very, very personal is the political. The New York Times Magazine.
Retrieved from http://www.nytimes.com/2004/02/15/magazine/15VOTERS.html?pagewanted=all
Google CEO on privacy (VIDEO): If you have something you dont want anyone to know, maybe you
shouldnt be doing it. (2010, March 10). The Huffington Post. Retrieved from
http://www.huffingtonpost.com/2009/12/07/google-ceo-on-privacy-if_n_383105.html
IBM. (2012). Bringing big data to the enterprise. Retrieved from http://www01.ibm.com/software/in/data/bigdata
Improve health care: Win $3 million. (2012). Heritage Provider Network: Health Prize. Retrieved from
http://www.heritagehealthprize.com/c/hhp
Kakutani, M. (2013, June 10). Watched by the Web: Surveillance is reborn. The New York Times.
Retrieved from http://www.nytimes.com/2013/06/11/books/big-data-by-viktor-mayerschonberger-and-kenneth-cukier.html
Katz, I. (2012, April 18). Tim Berners-Lee: Demand your data from Google and Facebook. The Guardian
[London]. Retrieved from http://www.guardian.co.uk/technology/2012/apr/18/tim-berners-leegoogle-facebook
Lyon, D. (1994). The electronic eye: The rise of surveillance society. Minneapolis: University of Minnesota
Press.
Lyon, D. (Ed.). (2002). Surveillance as social sorting: Privacy, risk and automated discrimination. New
York, NY: Routledge.
Mayer-Schnberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work,
and think. Boston, MA and New York, NY: Eamon Dolan/Houghton Mifflin Harcourt.
McCue, C. (2007). Data mining and predictive analysis: Intelligence gathering and crime analysis. New
York, NY: Butterworth-Heinemann.
Mosco, V. (2004). The digital sublime: Myth, power, and cyberspace. Cambridge, MA: MIT Press.
Nissenbaum, H. (2009). Privacy in context: Technology, policy, and the integrity of social life. Palo Alto,
CA: Stanford Law Books.
Negroponte, N. (1996). Being digital. New York, NY: Vintage.
Nolan, R. (2012, February 21). Behind the cover story: How much does Target know? The New York Times
Magazine. Retrieved from http://6thfloor.blogs.nytimes.com/2012/02/21/behind-the-cover-storyhow-much-does-target-know
Norberg, P. A., Horne, D. R., & Horne, D. A. (2007). The privacy paradox: Personal information disclosure
intentions versus behaviors. Journal of Consumer Affairs, 41(1), 100126.
Oppmann, P. (2010, April 14). In a digital world we trade privacy for convenience. CNN.com. Retrieved
from http://www.cnn.com/2010/TECH/04/14/oppmann.off.the.grid/index.html
Pentland, A. (2009). Reality mining of mobile communications: Toward a new deal on data. In S. Dutta &
R. Mia (Eds.), The global information technology information report 20082009: Mobility in a
networked world (pp. 7580). Basingstoke, UK: Palgrave Macmillan.
Purcell, K., Brenner, J., & Rainie, L. (2012, March 9). Search Engine Use, 2012. Pew Internet and
American Life Project, 9. Retrieved from http://pewinternet.org/Reports/2012/Search-EngineUse-2012.aspx
Raley, R. (2013). Dataveillance and counterveillance. In L. Gitelman (Ed.), Raw data is an oxymoron (pp.
121146). Cambridge, MA: MIT Press.
Regulators demand clearer privacy policies. (2009, February 16). Out-Law.com. Retrieved from
http://www.out-law.com/page-9795
Robot recruiters. (2013, April 6). The Economist. Retrieved from
http://www.economist.com/news/business/21575820-how-software-helps-firms-hire-workersmore-efficiently-robot-recruiters
Turow, J., King, J., Hoofnagle, C. J., Bleakley, A., & Hennessy, M. (2009, September 29). Americans reject
tailored advertising and three activities that enable it. Retrieved from
http://ssrn.com/abstract=1478214 or http://dx.doi.org/10.2139/ssrn.1478214
Turow, J., Mulligan, D., & Hoofnagle, C. (2007, October 31). Research report: Consumers fundamentally
misunderstand the online advertising marketplace. Retrieved from
http://www.law.berkeley.edu/files/annenberg_samuelson_advertising.pdf
Weinberger, D. (2011). Too big to know: Rethinking knowledge now that the facts arent the facts, experts
are everywhere, and the smartest person in the room is the room. New York, NY: Basic Books.
19328036/20140005
See http://www.sing365.com/music/lyric.nsf/First-We-Take-Manhattan-lyrics-Leonard-
Cohen/926CCB64249F308848256AF00028CB85
2
See http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
individual. The death of Freud and the rise of neuropharmacology have engrained this within academia.
Data sunt potestas. This leads to our intelligence being that of the ant colony, an arguably sad apotheosis.
Ants act as if they are intelligent, in terms of organizing their colonies, farming fungi, and so forth, but
they do not need to pass through ratiocination in order to achieve these goals. It is a stripped-down
version of Teilhard de Chardins numinous noosphere: global consciousness as glorified instinct rather than
spiritual insight.
A strong virtue to correlationalism is that it avoids funneling our findings through vapid
stereotypes. Thus, in molecular biology, most scientists do not believe in the categories of ethnicity
(Reardon, 2001)and are content to assign genetic clusters to diseases without passing through ethnicity
(e.g., Karposis sarcoma as initially a Jewish disease). Similarly, from the commercial end, many
recommender systems work through correlation of purchases without passing through the vapid
categories of the marketersyou dont need to know whether someone is male or female, queer or
straight, you just need to know his or her patterns of purchases and find similar clusters.
But there is a series of problems with this movement, which we can start to adumbrate if we look
to Bruno Latour. Latour (2002) argues for Gabriel Tarde contra Emile Durkheim. The latter reified society
and explained constant correlations (e.g., suicide rates) as social facts. Social conditions cause social
effects. The Tardean position, for Latour, involves replacing statistics (etymologically, facts about the
State) with aggregating clusters on the fly through large-scale data analysis. There is no need to go
outside of events for their explanationwe do not need to assume that there are categories like society,
class, ethnicity, and so forth: Everything depends on describing a specific correlation at a specific time.
Thus for Latour, as for the molecular biologists and the marketers, there is no need to appeal to analytic
categories in order to study and write about events. (I am deliberately not using understand, since
understanding is precisely what is at stake.)
Latour here is retrojecting onto Tarde his own prior views that actor-network theory is not a
theory but a way of flattening all categories and replacing theory with method. His is the nec plus ultra of
Margaret Thatchers infamous proclamation: And, you know, there is no such thing as society. There are
individual men and women, and there are families.3 Latour would just add in that there arent families or
individuals either (the latter being the more interesting ontological point).
So a two-part questiondo we need theories, and do theories need categories? In The Fragile
Absolute: Or, Why Is the Christian Legacy Worth Fighting For? iek (2009) provides one way in to these
questions. Take the social dimension first. If we accept the underlying ontology that we are all individuals
(atoms) who aggregate in unnamed clusters rather than categories, then iek argues that we certainly
lose the ability to recognize constant and meaningful forces in society (which Ill put in scare quotes for
the nonce). It does not just happen that there is a net protein, natural resource drain from the Third
World to the First, nor that women in the United States are consistently paid less for the same quality of
work as men. These categories represent a reality. Certainly, they should not be essentialized. The Third
World/First World divide overlooks regions of intense underdevelopment in, say, the United States and
3
See http://briandeer.com/social/thatcher-society.htm
regions of vast wealth in, say, India. Similarly, woman is a category that can and should be questioned.
And yet . . . the rough, aggregate truth is that there is not a level playing field for either, broadly
construed. No data deluge will explain these truthsat best, it can help direct policies to mitigate the
injustice; at worst (and most commonly), it can deny that there are indeed broad social forces. Willy-nilly,
our social world is one in which categories have deep meaning. This is not just about the social truths: The
same can be argued for truths in the natural sciences. A category system like the species concept is
indeed highly problematic (Wilkins, 2011); however, the aggregate behavior of most entities can be
described along certain dimensions as if this categorization were real. In both cases, the world is
structured in such a way as to make the categories have real consequences.
So in some ways, categories are central to being in the world. Big data does not do away with
categories at all. As I have argued elsewhere, the term raw data is itself an oxymoron. Antonia Walford
(2012) writes about the work it takes to turn data from sensors in the Amazon rain forest into manipulable
data within databases. There is a plenum of data: For her, the art of the scientific database is to take this
undifferentiated onslaught and conjure it into models (structured data fields, metadata) that allow Amazon
data to circulate scientifically. As Derrida (1998) argues in Archive Fever and Cory Knobel (2010) so
beautifully develops with his concept of ontic occlusion, every act of admitting data into the archive is
simultaneously an act of occluding other ways of being, other realities. The archive cannot in principle
contain the world in small; its very finitude means that most slices of reality are not represented. The
question for theory is what the forms of exclusion are and how we can generalize about them. Take the
other Amazon as an illustration. If I am defined by my clicks and purchases and so forth, I get
represented largely as a person with no qualities other than consumer with tastes. However, creating a
system that locks me into my tastes reduces me significantly. Individuals are not stable categoriesthings
and people are not identical with themselves over time. (This is argued in formal logic in the discipline of
mereology and in psychiatry by, say, ethnopsychiatry.) The unexamined term the individual is what
structures the database and significantly excludes temporality.
Two things, then. Just because we have big data does not mean that the world acts as if there
are no categories. And just because we have big (or very big, or massive) data does not mean that our
databases are not theoretically structured in ways that enable certain perspectives and disable others.
There is, however, also the overarching problem with both Anderson and Latour. Sure, with the
above caveats, I can imagine living in a world where science and social science are about manipulating the
worldeffective action is after all a good thing. However, this is a massive reduction of what it means to
know. I have already witnessed in the unhallowed halls of the National Science Foundation a line of
argument that says we dont really need ethnography any more. After all, ethnographers just reason from
an n of, say, 20, where other methods deploy an n of 200,000. In John Kings immortal words, numbers
beats no numbers every time. The hyping of big data leads to the withering away of interpretationnot
through the actions of a cabal, but through a sociologic of excluding from the archive all data which is not
big. This unconsidered exclusion is occurring in small across the sciences (first they came for
ethnography, but I did not speak out because I was not an ethnographer. . . and so forth). It demands a
systematic response.
The theory/data thing is very much about things, in the sense in which Pelle Ehn uses the
termfor him, a designed object (a thing) contains within it a host of contradictory discourses, never
finally resolvedas in the Icelandic Thing (the original parliament) (Binder, T., De Michelis, G., Ehn, P., &
Jacucci, G., 2011). Any thing that we create (object, way of looking at the world) irreducibly embodies
theory and data. And that is a good thing.
References
Binder, T., De Michelis, G., Ehn, P., & Jacucci, G. (2011). Design things (design thinking, design theory).
Cambridge, MA: MIT Press.
Derrida, J. (1998). Archive fever: A Freudian impression. Chicago, IL: University of Chicago Press.
Edwards, P. N. (2010). A vast machine: Computer models, climate data, and the politics of global
warming. Cambridge, MA: MIT Press.
Knobel, C. (2010). Ontic occlusion and exposure in sociotechnical systems (Doctoral dissertation,
University of Michigan). Retrieved from http://deepblue.lib.umich.edu/handle/2027.42/78763
(UMI Number AAI3441199)
Latour, B. (2002). Gabriel Tarde and the end of the social. In P. Joyce (Ed.), The social in question: New
bearings in history and the social sciences (pp. 117132). London, UK, Routledge.
Reardon, J. (2001). The human genome diversity project: A case study in coproduction.
Social Studies of Science 31, 357388.
Walford, A. (2012). Data moves: Taking Amazonian climate science seriously.
Cambridge Anthropology, 30 (2), 101117.
Wilkins, J. S. (2011). Species: A history of the idea. Berkeley: University of California Press.
iek, S. (2009). The fragile absolute: Or, why is the Christian legacy worth fighting for? London, UK:
Verso.