Amoore - Life Beyond Big Data Governing With Little Analytics
Amoore - Life Beyond Big Data Governing With Little Analytics
Amoore - Life Beyond Big Data Governing With Little Analytics
Abstract
The twenty-first-century rise of big data marks a significant break with statistical
notions of what is of interest or concern. The vast expansion of digital data has
been closely intertwined with the development of advanced analytical algorithms
with which to make sense of the data. The advent of techniques of knowledge
discovery affords some capacity for the analytics to derive the object or subject
of interest from clusters and patterns in large volumes of data, otherwise
imperceptible to human reading. Thus, the scale of the big in big data is of less
significance to contemporary forms of knowing and governing than what we will
call the little analytics. Following Henri Bergsons analysis of forms of perception
which cut out a series of figures detached from the whole, we propose that
analytical algorithms are instruments of perception without which the extensity
of big data would not be comprehensible. The technologies of analytics focus
human attention and decision on particular persons and things of interest, whilst
annulling or discarding much of the material context from which they are
extracted. Following the algorithmic processes of ingestion, partitioning and
memory, we illuminate how the use of analytics engines has transformed the
nature of analysis and knowledge and, thus, the nature of the governing of
economic, social and political life.
Keywords: analytics; algorithm; big data; knowledge discovery; Bergson;
technology.
Louise Amoore, Department of Geography, Durham University, South Road, Durham DH1
3LE, United Kingdom. E-mail: [email protected]
Volha Piotukh, Department of Geography, Durham University, South Road, Durham DH1
3LE, United Kingdom. E-mail: [email protected]
Copyright 2015 Taylor & Francis
343
345
Ingestion: n=all
Reporting on the rise of analytics-based decision-making, the consultants
Accenture urge their business clients to move beyond traditional sources
of data and seize the opportunities for new insights created by new sources,
such as text analytics from social media and digital interactions (Accenture,
2013, p. 5). What is captured here is a double transformation in the landscape
of big data: a radical expansion in the forms of social interaction and transaction
that can be rendered as data, or what Victor Mayer-Schnberger and Kenneth
Cukier (2013) call datafication, coupled with a novel capacity to analyse across
a variety of types of data. In short, the rise of big data witnesses a transformation
in what can be collected or sampled as data, and how it can be rendered
analysable. In the vocabulary of the computer scientists and data analysts, data
are no longer strictly collected, but rather are ingested, such that everything
becomes available to analysis, the sample being represented as infinite, or n=all.5
In the past, conventional forms of structured data, characterized by numbers,
tables, rows, and columns (Inmon & Nasaevich, 2007, p. 1), were the only
forms of data to inhabit the world of databases, spreadsheets and statistical
tables, and thus were the only data that could be leveraged for analysis. In many
ways, the distinction between structured and unstructured data that dominates
data science discourse and social science accounts is profoundly misleading.
Of course, we might say that all data declared to be unstructured is always
already structured, and certainly remains structured in important ways within
data architectures and digital devices (Berry, 2014; Kitchin, 2014). Yet, while
structured data is territorially indexable, in the sense that it can be queried
on the horizontal and vertical axes of spreadsheets within databases, so-called
unstructured data demands new forms of indexing that allow for analysis to
be deterritorialized (conducted across jurisdictions, or via distributed or cloud
computing, for example) and to be conducted across diverse data forms
images, video, text in chat rooms, audio files and so on.6 In the main this has
implied making unstructured data analysable by the establishment of links
with already indexed structured data and the creation of new indexes.
So, for example, IBMs predictive policing software uses content analytics
that promise to: search and analyse across multiple information sources,
extracting key pieces of information like new addresses, credit cards or passports
that can help resolve identities, build relationship networks and trace patterns
of behaviour (IBM, 2012, p. 2). The linking of the data elements is performed
through joins across data from different data sets, either on the basis of
direct intersections with already indexed data (e.g. via a phone, credit card or
social security number ingested from a database), or probabilistically, through
correlations among data-points from different sources (e.g. text scraped from
a Twitter account correlated with facial biometrically tagged images drawn
from Facebook). Though in many ways the use of the join is not novel and is
commonly used for querying relational databases, today the analysis operates
with a much more diverse pool of data. The allure of unstructured data is that it
347
reading that form part of the work of the little analytics. At first glance,
text analytics do not appear dissimilar from reading as such, and, indeed, the
genesis of text mining has its roots in natural language processing and semantic
structure. However, as Katherine Hayles has argued persuasively, machine
reading is a specific kind of reading that not only allows algorithms to read
text, but also alters irrevocably the way humans read and, consequently,
the way humans think and perceive (2012, pp. 2829). What matters is thus
not strictly whether machines may somehow read like humans, but rather how
the possibilities of digital forms, such as text analytics, change the practice
of reading for humans and machines alike.8 The hyper reading that Hayles
identifies among multiple forms of human and machinic reading consists of
skimming, scanning, fragmenting, and juxtaposing texts, being a mode
of reading attuned to an information intensive environment (2012, p. 12).
The reading involved in text analytics, engaged on the part of the algorithms
and the humans who action a query, is just such hyper reading of multiple
forms and sources of data as though they were a single text.
As in our Alpharm example, in order for the particular object of interest to
be perceptible, a certain damage is done to words and syntax, and to context.
Consider the processes necessary for text analytics to read: the removal of stop
words, including and, prepositions, gender suffixes in some languages, and
the definite and indefinite articles the and a; stemming, whereby words are
reduced to their stems; and the removal of punctuation marks and case
sensitivity. In effect, as one sees in the pharmaceutical companys scraping
of the web for a complete life story of a person, in order for a life to be read
with data analytics, any trace of a context, movement or a story that has a
recognizable narrative must first be pruned out. As Hayles points out, there
remain important differences between narrative-based stories of literature and
data-based story-telling:
The indeterminacy that databases find difficult to tolerate marks another way in
which narrative differs from database. Narratives gesture toward the inexplicable,
the unspeakable, the ineffable, whereas databases rely on enumeration, requiring
explicit articulation of attributes and data values. (Hayles, 2012, p. 179)
The parsing and stemming of text, then, is intrinsic and necessary to the
capacity of analytics to read at all. The stories about the lives of epilepsy
sufferers, or the purchases of retail loyalty-card holders, or the transactions
of online banking customers that can be read by algorithms are not the
indeterminate narratives of life stories. They are lives that are flattened and
reduced to their common stems, connected with others only through correlations, links and associations. On the basis of these analytics-derived life stories,
decisions are made about people, policies are implemented, resources are
allocated and interventions are targeted.
Because text analytics and sentiment analysis conduct their reading by a
process of reduction to bases and stems, their work exposes something of the
The analytics promise to leverage all types of data stored across multiple
architectures in order to unveil things that could not otherwise be seen, the
previously unseen, hidden patterns that dwell in the folds and joins between
data forms. Yet, if we understand the work of the analytics in seizing from
the surroundings that which interests or sustains, then we begin to see how
qualitative differences between data forms become obscured by the pursuit of
the object of interest.10 The analytics extract from diverse elements that which
is of interest, indifferent to the heterogeneity that surges beneath that data.
Viewed in this way, the contemporary big data question of how to approach
349
n=all is posed rather differently. In contrast with a word of big data that seeks
out complete data sets never available before (interview 1 October 2013) and
where big data wants n, nothing else (Hildebrandt, 2013, p. 6), n=all appears
instead as an impossible claim. The process of ingestion draws in the data
rather as Bergsons hydrochloric acid acts upon chalk, or a plant acts on
diverse nutrients in the soil, that is to say indifferent to the all with which
it communes. In this specific sense n will never be equal to all. In the so-called
flat files of analytics algorithms, which quite literally flatten the multiple
distinctions among data forms in order to make the data readable and
analysable, the complex temporalities of the life that generated the data are
entirely lost.
Partitioning: transform, select and filter the variables
As IBM describe their Intelligent Miner software, the task of analytics
algorithms is to extract facts, entities, concepts and objects from vast
repositories (2012, p. 2). Understood thus, the work of the analytics can
be conceived as one specific from of sense-making one means by which
subjects and objects of interest are partitioned from a remainder and singled
out for attention. How are qualitatively different entities in a heterogeneous
body of data transformed into something quantitative, something that can be
enumerated? In his early work Henri Bergson differentiates between two ideas
of time, the time of lived experience, or dure relle, and the mechanistic time
of science in which time is a succession of images or spatial frames, as in film
(Ansell Pearson & Mullarkey, 2002, p. 17). In this spatial representation of time
as a series of halts, we begin from a fixed point in the immobile to watch for the
moving reality as it passes instead of putting ourselves back into the moving
reality to traverse with it (Bergson, 1965; see also Connolly, 2011). Understood
thus, the fixed instrument of perception partitions, according to what is of
interest to it, a series of immobile stills from which to derive some picture of
a changing world.
As Gilles Deleuze notes, Bergson calls into question the order of needs,
of action, and of society that predisposes us to retain only what interests us
in things and that tends to obscure differences in kind (Deleuze, 1991, p. 33).
Following Bergson, Deleuze understands the qualitative multiplicity of
duration to bear all of the differences in kind, while space is unable to
present anything but differences of degree (since it is quantitative homogeneity) (1991, p. 31). The patterns of life, so readily claimed as the world captured
by analytics, might be properly thought of as durational, multiple, continuous
and qualitative. Like the modern physics Bergson and Deleuze describe, the
analytics extract and detach data from the whole, drawing a series of dis
continuous spatial images as vantage points on a mobile world. While analytics
claim to afford a vantage point on emergent life patterns and tendencies,
351
352
353
When IBM queried the temporality of the 100 terabytes and 100 per cent duty
cycle, they specifically asked how many data analysts would make simultaneous
queries of the data in the scenario. The CIA responded to the query by
appealing to the existing practices of data analytics in the commercial sphere,
inviting the bidders to bring to the state the techniques already thought to
be best practice in economy and commerce: The contractor should propose
commercial best practices derived from their commercially available solutions
to provide data analytics via the MapReduce software framework to concurrent
users from multiple organizations (GAO, 2013, p. 7). Here the divergent
responses of IBM and Amazon to the scenario reveal rather more than two
competing interpretations of the requirements. They afford a glimpse of how
the data analytics in processes such as scissors sort large volumes of data, and
the proximity of security applications such as PRISM and TEMPORA to the
commercial data analytics used every day to tell us which book we might like
to buy next. Amazons established commercial practice of analysing clickstream
data on its customers in close to real time and on a continuous cycle, it seems,
better met the CIAs requirement for analytics to deal with large volumes
of unstructured Internet data to be queried by multiple concurrent users, from
border and immigration control to counter-terrorism officers.
The capacity to integrate data analytics across multiple analysts, and to map
and reduce across multiple nodes, exhibited here by Amazon, contrasts with
IBMs extraction of batches of data for analysis. One response to the CIAs
scenario appears to sustain a somewhat conventional social science approach
to sampling, and a particular relation between the subset and the whole of big
data. In the other response one can see the ceaseless stream of ingestion,
partitioning and reassembly that affords novel iterative approaches to sample
and whole, where, in effect, people and objects continually cross back and forth
across the sample and the whole. The distributed analysis of data streams,
as David Berry writes, sustains some form of relationship with the flow of data
that doesnt halt the flow, but rather allows the user to step into and out of a
number of different streams in an intuitive way (2011, p. 143). In Amazons
MapReduce framework for the CIA, it is the identification of patterns of note
across different data streams that gives rise to a threshold at which a target or
person of interest is identified.
The mobile thresholds of support and confidence for an association rule the
very key to setting the gauge for the analytics have become highly significant
political boundaries for our times. The threshold is the moment when the
strongest relationships are identified, the moment when someone or something
of interest becomes perceptible. The moving of the threshold changes who or
what is surfaced from the data and brought to attention. In the historical origins
of data analytics this threshold was commonly defined in terms of a frequent set,
where the co-occurrence of retail consumer items in patterns of purchases
met a predetermined level of support and confidence. Co-occurrence in itself
is not always a matter of interest, for example, milk co-occurring with bread
in basket data would have high levels of support and confidence, but would
354
355
The processes of partitioning and analysis precisely do not require a context, nor
do they need individuals who can be remembered. The critical demand for
a contextual limit to the analysis of life data, or a deletion of the digital subject,
gains little purchase in a world where attributes are extracted from their qualities
and afforded numeric values. The analytics that partition big data make it
possible to forget the person and the context, but to remember the position,
the distance or proximity of association.
357
358
described the process to us in interview, we just keep iterating until the results
are satisfying. The processes of memory and iteration in Featurespace, or
TIBCO Spotfire or indeed in the analytics used by GCHQ and the NSA to
build a pattern of life (The New York Times, 2013) are far removed from what
Bergson termed attention to life (1912, p. 63). Where attention to life bears
witness to the adaptation of the past to the present, the utilization of the past
in terms of the present (Deleuze, 1991, p. 70), the features or patterns
of life sought by major supermarket chains and national security agencies know
no limitations, no indeterminacies, nothing that is not available to action.
Where the durational time of consciousness confronts the indeterminate
future by shedding some light gathered from selected past states, combining
with present states, it does so in the knowledge that the rest remains in the
dark (Bergson, 1912, p. 194). Amid their claims to predict human propensities,
by contrast, analytics engines, such as ARIC and Spotfire, confront an
indeterminate future in order precisely to leave nothing in the dark and nothing
undetermined. Though the analytics share with consciousness the selection
359
of some discrete past events, there the commonality ends. For the analytics take
the light of some past states and project it forward as though there could be no
dark corners remaining all propensities will be known, all future acts
anticipated. When analytics like ARIC are being used to monitor social media
in the Arab Spring uprisings, or to monitor Twitter in the predictive policing
of urban protest, it is of great significance that all memory of every past infraction
is thought to be retrievable, all futures foreseeable. Indeed, the event in event
stream analysis is annulled as such, along with the real in real-time analytics.
For nothing new or eventful can emerge, such is the machine time of the iterative
replay of the past state, modified for recent deviations and gnawing into the
uncertain future.
361
Funding
This work was supported by the RCUK Global Uncertainties Fellowship Securing
against Future Events: Pre-emption, Protocols and Publics [Grant number ES/
K000276/1].
Notes
1 In response to the UK governments announcement of the second phase of funding
for Big Data centres, Chief Executive of the ESRC, Professor Paul Boyle, welcomed
the sheer volume of data that is now being created, a significant resource that can
shape our knowledge of society and help us prepare and evaluate better government
policies in the future (ESRC, 2014).
2 It is not our purpose here to map a linear history of practices of data collection and
analysis. Rather, we juxtapose two moments when a specific set of claims are made
regarding the scale and scope of social data and its effects on the governing of societies.
3 Bergsons reflections on perception in science are present throughout his body of
work. Of particular significance here is his insistence on the shared categories of
thought and sensing across science and prosaic perception, so that ordinary knowledge
is forced, like scientific knowledge, to take things in a time broken up into particles,
pulverized so to speak, where an instant which does not endure follows another without
duration (1965, p. 120).
4 Indeed, by 1930, Bergson himself appreciated the growing capacity of modern
mathematics and physics to capture something of perpetual and indivisible change, to
follow the growth of magnitudes and to seize movement from within (1965, p. 211).
5 The earliest use of the concept of ingestion for analysis of data in multiple formats
can be found in papers from IBMs research on smart surveillance and web architecture
(Chiao-Fe, 2005; Gruhl et al., 2004). The use of a vocabulary of ingestion coincides
363
with an expansion of analysable samples of digital data, such that it is said that n=all, or
the sample is equal to everything.
6 The concept of index is used here in the sense proposed by Deleuze and Guattari to
denote the capacity to designate the state of things, territorially locatable in time and
space (1987, p. 124). Understood thus, for example, extraction algorithms are required
in order territorially to index unstructured objects, as in the use of biometric templates
derived from Facebook. It is the extracted template that makes the object searchable in
time and space.
7 The case is derived from field-work conducted in London in 2013. For further
examples and detailed descriptions of text mining and sentiment analysis, see Bello et al.
(2013); Zhao et al. (2013); and Anjaria and Gudetti (2014).
8 Hayles defines the concept of technogenesis as the idea that humans and technics
have coevolved together, such that our very capacity for thought and action is bound
up with epigenetic changes catalysed by exposure to and engagement with digital
media (2012, pp. 1012). The idea is present also in Walter Benjamins famous essay
on art in the age of mechanical reproduction, where he notes that the mode of human
sense perception changes with humanitys entire mode of existence (1999, p. 216).
9 Retrieved from http://www.theguardian.com/profile/laura-poitras. See also Harding (2014, pp. 110, 204).
10 Though the focus of this essay is not on the interface between data architectures
and software, the flattening of differences at this interface is significant. See Galloway
(2012); Berry (2011).
11 Despite substantial interest in the automated analysis of large data sets for security
purposes in the wake of Edward Snowdens disclosures, the use of algorithmic
techniques to analyse Passenger Name Record (PNR) and SWIFT financial data has
been known and documented for some time (Amoore, 2013; de Goede, 2012).
12 Insights drawn from observations at TIBCO Spotfire event, London, 13
June 2013.
13 Insights drawn from observations at TIBCO Spotfire event, London, 13 June
2013, and SAS Analytics How to workshops, 19 June 2013.
References
Accenture. (2013). Accenture analytics in
action. Retrieved from http://www.
accenture.com/sitecollectiondocuments/
pdf/accenture-analytics-in-actionsurvey.pdf
Agrawal, R., Asonov, D., Baliga, P.,
Liang, L., Porst, B. & Srikat, R. (2005).
A reusable platform for building sovereign
information sharing applications. SIGMOD
Proceedings. Retrieved March 25, 2015,
from http://www.rsrikant.com/papers/
divo04.pdf
Agrawal, R., Imielinski, T. & Swami,
A. (1993). Mining association rules between
364
365