BigDataforDevelopment GlobalPulseMay2012
BigDataforDevelopment GlobalPulseMay2012
BigDataforDevelopment GlobalPulseMay2012
Acknowledgements
This paper was developed by UN Global Pulse, an initiative based in the Executive Office of the
Secretary-General United Nations. Global Pulse is grateful to the Government of Sweden, the
Government of the United Kingdom, UNDP, UNICEF, WFP and the UNs Department of Public
Information for the generous support to the initiative.
The paper was written by Emmanuel Letouz, Senior Development Economist (2011), with
research support from current and former Global Pulse colleagues who provided resources,
feedback and helped shape the ideas that inform the contents of this paper. It was edited by
Anoush Rima Tatevossian and Robert Kirkpatrick, with copyediting support from Cassandra
Hendricks. The production and design was supported by Charlotte Friedman.
We would like to thank Bjrn Gillster, Scott Gaul, and Samuel Mendelson for their time in
reviewing a number of draft versions of this paper, and providing detailed feedback.
Global Pulse has benefited from the experience and expertise of a wide range of colleagues both
inside and outside of the UN system. Global Pulse acknowledges the extensive research
conducted in this diverse and emerging field by partners such as the Billion Prices Project, JANA,
Ushahidi, International Food Policy Research Institute (IFPRI), Google Flu Trends, the Global
Viral Forecasting Initiative, Telefonica Research, the Institute for Quantitative Social Science at
Harvard University, L'institut des Systmes Complexes Paris le-de-France, SAS, Crimson
Hexagon, and UNICEF.
The paper is available online at http://unglobalpulse.org/
The analysis and recommendations of this paper do not necessarily reflect the views of the United
Nations, or the Member States of the United Nations. The views presented in this paper are the
sole responsibility of its authors.
To provide feedback, comments or for general inquiries, please contact:
Global Pulse
370 Lexington Ave, Suite 1707
New York, New York 10017
E-mail: [email protected]
Web: www.unglobalpulse.org
Abstract
Innovations in technology and greater affordability of digital devices have presided over
todays Age of Big Data, an umbrella term for the explosion in the quantity and diversity
of high frequency digital data. These data hold the potentialas yet largely untapped
to allow decision makers to track development progress, improve social protection, and
understand where existing policies and programmes require adjustment.
Turning Big Datacall logs, mobile-banking transactions, online user-generated content
such as blog posts and Tweets, online searches, satellite images, etc.into actionable
information requires using computational techniques to unveil trends and patterns within
and between these extremely large socioeconomic datasets. New insights gleaned from
such data mining should complement official statistics, survey data, and information
generated by Early Warning Systems, adding depth and nuances on human behaviours
and experiencesand doing so in real time, thereby narrowing both information and
time gaps.
With the promise come questions about the analytical value and thus policy relevance of
this dataincluding concerns over the relevance of the data in developing country
contexts, its representativeness, its reliabilityas well as the overarching privacy issues
of utilising personal data. This paper does not offer a grand theory of technology-driven
social change in the Big Data era. Rather it aims to delineate the main concerns and
challenges raised by Big Data for Development as concretely and openly as possible,
and to suggest ways to address at least a few aspects of each.
It is important to recognise that Big Data and real-time analytics are no modern panacea
for age-old development challenges. That said, the diffusion of data science to the realm
of international development nevertheless constitutes a genuine opportunity to bring
powerful new tools to the fight against poverty, hunger and disease.
Table of Contents
INTRODUCTION .............................................................................................................................................. 6
SECTION 1: OPPORTUNITY .......................................................................................................................... 8
1.1. DATA INTENT AND CAPACITY .................................................................................................................. 8
The Data Revolution .............................................................................................................................. 8
Relevance to the Developing World .................................................................................................... 9
Intent in an Age of Growing Volatility ................................................................................................ 11
Big Data for Development: Getting Started....................................................................................... 13
Capacity: Big Data Analytics .............................................................................................................. 17
1.2 SOCIAL SCIENCE AND POLICY APPLICATIONS ......................................................................................... 19
A Growing Body of Evidence ............................................................................................................. 20
SECTION II: CHALLENGES ......................................................................................................................... 24
2.1 DATA..................................................................................................................................................... 24
Privacy .................................................................................................................................................. 24
Access and Sharing ............................................................................................................................ 25
2.2 ANALYSIS .............................................................................................................................................. 26
Getting the Picture Right .................................................................................................................... 27
Interpreting Data .................................................................................................................................. 29
Defining and Detecting Anomalies in Human Ecosystems ............................................................. 33
SECTION III: APPLICATION ......................................................................................................................... 35
3.1 WHAT NEW DATA STREAMS BRING TO THE TABLE ............................................................................. 35
Know Your Data ................................................................................................................................... 35
Applications of Big Data for Development ........................................................................................ 36
3.2. MAKING BIG DATA WORK FOR DEVELOPMENT ...................................................................................... 39
Contextualisation is Key ..................................................................................................................... 39
Becoming Sophisticated Users of Information................................................................................. 40
CONCLUDING REMARKS ON THE FUTURE OF BIG DATA FOR DEVELOPMENT ................................ 42
The hope is that as you take the economic pulse in real time, you will be able to respond to anomalies more quickly.
- Hal Varian, Google Chief Economist (Professor Emeritus, University of California, Berkeley)
Introduction
Since the turn of the century, innovations in technology and greater affordability of
digital devices have presided over the Industrial Revolution of Data,1 characterised by
an explosion in the quantity and diversity of real-timei digital data resulting from the
ever-increasing role of technology in our lives. As a result, we are entering an
unprecedented period in history in terms of our ability to learn about human
behaviour.2
What was a hypothesis only a few years ago is today being confirmed by researchers and
corporations, and has recently received significant coverage in the mainstream media3:
analysis of this new wealth of digital datathe massive and ever-expanding archive of
what we say and do using digital devices every daymay reveal remarkable insights into
the collective behaviour of communities. No less significantly, just as this data is
generated by people in real time, so it may now be analysed in real time by high
performance computing networks, thus creating at least the potential for improved
decision-making. It is time for the development community and policymakers around the
world to recognise and seize this historic opportunity to address twenty-first century
challenges, including the effects of global volatility, climate change, and demographic
shifts, with twenty-first century tools.
What exactly is the potential applicability of Big Data for Development? At the most
general level, properly analysed, these new data can provide
snapshots of the well-being of populations at high frequency, high
Big data for
degrees of granularity, and from a wide range of angles, narrowing
development is
both time and knowledge gaps. Practically, analysing this data may
about turning
help discover what Global Pulse has called digital smoke
imperfect,
signals4anomalous changes in how communities access services,
complex, often
that may serve as proxy indicators of changes in underlying wellunstructured
being. Real-time awareness of the status of a population and realtime feedback on the effectiveness of policy actions should in turn
data into
lead to a more agile and adaptive approach to international
actionable
development, and ultimately, to greater resilience and better
information.
outcomes.
Big Data for Development is about turning imperfect, complex, often unstructured data
into actionable information. This implies leveraging advanced computational tools (such
as machine learning), which have developed in other fields, to reveal trends and
correlations within and across large data sets that would otherwise remain undiscovered.
Above all, it requires human expertise and perspectives. Application of these approaches
to development raises great expectations, as well as questions and concerns, chief of
i
For the purposes of this paper, as discussed in Section 1.2, real time data is defined as data that (1)
covers/is relevant to a relatively short and recent period of timesuch as the average price of a commodity
over a few days rather than a few weeks, and (2) is made available within a timeframe that allows action to
be taken that may affect the conditions reflected in the data. The definition of real time is contextual:
real time in the realm of fiber optics is not the same as in the realm of public policy.
which is the analytical value and thus ultimately the policy relevance of big data to
address development challenges.
These must be discussed in a very open manner to ensure that there are no
misunderstandings about what and how real-time data analytics can offer the field global
development, and what it cannot.
This paper outlines the opportunities and challenges, which have guided the United
Nations Global Pulse initiative since its inception in 2009. The paper builds on some of
the most recent findings in the field of data science, and findings from our own
collaborative research projects. It does not aim to cover the entire spectrum of challenges
nor to offer definitive answers to those it addresses, but to serve as a reference for further
reflection and discussion. The rest of this document is organised as follows: section one
lays out the vision that underpins Big Data for Development; section two discusses the
main challenges it raises; section three discusses its application. The concluding section
examines options and priorities for the future.
Section 1: Opportunity
1.1. Data Intent and Capacity
The Data Revolution
The world is experiencing a data revolution, or data deluge (Figure 1).5 Whereas in
previous generations, a relatively small volume of analog data was produced and made
available through a limited number of channels, today a massive amount of data is
regularly being generated and flowing from various sources, through different channels,
every minute in todays Digital Age.6 It is the speed and frequency with which data is
emitted and transmitted on the one hand, and the rise in the number and variety of
sources from which it emanates on the other hand, that jointly constitute the data deluge.
The amount of available digital data at the global level grew from 150 exabytes in 2005
to 1200 exabytes in 2010.7 It is projected to increase by 40% annually in the next few
years, 8 which is about 40 times the much-debated growth of the worlds population.9
This rate of growth means that the stock of digital data is expected to increase 44 times
between 2007 and 2020, doubling every 20 months.ii
Figure 1: The Early Years of the Data Revolution
Source: The Leaky Corporation. The Economist. http://www.economist.com/node/18226961.
ii
In September 2008, the Jerusalem Declaration stated: We are entering the era of a high rate of
production of information of physical, biological, environmental, social and economic systems. The
recording, accessing, data mining and dissemination of this information affect in a crucial way the
progress of knowledge of mankind in the next years. Scientists should design, explore and validate
protocols for the access and use of this information able to maximize the access and freedom of research
and meanwhile protect and respect the private nature of part of it.(..) several scientific disciplines once
characterized by a low rate of data production have recently become disciplines with a huge rate of data
production. Today a huge amount of data easily accessible in electronic form is produced by both research
and, more generally, human activity.
The revolution has various features and implications. The stock of available data gets
younger and younger, i.e. the share of data that is less than a minute old (or a day, or a
week, or any other time benchmark) rises by the minute.iii Further, a large and increasing
percentage of this data is both produced and made available real-time (which is a related
but different phenomenon).iv The nature of the information is also changing, notably with
the rise of social media and the spread of services offered via mobile phones. The bulk of
this information can be called data exhaust, in other words, the digitally trackable or
storable actions, choices, and preferences that people generate as they go about their
daily lives.10 At any point in time and space, such data may be available for thousands
of individuals, providing an opportunity to figuratively take the pulse of communities.
The significance of these features is worth re-emphasising: this revolution is extremely
recent (less than one decade old), extremely rapid (the growth is exponential), and
immensely consequential for society, perhaps especially for developing countries.
Relevance to the Developing World
The data revolution is not restricted to the industrialised world; it is also happening
in developing countriesand increasingly so. The spread of mobile-phone technology to
the hands of billions of individuals over the past decade might be the single most
significant change that has affected developing countries since the decolonisation
movement and the Green Revolution. Worldwide, there were over five billion mobile
phones in use in 2010, and of those, over 80% in developing countries. That number
continues to grow quickly, as analysts at the GSM Association/Wireless Intelligence
predict six billion connections worldwide by the middle of 2012. The trend is especially
impressive in Sub-Saharan Africa, where mobile phone technology has been used as a
substitute for usually weak telecommunication and transport infrastructure as well as
underdeveloped financial and banking systems.v
Across the developing world, mobile phones are routinely used not only for personal
communications, but also to transfer money, to search for work, to buy and sell goods, or
transfer data such as grades, test results, stock levels and prices of various commodities,
medical information, etc. (For example, popular mobile services such as Cell Bazaar in
Bangladesh allow customers to buy and sell products, SoukTel in the Middle East offers
an SMS-based job-matching service, and the M-PESA mobile-banking service in Kenya
allows individuals to make payments to banks, or to individuals.) In many instances,
mobile services have outpaced the growth and availability of their traditional
iii
As demographic theory shows, a population whose rate of growth has risen, as in the case of digital data,
will eventually get younger for several years, and the process will continue if the rate of growth continues
to increase; nonetheless a population of data will always be older than a population of humans subject to
the same birth rate because the life expectancy of digital data is much higher than that of any human
population.
iv
Conceptually, not all newly produced data is real-time data as defined above and in section 1.2. For
example, if all historical vital statistics records of the world were all of a sudden made available digitally,
these newly produced digital data would obviously not be real-time data. In practice, it is the case that a
large share of newly produced digital data tends to be high frequency.
v
For example, mobile phone penetration, measured by the number of mobile phones per 100 habitants, was
96% in Botswana, 63% in Ghana, 66% in Mauritania, 49% in Kenya, 47% in Nigeria, 44% in Angola, 40%
in Tanzania, etc. <http://www.google.com/fusiontables/Home/> (Source: Google Fusion Tables)
vi
Question Box is a pilot initiative that helps people find answers to everyday questions through hotlines,
SMS, and kiosks (http://questionbox.org/). UNICEFs Digital Drum is a solar powered computer kiosk
(http://is.gd/gVepRP)
10
Although the data revolution is unfolding around the world in different ways and with
different speeds, the digital divide is closing faster than many had anticipated even a few
years ago. Furthermore, in countries with weak institutional capacities, the data
revolution may be especially relevant to supplement limited and often unreliable data.
Increasingly, Big Data is recognised as creating new possibilities for international
development.14 But data is a raw good that would be of little use without both intent
and capacity15 to make sense of it.
Intent in an Age of Growing Volatility
There is a general perception that our world has become more volatile, increasing the
risk of severe hardship for vulnerable communities. Fluctuations in economic
conditionsharvests, prices, employment, capital flows, etc.are certainly not new, but
it seems that our global economic system may have become more prone to large and
swift swings in the past few years.
The most commonly mentioned drivers are financial and climatologic shocks in a context
of greater interconnection.16 In the last five years alone, a series of
By the time
crises have unfolded17 with the food and fuel crisis of 2007 to 2008
hard evidence
followed by the Great Recession that started in 2008. By the second
finds its way
half of 2011 the world economy entered yet another period of turmoil
to the front
with a famine in the Horn of Africa and significant financial instability
in Europe and the United States. Global volatility is unlikely to abate:
pages of
according to the OECD, [d]isruptive shocks to the global economy
newspapers
and the desks are likely to become more frequent and cause greater economic and
societal hardship. The economic spill-over effect of events like the
of decision
financial crisis or a potential pandemic will grow due to the
makers, it is
increasing interconnectivity of the global economy and speed with
often too late which people, goods and data travel.18 For many households in
or extremely
developing countries, food price volatilityeven more than price
spikesis the most severe challenge.19
expensive to
respond.
For all this interconnectivity, local impacts may not be
immediately visible and trackable, but may be both severe and long
lasting. A rich literature on vulnerability20 has highlighted the long-term impact of shocks
on poor communities. Children who are forced to drop out of school may never go back
or catch up; households forced to sell their productive assets or flee face a significant risk
of falling back or deeper into poverty; undernourished infants and foetuses exposed to
acute maternal malnutrition may never fully recover21or worse die.22 These processes
often unfold beneath the radar of traditional monitoring systems. By the time hard
evidence finds its way to the front pages of newspapers and the desks of decision makers,
it is often too late or extremely expensive to respond. The main triggers will often be
knowna drought, rising temperatures, floods, a global oil or financial shock, armed
conflictbut even with sufficient macro-level contextual information it is hard to
distinguish which groups are affected, where, when, and how badly.
Policymakers have become increasingly aware of the costs of this growing volatility;
they know the simple fact that it is easier and less costly to prevent damages or to keep
them at a minimum than to reverse them. These arguments have been ringing in their ears
11
in recent years as the words volatility, vulnerability, fragility and austerity have made
headlines.
Various existing Early Warning Systemsvii do help raise flags in the international
community, but their coverage is limited (with a heavy focus on food security in rural
areas) and they are expensive to scale. Some are also plagued with design and
implementation problems.viii Survey data also provide important insights, but such data
takes time to be collected, processed, verified, and eventually published. Surveys are too
cumbersome and expensive to scale up to the point that they can function as an effective
and proactive solution. These traditional data (which, for the purpose of this paper,
refers to official statistics and survey data) will continue to generate relevant information,
but the digital data revolution presents a tremendous opportunity to gain richer, deeper
insights into human experience that can complement the development indicators that are
already collected.
Meanwhile, examples of the private sector successfully embracing Big Data analyticsix
and a growing volume of reports and discourse emphasising the promise of real-time
data, and data-driven decision-making forwarded by leading institutes, institutions and
mediafrom the World Economic Forum to the McKinsey Institute23 to the New York
Timeshave begun to make their way into the public sector discourse.
Civil society organisations have also showed their eagerness to embrace more agile
approaches to leveraging real-time digital data. This is evidenced by the growing role of
crowdsourcingx and other participatory sensing24 efforts bringing together
communities of practice and like-minded individuals through the use of mobile phones
and other platforms including Internet, hand-held radio, and geospatial technologies etc.xi
In many cases, these initiatives involve multiple partners from various fields in what
constitutes a novel way of doing business.
vii
A mapping conducted by Global Pulse found 39 such systems in operation in the UN system alone.
viii
Reasons include their focus on climate-related shocks as a criterion for geographical coverage, lack of
consistency of definitions, and lag in reporting. See in particular: WFP and UNICEF. Food and Nutrition
Security Monitoring and Analysis Systems: A Review of Five Countries (Madagascar, Malawi, Nepal, and
Zambia). Rep. UN Global Pulse, Dec, 2011. <http://www.unglobalpulse.org/projects/rivaf-research-studyimpact-global-economic-and-financial-crisis-vulnerable-people-five-cou>
ix
For example MIT Professor Erik Brynjolfssons research found significant differences in the data
payoffa 5 percent gap in productivity considered to be a decisive advantageenjoyed by companies
relying on data-driven decision-making processes on the one hand and those that continue to rely
primarily on experience and intuition on the other hand. Source: Lohr, Steve. When Theres No Such
Ting as Too Much Information. The New York Times. 23 Apr. 2011.
<http://www.nytimes.com/2011/04/24/business/24unboxed.html?_r=1&src=tptw>
x
The word crowdsourcing refers to the use of non-official actors (the crowd) as (free) sources of
information, knowledge and services, in reference and opposition to the commercial practice of
outsourcing. "
xi
Examples of such initiatives and approaches include Crisismappers, Ushahidi and participatory sensing.
For additional information details, see Crisis Mappers NetThe international Network of Crisis
Mappers. <http://crisismappers.net>, http://haiti.ushahidi.com and Goldman et al., 2009
12
Slowly, governments the world over are realising the power of Big Data. Some choose
conservative paths for managing the data deluge, involving crackdowns and strict
controls (which will likely prove unsustainable), while others will devise institutional
frameworks and support innovative initiatives, such as open data movements, that will
help leverage its power for the common good.xii
Finally, many other applications in the social sciences have also strengthened the case for
embracing Big Data for Development, as mentioned above and discussed in greater detail
below.
It is the double recognition of the promise of the data revolution and the need for better,
faster information in an age of growing global volatility that led the leaders of the G20
and the UN Secretary-General to call for the establishment of the Global Pulse initiative
(in the wake of the on-going Global Economic Crisis), with
Big Data analytics
the aim of developing of a new approach to social impact
refers to tools and
monitoring and behavioural analysis by building on new
methodologies that
sources of data and new analytical tools.
aim to transform
Beyond the availability of raw data alone, and beyond the
massive quantities of
intent to utilize it, there needs to be capacity to understand
raw data into data
and use data effectively. In the words of Stanford Professor
about data for
Andreas Weigend, data is the new oil; like oil, it must be
25
refined before it can be used.
analytical purposes.
Big Data for Development: Getting Started
"Big Data" is a popular phrase used to describe a massive volume of both structured and
unstructured data that is so large that it's difficult to process with traditional database and
software techniques. The characteristics which broadly distinguish Big Data are
sometimes called the 3 Vs: more volume, more variety and higher rates of velocity.
This data comes from everywhere: sensors used to gather climate information, posts to
social media sites, digital pictures and videos posted online, transaction records of online
purchases, and from cell phone GPS signals to name a few. This data is known as "Big
Data" because, as the term suggests, it is huge in both scope and power.
To illustrate how Big Data might be applicable to a development context, imagine a
hypothetical household living in the outskirts of a medium-size city a few hours
from the capital in a developing country. The head of household is a mechanic who
owns a small garage. His wife cultivates vegetables and raises a few sheep on their plot
of land as well as sews and sells drapes in town. They have four children aged 6 to 18.
Over the past couple of months, they have faced soaring commodity prices, particularly
food and fuel. Let us consider their options.
xii
Several governments, for example, have already taken significant steps to make government data open
and available for analysis and use. Countries from regions across the globe from Kenya to Norway, Brazil
to South Korea, and international institutions like the World Bank have begun adopting norms to
standardize and publish datasets. See Open Data Sites. < http://www.data.gov/opendatasites/> for a
sample of existing sites and datasets.
13
The couple could certainly reduce their expenses on food by switching to cheaper
alternatives, buying in bulk, or simply skipping meals. They could also get part of their
food at a nearby World Food Programme distribution center. To reduce other expenses,
the father could start working earlier in the morning in order to finish his day before
nightfall to lower his electricity bill. The mother could work longer hours and go to town
everyday to sell her drapes, rather than twice a week. They could also choose to top-off
their mobile phone SIM cards in smaller increments instead of purchasing credit in larger
sums and less-frequent intervals. The mother could withdraw from the savings
accumulated through a mobile phone-based banking service which she uses.
If things get worse they might be forced to sell pieces of the garage equipment or a few
sheep, or default on their microfinance loan repayment. They might opt to call relatives in
Europe for financial support. They might opt to temporarily take their youngest child out
of school to save on tuitions fees, school supplies and bus tickets. Over time, if the
situation does not improve, their younger children may show signs of anaemia, prompting
them to call a health hotline to seek advice, while their elder son might do online
searches, or vent about his frustration on social media at the local cybercaf. Local aid
workers and journalists may also report on increased hardships online.
Such a systemicas opposed to idiosyncraticshock will prompt dozens, hundreds
or thousands of households and individuals to react in roughly similar ways.26 Over
time, these collective changes in behaviour may show up in different digital data sources.
Take this series of hypothetical scenarios, for instance:
(1) The incumbent local mobile operator may see many subscribers shift from adding
an average denomination of $10 on their SIM-cards on the first day of the month
to a pattern of only topping off $1 every few days; The data may also show a
concomitant significant drop in calls and an increase in the use of text messages;
(2) Mobile banking service providers may notice that subscribers are depleting their
mobile money savings accounts; A few weeks into this trend, there may be an
increase in defaults on mobile repayments of microloans in larger numbers than
ever before;
(3) The following month, the carrier-supported mobile trading network might record
three times as many attempts to sell livestock as is typical for the season;
(4) Health hotlines might see increased volumes of calls reporting symptoms
consistent with the health impacts of malnutrition and unsafe water sources;
(5) Other sources may also pick up changes consistent with the scenario laid out
above. For example, the number of Tweets mentioning the difficulty to afford
food might begin to rise. Newspapers may be publishing stories about rising
infant mortality;
(6) Satellite imaging may show a decrease in the movement of cars and trucks
travelling in and out of the citys largest market;
(7) WFP might record that it serves twice as many meals a day than it did during the
same period one year before. UNICEF also holds daily data that may indicate that
school attendance has dropped.
14
A feedback loop involves four distinct stages. First comes the data: A behaviour must be measured,
captured and stored. This is the evidence stage. Second, the information must be relayed to the individual,
not in the raw-data form in which it was captured but in a context that makes it emotionally resonant. This
15
intrinsic time dimensionality of the data, and that of the feedback loop that jointly define
its characteristic as real-time. (One could also add that the real-time nature of the data is
ultimately contingent on the analysis being conducted in real-time, and by extension,
where action is required, used in real-time.)
With respect to spatial granularity, finer is not necessarily better. Village or community
level data may in some cases be preferable to household or individual level data because
it can provide richer insights and better protect privacy. As per the time dimensionality,
any immediacy benchmark is difficult to set precisely, and will become out-dated, as
higher frequency data are made available in greater volumes and with a higher degree of
immediacy in the next few years. It must also be noted that real-time is an attribute that
doesnt last long: sooner or later, it becomes contextual, i.e. non-actionable data. These
include data made available on the spot about average rainfalls or prices, or phone calls
made over a relatively long period of time in the past (even a few months), as well as the
vast majority of official statisticssuch as GDP, or employment data.
Without getting too caught up in semantics at length, it is important to recognise that Big
Data for Development is an evolving and expanding universe best conceptualised in
terms of continuum and relativeness.
For purposes of discussion, Global Pulse has developed a loose taxonomy of types of
new, digital data sources that are relevant to global development:
(1) Data Exhaust passively collected transactional data from peoples use of digital
services like mobile phones, purchases, web searches, etc., and/or operational
metrics and other real-time data collected by UN agencies, NGOs and other aid
organisations to monitor their projects and programmes (e.g. stock levels, school
attendance); these digital services create networked sensors of human behaviour;
(2) Online Information web content such as news media and social media
interactions (e.g. blogs, Twitter), news articles obituaries, e-commerce, job
postings; this approach considers web usage and content as a sensor of human
intent, sentiments, perceptions, and want;
(3) Physical Sensors satellite or infrared imagery of changing landscapes, traffic
patterns, light emissions, urban development and topographic changes, etc; this
approach focuses on remote sensing of changes in human activity;
(4) Citizen Reporting or Crowd-sourced Data Information actively produced or
submitted by citizens through mobile phone-based surveys, hotlines, usergenerated maps, etc; While not passively produced, this is a key information
source for verification and feedback
is the relevance stage. But even compelling information is useless if we dont know what to make of it, so
we need a third stage: consequence. The information must illuminate one or more paths ahead. And finally,
the fourth stage: action. There must be a clear moment when the individual can recalibrate a behaviour,
make a choice and act. Then that action is measured, and the feedback loop can run once more, every
action stimulating new behaviours that inch us closer to our goals. Goetz, Thomas. Harnessing the Power
of Feedback Loops. Wired.com. Conde Nast Digital, 19 June 2011.
<http://www.wired.com/magazine/2011/06/ff_feedbackloop/all/1>.
16
Yet another perspective breaks down the types of data that might be relevant to
international development by how it is produced or made available: by individuals, by the
public/development sector, or by the private sector (Figure 3).
Figure 3: Understanding the Dynamics of the Data Ecosystem
The new data ecosystem, as illustrated by the World Economic Forum, which we are in today includes
various data types, incentives, and requirements of actors.
Source: WEF White Paper, Big Data, Big Impact: New Possibilities for International Development
http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf
17
(2)
(3)
(4)
(5)
Then, the data that will eventually lend itself to analysis needs to be adequately prepared.
This step may include:
(1) Filteringi.e. keeping instances and observations of relevance and getting rid of
irrelevant pieces of information;
(2) Summarisingi.e. extracting keyword or set of keywords from a text;
(3) Categorising, and/or turning the raw data into an appropriate set of indicators
i.e. assigning a qualitative attribute to each observation when relevantsuch as
negative vs. positive comments, for instance. Yet another option is simply to
calculate indicators from quantitative data such as growth rates of price indices
(i.e. inflation).
Once the data is ready to be analysed, data analytics per se imply letting powerful
algorithms and computational tools dive into the data. A characteristic of these
algorithms is their ability to adapt their parameters in response to new streams of data by
creating algorithms of their own to take care of parts of the data. This is necessary
because these advanced modelsnon-linear models with many heterogeneous interacting
elementsrequire more data to calibrate them with a data-driven approach.28
This intensive mining of socioeconomic data, known as reality mining,29 can shed light
on processes and interactions in the data that would not have appeared otherwise. Reality
mining can be done in three main ways:42
(1) Continuous data analysis over streaming data, using tools to scrape the Web to
monitor and analyse high-frequency online data streams, including uncertain,
inexact data. Examples include systematically gathering online product prices in
real-time for analysis;
(2) Online digestion of semi-structured data and unstructured ones such as news
items, product reviews etc., to shed light on hot topics, perceptions, needs and
wants;
(3) Real-time correlation of streaming data (fast stream) with slowly accessible
historical data repositories. This terminology refers to mechanisms for
correlating and integrating real-time (fast streams) with historical recordsin
order to deliver a contextualised and personalised information space [that adds]
considerable value to the data, by providing (historical) context to new data.30
Big Data for Development could use all three techniques to various degrees depending on
the availability of data and the specific needs.
Further, an important feature of Big Data analytics is the role of visualisation, which can
provide new perspectives on findings that would otherwise be difficult to grasp. For
example, word clouds (Figure 4), which are a set of words that have appeared in a
certain body of text such as blogs, news articles or speeches, for example are a simple
18
Source: The full text of this paper; word cloud created using www.Wordle.net
Graphic designer Paul Butt created a data visualisation that shows the top five timber exporting countries of
the world and maps out the countries to which timber is sold and at what cost.
Source: Got Wood? The Information is Beautiful Awards.
http://www.informationisbeautifulawards.com/2012/03/who-buys-and-who-sells-wood/ graphic
xiv
Global Pulse has been working with the global community, Vizualizing.org to explore different styles
and methods for data visualization and information design. See Data Channels, UN Global Pulse.
<http://www.visualizing.org/data/channels/un-global-pulse.>
19
20
Twitter may serve a similar purpose. Computer scientists at John Hopkins University
analysed over 1.6 million health-related tweets (out of over 2 billion) posted in the US
between May 2009 and October 2010 using a sophisticated proprietary algorithm and
found a 0.958 correlation between the flu rate they modelled based on their data and the
official flu rate (Figure 7).38
The prevalence and spread of other types of health conditions and ailments in a
community can also be analysed via Twitter data, including obesity and cancer.39
Information about Twitter users' location that they provide publicly can be used to study
the geographic spread of a disease or virus, as evidenced by a study of the H1N1
epidemic in the US.40
21
Influenza rate from August 2009 to October 2010 as measured by CDC FluView (measured as percentage
of specimens that tested positive for influenza) and ATAM+s flu ailment (the normalised probability of
the flu ailment given the week): correlation coefficient of 0.958. (p. 4)
Source: You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze,
2011. http://www.cs.jhu.edu/%7Empaul/files/2011.icwsm.twitter_health.pdf
Another application for which Twitter is especially suited is to provide richer information
on health-related behaviours, perceptions, concerns, and beliefs. Important sociological
insights were gleaned from the John Hopkins study. These included detailed information
about the use and misuse of medication (such as self-prescribed antibiotics to treat the
flu, which are by nature irrelevant, and other over-the-counter misuse) and exercise
routines among various socioeconomic groupsrelevant for both research and policy.
Still in the realm of social media, a study showed that Facebook updates, photos and
posts could help identify drinking problems among college students, a major public
health concern.41 An important difference between Twitter and Facebook is, of course,
the fact that the latter is a closed network, whereas the vast majority of tweets are public,
thus theoretically (but not necessarily) accessible for analysis.
Other data streams can also be brought in or layered against social media data, notably to
provide geographical information: the Healthmap project, for example, compiles
disparate data from online news, eyewitness reports and expert-curated discussions, as
well as validated official reports, to achieve a unified and comprehensive view of the
current global state of infectious diseases that can be visualised on a map.42 In the
words of one of its creators, [i]ts really taking the local-weather forecast idea and
22
making it applicable to disease.43 This may help communities prepare for outbreaks as
they do for storms.
Another example of the use of new data for policy purposes was Ushahidisxv use of
crowdsourcing following the earthquake that devastated Haiti, where a centralised text
messaging system was set up to allow cell-phone users to report on people trapped under
damaged buildings. Analysis of the data found that the concentration of aggregated text
messages was highly correlated with areas where the damaged buildings were
concentrated.xvi According to Ushahidis Patrick Meier these results were evidence of the
systems ability to predict, with surprisingly high accuracy and statistical significance,
the location and extent of structural damage post-earthquake."
This section has aimed to illustrate how leveraging Big Data for Development can reduce
human inputs and time-lags in production, collection, and transmission of information,
and allow for more onus to be placed on analysis and interpretation, and making
informed and effective evidence-based decisions from the data.
With all this available data, a growing body of evidence, and all the existing
technological and analytical capacity to use it, the question must be asked: why hasnt
Big Data for Development taken hold as a common practice ?
The answer is because it is not easy. Section 2 turns to the challenges with using Big Data
for Development.
xv
Ushahidi is a nonprofit tech company that was developed to map reports of violence in Kenya following
the 2007 post-election fallout. Ushahidi specializes in developing free and open source software for
information collection, visualization and interactive mapping. <http://ushahidi.com>
xvi
Conducted by the European Commission's Joint Research Center against data on damaged buildings
collected by the World Bank and the UN from satellite images through spatial statistical techniques.
23
24
25
Source: Ensuring the Data-Rich Future of the Social Sciences. Science Magazine.
http://gking.harvard.edu/files/datarich.pdf
2.2 Analysis
Working with new data sources brings about a number of analytical challenges. The
relevance and severity of those challenges will vary depending on the type of analysis
being conducted, and on the type of decisions that the data might eventually inform. The
question what is the data really telling us? is at the core of any social science research
and evidence-based policymaking, but there is a general perception that new digital
data sources poses specific, more acute challenges. It is thus essential that these concerns
be spelled out in a fully transparent manner. The challenges are intertwined and difficult
to consider in isolation, but for the sake of clarity, they can be split into three distinct
categories: (1) getting the picture right, i.e. summarising the data (2) interpreting, or
making sense of the data through inferences, and (3) defining and detecting anomalies.54
26
For example, Tom MacMaster, a Scotland-based heterosexual male caused quite the controversy during
the 2011 Arab Spring when he regularly posted blogposts on the web in the persona of a lesbian woman in
Syria. The blog, A Gay Girl in Damascus, managed to gather a growing following as MacMaster
published regularly between February and June, until it was revealed that the blog was a hoax. The incident
raised or intensified concerns about unverified user-generated information. While Tom MacMaster claimed
he used the hypothetical persona to try to illuminate the events in the Middle East for a western audience,
any reader would have been ultimately relying on fabricated information. Addley, Esther. Syrian lesbian
blogger is revealed conclusively to be a married man. The Guardian. 12 Jun. 2011
<http://www.guardian.co.uk/world/2011/jun/13/syrian-lesbian-blogger-tom-macmaster>
xviii
False negatives refer to cases where some event of interest fails to be noticed.
xix
The CDC's influenza-like-illness surveillance network reporting the proportion of people who visit a
physician with flu-like symptoms (fever, fatigue and cough) and the CDC's virologic surveillance system
reporting on the proportion of people who visit a physician and actually have lab-confirmed influenza Liu,
Bing Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing 2 (2010): 1-38.
Department of Computer Science at the University of Illinois at Chicago.
<http://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment-analysis.pdf>
27
infections like SARS) that seems like the flu, it did not predict actual flu very well. The
mismatch was due to the presence of infections causing symptoms that resemble those of
influenza, and the fact that influenza is not always associated with influenza-like
symptoms. According to one of the researchers, [t]his year, up to 40% of people with
pandemic flu did not have influenza-like illness because they did not have a fever ()
Influenza-like illness is neither sensitive nor specific for influenza virus activityit's a
good proxy, it's a very useful public-health surveillance system, but it is not as accurate
as actual nationwide specimens positive for influenza virus." This mismatch may have
important policy implications. Doctors using Google Flu Trends may overstock on flu
vaccines or be inclined to conclude that the flu-like symptoms of their patients are
attributable to influenza, even when they are not. Similar caveats apply to the analysis of
Tweets or calls to health hotlines.
Another challenge relates to sentiment analysis (or opinion mining). The term refers to
the computational study of opinions, sentiments and emotions expressed in text57 that
aims at translating the vagaries of human emotion into hard data.58 Scraping blogs and
other social media content has become a common undertaking of corporations and
academic researchers. Sentiment analysis aims at finding out and quantifying whether,
how many, and how strongly people are happy vs. unhappy, pessimistic vs. optimistic,
what they like or dislike, support or reject, fear or look forward to somethingand any
shades of grey in between.
Difficulties in conducting sentiment analysis can be organised in various ways.59 One
perspective distinguishes challenges related to conceptualisation (i.e. defining
categories, clusters), measurement (i.e. assigning categories and clusters to
unstructured data, or vice-versa) and verification (i.e. assess how well steps 1 and 2
fare in extracting relevant information). Another focuses on the main challenges of
selecting target documents, identifying the overall sentiment expressed in the target
documents, and lastly present[ing] these sentiments [] in some reasonable summary
fashion.xx
Overall, the fundamental challenge is getting to the true intent of a statement, in terms of
polarity, intensity, etc. Many obstacles may impede this, from the use of slang, local
dialect, sarcasm, hyperboles, and irony, to the absence of any key words. These are, in a
sense, technical or measurement challenges that become easier to handle as the degree of
sophistication of sentiment analysis algorithms improves. But the conceptualisation and
classification phase of any analysis which is to be conducted is non-trivial. This implies
deciding, for instance, whether what matters is frequency or presence of key word(s).
Thus, the human analysts input is critical. Classification is one of the most central and
generic of all our conceptual exercises. [] Without classification, there could be no
advanced conceptualization, reasoning, language, data analysis, or, for that matter,
social science research.60
xx
Options considered include (a) aggregation of votes that may be registered on different scales (e.g.,
one reviewer uses a star system, but another uses letter grades), (b) selective highlighting of some opinions
(c) representation of points of disagreement and points of consensus (d) identification of communities of
opinion holders (e) accounting for different levels of authority among opinion holders. (Source: Pang and
Lee, 2008.)
28
Text mining goes beyond sentiment analysis to the extraction of key words or events. It
may involve scraping websites for facts such as deaths, job announcements or losses,
foreclosures, weddings, and financial difficulties, etc. The difficulty here, again, is
extracting the true significance of the statements in which these facts are reported. For
example, if our hypothetical couples son reports having lost a job (possibly out of two
or more), it is different from having lost ones only job (just like losing a part-time job is
different from losing a full-time job). Text mining may also involve looking for trending
topics in online news and other publications. This type of analysestext categorisation
is technically relatively easier to conduct, but aggregating topics within larger clusters
also requires inputs from the analyst.
A somewhat different problem is the fact that a significant share of new, digital data
sources are based on expressed intentions as revealed through blogposts, online searches,
or mobile-phone based information systems for checking market prices, etc., which may
be a poor indicator of actual intentions and ultimate decisions. An individual searching
for information about or discussing moving to the UK may have no intention of
moving to the UK at all, or may not follow through on his/her intentions even if he or she
initially did.
These examples are meant to illustrate just some of the difficulties in summarising facts
from user-generated text; the line between reported feelings and facts may not be as easy
to distinguish as it may seembecause facts all come with points of view.61 With an
understanding of the kinds of issues related to the accuracy of the variety of new digital
data sources that make up Big Data, let us now turn to the challenge of interpretation.
Interpreting Data
In contrast to user-generated text, as described in the section above, some digital data
sourcestransactional records, such as the number of microfinance loan defaults,
number of text messages sent, or number of mobile-phone based food vouchers
activatedare as close as it gets to indisputable, hard data. But whether or not the data
under consideration is thought to be accurate, interpreting it is never straightforward.
A frequently voiced concern is the sampling selection bias, i.e. the fact that the people
who use mobile or other digital services (and thus generate the real-time digital data
being considered for analysis) are not a representative sample of the larger population
considered. Cell-phones, computers, food vouchers, or microfinance products, are neither
used at random nor used by the entire population. Depending on the type of data, one can
expect younger or older, wealthier or poorer, more males than females, and educated or
uneducated individuals, to account for a disproportionate share of producers.xxi
For instance, [c]hances are that older adults are not searching in Google in the same
proportion as a younger, more Internet-bound, generation."62It is the combination of the
self-selection of individuals with specific attributes and the fact that these attributes affect
xxi
Note that the fact that some respondents in a representative sample may yield more data points than
others does not jeopardize the representativeness of the sampleonce will simply infer that the same
behavior will appear in the larger population within which the sample was drawn.
29
their behaviours in other ways that causes the bias.xxii There are other practical reasons
why the sample may be non-representative. For example, it could be that only data for
one mobile phone companies is available for analysis. The resulting sample will most
likely be non-representative of either the population of mobile-phone holders or of the
population of the area. More traditional datasets are generally less subject to such a
bias.xxiii
The problem with findings based on non-representative samples is that they lack external
validity.xxiv They cannot be generalised beyond the sample. Suppose that, in the village
where our hypothetical household lives, the number of cases of cholera reported through
a mobile-phone hotline doubles in a week following a rise in temperature. Because the
sample is not representativesince individuals using cell-phones tend to have specific
characteristicsone cannot infer that there are twice as many cases of cholera in the
entire community,xxv nor that in general a rise in temperature leads to more cases of
cholera. But while findings based on non-representative datasets need to be treated with
caution, they are not valuelessas discussed in section three.
Further, with massive quantities of data there is a risk of focusing exclusively on finding
patterns or correlations and subsequently rushing to judgements without a solid
understanding of the deeper dynamics at play. Such a wealth of data tempts some
researchers to believe that they can see everything at a 30,000-foot view. It is the kind of
data that encourages the practice of apophenia: seeing patterns where none actually
exist, simply because massive quantities of data can offer connections that radiate in all
directions.63 It also intensifies the search for interesting correlations64, which might
be (mis)interpreted as causal relationships.
The latter tendency is problematic with any type of data, but new data sources
xxii
Importantly, the sampling bias issue is only about the samplei.e. the individuals who are observed
not about the type or number of data they generate. Some individuals may generate many more
observations (such as phone calls, or visits to clinics) than others but as long as the difference is not due to
any other measurement problemsand the sample is representativethe data is unbiased.
xxiii
This can be avoided in one of three ways: one is obviously to ensure complete coverage of the unit of
analysis, another is to observe or survey individuals truly at random on a sufficientlylarge scale. For
instance, random samples can be constructed using a computer that picks up some individual specific
numbers at random and a (less rigorous way) third way is to purposefully build a representative sample
ex-ante or ex-post. The latter technique is used by pollsters who redress the results obtained from a
sample thought to be non-representative by using weights
xxiv
The term validity refers to the degree to which a conclusion or inference accurately describes a realworld process. It is standard to contrast internal versus external validity. The former refers to the
approximate truth about inferences regarding cause-effect or causal relationships. In other words, the
extent to which a study demonstrates the existence of a causal relationship. The latter refers to the degree to
which an internally valid conclusion can be generalised to a different setting. A causal relationship is
different from a causal effect, which is the quantifiable difference between two potential outcomes if some
initial condition changed. For the purposes of this paper, internal validity is understood to encompass the
extent to which the existence of some recurrent pattern in the dataa stylised fact, a correlationcan be
established. (Sources: King, Gary and Powell, Eleanor. How Not to Lie Without Statistics. Harvard
University. National Science Foundation (2008) <http://gking.harvard.edu/files/mist.pdf>) and Internal
Validity. Research Methods Knowledge Base. <http://www.socialresearchmethods.net/kb/intval.php>)
xxv
Some may treat this as an internal validity problem but it is only because they wrongfully confound the
sample and the larger unit of analysis it is drawn from.
30
31
searches since 2004 (Figure 9). It is unclear how one could cause the other.
Figure 9: Correlation Does Not Mean Causation
The problems highlighted here can have serious policy implications since
[m]isunderstanding relationships in data () can lead to choosing less effective, more
expensive data instead of choosing obvious, more accurate starting points. In the
Ushahidi case, Fruchterman argued, if decision makers simply had a map [of preearthquake buildings] they could have made better decisions more quickly, more
accurately, and with less complication.xxix
Another temptation when faced with the availability of massive amounts of disparate data
is to throw in a large number of controls in regression analysis to find true relationships
in the data. Doing so can strengthen the analysis internal validity by netting out of the
effects of other endogenous factors on the dependent variable, but it brings about several
potential challenges. First, it downplays the fact that combining data from multiple
sources may also mean magnifying66 their flaws. If the results are contingent on factors
specific to the unit of analysis, they also become hard to generalise to other settings with
vastly different mean values of the included controls: external validity is then weakened.
There is another econometric downside to throwing in a large number of controls
indiscriminately. If many of them are correlated, the resulting multicollinearity will lead
to non-unique parameter fits and more or less arbitrary parameter choices, meaning that
xxix
Note that the causal link in question does not imply that text messages cause actual damage, but that
more texts mean more damage because of the features of the system.
32
the results may be misleading. This suggests that theory and context matter even (or,
especially) with large amounts of data.
Transposing insights gleaned from static analysis across time and space is also a
challenge. Imagine that the prevalence of the phrase runny nose in Google searches
doubles over a week. Most data is subject to a confirmation or mimicking bias creating
a self-fulfilling prophecy similar to the case of inflation: if the rise is widely covered in
the media, it is possible that more people will search for these terms. Various other events
may affect peoples incentive to search for runny nose: if a cholera epidemic broke out,
flu symptoms would become less of a concern. Conversely if the epidemic was brought
under control, people may suddenly become more concerned about runny noses.xxx In
other words, Google's data are likely to be sensitive to factors that modify human
behaviour but which are not related to true disease rate. as noted by Elena Naumova,
Director of Tufts University Initiative for the Forecasting and Modeling of Infectious
Diseases.67 The same observation applies to other data streamshotlines, purchases,
Twitter, other user-generated content, etc. In addition, individuals being surveyed in
new data streams are usually different between two points in timeimagine how the
population of a large university town changes in the summersuch that changes in
behaviours may simply reflect this.
The challenge here lies partly in the fact that (as discussed in the previous section) a
significant share of the new data sources in question reflects perceptions, intentions, and
desires. Understanding the mechanisms by which people express perceptions, intentions
and desiresas well as how they differ between region or linguistic culture, and change
over timeis hard. Other difficulties clearly find their root in the samples. Both internal
and external validities can be threatened, depending on the exact set up and question. It
can be difficult to establish a form of causality for a given area between two points in
time, or for two different areas at a given point in time. Even once this is established,
such that internal validity is ensured, the difficulty is to transpose the insight(s) to either a
different point in time, or a different area, which relates to external validity.
Above and beyond simply determining accuracy, these are just some of the key
challenges associated with trying to draw valid inferences and conclusions from certain
types of digital data sources. It should be clear that while some of these challenges are
specific to dealing with new digital data sources, most are pervasive in social science and
policymaking relying on any data.
As discussed in section three, the best remedy is to rely on analysts who are fully aware
of these limitations and keep the claims and decisions made on the basis of the data
within acceptable boundarieswhich are wide enough.
Defining and Detecting Anomalies in Human Ecosystems
An overarching challenge when attempting to measure or detect anomalies in
human ecosystems is the characterization of (ab)normality. The nature and detection
of what may constitute socioeconomic anomalies differindeed may be less clear
xxx
This observation echoes that made in Solzhenitsyn's One Day in the Life of Denis Denisovich, where
his hero realizes that whenever his situation improves along some dimension, he notices a problem that had
until then been hidden, concealed, by the graver concern.
33
Conducting repeated tests on a dynamic system with a very low probability of making a type I error in
any single trip (or any single day) has a substantial probability of leading to such an error at least once
after n trips (or tries). In contrast, a system that would have an unrealistically high probability of making a
type II error (missing an actual problem) in any given trip has a very low probability of missing a problem
after n trips for any N sufficiently large. Under all realistic scenarios, a malfunction will eventually be
detected, such that in most cases it would seem prudent to balance the probability of making type I and II
errors in favor of reducing the chances of Type I errors (Source: Box, George, Spencer Graves, Soren
Bisgaard, John Van Gilder, Ken Marko, John James, Mark Poublon and Frank Fodale. Detecting
Malfunctions in Dynamic Systems. Center for Quality and Productivity Improvement (1999): 1-10.
University of Wisconsin, Mar. 1999 (p. 4).)
34
35
36
other words X and Y can be used as proxy indicators, even if no causality is claimed. As
noted by Google Chief Economist Hal Varian, even if all you have got is a
contemporaneous correlation, youve got a 6-week lead on the reported values. The hope
is that as you take the economic pulse in real time, you will be able to respond to
anomalies more quickly. 73
The temptation to find any correlations in big datasets must certainly be kept in check to
avoid misinterpretations and abuses, but there are many cases where correlations are
relevant. In some cases, new data sources may mirror official statistics, offering cheaper
and faster proxies. For example, as noted above, MIT researchers have been estimating
inflation by collecting and analysing the daily price of goods sold or advertised on the
web with impressive accuracy.74 The key value-add of this method is that online prices
can be obtained daily whilst consumer price indices in most countries are only published
on a monthly basis. Thus, this approach may help detect inflation spikes sooner than
traditional methods, or offer new insights into the transmission of price fluctuations to
various goods and areas.
Beyond correlations, analysing large quantities of data can help unveil stylised factsi.e.
broadly recurring behaviours and patterns. Stylised facts should not be considered as laws
that would always hold true, but they give a sense of the likelihood that some deviation
from the trend may occur. As such, they form the basis of anomaly detection. When it
comes to defining and detecting anomalies, important and promising work is being done.
For example, researchers at the International Food Policy Research Institute (IFPRI) have
developed a methodology to detect excessive food price volatility, () i.e. a period of
time in which a large number of extreme positive returns usually defined as a value of
return that is exceeded with low probability: 5% or 1%)75 in order to determine
appropriate country- level food security responses, such as the release of physical food
stocks. Similar methods can be applied to the detection of anomalies in how community
members use their cell-phones, sell their livestock, etc.
Access to large-scale sources of real time data can help save lives. The United States
Geological Survey (USGS) has developed a system that monitors Twitter for significant
spikes in the volume of messages about earthquakes.76 Location information is then
extracted and passed on to USGSs team of seismologists to verify that an earthquake
occurred, locate its epicentre and quantify its magnitude. As it turns out, 90% of the
reports that trigger an alert have turned out to be valid. Similarly, a recent retrospective
analysis of the 2010 cholera outbreak in Haiti conducted by researchers at Harvard
Medical School and Childrens Hospital Boston demonstrated that mining Twitter and
online news reports could have provided health officials a highly accurate indication of
the actual spread of the disease with two weeks lead time.77
Further, as various examples have shown, it is not just the size and speed but also
the nature, the richness of the information that many new data streams contain that
has great value. In many cases Big Data for Development is not meant to replace or act
as a proxy for official statistics, but to complement them by adding depth and nuance, as
in the aforementioned John Hopkins study showed. Indeed, even in the field of
syndromic surveillance, experts agree that new data streamsTweets in that caseare
not accurate enough to replace traditional methods of sentinel surveillance.78 But the
37
same experts add: [t]here is a lot of potential to learn so much about people that they
dont share with their doctors. The more qualitative social media information helps
paint a picture that quickly reacts to changing conditions. In turn, all of this information
can be factored in to affect these very same conditions in a much more agile way.
It is clear from these examples that it is the combination of the size, speed, and nature of
the data that is highly valuable to affect certain outcomes.
A recent joint research project between Global Pulse and social media analysis firm
Crimson Hexagon analysed over 14 million Tweets related to food, fuel, and housing in
the US and Indonesia, to better understand peoples concerns.79 The research analysed
trends in these topics in conjunction with themes such as afford, showing how the
volume and topics of the conversations changed over time reflecting populations
concerns. Interestingly, changes in the number of Tweets mentioning the price of rice and
actual food price inflation (official statistics) in Indonesia proved to be closely correlated
(Figure 9). Another collaborative research project between Global Pulse and the SAS
Institute analysing unemployment through the lens of social media in the US and Ireland
revealed that the increases in the volume of employment-related conversations on public
blogs, online forums and news in Ireland which were characterised by the sentiment
confusion show up three months before official increases in unemployment, while in
the US conversations about the loss of housing increased two months after
unemployment spikes (Figure 10).80 Similar research could be conducted in developing
countries with high Internet penetration, such as Indonesia or Brazil for instance.
Figure 9: Tweets about the price of rice vs. actual price of rice in Indonesia
Source:
Twitter and Perceptions of Crisis-Related Stress. UN Global Pulse.
http://www.unglobalpulse.org/projects/twitter-and-perceptions-crisis-related-stress
38
Figure 10: Infographic depicting some of the leading and lagging indicators of
unemployment spikes, as discovered through a joint SAS-Global Pulse research
project examining unemployment through the lens of social media
Properly analysed, Big Data offers the opportunity for an improved understanding of
human behaviour that can support the field of global development in three main ways:
1) Early warning: early detection of anomalies in how populations use digital devices
and services can enable faster response in times of crisis;
2) Real-time awareness: Big Data can paint a fine-grained and current representation
of reality which can inform the design and targeting of programs and policies;
3) Real-time feedback: the ability to monitor a population in real time makes it
possible to understand where policies and programs are failing and make the
necessary adjustments.
These applications are highly promising. But, as has been emphasised over and over,
there is nothing automatic, lest simple, about turning Big Data sources into actionable
information in development contexts.
3.2. Making Big Data Work for Development
Contextualisation is Key
The examples and arguments presented so far have all underscored the importance of
contextualisationunderstood in two complementary ways.
1) Data context. Indicators should not be interpreted in isolation. If one is
concerned with anomaly detection, it is not so much the occurrence of one
seemingly unusual fact or trend that should be concerning, but that of two, three
or more;
39
40
His point was that real-time information does not replace the quantitative statistical
evidence which governments traditionally use for decision making, but if understood
correctly, it can inform where further targeted investigation is necessary (in less timecritical situations) or even inform immediate response if necessary (in disaster situations,
such as tornadoes) and thus change outcomes like nothing else can.
Abiding by these guiding principles should allow Big Data for Development to meet its
ultimate objective: to help policymakers and development practitioners gain richer and
timelier insights on the experiences of vulnerable communities and implement betterinformed and more agile interventions.
41
xxxii
Moores Law (an observation made by Intel co-founder Gordon Moore in 1965) predicts that computer
chips shrink by half in size, and processors double in complexity, every two years. It is often referred to as
a rule of thumb for understanding exponential improvement; although some argue that the world will soon
see a period where progress in technology outpaces Moores prediction of two-year cycles.
42
Endnotes
1
Referenced by Nathan Eagle in video interview for UN Global Pulse, July 2011. Though, the term seems
to have been originally coined by Joe Hellerstein, a computer scientist at the University of California,
Berkeley <http://www.economist.com/node/15557443>
2
Onella, Jukka- Pekka. Social Networks and Collective Human Behavior. UN Global Pulse. 10 Nov.
2011. <http://www.unglobalpulse.org/node/14539>
3
Lohr, Steve. The Age of Big Data. New York Times. 11 Feb, 2012.
<http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-theworld.html?_r=2&pagewanted=all>
4
Kirkpatrick, Robert. Digital Smoke Signals. UN Global Pulse. 21 Apr. 2011.
<http://www.unglobalpulse.org/blog/digital-smoke-signals>
5
The Data Deluge. The Economist. 25 Feb 2010. <http://www.economist.com/node/15579717> and
Ammirati, Sean. Infographic: Data Deluge 8 Zettabytes of Data by 2015. Read Write Enterprise.
<http://www.readwriteweb.com/enterprise/2011/11/infographic-data-deluge---8-ze.php>
6
King, Gary. Ensuring the Data-Rich Future of Social Science. Science Mag 331 (2011) 719-721. 11
Feb, 2011 Web. <http://gking.harvard.edu/sites/scholar.iq.harvard.edu/files/gking/files/datarich_0.pdf>
7
Helbing, Dirk and Stefano Balietti. From Social Data Mining to Forecasting Socio-Economic Crises.
Arxiv (2011) 1-66. 26 July 2011 http://arxiv.org/pdf/1012.0178v5.pdf.
8
Manyika, James, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh and
Angela H. Byers. Big data: The next frontier for innovation, competition, and productivity. McKinsey
Global Institute (2011): 1-137. May 2011.
<http://www.mckinsey.com/mgi/publications/big_data/pdfs/MGI_big_data_full_report.pdf>
9
World Population Prospects, the 2010 Revision. United Nations Development Programme.
<http://esa.un.org/unpd/wpp/unpp/panel_population.htm>
10
Data Exhaust. <http://www.wordspy.com/words/dataexhaust.asp>
11
Cornu, Celine. Mobil Banking Moving Through Developing Countries. Jakarta Globe. 21 Feb, 2010.
<http://www.thejakartaglobe.com/business/mobile-banking-moving-through-developingcountries/359920>
12
Global Internet Usage by 2015 [Infographic]. Alltop. <http://holykaw.alltop.com/global-internetusage-by-2015-infographic?tu3=1>
13
Rao, Dr. Madanmohan. Mobile Africa Report: Regional Hubs of Excellence and Innovation. Mobile
Monday (2011): 1-68. Mar. 2011. <http://www.mobilemonday.net/reports/MobileAfrica_2011.pdf>
14
Big Data, Big Impact: New Possibilities for International Development. World Economic Forum
(2012): 1-9. Vital Wave Consulting. Jan. 2012 <http://www.weforum.org/reports/big-data-big-impact-newpossibilities-international-development>.
15
Toyama, Kentaro. Can Technology End Poverty? Boston Review. Dec 2010.
<http://bostonreview.net/BR35.6/toyama.php>
16
OECD, Future Global Shocks, Improving Risk Governance, 2011
17
Global Monitoring Report 2009: A Development Emergency. Rep. Washington DC: International Bank
for Reconstruction and Development/ The World Bank, 2009.
<http://siteresources.worldbank.org/INTGLOMONREP2009/Resources/59243491239742507025/GMR09_book.pdf>
18
Economy: Global Shocks to Become More Frequent, Says OECD. Organisation for Economic Cooperation and Development. 27 June. 2011.
<http://www.oecd.org/document/15/0,3746,en_21571361_44315115_48252559_1_1_1_1,00.html>
43
19
FAO, IFAD, IMF, OECD, UNCTAD, WFP, the World Bank, the WTO, IFPRI and the UN HLTF.
Price Volatility in Food and Agricultural Markets: Policy Responses. 2 June, 2011.
<http://www.foodsecurityportal.org/sites/default/files/g20_interagency_report_food_price_volatility.pdf>
20
For a recent review, see: Fuentes, Nieva Ricardo, and Papa A. Seck. Risks, Shocks and Human
Development: On the Brink. Basingstoke, England: Palgrave Macmillan, 2010.
21
Almond, Douglas, Lena Edlund, Hongbin Li and Junsen Zhang. Long-term Effects of the 1959-1961
China Famine: Mainland China and Hong Kong Working Paper no. 13384. National Bureau of Economic
Research, Sept. 2007. <http://www.nber.org/papers/w13384.pdf>
22
Friedman, Jed, and Norbert Schady. How Many More Infants Are Likely to Die in Africa as a Result of
the Global Financial Crisis? Rep. The World Bank.
<http://siteresources.worldbank.org/INTAFRICA/Resources/AfricaIMR_FriedmanSchady_060209.pdf>
23
Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute,
June 2011.
http://www.mckinsey.com/mgi/publications/big_data/pdfs/MGI_big_data_full_report.pdf
24
Burke, J., D. Estrin, M. Hansen, A. Parker, N. Ramanthan, S. Reddy and M.B. Srivastava. Participatory
Sensing. Rep. Escholarship, University of California, 2006. <http://escholarship.org/uc/item/19h777qd>.
25
Opening remarks to UN Global Pulse briefing presentation to the UN General Assembly
<http://youtu.be/lbmsDH8RJA4> (8, Nov 2011)
26
Kenya-Somalia: The Nitty-Gritty of Flight. Irin News Africa. 23 Aug. 2011.
<http://www.irinnews.org/report.aspx?reportid=93564>.
27
Bollier, David. The Promise and Peril of Big Data. The Aspen Institute, 2010.
<http://www.aspeninstitute.org/publications/promise-peril-big-data>
28
Helbing, Dirk and Stefano Balietti. From Social Data Mining to Forecasting Socio-Economic Crisis.
29
Eagle, Nathan and Alex (Sandy) Pentland. "Reality Mining: Sensing Complex Social Systems",
Personal and Ubiquitous Computing, 10.4 (2006): 255-268.
<http://reality.media.mit.edu/pdfs/realitymining.pdf>
30
Helbing and Balietti, From Social Data Mining to Forecasting Socio-Economic Crisis.
31
Hotz, Robert Lee. The Really Smart Phone. The Wall Street Journal. 22 Apr. 2011.
<http://online.wsj.com/article/SB10001424052748704547604576263261679848814.html>
32
Alex Pentland cited in When Theres No Such Thing As Too Much Information. The New York Times.
23 Apr. 2011
<http://www.nytimes.com/2011/04/24/business/24unboxed.html?_r=1&src=tptw>.
33
Nathan Eagle also cited in When Theres No Such Thing As Too Much Information. The New York
Times. 23 Apr. 2011.
<http://www.nytimes.com/2011/04/24/business/24unboxed.html?_r=1&src=tptw>.
34
Helbing and Balietti. From Social Data Mining to Forecasting Socio-Economic Crisis.
35
Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, and
Larry Brilliant. Detecting Influenza Epidemics Using Search Engine Query Data. Nature 457.7232
(2008): 1012-1014.
<http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/p
apers/detecting-influenza-epidemics.pdf >.
36
Eysenbach G. Infodemiology: tracking flu-related searches on the Web for syndromic surveillance.
AMIA (2006) <http://yi.com/home/EysenbachGunther/publications/2006/eysenbach2006c-infodemiologyamia-proc.pdf>
37
Syndromic Surveillance (SS). Centers for Disease Control and Prevention. 06 Mar. 2012.
<http://www.cdc.gov/ehrmeaningfuluse/Syndromic.html>.
44
38
Paul, M.J. and M. Dredze. You Are What You Tweet: Analyzing Twitter for Public Health. Rep. Center
for Language and Speech Processing at Johns Hopkins University, 2011.
<http://www.cs.jhu.edu/%7Empaul/files/2011.icwsm.twitter_health.pdf>
39
Eke, P.I.. Using Social Media for Research and Public Health Surveillance. (Abstract). Journal of
Dental Research 90.9 (2011). <http://jdr.sagepub.com/content/early/2011/07/15/0022034511415273>
40
Signorini, Alessio, Alberto M. Segre, and Phillip M. Polgren. The Use of Twitter to Track Levels of
Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic. PLos One 6.5
(2011). Pub Med. 4 May 2011. <http://www.ncbi.nlm.nih.gov/pubmed/21573238>
41
Moreno, Megan A., Dimitiri A. Christakis, Katie G. Egan, Libby N Brockman and Tara Becker.
Associations Between Displayed Alcohol Reference on Facebook and Problem Drinking Amoung College
Students. Archives of Pediatrics and Adolescent Medicine (2011). <http://archpedi.amaassn.org/cgi/content/abstract/archpediatrics.2011.180v1?maxtoshow=&hits=10&RESULTFORMAT=&full
text=facebook&searchid=1&FIRSTINDEX=0&resourcetype=HWCIT>.
42
Health Map <http://healthmap.org/en/>
43
Walsh, Bryan. :Outbreak.com: Using the Web to Track Deadly Diseases in Real Time. Time Science.
16 Aug. 2011. <http://www.time.com/time/health/article/0,8599,2088868,00.html>.
44
Helbing and Balietti. From Social Data Mining to Forecasting Socio-Economic Crisis.
45
Ibid.
46
Ibid.
47
Efrati, Amir. Like Button Follows Web Users. The Wall Street Journal. 18 May 2011.
<http://online.wsj.com/article/SB10001424052748704281504576329441432995616.html>
48
Boyd, Dana and Crawford, Kate. Six Provocations for Big Data. Working Paper - Oxford Internet
Institute. 21 Sept. 2011 <http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431>
49
A. Narayan and V. Shmatikov (2008)
50
Both examples gleaned from conversations and consultations with Global Pulse staff.
51
52
Both examples gleaned from conversations and consultations with Global Pulse staff.
Kirkpatrick, Robert. Data Philanthropy: Public and Private Sector Data Sharing for Global Resilience.
45
62
Google Flu Trends Do Not Match CDC Data. Popular Mechanics. 17 Mar. 2010.
<http://www.popularmechanics.com/science/health/med-tech/google-flu-trends-cdc-data>
63
Boyd, Dana and Crawford, Kate. Six Provocations for Big Data. Working Paper - Oxford Internet
Institute. 21 Sept. 2011 <http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431>
64
Boyd, Dana and Crawford, Kate.
65
Jim Fruchterman. Issues with Crowdsourced Data Part 2. Beneblog: Technology Meets Society. 28
Mar. 2011. <http://benetech.blogspot.com/2011/03/issues-with-crowdsourced-data-part-2.html>. Ball,
Patrick, Jeff Klingner, and Kristian Lum. Crowdsourcing Data is Not a Substitute for Real Statistics.
Beneblog. 17 Mar. 2011. <http://benetech.blogspot.com/2011/03/crowdsourced-data-is-not-substitutefor.html>
66
Bollier, David. The Promise and Peril of Big Data, pg. 13
67
Wenner, Melinda. Google Flu Trends Do Not Match CDC Data. Popular Mechanics. 17 May 2010.
<http://www.popularmechanics.com/science/health/med-tech/google-flu-trends-cdc-data>
68
Twenty Years Later, Timisoara Affairs Exposes Media Credulity. France 24. 22 Dec. 2009.
<http://www.france24.com/en/20091220-twenty-years-later-timisoara-affair-exposes-media-credulity>
69
Lustig, Robin. Why were we fooled by the fake Syria blog? BBC News. 13 June 2011.
<http://www.bbc.co.uk/blogs/worldtonight/2011/06/why_were_we_fooled_by_the_fake.html>
70
Murray, Alex. BBC Processes for Verifying Social Media Content. BBC News. 18 May 2011.
<http://www.bbc.co.uk/journalism/blog/2011/05/bbcsms-bbc-procedures-for-veri.shtml>
71
Referenced by Dr. Alberto Cavallo (PriceStats) in research consultations with UN Global Pulse, 2011.
72
Lise Getoor cited in David Bolliers The Promise and Peril of Big Data, pg. 16
<http://www.foodsecurityportal.org/policy-analysis-tools/excessive-food-price-variability-early-warningsystem>
73
Bollier, David. The Promise and Peril of Big Data.
74
Cavallo, Alberto. BPP and PriceStats. The Billion Prices Project and MIT. 6 May 2011.
<http://bpp.mit.edu/bpp-and-pricestats/>
75
Excessive Food Price Variability Early Warning System. Food Security Portal.
76
Shaking and Tweeting: The USGS Twitter Earthquake Detection Program. USGS. 14 December 2009.
<http://gallery.usgs.gov/audios/326#.T7um0r9J_sU>
77
Chunara, R., Andrews, J., and Brownstein, J. Social and News Media Enable Estimation of
Epidemiological Patterns Early in the 2010 Haitian Cholera Outbreak. American Journal of Tropical
Medicine and Hygiene. 2012 86:39-45. <http://www.ajtmh.org/content/86/1/39.abstract>
78
You Are What You Tweet: Analyzing Twitter for Public Health.
<http://www.cs.jhu.edu/~mpaul/files/2011.icwsm.twitter_health.pdf>
79
Twitter and Perceptions of Crisis-Related Stress. UN Global Pulse.
<http://www.unglobalpulse.org/projects/twitter-and-perceptions-crisis-related-stress>
80
Unemployment Through the Lens of Social Media. UN Global Pulse.
<http://www.unglobalpulse.org/projects/can-social-media-mining-add-depth-unemployment-statistics>
81
Fruchterman, Jim. More on Using Crowdsourced Data to Find Big Picture Patterns (Take 3).
Beneblog: Technology Meets Society. 7 Apr. 2011. <
http://benetech.blogspot.com/2011/04/more-on-usingcrowdsourced-data-to-find.html>
82
http://tech.state.gov/profiles/blogs/tech-state-real-time-awareness-agenda
83
Video of Craig Fugates keynote address at US State Department Tech@State Event, 3 February, 2012:
<http://www.livestream.com/techstate/video?clipId=pla_a1cef922-7b17-400b-b67ca21cb877b1f9&utm_source=lslibrary&utm_medium=ui-thumb>
46
84
Stelter, Leischen. FEMAs Fugate says Social Media is Valuable, but No Tweet Stops the Bleeding.
In Pulic Safety. 16 Feb. 2012. <http://inpublicsafety.com/2012/02/femas-fugate-says-social-media-isvaluable-but-no-tweet-stops-the-bleeding/>
85
Gray, Jim (ed. Gray, J., Tansley, S. and Tolle, K.). eScience: A transformed scientific method. The
Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research. Redmond, Washington, 2009.
< http://research.microsoft.com/en-us/collaboration/fourthparadigm/contents.aspx>.
47