The Economist Data Data Everywhere
The Economist Data Data Everywhere
everywhere
A special report on managing information
{Contents}
.:/p03 The data deluge
Businesses, governments and society are only starting to tap its vast potential
.:/p09 Show me
New ways of visualising data
Technology
.:/p03
Reprinted from The Economist February 27th 2010
services. And governments are belatedly periodic fusses when Facebook or Google be required to disclose details of security
coming around to the idea of putting unexpectedly change the privacy settings breaches, as is already the case in some
more information—such as crime on their online social networks, causing parts of the world, to encourage bosses to
figures, maps, details of government members to reveal personal information take information security more seriously.
contracts or statistics about the unwittingly. A more sinister threat comes Third, organisations should be subject
performance of public services—into the from Big Brotherishness of various to an annual security audit, with the
public domain. People can then reuse kinds, particularly when governments resulting grade made public (though
this information in novel ways to build compel companies to hand over personal details of any problems exposed would
businesses and hold elected officials to information about their customers. Rather not be). This would encourage companies
account. Companies that grasp these than owning and controlling their own to keep their security measures up to date.
new opportunities, or provide the tools personal data, people very often find that Market incentives will then come into
for others to do so, will prosper. Business they have lost control of it. play as organisations that manage data
intelligence is one of the fastest-growing The best way to deal with these well are favoured over those that do not.
parts of the software industry. drawbacks of the data deluge is, Greater transparency in these three areas
paradoxically, to make more data available would improve security and give people
Now for the bad news in the right way, by requiring greater more control over their data without the
But the data deluge also poses risks. transparency in several areas. First, users need for intricate regulation that could
Examples abound of databases being should be given greater access to and stifle innovation. After all, the process of
stolen: disks full of social-security data go control over the information held about learning to cope with the data deluge, and
missing, laptops loaded with tax records them, including whom it is shared working out how best to tap it, has only
are left in taxis, credit-card numbers are with. Google allows users to see what just begun.
stolen from online retailers. The result is information it holds about them, and
privacy breaches, identity theft and fraud. lets them delete their search histories or
Privacy infringements are also possible modify the targeting of advertising, for
even without such foul play: witness the example. Second, organisations should © The Economist Newspaper Limited, London (2010)
.:/p04
Reprinted from The Economist February 27th 2010
economic value, provide fresh insights as it will argue, the two are increasingly professor at New York University. Just
into science and hold governments to difficult to tell apart. Given enough raw as the microscope transformed biology
account. data, today’s algorithms and powerful by exposing germs, and the electron
But they are also creating a host of computers can reveal new insights that microscope changed physics, all these
new problems. Despite the abundance of would previously have remained hidden. data are turning the social sciences upside
tools to capture, process and share all this The business of information down, he explains. Researchers are now
information—sensors, computers, mobile management—helping organisations to able to understand human behaviour
phones and the like—it already exceeds make sense of their proliferating data—is at the population level rather than the
the available storage space (see chart 1). growing by leaps and bounds. In recent individual level.
Moreover, ensuring data security and years Oracle, IBM, Microsoft and SAP The amount of digital information
protecting privacy is becoming harder as between them have spent more than increases tenfold every five years. Moore’s
the information multiplies and is shared $15 billion on buying software firms law, which the computer industry now
ever more widely around the world. specialising in data management and takes for granted, says that the processing
analytics. This industry is estimated to be power and storage capacity of computer
Alex Szalay, an astrophysicist at worth more than $100 billion and growing chips double or their prices halve roughly
at almost 10% a year, roughly twice as fast every 18 months. The software programs
as the software business as a whole. are getting better too. Edward Felten, a
Chief information officers (CIOs) computer scientist at Princeton University,
have become somewhat more prominent reckons that the improvements in the
in the executive suite, and a new kind algorithms driving computer applications
of professional has emerged, the data have played as important a part as Moore’s
scientist, who combines the skills of law for decades.
software programmer, statistician and A vast amount of that information
storyteller/artist to extract the nuggets of is shared. By 2013 the amount of traffic
gold hidden under mountains of data. Hal flowing over the internet annually will
Varian, Google’s chief economist, predicts reach 667 exabytes, according to Cisco, a
that the job of statistician will become maker of communications gear. And the
the “sexiest” around. Data, he explains, quantity of data continues to grow faster
are widely available; what is scarce is the than the ability of the network to carry it
ability to extract wisdom from them. all.
Johns Hopkins University, notes that People have long groused that they
the proliferation of data is making them More of everything were swamped by information. Back
increasingly inaccessible. “How to make There are many reasons for the in 1917 the manager of a Connecticut
sense of all these data? People should information explosion. The most obvious manufacturing firm complained about
be worried about how we train the next one is technology. As the capabilities of the effects of the telephone: “Time is lost,
generation, not just of scientists, but digital devices soar and prices plummet, confusion results and money is spent.” Yet
people in government and industry,” he sensors and gadgets are digitising lots what is happening now goes way beyond
says. of information that was previously incremental growth. The quantitative
“We are at a different period unavailable. And many more people change has begun to make a qualitative
because of so much information,” says have access to far more powerful tools. difference.
James Cortada of IBM, who has written For example, there are 4.6 billion mobile- This shift from information scarcity
a couple of dozen books on the history phone subscriptions worldwide (though to surfeit has broad effects. “What we are
of information in society. Joe Hellerstein, many people have more than one, so the seeing is the ability to have economies
a computer scientist at the University world’s 6.8 billion people are not quite as form around the data—and that to me
of California in Berkeley, calls it “the well supplied as these figures suggest), and is the big change at a societal and even
industrial revolution of data”. The effect 1 billion-2 billion people use the internet. macroeconomic level,” says Craig Mundie,
is being felt everywhere, from business Moreover, there are now many more head of research and strategy at Microsoft.
to science, from government to the arts. people who interact with information. Data are becoming the new raw material
Scientists and computer engineers have Between 1990 and 2005 more than 1 billion of business: an economic input almost on
coined a new term for the phenomenon: people worldwide entered the middle a par with capital and labour. “Every day
“big data”. class. As they get richer they become I wake up and ask, ‘how can I flow data
Epistemologically speaking, more literate, which fuels information better, manage data better, analyse data
information is made up of a collection growth, notes Mr Cortada. The results are better?” says Rollin Ford, the CIO of Wal-
of data and knowledge is made up of showing up in politics, economics and the Mart.
different strands of information. But law as well. “Revolutions in science have Sophisticated quantitative analysis
this special report uses “data” and often been preceded by revolutions in is being applied to many aspects of life,
“information” interchangeably because, measurement,” says Sinan Aral, a business not just missile trajectories or financial
.:/p05
Reprinted from The Economist February 27th 2010
hedging strategies, as in the past. For of Google, sit on a presidential task force they probably meant to keep to themselves.
example, Farecast, a part of Microsoft’s to reform American health care. “Early on But big data can have far more serious
search engine Bing, can advise customers in this process Eric and I both said: ‘Look, consequences than that. During the recent
whether to buy an airline ticket now if you really want to transform health care, financial crisis it became clear that banks
or wait for the price to come down by you basically build a sort of health-care and rating agencies had been relying on
examining 225 billion flight and price economy around the data that relate to models which, although they required a
records. The same idea is being extended people’,” Mr Mundie explains. “You would vast amount of information to be fed in,
to hotel rooms, cars and similar items. not just think of data as the ‘exhaust’ of failed to reflect financial risk in the real
Personal-finance websites and banks providing health services, but rather they world. This was the first crisis to be sparked
are aggregating their customer data to become a central asset in trying to figure by big data—and there will be more.
show up macroeconomic trends, which out how you would improve every aspect The way that information is managed
may develop into ancillary businesses in of health care. It’s a bit of an inversion.” touches all areas of life. At the turn of the
their own right. Number-crunchers have To be sure, digital records should make 20th century new flows of information
even uncovered match-fixing in Japanese life easier for doctors, bring down costs for through channels such as the telegraph
sumo wrestling. providers and patients and improve the and telephone supported mass production.
quality of care. But in aggregate the data Today the availability of abundant data
Dross into gold can also be mined to spot unwanted drug enables companies to cater to small niche
“Data exhaust”—the trail of clicks that interactions, identify the most effective markets anywhere in the world. Economic
internet users leave behind from which treatments and predict the onset of disease production used to be based in the factory,
value can be extracted—is becoming before symptoms emerge. Computers where managers pored over every machine
a mainstay of the internet economy. already attempt to do these things, but need and process to make it more efficient. Now
One example is Google’s search engine, to be explicitly programmed for them. In a statisticians mine the information output
which is partly guided by the number of world of big data the correlations surface of the business for new ideas.
clicks on an item to help determine its almost by themselves. “The data-centred economy is just
relevance to a search query. If the eighth Sometimes those data reveal more nascent,” admits Mr Mundie of Microsoft.
listing for a search term is the one most than was intended. For example, the city of “You can see the outlines of it, but the
people go to, the algorithm puts it higher Oakland, California, releases information technical, infrastructural and even
up. on where and when arrests were made, business-model implications are not
As the world is becoming increasingly which is put out on a private website, well understood right now.” This special
digital, aggregating and analysing data Oakland Crimespotting. At one point a few report will point to where it is beginning
is likely to bring huge benefits in other clicks revealed that police swept the whole to surface.
fields as well. For example, Mr Mundie of a busy street for prostitution every
of Microsoft and Eric Schmidt, the boss evening except on Wednesdays, a tactic © The Economist Newspaper Limited, London (2010)
.:/p06
Reprinted from The Economist February 27th 2010
A different game
Information is transforming traditional businesses
was happening in the business. are collecting more data than ever before.
Sales data remain one of a company’s In the past they were kept in different
most important assets. In 2004 Wal-Mart systems that were unable to talk to each
peered into its mammoth databases and other, such as finance, human resources or
noticed that before a hurricane struck, customer management. Now the systems
there was a run on flashlights and are being linked, and companies are using
batteries, as might be expected; but also data-mining techniques to get a complete
on Pop-Tarts, a sugary American breakfast picture of their operations—“a single
snack. On reflection it is clear that the version of the truth”, as the industry likes
.:/p07
Reprinted from The Economist February 27th 2010
months into their subscription and example, sells more than 100,000 products technology meant to make sense of it often
reaped the rewards. in 200 countries, using 550,000 suppliers, just produces more data. Instead of finding
but it was not using its huge buying power a needle in the haystack, they are making
Agony and torture effectively because its databases were a more hay.
Such data-mining has a dubious mess. On examination, it found that of Still, as analytical techniques become
reputation. “Torture the data long enough its 9m records of vendors, customers and more widespread, business decisions
and they will confess to anything,” materials around half were obsolete or will increasingly be made, or at least
statisticians quip. But it has become duplicated, and of the remainder about corroborated, on the basis of computer
far more effective as more companies one-third were inaccurate or incomplete. algorithms rather than individual hunches.
have started to use the technology. Best The name of a vendor might be abbreviated This creates a need for managers who are
Buy, a retailer, found that 7% of its in one record but spelled out in another, comfortable with data, but statistics courses
customers accounted for 43% of its sales, leading to double-counting. in business schools are not popular.
so it reorganised its stores to concentrate Many new business insights come
on those customers’ needs. Airline Plainer vanilla from “dead data”: stored information
yield management improved because Over the past ten years Nestlé has been about past transactions that are examined
analytical techniques uncovered the best overhauling its IT system, using SAP to reveal hidden correlations. But now
predictor that a passenger would actually software, and improving the quality of companies are increasingly moving to
catch a flight he had booked: that he had its data. This enabled the firm to become analysing real-time information flows.
ordered a vegetarian meal. more efficient, says Chris Johnson, who Wal-Mart is a good example. The
The IT industry is piling into led the initiative. For just one ingredient, retailer operates 8,400 stores worldwide,
business intelligence, seeing it as a natural vanilla, its American operation was able has more than 2m employees and handles
successor of services such as accountancy to reduce the number of specifications and over 200m customer transactions each
and computing in the first and second use fewer suppliers, saving $30m a year. week. Its revenue last year, around $400
half of the 20th century respectively. Overall, such operational improvements billion, is more than the GDP of many
Accenture, PricewaterhouseCoopers, IBM save more than $1 billion annually. entire countries. The sheer scale of the data
and SAP are investing heavily in their Nestlé is not alone in having problems is a challenge, admits Rollin Ford, the CIO
consulting practices. Technology vendors with its database. Most CIOs admit that at Wal-Mart’s headquarters in Bentonville,
such as Oracle, Informatica, TIBCO, SAS their data are of poor quality. In a study Arkansas. “We keep a healthy paranoia.”
and EMC have benefited. IBM believes by IBM half the managers quizzed did
business intelligence will be a pillar not trust the information on which they Not a sparrow falls
of its growth as sensors are used to had to base decisions. Many say that the Wal-Mart’s inventory-management system,
manage things from a city’s traffic flow called Retail Link, enables suppliers to see
to a patient’s blood flow. It has invested the exact number of their products on every
$12 billion in the past four years and is shelf of every store at that precise moment.
opening six analytics centres with 4,000 The system shows the rate of sales by the
employees worldwide. hour, by the day, over the past year and
Analytics—performing statistical more. Begun in the 1990s, Retail Link gives
operations for forecasting or uncovering suppliers a complete overview of when
correlations such as between Pop-Tarts and how their products are selling, and
and hurricanes—can have a big pay- with what other products in the shopping
off. In Britain the Royal Shakespeare cart. This lets suppliers manage their stocks
Company (RSC) sifted through seven better.
years of sales data for a marketing The technology enabled Wal-Mart to
campaign that increased regular visitors change the business model of retailing. In
by 70%. By examining more than 2m some cases it leaves stock management in
transaction records, the RSC discovered the hands of its suppliers and does not
a lot more about its best customers: not take ownership of the products until the
just income, but things like occupation moment they are sold. This allows it to
and family status, which allowed it to shed inventory risk and reduce its costs.
target its marketing more precisely. That In essence, the shelves in its shops are a
was of crucial importance, says the RSC’s highly efficiently managed depot.
Mary Butlin, because it substantially Another company that capitalises
boosted membership as well as fund- on real-time information flows is Li &
raising revenue. Fung, one of the world’s biggest supply-
Yet making the most of data is not chain operators. Founded in Guangzhou
easy. The first step is to improve the in southern China a century ago, it does
accuracy of the information. Nestlé, for not own any factories or equipment but
.:/p08
Reprinted from The Economist February 27th 2010
orchestrates a network of 12,000 suppliers and later the recovery, from retailers’ and open-source software. Cloud
in 40 countries, sourcing goods for brands orders before these trends became computing—in which the internet is used
ranging from Kate Spade to Walt Disney. apparent. Investment analysts use country as a platform to collect, store and process
Its turnover in 2008 was $14 billion. information provided by Li & Fung to gain data—allows businesses to lease computing
Li & Fung used to deal with its insights into macroeconomic patterns. power as and when they need it, rather
clients mostly by phone and fax, with Now that they are able to process than having to buy expensive equipment.
e-mail counting as high technology. But information flows in real time, Amazon, Google and Microsoft are the most
thanks to a new web-services platform, organisations are collecting more data prominent firms to make their massive
its processes have speeded up. Orders than ever. One use for such information computing infrastructure available to
flow through a web portal and bids can is to forecast when machines will break clients. As more corporate functions, such
be solicited from pre-qualified suppliers. down. This hardly ever happens out of as human resources or sales, are managed
Agents now audit factories in real time the blue: there are usually warning signs over a network, companies can see patterns
with hand-held computers. Clients are such as noise, vibration or heat. Capturing across the whole of the business and share
able to monitor the details of every stage such data enables firms to act before a their information more easily.
of an order, from the initial production breakdown. A free programming language called
run to shipping. Similarly, the use of “predictive R lets companies examine and present big
One of the most important analytics” on the basis of large data sets data sets, and free software called Hadoop
technologies has turned out to be may transform health care. Dr Carolyn now allows ordinary PCs to analyse huge
videoconferencing. It allows buyers McGregor of the University of Ontario, quantities of data that previously required
and manufacturers to examine the working with IBM, conducts research a supercomputer. It does this by parcelling
colour of a material or the stitching on to spot potentially fatal infections in out the tasks to numerous computers
a garment. “Before, we weren’t able to premature babies. The system monitors at once. This saves time and money. For
send a 500MB image—we’d post a DVD. subtle changes in seven streams of real- example, the New York Times a few years
Now we can stream it to show vendors time data, such as respiration, heart rate ago used cloud computing and Hadoop to
in our offices. With real-time images we and blood pressure. The electrocardiogram convert over 400,000 scanned images from
can make changes quicker,” says Manuel alone generates 1,000 readings per second. its archives, from 1851 to 1922. By harnessing
Fernandez, Li & Fung’s chief technology This kind of information is turned out the power of hundreds of computers, it
officer. Data flowing through its network by all medical equipment, but it used to be was able to do the job in 36 hours.
soared from 100 gigabytes a day only 18 recorded on paper and examined perhaps Visa, a credit-card company, in
months ago to 1 terabyte. once an hour. By feeding the data into a recent trial with Hadoop crunched
The information system also allows a computer, Dr McGregor has been able two years of test records, or 73 billion
Li & Fung to look across its operations to detect the onset of an infection before transactions, amounting to 36 terabytes
to identify trends. In southern China, obvious symptoms emerge. “You can’t see of data. The processing time fell from
for instance, a shortage of workers and it with the naked eye, but a computer can,” one month with traditional methods to a
new legislation raised labour costs, she says. mere 13 minutes. It is a striking successor
so production moved north. “We saw of Ritty’s incorruptible cashier for a data-
that before it actually happened,” says Open sesame driven age.
Mr Fernandez. The company also got Two technology trends are helping to fuel
advance warning of the economic crisis, these new uses of data: cloud computing © The Economist Newspaper Limited, London (2010)
Show me
New ways of visualising data
.:/p09
Reprinted from The Economist February 27th 2010
.:/p10
Reprinted from The Economist February 27th 2010
charts and colour schemes. common diseases throughout people’s move to e-readers, animated infographics
These graphics are often based on lives. Among media companies the New will eventually become standard. The
immense quantities of data. Jeffrey Heer York Times and the Guardian in Britain software Gapminder elegantly displays
of Stanford University helped develop have been the most ambitious, producing four dynamic variables at once.
sense.us, a website that gives people data-rich, interactive graphics that are Displaying information can make
access to American census data going strong enough to stand on their own. a difference by enabling people to
back more than a century. Ben Fry, an The tools are becoming more understand complex matters and find
independent designer, created a map accessible. For example, Tableau Software, creative solutions. Valdis Krebs, a specialist
of the 26m roads in the continental co-founded in 2003 by Pat Hanrahan of in mapping social interactions, recalls
United States. The dense communities Stanford University, does for visualising being called in to help with a corporate
of the north-east form a powerful data what word-processing did for project that was vastly over budget and
contrast to the desolate far west. Aaron text, allowing anyone to manipulate behind schedule. He drew up an intricate
Koblin of Google plotted a map of every information creatively. Tableau offers network map of e-mail traffic that showed
commercial flight in America over 24 both free and paid-for products, as does a distinct clusters, revealing that the teams
hours, with brighter lines identifying website called Swivel.com. Some sites are involved were not talking directly to each
routes with heavier traffic. entirely free. Google and an IBM website other but passing messages via managers.
Such techniques are moving into called Many Eyes let people upload their So the company changed its office layout
the business world. Mr Fry designed data to display in novel ways and share and its work processes—and the project
interactive charts for Ge’s health-care with others. quickly got back on track.
division that show the costs borne by Some data sets are best represented
patients and insurers, respectively, for as a moving image. As print publications © The Economist Newspaper Limited, London (2010)
Needle in a haystack
The uses of information about information
.:/p11
Reprinted from The Economist February 27th 2010
.:/p12
Reprinted from The Economist February 27th 2010
to develop a robust spell-checker for its says Mr Och dismissively. Google has
word-processing program. But Google got billions. The system was first developed
its raw material free: its program is based by processing almost 2 trillion words. But
on all the misspellings that users type into although it learns from a big body of data,
a search window and then “correct” by it lacks the recursive qualities of spell-
clicking on the right result. With almost check and search.
3 billion queries a day, those results soon The design of the feedback loop
mount up. Other search engines in the is critical. Google asks users for their
1990s had the chance to do the same, but opinions, but not much else. A translation
did not pursue it. Around 2000 Yahoo! saw start-up in Germany called Linguee is
the potential, but nothing came of the idea. trying something different: it presents
It was Google that recognised the gold dust users with snippets of possible translations
in the detritus of its interactions with its and asks them to click on the best. That
users and took the trouble to collect it up. provides feedback on which version is the
Two newer Google services take the most accurate.
same approach: translation and voice Voice recognition highlights the
actually clicked and stayed. A Google recognition. Both have been big stumbling importance of making use of data exhaust.
search might yield 2m pages of results blocks for computer scientists working on To use Google’s telephone directory or
in a quarter of a second, but users often artificial intelligence. For over four decades audio car navigation service, customers
want just one page, and by choosing it the boffins tried to program computers to dial the relevant number and say what
they “tell” Google what they are looking “understand” the structure and phonetics they are looking for. The system repeats
for. So the algorithm was rejigged to feed of language. This meant defining rules such the information; when the customer
that information back into the service as where nouns and verbs go in a sentence, confirms it, or repeats the query, the
automatically. which are the correct tenses and so on. All system develops a record of the different
From then on Google realised it was the exceptions to the rules needed to be ways the target word can be spoken. It does
in the data-mining business. To put the programmed in too. Google, by contrast, not learn to understand voice; it computes
model in simple economic terms, its saw it as a big maths problem that could probabilities.
search results give away, say, $1 in value, be solved with a lot of data and processing To launch the service Google needed
and in return (thanks to the user’s clicks) power—and came up with something very an existing voice-recognition system,
it gets 1 cent back. When the next user useful. so it licensed software from Nuance,
visits, he gets $1.01 of value, and so on. As For translation, the company was able a leader in the field. But Google itself
one employee puts it: “We like learning to draw on its other services. Its search keeps the data from voice queries, and
from large, ‘noisy’ data sets.” system had copies of European Commission its voice-recognition system may end up
Making improvements on the documents, which are translated into performing better than Nuance’s—which is
back of a big data set is not a Google around 20 languages. Its book-scanning now trying to get access to lots more data
monopoly, nor is the technique new. project has thousands of titles that have by partnering with everyone in sight.
One of the most striking examples dates been translated into many languages. All Re-using data represents a new
from the mid-1800s, when Matthew these translations are very good, done by model for how computing is done, says
Fontaine Maury of the American navy experts to exacting standards. So instead Edward Felten of Princeton University.
had the idea of aggregating nautical logs of trying to teach its computers the rules “Looking at large data sets and making
from ships crossing the Pacific to find of a language, Google turned them loose inferences about what goes together is
the routes that offered the best winds on the texts to make statistical inferences. advancing more rapidly than expected.
and currents. He created an early variant Google Translate now covers more than ‘Understanding’ turns out to be overrated,
of a “viral” social network, rewarding 50 languages, according to Franz Och, one and statistical analysis goes a lot of the
captains who submitted their logbooks of the company’s engineers. The system way.” Many internet companies now see
with a copy of his maps. But the process identifies which word or phrase in one things the same way. Facebook regularly
was slow and laborious. language is the most likely equivalent in a examines its huge databases to boost usage.
second language. If direct translations are It found that the best single predictor of
Wizard spelling not available (say, Hindi to Catalan), then whether members would contribute to the
Google applies this principle of English is used as a bridge. site was seeing that their friends had been
recursively learning from the data to Google was not the first to try this active on it, so it took to sending members
many of its services, including the method. In the early 1990s IBM tried to information about what their friends had
humble spell-check, for which it used build a French-English program using been up to online. Zynga, an online games
a pioneering method that produced translations from Canada’s Parliament. company, tracks its 100m unique players
perhaps the world’s best spell-checker in But the system did not work well and the each month to improve its games.
almost every language. Microsoft says it project was abandoned. IBM had only a “If there are user-generated data to
spent several million dollars over 20 years few million documents at its disposal, be had, then we can build much better
.:/p13
Reprinted from The Economist February 27th 2010
systems than just trying to improve the service that lets Google users store medical to have access to them (and see that its
algorithms,” says Andreas Weigend, a records might also allow the company rivals do not). In an initiative called “Data
former chief scientist at Amazon who to spot valuable patterns about diseases Liberation Front” that quietly began last
is now at Stanford University. Marc and treatments. A service where users can September, Google is planning to rejig all
Andreessen, a venture capitalist who sits monitor their use of electricity, device its services so that users can discontinue
on numerous boards and was one of by device, provides rich information on them very easily and take their data with
the founders of Netscape, the web’s first energy consumption. It could become them. In an industry built on locking in
commercial browser, thinks that “these the world’s best database of household the customer, the company says it wants
new companies have built a culture, appliances and consumer electronics—and to reduce the “barriers to exit”. That should
and the processes and the technology even foresee breakdowns. The aggregated help save its engineers from complacency,
to deal with large amounts of data, that search queries, which the company makes the curse of many a tech champion. The
traditional companies simply don’t available free, are used as remarkably project might stall if it started to hurt the
have.” accurate predictors for everything from business. But perhaps Google reckons that
Recycling data exhaust is a common retail sales to flu outbreaks. users will be more inclined to share their
theme in the myriad projects going on Together, all this is in line with the information with it if they know that they
in Google’s empire and helps explain company’s audacious mission to “organise can easily take it back.
why almost all of them are labelled the world’s information”. Yet the words
as a “beta” or early test version: they are carefully chosen: Google does not need
truly are in continuous development. A to own the data. Usually all it wants is © The Economist Newspaper Limited, London (2010)
.:/p14
Reprinted from The Economist February 27th 2010
people expect from government, fuelled out about people’s concerns; and once it. In recent years moves towards more
by the experience of shopping on the the problem has been publicly aired it transparency in government have become
internet and having real-time access becomes more difficult to ignore. one of the most vibrant and promising
to financial information,” says John One obstacle is that most countries areas of public policy. Sometimes
Wonderlich of the Sunlight Foundation, lack America’s open-government ethos, information disclosure can achieve policy
which promotes open government. The nurtured over decades by laws on ethics aims more effectively and at far lower cost
economic crisis has speeded up that in government, transparency rules and than traditional regulation.
change, particularly in state and city the Freedom of Information act, which In an important shift, new
governments. acquired teeth after the Nixon years. transparency requirements are now being
“The city is facing its eighth budget An obstacle of a different sort is used by government—and by the public—
shortfall. We’re looking at a 50% reduction Crown copyright, which means that to hold the private sector to account. For
in operating funds,” says Chris Vein, most government data in Britain and the example, it had proved extremely difficult
San Francisco’s CIO. “We must figure Commonwealth countries are the state’s to persuade American businesses to cut
out how we change our operations.” He property, constraining their use. In Britain down on the use of harmful chemicals
insists that providing more information postcodes and Ordnance Survey map and their release into the environment. An
can make government more efficient. data at present cannot be freely used for add-on to a 1986 law required firms simply
California’s generous “sunshine laws” commercial purposes—a source of loud to disclose what they release, including
provide the necessary legal backing. complaints from businesses and activists. “by computer telecommunications”. Even
Among the first users of the newly But from later this year access to some to supporters it seemed like a fudge, but
available data was a site called “San parts of both data sets will be free, thanks it turned out to be a resounding success.
Francisco Crimespotting” by Stamen to an initiative to bring more government By 2000 American businesses had reduced
Design that layers historical crime figures services online. their emissions of the chemicals covered
on top of map information. It allows But even in America access to some under the law by 40%, and over time
users to play around with the data and government information is restricted the rules were actually tightened. Public
spot hidden trends. People now often by financial barriers. Remarkably, this scrutiny achieved what legislation could
come to public meetings armed with applies to court documents, which in a not.
crime maps to demand police patrols democracy should surely be free. Legal There have been many other such
in their particular area. records are public and available online successes in areas as diverse as restaurant
from the Administrative Office of the US sanitation, car safety, nutrition, home
Anyone can play Courts (AOUSC), but at a costly eight cents loans for minorities and educational
Other cities, including New York, per page. Even the federal government has performance, note Archon Fung,
Chicago and Washington, DC, are to pay: between 2000 and 2008 it spent Mary Graham and David Weil of the
racing ahead as well. Now that citizens’ $30m to get access to its own records. Yet Transparency Policy Project at Harvard’s
groups and companies have the raw the AOUSC is currently paying $156m over Kennedy School of Government in their
data, they can use them to improve ten years to two companies, WestLaw and book “Full Disclosure”. But transparency
city services in ways that cash-strapped LexisNexis, to publish the material online alone is not enough. There has to be a
local governments cannot. For instance, (albeit organised and searchable with the community to champion the information.
cleanscores.com puts restaurants’ health- firms’ technologies). Those companies, for Providers need an incentive to supply the
inspection scores online; other sites list their part, earn an estimated $2 billion data as well as penalties for withholding
children’s activities or help people find annually from selling American court them. And web developers have to find
parking spaces. In the past government rulings and extra content such as case ways of ensuring that the public data
would have been pressed to provide reference guides. “The law is locked up being released are used effectively.
these services; now it simply supplies behind a cash register,” says Mr Malamud. Mr Fung thinks that as governments
the data. Mr Vein concedes, however, The two firms say they welcome release more and more information about
that “we don’t know what is useful or competition, pointing to their strong the things they do, the data will be used
not. This is a grand experiment.” search technology and the additional to show the public sector’s shortcomings
Other parts of the world are also services they provide, such as case rather than to highlight its achievements.
beginning to move to greater openness. summaries and useful precedents. It Another concern is that the accuracy
A European Commission directive in seems unlikely that they will keep their and quality of the data will be found
2005 called for making public-sector grip for long. One administration official wanting (which is a problem for business
information more accessible (but it has privately calls freeing the information a as well as for the public sector). There is
no bite). Europe’s digital activists use “no-brainer”. Even Google has begun to also a debate over whether governments
the web to track politicians and to try provide some legal documents online. should merely supply the raw data or get
to improve public services. In Britain involved in processing and displaying
FixMyStreet.com gives citizens the Change agent them too. The concern is that they might
opportunity to flag up local problems. The point of open information is not manipulate them—but then so might
That allows local authorities to find merely to expose the world but to change anyone else.
.:/p15
Reprinted from The Economist February 27th 2010
Public access to government figures opens up new forms of collaboration widest participation in the details of
is certain to release economic value between the public and the private judicial and administrative business…
and encourage entrepreneurship. That sectors. Beth Noveck, one of the Obama above all by the utmost possible
has already happened with weather administration’s recruits, who is a law publicity.” These days, that includes the
data and with America’s GPS satellite- professor and author of a book entitled greatest possible disclosure of data by
navigation system that was opened for full “Wiki Government”, has spearheaded an electronic means.
commercial use a decade ago. And many initiative called peer-to-patent that has
firms make a good living out of searching opened up some of America’s patent
for or repackaging patent filings. filings for public inspection.
Moreover, providing information John Stuart Mill in 1861 called for “the © The Economist Newspaper Limited, London (2010)
.:/p16
Reprinted from The Economist February 27th 2010
handed to regulators if problems arose. Mason University, goes further: she worries as phone-number portability encourages
It could even be a requirement for about the “ethics of super-crunching”. competition among mobile operators. It
insurance coverage, allowing a market For example, racial discrimination might also reduce the need for antitrust
for information security to emerge. against an applicant for a bank loan is enforcement by counteracting data
Current rules on digital records illegal. But what if a computer model aggregators’ desire to grow ever bigger in
state that data should never be stored factors in the educational level of the order to reap economies of scale.
for longer than necessary because they applicant’s mother, which in America is Ensuring the integrity of the
might be misused or inadvertently strongly correlated with race? And what information is an important part of the
released. But Viktor Mayer-Schönberger if computers, just as they can predict an big-data age. When America’s secretary
of the National University of Singapore individual’s susceptibility to a disease of state, Hillary Clinton, lambasted the
worries that the increasing power and from other bits of information, can Chinese in January for allegedly hacking
decreasing price of computers will make predict his predisposition to committing into Google’s computers, she used the
it too easy to hold on to everything. In a crime? term “the global networked commons”.
his recent book “Delete” he argues in A new regulatory principle in the age The idea is that the internet is a shared
favour of technical systems that “forget”: of big data, then, might be that people’s environment, like the oceans or airspace,
digital files that have expiry dates or data cannot be used to discriminate which requires international co-operation
slowly degrade over time. against them on the basis of something to make the best use of it. Censorship
Yet regulation is pushing in the that might or might not happen. The pollutes that environment. Disrupting
opposite direction. There is a social and individual must be regarded as a free information flows not only violates
political expectation that records will be agent. This idea is akin to the general the integrity of the data but quashes
kept, says Peter Allen of CSC, a technology rule of national statistical offices that free expression and denies the right of
provider: “The more we know, the more data gathered for surveys cannot be used assembly. Likewise, if telecoms operators
we are expected to know—for ever.” against a person for things like deporting give preferential treatment to certain
American security officials have pressed illegal immigrants—which, alas, has not content providers, they undermine the
companies to keep records because always been respected. idea of “network neutrality”.
they may hold clues after a terrorist Privacy rules lean towards treating Governments could define best
incident. In future it is more likely that personal information as a property right. practice on dealing with information
companies will be required to retain all A reasonable presumption might be flows and the processing of data, just as they
digital files, and ensure their accuracy, that the trail of data that an individual require firms to label processed foods with
than to delete them. leaves behind and that can be traced to the ingredients or impose public-health
Processing data is another concern. him, from clicks on search engines to standards. The World Trade Organisation,
Ian Ayres, an economist and lawyer at book-buying preferences, belong to that which oversees the free flow of physical
Yale University and the author of “Super- individual, not the entity that collected trade, might be a suitable body for keeping
Crunchers”, a book about computer it. Google’s “data liberation” initiative digital goods and services flowing too. But
algorithms replacing human intuition, mentioned earlier in this report points it will not be quick or easy.
frets about the legal implications of in that direction. That might create a
using statistical correlations. Rebecca market for information. Indeed, “data
Goldin, a mathematician at George portability” stimulates competition, just © The Economist Newspaper Limited, London (2010)
.:/p17
Reprinted from The Economist February 27th 2010
four concepts or relationships at once. If described in this report, from business If an intelligence agency can be hit in
there is more information to process, or analytics to recursive machine-learning to this way, the chances are that most other
it is especially complex, people become visualisation software, exist to make data users are at even greater risk. Part of the
confused. more digestible for humans. solution will be to pour more resources
Moreover, knowledge has become so Some applications have already into improving the performance of
specialised that it is impossible for any become so widespread that they are taken existing technologies, not just pursue more
individual to grasp the whole picture. for granted. For example, banks use credit innovations. The computer industry went
A true understanding of climate change, scores, based on data about past financial through a similar period of reassessment
for instance, requires a knowledge of transactions, to judge an applicant’s in 2001-02 when Microsoft and others
meteorology, chemistry, economics and ability to repay a loan. That makes the announced that they were concentrating
law, among many other things. And process less subjective than the say-so on making their products much more
whereas doctors a century ago were of a bank manager. Likewise, landing a secure rather than adding new features.
expected to keep up with the entire field plane requires a lot of mental effort, so Another concern is energy
of medicine, now they would need to the process has been largely automated, consumption. Processing huge amounts
be familiar with about 10,000 diseases, and both pilots and passengers feel safer. of data takes a lot of power. “In two to
3,000 drugs and more than 1,000 lab And in health care the trend is towards three years we will saturate the electric
tests. A study in 2004 suggested that in “evidence-based medicine”, where not cables running into the building,” says
epidemiology alone it would take 21 only doctors but computers too get Alex Szalay at Johns Hopkins University.
hours of work a day just to stay current. involved in diagnosis and treatment. “The next challenge is how to do the same
And as more people around the world things as today, but with ten to 100 times
become more educated, the flow of The dangers of complacency less power.”
knowledge will increase even further. In the age of big data, algorithms will be It is a worry that affects many
The number of peer-reviewed scientific doing more of the thinking for people. organisations. The NSA in 2006 came
papers in China alone has increased 14- But that carries risks. The technology is close to exceeding its power supply,
fold since 1990 (see chart 3). far less reliable than people realise. For which would have blown out its electrical
every success with big data there are infrastructure. Both Google and Microsoft
many failures. The inability of banks to have had to put some of their huge
understand their risks in the lead-up to data centres next to hydroelectric plants
the financial crisis is one example. The to ensure access to enough energy at a
deficient system used to identify potential reasonable price.
terrorists is another. Some people are even questioning
On Christmas Day last year a Nigerian whether the scramble for ever more
man, Umar Farouk Abdulmutallab, tried information is a good idea. Nick Bostrom,
to ignite a hidden bomb as his plane was a philosopher at Oxford University,
landing in Detroit. It turned out his father identifies “information hazards” which
had informed American officials that he result from disseminating information
posed a threat. His name was entered that is likely to cause harm, such as
into a big database of around 550,000 publishing the blueprint for a nuclear
people who potentially posed a security bomb or broadcasting news of a race riot
risk. But the database is notoriously that could provoke further violence. “It is
flawed. It contains many duplicates, and said that a little knowledge is a dangerous
names are regularly lost during back-ups. thing,” he writes. “It is an open question
“What information consumes The officials had followed all the right whether more knowledge is safer.” Yet
is rather obvious: it consumes the procedures, but the system still did not similar concerns have been raised through
attention of its recipients,” wrote prevent the suspect from boarding the the ages, and mostly proved overblown.
Herbert Simon, an economist, in 1971. plane.
“Hence a wealth of information creates One big worry is what happens if the Knowledge is power
a poverty of attention.” But just as it technology stops working altogether. This The pursuit of information has been a
is machines that are generating most is not a far-fetched idea. In January 2000 human preoccupation since knowledge
of the data deluge, so they can also the torrent of data pouring into America’s was first recorded. In the 3rd century
be put to work to deal with it. That National Security Agency (NSA) brought BC Ptolemy stole every available scroll
highlights the role of “information the system to a crashing halt. The agency from passing travellers and ships to
intermediaries”. People rarely deal was “brain-dead” for three-and-a-half stock his great library in Alexandria.
with raw data but consume them in days, General Michael Hayden, then its After September 11th 2001 the American
processed form, once they have been director, said publicly in 2002. “We were Defence Department launched a program
aggregated or winnowed by computers. dark. Our ability to process information called “Total Information Awareness”
Indeed, many of the technologies was gone.” to compile as many data as possible
.:/p18
Reprinted from The Economist February 27th 2010
about just about everything—e-mails, “Our sensory and attentional systems The cornucopia of data now available is
phone calls, web searches, shopping are tuned via evolution and experience a resource, similar to other resources in
transactions, bank records, medical to be selective,” says Dennis Proffitt, a the world and even to technology itself.
files, travel history and much more. cognitive psychologist at the University of On their own, resources and technologies
Since 1996 Brewster Kahle, an internet Virginia. People find patterns to compress are neither good nor bad; it depends
entrepreneur, has been recording all the information and make it manageable. on how they are used. In the age of big
content on the web as a not-for-profit Even Commander Schmorrow does data, computers will be monitoring
venture called the “Internet Archive”. It not think that man will be replaced by more things, making more decisions and
has since expanded to software, films, robots. “The flexibility of the human to even automatically improving their own
audio recordings and scanning books. consider as-yet-unforeseen consequences processes—and man will be left with the
There has always been more during critical decision-making, go with same challenges he has always faced. As T.S.
information than people can mentally the gut when problem-solving under Eliot asked: “Where is the wisdom we have
process. The chasm between the amount uncertainty and other such abstract lost in knowledge? Where is the knowledge
of information and man’s ability to reasoning behaviours built up over years we have lost in information?”
deal with it may be widening, but of experience will not be readily replaced
that need not be a cause for alarm. by a computer algorithm,” he says. © The Economist Newspaper Limited, London (2010)
.:/p19
Reprinted from The Economist February 27th 2010
Overload and the Limits of Working Systems, vol. 24, no. 2, pp. 8-12, March/ Berners-Lee, et al. Communications of the
Memory” by Torkel Klingberg, Oxford April 2009. ACM, vol 51, issue 6, June 2008.
University Press, 2008. “Predicting the Present with Google “How Diverse is Facebook?” by Facebook
Reports and papers: Trends” by Hyunyoung Choi and Hal Data Team, Facebook, December 16th
“The Promise and Peril of Big Data” by Varian, Google. April 10th 2009. 2009.
David Bollier, The Aspen Institute, 2010. “Detecting influenza epidemics using “Managing Global Data Privacy: Cross-
“Ensuring the Integrity, Accessibility, search engine query data” by Jeremy Border Information Flows in a Networked
and Stewardship of Research Data in the Ginsberg, Matthew H. Mohebbi and Rajan Environment” by Paul M. Schwartz, The
Digital Age” by The National Academies S. Patel et al. Nature, vol 457, February 19th Privacy Projects, 2009.
Press, 2009. 2009. “Data Protection Accountability: The
“Computational Social Science” by “Data-Intensive Text Processing with Essential Elements—A Document for
David Lazer, Alex Pentland, Lada MapReduce” by Jimmy Lin and Chris Discussion” Centre for Information Policy
Adamic, et al. Science, February 6th 2009. Dyer, draft book chapter for Morgan & Leadership, the Galway Project, October
“Computer Mediated Transactions” by Claypool Synthesis Lectures on Human 2009.
Hal Varian. Ely Lecture at the American Language Technologies, February 7th 2010. “The Semantic Web” by Tim Berners-Lee,
Economics Association, January 18th “Web Squared: Web 2.0 Five Years On” by James Hendler and Ora Lassila, Scientific
2010. Tim O’Reilly and John Battelle, O’Reilly American, May 2001.
“The Unreasonable Effectiveness of Media, 2009. “Information Hazards: A Typology of
Data” by Alon Halevy, Peter Norvig “Information Accountability” by Daniel Potential Harms from Knowledge” by
and Fernando Pereira, IEEE Intelligent J. Weitzner, Harold Abelson and Tim Nick Bostrom, Draft 1.11, 2009.
.:/p20