100% found this document useful (1 vote)
513 views20 pages

The Economist Data Data Everywhere

Uploaded by

Claudio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
513 views20 pages

The Economist Data Data Everywhere

Uploaded by

Claudio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data, data

everywhere
A special report on managing information
{Contents}
.:/p03 The data deluge
Businesses, governments and society are only starting to tap its vast potential

.:/p04 Data, data everywhere


Information has gone from scarce to superabundant. That brings huge new
benefits, says Kenneth Cukier—but also big headaches

.:/p06 All too much


Monstrous amounts of data

.:/p07 A different game


Information is transforming traditional businesses

.:/p09 Show me
New ways of visualising data

.:/p11 Needle in a haystack


The uses of information about information

.:/p12 Clicking for gold


How internet companies profit from data on the web

.:/p14 The open society


Governments are letting in the light

.:/p16 New rules for big data


Regulators are having to rethink their brief

.:/p17 Handling the cornucopia


The best way to deal with all that information is to use machines.
But they need watching

.:/p19 Sources and acknowledgments


Reprinted from The Economist February 27th 2010

Technology

The data deluge


Businesses, governments and society are only starting to tap its vast potential

likely to be used to buy hard liquor than


wine, for example, because it is easier to
fence. Insurance firms are also good at
combining clues to spot suspicious claims:
fraudulent claims are more likely to be
made on a Monday than a Tuesday, since
policyholders who stage accidents tend to
assemble friends as false witnesses over
the weekend. By combining many such
rules, it is possible to work out which
cards are likeliest to have been stolen, and
which claims are dodgy.
Mobile-phone operators, meanwhile,
analyse subscribers’ calling patterns to
determine, for example, whether most
of their frequent contacts are on a rival
network. If that rival network is offering
an attractive promotion that might
cause the subscriber to defect, he or she
can then be offered an incentive to stay.
Older industries crunch data with just
as much enthusiasm as new ones these
days. Retailers, offline as well as online,
are masters of data mining (or “business
intelligence”, as it is now known). By
analysing “basket data”, supermarkets
can tailor promotions to particular
customers’ preferences. The oil industry
uses supercomputers to trawl seismic data
before drilling wells. And astronomers are
just as likely to point a software query-

E IGHTEEN months ago, Li & Fung, a


firm that manages supply chains for
retailers, saw 100 gigabytes of information
might be useful, is difficult enough.
Analysing it, to spot patterns and extract
useful information, is harder still. Even
tool at a digital sky survey as to point a
telescope at the stars.
There’s much further to go. Despite
flow through its network each day. so, the data deluge is already starting to years of effort, law-enforcement and
Now the amount has increased tenfold. transform business, government, science intelligence agencies’ databases are not,
During 2009, American drone aircraft and everyday life. It has great potential for by and large, linked. In health care, the
flying over Iraq and Afghanistan sent good—as long as consumers, companies digitisation of records would make it
back around 24 years’ worth of video and governments make the right choices much easier to spot and monitor health
footage. New models being deployed about when to restrict the flow of data, trends and evaluate the effectiveness of
this year will produce ten times as many and when to encourage it. different treatments. But large-scale efforts
data streams as their predecessors, and to computerise health records tend to run
those in 2011 will produce 30 times as Plucking the diamond from into bureaucratic, technical and ethical
many. the waste problems. Online advertising is already
Everywhere you look, the quantity A few industries have led the way in their far more accurately targeted than the
of information in the world is soaring. ability to gather and exploit data. Credit- offline sort, but there is scope for even
According to one estimate, mankind card companies monitor every purchase greater personalisation. Advertisers would
created 150 exabytes (billion gigabytes) and can identify fraudulent ones with then be willing to pay more, which would
of data in 2005. This year, it will create a high degree of accuracy, using rules in turn mean that consumers prepared to
1,200 exabytes. Merely keeping up with derived by crunching through billions of opt into such things could be offered a
this flood, and storing the bits that transactions. Stolen credit cards are more richer and broader range of free online

.:/p03
Reprinted from The Economist February 27th 2010

services. And governments are belatedly periodic fusses when Facebook or Google be required to disclose details of security
coming around to the idea of putting unexpectedly change the privacy settings breaches, as is already the case in some
more information—such as crime on their online social networks, causing parts of the world, to encourage bosses to
figures, maps, details of government members to reveal personal information take information security more seriously.
contracts or statistics about the unwittingly. A more sinister threat comes Third, organisations should be subject
performance of public services—into the from Big Brotherishness of various to an annual security audit, with the
public domain. People can then reuse kinds, particularly when governments resulting grade made public (though
this information in novel ways to build compel companies to hand over personal details of any problems exposed would
businesses and hold elected officials to information about their customers. Rather not be). This would encourage companies
account. Companies that grasp these than owning and controlling their own to keep their security measures up to date.
new opportunities, or provide the tools personal data, people very often find that Market incentives will then come into
for others to do so, will prosper. Business they have lost control of it. play as organisations that manage data
intelligence is one of the fastest-growing The best way to deal with these well are favoured over those that do not.
parts of the software industry. drawbacks of the data deluge is, Greater transparency in these three areas
paradoxically, to make more data available would improve security and give people
Now for the bad news in the right way, by requiring greater more control over their data without the
But the data deluge also poses risks. transparency in several areas. First, users need for intricate regulation that could
Examples abound of databases being should be given greater access to and stifle innovation. After all, the process of
stolen: disks full of social-security data go control over the information held about learning to cope with the data deluge, and
missing, laptops loaded with tax records them, including whom it is shared working out how best to tap it, has only
are left in taxis, credit-card numbers are with. Google allows users to see what just begun.
stolen from online retailers. The result is information it holds about them, and
privacy breaches, identity theft and fraud. lets them delete their search histories or
Privacy infringements are also possible modify the targeting of advertising, for
even without such foul play: witness the example. Second, organisations should © The Economist Newspaper Limited, London (2010)

A special report on managing information

Data, data everywhere


Information has gone from scarce to superabundant. That brings huge new benefits,
says Kenneth Cukier—but also big headaches
information can be found closer to Earth
too. Wal-Mart, a retail giant, handles more
than 1m customer transactions every hour,
feeding databases estimated at more than
2.5 petabytes—the equivalent of 167 times
the books in America’s Library of Congress
(see article for an explanation of how
data are quantified). Facebook, a social-
networking website, is home to 40 billion
photos. And decoding the human genome
involves analysing 3 billion base pairs—
which took ten years the first time it was
done, in 2003, but can now be achieved in
one week.
All these examples tell the same story:
that the world contains an unimaginably
vast amount of digital information which

W HEN the Sloan Digital Sky Survey


started work in 2000, its telescope
in New Mexico collected more data in its
whopping 140 terabytes of information.
A successor, the Large Synoptic Survey
Telescope, due to come on stream in Chile
is getting ever vaster ever more rapidly.
This makes it possible to do many things
that previously could not be done: spot
first few weeks than had been amassed in 2016, will acquire that quantity of data business trends, prevent diseases, combat
in the entire history of astronomy. Now, every five days. crime and so on. Managed well, the data
a decade later, its archive contains a Such astronomical amounts of can be used to unlock new sources of

.:/p04
Reprinted from The Economist February 27th 2010

economic value, provide fresh insights as it will argue, the two are increasingly professor at New York University. Just
into science and hold governments to difficult to tell apart. Given enough raw as the microscope transformed biology
account. data, today’s algorithms and powerful by exposing germs, and the electron
But they are also creating a host of computers can reveal new insights that microscope changed physics, all these
new problems. Despite the abundance of would previously have remained hidden. data are turning the social sciences upside
tools to capture, process and share all this The business of information down, he explains. Researchers are now
information—sensors, computers, mobile management—helping organisations to able to understand human behaviour
phones and the like—it already exceeds make sense of their proliferating data—is at the population level rather than the
the available storage space (see chart 1). growing by leaps and bounds. In recent individual level.
Moreover, ensuring data security and years Oracle, IBM, Microsoft and SAP The amount of digital information
protecting privacy is becoming harder as between them have spent more than increases tenfold every five years. Moore’s
the information multiplies and is shared $15 billion on buying software firms law, which the computer industry now
ever more widely around the world. specialising in data management and takes for granted, says that the processing
analytics. This industry is estimated to be power and storage capacity of computer
Alex Szalay, an astrophysicist at worth more than $100 billion and growing chips double or their prices halve roughly
at almost 10% a year, roughly twice as fast every 18 months. The software programs
as the software business as a whole. are getting better too. Edward Felten, a
Chief information officers (CIOs) computer scientist at Princeton University,
have become somewhat more prominent reckons that the improvements in the
in the executive suite, and a new kind algorithms driving computer applications
of professional has emerged, the data have played as important a part as Moore’s
scientist, who combines the skills of law for decades.
software programmer, statistician and A vast amount of that information
storyteller/artist to extract the nuggets of is shared. By 2013 the amount of traffic
gold hidden under mountains of data. Hal flowing over the internet annually will
Varian, Google’s chief economist, predicts reach 667 exabytes, according to Cisco, a
that the job of statistician will become maker of communications gear. And the
the “sexiest” around. Data, he explains, quantity of data continues to grow faster
are widely available; what is scarce is the than the ability of the network to carry it
ability to extract wisdom from them. all.
Johns Hopkins University, notes that People have long groused that they
the proliferation of data is making them More of everything were swamped by information. Back
increasingly inaccessible. “How to make There are many reasons for the in 1917 the manager of a Connecticut
sense of all these data? People should information explosion. The most obvious manufacturing firm complained about
be worried about how we train the next one is technology. As the capabilities of the effects of the telephone: “Time is lost,
generation, not just of scientists, but digital devices soar and prices plummet, confusion results and money is spent.” Yet
people in government and industry,” he sensors and gadgets are digitising lots what is happening now goes way beyond
says. of information that was previously incremental growth. The quantitative
“We are at a different period unavailable. And many more people change has begun to make a qualitative
because of so much information,” says have access to far more powerful tools. difference.
James Cortada of IBM, who has written For example, there are 4.6 billion mobile- This shift from information scarcity
a couple of dozen books on the history phone subscriptions worldwide (though to surfeit has broad effects. “What we are
of information in society. Joe Hellerstein, many people have more than one, so the seeing is the ability to have economies
a computer scientist at the University world’s 6.8 billion people are not quite as form around the data—and that to me
of California in Berkeley, calls it “the well supplied as these figures suggest), and is the big change at a societal and even
industrial revolution of data”. The effect 1 billion-2 billion people use the internet. macroeconomic level,” says Craig Mundie,
is being felt everywhere, from business Moreover, there are now many more head of research and strategy at Microsoft.
to science, from government to the arts. people who interact with information. Data are becoming the new raw material
Scientists and computer engineers have Between 1990 and 2005 more than 1 billion of business: an economic input almost on
coined a new term for the phenomenon: people worldwide entered the middle a par with capital and labour. “Every day
“big data”. class. As they get richer they become I wake up and ask, ‘how can I flow data
Epistemologically speaking, more literate, which fuels information better, manage data better, analyse data
information is made up of a collection growth, notes Mr Cortada. The results are better?” says Rollin Ford, the CIO of Wal-
of data and knowledge is made up of showing up in politics, economics and the Mart.
different strands of information. But law as well. “Revolutions in science have Sophisticated quantitative analysis
this special report uses “data” and often been preceded by revolutions in is being applied to many aspects of life,
“information” interchangeably because, measurement,” says Sinan Aral, a business not just missile trajectories or financial

.:/p05
Reprinted from The Economist February 27th 2010

hedging strategies, as in the past. For of Google, sit on a presidential task force they probably meant to keep to themselves.
example, Farecast, a part of Microsoft’s to reform American health care. “Early on But big data can have far more serious
search engine Bing, can advise customers in this process Eric and I both said: ‘Look, consequences than that. During the recent
whether to buy an airline ticket now if you really want to transform health care, financial crisis it became clear that banks
or wait for the price to come down by you basically build a sort of health-care and rating agencies had been relying on
examining 225 billion flight and price economy around the data that relate to models which, although they required a
records. The same idea is being extended people’,” Mr Mundie explains. “You would vast amount of information to be fed in,
to hotel rooms, cars and similar items. not just think of data as the ‘exhaust’ of failed to reflect financial risk in the real
Personal-finance websites and banks providing health services, but rather they world. This was the first crisis to be sparked
are aggregating their customer data to become a central asset in trying to figure by big data—and there will be more.
show up macroeconomic trends, which out how you would improve every aspect The way that information is managed
may develop into ancillary businesses in of health care. It’s a bit of an inversion.” touches all areas of life. At the turn of the
their own right. Number-crunchers have To be sure, digital records should make 20th century new flows of information
even uncovered match-fixing in Japanese life easier for doctors, bring down costs for through channels such as the telegraph
sumo wrestling. providers and patients and improve the and telephone supported mass production.
quality of care. But in aggregate the data Today the availability of abundant data
Dross into gold can also be mined to spot unwanted drug enables companies to cater to small niche
“Data exhaust”—the trail of clicks that interactions, identify the most effective markets anywhere in the world. Economic
internet users leave behind from which treatments and predict the onset of disease production used to be based in the factory,
value can be extracted—is becoming before symptoms emerge. Computers where managers pored over every machine
a mainstay of the internet economy. already attempt to do these things, but need and process to make it more efficient. Now
One example is Google’s search engine, to be explicitly programmed for them. In a statisticians mine the information output
which is partly guided by the number of world of big data the correlations surface of the business for new ideas.
clicks on an item to help determine its almost by themselves. “The data-centred economy is just
relevance to a search query. If the eighth Sometimes those data reveal more nascent,” admits Mr Mundie of Microsoft.
listing for a search term is the one most than was intended. For example, the city of “You can see the outlines of it, but the
people go to, the algorithm puts it higher Oakland, California, releases information technical, infrastructural and even
up. on where and when arrests were made, business-model implications are not
As the world is becoming increasingly which is put out on a private website, well understood right now.” This special
digital, aggregating and analysing data Oakland Crimespotting. At one point a few report will point to where it is beginning
is likely to bring huge benefits in other clicks revealed that police swept the whole to surface.
fields as well. For example, Mr Mundie of a busy street for prostitution every
of Microsoft and Eric Schmidt, the boss evening except on Wednesdays, a tactic © The Economist Newspaper Limited, London (2010)

A special report on managing information

All too much


Monstrous amounts of data

Q UANTIFYING the amount of


information that exists in the world
is hard. What is clear is that there is an
collect what they can and let the rest
dissipate into the ether.
According to a 2008 study by
actually consumed? Researchers at the
University of California in San Diego
(UCSD) examined the flow of data to
awful lot of it, and it is growing at a International Data Corp (IDC), a market- American households. They found that in
terrific rate (a compound annual 60%) research firm, around 1,200 exabytes of 2008 such households were bombarded
that is speeding up all the time. The digital data will be generated this year. with 3.6 zettabytes of information (or
flood of data from sensors, computers, Other studies measure slightly different 34 gigabytes per person per day). The
research labs, cameras, phones and the things. Hal Varian and the late Peter biggest data hogs were video games and
like surpassed the capacity of storage Lyman of the University of California television. In terms of bytes, written words
technologies in 2007. Experiments at in Berkeley, who pioneered the idea of are insignificant, amounting to less than
the Large Hadron Collider at CERN, counting the world’s bits, came up with a 0.1% of the total. However, the amount
Europe’s particle-physics laboratory far smaller amount, around 5 exabytes in of reading people do, previously in
near Geneva, generate 40 terabytes every 2002, because they counted only the stock decline because of television, has almost
second—orders of magnitude more than of original content. tripled since 1980, thanks to all that text
can be stored or analysed. So scientists What about the information that is on the internet. In the past information

.:/p06
Reprinted from The Economist February 27th 2010

machines and used by other machines will


probably grow faster than anything else,”
explains Roger Bohn of the UCSD, one
of the authors of the study on American
households. “This is primarily ‘database
to database’ information—people are only
tangentially involved in most of it.”
Only 5% of the information that
is created is “structured”, meaning it
comes in a standard format of words or
numbers that can be read by computers.
The rest are things like photos and phone
calls which are less easily retrievable and
usable. But this is changing as content
on the web is increasingly “tagged”, and
facial-recognition and voice-recognition
software can identify people and words
in digital files.
“It is a very sad thing that nowadays
there is so little useless information,”
consumption was largely passive, households to quantify consumption quipped Oscar Wilde in 1894. He did not
leaving aside the telephone. Today half globally and include business use as well. know the half of it.
of all bytes are received interactively,
according to the UCSD. Future March of the machines
studies will extend beyond American Significantly, “information created by © The Economist Newspaper Limited, London (2010)

A special report on managing information

A different game
Information is transforming traditional businesses

was happening in the business. are collecting more data than ever before.
Sales data remain one of a company’s In the past they were kept in different
most important assets. In 2004 Wal-Mart systems that were unable to talk to each
peered into its mammoth databases and other, such as finance, human resources or
noticed that before a hurricane struck, customer management. Now the systems
there was a run on flashlights and are being linked, and companies are using
batteries, as might be expected; but also data-mining techniques to get a complete
on Pop-Tarts, a sugary American breakfast picture of their operations—“a single
snack. On reflection it is clear that the version of the truth”, as the industry likes

I N 1879 James Ritty, a saloon-keeper in


Dayton, Ohio, received a patent for
a wooden contraption that he dubbed
snack would be a handy thing to eat in a
blackout, but the retailer would not have
thought to stock up on it before a storm.
to call it. That allows firms to operate more
efficiently, pick out trends and improve
their forecasting.
the “incorruptible cashier”. With a set of The company whose system crunched Wal- Consider Cablecom, a Swiss telecoms
buttons and a loud bell, the device, sold Mart’s numbers was none other than NCR operator. It has reduced customer
by National Cash Register (NCR), was little and its data-warehousing unit, Teradata, defections from one-fifth of subscribers a
more than a simple adding machine. now an independent firm. year to under 5% by crunching its numbers.
Yet as an early form of managing A few years ago such technologies, called Its software spotted that although customer
information flows in American business “business intelligence”, were available only defections peaked in the 13th month, the
the cash register had a huge impact. It to the world’s biggest companies. But as decision to leave was made much earlier,
not only reduced pilferage by alerting the the price of computing and storage has around the ninth month (as indicated by
shopkeeper when the till was opened; fallen and the software systems have got things like the number of calls to customer
by recording every transaction, it also better and cheaper, the technology has support services). So Cablecom offered
provided an instant overview of what moved into the mainstream. Companies certain customers special deals seven

.:/p07
Reprinted from The Economist February 27th 2010

months into their subscription and example, sells more than 100,000 products technology meant to make sense of it often
reaped the rewards. in 200 countries, using 550,000 suppliers, just produces more data. Instead of finding
but it was not using its huge buying power a needle in the haystack, they are making
Agony and torture effectively because its databases were a more hay.
Such data-mining has a dubious mess. On examination, it found that of Still, as analytical techniques become
reputation. “Torture the data long enough its 9m records of vendors, customers and more widespread, business decisions
and they will confess to anything,” materials around half were obsolete or will increasingly be made, or at least
statisticians quip. But it has become duplicated, and of the remainder about corroborated, on the basis of computer
far more effective as more companies one-third were inaccurate or incomplete. algorithms rather than individual hunches.
have started to use the technology. Best The name of a vendor might be abbreviated This creates a need for managers who are
Buy, a retailer, found that 7% of its in one record but spelled out in another, comfortable with data, but statistics courses
customers accounted for 43% of its sales, leading to double-counting. in business schools are not popular.
so it reorganised its stores to concentrate Many new business insights come
on those customers’ needs. Airline Plainer vanilla from “dead data”: stored information
yield management improved because Over the past ten years Nestlé has been about past transactions that are examined
analytical techniques uncovered the best overhauling its IT system, using SAP to reveal hidden correlations. But now
predictor that a passenger would actually software, and improving the quality of companies are increasingly moving to
catch a flight he had booked: that he had its data. This enabled the firm to become analysing real-time information flows.
ordered a vegetarian meal. more efficient, says Chris Johnson, who Wal-Mart is a good example. The
The IT industry is piling into led the initiative. For just one ingredient, retailer operates 8,400 stores worldwide,
business intelligence, seeing it as a natural vanilla, its American operation was able has more than 2m employees and handles
successor of services such as accountancy to reduce the number of specifications and over 200m customer transactions each
and computing in the first and second use fewer suppliers, saving $30m a year. week. Its revenue last year, around $400
half of the 20th century respectively. Overall, such operational improvements billion, is more than the GDP of many
Accenture, PricewaterhouseCoopers, IBM save more than $1 billion annually. entire countries. The sheer scale of the data
and SAP are investing heavily in their Nestlé is not alone in having problems is a challenge, admits Rollin Ford, the CIO
consulting practices. Technology vendors with its database. Most CIOs admit that at Wal-Mart’s headquarters in Bentonville,
such as Oracle, Informatica, TIBCO, SAS their data are of poor quality. In a study Arkansas. “We keep a healthy paranoia.”
and EMC have benefited. IBM believes by IBM half the managers quizzed did
business intelligence will be a pillar not trust the information on which they Not a sparrow falls
of its growth as sensors are used to had to base decisions. Many say that the Wal-Mart’s inventory-management system,
manage things from a city’s traffic flow called Retail Link, enables suppliers to see
to a patient’s blood flow. It has invested the exact number of their products on every
$12 billion in the past four years and is shelf of every store at that precise moment.
opening six analytics centres with 4,000 The system shows the rate of sales by the
employees worldwide. hour, by the day, over the past year and
Analytics—performing statistical more. Begun in the 1990s, Retail Link gives
operations for forecasting or uncovering suppliers a complete overview of when
correlations such as between Pop-Tarts and how their products are selling, and
and hurricanes—can have a big pay- with what other products in the shopping
off. In Britain the Royal Shakespeare cart. This lets suppliers manage their stocks
Company (RSC) sifted through seven better.
years of sales data for a marketing The technology enabled Wal-Mart to
campaign that increased regular visitors change the business model of retailing. In
by 70%. By examining more than 2m some cases it leaves stock management in
transaction records, the RSC discovered the hands of its suppliers and does not
a lot more about its best customers: not take ownership of the products until the
just income, but things like occupation moment they are sold. This allows it to
and family status, which allowed it to shed inventory risk and reduce its costs.
target its marketing more precisely. That In essence, the shelves in its shops are a
was of crucial importance, says the RSC’s highly efficiently managed depot.
Mary Butlin, because it substantially Another company that capitalises
boosted membership as well as fund- on real-time information flows is Li &
raising revenue. Fung, one of the world’s biggest supply-
Yet making the most of data is not chain operators. Founded in Guangzhou
easy. The first step is to improve the in southern China a century ago, it does
accuracy of the information. Nestlé, for not own any factories or equipment but

.:/p08
Reprinted from The Economist February 27th 2010

orchestrates a network of 12,000 suppliers and later the recovery, from retailers’ and open-source software. Cloud
in 40 countries, sourcing goods for brands orders before these trends became computing—in which the internet is used
ranging from Kate Spade to Walt Disney. apparent. Investment analysts use country as a platform to collect, store and process
Its turnover in 2008 was $14 billion. information provided by Li & Fung to gain data—allows businesses to lease computing
Li & Fung used to deal with its insights into macroeconomic patterns. power as and when they need it, rather
clients mostly by phone and fax, with Now that they are able to process than having to buy expensive equipment.
e-mail counting as high technology. But information flows in real time, Amazon, Google and Microsoft are the most
thanks to a new web-services platform, organisations are collecting more data prominent firms to make their massive
its processes have speeded up. Orders than ever. One use for such information computing infrastructure available to
flow through a web portal and bids can is to forecast when machines will break clients. As more corporate functions, such
be solicited from pre-qualified suppliers. down. This hardly ever happens out of as human resources or sales, are managed
Agents now audit factories in real time the blue: there are usually warning signs over a network, companies can see patterns
with hand-held computers. Clients are such as noise, vibration or heat. Capturing across the whole of the business and share
able to monitor the details of every stage such data enables firms to act before a their information more easily.
of an order, from the initial production breakdown. A free programming language called
run to shipping. Similarly, the use of “predictive R lets companies examine and present big
One of the most important analytics” on the basis of large data sets data sets, and free software called Hadoop
technologies has turned out to be may transform health care. Dr Carolyn now allows ordinary PCs to analyse huge
videoconferencing. It allows buyers McGregor of the University of Ontario, quantities of data that previously required
and manufacturers to examine the working with IBM, conducts research a supercomputer. It does this by parcelling
colour of a material or the stitching on to spot potentially fatal infections in out the tasks to numerous computers
a garment. “Before, we weren’t able to premature babies. The system monitors at once. This saves time and money. For
send a 500MB image—we’d post a DVD. subtle changes in seven streams of real- example, the New York Times a few years
Now we can stream it to show vendors time data, such as respiration, heart rate ago used cloud computing and Hadoop to
in our offices. With real-time images we and blood pressure. The electrocardiogram convert over 400,000 scanned images from
can make changes quicker,” says Manuel alone generates 1,000 readings per second. its archives, from 1851 to 1922. By harnessing
Fernandez, Li & Fung’s chief technology This kind of information is turned out the power of hundreds of computers, it
officer. Data flowing through its network by all medical equipment, but it used to be was able to do the job in 36 hours.
soared from 100 gigabytes a day only 18 recorded on paper and examined perhaps Visa, a credit-card company, in
months ago to 1 terabyte. once an hour. By feeding the data into a recent trial with Hadoop crunched
The information system also allows a computer, Dr McGregor has been able two years of test records, or 73 billion
Li & Fung to look across its operations to detect the onset of an infection before transactions, amounting to 36 terabytes
to identify trends. In southern China, obvious symptoms emerge. “You can’t see of data. The processing time fell from
for instance, a shortage of workers and it with the naked eye, but a computer can,” one month with traditional methods to a
new legislation raised labour costs, she says. mere 13 minutes. It is a striking successor
so production moved north. “We saw of Ritty’s incorruptible cashier for a data-
that before it actually happened,” says Open sesame driven age.
Mr Fernandez. The company also got Two technology trends are helping to fuel
advance warning of the economic crisis, these new uses of data: cloud computing © The Economist Newspaper Limited, London (2010)

A special report on managing information

Show me
New ways of visualising data

I N 1998 Martin Wattenberg, then a


graphic designer at the magazine
SmartMoney in New York, had a
individual companies may rise or fall
by a little or a lot. The same is true for
whole sectors. Being able to see all this
It used the day’s closing share price to
show more than 500 companies arranged
by sector. Shades of green or red indicated
problem. He wanted to depict the information at once could be useful to whether a share had risen or fallen and by
daily movements in the stockmarket, investors. But how to make it visually how much, showing the activity in every
but the customary way, as a line accessible? sector of the market. It was an instant
showing the performance of an index Mr Wattenberg’s brilliant idea was hit—and brought the nascent field of data
over time, provided only a very broad to adapt an existing technique to create a visualisation to a mainstream audience.
overall picture. Every day hundreds of “Map of the Market” in the form of a grid. In recent years there have been

.:/p09
Reprinted from The Economist February 27th 2010

raw data that can be presented visually,


sometimes in unexpected ways. For
instance, a representation of the sources
cited in the journal Nature gives each
source publication a line and identifies
different scientific fields in different
colours. This makes it easy to see that
biology sources are most heavily cited,
which is unsurprising. But it also shows,
more unexpectedly, that the publications
most heavily cited include the Physical
Review Letters and Astrophysical Journal.

The art of the visible


Resembling a splendid orchid, the Nature
chart can be criticised for being more
picturesque than informative; but whether
it is more art or more information, it
big advances in displaying massive terms displayed in a “tag cloud” are links offers a new way to look at the world at
amounts of data to make them easily that will bring up a list of the related a time when almost everything generates
accessible. This is emerging as a vibrant content. huge swathes of data that are hard
and creative field melding the skills Another way to present text, devised to understand. If a picture is worth a
of computer science, statistics, artistic by Mr Wattenberg and a colleague at IBM, thousand words, an infographic is worth
design and storytelling. Fernanda Viégas, is a chart of edits made an awful lot of data points.
“Every field has some central tension on Wikipedia. The online encyclopedia Visualisation is a relatively new
it is trying to resolve. Visualisation is written entirely by volunteers. The discipline. The time series, the most
deals with the inhuman scale of the software creates a permanent record of common form of chart, did not start to
information and the need to present every edit to show exactly who changed appear in scientific writings until the
it at the very human scale of what what, and when. That amounts to a lot of late 18th century, notes Edward Tufte
the eye can see,” says Mr Wattenberg, data over time. in his classic “The Visual Display of
who has since moved to IBM and now One way to map the process is Quantitative Information”, the bible of the
spearheads a new generation of data- to assign different colours to different business. Today’s infographics experts are
visualisation specialists. users and show how much of their pioneering a new medium that presents
Market information may be contribution remains by the thickness of meaty information in a compelling
hard to display, but at least the data the line that represents it. The entry for narrative: “Something in-between the
are numerical. Words are even more “chocolate”, for instance, looks smooth textbook and the novel”, writes Nathan
difficult. One way of depicting them until a series of ragged zigzags reveals an Yau of UCLA in a recent book, “Beautiful
is to count them and present them item of text being repeatedly removed Data”.
in clusters, with more common ones and restored as an arcane debate rages.
shown in a proportionately larger font. Another visualisation looks at changes to It’s only natural
Called a “word cloud”, this method is Wikipedia entries by software designed to The brain finds it easier to process
popular across the web. It gives a rough improve the way articles are categorised, information if it is presented as an image
indication of what a body of text is showing the modifications as a sea of rather than as words or numbers. The
about. colour. right hemisphere recognises shapes and
Soon after President Obama’s Is it art? Is it information? Some data- colours. The left side of the brain processes
inauguration a word cloud with a visual works have been exhibited in places information in an analytical and
graphical-semiotic representation of such as the Whitney and the Museum of sequential way and is more active when
his 21-minute speech appeared on the Modern Art in New York. Others have people read text or look at a spreadsheet.
web. The three most common words been turned into books, such as the web Looking through a numerical table takes
were nation, America and people. project “We Feel Fine” by Jonathan Harris a lot of mental effort, but information
His predecessor’s had been freedom, and Sep Kamvar, which captures every presented visually can be grasped in
America and liberty. Abraham Lincoln instance of the words “feel” or “feeling” a few seconds. The brain identifies
had majored on war, God and offence. on Twitter, a social-networking site, and patterns, proportions and relationships
The technique has a utility beyond matches it to time, location, age, sex and to make instant subliminal comparisons.
identifying themes. Social-networking even the weather. Businesses care about such things. Farecast,
sites let users “tag” pages and images For the purposes of data visualisation the online price-prediction service, hired
with words describing the content. The as many things as possible are reduced to applied psychologists to design the site’s

.:/p10
Reprinted from The Economist February 27th 2010

charts and colour schemes. common diseases throughout people’s move to e-readers, animated infographics
These graphics are often based on lives. Among media companies the New will eventually become standard. The
immense quantities of data. Jeffrey Heer York Times and the Guardian in Britain software Gapminder elegantly displays
of Stanford University helped develop have been the most ambitious, producing four dynamic variables at once.
sense.us, a website that gives people data-rich, interactive graphics that are Displaying information can make
access to American census data going strong enough to stand on their own. a difference by enabling people to
back more than a century. Ben Fry, an The tools are becoming more understand complex matters and find
independent designer, created a map accessible. For example, Tableau Software, creative solutions. Valdis Krebs, a specialist
of the 26m roads in the continental co-founded in 2003 by Pat Hanrahan of in mapping social interactions, recalls
United States. The dense communities Stanford University, does for visualising being called in to help with a corporate
of the north-east form a powerful data what word-processing did for project that was vastly over budget and
contrast to the desolate far west. Aaron text, allowing anyone to manipulate behind schedule. He drew up an intricate
Koblin of Google plotted a map of every information creatively. Tableau offers network map of e-mail traffic that showed
commercial flight in America over 24 both free and paid-for products, as does a distinct clusters, revealing that the teams
hours, with brighter lines identifying website called Swivel.com. Some sites are involved were not talking directly to each
routes with heavier traffic. entirely free. Google and an IBM website other but passing messages via managers.
Such techniques are moving into called Many Eyes let people upload their So the company changed its office layout
the business world. Mr Fry designed data to display in novel ways and share and its work processes—and the project
interactive charts for Ge’s health-care with others. quickly got back on track.
division that show the costs borne by Some data sets are best represented
patients and insurers, respectively, for as a moving image. As print publications © The Economist Newspaper Limited, London (2010)

A special report on managing information

Needle in a haystack
The uses of information about information

A S DATA become more abundant, the


main problem is no longer finding
the information as such but laying one’s
information provided by the internet has
to be organised. That is what Google does
so well. The raw material for its search
But they disdain conventional library
classifications. Instead, they pick any
word they fancy, creating an eclectic
hands on the relevant bits easily and engines comes free: web pages on the “folksonomy”. So instead of labelling
quickly. What is needed is information public internet. Where it adds value (and a photograph of Barack Obama as
about information. Librarians and creates metadata) is by structuring the “president”, they might call it “sexy” or
computer scientists call it metadata. information, ranking it in order of its “SOB”. That sounds chaotic, but needn’t
Information management has a relevance to the query. be.
long history. In Assyria around three Google handles around half the world’s When information was recorded
millennia ago clay tablets had small internet searches, answering around on a tangible medium—paper, film and
clay labels attached to them to make 35,000 queries every second. Metadata are so on—everything had only one correct
them easier to tell apart when they a potentially lucrative business. “If you place. With digital information the same
were filed in baskets or on shelves. The can control the pathways and means of item can be filed in several places at once,
idea survived into the 20th century in finding information, you can extract rents notes David Weinberger, the author of a
the shape of the little catalogue cards from subsequent levels of producers,” book about taxonomy and the internet,
librarians used to note down a book’s explains Eli Noam, a telecoms economist “Everything Is Miscellaneous”. Digital
title, author, subject and so on before at New York’s Columbia Business School. metadata make things more complicated
the records were moved onto computers. But there are more benign uses too. For and simpler at the same time.
The actual books constituted the data, example, photos uploaded to the website
the catalogue cards the metadata. Other Flickr contain metadata such as when
examples include package labels to the and often where they were snapped, as
5 billion bar codes that are scanned well as the camera model—useful for
throughout the world every day. would-be buyers.
These days metadata are Internet users help to label
undergoing a virtual renaissance. In unstructured information so it can be
order to be useful, the cornucopia of easily found, tagging photos and videos. © The Economist Newspaper Limited, London (2010)

.:/p11
Reprinted from The Economist February 27th 2010

A special report on managing information

Clicking for gold


How internet companies profit from data on the web
make recommendations to users based on
what other users like. The technique they
came up with has produced millions of
dollars of additional sales. Nearly two-
thirds of the film selections by Netflix’s
customer come from the referrals made by
computer.
EBay, which at first sight looks like
nothing more than a neutral platform
for commercial exchanges, makes myriad
adjustments based on information culled
from listing activity, bidding behaviour,
pricing trends, search terms and the length
of time users look at a page. Every product
category is treated as a micro-economy that
is actively managed. Lots of searches but
few sales for an expensive item may signal
unmet demand, so eBay will find a partner

P SST! Amazon.com does not want


you to know what it knows about
you. It not only tracks the books you
of the web’s biggest sites admits, “we’re
not in a position to have an in-depth
conversation. It has less to do with sensitive
to offer sellers insurance to increase listings.
The company that gets the most out of
its data is Google. Creating new economic
purchase, but also keeps a record of considerations like privacy. Instead, we’re value from unthinkably large amounts of
the ones you browse but do not buy to just not ready to tip our hand.” In other information is its lifeblood. That helps
help it recommend other books to you. words, the firm does not want to reveal explain why, on inspection, the market
Information from its e-book, the Kindle, valuable trade secrets. capitalisation of the 11-year-old firm, of
is probably even richer: how long a The reticence partly reflects fears around $170 billion, is not so outlandish.
user spends reading each page, whether about consumer unease and unwelcome Google exploits information that is a
he takes notes and so on. But Amazon attention from regulators. But this is short- by-product of user interactions, or data
refuses to disclose what data it collects or sighted, for two reasons. First, politicians exhaust, which is automatically recycled to
how it uses them. and the public are already anxious. The improve the service or create an entirely
It is not alone. Across the internet chairman of America’s Federal Trade new product.
economy, companies are compiling Commission, Jon Leibowitz, has publicly
masses of data on people, their activities, grumbled that the industry has not been Vote with your mouse
their likes and dislikes, their relationships sufficiently forthcoming. Second, if users Until 1998, when Larry Page, one of Google’s
with others and even where they are at knew how the data were used, they would founders, devised the PageRank algorithm
any particular moment—and keeping probably be more impressed than alarmed. for search, search engines counted the
mum. For example, Facebook, a social- Where traditional businesses generally number of times that a word appeared on
networking site, tracks the activities of collect information about customers from a web page to determine its relevance—a
its 400m users, half of whom spend their purchases or from surveys, internet system wide open to manipulation.
an average of almost an hour on the companies have the luxury of being Google’s innovation was to count the
site every day, but does not talk about able to gather data from everything that number of inbound links from other web
what it finds. Google reveals a little but happens on their sites. The biggest websites pages. Such links act as “votes” on what
holds back a lot. Even eBay, the online have long recognised that information internet users at large believe to be good
auctioneer, keeps quiet. itself is their biggest treasure. And it can content. More links suggest a webpage is
“They are uncomfortable bringing so immediately be put to use in a way that more useful, just as more citations of a
much attention to this because it is at traditional firms cannot match. book suggests it is better.
the heart of their competitive advantage,” Some of the techniques have become But although Google’s system was an
says Tim O’Reilly, a technology insider widespread. Before deploying a new feature, improvement, it too was open to abuse
and publisher. “Data are the coin of big sites run controlled experiments to see from “link spam”, created only to dupe
the realm. They have a big lead over what works best. Amazon and Netflix, a site the system. The firm’s engineers realised
other companies that do not ‘get’ this.” that offers films for hire, use a statistical that the solution was staring them in the
As the communications director of one technique called collaborative filtering to face: the search results on which users

.:/p12
Reprinted from The Economist February 27th 2010

to develop a robust spell-checker for its says Mr Och dismissively. Google has
word-processing program. But Google got billions. The system was first developed
its raw material free: its program is based by processing almost 2 trillion words. But
on all the misspellings that users type into although it learns from a big body of data,
a search window and then “correct” by it lacks the recursive qualities of spell-
clicking on the right result. With almost check and search.
3 billion queries a day, those results soon The design of the feedback loop
mount up. Other search engines in the is critical. Google asks users for their
1990s had the chance to do the same, but opinions, but not much else. A translation
did not pursue it. Around 2000 Yahoo! saw start-up in Germany called Linguee is
the potential, but nothing came of the idea. trying something different: it presents
It was Google that recognised the gold dust users with snippets of possible translations
in the detritus of its interactions with its and asks them to click on the best. That
users and took the trouble to collect it up. provides feedback on which version is the
Two newer Google services take the most accurate.
same approach: translation and voice Voice recognition highlights the
actually clicked and stayed. A Google recognition. Both have been big stumbling importance of making use of data exhaust.
search might yield 2m pages of results blocks for computer scientists working on To use Google’s telephone directory or
in a quarter of a second, but users often artificial intelligence. For over four decades audio car navigation service, customers
want just one page, and by choosing it the boffins tried to program computers to dial the relevant number and say what
they “tell” Google what they are looking “understand” the structure and phonetics they are looking for. The system repeats
for. So the algorithm was rejigged to feed of language. This meant defining rules such the information; when the customer
that information back into the service as where nouns and verbs go in a sentence, confirms it, or repeats the query, the
automatically. which are the correct tenses and so on. All system develops a record of the different
From then on Google realised it was the exceptions to the rules needed to be ways the target word can be spoken. It does
in the data-mining business. To put the programmed in too. Google, by contrast, not learn to understand voice; it computes
model in simple economic terms, its saw it as a big maths problem that could probabilities.
search results give away, say, $1 in value, be solved with a lot of data and processing To launch the service Google needed
and in return (thanks to the user’s clicks) power—and came up with something very an existing voice-recognition system,
it gets 1 cent back. When the next user useful. so it licensed software from Nuance,
visits, he gets $1.01 of value, and so on. As For translation, the company was able a leader in the field. But Google itself
one employee puts it: “We like learning to draw on its other services. Its search keeps the data from voice queries, and
from large, ‘noisy’ data sets.” system had copies of European Commission its voice-recognition system may end up
Making improvements on the documents, which are translated into performing better than Nuance’s—which is
back of a big data set is not a Google around 20 languages. Its book-scanning now trying to get access to lots more data
monopoly, nor is the technique new. project has thousands of titles that have by partnering with everyone in sight.
One of the most striking examples dates been translated into many languages. All Re-using data represents a new
from the mid-1800s, when Matthew these translations are very good, done by model for how computing is done, says
Fontaine Maury of the American navy experts to exacting standards. So instead Edward Felten of Princeton University.
had the idea of aggregating nautical logs of trying to teach its computers the rules “Looking at large data sets and making
from ships crossing the Pacific to find of a language, Google turned them loose inferences about what goes together is
the routes that offered the best winds on the texts to make statistical inferences. advancing more rapidly than expected.
and currents. He created an early variant Google Translate now covers more than ‘Understanding’ turns out to be overrated,
of a “viral” social network, rewarding 50 languages, according to Franz Och, one and statistical analysis goes a lot of the
captains who submitted their logbooks of the company’s engineers. The system way.” Many internet companies now see
with a copy of his maps. But the process identifies which word or phrase in one things the same way. Facebook regularly
was slow and laborious. language is the most likely equivalent in a examines its huge databases to boost usage.
second language. If direct translations are It found that the best single predictor of
Wizard spelling not available (say, Hindi to Catalan), then whether members would contribute to the
Google applies this principle of English is used as a bridge. site was seeing that their friends had been
recursively learning from the data to Google was not the first to try this active on it, so it took to sending members
many of its services, including the method. In the early 1990s IBM tried to information about what their friends had
humble spell-check, for which it used build a French-English program using been up to online. Zynga, an online games
a pioneering method that produced translations from Canada’s Parliament. company, tracks its 100m unique players
perhaps the world’s best spell-checker in But the system did not work well and the each month to improve its games.
almost every language. Microsoft says it project was abandoned. IBM had only a “If there are user-generated data to
spent several million dollars over 20 years few million documents at its disposal, be had, then we can build much better

.:/p13
Reprinted from The Economist February 27th 2010

systems than just trying to improve the service that lets Google users store medical to have access to them (and see that its
algorithms,” says Andreas Weigend, a records might also allow the company rivals do not). In an initiative called “Data
former chief scientist at Amazon who to spot valuable patterns about diseases Liberation Front” that quietly began last
is now at Stanford University. Marc and treatments. A service where users can September, Google is planning to rejig all
Andreessen, a venture capitalist who sits monitor their use of electricity, device its services so that users can discontinue
on numerous boards and was one of by device, provides rich information on them very easily and take their data with
the founders of Netscape, the web’s first energy consumption. It could become them. In an industry built on locking in
commercial browser, thinks that “these the world’s best database of household the customer, the company says it wants
new companies have built a culture, appliances and consumer electronics—and to reduce the “barriers to exit”. That should
and the processes and the technology even foresee breakdowns. The aggregated help save its engineers from complacency,
to deal with large amounts of data, that search queries, which the company makes the curse of many a tech champion. The
traditional companies simply don’t available free, are used as remarkably project might stall if it started to hurt the
have.” accurate predictors for everything from business. But perhaps Google reckons that
Recycling data exhaust is a common retail sales to flu outbreaks. users will be more inclined to share their
theme in the myriad projects going on Together, all this is in line with the information with it if they know that they
in Google’s empire and helps explain company’s audacious mission to “organise can easily take it back.
why almost all of them are labelled the world’s information”. Yet the words
as a “beta” or early test version: they are carefully chosen: Google does not need
truly are in continuous development. A to own the data. Usually all it wants is © The Economist Newspaper Limited, London (2010)

A special report on managing information

The open society


Governments are letting in the light

lots of printed information is notoriously Mr Obama’s directive caused a flurry


difficult. of activity. It is now possible to obtain
But now citizens and non- figures on job-related deaths that name
governmental organisations the world employers, and to get annual data on
over are pressing to get access to public migration free. Some information that
data at the national, state and municipal was previously available but hard to get
level—and sometimes government at, such as the Federal Register, a record
officials enthusiastically support them. of government notices, now comes in
“Government information is a form of a computer-readable format. It is all on

F ROM antiquity to modern times, the


nation has always been a product of
information management. The ability to
infrastructure, no less important to our
modern life than our roads, electrical grid
or water systems,” says Carl Malamud, the
a public website, data.gov. And more
information is being released all the time.
Within 48 hours of data on flight delays
impose taxes, promulgate laws, count boss of a group called Public.Resource. being made public, a website had sprung
citizens and raise an army lies at the Org that puts government data online. He up to disseminate them.
heart of statehood. Yet something new is was responsible for making the databases Providing access to data “creates a
afoot. These days democratic openness of America’s Securities and Exchange culture of accountability”, says Vivek
means more than that citizens can Commission available on the web in the Kundra, the federal government’s CIO.
vote at regular intervals in free and fair early 1990s. One of the first things he did after taking
elections. They also expect to have access America is in the lead on data access. office was to create an online “dashboard”
to government data. On his first full day in office Barack Obama detailing the government’s own $70
The state has long been the biggest issued a presidential memorandum billion technology spending. Now that the
generator, collector and user of data. It ordering the heads of federal agencies information is freely available, Congress
keeps records on every birth, marriage to make available as much information and the public can ask questions or offer
and death, compiles figures on all aspects as possible, urging them to act “with a suggestions. The model will be applied
of the economy and keeps statistics on clear presumption: in the face of doubt, to other areas, perhaps including health-
licences, laws and the weather. Yet until openness prevails”. This was all the more care data, says Mr Kundra—provided that
recently all these data have been locked remarkable since the Bush administration looming privacy issues can be resolved.
tight. Even when publicly accessible had explicitly instructed agencies to do All this has made a big difference.
they were hard to find, and aggregating the opposite. “There is a cultural change in what

.:/p14
Reprinted from The Economist February 27th 2010

people expect from government, fuelled out about people’s concerns; and once it. In recent years moves towards more
by the experience of shopping on the the problem has been publicly aired it transparency in government have become
internet and having real-time access becomes more difficult to ignore. one of the most vibrant and promising
to financial information,” says John One obstacle is that most countries areas of public policy. Sometimes
Wonderlich of the Sunlight Foundation, lack America’s open-government ethos, information disclosure can achieve policy
which promotes open government. The nurtured over decades by laws on ethics aims more effectively and at far lower cost
economic crisis has speeded up that in government, transparency rules and than traditional regulation.
change, particularly in state and city the Freedom of Information act, which In an important shift, new
governments. acquired teeth after the Nixon years. transparency requirements are now being
“The city is facing its eighth budget An obstacle of a different sort is used by government—and by the public—
shortfall. We’re looking at a 50% reduction Crown copyright, which means that to hold the private sector to account. For
in operating funds,” says Chris Vein, most government data in Britain and the example, it had proved extremely difficult
San Francisco’s CIO. “We must figure Commonwealth countries are the state’s to persuade American businesses to cut
out how we change our operations.” He property, constraining their use. In Britain down on the use of harmful chemicals
insists that providing more information postcodes and Ordnance Survey map and their release into the environment. An
can make government more efficient. data at present cannot be freely used for add-on to a 1986 law required firms simply
California’s generous “sunshine laws” commercial purposes—a source of loud to disclose what they release, including
provide the necessary legal backing. complaints from businesses and activists. “by computer telecommunications”. Even
Among the first users of the newly But from later this year access to some to supporters it seemed like a fudge, but
available data was a site called “San parts of both data sets will be free, thanks it turned out to be a resounding success.
Francisco Crimespotting” by Stamen to an initiative to bring more government By 2000 American businesses had reduced
Design that layers historical crime figures services online. their emissions of the chemicals covered
on top of map information. It allows But even in America access to some under the law by 40%, and over time
users to play around with the data and government information is restricted the rules were actually tightened. Public
spot hidden trends. People now often by financial barriers. Remarkably, this scrutiny achieved what legislation could
come to public meetings armed with applies to court documents, which in a not.
crime maps to demand police patrols democracy should surely be free. Legal There have been many other such
in their particular area. records are public and available online successes in areas as diverse as restaurant
from the Administrative Office of the US sanitation, car safety, nutrition, home
Anyone can play Courts (AOUSC), but at a costly eight cents loans for minorities and educational
Other cities, including New York, per page. Even the federal government has performance, note Archon Fung,
Chicago and Washington, DC, are to pay: between 2000 and 2008 it spent Mary Graham and David Weil of the
racing ahead as well. Now that citizens’ $30m to get access to its own records. Yet Transparency Policy Project at Harvard’s
groups and companies have the raw the AOUSC is currently paying $156m over Kennedy School of Government in their
data, they can use them to improve ten years to two companies, WestLaw and book “Full Disclosure”. But transparency
city services in ways that cash-strapped LexisNexis, to publish the material online alone is not enough. There has to be a
local governments cannot. For instance, (albeit organised and searchable with the community to champion the information.
cleanscores.com puts restaurants’ health- firms’ technologies). Those companies, for Providers need an incentive to supply the
inspection scores online; other sites list their part, earn an estimated $2 billion data as well as penalties for withholding
children’s activities or help people find annually from selling American court them. And web developers have to find
parking spaces. In the past government rulings and extra content such as case ways of ensuring that the public data
would have been pressed to provide reference guides. “The law is locked up being released are used effectively.
these services; now it simply supplies behind a cash register,” says Mr Malamud. Mr Fung thinks that as governments
the data. Mr Vein concedes, however, The two firms say they welcome release more and more information about
that “we don’t know what is useful or competition, pointing to their strong the things they do, the data will be used
not. This is a grand experiment.” search technology and the additional to show the public sector’s shortcomings
Other parts of the world are also services they provide, such as case rather than to highlight its achievements.
beginning to move to greater openness. summaries and useful precedents. It Another concern is that the accuracy
A European Commission directive in seems unlikely that they will keep their and quality of the data will be found
2005 called for making public-sector grip for long. One administration official wanting (which is a problem for business
information more accessible (but it has privately calls freeing the information a as well as for the public sector). There is
no bite). Europe’s digital activists use “no-brainer”. Even Google has begun to also a debate over whether governments
the web to track politicians and to try provide some legal documents online. should merely supply the raw data or get
to improve public services. In Britain involved in processing and displaying
FixMyStreet.com gives citizens the Change agent them too. The concern is that they might
opportunity to flag up local problems. The point of open information is not manipulate them—but then so might
That allows local authorities to find merely to expose the world but to change anyone else.

.:/p15
Reprinted from The Economist February 27th 2010

Public access to government figures opens up new forms of collaboration widest participation in the details of
is certain to release economic value between the public and the private judicial and administrative business…
and encourage entrepreneurship. That sectors. Beth Noveck, one of the Obama above all by the utmost possible
has already happened with weather administration’s recruits, who is a law publicity.” These days, that includes the
data and with America’s GPS satellite- professor and author of a book entitled greatest possible disclosure of data by
navigation system that was opened for full “Wiki Government”, has spearheaded an electronic means.
commercial use a decade ago. And many initiative called peer-to-patent that has
firms make a good living out of searching opened up some of America’s patent
for or repackaging patent filings. filings for public inspection.
Moreover, providing information John Stuart Mill in 1861 called for “the © The Economist Newspaper Limited, London (2010)

A special report on managing information

New rules for big data


Regulators are having to rethink their brief

companies’ interest in exploiting personal


information could be resolved by giving
people more control. They could be given
the right to see and correct the information
about them that an organisation holds,
and to be told how it was used and with
whom it was shared.
Today’s privacy rules aspire to this, but
fall short because of technical difficulties
which the industry likes to exaggerate.
Better technology should eliminate such
problems. Besides, firms are already
spending a great deal on collecting,
sharing and processing the data; they
could divert a sliver of that money to
provide greater individual control.
The benefits of information security—
protecting computer systems and

T WO centuries after Gutenberg


invented movable type in the mid-
1400s there were plenty of books around,
Privacy laws were not designed for
networks. Rules for document retention
presume paper records. And since all the
networks—are inherently invisible: if
threats have been averted, things work
as normal. That means it often gets
but they were expensive and poorly information is interconnected, it needs neglected. One way to deal with that is to
made. In Britain a cartel had a lock on global rules. disclose more information. A pioneering
classic works such as Shakespeare’s and New principles for an age of big law in California in 2003 required
Milton’s. The first copyright law, enacted data sets will need to cover six broad companies to notify people if a security
in the early 1700s in the Bard’s home areas: privacy, security, retention, breach had compromised their personal
country, was designed to free knowledge processing, ownership and the integrity of information, which pushed companies to
by putting books in the public domain information. invest more in prevention. The model has
after a short period of exclusivity, around Privacy is one of the biggest worries. been adopted in other states and could be
14 years. Laws protecting free speech did People are disclosing more personal used more widely.
not emerge until the late 18th century. information than ever. Social-networking In addition, regulators could
Before print became widespread the sites and others actually depend on it. But require large companies to undergo
need was limited. as databases grow, information that on an annual information-security audit
Now the information flows in an its own cannot be traced to a particular by an accredited third party, similar to
era of abundant data are changing the individual can often be unlocked with financial audits for listed companies.
relationship between technology and just a bit of computer effort. Information about vulnerabilities would
the role of the state once again. Many of This tension between individuals’ be kept confidential, but it could be used
today’s rules look increasingly archaic. interest in protecting their privacy and by firms to improve their practices and

.:/p16
Reprinted from The Economist February 27th 2010

handed to regulators if problems arose. Mason University, goes further: she worries as phone-number portability encourages
It could even be a requirement for about the “ethics of super-crunching”. competition among mobile operators. It
insurance coverage, allowing a market For example, racial discrimination might also reduce the need for antitrust
for information security to emerge. against an applicant for a bank loan is enforcement by counteracting data
Current rules on digital records illegal. But what if a computer model aggregators’ desire to grow ever bigger in
state that data should never be stored factors in the educational level of the order to reap economies of scale.
for longer than necessary because they applicant’s mother, which in America is Ensuring the integrity of the
might be misused or inadvertently strongly correlated with race? And what information is an important part of the
released. But Viktor Mayer-Schönberger if computers, just as they can predict an big-data age. When America’s secretary
of the National University of Singapore individual’s susceptibility to a disease of state, Hillary Clinton, lambasted the
worries that the increasing power and from other bits of information, can Chinese in January for allegedly hacking
decreasing price of computers will make predict his predisposition to committing into Google’s computers, she used the
it too easy to hold on to everything. In a crime? term “the global networked commons”.
his recent book “Delete” he argues in A new regulatory principle in the age The idea is that the internet is a shared
favour of technical systems that “forget”: of big data, then, might be that people’s environment, like the oceans or airspace,
digital files that have expiry dates or data cannot be used to discriminate which requires international co-operation
slowly degrade over time. against them on the basis of something to make the best use of it. Censorship
Yet regulation is pushing in the that might or might not happen. The pollutes that environment. Disrupting
opposite direction. There is a social and individual must be regarded as a free information flows not only violates
political expectation that records will be agent. This idea is akin to the general the integrity of the data but quashes
kept, says Peter Allen of CSC, a technology rule of national statistical offices that free expression and denies the right of
provider: “The more we know, the more data gathered for surveys cannot be used assembly. Likewise, if telecoms operators
we are expected to know—for ever.” against a person for things like deporting give preferential treatment to certain
American security officials have pressed illegal immigrants—which, alas, has not content providers, they undermine the
companies to keep records because always been respected. idea of “network neutrality”.
they may hold clues after a terrorist Privacy rules lean towards treating Governments could define best
incident. In future it is more likely that personal information as a property right. practice on dealing with information
companies will be required to retain all A reasonable presumption might be flows and the processing of data, just as they
digital files, and ensure their accuracy, that the trail of data that an individual require firms to label processed foods with
than to delete them. leaves behind and that can be traced to the ingredients or impose public-health
Processing data is another concern. him, from clicks on search engines to standards. The World Trade Organisation,
Ian Ayres, an economist and lawyer at book-buying preferences, belong to that which oversees the free flow of physical
Yale University and the author of “Super- individual, not the entity that collected trade, might be a suitable body for keeping
Crunchers”, a book about computer it. Google’s “data liberation” initiative digital goods and services flowing too. But
algorithms replacing human intuition, mentioned earlier in this report points it will not be quick or easy.
frets about the legal implications of in that direction. That might create a
using statistical correlations. Rebecca market for information. Indeed, “data
Goldin, a mathematician at George portability” stimulates competition, just © The Economist Newspaper Limited, London (2010)

A special report on managing information

Handling the cornucopia


The best way to deal with all that information is to use machines. But they need watching

I N 2002 America’s Defence Advanced


Research Projects Agency, best known
for developing the internet four
soldiers to think like never before. They
have to do things that require large
amounts of information, such as manage
achieved a 100% improvement in recall
and a 500% increase in working memory.
Is this everybody’s future? Probably
decades ago, embarked on a futuristic drones or oversee a patrol from a remote not. But as the torrent of information
initiative called Augmented Cognition, location. The system can help soldiers increases, it is not surprising that people
or “AugCog”. Commander Dylan make sense of the flood of information feel overwhelmed. “There is an immense
Schmorrow, a cognitive scientist with streaming in. So if the sensors detect that risk of cognitive overload,” explains Carl
the navy, devised a crown of sensors the wearer’s spatial memory is becoming Pabo, a molecular biologist who studies
to monitor activity in the brain such saturated, new information will be sent cognition. The mind can handle seven
as blood flow and oxygen levels. The in a different form, say via an audio alert pieces of information in its short-term
idea was that modern warfare requires instead of text. In a trial in 2005 the device memory and can generally deal with only

.:/p17
Reprinted from The Economist February 27th 2010

four concepts or relationships at once. If described in this report, from business If an intelligence agency can be hit in
there is more information to process, or analytics to recursive machine-learning to this way, the chances are that most other
it is especially complex, people become visualisation software, exist to make data users are at even greater risk. Part of the
confused. more digestible for humans. solution will be to pour more resources
Moreover, knowledge has become so Some applications have already into improving the performance of
specialised that it is impossible for any become so widespread that they are taken existing technologies, not just pursue more
individual to grasp the whole picture. for granted. For example, banks use credit innovations. The computer industry went
A true understanding of climate change, scores, based on data about past financial through a similar period of reassessment
for instance, requires a knowledge of transactions, to judge an applicant’s in 2001-02 when Microsoft and others
meteorology, chemistry, economics and ability to repay a loan. That makes the announced that they were concentrating
law, among many other things. And process less subjective than the say-so on making their products much more
whereas doctors a century ago were of a bank manager. Likewise, landing a secure rather than adding new features.
expected to keep up with the entire field plane requires a lot of mental effort, so Another concern is energy
of medicine, now they would need to the process has been largely automated, consumption. Processing huge amounts
be familiar with about 10,000 diseases, and both pilots and passengers feel safer. of data takes a lot of power. “In two to
3,000 drugs and more than 1,000 lab And in health care the trend is towards three years we will saturate the electric
tests. A study in 2004 suggested that in “evidence-based medicine”, where not cables running into the building,” says
epidemiology alone it would take 21 only doctors but computers too get Alex Szalay at Johns Hopkins University.
hours of work a day just to stay current. involved in diagnosis and treatment. “The next challenge is how to do the same
And as more people around the world things as today, but with ten to 100 times
become more educated, the flow of The dangers of complacency less power.”
knowledge will increase even further. In the age of big data, algorithms will be It is a worry that affects many
The number of peer-reviewed scientific doing more of the thinking for people. organisations. The NSA in 2006 came
papers in China alone has increased 14- But that carries risks. The technology is close to exceeding its power supply,
fold since 1990 (see chart 3). far less reliable than people realise. For which would have blown out its electrical
every success with big data there are infrastructure. Both Google and Microsoft
many failures. The inability of banks to have had to put some of their huge
understand their risks in the lead-up to data centres next to hydroelectric plants
the financial crisis is one example. The to ensure access to enough energy at a
deficient system used to identify potential reasonable price.
terrorists is another. Some people are even questioning
On Christmas Day last year a Nigerian whether the scramble for ever more
man, Umar Farouk Abdulmutallab, tried information is a good idea. Nick Bostrom,
to ignite a hidden bomb as his plane was a philosopher at Oxford University,
landing in Detroit. It turned out his father identifies “information hazards” which
had informed American officials that he result from disseminating information
posed a threat. His name was entered that is likely to cause harm, such as
into a big database of around 550,000 publishing the blueprint for a nuclear
people who potentially posed a security bomb or broadcasting news of a race riot
risk. But the database is notoriously that could provoke further violence. “It is
flawed. It contains many duplicates, and said that a little knowledge is a dangerous
names are regularly lost during back-ups. thing,” he writes. “It is an open question
“What information consumes The officials had followed all the right whether more knowledge is safer.” Yet
is rather obvious: it consumes the procedures, but the system still did not similar concerns have been raised through
attention of its recipients,” wrote prevent the suspect from boarding the the ages, and mostly proved overblown.
Herbert Simon, an economist, in 1971. plane.
“Hence a wealth of information creates One big worry is what happens if the Knowledge is power
a poverty of attention.” But just as it technology stops working altogether. This The pursuit of information has been a
is machines that are generating most is not a far-fetched idea. In January 2000 human preoccupation since knowledge
of the data deluge, so they can also the torrent of data pouring into America’s was first recorded. In the 3rd century
be put to work to deal with it. That National Security Agency (NSA) brought BC Ptolemy stole every available scroll
highlights the role of “information the system to a crashing halt. The agency from passing travellers and ships to
intermediaries”. People rarely deal was “brain-dead” for three-and-a-half stock his great library in Alexandria.
with raw data but consume them in days, General Michael Hayden, then its After September 11th 2001 the American
processed form, once they have been director, said publicly in 2002. “We were Defence Department launched a program
aggregated or winnowed by computers. dark. Our ability to process information called “Total Information Awareness”
Indeed, many of the technologies was gone.” to compile as many data as possible

.:/p18
Reprinted from The Economist February 27th 2010

about just about everything—e-mails, “Our sensory and attentional systems The cornucopia of data now available is
phone calls, web searches, shopping are tuned via evolution and experience a resource, similar to other resources in
transactions, bank records, medical to be selective,” says Dennis Proffitt, a the world and even to technology itself.
files, travel history and much more. cognitive psychologist at the University of On their own, resources and technologies
Since 1996 Brewster Kahle, an internet Virginia. People find patterns to compress are neither good nor bad; it depends
entrepreneur, has been recording all the information and make it manageable. on how they are used. In the age of big
content on the web as a not-for-profit Even Commander Schmorrow does data, computers will be monitoring
venture called the “Internet Archive”. It not think that man will be replaced by more things, making more decisions and
has since expanded to software, films, robots. “The flexibility of the human to even automatically improving their own
audio recordings and scanning books. consider as-yet-unforeseen consequences processes—and man will be left with the
There has always been more during critical decision-making, go with same challenges he has always faced. As T.S.
information than people can mentally the gut when problem-solving under Eliot asked: “Where is the wisdom we have
process. The chasm between the amount uncertainty and other such abstract lost in knowledge? Where is the knowledge
of information and man’s ability to reasoning behaviours built up over years we have lost in information?”
deal with it may be widening, but of experience will not be readily replaced
that need not be a cause for alarm. by a computer algorithm,” he says. © The Economist Newspaper Limited, London (2010)

A special report on managing information

Sources and acknowledgments


The author would like to thank all the special report: Enterprises for a Borderless World” by
people who shared their insights with Books: Victor K. Fung, William K. Fung and
him. He is particularly grateful to Viktor “Delete: The Virtue of Forgetting in the Yoram (Jerry) Wind. Wharton School
Mayer-Schönberger, Matthew Hindman, Digital Age” by Viktor Mayer-Schönberger, Publishing, 2007.
Hal Varian, Niko Waesche, Tim O’Reilly, Princeton University Press, 2009. “Click: What Millions of People Are Doing
Marc Benioff, Michael Kleeman, Stephan “A Nation Transformed by Information: Online and Why it Matters” by Bill Tancer,
Herrera, Carl Pabo, James Cortada, Alex How Information Has Shaped the United Hyperion, 2008.
Szalay, Jeff Hammerbacher and the States from Colonial Times to the Present” “Full Disclosure: The Perils and Promise
participants in the 2010 Global Leaders by Alfred D. Chandler Jr. and James W. of Transparency” by Archon Fung, Mary
in Information Policy Conference. Cortada (eds), Oxford University Press, Graham and David Weil, Cambridge
Thanks also to IBM, Informatica, TIBCO, 2000. University Press, 2007.
McKinsey, Accenture, PwC, Microsoft “Control through Communication: The “Wiki Government: How Technology Can
and Google. Rise of System in American Management” Make Government Better, Democracy
Sources by JoAnne Yates, The Johns Hopkins Stronger, and Citizens More Powerful”
The following infographics were University Press, 1993. by Beth Simone Noveck, Brookings
mentioned in this special report: “The Fourth Paradigm: Data-Intensive Institution Press, 2009.
“Map of the Market” Scientific Discovery” by Tony Hey, Stewart “The Visual Display of Quantitative
Wikipedia / chocolate—the first image Tansley and Kristin Tolle (eds). Microsoft Information” by Edward Tufte. Graphics
Wikipedia / changes by software to Research, 2009. Press, 2001 (second edition).
categorise articles—fig. 4 “Super Crunchers: Why Thinking-by- “Beautiful Data: The Stories Behind
“Word cloud” of Barack Obama’s Numbers Is the New Way to Be Smart” by Elegant Data Solutions” by Toby Segaran
inaugural speech Ian Ayres, Bantam, 2007. and Jeff Hammerbacher (eds), O’Reilly
San Francisco “crimespotting” site “The Numerati” by Stephen Baker, Media, 2009.
“We Feel Fine” Houghton Mifflin Harcourt, 2008. “Everything Is Miscellaneous: The Power
Nature sources “Competing on Analytics: The New of the New Digital Disorder” by David
Sense.us Science of Winning” by Thomas H. Weinberger, Times Books, 2007.
Ben Fry’s road map Davenport and Jeanne G. Harris. Harvard “Glut: Mastering Information Through
Aaron Kolbin’s flight patterns Business School Press, 2007. The Ages” by Alex Wright, Joseph Henry
Tableau Software “Analytics at Work: Smarter Decisions, Press, 2007.
IBM Many Eyes Better Results” by Thomas H. Davenport, “The Social Life of Information” by John
The Gapminder visualisation Jeanne G. Harris and Robert Morison, Seely Brown and Paul Duguid, Harvard
The following books, reports and papers Harvard Business Press, 2010. Business Press, 2000.
provided valuable information for this “Competing in a Flat World: Building “The Overflowing Brain: Information

.:/p19
Reprinted from The Economist February 27th 2010

Overload and the Limits of Working Systems, vol. 24, no. 2, pp. 8-12, March/ Berners-Lee, et al. Communications of the
Memory” by Torkel Klingberg, Oxford April 2009. ACM, vol 51, issue 6, June 2008.
University Press, 2008. “Predicting the Present with Google “How Diverse is Facebook?” by Facebook
Reports and papers: Trends” by Hyunyoung Choi and Hal Data Team, Facebook, December 16th
“The Promise and Peril of Big Data” by Varian, Google. April 10th 2009. 2009.
David Bollier, The Aspen Institute, 2010. “Detecting influenza epidemics using “Managing Global Data Privacy: Cross-
“Ensuring the Integrity, Accessibility, search engine query data” by Jeremy Border Information Flows in a Networked
and Stewardship of Research Data in the Ginsberg, Matthew H. Mohebbi and Rajan Environment” by Paul M. Schwartz, The
Digital Age” by The National Academies S. Patel et al. Nature, vol 457, February 19th Privacy Projects, 2009.
Press, 2009. 2009. “Data Protection Accountability: The
“Computational Social Science” by “Data-Intensive Text Processing with Essential Elements—A Document for
David Lazer, Alex Pentland, Lada MapReduce” by Jimmy Lin and Chris Discussion” Centre for Information Policy
Adamic, et al. Science, February 6th 2009. Dyer, draft book chapter for Morgan & Leadership, the Galway Project, October
“Computer Mediated Transactions” by Claypool Synthesis Lectures on Human 2009.
Hal Varian. Ely Lecture at the American Language Technologies, February 7th 2010. “The Semantic Web” by Tim Berners-Lee,
Economics Association, January 18th “Web Squared: Web 2.0 Five Years On” by James Hendler and Ora Lassila, Scientific
2010. Tim O’Reilly and John Battelle, O’Reilly American, May 2001.
“The Unreasonable Effectiveness of Media, 2009. “Information Hazards: A Typology of
Data” by Alon Halevy, Peter Norvig “Information Accountability” by Daniel Potential Harms from Knowledge” by
and Fernando Pereira, IEEE Intelligent J. Weitzner, Harold Abelson and Tim Nick Bostrom, Draft 1.11, 2009.

.:/p20

You might also like