Big Data
Big Data
Achieving scale
What is big data?
http://www.internetlivestats.com/
� Twitter: 600 million tweets per day.
� Facebook: 600 terabytes of incoming data each day, from 1.6 billion
active users.
� Google: 3.5 billion search queries per day.
� Instagram: 52 million new photos per day.
� Apple: 130 billion total app downloads.
� Netflix: 125 million hours of TV shows and movies streamed daily.
� Email: 205 billion messages per day.
How big is big?
� Massive data sets are typically the result of opportunity, instead of design.
� In traditional hypothesis-driven science, we design an experiment to gather
exactly the data we need to answer our specific question.
� But big data is more typically the product of some logging process
recording discrete events, or distributed contributions from millions of
people over social media.
� The data scientist generally has little or no control of the collection process,
just a vague charter to turn all those bits into money.
Big data as bad data
Consider measuring popular opinion from the posts on a social media platform.
Big data can be a wonderful resource.
But it is particularly prone to biases and limitations that make it difficult to draw
accurate conclusions like :
� Unrepresentative participation:
The data from any particular social media site does not reflect the people who
don't use it .
Amazon users buy far more books than shoppers at Walmart. Their political
affiliations and economic status differs as well. You get equally-biased but very
different views of the world if analyzing data from Instagram (too young), The New
York Times (too liberal), Fox News (too conservative) or The Wall Street Journal (too
wealthy).
Big data as bad data
Volume :
� It goes without saying that big data is bigger than little data.
� The distinction is one of class.
� We leave the world where we can represent our data in a spreadsheet or
process it on a single machine.
� This requires developing a more sophisticated computational infrastructure,
and restricting our analysis to linear-time algorithms for efficiency.
The Three V’s
Variety :
� Ambient data collection typically moves beyond the matrix to gather
heterogeneous data, which often requires ad hoc integration techniques.
� Consider social media. Posts may well include text, links, photos, and video.
Depending upon our task, all of these may be relevant, but text processing
requires vastly different techniques than network data and multimedia.
� Even images and videos are quite different beasts, not to be processed
using the same pipeline.
� Meaningfully integrating these materials into a single data set for analysis
requires substantial thought and effort.
The Three V’s
Velocity :
� Collecting data from ambient sources implies that the system is live,
meaning it is always on, always collecting data.
� Live data means that infrastructures must be built for collecting, indexing,
accessing, and visualizing the results, typically through a dashboard system.
� Live data means that consumers want real-time access to the latest results,
through graphs, charts, etc.
The fourth V
Hence counting up the operations performed over the course of the algorithm
gave its running time.
Big Analysis
Examples include:
Find the nearest neighbor of point p:
� Compare p against all n points in a given array a.
� The distance computation between p and point a[i] requires subtracting
and squaring d terms,
where d is the dimensionality of p.
� Looping through all n points and keeping track of the closest point takes
O(d . n) time.
� Since d is typically small, it can be thought of as a constant, this is
considered a linear-time algorithm
Basic loop-structure algorithms
� Compare every point a[i] against every other point a[j], where 1 ≤ i ≠ j ≤ n.
Matrix multiplication :
� Multiplying an [x . y] matrix times [y . z] matrix
� We get x.z matrix, where each of the x.z terms is the dot product of two y-
length vectors:
Dictionary maintenance
position of object x,
practice
Hashing: Applications
Frequency counting
Duplicate removal
Canonization
Cryptographic hashing
� How can you prove that an input file remains unchanged since you last
analyzed it?
� Construct a hash code or checksum for the file when you worked on it
� Save this code for comparison with the file hash at any point in the future
� They will be the same if the file is unchanged, and
� almost surely differ if any alterations have occurred.
Storage Hierarchy
� There are several possible types of devices to put it on, which differ greatly
in speed, capacity, and latency.
� The performance differences between different levels of the storage
hierarchy is so enormous that we cannot ignore it in our abstraction of the
RAM machine.
� The ratio of the access speed from disk to cache memory is roughly the
same (10^6 ) as the speed of a tortoise to the exit velocity of the earth!
Levels of storage Hierarchy
Cache memory:
Cache memory:
Main memory:
� This is what holds the general state of the computation, and where large
data structures are hosted and maintained.
� Main memory is generally measured in gigabytes, and runs hundreds to
thousands of times faster than disk storage.
� To the greatest extent possible, we need data structures that fit into main
memory and avoid the paging behavior of virtual memory.
Levels of storage Hierarchy
� Latency times on a local area network run into the low-order milliseconds,
making it generally faster than secondary storage devices like disks.
� This means that distributed data structures like hash tables can be
meaningfully maintained across networks of machines, but with access
times that can be hundreds of times slower than main memory.
Levels of storage Hierarchy
Disk storage:
� One can organize a corpus of documents such that each is in its own file
� This is logical for humans but slow for machines, when there are millions of
tiny files
� Much better is to organize them in one large file to efficiently sweep
through all examples, instead of requiring a separate disk access for each
one
Storage Hierarchy
One important benefit of big data is that with sufficient volume you can afford
to throw most of your data away
And this can be quite worthwhile, to make your analysis cleaner and easier
Filtering and Sampling
Sampling by truncation
� takes the first n records in the file as the desired sample.
� This is simple, and is readily reproducible, meaning someone else with the
full data file could easily reconstruct the sample.
However, the truncated samples often contain subtle effects from factors
such as:
� Temporal biases: Log files are typically constructed by appending new
records to the end of the file. Thus the first n records would be the oldest
available, and will not reflect recent regime changes.
Deterministic Sampling Algorithms
Sampling by truncation
� Lexicographic biases: Many files are sorted according to the primary key,
which means that the first n records are biased to a particular population.
Sampling by truncation
� Numerical biases: Often files are sorted by identity numbers, which may
appear to be arbitrarily defined. But ID numbers can encode meaning.
Example : Consider sorting the personnel records by their U.S. social security
numbers. In fact, the first five digits of social security numbers are generally a
function of the year and place of birth. Thus truncation leads to a
geographically and age-biased sample.
So truncation in general is
Uniform Sampling
Parallel processing
� Happens on one machine, involving multiple cores and/or processors
� Communication through threads and operating system resources.
� Computation is often CPU-bound,
� Limited more by the number of cycles than the movement of data
through the machine.
� The emphasis is solving a particular computing problem faster than one
could sequentially.
Parallelism
Distributed processing
� Happens on many machines, using network communication.
� Appropriate to loosely-coupled jobs which do not communicate much.
� The goal involves sharing resources like memory and secondary storage
across multiple machines, more so than exploiting multiple CPUs.
� Whenever the speed of reading data from a disk is the bottleneck, we are
better off having many machines reading as many different disks as
possible, simultaneously.
Parallelism : Complexity
Common ethical concerns in world of big data that one should worry about :
Integrity in communications and modeling:
� Your results will be used to influence public opinion
� Appear in testimony before legal or governmental authorities
Big data ethics
Common ethical concerns in world of big data that one should worry about :
Integrity in communications and modeling:
� Your results will be used to influence public opinion
� Appear in testimony before legal or governmental authorities
Such filters may have some responsibility for political polarization in our
society: do you see opposing viewpoints, or just an echo chamber for your
own thoughts?
Big data ethics
� Maintaining privacy in
aggregated data