0% found this document useful (0 votes)
32 views77 pages

Big Data

The document discusses big data and some of the challenges associated with analyzing large datasets. It notes that big data sources often consist of massive numbers of records across a small number of fields. Additionally, big data is frequently opportunistic rather than designed, and may be biased, redundant, or prone to temporal biases. While big data has value, good data for a problem requires data appropriate to answering relevant questions. The document also covers the three V's of big data - volume, variety, and velocity - as well as hashing techniques useful for analyzing big data efficiently.

Uploaded by

Mayoukh Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views77 pages

Big Data

The document discusses big data and some of the challenges associated with analyzing large datasets. It notes that big data sources often consist of massive numbers of records across a small number of fields. Additionally, big data is frequently opportunistic rather than designed, and may be biased, redundant, or prone to temporal biases. While big data has value, good data for a problem requires data appropriate to answering relevant questions. The document also covers the three V's of big data - volume, variety, and velocity - as well as hashing techniques useful for analyzing big data efficiently.

Uploaded by

Mayoukh Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Big data

Achieving scale
What is big data?

http://www.internetlivestats.com/
� Twitter: 600 million tweets per day.
� Facebook: 600 terabytes of incoming data each day, from 1.6 billion
active users.
� Google: 3.5 billion search queries per day.
� Instagram: 52 million new photos per day.
� Apple: 130 billion total app downloads.
� Netflix: 125 million hours of TV shows and movies streamed daily.
� Email: 205 billion messages per day.
How big is big?

� Big data generally consists of massive numbers of rows (records) over a


relatively small number of columns (features).
� Thus big data is often overkill to accurately fit a single model for a given
problem.
� The value generally comes from fitting many distinct models, as in training a
custom model personalized for each distinct user.
Big data as bad data

� Massive data sets are typically the result of opportunity, instead of design.
� In traditional hypothesis-driven science, we design an experiment to gather
exactly the data we need to answer our specific question.
� But big data is more typically the product of some logging process
recording discrete events, or distributed contributions from millions of
people over social media.
� The data scientist generally has little or no control of the collection process,
just a vague charter to turn all those bits into money.
Big data as bad data

Consider measuring popular opinion from the posts on a social media platform.
Big data can be a wonderful resource.
But it is particularly prone to biases and limitations that make it difficult to draw
accurate conclusions like :
� Unrepresentative participation:
The data from any particular social media site does not reflect the people who
don't use it .
Amazon users buy far more books than shoppers at Walmart. Their political
affiliations and economic status differs as well. You get equally-biased but very
different views of the world if analyzing data from Instagram (too young), The New
York Times (too liberal), Fox News (too conservative) or The Wall Street Journal (too
wealthy).
Big data as bad data

� Spam and machine-generated content:


Big data sources are worse than unrepresentative and often deliberately
misleading. Armies of paid reviewers work each day writing fake and
misleading product reviews.
� a sizable fraction of the hits reported on any website are from mechanical
crawlers, instead of people.
� 90% of all email sent over networks is spam: the effectiveness of spam filters
at several stages of the pipeline is the only reason you don't see more of it.
� Spam filtering is an essential part of the data cleaning process on any
social media analysis. If you don't remove the spam, it will be lying to you
instead of just misleading you.
Big data as bad data

� Too much redundancy:


Many human activities follow a power law distribution, meaning that a very
small percentage of the items account for a large percentage of the total
activity.
News and social media concentrates heavily on the latest missteps of the
celebrities, covering them with articles by the thousands. Many of these will be
almost exact duplicates of other articles. How much more does the full set of
them tell you than any one of them would?
This law of unequal coverage implies that much of the data we see through
ambient sources is something we have seen before. Removing this duplication
is an essential cleaning step for many applications.
Big data as bad data

� Susceptibility to temporal bias:


Products change in response to competition and changes in consumer
demand.
Often these improvements change the way people use these products.
A time series resulting from ambient data collection might well encode several
product/interface transitions, which make it hard to distinguish artifact from
signal.
Take away

� Big data is data we have.


Take away

� Big data is data we have.


� Good data is data appropriate to the challenge at hand.
Take away

� Big data is data we have.


� Good data is data appropriate to the challenge at hand.
� Big data is bad data if it cannot really answer the questions we care about.
The Three V’s

Management consulting types have latched onto a notion of the three Vs of


big data as a means of explaining it:
the properties of volume, variety, and velocity.
They provide a foundation to talk about what makes big data different.
The Three V’s

Volume :
� It goes without saying that big data is bigger than little data.
� The distinction is one of class.
� We leave the world where we can represent our data in a spreadsheet or
process it on a single machine.
� This requires developing a more sophisticated computational infrastructure,
and restricting our analysis to linear-time algorithms for efficiency.
The Three V’s

Variety :
� Ambient data collection typically moves beyond the matrix to gather
heterogeneous data, which often requires ad hoc integration techniques.
� Consider social media. Posts may well include text, links, photos, and video.
Depending upon our task, all of these may be relevant, but text processing
requires vastly different techniques than network data and multimedia.
� Even images and videos are quite different beasts, not to be processed
using the same pipeline.
� Meaningfully integrating these materials into a single data set for analysis
requires substantial thought and effort.
The Three V’s

Velocity :
� Collecting data from ambient sources implies that the system is live,
meaning it is always on, always collecting data.
� Live data means that infrastructures must be built for collecting, indexing,
accessing, and visualizing the results, typically through a dashboard system.
� Live data means that consumers want real-time access to the latest results,
through graphs, charts, etc.
The fourth V

The management set sometimes defines a fourth V: veracity,


� a measure for how much we trust the underlying data.
� Here we are faced with the problem of eliminating spam and other
artifacts resulting from the collection process, beyond the level of normal
cleaning.
Algorithmic for Big Data

� Big data requires efficient algorithms to work on it.


� The basic algorithmic issues associated with big data are :
� Asymptotic complexity
� Hashing
� streaming models to optimize I/O performance in large data files.
Big Analysis

Traditional algorithm analysis was based on an abstract computer called the


Random Access Machine or RAM. On such a model:

� Each simple operation takes exactly one step.


� Each memory operation takes exactly one step.

Hence counting up the operations performed over the course of the algorithm
gave its running time.
Big Analysis

In general, the number of operations performed by any algorithm is a function


of the size of the input n :
� a matrix with n rows
� a text with n words
� a point set with n points.
Algorithm analysis is the process of estimating or bounding the number of steps
the algorithm takes as a function of n.
Big Analysis

For algorithms defined by for-loops, such analysis is fairly straightforward.


� The depth of the nesting of these loops defines the complexity of the
algorithm.
� A single loop from 1 to n defines a linear-time or O(n) algorithm, while
� Two nested loops define a quadratic-time or O(nxn) algorithm.
� Two sequential for loops that do not nest are still linear, because n+n = 2n
steps are used instead of nxn such operations.
Basic loop-structure algorithms

Examples include:
Find the nearest neighbor of point p:
� Compare p against all n points in a given array a.
� The distance computation between p and point a[i] requires subtracting
and squaring d terms,
where d is the dimensionality of p.
� Looping through all n points and keeping track of the closest point takes
O(d . n) time.
� Since d is typically small, it can be thought of as a constant, this is
considered a linear-time algorithm
Basic loop-structure algorithms

The closest pair of points in a set :

� Compare every point a[i] against every other point a[j], where 1 ≤ i ≠ j ≤ n.

� This takes O(d.nxn) time, and would be considered a quadratic-time


algorithm
Basic loop-structure algorithms

Matrix multiplication :
� Multiplying an [x . y] matrix times [y . z] matrix
� We get x.z matrix, where each of the x.z terms is the dot product of two y-
length vectors:

� This algorithm takes x.y.z steps.


� If n = max(x, y, z), then this takes at most O(n³) steps, and would be
considered a cubic-time algorithm
Complex algorithms

For algorithms defined by conditional while loops or recursion, the analysis


often requires more sophistication. Examples include:
� Binary search : To locate a given search key k in a sorted array A,
containing n items like searching for a name in the telephone book.
� Mergesort : Two sorted lists with a total of n items can be merged into a
single sorted list in linear time.
Complex algorithms

For algorithms defined by conditional while loops or recursion, the analysis


often requires more sophistication. Examples include:
� Binary search : To locate a given search key k in a sorted array A,
containing n items like searching for a name in the telephone book.
� Mergesort : Two sorted lists with a total of n items can be merged into a
single sorted list in linear time.

Algorithms running on big data sets must be linear or near-linear, perhaps


O(n log n). Quadratic algorithms become impossible to contemplate for
n > 10000.
Hashing

Definition : Hashing is a technique which can often turn quadratic algorithms


into linear time algorithms, making them tractable for dealing with the scale of
data we hope to work with.
� A hash function h takes an object x and maps it to a specific integer h(x).
The key idea is that whenever x = y, then h(x) = h(y).
� Different items are usually mapped to different places, assuming a well-
designed hash function
� Turning the vector of numbers into a single representative number is the job
of the hash function h(x).
Hashing: Applications

Dictionary maintenance

� A hash table is an array-based data structure using h(x) to define the

position of object x,

� It is coupled with an appropriate collision-resolution method.

� Properly implemented, such hash tables yield constant search times in

practice
Hashing: Applications

Frequency counting

� A common task in analyzing logs is tabulating the frequencies of given


events, such as word counts or page hits.
� The fastest/easiest approach is to set up a hash table with event types as
the key
� increment the associated counter for each new event.
� Properly implemented, this algorithm is linear in the total number of events
being analyzed.
Hashing: Applications

Duplicate removal

� An important data cleaning chore is identifying duplicate records in a data


stream and removing them
� For each item in the stream, check whether it is already in the hash table.
� If not insert it, if so ignore it.
� Properly implemented, this algorithm takes time linear in the total number of
records being analyzed.
Hashing: Applications

Canonization

� Often the same object can be referred to by multiple different names.


� Vocabulary words are generally case-insensitive
� Determining the vocabulary of a language requires unifying alternate forms,
mapping them to a single key.
� This process of constructing a canonical representation can be interpreted as
hashing.
� It requires a domain-specific simplification function doing such things as
reduction to lower case, white space removal, stop word elimination, and
abbreviation expansion.
� These canonical keys can then be hashed, using conventional hash functions.
Hashing: Applications

Cryptographic hashing

� How can you prove that an input file remains unchanged since you last
analyzed it?
� Construct a hash code or checksum for the file when you worked on it
� Save this code for comparison with the file hash at any point in the future
� They will be the same if the file is unchanged, and
� almost surely differ if any alterations have occurred.
Storage Hierarchy

� Big data algorithms are often storage-bound or bandwidth-bound rather


than compute-bound
� Achieving good performance can rest more on smart data management
than sophisticated algorithmics
� To be available for analysis, data must be stored somewhere in a
computing system
Storage Hierarchy

� There are several possible types of devices to put it on, which differ greatly
in speed, capacity, and latency.
� The performance differences between different levels of the storage
hierarchy is so enormous that we cannot ignore it in our abstraction of the
RAM machine.
� The ratio of the access speed from disk to cache memory is roughly the
same (10^6 ) as the speed of a tortoise to the exit velocity of the earth!
Levels of storage Hierarchy

Cache memory:

� Modern computer architectures feature a complex system of registers and


caches to store working copies of the data actively being used.
� Some of this is used for prefetching: grabbing larger blocks of data around
memory locations which have been recently accessed, in anticipation of
them being needed later.
Levels of storage Hierarchy

Cache memory:

� Cache sizes are typically measured in megabytes, with access times


between five and one hundred times faster than main memory.
� This performance makes it very advantageous for computations to exploit
locality, to use particular data items intensively in concentrated bursts,
rather than intermittently over a long computation.
Levels of storage Hierarchy

Main memory:

� This is what holds the general state of the computation, and where large
data structures are hosted and maintained.
� Main memory is generally measured in gigabytes, and runs hundreds to
thousands of times faster than disk storage.
� To the greatest extent possible, we need data structures that fit into main
memory and avoid the paging behavior of virtual memory.
Levels of storage Hierarchy

Main memory on another machine:

� Latency times on a local area network run into the low-order milliseconds,
making it generally faster than secondary storage devices like disks.
� This means that distributed data structures like hash tables can be
meaningfully maintained across networks of machines, but with access
times that can be hundreds of times slower than main memory.
Levels of storage Hierarchy

Disk storage:

� Secondary storage devices can be measured in terabytes, providing the


capacity that enables big data to get big.
� Physical devices like spinning disks take considerable time to move the
read head to the position where the data is.
� Once there, it is relatively quick to read a large block of data.
� This motivates pre-fetching, copying large chunks of files into memory
under the assumption that they will be needed later.
Storage Hierarchy

We need to organize our computations to take advantage of latency, using


techniques like:
Process files and data structures in streams:

� It is important to access files and data structures sequentially whenever


possible, to exploit pre-fetching
� Much of the advantage of sorting data is that we can jump to the
appropriate location in question
Storage Hierarchy

We need to organize our computations to take advantage of latency, using


techniques like:
Think big files instead of directories:

� One can organize a corpus of documents such that each is in its own file
� This is logical for humans but slow for machines, when there are millions of
tiny files
� Much better is to organize them in one large file to efficiently sweep
through all examples, instead of requiring a separate disk access for each
one
Storage Hierarchy

We need to organize our computations to take advantage of latency, using


techniques like:
Packing data concisely:

� The cost of decompressing data being held in main memory is generally


much smaller than the extra transfer costs for larger files
� This is an argument that it pays to represent large data files concisely
whenever you can
� This might mean explicit file compression schemes, with small enough file
sizes so that they can be expanded in memory
Filtering and Sampling
Filtering and Sampling

One important benefit of big data is that with sufficient volume you can afford
to throw most of your data away

And this can be quite worthwhile, to make your analysis cleaner and easier
Filtering and Sampling

We consider two distinct ways to throw data away, filtering and


sampling.
Filtering means selecting a relevant subset of data based on a specific
criteria.
Example: Suppose we wanted to build a language model for an
application in the United States, and we wanted to train it on data from
Twitter.
English accounts for only about one third of all tweets on Twitter, so filtering
out all other languages leaves enough for meaningful analysis.
Filtering and Sampling

But filtering introduces biases


� Over 10% of the U.S. population speaks Spanish. Shouldn’t they be
represented in the language model, amigo?
� It is important to select the right filtering criteria to achieve the outcome
we seek.
Perhaps we might better filter tweets based on location of origin, instead of
language
Filtering and Sampling

Sampling means selecting an appropriate size subset in an arbitrary manner,


without domain-specific criteria. There are several reasons why we may want
to subsample good, relevant data:
� Right-sizing training data: Simple, robust models generally have few
parameters, making big data unnecessary to fit them. Subsampling your
data in an unbiased way leads to efficient model fitting, but is still
representative of the entire data set
Filtering and Sampling

Sampling means selecting an appropriate size subset in an arbitrary manner,


without domain-specific criteria. There are several reasons why we may want
to subsample good, relevant data:
� Right-sizing training data: Simple, robust models generally have few
parameters, making big data unnecessary to fit them. Subsampling your
data in an unbiased way leads to efficient model fitting, but is still
representative of the entire data set
Filtering and Sampling

� Data partitioning: Model-building hygiene requires cleanly separating


training, testing, and evaluation data, typically in a 60%, 20%, and 20%
mix. Constructing these partitions in an unbiased manner is necessary for
the veracity of this process.
� Exploratory data analysis and visualization: Spreadsheet-sized data sets
are fast and easy to explore. An unbiased sample is representative of the
whole while remaining comprehensible.
Filtering and Sampling

Sampling n records in an efficient and unbiased manner is a more subtle task


than it may appear at first. There are two general approaches:

� Deterministic Sampling Algorithms

� Randomized and Stream Sampling


Deterministic Sampling Algorithms

Sampling by truncation
� takes the first n records in the file as the desired sample.
� This is simple, and is readily reproducible, meaning someone else with the
full data file could easily reconstruct the sample.
However, the truncated samples often contain subtle effects from factors
such as:
� Temporal biases: Log files are typically constructed by appending new
records to the end of the file. Thus the first n records would be the oldest
available, and will not reflect recent regime changes.
Deterministic Sampling Algorithms

Sampling by truncation

� Lexicographic biases: Many files are sorted according to the primary key,
which means that the first n records are biased to a particular population.

Example : Imagine a personnel roster sorted by name. The first n records


might consist only of the As, which means that we will probably over-sample
Arabic names from the general population, and under-sample Chinese ones.
Deterministic Sampling Algorithms

Sampling by truncation

� Numerical biases: Often files are sorted by identity numbers, which may
appear to be arbitrarily defined. But ID numbers can encode meaning.

Example : Consider sorting the personnel records by their U.S. social security
numbers. In fact, the first five digits of social security numbers are generally a
function of the year and place of birth. Thus truncation leads to a
geographically and age-biased sample.
So truncation in general is
Uniform Sampling

� Suppose we seek to sample n/m records out of n from a given file.


� Let us start from the ith record, where i is some value between 1 and m,
and then sample every mth record starting from i.
Uniform Sampling

� Suppose we seek to sample n/m records out of n from a given file.


� Let us start from the ith record, where i is some value between 1 and m,
and then sample every mth record starting from i.
Uniform Sampling

Advantages of uniform sampling :


� We obtain exactly the desired number of records for our sample.
� It is quick and reproducible by anyone given the file and the values of I
and m.
� It is easy to construct multiple disjoint samples. If we repeat the process
� with a different offset i, we get an independent sample.
Uniform Sampling

Advantages of uniform sampling :


� We obtain exactly the desired number of records for our sample.
� It is quick and reproducible by anyone given the file and the values of I
and m.
� It is easy to construct multiple disjoint samples. If we repeat the process
� with a different offset i, we get an independent sample.

There still exist potential periodic temporal biases


Randomized and Stream Sampling

� Randomly sampling records with a probability p results in a selection of an


expected p . n items, without any explicit biases.
� Typical random number generators return a value between 0 and 1,
drawn from a uniform distribution.
� We can use the sampling probability p as a threshold. As we scan each
new record, generate a new random number r.
� When r < p, we accept this record into our sample, but when r > p we
ignore it.
Randomized and Stream Sampling

� Randomly sampling records with a probability p results in a selection of an


expected p . n items, without any explicit biases.
� Typical random number generators return a value between 0 and 1,
drawn from a uniform distribution.
� We can use the sampling probability p as a threshold. As we scan each
new record, generate a new random number r.
� When r < p, we accept this record into our sample, but when r > p we
ignore it.
Random sampling is not reproducible without the seed and random generator
Parallelism

There are two distinct approaches to simultaneous computing


with multiple machines
Parallelism

Parallel processing
� Happens on one machine, involving multiple cores and/or processors
� Communication through threads and operating system resources.
� Computation is often CPU-bound,
� Limited more by the number of cycles than the movement of data
through the machine.
� The emphasis is solving a particular computing problem faster than one
could sequentially.
Parallelism

Distributed processing
� Happens on many machines, using network communication.
� Appropriate to loosely-coupled jobs which do not communicate much.
� The goal involves sharing resources like memory and secondary storage
across multiple machines, more so than exploiting multiple CPUs.
� Whenever the speed of reading data from a disk is the bottleneck, we are
better off having many machines reading as many different disks as
possible, simultaneously.
Parallelism : Complexity

� 1 person: A date is easy to arrange using personal communication


� More than 2 persons: A dinner among friends requires active coordination
� More than 10 persons: A group meeting requires that there be a leader in
charge
� More than 100 persons: A wedding dinner requires a fixed menu, because
the kitchen cannot manage the diversity of possible orders
Parallelism : Complexity

� More than 1000 persons: At any community festival or parade, no one


knows the majority of attendees
� More than 10000 persons: After any major political demonstration,
somebody is going to spend the night in the hospital, even if the march is
peaceful
� More than 100000 persons: At any large sporting event, one of the
spectators will presumably die that day, either through a heart attack or
an accident on the drive home
Challenges of parallelization and
distributed computing
Coordination:
� How do we assign work units to processors, particularly when we have
more work units than workers?
� How do we aggregate or combine each worker's efforts into a single
result?
Communication:
� To what extent can workers share partial results?
� How can we know when all the workers have finished their tasks?
Challenges of parallelization and
distributed computing
Fault tolerance:
� How do we reassign tasks if workers quit or die?
� Must we protect against malicious and systematic attacks, or just random
failures?
Challenges of parallelization and
distributed computing
Fault tolerance:
� How do we reassign tasks if workers quit or die?
� Must we protect against malicious and systematic attacks, or just random
failures?

Parallel computing works when we can minimize communication and


coordination complexity, and complete the task with low probability of failure
Big data ethics

Our ability to get into serious trouble increases with size.


A car can cause a more serious accident than a bicycle, and an airplane
more serious carnage than an automobile.
Big data ethics

Common ethical concerns in world of big data that one should worry about :
Integrity in communications and modeling:
� Your results will be used to influence public opinion
� Appear in testimony before legal or governmental authorities
Big data ethics

Common ethical concerns in world of big data that one should worry about :
Integrity in communications and modeling:
� Your results will be used to influence public opinion
� Appear in testimony before legal or governmental authorities

Accurate reporting and dissemination of results are


essential behavior for ethical data scientists
Big data ethics

Transparency and ownership :


� Ownership means that users should have the right to see what information
has been collected from them, and the ability to prevent the future use of
this material.
� Data errors can propagate and harm individuals without allowing a
mechanism to people to access and understand what information has
been collected about them
Big data ethics

Uncorrectable decisions and feedback loops:


� Employing models as hard screening criteria can be dangerous,
particularly in domains where the model is just a proxy for what you really
want to measure.
Example: Consider a model suggesting that it is risky to hire a particular job
candidate because people like him who live in lower-class neighborhoods
are more likely to be arrested.
If all employers use such models, these people simply won't get hired, and are
driven deeper into poverty through no fault of their own.
Big data ethics

Model-driven bias and filters:


� Big data permits the customization of products to best fit each individual
user.
� Google, Facebook, and others analyze your data so as to show you the
results their algorithms think you most want to see.
Big data ethics

Model-driven bias and filters:


� Big data permits the customization of products to best fit each individual
user.
� Google, Facebook, and others analyze your data so as to show you the
results their algorithms think you most want to see.

Such filters may have some responsibility for political polarization in our
society: do you see opposing viewpoints, or just an echo chamber for your
own thoughts?
Big data ethics

� Maintaining the security of


large data sets

� Maintaining privacy in
aggregated data

You might also like