0% found this document useful (0 votes)

32 views77 pages

Big Data

The document discusses big data and some of the challenges associated with analyzing large datasets. It notes that big data sources often consist of massive numbers of records across a small number of fields. Additionally, big data is frequently opportunistic rather than designed, and may be biased, redundant, or prone to temporal biases. While big data has value, good data for a problem requires data appropriate to answering relevant questions. The document also covers the three V's of big data - volume, variety, and velocity - as well as hashing techniques useful for analyzing big data efficiently.

Uploaded by

Mayoukh Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views77 pages

Big Data

Uploaded by

Mayoukh Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Big data

Achieving scale
What is big data?

http://www.internetlivestats.com/
� Twitter: 600 million tweets per day.
� Facebook: 600 terabytes of incoming data each day, from 1.6 billion
active users.
� Google: 3.5 billion search queries per day.
� Instagram: 52 million new photos per day.
� Apple: 130 billion total app downloads.
� Netflix: 125 million hours of TV shows and movies streamed daily.
� Email: 205 billion messages per day.
How big is big?

� Big data generally consists of massive numbers of rows (records) over a

relatively small number of columns (features).
� Thus big data is often overkill to accurately fit a single model for a given
problem.
� The value generally comes from fitting many distinct models, as in training a
custom model personalized for each distinct user.
Big data as bad data

� Massive data sets are typically the result of opportunity, instead of design.
� In traditional hypothesis-driven science, we design an experiment to gather
exactly the data we need to answer our specific question.
� But big data is more typically the product of some logging process
recording discrete events, or distributed contributions from millions of
people over social media.
� The data scientist generally has little or no control of the collection process,
just a vague charter to turn all those bits into money.
Big data as bad data

Consider measuring popular opinion from the posts on a social media platform.
Big data can be a wonderful resource.
But it is particularly prone to biases and limitations that make it difficult to draw
accurate conclusions like :
� Unrepresentative participation:
The data from any particular social media site does not reflect the people who
don't use it .
Amazon users buy far more books than shoppers at Walmart. Their political
affiliations and economic status differs as well. You get equally-biased but very
different views of the world if analyzing data from Instagram (too young), The New
York Times (too liberal), Fox News (too conservative) or The Wall Street Journal (too
wealthy).
Big data as bad data

� Spam and machine-generated content:

Big data sources are worse than unrepresentative and often deliberately
misleading. Armies of paid reviewers work each day writing fake and
misleading product reviews.
� a sizable fraction of the hits reported on any website are from mechanical
crawlers, instead of people.
� 90% of all email sent over networks is spam: the effectiveness of spam filters
at several stages of the pipeline is the only reason you don't see more of it.
� Spam filtering is an essential part of the data cleaning process on any
social media analysis. If you don't remove the spam, it will be lying to you
instead of just misleading you.
Big data as bad data

� Too much redundancy:

Many human activities follow a power law distribution, meaning that a very
small percentage of the items account for a large percentage of the total
activity.
News and social media concentrates heavily on the latest missteps of the
celebrities, covering them with articles by the thousands. Many of these will be
almost exact duplicates of other articles. How much more does the full set of
them tell you than any one of them would?
This law of unequal coverage implies that much of the data we see through
ambient sources is something we have seen before. Removing this duplication
is an essential cleaning step for many applications.
Big data as bad data

� Susceptibility to temporal bias:

Products change in response to competition and changes in consumer
demand.
Often these improvements change the way people use these products.
A time series resulting from ambient data collection might well encode several
product/interface transitions, which make it hard to distinguish artifact from
signal.
Take away

� Big data is data we have.

Take away

� Big data is data we have.

� Good data is data appropriate to the challenge at hand.
Take away

� Big data is data we have.

� Good data is data appropriate to the challenge at hand.
� Big data is bad data if it cannot really answer the questions we care about.
The Three V’s

Management consulting types have latched onto a notion of the three Vs of

big data as a means of explaining it:
the properties of volume, variety, and velocity.
They provide a foundation to talk about what makes big data different.
The Three V’s

Volume :
� It goes without saying that big data is bigger than little data.
� The distinction is one of class.
� We leave the world where we can represent our data in a spreadsheet or
process it on a single machine.
� This requires developing a more sophisticated computational infrastructure,
and restricting our analysis to linear-time algorithms for efficiency.
The Three V’s

Variety :
� Ambient data collection typically moves beyond the matrix to gather
heterogeneous data, which often requires ad hoc integration techniques.
� Consider social media. Posts may well include text, links, photos, and video.
Depending upon our task, all of these may be relevant, but text processing
requires vastly different techniques than network data and multimedia.
� Even images and videos are quite different beasts, not to be processed
using the same pipeline.
� Meaningfully integrating these materials into a single data set for analysis
requires substantial thought and effort.
The Three V’s

Velocity :
� Collecting data from ambient sources implies that the system is live,
meaning it is always on, always collecting data.
� Live data means that infrastructures must be built for collecting, indexing,
accessing, and visualizing the results, typically through a dashboard system.
� Live data means that consumers want real-time access to the latest results,
through graphs, charts, etc.
The fourth V

The management set sometimes defines a fourth V: veracity,

� a measure for how much we trust the underlying data.
� Here we are faced with the problem of eliminating spam and other
artifacts resulting from the collection process, beyond the level of normal
cleaning.
Algorithmic for Big Data

� Big data requires efficient algorithms to work on it.

� The basic algorithmic issues associated with big data are :
� Asymptotic complexity
� Hashing
� streaming models to optimize I/O performance in large data files.
Big Analysis

Traditional algorithm analysis was based on an abstract computer called the

Random Access Machine or RAM. On such a model:

� Each simple operation takes exactly one step.

� Each memory operation takes exactly one step.

Hence counting up the operations performed over the course of the algorithm
gave its running time.
Big Analysis

In general, the number of operations performed by any algorithm is a function

of the size of the input n :
� a matrix with n rows
� a text with n words
� a point set with n points.
Algorithm analysis is the process of estimating or bounding the number of steps
the algorithm takes as a function of n.
Big Analysis

For algorithms defined by for-loops, such analysis is fairly straightforward.

� The depth of the nesting of these loops defines the complexity of the
algorithm.
� A single loop from 1 to n defines a linear-time or O(n) algorithm, while
� Two nested loops define a quadratic-time or O(nxn) algorithm.
� Two sequential for loops that do not nest are still linear, because n+n = 2n
steps are used instead of nxn such operations.
Basic loop-structure algorithms

Examples include:
Find the nearest neighbor of point p:
� Compare p against all n points in a given array a.
� The distance computation between p and point a[i] requires subtracting
and squaring d terms,
where d is the dimensionality of p.
� Looping through all n points and keeping track of the closest point takes
O(d . n) time.
� Since d is typically small, it can be thought of as a constant, this is
considered a linear-time algorithm
Basic loop-structure algorithms

The closest pair of points in a set :

� Compare every point a[i] against every other point a[j], where 1 ≤ i ≠ j ≤ n.

� This takes O(d.nxn) time, and would be considered a quadratic-time

algorithm
Basic loop-structure algorithms

Matrix multiplication :
� Multiplying an [x . y] matrix times [y . z] matrix
� We get x.z matrix, where each of the x.z terms is the dot product of two y-
length vectors:

� This algorithm takes x.y.z steps.

� If n = max(x, y, z), then this takes at most O(n³) steps, and would be
considered a cubic-time algorithm
Complex algorithms

For algorithms defined by conditional while loops or recursion, the analysis

often requires more sophistication. Examples include:
� Binary search : To locate a given search key k in a sorted array A,
containing n items like searching for a name in the telephone book.
� Mergesort : Two sorted lists with a total of n items can be merged into a
single sorted list in linear time.
Complex algorithms

For algorithms defined by conditional while loops or recursion, the analysis

Algorithms running on big data sets must be linear or near-linear, perhaps

O(n log n). Quadratic algorithms become impossible to contemplate for
n > 10000.
Hashing

Definition : Hashing is a technique which can often turn quadratic algorithms

into linear time algorithms, making them tractable for dealing with the scale of
data we hope to work with.
� A hash function h takes an object x and maps it to a specific integer h(x).
The key idea is that whenever x = y, then h(x) = h(y).
� Different items are usually mapped to different places, assuming a well-
designed hash function
� Turning the vector of numbers into a single representative number is the job
of the hash function h(x).
Hashing: Applications

Dictionary maintenance

� A hash table is an array-based data structure using h(x) to define the

position of object x,

� It is coupled with an appropriate collision-resolution method.

� Properly implemented, such hash tables yield constant search times in

practice
Hashing: Applications

Frequency counting

� A common task in analyzing logs is tabulating the frequencies of given

events, such as word counts or page hits.
� The fastest/easiest approach is to set up a hash table with event types as
the key
� increment the associated counter for each new event.
� Properly implemented, this algorithm is linear in the total number of events
being analyzed.
Hashing: Applications

Duplicate removal

� An important data cleaning chore is identifying duplicate records in a data

stream and removing them
� For each item in the stream, check whether it is already in the hash table.
� If not insert it, if so ignore it.
� Properly implemented, this algorithm takes time linear in the total number of
records being analyzed.
Hashing: Applications

Canonization

� Often the same object can be referred to by multiple different names.

� Vocabulary words are generally case-insensitive
� Determining the vocabulary of a language requires unifying alternate forms,
mapping them to a single key.
� This process of constructing a canonical representation can be interpreted as
hashing.
� It requires a domain-specific simplification function doing such things as
reduction to lower case, white space removal, stop word elimination, and
abbreviation expansion.
� These canonical keys can then be hashed, using conventional hash functions.
Hashing: Applications

Cryptographic hashing

� How can you prove that an input file remains unchanged since you last
analyzed it?
� Construct a hash code or checksum for the file when you worked on it
� Save this code for comparison with the file hash at any point in the future
� They will be the same if the file is unchanged, and
� almost surely differ if any alterations have occurred.
Storage Hierarchy

� Big data algorithms are often storage-bound or bandwidth-bound rather

than compute-bound
� Achieving good performance can rest more on smart data management
than sophisticated algorithmics
� To be available for analysis, data must be stored somewhere in a
computing system
Storage Hierarchy

� There are several possible types of devices to put it on, which differ greatly
in speed, capacity, and latency.
� The performance differences between different levels of the storage
hierarchy is so enormous that we cannot ignore it in our abstraction of the
RAM machine.
� The ratio of the access speed from disk to cache memory is roughly the
same (10^6 ) as the speed of a tortoise to the exit velocity of the earth!
Levels of storage Hierarchy

Cache memory:

� Modern computer architectures feature a complex system of registers and

caches to store working copies of the data actively being used.
� Some of this is used for prefetching: grabbing larger blocks of data around
memory locations which have been recently accessed, in anticipation of
them being needed later.
Levels of storage Hierarchy

Cache memory:

� Cache sizes are typically measured in megabytes, with access times

between five and one hundred times faster than main memory.
� This performance makes it very advantageous for computations to exploit
locality, to use particular data items intensively in concentrated bursts,
rather than intermittently over a long computation.
Levels of storage Hierarchy

Main memory:

� This is what holds the general state of the computation, and where large
data structures are hosted and maintained.
� Main memory is generally measured in gigabytes, and runs hundreds to
thousands of times faster than disk storage.
� To the greatest extent possible, we need data structures that fit into main
memory and avoid the paging behavior of virtual memory.
Levels of storage Hierarchy

Main memory on another machine:

� Latency times on a local area network run into the low-order milliseconds,
making it generally faster than secondary storage devices like disks.
� This means that distributed data structures like hash tables can be
meaningfully maintained across networks of machines, but with access
times that can be hundreds of times slower than main memory.
Levels of storage Hierarchy

Disk storage:

� Secondary storage devices can be measured in terabytes, providing the

capacity that enables big data to get big.
� Physical devices like spinning disks take considerable time to move the
read head to the position where the data is.
� Once there, it is relatively quick to read a large block of data.
� This motivates pre-fetching, copying large chunks of files into memory
under the assumption that they will be needed later.
Storage Hierarchy

We need to organize our computations to take advantage of latency, using

techniques like:
Process files and data structures in streams:

� It is important to access files and data structures sequentially whenever

possible, to exploit pre-fetching
� Much of the advantage of sorting data is that we can jump to the
appropriate location in question
Storage Hierarchy

We need to organize our computations to take advantage of latency, using

techniques like:
Think big files instead of directories:

� One can organize a corpus of documents such that each is in its own file
� This is logical for humans but slow for machines, when there are millions of
tiny files
� Much better is to organize them in one large file to efficiently sweep
through all examples, instead of requiring a separate disk access for each
one
Storage Hierarchy

We need to organize our computations to take advantage of latency, using

techniques like:
Packing data concisely:

� The cost of decompressing data being held in main memory is generally

much smaller than the extra transfer costs for larger files
� This is an argument that it pays to represent large data files concisely
whenever you can
� This might mean explicit file compression schemes, with small enough file
sizes so that they can be expanded in memory
Filtering and Sampling
Filtering and Sampling

One important benefit of big data is that with sufficient volume you can afford
to throw most of your data away

And this can be quite worthwhile, to make your analysis cleaner and easier
Filtering and Sampling

We consider two distinct ways to throw data away, filtering and

sampling.
Filtering means selecting a relevant subset of data based on a specific
criteria.
Example: Suppose we wanted to build a language model for an
application in the United States, and we wanted to train it on data from
Twitter.
English accounts for only about one third of all tweets on Twitter, so filtering
out all other languages leaves enough for meaningful analysis.
Filtering and Sampling

But filtering introduces biases

� Over 10% of the U.S. population speaks Spanish. Shouldn’t they be
represented in the language model, amigo?
� It is important to select the right filtering criteria to achieve the outcome
we seek.
Perhaps we might better filter tweets based on location of origin, instead of
language
Filtering and Sampling

Sampling means selecting an appropriate size subset in an arbitrary manner,

without domain-specific criteria. There are several reasons why we may want
to subsample good, relevant data:
� Right-sizing training data: Simple, robust models generally have few
parameters, making big data unnecessary to fit them. Subsampling your
data in an unbiased way leads to efficient model fitting, but is still
representative of the entire data set
Filtering and Sampling

Sampling means selecting an appropriate size subset in an arbitrary manner,

� Data partitioning: Model-building hygiene requires cleanly separating

training, testing, and evaluation data, typically in a 60%, 20%, and 20%
mix. Constructing these partitions in an unbiased manner is necessary for
the veracity of this process.
� Exploratory data analysis and visualization: Spreadsheet-sized data sets
are fast and easy to explore. An unbiased sample is representative of the
whole while remaining comprehensible.
Filtering and Sampling

Sampling n records in an efficient and unbiased manner is a more subtle task

than it may appear at first. There are two general approaches:

� Deterministic Sampling Algorithms

� Randomized and Stream Sampling

Deterministic Sampling Algorithms

Sampling by truncation
� takes the first n records in the file as the desired sample.
� This is simple, and is readily reproducible, meaning someone else with the
full data file could easily reconstruct the sample.
However, the truncated samples often contain subtle effects from factors
such as:
� Temporal biases: Log files are typically constructed by appending new
records to the end of the file. Thus the first n records would be the oldest
available, and will not reflect recent regime changes.
Deterministic Sampling Algorithms

Sampling by truncation

� Lexicographic biases: Many files are sorted according to the primary key,
which means that the first n records are biased to a particular population.

Example : Imagine a personnel roster sorted by name. The first n records

might consist only of the As, which means that we will probably over-sample
Arabic names from the general population, and under-sample Chinese ones.
Deterministic Sampling Algorithms

Sampling by truncation

� Numerical biases: Often files are sorted by identity numbers, which may
appear to be arbitrarily defined. But ID numbers can encode meaning.

Example : Consider sorting the personnel records by their U.S. social security
numbers. In fact, the first five digits of social security numbers are generally a
function of the year and place of birth. Thus truncation leads to a
geographically and age-biased sample.
So truncation in general is
Uniform Sampling

� Suppose we seek to sample n/m records out of n from a given file.

� Let us start from the ith record, where i is some value between 1 and m,
and then sample every mth record starting from i.
Uniform Sampling

� Suppose we seek to sample n/m records out of n from a given file.

� Let us start from the ith record, where i is some value between 1 and m,
and then sample every mth record starting from i.
Uniform Sampling

Advantages of uniform sampling :

� We obtain exactly the desired number of records for our sample.
� It is quick and reproducible by anyone given the file and the values of I
and m.
� It is easy to construct multiple disjoint samples. If we repeat the process
� with a different offset i, we get an independent sample.
Uniform Sampling

Advantages of uniform sampling :

There still exist potential periodic temporal biases

Randomized and Stream Sampling

� Randomly sampling records with a probability p results in a selection of an

expected p . n items, without any explicit biases.
� Typical random number generators return a value between 0 and 1,
drawn from a uniform distribution.
� We can use the sampling probability p as a threshold. As we scan each
new record, generate a new random number r.
� When r < p, we accept this record into our sample, but when r > p we
ignore it.
Randomized and Stream Sampling

� Randomly sampling records with a probability p results in a selection of an

There are two distinct approaches to simultaneous computing

with multiple machines
Parallelism

Parallel processing
� Happens on one machine, involving multiple cores and/or processors
� Communication through threads and operating system resources.
� Computation is often CPU-bound,
� Limited more by the number of cycles than the movement of data
through the machine.
� The emphasis is solving a particular computing problem faster than one
could sequentially.
Parallelism

Distributed processing
� Happens on many machines, using network communication.
� Appropriate to loosely-coupled jobs which do not communicate much.
� The goal involves sharing resources like memory and secondary storage
across multiple machines, more so than exploiting multiple CPUs.
� Whenever the speed of reading data from a disk is the bottleneck, we are
better off having many machines reading as many different disks as
possible, simultaneously.
Parallelism : Complexity

� 1 person: A date is easy to arrange using personal communication

� More than 2 persons: A dinner among friends requires active coordination
� More than 10 persons: A group meeting requires that there be a leader in
charge
� More than 100 persons: A wedding dinner requires a fixed menu, because
the kitchen cannot manage the diversity of possible orders
Parallelism : Complexity

� More than 1000 persons: At any community festival or parade, no one

knows the majority of attendees
� More than 10000 persons: After any major political demonstration,
somebody is going to spend the night in the hospital, even if the march is
peaceful
� More than 100000 persons: At any large sporting event, one of the
spectators will presumably die that day, either through a heart attack or
an accident on the drive home
Challenges of parallelization and
distributed computing
Coordination:
� How do we assign work units to processors, particularly when we have
more work units than workers?
� How do we aggregate or combine each worker's efforts into a single
result?
Communication:
� To what extent can workers share partial results?
� How can we know when all the workers have finished their tasks?
Challenges of parallelization and
distributed computing
Fault tolerance:
� How do we reassign tasks if workers quit or die?
� Must we protect against malicious and systematic attacks, or just random
failures?
Challenges of parallelization and
distributed computing
Fault tolerance:
� How do we reassign tasks if workers quit or die?
� Must we protect against malicious and systematic attacks, or just random
failures?

Parallel computing works when we can minimize communication and

coordination complexity, and complete the task with low probability of failure
Big data ethics

Our ability to get into serious trouble increases with size.

A car can cause a more serious accident than a bicycle, and an airplane
more serious carnage than an automobile.
Big data ethics

Common ethical concerns in world of big data that one should worry about :
Integrity in communications and modeling:
� Your results will be used to influence public opinion
� Appear in testimony before legal or governmental authorities
Big data ethics

Accurate reporting and dissemination of results are

essential behavior for ethical data scientists
Big data ethics

Transparency and ownership :

� Ownership means that users should have the right to see what information
has been collected from them, and the ability to prevent the future use of
this material.
� Data errors can propagate and harm individuals without allowing a
mechanism to people to access and understand what information has
been collected about them
Big data ethics

Uncorrectable decisions and feedback loops:

� Employing models as hard screening criteria can be dangerous,
particularly in domains where the model is just a proxy for what you really
want to measure.
Example: Consider a model suggesting that it is risky to hire a particular job
candidate because people like him who live in lower-class neighborhoods
are more likely to be arrested.
If all employers use such models, these people simply won't get hired, and are
driven deeper into poverty through no fault of their own.
Big data ethics

Model-driven bias and filters:

� Big data permits the customization of products to best fit each individual
user.
� Google, Facebook, and others analyze your data so as to show you the
results their algorithms think you most want to see.
Big data ethics

Model-driven bias and filters:

Such filters may have some responsibility for political polarization in our
society: do you see opposing viewpoints, or just an echo chamber for your
own thoughts?
Big data ethics

� Maintaining the security of

large data sets

� Maintaining privacy in
aggregated data

HC200 03 En-Us
No ratings yet
HC200 03 En-Us
300 pages
Plane Extraction
No ratings yet
Plane Extraction
11 pages
Undestanding Data Module-3
No ratings yet
Undestanding Data Module-3
8 pages
Overview of Big Data
No ratings yet
Overview of Big Data
4 pages
Lecture Notes - Introduction To Big Data
0% (1)
Lecture Notes - Introduction To Big Data
8 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Big Data
No ratings yet
Big Data
45 pages
Introduction To Data Science - Students
No ratings yet
Introduction To Data Science - Students
237 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Microooooooooooooo
No ratings yet
Microooooooooooooo
33 pages
Lesson 2 Big Data
No ratings yet
Lesson 2 Big Data
37 pages
Unit I - Big Data Programming
No ratings yet
Unit I - Big Data Programming
19 pages
CC&BD Unit 3
No ratings yet
CC&BD Unit 3
16 pages
Three V of Big Data
No ratings yet
Three V of Big Data
4 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
What Is Data
No ratings yet
What Is Data
20 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Data Science
No ratings yet
Data Science
32 pages
What Is Data
No ratings yet
What Is Data
8 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Big Data Analysis
No ratings yet
Big Data Analysis
14 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Understanding Data-1
No ratings yet
Understanding Data-1
19 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
1big Data
No ratings yet
1big Data
69 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Introduction To Big Data Platform (Module-3)
No ratings yet
Introduction To Big Data Platform (Module-3)
23 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
Big Data:: Task 1
No ratings yet
Big Data:: Task 1
18 pages
Big Data and Data Science: Case Studies: Priyanka Srivatsa
No ratings yet
Big Data and Data Science: Case Studies: Priyanka Srivatsa
5 pages
BDA Unit 1
No ratings yet
BDA Unit 1
28 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
Chapter 2 - Overview For Data Science
No ratings yet
Chapter 2 - Overview For Data Science
31 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Bigdata Documentation
No ratings yet
Bigdata Documentation
20 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
BCE Report
No ratings yet
BCE Report
14 pages
UNIT - 1 - DA - Notes
No ratings yet
UNIT - 1 - DA - Notes
51 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
221 pages
Big Data: Management Information Systems
No ratings yet
Big Data: Management Information Systems
11 pages
IMTC634 - Data Science - Chapter 11
No ratings yet
IMTC634 - Data Science - Chapter 11
22 pages
Data Science
No ratings yet
Data Science
23 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
R19 Bda Unit-1
No ratings yet
R19 Bda Unit-1
22 pages
Bda (Chapter 1)
No ratings yet
Bda (Chapter 1)
8 pages
What Is Data
No ratings yet
What Is Data
24 pages
Chapter 04 - Small Data, Big Data - 02
No ratings yet
Chapter 04 - Small Data, Big Data - 02
53 pages
Big Data Defination Aspect
No ratings yet
Big Data Defination Aspect
30 pages
Kallam Haranadhareddy Institute of Technology: Presentation By: Modukuri John Jaya Prakash 188X1A0524
No ratings yet
Kallam Haranadhareddy Institute of Technology: Presentation By: Modukuri John Jaya Prakash 188X1A0524
18 pages
Software Listing
No ratings yet
Software Listing
5 pages
Machine Learning in Edge Computing: Opportunities and Challenges
No ratings yet
Machine Learning in Edge Computing: Opportunities and Challenges
7 pages
Shubham Prashar (SDE1)
No ratings yet
Shubham Prashar (SDE1)
1 page
COA Lecture 19-Address Translation PDF
No ratings yet
COA Lecture 19-Address Translation PDF
16 pages
It0047 Fa6
No ratings yet
It0047 Fa6
15 pages
Capital Market (CMDM) Module - New
No ratings yet
Capital Market (CMDM) Module - New
56 pages
SPCC Exp 1 To 9
No ratings yet
SPCC Exp 1 To 9
32 pages
Gig Performer Quick Start Guide
No ratings yet
Gig Performer Quick Start Guide
15 pages
WBUT Affiliation Process PDF
No ratings yet
WBUT Affiliation Process PDF
62 pages
Test 12 - Practical Questions
No ratings yet
Test 12 - Practical Questions
10 pages
Them Bombs - Manual (En 3.0)
No ratings yet
Them Bombs - Manual (En 3.0)
31 pages
Untitled
No ratings yet
Untitled
12 pages
Linear Integrated Circuits by Bakshi PDF
100% (1)
Linear Integrated Circuits by Bakshi PDF
4 pages
Raid Controller Via Vt6421
No ratings yet
Raid Controller Via Vt6421
42 pages
P4 - BYOD Policy
No ratings yet
P4 - BYOD Policy
3 pages
Software Mining Repository Practical
No ratings yet
Software Mining Repository Practical
28 pages
OOAD Lecture 4
No ratings yet
OOAD Lecture 4
14 pages
Csc121 - Topic 1 Introduction To Computer Systems
No ratings yet
Csc121 - Topic 1 Introduction To Computer Systems
83 pages
Scribd Metadata Sheet-5.0
No ratings yet
Scribd Metadata Sheet-5.0
5 pages
PrivacyIssuesofPublicWi FiNetworks
No ratings yet
PrivacyIssuesofPublicWi FiNetworks
11 pages
Datatypes in Oracle
No ratings yet
Datatypes in Oracle
11 pages
Advanced Topics - 1-2
No ratings yet
Advanced Topics - 1-2
2 pages
Event Id 3041 PDF
No ratings yet
Event Id 3041 PDF
2 pages
OPC Automation
No ratings yet
OPC Automation
18 pages
A Low Cost Automastic Impedance Bridge
No ratings yet
A Low Cost Automastic Impedance Bridge
4 pages
Kubernetes Security Best Practices
No ratings yet
Kubernetes Security Best Practices
20 pages
Mashup Tool For Automatic Query Generation For Data Web
No ratings yet
Mashup Tool For Automatic Query Generation For Data Web
5 pages

Big Data

Uploaded by

Big Data

Uploaded by

Big data

� Big data generally consists of massive numbers of rows (records) over a

� Spam and machine-generated content:

� Too much redundancy:

� Susceptibility to temporal bias:

� Big data is data we have.

� Big data is data we have.

� Big data is data we have.

Management consulting types have latched onto a notion of the three Vs of

The management set sometimes defines a fourth V: veracity,

� Big data requires efficient algorithms to work on it.

Traditional algorithm analysis was based on an abstract computer called the

� Each simple operation takes exactly one step.

In general, the number of operations performed by any algorithm is a function

For algorithms defined by for-loops, such analysis is fairly straightforward.

The closest pair of points in a set :

� This takes O(d.nxn) time, and would be considered a quadratic-time

� This algorithm takes x.y.z steps.

For algorithms defined by conditional while loops or recursion, the analysis

For algorithms defined by conditional while loops or recursion, the analysis

Algorithms running on big data sets must be linear or near-linear, perhaps

Definition : Hashing is a technique which can often turn quadratic algorithms

� A hash table is an array-based data structure using h(x) to define the

� It is coupled with an appropriate collision-resolution method.

� Properly implemented, such hash tables yield constant search times in

� A common task in analyzing logs is tabulating the frequencies of given

� An important data cleaning chore is identifying duplicate records in a data

� Often the same object can be referred to by multiple different names.

� Big data algorithms are often storage-bound or bandwidth-bound rather

� Modern computer architectures feature a complex system of registers and

� Cache sizes are typically measured in megabytes, with access times

Main memory on another machine:

� Secondary storage devices can be measured in terabytes, providing the

We need to organize our computations to take advantage of latency, using

� It is important to access files and data structures sequentially whenever

We need to organize our computations to take advantage of latency, using

We need to organize our computations to take advantage of latency, using

� The cost of decompressing data being held in main memory is generally

We consider two distinct ways to throw data away, filtering and

But filtering introduces biases

Sampling means selecting an appropriate size subset in an arbitrary manner,

Sampling means selecting an appropriate size subset in an arbitrary manner,

� Data partitioning: Model-building hygiene requires cleanly separating

Sampling n records in an efficient and unbiased manner is a more subtle task

� Deterministic Sampling Algorithms

� Randomized and Stream Sampling

Example : Imagine a personnel roster sorted by name. The first n records

� Suppose we seek to sample n/m records out of n from a given file.

� Suppose we seek to sample n/m records out of n from a given file.

Advantages of uniform sampling :

Advantages of uniform sampling :

There still exist potential periodic temporal biases

� Randomly sampling records with a probability p results in a selection of an

� Randomly sampling records with a probability p results in a selection of an

There are two distinct approaches to simultaneous computing

� 1 person: A date is easy to arrange using personal communication

� More than 1000 persons: At any community festival or parade, no one

Parallel computing works when we can minimize communication and

Our ability to get into serious trouble increases with size.

Accurate reporting and dissemination of results are

Transparency and ownership :

Uncorrectable decisions and feedback loops:

Model-driven bias and filters:

Model-driven bias and filters:

� Maintaining the security of

You might also like