Unit I LM
Unit I LM
Unit I LM
UNIT-I
INTRODUCTION TO BIG DATA
What is Big Data
Definition: Big Data is often described as extremely large data sets that
have grown beyond the ability to manage and analyze them with traditional
data processing tools.
The data set has grown so large that it is difficult to manage and even
harder to garner value out of it.
The primary difficulties are the acquisition, storage, searching,
sharing, analytics, and visualization of data. Not only the size of the
data set but also difficult to process the data.
The data come from everywhere : Sensors used to gather climate
information, posts to social media sites, digital pictures and videos posted
online, transaction records of online purchases, and cell phone GPS signals
etc.,
All of these data have intrinsic value that can be extracted using analytics,
algorithms, and other technique.
Why Big Data is Important
Big Data solutions are ideal for analyzing not only raw structured
data, but semi structured and unstructured data from a wide variety
of sources.
Big Data solutions are ideal when all, or most, of the data needs to be
analyzed versus a sample of the data; or a sampling of data isn’t
nearly as effective as a larger set of data from which to derive analysis.
Big Data solutions are ideal for iterative and exploratory analysis
when business measures on data are not predetermined.
Big Data is well suited for solving information challenges that don’t
natively fit within a traditional relational database approach for
handling the problem at hand.
Big Data has already proved its importance and value in several areas.
Organizations such as the National Oceanic and Atmospheric
Administration
(NOAA), the National Aeronautics and Space Administration (NASA), several
pharmaceutical companies, and numerous energy companies have amassed
huge amounts of data and now leverage Big Data technologies on a daily
basis to extract value from them.
NOAA uses Big Data approaches to aid in climate, ecosystem, weather, and
commercial research,
NASA uses Big Data for aeronautical and other research.
Pharmaceutical companies and energy companies have leveraged Big
Data
for more tangible results. such as drug testing and geophysical
analysis.
The New York Times has used Big Data tools for text analysis and
Web Mining.
Walt Disney Company uses them to correlate and understand
customer behavior in all of its stores, theme parks.
Companies such as Facebook, Amazon, and Google rely on Big Data
analytics a part of their primary marketing schemes as well as a
means of servicing their customers better.
This accomplished by storing each customer’s searches and
purchases and other piece of information available, and then applying
algorithms to that information to compare one customer’s information
with all other customers information.
Big Data plays another role in today’s businesses: Large organizations
increasingly face the need to maintain massive amounts of structured
and unstructured data—from transaction information in data
warehouses to employee tweets, from supplier records to regulatory
filings—to comply with government regulations.
If the amount of data is more than hundreds of terabytes then such a data
is called as big data.
Data generated by People:
Through individual interactions-
- Phone calls - emails - documents
Through social media
-twitter -facebook -whatsup etc.
Data generated by Machines:
-RFID readers -Sensor networks -Vehicle GPS traces -Machine logs
Characteristics of Big Data
Three characteristics define Big Data: volume, variety, and velocity
A text file is a few kilo bytes, a sound file is a few mega bytes while a
full length movie is a few giga bytes. More sources of data are added
on continuous basis.
For companies, in the old days, all data was generated internally by
employees. Currently, the data is generated by employees, partners
and customers.
Peta byte data sets are common these days and Exa byte is not far
away.
result. That scheme works when the incoming data rate is slower than
the batch-processing rate and when the result is useful despite the
delay.
With the new sources of data such as social and mobile applications,
the batch process breaks down. The data is now streaming into the
server in real time, in a continuous fashion and the result is only
useful if the delay is very short.
Variety: Represents all kinds of data
Data can be classified under several categories : structured data, semi
structured data and unstructured data.
Structured data are normally found in traditional databases (SQL or others)
where data are organized into tables based on defined business rules.
Structured data usually prove to be the easiest type of data to work with,
simply because the data are defined and indexed, making access and
filtering easier.
Unstructured data, are not organized into tables and cannot be natively
used by applications or interpreted by a database. A good example of
unstructured data would be a collection of binary image files.
Semistructured data fall between unstructured and structured data.
Semistructured data do not have a formal structure like a database with
tables and relationships. However, unlike unstructured data,
semistructured data have tags or other markers to separate the elements
and provide a hierarchy of records and fields, which define the data.
Second problem : most analysis tasks need to be able to combine the data
in some way;
i.e data read from one disk may need to be combined with the data from any
of the other 99 disks.
MapReduce is a good fit for problems that need to analyze the whole
dataset, in a batch fashion, particularly for ad hoc analysis. RDBMS is
good for point queries or updates, where the dataset has been indexed
to deliver low-latency retrieval and update times of a relatively small
amount of data.
MapReduce suits applications where the data is written once, and
read many times.Relational database is good for datasets that are
continually updated.
Another difference is the amount of structure in the datasets that they
operate on
RDBMS operate on Structured data is data that is organized into entities that
have a defined format, such as XML documents or database tables that
conform to a particular predefined schema. Map Reduce operate on
Semistructured and Unstructured data. In Semi-structured data there may
be a schema, it is often ignored, so it may be used only as a guide to the
structure of the data.
Ex : Spreadsheet, in which the structure is the grid of cells, although the
cells themselves may hold any form of data.Unstructured data does not have
any particular internal structure
Ex : plain text or image data. MapReduce works well on unstructured or
semistructured data, since it is designed to interpret the data at processing
time.
The client hostnames are specified in full each time, even though the same
client may appear many times and this is one reason that logfiles of all
kinds are particularly well-suited to analysis with MapReduce.
These functions are unmind to the size of the data or the cluster that
they are operating on, so they can be used unchanged for a small
dataset and for a massive one.
if you double the size of the input data, a job will run twice as slow.
But if you also double the size of the cluster, a job will run as fast as
the original one. This is not generally true of SQL queries
Grid Computing
1) The HPC and Grid computing doing large scale data processing using
APIs as Message Passing Interface(MPI).
The approach of HPC is to distribute the work across a cluster of machines
Which access shared files system
Hosted by a Storage Area Network(SAN)
Works well for compute intensive jobs.
It face problem when nodes need to access larger data volumes i.e
hundreds of giga bytes.
Reason is the network bandwidth is the bottleneck and computer
nodes become idle. (At this point Hadoop starts shines).
MapReduce tries to collocate the data with the compute node, so data
access is fast since it is local. This feature, known as data locality, is
at the heart of MapReduce and is the reason for its good performance.
2)MPI gives great control to the programmer, but requires that explicitly
handle the mechanics of the
data flow
exposed via low-level C routines
constructs, such as sockets
the higher-level algorithm for the analysis.
MapReduce operates only at the higher level: the programmer thinks in
terms of functions of key and value pairs, and the data flow is implicit.
Volunteer Computing
When the analysis is completed, the results are sent back to the
server, and the client gets another work unit.
As a precaution to combat cheating, each work unit is sent to three
different machines and needs at least two results to agree to be
accepted.
SETI@home may be superficially similar to MapReduce (breaking a
problem into independent pieces to be worked on in parallel).
The difference is SETI@home problem is very CPU-intensive, which
makes it suitable for running on hundreds of thousands of computers
across the world.
The time to transfer the work unit is very small by the time to run the
computation on it. Volunteers are donating CPU cycles, not
bandwidth.
MapReduce is designed to run jobs that last minutes or hours on
trusted, dedicated hardware running in a single data center with very
high aggregate bandwidth interconnects.
By contrast, SETI@home perform computation on untrusted machines
on the Internet with highly variable connection speeds and no data
locality.