Unit - III Advanced Analytics Technology and Tools
Unit - III Advanced Analytics Technology and Tools
ADVANCED ANALYTICS
TECHNOLOGY AND TOOLS
Introduction
Types of Data Structures in Big Data
• Structured: A specific and consistent format (for
example, a data table)
• Semi-structured: A self-describing format
(for example, an XML fi le)
• Quasi-structured: A somewhat inconsistent format
(for example, a hyper-l ink)
• Unstructured: An inconsistent format
(for example, text or video)
Use Cases
• IBM Watson
• Watson participated in a TV game show Jeopardy against
two best Jeopardy champions in the show's history
• Over the three-day tournament, Watson was able to
defeat the two human contestants.
• To educate Watson, Hadoop was utilized to process
various data sources such as encyclopedias, dictionaries,
news wire feeds, literature, and the entire contents of
Wikipedia
Use Cases
• IBM Watson
o Deconstruct the provided clue into words and phrases
o Establish the grammatical relationship between the
words and the phrases
o Create a set of similar terms to use in Watson's search
for a response
o Use Hadoop to coordinate the search for a response
across terabytes of data
o Determine possible responses and assign their
likelihood of being correct
o Actuate the buzzer
o Provide a syntactically correct response in English
Use Cases
• LinkedIn
• LinkedIn utilizes Hadoop for the following purposes
o Process daily production database transaction logs
o Examine the users' activities such as views and clicks
o Feed the extracted data back to the production
systems
o Restructure the data to add to an analytical database
o Develop and test analytical models
Use Cases
• Yahoo!
• Yahoo!'s Hadoop applications include the following
o Search index creation and maintenance
o Web page content optimization
o Web ad placement optimization
o Spam filters
o Ad-hoc analysis and analytic model development
MapReduce
• MapReduce™ is the heart of Apache™ Hadoop®.
• It is the programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a
Hadoop cluster
• It breaks a large task into smaller tasks, run the tasks in
parallel, and consolidate the outputs of the individual
tasks into the final output
• MapReduce consists of two basic parts
-a map step and
-a reduce step
MapReduce
Map:
• Applies an operation to a piece of data
• Provides some intermediate output
Reduce:
• Consolidates the intermediate outputs from the
map steps
• Provides the final output
• Each step uses key/value pairs, denoted as <key, value>,
as input and output.
For example, the key could be a filename, and the value
could be the entire contents of the file.
Benefits of MapReduce
Benefit Description
Recovery MapReduce takes care of failures. If a machine with one copy of the
data is unavailable, another machine has a copy of the same
key/value pair, which can be used to solve the same sub-task. The
JobTracker keeps track of it all.
Minimal data MapReduce moves compute processes to the data on HDFS and not
motion the other way around. Processing tasks can occur on the physical
node where the data resides. This significantly reduces the network
I/O patterns and contributes to Hadoop’s processing speed.
MapReduce
MapReduce - The Algorithm
Multi-tenancy
• YARN allows multiple access engines (either open-
source or proprietary) to use Hadoop as the common
standard for batch, interactive and real-time engines that
can simultaneously access the same data set.
• Multi-tenant data processing improves an enterprise’s
return on its Hadoop investments
Cluster utilization
• YARN’s dynamic allocation of cluster resources improves
utilization over more static MapReduce rules used in
early versions of Hadoop
YARN Features
Scalability
• Data center processing power continues to rapidly
expand.
• YARN’s ResourceManager focuses exclusively on
scheduling and keeps pace as clusters expand to thousands
of nodes managing petabytes of data.
Compatibility
• Existing MapReduce applications developed for Hadoop 1
can run YARN without any disruption to existing processes
that already work
The Hadoop Ecosystem
The Hadoop Ecosystem
Hadoop-related Apache projects:
• Pig: Provides a high-level data-flow programming
language
• Hive: Provides SOL-like access
• Mahout: Provides analytical tools
• HBase: Provides real-time reads and writes
The Hadoop Ecosystem
• By masking the details necessary to develop a
MapReduce program, Pig and Hive each enable a
developer to write high-level code that is later translated
into one or more MapReduce programs.
• Because Map Reduce is intended for batch processing, Pig
and Hive are also intended for batch processing use cases.
• Once Hadoop processes a dataset, Mahout provides
several tools that can analyze the data in a Hadoop
environment. For example, a k-means clustering analysis