0% found this document useful (0 votes)
61 views44 pages

Unit - III Advanced Analytics Technology and Tools

The document discusses key concepts in big data analytics including data structures, use cases of IBM Watson, LinkedIn and Yahoo with Hadoop, MapReduce algorithm and its working including map, shuffle and reduce stages, benefits of MapReduce, and architecture of Hadoop including HDFS and organization of MapReduce tasks.

Uploaded by

Diksha Chhabra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views44 pages

Unit - III Advanced Analytics Technology and Tools

The document discusses key concepts in big data analytics including data structures, use cases of IBM Watson, LinkedIn and Yahoo with Hadoop, MapReduce algorithm and its working including map, shuffle and reduce stages, benefits of MapReduce, and architecture of Hadoop including HDFS and organization of MapReduce tasks.

Uploaded by

Diksha Chhabra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Unit – III

ADVANCED ANALYTICS
TECHNOLOGY AND TOOLS
Introduction
Types of Data Structures in Big Data
• Structured: A specific and consistent format (for
example, a data table)
• Semi-structured: A self-describing format
(for example, an XML fi le)
• Quasi-structured: A somewhat inconsistent format
(for example, a hyper-l ink)
• Unstructured: An inconsistent format
(for example, text or video)
Use Cases
• IBM Watson
• Watson participated in a TV game show Jeopardy against
two best Jeopardy champions in the show's history
• Over the three-day tournament, Watson was able to
defeat the two human contestants.
• To educate Watson, Hadoop was utilized to process
various data sources such as encyclopedias, dictionaries,
news wire feeds, literature, and the entire contents of
Wikipedia
Use Cases
• IBM Watson
o Deconstruct the provided clue into words and phrases
o Establish the grammatical relationship between the
words and the phrases
o Create a set of similar terms to use in Watson's search
for a response
o Use Hadoop to coordinate the search for a response
across terabytes of data
o Determine possible responses and assign their
likelihood of being correct
o Actuate the buzzer
o Provide a syntactically correct response in English
Use Cases
• LinkedIn
• LinkedIn utilizes Hadoop for the following purposes
o Process daily production database transaction logs
o Examine the users' activities such as views and clicks
o Feed the extracted data back to the production
systems
o Restructure the data to add to an analytical database
o Develop and test analytical models
Use Cases
• Yahoo!
• Yahoo!'s Hadoop applications include the following
o Search index creation and maintenance
o Web page content optimization
o Web ad placement optimization
o Spam filters
o Ad-hoc analysis and analytic model development
MapReduce
• MapReduce™ is the heart of Apache™ Hadoop®.
• It is the programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a
Hadoop cluster
• It breaks a large task into smaller tasks, run the tasks in
parallel, and consolidate the outputs of the individual
tasks into the final output
• MapReduce consists of two basic parts
-a map step and
-a reduce step
MapReduce
Map:
• Applies an operation to a piece of data
• Provides some intermediate output
Reduce:
• Consolidates the intermediate outputs from the
map steps
• Provides the final output
• Each step uses key/value pairs, denoted as <key, value>,
as input and output.
For example, the key could be a filename, and the value
could be the entire contents of the file.
Benefits of MapReduce
Benefit Description

Simplicity Developers can write applications in their language of choice, such as


Java, C++ or Python, and MapReduce jobs are easy to run

Scalability MapReduce can process petabytes of data, stored in HDFS on one


cluster
Speed Parallel processing means that MapReduce can take problems that
used to take days to solve and solve them in hours or minutes

Recovery MapReduce takes care of failures. If a machine with one copy of the
data is unavailable, another machine has a copy of the same
key/value pair, which can be used to solve the same sub-task. The
JobTracker keeps track of it all.
Minimal data MapReduce moves compute processes to the data on HDFS and not
motion the other way around. Processing tasks can occur on the physical
node where the data resides. This significantly reduces the network
I/O patterns and contributes to Hadoop’s processing speed.
MapReduce
MapReduce - The Algorithm

• Generally MapReduce paradigm is based on sending the


computation part to where the data resides!
• MapReduce program executes in three stages, namely
map stage, shuffle stage, and reduce stage.
• Map stage : The map or mapper’s job is to process the
input data.
• Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS).
• The input file is passed to the mapper function line by
line.
• The mapper processes the data and creates several small
chunks of data.
MapReduce - The Algorithm

• Reduce stage : This stage is the combination of


the Shuffle stage and the Reduce stage.
•The Reducer’s job is to process the data that comes from
the mapper.
• After processing, it produces a new set of output, which
will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and
Reduce tasks to the appropriate servers in the cluster.
• The framework manages all the details of data-passing
such as issuing tasks, verifying task completion, and
copying data around the cluster between the nodes.
MapReduce - The Algorithm

• Most of the computing takes place on nodes with data


on local disks that reduces the network traffic.
• After completion of the given tasks, the cluster collects
and reduces the data to form an appropriate result, and
sends it back to the Hadoop server.
• The MapReduce framework operates on <key, value>
pairs, that is, the framework views the input to the job as
a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of
different types.
MapReduce - The Algorithm
MapReduce
• The data goes through following phases
Input Splits:
• Input to a MapReduce job is divided into fixed-
size pieces called input splits
• Input split is a chunk of the input that is
consumed by a single map
MapReduce
Mapping
• This is very first phase in the execution of map-
reduce program.
• In this phase data in each split is passed to a
mapping function to produce output values.
• In our example, job of mapping phase is to count
number of occurrences of each word from input
splits and prepare a list in the form of <word,
frequency>
MapReduce
Shuffling
• This phase consumes output of Mapping phase.
• Its task is to consolidate the relevant records from
Mapping phase output.
• In our example, same words are clubbed together
along with their respective frequency.
Reducing
• In this phase, output values from Shuffling phase are
aggregated.
• This phase combines values from Shuffling phase and
returns a single output value.
• In short, this phase summarizes the complete
dataset.
Apache Hadoop
• Hadoop is an open-source framework
• It allows to store and process big data in a distributed
environment across clusters of computers using simple
programming models.
• It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
• Hadoop runs applications using the MapReduce
algorithm, where the data is processed in parallel on
different CPU nodes.
• In short, Hadoop framework is capable enough to
develop applications capable of running on clusters of
computers and they could perform complete statistical
analysis for a huge amounts of data.
Apache Hadoop
Hadoop Architecture
• Hadoop framework includes following four modules:
• Hadoop Common: These are Java libraries and utilities
required by other Hadoop modules.
• These libraries provides file system and OS level
abstractions and contains the necessary Java files and
scripts required to start Hadoop.
• Hadoop YARN: This is a framework for job scheduling and
cluster resource management.
• Hadoop Distributed File System (HDFS™): A distributed
file system that provides high-throughput access to
application data.
• Hadoop MapReduce: This is YARN-based system for
parallel processing of large data sets.
Hadoop Architecture
Hadoop Distributed File System (HDFS)
• Hadoop can work directly with any mountable
distributed file system such as Local FS, HFTP FS, S3 FS…
• But, the most common file system used by Hadoop is the
Hadoop Distributed File System (HDFS).
• The Hadoop Distributed File System (HDFS) is based on
the Google File System (GFS)
• It provides a distributed file system that is designed to
run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.
• HDFS uses a master/slave architecture where master
consists of a single NameNode that manages the file
system metadata and one or more slave DataNodes that
store the actual data.
Hadoop Distributed File System (HDFS)
• A file in an HDFS namespace is split into several blocks
and those blocks are stored in a set of DataNodes.
• The NameNode determines the mapping of blocks to the
DataNodes.
• The DataNodes takes care of read and write operation
with the file system.
• They also take care of block creation, deletion and
replication based on instruction given by NameNode.
• HDFS provides a shell like any other file system and a list
of commands are available to interact with the file system.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
• The above figure illustrates a Hadoop cluster with ten
machines and the storage of one large file requiring
three HDFS data blocks.
• Furthermore, this file is stored using triple replication.
• The machines running the NameNode and the Secondary
Name Node are considered master nodes.
• Because the Data Nodes take their instructions from the
master nodes, the machines running the Data Nodes are
referred to as worker nodes
How MapReduce Organizes Work?
•Hadoop divides the job into tasks. There are two types of
tasks:
1. Map tasks (Spilts & Mapping)
2. Reduce tasks (Shuffling, Reducing)
• The complete execution process (execution of Map and
Reduce tasks, both) is controlled by two types of entities
called a
 Jobtracker : Acts like a master (responsible for complete
execution of submitted job)
 Multiple Task Trackers : Acts like slaves, each of them
performing the job
How MapReduce Organizes Work?
How MapReduce Organizes Work?
• For every job submitted for execution in the system,
there is one Jobtracker that resides on Namenode and
there are multiple tasktrackers which reside on Datanode.
• A job is divided into multiple tasks which are then run
onto multiple data nodes in a cluster.
• It is the responsibility of jobtracker to coordinate the
activity by scheduling tasks to run on different data nodes.
• Execution of individual task is then look after by
tasktracker, which resides on every data node executing
part of the job.
How MapReduce Organizes Work?
• Tasktracker's responsibility is to send the progress report
to the jobtracker.
• In addition, tasktracker sends 'heartbeat' signal to the
Jobtracker periodically, so as to notify him of current state
of the system.
• Thus jobtracker keeps track of overall progress of each
job. In the event of task failure, the jobtracker can
reschedule it on a different tasktracker.
• A third daemon, the Secondary NameNode, provides the
capability to perform some of the NameNode tasks to
reduce the load on the NameNode.
• Such tasks include updating the file system image with
the contents of the file system edit logs.
How Does Hadoop Work?
Stage 1
• A user/application can submit a job to the Hadoop
(a hadoop job client) for required process by specifying the
following items:
• The location of the input and output files in the
distributed file system.
• The java classes in the form of jar file containing the
implementation of map and reduce functions.
• The job configuration by setting different parameters
specific to the job.
How Does Hadoop Work?
Stage 2
• The Hadoop job client then submits the job
(jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the
software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic
information to the job-client.
Stage 3
• The TaskTrackers on different nodes execute the task as
per MapReduce implementation and output of the reduce
function is stored into the output files on the file system.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and
test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-
tolerance and high availability (FTHA), rather Hadoop
library itself has been designed to detect and handle
failures at the application layer.
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• Another big advantage of Hadoop is that apart from
being open source, it is compatible on all the platforms
since it is Java based.
Developing and Executing a Hadoop
• MapReduce Program
• A common approach to develop a Hadoop MapReduce
program is to write Java code using an Interactive
Development Environment (IDE) tool such as Eclipse
• A typical MapReduce program consists of three Java
files: one each for the driver code, map code, and reduce
code.
• The Java code is compiled and stored as a Java Archive
(JAR) file.
• This JAR file is then executed against the specified
HDFS input files.
Developing and Executing a Hadoop
• MapReduce Program
• Three key challenges to a new Hadoop developer are:
 defining the logic of the code to use the Map
Reduce paradigm;
 learning the Apache Hadoop Java classes, methods,
and interfaces; and
 implementing the driver, map, and reduce
functionality in Java
• Hadoop Streaming API allows the user to write and run
Hadoop jobs with other languages like C++, Python.
Developing and Executing a Hadoop
• MapReduce Program
• Some important considerations when preparing and
running a Hadoop streaming job
o Although the shuffle and sort output are provided to
the reducer in key sorted order, the reducer
does not receive the corresponding values as a list;
rather, it receives individual key/value pairs.
o The reduce code has to monitor for changes in the
value of the key and appropriately handle the new key.
o The map and reduce code must already be in an
executable form, or the necessary interpreter must
already be installed on each worker node.
Developing and Executing a Hadoop
• MapReduce Program
o The map and reduce code must already reside on each
worker node, or the location of the code must be
provided when the job is submitted. In the latter case,
the code is copied to each worker node.
o Some functionality, such as a partitioner, still needs to
be written in Java.
o The inputs and outputs are handled through stdin and
stdout. Stderr is also available to track the status of the
tasks, implement counter functionality, and report
execution issues to the display.
o The streaming API may not perform as well as similar
functionality written in Java.
Yet Another Resource Negotiator (YARN)

• YARN is the foundation of the new generation of
Hadoop and is enabling organizations everywhere to
realize a modern data architecture
• YARN separates the resource management of the cluster
from the scheduling and monitoring of jobs running on
the cluster.
• The YARN implementation makes it possible for
paradigms other than MapReduce to be utilized in
Hadoop environments
• YARN replaces the functionality previously provided by
the Job Tracker and TaskTracker daemons
Yet Another Resource Negotiator (YARN)

• YARN is the prerequisite for Enterprise Hadoop,


providing resource management and a central platform
to deliver consistent operations, security, and data
governance tools across Hadoop clusters.
• YARN also extends the power of Hadoop to incumbent
and new technologies found within the data center so
that they can take advantage of cost effective, linear-scale
storage and processing.
• It provides ISVs and developers a consistent framework
for writing data access applications that run IN Hadoop.
Yet Another Resource Negotiator (YARN)
YARN Features

Multi-tenancy
• YARN allows multiple access engines (either open-
source or proprietary) to use Hadoop as the common
standard for batch, interactive and real-time engines that
can simultaneously access the same data set.
• Multi-tenant data processing improves an enterprise’s
return on its Hadoop investments
Cluster utilization
• YARN’s dynamic allocation of cluster resources improves
utilization over more static MapReduce rules used in
early versions of Hadoop
YARN Features

Scalability
• Data center processing power continues to rapidly
expand.
• YARN’s ResourceManager focuses exclusively on
scheduling and keeps pace as clusters expand to thousands
of nodes managing petabytes of data.
Compatibility
• Existing MapReduce applications developed for Hadoop 1
can run YARN without any disruption to existing processes
that already work
The Hadoop Ecosystem
The Hadoop Ecosystem
Hadoop-related Apache projects:
• Pig: Provides a high-level data-flow programming
language
• Hive: Provides SOL-like access
• Mahout: Provides analytical tools
• HBase: Provides real-time reads and writes
The Hadoop Ecosystem
• By masking the details necessary to develop a
MapReduce program, Pig and Hive each enable a
developer to write high-level code that is later translated
into one or more MapReduce programs.
• Because Map Reduce is intended for batch processing, Pig
and Hive are also intended for batch processing use cases.
• Once Hadoop processes a dataset, Mahout provides
several tools that can analyze the data in a Hadoop
environment. For example, a k-means clustering analysis

You might also like