Unit - III Advanced Analytics Technology and Tools

The document discusses key concepts in big data analytics including data structures, use cases of IBM Watson, LinkedIn and Yahoo with Hadoop, MapReduce algorithm and its working including map, shuffle and reduce stages, benefits of MapReduce, and architecture of Hadoop including HDFS and organization of MapReduce tasks.

Uploaded by

Diksha Chhabra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views44 pages

Unit - III Advanced Analytics Technology and Tools

Uploaded by

Diksha Chhabra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Unit – III

ADVANCED ANALYTICS
TECHNOLOGY AND TOOLS
Introduction
Types of Data Structures in Big Data
• Structured: A specific and consistent format (for
example, a data table)
• Semi-structured: A self-describing format
(for example, an XML fi le)
• Quasi-structured: A somewhat inconsistent format
(for example, a hyper-l ink)
• Unstructured: An inconsistent format
(for example, text or video)
Use Cases
• IBM Watson
• Watson participated in a TV game show Jeopardy against
two best Jeopardy champions in the show's history
• Over the three-day tournament, Watson was able to
defeat the two human contestants.
• To educate Watson, Hadoop was utilized to process
various data sources such as encyclopedias, dictionaries,
news wire feeds, literature, and the entire contents of
Wikipedia
Use Cases
• IBM Watson
o Deconstruct the provided clue into words and phrases
o Establish the grammatical relationship between the
words and the phrases
o Create a set of similar terms to use in Watson's search
for a response
o Use Hadoop to coordinate the search for a response
across terabytes of data
o Determine possible responses and assign their
likelihood of being correct
o Actuate the buzzer
o Provide a syntactically correct response in English
Use Cases
• LinkedIn
• LinkedIn utilizes Hadoop for the following purposes
o Process daily production database transaction logs
o Examine the users' activities such as views and clicks
o Feed the extracted data back to the production
systems
o Restructure the data to add to an analytical database
o Develop and test analytical models
Use Cases
• Yahoo!
• Yahoo!'s Hadoop applications include the following
o Search index creation and maintenance
o Web page content optimization
o Web ad placement optimization
o Spam filters
o Ad-hoc analysis and analytic model development
MapReduce
• MapReduce™ is the heart of Apache™ Hadoop®.
• It is the programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a
Hadoop cluster
• It breaks a large task into smaller tasks, run the tasks in
parallel, and consolidate the outputs of the individual
tasks into the final output
• MapReduce consists of two basic parts
-a map step and
-a reduce step
MapReduce
Map:
• Applies an operation to a piece of data
• Provides some intermediate output
Reduce:
• Consolidates the intermediate outputs from the
map steps
• Provides the final output
• Each step uses key/value pairs, denoted as <key, value>,
as input and output.
For example, the key could be a filename, and the value
could be the entire contents of the file.
Benefits of MapReduce
Benefit Description

Simplicity Developers can write applications in their language of choice, such as

Java, C++ or Python, and MapReduce jobs are easy to run

Scalability MapReduce can process petabytes of data, stored in HDFS on one

cluster
Speed Parallel processing means that MapReduce can take problems that
used to take days to solve and solve them in hours or minutes

Recovery MapReduce takes care of failures. If a machine with one copy of the
data is unavailable, another machine has a copy of the same
key/value pair, which can be used to solve the same sub-task. The
JobTracker keeps track of it all.
Minimal data MapReduce moves compute processes to the data on HDFS and not
motion the other way around. Processing tasks can occur on the physical
node where the data resides. This significantly reduces the network
I/O patterns and contributes to Hadoop’s processing speed.
MapReduce
MapReduce - The Algorithm

• Generally MapReduce paradigm is based on sending the

computation part to where the data resides!
• MapReduce program executes in three stages, namely
map stage, shuffle stage, and reduce stage.
• Map stage : The map or mapper’s job is to process the
input data.
• Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS).
• The input file is passed to the mapper function line by
line.
• The mapper processes the data and creates several small
chunks of data.
MapReduce - The Algorithm

• Reduce stage : This stage is the combination of

the Shuffle stage and the Reduce stage.
•The Reducer’s job is to process the data that comes from
the mapper.
• After processing, it produces a new set of output, which
will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and
Reduce tasks to the appropriate servers in the cluster.
• The framework manages all the details of data-passing
such as issuing tasks, verifying task completion, and
copying data around the cluster between the nodes.
MapReduce - The Algorithm

• Most of the computing takes place on nodes with data

on local disks that reduces the network traffic.
• After completion of the given tasks, the cluster collects
and reduces the data to form an appropriate result, and
sends it back to the Hadoop server.
• The MapReduce framework operates on <key, value>
pairs, that is, the framework views the input to the job as
a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of
different types.
MapReduce - The Algorithm
MapReduce
• The data goes through following phases
Input Splits:
• Input to a MapReduce job is divided into fixed-
size pieces called input splits
• Input split is a chunk of the input that is
consumed by a single map
MapReduce
Mapping
• This is very first phase in the execution of map-
reduce program.
• In this phase data in each split is passed to a
mapping function to produce output values.
• In our example, job of mapping phase is to count
number of occurrences of each word from input
splits and prepare a list in the form of <word,
frequency>
MapReduce
Shuffling
• This phase consumes output of Mapping phase.
• Its task is to consolidate the relevant records from
Mapping phase output.
• In our example, same words are clubbed together
along with their respective frequency.
Reducing
• In this phase, output values from Shuffling phase are
aggregated.
• This phase combines values from Shuffling phase and
returns a single output value.
• In short, this phase summarizes the complete
dataset.
Apache Hadoop
• Hadoop is an open-source framework
• It allows to store and process big data in a distributed
environment across clusters of computers using simple
programming models.
• It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
• Hadoop runs applications using the MapReduce
algorithm, where the data is processed in parallel on
different CPU nodes.
• In short, Hadoop framework is capable enough to
develop applications capable of running on clusters of
computers and they could perform complete statistical
analysis for a huge amounts of data.
Apache Hadoop
Hadoop Architecture
• Hadoop framework includes following four modules:
• Hadoop Common: These are Java libraries and utilities
required by other Hadoop modules.
• These libraries provides file system and OS level
abstractions and contains the necessary Java files and
scripts required to start Hadoop.
• Hadoop YARN: This is a framework for job scheduling and
cluster resource management.
• Hadoop Distributed File System (HDFS™): A distributed
file system that provides high-throughput access to
application data.
• Hadoop MapReduce: This is YARN-based system for
parallel processing of large data sets.
Hadoop Architecture
Hadoop Distributed File System (HDFS)
• Hadoop can work directly with any mountable
distributed file system such as Local FS, HFTP FS, S3 FS…
• But, the most common file system used by Hadoop is the
Hadoop Distributed File System (HDFS).
• The Hadoop Distributed File System (HDFS) is based on
the Google File System (GFS)
• It provides a distributed file system that is designed to
run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.
• HDFS uses a master/slave architecture where master
consists of a single NameNode that manages the file
system metadata and one or more slave DataNodes that
store the actual data.
Hadoop Distributed File System (HDFS)
• A file in an HDFS namespace is split into several blocks
and those blocks are stored in a set of DataNodes.
• The NameNode determines the mapping of blocks to the
DataNodes.
• The DataNodes takes care of read and write operation
with the file system.
• They also take care of block creation, deletion and
replication based on instruction given by NameNode.
• HDFS provides a shell like any other file system and a list
of commands are available to interact with the file system.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
• The above figure illustrates a Hadoop cluster with ten
machines and the storage of one large file requiring
three HDFS data blocks.
• Furthermore, this file is stored using triple replication.
• The machines running the NameNode and the Secondary
Name Node are considered master nodes.
• Because the Data Nodes take their instructions from the
master nodes, the machines running the Data Nodes are
referred to as worker nodes
How MapReduce Organizes Work?
•Hadoop divides the job into tasks. There are two types of
tasks:
1. Map tasks (Spilts & Mapping)
2. Reduce tasks (Shuffling, Reducing)
• The complete execution process (execution of Map and
Reduce tasks, both) is controlled by two types of entities
called a
 Jobtracker : Acts like a master (responsible for complete
execution of submitted job)
 Multiple Task Trackers : Acts like slaves, each of them
performing the job
How MapReduce Organizes Work?
How MapReduce Organizes Work?
• For every job submitted for execution in the system,
there is one Jobtracker that resides on Namenode and
there are multiple tasktrackers which reside on Datanode.
• A job is divided into multiple tasks which are then run
onto multiple data nodes in a cluster.
• It is the responsibility of jobtracker to coordinate the
activity by scheduling tasks to run on different data nodes.
• Execution of individual task is then look after by
tasktracker, which resides on every data node executing
part of the job.
How MapReduce Organizes Work?
• Tasktracker's responsibility is to send the progress report
to the jobtracker.
• In addition, tasktracker sends 'heartbeat' signal to the
Jobtracker periodically, so as to notify him of current state
of the system.
• Thus jobtracker keeps track of overall progress of each
job. In the event of task failure, the jobtracker can
reschedule it on a different tasktracker.
• A third daemon, the Secondary NameNode, provides the
capability to perform some of the NameNode tasks to
reduce the load on the NameNode.
• Such tasks include updating the file system image with
the contents of the file system edit logs.
How Does Hadoop Work?
Stage 1
• A user/application can submit a job to the Hadoop
(a hadoop job client) for required process by specifying the
following items:
• The location of the input and output files in the
distributed file system.
• The java classes in the form of jar file containing the
implementation of map and reduce functions.
• The job configuration by setting different parameters
specific to the job.
How Does Hadoop Work?
Stage 2
• The Hadoop job client then submits the job
(jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the
software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic
information to the job-client.
Stage 3
• The TaskTrackers on different nodes execute the task as
per MapReduce implementation and output of the reduce
function is stored into the output files on the file system.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and
test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-
tolerance and high availability (FTHA), rather Hadoop
library itself has been designed to detect and handle
failures at the application layer.
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• Another big advantage of Hadoop is that apart from
being open source, it is compatible on all the platforms
since it is Java based.
Developing and Executing a Hadoop
• MapReduce Program
• A common approach to develop a Hadoop MapReduce
program is to write Java code using an Interactive
Development Environment (IDE) tool such as Eclipse
• A typical MapReduce program consists of three Java
files: one each for the driver code, map code, and reduce
code.
• The Java code is compiled and stored as a Java Archive
(JAR) file.
• This JAR file is then executed against the specified
HDFS input files.
Developing and Executing a Hadoop
• MapReduce Program
• Three key challenges to a new Hadoop developer are:
 defining the logic of the code to use the Map
Reduce paradigm;
 learning the Apache Hadoop Java classes, methods,
and interfaces; and
 implementing the driver, map, and reduce
functionality in Java
• Hadoop Streaming API allows the user to write and run
Hadoop jobs with other languages like C++, Python.
Developing and Executing a Hadoop
• MapReduce Program
• Some important considerations when preparing and
running a Hadoop streaming job
o Although the shuffle and sort output are provided to
the reducer in key sorted order, the reducer
does not receive the corresponding values as a list;
rather, it receives individual key/value pairs.
o The reduce code has to monitor for changes in the
value of the key and appropriately handle the new key.
o The map and reduce code must already be in an
executable form, or the necessary interpreter must
already be installed on each worker node.
Developing and Executing a Hadoop
• MapReduce Program
o The map and reduce code must already reside on each
worker node, or the location of the code must be
provided when the job is submitted. In the latter case,
the code is copied to each worker node.
o Some functionality, such as a partitioner, still needs to
be written in Java.
o The inputs and outputs are handled through stdin and
stdout. Stderr is also available to track the status of the
tasks, implement counter functionality, and report
execution issues to the display.
o The streaming API may not perform as well as similar
functionality written in Java.
Yet Another Resource Negotiator (YARN)
•
• YARN is the foundation of the new generation of
Hadoop and is enabling organizations everywhere to
realize a modern data architecture
• YARN separates the resource management of the cluster
from the scheduling and monitoring of jobs running on
the cluster.
• The YARN implementation makes it possible for
paradigms other than MapReduce to be utilized in
Hadoop environments
• YARN replaces the functionality previously provided by
the Job Tracker and TaskTracker daemons
Yet Another Resource Negotiator (YARN)

• YARN is the prerequisite for Enterprise Hadoop,

providing resource management and a central platform
to deliver consistent operations, security, and data
governance tools across Hadoop clusters.
• YARN also extends the power of Hadoop to incumbent
and new technologies found within the data center so
that they can take advantage of cost effective, linear-scale
storage and processing.
• It provides ISVs and developers a consistent framework
for writing data access applications that run IN Hadoop.
Yet Another Resource Negotiator (YARN)
YARN Features

Multi-tenancy
• YARN allows multiple access engines (either open-
source or proprietary) to use Hadoop as the common
standard for batch, interactive and real-time engines that
can simultaneously access the same data set.
• Multi-tenant data processing improves an enterprise’s
return on its Hadoop investments
Cluster utilization
• YARN’s dynamic allocation of cluster resources improves
utilization over more static MapReduce rules used in
early versions of Hadoop
YARN Features

Scalability
• Data center processing power continues to rapidly
expand.
• YARN’s ResourceManager focuses exclusively on
scheduling and keeps pace as clusters expand to thousands
of nodes managing petabytes of data.
Compatibility
• Existing MapReduce applications developed for Hadoop 1
can run YARN without any disruption to existing processes
that already work
The Hadoop Ecosystem
The Hadoop Ecosystem
Hadoop-related Apache projects:
• Pig: Provides a high-level data-flow programming
language
• Hive: Provides SOL-like access
• Mahout: Provides analytical tools
• HBase: Provides real-time reads and writes
The Hadoop Ecosystem
• By masking the details necessary to develop a
MapReduce program, Pig and Hive each enable a
developer to write high-level code that is later translated
into one or more MapReduce programs.
• Because Map Reduce is intended for batch processing, Pig
and Hive are also intended for batch processing use cases.
• Once Hadoop processes a dataset, Mahout provides
several tools that can analyze the data in a Hadoop
environment. For example, a k-means clustering analysis

The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
CST322 Module4 Part3 Hadoop
No ratings yet
CST322 Module4 Part3 Hadoop
45 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Unit 2
No ratings yet
Unit 2
21 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Bda 2
No ratings yet
Bda 2
35 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Unit 5
No ratings yet
Unit 5
35 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Unit 5
No ratings yet
Unit 5
7 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop
No ratings yet
Hadoop
5 pages
Unit 2
No ratings yet
Unit 2
22 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Unit 2
No ratings yet
Unit 2
12 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Unit 5
No ratings yet
Unit 5
32 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Introduction To
No ratings yet
Introduction To
7 pages
Hadoop
No ratings yet
Hadoop
34 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
63-2662 Spyder User Guide
100% (1)
63-2662 Spyder User Guide
422 pages
Vlsi Design Styles
No ratings yet
Vlsi Design Styles
61 pages
How To Identify MQ Client Connections and Stop Them
100% (1)
How To Identify MQ Client Connections and Stop Them
26 pages
Acs 800
100% (2)
Acs 800
2 pages
MVC 4 All
100% (2)
MVC 4 All
121 pages
Step by Step Smart Forms
No ratings yet
Step by Step Smart Forms
45 pages
4 - Verilog Language Elements
No ratings yet
4 - Verilog Language Elements
12 pages
ZKTeco Biometric Readers Product Catalogue FINAL LRZ 2023
No ratings yet
ZKTeco Biometric Readers Product Catalogue FINAL LRZ 2023
16 pages
Lenovo B50-30/B50-30 Touch/ B50-45/B50-70/B50-80: Hardware Maintenance Manual
No ratings yet
Lenovo B50-30/B50-30 Touch/ B50-45/B50-70/B50-80: Hardware Maintenance Manual
118 pages
X505 Service Guide Chapter 02-V1.0
100% (1)
X505 Service Guide Chapter 02-V1.0
33 pages
Learn Python 3 - Dictionaries
No ratings yet
Learn Python 3 - Dictionaries
2 pages
Linux Basic To Advanced 1
No ratings yet
Linux Basic To Advanced 1
2 pages
CST307 Scheme
No ratings yet
CST307 Scheme
3 pages
Java Vs C++
No ratings yet
Java Vs C++
4 pages
Yum Repository
No ratings yet
Yum Repository
3 pages
Mint 13 English - 13.0
No ratings yet
Mint 13 English - 13.0
51 pages
PM10 - PM2.5 - Air Quality Sensor
No ratings yet
PM10 - PM2.5 - Air Quality Sensor
4 pages
Get Traces On QuesCom 300-400
No ratings yet
Get Traces On QuesCom 300-400
4 pages
Sample Programming Assignment PDF
No ratings yet
Sample Programming Assignment PDF
94 pages
Whitepaper Apache Tomcat Best Practices
No ratings yet
Whitepaper Apache Tomcat Best Practices
15 pages
How To Attach A SQL Server Database (Using Skfuser - MDF File)
No ratings yet
How To Attach A SQL Server Database (Using Skfuser - MDF File)
6 pages
Chapter 13
No ratings yet
Chapter 13
6 pages
PART A-Plan Format of Micro Project Proposal For 1 To4 Semester
No ratings yet
PART A-Plan Format of Micro Project Proposal For 1 To4 Semester
6 pages
What Is Linear Data Structure
No ratings yet
What Is Linear Data Structure
2 pages
Lesson 1
No ratings yet
Lesson 1
8 pages
Optimistic Concurency Control Techniques
No ratings yet
Optimistic Concurency Control Techniques
10 pages
OOPS
No ratings yet
OOPS
14 pages
TUGAS 2 PTI - Aden Rahman Putra - 2015031079
No ratings yet
TUGAS 2 PTI - Aden Rahman Putra - 2015031079
4 pages
Dat10603 - Lab 3 (Control Structures)
No ratings yet
Dat10603 - Lab 3 (Control Structures)
6 pages
Pelles C - Overview
No ratings yet
Pelles C - Overview
2 pages

Unit - III Advanced Analytics Technology and Tools

Uploaded by

Unit - III Advanced Analytics Technology and Tools

Uploaded by

Unit – III

Simplicity Developers can write applications in their language of choice, such as

Scalability MapReduce can process petabytes of data, stored in HDFS on one

• Generally MapReduce paradigm is based on sending the

• Reduce stage : This stage is the combination of

• Most of the computing takes place on nodes with data

• YARN is the prerequisite for Enterprise Hadoop,

You might also like