0% found this document useful (0 votes)
42 views73 pages

CS8091 BDA Unit I LectureNotes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views73 pages

CS8091 BDA Unit I LectureNotes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

CS8091 – Big data Analytics

UNIT – I
CS8091 Big Data Analytics Lecture 1:
Overview of Big Data Analytics
Definition and Characteristics of Big Data

“Big data is high-volume, high-velocity and high-variety information assets


that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.” – Gartner

Big data is fundamentally about applying innovative and cost-effective


techniques for solving existing and future business problems whose
resource requirements (for data management space, computation
resources, or immediate, in memory representation needs) exceed the
capabilities of traditional computing environments.
What made Big Data needed?
CHARACTERISTICS OF BIG DATA:

(i) Volume
❑ The name 'Big Data' itself is related to a size which is enormous.
❑ Size of data plays very crucial role in determining value out of data.
❑ Also, whether a particular data can actually be considered as a Big Data or not, is
dependent upon volume of data.
❑ Hence, 'Volume' is one characteristic which needs to be considered while
dealing with 'Big Data'.
CHARACTERISTICS OF BIG DATA:

(ii) Variety
❑ Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured.
❑ During earlier days, spreadsheets and databases were the only sources of data
❑ Now a days, data in the form of emails, photos, videos, monitoring devices,
PDFs, audio, etc. is also being considered in the analysis applications.
❑ This variety of unstructured data poses certain issues for storage, mining and
analyzing data.
CHARACTERISTICS OF BIG DATA:

(iii) Velocity

❑ The term 'velocity' refers to the speed of generation of data.


❑ How fast the data is generated and processed to meet the demands, determines
real potential in the data.
❑ Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks and social media sites, sensors, Mobile
devices, etc.
❑ The flow of data is massive and continuous.
CHARACTERISTICS OF BIG DATA:

(iv) Variability
❑ This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
❑ Variability refers to data whose meaning is constantly changing. Many a
time, organizations need to develop sophisticated programs in order to be able to
understand context in them and decode their exact meaning.
(v) Veracity
❑ It refers to the quality of the data that is being analyzed.
❑ High veracity data has many records that are valuable to analyze and that
contribute in a meaningful way to the overall results.
❑ Low veracity data, on the other hand, contains a high percentage of
meaningless data.
UNDERSTANDING THE BUSINESS DRIVERS

Business drivers are about agility in utilization and analysis of collections of datasets
and streams to create value:

• increase revenues,
• decrease costs,
• improve the customer experience,
• reduce risks, and
• increase productivity.
UNDERSTANDING THE BUSINESS DRIVERS

What is driving businesses to adopt big data solutions.

• Increased data volumes being captured and stored


• Rapid acceleration of data growth
• Increased data volumes pushed into the network
• Growing variation in types of data assets for analysis
Validating – The promotion of the Value of Big Data

• There are number of factors need to be considered before making a decision


regarding the adoption of big data technology within an organization
• The initial task would be to evaluate organization’s fitness as the combination of five
factors, namely:
Feasibility
• Steps taken to create an environment that is suited to the introduction and
assessment of innovative technologies (Are we capable of doing this?)
Reasonability
• Anticipation of business challenges whose resource requirements exceed the
capability of the existing or planned environment (Do we need to be doing this?)
Value
• Defining clear measures of value and methods for measurement (Will this provide
value?)
Integrability
• Steps need to be taken to evaluate the means by which big data can be integrated
as part of the enterprise (Can we incorporate this?)
Sustainability
• Plan for fund continued management for maintenance of a big data environment
(are we able to keep doing this?)
Techniques towards Big Data

• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

➔ Techniques exist for years to decades.


How much data do we generate?
How much data do we generate?

Source: IDC
Big data – 5V Classification/Characteristics
THE PROMOTION OF THE VALUE OF BIG DATA

The followings are the values of big data.


What is being done with big data

∙ Optimized consumer spending as a result of improved targeted customer marketing;


∙ Improvements to research and analytics within the manufacturing sectors to lead to
new product development;
∙ Improvements in strategizing and business planning leading to innovation and new
start-up companies;
∙ Predictive analytics for improving supply chain management to optimize stock
management, replenishment, and forecasting;
∙ Improving the scope and accuracy of fraud detection.
THE PROMOTION OF THE VALUE OF BIG DATA

The same types of benefits promoted by business intelligence and data warehouse
tools vendors and system integrators for the past 15_20 years, namely:

❑ Better targeted customer marketing


❑ Improved product analytics
❑ Improved business planning
❑ Improved supply chain management
❑ Improved analysis for fraud, waste, and abuse.

Then what makes big data different?


It will be addressed by use cases of big data.
Big Data Use Cases

Business Intelligence, querying, reporting, searching


• Including many applications of searching, filtering, indexing,
speeding up aggregation for reporting and for report
generation, trend analysis, search optimization and general
information retrieval
Improved performance for common data management operations
• Log storage, data storage and archiving, followed by sorting,
running joins, ETL processing, other type of data conversions
as well as duplicate analysis and elimination
Non-database Applications
• Such as image processing, text processing, genome
sequencing, protein sequencing and structure prediction, web
crawling and monitoring work flow processes
Big Data Use Cases

Data Mining and Analytical Applications


• Including social network analysis, facial recognition, profile matching and
other types of text analytics, web mining, machine learning, information
extraction, personalization and recommendation analysis, and behavioral
analysis
In turn, the core capabilities that are implemented using the big data application can be
further abstracted into more fundamental categories:
Counting
Functions applied to large volumes of data that can be segmented and distributed
among a group of computing and storage resources, such as document indexing,
concept filtering and aggregation (counts and sums)
Scanning
Functions that can be broken into parallel threads, such as sorting, data
transformations, semantic text analysis, pattern recognition and searching
Modeling
Capabilities for analysis and prediction
Storing
Large datasets while providing relatively rapid access
Perception and Quantification of Value

Three facets of the appropriateness of the big data are


• Organizational fitness
• Sustainability of the business challenge
• Big Data’s contribution to the organization
Whether using big data significantly contributes to adding value to the
organization by:
Increasing revenues – increasing same-customer sales
Lowering costs – eliminate need for specialized servers and reduce operating
costs
Increasing Productivity – Increasing the speed of analytics and there by
allowing for the actions to be taken quickly
Reducing Risks – unusual events are rapidly investigated to determine the
risk factor
Big Data Applications Characteristics

Big data approach is mostly suited to addressing or solving business


problems that are subject to one or more of the following criteria.

Data Throttling (Existing solution – Traditional Hardware)


• The performance of a solution throttled as a result of data accessibility,
data latency, data availability, or limits on bandwidth in relation to the size
of inputs
Computation-restricted Throttling (Existing algorithms - Heuristic)
• Have not been implemented because the expected computational
performance has not been with the conventional systems
Large Data Volumes
• The analytical application combines a multitude of existing large datasets
and data streams with high rates of data creation and delivery
Big Data Applications Characteristics

Significant Data Variety


• The data acquired from vary in sdifferent sources tructure and content and
some of the data is unstructured
Benefits from Data Parallelization
• Because of the reduced data dependencies, the applications run time can
be improved through task or thread-level parallelism applied to
independent data segments
Examples of Big Data Applications

• Energy network monitoring and optimization


• Credit card fraud detection
• Data profiling
• Clustering and customer segmentation
• Recommendation engines
• Price Modeling , etc.
Examples of Big Data Applications
Understanding Big Data Storage
• The ability to design, develop and implement a big data application is directly
dependent on an awareness of the architecture of the underlying computing
platform, both from hardware perspective and software perspective
• One commonality among the different appliances and frameworks is the
adaptation of tools to leverage the combination of collections of four key
computing resources:
• Processing Capability (CPU, processor or node) – multiprocessor,
multithreading
• Memory – which holds the data that the processing node is currently working
on
• Storage – providing persistence of data
• Network – which provides the “pipes” through which the datasets are
exchanged between different processing and storage nodes.
• Single node computers – limited in capacity, cannot accommodate massive
amount of data
• High Performance platforms – collections of computers – massive amount of
data and requirements for processing can be distributed among the group of
resources available
Overview of High Performance Architecture

Most high-performance platforms are created by connecting multiple nodes together via a
variety of network topologies.
The general architecture distinguishes the management of computing resources and the
management of the data across the network of storage nodes.

Typical organization of resources in a big data platform.


Overview of High Performance Architecture

In this configuration, a master job manager oversees the pool of processing


nodes, assigns tasks, and monitors the activity. At the same time, a storage
manager oversees the data storage pool and distributes datasets across the
collection of storage resources.

Hadoop is a framework that allows to store Big Data in a distributed


environment, so that, data’s can be processed parallely.
Overview of High Performance Architecture – Apache Hadoop Framework

• Hadoop is essentially a collection of open source projects that are


combined to enable a software-based big data appliance.
• The core aspects of Hadoop’s utilities are Hadoop Distributed File
Systems (HDFS) and MapReduce programming model, upon which
the next layer in the stack is implemented.
• A new generation framework ‘YARN’ has been developed for job
scheduling and cluster management

Apache Hadoop Project


• Google's solution as the starting point
• runs applications using the MapReduce algorithm
• data is processed in parallel on different CPU nodes
• framework for developing applications that run on clusters of
computers and could perform complete statistical analysis for a
huge amounts of data
Overview of High Performance Architecture – Apache Hadoop Framework

Hadoop Architecture

• Hadoop Common: Java libraries and utilities required by other


Hadoop modules provides filesystem and OS level abstractions
contains the necessary Java files and scripts required to start
Hadoop
• Hadoop YARN: a framework for job scheduling and cluster
resource management
• Hadoop Distributed File System (HDFS): a distributed file system
that provides high-throughput access to application data
• Hadoop MapReduce: a YARN-based system for parallel processing
of large data sets
Use Cases of Hadoop

IBM Watson

In 2011, IBM’s computer system Watson participated in the U.S. television game
show Jeopardy against two of the best Jeopardy champions in the show’s history. In the
game, the contestants are provided a clue such as “He likes his martinis shaken, not stirred”
and the correct response, phrased in the form of a question, would be, “Who is James
Bond?”
Over the three-day tournament, Watson was able to defeat the two human
contestants. To educate Watson, Hadoop was utilized to process various data sources such
as encyclopedias, dictionaries, news wire feeds, literature, and the entire contents of
Wikipedia. For each clue provided during the game,

Watson had to perform the following tasks in less than three seconds.

❑ Deconstruct the provided clue into words and phrases


❑ Establish the grammatical relationship between the words and the phrases
❑ Create a set of similar terms to use in Watson’s search for a response
❑ Use Hadoop to coordinate the search for a response across terabytes of data
Use Cases of Hadoop

IBM Watson

❑ Determine possible responses and assign their likelihood of being correct


❑ Actuate the buzzer
❑ Provide a syntactically correct response in English

Among other applications, Watson is being used in the medical profession to


diagnose patients and provide treatment recommendations.
Use Cases of Hadoop

LinkedIn
LinkedIn is an online professional network of 250 million users in 200 countries as
of early 2014 [5]. LinkedIn provides several free and subscription-based services, such as
company information pages, job postings, talent searches, social graphs of one’s contacts,
personally tailored news feeds, and access to discussion groups, including a Hadoop users
group.
LinkedIn utilizes Hadoop for the following purposes:
❑ Process daily production database transaction logs
❑ Examine the users’ activities such as views and clicks
❑ Feed the extracted data back to the production systems
❑ Restructure the data to add to an analytical database
❑ Develop and test analytical models
Use Cases of Hadoop

Yahoo!
As of 2012, Yahoo! has one of the largest publicly announced Hadoop
deployments at 42,000 nodes across several clusters utilizing. Yahoo!‘s Hadoop
applications include the following 350 petabytes of raw storage
❑ Search index creation and maintenance
❑ Web page content optimization
❑ Web ad placement optimization
❑ Spam filters
❑ Ad-hoc analysis and analytic model development
Prior to deploying Hadoop, it took 26 days to process three years’ worth of log
data. With Hadoop, the processing time was reduced to 20 minutes.
History behind Hadoop

Hadoop, an open source project managed and licensed by the Apache Software Foundation.
The origins of Hadoop began as a search engine called Nutch, developed by Doug
Cutting and Mike Cafarella. Based on two Google papers [9] [12], versions of MapReduce
and the Google File System were added to Nutch in 2004. In 2006, Yahoo! hired Cutting,
who helped to develop Hadoop based on the code in Nutch [13]. The name “Hadoop” came
from the name of Cutting’s child’s stuffed toy elephant that also inspired the wellrecognized
symbol for the Hadoop project.
Overview of High Performance Architecture – Apache Hadoop Framework

Apart from HDFS and Map reduce , the other components of Hadoop ecosystem
are shown.
Overview of High Performance Architecture – Apache Hadoop Framework

Hbase

⮚ HBase is an open-source non-relational distributed database modeled


after Google's Bigtable and written in Java.

⮚ It is developed as part of Apache Software Foundation's


Apache Hadoop project and runs on top of
HDFS (Hadoop Distributed File System), providing Big table-like
capabilities for Hadoop.

⮚ That is, it provides a fault-tolerant way of storing large quantities of


sparse data.
Overview of High Performance Architecture – Apache Hadoop Framework

PIG
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Pig.
Hive
Apache Hive, is an open source data warehouse system for querying and
analyzing large datasets stored in Hadoop files.
Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute
on Hadoop.
Overview of High Performance Architecture – Apache Hadoop Framework

Apache Mahout
Mahout is open source framework for creating scalable machine learning
algorithm and data mining library. Once data is stored in Hadoop HDFS,
mahout provides the data science tools to automatically find meaningful
patterns in those big data sets.

Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive. It also exports data from Hadoop to
other external sources. Sqoop works with relational databases such as
teradata, Netezza, oracle, MySQL.
Overview of High Performance Architecture – Apache Hadoop Framework

Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem
component for maintaining configuration information, naming, providing
distributed synchronization, and providing group services. Zookeeper manages
and coordinates a large cluster of machines.
Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie
combines multiple jobs sequentially into one logical unit of work. Oozie
framework is fully integrated with apache Hadoop stack.
Oozie is scalable and can manage timely execution of thousands of workflow
in a Hadoop cluster. Oozie is very much flexible as well. One can easily start,
stop, suspend and rerun jobs.
Overview of High Performance Architecture – Apache Hadoop Framework

Hadoop Distributed File System (HDFS)


• HDFS is based on the Google File System (GFS)
✔ provides a distributed file system
✔ it is designed to run on large clusters (thousands of computers) of small
computer machines
✔ reliable and fault-tolerant
• HDFS uses a master/slave architecture
✔ the master consists of a single NameNode that manages the file system
metadata (a single point of failure)
✔ one or more slave DataNodes store the actual data
• a file in an HDFS namespace is split into several blocks and those blocks are
stored in a set of DataNodes
• the NameNode determines the mapping of blocks to the DataNodes
• the DataNodes take care of (based on instruction given by NameNode)
✔ read and write operation with the file system
✔ block creation, deletion and replication
• HDFS provides a shell like any other file system and a list of commands to
interact with the file system
Overview of High Performance Architecture – Apache Hadoop Framework

HDFS Architecture

The name node maintains metadata about each file (i.e. number of data blocks, file
name, path, Block IDs, Block location, no. of replicas, and also Slave related
configuration. This meta-data is available in memory in the master for faster
retrieval of data), as well as the history of changes to file metadata.
Overview of High Performance Architecture – Apache Hadoop Framework

Hadoop Distributed File System (HDFS)

Data Replication

• HDFS provides a level of fault tolerance through data replication.


An application can specify the degree of replication (i.e., the
number of copies made) when a file is created.

• The name node also manages replication.

• In essence, HDFS provides performance through distribution of


data and fault tolerance through replication.
Overview of High Performance Architecture – Apache Hadoop Framework

Hadoop Distributed File System (HDFS) (Key Tasks)


• Monitoring: There is a continuous “heartbeat” communication
between the data nodes to the name node. If a data node’s
heartbeat is not heard by the name node, the data node is
considered to have failed and is no longer available. In this
case, a replica is employed to replace the failed node, and a
change is made to the replication scheme.
• Rebalancing: This is a process of automatically migrating blocks
of data from one data node to another when there is free
space, when there is an increased demand for the data and
moving it may improve performance (such as moving from a
traditional disk drive to a solid-state drive that is much faster or
can accommodate increased numbers of simultaneous
accesses), or an increased need to replication in reaction to
more frequent node failures.
Overview of High Performance Architecture – Apache Hadoop Framework

Hadoop Distributed File System (HDFS) (Key Tasks)


• Managing integrity: HDFS uses checksums, which are
effectively “digital signatures” associated with the actual data
stored in a file (often calculated as a numerical function of the
values within the bits of the files) that can be used to verify
that the data stored corresponds to the data shared or
received. When the checksum calculated for a retrieved block
does not equal the stored checksum of that block, it is
considered an integrity error. In that case, the requested block
will need to be retrieved from a replica instead.
• Metadata replication: The metadata files are also subject to
failure, and HDFS can be configured to maintain replicas of the
corresponding metadata files to protect against corruption.
• Snapshots: This is incremental copying of data to establish a
point in time to which the system can be rolled back.
Overview of High Performance Architecture – Apache Hadoop Framework

Hadoop Distributed File System (Example)

⮚ For a given file, HDFS breaks the file, say, into 64 MB blocks and
stores the blocks across the cluster. So, if a file size is 300 MB, the file
is stored in five blocks: four 64 MB blocks and one 44 MB block. If a
file size is smaller than 64 MB, the block is assigned the size of the
file.
⮚ By default, HDFS creates three copies of each block across the cluster
to provide the necessary redundancy in case of a failure.
⮚ If a machine fails, HDFS replicates an accessible copy of the relevant
data blocks to another available machine. HDFS is also rack aware,
which means that it distributes the blocks across several equipment
racks to prevent an entire rack failure from causing a data unavailable
Overview of High Performance Architecture – Apache Hadoop Framework

Hadoop Distributed File System (Example)


Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce
• A software framework for easily writing applications which
process big amounts of data in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
• The MapReduce execution environment employs a master/slave
execution model, in which one master node (called the
JobTracker) manages a pool of slave computing resources (called
TaskTrackers) that are called upon to do the actual work.
• The term MapReduce actually refers to two different tasks:
✔ the Map task - the first task, which takes input data and converts
it into a set of data, where individual elements are broken down
into tuples (key/value pairs)
✔ the Reduce task - this task takes the output from a map task as
input and combines those to provide output.
Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce
• the master is responsible for:
✔ resource management
✔ tracking resource consumption/availability and scheduling the
jobs component
✔ tasks on the slaves
✔ monitoring slaves and re-executing the failed tasks
• the slave TaskTracker:
✔ executes the tasks as directed by the master
✔ provides task-status information to the master periodically
Overview of High Performance Architecture – Apache Hadoop Framework

THE MAPREDUCE PROGRAMMING MODEL

Map Reduce, can be used to develop applications to read, analyze, transform, and
share massive amounts of data is not a database system but rather is a
programming model introduced and described by Google researchers for parallel,
distributed computation involving massive datasets (ranging from hundreds of
terabytes to petabytes).

Application development in Map Reduce is a combination of the familiar


procedural/imperative approaches used by Java or C++ programmers.

Map Reduce’s dependence on two basic operations that are applied to sets or
lists of data value pairs:

Map, which describes the computation or analysis applied to a set of input


key/value pairs to produce a set of intermediate key/value pairs.

Reduce, in which the set of values associated with the intermediate key/value
pairs output by the Map operation are combined to provide the results.
Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce Programming Model

A Map Reduce application is envisioned as a series of basic operations applied in a


sequence to small sets of many (millions, billions, or even more) data items. These
data items are logically organized in a way that enables the MapReduce execution
model to allocate tasks that can be executed in parallel.

The data items are indexed using a defined key in to key, value pairs, in which the
key represents some grouping criterion associated with a computed value. With
some applications applied to massive datasets, the theory is that the computations
applied during the Map phase to each input key/value pair are independent from
one another. Figure 1.5 shows how Map and Reduce work.
Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce Programming Model


Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce

The Map Reduce paradigm provides the means to break a large task
into smaller tasks, run the tasks in parallel, and consolidate the
outputs of the individual tasks into the final output. As its name
implies,
Map Reduce consists of two basic parts
▪ a map step and
▪ a reduce step

Map:
Applies an operation to a piece of data
Provides some intermediate output

Reduce:
Consolidates the intermediate outputs from the map steps
Overview of High Performance Architecture – Apache Hadoop Framework

Each step uses key/value pairs, denoted as <key, value>, as input and
output.

For example, the key could be a filename, and the value could be the
entire contents of the file.

The simplest illustration of MapReduce is a word count example in


which the task is to simply count the number of times each word
appears in a collection of documents.
Overview of High Performance Architecture – Apache Hadoop Framework
Overview of High Performance Architecture – Apache Hadoop Framework
In this example, the map step parses the provided text string
into individual words and emits a set of key/value pairs of the
form <word, 1>.

For each unique key—in this example, word—the reduce step


sums the 1 values and outputs the < word, count> key/value
pairs. Because the word each appeared twice in the given line of
text, the reduce step provides a corresponding key/value pair of
<each, 2>.

It should be noted that, in this example, the original key, 1234, is


ignored in the processing.

In a typical word count application, the map step may be


applied to millions of lines of text, and the reduce step will
summarize the key/value pairs generated by all the map steps.
Overview of High Performance Architecture – Apache Hadoop Framework

A key characteristic of MapReduce is that the processing of one


portion of the input can be carried out independently of the
processing of the other inputs. Thus, the workload can be easily
distributed over a cluster of machines.

Although MapReduce is a simple paradigm to understand, it is


not as easy to implement, especially in a distributed system.
Executing a MapReduce job (the MapReduce code run against
some specified data) requires the management and
coordination of several activities:
Overview of High Performance Architecture – Apache Hadoop Framework
Overview of High Performance Architecture – Apache Hadoop Framework

Structuring a MapReduce Job in Hadoop

A typical MapReduce program in Java consists of three classes:


The driver,
The mapper, and
The reducer.

The driver provides details such as input file locations, the


provisions for adding the input file to the map task, the names of
the mapper and reducer Java classes, and the location of the
reduce task output. Various job configuration options can also be
specified in the driver.

The mapper provides the logic to be processed on each


data block corresponding to the specified input files in the driver
code.
Overview of High Performance Architecture – Apache Hadoop Framework

Structuring a MapReduce Job in Hadoop

Next, the key/value pairs are processed by the built-in


shuffle and sort functionality based on the number of reducers to
be executed. In this simple example, there is only one reducer. So,
all the intermediate data is passed to it.

Also, Hadoop ensures that the keys are passed to each


reducer in sorted order
Overview of High Performance Architecture – Apache Hadoop Framework
Overview of High Performance Architecture – Apache Hadoop Framework

In general, each reducer processes the values for each key and
emits a key/value pair as defined by the reduce logic.

The output is then stored in HDFS like any other file in, say, 64
MB blocks replicated three times across the nodes.

Several Hadoop features provide additional functionality to a


MapReduce job.

First, a combiner is a useful option to apply, when possible,


between the map task and the shuffle and sort. Typically, the
combiner applies the same logic used in the reducer, but it also
applies this logic on the output of each map task. In the word
count example, a combiner sums up the number of occurrences
of each word from a mapper’s output.
Overview of High Performance Architecture – Apache Hadoop Framework
Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce Programming Model (Example)


Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce Programming Model (Example)


Overview of High Performance Architecture – Apache Hadoop Framework

MapReduce Programming Model


Overview of High Performance Architecture – Apache Hadoop Framework

Advantages of Hadoop
⮚ allows the user to quickly write and test distributed systems
⮚ efficient
⮚ automatic data and work distribution across the machines
⮚ easy utilization of the underlying parallelism of the CPU cores
⮚ Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA)
⮚ Hadoop library itself has been designed to detect and handle
failures at the application layer
⮚ servers can be added or removed from the cluster dynamically -
Hadoop continues to operate without interruption
⮚ open source
⮚ compatible on all platforms since it is Java based
Overview of High Performance Architecture – Apache Hadoop Framework

Yet Another Resource Negotiator (YARN)


⮚ Released in Hadoop 2.0
⮚ Performs activities by allocating resources & scheduling tasks.
⮚ Responsible for managing and monitoring workloads.
Main Components
✔ Resource Manager (Manages the resource allocation in the cluster)
✔ Node Manager (Manages the containers and status of data node &
sends staus to RM)
✔ Application Master (handles job life cycle & talks to RM and allocate
containers)
✔ Container (Executes an application specific processes)
Overview of High Performance Architecture – Apache Hadoop Framework
Overview of High Performance Architecture – Apache Hadoop Framework

Yarn Application Workflow


Overview of High Performance Architecture – Apache Hadoop Framework

Yarn Application Workflow

1. Client submits an application


2. Resource Manager allocates a container to start the
application Master.
3. Application Master registers itself with Resource Manger
4. Application Master Negotiates containers from the resource
Manager
5. Resource Manager Notifies the Node Manager to launch
Containers
6. Application code is executed in container
7. Client contacts the RM/AM to monitor the status.
8. Once the process is complete, Application Master unregisters
with resource manager.
Overview of High Performance Architecture – Apache Hadoop Framework

Advantages of YARN

The concept of an Application Master that is associated with each


application that directly negotiates with the central Resource Manager for
resources while taking over the responsibility for monitoring progress and
tracking status. Pushing this responsibility to the application environment
allows greater flexibility in the assignment of resources as well as be more
effective in scheduling to improve node utilization.

The YARN approach allows applications to be better aware of the data


allocation across the topology of the resources within a cluster. This awareness
allows for improved colocation of compute and data resources, reducing data
motion, and consequently, reducing delays associated with data access
latencies. The result should be increased scalability and performance.
Overview of High Performance Architecture – Apache Hadoop Framework

Yet Another Resource YARN


Thank You

You might also like