0% found this document useful (0 votes)
12 views107 pages

BDT Module 1

Uploaded by

falishaumaiza6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views107 pages

BDT Module 1

Uploaded by

falishaumaiza6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

INTRODUCTION TO BIG

DATA TECHNOLOGIES

MODULE - 1
Content
• What is Big Data? Evolution of Big Data
• Big data Challenges-Traditional versus big data approach
• Structured, unstructured, semi-structured and quasi structured data.
• Drivers for Big data- Five Vs
• Big data applications.
• Basics of Distributed File System
• The Big Data Technology Landscape: No-SQLThe Hadoop:
Contents
• History of Hadoop-Hadoop use cases,
• Distributed File system
• HDFS architecture
• The Design of HDFS
• Name node and data node
• Blocks and replication management
• Rack awareness
• HDFS Federation
• Anatomy of File write
• Anatomy of File read.
What is Big Data?
• Big Data is a term used for a collection of data sets that
are large and complex, which is difficult to store and
process using available database management tools or
traditional data processing applications.
• The challenge includes capturing, curating, storing,
searching, sharing, transferring, analyzing and
visualization of this data.
Evolution of Big Data

5
Big data Challenges-Traditional versus
big data approach
SR.NO TRADITIONAL DATA BIG DATA
1 Traditional data is generated in enterprise Big data is generated in outside and
level. enterprise level.
2 Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to
Zettabytes or Exabytes.
3 Traditional database system deals with Big data system deals with structured,
structured data. semi structured and unstructured data.
4 Traditional data is generated per hour or per But big data is generated more
day or more. frequently mainly per seconds.
5 Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.
6 Data integration is very difficult.
Data integration is very easy.
7 Normal system configuration is capable to High system configuration is required to
process traditional data. process big data.
8 The size of the data is very small. The size is more than the traditional data
size.
9 Traditional data base tools are required to Special kind of data base tools are
perform any data base operation. required to perform any data base
operation.
SR.NO TRADITIONAL DATA BIG DATA
10 Its data model is strict schema Its data model is flat schema
based and it is static. based and it is dynamic.
11 Traditional data is stable and Big data is not stable and
inter relationship. unknown relationship.
12 Big data is in huge volume which
Traditional data is in manageable becomes unmanageable.
volume.
13 It is easy to manage and It is difficult to manage and
manipulate the data. manipulate the data.
14 Its data sources includes ERP Its data sources includes social
transaction data, CRM media, device data, sensor data,
transaction data, financial data, video, images, audio etc.
organizational data, web
transaction data etc.
15 Traditional data base tools are Big data source is distributed and
required to perform any data it is managed in distributed form.
base operation.
Types of Big Data

• Unstructured
• Quasi-Structured
• Semi-Structured
• Structured
9
Characteristics of Big Data
he five characteristics that define Big Data are: Volume, Velocity,
Variety, Veracity and Value.

VOLUME
• Volume refers to the ‘amount of data’,
which is growing day by day at a very fast
pace.
• The size of data generated by humans,
machines and their interactions on social
media itself is massive.
• Researchers have predicted that 40
Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of
300 times from 2005.

10
Characteristics of Big Data

VELOCITY
• Velocity is defined as the pace at which different sources
generate the data every day.
• This flow of data is massive and continuous.
• There are 1.03 billion Daily Active Users (Facebook DAU) on
Mobile as of now, which is an increase of 22% year-over-year.
• This shows how fast the number of users are growing on social
media and how fast the data is getting generated daily.
• If we are able to handle the velocity, we will be able to generate
insights and take decisions based on real-time data.
VARIETY

As there are many sources which are contributing to Big Data,
the type of data they are generating is different.
• It can be structured, semi-structured or unstructured.
• Hence, there is a variety of data which is getting generated every
day.
• Earlier, we used to get the data from excel and databases, now
the data are coming in the form of images, audios, videos, sensor
data etc. as shown in below image.
• Hence, this variety of unstructured data creates problems in
capturing, storage, mining and analyzing the data.
VERACITY
• Veracity refers to the data in doubt or uncertainty of data available due to data
inconsistency and incompleteness.
• In the image below, you can see that few values are missing in the table. Also, a
few values are hard to accept, for example – 15000 minimum value in the 3rd
row, it is not possible.
• This inconsistency and incompleteness is Veracity.
• Data available can sometimes get messy and maybe difficult to trust.
• With many forms of big data, quality and accuracy are difficult to control like
Twitter posts with hashtags, abbreviations, typos and colloquial speech.
• The volume is often the reason behind for the lack of quality and accuracy in the
data.
VALUE
• It is all well and good to have access to big data but unless we can turn it into
value it is useless.
• By turning it into value It means, Is it adding to the benefits of the
organizations who are analyzing big data? Is the organization working on Big
Data achieving high ROI (Return On Investment)?
• Unless, it adds to their profits by working on Big Data, it is useless.
Applications of Big Data
• Smarter Healthcare
-Making use of the petabytes of patient’s data, the organization
can extract meaningful information and then build applications
that can predict the patient’s deteriorating condition in advance.
• Telecom
-Telecom sectors collects information, analyzes it and provide
solutions to different problems.
- By using Big Data applications, telecom companies have been
able to significantly reduce data packet loss, which occurs when
networks are overloaded, and thus, providing a seamless
connection to their customers.
Applications of Big Data
• Retail
Retail has some of the tightest margins, and is one of the greatest
beneficiaries of big data.
The beauty of using big data in retail is to understand consumer
behavior.
Amazon’s recommendation engine provides suggestion based on the
browsing history of the consumer.
• Traffic control
Traffic congestion is a major challenge for many cities globally.
Effective use of data and sensors will be key to managing traffic better
as cities become increasingly densely populated.

16
Applications of Big Data
• Manufacturing
Analyzing big data in the manufacturing industry can reduce
component defects, improve product quality, increase efficiency, and
save time and money.
• Search Quality
Every time we are extracting information from google, we are
simultaneously generating data for it.
Google stores this data and uses it to improve its search quality.

17
The Big Data Technology Landscape: No-
SQL(Not Only SQL)

• No-SQLThe term NoSQL was first coined by Carlo Strozzi in 1998 to


name his light weight, open-source, non-relational database that did
not expose the standard SQL interface.
• A NoSQL originally referring to non SQL or non relational is a
database that provides a mechanism for storage and retrieval of data.
• This data is modeled in means other than the tabular relations used
in relational databases.
• NoSQL databases are used in real-time web applications and big data
and their use are increasing over time.
• NoSQL systems are also sometimes called Not only SQL to emphasize
the fact that they may support SQL-like query languages.
The Big Data Technology Landscape: No-
SQL(Not Only SQL)
• A NoSQL database includes simplicity of design, simpler horizontal
scaling to clusters of machines and finer control over availability.
• The data structures used by NoSQL databases are different from
those used by default in relational databases which makes some
operations faster in NoSQL.
• The suitability of a given NoSQL database depends on the problem it
should solve.
• Data structures used by NoSQL databases are sometimes also viewed
as more flexible than relational database tables.
The Big Data Technology Landscape: No-
SQL(Not Only SQL)
• The concept of NoSQL databases became popular with Internet
giants like Google, Facebook, Amazon, etc. who deal with huge
volumes of data.
• The system response time becomes slow when you use RDBMS for
massive volumes of data.
• To resolve this problem, we could “scale up” our systems by
upgrading our existing hardware. This process is expensive.
• The alternative for this issue is to distribute database load on
multiple hosts whenever the load increases. This method is known as
“scaling out.”
The Big Data Technology Landscape: No-
SQL(Not Only SQL)

NoSQL database is non-relational, so it scales out better than relational


databases as they are designed with web applications in mind.
Advantages of NoSQL
1. Can easily scale up and down: NoSQL database supports scaling
rapidly and elastically and allows to scale to the cloud.
• Cluster scale: It allows distribution of database across 100+ nodes
often in multiple data centers,
• Performance scale: It sustains over 100,000+ database reads and
writes per second.
• Data scale: It supports housing of 1 billion+ documents in the
database,
2. Doesn't require a pre-defined schema: NoSQL does not require any
adherence to pre-defined schema
Advantages of NoSQL
3. It is pretty flexible. For example, if we look at MongoDB, the
documents in a collection can have different sets of key-value pairs.

4. Cheap, easy to implement: Deploying NoSQL. properly allows for


all of the benefits : High availability, fault tolerance, etc, while also
lowering operational costs.

23
Types of NoSQL Databases
• Key-value Pair Based
• Column-oriented Graph
• Graphs based
• Document-oriented
Types of NoSQL Databases
Key Value Pair Based
• Data is stored in key/value pairs. It is designed in such a way to
handle lots of data and heavy load.
• Key-value pair storage databases store data as a hash table where
each key is unique, and the value can be a JSON, BLOB(Binary Large
Objects), string, etc.
• It is one of the most basic NoSQL database example. This kind of
NoSQL database is used as a collection, dictionaries, associative
arrays, etc. Key value stores help the developer to store schema-less
data.
• They work best for shopping cart contents.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store
DataBases. They are all based on Amazon’s Dynamo paper.
Types of NoSQL Databases
Key Value Pair Based
Types of NoSQL Databases
Column-based
• Column-oriented databases work on
columns and are based on BigTable paper
by Google.
• Every column is treated separately. Values
of single column databases are stored
contiguously.
• They deliver high performance on
aggregation queries like SUM, COUNT, AVG,
MIN etc. as the data is readily available in a
column.
• Column-based NoSQL databases are
widely used to manage data
warehouses, business intelligence, CRM,
Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are
NoSQL query examples of column based
database.
Types of NoSQL Databases
Document-Oriented:
• Document-Oriented NoSQL DB stores and retrieves data as a key
value pair but the value part is stored as a document.
• The document is stored in JSON or XML formats.
• The value is understood by the DB and can be queried.
Types of NoSQL Databases
Document-Oriented
• In this diagram our left we can see we have rows and columns, and in
the right, we have a document database which has a similar structure
to JSON.
• Now for the relational database, we have to know what columns we
have and so on.
• However, for a document database, we have data store like JSON
object. We do not require to define which make it flexible.
• The document type is mostly used for CMS systems, blogging
platforms, real-time analytics & e-commerce applications.
• It should not use for complex transactions which require multiple
operations or queries against varying aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes,
MongoDB, are popular Document originated DBMS systems.
Types of NoSQL Databases
Graph-Based
• A graph type database stores entities as well the relations amongst those
entities.
• The entity is stored as a node with the relationship as edges.
• An edge gives a relationship between nodes.
• Every node and edge has a unique identifier.
• Compared to a relational database where tables are loosely connected, a
Graph database is a multi-relational in nature.
• Traversing relationship is fast as they are already captured into the DB,
and there is no need to calculate them.
• Graph base database mostly used for social networks, logistics, spatial
data.
• Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based
databases.
Types of NoSQL Databases
Graph-Based

31
Hadoop
• Hadoop is an open-source software framework used for storing and
processing Big Data in a distributed manner on large clusters of
commodity hardware.
• Hadoop is licensed under the Apache v2 license.
• Hadoop was developed, based on the paper written by Google on the
MapReduce system and it applies concepts of functional
programming.
• Hadoop is written in the Java programming language and ranks
among the highest-level Apache projects.
• Hadoop was developed by Doug Cutting and Michael J. Cafarella.
History of Hadoop
USE CASE OF HADOOP
The ClickStream analysis using Hadoop provides three key benefits:
• Hadoop helps to join ClickStream data with other data sources such as
Customer Relationship Management Data (Customer Demographics
Data, Sales Data, and Information on Advertising Campaigns). This
additional data often provides the much needed information to
understand customer behavior.
• Hadoop's scalability property helps you to store years of data without
ample incremental cost. This helps you to perform temporal or year over
year analysis on Click Stream data which your competitors may miss.
• Business analysts can use Apache Pig or Apache Hive for website
analysis. With these tools, you can organize ClickStream data by user
session, refine it, and feed it to visualization or analytics tools.
USE CASE OF HADOOP
• ClickStream data (mouse clicks) helps you to understand the
purchasing behavior of customers.
• ClickStream analysis helps online marketers to optimize their
product web pages, promotional content, etc. to improve their
business.
Distributed File System
• A distributed file system (DFS) is a
file system that is distributed on
various file servers and locations.
• It permits programs to access and
store isolated data in the same
method as in the local files.
• It also permits the user to access
files from any system.
• It allows network users to share
information and files in a regulated
and permitted manner. Although,
the servers have complete control
over the data and provide users
access control.
Distributed File System
• DFS's primary goal is to enable users of physically
distributed systems to share resources and information
through the Common File System (CFS).
• It is a file system that runs as a part of the operating
systems. Its configuration is a set of workstations and
mainframes that a LAN connects.
Distributed File System
• DFS has two components in its services, and these are as follows:
1.Local Transparency
It is achieved via the namespace component.
1.Redundancy
• It is achieved via a file replication component.
• In the case of failure or heavy load, these components work
together to increase data availability by allowing data from
multiple places to be logically combined under a single folder
known as the "DFS root".
• It is not required to use both DFS components simultaneously;
the namespace component can be used without the file
replication component, and the file replication component can be
used between servers without the namespace component.
Applications of Distributed File System
• There are several applications of the distributed file system. Some of
them are as follows:
• Hadoop
✓ Hadoop is a collection of open-source software services.
✓ It is a software framework that uses the MapReduce programming
style to allow distributed storage and management of large amounts
of data.
✓ Hadoop is made up of a storage component known as Hadoop
Distributed File System (HDFS). It is an operational component
based on the MapReduce programming model.

• NFS (Network File System)


✓ A client-server architecture enables a computer user to store, update,
and view files remotely.
✓ It is one of various DFS standards for Network-Attached Storage.
Applications of Distributed File System
• SMB (Server Message Block)
✓IBM developed an SMB protocol to file sharing.
✓It was developed to permit systems to read and
write files to a remote host across a LAN.
✓ The remote host's directories may be accessed
through SMB and are known as "shares".
• NetWare
✓It is an abandon computer network operating
system that is developed by Novell, Inc.
✓The IPX network protocol mainly used combined
multitasking to execute many services on a computer
system.
Hadoop Distributed File System
• Apache HDFS or Hadoop Distributed File System is a block-
structured file system where each file is divided into blocks of a pre-
determined size.
• These blocks are stored across a cluster of one or several machines.
Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave nodes).
• HDFS can be deployed on a broad spectrum of machines that
support Java. Though one can run several Data Nodes on a single
machine, but in the practical world, these Data Nodes are spread
across various machines.
Hadoop Distributed File System
Name Node
• Name Node is the master node in the Apache Hadoop HDFS
Architecture that maintains and manages the blocks present on the
Data Nodes (slave nodes).
• Name Node is a very highly available server that manages the File
System Namespace and controls access to files by clients.
• The HDFS architecture is built in such a way that the user data never
resides on the Name Node.
• The data resides on Data Nodes only.
Hadoop Distributed File System
Data Node
• Data Nodes are the slave nodes in HDFS.
• Unlike Name Node, Data Node is a commodity hardware, that
is, a non-expensive system which is not of high quality or high-
availability.
• The Data Node is a block server that stores the data in the
local file ext3 or ext4.
Hadoop Distributed File System
Functions of Data Node
• These are slave daemons or process which runs on each slave
machine.
• The actual data is stored on Data Nodes.
• The Data Nodes perform the low-level read and write requests from
the file system’s clients.
• They send heartbeats to the Name Node periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds.

44
Blocks

• Blocks are the nothing but the smallest continuous location on your
hard drive where data is stored.
• In general, in any of the File System, you store the data as a collection
of blocks.
• Similarly, HDFS stores each file as blocks which are scattered
throughout the Apache Hadoop cluster.
• The default size of each block is 128 MB in Apache Hadoop 2.x (64
MB in Apache Hadoop 1.x) which you can configure as per our
requirement.
Blocks

• It is not necessary that in HDFS, each file is stored in exact multiple


of the configured block size (128 MB, 256 MB etc.).
• Let’s take an example where I have a file “example.txt” of size 514
MB as shown in above figure.
• Suppose that we are using the default configuration of block size,
which is 128 MB. Then, how many blocks will be created? 5, Right.
• The first four blocks will be of 128 MB. But, the last block will be of
2 MB size only.
Blocks
• Well, whenever we talk about HDFS, we talk about huge data sets, i.e.
Terabytes and Petabytes of data.
• So, if we had a block size of let’s say of 4 KB, as in Linux file system,
we would be having too many blocks and therefore too much of the
metadata.
• So, managing these no. of blocks and metadata will create huge
overhead, which is something, we don’t want.
Hadoop Distributed File System
Secondary Name Node:
• Apart from these two daemons, there is a third daemon or a process
called Secondary Name Node.
• The Secondary Name Node works concurrently with the primary
Name Node as a helper daemon.
• Secondary Name Node is not being a backup Name Node.
Hadoop Distributed File System
Functions of Secondary Name Node:
• The Secondary Name Node is one which constantly reads all the file
systems and metadata from the RAM of the Name Node and writes it
into the hard disk or the file system.
• It is responsible for combining the EditLogs with FsImage from the
Name Node.
• It downloads the EditLogs from the Name Node at regular intervals
and applies to FsImage. The new FsImage is copied back to the Name
Node, which is used whenever the Name Node is started the next
time.
• Hence, Secondary Name Node performs regular checkpoints in
HDFS. Therefore, it is also called Checkpoint Node.
Replication Management
• HDFS provides a reliable way to store huge data in a distributed
environment as data blocks.
• The blocks are also replicated to provide fault tolerance.
• The default replication factor is 3 which is again configurable.
• So, as you can see in the figure below where each block is replicated
three times and stored on different Data Nodes (considering the
default replication factor):
Rack Awareness
• The Name Node also ensures that all the replicas are not stored on
the same rack or a single rack.
• It follows an in-built Rack Awareness Algorithm to reduce latency as
well as provide fault tolerance.
• Considering the replication factor is 3, the Rack Awareness
Algorithm says that the first replica of a block will be stored on a
local rack and the next two replicas will be stored on a different
(remote) rack but, on a different Data Node within that (remote) rack
as shown in the figure above.
• If we have more replicas, the rest of the replicas will be placed on
random Data Nodes provided not more than two replicas reside on
the same rack, if possible.
• This is how an actual Hadoop production cluster looks like. Here, we
have multiple racks populated with Data Nodes:
Rack Awareness
54
55
Limitations of Hadoop 1.0 Architecture
HDFS Limitation
• Name Node saves all its file metadata in main memory.
• Although the main memory today is not as small and as expensive as
it used to be two decades ago, still there is a limit on the number of
objects that one can have in the memory on a single Name Node.
• The Name Node can quickly become overwhelmed with load on the
system increasing.
• In Hadoop 2.x, this is resolved with the help of HDFS Federation.
HDFS Federation
• HDFS Federation uses multiple independent Name Nodes for
horizontal scalability.
• Name Nodes are independent of each other.
• It means, Name Nodes does not need any coordination with each
other. The Data Nodes are common storage for blocks and shared by
all Name Nodes.
• All Data Nodes in the cluster registers with each Name Node in the
cluster.
• High availability of Name Node is obtained with the help of Passive
Standby Name Node.
HDFS Federation
• In Hadoop 2.x, Active-Passive Name Node handles failover
automatically.
• All namespace edits are recorded to a shared NFS Storage and there
is a single writer at any point of time.
• Passive Name Node reads edits from shared storage and keeps
updated metadata information.
• In case of Active Name Node failure, Passive Name Node become an
Active Name Node automatically.
• Then it starts writing to the shared storage.

58
Anatomy of File Read
Anatomy of File Read

The steps involved in the File Read are as follows:


1. The client opens the file that it wishes to read from by calling open() on the
Distributed File System.

2. Distributed File System communicates with the Name Node to get the location
of data blocks. Name Node returns with the addresses of the Data nodes that
the data blocks are stored on. Subsequent to this, the Distributed File system
returns an FS Data lnput Stream to client to read from the file.

3. Client then calls read() on the stream DFSlnputStream, which has addresses of
the Data nodes for the first few blocks of the file, connects to the closest Data
node for the first block in the file.
Anatomy of File Read
4. Client calls read() repeatedly to stream the data from the Data node.

5. When end of the block is reached, DFSlnputStream closes the


connection with the Datanode. It repeats the steps to find the best Data
node for the next block and subsequent blocks.

5. When the client completes the reading of the file, it calls close() on the
FSDatalnputStream to close the connection.

62
Anatomy of File Write
Anatomy of File Write
• The steps involved in anatomy of File Write are as follows:
1. The client calls create() on Distributed File System to create a file.
2. An RPC call to the Name Node happens through the Distributed File
System to create a new file. The Name Node performs various
checks to create a new file (checks whether such a file exists or
not). Initially, the Name Node creates a file without associating any
data blocks to the file. The Distributed File System returns an
FSDataOutputStream to the client to perform write.
3. As the client writes data, data is split into packets by
DFSOutputStream, which is then written to an internal queue,
called data queue. Data Streamer consumes the data queue. The
DataStreamer requests the Name Node to allocate new blocks by
selecting a list of suitable Data nodes to store replicas. This list of
Data nodes makes a pipeline. Here, we will go with the default
replication factor of three, so there will be three nodes in the
pipeline for the first block.
Anatomy of File Write
Anatomy of File Write
4. DataStreamer streams the packets to the first Data node in the
pipeline. It stores packet and forwards it to the second Data node in the
pipeline. In the same way, the second Data node stores the packet and
forwards it to the third Data node in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages an
"Ack queue" of packets that are waiting for the acknowledgement by
Data nodes. A packet is removed from the "Ack queue" only if it is
acknowledged by all the Data nodes in the pipeline.
6. When the client finishes writing the file, it calls close() on the stream.
7. This flushes all the remaining packets to the Data node pipeline and
waits for relevant acknowledgments before communicating with the
Name Node to inform the client that the creation of the file is complete.
67
MAPREDUCE
Hadoop Mapreduce paradigm
• Hadoop is an open-source software framework
for storing and processing large datasets ranging
in size from gigabytes to petabytes.
• developed at the Apache Software Foundation.
• basically two components in Hadoop:
1. Massive data storage
2. Faster data processing

69
Hadoop Mapreduce paradigm
• Hadoop distributed File System (HDFS):
• It allows you to store data of various formats
across a cluster.
• Map-Reduce:
• For resource management in Hadoop. It allows
parallel processing over the data stored across
HDFS.

70
History of Hadoop

71
Why Hadoop?
• Cost Effective System
• Computing power
• Scalability
• Storage flexibility
• Inherent data protection
• Varied Data Sources
• Fault-Tolerant
• Highly Available
• High Throughput
• Multiple Languages Supported

72
Traditional restaurant scenerio

73
Traditional Scenario

74
Distributed Processing Scenario

75
Distributed Processing Scenario Failure

76
Solution of Restaurant problem

77
Hadoop in Restaurant Analogy

78
Map tasks
• Process independent chunks in a parallel manner
• Output of map task stored as intermediate data
on local disk of that server
Reduce task
• Output of mapper automatically shuffled and stored by
framework
• Sorts the output based on key
• Provide reduced output by combining the output from
various mappers

79
80
Map-reduce daemons
1. JobTrackers
2. TaskTrackers

81
JobTracker
• Master daemon
• Single JobTracker per Hadoop cluster
• Provide connectivity between Hadoop and client
application
• Execution plan creation(which task to assign to
which node)
• Monitor all running tasks
• If task failed then rescheduling

82
Task Tracker
• Responsible for executing individual task which
is assigned by JobTracker
• Single Task Tracker per slave
• Continuously sends heartbeat message to Job
Tracker
• If no heartbeat message then task will be
allocated to other Task Trackers

83
Map-reduce execution pipeline

84
Mapper
• Mapper maps the input key-value pairs into a set of
intermediate key-value pairs
• Phases:
1. RecordReader:
• Converts tasks with key value pairs
• <Key , value> → <positional information, chunk of
data that constitutes the record>
2. Map:
• generate zero or more intermediate key-value pairs

85
Mapper
3. Combiner
• Optimization technique for mapreduce job,
applies user specific aggregate function to only
that mapper
• Also known as Local reducer
4. Partitioner
• Intermediate key-value pairs
• Usually Number of partitions are equal to the
number of reducer

86
Reducer
1. Shuffle and sort:
• consumes the output of Mapping phase
• consolidate the relevant records from Mapping
phase output.
• the same words are clubbed together along with
their respective frequency.

87
Reducer
2. Reducer:
• Grouped data produced by the shuffle and sort phase
• Apply reduce function
• Process one group at a time
• Reducer function iterate all the values associated with that key
• Aggregation, filtering,combining

3. Output format:
• Separates key value pair with tab
• Write it out to a file using record writer

88
89
API
• Main Class file Packages

• Mapper Class Packages

• Reducer Class Packages

90
Main class file packages
• import org.apache.hadoop.conf.Configured; (Configuration of system parameters)

• import org.apache.hadoop.fs.Path; (Configuration of file system path)

• import org.apache.hadoop.io.IntWritable; (Input/output package to display in output screen)

• import org.apache.hadoop.io.Text; ( to read and write the text)

• import org.apache.hadoop.mapred.FileInputFormat; ( MapRed file input format)

• import org.apache.hadoop.mapred.FileOutputFormat; ; ( MapRed file output format)

• import org.apache.hadoop.mapred.JobClient; ( assign the input job and process)

• import org.apache.hadoop.mapred.JobConf; (configuration file to execute I/O process)

91
• import org.apache.hadoop.util.Tool; (interface
(command line options) used to access MapRed
functions)

• import org.apache.hadoop.util.ToolRunner;

( Interface use to call run function)

92
Mapper File Packages
• import java.io.IOException; ( Exception handle)

• import org.apache.hadoop.io.IntWritable; ( to read the integer file)

• import org.apache.hadoop.io.LongWritable; (to read files range exceeding integer)

• import org.apache.hadoop.io.Text; (Input and output text)

• import org.apache.hadoop.mapred.MapReduceBase;( Inherited class of MapReduce functions)

• import org.apache.hadoop.mapred.Mapper; (Mapper Class)

• import org.apache.hadoop.mapred.OutputCollector; ( to collect and display class)

• import org.apache.hadoop.mapred.Reporter; (to display the information)

93
Reducer file Package
• import java.io.IOException; ( Exception handle)

• import java.util.Iterator; (to call utility function has more elements from iterator class)

• import org.apache.hadoop.io.IntWritable; ( to read the integer file)

• import org.apache.hadoop.io.Text; (Input and output text)

94
Reducer file Package
• import org.apache.hadoop.mapred.MapReduceBase; ( Inherited class of
MapReduce functions)

• import org.apache.hadoop.mapred.OutputCollector; ( to collect and


display class)

• import org.apache.hadoop.mapred.Reducer; (Reducer Class)

• import org.apache.hadoop.mapred.Reporter; (to display the


information)

95
Hadoop 2.0 features
• HDFS Federation – horizontal scalability of
NameNode
• NameNode High Availability – NameNode is no
longer a Single Point of Failure
• YARN – ability to process Terabytes and
Petabytes of data available in HDFS using Non-
MapReduce applications such as MPI, GIRAPH

96
Namenode high availability
• Hadoop 1.x, NameNode was single point of failure
• Hadoop Administrators need to manually recover
the NameNode using Secondary NameNode.
• Hadoop 2.0 Architecture supports multiple
NameNodes to remove this bottleneck
• Passive Standby NameNode support.
• In case of Active NameNode failure, the passive
NameNode becomes the Active NameNode and
starts writing to the shared storage

97
MapReduce Execution Framework

98
Motivation for MR V2

99
Yarn Architecture

100
YARN(Yet Another Resource Negotiator)
• Main idea is splitting the JobTracker
responsibility of resource management and Job
scheduling into separate daemons.

101
YARN daemons
1. Global resource manager:
a) Scheduler(allocation of resources among
various running applications)
b) Application manager(Accepting job
submission, restarting application master in
case of failure)

102
YARN daemons
2. Node manager:
• Pre machine slave daemon
• Launching application container for application
execution
• Report usage of resources to the global resource
manager

103
YARN daemons
3. Application master:
• Application specific entity
• Negotiate required resources for execution from
the resource manager
• Works with node manager for executing and
monitoring component tasks

104
YARN

105
YARN workflow
1. Client submits an application
2. The Resource Manager allocates a container to start the
Application Manager
3. The Application Manager registers itself with the Resource
Manager
4. The Application Manager negotiates containers from the Resource
Manager
5. The Application Manager notifies the Node Manager to launch
containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to
monitor application’s status
8. Once the processing is complete, the Application Manager un-
registers with the Resource Manager

106
107

You might also like