0% found this document useful (0 votes)
19 views60 pages

Kcs061 Unit 2

The document discusses Hadoop, an open-source framework for storing and processing large datasets in a distributed manner. It provides an overview of Hadoop's history and components, including the Hadoop Distributed File System (HDFS) for storage and MapReduce for processing. The document also gives examples of using Hadoop to analyze large amounts of user data from an e-commerce site to identify top customers and buying trends.

Uploaded by

Sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views60 pages

Kcs061 Unit 2

The document discusses Hadoop, an open-source framework for storing and processing large datasets in a distributed manner. It provides an overview of Hadoop's history and components, including the Hadoop Distributed File System (HDFS) for storage and MapReduce for processing. The document also gives examples of using Hadoop to analyze large amounts of user data from an e-commerce site to identify top customers and buying trends.

Uploaded by

Sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Lecture 09- 16

Big Data(KcS-061)

Hadoop
History of Hadoop, Apache Hadoop, the Hadoop Distributed File
System, components of Hadoop, data format, analyzing data with
Hadoop, scaling out, Hadoop streaming, Hadoop pipes, Hadoop Echo
System
2023/3/27 SHEAT CSE/Big Data/KCS061 1
HaDoop: Defination

Hadoop is an open-source software framework


used for storing and processing Big Data in a
distributed manner on large clusters of commodity
hardware. Hadoop is licensed under Apache
Software Foundation (ASF).

2023/3/27 SHEAT CSE/Big Data/KCS061 2


Technicians working on a large Linux cluster at the
Chemnitz University of Technology, Germany

2023/3/27 SHEAT CSE/Big Data/KCS061 3


WHat iS tHe neeD of HaDoop?
Use case:
Problem:
An e-commerce site XYZ (having 100 million users) wants
to offer a gift voucher of 100$ to its top 10 customers who have
spent the most in the previous year. Moreover, they want to
find the buying trend of these customers so that company can
suggest more items related to them.

2023/3/27 SHEAT CSE/Big Data/KCS061 4


uSe caSe:
Use case:
Issues:
Huge amount of unstructured data which needs to be
stored, processed and analyzed.
Solution:
Apache Hadoop is not only a storage system but is a
platform for data storage as well as processing.

2023/3/27 SHEAT CSE/Big Data/KCS061 5


uSe caSe:
Use case:
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop
Distributed File System) which uses commodity hardware to form
clusters and store data in a distributed fashion. It works on Write once,
read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over
network to find the required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.

2023/3/27 SHEAT CSE/Big Data/KCS061 6


era of Big Data
For the past two decades, we have been living in an era of data
explosion. The amount of data created in traditional business, such as
orders and warehousing, has increased relatively slowly, and its
proportion in the total amount of data has gradually decreased.

Instead, massive amounts of human data and machine data (logs, IoT
devices, etc.) have been collected and stored in quantities far exceeding
traditional business data. A huge technology gap exists between the
massive amounts of data and human capabilities, which has spawned
various big data technologies. In this context, what we call the era of big
data has come into being.
.
2023/3/27 SHEAT CSE/Big Data/KCS061 7
Data expLoSion

2023/3/27 SHEAT CSE/Big Data/KCS061 8


HaDoop at a gLance'
• Hadoop is written in the Java programming language and ranks among the
highest-level Apache projects.
• Doug Cutting and Mike J. Cafarella developed Hadoop.
• By getting inspiration from Google, Hadoop is using technologies like Map-
Reduce programming model as well as Google file system (GFS).
• It is optimized to handle massive quantities of data that could be structured,
unstructured or semi-structured, using commodity hardware, that is, relatively
inexpensive computers.
• It is intended to work upon from a single server to thousands of machines
each offering local computation and storage. It supports the large collection of
data set in a distributed computing environment.
2023/3/27 SHEAT CSE/Big Data/KCS061 9
HiStory of HaDoop

2023/3/27 SHEAT CSE/Big Data/KCS061 10


HiStory of HaDoop
• 2012 Apache Hadoop 1.0 version released.
• 2013 Apache Hadoop 2.2 version released.
• 2014 Apache Hadoop 2.6 version released.
• 2015 Apache Hadoop 2.7 version released.
• 2017 Apache Hadoop 3.0 version released.
• 2018 Apache Hadoop 3.1 version released.

2023/3/27 SHEAT CSE/Big Data/KCS061 11


apacHe HaDoop
Apache Hadoop ( /həˈduːp/) is a collection of open-source
software utilities that facilitates using a network of many computers
to solve problems involving massive amounts of data and
computation. It provides a software framework for distributed
storage and processing of big data using the MapReduce
programming model.
Hadoop was originally designed for computer clusters built from
commodity hardware, which is still the common use.

2023/3/27 SHEAT CSE/Big Data/KCS061 12


apacHe HaDoop

The core of Apache Hadoop consists of a storage part, known


as Hadoop Distributed File System (HDFS), and a processing part
which is a MapReduce programming model.

Hadoop splits files into large blocks and distributes them across
nodes in a cluster. It then transfers packaged code into nodes to
process the data in parallel.

2023/3/27 SHEAT CSE/Big Data/KCS061 13


HaDoop arcHitecture
Apache Hadoop offers a scalable, flexible and reliable distributed
computing big data framework for a cluster of systems with storage
capacity and local computing power by leveraging commodity hardware.
Hadoop runs applications using the MapReduce algorithm, where the
data is processed in parallel on different CPU nodes. In short, Hadoop
framework is capable enough to develop applications capable of running on
clusters of computers and they could perform complete statistical analysis
for huge amounts of data.
Hadoop follows a Master Slave architecture for the transformation and
analysis of large datasets using Hadoop MapReduce paradigm.

2023/3/27 SHEAT CSE/Big Data/KCS061 14


HaDoop arcHitecture

The 3 important Hadoop core components that play a vital


role in the Hadoop architecture are –
1. Hadoop Distributed File System (HDFS)
2. Hadoop MapReduce
3. Yet Another Resource Negotiator (YARN)

2023/3/27 SHEAT CSE/Big Data/KCS061 15


HaDoop arcHitecture

2023/3/27 SHEAT CSE/Big Data/KCS061 16


HaDoop DiStriButeD fiLe SyStem
(HDfS)
• Hadoop Distributed File System runs on top of the existing file
systems on each node in a Hadoop cluster.
• Hadoop Distributed File System is a block-structured file system
where each file is divided into blocks of a pre-determined size.
• Data in a Hadoop cluster is broken down into smaller units (called
blocks) and distributed throughout the cluster. Each block is
duplicated twice (for a total of three copies), with the two replicas
stored on two nodes in a rack somewhere else in the cluster.

2023/3/27 SHEAT CSE/Big Data/KCS061 17


HaDoop DiStriButeD fiLe SyStem
(HDfS)
Since the data has a default replication factor of three, it is highly
available and fault-tolerant.
• If a copy is lost (because of machine failure, for example), HDFS will
automatically re-replicate it elsewhere in the cluster, ensuring that
the threefold replication factor is maintained.

2023/3/27 SHEAT CSE/Big Data/KCS061 18


HaDoop arcHitecture

2023/3/27 SHEAT CSE/Big Data/KCS061 19


HaDoop- mapreDuce
• This is for parallel processing of large data sets.
• The MapReduce framework consists of a single master node (JobTracker)
and n numbers of slave nodes (Task Tracker) where n can be 1000s. Master
manages, maintains and monitors the slaves while slaves are the actual
worker nodes.
• Client submit a job to Hadoop. The job can be a mapper, a reducer or a list
of input. The job is sent to job tracker process on master node. Each slave
node runs a process through task tracker
• The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the
slaves, monitoring them and re-executing the failed tasks.
2023/3/27 SHEAT CSE/Big Data/KCS061 20
HaDoop mapreDuce
• The slaves TaskTracker execute the tasks as directed by the master and
provide task-status information to the master periodically.
• o Master stores the metadata (data about data) while slaves are the nodes
which store the data. The client connects with master node to perform any
task

• Hadoop YARN: YARN (Yet Another Resource Negotiator) is the framework
responsible for assigning computational resources for application execution
and cluster management. YARN consists of three core components:
• ResourceManager (one per cluster)
• ApplicationMaster (one per application)
• NodeManagers (one per node)

2023/3/27 SHEAT CSE/Big Data/KCS061 21


HaDoop mapreDuce

2023/3/27 SHEAT CSE/Big Data/KCS061 22


fiLe formatS in HaDoop
Input File Formats in Hadoop
1. Text/CSV Files
2. JSON Records
3. Avro Files
4. Sequence Files
5. RC Files
6. ORC Files
7. Parquet Files

2023/3/27 SHEAT CSE/Big Data/KCS061 23


fiLe formatS in HaDoop
 Text/CSV (comma-separated values ) Files:
Text and CSV files are quite common and frequently Hadoop developers
and data scientists received text and CSV files to work upon.
• JSON (JavaScript Object Notation ) Records: JSON records contain JSON files
where each line is its own JSON datum. In the case of JSON files, metadata is
stored and the file is also splittable but again it also doesn’t support block
compression.
• Avro files include markers that can be used to split large data sets into
subsets suitable for Apache MapReduce processing
• Sequence File: Sequence file stores data in binary format and has a similar
structure to CSV file with some differences. It also doesn’t store metadata
and so only schema evolution option is appending new fields but it supports
block compression.

2023/3/27 SHEAT CSE/Big Data/KCS061 24


fiLe formatS in HaDoop
• RCFile (Record Columnar File) is a data placement structure that
determines how to store relational tables on computer clusters. It is
designed for systems using the MapReduce framework.
• Optimized Row Columnar (ORC) is an open-source columnar storage
file format originally released in early 2013 for Hadoop workloads.
ORC provides a highly-efficient way to store Apache Hive data,
though it can store other data as well.
• Parquet is an open source file format built to handle flat columnar
storage data formats. Parquet operates well with complex data in
large volumes.It is known for its both performant data compression
and its ability to handle a wide variety of encoding types.

2023/3/27 SHEAT CSE/Big Data/KCS061 25


anaLySing Data WitH HaDoop

• The Hive and Pig projects are popular choices that provide SQL-like
and procedural data flow-like languages, respectively.
• HBase is also a popular way to store and analyze data in HDFS. It is a
column-oriented database, and unlike MapReduce, provides random
read and write access to data with low latency.
• MapReduce jobs can read and write data in HBase’s table format, but
data processing is often done via HBase’s own client API.

2023/3/27 SHEAT CSE/Big Data/KCS061 26


ScaLing out
Scaling In Vs Scaling Out :
Once a decision has been made for data scaling,
the specific scaling approach must be chosen.
There are two commonly used types of data scaling
• Up
• Out

2023/3/27 SHEAT CSE/Big Data/KCS061 27


ScaLing up or verticaL ScaLing

• It involves obtaining a faster server with more powerful processors


and more memory.
• This solution uses less network hardware, and consumes less power;
but ultimately.
• For many platforms, it may only provide a short-term fix, especially if
continued growth is expected.

2023/3/27 SHEAT CSE/Big Data/KCS061 28


ScaLing out, or HorizontaL ScaLing

Scaling out, or horizontal scaling :


• It involves adding servers for parallel computing.
• The scale-out technique is a long-term solution, as
more and more servers may be added when
needed.
• But going from one monolithic system to this type
of cluster may be difficult, although extremely
effective solution.
2023/3/27 SHEAT CSE/Big Data/KCS061 29
HaDoop Streaming
Hadoop streaming is a utility that comes with the Hadoop distribution.
This utility allows you to create and run Map/Reduce jobs with any
executable or script as the mapper and/or the reducer.

It is a utility or feature that comes with a Hadoop distribution that allows


developers or programmers to write the Map-Reduce program using
different programming languages like Ruby, Perl, Python, C++, etc. We can
use any language that can read from the standard input(STDIN) like keyboard
input and all and write using standard output(STDOUT).
We all know the Hadoop Framework is completely written in java but
programs for Hadoop are not necessarily need to code in Java programming
language. feature of Hadoop Streaming is available since Hadoop version
0.14.1.

2023/3/27 SHEAT CSE/Big Data/KCS061 30


HoW HaDoop Streaming WorKS

2023/3/27 SHEAT CSE/Big Data/KCS061 31


HaDoop pipeS
• Hadoop Pipes is the name of the C++ interface to Hadoop
MapReduce.
• Unlike Streaming, this uses standard input and output to
communicate with the map and reduce code.
• Pipes uses sockets as the channel over which the task tracker
communicates with the process running the C++ map or reduce
function.

2023/3/27 SHEAT CSE/Big Data/KCS061 32


Streaming vS. pipeS

2023/3/27 SHEAT CSE/Big Data/KCS061 33


HaDoop ecoSyStem
• Hadoop Ecosystem is neither a programming language nor a service; it is a
platform or framework which solves big data problems. You can consider it as a
suite which encompasses a number of services (ingesting, storing, analyzing
and maintaining) inside it. Let us discuss and get a brief idea about how the
services work individually and in collaboration.
• The Hadoop ecosystem provides the furnishings that turn the framework into
a comfortable home for big data activity that reflects your specific needs and
tastes.
• The Hadoop ecosystem includes both official Apache open source projects and
a wide range of commercial tools and solutions.
• Below are the Hadoop components, that together form a Hadoop ecosystem
2023/3/27 SHEAT CSE/Big Data/KCS061 34
HaDoop ecoSyStem
Below are the Hadoop components, that together form a Hadoop ecosystem,
✓ HDFS -> Hadoop Distributed File System
✓ YARN -> Yet Another Resource Negotiator
✓ MapReduce -> Data processing using programming
✓ Spark -> In-memory Data Processing
✓ PIG, HIVE-> Data Processing Services using Query (SQL-like)
✓ HBase -> NoSQL Database
✓ Mahout, Spark MLlib -> Machine Learning
✓ Apache Drill -> SQL on Hadoop
✓ Zookeeper -> Managing Cluster
✓ Oozie -> Job Scheduling
✓ Flume, Sqoop -> Data Ingesting Services
✓ Solr&Lucene
2023/3/27 -> Searching & Indexing
SHEAT CSE/Big Data/KCS061 35
HaDoop ecoSyStem

2023/3/27 SHEAT CSE/Big Data/KCS061 36


HaDoop DiStriButionS
Hadoop is an open-source, catch-all technology solution with incredible
scalability, low cost storage systems and fast paced big data analytics with
economical server costs.
Hadoop Vendor distributions overcome the drawbacks and issues with the
open source edition of Hadoop. These distributions have added
functionalities that focus on:
• Support: Most of the Hadoop vendors provide technical guidance and
assistance that makes it easy for customers to adopt Hadoop for enterprise
level tasks and mission critical applications
• . • Reliability: Hadoop vendors promptly act in response whenever a bug is
detected. With the intent to make commercial solutions more stable,
patches and fixes are deployed immediately.

2023/3/27 SHEAT CSE/Big Data/KCS061 37


HaDoop DiStriButionS
• Completeness: Hadoop vendors couple their distributions with
various other add-on tools which help customers customize the
Hadoop application to address their specific tasks.

• Fault Tolerant: Since the data has a default replication factor of


three, it is highly available and fault-tolerant.

2023/3/27 SHEAT CSE/Big Data/KCS061 38


top HaDoop venDorS
Here is a list of top Hadoop Vendors who play a key role in big data
market growth
➢ Amazon Elastic MapReduce
➢ Cloudera CDH Hadoop Distribution
➢ Hortonworks Data Platform (HDP)
➢ MapR Hadoop Distribution
➢ IBM Open Platform (IBM Infosphere Big insights)
➢ Microsoft Azure's HDInsight -Cloud based Hadoop Distribution

2023/3/27 SHEAT CSE/Big Data/KCS061 39


aDvantageS of HaDoop
The increase in the requirement of computing resources has made Hadoop
a viable and extensively used programming framework. Modern day
organizations can learn Hadoop and leverage their knowhow of managing
processing power of their businesses.
1. Scalable: Hadoop is a highly scalable storage platform, because it can stores
and distribute very large data sets across hundreds of inexpensive servers
that operate in parallel.
2. Cost effective: Hadoop also offers a cost effective storage solution for
businesses’ exploding data sets. The problem with traditional relational
database management systems is that it is extremely cost prohibitive to
scale to such a degree in order to process such massive volumes of data.

2023/3/27 SHEAT CSE/Big Data/KCS061 40


aDvantageS of HaDoop: contD.
3. Flexible: Hadoop enables businesses to easily access new data
sources and tap into different types of data (both structured and
unstructured) to generate value from that data.
4. Speed of Processing: Hadoop’s unique storage method is based on a
distributed file system that basically ‘maps’ data wherever it is
located on a cluster
• 5. Resilient to failure: A key advantage of using Hadoop is its fault
tolerance. When data is sent to an individual node, that data is also
replicated to other nodes in the cluster, which means that in the
event of failure, there is another copy available for use
2023/3/27 SHEAT CSE/Big Data/KCS061 41
Lecture 13- 16
Big Data(KcS-061)

Map Reduce
Map Reduce framework and basics, how Map Reduce works,
developing a Map Reduce application, unit tests with MR unit, test
data and local tests, anatomy of a Map Reduce job run, failures, job
scheduling, shuffle and sort, task execution, Map Reduce types, input
formats, output formats, Map Reduce features, Real-world Map
Reduce
2023/3/27 SHEAT CSE/Big Data/KCS061 42
map reDuce

MapReduce is a programming framework that


allows us to perform parallel and distributed
processing on huge data sets in distributed
environment.

2023/3/27 SHEAT CSE/Big Data/KCS061 43


HoW mapreDuce WorKS?

How MapReduce Works?


MapReduce is a hugely parallel processing framework that can
be easily scaled over massive amounts of commodity hardware to
meet the increased need for processing larger amounts of data. Once
we get the mapping and reducing tasks right all it needs a change in
the configuration in order to make it work on a larger set of data. This
kind of extreme scalability from a single node to hundreds and even
thousands of nodes is what makes MapReduce a top favorite among
Big Data professionals worldwide.

2023/3/27 SHEAT CSE/Big Data/KCS061 44


mapreDuce WorKS

2023/3/27 SHEAT CSE/Big Data/KCS061 45


arcHitecturaL componentS
Map Reduce can be implemented by its components:
1. Architectural components
a) Job Tracker
b) Task Trackers
2. Functional components
a) Mapper (map)
b) Combiner (Shuffler)
c) Reducer (Reduce

2023/3/27 SHEAT CSE/Big Data/KCS061 46


arcHitecturaL componentS
Architectural Components:
➢ The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called:
1. A Job tracker : Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the job
➢ For every job submitted for execution in the system, there is one Job tracker that
resides on Name node and there are multiple task trackers which reside on Data node
A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
 It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run
on different data nodes.

2023/3/27 SHEAT CSE/Big Data/KCS061 47


arcHitecturaL componentS

2023/3/27 SHEAT CSE/Big Data/KCS061 48


arcHitecturaL componentS
• Execution of individual task is then look after by task tracker, which
resides on every data node executing part of the job
. • Task tracker's responsibility is to send the progress report to the job
tracker.
• In addition, task tracker periodically sends 'heartbeat' signal to the
Job tracker so as to notify him of current state of the system.
• Thus job tracker keeps track of overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different
task tracker

2023/3/27 SHEAT CSE/Big Data/KCS061 49


arcHitecturaL componentS

2023/3/27 SHEAT CSE/Big Data/KCS061 50


arcHitecturaL componentS
Functional Components:
• A job (complete Work) is submitted to the master, Hadoop divides the job
into phases , map phase and reduce phase. In between Map and Reduce,
there is small phase called shuffle & Sort in MapReduce. 1. Map tasks 2.
Reduce tasks
• Map Phase: This is very first phase in the execution of map-reduce program.
In this phase data in each split is passed to a mapping function to produce
output values. The map takes key/value pair as input. Key is a reference to
the input. Value is the data set on which to operate. Map function applies
the business logic to every value in input. Map produces an output is called
intermediate output. An output of map is stored on the local disk from
where it is shuffled to reduce nodes.

2023/3/27 SHEAT CSE/Big Data/KCS061 51


arcHitecturaL componentS
• Reduce Phase: In MapReduce Reduce takes intermediate Key / Value
pairs as input and process the output of the mapper. Key/value pairs
provided to reduce are sorted by key. Usually, in the reducer, we do
aggregation or summation sort of computation. A function defined
by user supplies the values for a given key to the Reduce function.
Reduce produces a final output as a list of key/value pairs. This final
output is stored in HDFS and replication is done as usual.
• Shuffling: This phase consumes output of mapping phase. Its task is
to consolidate the relevant records from Mapping phase output

2023/3/27 SHEAT CSE/Big Data/KCS061 52


MapReduce Terminologies

• PayLoad − Applica ons implement the Map and the Reduce func ons, and form the core of the
job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execu on of a Mapper and Reducer across a dataset.
• Task − An execu on of a Mapper or a Reducer on a slice of data.
• Task Attempt − A par cular instance of an a empt to execute a task on a SlaveNode.

2023/3/27 SHEAT CSE/Big Data/KCS061 53
unit teStS WitH mr unit
MRUnit is: a unit test library designed to facilitate easy integration
between your MapReduce development process and standard
development and testing tools such as JUnit. With MRUnit, there are
no test files to create, no configuration parameters to change, and
generally less test code.

2023/3/27 SHEAT CSE/Big Data/KCS061 54


manuaL teSting
• How to perform Manual Testing
• Analyze requirements from the software requirement specification
document.
• Create a clear test plan.
• Write test cases that cover all the requirements defined in the
document.
• Get test cases reviewed by the QA lead.
• Execute test cases and detect any bugs.

2023/3/27 SHEAT CSE/Big Data/KCS061 55


map reDuce featureS :
:

Features of MapReduce are as follows :


• Scalability: Apache Hadoop is a highly scalable framework. This is because of its ability to store and distribute huge
data across plenty of servers.
• Flexibility : MapReduce programming enables companies to access new sources of data. It enables companies to
operate on different types of data.
• Security and Authentication: The MapReduce programming model uses HBase and HDFS security platform that allows
access only to the authenticated users to operate on the data.
• Cost-effective solution: Hadoop’s scalable architecture with the MapReduce programming framework allows the
storage and processing of large data sets in a very affordable manner.
• Fast: Even if we are dealing with large volumes of unstructured data, Hadoop MapReduce just takes minutes to
process terabytes of data. It can process petabytes of data in just an hour.
• A simple model of programming: One of the most important features is that it is based on a simple programming
model.
• Parallel Programming: It divides the tasks in a manner that allows their execution in parallel. Parallel processing
allows multiple processors to execute these divided tasks.
• Availability: If any particular node suffers from a failure, then there are always other copies present on other nodes
that can still be accessed whenever needed.
• Resilient nature: One of the major features offered by Apache Hadoop is its fault tolerance. The Hadoop MapReduce
framework has the ability to quickly recognizing faults that occur.
2023/3/27 SHEAT CSE/Big Data/KCS061 56
map reDuce typeS
Map Reduce Types :
• Hadoop uses the MapReduce programming model for the data
processing of input and output for the map and to reduce functions
represented as key-value pairs.
• They are subject to the parallel execution of datasets situated in a
wide array of machines in a distributed architecture.
• The programming paradigm is essentially functional in nature in
combining while using the technique of map and reduce.

2023/3/27 SHEAT CSE/Big Data/KCS061 57


map reDuce typeS
Map Reduce Types :
• Mapping is the core technique of processing a list of data elements
that come in pairs of keys and values.
• The map function applies to individual elements defined as key-value
pairs of a list and produces a new list.
• The general idea of the map and reduce the function of Hadoop can
be illustrated as follows:
• map: (K1, V1)-> list (K2, V2)
• reduce: (K2, list(V2)) -> list (K3, V3)
2023/3/27 SHEAT CSE/Big Data/KCS061 58
map reDuce typeS

• The input parameters of the key and value pair, represented by K1


and V1 respectively, are different from the output pair type: K2 and
V2.
• The reduce function accepts the same format output by the map, but
the type of output again of the reduce operation is different: K3 and
V3.

2023/3/27 SHEAT CSE/Big Data/KCS061 59


aDvantageS of mapreDuce
There are many advantages of learning this technology. MapReduce is a very simplified
way of working with extremely large volumes of data. The best part is that the entire
MapReduce process is written in Java language which is a very common language among the
software developers community. So it can help you in your career by helping
you upgrade from a Java career to a Hadoop career and stand out from the crowd.
So you will have a head start when it comes to working on the Hadoop platform
if you are able to write MapReduce programs. Some of the biggest enterprises on
earth are deploying Hadoop on previously unheard scales and things can only get
better for the Hadoop deploying companies. Companies
like Amazon, Facebook, Google, Microsoft, Yahoo, General Electric and IBM run
massive Hadoop clusters in order to parse their inordinate amounts of data. So as a
forward-thinking IT professional this technology can help you leapfrog your
competitors and take your career to an altogether next level.

2023/3/27 SHEAT CSE/Big Data/KCS061 60

You might also like