0% found this document useful (0 votes)

19 views60 pages

Kcs061 Unit 2

The document discusses Hadoop, an open-source framework for storing and processing large datasets in a distributed manner. It provides an overview of Hadoop's history and components, including the Hadoop Distributed File System (HDFS) for storage and MapReduce for processing. The document also gives examples of using Hadoop to analyze large amounts of user data from an e-commerce site to identify top customers and buying trends.

Uploaded by

Sachin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views60 pages

Kcs061 Unit 2

Uploaded by

Sachin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Lecture 09- 16

Big Data(KcS-061)

Hadoop
History of Hadoop, Apache Hadoop, the Hadoop Distributed File
System, components of Hadoop, data format, analyzing data with
Hadoop, scaling out, Hadoop streaming, Hadoop pipes, Hadoop Echo
System
2023/3/27 SHEAT CSE/Big Data/KCS061 1
HaDoop: Defination

Hadoop is an open-source software framework

used for storing and processing Big Data in a
distributed manner on large clusters of commodity
hardware. Hadoop is licensed under Apache
Software Foundation (ASF).

2023/3/27 SHEAT CSE/Big Data/KCS061 2

Technicians working on a large Linux cluster at the
Chemnitz University of Technology, Germany

2023/3/27 SHEAT CSE/Big Data/KCS061 3

WHat iS tHe neeD of HaDoop?
Use case:
Problem:
An e-commerce site XYZ (having 100 million users) wants
to offer a gift voucher of 100$ to its top 10 customers who have
spent the most in the previous year. Moreover, they want to
find the buying trend of these customers so that company can
suggest more items related to them.

2023/3/27 SHEAT CSE/Big Data/KCS061 4

uSe caSe:
Use case:
Issues:
Huge amount of unstructured data which needs to be
stored, processed and analyzed.
Solution:
Apache Hadoop is not only a storage system but is a
platform for data storage as well as processing.

2023/3/27 SHEAT CSE/Big Data/KCS061 5

uSe caSe:
Use case:
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop
Distributed File System) which uses commodity hardware to form
clusters and store data in a distributed fashion. It works on Write once,
read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over
network to find the required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.

2023/3/27 SHEAT CSE/Big Data/KCS061 6

era of Big Data
For the past two decades, we have been living in an era of data
explosion. The amount of data created in traditional business, such as
orders and warehousing, has increased relatively slowly, and its
proportion in the total amount of data has gradually decreased.

Instead, massive amounts of human data and machine data (logs, IoT
devices, etc.) have been collected and stored in quantities far exceeding
traditional business data. A huge technology gap exists between the
massive amounts of data and human capabilities, which has spawned
various big data technologies. In this context, what we call the era of big
data has come into being.
.
2023/3/27 SHEAT CSE/Big Data/KCS061 7
Data expLoSion

2023/3/27 SHEAT CSE/Big Data/KCS061 8

HaDoop at a gLance'
• Hadoop is written in the Java programming language and ranks among the
highest-level Apache projects.
• Doug Cutting and Mike J. Cafarella developed Hadoop.
• By getting inspiration from Google, Hadoop is using technologies like Map-
Reduce programming model as well as Google file system (GFS).
• It is optimized to handle massive quantities of data that could be structured,
unstructured or semi-structured, using commodity hardware, that is, relatively
inexpensive computers.
• It is intended to work upon from a single server to thousands of machines
each offering local computation and storage. It supports the large collection of
data set in a distributed computing environment.
2023/3/27 SHEAT CSE/Big Data/KCS061 9
HiStory of HaDoop

2023/3/27 SHEAT CSE/Big Data/KCS061 10

HiStory of HaDoop
• 2012 Apache Hadoop 1.0 version released.
• 2013 Apache Hadoop 2.2 version released.
• 2014 Apache Hadoop 2.6 version released.
• 2015 Apache Hadoop 2.7 version released.
• 2017 Apache Hadoop 3.0 version released.
• 2018 Apache Hadoop 3.1 version released.

2023/3/27 SHEAT CSE/Big Data/KCS061 11

apacHe HaDoop
Apache Hadoop ( /həˈduːp/) is a collection of open-source
software utilities that facilitates using a network of many computers
to solve problems involving massive amounts of data and
computation. It provides a software framework for distributed
storage and processing of big data using the MapReduce
programming model.
Hadoop was originally designed for computer clusters built from
commodity hardware, which is still the common use.

2023/3/27 SHEAT CSE/Big Data/KCS061 12

apacHe HaDoop

The core of Apache Hadoop consists of a storage part, known

as Hadoop Distributed File System (HDFS), and a processing part
which is a MapReduce programming model.

Hadoop splits files into large blocks and distributes them across
nodes in a cluster. It then transfers packaged code into nodes to
process the data in parallel.

2023/3/27 SHEAT CSE/Big Data/KCS061 13

HaDoop arcHitecture
Apache Hadoop offers a scalable, flexible and reliable distributed
computing big data framework for a cluster of systems with storage
capacity and local computing power by leveraging commodity hardware.
Hadoop runs applications using the MapReduce algorithm, where the
data is processed in parallel on different CPU nodes. In short, Hadoop
framework is capable enough to develop applications capable of running on
clusters of computers and they could perform complete statistical analysis
for huge amounts of data.
Hadoop follows a Master Slave architecture for the transformation and
analysis of large datasets using Hadoop MapReduce paradigm.

2023/3/27 SHEAT CSE/Big Data/KCS061 14

HaDoop arcHitecture

The 3 important Hadoop core components that play a vital

role in the Hadoop architecture are –
1. Hadoop Distributed File System (HDFS)
2. Hadoop MapReduce
3. Yet Another Resource Negotiator (YARN)

2023/3/27 SHEAT CSE/Big Data/KCS061 15

HaDoop arcHitecture

2023/3/27 SHEAT CSE/Big Data/KCS061 16

HaDoop DiStriButeD fiLe SyStem
(HDfS)
• Hadoop Distributed File System runs on top of the existing file
systems on each node in a Hadoop cluster.
• Hadoop Distributed File System is a block-structured file system
where each file is divided into blocks of a pre-determined size.
• Data in a Hadoop cluster is broken down into smaller units (called
blocks) and distributed throughout the cluster. Each block is
duplicated twice (for a total of three copies), with the two replicas
stored on two nodes in a rack somewhere else in the cluster.

2023/3/27 SHEAT CSE/Big Data/KCS061 17

HaDoop DiStriButeD fiLe SyStem
(HDfS)
Since the data has a default replication factor of three, it is highly
available and fault-tolerant.
• If a copy is lost (because of machine failure, for example), HDFS will
automatically re-replicate it elsewhere in the cluster, ensuring that
the threefold replication factor is maintained.

2023/3/27 SHEAT CSE/Big Data/KCS061 18

HaDoop arcHitecture
•

2023/3/27 SHEAT CSE/Big Data/KCS061 19

HaDoop- mapreDuce
• This is for parallel processing of large data sets.
• The MapReduce framework consists of a single master node (JobTracker)
and n numbers of slave nodes (Task Tracker) where n can be 1000s. Master
manages, maintains and monitors the slaves while slaves are the actual
worker nodes.
• Client submit a job to Hadoop. The job can be a mapper, a reducer or a list
of input. The job is sent to job tracker process on master node. Each slave
node runs a process through task tracker
• The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the
slaves, monitoring them and re-executing the failed tasks.
2023/3/27 SHEAT CSE/Big Data/KCS061 20
HaDoop mapreDuce
• The slaves TaskTracker execute the tasks as directed by the master and
provide task-status information to the master periodically.
• o Master stores the metadata (data about data) while slaves are the nodes
which store the data. The client connects with master node to perform any
task
•
• Hadoop YARN: YARN (Yet Another Resource Negotiator) is the framework
responsible for assigning computational resources for application execution
and cluster management. YARN consists of three core components:
• ResourceManager (one per cluster)
• ApplicationMaster (one per application)
• NodeManagers (one per node)

2023/3/27 SHEAT CSE/Big Data/KCS061 21

HaDoop mapreDuce

2023/3/27 SHEAT CSE/Big Data/KCS061 22

fiLe formatS in HaDoop
Input File Formats in Hadoop
1. Text/CSV Files
2. JSON Records
3. Avro Files
4. Sequence Files
5. RC Files
6. ORC Files
7. Parquet Files

2023/3/27 SHEAT CSE/Big Data/KCS061 23

fiLe formatS in HaDoop
 Text/CSV (comma-separated values ) Files:
Text and CSV files are quite common and frequently Hadoop developers
and data scientists received text and CSV files to work upon.
• JSON (JavaScript Object Notation ) Records: JSON records contain JSON files
where each line is its own JSON datum. In the case of JSON files, metadata is
stored and the file is also splittable but again it also doesn’t support block
compression.
• Avro files include markers that can be used to split large data sets into
subsets suitable for Apache MapReduce processing
• Sequence File: Sequence file stores data in binary format and has a similar
structure to CSV file with some differences. It also doesn’t store metadata
and so only schema evolution option is appending new fields but it supports
block compression.

2023/3/27 SHEAT CSE/Big Data/KCS061 24

fiLe formatS in HaDoop
• RCFile (Record Columnar File) is a data placement structure that
determines how to store relational tables on computer clusters. It is
designed for systems using the MapReduce framework.
• Optimized Row Columnar (ORC) is an open-source columnar storage
file format originally released in early 2013 for Hadoop workloads.
ORC provides a highly-efficient way to store Apache Hive data,
though it can store other data as well.
• Parquet is an open source file format built to handle flat columnar
storage data formats. Parquet operates well with complex data in
large volumes.It is known for its both performant data compression
and its ability to handle a wide variety of encoding types.

2023/3/27 SHEAT CSE/Big Data/KCS061 25

anaLySing Data WitH HaDoop

• The Hive and Pig projects are popular choices that provide SQL-like
and procedural data flow-like languages, respectively.
• HBase is also a popular way to store and analyze data in HDFS. It is a
column-oriented database, and unlike MapReduce, provides random
read and write access to data with low latency.
• MapReduce jobs can read and write data in HBase’s table format, but
data processing is often done via HBase’s own client API.

2023/3/27 SHEAT CSE/Big Data/KCS061 26

ScaLing out
Scaling In Vs Scaling Out :
Once a decision has been made for data scaling,
the specific scaling approach must be chosen.
There are two commonly used types of data scaling
• Up
• Out

2023/3/27 SHEAT CSE/Big Data/KCS061 27

ScaLing up or verticaL ScaLing

• It involves obtaining a faster server with more powerful processors

and more memory.
• This solution uses less network hardware, and consumes less power;
but ultimately.
• For many platforms, it may only provide a short-term fix, especially if
continued growth is expected.

2023/3/27 SHEAT CSE/Big Data/KCS061 28

ScaLing out, or HorizontaL ScaLing

Scaling out, or horizontal scaling :

• It involves adding servers for parallel computing.
• The scale-out technique is a long-term solution, as
more and more servers may be added when
needed.
• But going from one monolithic system to this type
of cluster may be difficult, although extremely
effective solution.
2023/3/27 SHEAT CSE/Big Data/KCS061 29
HaDoop Streaming
Hadoop streaming is a utility that comes with the Hadoop distribution.
This utility allows you to create and run Map/Reduce jobs with any
executable or script as the mapper and/or the reducer.

It is a utility or feature that comes with a Hadoop distribution that allows

developers or programmers to write the Map-Reduce program using
different programming languages like Ruby, Perl, Python, C++, etc. We can
use any language that can read from the standard input(STDIN) like keyboard
input and all and write using standard output(STDOUT).
We all know the Hadoop Framework is completely written in java but
programs for Hadoop are not necessarily need to code in Java programming
language. feature of Hadoop Streaming is available since Hadoop version
0.14.1.

2023/3/27 SHEAT CSE/Big Data/KCS061 30

HoW HaDoop Streaming WorKS

2023/3/27 SHEAT CSE/Big Data/KCS061 31

HaDoop pipeS
• Hadoop Pipes is the name of the C++ interface to Hadoop
MapReduce.
• Unlike Streaming, this uses standard input and output to
communicate with the map and reduce code.
• Pipes uses sockets as the channel over which the task tracker
communicates with the process running the C++ map or reduce
function.

2023/3/27 SHEAT CSE/Big Data/KCS061 32

Streaming vS. pipeS

2023/3/27 SHEAT CSE/Big Data/KCS061 33

HaDoop ecoSyStem
• Hadoop Ecosystem is neither a programming language nor a service; it is a
platform or framework which solves big data problems. You can consider it as a
suite which encompasses a number of services (ingesting, storing, analyzing
and maintaining) inside it. Let us discuss and get a brief idea about how the
services work individually and in collaboration.
• The Hadoop ecosystem provides the furnishings that turn the framework into
a comfortable home for big data activity that reflects your specific needs and
tastes.
• The Hadoop ecosystem includes both official Apache open source projects and
a wide range of commercial tools and solutions.
• Below are the Hadoop components, that together form a Hadoop ecosystem
2023/3/27 SHEAT CSE/Big Data/KCS061 34
HaDoop ecoSyStem
Below are the Hadoop components, that together form a Hadoop ecosystem,
✓ HDFS -> Hadoop Distributed File System
✓ YARN -> Yet Another Resource Negotiator
✓ MapReduce -> Data processing using programming
✓ Spark -> In-memory Data Processing
✓ PIG, HIVE-> Data Processing Services using Query (SQL-like)
✓ HBase -> NoSQL Database
✓ Mahout, Spark MLlib -> Machine Learning
✓ Apache Drill -> SQL on Hadoop
✓ Zookeeper -> Managing Cluster
✓ Oozie -> Job Scheduling
✓ Flume, Sqoop -> Data Ingesting Services
✓ Solr&Lucene
2023/3/27 -> Searching & Indexing
SHEAT CSE/Big Data/KCS061 35
HaDoop ecoSyStem

2023/3/27 SHEAT CSE/Big Data/KCS061 36

HaDoop DiStriButionS
Hadoop is an open-source, catch-all technology solution with incredible
scalability, low cost storage systems and fast paced big data analytics with
economical server costs.
Hadoop Vendor distributions overcome the drawbacks and issues with the
open source edition of Hadoop. These distributions have added
functionalities that focus on:
• Support: Most of the Hadoop vendors provide technical guidance and
assistance that makes it easy for customers to adopt Hadoop for enterprise
level tasks and mission critical applications
• . • Reliability: Hadoop vendors promptly act in response whenever a bug is
detected. With the intent to make commercial solutions more stable,
patches and fixes are deployed immediately.

2023/3/27 SHEAT CSE/Big Data/KCS061 37

HaDoop DiStriButionS
• Completeness: Hadoop vendors couple their distributions with
various other add-on tools which help customers customize the
Hadoop application to address their specific tasks.

• Fault Tolerant: Since the data has a default replication factor of

three, it is highly available and fault-tolerant.

2023/3/27 SHEAT CSE/Big Data/KCS061 38

top HaDoop venDorS
Here is a list of top Hadoop Vendors who play a key role in big data
market growth
➢ Amazon Elastic MapReduce
➢ Cloudera CDH Hadoop Distribution
➢ Hortonworks Data Platform (HDP)
➢ MapR Hadoop Distribution
➢ IBM Open Platform (IBM Infosphere Big insights)
➢ Microsoft Azure's HDInsight -Cloud based Hadoop Distribution

2023/3/27 SHEAT CSE/Big Data/KCS061 39

aDvantageS of HaDoop
The increase in the requirement of computing resources has made Hadoop
a viable and extensively used programming framework. Modern day
organizations can learn Hadoop and leverage their knowhow of managing
processing power of their businesses.
1. Scalable: Hadoop is a highly scalable storage platform, because it can stores
and distribute very large data sets across hundreds of inexpensive servers
that operate in parallel.
2. Cost effective: Hadoop also offers a cost effective storage solution for
businesses’ exploding data sets. The problem with traditional relational
database management systems is that it is extremely cost prohibitive to
scale to such a degree in order to process such massive volumes of data.

2023/3/27 SHEAT CSE/Big Data/KCS061 40

aDvantageS of HaDoop: contD.
3. Flexible: Hadoop enables businesses to easily access new data
sources and tap into different types of data (both structured and
unstructured) to generate value from that data.
4. Speed of Processing: Hadoop’s unique storage method is based on a
distributed file system that basically ‘maps’ data wherever it is
located on a cluster
• 5. Resilient to failure: A key advantage of using Hadoop is its fault
tolerance. When data is sent to an individual node, that data is also
replicated to other nodes in the cluster, which means that in the
event of failure, there is another copy available for use
2023/3/27 SHEAT CSE/Big Data/KCS061 41
Lecture 13- 16
Big Data(KcS-061)

Map Reduce
Map Reduce framework and basics, how Map Reduce works,
developing a Map Reduce application, unit tests with MR unit, test
data and local tests, anatomy of a Map Reduce job run, failures, job
scheduling, shuffle and sort, task execution, Map Reduce types, input
formats, output formats, Map Reduce features, Real-world Map
Reduce
2023/3/27 SHEAT CSE/Big Data/KCS061 42
map reDuce

MapReduce is a programming framework that

allows us to perform parallel and distributed
processing on huge data sets in distributed
environment.

2023/3/27 SHEAT CSE/Big Data/KCS061 43

HoW mapreDuce WorKS?

How MapReduce Works?

MapReduce is a hugely parallel processing framework that can
be easily scaled over massive amounts of commodity hardware to
meet the increased need for processing larger amounts of data. Once
we get the mapping and reducing tasks right all it needs a change in
the configuration in order to make it work on a larger set of data. This
kind of extreme scalability from a single node to hundreds and even
thousands of nodes is what makes MapReduce a top favorite among
Big Data professionals worldwide.

2023/3/27 SHEAT CSE/Big Data/KCS061 44

mapreDuce WorKS

2023/3/27 SHEAT CSE/Big Data/KCS061 45

arcHitecturaL componentS
Map Reduce can be implemented by its components:
1. Architectural components
a) Job Tracker
b) Task Trackers
2. Functional components
a) Mapper (map)
b) Combiner (Shuffler)
c) Reducer (Reduce

2023/3/27 SHEAT CSE/Big Data/KCS061 46

arcHitecturaL componentS
Architectural Components:
➢ The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called:
1. A Job tracker : Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the job
➢ For every job submitted for execution in the system, there is one Job tracker that
resides on Name node and there are multiple task trackers which reside on Data node
A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
 It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run
on different data nodes.

2023/3/27 SHEAT CSE/Big Data/KCS061 47

arcHitecturaL componentS

2023/3/27 SHEAT CSE/Big Data/KCS061 48

arcHitecturaL componentS
• Execution of individual task is then look after by task tracker, which
resides on every data node executing part of the job
. • Task tracker's responsibility is to send the progress report to the job
tracker.
• In addition, task tracker periodically sends 'heartbeat' signal to the
Job tracker so as to notify him of current state of the system.
• Thus job tracker keeps track of overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different
task tracker

2023/3/27 SHEAT CSE/Big Data/KCS061 49

arcHitecturaL componentS

2023/3/27 SHEAT CSE/Big Data/KCS061 50

arcHitecturaL componentS
Functional Components:
• A job (complete Work) is submitted to the master, Hadoop divides the job
into phases , map phase and reduce phase. In between Map and Reduce,
there is small phase called shuffle & Sort in MapReduce. 1. Map tasks 2.
Reduce tasks
• Map Phase: This is very first phase in the execution of map-reduce program.
In this phase data in each split is passed to a mapping function to produce
output values. The map takes key/value pair as input. Key is a reference to
the input. Value is the data set on which to operate. Map function applies
the business logic to every value in input. Map produces an output is called
intermediate output. An output of map is stored on the local disk from
where it is shuffled to reduce nodes.

2023/3/27 SHEAT CSE/Big Data/KCS061 51

arcHitecturaL componentS
• Reduce Phase: In MapReduce Reduce takes intermediate Key / Value
pairs as input and process the output of the mapper. Key/value pairs
provided to reduce are sorted by key. Usually, in the reducer, we do
aggregation or summation sort of computation. A function defined
by user supplies the values for a given key to the Reduce function.
Reduce produces a final output as a list of key/value pairs. This final
output is stored in HDFS and replication is done as usual.
• Shuffling: This phase consumes output of mapping phase. Its task is
to consolidate the relevant records from Mapping phase output

2023/3/27 SHEAT CSE/Big Data/KCS061 52

MapReduce Terminologies

• PayLoad − Applica ons implement the Map and the Reduce func ons, and form the core of the
job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execu on of a Mapper and Reducer across a dataset.
• Task − An execu on of a Mapper or a Reducer on a slice of data.
• Task Attempt − A par cular instance of an a empt to execute a task on a SlaveNode.
•
2023/3/27 SHEAT CSE/Big Data/KCS061 53
unit teStS WitH mr unit
MRUnit is: a unit test library designed to facilitate easy integration
between your MapReduce development process and standard
development and testing tools such as JUnit. With MRUnit, there are
no test files to create, no configuration parameters to change, and
generally less test code.

2023/3/27 SHEAT CSE/Big Data/KCS061 54

manuaL teSting
• How to perform Manual Testing
• Analyze requirements from the software requirement specification
document.
• Create a clear test plan.
• Write test cases that cover all the requirements defined in the
document.
• Get test cases reviewed by the QA lead.
• Execute test cases and detect any bugs.

2023/3/27 SHEAT CSE/Big Data/KCS061 55

map reDuce featureS :
:

Features of MapReduce are as follows :

• Scalability: Apache Hadoop is a highly scalable framework. This is because of its ability to store and distribute huge
data across plenty of servers.
• Flexibility : MapReduce programming enables companies to access new sources of data. It enables companies to
operate on different types of data.
• Security and Authentication: The MapReduce programming model uses HBase and HDFS security platform that allows
access only to the authenticated users to operate on the data.
• Cost-effective solution: Hadoop’s scalable architecture with the MapReduce programming framework allows the
storage and processing of large data sets in a very affordable manner.
• Fast: Even if we are dealing with large volumes of unstructured data, Hadoop MapReduce just takes minutes to
process terabytes of data. It can process petabytes of data in just an hour.
• A simple model of programming: One of the most important features is that it is based on a simple programming
model.
• Parallel Programming: It divides the tasks in a manner that allows their execution in parallel. Parallel processing
allows multiple processors to execute these divided tasks.
• Availability: If any particular node suffers from a failure, then there are always other copies present on other nodes
that can still be accessed whenever needed.
• Resilient nature: One of the major features offered by Apache Hadoop is its fault tolerance. The Hadoop MapReduce
framework has the ability to quickly recognizing faults that occur.
2023/3/27 SHEAT CSE/Big Data/KCS061 56
map reDuce typeS
Map Reduce Types :
• Hadoop uses the MapReduce programming model for the data
processing of input and output for the map and to reduce functions
represented as key-value pairs.
• They are subject to the parallel execution of datasets situated in a
wide array of machines in a distributed architecture.
• The programming paradigm is essentially functional in nature in
combining while using the technique of map and reduce.

2023/3/27 SHEAT CSE/Big Data/KCS061 57

map reDuce typeS
Map Reduce Types :
• Mapping is the core technique of processing a list of data elements
that come in pairs of keys and values.
• The map function applies to individual elements defined as key-value
pairs of a list and produces a new list.
• The general idea of the map and reduce the function of Hadoop can
be illustrated as follows:
• map: (K1, V1)-> list (K2, V2)
• reduce: (K2, list(V2)) -> list (K3, V3)
2023/3/27 SHEAT CSE/Big Data/KCS061 58
map reDuce typeS

• The input parameters of the key and value pair, represented by K1

and V1 respectively, are different from the output pair type: K2 and
V2.
• The reduce function accepts the same format output by the map, but
the type of output again of the reduce operation is different: K3 and
V3.

2023/3/27 SHEAT CSE/Big Data/KCS061 59

aDvantageS of mapreDuce
There are many advantages of learning this technology. MapReduce is a very simplified
way of working with extremely large volumes of data. The best part is that the entire
MapReduce process is written in Java language which is a very common language among the
software developers community. So it can help you in your career by helping
you upgrade from a Java career to a Hadoop career and stand out from the crowd.
So you will have a head start when it comes to working on the Hadoop platform
if you are able to write MapReduce programs. Some of the biggest enterprises on
earth are deploying Hadoop on previously unheard scales and things can only get
better for the Hadoop deploying companies. Companies
like Amazon, Facebook, Google, Microsoft, Yahoo, General Electric and IBM run
massive Hadoop clusters in order to parse their inordinate amounts of data. So as a
forward-thinking IT professional this technology can help you leapfrog your
competitors and take your career to an altogether next level.

2023/3/27 SHEAT CSE/Big Data/KCS061 60

Eco Assignment
No ratings yet
Eco Assignment
9 pages
Boeing 747 Data
100% (1)
Boeing 747 Data
244 pages
803 Manual en PDF
No ratings yet
803 Manual en PDF
12 pages
Tad1241ge PDF
No ratings yet
Tad1241ge PDF
14 pages
A380 Level I Ata 44 Cabin Systems PDF
No ratings yet
A380 Level I Ata 44 Cabin Systems PDF
42 pages
USBDLA User Guide V1.0
No ratings yet
USBDLA User Guide V1.0
18 pages
Autopart Documentation
No ratings yet
Autopart Documentation
47 pages
Base Excitation
No ratings yet
Base Excitation
24 pages
Project Report: "In Pursuit of Global Competitiveness"
75% (4)
Project Report: "In Pursuit of Global Competitiveness"
9 pages
RF Oscillator
100% (2)
RF Oscillator
25 pages
Manual Aire Acondicionado
No ratings yet
Manual Aire Acondicionado
44 pages
Actility Enova Presentation REV01
No ratings yet
Actility Enova Presentation REV01
19 pages
Geronimo Creer, Jr. For Plaintiffs-Appellees. Benedicto G. Cobarde For Defendant, Defendant-Appellant
No ratings yet
Geronimo Creer, Jr. For Plaintiffs-Appellees. Benedicto G. Cobarde For Defendant, Defendant-Appellant
2 pages
IR
No ratings yet
IR
8 pages
Pencil
No ratings yet
Pencil
17 pages
CV - Jean Francois Laloux
No ratings yet
CV - Jean Francois Laloux
5 pages
PressRelease-2013-Recall All ICs Issued in Sabah and Issue A New Sabah IC - 03 November 2013
No ratings yet
PressRelease-2013-Recall All ICs Issued in Sabah and Issue A New Sabah IC - 03 November 2013
2 pages
Research Methodology: Types of Taboo Words Are Used in What Is The Function of Taboo Words Are Used in
No ratings yet
Research Methodology: Types of Taboo Words Are Used in What Is The Function of Taboo Words Are Used in
3 pages
(ABRIDGED) RMUN 2021 (UNHCR) - Study Guide
No ratings yet
(ABRIDGED) RMUN 2021 (UNHCR) - Study Guide
15 pages
24 Coercion Exercise
No ratings yet
24 Coercion Exercise
1 page
Tap Magic Eco Oil Sds en Us 2023pdf
No ratings yet
Tap Magic Eco Oil Sds en Us 2023pdf
8 pages
Royal Caribbean
No ratings yet
Royal Caribbean
13 pages
V.K.S 7233 16.12.23
No ratings yet
V.K.S 7233 16.12.23
1 page
Company Profile Acurate Packtech
No ratings yet
Company Profile Acurate Packtech
6 pages
Urban Flibk INVOICE
No ratings yet
Urban Flibk INVOICE
1 page
Maaz Assignment # 3 Deep Learning
No ratings yet
Maaz Assignment # 3 Deep Learning
5 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Assignment MCA 103
No ratings yet
Assignment MCA 103
4 pages
Cool Bot Pro Spec Sheet 2020
No ratings yet
Cool Bot Pro Spec Sheet 2020
1 page
Geosynthetics in Civil Engineering by Oleg Stolyarov Unit 3
No ratings yet
Geosynthetics in Civil Engineering by Oleg Stolyarov Unit 3
52 pages
11 Best Step - How To Plant An Avocado Seed in Soil - October 2024
No ratings yet
11 Best Step - How To Plant An Avocado Seed in Soil - October 2024
31 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2141)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)

Kcs061 Unit 2

Uploaded by

Kcs061 Unit 2

Uploaded by

Lecture 09- 16

Hadoop is an open-source software framework

2023/3/27 SHEAT CSE/Big Data/KCS061 2

2023/3/27 SHEAT CSE/Big Data/KCS061 3

2023/3/27 SHEAT CSE/Big Data/KCS061 4

2023/3/27 SHEAT CSE/Big Data/KCS061 5

2023/3/27 SHEAT CSE/Big Data/KCS061 6

2023/3/27 SHEAT CSE/Big Data/KCS061 8

2023/3/27 SHEAT CSE/Big Data/KCS061 10

2023/3/27 SHEAT CSE/Big Data/KCS061 11

2023/3/27 SHEAT CSE/Big Data/KCS061 12

The core of Apache Hadoop consists of a storage part, known

2023/3/27 SHEAT CSE/Big Data/KCS061 13

2023/3/27 SHEAT CSE/Big Data/KCS061 14

The 3 important Hadoop core components that play a vital

2023/3/27 SHEAT CSE/Big Data/KCS061 15

2023/3/27 SHEAT CSE/Big Data/KCS061 16

2023/3/27 SHEAT CSE/Big Data/KCS061 17

2023/3/27 SHEAT CSE/Big Data/KCS061 18

2023/3/27 SHEAT CSE/Big Data/KCS061 19

2023/3/27 SHEAT CSE/Big Data/KCS061 21

2023/3/27 SHEAT CSE/Big Data/KCS061 22

2023/3/27 SHEAT CSE/Big Data/KCS061 23

2023/3/27 SHEAT CSE/Big Data/KCS061 24

2023/3/27 SHEAT CSE/Big Data/KCS061 25

2023/3/27 SHEAT CSE/Big Data/KCS061 26

2023/3/27 SHEAT CSE/Big Data/KCS061 27

• It involves obtaining a faster server with more powerful processors

2023/3/27 SHEAT CSE/Big Data/KCS061 28

Scaling out, or horizontal scaling :

It is a utility or feature that comes with a Hadoop distribution that allows

2023/3/27 SHEAT CSE/Big Data/KCS061 30

2023/3/27 SHEAT CSE/Big Data/KCS061 31

2023/3/27 SHEAT CSE/Big Data/KCS061 32

2023/3/27 SHEAT CSE/Big Data/KCS061 33

2023/3/27 SHEAT CSE/Big Data/KCS061 36

2023/3/27 SHEAT CSE/Big Data/KCS061 37

• Fault Tolerant: Since the data has a default replication factor of

2023/3/27 SHEAT CSE/Big Data/KCS061 38

2023/3/27 SHEAT CSE/Big Data/KCS061 39

2023/3/27 SHEAT CSE/Big Data/KCS061 40

MapReduce is a programming framework that

2023/3/27 SHEAT CSE/Big Data/KCS061 43

How MapReduce Works?

2023/3/27 SHEAT CSE/Big Data/KCS061 44

2023/3/27 SHEAT CSE/Big Data/KCS061 45

2023/3/27 SHEAT CSE/Big Data/KCS061 46

2023/3/27 SHEAT CSE/Big Data/KCS061 47

2023/3/27 SHEAT CSE/Big Data/KCS061 48

2023/3/27 SHEAT CSE/Big Data/KCS061 49

2023/3/27 SHEAT CSE/Big Data/KCS061 50

2023/3/27 SHEAT CSE/Big Data/KCS061 51

2023/3/27 SHEAT CSE/Big Data/KCS061 52

2023/3/27 SHEAT CSE/Big Data/KCS061 54

2023/3/27 SHEAT CSE/Big Data/KCS061 55

Features of MapReduce are as follows :

2023/3/27 SHEAT CSE/Big Data/KCS061 57

• The input parameters of the key and value pair, represented by K1

2023/3/27 SHEAT CSE/Big Data/KCS061 59

2023/3/27 SHEAT CSE/Big Data/KCS061 60

You might also like