Bigdata

MapReduce in Hadoop consists of two phases: the Map phase, which applies complex logic to data, and the Reduce phase, which performs lightweight processing like aggregation. Hadoop is an open-source framework designed for storing and processing large datasets efficiently and fault-tolerantly across commodity servers. The growing demand for Hadoop professionals highlights its importance in managing big data, with a robust ecosystem supporting various industries and applications.

Uploaded by

jithugirish3424

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

Bigdata

Uploaded by

jithugirish3424

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MapReduce

In Hadoop, MapReduce works by breaking the data processing into two phases: Map
phase and Reduce phase. The map is the first phase of processing, where we specify
all the complex logic/business rules/costly code. Reduce is the second phase of
processing, where we specify light-weight processing like aggregation/summation.
You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It's
very short, but it conceals a great deal of processing behind the scenes. At the highest
level, there are four independent entities:
1. The client, which submits the MapReduce job.
2. The jobtracker, which coordinates the job run. The jobtracker is a Java application
whose main class is JobTracker.
3. The tasktrackers, which run the tasks that the job has been split into. Tasktrackers
are Java applications whose main class is Task Trucker.
4. The distributed file system, which is used for sharing job files between the other
entities.
The MapReduce model is to break jobs into tasks and run the tasks in parallel to make
the overall job execution time smaller than it would otherwise be if the tasks ran
sequentially. This makes job execution time sensitive to slow-running tasks, as it takes
only one slow task to make the whole job take significantly longer than it would have
done otherwise. When a job consists of hundreds or thousands of tasks, the possibility
of a few straggling tasks is very real. Tasks may be slow for various reasons, including
hardware degradation or software mis-configuration, but the causes may be hard to
detect since the tasks still complete successfully, albeit after a longer time than
expected. Hadoop doesn't try to diagnose and fix slow-running tasks, instead, it tries to
detect when a task is running slower than expected and launches another, equivalent,
task as a backup. This is termed speculative execution of tasks. Hadoop runs tasks in
their own Java Virtual Machine to isolate them from other running tasks. The
overhead of starting a new JVM for each task can take around a second, which for jobs
that run for a minute or so is insignificant. However, jobs that have a large number of
very short-lived tasks (these are usually map tasks), or that have lengthy initialization,
can see performance gains when the JVM is reused for subsequent tasks. With task
JVM reuse enabled, tasks do not run concurrently in a single JVM. The JVM runs
tasks sequentially. Tasktrackers can, however, run more than one task at a time, but
this is always done in separate JVMs.
How MapReduce Works
• Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
• All the data whether structured or unstructured needs to be translated to the key-
value pair before it is passed through the MapReduce model.
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!. MapReduce model as the name suggests has two different
functions; Map-function and Reduce-function. The order of operation is always
Map|Shuffle|Reduce.
• The reduce task takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
• Under the MapReduce model, the data processing primitives are called mappers
andreducers.

What is Hadoop?

• Hadoop is open-source, Java based programming framework and server software

which is used to save and analyze data with the help of 100s or even 1000s of
commodity servers in a clustered environment.
• Hadoop is designed to storage and process large datasets extremely fast and in
fault tolerant way.
• Hadoop uses HDFS (Hadoop File System) for storing data on cluster of
commodity computers. If any server goes down it know how to replicate the data
and there is no loss of data even in hardware failure.
• Hadoop is Apache sponsored project and it consists of many software packages
which runs on the top of the Apache Hadoop system.
• Top Hadoop based Commercial Big Data Analytics Platform
• Hadoop provides set of tools and software for making the backbone of the Big
Data analytics system.
Why Hadoop is Important?
1. Managing Big Data
• As we are living in the digital era there is a data explosion. The data is getting
generated at a very high speed and high volume. So there is an increasing need to
manage this Big Data. Therefore in order to manage this ever-increasing volume of data,
we require Big Data technologies like Hadoop. There is an increasing need for a solution
which could handle this much amount of data. In this scenario, Hadoop comes to rescue.
With its robust architecture and economical feature, it is the best fit for storing huge
amounts of data.
2. Exponential Growth of Big Data Market
Slowly companies are realizing the advantage big data can bring to their business. As
the market for Big Data grows there will be a rising need for Big Data technologies.
Hadoop forms the base of many big data technologies. The new technologies like
Apache Spark and Flink work well over Hadoop. As it is an indemand big data
technology, there is a need to master Hadoop. As the requirements for Hadoop
professionals are increasing, this makes it a must to learn technology.
3. Lack of Hadoop Professionals
As we have seen, the Hadoop market is continuously growing to create more job
opportunities every day. Most of these Hadoop job opportunities remain vacant due to
unavailability of the required skills. So this is the right time to show your talent in big
data by mastering the technology before it is too late. Become a Hadoop expert and give
a boost to your career. This is where Data Flair plays an important role to make you
Hadoop expert.
4. Hadoop for all
Professionals from various streams can easily learn Hadoop and become master of it to
get high paid jobs. IT professionals can easily learn MapReduce programming
in java or python, those who know scripting can work on Hadoop ecosystem component
named Pig. Hive or drill is easy for those who know to the script.
5. Robust Hadoop Ecosystem
Hadoop has a very robust and rich ecosystem which serves a wide variety of
organizations. Organizations like web start-ups, telecom, financial and so on are needing
Hadoop to answer their business needs.
6. Research Tool
Hadoop has come up as a powerful research tool. It allows an organization to find
answers to their business questions. Hadoop helps them in research and development
work. Companies use it to perform the analysis. They use this analysis to develop a
rapport with the customer.
8. Hadoop is Omnipresent
There is no industry where Big Data has not reached. Big Data has covered almost all
domains like healthcare, retail, government, banking, media, transportation, natural
resources and so on. People are increasingly becoming data aware. This means they are
realizing the power of data. Hadoop is a framework which can harness this power of
data to improve the business.

9. Higher Salaries
In the current scenario, there is a gap between demand and supply of Big Data
professional. This gap is increasing every day. In the wake of the scarcity of Hadoop
professionals, organizations are ready to offer big packages for Hadoop skills. There is
always a compelling requirement of skilled people who can think from a business point
of view. They are the people who understand data and can produce insights with that
data. For this reason, technical persons with analytics skills find them in great demand.

10. A Maturing Technology

Hadoop is evolving with time. The new version of Hadoop i.e. Hadoop 3.0 is coming
into the market. It has already collaborated with HortonWorks, Tableau, MapR, and
even BI experts to name a few. New actors like Spark, Flink etc. are coming on the Big
Data stage. These technologies promise the lightening speed of processing. These
technologies also provide a single platform for various kinds of workloads

Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work across
the machines and in turn, utilizes the underlying parallelism of the CPU
cores.
Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect
and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
HDFS (Hadoop Distributed File System)
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware. HDFS holds very large amount of
data and provides easier access. To store such huge data, the files are stored across
multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications
available to parallel processing. The Features of HDFS are:

. It is suitable for the distributed storage and processing.

. Hadoop provides a command interface to interact with HDFS.
. The built-in servers of name node and data node help users to easily check the
status of cluster.
. Streaming access to file system data.
. HDFS provides file permissions and authentication.
•Node
. A computer.
A non-enterprise commodity hardware.
• Rack
. A collection of 30-40 nodes.
. Connected to the same network switch.
. Physically stored close together.
• Hadoop Cluster
. A collection of racks.
• HDFS follows the master-slave architecture and it has the following elements.
1.Name Node
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software. It is a software that can be run on
commodity hardware. The system having the name node acts as the master server
and it does the following tasks
• Manages the file system namespace.
• Regulates client's access to files.
• It also executes file system operations such as renaming, closing, and opening files
and directories.

2.Data Node
The data node is a commodity hardware having the GNU/Linux operating system
and data node software. For every node (Commodity hardware/System) in a
cluster, there will be a data node. These nodes manage the data storage of their
system. Data nodes perform read-write operations on the file systems, as per client
request. They also perform operations such as block creation, deletion, and
replication according to the instructions of the name node.
3.Block
Generally, the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is 128
MB, but it can be increased as per the need to change in HDFS configuration.
What is surprise number
Consider a data stream of elements from a universal set. Let my be the number of
occurrences of the i ^ (th) element for any i. Then the k^ th . . moment of the
stream is the sum over all i of (m_{i}) ^ k The 1 ^ (st) moment is the sum of mi's.
That is the length of the stream. The 2 ^ (nd) moment is the sum of squares of
mi's. It is also called a "surprise number".
Suppose, we have a stream of length 100, in which eleven different elements
appear. The most even distribution of these eleven elements would have one
appearing 10 times and the other ten appearing 9 times each. In this case, the
surprise number is 10 ^ 2 + 1* 9 ^ 2 = 910

5. Define decaying windows.

This approach is used for finding the most popular element in the stream. This can
be considered as an extension of DGIM Algorithm. The aim is to weight the recent
elements more heavily.
. Recording the popularity of items sold at Amazon.
. The rate at which different Twitter-users tweet.

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
DCI Essentials Expert Series 2019
No ratings yet
DCI Essentials Expert Series 2019
15 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
HADOOP
No ratings yet
HADOOP
55 pages
Big Data ANAlysis Short
No ratings yet
Big Data ANAlysis Short
114 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Bda Unit IV
No ratings yet
Bda Unit IV
97 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
BDA Assignment QP-3 IT A With Key Solutions
No ratings yet
BDA Assignment QP-3 IT A With Key Solutions
5 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
BDM 2
No ratings yet
BDM 2
5 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
00 HadoopWelcome Transcript
No ratings yet
00 HadoopWelcome Transcript
4 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
CS 4407 Discussion Forum Unit 2
No ratings yet
CS 4407 Discussion Forum Unit 2
2 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Unit 5
No ratings yet
Unit 5
35 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Big Data Assignment 1
No ratings yet
Big Data Assignment 1
6 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
Bda U2
No ratings yet
Bda U2
68 pages
Unit 5
No ratings yet
Unit 5
32 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Module 2
No ratings yet
Module 2
34 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Introduction To
No ratings yet
Introduction To
7 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
The Advanced Encryption Standard (AES)
No ratings yet
The Advanced Encryption Standard (AES)
61 pages
The Technical Specifications Required For The Integrated Library Management System
No ratings yet
The Technical Specifications Required For The Integrated Library Management System
46 pages
PPS Question Bank 3 To 6
No ratings yet
PPS Question Bank 3 To 6
4 pages
SPPU B.voc Robotics Automation Syllabus
No ratings yet
SPPU B.voc Robotics Automation Syllabus
68 pages
Pig Tutorial PDF
No ratings yet
Pig Tutorial PDF
22 pages
Create A 5.1 Surround Audio Sequence: Adobe Premiere Pro
No ratings yet
Create A 5.1 Surround Audio Sequence: Adobe Premiere Pro
4 pages
Content Log - Previous
No ratings yet
Content Log - Previous
67 pages
Wireless Communications and Networks
No ratings yet
Wireless Communications and Networks
201 pages
Seminar Report New
No ratings yet
Seminar Report New
27 pages
ST1236 Rev00
No ratings yet
ST1236 Rev00
14 pages
Ishigurognnintroduction201023 201027054344
No ratings yet
Ishigurognnintroduction201023 201027054344
81 pages
Brief Learning Points (Objectives) : Able To Prove 2, 3 Are Irrational, Terminating Decimal
No ratings yet
Brief Learning Points (Objectives) : Able To Prove 2, 3 Are Irrational, Terminating Decimal
3 pages
Release Notes
No ratings yet
Release Notes
150 pages
Turbosound M-15 Spec001
No ratings yet
Turbosound M-15 Spec001
4 pages
Unit4 Functions C++ PDF
No ratings yet
Unit4 Functions C++ PDF
33 pages
Part 4 - Solution Design Documents - What You Need To Know
No ratings yet
Part 4 - Solution Design Documents - What You Need To Know
29 pages
For Downloading
No ratings yet
For Downloading
3 pages
CSC 305 - Lect 2
No ratings yet
CSC 305 - Lect 2
10 pages
Chapter 1 - 3
No ratings yet
Chapter 1 - 3
25 pages
Chapter 4 Algorithms and Flowcharts Class 8 ICSE
No ratings yet
Chapter 4 Algorithms and Flowcharts Class 8 ICSE
4 pages
Akira 21hcs3wn Tb1238
No ratings yet
Akira 21hcs3wn Tb1238
23 pages
Configuration Samba Server File Sharing
No ratings yet
Configuration Samba Server File Sharing
20 pages
AWS Ramp-Up Guide: Security
No ratings yet
AWS Ramp-Up Guide: Security
2 pages
SQL Interview Questions - 1
No ratings yet
SQL Interview Questions - 1
68 pages
FALLSEM2022-23 CSE2006 ETH VL2022230103866 Reference Material I 22-08-2022 8255
No ratings yet
FALLSEM2022-23 CSE2006 ETH VL2022230103866 Reference Material I 22-08-2022 8255
41 pages
Houghton Mifflin Math Homework Grade 6
100% (1)
Houghton Mifflin Math Homework Grade 6
6 pages
Java How To Program, 10/e: Reserved
No ratings yet
Java How To Program, 10/e: Reserved
191 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
94 pages
TD Mounting-Systems
No ratings yet
TD Mounting-Systems
69 pages

Bigdata

Uploaded by

Bigdata

Uploaded by

MapReduce

• Hadoop is open-source, Java based programming framework and server software

10. A Maturing Technology

. It is suitable for the distributed storage and processing.

5. Define decaying windows.

You might also like