0% found this document useful (0 votes)

26 views2 pages

Hadoop OnePage

Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It allows for scaling to large datasets and hardware, handles failures, and processes data in parallel using a MapReduce programming model where the input data is split, mapped, shuffled, and reduced.

Uploaded by

Sowmya K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views2 pages

Hadoop OnePage

Uploaded by

Sowmya K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

Hadoop - Handling BigDATA

===============================

In this workshop, we describe Hadoop, a software framework that enables distributed

storage and processing of large data using simoke high-level programming models. We
cover the most important concepts of Hadoop, describe its architecture and practice
some hands-on examples using this framework.

Hadoop is an open-source project under Apache software license that can be

installed on a standard set of desktop computers so that these computers can
communicate among themselves and work together to store and process the massive
data. Because Hadoop scales nicely and provides many fault-tolerant mechanisms, it
is not necessary to purchase expensive top-end servers to minimize the risk of
hardware failure and increase storage capacity and processing power.

CHALLENGES
--------------

Hadoop handles the challenge of processing and storing large data using the
following characteristics:

1. Distribution: Spread the storage and processing across a cluster of smaller

machines and communicate to work together on a specific task.

2. Scalability: Add a new machine to Hadoop cluster and every new addition
increases the storage and processing power of the Hadoop cluster

3. Fault-tolerance: Ability to continue the underlying process even when a

component (hardware or software) fails.

4. Optimization: Cost reduction by running the standard hardware; it does not

require expensive servers.

5. Abstraction: Handling all messy details related to distributed computing.

6. Data Locality: Do not move data to where application is running, but run the
application where data is already present.

MAP - REDUCE
----------------

MapReduce is a programming model that allows for implementing parallel distributed

algorithms.
We will describe the basic steps in applying a MapReduce model to process massive
data.

STEPS:
----------
Input: This is the input data / file to be processed.

Split: Hadoop splits the incoming data into smaller pieces called "splits".

Map: In this step, MapReduce processes each split according to the logic
defined in map() function. Each mapper works on each split at a time. Each mapper
is treated as a task and multiple tasks are executed across different TaskTrackers
and coordinated by the JobTracker.

Combine: This is an optional step and is used to improve the performance by

reducing the amount of data transferred across the network. Combiner is the same as
the reduce step and is used for aggregating the output of the map() function before
it is passed to the subsequent steps.

Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted
to put them in order, and grouped before sending them to the next step.

Reduce: This step is used to aggregate the outputs of mappers using the
reduce() function. Output of reducer is sent to the next and final step. Each
reducer is treated as a task and multiple tasks are executed across different
TaskTrackers and coordinated by the JobTracker.

Output: Finally the output of reduce step is written to a file in the

distributed storage.

PRACTICAL APPLICATIONS
--------------------------

We will describe some examples and illustrate them to the students so that they
could try this framework on their BigDATA. At first, we will start with a simple
word count example and then illustrate the processing of some synthetic data in
which columns are separated by a delimiter.

SUMMARY
-----------

Hadoop is one of the most popular tools for big data processing with a high level
API. Hadoop has been successfully deployed in production by many companies for
several years. There are many tools available for collecting, storing and
processing data as well as cluster deployment, monitoring and data security.

IDS Unit3
No ratings yet
IDS Unit3
19 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
BDA Practical
No ratings yet
BDA Practical
18 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Best Hadoop Online Training
No ratings yet
Best Hadoop Online Training
41 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Hadoop MapReduce Explained Simply
No ratings yet
Hadoop MapReduce Explained Simply
3 pages
CSE488 Lab01
No ratings yet
CSE488 Lab01
6 pages
Bda 2
No ratings yet
Bda 2
35 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
M5
No ratings yet
M5
18 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Assgnment2 Group B
No ratings yet
Assgnment2 Group B
5 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
LPKF - s100 Manual v4
No ratings yet
LPKF - s100 Manual v4
140 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
SANFOUNDRY
No ratings yet
SANFOUNDRY
93 pages
Bio-Rad Model 680 Service Manual 2006 PDF
100% (1)
Bio-Rad Model 680 Service Manual 2006 PDF
124 pages
Mapreduce: Map Phase & Reduce Phase: Each Has Key-Value Pairs As Input and Output
No ratings yet
Mapreduce: Map Phase & Reduce Phase: Each Has Key-Value Pairs As Input and Output
2 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
The Elements of Computing Systems - Building A Modern Computer From First Principles - 00 - Intro
0% (3)
The Elements of Computing Systems - Building A Modern Computer From First Principles - 00 - Intro
6 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Introduction To
No ratings yet
Introduction To
7 pages
Asphalt Plants: An Overview of
No ratings yet
Asphalt Plants: An Overview of
34 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
DS150-Manitex Operators Manual-English
No ratings yet
DS150-Manitex Operators Manual-English
18 pages
Unit 3
No ratings yet
Unit 3
13 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit 5
No ratings yet
Unit 5
7 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Bigdata
No ratings yet
Bigdata
6 pages
32012001
No ratings yet
32012001
113 pages
8255 Programmable Peripheral Interface
No ratings yet
8255 Programmable Peripheral Interface
32 pages
Speech Interaction System
No ratings yet
Speech Interaction System
12 pages
Comparison of Scheduling Algorithms in Chimera and VxWorks
No ratings yet
Comparison of Scheduling Algorithms in Chimera and VxWorks
10 pages
Goldpic PDF
No ratings yet
Goldpic PDF
6 pages
EX Series UpRunning PDF
No ratings yet
EX Series UpRunning PDF
291 pages
TAGAP 3: The Developer's Handbook
No ratings yet
TAGAP 3: The Developer's Handbook
90 pages
Chatt 225230 P Manu
100% (1)
Chatt 225230 P Manu
52 pages
Share - DB2 Enclaves - V2 PDF
No ratings yet
Share - DB2 Enclaves - V2 PDF
36 pages
GE Fanuc CMM
No ratings yet
GE Fanuc CMM
6 pages
Revit Notes
0% (1)
Revit Notes
3 pages
Employee User & Attendance Manual
No ratings yet
Employee User & Attendance Manual
50 pages
Comparison Between RANS and LES Fo Turbulence Models in CFD Simulation Using OpenFOAM
No ratings yet
Comparison Between RANS and LES Fo Turbulence Models in CFD Simulation Using OpenFOAM
6 pages
Characteristic S Felta Technokid S Techfactor Creotec Makebloc K C & E ? Diwa 1 Robotics Alexan Phoeni X
No ratings yet
Characteristic S Felta Technokid S Techfactor Creotec Makebloc K C & E ? Diwa 1 Robotics Alexan Phoeni X
4 pages
Datasheet
No ratings yet
Datasheet
2 pages
Hand Operated Die Punch
No ratings yet
Hand Operated Die Punch
4 pages
FIR Filter Design Using Mixed Algorithms: A Survey: Vikash Kumar, Mr. Vaibhav Purwar
No ratings yet
FIR Filter Design Using Mixed Algorithms: A Survey: Vikash Kumar, Mr. Vaibhav Purwar
5 pages
UASC FMST Quickstart Guide
No ratings yet
UASC FMST Quickstart Guide
23 pages
Boolean Algebra & Karnaugh Maps: Points To Remember
No ratings yet
Boolean Algebra & Karnaugh Maps: Points To Remember
8 pages
Installing Freepbx 13 On Ubuntu Server 14.04.2 LTS: Read First
No ratings yet
Installing Freepbx 13 On Ubuntu Server 14.04.2 LTS: Read First
7 pages
All About TransactionScope - CodeProject
No ratings yet
All About TransactionScope - CodeProject
16 pages
74AHC1G86 74AHCT1G86: 1. General Description
No ratings yet
74AHC1G86 74AHCT1G86: 1. General Description
12 pages
Furuno Radar Fr21X5
No ratings yet
Furuno Radar Fr21X5
4 pages
Data Handling Best Practices
No ratings yet
Data Handling Best Practices
4 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Hadoop OnePage

Uploaded by

Hadoop OnePage

Uploaded by

Hadoop - Handling BigDATA

In this workshop, we describe Hadoop, a software framework that enables distributed

Hadoop is an open-source project under Apache software license that can be

1. Distribution: Spread the storage and processing across a cluster of smaller

3. Fault-tolerance: Ability to continue the underlying process even when a

4. Optimization: Cost reduction by running the standard hardware; it does not

5. Abstraction: Handling all messy details related to distributed computing.

MapReduce is a programming model that allows for implementing parallel distributed

Combine: This is an optional step and is used to improve the performance by

Output: Finally the output of reduce step is written to a file in the

You might also like