0% found this document useful (0 votes)
26 views2 pages

Hadoop OnePage

Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It allows for scaling to large datasets and hardware, handles failures, and processes data in parallel using a MapReduce programming model where the input data is split, mapped, shuffled, and reduced.

Uploaded by

Sowmya K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views2 pages

Hadoop OnePage

Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It allows for scaling to large datasets and hardware, handles failures, and processes data in parallel using a MapReduce programming model where the input data is split, mapped, shuffled, and reduced.

Uploaded by

Sowmya K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Hadoop - Handling BigDATA

===============================

In this workshop, we describe Hadoop, a software framework that enables distributed


storage and processing of large data using simoke high-level programming models. We
cover the most important concepts of Hadoop, describe its architecture and practice
some hands-on examples using this framework.

Hadoop is an open-source project under Apache software license that can be


installed on a standard set of desktop computers so that these computers can
communicate among themselves and work together to store and process the massive
data. Because Hadoop scales nicely and provides many fault-tolerant mechanisms, it
is not necessary to purchase expensive top-end servers to minimize the risk of
hardware failure and increase storage capacity and processing power.

CHALLENGES
--------------

Hadoop handles the challenge of processing and storing large data using the
following characteristics:

1. Distribution: Spread the storage and processing across a cluster of smaller


machines and communicate to work together on a specific task.

2. Scalability: Add a new machine to Hadoop cluster and every new addition
increases the storage and processing power of the Hadoop cluster

3. Fault-tolerance: Ability to continue the underlying process even when a


component (hardware or software) fails.

4. Optimization: Cost reduction by running the standard hardware; it does not


require expensive servers.

5. Abstraction: Handling all messy details related to distributed computing.

6. Data Locality: Do not move data to where application is running, but run the
application where data is already present.

MAP - REDUCE
----------------

MapReduce is a programming model that allows for implementing parallel distributed


algorithms.
We will describe the basic steps in applying a MapReduce model to process massive
data.

STEPS:
----------
Input: This is the input data / file to be processed.

Split: Hadoop splits the incoming data into smaller pieces called "splits".

Map: In this step, MapReduce processes each split according to the logic
defined in map() function. Each mapper works on each split at a time. Each mapper
is treated as a task and multiple tasks are executed across different TaskTrackers
and coordinated by the JobTracker.

Combine: This is an optional step and is used to improve the performance by


reducing the amount of data transferred across the network. Combiner is the same as
the reduce step and is used for aggregating the output of the map() function before
it is passed to the subsequent steps.

Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted
to put them in order, and grouped before sending them to the next step.

Reduce: This step is used to aggregate the outputs of mappers using the
reduce() function. Output of reducer is sent to the next and final step. Each
reducer is treated as a task and multiple tasks are executed across different
TaskTrackers and coordinated by the JobTracker.

Output: Finally the output of reduce step is written to a file in the


distributed storage.

PRACTICAL APPLICATIONS
--------------------------

We will describe some examples and illustrate them to the students so that they
could try this framework on their BigDATA. At first, we will start with a simple
word count example and then illustrate the processing of some synthetic data in
which columns are separated by a delimiter.

SUMMARY
-----------

Hadoop is one of the most popular tools for big data processing with a high level
API. Hadoop has been successfully deployed in production by many companies for
several years. There are many tools available for collecting, storing and
processing data as well as cluster deployment, monitoring and data security.

You might also like