Hadoop OnePage
Hadoop OnePage
===============================
CHALLENGES
--------------
Hadoop handles the challenge of processing and storing large data using the
following characteristics:
2. Scalability: Add a new machine to Hadoop cluster and every new addition
increases the storage and processing power of the Hadoop cluster
6. Data Locality: Do not move data to where application is running, but run the
application where data is already present.
MAP - REDUCE
----------------
STEPS:
----------
Input: This is the input data / file to be processed.
Split: Hadoop splits the incoming data into smaller pieces called "splits".
Map: In this step, MapReduce processes each split according to the logic
defined in map() function. Each mapper works on each split at a time. Each mapper
is treated as a task and multiple tasks are executed across different TaskTrackers
and coordinated by the JobTracker.
Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted
to put them in order, and grouped before sending them to the next step.
Reduce: This step is used to aggregate the outputs of mappers using the
reduce() function. Output of reducer is sent to the next and final step. Each
reducer is treated as a task and multiple tasks are executed across different
TaskTrackers and coordinated by the JobTracker.
PRACTICAL APPLICATIONS
--------------------------
We will describe some examples and illustrate them to the students so that they
could try this framework on their BigDATA. At first, we will start with a simple
word count example and then illustrate the processing of some synthetic data in
which columns are separated by a delimiter.
SUMMARY
-----------
Hadoop is one of the most popular tools for big data processing with a high level
API. Hadoop has been successfully deployed in production by many companies for
several years. There are many tools available for collecting, storing and
processing data as well as cluster deployment, monitoring and data security.