bigdata
bigdata
2. MapReduce
MapReduce is the processing layer of Hadoop. It is a programming model used for
processing large data sets in parallel across a distributed cluster.
● Map phase: In the Map phase, the input data is divided into chunks (called
splits), and each chunk is processed by a mapper. The mapper processes the
data and generates a set of intermediate key-value pairs.
● Shuffle and Sort: After the Map phase, the intermediate key-value pairs are
shuffled and sorted. The system groups the data by key and prepares it for the
Reduce phase.
● Reduce phase: In the Reduce phase, the system applies the reduce function to
the sorted intermediate data, aggregating or transforming the data in some way.
The results are written to the output files.
MapReduce Architecture Diagram:
+-------------+
| Input | ----> [Map] ----> [Shuffle & Sort] ----> [Reduce] ----> Output
+-------------+
● JobTracker: The JobTracker is the master daemon in the MapReduce
framework. It is responsible for scheduling and monitoring jobs, dividing the work
into tasks, and allocating tasks to TaskTrackers.
● TaskTracker: TaskTrackers are worker daemons that run on the cluster nodes
and execute tasks assigned by the JobTracker. Each TaskTracker handles both Map
and Reduce tasks.
3. YARN (Yet Another Resource Negotiator)
YARN is the resource management layer of Hadoop, responsible for managing
resources across the cluster and scheduling the execution of tasks.
● ResourceManager (RM): The ResourceManager is the master daemon in YARN,
which manages the allocation of resources (memory, CPU) to the various
applications running on the cluster. It makes sure that resources are allocated
based on job requirements and cluster availability.
● NodeManager (NM): The NodeManager runs on each node in the cluster. It is
responsible for managing resources on the individual node and monitoring the
status of the node.
● ApplicationMaster (AM): The ApplicationMaster is a per-application entity that
manages the lifecycle of a job. It negotiates resources with the ResourceManager
and monitors the progress of its application (MapReduce job or Spark job).
YARN Architecture Diagram:
+-----------------------+
| ResourceManager | <-----> [Resource Allocation]
+-----------------------+
|
+-----------------------------+
| NodeManager | <-----> [Resource Monitoring]
+-----------------------------+
|
+---------------------------+
| ApplicationMaster (AM) | <-----> [Job Coordination]
+---------------------------+
|
+-----------------------+
| Application | <-----> [MapReduce/Spark Job]
+-----------------------+
2. Loading DataSet in to HDFS for Spark Analysis Installation of Hadoop and cluster
management
(i) Installing Hadoop single node cluster in ubuntu environment
(ii) Knowing the differencing between single node clusters and multi-node clusters
(iii) Accessing WEB-UI and the port number
(iv) Installing and accessing the environments such as hive and sqoop
Here’s a breakdown of file management tasks and basic Linux commands, particularly
focused on HDFS (Hadoop Distributed File System) operations:
(i) Creating a directory in HDFS:
To create a directory in HDFS, you can use the hadoop fs -mkdir command.
hadoop fs -mkdir /path/to/your/directory
This will create a directory at the specified path in HDFS.
(ii) Moving forth and back to directories:
You can navigate directories in the Linux file system using the cd command.
● To move to a directory:
● cd /path/to/directory
● To move back to the previous directory:
● cd -
● To move up one directory level:
● cd ..
For HDFS directories, you use the hadoop fs -ls command to list the contents and hadoop fs
-cd to change directories.
(vii) Copying and moving files between local and HDFS environment:
● Copying a file from local to HDFS:
● hadoop fs -copyFromLocal /local/path/to/file /hdfs/path/to/destination
● Copying a file from HDFS to local:
● hadoop fs -copyToLocal /hdfs/path/to/file /local/path/to/destination
● Moving a file from local to HDFS:
● hadoop fs -moveFromLocal /local/path/to/file /hdfs/path/to/destination
● Moving a file from HDFS to local:
● hadoop fs -moveToLocal /hdfs/path/to/file /local/path/to/destination
# Mapper function
def mapper():
for line in sys.stdin:
words = line.split()
for word in words:
# Emit word with value 1
print(f"{word}\t1")
if __name__ == "__main__":
mapper()
In this code:
● The input is a line of text.
● The line is split into words.
● For each word, a key-value pair is emitted, where the key is the word, and the value is 1.
2. Shuffle and Sort:
After the map phase, the framework automatically groups and sorts the emitted key-value pairs. For
instance, all instances of the word "hello" will be grouped together so that they can be passed to the
same reducer.
Example of shuffled data:
hello 1
hello 1
world 1
world 1
data 1
3. Reducer Phase:
The reducer processes the grouped key-value pairs. It aggregates the values by summing them to get
the total count for each word.
Reducer code:
import sys
# Reducer function
def reducer():
current_word = None
current_count = 0
for line in sys.stdin:
word, count = line.strip().split('\t')
count = int(count)
if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if __name__ == "__main__":
reducer()
In this code:
● The reducer receives grouped key-value pairs.
● It aggregates the count of each word and prints the final result.
4. Driver Code:
The driver code sets up the map and reduce operations and coordinates the execution of the map
and reduce phases in the framework. In Hadoop, this would be handled by a job configuration, but
for simplicity, this can be managed manually in a basic script.
Example Driver Code (in a Hadoop or basic setup):
# Pseudo code to explain the execution
# 1. The input text is passed to the Mapper.
# 2. Mapper emits key-value pairs.
# 3. Intermediate data is shuffled and sorted by keys.
# 4. The Reducer takes the sorted data, aggregates it, and outputs the result.