0% found this document useful (0 votes)
5 views

bigdata

The document provides an overview of Big Data and Hadoop architecture, detailing the layers involved in Big Data architecture including data sources, ingestion, storage, processing, analytics, presentation, security, and orchestration. It also describes Hadoop architecture, focusing on its core components: HDFS, MapReduce, and YARN, along with their roles in distributed data processing. Additionally, it outlines the steps for installing a single-node Hadoop cluster in an Ubuntu environment, including prerequisites and configuration settings.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

bigdata

The document provides an overview of Big Data and Hadoop architecture, detailing the layers involved in Big Data architecture including data sources, ingestion, storage, processing, analytics, presentation, security, and orchestration. It also describes Hadoop architecture, focusing on its core components: HDFS, MapReduce, and YARN, along with their roles in distributed data processing. Additionally, it outlines the steps for installing a single-node Hadoop cluster in an Ubuntu environment, including prerequisites and configuration settings.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1.

To Study of Big Data Analytics and Hadoop Architecture


(i) know the concept of big data architecture
(ii) know the concept of Hadoop architecture
Big Data architecture:
Big Data architecture refers to the design and structure used to store, process, and
analyze large volumes of data. These architectures are built to handle a variety of data
types (structured, semi-structured, unstructured), as well as the large scale and speed of
modern data flows. The core components of Big Data architecture typically include the
following layers:
1. Data Source Layer
This layer refers to the origin of the data, which could come from a variety of sources:
* External data sources: Social media, IoT devices, third-party services, etc.
* Internal data sources: Databases, data warehouses, etc.
* Data streams: Real-time data from sensors, logs, etc.
2. Data Ingestion Layer
Data ingestion is the process of collecting and transporting data from various sources
to the storage layer. The two main types of ingestion are:
* Batch processing: Data is collected over a fixed period (e.g., every hour, daily).
* Real-time/streaming processing: Data is collected in real-time or near real-time.
Tools used for data ingestion include:
* Apache Kafka: A distributed streaming platform.
* Apache Flume: A service for collecting and moving large amounts of log data.
* AWS Kinesis: A platform for real-time streaming data on AWS.
3. Data Storage Layer
This is where all the data is stored. Big Data storage should support both structured and
unstructured data. It needs to be scalable, reliable, and highly available. Some common
types of data storage in Big Data systems include:
* HDFS (Hadoop Distributed File System): A scalable, distributed file system.
* NoSQL databases: MongoDB, Cassandra, HBase for non-relational data.
* Data Lakes: A central repository for storing raw data in its native format (e.g.,
AWS S3, Azure Blob Storage).
4. Data Processing Layer
This layer processes the stored data and transforms it into valuable insights. It can be
divided into two major approaches:
* Batch Processing: Processing data in large, scheduled intervals (e.g., Hadoop
MapReduce).
* Stream Processing: Processing data in real-time as it flows in (e.g., Apache
Flink, Apache Storm, Spark Streaming).
Some key processing tools:
* Apache Spark: A fast and general-purpose cluster-computing system.
* Apache Hadoop: A framework for distributed storage and processing.
* Flink and Storm: Used for real-time data stream processing.
5. Data Analytics Layer
Once data is processed, it is often analyzed to extract insights. The analytics layer
provides tools for complex analysis, including:
* Machine Learning (ML): Building predictive models and patterns using
algorithms.
* Data Mining: Discovering hidden patterns and trends in data.
* Business Intelligence (BI): Tools like Tableau, Power BI for reporting and
visualization.
Popular tools used for analytics:
* Apache Hive: A data warehouse built on top of Hadoop for querying and
analyzing large datasets.
* Apache Impala: A high-performance SQL engine for big data.
* Python libraries (Pandas, scikit-learn): For data manipulation and machine
learning.
6. Data Presentation Layer
This layer presents the insights derived from the analytics layer. It often involves
dashboards, reports, and visualizations. Users, stakeholders, or systems will interact
with this layer to make data-driven decisions. Tools include:
* BI tools: Tableau, Power BI, QlikView.
* Custom web interfaces: To display reports, graphs, and analysis.
7. Security and Governance Layer
Given the large volumes and sensitivity of data, security and governance are critical.
This layer ensures data privacy, access control, and regulatory compliance.
* Authentication/Authorization: Ensuring only authorized users can access
specific data.
* Data Encryption: To protect sensitive data at rest and in transit.
* Data Lineage: Tracking the origin and movement of data to ensure
trustworthiness.
* Compliance: Adhering to regulations such as GDPR, HIPAA, etc.
8. Orchestration and Management Layer
Big Data systems require complex management for coordination, scheduling, and
monitoring.
* Apache Airflow: An open-source platform to programmatically author, schedule,
and monitor workflows.
* Kubernetes: For managing containerized applications and ensuring scalability
and reliability.
Key Technologies in Big Data Architecture:
* Hadoop Ecosystem: For storage and processing (HDFS, YARN, MapReduce, Pig,
Hive, etc.).
* Apache Kafka: For real-time streaming.
* Apache Spark: For fast in-memory data processing.
* NoSQL Databases: MongoDB, Cassandra, HBase.
* Cloud Platforms: AWS, Azure, Google Cloud provide tools for storage,
processing, and management.
Example of Big Data Architecture
+---------------------+
| Data Sources |
+---------------------+
|
v
+---------------------+ +--------------------+
| Data Ingestion | ---> | Data Storage |
| (Batch/Streaming) | | (HDFS, NoSQL, |
+---------------------+ | Data Lakes) |
| +--------------------+
v |
+--------------------+ v
| Data Processing | +---------------------+
| (Batch/Stream) | ---> | Data Analytics |
+--------------------+ | (ML, BI, Analysis) |
| +---------------------+
v |
+---------------------+ v
| Data Presentation | <--------> +-----------------+
| (Dashboards, Reports| | Security & |
| Visualization) | | Governance |
This high-level overview demonstrates the flow of data through the architecture from
collection to processing and presentation.
(ii) know the concept of Hadoop architecture
Hadoop Architecture Overview
Hadoop is an open-source framework for processing and storing large datasets in a
distributed computing environment. It is designed to scale from a single server to
thousands of machines, each offering local computation and storage. Understanding
Hadoop architecture is essential for working with Hadoop-based systems. Below is a
detailed overview of the Hadoop architecture, its components, and how they work
together.

Key Components of Hadoop Architecture


The architecture of Hadoop primarily revolves around three main components:
1. Hadoop Distributed File System (HDFS)
2. MapReduce
3. YARN (Yet Another Resource Negotiator)
These components work together to provide a distributed system that can store and
process large volumes of data.

1. Hadoop Distributed File System (HDFS)


HDFS is the storage layer of Hadoop. It is designed to store vast amounts of data
across multiple machines in a distributed environment.
● Block-based storage: HDFS stores data in blocks (typically 128MB or 256MB by
default). Each file is divided into blocks, which are then distributed across
multiple nodes.
● Fault tolerance: HDFS ensures fault tolerance by replicating blocks. The default
replication factor is 3 (each block is copied three times across the cluster). If one
node fails, the data can still be accessed from another replica.
● NameNode: The NameNode is the master node in HDFS that manages the
metadata (such as block locations) for the files. It does not store the data itself
but keeps track of where the blocks are stored across the cluster.
● DataNode: DataNodes are the worker nodes that store the actual data in the
form of blocks. Each DataNode is responsible for serving the blocks on request
and performing block-level operations (like block creation, deletion, and
replication).
HDFS Architecture Diagram:
+-------------------+ +-------------------+
| Client | | Client |
+-------------------+ +-------------------+
| |
+---------------+ +---------------+
| NameNode | | NameNode |
+---------------+ +---------------+
| |
+-------------------+ +-------------------+
| DataNode | | DataNode |
+-------------------+ +-------------------+

2. MapReduce
MapReduce is the processing layer of Hadoop. It is a programming model used for
processing large data sets in parallel across a distributed cluster.
● Map phase: In the Map phase, the input data is divided into chunks (called
splits), and each chunk is processed by a mapper. The mapper processes the
data and generates a set of intermediate key-value pairs.
● Shuffle and Sort: After the Map phase, the intermediate key-value pairs are
shuffled and sorted. The system groups the data by key and prepares it for the
Reduce phase.
● Reduce phase: In the Reduce phase, the system applies the reduce function to
the sorted intermediate data, aggregating or transforming the data in some way.
The results are written to the output files.
MapReduce Architecture Diagram:
+-------------+
| Input | ----> [Map] ----> [Shuffle & Sort] ----> [Reduce] ----> Output
+-------------+
● JobTracker: The JobTracker is the master daemon in the MapReduce
framework. It is responsible for scheduling and monitoring jobs, dividing the work
into tasks, and allocating tasks to TaskTrackers.
● TaskTracker: TaskTrackers are worker daemons that run on the cluster nodes
and execute tasks assigned by the JobTracker. Each TaskTracker handles both Map
and Reduce tasks.
3. YARN (Yet Another Resource Negotiator)
YARN is the resource management layer of Hadoop, responsible for managing
resources across the cluster and scheduling the execution of tasks.
● ResourceManager (RM): The ResourceManager is the master daemon in YARN,
which manages the allocation of resources (memory, CPU) to the various
applications running on the cluster. It makes sure that resources are allocated
based on job requirements and cluster availability.
● NodeManager (NM): The NodeManager runs on each node in the cluster. It is
responsible for managing resources on the individual node and monitoring the
status of the node.
● ApplicationMaster (AM): The ApplicationMaster is a per-application entity that
manages the lifecycle of a job. It negotiates resources with the ResourceManager
and monitors the progress of its application (MapReduce job or Spark job).
YARN Architecture Diagram:
+-----------------------+
| ResourceManager | <-----> [Resource Allocation]
+-----------------------+
|
+-----------------------------+
| NodeManager | <-----> [Resource Monitoring]
+-----------------------------+
|
+---------------------------+
| ApplicationMaster (AM) | <-----> [Job Coordination]
+---------------------------+
|
+-----------------------+
| Application | <-----> [MapReduce/Spark Job]
+-----------------------+

Hadoop Ecosystem Components


Apart from the core components (HDFS, MapReduce, and YARN), Hadoop has a rich
ecosystem that includes several tools and frameworks for different use cases. Some of
the key components include:
● Hive: A data warehouse system that facilitates querying and managing large
datasets in HDFS using SQL-like queries.
● Pig: A platform for analyzing large datasets, providing a high-level language
called Pig Latin for processing and transforming data.
● HBase: A NoSQL database for real-time read/write access to large datasets
stored in HDFS.
● Sqoop: A tool for transferring data between Hadoop and relational databases.
● Flume: A service for collecting and aggregating log data and other types of
streaming data.
● Oozie: A workflow scheduler for managing Hadoop jobs.
● Zookeeper: A service for coordinating distributed applications in the Hadoop
ecosystem.
● Mahout: A machine learning library for scalable machine learning algorithms.

Hadoop Architecture Diagram (Complete)


+------------------+
| Client Node |
+------------------+
|
+------------------+--------+----------+------------------+
| HDFS (Storage Layer) | YARN (Resource Manager)
+--------------------------------+ +----------------------------------+
| NameNode (Master) | | ResourceManager (Master) |
| DataNode (Worker) | | NodeManager (Worker) |
+--------------------------------+ +----------------------------------+
| |
+------------------+ +------------------------+
| MapReduce Layer | | Application Master |
+------------------+ +------------------------+

Key Characteristics of Hadoop Architecture


1. Scalability: Hadoop is designed to scale horizontally. As your data grows, you
can add more nodes to the cluster.
2. Fault Tolerance: Through replication and data distribution, Hadoop ensures that
the data is not lost even when individual nodes fail.
3. Cost Efficiency: Hadoop runs on commodity hardware, meaning you can build
large-scale clusters with low-cost machines.
4. Data Locality: Hadoop tries to move computation to where the data is stored to
minimize network congestion and speed up processing.

2. Loading DataSet in to HDFS for Spark Analysis Installation of Hadoop and cluster
management
(i) Installing Hadoop single node cluster in ubuntu environment
(ii) Knowing the differencing between single node clusters and multi-node clusters
(iii) Accessing WEB-UI and the port number
(iv) Installing and accessing the environments such as hive and sqoop

Installing Hadoop Single Node Cluster in Ubuntu Environment


Prerequisites:
● A fresh Ubuntu system or a virtual machine running Ubuntu.
● Java should be installed (Hadoop requires Java 8 or later).
● A user with sudo privileges.
Step-by-Step Installation:
1. Install Java (JDK):
Hadoop requires Java to be installed. Install Java 8 or a compatible version.
sudo apt update
sudo apt install openjdk-8-jdk
Verify the Java installation:
java -version
2. Install Hadoop:
o First, download Hadoop binaries from the official Apache website. You can
download a stable version using wget:
3. wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
o Extract the downloaded tar file:
4. tar -xzvf hadoop-3.3.1.tar.gz
o Move it to the /opt directory:
5. sudo mv hadoop-3.3.1 /opt/hadoop
6. Set Environment Variables:
Add Hadoop-related environment variables to the .bashrc file:
nano ~/.bashrc
Add the following lines at the end of the file:
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
After saving and closing, apply the changes:
source ~/.bashrc
7. Configure Hadoop:
In the Hadoop configuration directory, you'll need to edit several XML files to set
up the cluster.
o core-site.xml:
Edit the core configuration to set the HDFS URI.
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
o hdfs-site.xml:
Configure HDFS directories and replication:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/hdfs/datanode</value>
</property>
</configuration>
o mapred-site.xml:
Set up the MapReduce framework:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
o yarn-site.xml:
Configure YARN settings:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add:
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
8. Format HDFS:
Before starting Hadoop, format the HDFS:
hdfs namenode -format
9. Start Hadoop Daemons:
Start the Hadoop daemons (NameNode, DataNode, ResourceManager,
NodeManager):
start-dfs.sh
start-yarn.sh
10. Verify the Installation:
o Check if the HDFS is running properly:
o jps
o Check if the ResourceManager and NodeManager are running as well:
o jps
o You can also check the Hadoop Web UI to view the status of your cluster.

(ii) Differences Between Single-Node and Multi-Node Clusters


Single-Node Cluster:
● A single-node cluster is a Hadoop setup where all the Hadoop services
(NameNode, DataNode, ResourceManager, and NodeManager) run on one
machine (localhost).
● It is simpler to set up and useful for development and testing purposes.
● Limited scalability and no distributed computation capability in the true sense
of a multi-node cluster.
Multi-Node Cluster:
● A multi-node cluster involves multiple machines, where one node acts as the
master (NameNode, ResourceManager) and others as slaves (DataNode,
NodeManager).
● It offers the true power of distributed computing and storage, enabling
scalability and fault tolerance.
● It requires more complex configuration, network setup, and hardware resources.
● It is used in production environments where large-scale data processing is
required.

(iii) Accessing WEB-UI and the Port Number


Hadoop provides a Web UI to monitor the cluster's health and performance. The
following are the key ports:
● NameNode Web UI: http://localhost:50070 – For monitoring HDFS status.
● ResourceManager Web UI: http://localhost:8088 – For monitoring the YARN resource
manager.
● JobHistory Server: http://localhost:19888 – For tracking MapReduce job history.
Make sure these ports are open and accessible.

(iv) Installing and Accessing Environments such as Hive and Sqoop


Hive Installation:
1. Install Hive:
You can download the latest stable version of Apache Hive from the Apache
website or install it via apt if available.
sudo apt-get install hive
2. Configure Hive:
Hive requires a metastore (typically MySQL or Derby). You can configure it by
editing hive-site.xml:
nano $HIVE_HOME/conf/hive-site.xml
3. Access Hive:
After installation and configuration, start Hive:
hive
This opens the Hive CLI where you can execute Hive queries.
Sqoop Installation:
1. Install Sqoop:
Download and install Sqoop, which is used for transferring data between
relational databases and Hadoop.
sudo apt-get install sqoop
2. Configure Sqoop:
Set up database connection configurations in Sqoop by editing the sqoop-site.xml
file.
3. Access Sqoop:
To use Sqoop to import or export data, you can run commands like:
sqoop import --connect jdbc:mysql://localhost/database --table tablename --username user --
password pass

3. File management tasks & Basic linux commands


(i) Creating a directory in HDFS
(ii) Moving forth and back to directories
(iii) Listing directory contents
(iv) Uploading and downloading a file in HDFS
(v) Checking the contents of the file
(vi) Copying and moving files
(vii) Copying and moving files between local to HDFS environment
(viii) Removing files and paths
(ix) Displaying few lines of a file
(x) Display the aggregate length of a file
(xi) Checking the permissions of a file
(xii) Zipping and unzipping the files with & without permission pasting it to a location
(xiii) Copy, Paste commands

Here’s a breakdown of file management tasks and basic Linux commands, particularly
focused on HDFS (Hadoop Distributed File System) operations:
(i) Creating a directory in HDFS:
To create a directory in HDFS, you can use the hadoop fs -mkdir command.
hadoop fs -mkdir /path/to/your/directory
This will create a directory at the specified path in HDFS.
(ii) Moving forth and back to directories:
You can navigate directories in the Linux file system using the cd command.
● To move to a directory:
● cd /path/to/directory
● To move back to the previous directory:
● cd -
● To move up one directory level:
● cd ..
For HDFS directories, you use the hadoop fs -ls command to list the contents and hadoop fs
-cd to change directories.

(iii) Listing directory contents:


To list contents of a directory, whether in HDFS or local, you use the ls command.
● In HDFS:
● hadoop fs -ls /path/to/directory
● In Local File System:
● ls /path/to/directory

(iv) Uploading and downloading a file in HDFS:


To upload a file to HDFS:
hadoop fs -put /local/path/to/file /hdfs/path/to/directory
To download a file from HDFS:
hadoop fs -get /hdfs/path/to/file /local/path/to/directory

(v) Checking the contents of the file:


You can check the contents of a file using the cat command.
● In HDFS:
● hadoop fs -cat /path/to/file
● In Local File System:
● cat /path/to/file

(vi) Copying and moving files:


● Copying files:
o To copy a file within HDFS:
o hadoop fs -cp /hdfs/source/path /hdfs/destination/path
o To copy a file from local to HDFS:
o hadoop fs -copyFromLocal /local/source/path /hdfs/destination/path
o To copy a file from HDFS to local:
o hadoop fs -copyToLocal /hdfs/source/path /local/destination/path
● Moving files:
o To move a file within HDFS:
o hadoop fs -mv /hdfs/source/path /hdfs/destination/path
o To move a file from local to HDFS:
o hadoop fs -moveFromLocal /local/source/path /hdfs/destination/path
o To move a file from HDFS to local:
o hadoop fs -moveToLocal /hdfs/source/path /local/destination/path

(vii) Copying and moving files between local and HDFS environment:
● Copying a file from local to HDFS:
● hadoop fs -copyFromLocal /local/path/to/file /hdfs/path/to/destination
● Copying a file from HDFS to local:
● hadoop fs -copyToLocal /hdfs/path/to/file /local/path/to/destination
● Moving a file from local to HDFS:
● hadoop fs -moveFromLocal /local/path/to/file /hdfs/path/to/destination
● Moving a file from HDFS to local:
● hadoop fs -moveToLocal /hdfs/path/to/file /local/path/to/destination

(viii) Removing files and paths:


To remove files and directories, you can use the -rm and -r options for directories.
● Remove a file in HDFS:
● hadoop fs -rm /hdfs/path/to/file
● Remove a directory in HDFS:
● hadoop fs -rm -r /hdfs/path/to/directory
● Remove a file locally:
● rm /local/path/to/file
● Remove a directory locally:
● rm -r /local/path/to/directory

(ix) Displaying few lines of a file:


To display the first few lines of a file:
● In HDFS:
● hadoop fs -head /path/to/file
● In Local File System:
● head /path/to/file

(x) Display the aggregate length of a file:


You can get the file size using the -du (disk usage) command.
● In HDFS:
● hadoop fs -du -s /path/to/file
● In Local File System:
● du -sh /path/to/file
This will display the total size of the file.
(xi) Checking the permissions of a file:
You can check the permissions of a file using the -ls command, which will show the file
permissions.
● In HDFS:
● hadoop fs -ls /path/to/file
● In Local File System:
● ls -l /path/to/file
This will display the permissions, owner, and group of the file or directory.
(xii) Zipping and unzipping files with and without permission pasting it to a location:
You can zip and unzip files using the zip and unzip commands.
● Zipping a file:
● zip filename.zip /path/to/file
● Unzipping a file:
● unzip filename.zip -d /path/to/extract
To maintain permissions while transferring a file, use the -p option in cp or rsync for
preserving permissions.
Example with rsync:
rsync -av /path/to/source /path/to/destination

(xiii) Copy, Paste Commands:


● Copy Command (For local filesystem):
● cp /source/path /destination/path
For HDFS:
hadoop fs -cp /source/hdfs/path /destination/hdfs/path
● Paste Command (To paste a file after copying it): This is generally done by using
cp or mv as mentioned above. There's no specific "paste" command, but the
operation is performed through these commands when moving or copying data.
4. Map-reducing
(i) Definition of Map-reduce
(ii) Its stages and terminologies
(iii) Word-count program to understand map-reduce (Mapper phase, Reducer
phase, Driver code)
(i) Definition of Map-Reduce:
Map-Reduce is a programming model and processing technique used to process and generate large
datasets. It allows the parallel processing of data by dividing it into small chunks and distributing it
across multiple nodes in a cluster. The main concept involves two key operations: Map and Reduce.
● Map: The map function processes input data and produces a set of intermediate key-value
pairs.
● Reduce: The reduce function takes the intermediate key-value pairs, processes them, and
merges them to produce the final result.
Map-Reduce is widely used in distributed systems like Hadoop for large-scale data processing tasks.
(ii) Stages and Terminologies in Map-Reduce:
The Map-Reduce process is split into two main stages: the Map stage and the Reduce stage, but
several other intermediate processes and terminologies come into play.
1. Map Stage:
o The input data is divided into chunks (usually files or records).
o The Mapper function processes each chunk and outputs intermediate key-value pairs.
o The intermediate output is sorted and grouped by key (called the shuffle phase).
2. Shuffle and Sort:
o After the map phase, the intermediate key-value pairs are shuffled and sorted to ensure
that all values corresponding to the same key are grouped together. This step
happens automatically in Map-Reduce frameworks like Hadoop.
3. Reduce Stage:
o The Reducer function processes each group of intermediate key-value pairs and
merges them to produce a final output. It can aggregate, summarize, or process data
in any other way required by the user.
4. Output:
o After the reduce phase, the final output is written to a file or a database.
Key Terminologies in Map-Reduce:
● Mapper: The function or process that reads input data, processes it, and outputs key-value
pairs.
● Reducer: The function that processes the grouped key-value pairs from the mapper and
performs the final aggregation or computation.
● Key-Value Pair: The fundamental unit of data in Map-Reduce, where each record is
represented as a key paired with a value.
● Shuffle: The process of redistributing the data across reducers based on keys, ensuring that
all values for the same key are sent to the same reducer.
● Input Split: The unit of work or chunk of data that is sent to a mapper.
● Output: The final result after processing in the reduce phase, usually saved to disk or a
storage system.
(iii) Word-Count Program to Understand Map-Reduce:
Here is a simple example of a Word-Count program to demonstrate the Map-Reduce process. We will
break it into three main parts:
1. Mapper Phase:
The mapper reads input text and emits key-value pairs, where the key is a word, and the value is 1
(representing a single occurrence of the word).
Mapper code (in Python or any suitable language):
import sys

# Mapper function
def mapper():
for line in sys.stdin:
words = line.split()
for word in words:
# Emit word with value 1
print(f"{word}\t1")

if __name__ == "__main__":
mapper()
In this code:
● The input is a line of text.
● The line is split into words.
● For each word, a key-value pair is emitted, where the key is the word, and the value is 1.
2. Shuffle and Sort:
After the map phase, the framework automatically groups and sorts the emitted key-value pairs. For
instance, all instances of the word "hello" will be grouped together so that they can be passed to the
same reducer.
Example of shuffled data:
hello 1
hello 1
world 1
world 1
data 1
3. Reducer Phase:
The reducer processes the grouped key-value pairs. It aggregates the values by summing them to get
the total count for each word.
Reducer code:
import sys

# Reducer function
def reducer():
current_word = None
current_count = 0
for line in sys.stdin:
word, count = line.strip().split('\t')
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

# Output the last word


if current_word:
print(f"{current_word}\t{current_count}")

if __name__ == "__main__":
reducer()
In this code:
● The reducer receives grouped key-value pairs.
● It aggregates the count of each word and prints the final result.
4. Driver Code:
The driver code sets up the map and reduce operations and coordinates the execution of the map
and reduce phases in the framework. In Hadoop, this would be handled by a job configuration, but
for simplicity, this can be managed manually in a basic script.
Example Driver Code (in a Hadoop or basic setup):
# Pseudo code to explain the execution
# 1. The input text is passed to the Mapper.
# 2. Mapper emits key-value pairs.
# 3. Intermediate data is shuffled and sorted by keys.
# 4. The Reducer takes the sorted data, aggregates it, and outputs the result.

# In Hadoop, you would configure a Job with Mapper and Reducer.


Final Output:
After the map and reduce phases, the output would look like this:
data 1
hello 2
world 2
This shows the word count for each word in the input text.
In a distributed setup like Hadoop:
● The mapper would be executed on different nodes processing chunks of data in parallel.
● The reducer would then aggregate the results from all the mappers.
This basic example gives you a good understanding of how Map-Reduce works to process large
datasets by distributing the work and aggregating results efficiently.

You might also like