0% found this document useful (0 votes)

5 views

bigdata

The document provides an overview of Big Data and Hadoop architecture, detailing the layers involved in Big Data architecture including data sources, ingestion, storage, processing, analytics, presentation, security, and orchestration. It also describes Hadoop architecture, focusing on its core components: HDFS, MapReduce, and YARN, along with their roles in distributed data processing. Additionally, it outlines the steps for installing a single-node Hadoop cluster in an Ubuntu environment, including prerequisites and configuration settings.

Uploaded by

kameswarisreevalli2801

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

bigdata

Uploaded by

kameswarisreevalli2801

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

1.

To Study of Big Data Analytics and Hadoop Architecture

(i) know the concept of big data architecture
(ii) know the concept of Hadoop architecture
Big Data architecture:
Big Data architecture refers to the design and structure used to store, process, and
analyze large volumes of data. These architectures are built to handle a variety of data
types (structured, semi-structured, unstructured), as well as the large scale and speed of
modern data flows. The core components of Big Data architecture typically include the
following layers:
1. Data Source Layer
This layer refers to the origin of the data, which could come from a variety of sources:
* External data sources: Social media, IoT devices, third-party services, etc.
* Internal data sources: Databases, data warehouses, etc.
* Data streams: Real-time data from sensors, logs, etc.
2. Data Ingestion Layer
Data ingestion is the process of collecting and transporting data from various sources
to the storage layer. The two main types of ingestion are:
* Batch processing: Data is collected over a fixed period (e.g., every hour, daily).
* Real-time/streaming processing: Data is collected in real-time or near real-time.
Tools used for data ingestion include:
* Apache Kafka: A distributed streaming platform.
* Apache Flume: A service for collecting and moving large amounts of log data.
* AWS Kinesis: A platform for real-time streaming data on AWS.
3. Data Storage Layer
This is where all the data is stored. Big Data storage should support both structured and
unstructured data. It needs to be scalable, reliable, and highly available. Some common
types of data storage in Big Data systems include:
* HDFS (Hadoop Distributed File System): A scalable, distributed file system.
* NoSQL databases: MongoDB, Cassandra, HBase for non-relational data.
* Data Lakes: A central repository for storing raw data in its native format (e.g.,
AWS S3, Azure Blob Storage).
4. Data Processing Layer
This layer processes the stored data and transforms it into valuable insights. It can be
divided into two major approaches:
* Batch Processing: Processing data in large, scheduled intervals (e.g., Hadoop
MapReduce).
* Stream Processing: Processing data in real-time as it flows in (e.g., Apache
Flink, Apache Storm, Spark Streaming).
Some key processing tools:
* Apache Spark: A fast and general-purpose cluster-computing system.
* Apache Hadoop: A framework for distributed storage and processing.
* Flink and Storm: Used for real-time data stream processing.
5. Data Analytics Layer
Once data is processed, it is often analyzed to extract insights. The analytics layer
provides tools for complex analysis, including:
* Machine Learning (ML): Building predictive models and patterns using
algorithms.
* Data Mining: Discovering hidden patterns and trends in data.
* Business Intelligence (BI): Tools like Tableau, Power BI for reporting and
visualization.
Popular tools used for analytics:
* Apache Hive: A data warehouse built on top of Hadoop for querying and
analyzing large datasets.
* Apache Impala: A high-performance SQL engine for big data.
* Python libraries (Pandas, scikit-learn): For data manipulation and machine
learning.
6. Data Presentation Layer
This layer presents the insights derived from the analytics layer. It often involves
dashboards, reports, and visualizations. Users, stakeholders, or systems will interact
with this layer to make data-driven decisions. Tools include:
* BI tools: Tableau, Power BI, QlikView.
* Custom web interfaces: To display reports, graphs, and analysis.
7. Security and Governance Layer
Given the large volumes and sensitivity of data, security and governance are critical.
This layer ensures data privacy, access control, and regulatory compliance.
* Authentication/Authorization: Ensuring only authorized users can access
specific data.
* Data Encryption: To protect sensitive data at rest and in transit.
* Data Lineage: Tracking the origin and movement of data to ensure
trustworthiness.
* Compliance: Adhering to regulations such as GDPR, HIPAA, etc.
8. Orchestration and Management Layer
Big Data systems require complex management for coordination, scheduling, and
monitoring.
* Apache Airflow: An open-source platform to programmatically author, schedule,
and monitor workflows.
* Kubernetes: For managing containerized applications and ensuring scalability
and reliability.
Key Technologies in Big Data Architecture:
* Hadoop Ecosystem: For storage and processing (HDFS, YARN, MapReduce, Pig,
Hive, etc.).
* Apache Kafka: For real-time streaming.
* Apache Spark: For fast in-memory data processing.
* NoSQL Databases: MongoDB, Cassandra, HBase.
* Cloud Platforms: AWS, Azure, Google Cloud provide tools for storage,
processing, and management.
Example of Big Data Architecture
+---------------------+
| Data Sources |
+---------------------+
|
v
+---------------------+ +--------------------+
| Data Ingestion | ---> | Data Storage |
| (Batch/Streaming) | | (HDFS, NoSQL, |
+---------------------+ | Data Lakes) |
| +--------------------+
v |
+--------------------+ v
| Data Processing | +---------------------+
| (Batch/Stream) | ---> | Data Analytics |
+--------------------+ | (ML, BI, Analysis) |
| +---------------------+
v |
+---------------------+ v
| Data Presentation | <--------> +-----------------+
| (Dashboards, Reports| | Security & |
| Visualization) | | Governance |
This high-level overview demonstrates the flow of data through the architecture from
collection to processing and presentation.
(ii) know the concept of Hadoop architecture
Hadoop Architecture Overview
Hadoop is an open-source framework for processing and storing large datasets in a
distributed computing environment. It is designed to scale from a single server to
thousands of machines, each offering local computation and storage. Understanding
Hadoop architecture is essential for working with Hadoop-based systems. Below is a
detailed overview of the Hadoop architecture, its components, and how they work
together.

Key Components of Hadoop Architecture

The architecture of Hadoop primarily revolves around three main components:
1. Hadoop Distributed File System (HDFS)
2. MapReduce
3. YARN (Yet Another Resource Negotiator)
These components work together to provide a distributed system that can store and
process large volumes of data.

1. Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop. It is designed to store vast amounts of data
across multiple machines in a distributed environment.
● Block-based storage: HDFS stores data in blocks (typically 128MB or 256MB by
default). Each ﬁle is divided into blocks, which are then distributed across
multiple nodes.
● Fault tolerance: HDFS ensures fault tolerance by replicating blocks. The default
replication factor is 3 (each block is copied three times across the cluster). If one
node fails, the data can still be accessed from another replica.
● NameNode: The NameNode is the master node in HDFS that manages the
metadata (such as block locations) for the ﬁles. It does not store the data itself
but keeps track of where the blocks are stored across the cluster.
● DataNode: DataNodes are the worker nodes that store the actual data in the
form of blocks. Each DataNode is responsible for serving the blocks on request
and performing block-level operations (like block creation, deletion, and
replication).
HDFS Architecture Diagram:
+-------------------+ +-------------------+
| Client | | Client |
+-------------------+ +-------------------+
| |
+---------------+ +---------------+
| NameNode | | NameNode |
+---------------+ +---------------+
| |
+-------------------+ +-------------------+
| DataNode | | DataNode |
+-------------------+ +-------------------+

2. MapReduce
MapReduce is the processing layer of Hadoop. It is a programming model used for
processing large data sets in parallel across a distributed cluster.
● Map phase: In the Map phase, the input data is divided into chunks (called
splits), and each chunk is processed by a mapper. The mapper processes the
data and generates a set of intermediate key-value pairs.
● Shuffle and Sort: After the Map phase, the intermediate key-value pairs are
shuffled and sorted. The system groups the data by key and prepares it for the
Reduce phase.
● Reduce phase: In the Reduce phase, the system applies the reduce function to
the sorted intermediate data, aggregating or transforming the data in some way.
The results are written to the output files.
MapReduce Architecture Diagram:
+-------------+
| Input | ----> [Map] ----> [Shuffle & Sort] ----> [Reduce] ----> Output
+-------------+
● JobTracker: The JobTracker is the master daemon in the MapReduce
framework. It is responsible for scheduling and monitoring jobs, dividing the work
into tasks, and allocating tasks to TaskTrackers.
● TaskTracker: TaskTrackers are worker daemons that run on the cluster nodes
and execute tasks assigned by the JobTracker. Each TaskTracker handles both Map
and Reduce tasks.
3. YARN (Yet Another Resource Negotiator)
YARN is the resource management layer of Hadoop, responsible for managing
resources across the cluster and scheduling the execution of tasks.
● ResourceManager (RM): The ResourceManager is the master daemon in YARN,
which manages the allocation of resources (memory, CPU) to the various
applications running on the cluster. It makes sure that resources are allocated
based on job requirements and cluster availability.
● NodeManager (NM): The NodeManager runs on each node in the cluster. It is
responsible for managing resources on the individual node and monitoring the
status of the node.
● ApplicationMaster (AM): The ApplicationMaster is a per-application entity that
manages the lifecycle of a job. It negotiates resources with the ResourceManager
and monitors the progress of its application (MapReduce job or Spark job).
YARN Architecture Diagram:
+-----------------------+
| ResourceManager | <-----> [Resource Allocation]
+-----------------------+
|
+-----------------------------+
| NodeManager | <-----> [Resource Monitoring]
+-----------------------------+
|
+---------------------------+
| ApplicationMaster (AM) | <-----> [Job Coordination]
+---------------------------+
|
+-----------------------+
| Application | <-----> [MapReduce/Spark Job]
+-----------------------+

Hadoop Ecosystem Components

Apart from the core components (HDFS, MapReduce, and YARN), Hadoop has a rich
ecosystem that includes several tools and frameworks for different use cases. Some of
the key components include:
● Hive: A data warehouse system that facilitates querying and managing large
datasets in HDFS using SQL-like queries.
● Pig: A platform for analyzing large datasets, providing a high-level language
called Pig Latin for processing and transforming data.
● HBase: A NoSQL database for real-time read/write access to large datasets
stored in HDFS.
● Sqoop: A tool for transferring data between Hadoop and relational databases.
● Flume: A service for collecting and aggregating log data and other types of
streaming data.
● Oozie: A workﬂow scheduler for managing Hadoop jobs.
● Zookeeper: A service for coordinating distributed applications in the Hadoop
ecosystem.
● Mahout: A machine learning library for scalable machine learning algorithms.

Hadoop Architecture Diagram (Complete)

Key Characteristics of Hadoop Architecture

1. Scalability: Hadoop is designed to scale horizontally. As your data grows, you
can add more nodes to the cluster.
2. Fault Tolerance: Through replication and data distribution, Hadoop ensures that
the data is not lost even when individual nodes fail.
3. Cost Eﬃciency: Hadoop runs on commodity hardware, meaning you can build
large-scale clusters with low-cost machines.
4. Data Locality: Hadoop tries to move computation to where the data is stored to
minimize network congestion and speed up processing.

2. Loading DataSet in to HDFS for Spark Analysis Installation of Hadoop and cluster
management
(i) Installing Hadoop single node cluster in ubuntu environment
(ii) Knowing the differencing between single node clusters and multi-node clusters
(iii) Accessing WEB-UI and the port number
(iv) Installing and accessing the environments such as hive and sqoop

Installing Hadoop Single Node Cluster in Ubuntu Environment

Prerequisites:
● A fresh Ubuntu system or a virtual machine running Ubuntu.
● Java should be installed (Hadoop requires Java 8 or later).
● A user with sudo privileges.
Step-by-Step Installation:
1. Install Java (JDK):
Hadoop requires Java to be installed. Install Java 8 or a compatible version.
sudo apt update
sudo apt install openjdk-8-jdk
Verify the Java installation:
java -version
2. Install Hadoop:
o First, download Hadoop binaries from the official Apache website. You can
download a stable version using wget:
3. wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
o Extract the downloaded tar file:
4. tar -xzvf hadoop-3.3.1.tar.gz
o Move it to the /opt directory:
5. sudo mv hadoop-3.3.1 /opt/hadoop
6. Set Environment Variables:
Add Hadoop-related environment variables to the .bashrc file:
nano ~/.bashrc
Add the following lines at the end of the file:
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
After saving and closing, apply the changes:
source ~/.bashrc
7. Configure Hadoop:
In the Hadoop configuration directory, you'll need to edit several XML files to set
up the cluster.
o core-site.xml:
Edit the core configuration to set the HDFS URI.
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
o hdfs-site.xml:
Configure HDFS directories and replication:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/hdfs/datanode</value>
</property>
</configuration>
o mapred-site.xml:
Set up the MapReduce framework:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
o yarn-site.xml:
Configure YARN settings:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add:
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
8. Format HDFS:
Before starting Hadoop, format the HDFS:
hdfs namenode -format
9. Start Hadoop Daemons:
Start the Hadoop daemons (NameNode, DataNode, ResourceManager,
NodeManager):
start-dfs.sh
start-yarn.sh
10. Verify the Installation:
o Check if the HDFS is running properly:
o jps
o Check if the ResourceManager and NodeManager are running as well:
o jps
o You can also check the Hadoop Web UI to view the status of your cluster.

(ii) Differences Between Single-Node and Multi-Node Clusters

Single-Node Cluster:
● A single-node cluster is a Hadoop setup where all the Hadoop services
(NameNode, DataNode, ResourceManager, and NodeManager) run on one
machine (localhost).
● It is simpler to set up and useful for development and testing purposes.
● Limited scalability and no distributed computation capability in the true sense
of a multi-node cluster.
Multi-Node Cluster:
● A multi-node cluster involves multiple machines, where one node acts as the
master (NameNode, ResourceManager) and others as slaves (DataNode,
NodeManager).
● It offers the true power of distributed computing and storage, enabling
scalability and fault tolerance.
● It requires more complex conﬁguration, network setup, and hardware resources.
● It is used in production environments where large-scale data processing is
required.

(iii) Accessing WEB-UI and the Port Number

Hadoop provides a Web UI to monitor the cluster's health and performance. The
following are the key ports:
● NameNode Web UI: http://localhost:50070 – For monitoring HDFS status.
● ResourceManager Web UI: http://localhost:8088 – For monitoring the YARN resource
manager.
● JobHistory Server: http://localhost:19888 – For tracking MapReduce job history.
Make sure these ports are open and accessible.

(iv) Installing and Accessing Environments such as Hive and Sqoop

Hive Installation:
1. Install Hive:
You can download the latest stable version of Apache Hive from the Apache
website or install it via apt if available.
sudo apt-get install hive
2. Configure Hive:
Hive requires a metastore (typically MySQL or Derby). You can configure it by
editing hive-site.xml:
nano $HIVE_HOME/conf/hive-site.xml
3. Access Hive:
After installation and configuration, start Hive:
hive
This opens the Hive CLI where you can execute Hive queries.
Sqoop Installation:
1. Install Sqoop:
Download and install Sqoop, which is used for transferring data between
relational databases and Hadoop.
sudo apt-get install sqoop
2. Configure Sqoop:
Set up database connection configurations in Sqoop by editing the sqoop-site.xml
file.
3. Access Sqoop:
To use Sqoop to import or export data, you can run commands like:
sqoop import --connect jdbc:mysql://localhost/database --table tablename --username user --
password pass

3. File management tasks & Basic linux commands

(i) Creating a directory in HDFS
(ii) Moving forth and back to directories
(iii) Listing directory contents
(iv) Uploading and downloading a file in HDFS
(v) Checking the contents of the file
(vi) Copying and moving files
(vii) Copying and moving files between local to HDFS environment
(viii) Removing files and paths
(ix) Displaying few lines of a file
(x) Display the aggregate length of a file
(xi) Checking the permissions of a file
(xii) Zipping and unzipping the files with & without permission pasting it to a location
(xiii) Copy, Paste commands

Here’s a breakdown of file management tasks and basic Linux commands, particularly
focused on HDFS (Hadoop Distributed File System) operations:
(i) Creating a directory in HDFS:
To create a directory in HDFS, you can use the hadoop fs -mkdir command.
hadoop fs -mkdir /path/to/your/directory
This will create a directory at the specified path in HDFS.
(ii) Moving forth and back to directories:
You can navigate directories in the Linux file system using the cd command.
● To move to a directory:
● cd /path/to/directory
● To move back to the previous directory:
● cd -
● To move up one directory level:
● cd ..
For HDFS directories, you use the hadoop fs -ls command to list the contents and hadoop fs
-cd to change directories.

(iii) Listing directory contents:

To list contents of a directory, whether in HDFS or local, you use the ls command.
● In HDFS:
● hadoop fs -ls /path/to/directory
● In Local File System:
● ls /path/to/directory

(iv) Uploading and downloading a ﬁle in HDFS:

To upload a file to HDFS:
hadoop fs -put /local/path/to/file /hdfs/path/to/directory
To download a file from HDFS:
hadoop fs -get /hdfs/path/to/file /local/path/to/directory

(v) Checking the contents of the ﬁle:

You can check the contents of a file using the cat command.
● In HDFS:
● hadoop fs -cat /path/to/file
● In Local File System:
● cat /path/to/file

(vi) Copying and moving ﬁles:

● Copying files:
o To copy a file within HDFS:
o hadoop fs -cp /hdfs/source/path /hdfs/destination/path
o To copy a file from local to HDFS:
o hadoop fs -copyFromLocal /local/source/path /hdfs/destination/path
o To copy a file from HDFS to local:
o hadoop fs -copyToLocal /hdfs/source/path /local/destination/path
● Moving files:
o To move a file within HDFS:
o hadoop fs -mv /hdfs/source/path /hdfs/destination/path
o To move a file from local to HDFS:
o hadoop fs -moveFromLocal /local/source/path /hdfs/destination/path
o To move a file from HDFS to local:
o hadoop fs -moveToLocal /hdfs/source/path /local/destination/path

(vii) Copying and moving files between local and HDFS environment:
● Copying a file from local to HDFS:
● hadoop fs -copyFromLocal /local/path/to/file /hdfs/path/to/destination
● Copying a file from HDFS to local:
● hadoop fs -copyToLocal /hdfs/path/to/file /local/path/to/destination
● Moving a file from local to HDFS:
● hadoop fs -moveFromLocal /local/path/to/file /hdfs/path/to/destination
● Moving a file from HDFS to local:
● hadoop fs -moveToLocal /hdfs/path/to/file /local/path/to/destination

(viii) Removing ﬁles and paths:

To remove files and directories, you can use the -rm and -r options for directories.
● Remove a file in HDFS:
● hadoop fs -rm /hdfs/path/to/file
● Remove a directory in HDFS:
● hadoop fs -rm -r /hdfs/path/to/directory
● Remove a file locally:
● rm /local/path/to/file
● Remove a directory locally:
● rm -r /local/path/to/directory

(ix) Displaying few lines of a ﬁle:

To display the first few lines of a file:
● In HDFS:
● hadoop fs -head /path/to/file
● In Local File System:
● head /path/to/file

(x) Display the aggregate length of a ﬁle:

You can get the file size using the -du (disk usage) command.
● In HDFS:
● hadoop fs -du -s /path/to/file
● In Local File System:
● du -sh /path/to/file
This will display the total size of the file.
(xi) Checking the permissions of a file:
You can check the permissions of a file using the -ls command, which will show the file
permissions.
● In HDFS:
● hadoop fs -ls /path/to/file
● In Local File System:
● ls -l /path/to/file
This will display the permissions, owner, and group of the file or directory.
(xii) Zipping and unzipping files with and without permission pasting it to a location:
You can zip and unzip files using the zip and unzip commands.
● Zipping a file:
● zip filename.zip /path/to/file
● Unzipping a file:
● unzip filename.zip -d /path/to/extract
To maintain permissions while transferring a file, use the -p option in cp or rsync for
preserving permissions.
Example with rsync:
rsync -av /path/to/source /path/to/destination

(xiii) Copy, Paste Commands:

● Copy Command (For local filesystem):
● cp /source/path /destination/path
For HDFS:
hadoop fs -cp /source/hdfs/path /destination/hdfs/path
● Paste Command (To paste a file after copying it): This is generally done by using
cp or mv as mentioned above. There's no specific "paste" command, but the
operation is performed through these commands when moving or copying data.
4. Map-reducing
(i) Definition of Map-reduce
(ii) Its stages and terminologies
(iii) Word-count program to understand map-reduce (Mapper phase, Reducer
phase, Driver code)
(i) Definition of Map-Reduce:
Map-Reduce is a programming model and processing technique used to process and generate large
datasets. It allows the parallel processing of data by dividing it into small chunks and distributing it
across multiple nodes in a cluster. The main concept involves two key operations: Map and Reduce.
● Map: The map function processes input data and produces a set of intermediate key-value
pairs.
● Reduce: The reduce function takes the intermediate key-value pairs, processes them, and
merges them to produce the final result.
Map-Reduce is widely used in distributed systems like Hadoop for large-scale data processing tasks.
(ii) Stages and Terminologies in Map-Reduce:
The Map-Reduce process is split into two main stages: the Map stage and the Reduce stage, but
several other intermediate processes and terminologies come into play.
1. Map Stage:
o The input data is divided into chunks (usually files or records).
o The Mapper function processes each chunk and outputs intermediate key-value pairs.
o The intermediate output is sorted and grouped by key (called the shuffle phase).
2. Shuffle and Sort:
o After the map phase, the intermediate key-value pairs are shuffled and sorted to ensure
that all values corresponding to the same key are grouped together. This step
happens automatically in Map-Reduce frameworks like Hadoop.
3. Reduce Stage:
o The Reducer function processes each group of intermediate key-value pairs and
merges them to produce a final output. It can aggregate, summarize, or process data
in any other way required by the user.
4. Output:
o After the reduce phase, the final output is written to a file or a database.
Key Terminologies in Map-Reduce:
● Mapper: The function or process that reads input data, processes it, and outputs key-value
pairs.
● Reducer: The function that processes the grouped key-value pairs from the mapper and
performs the final aggregation or computation.
● Key-Value Pair: The fundamental unit of data in Map-Reduce, where each record is
represented as a key paired with a value.
● Shuffle: The process of redistributing the data across reducers based on keys, ensuring that
all values for the same key are sent to the same reducer.
● Input Split: The unit of work or chunk of data that is sent to a mapper.
● Output: The final result after processing in the reduce phase, usually saved to disk or a
storage system.
(iii) Word-Count Program to Understand Map-Reduce:
Here is a simple example of a Word-Count program to demonstrate the Map-Reduce process. We will
break it into three main parts:
1. Mapper Phase:
The mapper reads input text and emits key-value pairs, where the key is a word, and the value is 1
(representing a single occurrence of the word).
Mapper code (in Python or any suitable language):
import sys

# Mapper function
def mapper():
for line in sys.stdin:
words = line.split()
for word in words:
# Emit word with value 1
print(f"{word}\t1")

if __name__ == "__main__":
mapper()
In this code:
● The input is a line of text.
● The line is split into words.
● For each word, a key-value pair is emitted, where the key is the word, and the value is 1.
2. Shuﬄe and Sort:
After the map phase, the framework automatically groups and sorts the emitted key-value pairs. For
instance, all instances of the word "hello" will be grouped together so that they can be passed to the
same reducer.
Example of shuﬄed data:
hello 1
hello 1
world 1
world 1
data 1
3. Reducer Phase:
The reducer processes the grouped key-value pairs. It aggregates the values by summing them to get
the total count for each word.
Reducer code:
import sys

# Reducer function
def reducer():
current_word = None
current_count = 0
for line in sys.stdin:
word, count = line.strip().split('\t')
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

# Output the last word

if current_word:
print(f"{current_word}\t{current_count}")

if __name__ == "__main__":
reducer()
In this code:
● The reducer receives grouped key-value pairs.
● It aggregates the count of each word and prints the final result.
4. Driver Code:
The driver code sets up the map and reduce operations and coordinates the execution of the map
and reduce phases in the framework. In Hadoop, this would be handled by a job configuration, but
for simplicity, this can be managed manually in a basic script.
Example Driver Code (in a Hadoop or basic setup):
# Pseudo code to explain the execution
# 1. The input text is passed to the Mapper.
# 2. Mapper emits key-value pairs.
# 3. Intermediate data is shuffled and sorted by keys.
# 4. The Reducer takes the sorted data, aggregates it, and outputs the result.

# In Hadoop, you would conﬁgure a Job with Mapper and Reducer.

Final Output:
After the map and reduce phases, the output would look like this:
data 1
hello 2
world 2
This shows the word count for each word in the input text.
In a distributed setup like Hadoop:
● The mapper would be executed on different nodes processing chunks of data in parallel.
● The reducer would then aggregate the results from all the mappers.
This basic example gives you a good understanding of how Map-Reduce works to process large
datasets by distributing the work and aggregating results eﬃciently.

Start Date
No ratings yet
Start Date
2 pages
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)
bigdata (1) (1)
No ratings yet
bigdata (1) (1)
23 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
BDA simple 1 to 4
No ratings yet
BDA simple 1 to 4
11 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
TIE- 21CS71 SIMP with Key Answers (1)
No ratings yet
TIE- 21CS71 SIMP with Key Answers (1)
19 pages
Last Min Preparation -Big Data
No ratings yet
Last Min Preparation -Big Data
5 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Big Data Analytics unit wise short note
No ratings yet
Big Data Analytics unit wise short note
6 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Unit 2
No ratings yet
Unit 2
17 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
Architecture of whole syllabus tools in Big data
No ratings yet
Architecture of whole syllabus tools in Big data
4 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Big Data Analytics Application
No ratings yet
Big Data Analytics Application
6 pages
bda ans
No ratings yet
bda ans
18 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Big Data Analysis BDA IMP QNA Openinapp
No ratings yet
Big Data Analysis BDA IMP QNA Openinapp
33 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Bda A1
No ratings yet
Bda A1
5 pages
Abhishek Seminar 222
No ratings yet
Abhishek Seminar 222
19 pages
unit II big data architecture
No ratings yet
unit II big data architecture
5 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
11 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
BIGDATA FINAL
No ratings yet
BIGDATA FINAL
25 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Module 1.ppt
No ratings yet
Module 1.ppt
29 pages
Big Data Notes With Diagrams
No ratings yet
Big Data Notes With Diagrams
3 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Detailed Big Data and Hadoop Notes
No ratings yet
Detailed Big Data and Hadoop Notes
3 pages
BDM 2
No ratings yet
BDM 2
5 pages
Data Science
No ratings yet
Data Science
87 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
data analyst
No ratings yet
data analyst
9 pages
ucPDF (14)
No ratings yet
ucPDF (14)
10 pages
2.2. Components of Hadoop - Analysing.docx
No ratings yet
2.2. Components of Hadoop - Analysing.docx
16 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Big data analytics
No ratings yet
Big data analytics
36 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
2 pages
Hadoop
No ratings yet
Hadoop
3 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Big Data Tools and Its Framework
No ratings yet
Big Data Tools and Its Framework
5 pages
Bigdata Hadoop
No ratings yet
Bigdata Hadoop
4 pages
Hadoop Kufkaf Apeche
No ratings yet
Hadoop Kufkaf Apeche
14 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
BDA-2
No ratings yet
BDA-2
6 pages
Big Data QB
No ratings yet
Big Data QB
37 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
HADOOP
No ratings yet
HADOOP
4 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Introduction To Artificial Intelligence Technique: Humans."
No ratings yet
Introduction To Artificial Intelligence Technique: Humans."
10 pages
Horn Mouthpiece Catalog
No ratings yet
Horn Mouthpiece Catalog
30 pages
Bit Error Rate (BER) Comparison of AWGN Channels For Different Type's Digital Modulation Using MATLAB Simulink
No ratings yet
Bit Error Rate (BER) Comparison of AWGN Channels For Different Type's Digital Modulation Using MATLAB Simulink
11 pages
Chapter 04_ABAP Dictionary
No ratings yet
Chapter 04_ABAP Dictionary
13 pages
Bilingual Remote Community Manager - English and Korean or Japanese or Thai or Chinese Mandarin - Arise Work From Home
No ratings yet
Bilingual Remote Community Manager - English and Korean or Japanese or Thai or Chinese Mandarin - Arise Work From Home
8 pages
Soal Ujian Bahasa Inggris 4-YuliartaRizkiNusantoko-2103181008-2D3ITA
No ratings yet
Soal Ujian Bahasa Inggris 4-YuliartaRizkiNusantoko-2103181008-2D3ITA
4 pages
Lecture 03
No ratings yet
Lecture 03
37 pages
Fi XXXXXX
No ratings yet
Fi XXXXXX
27 pages
Spotlight Accounting For Cloud Based Software PDF
No ratings yet
Spotlight Accounting For Cloud Based Software PDF
4 pages
INF2603 - 202 - 2021 Assignment 2
No ratings yet
INF2603 - 202 - 2021 Assignment 2
4 pages
KCS302 - Stack Organization & RPN
No ratings yet
KCS302 - Stack Organization & RPN
19 pages
AI project
No ratings yet
AI project
27 pages
XBRL White Paper PDF
No ratings yet
XBRL White Paper PDF
62 pages
Is Attention All You Need For Intraday Forex Tradi
100% (1)
Is Attention All You Need For Intraday Forex Tradi
19 pages
Ethereum Data Structures: Kamil Jezek
No ratings yet
Ethereum Data Structures: Kamil Jezek
27 pages
Upwork Cover Letter
100% (2)
Upwork Cover Letter
7 pages
Sales Daily Report KeerthiR
No ratings yet
Sales Daily Report KeerthiR
57 pages
Thesis Chapter 4 Qualitative
100% (3)
Thesis Chapter 4 Qualitative
8 pages
Applying Network Security Lecture 1 Notes
No ratings yet
Applying Network Security Lecture 1 Notes
393 pages
Introduction To Java Programming: Objectives
No ratings yet
Introduction To Java Programming: Objectives
14 pages
Please DocuSign Tax Certificate of Foreign S PDF
No ratings yet
Please DocuSign Tax Certificate of Foreign S PDF
6 pages
VCS-278.examcollection - Premium.exam.159q p2vOwLd PDF
No ratings yet
VCS-278.examcollection - Premium.exam.159q p2vOwLd PDF
51 pages
Book - PDF ICT
100% (1)
Book - PDF ICT
100 pages
MLOps Google Cloud
No ratings yet
MLOps Google Cloud
37 pages
Java Sorted Question
No ratings yet
Java Sorted Question
6 pages
DS-2DE7530IW-AE E Series 5MP 30× IR Network Speed Dome: Key Features
No ratings yet
DS-2DE7530IW-AE E Series 5MP 30× IR Network Speed Dome: Key Features
5 pages
UPPSC AE ICT by Swati Ma'Am
No ratings yet
UPPSC AE ICT by Swati Ma'Am
91 pages
Nuggts Character Maker! Picrew 11
No ratings yet
Nuggts Character Maker! Picrew 11
1 page
Lossy and Lossless Compression Techniques
No ratings yet
Lossy and Lossless Compression Techniques
18 pages